CM4AI Dataset Documentation

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

DescriptionIDNameResponse
Create AI-ready maps of human cell architecture from disease-relevant cell lines as part of the NIH ...cm4ai:purpose:ai-ready-mapsCreate AI-ready cell architecture mapsCM4AI's overarching mission is to produce ethical, AI-ready datasets of cell architecture, inferred ...
Enable interpretable genotype-phenotype learning by providing visible machine learning systems infor...cm4ai:purpose:interpretable-genotype-phenotypeEnable interpretable genotype-phenotype learningThe CM4AI project seeks to map the spatiotemporal architecture of human cells and use these maps tow...
  • ID
    cm4ai:funder:nih-bridge2ai
    Name
    NIH Bridge2AI Functional Genomics
    Description
    National Institutes of Health Bridge to Artificial Intelligence program, Functional Genomics Data Generation Project. This work was funded by the National Institutes of Health under awards 1OT2OD032742-01 (Bridge2AI Functional Genomics) and 5U54HG012513-02 (Bridge2AI Bridge Center), and by the Frederick Thomas Fund of the University of Virginia. Grant number: 1OT2OD032742-01. Project period: September 1, 2022 to August 31, 2026.
πŸ“Š

Composition

What do the instances represent?

DescriptionIDInstance TypeName
Expressed genome-scale CRISPRi Perturbation Cell Atlas in undifferentiated KOLF2.1J human induced pl...cm4ai:instance:crispr-perturbationPerturb-seq single-cell RNA sequencing dataCRISPR Perturbation Cell Atlas
Size exclusion chromatography-mass spectroscopy (SEC-MS) data from undifferentiated KOLF2.1J human i...cm4ai:instance:sec-ms-ipscProtein-protein interaction mass spectrometry dataSEC-MS protein-protein interactions in iPSCs
SEC-MS protein-protein interaction data from iPSC-derived NPCs (neural progenitor cells), neurons, a...cm4ai:instance:sec-ms-differentiatedProtein-protein interaction mass spectrometry dataSEC-MS in differentiated iPSC derivatives
Spatial localization of 563 proteins of interest in untreated cells of the breast cancer cell line M...cm4ai:instance:if-images-untreatedImmunofluorescence confocal microscopy imagesImmunofluorescence images - untreated MDA-MB-468
Spatial localization of 563 proteins in MDA-MB-468 breast cancer cells treated with paclitaxel chemo...cm4ai:instance:if-images-paclitaxelImmunofluorescence confocal microscopy imagesImmunofluorescence images - paclitaxel-treated MDA-MB-468
Spatial localization of 563 proteins in MDA-MB-468 breast cancer cells treated with vorinostat chemo...cm4ai:instance:if-images-vorinostatImmunofluorescence confocal microscopy imagesImmunofluorescence images - vorinostat-treated MDA-MB-468
AP-MS on endogenously tagged cell lines for protein-protein interaction mapping. 17 genes endogenous...cm4ai:instance:ap-msProtein-protein interaction mass spectrometry dataAffinity purification mass spectrometry
Hierarchical directed acyclic graph (DAG) maps of cell architecture produced by integrating protein ...cm4ai:instance:cell-mapsNetwork data structure (hierarchical graph)Hierarchical cell maps
Access UrlsDescriptionIDName
https://dataverse.lib.virginia.edu/dataset.xhtml?persistentId=doi:10.18130/V3/B35XWXDatasets packaged in RO-Crate format with rich metadata and provenance graphs. cm4ai:format:ro-crateRO-Crate packages
https://dataverse.lib.virginia.eduUniversity of Virginia's LibraData (Dataverse) NIH-approved generalist repository. cm4ai:format:dataverseLibraData Dataverse repository
https://cm4ai.org, http://www.cm4ai.orgCM4AI project portal using U-BRITE platform for open dissemination. cm4ai:format:portalCM4AI web portal
https://www.ndexbio.orgCell maps shared via NDEx for visualization in web browsers or Cytoscape. cm4ai:format:ndexNetwork Data Exchange
πŸ”

Collection Process

How was the data acquired?

CM4AI
Cell Maps for Artificial Intelligence - March 2025 Data Release (Beta)
This dataset is the March 2025 Data Release of Cell Maps for Artificial Intelligence (CM4AI; CM4AI.org), the Functional Genomics Grand Challenge in the NIH Bridge2AI program. This Beta release includes perturb-seq data in undifferentiated KOLF2.1J iPSCs; SEC-MS data in undifferentiated KOLF2.1J iPSCs and iPSC-derived NPCs, neurons, and cardiomyocytes; and IF images in MDA-MB-468 breast cancer cells in the presence and absence of chemotherapy (vorinostat and paclitaxel). CM4AI's objective is to deliver machine-readable hierarchical maps of cell architecture as AI-Ready data produced from multimodal interrogation of 100 chromatin modifiers and 100 metabolic enzymes involved in cancer, neuropsychiatric, and cardiac disorders in disease-relevant cell lines under perturbed and unperturbed conditions. CM4AI output data are packaged with provenance graphs and rich metadata as AI-ready datasets in RO-Crate format using the FAIRSCAPE framework. CM4AI is a collaboration of UCSD, UCSF, Stanford, UVA, Yale, UA Birmingham, Simon Fraser University, and the Hastings Center.
en
  • AI
  • affinity purification
  • AP-MS
  • artificial intelligence
  • breast cancer
  • Bridge2AI
  • cardiomyocyte
  • CM4AI
  • CRISPR/Cas9
  • CRISPR/Cas perturbation screens
  • functional genomics
  • induced pluripotent stem cell
  • iPSC
  • KOLF2.1J
  • machine learning
  • mass spectroscopy
  • MDA-MB-468
  • neural progenitor cell
  • NPC
  • neuron
  • paclitaxel
  • perturb-seq
  • perturbation sequencing
  • protein-protein interaction
  • protein localization
  • single-cell RNA sequencing
  • scRNAseq
  • SEC-MS
  • size exclusion chromatography
  • subcellular imaging
  • vorinostat
  • RO-Crate
  • FAIRSCAPE
  • cell maps
  • hierarchical maps
  • visible machine learning
  • VNN
  • visible neural networks
  • chromatin modifiers
  • metabolic enzymes
  • immunofluorescence
  • IF imaging
  • spatial proteomics
DescriptionIDNameResponse
Address the limitation of black-box machine learning models in genomics by creating visible neural n...cm4ai:gap:black-box-mlAddress black-box machine learningMachine learning models show great promise in analyzing the human genome but their inner workings ar...
Fill the gap in availability of high-quality, FAIR, AI-ready functional genomics datasets with rich ...cm4ai:gap:fair-ai-datasetsProvide FAIR AI-ready functional genomics datasetsCM4AI addresses the need for ethical, FAIR, AI-Ready biomedical data that are fully characterized wi...
RoleNameORCIDAffiliation
ContributorTrey Idekercm4ai:creator:ideker-
ContributorTimothy W Clarkcm4ai:creator:clark-
ContributorNevan J Krogancm4ai:creator:krogan-
ContributorEmma Lundbergcm4ai:creator:lundberg-
ContributorPrashant Malicm4ai:creator:mali-
ContributorAndrej Salicm4ai:creator:sali-
ContributorJean-Christophe BΓ©lisle-Piponcm4ai:creator:belisle-pipon-
ContributorCynthia Brandtcm4ai:creator:brandt-
ContributorJake Yue Chencm4ai:creator:chen-
ContributorYing Dingcm4ai:creator:ding-
ContributorSamah Fodehcm4ai:creator:fodeh-
ContributorPamela Payne-Fostercm4ai:creator:payne-foster-
ContributorSarah J Ratcliffecm4ai:creator:ratcliffe-
ContributorVardit Ravitskycm4ai:creator:ravitsky-
ContributorWade Loren Schulzcm4ai:creator:schulz-
DescriptionIDName
Expressed genome-scale CRISPRi Perturbation Cell Atlas in undifferentiated KOLF2.1J human induced pl...cm4ai:subset:crispr-atlasCRISPR Perturbation Cell Atlas
Size exclusion chromatography-mass spectroscopy data on undifferentiated KOLF2.1J human induced plur...cm4ai:subset:sec-ms-ipscProtein-protein Interaction SEC-MS iPSCs
Spatial localization of 563 proteins in untreated MDA-MB-468 breast cancer cells imaged by ICC-IF an...cm4ai:subset:if-untreatedProtein Localization Images - Untreated
Spatial localization of 563 proteins in MDA-MB-468 breast cancer cells treated with paclitaxel, imag...cm4ai:subset:if-paclitaxelProtein Localization Images - Paclitaxel
Spatial localization of 563 proteins in MDA-MB-468 breast cancer cells treated with vorinostat, imag...cm4ai:subset:if-vorinostatProtein Localization Images - Vorinostat
DescriptionIDName
Disease-relevant cell lines selected to enable AI-ready functional genomics mapping for cancer, neur...cm4ai:sampling:cell-line-selectionDisease-relevant cell line selection
Selection of 100 chromatin regulators and 100 metabolic enzymes involved in cancer, neuropsychiatric...cm4ai:sampling:chromatin-modifiersChromatin modifier selection
Selection of chemotherapy treatments (paclitaxel and vorinostat) and cell differentiation states to ...cm4ai:sampling:perturbation-conditionsPerturbation and treatment conditions
DescriptionExternal ResourcesIDName
Official CM4AI project website and portalhttps://cm4ai.org, https://www.cm4ai.orgcm4ai:resource:project-websiteCM4AI Project Website
Key publications describing CM4AI methods and datasetshttps://doi.org/10.1101/2024.05.21.589311, https://doi.org/10.1101/2024.11.03.621734cm4ai:resource:publicationsCM4AI Publications
Platform for sharing and visualizing cell mapshttps://www.ndexbio.orgcm4ai:resource:ndexNetwork Data Exchange (NDEx)
AI-readiness framework for metadata and provenancehttps://fairscape.github.iocm4ai:resource:fairscapeFAIRSCAPE Framework
Software tools for cell map creation and analysishttps://github.com/idekerlab, http://integrativemodeling.orgcm4ai:resource:software-toolsCM4AI Software Tools
Acquisition DetailsDescriptionIDNameWas Directly Observed
Endogenous tagging of genes in MDA-MB-468 cells, AP-MS data acquisition under three conditions (untreated, paclitaxel, vorinostat), 17 genes tagged in Year 1, 34 additional genes in processAP-MS on endogenously tagged cell lines to map protein-protein interactions of 100 chromatin regulat...cm4ai:acquisition:ap-msAffinity purification mass spectrometryTrue
SEC-MS performed on MDA-MB-468 cells under three conditions, Identified 72/100 chromatin modifiers with 52 in protein complexes, ... (+3 more)SEC-MS for proteome-wide complex and interaction mapping as orthogonal approach to AP-MS. cm4ai:acquisition:sec-msSize exclusion chromatography mass spectrometryTrue
Automated fixation and permeabilization protocols using pipetting robot, Four-channel staining (DAPI, calreticulin, tubulin, target protein), ... (+3 more)IF staining and high-resolution confocal microscopy for spatial proteomics mapping of protein subcel...cm4ai:acquisition:immunofluorescenceImmunofluorescence confocal microscopyTrue
CRISPR lentiviral library targeting 100 chromatin factors with 6 guide RNAs per gene, MDA-MB-468 and KOLF2 CRISPR lines with inducible dCas9, ... (+4 more)CRISPR-Cas9 perturbation screens for genetic perturbation mapping with single-cell RNA sequencing re...cm4ai:acquisition:crispr-perturbationSingle-cell CRISPR perturbation screensTrue
DescriptionIDMechanism DetailsName
Mass spectrometry instruments for AP-MS and SEC-MS protein interaction mapping. cm4ai:mechanism:mass-spectrometryAffinity purification coupled with mass spectrometry, Size exclusion chromatography coupled with mass spectrometry, Proteome-wide interaction profilingMass spectrometry instruments
High-resolution confocal microscopy for immunofluorescence imaging of protein subcellular localizati...cm4ai:mechanism:confocal-microscopyConfocal microscopy with four-channel imaging, Immunofluorescence-based staining (ICC-IF), Automated pipetting robot for sample preparationConfocal microscopy
CRISPR-Cas9 gene perturbation with single-cell RNA sequencing for transcriptional and fitness phenot...cm4ai:mechanism:crispr-cas9CRISPRi gene knockdown with inducible dCas9, Lentiviral library delivery, ... (+2 more)CRISPR-Cas9 perturbation system
Multi-Scale Integrated Cell (MuSIC) pipeline for integrating multimodal data streams into hierarchic...cm4ai:mechanism:music-pipelinenode2vec deep learning for PPI network embedding, Human Protein Atlas deep learning for image embedding, ... (+4 more)MuSIC data integration pipeline
Collector DetailsDescriptionIDName
Ideker Lab - project coordination and cell map integration, Mali Lab - genetic perturbation mapping, Sali Lab - integrative structure modelingUCSD teams led by Trey Ideker (PI), Prashant Mali, and Andrej Sali. cm4ai:collector:ucsdUniversity of California San Diego
Krogan Laboratory - AP-MS and SEC-MS data generation, Protein complex mapping across conditionsUCSF teams led by Nevan Krogan for protein-protein interaction data. cm4ai:collector:ucsfUniversity of California San Francisco
Lundberg Laboratory - immunofluorescence imaging, Spatial proteomics mapping with confocal microscopy, Copyright holder for raw image dataStanford team led by Emma Lundberg for spatial proteomics. cm4ai:collector:stanfordStanford University
Standards Module - FAIRSCAPE framework development, AI-readiness packaging and metadata, Data repository (LibraData/Dataverse)UVA team led by Timothy Clark for AI-readiness standards. cm4ai:collector:uvaUniversity of Virginia
Data curation and standards, Internship program hostingYale team led by Cynthia Brandt, Samah Fodeh, and Wade Schulz. cm4ai:collector:yaleYale University
Teaming Module - U-BRITE platform for portal, Open dissemination of data, maps, and tools, Internship program hostingUAB team led by Jake Chen for teaming and dissemination. cm4ai:collector:uabUniversity of Alabama at Birmingham
DescriptionIDNameTimeframe Details
September 1, 2022 to August 31, 2026 with ongoing data augmentation. cm4ai:timeframe:project-periodOverall project periodProject start September 1 2022, Project end August 31 2026, ... (+4 more)
Year 1 data generation including chromatin modifiers, spatial proteomics, and CRISPR screens. cm4ai:timeframe:year1Year 1 accomplishments100 chromatin regulators mapped in MDA-MB-468 cells, Single-cell CRISPR screens completed, ... (+2 more)
DescriptionIDNamePreprocessing Details
Generation of embeddings from PPI networks and IF images using deep learning models for dimensionali...cm4ai:preprocessing:embeddingDeep learning embeddingsnode2vec deep learning for PPI network embedding, Human Protein Atlas deep learning for image embedding, Contrastive learning for co-embedding integration
Integration of PPI and image embeddings through MuSIC pipeline to create co-embeddings and hierarchi...cm4ai:preprocessing:integrationMultimodal data integrationCo-embedding integration with minimal information loss, Community detection based on co-embedding similarities, Hierarchy creation from multi-resolution community detection
Annotation of protein assemblies in cell maps using ontology alignment and large language models. cm4ai:preprocessing:annotationCell map annotationAlignment to Gene Ontology and Reactome pathways, LLM-based naming of protein assemblies with confidence scores, Evaluation of overlap with known cell biology
Packaging of datasets with rich metadata, provenance graphs, and schemas in RO-Crate format for AI-r...cm4ai:preprocessing:ro-crate-packagingRO-Crate packaging with FAIRSCAPEFAIRSCAPE-CLI client for RO-Crate creation, JSON-Schema data dictionaries for all datasets, ... (+4 more)
  1. ID
    cm4ai:maintainer:cm4ai-team
    Name
    CM4AI Project Team
    Description
    Multi-institutional team responsible for dataset maintenance.
    Maintainer Details
    • Point of contact Trey Ideker (University of California San Diego)
    • Data depositor Justin Niestroy (University of Virginia)
    • Data Access Committee for ethical supervision
DescriptionIDNameReview Details
Ethics team employs Value-Sensitive Design (VSD) methodology for ethical AI development. cm4ai:ethics:value-sensitive-designValue-Sensitive Design frameworkConceptual work to craft axiological repository of values, Design standards and expectations for responsible AI, ... (+2 more)
Mixed empirical research combining qualitative and quantitative methods to capture community insight...cm4ai:ethics:stakeholder-engagementStakeholder engagement researchDiverse perspective integration in ethical guidelines, Community insights on values and concerns, Tools for ethical, legal, and social ramifications awareness
Bridge2AI Data Sharing and Dissemination Working Group Code of Conduct for data access. cm4ai:ethics:code-of-conductBridge2AI Code of ConductBasic requirements for Bridge2AI Open House participants, Attestation required prior to accessing CM4AI datasets, Ethical commitments from data users
DescriptionIDImpact DetailsName
Ethical preparation, licensing, dissemination, and data access supervision balancing openness with i...cm4ai:protection:licensingCC BY-NC-SA 4.0 for open non-commercial use, Commercial use requires separate licensing, ... (+2 more)Licensing and data governance
Guidelines and best practices for ethical development of AI systems leveraging CM4AI data. cm4ai:protection:responsible-aiAnticipation of potential AI/ML applications, Integration into data governance frameworks, ... (+2 more)Responsible AI development framework
πŸš€

Uses

What (other) tasks could the dataset be used for?

DescriptionIDNameResponse
Create hierarchical directed acyclic graph (DAG) maps where each node represents an assembly of prot...cm4ai:task:cell-mappingHierarchical cell mappingCell maps are hierarchical directed acyclic graphs (DAG), where each node represents an assembly of ...
Integrate three complementary mapping approaches - proteomic mass spectrometry, cellular imaging, an...cm4ai:task:multimodal-integrationMultimodal data integrationThe project launches a coordinated effort involving three complementary mapping approaches: affinity...
Package datasets with FAIR compliance, provenance graphs, complete schemas, validation procedures, d...cm4ai:task:ai-readinessAI-readiness packagingAI-Ready biomedical data are fully characterized FAIR data of known provenance, which can be ethical...
DescriptionExamplesIDName
Scientific publications describing CM4AI methods and initial findings. Clark T, et al. Cell Maps for Artificial Intelligence. bioRxiv 2024.05.21.589311, Nourreddine S, et al. A PERTURBATION CELL ATLAS OF HUMAN iPSCs. bioRxiv 2024.11.03.621734cm4ai:use:publicationsPublications using CM4AI data
March 2024 CodeFest event for training researchers in CM4AI data and tools. 38 registrants with 40% female and 30% from underrepresented communities, Detailed tutorials on data derivation and integration, Production-ready software demonstrationcm4ai:use:codefestCM4AI CodeFest training event
DescriptionExamplesIDName
Development of visible neural networks (VNNs) that provide interpretable AI/ML predictions based on ...Visible machine learning systems informed by cell maps, Interpretable genotype-phenotype predictions, ... (+2 more)cm4ai:intended:visible-mlVisible machine learning development
AI/ML research on functional genomics data including transcriptional phenotypes, protein interaction...Machine learning on multimodal functional genomics data, Disease mechanism understanding, ... (+2 more)cm4ai:intended:biomedical-aiBiomedical AI research
Research on protein assemblies, cellular compartments, and protein complex organization. Protein complex characterization, Subcellular localization studies, ... (+2 more)cm4ai:intended:cell-biologyCell biology research
DescriptionIDImpact DetailsName
Dataset characteristics specific to MDA-MB-468 and KOLF2.1J cell lines may not generalize to all cel...cm4ai:impact:cell-line-specificMDA-MB-468 specific to triple-negative breast cancer context, KOLF2.1J from healthy male Northern European donor, ... (+2 more)Cell line-specific characteristics
Limitations inherent to mass spectrometry, microscopy, and CRISPR perturbation technologies. cm4ai:impact:technological-limitationsMass spectrometry detection limits, Antibody availability and specificity constraints, ... (+3 more)Technological and methodological limitations
DescriptionDiscouragement DetailsIDName
Commercial use is restricted by CC BY-NC-SA 4.0 license and requires separate licensing negotiation....Commercial use not permitted under CC BY-NC-SA 4.0, Separate license required from copyright holders (UCSD, Stanford, UCSF), Data Access Committee supervision for commercial licensingcm4ai:discouraged:commercialCommercial use without license
Data are from cell lines, not clinical samples, and should not be directly applied to clinical decis...Non-clinical data from tissue cultures, De-identified and cannot be matched to human subjects, Requires clinical validation before medical applicationcm4ai:discouraged:clinical-directDirect clinical application without validation
ID
cm4ai:license:cc-by-nc-sa
Name
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International
Description
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license. Data are Copyright 2025 The Regents of the University of California except where otherwise noted. Spatial proteomics raw image data copyright 2025 The Board of Trustees of the Leland Stanford Junior University.
License Terms
  • Attribution required to copyright holders and authors
  • Citation of bioRxiv article required
  • Non-commercial use only without separate license
  • ShareAlike - derivatives must use same license
  • Software under BSD-3 and MIT open source licenses
  • FAIRSCAPE under MIT license
πŸ“€

Distribution

How will the dataset be distributed?

CC BY-NC-SA 4.0
ID
cm4ai:version-access
Name
Dataset versioning
Description
Dataset versioned through University of Virginia Dataverse with DOI persistence.
Version Details
  • DOI doi:10.18130/V3/B35XWX for persistent access
  • Dataverse versioning system tracks updates
  • Software packages versioned for reproducibility
  • Version referenced in dataset metadata
πŸ”„

Maintenance

How will the dataset be maintained?

ID
cm4ai:updates
Name
Regular data augmentation
Description
Data will be augmented regularly through the end of the project in August 2026.
Update Details
  • Ongoing data generation and release
  • Regular updates to LibraData repository
  • Additional cell lines and conditions planned
  • Year 2 expansion to iPSC differentiation states
πŸ‘₯

Human Subjects

Does the dataset relate to people?

ID
cm4ai:ethics:non-human-subjects
Name
Non-human subjects research
Description
CM4AI data are distinctive in being non-clinical data from tissue cultures that are considered de-identified as they cannot be matched to human subjects with current knowledge. Data from ethically sourced cell lines: MDA-MB-468 from ATCC with established provenance, and KOLF2.1J from HipSci with donor consent for research use. No direct human subjects involvement; cell lines cannot be matched to individuals.
Involves Human Subjects
False
Generated on 2025-12-09 18:07:08 using Bridge2AI Data Sheets Schema