CM4AI d4d

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

  • Description
    Create AI-ready maps of human cell architecture from disease-relevant cell lines as part of the NIH Bridge2AI Functional Genomics Grand Challenge.
📊

Composition

What do the instances represent?

  • Description
    The dataset contains functional genomics data organized into multiple data types: CRISPRi Perturbation Cell Atlas mapping transcriptional and fitness phenotypes for 11,739 targeted genes in KOLF2.1J iPSCs; SEC-MS protein-protein interaction data from undifferentiated iPSCs, NPCs, neurons, and cardiomyocytes; and IF images showing spatial localization of 563 proteins of interest in MDA-MB-468 breast cancer cells with and without chemotherapy treatment.
🔍

Collection Process

How was the data acquired?

CM4AI Dataset
Cell Maps for Artificial Intelligence - March 2025 Data Release (Beta)
This dataset is the March 2025 Data Release of Cell Maps for Artificial Intelligence (CM4AI; CM4AI.org), the Functional Genomics Grand Challenge in the NIH Bridge2AI program. This Beta release includes perturb-seq data in undifferentiated KOLF2.1J iPSCs; SEC-MS data in undifferentiated KOLF2.1J iPSCs and iPSC-derived NPCs, neurons, and cardiomyocytes; and IF images in MDA-MB-468 breast cancer cells in the presence and absence of chemotherapy (vorinostat and paclitaxel). CM4AI output data are packaged with provenance graphs and rich metadata as AI-ready datasets in RO-Crate format using the FAIRSCAPE framework. CM4AI is a collaboration of UCSD, UCSF, Stanford, UVA, Yale, UA Birmingham, Simon Fraser University, and the Hastings Center.
en
  • AI
  • artificial intelligence
  • machine learning
  • Bridge2AI
  • CM4AI
  • functional genomics
  • perturb-seq
  • CRISPR/Cas9
  • induced pluripotent stem cell
  • iPSC
  • KOLF2.1J
  • protein-protein interaction
  • SEC-MS
  • protein localization
  • subcellular imaging
  • MDA-MB-468
  • breast cancer
  • paclitaxel
  • vorinostat
  • RO-Crate
  • FAIRSCAPE
  • Description
    Provide high-quality, AI-ready functional genomics datasets with rich metadata and provenance graphs packaged in RO-Crate format using the FAIRSCAPE framework.
DescriptionIDName
Expressed genome-scale CRISPRi Perturbation Cell Atlas in undifferentiated KOLF2.1J human induced pluripotent stem cells (hiPSCs) mapping transcriptional and fitness phenotypes associated with 11,739 targeted genes.
cm4ai:crispr-perturbation-atlasCRISPR Perturbation Cell Atlas
Size exclusion chromatography-mass spectroscopy (SEC-MS) data on undifferentiated KOLF2.1J human induced pluripotent stem cells from the Krogan laboratory at UCSF. cm4ai:sec-ms-ppiProtein-protein Interaction SEC-MS
Spatial localization of 563 proteins in untreated MDA-MB-468 breast cancer cells imaged by ICC-IF and confocal microscopy in the Lundberg Lab at Stanford. cm4ai:if-images-untreatedProtein Localization Subcellular Images - Untreated
Spatial localization of 563 proteins in MDA-MB-468 breast cancer cells treated with paclitaxel, imaged by ICC-IF and confocal microscopy. cm4ai:if-images-paclitaxelProtein Localization Subcellular Images - Paclitaxel
Spatial localization of 563 proteins in MDA-MB-468 breast cancer cells treated with vorinostat, imaged by ICC-IF and confocal microscopy. cm4ai:if-images-vorinostatProtein Localization Subcellular Images - Vorinostat
  • Description
    Disease-relevant cell lines selected to enable AI-ready functional genomics mapping. KOLF2.1J iPSCs chosen for perturbation studies; MDA-MB-468 breast cancer cells selected for protein localization imaging under chemotherapy.
  • Description
    Data generated using multiple high-throughput experimental methods including CRISPRi perturbation sequencing, size exclusion chromatography coupled with mass spectrometry, and immunofluorescence confocal microscopy.
    Was Directly Observed
    True
    Acquisition Details
    • CRISPRi perturbation sequencing for transcriptional and fitness phenotypes
    • SEC-MS for protein-protein interaction mapping
    • ICC-IF and confocal microscopy for protein localization imaging
  • Description
    Hardware and software instruments for functional genomics data collection.
    Mechanism Details
    • CRISPRi gene knockdown system with perturb-seq readout
    • Size exclusion chromatography coupled with mass spectrometry
    • Confocal microscopy with DAPI, calreticulin, tubulin, and target protein staining
  • Description
    Multi-institutional collaboration across major research universities.
    Collector Details
    • University of California San Diego (Ideker Lab, Mali Lab, Sali Lab)
    • University of California San Francisco (Krogan Lab)
    • Stanford University (Lundberg Lab)
    • University of Virginia
    • Yale University
    • University of Alabama at Birmingham
  • Description
    Data collected through February 2025 with ongoing augmentation planned.
    Timeframe Details
    • Data creation date 2025-02-27
    • Publication date 2025-03-03
    • Ongoing data augmentation planned
  • Description
    CM4AI output data packaged with provenance graphs and rich metadata as AI-ready datasets in RO-Crate format using the FAIRSCAPE framework.
    Preprocessing Details
    • RO-Crate packaging with provenance metadata
    • FAIRSCAPE framework for AI-ready data formatting
    • Rich metadata including ontology annotations
  • Name
    CM4AI Project Team
    Description
    Multi-institutional team responsible for dataset maintenance.
    Maintainer Details
    • Point of contact Trey Ideker (University of California San Diego)
    • Data depositor Justin Niestroy (University of Virginia)
🚀

Uses

What (other) tasks could the dataset be used for?

  • Description
    Enable AI/machine learning research on functional genomics data including transcriptional phenotypes, protein-protein interactions, and protein localization in disease-relevant cell lines.
  • Examples
    • Clark T, et al. Cell Maps for Artificial Intelligence: AI-Ready Maps of Human Cell Architecture. bioRxiv 2024.05.21.589311
    • Nourreddine S, et al. A PERTURBATION CELL ATLAS OF HUMAN iPSCs. bioRxiv 2024.11.03.621734
  • Description
    Dataset designed for AI/ML research on functional genomics. Users should be aware of cell line-specific characteristics when generalizing findings.
    Impact Details
    • Cell line-specific phenotypes may not generalize across all cell types
    • Chemotherapy treatment conditions specific to paclitaxel and vorinostat
  • Description
    Commercial use restrictions apply per the CC BY-NC-SA 4.0 license.
    Discouragement Details
    • Commercial use not permitted under license terms
    • Attribution and share-alike requirements must be followed
Description
Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International license (CC BY-NC-SA 4.0). Attribution required to copyright holders and authors. Spatial proteomics raw image data copyright The Board of Trustees of the Leland Stanford Junior University. Other data copyright The Regents of UC.
License Terms
  • CC BY-NC-SA 4.0
  • Attribution required to copyright holders and authors
  • Non-commercial use only
📤

Distribution

How will the dataset be distributed?

Description
Dataset versioned through University of Virginia Dataverse with DOI persistence.
Version Details
  • DOI doi:10.18130/V3/B35XWX for persistent access
  • Dataverse versioning system tracks updates
🔄

Maintenance

How will the dataset be maintained?

Description
Data will be augmented regularly through the end of the Bridge2AI project.
Update Details
  • Regular data augmentation planned
  • Version 1.4 as of March 2025 publication
Generated on 2026-04-15 21:02:50 using Bridge2AI Data Sheets Schema