Datasheets for Datasets Schema
📚 View: Schema Documentation | CLI Reference | D4D Examples | About
A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on".
Bridge2AI Generating Center Datasheets
Curated comprehensive datasheets for each Bridge2AI data generating project:
- AI-READI - Retinal imaging and diabetes dataset
- CM4AI - Cell maps for AI dataset
- VOICE - Voice biomarker dataset
- CHORUS - Health data for underrepresented populations
D4D-Core Schema (Recommended Entry Point)
A curated, interop-focused subset of D4D — the recommended starting point for new datasheets and for systems that exchange datasheets with RO-Crate / FAIRSCAPE / DCAT consumers. See D4D-Core → for the schema YAMLs, merged form, validation targets, and curated HTML examples. Each d4d-core slot is paired with a SKOS-aligned external term in the Semantic Exchange Layer →.
Semantic Exchange Layer (D4D ↔ RO-Crate / FAIRSCAPE)
The canonical SKOS + SSSOM mapping that lets a D4D datasheet round-trip through RO-Crate, FAIRSCAPE EVI, schema.org, DCAT, and Croissant RAI. See Semantic Exchange → for the SKOS TTL, semantic + structural SSSOM, generator scripts, and the /d4d-add-mapping workflow for new mappings.
Repository Structure
Browse the source code repository on GitHub:
- src/data/examples/ - Example data files
- project/ - Generated project files (JSON Schema, OWL, SHACL, etc.)
- src/ - Source files
- src/data_sheets_schema/schema/ - LinkML schema source (edit here);
data_sheets_schema_core.yamlis the d4d-core entry point - src/data_sheets_schema/semantic_exchange/ - canonical SKOS + SSSOM exchange-layer artifacts
- src/data_sheets_schema/datamodel/ - Generated Python datamodel
- src/semantic_exchange/ - SSSOM/SKOS generator scripts
- data/semantic_exchange/ - structural SSSOM + analysis docs
- tests/ - Python tests (
test_semantic_exchange/,test_fairscape_integration/, …) - data/ - D4D metadata and evaluation data
Quick Links
- D4D-Core - Curated interop-focused schema subset (recommended entry point)
- Semantic Exchange - SKOS + SSSOM mapping to RO-Crate / FAIRSCAPE
- Schema Documentation - Complete schema reference (classes, slots, enumerations)
- CLI Reference - Command groups, flags, and workflow examples for
d4d - D4D Examples - View rendered datasheets for Bridge2AI projects
- GitHub Repository - Source code and development
- Issues - Report bugs or request features
Related Resources
- Original Paper: Datasheets for Datasets (Gebru et al. 2021)
- Example: Structured dataset documentation: a datasheet for CheXpert
- Google's Alternative: Data Cards
- Augmented Model: Augmented Datasheets for Speech Datasets
About This Project
This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers. The schema includes 76 classes, 272 unique slots, and comprehensive coverage of:
- Motivation - Why was the dataset created?
- Composition - What's in the dataset?
- Collection - How was data collected?
- Preprocessing - What preprocessing was done?
- Uses - What should the dataset be used for?
- Distribution - How is the dataset distributed?
- Maintenance - Who maintains the dataset?
- Ethics - What ethical reviews were conducted?
- Human Subjects - What protections for human subjects?
- Data Governance - How is the data governed?
This project was made with linkml-project-cookiecutter.