Datasheets for Datasets Schema
📚 View: Schema Documentation | D4D Examples | About
A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on".
Bridge2AI Generating Center Datasheets
Curated comprehensive datasheets for each Bridge2AI data generating project:
- AI-READI - Retinal imaging and diabetes dataset
- CM4AI - Cell maps for AI dataset
- VOICE - Voice biomarker dataset
- CHORUS - Health data for underrepresented populations
Repository Structure
Browse the source code repository on GitHub:
- examples/ - Example data files
- project/ - Generated project files (JSON Schema, OWL, SHACL, etc.)
- src/ - Source files
- src/data_sheets_schema/schema/ - LinkML schema source (edit here)
- src/data_sheets_schema/datamodel/ - Generated Python datamodel
- tests/ - Python tests
- data/ - D4D metadata and evaluation data
Quick Links
- Schema Documentation - Complete schema reference (classes, slots, enumerations)
- D4D Examples - View rendered datasheets for Bridge2AI projects
- GitHub Repository - Source code and development
- Issues - Report bugs or request features
Related Resources
- Original Paper: Datasheets for Datasets (Gebru et al. 2021)
- Example: Structured dataset documentation: a datasheet for CheXpert
- Google's Alternative: Data Cards
- Augmented Model: Augmented Datasheets for Speech Datasets
About This Project
This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers. The schema includes 76 classes, 272 unique slots, and comprehensive coverage of:
- Motivation - Why was the dataset created?
- Composition - What's in the dataset?
- Collection - How was data collected?
- Preprocessing - What preprocessing was done?
- Uses - What should the dataset be used for?
- Distribution - How is the dataset distributed?
- Maintenance - Who maintains the dataset?
- Ethics - What ethical reviews were conducted?
- Human Subjects - What protections for human subjects?
- Data Governance - How is the data governed?
This project was made with linkml-project-cookiecutter.