Skip to content

Datasheets for Datasets Schema

📚 View: Schema Documentation | CLI Reference | D4D Examples | About

A LinkML schema for Datasheets for Datasets model as published in Datasheets for Datasets. Inspired by datasheets as used in the electronics and other industries, Gebru et al. proposed that every dataset "be accompanied with a datasheet that documents its motivation, composition, collection process, recommended uses, and so on".

Bridge2AI Generating Center Datasheets

Curated comprehensive datasheets for each Bridge2AI data generating project:

  • AI-READI - Retinal imaging and diabetes dataset
  • CM4AI - Cell maps for AI dataset
  • VOICE - Voice biomarker dataset
  • CHORUS - Health data for underrepresented populations

View all D4D examples →

D4D-Core Schema (Recommended Entry Point)

A curated, interop-focused subset of D4D — the recommended starting point for new datasheets and for systems that exchange datasheets with RO-Crate / FAIRSCAPE / DCAT consumers. See D4D-Core → for the schema YAMLs, merged form, validation targets, and curated HTML examples. Each d4d-core slot is paired with a SKOS-aligned external term in the Semantic Exchange Layer →.

Semantic Exchange Layer (D4D ↔ RO-Crate / FAIRSCAPE)

The canonical SKOS + SSSOM mapping that lets a D4D datasheet round-trip through RO-Crate, FAIRSCAPE EVI, schema.org, DCAT, and Croissant RAI. See Semantic Exchange → for the SKOS TTL, semantic + structural SSSOM, generator scripts, and the /d4d-add-mapping workflow for new mappings.

Repository Structure

Browse the source code repository on GitHub:

About This Project

This repository stores a LinkML schema representation for the original Datasheets for Datasets model, representing the topics, sets of questions, and expected entities and fields in the answers. The schema includes 76 classes, 272 unique slots, and comprehensive coverage of:

  • Motivation - Why was the dataset created?
  • Composition - What's in the dataset?
  • Collection - How was data collected?
  • Preprocessing - What preprocessing was done?
  • Uses - What should the dataset be used for?
  • Distribution - How is the dataset distributed?
  • Maintenance - Who maintains the dataset?
  • Ethics - What ethical reviews were conducted?
  • Human Subjects - What protections for human subjects?
  • Data Governance - How is the data governed?

This project was made with linkml-project-cookiecutter.