D4D Examples
This page provides links to rendered Datasheet for Datasets (D4D) examples for Bridge2AI data generating projects.
Recommended: Use the Curated Comprehensive Datasheets below for the most complete and accurate project metadata.
Curated Comprehensive Datasheets
These are the most comprehensive and authoritative datasheets for each project, created through extensive AI-powered synthesis with human oversight and validation:
AI-READI
- Human Readable HTML - Recommended viewing format
- LinkML Format HTML - Technical schema format
- Download YAML - Source metadata
CM4AI
- Human Readable HTML - Recommended viewing format
- LinkML Format HTML - Technical schema format
- Download YAML - Source metadata
VOICE
- Human Readable HTML - Recommended viewing format
- LinkML Format HTML - Technical schema format
- Download YAML - Source metadata
CHORUS
Note: CHORUS does not yet have a curated comprehensive datasheet. For CHORUS metadata, please see the GPT-5 Synthesized Datasheets section below.
GPT-5 Synthesized Datasheets
These datasheets were automatically synthesized from multiple documents using GPT-5:
AI-READI
CHORUS
CM4AI
VOICE
Claude Code Synthesized Datasheets (Deterministic)
These datasheets were automatically synthesized using Claude Sonnet 4.5 with deterministic settings (temperature=0.0) for reproducibility:
AI-READI
CHORUS
CM4AI
VOICE
Individual Dataset Datasheets
These datasheets were created from specific dataset metadata sources:
AI-READI (FAIRHub v3)
CM4AI (Dataverse v3)
VOICE (PhysioNet v3)
About the Datasheets
Curated Comprehensive Datasheets
The Curated Comprehensive Datasheets represent the most complete and authoritative metadata for each project. These were created through:
- Automated extraction of metadata from multiple data sources and documentation using AI
- Human oversight and validation by domain experts
- Iterative refinement to ensure completeness and accuracy
- Validation against the LinkML schema
- Rendering to multiple formats (human-readable HTML, technical LinkML HTML, and YAML)
These datasheets provide comprehensive project-level metadata including dataset motivation, composition, collection processes, preprocessing, recommended uses, distribution, maintenance, ethics, and governance.
GPT-5 Synthesized Datasheets
The GPT-5 Synthesized Datasheets were created by: 1. Concatenating multiple project-related documents in reproducible order 2. Processing with GPT-5 to extract and synthesize D4D metadata 3. Validating against the LinkML schema 4. Rendering to human-readable HTML format
These provide automated comprehensive project-level metadata and include both HTML views and downloadable YAML files. Note: HTML rendering may show incomplete data in some cases; refer to YAML source files for complete extracted metadata.
Claude Code Synthesized Datasheets (Deterministic)
The Claude Code Synthesized Datasheets are generated with deterministic settings for reproducibility:
1. Temperature=0.0: Eliminates randomness in model responses
2. Pinned model version: claude-sonnet-4-5-20250929 prevents changes from model updates
3. Version-controlled prompts: Stored in external files tracked in git
4. Local schema: Uses version-controlled schema file (not remote)
5. Comprehensive metadata: Each YAML includes a metadata file tracking all generation parameters
Key Features: - Reproducible: Running twice on same input produces identical output - Traceable: Complete provenance tracking via metadata files - Comparable: Can meaningfully compare with GPT-5 outputs - Transparent: All prompts and settings version-controlled and documented
Metadata Files contain: - SHA-256 hashes of input file, schema, and prompts - Model settings (temperature, max_tokens) - Processing environment details - Git commit hash for provenance - Reproducibility command
See DETERMINISM.md for complete details on the deterministic approach.
Individual Dataset Datasheets
The Individual Dataset Datasheets provide detailed metadata for specific datasets from each project's primary data repository (FAIRHub, Dataverse, PhysioNet). These focus on individual dataset instances rather than project-level metadata.
Schema Information
All datasheets conform to the Datasheets for Datasets framework by Gebru et al., implemented using the Bridge2AI LinkML schema.
The YAML files can be validated, transformed, and processed using LinkML tools. See the LinkML documentation for more information.