AI READI d4d

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

  • Description
    Better understand salutogenesis (the pathway from disease to health) in Type 2 Diabetes Mellitus using a harmonized, multi-domain dataset designed for AI/ML research.
📊

Composition

What do the instances represent?

  • Description
    Individual participants with and without Type 2 Diabetes Mellitus (T2DM) with multi-domain measurements including survey responses, physical and clinical measurements, blood and urine lab results, retinal imaging, ECG, wearable device time-series, blood glucose levels, and environmental sensor data (e.g., home air quality).
  • Description
    Participants with and without Type 2 Diabetes Mellitus, stratified by diabetes severity. Recruitment aimed at approximately equal distribution by diabetes severity, though pilot and periodic releases may not achieve balanced distribution across groups due to ongoing enrollment.
  • Description
    Public dataset downloadable upon agreement with a license; full dataset available via controlled access through Data Use Agreement.
    Access Urls
🔍

Collection Process

How was the data acquired?

AI-READI Dataset
Flagship Dataset of Type 2 Diabetes from the AI-READI Project
The AI-READI dataset consists of data collected from individuals with and without Type 2 Diabetes Mellitus (T2DM) and harmonized across 3 data collection sites. The composition was designed with future AI/Machine Learning studies in mind, including recruitment sampling procedures aimed at achieving approximately equal distribution of participants across diabetes severity, as well as a multi-domain data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable device data, etc.) to enable downstream AI/ML analyses that may not be feasible with existing data sources such as claims or electronic health records data. The goal is to better understand salutogenesis (the pathway from disease to health) in T2DM.
en
  • Type 2 Diabetes
  • T2DM
  • AI-READI
  • Machine Learning
  • multimodal
  • harmonized
  • multi-site
  • survey data
  • clinical data
  • imaging data
  • wearable device data
  • ECG
  • retinal images
  • blood glucose
  • laboratory results
  • environmental data
  • FAIR principles
  • Healthsheet
  • Description
    Provide a harmonized, multi-site, multi-domain dataset enabling AI/ML analyses not feasible with existing sources (e.g., claims or EHR alone), with recruitment targeting approximately equal distribution by diabetes severity.
DescriptionIDName
Includes survey data, blood and urine lab results, fitness activity levels, clinical measurements (e.g., monofilament and cognitive function testing), retinal images, ECG, blood sugar levels, and environmental variables such as home air quality. Available for public download upon agreement with a license that defines how the data can be used.
aireadi:public-datasetPublic dataset
Includes 5-digit zip code, sex, race, ethnicity, genetic sequencing data, past health records, medications, and traffic and accident reports. Accessible by entering into a data use agreement. aireadi:controlled-access-datasetControlled-access dataset
  1. Description
    Recruitment sampling procedures aimed at achieving approximately equal distribution of participants across diabetes severity.
    Sample
    True
    Random Sampling
    False
    Representative Sample
    False
    Strategies
    • Targeted recruitment to balance diabetes severity across participant groups
  • Description
    As enrollment is ongoing, the pilot data release and periodic updates to data releases may not have achieved balanced distribution across groups.
    Anomaly Details
    • Early releases may exhibit unbalanced distributions across diabetes severity groups
DescriptionExternal ResourcesFuture GuaranteesName
Documentation for the AI-READI dataset on the FAIRhub data portal, including dataset landing page and comprehensive documentation. https://docs.aireadi.org, https://fairhub.io/datasets/2Documentation versions correspond to dataset versionsAI-READI Dataset Documentation
Related AI-READI publications in Nature Metabolism and BMJ Openhttps://doi.org/10.1038/s42255-024-01165-x, https://doi.org/10.1136/bmjopen-2024-097449Related Publications
Archived dataset record on Zenodohttps://doi.org/10.5281/zenodo.10642459Zenodo Archive
  • Description
    Contains protected health information elements under controlled access including past health records, medications, and genetic sequencing data.
    Confidential Elements Present
    True
    Confidentiality Details
    • 5-digit zip code held under controlled access
    • Genetic sequencing data held under controlled access
    • Past health records and medications held under controlled access
  • Description
    Sensitive demographic and health data held under controlled access.
    Sensitive Elements Present
    True
    Sensitivity Details
    • Sex, race, ethnicity held under controlled access
    • Genetic sequencing data
    • Past health records and medications
    • Traffic and accident reports
  1. Description
    Harmonized, multi-domain data acquisition across three collection sites using surveys, clinical exams, imaging devices, wearable sensors, and environmental monitors.
    Was Directly Observed
    True
    Was Reported By Subjects
    True
    Acquisition Details
    • Physical and clinical measurements directly observed
    • Survey data reported by subjects
    • Imaging data (retinal) directly captured
    • Wearable device data passively collected
    • Environmental sensor data (home air quality) directly measured
  • Description
    Multi-modal data collection using hardware devices, clinical procedures, and software-driven capture.
    Mechanism Details
    • Hardware devices and sensors (retinal imaging, ECG, wearables, environmental monitors)
    • Clinical procedures (monofilament and cognitive function testing)
    • Software-driven data capture with manual curation as needed
  • Description
    Three data collection sites were involved in recruitment and data acquisition.
    Collector Details
    • Multi-site data collection across 3 harmonized sites
  • Description
    Pilot study phase data included; enrollment is ongoing with periodic data releases planned.
    Timeframe Details
    • Pilot phase data included in initial release
    • Enrollment ongoing with periodic updates
  • Description
    Domain-specific processing and harmonization described in the Dataset Documentation for each data domain.
    Preprocessing Details
    • File formats, data standards, metadata, and example outputs provided per domain
    • Harmonization across three collection sites
  • Description
    Harmonization and processing across three sites with domain-specific details provided in the documentation.
    Cleaning Details
    • Cross-site harmonization procedures
    • Domain-specific data cleaning as documented
  • Description
    Domain-specific labeling and annotation where applicable, as described in the documentation for each data domain.
    Labeling Details
    • Clinical test outputs annotated per domain protocols
    • Imaging outputs labeled according to clinical standards
  • Name
    AI-READI Project Team
    Description
    Project team responsible for maintaining the dataset.
    Maintainer Details
    • See Documentation site Contact Us and GitHub references
🚀

Uses

What (other) tasks could the dataset be used for?

  • Description
    Enable downstream AI/ML analyses across survey, clinical, imaging, wearable, and environmental domains related to T2DM that may not be feasible with existing data sources such as claims or electronic health records data.
  • Description
    Early-release imbalance across groups may affect AI/ML model performance and fairness. Users should account for group balance and potential distribution shifts. Sensitive elements must be handled under DUA to mitigate privacy risks.
    Impact Details
    • Potential bias from unbalanced early releases
    • Privacy considerations for controlled-access data
  • Description
    Users must adhere to license terms for public data and the Data Use Agreement for controlled-access data.
    Discouragement Details
    • Uses not permitted by license terms
    • Uses that would violate participant privacy protections
Description
Public data available for download upon agreement with a license defining permitted uses. Full dataset access contingent on entering into a Data Use Agreement (controlled access).
License Terms
  • Public data under license with defined permitted uses
  • Full dataset requires Data Use Agreement
📤

Distribution

How will the dataset be distributed?

Description
Separate documentation versions align to dataset versions. Users can navigate between versions via the documentation site.
Version Details
  • Version dropdown available in documentation
  • Each dataset version has corresponding documentation version
🔄

Maintenance

How will the dataset be maintained?

Description
Periodic updates to data releases are planned as enrollment proceeds. Documentation versions align with dataset versions.
Update Details
  • Periodic data releases as enrollment continues
  • Documentation versioned with dataset (e.g., v1.0.0, v2.0.0)
Generated on 2026-04-15 21:08:53 using Bridge2AI Data Sheets Schema