AI READI (Claude Code Synthesized)

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

  • Response
    Better understand salutogenesis (pathway from disease to health) in T2DM using a harmonized, multi-domain dataset designed for AI/ML research.
📊

Composition

What do the instances represent?

  • Representation
    Individuals (participants) with and without Type 2 Diabetes Mellitus (T2DM)
    Instance Type
    Participants and their multi-domain measurements
    Data Type
    Survey responses; physical and clinical measurements; blood and urine lab results; imaging (retinal); physiological signals (ECG); wearable device time-series; blood glucose levels; environmental sensor data (e.g., home air quality).
  • Identification
    • With and without T2DM
    • Diabetes severity strata
    Distribution
    • Recruitment aimed at approximately equal distribution by diabetes severity
    • Pilot and periodic releases may not achieve full balance due to ongoing enrollment
  • Description
    • Public dataset downloadable upon agreement with a license
    • Full dataset available via controlled access (DUA)
    • Multiple modalities spanning tabular data, images, and time-series; file formats and standards are documented per domain
🔍

Collection Process

How was the data acquired?

AI-READI Dataset
AI-READI Dataset
AI-READI Flagship Dataset of Type 2 Diabetes
The AI-READI dataset consists of data collected from individuals with and without Type 2 Diabetes Mellitus (T2DM), harmonized across three data collection sites. It was designed with future AI/Machine Learning (AI/ML) studies in mind, including recruitment sampling procedures aimed at achieving approximately equal distribution of participants across diabetes severity, and a multi-domain data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable device data, etc.). The goal is to better understand salutogenesis (the pathway from disease to health) in T2DM. Some non-sensitive data will be publicly downloadable upon agreement with a license defining permitted uses. The full dataset is accessible via a Data Use Agreement (DUA). Public data include survey data, blood and urine lab results, fitness activity levels, clinical measurements (e.g., monofilament and cognitive function testing), retinal images, ECG, blood sugar levels, and environmental variables (e.g., home air quality). Controlled-access data include 5-digit ZIP code, sex, race, ethnicity, genetic sequencing data, past health records, medications, and traffic and accident reports. Enrollment is ongoing; pilot and periodic releases may not achieve balanced distributions across groups. Documentation versions correspond to dataset versions and include domain-level acquisition and processing details.
en
  • Type 2 Diabetes
  • T2DM
  • AI
  • Machine Learning
  • multimodal
  • harmonized
  • multi-site
  • survey data
  • clinical data
  • imaging data
  • wearable device data
  • time-series
  • ECG
  • retinal images
  • blood glucose
  • laboratory results
  • environmental data
  • FAIR principles
  • Healthsheet
  • AI_READI
  • FAIRhub
mixed (tabular and non-tabular modalities)
  • Response
    Provide a harmonized, multi-site, multi-domain dataset enabling AI/ML analyses not feasible with existing sources (e.g., claims or EHR alone), with recruitment targeting approximately equal distribution by diabetes severity.
DescriptionName
Includes survey data, blood and urine lab results, fitness activity levels, clinical measurements (e...Public dataset
Includes 5-digit ZIP code, sex, race, ethnicity, genetic sequencing data, past health records, medic...Controlled-access dataset
  • Is Sample
    • Yes (recruited participants across three data collection sites)
  • Is Random
    • No (targeted recruitment to balance diabetes severity)
  • Source Data
    • Participants enrolled via three data collection sites
  • Is Representative
    • Designed for approximate balance by diabetes severity; broader representativeness not claimed
  • Representative Verification
    • Balance targeted during recruitment; enrollment ongoing so pilot release may not achieve balance
  • Strategies
    • Recruitment sampling to achieve approximately equal distribution across diabetes severity
  • Description
    • Ongoing enrollment means early releases may exhibit unbalanced distributions across groups.
  • External Resources
    • Dataset landing page on the FAIRhub data portal (https://fairhub.io/datasets/2)
    • Documentation sections with domain-specific standards, metadata, file formats, and example outputs
    • Publications page (https://aireadi.org/publications)
    Archival
    • Documentation versions correspond to dataset versions
    • Zenodo archive (doi:10.5281/zenodo.10642459)
    Restrictions
    • Full dataset requires a Data Use Agreement; some data publicly available under license
  • Description
    • Contains protected health information elements under controlled access (e.g., past health records).
  • Description
    • Genetic sequencing data (controlled)
    • Past health records and medications (controlled)
    • Demographics (sex, race, ethnicity) and 5-digit ZIP code (controlled)
Description
  • Public data are released under licensing terms; sensitive elements are held under controlled access via DUA to protect participant privacy.
  1. Description
    • Harmonized, multi-domain data acquisition across three collection sites
    • Data collected via surveys, clinical exams, imaging devices, wearable sensors, and environmental monitors
    Was Directly Observed
    Yes (physical/clinical measurements, imaging, wearable, environmental sensors)
    Was Reported By Subjects
    Yes (survey data)
    Was Inferred Derived
    Unspecified
    Was Validated Verified
    Harmonization across sites; domain-specific validation details provided in the documentation.
  • Description
    • Hardware devices and sensors (e.g., retinal imaging, ECG, wearables, environmental monitors)
    • Clinical procedures (e.g., monofilament and cognitive testing)
    • Software-driven data capture and manual curation as needed
  • Description
    • Three data collection sites were involved in recruitment and data acquisition.
  • Description
    • Pilot study phase data included; enrollment is ongoing; periodic data releases planned.
  • Description
    • Domain-specific processing and harmonization described in the Dataset Documentation (file formats, standards, metadata, example outputs).
  • Description
    • Harmonization and processing across three sites; details provided per domain in the documentation.
  • Description
    • Domain-specific labeling/annotation where applicable (e.g., clinical test outputs, imaging outputs), as described in the documentation.
  • Description
    • AI-READI project team (see Documentation site Contact Us and GitHub references)
🚀

Uses

What (other) tasks could the dataset be used for?

  • Response
    Enable downstream AI/ML analyses across survey, clinical, imaging, wearable, and environmental domains related to T2DM.
  • Description
    • FAIRhub data portal (https://fairhub.io/datasets/2)
    • Zenodo (doi:10.5281/zenodo.10642459)
  • Description
    • Early-release imbalance across groups may affect AI/ML model performance and fairness; users should account for group balance and distribution shifts.
    • Sensitive elements must be handled under DUA to mitigate privacy risks.
  • Description
    • Not specified; users must adhere to license terms and the Data Use Agreement.
Description
  • Public data are available for download upon agreement with a license defining permitted uses.
  • Full dataset access is contingent on entering into a Data Use Agreement (controlled access).
📤

Distribution

How will the dataset be distributed?

Description
  • Separate documentation versions align to dataset versions (e.g., v1.0.0, v2.0.0); users can navigate between versions via the documentation site.
🔄

Maintenance

How will the dataset be maintained?

Description
  • Periodic updates to data releases are planned as enrollment proceeds; documentation versions align with dataset versions.
Generated on 2025-11-16 17:37:50 using Bridge2AI Data Sheets Schema