AI-READI Dataset
AI-READI Flagship Dataset of Type 2 Diabetes
The AI-READI dataset consists of data collected from individuals with and without Type 2 Diabetes Mellitus (T2DM), harmonized across three data collection sites. It was designed with future AI/Machine Learning (AI/ML) studies in mind, including recruitment sampling procedures aimed at achieving approximately equal distribution of participants across diabetes severity, and a multi-domain data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable device data, etc.). The goal is to better understand salutogenesis (the pathway from disease to health) in T2DM. Some non-sensitive data will be publicly downloadable upon agreement with a license defining permitted uses. The full dataset is accessible via a Data Use Agreement (DUA). Public data include survey data, blood and urine lab results, fitness activity levels, clinical measurements (e.g., monofilament and cognitive function testing), retinal images, ECG, blood sugar levels, and environmental variables (e.g., home air quality). Controlled-access data include 5-digit ZIP code, sex, race, ethnicity, genetic sequencing data, past health records, medications, and traffic and accident reports. Enrollment is ongoing; pilot and periodic releases may not achieve balanced distributions across groups. Documentation versions correspond to dataset versions and include domain-level acquisition and processing details.
en
- Type 2 Diabetes
- T2DM
- AI
- Machine Learning
- multimodal
- harmonized
- multi-site
- survey data
- clinical data
- imaging data
- wearable device data
- time-series
- ECG
- retinal images
- blood glucose
- laboratory results
- environmental data
- FAIR principles
- Healthsheet
- ID
- AI-READI Dataset
- Name
- AI-READI Dataset
- Title
- AI-READI Flagship Dataset of Type 2 Diabetes
- Description
- Harmonized, multi-site, multi-domain dataset of individuals with and without T2DM supporting AI/ML research on salutogenesis, with public and controlled-access components and versioned documentation aligned to data releases.
- Language
- en
- Page
- https://fairhub.io/datasets/2
- Keywords
- Type 2 Diabetes
- T2DM
- AI
- Machine Learning
- multimodal
- harmonized
- multi-site
- survey data
- clinical data
- imaging data
- wearable device data
- time-series
- ECG
- retinal images
- blood glucose
- laboratory results
- environmental data
- FAIR principles
- Healthsheet
- Is Tabular
- mixed (tabular and non-tabular modalities)
- Purposes
- Response
- Better understand salutogenesis (pathway from disease to health) in T2DM using a harmonized, multi-domain dataset designed for AI/ML research.
- Tasks
- Response
- Enable downstream AI/ML analyses across survey, clinical, imaging, wearable, and environmental domains related to T2DM.
- Addressing Gaps
- Response
- Provide a harmonized, multi-site, multi-domain dataset enabling AI/ML analyses not feasible with existing sources (e.g., claims or EHR alone), with recruitment targeting approximately equal distribution by diabetes severity.
- Instances
- Representation
- Individuals (participants) with and without Type 2 Diabetes Mellitus (T2DM)
- Instance Type
- Participants and their multi-domain measurements
- Data Type
- Survey responses; physical and clinical measurements; blood and urine lab results; imaging (retinal); physiological signals (ECG); wearable device time-series; blood glucose levels; environmental sensor data (e.g., home air quality).
- Subsets
Description Name Includes survey data, blood and urine lab results, fitness activity levels, clinical measurements (e... Public dataset Includes 5-digit ZIP code, sex, race, ethnicity, genetic sequencing data, past health records, medic... Controlled-access dataset - Sampling Strategies
- Is Sample
- Yes (recruited participants across three data collection sites)
- Is Random
- No (targeted recruitment to balance diabetes severity)
- Source Data
- Participants enrolled via three data collection sites
- Is Representative
- Designed for approximate balance by diabetes severity; broader representativeness not claimed
- Representative Verification
- Balance targeted during recruitment; enrollment ongoing so pilot release may not achieve balance
- Strategies
- Recruitment sampling to achieve approximately equal distribution across diabetes severity
- Subpopulations
- Identification
- With and without T2DM
- Diabetes severity strata
- Distribution
- Recruitment aimed at approximately equal distribution by diabetes severity
- Pilot and periodic releases may not achieve full balance due to ongoing enrollment
- Anomalies
- Description
- Ongoing enrollment means early releases may exhibit unbalanced distributions across groups.
- External Resources
- External Resources
- Fairhub Dataset Landing Page
- https://fairhub.io/datasets/2
- AI Readi Publications Page
- https://aireadi.org/publications
- Archival
- Documentation versions correspond to dataset versions
- Restrictions
- Full dataset requires a Data Use Agreement; some data publicly available under license
- Confidential Elements
- Description
- Contains protected health information elements under controlled access (e.g., past health records).
- Sensitive Elements
- Description
- Genetic sequencing data (controlled)
- Past health records and medications (controlled)
- Demographics (sex, race, ethnicity) and 5-digit ZIP code (controlled)
- Is Deidentified
- Description
- Public data are released under licensing terms; sensitive elements are held under controlled access via DUA to protect participant privacy.
- Acquisition Methods
- Description
- Harmonized, multi-domain data acquisition across three collection sites
- Data collected via surveys, clinical exams, imaging devices, wearable sensors, and environmental monitors
- Was Directly Observed
- Yes (physical/clinical measurements, imaging, wearable, environmental sensors)
- Was Reported By Subjects
- Yes (survey data)
- Was Inferred Derived
- Unspecified
- Was Validated Verified
- Harmonization across sites; domain-specific validation details provided in the documentation.
- Collection Mechanisms
- Description
- Hardware devices and sensors (e.g., retinal imaging, ECG, wearables, environmental monitors)
- Clinical procedures (e.g., monofilament and cognitive testing)
- Software-driven data capture and manual curation as needed
- Data Collectors
- Description
- Three data collection sites were involved in recruitment and data acquisition.
- Collection Timeframes
- Description
- Pilot study phase data included; enrollment is ongoing; periodic data releases planned.
- Preprocessing Strategies
- Description
- Domain-specific processing and harmonization described in the Dataset Documentation (file formats, standards, metadata, example outputs).
- Cleaning Strategies
- Description
- Harmonization and processing across three sites; details provided per domain in the documentation.
- Labeling Strategies
- Description
- Domain-specific labeling/annotation where applicable (e.g., clinical test outputs, imaging outputs), as described in the documentation.
- Future Use Impacts
- Description
- Early-release imbalance across groups may affect AI/ML model performance and fairness; users should account for group balance and distribution shifts.
- Sensitive elements must be handled under DUA to mitigate privacy risks.
- Discouraged Uses
- Description
- Not specified; users must adhere to license terms and the Data Use Agreement.
- Distribution Formats
- Description
- Public dataset downloadable upon agreement with a license
- Full dataset available via controlled access (DUA)
- Multiple modalities spanning tabular data, images, and time-series; file formats and standards are documented per domain
- License And Use Terms
- Description
- Public data are available for download upon agreement with a license defining permitted uses.
- Full dataset access is contingent on entering into a Data Use Agreement (controlled access).
- Maintainers
- Description
- AI-READI project team (see Documentation site Contact Us and GitHub references)
- Updates
- Description
- Periodic updates to data releases are planned as enrollment proceeds; documentation versions align with dataset versions.
- Version Access
- Description
- Separate documentation versions align to dataset versions (e.g., v1.0.0, v2.0.0); users can navigate between versions via the documentation site.