AI READI (GPT-5 Synthesized)

Datasheet for Dataset - Human Readable Format

🔍

Collection Process

How was the data acquired?

AI-READI Dataset
AI-READI Flagship Dataset of Type 2 Diabetes
The AI-READI dataset consists of data collected from individuals with and without Type 2 Diabetes Mellitus (T2DM), harmonized across three data collection sites. It was designed with future AI/Machine Learning (AI/ML) studies in mind, including recruitment sampling procedures aimed at achieving approximately equal distribution of participants across diabetes severity, and a multi-domain data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable device data, etc.). The goal is to better understand salutogenesis (the pathway from disease to health) in T2DM. Some non-sensitive data will be publicly downloadable upon agreement with a license defining permitted uses. The full dataset is accessible via a Data Use Agreement (DUA). Public data include survey data, blood and urine lab results, fitness activity levels, clinical measurements (e.g., monofilament and cognitive function testing), retinal images, ECG, blood sugar levels, and environmental variables (e.g., home air quality). Controlled-access data include 5-digit ZIP code, sex, race, ethnicity, genetic sequencing data, past health records, medications, and traffic and accident reports. Enrollment is ongoing; pilot and periodic releases may not achieve balanced distributions across groups. Documentation versions correspond to dataset versions and include domain-level acquisition and processing details.
en
  • Type 2 Diabetes
  • T2DM
  • AI
  • Machine Learning
  • multimodal
  • harmonized
  • multi-site
  • survey data
  • clinical data
  • imaging data
  • wearable device data
  • time-series
  • ECG
  • retinal images
  • blood glucose
  • laboratory results
  • environmental data
  • FAIR principles
  • Healthsheet
  1. ID
    AI-READI Dataset
    Name
    AI-READI Dataset
    Title
    AI-READI Flagship Dataset of Type 2 Diabetes
    Description
    Harmonized, multi-site, multi-domain dataset of individuals with and without T2DM supporting AI/ML research on salutogenesis, with public and controlled-access components and versioned documentation aligned to data releases.
    Language
    en
    Page
    https://fairhub.io/datasets/2
    Keywords
    • Type 2 Diabetes
    • T2DM
    • AI
    • Machine Learning
    • multimodal
    • harmonized
    • multi-site
    • survey data
    • clinical data
    • imaging data
    • wearable device data
    • time-series
    • ECG
    • retinal images
    • blood glucose
    • laboratory results
    • environmental data
    • FAIR principles
    • Healthsheet
    Is Tabular
    mixed (tabular and non-tabular modalities)
    Purposes
    • Response
      Better understand salutogenesis (pathway from disease to health) in T2DM using a harmonized, multi-domain dataset designed for AI/ML research.
    Tasks
    • Response
      Enable downstream AI/ML analyses across survey, clinical, imaging, wearable, and environmental domains related to T2DM.
    Addressing Gaps
    • Response
      Provide a harmonized, multi-site, multi-domain dataset enabling AI/ML analyses not feasible with existing sources (e.g., claims or EHR alone), with recruitment targeting approximately equal distribution by diabetes severity.
    Instances
    • Representation
      Individuals (participants) with and without Type 2 Diabetes Mellitus (T2DM)
      Instance Type
      Participants and their multi-domain measurements
      Data Type
      Survey responses; physical and clinical measurements; blood and urine lab results; imaging (retinal); physiological signals (ECG); wearable device time-series; blood glucose levels; environmental sensor data (e.g., home air quality).
    Subsets
    DescriptionName
    Includes survey data, blood and urine lab results, fitness activity levels, clinical measurements (e...Public dataset
    Includes 5-digit ZIP code, sex, race, ethnicity, genetic sequencing data, past health records, medic...Controlled-access dataset
    Sampling Strategies
    1. Is Sample
      • Yes (recruited participants across three data collection sites)
      Is Random
      • No (targeted recruitment to balance diabetes severity)
      Source Data
      • Participants enrolled via three data collection sites
      Is Representative
      • Designed for approximate balance by diabetes severity; broader representativeness not claimed
      Representative Verification
      • Balance targeted during recruitment; enrollment ongoing so pilot release may not achieve balance
      Strategies
      • Recruitment sampling to achieve approximately equal distribution across diabetes severity
    Subpopulations
    • Identification
      • With and without T2DM
      • Diabetes severity strata
      Distribution
      • Recruitment aimed at approximately equal distribution by diabetes severity
      • Pilot and periodic releases may not achieve full balance due to ongoing enrollment
    Anomalies
    • Description
      • Ongoing enrollment means early releases may exhibit unbalanced distributions across groups.
    External Resources
    Confidential Elements
    • Description
      • Contains protected health information elements under controlled access (e.g., past health records).
    Sensitive Elements
    • Description
      • Genetic sequencing data (controlled)
      • Past health records and medications (controlled)
      • Demographics (sex, race, ethnicity) and 5-digit ZIP code (controlled)
    Is Deidentified
    Description
    • Public data are released under licensing terms; sensitive elements are held under controlled access via DUA to protect participant privacy.
    Acquisition Methods
    1. Description
      • Harmonized, multi-domain data acquisition across three collection sites
      • Data collected via surveys, clinical exams, imaging devices, wearable sensors, and environmental monitors
      Was Directly Observed
      Yes (physical/clinical measurements, imaging, wearable, environmental sensors)
      Was Reported By Subjects
      Yes (survey data)
      Was Inferred Derived
      Unspecified
      Was Validated Verified
      Harmonization across sites; domain-specific validation details provided in the documentation.
    Collection Mechanisms
    • Description
      • Hardware devices and sensors (e.g., retinal imaging, ECG, wearables, environmental monitors)
      • Clinical procedures (e.g., monofilament and cognitive testing)
      • Software-driven data capture and manual curation as needed
    Data Collectors
    • Description
      • Three data collection sites were involved in recruitment and data acquisition.
    Collection Timeframes
    • Description
      • Pilot study phase data included; enrollment is ongoing; periodic data releases planned.
    Preprocessing Strategies
    • Description
      • Domain-specific processing and harmonization described in the Dataset Documentation (file formats, standards, metadata, example outputs).
    Cleaning Strategies
    • Description
      • Harmonization and processing across three sites; details provided per domain in the documentation.
    Labeling Strategies
    • Description
      • Domain-specific labeling/annotation where applicable (e.g., clinical test outputs, imaging outputs), as described in the documentation.
    Future Use Impacts
    • Description
      • Early-release imbalance across groups may affect AI/ML model performance and fairness; users should account for group balance and distribution shifts.
      • Sensitive elements must be handled under DUA to mitigate privacy risks.
    Discouraged Uses
    • Description
      • Not specified; users must adhere to license terms and the Data Use Agreement.
    Distribution Formats
    • Description
      • Public dataset downloadable upon agreement with a license
      • Full dataset available via controlled access (DUA)
      • Multiple modalities spanning tabular data, images, and time-series; file formats and standards are documented per domain
    License And Use Terms
    Description
    • Public data are available for download upon agreement with a license defining permitted uses.
    • Full dataset access is contingent on entering into a Data Use Agreement (controlled access).
    Maintainers
    • Description
      • AI-READI project team (see Documentation site Contact Us and GitHub references)
    Updates
    Description
    • Periodic updates to data releases are planned as enrollment proceeds; documentation versions align with dataset versions.
    Version Access
    Description
    • Separate documentation versions align to dataset versions (e.g., v1.0.0, v2.0.0); users can navigate between versions via the documentation site.
Generated on 2025-11-16 18:39:57 using Bridge2AI Data Sheets Schema