AI READI Dataset Documentation

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

DescriptionIDName
Better understand salutogenesis (the pathway from disease to health) in Type 2 Diabetes Mellitus usi...purpose-001Understanding T2DM salutogenesis
Establish standards, best practices, and guidelines for collection, preparation, and sharing of medi...purpose-002Establishing AI/ML data standards
Address the lack of racial and ethnic diversity in T2DM research by creating a dataset that is tripl...purpose-003Addressing demographic inequities in T2DM research
  • ID
    funder-001
    Name
    NIH Common Fund Bridge2AI Program
    Description
    Funded through National Institutes of Health grant OT2OD032644, administered by NIH Office of the Director. Additional support from grants P30DK035816 (Nutrition and Obesity Research Center), UL1TR003096, and Research to Prevent Blindness. Total funding in 2022: $5,026,499. Opportunity Number: OTA-21-008.
📊

Composition

What do the instances represent?

  1. ID
    instance-001
    Name
    Individual participants
    Description
    Individual participants aged 40 and older with and without Type 2 Diabetes Mellitus (T2DM). Target enrollment is 4,000 people, triple-balanced by self-reported race/ethnicity (Asian, Black, Hispanic, White), T2DM severity (no diabetes, pre-diabetes/lifestyle-controlled diabetes, diabetes treated with oral medications or non-insulin injections, insulin-controlled diabetes), and biological sex (male, female). Participants must speak, read, and understand English. Exclusion criteria include pregnancy and type 1 diabetes.
    Instance Type
    Human participants recruited from three health system sites (University of Alabama at Birmingham, University of California San Diego, University of Washington) between 2020 and 2025.
DescriptionIDName
Self-reported Asian race/ethnicity, target ~1,000 participants (25% of sample)subpop-001Asian participants
Self-reported Black race/ethnicity, target ~1,000 participants (25% of sample)subpop-002Black participants
Self-reported Hispanic ethnicity, target ~1,000 participants (25% of sample)subpop-003Hispanic participants
Self-reported White race/ethnicity, target ~1,000 participants (25% of sample)subpop-004White participants
Participants without diabetes diagnosis, target ~1,000 participants (25% of sample)subpop-005No diabetes
Participants with pre-diabetes or lifestyle-controlled diabetes, target ~1,000 participants (25% of ...subpop-006Pre-diabetes and lifestyle-controlled diabetes
Participants with diabetes treated with oral medications or non-insulin injections, target ~1,000 pa...subpop-007Medication-controlled diabetes
Participants with insulin-controlled diabetes, target ~1,000 participants (25% of sample)subpop-008Insulin-controlled diabetes
Access UrlsDescriptionIDName
https://fairhub.io/datasets/2Retinal imaging data distributed in DICOM format (converted from proprietary .fda and .sdt formats f...format-001DICOM for imaging
https://fairhub.io/datasets/2Survey data, clinical lab results, continuous glucose monitoring, environmental sensor data, and oth...format-002CSV for tabular and time-series data
https://fairhub.io/datasets/2Physical activity monitoring data (from Garmin VivoSmart 5) converted from .FIT format to mHealth st...format-003mHealth standard for wearable data
https://docs.aireadi.org/Metadata and data dictionaries provided through REDCap system documentation for survey and study coo...format-004REDCap data dictionary
🔍

Collection Process

How was the data acquired?

AI-READI
Artificial Intelligence Ready and Equitable Atlas for Diabetes Insights (AI-READI)
The AI-READI is a flagship dataset consisting of multimodal data collected from 4,000 individuals with and without Type 2 Diabetes Mellitus (T2DM), harmonized across 3 data collection sites (Birmingham, Alabama; San Diego, California; Seattle, Washington). The dataset was designed with future AI/Machine Learning studies in mind, including recruitment sampling procedures aimed at achieving approximately equal distribution of participants across diabetes severity (triple-balanced by race/ethnicity, biological sex, and T2DM severity), as well as a multi-domain data acquisition protocol (survey data, physical measurements, clinical data, imaging data, wearable device data, environmental sensors, biospecimens) to enable downstream AI/ML analyses that may not be feasible with existing data sources such as claims or electronic health records data. The goal is to better understand salutogenesis (the pathway from disease to health) in T2DM. The study follows FAIR principles and incorporates ethical and equitable data collection and management practices.
en
  • Type 2 Diabetes Mellitus
  • T2DM
  • AI-READI
  • Machine Learning
  • Artificial Intelligence
  • multimodal dataset
  • harmonized data
  • multi-site study
  • salutogenesis
  • FAIR principles
  • retinal imaging
  • continuous glucose monitoring
  • wearable devices
  • biorepository
  • biospecimens
  • triple-balanced sampling
  • health equity
  • Bridge2AI
  • cross-sectional study
DescriptionIDName
Provide a large-scale, harmonized, multi-site, multi-domain dataset enabling AI/ML analyses not feas...gap-001Lack of multimodal T2DM datasets
Address demographic inequities in T2DM research by recruiting equal proportions across four race/eth...gap-002Demographic underrepresentation
Create a model for future AI-ready medical datasets through comprehensive metadata, standardized dat...gap-003AI-readiness of medical datasets
RoleNameORCIDAffiliation
ContributorCynthia Owsleycreator-001-
ContributorAaron Leecreator-002-
ContributorSally L. Baxtercreator-003-
ContributorChristopher G. Chutecreator-004-
ContributorMegan E. Collinscreator-005-
ContributorJeffrey C. Edbergcreator-006-
ContributorKadija Ferrymancreator-007-
ContributorMichelle Hribarcreator-008-
ContributorSamantha Hurstcreator-009-
ContributorHiroshi Ishikawacreator-010-
ContributorCecilia S. Leecreator-011-
ContributorAlvin Y. Liucreator-012-
ContributorGerald McGwincreator-013-
ContributorShannon K. McWeeneycreator-014-
ContributorCamille Nebekercreator-015-
ContributorBhavesh Patelcreator-016-
ContributorSara Jean Singercreator-017-
ContributorMichael P. Snydercreator-018-
ContributorJoseph Manuel Yrachetacreator-019-
ContributorLinda M. Zangwillcreator-020-
DescriptionIDName
Includes data not considered sensitive personal health information, available to the public for down...subset-001Public Access Dataset
Includes sensitive data accessible by entering into a data use agreement. Contains 5-digit zip code,...subset-002Controlled Access Dataset
Biobanked samples stored at UAB Center for Clinical and Translational Science (CCTS), including plas...subset-003Biorepository
  1. ID
    sampling-001
    Name
    Triple-balanced recruitment
    Description
    Recruitment sampling procedures aimed at achieving approximately equal distribution of participants across three dimensions: (1) race/ethnicity (Asian, Black, Hispanic, White), (2) T2DM severity (no diabetes, pre-diabetes/lifestyle-controlled, medication-controlled, insulin-controlled), and (3) biological sex (male, female). This balanced design is critical for developing unbiased machine learning models.
    Is Sample
    • True
    Is Random
    • False
    Is Representative
    • False
    Strategies
    • Targeted recruitment to balance demographics across race/ethnicity, sex, and diabetes severity
    • Wave-based recruitment with monitoring and adjustment through under- and oversampling
    • Recruitment from electronic health records screening using ICD-10 codes (R73.09 for pre-diabetes, E11.X for T2DM)
    • Personalized invitation letters and emails with REDCap recruitment interface
DescriptionIDName
Single study encounter per participant at one of three data collection sites (Birmingham, San Diego,...collection-001In-person data collection visits
Source population identified by screening electronic health records for patients aged 40+ who had me...collection-002Electronic health record screening
Participants recruited in waves to facilitate efficient sampling. Composition and size of each wave ...collection-003Wave-based recruitment
Continuous glucose monitoring (Dexcom G6, 5-minute intervals), physical activity monitoring (Garmin ...collection-004Home-based wearable monitoring
Blood (53 mL) and urine collected during study visit. Local processing for plasma, serum, buffy coat...collection-005Biospecimen collection and biobanking
DescriptionIDName
Self-reported data collected via REDCap interfaces including demographics, medical history, social d...acquisition-001Survey and questionnaire data
Height, weight, blood pressure, heart rate, waist circumference, body composition, and other anthrop...acquisition-002Physical measurements and vital signs
Multi-device retinal imaging protocol capturing data from Optos California, Spectralis OCT2, Triton ...acquisition-003Retinal imaging
Visual acuity and contrast sensitivity under photopic (daylight) and mesopic (dim light) conditions ...acquisition-004Visual function testing
Complete blood count (CBC) from fresh whole blood at local CLIA-certified labs. Central lab testing ...acquisition-005Clinical laboratory testing
12-lead ECG data collected during study visit using standardized protocols. acquisition-006Electrocardiogram (ECG)
Montreal Cognitive Assessment (MoCA) administered to assess cognitive function, relevant to T2DM com...acquisition-007Cognitive function testing
Monofilament testing performed to assess peripheral neuropathy, a common T2DM complication. acquisition-008Peripheral neuropathy assessment
Dexcom G6 Continuous Glucose Monitor capturing blood glucose measurements (mg/dL) every 5 minutes. D...acquisition-009Continuous glucose monitoring
Garmin VivoSmart 5 wearable device capturing number of steps, heart rate, sleep duration (circadian ...acquisition-010Physical activity monitoring
Custom-designed environmental sensor (Karalis Johnson Retina Center, UW) capturing ambient temperatu...acquisition-011Environmental monitoring
Non-fasting blood (53 mL) and urine collection. Processing includes whole blood for CBC, EDTA plasma...acquisition-012Biospecimen collection
DescriptionIDNamePreprocessing Details
Data harmonized across three collection sites (Birmingham, San Diego, Seattle) using standardized op...preproc-001Data standardization and harmonizationStandardized operating procedures across all three sites, Common protocols and equipment, Centralized data management through REDCap
Retinal imaging data converted from proprietary formats (.fda, .sdt) to DICOM standard for the datas...preproc-002Image format conversionProprietary retinal imaging formats (.fda, .sdt) converted to DICOM, Wearable device data (.FIT) converted to mHealth standard
Standardized local processing for plasma, serum, and buffy coats using consistent protocols. Central...preproc-003Biospecimen processingStandardized local processing for plasma, serum, buffy coats, Centralized PBMC processing at UAB CCTS, Batch shipping for central lab analyses
Multiple quality control measures including standardized training of study coordinators, equipment c...preproc-004Quality control and validationStandardized training of study coordinators, Equipment calibration protocols, ... (+2 more)
  1. ID
    cleaning-001
    Name
    Multi-site harmonization
    Description
    Standardized protocols and procedures across all three data collection sites ensure data consistency and quality. Common equipment, training, and REDCap data management system used to maintain FAIR principles compliance.
    Cleaning Details
    • Cross-site harmonization procedures
    • Standardized equipment and training
    • REDCap data management for quality
    • FAIR principles implementation
  1. ID
    maintainer-001
    Name
    AI-READI Consortium
    Description
    Multidisciplinary consortium managing dataset maintenance including data collection sites, coordinating centers, and data governance committees.
    Maintainer Details
    • University of Washington (lead institution, data coordination)
    • University of Alabama at Birmingham (biorepository, data collection)
    • University of California San Diego (data collection)
    • Data Access Committee (access policies)
    • Documentation team (version-specific guides)
ID
retention-001
Name
Data and biospecimen retention
Description
Digital data maintained according to NIH data sharing policies. Biospecimen retention subject to institutional policies and consent agreements. Finite number of biospecimen samples available for distribution.
Retention Details
  • NIH data sharing policies govern digital data retention
  • Biospecimen retention per institutional policies
  • Consent agreements specify retention terms
  • Finite biospecimen availability
DescriptionIDNameSensitive Elements PresentSensitivity Details
Genomic DNA extracted from buffy coats, blood derivatives, and urine samples stored with potential f...sensitive-001Genetic and biospecimen dataTrueGenetic sequencing data from buffy coats, Blood derivatives and urine biospecimens
5-digit zip code, detailed race, ethnicity, and sex information available in controlled access datas...sensitive-002Geographic and demographic identifiersTrue5-digit zip code, Race and ethnicity details, Biological sex
Past health records, medications, traffic and accident reports available in controlled access datase...sensitive-003Medical history and recordsTruePast health records, Medications, Traffic and accident reports
DescriptionExternal ResourcesIDName
Official project website with overview and resourceshttps://aireadi.org/resource-001AI-READI Project Website
Comprehensive dataset documentation with version-specific guideshttps://docs.aireadi.org/resource-002AI-READI Dataset Documentation
Dataset repository and download portalhttps://fairhub.io/datasets/2resource-003FAIRhub Dataset Landing Page
Parent NIH Common Fund program supporting AI-ready biomedical datasetshttps://bridge2ai.org/resource-004Bridge2AI Program
Federal grant information and project detailshttps://reporter.nih.gov/project-details/10471118resource-005NIH RePORTER Project Details
Policies and procedures for data accesshttps://aireadi.org/goals/data-sharingresource-006Data Sharing Information
Additional dataset documentation and resourceshttps://doi.org/10.5281/zenodo.10642459resource-007Zenodo Archive
BMJ Open publication describing study design and protocolhttps://doi.org/10.1136/bmjopen-2024-097449resource-008Protocol Publication
Overview of AI-READI approach and significancehttps://doi.org/10.1038/s42255-024-01165-xresource-009Nature Metabolism Commentary
🚀

Uses

What (other) tasks could the dataset be used for?

DescriptionIDName
Enable downstream AI/ML analyses across survey, clinical, imaging, wearable device, environmental, a...task-001Enable multi-domain AI/ML analyses for T2DM
Support the development of unbiased machine learning models through balanced data collection across ...task-002Develop unbiased AI/ML models
Study disease trajectories and salutogenesis pathways in T2DM through cross-sectional analysis of pa...task-003Study T2DM disease trajectories
DescriptionIDName
Primary intended use is development and training of artificial intelligence and machine learning mod...use-001AI/ML model development for T2DM
Research leveraging multiple data domains (imaging, clinical, genomic, wearable, environmental) to u...use-002Multi-modal T2DM research
Studies examining racial and ethnic disparities in T2DM outcomes, social determinants of health effe...use-003Health equity research
Discovery of novel biomarkers for T2DM progression, complications, and salutogenesis using biospecim...use-004Biomarker discovery
Use as an exemplar for future AI-ready medical dataset development, demonstrating best practices in ...use-005Model dataset for AI-ready data standards
DescriptionIDName
As enrollment is ongoing, pilot data releases and periodic updates may not have achieved balanced di...discouraged-001Uses outside pilot phase scope
Dataset is for research purposes. Any AI/ML models developed should undergo appropriate clinical val...discouraged-002Clinical decision-making without validation
Attempts to re-identify participants from de-identified data violate ethical principles and data use...discouraged-003Re-identification attempts
ID
license-001
Name
Creative Commons Attribution Non-Commercial
Description
Public access data distributed under Creative Commons Attribution Non-Commercial (CC BY-NC 4.0) license. Permits others to distribute, remix, adapt, build upon this work non-commercially, and license their derivative works on different terms, provided the original work is properly cited, appropriate credit is given, any changes made indicated, and the use is non-commercial. Controlled access data requires data use agreement. See http://creativecommons.org/licenses/by-nc/4.0/ for full license terms.
License Terms
  • Proper citation required
  • Non-commercial use only for public data
  • Derivative works permitted with attribution
  • Changes must be indicated
  • Controlled access data requires separate data use agreement
📤

Distribution

How will the dataset be distributed?

CC BY-NC 4.0
🔄

Maintenance

How will the dataset be maintained?

ID
updates-001
Name
Periodic data releases and maintenance plan
Description
Dataset updated periodically as enrollment progresses toward target of 4,000 participants by November 2026. Version-specific documentation maintained for each release. Biorepository maintained at UAB CCTS with long-term storage protocols. Data sharing policies under ongoing development by Data Access Committee.
Frequency
Periodic releases with ongoing enrollment; final release planned for late 2026
Update Details
  • Periodic data releases as enrollment continues
  • Pilot data released with ongoing enrollment
  • Final dataset expected after completion of 4,000 participant enrollment by November 2026
  • Dataset versioning implemented (v1.0.0, v2.0.0, v3.0.0)
  • Version-specific documentation at https://docs.aireadi.org/
👥

Human Subjects

Does the dataset relate to people?

ID
hsr-001
Name
AI-READI Human Subjects Research
Description
Study approved by Institutional Review Board (IRB) of University of Washington (approval number STUDY00016228), with reliance agreements from IRBs of University of Alabama at Birmingham and University of California, San Diego. Written informed consent provided by all participants. Bioethics guidance integrated throughout study design. Community Advisory Board of 11 persons with diversity in race and ethnicity contributes to protocol development. Ethical and equitable data collection and management practices implemented.
Involves Human Subjects
True
IRB Approval
  • University of Washington IRB approval number STUDY00016228
  • University of Alabama at Birmingham IRB reliance agreement
  • University of California San Diego IRB reliance agreement
Ethics Review Board
  • University of Washington Institutional Review Board
  • University of Alabama at Birmingham Institutional Review Board (reliance agreement)
  • University of California San Diego Institutional Review Board (reliance agreement)
  • Community Advisory Board with 11 members representing diverse race and ethnicity
Special Populations
  • Recruitment targeted to include racial and ethnic minorities disproportionately affected by T2DM
  • Asian populations
  • Black populations
  • Hispanic populations
  • Tribal consultation planned for Native American cohort participation
Generated on 2025-12-09 18:07:08 using Bridge2AI Data Sheets Schema