VOICE Dataset Documentation

Description	Purpose Details
To integrate the use of voice as a biomarker of health in clinical care by generating a substantial ...	Voice is a promising biomarker as it is simple to collect, cost-effective, and has broad clinical utility, Recent AI advances enable extraction of prognostically useful information from voice data, ... (+2 more)
To develop new standards of acoustic and voice data collection and analysis for voice AI research, i...	Standardized voice data collection protocols across sites, Acoustic quality standardization and calibration, ... (+2 more)
To create software and cloud infrastructure for automated voice data collection through smartphone a...	Custom tablet/smartphone application for voice recording, Integrated acoustic quality standardization, ... (+2 more)

Description	Funder Name	Grant Info
NIH Office of the Director, Bridge to Artificial Intelligence (Bridge2AI) program, grant 3OT2OD03272...	National Institutes of Health (NIH)	Grant number 3OT2OD032720-01S1 (current), 3OT2OD032720-01S3 (referenced), {'Administering IC': 'NIH Office of the Director'}, ... (+12 more)
Additional infrastructure support from National Institute of Biomedical Imaging and Bioengineering (...	NIBIB	Grant number R01EB030362, Supports PhysioNet infrastructure, MIT Laboratory for Computational Physiology

📊

Composition

What do the instances represent?

Instances

Access	Count	Description	Format	Instance Type	Privacy Note
	12523	Voice and speech audio recordings from 306 participants across five clinical sites in North America,...	Derived features (spectrograms, MFCCs) in Parquet format	Audio recordings
	12523	Spectrograms computed using short-time Fast Fourier Transform (FFT) with 25ms window size, 10ms hop ...	Parquet (spectrograms.parquet)	Spectrograms
	12523	Mel-frequency cepstral coefficients (MFCCs) with 60 coefficients extracted from spectrograms, result...	Parquet (mfcc.parquet)	MFCCs
		Acoustic features extracted using OpenSMILE (Speech and Music Interpretation by Large-space Extracti...	TSV (static_features.tsv)	Acoustic features
		Phonetic and prosodic features computed using Parselmouth (Python interface to Praat), providing mea...	TSV (static_features.tsv)	Prosodic features
	306	Demographic data from 306 participants including de-identified geographic information (country retai...	TSV (phenotype.tsv)	Demographics
	306	Self-reported medical history questionnaires covering health status, disease history, medication use...	TSV (phenotype.tsv)	Medical history
		Disease-specific validated questionnaires tailored to participant's disease cohort membership (voice...	TSV (phenotype.tsv)	Clinical questionnaires
		Targeted questionnaires on known confounders for voice including smoking status, vocal use patterns,...	TSV (phenotype.tsv)	Voice confounders
With participant consent		Electronic health record (EHR) data accessed with participant consent for gold standard validation o...		EHR data
		Automated transcriptions generated using OpenAI's Whisper Large model. Free speech transcripts remov...		Transcriptions	Free speech transcripts removed for privacy

Subpopulations

Description	Subpopulation Type
Multi-institutional participants recruited from five clinical sites across North America to ensure g...	Geographic diversity
Disease cohort-based sampling targeting five categories with known voice manifestations: (1) Voice d...	Disease cohort stratification
Intentional recruitment of diverse participants to address historical underrepresentation in voice A...	DEI-focused recruitment

Distribution Formats

Description
Parquet file containing spectrograms with participant_id, session_id, task_name, and 513xN dimension time-frequency representation arrays. Parquet format provides efficient columnar storage and fast queries.
File Name
spectrograms.parquet
Format
Parquet
Structure
Columnar with metadata and dense arrays
Size
Large (dense spectrogram data)
Description
Parquet file containing 60xN dimension MFCC arrays derived from spectrograms. Added in version 1.1 release (January 17, 2025).
File Name
mfcc.parquet
Format
Parquet
Structure
Columnar with metadata and dense arrays
Version Added
1.1
Description
Tab-delimited phenotype data with one row per unique participant (306 rows total), containing demographics, acoustic confounders, and responses to validated questionnaires. Accompanied by JSON data dictionary (phenotype.json) with column descriptions.
File Name
phenotype.tsv
Format
TSV (tab-separated values)
Structure
One row per participant
Rows
306
Data Dictionary
phenotype.json
Description
Tab-delimited static features with one row per unique recording (12,523 rows total), containing features derived from OpenSMILE, Praat, parselmouth, and torchaudio. Accompanied by JSON data dictionary (static_features.json) with feature descriptions.
File Name
static_features.tsv
Format
TSV (tab-separated values)
Structure
One row per recording
Rows
12,523
Data Dictionary
static_features.json
Description
Primary distribution through PhysioNet registered access system managed by MIT Laboratory for Computational Physiology, supported by NIBIB grant R01EB030362. Requires Data Access Compliance Office (DACO) approval with Data Transfer and Use Agreement (DTUA).
Platform
PhysioNet
URL
https://physionet.org/content/b2ai-voice/
DOI V1 1
https://doi.org/10.13026/249v-w155
DOI Latest
https://doi.org/10.13026/37yb-1t42
Access Mechanism
Registered access with DTUA
Description
Secondary distribution through Health Data Nexus platform providing alternative access point for AI-ready biomedical datasets.
Platform
Health Data Nexus
URL
https://healthdatanexus.ai/content/b2ai-voice/1.0/
Access Mechanism
Registered access
Description
Project documentation, protocols, and software tools available through official website and GitHub repositories under open-source licenses (MIT License for software).
Platform
Project website and GitHub
URL Documentation
https://docs.b2ai-voice.org
URL Github Docs
https://github.com/eipm/bridge2ai-docs
URL Github B2aiprep
https://github.com/sensein/b2aiprep
License
MIT License (software)
Description
Raw audio data available through controlled access only by contacting Data Access Compliance Office (DACO@b2ai-voice.org). Raw audio waveforms disseminated with additional privacy protections to protect participant confidentiality.
Platform
Controlled access (DACO)
Contact
DACO@b2ai-voice.org
Data Type
Raw audio waveforms
Privacy Level
Enhanced (controlled access only)

🔍

Collection Process

How was the data acquired?

ID *

bridge2ai-voice-dataset

Name

Bridge2AI-Voice

Title

Bridge2AI-Voice - An ethically-sourced, diverse voice dataset linked to health information

Description

The Bridge2AI-Voice dataset contains comprehensive voice, speech, and language data linked to health information, collected through a multi-institutional initiative funded by NIH's Bridge to Artificial Intelligence program. The dataset includes samples from conventional acoustic tasks such as respiratory sounds, cough sounds, and free speech prompts. Participants perform speaking tasks and complete self-reported demographic and medical history questionnaires, as well as disease-specific validated questionnaires. The project aims to integrate voice as a biomarker of health in clinical care by generating a substantial, ethically sourced, and diverse voice database linked to multimodal health biomarkers (EHR, radiomics, genomics) to fuel voice AI research and build predictive models for screening, diagnosis, and treatment across a broad range of diseases. Data collection is conducted via smartphone application linked to electronic health records, supported by federated learning technology to protect data privacy. Version 1.1 provides 12,523 recordings for 306 participants collected across five sites in North America. The dataset is distributed through PhysioNet and Health Data Nexus under a registered access license.

Page

https://physionet.org/content/b2ai-voice/

Language

Keywords

voice
speech
bridge2ai
voice biomarker
acoustic biomarker
AI
machine learning
health
disease screening
voice disorders
neurological disorders
mood disorders
respiratory disorders
pediatric
PhysioNet
federated learning
ethical AI
FAIR data
CARE principles
multimodal biomarkers

Addressing Gaps

Description	Existing Limitations	Gap Type
Address the pressing need for large, high quality, multi-institutional and diverse voice databases l...	Previous literature used small datasets with limited demographic diversity reporting, Lack of standardized data collection protocols precluding meta-analysis, ... (+4 more)	Dataset availability
Address ethical concerns about patient privacy protection, fair representation of populations, and c...	Industry development lacks comprehensive ethical oversight, Privacy protection inadequate in commercial voice AI, ... (+3 more)	Ethical framework
Build bridges between the medical voice research world, acoustic engineers, and the AI/ML community ...	Siloed research communities, Limited clinical translation of voice AI research, ... (+2 more)	Interdisciplinary collaboration

Creators

Role	ORCID	Affiliation
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-

Subsets

Cohort	Data Dictionary	Description	File Size	Format	ID	Name	Rows	Status	Version Added
		Parquet file (spectrograms.parquet) containing time-frequency representations with participant_id, s...	Large (dense array data)	Parquet	voice:spectrograms	Spectrograms
		Parquet file (mfcc.parquet) containing 60xN dimension MFCC arrays derived from spectrograms. Added i...		Parquet	voice:mfcc	Mel-frequency Cepstral Coefficients			1.1
	phenotype.json	Tab-delimited file (phenotype.tsv) with one row per unique participant (306 rows), containing demogr...		TSV	voice:phenotype	Phenotype Data	306
	static_features.json	Tab-delimited file (static_features.tsv) containing features derived from raw audio using OpenSMILE,...		TSV	voice:static-features	Static Acoustic Features	12523
Voice disorders		Participants with vocal pathologies including laryngeal cancers, vocal fold paralysis, and benign la...			voice:cohort-voice-disorders	Voice Disorders Cohort
Neurological disorders		Participants with neurological and neurodegenerative conditions including Alzheimer's disease, Parki...			voice:cohort-neuro	Neurological Disorders Cohort
Mood and psychiatric disorders		Participants with mood and psychiatric conditions including depression, schizophrenia, and bipolar d...			voice:cohort-mood	Mood and Psychiatric Disorders Cohort
Respiratory disorders		Participants with respiratory conditions including pneumonia, COPD, heart failure, and obstructive s...			voice:cohort-respiratory	Respiratory Disorders Cohort
Pediatric		Pediatric participants with conditions including autism and speech delay. Not included in version 1....			voice:cohort-pediatric	Pediatric Cohort		Not included in v1.1 (adult cohort only)

Sampling Strategies

Description	Rationale	Sampling Method
Participants selected based on membership to five predetermined disease cohort groups identified fro...	Target conditions with established voice-disease associations and clinical unmet needs	Disease cohort-based selection
Multi-institutional recruitment across five sites in North America to ensure geographic diversity, s...	Generalizability and reduced site-specific bias	Multi-site geographic sampling
Intentional focus on recruiting diverse participants historically underrepresented in voice AI resea...	Fairness, representativeness, and reduction of algorithmic bias	Diversity-targeted recruitment
Patients screened at specialty clinics based on predetermined inclusion/exclusion criteria developed...	Clinical validity and gold standard diagnosis	Clinician-guided screening

Acquisition Methods

Collection Mode	Data Sources	Description	Equipment	Instruments	Was Directly Observed
In-person at clinical sites		Voice recordings collected using custom tablet application with headsets at clinical sites during sc...	Custom tablet application (REDCap-based v3.20.0), Headsets for audio capture with acoustic quality control, ... (+2 more)		True
Self-report via tablet application		Structured questionnaires administered via custom data collection application on tablets, capturing ...		Demographic questionnaires, Medical history questionnaires, ... (+2 more)	True
EHR data extraction with consent	Institutional EHR systems at participating sites, Diagnostic codes and clinical notes, ... (+2 more)	Electronic health record (EHR) data accessed through institutional platforms with participant consen...			False

Collection Mechanisms

Components	Description	Mechanism Type
Tablets for application deployment, Headsets with acoustic specifications, ... (+2 more)	Hardware infrastructure including tablets with integrated headsets for standardized audio capture, a...	Hardware
REDCap v3.20.0 (doi:10.5281/zenodo.14148755), Custom tablet/smartphone application, ... (+7 more)	Software infrastructure including REDCap electronic data capture framework (v3.20.0), custom voice r...	Software
Institutional EHR APIs, Secure data linkage protocols, ... (+2 more)	EHR integration platforms enabling secure linkage between voice data and clinical records across par...	Data integration

Data Collectors

Collector Type	Description	Sites	Systems
Human - Clinical research staff	Clinical research coordinators and trained study personnel at participating sites responsible for pa...	University of South Florida, Massachusetts Institute of Technology, ... (+3 more)
Automated - Computational pipelines	Automated computational systems for audio preprocessing, feature extraction, transcription, and qual...		b2aiprep preprocessing library, OpenSMILE acoustic feature extraction, ... (+4 more)

Collection Timeframes

Description
Project initiated September 1, 2022 with planned completion November 30, 2026. Ongoing data collection with periodic versioned releases.
Start Date
2022-09-01
End Date
2026-11-30
Collection Status
Ongoing
Description
Version release timeline: v1.0 released January 2024 (initial release with 306 participants, 12,523 recordings), v1.1 released January 17, 2025 (added MFCCs), v2.0.0 planned April 16, 2025, v2.0.1 planned August 18, 2025.
Release Schedule
- V1.0
  January 2024
- V1.1
  January 17, 2025
- V2.0.0
  April 16, 2025 (planned)
- V2.0.1
  August 18, 2025 (planned)
Description
Most participants complete data collection in a single session. Subset of participants require multiple sessions to complete protocol, resulting in multiple sessions per participant for some individuals.
Session Structure
Single or multi-session per participant

Preprocessing Strategies

Description	Methods	Preprocessing Type	Privacy Measures	Tools
Raw audio preprocessing pipeline standardizes all recordings to monaural (single-channel) format and...	Conversion to monaural (mono) audio, Resampling to 16 kHz, Butterworth anti-aliasing filter	Audio standardization
Spectrogram extraction using short-time Fast Fourier Transform (FFT) with 25ms window size, 10ms hop...	Short-time FFT with 25ms window, 10ms hop length, ... (+3 more)	Time-frequency transformation
Mel-frequency cepstral coefficient (MFCC) extraction with 60 coefficients computed from spectrograms...	60 MFCC coefficients, Derived from spectrograms, ... (+2 more)	Perceptual feature extraction
Acoustic feature extraction using OpenSMILE (Speech and Music Interpretation by Large-space Extracti...		Acoustic feature extraction		OpenSMILE (Eyben et al. 2010), LLD computation, ... (+2 more)
Phonetic and prosodic feature computation using Parselmouth (Python interface to Praat) for fundamen...		Prosodic analysis		Parselmouth (Jadoul et al. 2018), Praat phonetic analysis, ... (+3 more)
Automated speech transcription using OpenAI's Whisper Large model for accurate transcription of voic...		Automated transcription	Free speech transcripts removed, Only non-identifying task transcriptions retained	OpenAI Whisper Large model, Automatic speech recognition (ASR)
REDCap data export and conversion using open-source b2aiprep library (v0.21.0) for standardized extr...		Data export and formatting		b2aiprep v0.21.0 (https://github.com/sensein/b2aiprep), REDCap API integration, ... (+2 more)

Cleaning Strategies

Cleaning Type	Description	Identifiers Removed	Privacy Measures	Quality Measures
HIPAA Safe Harbor de-identification	HIPAA Safe Harbor de-identification method applied systematically to remove all 18 categories of ide...	Names, Geographic subdivisions smaller than state (state/province removed, country retained), ... (+16 more)
Privacy-preserving feature extraction	Raw audio waveforms excluded from public releases v1.0 and v1.1 to protect participant privacy and p...		Raw audio omitted from v1.0 and v1.1, Only derived features publicly released, ... (+2 more)
Transcript privacy protection	Free speech transcripts removed from all public releases to prevent disclosure of potentially identi...		Free speech transcripts removed, Task-based transcriptions retained (non-identifying prompts only), Reduces re-identification risk from unique speech patterns
Quality assurance	Data quality control procedures including acoustic quality validation, outlier detection, completene...			Acoustic quality thresholds, Outlier detection and flagging, ... (+3 more)

Maintainers

Grantor	Grant Name	Grant Number
-	-	-
-	-	-
-	-	-
-	-	-

Governance Model

Role	Name	ORCID	Affiliation
Contributor		-	-
Contributor		-	-

Third Party Restrictions

Role	ORCID	Affiliation
Contributor	-	-
Contributor	-	-
Contributor	-	-

Citation Requirements

Citation Type	Description	Format	Policy
Dataset citation	Primary dataset citation required for all publications, presentations, and other uses of data. Shoul...	Johnson, A., Bélisle-Pipon, J., Dorr, D., Ghosh, S., Payne, P., Powell, M., Rameau, A., Ravitsky, V....
Platform citation	PhysioNet platform citation required as standard acknowledgment of infrastructure supporting data di...	Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. ...
Attribution requirement	Recipient agrees to recognize contribution of Provider as source of data in all written, visual, or ...		Provider recognition required in all public disclosures
Publication encouragement	Recipients encouraged to make results publicly available in open-access journals or pre-print server...		Open-access publication encouraged

Project Specific Aims

Aim Number	Aim Title	Description
1	Data Acquisition Module	To build a multi-modal, multi-institutional, large scale, diverse and ethically sourced human voice ...
2	Standard Module	To introduce the field of acoustic biomarkers by developing new standards of acoustic and voice data...
3	Tool Development and Optimization	To develop a software and cloud infrastructure for automated voice data collection through a smartph...
4	Ethics Module	To integrate existing scholarship, tools, and guidance with development of new standard and normativ...
5	Teaming Module	To build bridges between the medical voice research world, the acoustic engineers, and the AI/ML wor...
6	Skills and Workforce Development Module	To develop a unique curriculum on voice biomarkers of health and the development, validation, and im...

Related Publications

Citation
Rameau, A., Ghosh, S., Sigaras, A., Elemento, O., Belisle-Pipon, J.-C., Ravitsky, V., Powell, M., Jo...
Bensoussan, Y., Ghosh, S. S., Rameau, A., Boyer, M., Bahr, R., Watts, S., Rudzicz, F., Bolser, D., L...
Sigaras, A., Zisimopoulos, P., Tang, J., Bevers, I., Gallois, H., Bernier, A., Bensoussan, Y., Ghosh...
Johnson, A., Bélisle-Pipon, J., Dorr, D., Ghosh, S., Payne, P., Powell, M., Rameau, A., Ravitsky, V....

Software And Tools

Description	DOI	License	Name	Reference	Topics	URL	Version
Open source library for preprocessing raw audio waveforms and merging source data into phenotype fil...		Open source	b2aiprep			https://github.com/sensein/b2aiprep	0.21.0
Custom REDCap configuration for voice data collection	https://doi.org/10.5281/zenodo.14148755		Bridge2AI Voice REDCap				v3.20.0
Documentation dashboard and project documentation	https://zenodo.org/doi/10.5281/zenodo.13834653	MIT License	bridge2ai-docs		ai, bridge2ai, ... (+3 more)	https://github.com/eipm/bridge2ai-docs	2.0.5
The Munich Versatile and Fast Open-Source Audio Feature Extractor			OpenSMILE	Florian Eyben, Martin Wöllmer, Björn Schuller: "openSMILE - The Munich Versatile and Fast Open-Sourc...		https://audeering.github.io/opensmile/
Phonetic analysis software			Praat	Boersma P, Van Heuven V. Speak and unSpeak with PRAAT. Glot International. 2001 Nov;5(9/10):341-7.		http://www.praat.org/
Python interface to Praat for phonetic analysis			Parselmouth	Jadoul Y, Thompson B, De Boer B. Introducing parselmouth: A python interface to praat. Journal of Ph...		https://github.com/YannickJadoul/Parselmouth
Audio processing library for PyTorch			TorchAudio	Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-F., Puhrsch, C., Polla...		https://github.com/pytorch/audio
Automatic speech recognition model (Large variant)			OpenAI Whisper			https://github.com/openai/whisper

🚀

Uses

What (other) tasks could the dataset be used for?

Tasks

Description	Target Applications	Target Populations	Task Type
Enable AI/ML research for disease screening, diagnosis, and treatment monitoring across five disease...		Adults with voice disorders, Adults with neurological/neurodegenerative conditions, ... (+3 more)	AI/ML model development
Discovery and validation of novel acoustic biomarkers associated with health conditions, expanding b...	Voice changes in depression (decreased fundamental frequency, monotonous speech), Voice changes in anxiety (increased fundamental frequency), ... (+3 more)		Biomarker discovery
Development of clinical decision support tools integrating voice biomarkers into healthcare workflow...	Point-of-care voice screening tools, Remote patient monitoring using voice, ... (+2 more)		Clinical application
Multi-modal biomarker research integrating voice with EHR, radiomics, genomics, and other data sourc...	Voice + EHR integration for diagnosis validation, Voice + genomics for personalized medicine, ... (+2 more)		Multi-modal integration

Existing Uses

Description
Dataset publicly released through PhysioNet and Health Data Nexus for voice AI research community access under registered access license. Initial research outputs include protocol development publication and open-source software tools.
Publication References
- Rameau A, et al. (2024) Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI. Proc. Interspeech 2024, 1445-1449, doi:10.21437/Interspeech.2024-1926
- Bensoussan Y, et al. (2024) Bridge2AI Voice REDCap (v3.20.0). Zenodo, doi:10.5281/zenodo.14148755
- Sigaras A, et al. (2024) eipm/bridge2ai-docs. Zenodo, doi:10.5281/zenodo.13834653
- Johnson A, et al. (2024) Bridge2AI-Voice v1.0. Health Data Nexus, doi:10.57764/qb6h-em84

Future Use Impacts

Description	Impact Type	Potential Benefits	Potential Harms
Voice biomarker discovery for disease screening and diagnosis may enable earlier detection, non-inva...	Clinical decision support	Earlier disease detection through voice screening, Non-invasive monitoring tools, ... (+3 more)	False positive results causing unnecessary anxiety and interventions, False negative results delaying diagnosis and treatment, ... (+3 more)
Multi-modal AI model development integrating voice with EHR, genomics, and imaging data may provide ...	Multi-modal integration	Comprehensive patient phenotyping, Improved diagnostic accuracy through data fusion, ... (+2 more)	Increased re-identification risk from linked data, Privacy concerns about comprehensive patient profiles, ... (+2 more)
Federated learning applications may enable privacy-preserving collaborative research across institut...	Privacy-preserving collaboration	Multi-institutional model training without data sharing, Preservation of patient privacy, ... (+2 more)	Model inversion attacks extracting training data, Gradient leakage revealing patient information, ... (+2 more)
Commercial voice AI applications (e.g., smartphone-based screening) may increase accessibility but r...	Commercial applications	Consumer-accessible health monitoring, Scalable screening tools, ... (+2 more)	Data exploitation for profit, Biometric surveillance concerns, ... (+3 more)

Intended Uses

Description	Use Case
Development and validation of AI/ML models for voice-based disease screening, diagnosis, and monitor...	AI/ML model development
Discovery and validation of novel acoustic biomarkers associated with health conditions not previous...	Biomarker discovery
Development of clinical decision support tools integrating voice biomarkers into healthcare workflow...	Clinical decision support
Multi-modal biomarker research integrating voice with EHR, radiomics, genomics, and other data sourc...	Multi-modal data integration
Federated learning applications for privacy-preserving collaborative research across institutions, e...	Federated learning research
Development of standards, best practices, and quality measures for acoustic and voice data collectio...	Standards development
Education and training of interdisciplinary researchers in voice biomarkers, AI/ML methods, and ethi...	Workforce development

Discouraged Uses

Role	ORCID	Affiliation
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-
Contributor	-	-

License And Use Terms

Description

Bridge2AI Voice Registered Access License with Data Transfer and Use Agreement (DTUA) required for all data access. Registered users must sign DTUA and obtain approval from Data Access Compliance Office (DACO) before accessing files. Recipients must establish administrative, technical, and physical safeguards to protect Personally Identifiable Information (PII) per OMB M-07-16 and ensure only authorized persons access data. Data provided "AS IS" without warranties of any kind. Recipients assume all liability for use, storage, disclosure, or disposal. No unauthorized disclosure to third parties; collaborators must apply independently. Attribution required citing both dataset DOI and PhysioNet platform. Commercial use allowed under DTUA terms. Recipients may retain derivative works with proper attribution and may publish results (open-access encouraged). Two-year use period from DTUA start date upon completion of project, expiration of ethics approval, or termination, whichever occurs first; renewable with Provider approval. One archival copy allowed for records retention compliance. Provider Institution (University of South Florida) may unilaterally amend if Federal sponsor requires; recipient may object resulting in immediate termination. Certificate of Confidentiality protections apply and must be asserted against compulsory legal demands. DTUA approved for use through August 31, 2025.

License Name

Bridge2AI Voice Registered Access License

Agreement Required

Data Transfer and Use Agreement (DTUA)

Approval Authority

Data Access Compliance Office (DACO)

Provider Institution

University of South Florida Board of Trustees

Effective Through

August 31, 2025

Key Terms

Registered access with DACO approval required
Data classified as Personally Identifiable Information (PII, OMB M-07-16)
Administrative, technical, physical safeguards required
Certificate of Confidentiality protections (must assert against legal demands)
Data provided "AS IS" without warranties
Recipients assume liability for use
No unauthorized third-party disclosure
Collaborators apply independently
Attribution requirements (dataset DOI + PhysioNet)
Commercial use allowed
Open-access publication encouraged
Two-year use period (renewable)
Archival copy allowed for records retention
Provider may amend if Federal sponsor requires
Termination results in data destruction (certification required)

Motivation

Composition

Collection Process

Uses

Distribution

Maintenance

Human Subjects

Anticipated Changes	Changes	Description	DOI	Platform	Release Date	Status	Version
	Initial public release, 12,523 recordings, 306 participants, ... (+6 more)	Initial release of Bridge2AI-Voice dataset with 12,523 recordings from 306 participants across five ...	https://doi.org/10.57764/qb6h-em84	Health Data Nexus	January 2024		1.0
	Added mfcc.parquet file, 60 MFCC coefficients (60xN dimension), Derived from existing spectrograms	Added Mel-frequency cepstral coefficients (MFCCs) with 60 coefficients per recording, providing addi...	https://doi.org/10.13026/249v-w155	PhysioNet	January 17, 2025		1.1
Additional participants, Expanded disease cohorts, ... (+2 more)		Planned future release with additional participants, enhanced features, and expanded cohorts. Detail...			April 16, 2025 (planned)	Planned	2.0.0
Bug fixes and corrections, Documentation improvements, Minor feature enhancements		Planned maintenance release with bug fixes, documentation updates, and minor enhancements. Currently...			August 18, 2025 (planned)	Planned (latest version)	2.0.1