Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information

Grantor	Grant Name	Grant Number
National Institutes of Health	Bridge2AI: Voice as a Biomarker of Health - Building an ethically sourced, bioaccoustic database to understand disease like never before	3OT2OD032720-01S1

Dataset Resource

ID *

Dataset Resource

Title

Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information

Dataset Resource

Description

Bridge2AI-Voice v2.0 contains data for 19,271 recordings collected from 442 participants across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The release contains data considered low risk, including derivations such as spectrograms but not the original voice recordings. Detailed demographic, clinical, and validated questionnaire data are also made available.

Dataset Resource

Creators

Role	Name	ORCID	Affiliation
Principal Investigator	Yael Bensoussan	Yael Bensoussan	-
Principal Investigator	Olivier Elemento	Olivier Elemento	-
Principal Investigator	Satrajit Ghosh	Satrajit Ghosh	-
Author	Alexandros Sigaras	Alexandros Sigaras	-
Author	Anais Rameau	Anais Rameau	-
Author	Maria Powell	Maria Powell	-
Author	David Dorr	David Dorr	-
Author	Philip Payne	Philip Payne	-
Author	Vardit Ravitsky	Vardit Ravitsky	-
Author	Jean-Christophe Bélisle-Pipon	Jean-Christophe Bélisle-Pipon	-
Author	Alistair Johnson	Alistair Johnson	-
Author	Ruth Bahr	Ruth Bahr	-
Author	Stephanie Watts	Stephanie Watts	-
Author	Donald Bolser	Donald Bolser	-
Author	Jennifer Siu	Jennifer Siu	-
Author	Jordan Lerner-Ellis	Jordan Lerner-Ellis	-
Author	Frank Rudzicz	Frank Rudzicz	-
Author	Micah Boyer	Micah Boyer	-
Author	Samantha Salvi Cruz	Samantha Salvi Cruz	-
Author	Yassmeen Abdel-Aty	Yassmeen Abdel-Aty	-
Author	Toufeeq Ahmed Syed	Toufeeq Ahmed Syed	-
Author	James Anibal	James Anibal	-
Author	Stephen Aradi	Stephen Aradi	-
Author	Ana Sophia Martinez	Ana Sophia Martinez	-
Author	Shaheen Awan	Shaheen Awan	-
Author	Steven Bedrick	Steven Bedrick	-
Author	Isaac Bevers	Isaac Bevers	-
Author	Rahul Brito	Rahul Brito	-
Author	Selina Casalino	Selina Casalino	-
Author	John Costello	John Costello	-
Author	Iris De Santiago	Iris De Santiago	-
Author	Enrique Diaz-Ocampo	Enrique Diaz-Ocampo	-
Author	Mohamed Ebraheem	Mohamed Ebraheem	-
Author	Ellie Eiseman	Ellie Eiseman	-
Author	Mahmoud Elmahdy	Mahmoud Elmahdy	-
Author	Emily Evangelista	Emily Evangelista	-
Author	Kenneth Fletcher	Kenneth Fletcher	-
Author	Alexander Gelbard	Alexander Gelbard	-
Author	Anna Goldenberg	Anna Goldenberg	-
Author	Karim Hanna	Karim Hanna	-
Author	William Hersh	William Hersh	-
Author	Lochana Jayachandran	Lochana Jayachandran	-
Author	Kaley Jenney	Kaley Jenney	-
Author	Kathy Jenkins	Kathy Jenkins	-
Author	Stacy Jo	Stacy Jo	-
Author	Ayush Kalia	Ayush Kalia	-
Author	Andrea Krussel	Andrea Krussel	-
Author	Elisa Lapadula	Elisa Lapadula	-
Author	Chloe Loewith	Chloe Loewith	-
Author	Radhika Mahajan	Radhika Mahajan	-
Author	Vrishni Maharaj	Vrishni Maharaj	-
Author	Siyu Miao	Siyu Miao	-
Author	Matthew Mifsud	Matthew Mifsud	-
Author	Marian Mikhael	Marian Mikhael	-
Author	Elijah Moothedan	Elijah Moothedan	-
Author	Yosef Nafii	Yosef Nafii	-
Author	Tempestt Neal	Tempestt Neal	-
Author	Karlee Newberry	Karlee Newberry	-
Author	Evan Ng	Evan Ng	-
Author	Christopher Nickel	Christopher Nickel	-
Author	Trevor Pharr	Trevor Pharr	-
Author	Claire Premi-Bortolotto	Claire Premi-Bortolotto	-
Author	JM Rahman	JM Rahman	-
Author	Sarah Rohde	Sarah Rohde	-
Author	Laurie Russell	Laurie Russell	-
Author	Suketu Shah	Suketu Shah	-
Author	Ahmed Shawkat	Ahmed Shawkat	-
Author	Elizabeth Silberholz	Elizabeth Silberholz	-
Author	Duncan Sutherland	Duncan Sutherland	-
Author	Venkata Swarna Mukhi	Venkata Swarna Mukhi	-
Author	Jeffrey Tang	Jeffrey Tang	-
Author	Jamie Toghranegar	Jamie Toghranegar	-
Author	Kimberly Vinson	Kimberly Vinson	-
Author	Claire Wilson	Claire Wilson	-
Author	Madeleine Zanin	Madeleine Zanin	-
Author	Xijie Zeng	Xijie Zeng	-
Author	Theresa Zesiewicz	Theresa Zesiewicz	-
Author	Robin Zhao	Robin Zhao	-
Author	Pantelis Zisimopoulos	Pantelis Zisimopoulos	-

Dataset Resource

Issued

2025-04-16

Dataset Resource

Keywords

voice
bridge2ai

Dataset Resource

Collection Timeframes

Description

Dataset Resource

Conforms To

Dataset Resource

Preprocessing Strategies

Description
Raw audio was preprocessed by converting to monaural and resampling to 16 kHz with a Butterworth anti-aliasing filter. Derived data includes: Spectrograms (short-time FFT), 60 Mel-frequency cepstral coefficients (MFCCs), acoustic features extracted using OpenSMILE, phonetic and prosodic features computed using Parselmouth and Praat, and transcriptions generated using OpenAI's Whisper Large model.

Dataset Resource

Cleaning Strategies

Description
HIPAA Safe Harbor identifiers were removed. State and province were removed. Country of data collection was retained. Audio records with sensitive information were removed, but their static features were retained. In this release, audio waveforms are omitted.

Dataset Resource

Is Deidentified

Identifiable Elements Present

False

Description

HIPAA Safe Harbor identifiers were removed. State and province were removed. Audio records with sensitive information were removed. Audio waveforms are omitted from this release.

Dataset Resource

Sensitive Elements

Sensitive Elements Present
True
Description
- The dataset contains data linked to health conditions including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The raw voice recordings are considered sensitive and are not included in this release but are available in a companion release.

Dataset Resource

Acquisition Methods

Description
Data was collected via patient consent from individuals at specialty clinics and institutions. A standardized protocol involved collecting demographic information, health questionnaires, and voice recording tasks.
Was Directly Observed
Was Reported By Subjects
True
Was Inferred Derived
True
Was Validated Verified

Dataset Resource

Compression

Dataset Resource

Conforms To Class

Dataset Resource

Conforms To Schema

Dataset Resource

Created By

Dataset Resource

Created On

Dataset Resource

Language

Dataset Resource

Modified By

Dataset Resource

Page

Dataset Resource

Publisher

PhysioNet

Dataset Resource

Status

Dataset Resource

Was Derived From

Dataset Resource

Dialect

Dataset Resource

Encoding

Dataset Resource

Format

Dataset Resource

Hash

Dataset Resource

Md5

Dataset Resource

Media Type

Dataset Resource

Path

Dataset Resource

Sha256

Dataset Resource

Addressing Gaps

Response
For voice to emerge as a biomarker of health, there is a pressing need for large, high quality, multi-institutional and diverse voice database linked to other health biomarkers from various data of different modality to fuel voice AI research and answer tangible clinical questions.

Dataset Resource

Subsets

Dataset Resource

Anomalies

Dataset Resource

External Resources

External Resources
- Audio recordings are included on a companion release on PhysioNet with the title 'Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (Audio Included)'.
Future Guarantees
Archival
Restrictions

Dataset Resource

Confidential Elements

Dataset Resource

Content Warnings

Dataset Resource

Collection Mechanisms

Description
Data collection was conducted using a custom application on a tablet with a headset used for data collection when possible. Data were exported and converted from RedCap using the b2aiprep open source library.

Dataset Resource

Sampling Strategies

Description
Patients presenting at specialty clinics and institutions were considered for enrollment. Patients were selected based on membership to five predetermined groups (Respiratory disorders, Voice disorders, Neurological disorders, Mood disorders, Pediatric). This is a purposive sampling strategy.

Dataset Resource

Data Collectors

Dataset Resource

Ethical Reviews

Description
Data collection and sharing was approved by the University of South Florida Institutional Review Board.

Dataset Resource

Data Protection Impacts

Dataset Resource

Labeling Strategies

Dataset Resource

Raw Sources

Description
Yes, the raw audio data is available in a companion release on PhysioNet.

Dataset Resource

Ip Restrictions

Dataset Resource

Regulatory Restrictions

Dataset Resource

Maintainers

Dataset Resource

Errata

Dataset Resource

Retention Limit

Dataset Resource

Extension Mechanism

Dataset Resource

Is Tabular

Counts	Data Substrate	Data Topic	Instance Type	Label	Label Description	Missing Information	Sampling Strategies
442			participants
19271			recordings

Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information

Motivation

Composition

Collection Process

Uses

Distribution

Maintenance