Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information
Datasheet for Dataset - Human Readable Format
🎯
Motivation
Why was the dataset created?
Dataset Resource
Grantor
Grant Name
Grant Number
National Institutes of Health
Bridge2AI: Voice as a Biomarker of Health - Building an ethically sourced, bioaccoustic database to understand disease like never before
3OT2OD032720-01S1
Dataset Resource
Response
The Bridge2AI-Voice project seeks to create an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health.
📊
Composition
What do the instances represent?
Dataset Resource
Counts
Data Substrate
Data Topic
Instance Type
Label
Label Description
Missing Information
Sampling Strategies
442
participants
19271
recordings
Dataset Resource
Subpopulation Elements Present
True
Identification
Disease cohort categories
Distribution
Voice Disorders, Neurological and Neurodegenerative Disorders, Mood and Psychiatric Disorders, Respiratory disorders. Note: The v2.0.0 dataset does not contain pediatric data and does not contain an equal distribution across categories of diseases.
Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information
Dataset Resource
Bridge2AI-Voice v2.0 contains data for 19,271 recordings collected from 442 participants across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The release contains data considered low risk, including derivations such as spectrograms but not the original voice recordings. Detailed demographic, clinical, and validated questionnaire data are also made available.
Dataset Resource
Role
Name
ORCID
Affiliation
Principal Investigator
Yael Bensoussan
Yael Bensoussan
-
Principal Investigator
Olivier Elemento
Olivier Elemento
-
Principal Investigator
Satrajit Ghosh
Satrajit Ghosh
-
Author
Alexandros Sigaras
Alexandros Sigaras
-
Author
Anais Rameau
Anais Rameau
-
Author
Maria Powell
Maria Powell
-
Author
David Dorr
David Dorr
-
Author
Philip Payne
Philip Payne
-
Author
Vardit Ravitsky
Vardit Ravitsky
-
Author
Jean-Christophe Bélisle-Pipon
Jean-Christophe Bélisle-Pipon
-
Author
Alistair Johnson
Alistair Johnson
-
Author
Ruth Bahr
Ruth Bahr
-
Author
Stephanie Watts
Stephanie Watts
-
Author
Donald Bolser
Donald Bolser
-
Author
Jennifer Siu
Jennifer Siu
-
Author
Jordan Lerner-Ellis
Jordan Lerner-Ellis
-
Author
Frank Rudzicz
Frank Rudzicz
-
Author
Micah Boyer
Micah Boyer
-
Author
Samantha Salvi Cruz
Samantha Salvi Cruz
-
Author
Yassmeen Abdel-Aty
Yassmeen Abdel-Aty
-
Author
Toufeeq Ahmed Syed
Toufeeq Ahmed Syed
-
Author
James Anibal
James Anibal
-
Author
Stephen Aradi
Stephen Aradi
-
Author
Ana Sophia Martinez
Ana Sophia Martinez
-
Author
Shaheen Awan
Shaheen Awan
-
Author
Steven Bedrick
Steven Bedrick
-
Author
Isaac Bevers
Isaac Bevers
-
Author
Rahul Brito
Rahul Brito
-
Author
Selina Casalino
Selina Casalino
-
Author
John Costello
John Costello
-
Author
Iris De Santiago
Iris De Santiago
-
Author
Enrique Diaz-Ocampo
Enrique Diaz-Ocampo
-
Author
Mohamed Ebraheem
Mohamed Ebraheem
-
Author
Ellie Eiseman
Ellie Eiseman
-
Author
Mahmoud Elmahdy
Mahmoud Elmahdy
-
Author
Emily Evangelista
Emily Evangelista
-
Author
Kenneth Fletcher
Kenneth Fletcher
-
Author
Alexander Gelbard
Alexander Gelbard
-
Author
Anna Goldenberg
Anna Goldenberg
-
Author
Karim Hanna
Karim Hanna
-
Author
William Hersh
William Hersh
-
Author
Lochana Jayachandran
Lochana Jayachandran
-
Author
Kaley Jenney
Kaley Jenney
-
Author
Kathy Jenkins
Kathy Jenkins
-
Author
Stacy Jo
Stacy Jo
-
Author
Ayush Kalia
Ayush Kalia
-
Author
Andrea Krussel
Andrea Krussel
-
Author
Elisa Lapadula
Elisa Lapadula
-
Author
Chloe Loewith
Chloe Loewith
-
Author
Radhika Mahajan
Radhika Mahajan
-
Author
Vrishni Maharaj
Vrishni Maharaj
-
Author
Siyu Miao
Siyu Miao
-
Author
Matthew Mifsud
Matthew Mifsud
-
Author
Marian Mikhael
Marian Mikhael
-
Author
Elijah Moothedan
Elijah Moothedan
-
Author
Yosef Nafii
Yosef Nafii
-
Author
Tempestt Neal
Tempestt Neal
-
Author
Karlee Newberry
Karlee Newberry
-
Author
Evan Ng
Evan Ng
-
Author
Christopher Nickel
Christopher Nickel
-
Author
Trevor Pharr
Trevor Pharr
-
Author
Claire Premi-Bortolotto
Claire Premi-Bortolotto
-
Author
JM Rahman
JM Rahman
-
Author
Sarah Rohde
Sarah Rohde
-
Author
Laurie Russell
Laurie Russell
-
Author
Suketu Shah
Suketu Shah
-
Author
Ahmed Shawkat
Ahmed Shawkat
-
Author
Elizabeth Silberholz
Elizabeth Silberholz
-
Author
Duncan Sutherland
Duncan Sutherland
-
Author
Venkata Swarna Mukhi
Venkata Swarna Mukhi
-
Author
Jeffrey Tang
Jeffrey Tang
-
Author
Jamie Toghranegar
Jamie Toghranegar
-
Author
Kimberly Vinson
Kimberly Vinson
-
Author
Claire Wilson
Claire Wilson
-
Author
Madeleine Zanin
Madeleine Zanin
-
Author
Xijie Zeng
Xijie Zeng
-
Author
Theresa Zesiewicz
Theresa Zesiewicz
-
Author
Robin Zhao
Robin Zhao
-
Author
Pantelis Zisimopoulos
Pantelis Zisimopoulos
-
Dataset Resource
2025-04-16
Dataset Resource
voice
bridge2ai
Dataset Resource
Description
Dataset Resource
Dataset Resource
Description
Raw audio was preprocessed by converting to monaural and resampling to 16 kHz with a Butterworth anti-aliasing filter. Derived data includes: Spectrograms (short-time FFT), 60 Mel-frequency cepstral coefficients (MFCCs), acoustic features extracted using OpenSMILE, phonetic and prosodic features computed using Parselmouth and Praat, and transcriptions generated using OpenAI's Whisper Large model.
Dataset Resource
Description
HIPAA Safe Harbor identifiers were removed. State and province were removed. Country of data collection was retained. Audio records with sensitive information were removed, but their static features were retained. In this release, audio waveforms are omitted.
Dataset Resource
Identifiable Elements Present
False
Description
HIPAA Safe Harbor identifiers were removed. State and province were removed. Audio records with sensitive information were removed. Audio waveforms are omitted from this release.
Dataset Resource
Sensitive Elements Present
True
Description
The dataset contains data linked to health conditions including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The raw voice recordings are considered sensitive and are not included in this release but are available in a companion release.
Dataset Resource
Description
Data was collected via patient consent from individuals at specialty clinics and institutions. A standardized protocol involved collecting demographic information, health questionnaires, and voice recording tasks.
Was Directly Observed
Was Reported By Subjects
True
Was Inferred Derived
True
Was Validated Verified
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
PhysioNet
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Response
For voice to emerge as a biomarker of health, there is a pressing need for large, high quality, multi-institutional and diverse voice database linked to other health biomarkers from various data of different modality to fuel voice AI research and answer tangible clinical questions.
Dataset Resource
Dataset Resource
Dataset Resource
External Resources
Audio recordings are included on a companion release on PhysioNet with the title 'Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (Audio Included)'.
Future Guarantees
Archival
Restrictions
Dataset Resource
Dataset Resource
Dataset Resource
Description
Data collection was conducted using a custom application on a tablet with a headset used for data collection when possible. Data were exported and converted from RedCap using the b2aiprep open source library.
Dataset Resource
Description
Patients presenting at specialty clinics and institutions were considered for enrollment. Patients were selected based on membership to five predetermined groups (Respiratory disorders, Voice disorders, Neurological disorders, Mood disorders, Pediatric). This is a purposive sampling strategy.
Dataset Resource
Dataset Resource
Description
Data collection and sharing was approved by the University of South Florida Institutional Review Board.
Dataset Resource
Dataset Resource
Dataset Resource
Description
Yes, the raw audio data is available in a companion release on PhysioNet.
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
🚀
Uses
What (other) tasks could the dataset be used for?
Dataset Resource
Dataset Resource
Dataset Resource
Description
Access is restricted to credentialed users who sign the 'Bridge2AI Voice Registered Access Agreement' Data Use Agreement. The license is the 'Bridge2AI Voice Registered Access License'.
Previous versions are available. v1.1 was released Jan. 17, 2025. v1.0 is available at https://doi.org/10.57764/qb6h-em84.
🔄
Maintenance
How will the dataset be maintained?
Dataset Resource
2.0.0
Dataset Resource
Description
b2ai-voice v2.0: This release provides data for an additional 136 new participants. Spectrograms were reprocessed. All spectrograms and Mel-frequency cepstral coefficients from free speech related files have been removed. b2ai-voice v1.1: This release added Mel-frequency cepstral coefficients (MFCCs). b2ai-voice v1.0: This was the first release.
Dataset Resource
Generated on 2025-10-30 10:38:41 using Bridge2AI Data Sheets Schema