creators: &id001
- affiliation: ''
principal_investigator:
id: Yael Bensoussan
name: Yael Bensoussan
- affiliation: ''
principal_investigator:
id: Olivier Elemento
name: Olivier Elemento
- affiliation: ''
principal_investigator:
id: Satrajit Ghosh
name: Satrajit Ghosh
- affiliation: ''
author:
id: Alexandros Sigaras
name: Alexandros Sigaras
- affiliation: ''
author:
id: Anais Rameau
name: Anais Rameau
- affiliation: ''
author:
id: Maria Powell
name: Maria Powell
- affiliation: ''
author:
id: David Dorr
name: David Dorr
- affiliation: ''
author:
id: Philip Payne
name: Philip Payne
- affiliation: ''
author:
id: Vardit Ravitsky
name: Vardit Ravitsky
- affiliation: ''
author:
id: "Jean-Christophe B\xE9lisle-Pipon"
name: "Jean-Christophe B\xE9lisle-Pipon"
- affiliation: ''
author:
id: Alistair Johnson
name: Alistair Johnson
- affiliation: ''
author:
id: Ruth Bahr
name: Ruth Bahr
- affiliation: ''
author:
id: Stephanie Watts
name: Stephanie Watts
- affiliation: ''
author:
id: Donald Bolser
name: Donald Bolser
- affiliation: ''
author:
id: Jennifer Siu
name: Jennifer Siu
- affiliation: ''
author:
id: Jordan Lerner-Ellis
name: Jordan Lerner-Ellis
- affiliation: ''
author:
id: Frank Rudzicz
name: Frank Rudzicz
- affiliation: ''
author:
id: Micah Boyer
name: Micah Boyer
- affiliation: ''
author:
id: Samantha Salvi Cruz
name: Samantha Salvi Cruz
- affiliation: ''
author:
id: Yassmeen Abdel-Aty
name: Yassmeen Abdel-Aty
- affiliation: ''
author:
id: Toufeeq Ahmed Syed
name: Toufeeq Ahmed Syed
- affiliation: ''
author:
id: James Anibal
name: James Anibal
- affiliation: ''
author:
id: Stephen Aradi
name: Stephen Aradi
- affiliation: ''
author:
id: Ana Sophia Martinez
name: Ana Sophia Martinez
- affiliation: ''
author:
id: Shaheen Awan
name: Shaheen Awan
- affiliation: ''
author:
id: Steven Bedrick
name: Steven Bedrick
- affiliation: ''
author:
id: Isaac Bevers
name: Isaac Bevers
- affiliation: ''
author:
id: Rahul Brito
name: Rahul Brito
- affiliation: ''
author:
id: Selina Casalino
name: Selina Casalino
- affiliation: ''
author:
id: John Costello
name: John Costello
- affiliation: ''
author:
id: Iris De Santiago
name: Iris De Santiago
- affiliation: ''
author:
id: Enrique Diaz-Ocampo
name: Enrique Diaz-Ocampo
- affiliation: ''
author:
id: Mohamed Ebraheem
name: Mohamed Ebraheem
- affiliation: ''
author:
id: Ellie Eiseman
name: Ellie Eiseman
- affiliation: ''
author:
id: Mahmoud Elmahdy
name: Mahmoud Elmahdy
- affiliation: ''
author:
id: Emily Evangelista
name: Emily Evangelista
- affiliation: ''
author:
id: Kenneth Fletcher
name: Kenneth Fletcher
- affiliation: ''
author:
id: Alexander Gelbard
name: Alexander Gelbard
- affiliation: ''
author:
id: Anna Goldenberg
name: Anna Goldenberg
- affiliation: ''
author:
id: Karim Hanna
name: Karim Hanna
- affiliation: ''
author:
id: William Hersh
name: William Hersh
- affiliation: ''
author:
id: Lochana Jayachandran
name: Lochana Jayachandran
- affiliation: ''
author:
id: Kaley Jenney
name: Kaley Jenney
- affiliation: ''
author:
id: Kathy Jenkins
name: Kathy Jenkins
- affiliation: ''
author:
id: Stacy Jo
name: Stacy Jo
- affiliation: ''
author:
id: Ayush Kalia
name: Ayush Kalia
- affiliation: ''
author:
id: Andrea Krussel
name: Andrea Krussel
- affiliation: ''
author:
id: Elisa Lapadula
name: Elisa Lapadula
- affiliation: ''
author:
id: Chloe Loewith
name: Chloe Loewith
- affiliation: ''
author:
id: Radhika Mahajan
name: Radhika Mahajan
- affiliation: ''
author:
id: Vrishni Maharaj
name: Vrishni Maharaj
- affiliation: ''
author:
id: Siyu Miao
name: Siyu Miao
- affiliation: ''
author:
id: Matthew Mifsud
name: Matthew Mifsud
- affiliation: ''
author:
id: Marian Mikhael
name: Marian Mikhael
- affiliation: ''
author:
id: Elijah Moothedan
name: Elijah Moothedan
- affiliation: ''
author:
id: Yosef Nafii
name: Yosef Nafii
- affiliation: ''
author:
id: Tempestt Neal
name: Tempestt Neal
- affiliation: ''
author:
id: Karlee Newberry
name: Karlee Newberry
- affiliation: ''
author:
id: Evan Ng
name: Evan Ng
- affiliation: ''
author:
id: Christopher Nickel
name: Christopher Nickel
- affiliation: ''
author:
id: Trevor Pharr
name: Trevor Pharr
- affiliation: ''
author:
id: Claire Premi-Bortolotto
name: Claire Premi-Bortolotto
- affiliation: ''
author:
id: JM Rahman
name: JM Rahman
- affiliation: ''
author:
id: Sarah Rohde
name: Sarah Rohde
- affiliation: ''
author:
id: Laurie Russell
name: Laurie Russell
- affiliation: ''
author:
id: Suketu Shah
name: Suketu Shah
- affiliation: ''
author:
id: Ahmed Shawkat
name: Ahmed Shawkat
- affiliation: ''
author:
id: Elizabeth Silberholz
name: Elizabeth Silberholz
- affiliation: ''
author:
id: Duncan Sutherland
name: Duncan Sutherland
- affiliation: ''
author:
id: Venkata Swarna Mukhi
name: Venkata Swarna Mukhi
- affiliation: ''
author:
id: Jeffrey Tang
name: Jeffrey Tang
- affiliation: ''
author:
id: Jamie Toghranegar
name: Jamie Toghranegar
- affiliation: ''
author:
id: Kimberly Vinson
name: Kimberly Vinson
- affiliation: ''
author:
id: Claire Wilson
name: Claire Wilson
- affiliation: ''
author:
id: Madeleine Zanin
name: Madeleine Zanin
- affiliation: ''
author:
id: Xijie Zeng
name: Xijie Zeng
- affiliation: ''
author:
id: Theresa Zesiewicz
name: Theresa Zesiewicz
- affiliation: ''
author:
id: Robin Zhao
name: Robin Zhao
- affiliation: ''
author:
id: Pantelis Zisimopoulos
name: Pantelis Zisimopoulos
description: Bridge2AI-Voice v2.0 contains data for 19,271 recordings collected from
442 participants across five sites in North America. Participants were selected
based on known conditions which manifest within the voice waveform including voice
disorders, neurological disorders, mood disorders, and respiratory disorders. The
release contains data considered low risk, including derivations such as spectrograms
but not the original voice recordings. Detailed demographic, clinical, and validated
questionnaire data are also made available.
funders: &id002
- grant:
grant_number: 3OT2OD032720-01S1
id: 3OT2OD032720-01S1
name: 'Bridge2AI: Voice as a Biomarker of Health - Building an ethically sourced,
bioaccoustic database to understand disease like never before'
grantor:
id: NIH
name: National Institutes of Health
id: https://doi.org/10.13026/3xt6-rf05
issued: '2025-04-16'
keywords: &id003
- voice
- bridge2ai
resources:
- acquisition_methods:
- description: Data was collected via patient consent from individuals at specialty
clinics and institutions. A standardized protocol involved collecting demographic
information, health questionnaires, and voice recording tasks.
was_directly_observed: ''
was_inferred_derived: true
was_reported_by_subjects: true
was_validated_verified: ''
addressing_gaps:
- response: For voice to emerge as a biomarker of health, there is a pressing need
for large, high quality, multi-institutional and diverse voice database linked
to other health biomarkers from various data of different modality to fuel voice
AI research and answer tangible clinical questions.
anomalies: []
bytes: ''
cleaning_strategies:
- description: HIPAA Safe Harbor identifiers were removed. State and province were
removed. Country of data collection was retained. Audio records with sensitive
information were removed, but their static features were retained. In this release,
audio waveforms are omitted.
collection_mechanisms:
- description: Data collection was conducted using a custom application on a tablet
with a headset used for data collection when possible. Data were exported and
converted from RedCap using the b2aiprep open source library.
collection_timeframes:
- description: ''
compression: ''
confidential_elements: []
conforms_to: []
conforms_to_class: ''
conforms_to_schema: ''
content_warnings: []
created_by: []
created_on: ''
creators: *id001
data_collectors: []
data_protection_impacts: []
description: Bridge2AI-Voice v2.0 contains data for 19,271 recordings collected
from 442 participants across five sites in North America. Participants were selected
based on known conditions which manifest within the voice waveform including voice
disorders, neurological disorders, mood disorders, and respiratory disorders.
The release contains data considered low risk, including derivations such as spectrograms
but not the original voice recordings. Detailed demographic, clinical, and validated
questionnaire data are also made available.
dialect: ''
discouraged_uses: []
distribution_dates: []
distribution_formats:
- description: spectrograms.parquet, mfcc.parquet, phenotype.tsv, static_features.tsv
doi: https://doi.org/10.13026/3xt6-rf05
download_url: ''
encoding: ''
errata: []
ethical_reviews:
- description: Data collection and sharing was approved by the University of South
Florida Institutional Review Board.
existing_uses: []
extension_mechanism: ''
external_resources:
- archival: ''
external_resources:
- 'Audio recordings are included on a companion release on PhysioNet with the
title ''Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked
to health information (Audio Included)''.'
future_guarantees: ''
restrictions: ''
format: ''
funders: *id002
future_use_impacts: []
hash: ''
id: https://doi.org/10.13026/3xt6-rf05
instances:
- counts: 442
data_substrate: ''
data_topic: ''
instance_type: participants
label: ''
label_description: ''
missing_information: []
sampling_strategies: []
- counts: 19271
data_substrate: ''
data_topic: ''
instance_type: recordings
label: ''
label_description: ''
missing_information: []
sampling_strategies: []
ip_restrictions: ''
is_deidentified:
description:
- HIPAA Safe Harbor identifiers were removed. State and province were removed.
Audio records with sensitive information were removed. Audio waveforms are omitted
from this release.
identifiable_elements_present: false
is_tabular: ''
issued: '2025-04-16'
keywords: *id003
labeling_strategies: []
language: ''
last_updated_on: ''
license: Bridge2AI Voice Registered Access License
license_and_use_terms:
- description: Access is restricted to credentialed users who sign the 'Bridge2AI
Voice Registered Access Agreement' Data Use Agreement. The license is the 'Bridge2AI
Voice Registered Access License'.
maintainers: []
md5: ''
media_type: ''
modified_by: []
other_tasks: []
page: ''
path: ''
preprocessing_strategies:
- description: 'Raw audio was preprocessed by converting to monaural and resampling
to 16 kHz with a Butterworth anti-aliasing filter. Derived data includes: Spectrograms
(short-time FFT), 60 Mel-frequency cepstral coefficients (MFCCs), acoustic features
extracted using OpenSMILE, phonetic and prosodic features computed using Parselmouth
and Praat, and transcriptions generated using OpenAI''s Whisper Large model.'
publisher: PhysioNet
purposes:
- response: The Bridge2AI-Voice project seeks to create an ethically sourced flagship
dataset to enable future research in artificial intelligence and support critical
insights into the use of voice as a biomarker of health.
raw_sources:
- description: Yes, the raw audio data is available in a companion release on PhysioNet.
regulatory_restrictions: ''
retention_limit: ''
sampling_strategies:
- description: Patients presenting at specialty clinics and institutions were considered
for enrollment. Patients were selected based on membership to five predetermined
groups (Respiratory disorders, Voice disorders, Neurological disorders, Mood
disorders, Pediatric). This is a purposive sampling strategy.
sensitive_elements:
- description:
- The dataset contains data linked to health conditions including voice disorders,
neurological disorders, mood disorders, and respiratory disorders. The raw voice
recordings are considered sensitive and are not included in this release but
are available in a companion release.
sensitive_elements_present: true
sha256: ''
status: ''
subpopulations:
- distribution:
- 'Voice Disorders, Neurological and Neurodegenerative Disorders, Mood and Psychiatric
Disorders, Respiratory disorders. Note: The v2.0.0 dataset does not contain
pediatric data and does not contain an equal distribution across categories
of diseases.'
identification:
- Disease cohort categories
subpopulation_elements_present: true
subsets: []
tasks: []
title: 'Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health
information'
updates:
- description: 'b2ai-voice v2.0: This release provides data for an additional 136
new participants. Spectrograms were reprocessed. All spectrograms and Mel-frequency
cepstral coefficients from free speech related files have been removed. b2ai-voice
v1.1: This release added Mel-frequency cepstral coefficients (MFCCs). b2ai-voice
v1.0: This was the first release.'
use_repository: []
version: 2.0.0
version_access:
- description: Previous versions are available. v1.1 was released Jan. 17, 2025.
v1.0 is available at https://doi.org/10.57764/qb6h-em84.
was_derived_from: ''
title: 'Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health
information'
version: 2.0.0