Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

Dataset Resource
GrantorGrant NameGrant Number
National Institutes of HealthBridge2AI: Voice as a Biomarker of Health - Building an ethically sourced, bioaccoustic database to understand disease like never before3OT2OD032720-01S1
Dataset Resource
  • Response
    The Bridge2AI-Voice project seeks to create an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health.
📊

Composition

What do the instances represent?

Dataset Resource
CountsData SubstrateData TopicInstance TypeLabelLabel DescriptionMissing InformationSampling Strategies
442participants
19271recordings
Dataset Resource
  • Subpopulation Elements Present
    True
    Identification
    • Disease cohort categories
    Distribution
    • Voice Disorders, Neurological and Neurodegenerative Disorders, Mood and Psychiatric Disorders, Respiratory disorders. Note: The v2.0.0 dataset does not contain pediatric data and does not contain an equal distribution across categories of diseases.
Dataset Resource
  • Description
    spectrograms.parquet, mfcc.parquet, phenotype.tsv, static_features.tsv
Dataset Resource
🔍

Collection Process

How was the data acquired?

Dataset Resource
Dataset Resource
Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information
Dataset Resource
Bridge2AI-Voice v2.0 contains data for 19,271 recordings collected from 442 participants across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The release contains data considered low risk, including derivations such as spectrograms but not the original voice recordings. Detailed demographic, clinical, and validated questionnaire data are also made available.
Dataset Resource
RoleNameORCIDAffiliation
Principal InvestigatorYael BensoussanYael Bensoussan-
Principal InvestigatorOlivier ElementoOlivier Elemento-
Principal InvestigatorSatrajit GhoshSatrajit Ghosh-
AuthorAlexandros SigarasAlexandros Sigaras-
AuthorAnais RameauAnais Rameau-
AuthorMaria PowellMaria Powell-
AuthorDavid DorrDavid Dorr-
AuthorPhilip PaynePhilip Payne-
AuthorVardit RavitskyVardit Ravitsky-
AuthorJean-Christophe Bélisle-PiponJean-Christophe Bélisle-Pipon-
AuthorAlistair JohnsonAlistair Johnson-
AuthorRuth BahrRuth Bahr-
AuthorStephanie WattsStephanie Watts-
AuthorDonald BolserDonald Bolser-
AuthorJennifer SiuJennifer Siu-
AuthorJordan Lerner-EllisJordan Lerner-Ellis-
AuthorFrank RudziczFrank Rudzicz-
AuthorMicah BoyerMicah Boyer-
AuthorSamantha Salvi CruzSamantha Salvi Cruz-
AuthorYassmeen Abdel-AtyYassmeen Abdel-Aty-
AuthorToufeeq Ahmed SyedToufeeq Ahmed Syed-
AuthorJames AnibalJames Anibal-
AuthorStephen AradiStephen Aradi-
AuthorAna Sophia MartinezAna Sophia Martinez-
AuthorShaheen AwanShaheen Awan-
AuthorSteven BedrickSteven Bedrick-
AuthorIsaac BeversIsaac Bevers-
AuthorRahul BritoRahul Brito-
AuthorSelina CasalinoSelina Casalino-
AuthorJohn CostelloJohn Costello-
AuthorIris De SantiagoIris De Santiago-
AuthorEnrique Diaz-OcampoEnrique Diaz-Ocampo-
AuthorMohamed EbraheemMohamed Ebraheem-
AuthorEllie EisemanEllie Eiseman-
AuthorMahmoud ElmahdyMahmoud Elmahdy-
AuthorEmily EvangelistaEmily Evangelista-
AuthorKenneth FletcherKenneth Fletcher-
AuthorAlexander GelbardAlexander Gelbard-
AuthorAnna GoldenbergAnna Goldenberg-
AuthorKarim HannaKarim Hanna-
AuthorWilliam HershWilliam Hersh-
AuthorLochana JayachandranLochana Jayachandran-
AuthorKaley JenneyKaley Jenney-
AuthorKathy JenkinsKathy Jenkins-
AuthorStacy JoStacy Jo-
AuthorAyush KaliaAyush Kalia-
AuthorAndrea KrusselAndrea Krussel-
AuthorElisa LapadulaElisa Lapadula-
AuthorChloe LoewithChloe Loewith-
AuthorRadhika MahajanRadhika Mahajan-
AuthorVrishni MaharajVrishni Maharaj-
AuthorSiyu MiaoSiyu Miao-
AuthorMatthew MifsudMatthew Mifsud-
AuthorMarian MikhaelMarian Mikhael-
AuthorElijah MoothedanElijah Moothedan-
AuthorYosef NafiiYosef Nafii-
AuthorTempestt NealTempestt Neal-
AuthorKarlee NewberryKarlee Newberry-
AuthorEvan NgEvan Ng-
AuthorChristopher NickelChristopher Nickel-
AuthorTrevor PharrTrevor Pharr-
AuthorClaire Premi-BortolottoClaire Premi-Bortolotto-
AuthorJM RahmanJM Rahman-
AuthorSarah RohdeSarah Rohde-
AuthorLaurie RussellLaurie Russell-
AuthorSuketu ShahSuketu Shah-
AuthorAhmed ShawkatAhmed Shawkat-
AuthorElizabeth SilberholzElizabeth Silberholz-
AuthorDuncan SutherlandDuncan Sutherland-
AuthorVenkata Swarna MukhiVenkata Swarna Mukhi-
AuthorJeffrey TangJeffrey Tang-
AuthorJamie ToghranegarJamie Toghranegar-
AuthorKimberly VinsonKimberly Vinson-
AuthorClaire WilsonClaire Wilson-
AuthorMadeleine ZaninMadeleine Zanin-
AuthorXijie ZengXijie Zeng-
AuthorTheresa ZesiewiczTheresa Zesiewicz-
AuthorRobin ZhaoRobin Zhao-
AuthorPantelis ZisimopoulosPantelis Zisimopoulos-
Dataset Resource
2025-04-16
Dataset Resource
  • voice
  • bridge2ai
Dataset Resource
  • Description
Dataset Resource
Dataset Resource
  • Description
    Raw audio was preprocessed by converting to monaural and resampling to 16 kHz with a Butterworth anti-aliasing filter. Derived data includes: Spectrograms (short-time FFT), 60 Mel-frequency cepstral coefficients (MFCCs), acoustic features extracted using OpenSMILE, phonetic and prosodic features computed using Parselmouth and Praat, and transcriptions generated using OpenAI's Whisper Large model.
Dataset Resource
  • Description
    HIPAA Safe Harbor identifiers were removed. State and province were removed. Country of data collection was retained. Audio records with sensitive information were removed, but their static features were retained. In this release, audio waveforms are omitted.
Dataset Resource
Identifiable Elements Present
False
Description
  • HIPAA Safe Harbor identifiers were removed. State and province were removed. Audio records with sensitive information were removed. Audio waveforms are omitted from this release.
Dataset Resource
  • Sensitive Elements Present
    True
    Description
    • The dataset contains data linked to health conditions including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The raw voice recordings are considered sensitive and are not included in this release but are available in a companion release.
Dataset Resource
  1. Description
    Data was collected via patient consent from individuals at specialty clinics and institutions. A standardized protocol involved collecting demographic information, health questionnaires, and voice recording tasks.
    Was Directly Observed
    Was Reported By Subjects
    True
    Was Inferred Derived
    True
    Was Validated Verified
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
PhysioNet
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
  • Response
    For voice to emerge as a biomarker of health, there is a pressing need for large, high quality, multi-institutional and diverse voice database linked to other health biomarkers from various data of different modality to fuel voice AI research and answer tangible clinical questions.
Dataset Resource
Dataset Resource
Dataset Resource
  1. External Resources
    • Audio recordings are included on a companion release on PhysioNet with the title 'Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (Audio Included)'.
    Future Guarantees
    Archival
    Restrictions
Dataset Resource
Dataset Resource
Dataset Resource
  • Description
    Data collection was conducted using a custom application on a tablet with a headset used for data collection when possible. Data were exported and converted from RedCap using the b2aiprep open source library.
Dataset Resource
  • Description
    Patients presenting at specialty clinics and institutions were considered for enrollment. Patients were selected based on membership to five predetermined groups (Respiratory disorders, Voice disorders, Neurological disorders, Mood disorders, Pediatric). This is a purposive sampling strategy.
Dataset Resource
Dataset Resource
  • Description
    Data collection and sharing was approved by the University of South Florida Institutional Review Board.
Dataset Resource
Dataset Resource
Dataset Resource
  • Description
    Yes, the raw audio data is available in a companion release on PhysioNet.
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
🚀

Uses

What (other) tasks could the dataset be used for?

Dataset Resource
Dataset Resource
Dataset Resource
  • Description
    Access is restricted to credentialed users who sign the 'Bridge2AI Voice Registered Access Agreement' Data Use Agreement. The license is the 'Bridge2AI Voice Registered Access License'.
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
📤

Distribution

How will the dataset be distributed?

Dataset Resource
Bridge2AI Voice Registered Access License
Dataset Resource
Dataset Resource
Dataset Resource
Dataset Resource
  • Description
    Previous versions are available. v1.1 was released Jan. 17, 2025. v1.0 is available at https://doi.org/10.57764/qb6h-em84.
🔄

Maintenance

How will the dataset be maintained?

Dataset Resource
2.0.0
Dataset Resource
  • Description
    b2ai-voice v2.0: This release provides data for an additional 136 new participants. Spectrograms were reprocessed. All spectrograms and Mel-frequency cepstral coefficients from free speech related files have been removed. b2ai-voice v1.1: This release added Mel-frequency cepstral coefficients (MFCCs). b2ai-voice v1.0: This was the first release.
Dataset Resource
Generated on 2025-10-30 10:38:41 using Bridge2AI Data Sheets Schema