VOICE (Claude Code Synthesized)

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

  • Response
    Enable ethically sourced, large-scale research on voice as a biomarker of health by linking derived voice representations to demographic, clinical, and questionnaire data.
Funding
  • Agency
    National Institutes of Health
    Award Number
    3OT2OD032720-01S1
    Project Title
    Bridge2AI: Voice as a Biomarker of Health - Building an ethically sourced, bioaccoustic database to understand disease like never before
Acknowledgements
We acknowledge the contribution of study participants and the NIH for continued support of the project.
Platform Support
National Institute of Biomedical Imaging and Bioengineering under NIH grant number R01EB030362 supported PhysioNet infrastructure.
📊

Composition

What do the instances represent?

  • Representation
    Adult participants with voice, neurological, mood, and respiratory disorders
    Instance Type
    Participants and their voice-derived features with clinical phenotype data
    Data Type
    Spectrograms derived from audio; mel-frequency cepstral coefficients; acoustic feature sets (openSMILE); phonetic and prosodic features (Parselmouth and Praat); transcriptions generated by OpenAI Whisper Large (free speech transcripts removed); phenotype and questionnaire data.
  • Description
    Parquet files (spectrograms, MFCC); TSV files (phenotype, static features); JSON files (data dictionaries)
  • Description
    v1.1 released 2025-01-17
🔍

Collection Process

How was the data acquired?

Bridge2AI-Voice
Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information
The human voice contains complex acoustic markers which have been linked to important health conditions including dementia, mood disorders, and cancer. When viewed as a biomarker, voice is a promising characteristic to measure as it is simple to collect, cost-effective, and has broad clinical utility. Recent advances in artificial intelligence have provided techniques to extract previously unknown prognostically useful information from dense data elements such as images. The Bridge2AI-Voice project seeks to create an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health. Here we present Bridge2AI-Voice, a comprehensive collection of data derived from voice recordings with corresponding clinical information. Bridge2AI-Voice v1.0, the initial release, provides 12,523 recordings for 306 participants collected across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. The initial release contains data considered low risk, including derivations such as spectrograms but not the original voice recordings. Detailed demographic, clinical, and validated questionnaire data are also made available.
en
SCR_007345
PhysioNet
2025-01-17
  • VOICE
  • voice
  • bridge2ai
  • biomarker
  • dementia
  • mood disorders
  • cancer
  • voice disorders
  • neurological disorders
  • respiratory disorders
  • spectrograms
  • acoustic features
  • Health Data Nexus
  • PhysioNet
  • ethical data
  • AI
  • machine learning
  • Response
    Create an ethically sourced flagship dataset to enable AI research on voice as a biomarker, supporting critical insights into voice-health relationships not previously available in standardized datasets.
Overview
Derived audio representations and associated phenotype data from adult participants recruited at specialty clinics.
Population
Cohort Scope
Adult cohort only as of v1.1
Recruitment Region
Five sites in North America
Participants
306
Recordings
12,523
Condition Groups
  • Voice disorders
  • Neurological and neurodegenerative disorders
  • Mood and psychiatric disorders
  • Respiratory disorders
  • Pediatric voice and speech disorders (planned; not included in v1.1)
Modalities
  • Spectrograms derived from audio
  • Mel-frequency cepstral coefficients
  • Acoustic feature sets (openSMILE)
  • Phonetic and prosodic features (Parselmouth and Praat)
  • Transcriptions generated by OpenAI Whisper Large (free speech transcripts removed)
  • Phenotype and questionnaire data
Data Formats
  • Parquet
  • TSV
  • JSON
Identifiers In Files
  • participant_id
  • session_id
  • task_name
Sampling And Dimensions
Audio resampled to 16 kHz; spectrograms are 513 x N; MFCC arrays are 60 x N, where N is proportional to recording length.
Setting
Specialty clinics and institutions
Participant Selection
Screened for inclusion and exclusion criteria within five predetermined groups.
Consent
Participants provided consent for data collection and sharing of de-identified research data.
Procedure
Standardized protocol collecting demographics, health questionnaires, targeted confounders for voice, disease specific information, and voice tasks such as sustained vowel phonation.
Data Capture
Custom tablet application used for collection; headset used when possible.
Sessions
Most participants completed one session; a subset required multiple sessions.
Data Export And Merge
Exported from REDCap and converted using an open source library.
  1. Description
    Standardized protocol collecting demographics, health questionnaires, targeted confounders for voice, disease specific information, and voice tasks such as sustained vowel phonation.
    Was Directly Observed
    Yes (voice recordings via tablet application)
    Was Reported By Subjects
    Yes (questionnaires)
    Was Validated Verified
    Standardized data collection protocol; validated questionnaires; REDCap data capture
  • Description
    Custom tablet application used for collection; headset used when possible; REDCap for phenotype data
  • Description
    Five data collection sites in North America (specialty clinics)
  • Description
    Initial release (v1.0) in 2024; v1.1 released 2025-01-17; latest version 2.0.1 released 2025-08-18
  • Description
    Raw audio processing: Converted to mono and resampled to 16 kHz with a Butterworth anti-aliasing filter. Spectrograms: Short-time FFT with 25 ms window, 10 ms hop, 512-point FFT; stored in power representation. MFCC: 60 coefficients computed from spectrograms. Acoustic features: Extracted using openSMILE capturing temporal dynamics and acoustic characteristics. Phonetic/prosodic features: Computed using Parselmouth and Praat; includes measures of fundamental frequency, formants, and voice quality. Transcription: Generated using OpenAI Whisper Large; transcripts of free speech audio were removed prior to release. Open source code: b2aiprep library used to preprocess waveforms and merge phenotype data.
  • Description
    De-identification using HIPAA Safe Harbor approach; removal of identifiers including names, geographic locators, dates at finer than year resolution, phone/fax numbers, email addresses, IP addresses, Social Security Numbers, medical record numbers, health plan beneficiary numbers, device identifiers, license numbers, account numbers, vehicle identifiers, website URLs, full face photos, biometric identifiers, and any unique identifiers. Removal of state and province; retention of country of data collection. Removal of transcripts of free speech audio. Omission of raw audio waveforms in v1.1; only spectrograms and other derived features are released.
Description
  • HIPAA Safe Harbor de-identification applied
  • No raw audio waveforms in v1.1; only derived representations released
  • Free speech transcripts removed to reduce re-identification risk
  • Description
    Health condition information (voice disorders, neurological disorders, mood disorders, respiratory disorders) under restricted access with data use agreement
Version Notice
Files for version 1.1 are no longer available; the latest version of this project is 2.0.1.
Listing
DescriptionPathType
Dense time-frequency representations derived from voice waveforms; includes participant_id, session_...spectrograms.parquetParquet
Mel-frequency cepstral coefficients derived from spectrograms; arrays of size 60 x N per recording.mfcc.parquetParquet
One row per participant; demographics, acoustic confounders, and responses to validated questionnair...phenotype.tsvTSV
Data dictionary for phenotype.tsv with one sentence descriptions per column.phenotype.jsonJSON
One row per audio recording; features derived using openSMILE, Praat, parselmouth, and torchaudio.static_features.tsvTSV
Data dictionary for static_features.tsv with feature descriptions.static_features.jsonJSON
  • Adult cohort only in v1.1; pediatric data not included.
  • No raw audio is released in v1.1; analyses are limited to derived representations.
  • Participants were selected based on conditions known to manifest in voice, which may affect generalizability.
  • Description
    Bridge2AI-Voice project team; hosted on PhysioNet
DateNotesVersion
2025-01-17This release added Mel-frequency cepstral coefficients.1.1
2024Initial release of the dataset.1.0
2025-04-16Major update (details not provided in source)2.0.0
2025-08-18Latest version (details not provided in source)2.0.1
Name
Alistair Johnson
Jean-Christophe Bélisle-Pipon
David Dorr
Satrajit Ghosh
Philip Payne
Maria Powell
Anais Rameau
Vardit Ravitsky
Alexandros Sigaras
Olivier Elemento
Yael Bensoussan
Not publicly listed; contact information requires login.
Preprocessing Code
Name
b2aiprep
URL
https://github.com/sensein/b2aiprep
Description
Open source library used to preprocess raw audio and merge phenotype data.
Referenced Tools
  • openSMILE
  • Praat
  • Parselmouth
  • torchaudio
  • OpenAI Whisper Large
  • librosa (example usage for visualization)
Dataset Citation
Johnson, A., Bélisle-Pipon, J., Dorr, D., Ghosh, S., Payne, P., Powell, M., Rameau, A., Ravitsky, V., Sigaras, A., Elemento, O., & Bensoussan, Y. (2025). Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (version 1.1). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/249v-w155
Platform Citation
Goldberger, A., Amaral, L., Glass, L., Hausdorff, J., Ivanov, P. C., Mark, R., ... & Stanley, H. E. (2000). PhysioBank, PhysioToolkit, and PhysioNet: Components of a new research resource for complex physiologic signals. Circulation [Online]. 101 (23), pp. e215–e220. RRID:SCR_007345.
  • External Resources
    • PhysioNet platform (https://physionet.org/)
    • Health Data Nexus (https://healthdatanexus.ai/content/b2ai-voice/1.0/)
    • Project documentation (https://docs.b2ai-voice.org)
    • b2aiprep GitHub repository (https://github.com/sensein/b2aiprep)
    • Bridge2AI Voice REDCap on Zenodo (https://doi.org/10.5281/zenodo.14148755)
  • Rameau, A., Ghosh, S., Sigaras, A., Elemento, O., Belisle-Pipon, J.-C., Ravitsky, V., Powell, M., Johnson, A., Dorr, D., Payne, P., Boyer, M., Watts, S., Bahr, R., Rudzicz, F., Lerner-Ellis, J., Awan, S., Bolser, D., Bensoussan, Y. (2024) Developing Multi-Disorder Voice Protocols: A team science approach involving clinical expertise, bioethics, standards, and DEI.. Proc. Interspeech 2024, 1445-1449, doi: 10.21437/Interspeech.2024-1926
  • Bensoussan, Y., Ghosh, S. S., Rameau, A., Boyer, M., Bahr, R., Watts, S., Rudzicz, F., Bolser, D., Lerner-Ellis, J., Awan, S., Powell, M. E., Belisle-Pipon, J.-C., Ravitsky, V., Johnson, A., Zisimopoulos, P., Tang, J., Sigaras, A., Elemento, O., Dorr, D., … Bridge2AI-Voice. (2024). Bridge2AI Voice REDCap (v3.20.0). Zenodo. https://doi.org/10.5281/zenodo.14148755
  • Florian Eyben, Martin Wöllmer, Björn Schuller: openSMILE - The Munich Versatile and Fast Open-Source Audio Feature Extractor, Proc. ACM Multimedia (MM), ACM, Florence, Italy, ISBN 978-1-60558-933-6, pp. 1459-1462, 25.-29.10.2010.
  • Boersma P, Van Heuven V. Speak and unSpeak with PRAAT. Glot International. 2001 Nov;5(9/10):341-7.
  • Jadoul Y, Thompson B, De Boer B. Introducing parselmouth: A python interface to praat. Journal of Phonetics. 2018 Nov 1;71:1-5.
  • Hwang, J., Hira, M., Chen, C., Zhang, X., Ni, Z., Sun, G., Ma, P., Huang, R., Pratap, V., Zhang, Y., Kumar, A., Yu, C.-Y., Zhu, C., Liu, C., Kahn, J., Ravanelli, M., Sun, P., Watanabe, S., Shi, Y., Tao, T., Scheibler, R., Cornell, S., Kim, S., & Petridis, S. (2023). TorchAudio 2.1: Advancing speech recognition, self-supervised learning, and audio processing components for PyTorch. arXiv preprint arXiv:2310.17864
  • Yang, Y.-Y., Hira, M., Ni, Z., Chourdia, A., Astafurov, A., Chen, C., Yeh, C.-F., Puhrsch, C., Pollack, D., Genzel, D., Greenberg, D., Yang, E. Z., Lian, J., Mahadeokar, J., Hwang, J., Chen, J., Goldsborough, P., Roy, P., Narenthiran, S., Watanabe, S., Chintala, S., Quenneville-Bélair, V, & Shi, Y. (2021). TorchAudio: Building Blocks for Audio and Speech Processing. arXiv preprint arXiv:2110.15018.
  • Bevers, I., Ghosh, S., Johnson, A., Brito, R., Bedrick, S., Catania, F., & Ng, E. (2017). My Research Software (Version 0.21.0) [Computer software]. https://github.com/sensein/b2aiprep
  • Johnson, A., Bélisle-Pipon, J., Dorr, D., Ghosh, S., Payne, P., Powell, M., Rameau, A., Ravitsky, V., Sigaras, A., Elemento, O., & Bensoussan, Y. (2024). Bridge2AI-Voice: An ethically-sourced, diverse voice dataset linked to health information (version 1.0). Health Data Nexus. https://doi.org/10.57764/qb6h-em84
🚀

Uses

What (other) tasks could the dataset be used for?

Response
Development and benchmarking of models to associate voice-derived features with health conditions.
Exploration of acoustic, phonetic, and prosodic correlates of disease using de-identified derived da...
Primary
Artificial intelligence and clinical research on voice as a biomarker of health.
Examples
  • Development and benchmarking of models to associate voice-derived features with health conditions.
  • Exploration of acoustic, phonetic, and prosodic correlates of disease using de-identified derived data.
Usage Notes
Data are provided as derived representations without raw audio to reduce re-identification risk.
  • Description
    Adult cohort only in v1.1; pediatric data not included, which may limit generalizability. Participants were selected based on conditions known to manifest in voice, which may affect generalizability. Users should account for these sampling characteristics.
Description
  • Restricted Access: Only registered users who sign the specified data use agreement can access the files.
  • Bridge2AI Voice Registered Access License
  • Bridge2AI Voice Registered Access Agreement
Description
PhysioNet restricted access repository
Health Data Nexus
📤

Distribution

How will the dataset be distributed?

10.13026/249v-w155
10.13026/37yb-1t42
Platform
PhysioNet
Access Policy
Restricted Access
Access Conditions
Only registered users who sign the specified data use agreement can access the files.
License
Bridge2AI Voice Registered Access License
Data Use Agreement
Bridge2AI Voice Registered Access Agreement
Description
  • Multiple versions available on PhysioNet platform (v1.1, v2.0.0, v2.0.1)
  • Also available on Health Data Nexus (b2ai-voice version 1.0)
🔄

Maintenance

How will the dataset be maintained?

1.1
Description
  • Periodic updates planned; v1.1 released 2025-01-17 adding MFCC; v2.0.0 released 2025-04-16; v2.0.1 released 2025-08-18
👥

Human Subjects

Does the dataset relate to people?

IRB Approval
Data collection and sharing approved by the University of South Florida Institutional Review Board.
Ethical Position
Dataset is ethically sourced with privacy protections; derived data released for low risk.
Conflicts Of Interest
None to declare.
Generated on 2025-11-16 17:37:50 using Bridge2AI Data Sheets Schema