VOICE d4d

Datasheet for Dataset - Human Readable Format

🎯

Motivation

Why was the dataset created?

  • Description
    Create an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health.
📊

Composition

What do the instances represent?

  • Description
    Version 1.1 contains 12,523 recordings from 306 adult participants collected across five sites in North America. Most participants completed one session, with a subset completing multiple sessions.
Description
Voice disorders (benign and malignant lesions affecting vocal folds)
Neurological and neurodegenerative disorders (including Parkinson's, ALS)
Mood and psychiatric disorders (depression, anxiety)
Respiratory disorders (cough, breathing sounds)
🔍

Collection Process

How was the data acquired?

Bridge2AI-Voice Dataset
Bridge2AI-Voice - An ethically-sourced, diverse voice dataset linked to health information
The Bridge2AI-Voice project creates an ethically sourced flagship dataset to enable future research in artificial intelligence and support critical insights into the use of voice as a biomarker of health. Version 1.1 provides 12,523 recordings for 306 adult participants collected across five sites in North America. Participants were selected based on known conditions which manifest within the voice waveform including voice disorders, neurological disorders, mood disorders, and respiratory disorders. This release contains de-identified derived data including spectrograms, MFCCs, and acoustic features but not the original voice recordings.
en
  • Bridge2AI
  • voice biomarker
  • speech
  • health
  • voice disorders
  • neurological disorders
  • mood disorders
  • respiratory disorders
  • spectrogram
  • MFCC
  • acoustic features
  • PhysioNet
  • Description
    Address the need for large, high quality, multi-institutional and diverse voice databases linked to health biomarkers to fuel voice AI research and answer clinical questions.
DescriptionIDName
Time-frequency power spectrograms (513 x N dimension) computed using short-time FFT with 25ms window, 10ms hop length, and 512-point FFT. voice:spectrogramsspectrograms.parquet
60 Mel-frequency cepstral coefficients (MFCCs) derived from spectrograms, 60 x N dimension per recording. voice:mfccmfcc.parquet
Participant demographics, validated questionnaire responses, and acoustic confounders with one row per unique participant. voice:phenotypephenotype.tsv
Acoustic features from openSMILE, Praat, parselmouth, and torchaudio with one row per unique recording. voice:static-featuresstatic_features.tsv
  • Description
    Participants selected based on membership to five predetermined disease groups from specialty clinics and institutions. Patients screened for inclusion/exclusion criteria prior to visit by project investigators.
  • Description
    Voice tasks collected via standardized protocol including sustained vowel phonation using a custom tablet application with headset.
    Was Directly Observed
    True
    Acquisition Details
    • Custom tablet application for data collection
    • Headset used for voice recording when possible
    • Standardized protocol across five collection sites
  • Description
    Data collection using REDCap with custom application.
    Mechanism Details
    • Custom tablet application for voice recording
    • REDCap for clinical and questionnaire data
    • Headset microphone for audio capture
  • Description
    Specialty clinics and institutions across five North American sites.
    Collector Details
    • Five North American clinical sites
    • Specialty clinics for each disease cohort
  • Description
    Data collected prior to January 2025 release.
    Timeframe Details
    • v1.1 released January 17, 2025
  • Description
    Raw audio converted to mono and resampled to 16 kHz with Butterworth anti-aliasing filter. Derived data computed using multiple feature extraction pipelines.
    Preprocessing Details
    • Audio resampled to 16 kHz mono
    • Spectrograms computed via short-time FFT (25ms window, 10ms hop, 512-point FFT)
    • 60 MFCCs extracted from spectrograms
    • Acoustic features extracted using openSMILE
    • Phonetic and prosodic features computed using Parselmouth/Praat
    • Transcriptions generated using OpenAI Whisper Large (free speech transcripts removed)
  • Description
    HIPAA Safe Harbor de-identification applied. Free speech transcripts removed to reduce re-identification risk. Raw audio waveforms not released in v1.1.
    Cleaning Details
    • HIPAA Safe Harbor identifiers removed
    • State and province removed (country retained)
    • Free speech transcripts removed
    • Raw audio waveforms omitted
  • Description
    Data collection and sharing approved by the University of South Florida Institutional Review Board.
    Review Details
    • IRB approval from University of South Florida
Description
De-identification using HIPAA Safe Harbor method. All Safe Harbor identifiers removed including names, geographic locators, dates, contact information, government identifiers, and biometric identifiers.
  • Name
    PhysioNet platform team
    Description
    Dataset hosted and maintained by PhysioNet.
    Maintainer Details
    • PhysioNet platform and infrastructure
    • MIT Laboratory for Computational Physiology
🚀

Uses

What (other) tasks could the dataset be used for?

  • Description
    Enable AI research on voice as a biomarker for health conditions including voice disorders, neurological disorders, mood disorders, and respiratory disorders.
  • Description
    Dataset designed for AI research on voice as health biomarker. Data are provided as derived representations without raw audio to reduce re-identification risk.
    Impact Details
    • Adult cohort only in v1.1 (pediatric data not included)
    • No raw audio in v1.1 (analyses limited to derived representations)
    • Participants selected for conditions manifesting in voice
  • Description
    Access requires signing the Bridge2AI Voice Registered Access Agreement. Use governed by the agreement terms.
    Discouragement Details
    • Registered access required
    • Must sign data use agreement
Description
Bridge2AI Voice Registered Access License. Only registered users who sign the specified data use agreement can access the files.
License Terms
  • Bridge2AI Voice Registered Access License
  • Requires signing Bridge2AI Voice Registered Access Agreement
  • Registered access on PhysioNet
📤

Distribution

How will the dataset be distributed?

Description
Multiple versions available through PhysioNet. Files for v1.1 may no longer be available; latest version is 2.0.1.
Latest Version DOI
https://doi.org/10.13026/37yb-1t42
Versions Available
  • 1.1 (January 2025)
  • 2.0.0 (April 2025)
  • 2.0.1 (August 2025)
🔄

Maintenance

How will the dataset be maintained?

Description
Version 1.1 released January 17, 2025 adding MFCCs. Project subsequently updated to versions 2.0.0 and 2.0.1.
Update Details
  • v1.0 initial release 2024
  • v1.1 released 2025-01-17 (added MFCCs)
  • v2.0.0 released 2025-04-16
  • v2.0.1 released 2025-08-18
👥

Human Subjects

Does the dataset relate to people?

  • Consent Obtained
    True
    Consent Type
    • Written consent for data collection
    • Consent for research data sharing
    Description
    Participants provided consent for data collection and sharing of de-identified research data.
Generated on 2026-04-15 21:02:52 using Bridge2AI Data Sheets Schema