Back to datasets
Dataset assetOpen Source CommunitySpeech RecognitionCrowdsourced Data

Common Voice Dataset

This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.

Source
github
Created
Jul 17, 2020
Updated
May 14, 2024
Signals
275 views
Availability
Linked source ready
Overview

Dataset description and usage context

Common Voice Dataset Overview

Dataset Description

  • Source: Speech contributions from the Common Voice community via the web platform.
  • Release Frequency: Approximately every six months.
  • Data Processing: All speech contributions, irrespective of validation status, are included. The dataset is packaged with the Common Voice Bundler tool and uploaded to S3.

Dataset Structure

  • File Format: Each downloaded .tar.gz contains the following layout:

    [lang].tar.gz/ ├── clips/ │ ├── *.mp3 files |__ dev.tsv |__ invalidated.tsv |__ other.tsv |__ test.tsv |__ train.tsv |__ validated.tsv |__ reported.tsv (as of Corpus 5.0)

  • TSV File Contents: Each .tsv lists audio files, original source sentences, hash client_id, validation data, and demographic information.

Dataset Fields

  • Each row (audio clip) includes:
    • client_id
    • path
    • text
    • up_votes
    • down_votes
    • age
    • gender
    • accent
    • segment

Dataset Usage

  • ML Applications: Processed with the Mozilla Corpora Creator to generate test, train, and development splits.
  • Data Splits: Generation is nondeterministic to avoid duplication and demographic bias.

Access

  • Download Recommendation: For large files, use curl with resume support.

Citation

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio