Common Voice Dataset
This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.
Dataset description and usage context
Common Voice Dataset Overview
Dataset Description
- Source: Speech contributions from the Common Voice community via the web platform.
- Release Frequency: Approximately every six months.
- Data Processing: All speech contributions, irrespective of validation status, are included. The dataset is packaged with the Common Voice Bundler tool and uploaded to S3.
Dataset Structure
-
File Format: Each downloaded
.tar.gzcontains the following layout:[lang].tar.gz/├── clips/ │ ├── *.mp3 files |__ dev.tsv |__ invalidated.tsv |__ other.tsv |__ test.tsv |__ train.tsv |__ validated.tsv |__ reported.tsv (as of Corpus 5.0) -
TSV File Contents: Each
.tsvlists audio files, original source sentences, hashclient_id, validation data, and demographic information.
Dataset Fields
- Each row (audio clip) includes:
- client_id
- path
- text
- up_votes
- down_votes
- age
- gender
- accent
- segment
Dataset Usage
- ML Applications: Processed with the Mozilla Corpora Creator to generate test, train, and development splits.
- Data Splits: Generation is nondeterministic to avoid duplication and demographic bias.
Access
- Download Recommendation: For large files, use
curlwith resume support.
Citation
- Academic Citation: When used in research, cite Common Voice: A Massively‑Multilingual Speech Corpus.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.