Common Voice Dataset
This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.
Description
Common Voice Dataset Overview
Dataset Description
- Source: Speech contributions from the Common Voice community via the web platform.
- Release Frequency: Approximately every six months.
- Data Processing: All speech contributions, irrespective of validation status, are included. The dataset is packaged with the Common Voice Bundler tool and uploaded to S3.
Dataset Structure
-
File Format: Each downloaded
.tar.gzcontains the following layout:[lang].tar.gz/├── clips/ │ ├── *.mp3 files |__ dev.tsv |__ invalidated.tsv |__ other.tsv |__ test.tsv |__ train.tsv |__ validated.tsv |__ reported.tsv (as of Corpus 5.0) -
TSV File Contents: Each
.tsvlists audio files, original source sentences, hashclient_id, validation data, and demographic information.
Dataset Fields
- Each row (audio clip) includes:
- client_id
- path
- text
- up_votes
- down_votes
- age
- gender
- accent
- segment
Dataset Usage
- ML Applications: Processed with the Mozilla Corpora Creator to generate test, train, and development splits.
- Data Splits: Generation is nondeterministic to avoid duplication and demographic bias.
Access
- Download Recommendation: For large files, use
curlwith resume support.
Citation
- Academic Citation: When used in research, cite Common Voice: A Massively‑Multilingual Speech Corpus.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: github
Created: 7/17/2020
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.