Common Voice Dataset

This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.

Updated 5/14/2024

github

Common Voice Dataset Overview

Dataset Description

Source: Speech contributions from the Common Voice community via the web platform.
Release Frequency: Approximately every six months.
Data Processing: All speech contributions, irrespective of validation status, are included. The dataset is packaged with the Common Voice Bundler tool and uploaded to S3.

Dataset Structure

File Format: Each downloaded .tar.gz contains the following layout:

[lang].tar.gz/ ├── clips/ │ ├── *.mp3 files |__ dev.tsv |__ invalidated.tsv |__ other.tsv |__ test.tsv |__ train.tsv |__ validated.tsv |__ reported.tsv (as of Corpus 5.0)
TSV File Contents: Each .tsv lists audio files, original source sentences, hash client_id, validation data, and demographic information.

Dataset Fields

Each row (audio clip) includes:
- client_id
- path
- text
- up_votes
- down_votes
- age
- gender
- accent
- segment

Dataset Usage

ML Applications: Processed with the Mozilla Corpora Creator to generate test, train, and development splits.
Data Splits: Generation is nondeterministic to avoid duplication and demographic bias.

Access

Download Recommendation: For large files, use curl with resume support.

Citation

Academic Citation: When used in research, cite Common Voice: A Massively‑Multilingual Speech Corpus.

Common Voice Dataset

Description

Common Voice Dataset Overview

Dataset Description

Dataset Structure

Dataset Fields

Dataset Usage

Access

Citation

AI studio

Access Dataset

Topics

Source