JUHE API Marketplace
DATASET
Open Source Community

Common Voice Dataset

This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.

Updated 5/14/2024
github

Description

Common Voice Dataset Overview

Dataset Description

  • Source: Speech contributions from the Common Voice community via the web platform.
  • Release Frequency: Approximately every six months.
  • Data Processing: All speech contributions, irrespective of validation status, are included. The dataset is packaged with the Common Voice Bundler tool and uploaded to S3.

Dataset Structure

  • File Format: Each downloaded .tar.gz contains the following layout:

    [lang].tar.gz/ ├── clips/ │ ├── *.mp3 files |__ dev.tsv |__ invalidated.tsv |__ other.tsv |__ test.tsv |__ train.tsv |__ validated.tsv |__ reported.tsv (as of Corpus 5.0)

  • TSV File Contents: Each .tsv lists audio files, original source sentences, hash client_id, validation data, and demographic information.

Dataset Fields

  • Each row (audio clip) includes:
    • client_id
    • path
    • text
    • up_votes
    • down_votes
    • age
    • gender
    • accent
    • segment

Dataset Usage

  • ML Applications: Processed with the Mozilla Corpora Creator to generate test, train, and development splits.
  • Data Splits: Generation is nondeterministic to avoid duplication and demographic bias.

Access

  • Download Recommendation: For large files, use curl with resume support.

Citation

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Speech Recognition
Crowdsourced Data

Source

Organization: github

Created: 7/17/2020

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.