Datasets | JuheAPI

Common Voice Dataset

Speech Recognition

Crowdsourced Data

This dataset contains speech contributions from the Common Voice community on the web platform; all contributions are included regardless of validation status. The dataset is released roughly every six months and includes audio files and associated metadata such as age, gender, accent, etc.

github

View Details

clarin-pl/2021-punctuation-restoration

Speech Recognition

Natural Language Processing

The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems. It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational). The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks. It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers. The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).

hugging_face

View Details

Human Voice Dataset

Speech Recognition

Music Technology

A collection of human voice recordings featuring various singing styles (pitch, vowels, consonants, etc.). The dataset aims to simplify research on voice‑based music controllers and can be used to benchmark vocal feature detection algorithms (pitch detection, onset detection) as well as serve as training data for machine‑learning models.

github

View Details

my-speech-datasets

Speech Recognition

Spanish

This dataset is used to train Mozilla's DeepSpeech model, containing public‑domain speech data, currently focusing on Spanish speech, with plans to add more languages. It includes 120 hours of clean Spanish speech and 100 hours of clean speech from a single speaker, in LJSpeech format.

github

View Details

M-AILABS Speech Dataset

Speech Recognition

Speech Synthesis

The M‑AILABS Speech Dataset is the first large‑scale free dataset we provide for both speech recognition and speech synthesis training. The data are primarily derived from LibriVox and Project Gutenberg, containing nearly a thousand hours of audio and aligned text files. Each segment is transcribed, ranging from 1 to 20 seconds. Texts were published between 1884 and 1964 and are in the public domain. Audio recordings are also public domain from the LibriVox project, except for Ukrainian recordings, which are supplied by Nash Format or Gwara Media and are intended solely for machine‑learning use.

github

View Details

How2

Multimodal Datasets

Speech Recognition

How2 is a multimodal dataset containing approximately 80,000 instructional videos (~2,000 hours) with English subtitles and summaries. About 300 hours of videos have been crowd‑translated into Portuguese and were used in the JSALT 2018 workshop. The dataset can be used for speech recognition, speech summarization, text summarization, and their multimodal extensions.

github

View Details

cdminix/libritts-aligned

Speech Recognition

Text-to-Speech

The LibriTTS Corpus with Forced Alignments dataset is a speech dataset for automatic speech recognition (ASR) and text‑to‑speech (TTS) tasks. It includes audio files, corresponding transcripts, phonemes, and their durations. The dataset provides pre‑processed alignment information so users do not need to run the Montreal Forced Aligner locally. A data collator is also provided to create training batches. The dataset is divided into several subsets (train, dev, test, etc.) corresponding to different subsets of LibriSpeech.

hugging_face

View Details

AudioDataset

Audio Data

Speech Recognition

A repository containing various audio datasets, including speech, music, and audio mixture datasets. Speech datasets such as VCTK and LibriSpeech, music dataset such as StarNet, and audio mixture datasets such as Libri2Mix and Divide and Remaster (DnR).

github

View Details

spoken-language-understanding-research-datasets

Speech Recognition

Smart Home

This dataset comprises multiple sub‑datasets for speech language understanding research, including data from the SmartLights and SmartSpeaker assistants. The SmartLights dataset is used for cross‑validation and contains six intents for controlling light switches, brightness, or color changes. The SmartSpeaker dataset is used for training/testing, includes English and French versions, and is intended for controlling playback and music on smart speakers.

github

View Details

speech-commands

Speech Recognition

Word Classification

This dataset is used to train speech recognition models and contains 35 words divided into numeric, directional, command, animal, and other categories.

github

View Details

Murple/ksponspeech

Speech Recognition

Natural Language Processing

The KsponSpeech dataset contains 969 hours of Korean conversational speech recorded by approximately 2,000 native Korean speakers in clean environments. All data were created by recording dialogues between two people and manually transcribing the audio. Transcriptions provide both orthographic and phonetic versions, along with disfluency tags (e.g., filler words, repeated words, word fragments) to indicate spontaneous speech. The dataset is primarily used for automatic speech recognition tasks and has been publicly released on the Korean government open data platform.

hugging_face

View Details

GigaSpeech

Speech Recognition

Big Data

GigaSpeech is an evolving, multi‑domain English speech recognition corpus created by Tsinghua University's Department of Electronic Engineering and partner institutions. It contains 10,000 hours of high‑quality manually transcribed audio for supervised training, and a total of 40,000 hours suitable for semi‑supervised and unsupervised training. The corpus is compiled from audiobooks, podcasts, and YouTube videos, covering both read and spontaneous speech styles across topics such as arts, science, and sports. The creation pipeline includes audio collection, text normalization, forced alignment, audio segmentation, and segment validation. GigaSpeech aims to advance speech recognition research and address the performance saturation of existing datasets.

arXiv

View Details

VCTK Corpus

Speech Recognition

Speech Synthesis

This repository provides full‑context label files for the VCTK corpus. These label files were created following the preprocessing steps in r9y9/deepvoice3_pytorch. The dataset includes both full and mono label files, detailing the segmentation and annotation format of the audio data.

github

View Details

mozilla-foundation/common_voice_11_0

Speech Recognition

Machine Learning

--- annotations_creators: - crowdsourced language_creators: - crowdsourced license: - cc0-1.0 multilinguality: - multilingual size_categories: ab: - 10K<n<100K ar: - 100K<n<1M as: - 1K<n<10K ast: - n<1K az: - n<1K ba: - 100K<n<1M bas: - 1K<n<10K be: - 100K<n<1M bg: - 1K<n<10K bn: - 100K<n<1M br: - 10K<n<100K ca: - 1M<n<10M ckb: - 100K<n<1M cnh: - 1K<n<10K cs: - 10K<n<100K cv: - 10K<n<100K cy: - 100K<n<1M da: - 1K<n<10K de: - 100K<n<1M dv: - 10K<n<100K el: - 10K<n<100K en: - 1M<n<10M eo: - 1M<n<10M es: - 1M<n<10M et: - 10K<n<100K eu: - 100K<n<1M fa: - 100K<n<1M fi: - 10K<n<100K fr: - 100K<n<1M fy-NL: - 10K<n<100K ga-IE: - 1K<n<10K gl: - 10K<n<100K gn: - 1K<n<10K ha: - 1K<n<10K hi: - 10K<n<100K hsb: - 1K<n<10K hu: - 10K<n<100K hy-AM: - 1K<n<10K ia: - 10K<n<100K id: - 10K<n<100K ig: - 1K<n<10K it: - 100K<n<1M ja: - 10K<n<100K ka: - 10K<n<100K kab: - 100K<n<1M kk: - 1K<n<10K kmr: - 10K<n<100K ky: - 10K<n<100K lg: - 100K<n<1M lt: - 10K<n<100K lv: - 1K<n<10K mdf: - n<1K mhr: - 100K<n<1M mk: - n<1K ml: - 1K<n<10K mn: - 10K<n<100K mr: - 10K<n<100K mrj: - 10K<n<100K mt: - 10K<n<100K myv: - 1K<n<10K nan-tw: - 10K<n<100K ne-NP: - n<1K nl: - 10K<n<100K nn-NO: - n<1K or: - 1K<n<10K pa-IN: - 1K<n<10K pl: - 100K<n<1M pt: - 100K<n<1M rm-sursilv: - 1K<n<10K rm-vallader: - 1K<n<10K ro: - 10K<n<100K ru: - 100K<n<1M rw: - 1M<n<10M sah: - 1K<n<10K sat: - n<1K sc: - 1K<n<10K sk: - 10K<n<100K skr: - 1K<n<10K sl: - 10K<n<100K sr: - 1K<n<10K sv-SE: - 10K<n<100K sw: - 100K<n<1M ta: - 100K<n<1M th: - 100K<n<1M ti: - n<1K tig: - n<1K tok: - 1K<n<10K tr: - 10K<n<100K tt: - 10K<n<100K tw: - n<1K ug: - 10K<n<100K uk: - 10K<n<100K ur: - 100K<n<1M uz: - 100K<n<1M vi: - 10K<n<100K vot: - n<1K yue: - 10K<n<100K zh-CN: - 100K<n<1M zh-HK: - 100K<n<1M zh-TW: - 100K<n<1M source_datasets: - extended|common_voice task_categories: - automatic-speech-recognition task_ids: [] paperswithcode_id: common-voice pretty_name: Common Voice Corpus 11.0 language_bcp47: - ab - ar - as - ast - az - ba - bas - be - bg - bn - br - ca - ckb - cnh - cs - cv - cy - da - de - dv - el - en - eo - es - et - eu - fa - fi - fr - fy-NL - ga-IE - gl - gn - ha - hi - hsb - hu - hy-AM - ia - id - ig - it - ja - ka - kab - kk - kmr - ky - lg - lt - lv - mdf - mhr - mk - ml - mn - mr - mrj - mt - myv - nan-tw - ne-NP - nl - nn-NO - or - pa-IN - pl - pt - rm-sursilv - rm-vallader - ro - ru - rw - sah - sat - sc - sk - skr - sl - sr - sv-SE - sw - ta - th - ti - tig - tok - tr - tt - tw - ug - uk - ur - uz - vi - vot - yue - zh-CN - zh-HK - zh-TW extra_gated_prompt: By clicking on “Access repository” below, you also agree to not attempt to determine the identity of speakers in the Common Voice dataset. --- # Dataset Card for Common Voice Corpus 11.0 ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [How to use](#how-to-use) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** https://commonvoice.mozilla.org/en/datasets - **Repository:** https://github.com/common-voice/common-voice - **Paper:** https://arxiv.org/abs/1912.06670 - **Leaderboard:** https://paperswithcode.com/dataset/common-voice - **Point of Contact:** [Anton Lozhkov](mailto:anton@huggingface.co) ### Dataset Summary The Common Voice dataset consists of a unique MP3 and corresponding text file. Many of the 24210 recorded hours in the dataset also include demographic metadata like age, sex, and accent that can help improve the accuracy of speech recognition engines. The dataset currently consists of 16413 validated hours in 100 languages, but more voices and languages are always added. Take a look at the [Languages](https://commonvoice.mozilla.org/en/languages) page to request a language or start contributing. ### Supported Tasks and Leaderboards The results for models trained on the Common Voice datasets are available via the [🤗 Autoevaluate Leaderboard](https://huggingface.co/spaces/autoevaluate/leaderboards?dataset=mozilla-foundation%2Fcommon_voice_11_0&only_verified=0&task=automatic-speech-recognition&config=ar&split=test&metric=wer) ### Languages ``` Abkhaz, Arabic, Armenian, Assamese, Asturian, Azerbaijani, Basaa, Bashkir, Basque, Belarusian, Bengali, Breton, Bulgarian, Cantonese, Catalan, Central Kurdish, Chinese (China), Chinese (Hong Kong), Chinese (Taiwan), Chuvash, Czech, Danish, Dhivehi, Dutch, English, Erzya, Esperanto, Estonian, Finnish, French, Frisian, Galician, Georgian, German, Greek, Guarani, Hakha Chin, Hausa, Hill Mari, Hindi, Hungarian, Igbo, Indonesian, Interlingua, Irish, Italian, Japanese, Kabyle, Kazakh, Kinyarwanda, Kurmanji Kurdish, Kyrgyz, Latvian, Lithuanian, Luganda, Macedonian, Malayalam, Maltese, Marathi, Meadow Mari, Moksha, Mongolian, Nepali, Norwegian Nynorsk, Odia, Persian, Polish, Portuguese, Punjabi, Romanian, Romansh Sursilvan, Romansh Vallader, Russian, Sakha, Santali (Ol Chiki), Saraiki, Sardinian, Serbian, Slovak, Slovenian, Sorbian, Upper, Spanish, Swahili, Swedish, Taiwanese (Minnan), Tamil, Tatar, Thai, Tigre, Tigrinya, Toki Pona, Turkish, Twi, Ukrainian, Urdu, Uyghur, Uzbek, Vietnamese, Votic, Welsh ``` ## How to use The `datasets` library allows you to load and pre-process your dataset in pure Python, at scale. The dataset can be downloaded and prepared in one call to your local drive by using the `load_dataset` function. For example, to download the Hindi config, simply specify the corresponding language config name (i.e., "hi" for Hindi): ```python from datasets import load_dataset cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") ``` Using the datasets library, you can also stream the dataset on-the-fly by adding a `streaming=True` argument to the `load_dataset` function call. Loading a dataset in streaming mode loads individual samples of the dataset at a time, rather than downloading the entire dataset to disk. ```python from datasets import load_dataset cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train", streaming=True) print(next(iter(cv_11))) ``` *Bonus*: create a [PyTorch dataloader](https://huggingface.co/docs/datasets/use_with_pytorch) directly with your own datasets (local/streamed). ### Local ```python from datasets import load_dataset from torch.utils.data.sampler import BatchSampler, RandomSampler cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") batch_sampler = BatchSampler(RandomSampler(cv_11), batch_size=32, drop_last=False) dataloader = DataLoader(cv_11, batch_sampler=batch_sampler) ``` ### Streaming ```python from datasets import load_dataset from torch.utils.data import DataLoader cv_11 = load_dataset("mozilla-foundation/common_voice_11_0", "hi", split="train") dataloader = DataLoader(cv_11, batch_size=32) ``` To find out more about loading and preparing audio datasets, head over to [hf.co/blog/audio-datasets](https://huggingface.co/blog/audio-datasets). ### Example scripts Train your own CTC or Seq2Seq Automatic Speech Recognition models on Common Voice 11 with `transformers` - [here](https://github.com/huggingface/transformers/tree/main/examples/pytorch/speech-recognition). ## Dataset Structure ### Data Instances A typical data point comprises the `path` to the audio file and its `sentence`. Additional fields include `accent`, `age`, `client_id`, `up_votes`, `down_votes`, `gender`, `locale` and `segment`. ```python { 'client_id': 'd59478fbc1ee646a28a3c652a119379939123784d99131b865a89f8b21c81f69276c48bd574b81267d9d1a77b83b43e6d475a6cfc79c232ddbca946ae9c7afc5', 'path': 'et/clips/common_voice_et_18318995.mp3', 'audio': { 'path': 'et/clips/common_voice_et_18318995.mp3', 'array': array([-0.00048828, -0.00018311, -0.00137329, ..., 0.00079346, 0.00091553, 0.00085449], dtype=float32), 'sampling_rate': 48000 }, 'sentence': 'Tasub kokku saada inimestega, keda tunned juba ammust ajast saati.', 'up_votes': 2, 'down_votes': 0, 'age': 'twenties', 'gender': 'male', 'accent': '', 'locale': 'et', 'segment': '' } ``` ### Data Fields `client_id` (`string`): An id for which client (voice) made the recording `path` (`string`): The path to the audio file `audio` (`dict`): A dictionary containing the path to the downloaded audio file, the decoded audio array, and the sampling rate. Note that when accessing the audio column: `dataset[0]["audio"]` the audio file is automatically decoded and resampled to `dataset.features["audio"].sampling_rate`. Decoding and resampling of a large number of audio files might take a significant amount of time. Thus it is important to first query the sample index before the `"audio"` column, *i.e.* `dataset[0]["audio"]` should **always** be preferred over `dataset["audio"][0]`. `sentence` (`string`): The sentence the user was prompted to speak `up_votes` (`int64`): How many upvotes the audio file has received from reviewers `down_votes` (`int64`): How many downvotes the audio file has received from reviewers `age` (`string`): The age of the speaker (e.g. `teens`, `twenties`, `fifties`) `gender` (`string`): The gender of the speaker `accent` (`string`): Accent of the speaker `locale` (`string`): The locale of the speaker `segment` (`string`): Usually an empty field ### Data Splits The speech material has been subdivided into portions for dev, train, test, validated, invalidated, reported and other. The validated data is data that has been validated with reviewers and received upvotes that the data is of high quality. The invalidated data is data has been invalidated by reviewers and received downvotes indicating that the data is of low quality. The reported data is data that has been reported, for different reasons. The other data is data that has not yet been reviewed. The dev, test, train are all data that has been reviewed, deemed of high quality and split into dev, test and train. ## Data Preprocessing Recommended by Hugging Face The following are data preprocessing steps advised by the Hugging Face team. They are accompanied by an example code snippet that shows how to put them to practice. Many examples in this dataset have trailing quotations marks, e.g _“the cat sat on the mat.“_. These trailing quotation marks do not change the actual meaning of the sentence, and it is near impossible to infer whether a sentence is a quotation or not a quotation from audio data alone. In these cases, it is advised to strip the quotation marks, leaving: _the cat sat on the mat_. In addition, the majority of training sentences end in punctuation ( . or ? or ! ), whereas just a small proportion do not. In the dev set, **almost all** sentences end in punctuation. Thus, it is recommended to append a full-stop ( . ) to the end of the small number of training examples that do not end in punctuation. ```python from datasets import load_dataset ds = load_dataset("mozilla-foundation/common_voice_11_0", "en", use_auth_token=True) def prepare_dataset(batch): """Function to preprocess the dataset with the .map method""" transcription = batch["sentence"] if transcription.startswith('"') and transcription.endswith('"'): # we can remove trailing quotation marks as they do not affect the transcription transcription = transcription[1:-1] if transcription[-1] not in [".", "?", "!"]: # append a full-stop to sentences that do not end in punctuation transcription = transcription + "." batch["sentence"] = transcription return batch ds = ds.map(prepare_dataset, desc="preprocess dataset") ``` ## Dataset Creation ### Curation Rationale [Needs More Information] ### Source Data #### Initial Data Collection and Normalization [Needs More Information] #### Who are the source language producers? [Needs More Information] ### Annotations #### Annotation process [Needs More Information] #### Who are the annotators? [Needs More Information] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ## Considerations for Using the Data ### Social Impact of Dataset The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in the Common Voice dataset. ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, [CC-0](https://creativecommons.org/share-your-work/public-domain/cc0/) ### Citation Information ``` @inproceedings{commonvoice:2020, author = {Ardila, R. and Branson, M. and Davis, K. and Henretty, M. and Kohler, M. and Meyer, J. and Morais, R. and Saunders, L. and Tyers, F. M. and Weber, G.}, title = {Common Voice: A Massively-Multilingual Speech Corpus}, booktitle = {Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020)}, pages = {4211--4215}, year = 2020 } ```

hugging_face

View Details

speech-recognition-dataset

Speech Recognition

Russian

This dataset consists of video recordings of people uttering different phrases. It is based on the State University of Nizhny Novgorod in Russia and is unique because it contains a Russian phrase library. Most of the phrases used in the dataset come from classic Russian literature and other publicly available texts. Participants sat in front of a phone or laptop screen and spoke the phrases from various distances. Each person in a video utters a specific phrase from the total phrase list. Videos are recorded in mp4 format.

github

View Details

encoded_LA_2021

Speech Recognition

Anti‑spoofing

This dataset is the original ASVspoof 2021 LA subset, containing two main features: 'label' and 'input_values'. The 'label' feature is a categorical label with two classes: 'fake' and 'real'. The 'input_values' feature is a sequence of floating‑point numbers. The dataset is split into training, validation, and test sets, each with specified sample counts and file sizes. The configuration name is 'default', and data files are assigned per split. The license is 'odc‑by', and the dataset name is 'ASVspoof 2021 LA'.

huggingface

View Details

Dataset Hub

Browse by Category

Common Voice Dataset

clarin-pl/2021-punctuation-restoration

Human Voice Dataset

my-speech-datasets

M-AILABS Speech Dataset

How2

cdminix/libritts-aligned

AudioDataset

spoken-language-understanding-research-datasets

speech-commands

Murple/ksponspeech

GigaSpeech

VCTK Corpus

mozilla-foundation/common_voice_11_0

speech-recognition-dataset

encoded_LA_2021