Dataset assetOpen Source CommunitySpeech RecognitionBig Data

GigaSpeech

GigaSpeech is an evolving, multi‑domain English speech recognition corpus created by Tsinghua University's Department of Electronic Engineering and partner institutions. It contains 10,000 hours of high‑quality manually transcribed audio for supervised training, and a total of 40,000 hours suitable for semi‑supervised and unsupervised training. The corpus is compiled from audiobooks, podcasts, and YouTube videos, covering both read and spontaneous speech styles across topics such as arts, science, and sports. The creation pipeline includes audio collection, text normalization, forced alignment, audio segmentation, and segment validation. GigaSpeech aims to advance speech recognition research and address the performance saturation of existing datasets.

Source

arXiv

Created

Jun 13, 2021

Updated

Jun 13, 2021

Signals

307 views

Availability

Linked source ready

Overview

Dataset description and usage context

GigaSpeech Dataset Overview

Dataset Version

Version: 1.0.0
Release Date: July 5, 2021

Dataset Download

Download Steps:
1. Fill out the Google Form to obtain access permission.
2. Choose one of the following options:
  - Option A: Follow the instructions in the reply email to obtain the original release version.
  - Option B: Retrieve the pre‑processed version from HuggingFace.

Dataset Details

Audio Sources

Language: English
Total Duration: 33,005 hours (including 10,000 hours of high‑quality manual transcription)

Audio Source	Transcribed Hours	Total Hours	Acoustic Conditions
Audiobooks	2,655	11,982	Reading; various ages & accents
Podcasts	3,498	9,254	Clean or with background music; indoor; close‑field; spontaneous; various ages & accents
YouTube	3,845	11,768	Clean & noisy; indoor & outdoor; close‑field & far‑field; reading & spontaneous; various ages & accents
Total	10,000	33,005	—

Transcribed Training Subsets

Subset	Hours	Notes
XS	10	System building & debugging
S	250	Quick research experiments
M	1,000	Large‑scale research experiments
L	2,500	Medium‑scale industrial experiments
XL	10,000	Large‑scale industrial experiments

Transcribed Evaluation Subsets

Subset	Hours	Notes
Dev	12	Randomly selected from crawled podcast and YouTube data
Test	40	Partially random from crawled data; part manually collected via other channels for better coverage

Data Preparation Guide

Data preparation scripts: Provide scripts for various ASR toolkits, e.g., Kaldi scripts located in the toolkits/kaldi directory.

Metadata Files

Filename: GigaSpeech.json
Content: Includes audio file paths, segments, transcription texts, etc.

Audio Processing

Sample Rate: 16 kHz
Format: Opus compressed format

Text Pre‑processing

Punctuation: Preserve four punctuation symbols (, , , )
Noise tags: Mark non‑speech segments, recommended to discard during training.

Text Post‑processing

Fillers: Suggested removal before WER calculation to ensure fair comparison across toolkits.

Citation

Please cite the following paper:

@inproceedings{GigaSpeech2021,
  title={GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio},
  booktitle={Proc. Interspeech 2021},
  year={2021},
  author={Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei‑Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio