GigaSpeech
GigaSpeech is an evolving, multi‑domain English speech recognition corpus created by Tsinghua University's Department of Electronic Engineering and partner institutions. It contains 10,000 hours of high‑quality manually transcribed audio for supervised training, and a total of 40,000 hours suitable for semi‑supervised and unsupervised training. The corpus is compiled from audiobooks, podcasts, and YouTube videos, covering both read and spontaneous speech styles across topics such as arts, science, and sports. The creation pipeline includes audio collection, text normalization, forced alignment, audio segmentation, and segment validation. GigaSpeech aims to advance speech recognition research and address the performance saturation of existing datasets.
Description
GigaSpeech Dataset Overview
Dataset Version
- Version: 1.0.0
- Release Date: July 5, 2021
Dataset Download
- Download Steps:
- Fill out the Google Form to obtain access permission.
- Choose one of the following options:
- Option A: Follow the instructions in the reply email to obtain the original release version.
- Option B: Retrieve the pre‑processed version from HuggingFace.
Dataset Details
Audio Sources
- Language: English
- Total Duration: 33,005 hours (including 10,000 hours of high‑quality manual transcription)
| Audio Source | Transcribed Hours | Total Hours | Acoustic Conditions |
|---|---|---|---|
| Audiobooks | 2,655 | 11,982 | Reading; various ages & accents |
| Podcasts | 3,498 | 9,254 | Clean or with background music; indoor; close‑field; spontaneous; various ages & accents |
| YouTube | 3,845 | 11,768 | Clean & noisy; indoor & outdoor; close‑field & far‑field; reading & spontaneous; various ages & accents |
| Total | 10,000 | 33,005 | — |
Transcribed Training Subsets
| Subset | Hours | Notes |
|---|---|---|
| XS | 10 | System building & debugging |
| S | 250 | Quick research experiments |
| M | 1,000 | Large‑scale research experiments |
| L | 2,500 | Medium‑scale industrial experiments |
| XL | 10,000 | Large‑scale industrial experiments |
Transcribed Evaluation Subsets
| Subset | Hours | Notes |
|---|---|---|
| Dev | 12 | Randomly selected from crawled podcast and YouTube data |
| Test | 40 | Partially random from crawled data; part manually collected via other channels for better coverage |
Data Preparation Guide
- Data preparation scripts: Provide scripts for various ASR toolkits, e.g., Kaldi scripts located in the
toolkits/kaldidirectory.
Metadata Files
- Filename: GigaSpeech.json
- Content: Includes audio file paths, segments, transcription texts, etc.
Audio Processing
- Sample Rate: 16 kHz
- Format: Opus compressed format
Text Pre‑processing
- Punctuation: Preserve four punctuation symbols (
, , , ) - Noise tags: Mark non‑speech segments, recommended to discard during training.
Text Post‑processing
- Fillers: Suggested removal before WER calculation to ensure fair comparison across toolkits.
Citation
- Please cite the following paper:
@inproceedings{GigaSpeech2021, title={GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio}, booktitle={Proc. Interspeech 2021}, year={2021}, author={Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei‑Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 6/13/2021
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.