Back to datasets
Dataset assetOpen Source CommunitySpeech RecognitionBig Data

GigaSpeech

GigaSpeech is an evolving, multi‑domain English speech recognition corpus created by Tsinghua University's Department of Electronic Engineering and partner institutions. It contains 10,000 hours of high‑quality manually transcribed audio for supervised training, and a total of 40,000 hours suitable for semi‑supervised and unsupervised training. The corpus is compiled from audiobooks, podcasts, and YouTube videos, covering both read and spontaneous speech styles across topics such as arts, science, and sports. The creation pipeline includes audio collection, text normalization, forced alignment, audio segmentation, and segment validation. GigaSpeech aims to advance speech recognition research and address the performance saturation of existing datasets.

Source
arXiv
Created
Jun 13, 2021
Updated
Jun 13, 2021
Signals
307 views
Availability
Linked source ready
Overview

Dataset description and usage context

GigaSpeech Dataset Overview

Dataset Version

  • Version: 1.0.0
  • Release Date: July 5, 2021

Dataset Download

  • Download Steps:
    1. Fill out the Google Form to obtain access permission.
    2. Choose one of the following options:
      • Option A: Follow the instructions in the reply email to obtain the original release version.
      • Option B: Retrieve the pre‑processed version from HuggingFace.

Dataset Details

Audio Sources

  • Language: English
  • Total Duration: 33,005 hours (including 10,000 hours of high‑quality manual transcription)
Audio SourceTranscribed HoursTotal HoursAcoustic Conditions
Audiobooks2,65511,982Reading; various ages & accents
Podcasts3,4989,254Clean or with background music; indoor; close‑field; spontaneous; various ages & accents
YouTube3,84511,768Clean & noisy; indoor & outdoor; close‑field & far‑field; reading & spontaneous; various ages & accents
Total10,00033,005

Transcribed Training Subsets

SubsetHoursNotes
XS10System building & debugging
S250Quick research experiments
M1,000Large‑scale research experiments
L2,500Medium‑scale industrial experiments
XL10,000Large‑scale industrial experiments

Transcribed Evaluation Subsets

SubsetHoursNotes
Dev12Randomly selected from crawled podcast and YouTube data
Test40Partially random from crawled data; part manually collected via other channels for better coverage

Data Preparation Guide

  • Data preparation scripts: Provide scripts for various ASR toolkits, e.g., Kaldi scripts located in the toolkits/kaldi directory.

Metadata Files

  • Filename: GigaSpeech.json
  • Content: Includes audio file paths, segments, transcription texts, etc.

Audio Processing

  • Sample Rate: 16 kHz
  • Format: Opus compressed format

Text Pre‑processing

  • Punctuation: Preserve four punctuation symbols (, , , )
  • Noise tags: Mark non‑speech segments, recommended to discard during training.

Text Post‑processing

  • Fillers: Suggested removal before WER calculation to ensure fair comparison across toolkits.

Citation

  • Please cite the following paper:
    @inproceedings{GigaSpeech2021,
      title={GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio},
      booktitle={Proc. Interspeech 2021},
      year={2021},
      author={Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei‑Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan}
    }
    
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio