JUHE API Marketplace
DATASET
Open Source Community

GigaSpeech

GigaSpeech is an evolving, multi‑domain English speech recognition corpus created by Tsinghua University's Department of Electronic Engineering and partner institutions. It contains 10,000 hours of high‑quality manually transcribed audio for supervised training, and a total of 40,000 hours suitable for semi‑supervised and unsupervised training. The corpus is compiled from audiobooks, podcasts, and YouTube videos, covering both read and spontaneous speech styles across topics such as arts, science, and sports. The creation pipeline includes audio collection, text normalization, forced alignment, audio segmentation, and segment validation. GigaSpeech aims to advance speech recognition research and address the performance saturation of existing datasets.

Updated 6/13/2021
arXiv

Description

GigaSpeech Dataset Overview

Dataset Version

  • Version: 1.0.0
  • Release Date: July 5, 2021

Dataset Download

  • Download Steps:
    1. Fill out the Google Form to obtain access permission.
    2. Choose one of the following options:
      • Option A: Follow the instructions in the reply email to obtain the original release version.
      • Option B: Retrieve the pre‑processed version from HuggingFace.

Dataset Details

Audio Sources

  • Language: English
  • Total Duration: 33,005 hours (including 10,000 hours of high‑quality manual transcription)
Audio SourceTranscribed HoursTotal HoursAcoustic Conditions
Audiobooks2,65511,982Reading; various ages & accents
Podcasts3,4989,254Clean or with background music; indoor; close‑field; spontaneous; various ages & accents
YouTube3,84511,768Clean & noisy; indoor & outdoor; close‑field & far‑field; reading & spontaneous; various ages & accents
Total10,00033,005

Transcribed Training Subsets

SubsetHoursNotes
XS10System building & debugging
S250Quick research experiments
M1,000Large‑scale research experiments
L2,500Medium‑scale industrial experiments
XL10,000Large‑scale industrial experiments

Transcribed Evaluation Subsets

SubsetHoursNotes
Dev12Randomly selected from crawled podcast and YouTube data
Test40Partially random from crawled data; part manually collected via other channels for better coverage

Data Preparation Guide

  • Data preparation scripts: Provide scripts for various ASR toolkits, e.g., Kaldi scripts located in the toolkits/kaldi directory.

Metadata Files

  • Filename: GigaSpeech.json
  • Content: Includes audio file paths, segments, transcription texts, etc.

Audio Processing

  • Sample Rate: 16 kHz
  • Format: Opus compressed format

Text Pre‑processing

  • Punctuation: Preserve four punctuation symbols (, , , )
  • Noise tags: Mark non‑speech segments, recommended to discard during training.

Text Post‑processing

  • Fillers: Suggested removal before WER calculation to ensure fair comparison across toolkits.

Citation

  • Please cite the following paper:
    @inproceedings{GigaSpeech2021,
      title={GigaSpeech: An Evolving, Multi-domain ASR Corpus with 10,000 Hours of Transcribed Audio},
      booktitle={Proc. Interspeech 2021},
      year={2021},
      author={Guoguo Chen, Shuzhou Chai, Guanbo Wang, Jiayu Du, Wei‑Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Yujun Wang, Zhao You, Zhiyong Yan}
    }
    

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Speech Recognition
Big Data

Source

Organization: arXiv

Created: 6/13/2021

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.