High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

GigaSpeech

GigaSpeech is an evolving, multi‑domain English speech recognition corpus created by Tsinghua University's Department of Electronic Engineering and partner institutions. It contains 10,000 hours of high‑quality manually transcribed audio for supervised training, and a total of 40,000 hours suitable for semi‑supervised and unsupervised training. The corpus is compiled from audiobooks, podcasts, and YouTube videos, covering both read and spontaneous speech styles across topics such as arts, science, and sports. The creation pipeline includes audio collection, text normalization, forced alignment, audio segmentation, and segment validation. GigaSpeech aims to advance speech recognition research and address the performance saturation of existing datasets.

arXiv

View Details