Datasets | JuheAPI

AILAB-VNUHCM/vivos

Automatic Speech Recognition

Vietnamese

VIVOS is a free Vietnamese audio corpus containing 15 hours of recordings, prepared for Vietnamese automatic speech recognition tasks. The corpus was compiled by the AILAB lab at VNU‑HCM – University of Science, aiming to attract researchers to address Vietnamese speech recognition challenges. It includes audio files, corresponding transcripts, speaker IDs, and file paths, split into training and test sets. The dataset is released under a CC BY‑NC‑SA 4.0 license for non‑commercial use.

hugging_face

View Details

japanese-anime-speech-v2

Automatic Speech Recognition

Anime

japanese‑anime‑speech‑v2 is an audio‑text dataset designed to train automatic speech recognition models. It contains 300,506 audio clips and their transcriptions sourced from visual novels. The goal is to improve ASR models (e.g., OpenAI's Whisper) for transcribing anime and similar Japanese media dialogue. Audio is in MP3 format, sampled at 16 kHz, with an average length of 5.5 seconds. This is the first release of the japanese‑anime‑speech‑v2 series; compared with the previous version, audio quality has been adjusted and NSFW content is not filtered. The dataset is predominantly female voices, with vocabularies around love, relationships, and fantasy, which may not fully reflect real‑world speech patterns. Future plans include separating safe and NSFW content, improving text formatting, and expanding data sources.

huggingface

View Details

CSTR-Edinburgh/vctk

Automatic Speech Recognition

Text-to-Speech

--- annotations_creators: - expert-generated language_creators: - crowdsourced language: - en license: - cc-by-4.0 multilinguality: - monolingual pretty_name: VCTK size_categories: - 10K<n<100K source_datasets: - original task_categories: - automatic-speech-recognition - text-to-speech - text-to-audio task_ids: [] paperswithcode_id: vctk train-eval-index: - config: main task: automatic-speech-recognition task_id: speech_recognition splits: train_split: train col_mapping: file: path text: text metrics: - type: wer name: WER - type: cer name: CER dataset_info: features: - name: speaker_id dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: file dtype: string - name: text dtype: string - name: text_id dtype: string - name: age dtype: string - name: gender dtype: string - name: accent dtype: string - name: region dtype: string - name: comment dtype: string config_name: main splits: - name: train num_bytes: 40103111 num_examples: 88156 download_size: 11747302977 dataset_size: 40103111 --- # Dataset Card for VCTK ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Edinburg DataShare](https://doi.org/10.7488/ds/2645) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This CSTR VCTK Corpus includes around 44-hours of speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. ### Supported Tasks - `automatic-speech-recognition`, `speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). - `text-to-speech`, `text-to-audio`: The dataset can also be used to train a model for Text-To-Speech (TTS). ### Languages [More Information Needed] ## Dataset Structure ### Data Instances A data point comprises the path to the audio file, called `file` and its transcription, called `text`. ``` { 'speaker_id': 'p225', 'text_id': '001', 'text': 'Please call Stella.', 'age': '23', 'gender': 'F', 'accent': 'English', 'region': 'Southern England', 'file': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'audio': { 'path': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'array': array([0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], dtype=float32), 'sampling_rate': 48000 }, 'comment': '' } ``` Each audio file is a single-channel FLAC with a sample rate of 48000 Hz. ### Data Fields Each row consists of the following fields: - `speaker_id`: Speaker ID - `audio`: Audio recording - `file`: Path to audio file - `text`: Text transcription of corresponding audio - `text_id`: Text ID - `age`: Speaker's age - `gender`: Speaker's gender - `accent`: Speaker's accent - `region`: Speaker's region, if annotation exists - `comment`: Miscellaneous comments, if any ### Data Splits The dataset has no predefined splits. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, Creative Commons Attribution 4.0 International Public License ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) ### Citation Information ```bibtex @inproceedings{Veaux2017CSTRVC, title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit}, author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald}, year = 2017 } ``` ### Contributions Thanks to [@jaketae](https://github.com/jaketae) for adding this dataset.

hugging_face

View Details

edinburghcstr/edacc

Automatic Speech Recognition

Accent Diversity

The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset containing 40 hours of English dialogue that spans a wide range of English accents. It includes extensive first‑language and second‑language English variants, along with detailed speaker background information. Recent evaluations with public and commercial models show that EdAcc highlights shortcomings of current English ASR models: while they perform well on existing benchmarks, their performance degrades significantly on speakers with different accents.

hugging_face

View Details

Dataset Hub

Browse by Category

AILAB-VNUHCM/vivos

japanese-anime-speech-v2

CSTR-Edinburgh/vctk

edinburghcstr/edacc