CSTR-Edinburgh/vctk

--- annotations_creators: - expert-generated language_creators: - crowdsourced language: - en license: - cc-by-4.0 multilinguality: - monolingual pretty_name: VCTK size_categories: - 10K<n<100K source_datasets: - original task_categories: - automatic-speech-recognition - text-to-speech - text-to-audio task_ids: [] paperswithcode_id: vctk train-eval-index: - config: main task: automatic-speech-recognition task_id: speech_recognition splits: train_split: train col_mapping: file: path text: text metrics: - type: wer name: WER - type: cer name: CER dataset_info: features: - name: speaker_id dtype: string - name: audio dtype: audio: sampling_rate: 48000 - name: file dtype: string - name: text dtype: string - name: text_id dtype: string - name: age dtype: string - name: gender dtype: string - name: accent dtype: string - name: region dtype: string - name: comment dtype: string config_name: main splits: - name: train num_bytes: 40103111 num_examples: 88156 download_size: 11747302977 dataset_size: 40103111 --- # Dataset Card for VCTK ## Table of Contents - [Dataset Description](#dataset-description) - [Dataset Summary](#dataset-summary) - [Supported Tasks and Leaderboards](#supported-tasks-and-leaderboards) - [Languages](#languages) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Fields](#data-fields) - [Data Splits](#data-splits) - [Dataset Creation](#dataset-creation) - [Curation Rationale](#curation-rationale) - [Source Data](#source-data) - [Annotations](#annotations) - [Personal and Sensitive Information](#personal-and-sensitive-information) - [Considerations for Using the Data](#considerations-for-using-the-data) - [Social Impact of Dataset](#social-impact-of-dataset) - [Discussion of Biases](#discussion-of-biases) - [Other Known Limitations](#other-known-limitations) - [Additional Information](#additional-information) - [Dataset Curators](#dataset-curators) - [Licensing Information](#licensing-information) - [Citation Information](#citation-information) - [Contributions](#contributions) ## Dataset Description - **Homepage:** [Edinburg DataShare](https://doi.org/10.7488/ds/2645) - **Repository:** - **Paper:** - **Leaderboard:** - **Point of Contact:** ### Dataset Summary This CSTR VCTK Corpus includes around 44-hours of speech data uttered by 110 English speakers with various accents. Each speaker reads out about 400 sentences, which were selected from a newspaper, the rainbow passage and an elicitation paragraph used for the speech accent archive. ### Supported Tasks - `automatic-speech-recognition`, `speaker-identification`: The dataset can be used to train a model for Automatic Speech Recognition (ASR). The model is presented with an audio file and asked to transcribe the audio file to written text. The most common evaluation metric is the word error rate (WER). - `text-to-speech`, `text-to-audio`: The dataset can also be used to train a model for Text-To-Speech (TTS). ### Languages [More Information Needed] ## Dataset Structure ### Data Instances A data point comprises the path to the audio file, called `file` and its transcription, called `text`. ``` { 'speaker_id': 'p225', 'text_id': '001', 'text': 'Please call Stella.', 'age': '23', 'gender': 'F', 'accent': 'English', 'region': 'Southern England', 'file': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'audio': { 'path': '/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac', 'array': array([0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492], dtype=float32), 'sampling_rate': 48000 }, 'comment': '' } ``` Each audio file is a single-channel FLAC with a sample rate of 48000 Hz. ### Data Fields Each row consists of the following fields: - `speaker_id`: Speaker ID - `audio`: Audio recording - `file`: Path to audio file - `text`: Text transcription of corresponding audio - `text_id`: Text ID - `age`: Speaker's age - `gender`: Speaker's gender - `accent`: Speaker's accent - `region`: Speaker's region, if annotation exists - `comment`: Miscellaneous comments, if any ### Data Splits The dataset has no predefined splits. ## Dataset Creation ### Curation Rationale [More Information Needed] ### Source Data #### Initial Data Collection and Normalization [More Information Needed] #### Who are the source language producers? [More Information Needed] ### Annotations #### Annotation process [More Information Needed] #### Who are the annotators? [More Information Needed] ### Personal and Sensitive Information The dataset consists of people who have donated their voice online. You agree to not attempt to determine the identity of speakers in this dataset. ## Considerations for Using the Data ### Social Impact of Dataset [More Information Needed] ### Discussion of Biases [More Information Needed] ### Other Known Limitations [More Information Needed] ## Additional Information ### Dataset Curators [More Information Needed] ### Licensing Information Public Domain, Creative Commons Attribution 4.0 International Public License ([CC-BY-4.0](https://creativecommons.org/licenses/by/4.0/legalcode)) ### Citation Information ```bibtex @inproceedings{Veaux2017CSTRVC, title = {CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit}, author = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald}, year = 2017 } ``` ### Contributions Thanks to [@jaketae](https://github.com/jaketae) for adding this dataset.

Updated 8/14/2024

hugging_face

Description

Dataset Card VCTK

Dataset Description

Dataset Overview

The VCTK dataset contains approximately 44 hours of English speech data recorded by 110 speakers with various accents. Each speaker read about 400 sentences selected from newspaper articles, rainbow paragraphs, and elicitation passages for accent archiving.

Supported Tasks

automatic-speech-recognition (ASR): The dataset can be used to train ASR models. Models receive audio files and output transcribed text. The most common evaluation metric is word error rate (WER).
text-to-speech (TTS): The dataset can also be used to train TTS models.

Language

The language of the dataset is English.

Dataset Structure

Data Example

Each data point includes the path to the audio file (named file) and its transcription (text).

{
  "speaker_id": "p225",
  "text_id": "001",
  "text": "Please call Stella.",
  "age": "23",
  "gender": "F",
  "accent": "English",
  "region": "Southern England",
  "file": "/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac",
  "audio": {
    "path": "/datasets/downloads/extracted/8ed7dad05dfffdb552a3699777442af8e8ed11e656feb277f35bf9aea448f49e/wav48_silence_trimmed/p225/p225_001_mic1.flac",
    "array": [0.00485229, 0.00689697, 0.00619507, ..., 0.00811768, 0.00836182, 0.00854492],
    "sampling_rate": 48000
  },
  "comment": ""
}

Each audio file is a mono FLAC file sampled at 48 kHz.

Data Fields

speaker_id: Speaker identifier
audio: Audio recording
file: Path to audio file
text: Transcribed text for the audio
text_id: Text identifier
age: Speaker age
gender: Speaker gender
accent: Speaker accent
region: Speaker region (if available)
comment: Additional comments (if any)

Data Splits

The dataset does not provide predefined splits.

Dataset Creation

Personal and Sensitive Information

The dataset includes recordings from volunteers who donated their voices online. Users agree not to attempt to identify the speakers.

Additional Information

License

Public domain, Creative Commons Attribution 4.0 International Public License (CC-BY-4.0)

Citation

@inproceedings{Veaux2017CSTRVC,
    title        = {CSTR VCTK Corpus: English Multi‑speaker Corpus for CSTR Voice Cloning Toolkit},
    author       = {Christophe Veaux and Junichi Yamagishi and Kirsten MacDonald},
    year         = 2017
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Automatic Speech Recognition

Text-to-Speech

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →