edinburghcstr/edacc

The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset containing 40 hours of English dialogue that spans a wide range of English accents. It includes extensive first‑language and second‑language English variants, along with detailed speaker background information. Recent evaluations with public and commercial models show that EdAcc highlights shortcomings of current English ASR models: while they perform well on existing benchmarks, their performance degrades significantly on speakers with different accents.

Updated 2/22/2024

hugging_face

Description

Dataset Overview

Dataset Description

Basic Information

Dataset Name: EdAcc: The Edinburgh International Accents of English Corpus
Size: 7,542,124,660.365999 bytes
Download Size: 6,951,164,322 bytes

Structure

Features:
- speaker: speaker ID (string)
- text: transcription of the audio (string)
- accent: accent labeled by trained linguists (string)
- raw_accent: self‑reported accent (string)
- gender: speaker gender (string)
- l1: speaker's native language, standardized by linguists (string)
- audio: dictionary containing filename, decoded audio array, and sampling rate (audio)

Splits

Validation: 9,848 samples, 2,615,574,877.928 bytes
Test: 9,289 samples, 4,926,549,782.438 bytes

Supported Tasks

Automatic Speech Recognition (ASR): Models receive audio files and output transcribed text; primary metric is word error rate (WER).
Audio Classification: Models receive audio files and predict speaker accent or gender; primary metric is percentage accuracy.

Dataset Creation

Data Collection Process

Participants engaged in relaxed conversations over Zoom and completed detailed questionnaires to collect metadata.
The questionnaire captured language background, English learning duration, language usage, residence history, relationship with conversation partners, and self‑perceived accent.
It also gathered demographic information such as age, gender, ethnic background, and education level.
Conversations were transcribed by professional transcribers, ensuring accurate capture of speech, overlap, background noises, laughter, and hesitations.

License

License: Public domain, Creative Commons Attribution‑ShareAlike International Public License (CC‑BY‑SA)

Citation

@inproceedings{sanabria23edacc,
   title="{The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR}",
   author={Sanabria, Ramon and Bogoychev, Nikolay and  Markl, Nina and Carmantini, Andrea and  Klejch, Ondrej and Bell, Peter},
   booktitle={ICASSP 2023},
   year={2023},
}

Contributors

Thanks to @sanchit-gandhi for adding this dataset.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Automatic Speech Recognition

Accent Diversity

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →