Back to datasets
Dataset assetOpen Source CommunityAutomatic Speech RecognitionAccent Diversity
edinburghcstr/edacc
The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset containing 40 hours of English dialogue that spans a wide range of English accents. It includes extensive first‑language and second‑language English variants, along with detailed speaker background information. Recent evaluations with public and commercial models show that EdAcc highlights shortcomings of current English ASR models: while they perform well on existing benchmarks, their performance degrades significantly on speakers with different accents.
Source
hugging_face
Created
Nov 28, 2025
Updated
Feb 22, 2024
Signals
291 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Description
Basic Information
- Dataset Name: EdAcc: The Edinburgh International Accents of English Corpus
- Size: 7,542,124,660.365999 bytes
- Download Size: 6,951,164,322 bytes
Structure
- Features:
speaker: speaker ID (string)text: transcription of the audio (string)accent: accent labeled by trained linguists (string)raw_accent: self‑reported accent (string)gender: speaker gender (string)l1: speaker's native language, standardized by linguists (string)audio: dictionary containing filename, decoded audio array, and sampling rate (audio)
Splits
- Validation: 9,848 samples, 2,615,574,877.928 bytes
- Test: 9,289 samples, 4,926,549,782.438 bytes
Supported Tasks
- Automatic Speech Recognition (ASR): Models receive audio files and output transcribed text; primary metric is word error rate (WER).
- Audio Classification: Models receive audio files and predict speaker accent or gender; primary metric is percentage accuracy.
Dataset Creation
Data Collection Process
- Participants engaged in relaxed conversations over Zoom and completed detailed questionnaires to collect metadata.
- The questionnaire captured language background, English learning duration, language usage, residence history, relationship with conversation partners, and self‑perceived accent.
- It also gathered demographic information such as age, gender, ethnic background, and education level.
- Conversations were transcribed by professional transcribers, ensuring accurate capture of speech, overlap, background noises, laughter, and hesitations.
License
- License: Public domain, Creative Commons Attribution‑ShareAlike International Public License (CC‑BY‑SA)
Citation
@inproceedings{sanabria23edacc,
title="{The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR}",
author={Sanabria, Ramon and Bogoychev, Nikolay and Markl, Nina and Carmantini, Andrea and Klejch, Ondrej and Bell, Peter},
booktitle={ICASSP 2023},
year={2023},
}
Contributors
- Thanks to @sanchit-gandhi for adding this dataset.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.