edinburghcstr/edacc
The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset containing 40 hours of English dialogue that spans a wide range of English accents. It includes extensive first‑language and second‑language English variants, along with detailed speaker background information. Recent evaluations with public and commercial models show that EdAcc highlights shortcomings of current English ASR models: while they perform well on existing benchmarks, their performance degrades significantly on speakers with different accents.
Description
Dataset Overview
Dataset Description
Basic Information
- Dataset Name: EdAcc: The Edinburgh International Accents of English Corpus
- Size: 7,542,124,660.365999 bytes
- Download Size: 6,951,164,322 bytes
Structure
- Features:
speaker: speaker ID (string)text: transcription of the audio (string)accent: accent labeled by trained linguists (string)raw_accent: self‑reported accent (string)gender: speaker gender (string)l1: speaker's native language, standardized by linguists (string)audio: dictionary containing filename, decoded audio array, and sampling rate (audio)
Splits
- Validation: 9,848 samples, 2,615,574,877.928 bytes
- Test: 9,289 samples, 4,926,549,782.438 bytes
Supported Tasks
- Automatic Speech Recognition (ASR): Models receive audio files and output transcribed text; primary metric is word error rate (WER).
- Audio Classification: Models receive audio files and predict speaker accent or gender; primary metric is percentage accuracy.
Dataset Creation
Data Collection Process
- Participants engaged in relaxed conversations over Zoom and completed detailed questionnaires to collect metadata.
- The questionnaire captured language background, English learning duration, language usage, residence history, relationship with conversation partners, and self‑perceived accent.
- It also gathered demographic information such as age, gender, ethnic background, and education level.
- Conversations were transcribed by professional transcribers, ensuring accurate capture of speech, overlap, background noises, laughter, and hesitations.
License
- License: Public domain, Creative Commons Attribution‑ShareAlike International Public License (CC‑BY‑SA)
Citation
@inproceedings{sanabria23edacc,
title="{The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR}",
author={Sanabria, Ramon and Bogoychev, Nikolay and Markl, Nina and Carmantini, Andrea and Klejch, Ondrej and Bell, Peter},
booktitle={ICASSP 2023},
year={2023},
}
Contributors
- Thanks to @sanchit-gandhi for adding this dataset.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.