Back to datasets
Dataset assetOpen Source CommunityAutomatic Speech RecognitionAccent Diversity

edinburghcstr/edacc

The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset containing 40 hours of English dialogue that spans a wide range of English accents. It includes extensive first‑language and second‑language English variants, along with detailed speaker background information. Recent evaluations with public and commercial models show that EdAcc highlights shortcomings of current English ASR models: while they perform well on existing benchmarks, their performance degrades significantly on speakers with different accents.

Source
hugging_face
Created
Nov 28, 2025
Updated
Feb 22, 2024
Signals
291 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Description

Basic Information

  • Dataset Name: EdAcc: The Edinburgh International Accents of English Corpus
  • Size: 7,542,124,660.365999 bytes
  • Download Size: 6,951,164,322 bytes

Structure

  • Features:
    • speaker: speaker ID (string)
    • text: transcription of the audio (string)
    • accent: accent labeled by trained linguists (string)
    • raw_accent: self‑reported accent (string)
    • gender: speaker gender (string)
    • l1: speaker's native language, standardized by linguists (string)
    • audio: dictionary containing filename, decoded audio array, and sampling rate (audio)

Splits

  • Validation: 9,848 samples, 2,615,574,877.928 bytes
  • Test: 9,289 samples, 4,926,549,782.438 bytes

Supported Tasks

  • Automatic Speech Recognition (ASR): Models receive audio files and output transcribed text; primary metric is word error rate (WER).
  • Audio Classification: Models receive audio files and predict speaker accent or gender; primary metric is percentage accuracy.

Dataset Creation

Data Collection Process

  • Participants engaged in relaxed conversations over Zoom and completed detailed questionnaires to collect metadata.
  • The questionnaire captured language background, English learning duration, language usage, residence history, relationship with conversation partners, and self‑perceived accent.
  • It also gathered demographic information such as age, gender, ethnic background, and education level.
  • Conversations were transcribed by professional transcribers, ensuring accurate capture of speech, overlap, background noises, laughter, and hesitations.

License

  • License: Public domain, Creative Commons Attribution‑ShareAlike International Public License (CC‑BY‑SA)

Citation

@inproceedings{sanabria23edacc,
   title="{The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR}",
   author={Sanabria, Ramon and Bogoychev, Nikolay and  Markl, Nina and Carmantini, Andrea and  Klejch, Ondrej and Bell, Peter},
   booktitle={ICASSP 2023},
   year={2023},
}

Contributors

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio