JUHE API Marketplace
DATASET
Open Source Community

edinburghcstr/edacc

The Edinburgh International Accents of English Corpus (EdAcc) is a new automatic speech recognition (ASR) dataset containing 40 hours of English dialogue that spans a wide range of English accents. It includes extensive first‑language and second‑language English variants, along with detailed speaker background information. Recent evaluations with public and commercial models show that EdAcc highlights shortcomings of current English ASR models: while they perform well on existing benchmarks, their performance degrades significantly on speakers with different accents.

Updated 2/22/2024
hugging_face

Description

Dataset Overview

Dataset Description

Basic Information

  • Dataset Name: EdAcc: The Edinburgh International Accents of English Corpus
  • Size: 7,542,124,660.365999 bytes
  • Download Size: 6,951,164,322 bytes

Structure

  • Features:
    • speaker: speaker ID (string)
    • text: transcription of the audio (string)
    • accent: accent labeled by trained linguists (string)
    • raw_accent: self‑reported accent (string)
    • gender: speaker gender (string)
    • l1: speaker's native language, standardized by linguists (string)
    • audio: dictionary containing filename, decoded audio array, and sampling rate (audio)

Splits

  • Validation: 9,848 samples, 2,615,574,877.928 bytes
  • Test: 9,289 samples, 4,926,549,782.438 bytes

Supported Tasks

  • Automatic Speech Recognition (ASR): Models receive audio files and output transcribed text; primary metric is word error rate (WER).
  • Audio Classification: Models receive audio files and predict speaker accent or gender; primary metric is percentage accuracy.

Dataset Creation

Data Collection Process

  • Participants engaged in relaxed conversations over Zoom and completed detailed questionnaires to collect metadata.
  • The questionnaire captured language background, English learning duration, language usage, residence history, relationship with conversation partners, and self‑perceived accent.
  • It also gathered demographic information such as age, gender, ethnic background, and education level.
  • Conversations were transcribed by professional transcribers, ensuring accurate capture of speech, overlap, background noises, laughter, and hesitations.

License

  • License: Public domain, Creative Commons Attribution‑ShareAlike International Public License (CC‑BY‑SA)

Citation

@inproceedings{sanabria23edacc,
   title="{The Edinburgh International Accents of English Corpus: Towards the Democratization of English ASR}",
   author={Sanabria, Ramon and Bogoychev, Nikolay and  Markl, Nina and Carmantini, Andrea and  Klejch, Ondrej and Bell, Peter},
   booktitle={ICASSP 2023},
   year={2023},
}

Contributors

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Automatic Speech Recognition
Accent Diversity

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.