Dataset assetOpen Source CommunitySpeech Signal ProcessingMulti‑Speaker Separation

huckiyang/DiPCo

The DipCo dataset, publicly released by Amazon, aims to help speech scientists separate multiple speakers' signals in reverberant rooms. The dataset was created by simulating dinner‑party scenarios with volunteers in a lab; each session involves four participants. It includes near‑field and far‑field recordings together with detailed transcriptions for development and evaluation. The dataset is released under the CDLA‑Permissive‑1.0 license.

Source

hugging_face

Created

Nov 28, 2025

Updated

Feb 6, 2024

Signals

167 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Name: DipCo – Dinner Party Corpus
Alias: DiPCo

Dataset Attributes

Language: English (en)
Task Categories: automatic‑speech‑recognition, voice‑activity‑detection
Multilinguality: Monolingual
Tags: speaker separation, speech recognition, microphone array processing
License: CDLA‑Permissive‑1.0
Size Range: 100 M < size < 100 G
Annotation Creators: expert generated
Language Creators: expert generated

Dataset Content

Audio Format: WAV, 16 kHz, 16‑bit
Recording Types:
- Near‑field (single‑channel microphone)
- Far‑field (7‑channel microphone array)
File Naming Rules:
- Near‑field: <session_id>_<speaker_id>.wav
- Far‑field: <session_id>_<device_id>.<channel_id>.wav
Transcription Format: JSON
Transcription Content: session ID, speaker ID, gender, mother tongue, language proficiency, transcript text, start time, end time, reference signal

Dataset Structure

DiPCo/
├── audio
│   ├── dev
│   └── eval
└── transcriptions
    ├── dev
    └── eval

Session Details

Number of Sessions: 10
Participants per Session: 4
Number of Devices: 5
Channels per Device: 7
Session Naming: <session_id> (e.g., S01, S02, …)
Speaker Naming: <speaker_id> (e.g., P01, P02, …)
Device Naming: <device_id> (e.g., U01, U02, …)
Channel Naming: <channel_id> (e.g., CH1, CH2, …)

Development & Evaluation Sets

Development Set: Sessions S02, S04, S05, S09, S10; total 2 h 43 min, 3,691 utterances
Evaluation Set: Sessions S01, S03, S06, S07, S08; total 2 h 36 min, 3,619 utterances

Transcription Example

{
  "start_time": {
    "U01": "00:02:12.79",
    "U02": "00:02:12.79",
    "U03": "00:02:12.79",
    "U04": "00:02:12.79",
    "U05": "00:02:12.79",
    "close-talk": "00:02:12.79"
  },
  "end_time": {
    "U01": "00:02:14.84",
    "U02": "00:02:14.84",
    "U03": "00:02:14.84",
    "U04": "00:02:14.84",
    "U05": "00:02:14.84",
    "close-talk": "00:02:14.84"
  },
  "gender": "male",
  "mother_tongue": "U.S. English",
  "nativeness": "native",
  "ref": "close-talk",
  "session_id": "S02",
  "speaker_id": "P05",
  "words": "[noise] how do you like the food"
}

License

Type: CDLA‑Permissive
Details: see LICENSE file

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio