JUHE API Marketplace
DATASET
Open Source Community

Murple/ksponspeech

The KsponSpeech dataset contains 969 hours of Korean conversational speech recorded by approximately 2,000 native Korean speakers in clean environments. All data were created by recording dialogues between two people and manually transcribing the audio. Transcriptions provide both orthographic and phonetic versions, along with disfluency tags (e.g., filler words, repeated words, word fragments) to indicate spontaneous speech. The dataset is primarily used for automatic speech recognition tasks and has been publicly released on the Korean government open data platform.

Updated 11/14/2022
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: KsponSpeech

Dataset Attributes

  • Language: Korean (ko)
  • Language Creation Method: Crowdsourced
  • Multilinguality: Monolingual
  • Annotation Creation Method: Expert-generated
  • Size: 10K<n<100K
  • Source Data: Original
  • Task Category: Automatic Speech Recognition

Dataset Description

  • Summary: Contains 969 hours of general open-domain conversational speech recorded by about 2,000 native Korean speakers in clean environments. The data were constructed by recording two people freely conversing and manually transcribing the recordings. The transcription provides dual orthographic and phonetic versions, as well as disfluency tags for spontaneous speech such as filler words, repeated words, and word fragments.
  • Supported Tasks: Automatic Speech Recognition
  • Language: Korean

Dataset Structure

  • Data Instances: Each instance includes audio information (path, array, sample rate), text transcription, and a unique ID.
  • Data Fields:
    • Audio: Contains the audio file path, decoded audio array, and sample rate.
    • Text: Transcription of the audio file.
    • ID: Unique identifier for the data sample.
  • Data Splits: Includes training, validation, and two evaluation sets (eval.clean and eval.other).

Dataset Creation

  • Source Data: Constructed by recording two people freely conversing and manually transcribing the dialogues.
  • Annotations: Provide dual orthographic and phonetic transcriptions along with disfluency tags for spontaneous speech.

Citation Information

bibtex @Article{app10196936, AUTHOR = {Bang, Jeong-Uk and Yun, Seung and Kim, Seung-Hi and Choi, Mu-Yeol and Lee, Min-Kyu and Kim, Yeo-Jeong and Kim, Dong-Hyun and Park, Jun and Lee, Young-Jik and Kim, Sang-Hun}, TITLE = {KsponSpeech: Korean Spontaneous Speech Corpus for Automatic Speech Recognition}, JOURNAL = {Applied Sciences}, VOLUME = {10}, YEAR = {2020}, NUMBER = {19}, ARTICLE-NUMBER = {6936}, URL = {https://www.mdpi.com/2076-3417/10/19/6936}, ISSN = {2076-3417}, ABSTRACT = {This paper introduces a large-scale spontaneous speech corpus of Korean, named KsponSpeech. This corpus contains 969 h of general open-domain dialog utterances, spoken by about 2000 native Korean speakers in a clean environment. All data were constructed by recording the dialogue of two people freely conversing on a variety of topics and manually transcribing the utterances. The transcription provides a dual transcription consisting of orthography and pronunciation, and disfluency tags for spontaneity of speech, such as filler words, repeated words, and word fragments. This paper also presents the baseline performance of an end-to-end speech recognition model trained with KsponSpeech. In addition, we investigated the performance of standard end-to-end architectures and the number of sub-word units suitable for Korean. We investigated issues that should be considered in spontaneous speech recognition in Korean. KsponSpeech is publicly available on an open data hub site of the Korea government.}, DOI = {10.3390/app10196936} }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Speech Recognition
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.