clarin-pl/2021-punctuation-restoration

The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems. It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational). The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks. It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers. The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).

Updated 8/29/2022

hugging_face

Description

Dataset Overview

Dataset Description

The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems.
It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational).
The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks.
It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers.
The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).

Dataset Composition

Training Size: 800 texts
Development Size: 0 texts
Test Size: 200 texts

Data Format

Input Format: TSV file containing text ID and lower‑cased input text without punctuation.
Output Format: Same number of lines as the input file, each line containing the text with punctuation.

Evaluation

Metrics: Accuracy, recall and F1 score, computed separately for each punctuation mark.
Final Score: Weighted average of F1 scores across punctuation marks.

Download

Dataset hosted on GitHub: https://github.com/poleval/2021-punctuation-restoration
Additional training data and resources available via Google Drive.

License

Creative Commons – Attribution‑NonCommercial‑NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0)

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Speech Recognition

Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →