clarin-pl/2021-punctuation-restoration
The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems. It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational). The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks. It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers. The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).
Description
Dataset Overview
Dataset Description
- The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems.
- It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational).
- The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks.
- It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers.
- The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).
Dataset Composition
- Training Size: 800 texts
- Development Size: 0 texts
- Test Size: 200 texts
Data Format
- Input Format: TSV file containing text ID and lower‑cased input text without punctuation.
- Output Format: Same number of lines as the input file, each line containing the text with punctuation.
Evaluation
- Metrics: Accuracy, recall and F1 score, computed separately for each punctuation mark.
- Final Score: Weighted average of F1 scores across punctuation marks.
Download
- Dataset hosted on GitHub: https://github.com/poleval/2021-punctuation-restoration
- Additional training data and resources available via Google Drive.
License
- Creative Commons – Attribution‑NonCommercial‑NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0)
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.