clarin-pl/2021-punctuation-restoration
The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems. It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational). The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks. It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers. The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).
Dataset description and usage context
Dataset Overview
Dataset Description
- The 2021‑punctuation‑restoration dataset is primarily used to restore punctuation in the output of automatic speech recognition (ASR) systems.
- It contains Polish text and audio data, divided into two parts: WikiTalks (conversational) and WikiNews (informational).
- The dataset aims to improve the readability of ASR‑generated transcripts and may also enhance performance on other NLP tasks.
- It comprises 1,200 texts, totaling over 240,000 words, spoken by over 100 different native speakers.
- The dataset provides training and test splits, with the test set containing ASR transcriptions of texts from both sources (WikiNews and WikiTalks).
Dataset Composition
- Training Size: 800 texts
- Development Size: 0 texts
- Test Size: 200 texts
Data Format
- Input Format: TSV file containing text ID and lower‑cased input text without punctuation.
- Output Format: Same number of lines as the input file, each line containing the text with punctuation.
Evaluation
- Metrics: Accuracy, recall and F1 score, computed separately for each punctuation mark.
- Final Score: Weighted average of F1 scores across punctuation marks.
Download
- Dataset hosted on GitHub: https://github.com/poleval/2021-punctuation-restoration
- Additional training data and resources available via Google Drive.
License
- Creative Commons – Attribution‑NonCommercial‑NoDerivatives 4.0 International (CC BY‑NC‑ND 4.0)
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.