Back to datasets
Dataset assetOpen Source CommunityDocument ProcessingLanguage Technology
pd3f-dataset-bmjv
This dataset mainly contains German PDF files for developing pd3f. The files were downloaded from the BMJV 'Stellungnahmen zu Referententwürfen', processed with OCR, and support German and English.
Source
github
Created
Mar 23, 2020
Updated
Apr 3, 2021
Signals
47 views
Availability
Linked source ready
Overview
Dataset description and usage context
pd3f‑dataset‑bmjv Dataset Overview
Dataset Content
- This dataset mainly contains German PDF files for developing pd3f project.
- The PDF files are sourced from public documents and can be downloaded via the following link: https://data.jfilter.de/nlp/pd3f/bmjv_v1.zip.
Dataset Source
- The dataset includes files downloaded from BMJV titled “Stellungnahmen zu Referententwürfen”, downloaded around 2022‑04‑02.
- Numerical prefixes were added to file names.
- OCRmyPDF was used to OCR German and English content.
- Files were sorted and grouped by language.
- Manual inspection and re‑processing of OCR errors were performed.
License
- The dataset is provided under the GPLv3 license.
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.