DATASET
Open Source Community
pd3f-dataset-bmjv
This dataset mainly contains German PDF files for developing pd3f. The files were downloaded from the BMJV 'Stellungnahmen zu Referententwürfen', processed with OCR, and support German and English.
Updated 4/3/2021
github
Description
pd3f‑dataset‑bmjv Dataset Overview
Dataset Content
- This dataset mainly contains German PDF files for developing pd3f project.
- The PDF files are sourced from public documents and can be downloaded via the following link: https://data.jfilter.de/nlp/pd3f/bmjv_v1.zip.
Dataset Source
- The dataset includes files downloaded from BMJV titled “Stellungnahmen zu Referententwürfen”, downloaded around 2022‑04‑02.
- Numerical prefixes were added to file names.
- OCRmyPDF was used to OCR German and English content.
- Files were sorted and grouped by language.
- Manual inspection and re‑processing of OCR errors were performed.
License
- The dataset is provided under the GPLv3 license.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Document Processing
Language Technology
Source
Organization: github
Created: 3/23/2020
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.