DATASET

Open Source Community

pd3f-dataset-bmjv

This dataset mainly contains German PDF files for developing pd3f. The files were downloaded from the BMJV 'Stellungnahmen zu Referententwürfen', processed with OCR, and support German and English.

Updated 4/3/2021

github

Description

pd3f‑dataset‑bmjv Dataset Overview

Dataset Content

This dataset mainly contains German PDF files for developing pd3f project.
The PDF files are sourced from public documents and can be downloaded via the following link: https://data.jfilter.de/nlp/pd3f/bmjv_v1.zip.

Dataset Source

The dataset includes files downloaded from BMJV titled “Stellungnahmen zu Referententwürfen”, downloaded around 2022‑04‑02.
Numerical prefixes were added to file names.
OCRmyPDF was used to OCR German and English content.
Files were sorted and grouped by language.
Manual inspection and re‑processing of OCR errors were performed.

License

The dataset is provided under the GPLv3 license.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Document Processing

Language Technology

Source

Organization: github

Created: 3/23/2020

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →