Back to datasets
Dataset assetOpen Source CommunityDocument ProcessingLanguage Technology

pd3f-dataset-bmjv

This dataset mainly contains German PDF files for developing pd3f. The files were downloaded from the BMJV 'Stellungnahmen zu Referententwürfen', processed with OCR, and support German and English.

Source
github
Created
Mar 23, 2020
Updated
Apr 3, 2021
Signals
47 views
Availability
Linked source ready
Overview

Dataset description and usage context

pd3f‑dataset‑bmjv Dataset Overview

Dataset Content

Dataset Source

  • The dataset includes files downloaded from BMJV titled “Stellungnahmen zu Referententwürfen”, downloaded around 2022‑04‑02.
  • Numerical prefixes were added to file names.
  • OCRmyPDF was used to OCR German and English content.
  • Files were sorted and grouped by language.
  • Manual inspection and re‑processing of OCR errors were performed.

License

  • The dataset is provided under the GPLv3 license.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio