PoeTree
PoeTree is a standardized poetry‑corpus collection, containing over 300,000 poems and covering nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Russian). Each corpus has been deduplicated, enriched with universal dependencies, provides additional metadata, and is converted into a unified JSON structure.
Dataset description and usage context
Dataset Overview
poetRee is an R package that fetches curated poetry data from the PoeTree API. PoeTree is a standardized collection comprising over 300,000 poems across nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Slovenian, Spanish, and Russian). Each sub‑corpus has been deduplicated, enriched with universal dependency relations, includes extra metadata, and is transformed into a uniform JSON format.
Dataset Contents
- Metadata: Provides a summary for each sub‑corpus, including ISO language codes.
- Author Information: Detailed information for all authors present in the corpora.
- Source Information: Bibliographic source details for all entries, supporting author IDs.
- Poem Information: All poem records for a given author ID (or vector of author IDs).
- Text Information: Text and annotations for a specified poem ID, supporting multiple output formats.
Usage
- Installation: Install via
devtools::install_github("perechen/poetRee"). - Citation: When using the PoeTree dataset, cite the associated dataset and publications.
Examples
- Metadata Example: Shows statistics such as number of authors, poems, and lines per corpus.
- Author Example: Lists detailed author information for a specific corpus (e.g., Czech).
- Source Example: Shows source details for a given corpus and author ID.
- Poem Example: Displays poem details for a specific corpus and author ID.
- Text Example: Shows the text of a particular poem ID in various output formats.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.