Dataset Catalog

Browse trusted datasets for evaluation, enrichment, and production use.

Category index
Showing 4 of 4 datasets
Category: Corpus

CEC-Corpus

Emergency EventsCorpus

The Chinese Emergency Corpus is constructed by Shanghai University (Semantic Intelligence Lab) and includes news reports of five types of emergencies: earthquakes, fires, traffic accidents, terrorist attacks, and food poisoning. The dataset undergoes text preprocessing, analysis, annotation, etc., using XML as the annotation format, containing six tags: Event, Denoter, Time, Location, Participant, and Object, to comprehensively describe events and their elements.

Source githubUpdated May 24, 2024570 viewsLinked
Inspect dataset

FinPile

FinanceCorpus

FinPile is a secure, high‑quality, open‑source Chinese financial corpus for generating and auditing financial data.

Source githubUpdated Sep 20, 2024543 viewsLinked
Inspect dataset

MedNorm corpus

Medical Terminology NormalizationCorpus

The MedNorm corpus is a dataset and embedding collection for cross‑terminology medical concept normalization, which combines instances from multiple datasets and provides consistent simultaneous mappings to MedDRA and SNOMED‑CT terms.

Source githubUpdated Aug 27, 2022259 viewsLinked
Inspect dataset

PoeTree

Multilingual LiteratureCorpus

PoeTree is a standardized poetry‑corpus collection, containing over 300,000 poems and covering nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Russian). Each corpus has been deduplicated, enriched with universal dependencies, provides additional metadata, and is converted into a unified JSON structure.

Source githubUpdated Jan 17, 2024196 viewsLinked
Inspect dataset