Explore high-quality datasets for your AI and machine learning projects.
The Chinese Emergency Corpus is constructed by Shanghai University (Semantic Intelligence Lab) and includes news reports of five types of emergencies: earthquakes, fires, traffic accidents, terrorist attacks, and food poisoning. The dataset undergoes text preprocessing, analysis, annotation, etc., using XML as the annotation format, containing six tags: Event, Denoter, Time, Location, Participant, and Object, to comprehensively describe events and their elements.
FinPile is a secure, high‑quality, open‑source Chinese financial corpus for generating and auditing financial data.
The MedNorm corpus is a dataset and embedding collection for cross‑terminology medical concept normalization, which combines instances from multiple datasets and provides consistent simultaneous mappings to MedDRA and SNOMED‑CT terms.
PoeTree is a standardized poetry‑corpus collection, containing over 300,000 poems and covering nine languages (Czech, English, French, German, Hungarian, Italian, Portuguese, Spanish, and Russian). Each corpus has been deduplicated, enriched with universal dependencies, provides additional metadata, and is converted into a unified JSON structure.