Explore high-quality datasets for your AI and machine learning projects.
The MIT Movie NER dataset is part of the T‑NER project and is specifically designed for named entity recognition tasks in the movie domain. It includes 12 entity types such as Actor, Plot, Opinion, Award, Year, Genre, Origin, Director, Soundtrack, Relationship, Character_Name, and Quote. The dataset is split into training (6,816 instances), validation (1,000 instances), and test (1,953 instances).
CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.
The DWIE (Deutsche Welle Information Extraction) corpus is a new dataset designed for document‑level multi‑task information extraction. It combines four main IE subtasks: named entity recognition, coreference resolution, relation extraction, and entity linking. The dataset includes detailed entity and relation information, linked to Wikipedia, and is suitable for feature extraction and text classification tasks on English text.