High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

tner/mit_movie_trivia

The MIT Movie NER dataset is part of the T‑NER project and is specifically designed for named entity recognition tasks in the movie domain. It includes 12 entity types such as Actor, Plot, Opinion, Award, Year, Genre, Origin, Director, Soundtrack, Relationship, Character_Name, and Quote. The dataset is split into training (6,816 instances), validation (1,000 instances), and test (1,953 instances).

hugging_face

View Details

CCNC

Chinese Name Research

Entity Recognition

CCNC is a large Chinese name corpus containing 3,658,109 name samples, sourced from the Name Encyclopedia and the Chinese Personal Names Corpus. After processing and adding phonetic annotations, it is used for Chinese name research and entity recognition.

github

View Details

DFKI-SLT/DWIE

Information Extraction

Entity Recognition

The DWIE (Deutsche Welle Information Extraction) corpus is a new dataset designed for document‑level multi‑task information extraction. It combines four main IE subtasks: named entity recognition, coreference resolution, relation extraction, and entity linking. The dataset includes detailed entity and relation information, linked to Wikipedia, and is suitable for feature extraction and text classification tasks on English text.

hugging_face

View Details