High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

darrow-ai/LegalLensNER

LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular focus on detecting legal violations in unstructured text. The dataset contains a unique identifier for each record, the specific word or token in the text, the entity class assigned to the word (e.g., Law, Violation, Violated By, or Violated On), and the start and end character indices of the word in the text. The data generation process combines GPT‑4 automated data generation with manual review by experienced legal annotators. The dataset is open to researchers and practitioners for further enrichment and collaboration.

hugging_face

View Details

RaTE-NER

Radiology

Named Entity Recognition

The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC-IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT-4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.

huggingface

View Details

leondz/wnut_17

Named Entity Recognition

Text Classification

The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.

hugging_face

View Details

spyysalo/bc2gm_corpus

Bioinformatics

Named Entity Recognition

Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.

hugging_face

View Details

NER_corpus_chinese

Named Entity Recognition

Natural Language Processing

Chinese NER corpus containing multiple versions such as People's Daily 1998 edition and MSRA corpus, used for named entity recognition tasks.

github

View Details

bnsapa/cybersecurity-ner

Cybersecurity

Named Entity Recognition

This dataset is primarily for token‑classification tasks and includes three features: id (string), tokens (list of strings), and ner_tags (list of named‑entity labels). The ner_tags cover 11 categories to label different entity types such as indicators, malware, organizations, systems, and vulnerabilities. The dataset is split into training, testing, and validation subsets, each with different numbers of samples and byte sizes. The download size is 385,026 bytes and the total size is 1,873,973 bytes. It uses the default configuration with file paths for each split. The license is Apache 2.0.

hugging_face

View Details