Explore high-quality datasets for your AI and machine learning projects.
LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular focus on detecting legal violations in unstructured text. The dataset contains a unique identifier for each record, the specific word or token in the text, the entity class assigned to the word (e.g., Law, Violation, Violated By, or Violated On), and the start and end character indices of the word in the text. The data generation process combines GPT‑4 automated data generation with manual review by experienced legal annotators. The dataset is open to researchers and practitioners for further enrichment and collaboration.
The RaTE-NER dataset is a large-scale radiology named entity recognition (NER) dataset, containing 13,235 manually annotated sentences from 1,816 reports in the MIMIC-IV database, covering nine imaging modalities and 23 anatomical regions to ensure comprehensive coverage. Additionally, by leveraging GPT-4 and other medical knowledge bases, the dataset further enriches 33,605 sentences from 17,432 reports in Radiopaedia, capturing the complexity and subtleties of rare diseases and abnormalities. The dataset provides two preprocessing formats to support different NER approaches and clearly outlines the file paths and structure.
The WNUT 17 dataset is a named entity recognition (NER) dataset focusing on identifying novel and rare entities in noisy text. It includes training (3,394 samples), validation (1,009 samples), and test (1,287 samples) sets. Each sample contains an ID, token list, and IOB2‑formatted NER labels covering entities such as companies, creative works, groups, locations, persons, and products. The dataset was created to provide definitions for emerging and rare entities and to support detection of such entities.
Bc2GmCorpus is a dataset for named entity recognition focusing on gene‑related entities. It comprises a training set, validation set, and test set containing 12,500, 2,500, and 5,000 samples respectively. Each sample includes an `id`, a list of `tokens`, and `ner_tags` indicating gene‑related entity annotations.
Chinese NER corpus containing multiple versions such as People's Daily 1998 edition and MSRA corpus, used for named entity recognition tasks.
This dataset is primarily for token‑classification tasks and includes three features: id (string), tokens (list of strings), and ner_tags (list of named‑entity labels). The ner_tags cover 11 categories to label different entity types such as indicators, malware, organizations, systems, and vulnerabilities. The dataset is split into training, testing, and validation subsets, each with different numbers of samples and byte sizes. The download size is 385,026 bytes and the total size is 1,873,973 bytes. It uses the default configuration with file paths for each split. The license is Apache 2.0.