Explore high-quality datasets for your AI and machine learning projects.
LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular focus on detecting legal violations in unstructured text. The dataset contains a unique identifier for each record, the specific word or token in the text, the entity class assigned to the word (e.g., Law, Violation, Violated By, or Violated On), and the start and end character indices of the word in the text. The data generation process combines GPT‑4 automated data generation with manual review by experienced legal annotators. The dataset is open to researchers and practitioners for further enrichment and collaboration.