Back to datasets
Dataset assetOpen Source CommunityNamed Entity RecognitionLegal Domain

darrow-ai/LegalLensNER

LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular focus on detecting legal violations in unstructured text. The dataset contains a unique identifier for each record, the specific word or token in the text, the entity class assigned to the word (e.g., Law, Violation, Violated By, or Violated On), and the start and end character indices of the word in the text. The data generation process combines GPT‑4 automated data generation with manual review by experienced legal annotators. The dataset is open to researchers and practitioners for further enrichment and collaboration.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jul 8, 2024
Signals
332 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular emphasis on detecting legal violations in unstructured text.

Data Fields

  • id: (int) Unique identifier for each record.
  • word: (str) Specific word or token in the text.
  • label: (str) Entity class assigned to the word, including Law, Violation, Violated By, or Violated On.
  • start: (int) Starting character index of the word in the text.
  • end: (int) Ending character index of the word in the text.

Data Generation

LegalLensNER is generated through a detailed pipeline that includes automatic data generation using GPT‑4 to produce synthetic data, followed by manual review by experienced legal annotators.

Collaboration & Contribution

LegalLensNER provides a resource for legal‑domain NER tasks, offering a broad foundation for legal text analysis and information extraction, and promoting the development of legal natural language processing (NLP) research and applications. The dataset is open for further enrichment and collaboration, encouraging researchers and practitioners interested in legal NLP to contribute or participate in joint projects to expand its breadth and depth.

Data Example

To access the dataset, you can use the following code snippet:

from datasets import load_dataset
dataset = load_dataset("darrow-ai/LegalLensNER")

Citation

@article{bernsohn2024legallens,
  title={LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text},
  author={Bernsohn, Dor and Semo, Gil and Vazana, Yaron and Hayat, Gila and Hagag, Ben and Niklaus, Joel and Saha, Rohit and Truskovskyi, Kyryl},
  journal={arXiv preprint arXiv:2402.04335},
  year={2024}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.