JUHE API Marketplace
DATASET
Open Source Community

darrow-ai/LegalLensNER

LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular focus on detecting legal violations in unstructured text. The dataset contains a unique identifier for each record, the specific word or token in the text, the entity class assigned to the word (e.g., Law, Violation, Violated By, or Violated On), and the start and end character indices of the word in the text. The data generation process combines GPT‑4 automated data generation with manual review by experienced legal annotators. The dataset is open to researchers and practitioners for further enrichment and collaboration.

Updated 7/8/2024
hugging_face

Description

Dataset Overview

LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular emphasis on detecting legal violations in unstructured text.

Data Fields

  • id: (int) Unique identifier for each record.
  • word: (str) Specific word or token in the text.
  • label: (str) Entity class assigned to the word, including Law, Violation, Violated By, or Violated On.
  • start: (int) Starting character index of the word in the text.
  • end: (int) Ending character index of the word in the text.

Data Generation

LegalLensNER is generated through a detailed pipeline that includes automatic data generation using GPT‑4 to produce synthetic data, followed by manual review by experienced legal annotators.

Collaboration & Contribution

LegalLensNER provides a resource for legal‑domain NER tasks, offering a broad foundation for legal text analysis and information extraction, and promoting the development of legal natural language processing (NLP) research and applications. The dataset is open for further enrichment and collaboration, encouraging researchers and practitioners interested in legal NLP to contribute or participate in joint projects to expand its breadth and depth.

Data Example

To access the dataset, you can use the following code snippet:

from datasets import load_dataset
dataset = load_dataset("darrow-ai/LegalLensNER")

Citation

@article{bernsohn2024legallens,
  title={LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text},
  author={Bernsohn, Dor and Semo, Gil and Vazana, Yaron and Hayat, Gila and Hagag, Ben and Niklaus, Joel and Saha, Rohit and Truskovskyi, Kyryl},
  journal={arXiv preprint arXiv:2402.04335},
  year={2024}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Legal Domain
Named Entity Recognition

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.