darrow-ai/LegalLensNER
LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular focus on detecting legal violations in unstructured text. The dataset contains a unique identifier for each record, the specific word or token in the text, the entity class assigned to the word (e.g., Law, Violation, Violated By, or Violated On), and the start and end character indices of the word in the text. The data generation process combines GPT‑4 automated data generation with manual review by experienced legal annotators. The dataset is open to researchers and practitioners for further enrichment and collaboration.
Description
Dataset Overview
LegalLensNER is a dataset specifically designed for named entity recognition (NER) in the legal domain, with a particular emphasis on detecting legal violations in unstructured text.
Data Fields
- id: (int) Unique identifier for each record.
- word: (str) Specific word or token in the text.
- label: (str) Entity class assigned to the word, including Law, Violation, Violated By, or Violated On.
- start: (int) Starting character index of the word in the text.
- end: (int) Ending character index of the word in the text.
Data Generation
LegalLensNER is generated through a detailed pipeline that includes automatic data generation using GPT‑4 to produce synthetic data, followed by manual review by experienced legal annotators.
Collaboration & Contribution
LegalLensNER provides a resource for legal‑domain NER tasks, offering a broad foundation for legal text analysis and information extraction, and promoting the development of legal natural language processing (NLP) research and applications. The dataset is open for further enrichment and collaboration, encouraging researchers and practitioners interested in legal NLP to contribute or participate in joint projects to expand its breadth and depth.
Data Example
To access the dataset, you can use the following code snippet:
from datasets import load_dataset
dataset = load_dataset("darrow-ai/LegalLensNER")
Citation
@article{bernsohn2024legallens,
title={LegalLens: Leveraging LLMs for Legal Violation Identification in Unstructured Text},
author={Bernsohn, Dor and Semo, Gil and Vazana, Yaron and Hayat, Gila and Hagag, Ben and Niklaus, Joel and Saha, Rohit and Truskovskyi, Kyryl},
journal={arXiv preprint arXiv:2402.04335},
year={2024}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.