Back to datasets
Dataset assetOpen Source CommunityArtificial IntelligenceLegal AI

ShengbinYue/DISC-Law-SFT

The DISC‑Law‑SFT dataset is a high‑quality Chinese legal supervision fine‑tuning dataset designed to improve legal AI systems' abilities in understanding and generating legal text. It consists of two subsets—DISC‑Law‑SFT‑Pair (for introducing legal reasoning) and DISC‑Law‑SFT‑Triplet (for enhancing the model's use of external legal knowledge). The dataset covers numerous legal scenarios such as information extraction, judgment prediction, document summarization, and legal QA. Tasks include legal information extraction, event detection, case classification, judgment prediction, case matching, text summarization, judicial public‑opinion summarization, QA, reading comprehension, and judicial exam. Total size is 403 K entries, suitable for legal assistants, consulting services, and exam preparation.

Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 20, 2024
Signals
1,540 views
Availability
Linked source ready
Overview

Dataset description and usage context

DISC‑Law‑SFT Dataset Overview

Basic Information

  • Name: DISC‑Law‑SFT Dataset
  • Language: Chinese
  • Tags: Legal
  • Size: 100 M < n < 1 B
  • License: Apache‑2.0

Contents

DISC‑Law‑SFT contains two main subsets:

1. DISC‑Law‑SFT‑Pair

  • Purpose: Introduce legal reasoning ability.
  • Tasks & Sizes:
    • Legal Information Extraction: 32 K
    • Legal Event Detection: 27 K
    • Legal Case Classification: 20 K
    • Legal Judgment Prediction: 11 K
    • Legal Case Matching: 8 K
    • Legal Text Summarization: 9 K
    • Judicial Public‑Opinion Summarization: 6 K
    • Legal QA: 93 K
    • Legal Reading Comprehension: 38 K
    • Judicial Exam: 12 K

2. DISC‑Law‑SFT‑Triplet

  • Purpose: Enhance the model's use of external legal knowledge.
  • Tasks & Sizes:
    • Legal Judgment Prediction: 16 K
    • Legal QA: 23 K

Common Portion

  • Tasks & Sizes:
    • Alpaca‑GPT4: 48 K
    • Firefly: 60 K

Total Size

  • Total: 403 K

Availability

  • Status: Most data have been open‑sourced.
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio