Back to datasets
Dataset assetOpen Source CommunityTheorem ProvingLean4

l3lab/ntp-mathlib-instruct-context

The miniCTX dataset is a neural theorem proving dataset that emphasizes proof tasks within long-context environments. It consists of Lean 4 tactic prediction examples extracted from Mathlib, each containing a prompt (instruction, preceding file content, proof state) and a completion (tactic). File contents are truncated to 1,024 tokens. The dataset was generated using the `ntptoolkit` utilities `ntp‑training‑data` and `instruction_tuning.py`, and detailed generation configuration is provided.

Source
hugging_face
Created
Nov 28, 2025
Updated
Sep 6, 2024
Signals
114 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

miniCTX: Neural Theorem Proving with (Long‑)Contexts

Dataset Content

  • Source: Lean 4 tactic prediction examples extracted from Mathlib.
  • Example Structure:
    • prompt:
      • Contains instruction, preceding file content, proof state or instruction, proof state
    • completion: tactic
  • File Content: Truncated to 1,024 tokens.

Dataset Configuration

  • Configuration Name: default
  • Data Files:
    • split: train, dev, test
    • path: "with_context_mathlib_only/with_context_train*", "with_context_mathlib_only/with_context_dev*", "with_context_mathlib_only/with_context_test*"

Version Information

  • Generation Tools: ntptoolkit's ntp‑training‑data and instruction_tuning.py
  • Configuration Details:
    {
        "repo": "https://github.com/leanprover-community/mathlib4",
        "commit": "cf8e23a62939ed7cc530fbb68e83539730f32f86",
        "lean": "leanprover/lean4:v4.4.0",
        "name": "mathlib",
        "import_file": "Mathlib.lean",
        "imports": ["Mathlib"]
    }
    

Citation

@misc{hu2024minictx, author = {Jiewen Hu and Thomas Zhu and Sean Welleck}, title = {miniCTX: Neural Theorem Proving with (Long‑)Contexts}, year = {2024}, eprint={}, archivePrefix={arXiv}, }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio