Back to datasets
Dataset assetOpen Source CommunityMachine LearningNatural Language Processing

prm800k

This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.

Source
huggingface
Created
Dec 13, 2024
Updated
Dec 14, 2024
Signals
515 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Information

Configuration phase1

  • Features:
    • labeler: string
    • timestamp: string
    • generation: null
    • is_quality_control_question: bool
    • is_initial_screening_question: bool
    • question (structured):
      • problem: string
      • ground_truth_answer: string
    • label (structured):
      • steps (list):
        • completions (list):
          • text: string
          • rating: int64
          • flagged: bool
        • human_completion (structured):
          • text: string
          • rating: null
          • source: string
          • flagged: bool
          • corrected_rating: int64
        • chosen_completion: int64
      • total_time: int64
      • finish_reason: string
  • Splits:
    • train: 5,185,121 bytes, 949 samples
    • test: 532,137 bytes, 106 samples
  • Download Size: 1,850,110 bytes
  • Dataset Size: 5,717,258 bytes

Configuration phase2

  • Features:
    • labeler: string
    • timestamp: string
    • generation: int64
    • is_quality_control_question: bool
    • is_initial_screening_question: bool
    • question (structured):
      • problem: string
      • ground_truth_solution: string
      • ground_truth_answer: string
      • pre_generated_steps: sequence of string
      • pre_generated_answer: string
      • pre_generated_verifier_score: float64
    • label (structured):
      • steps (list):
        • completions (list):
          • text: string
          • rating: int64
          • flagged: bool
        • human_completion: null
        • chosen_completion: int64
      • total_time: int64
      • finish_reason: string
  • Splits:
    • train: 344,736,273 bytes, 97,782 samples
    • test: 9,164,167 bytes, 2,762 samples
  • Download Size: 132,668,705 bytes
  • Dataset Size: 353,900,440 bytes

Configuration Files

  • phase1:
    • train: phase1/train-*
    • test: phase1/test-*
  • phase2:
    • train: phase2/train-*
    • test: phase2/test-*

Language

  • English (en)

Dataset Scale

  • 10K < n < 100K
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio