Back to datasets
Dataset assetOpen Source CommunityMachine LearningNatural Language Processing
prm800k
This dataset contains data from [openai/prm800k](https://github.com/openai/prm800k). It is divided into two phases (phase1 and phase2), each with train and test splits. Features include labeler, timestamp, question, etc.; detailed feature types are described in the README.
Source
huggingface
Created
Dec 13, 2024
Updated
Dec 14, 2024
Signals
515 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Dataset Information
Configuration phase1
- Features:
labeler: stringtimestamp: stringgeneration: nullis_quality_control_question: boolis_initial_screening_question: boolquestion(structured):problem: stringground_truth_answer: string
label(structured):steps(list):completions(list):text: stringrating: int64flagged: bool
human_completion(structured):text: stringrating: nullsource: stringflagged: boolcorrected_rating: int64
chosen_completion: int64
total_time: int64finish_reason: string
- Splits:
train: 5,185,121 bytes, 949 samplestest: 532,137 bytes, 106 samples
- Download Size: 1,850,110 bytes
- Dataset Size: 5,717,258 bytes
Configuration phase2
- Features:
labeler: stringtimestamp: stringgeneration: int64is_quality_control_question: boolis_initial_screening_question: boolquestion(structured):problem: stringground_truth_solution: stringground_truth_answer: stringpre_generated_steps: sequence of stringpre_generated_answer: stringpre_generated_verifier_score: float64
label(structured):steps(list):completions(list):text: stringrating: int64flagged: bool
human_completion: nullchosen_completion: int64
total_time: int64finish_reason: string
- Splits:
train: 344,736,273 bytes, 97,782 samplestest: 9,164,167 bytes, 2,762 samples
- Download Size: 132,668,705 bytes
- Dataset Size: 353,900,440 bytes
Configuration Files
- phase1:
train:phase1/train-*test:phase1/test-*
- phase2:
train:phase2/train-*test:phase2/test-*
Language
- English (
en)
Dataset Scale
- 10K < n < 100K
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.