Dataset assetOpen Source CommunityLarge Language ModelsLogical Reasoning

botp/Open-Platypus

OpenPlatypus数据集专注于提高大型语言模型（LLM）的逻辑推理能力，并用于训练Platypus2模型。该数据集由多个子数据集组成，包括PRM800K、ScienceQA、SciBench、ReClor、TheoremQA等，这些数据集通过关键词搜索和Sentence Transformers进行过滤，去除相似度超过80%的问题。此外，还移除了大约200个出现在Hugging Face基准测试集中的问题。数据集的特征包括输入、输出和指令，均为字符串类型，训练集包含24,926个示例，总大小为30,418,784字节。

Source

hugging_face

Created

Nov 28, 2025

Updated

Aug 17, 2023

Signals

134 views

Availability

Linked source ready

Overview

Dataset description and usage context

OpenPlatypus 数据集概述

数据集配置

默认配置 (default)
- 训练数据文件路径: data/train-*

数据集信息

特征:
- input: 类型为 string
- output: 类型为 string
- instruction: 类型为 string
数据分割:
- train: 包含 24926 个样本，总字节数为 30418784
下载大小: 15545530 字节
数据集大小: 30418784 字节

语言

英语 (en)

数据集大小分类

10K < n < 100K

数据来源

该数据集由多个子数据集组成，通过关键词搜索和 Sentence Transformers 过滤相似度高于 80% 的问题：
- PRM800K: MIT 许可证
- ScienceQA: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 许可证
- SciBench: MIT 许可证
- ReClor: 非商业许可证
- TheoremQA: MIT 许可证
- nuprl/leetcode-solutions-python-testgen-gpt4: 未列出许可证
- jondurbin/airoboros-gpt4-1.4.1: 其他许可证
- TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k: Apache-2.0 许可证
- openbookQA: Apache-2.0 许可证
- ARB: MIT 许可证
- timdettmers/openassistant-guanaco: Apache-2.0 许可证

数据污染检查

移除了约 200 个在 Hugging Face 基准测试集中出现的问题。

引用

bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} }

bibtex @article{lightman2023lets, title={Lets Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} }

bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} }

bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 }

bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} }

bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} }

bibtex @inproceedings{OpenBookQA2018, title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering}, author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal}, booktitle={EMNLP}, year={2018} }

bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio