botp/Open-Platypus
OpenPlatypus数据集专注于提高大型语言模型(LLM)的逻辑推理能力,并用于训练Platypus2模型。该数据集由多个子数据集组成,包括PRM800K、ScienceQA、SciBench、ReClor、TheoremQA等,这些数据集通过关键词搜索和Sentence Transformers进行过滤,去除相似度超过80%的问题。此外,还移除了大约200个出现在Hugging Face基准测试集中的问题。数据集的特征包括输入、输出和指令,均为字符串类型,训练集包含24,926个示例,总大小为30,418,784字节。
Description
OpenPlatypus 数据集概述
数据集配置
- 默认配置 (
default)- 训练数据文件路径:
data/train-*
- 训练数据文件路径:
数据集信息
- 特征:
input: 类型为stringoutput: 类型为stringinstruction: 类型为string
- 数据分割:
train: 包含 24926 个样本,总字节数为 30418784
- 下载大小: 15545530 字节
- 数据集大小: 30418784 字节
语言
- 英语 (
en)
数据集大小分类
- 10K < n < 100K
数据来源
- 该数据集由多个子数据集组成,通过关键词搜索和 Sentence Transformers 过滤相似度高于 80% 的问题:
- PRM800K: MIT 许可证
- ScienceQA: Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International 许可证
- SciBench: MIT 许可证
- ReClor: 非商业许可证
- TheoremQA: MIT 许可证
- nuprl/leetcode-solutions-python-testgen-gpt4: 未列出许可证
- jondurbin/airoboros-gpt4-1.4.1: 其他许可证
- TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k: Apache-2.0 许可证
- openbookQA: Apache-2.0 许可证
- ARB: MIT 许可证
- timdettmers/openassistant-guanaco: Apache-2.0 许可证
数据污染检查
- 移除了约 200 个在 Hugging Face 基准测试集中出现的问题。
引用
bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} }
bibtex @article{lightman2023lets, title={Lets Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} }
bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} }
bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 }
bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} }
bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} }
bibtex @inproceedings{OpenBookQA2018, title={Can a Suit of Armor Conduct Electricity? A New Dataset for Open Book Question Answering}, author={Todor Mihaylov and Peter Clark and Tushar Khot and Ashish Sabharwal}, booktitle={EMNLP}, year={2018} }
bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.