garage-bAInd/Open-Platypus

--- configs: - config_name: default data_files: - split: train path: data/train-* dataset_info: features: - name: input dtype: string - name: output dtype: string - name: instruction dtype: string - name: data_source dtype: string splits: - name: train num_bytes: 30776452 num_examples: 24926 download_size: 15565850 dataset_size: 30776452 language: - en size_categories: - 10K<n<100K --- # Open-Platypus This dataset is focused on improving LLM logical reasoning skills and was used to train the Platypus2 models. It is comprised of the following datasets, which were filtered using keyword search and then Sentence Transformers to remove questions with a similarity above 80%: | Dataset Name | License Type | |--------------------------------------------------------------|--------------| | [PRM800K](https://github.com/openai/prm800k) | MIT | | [MATH](https://github.com/hendrycks/math) | MIT | | [ScienceQA](https://github.com/lupantech/ScienceQA) | [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International](https://creativecommons.org/licenses/by-nc-sa/4.0/) | | [SciBench](https://github.com/mandyyyyii/scibench) | MIT | | [ReClor](https://whyu.me/reclor/) | Non-commercial | | [TheoremQA](https://huggingface.co/datasets/wenhu/TheoremQA) | MIT | | [`nuprl/leetcode-solutions-python-testgen-gpt4`](https://huggingface.co/datasets/nuprl/leetcode-solutions-python-testgen-gpt4/viewer/nuprl--leetcode-solutions-python-testgen-gpt4/train?p=1) | None listed | | [`jondurbin/airoboros-gpt4-1.4.1`](https://huggingface.co/datasets/jondurbin/airoboros-gpt4-1.4.1) | other | | [`TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k`](https://huggingface.co/datasets/TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k/viewer/TigerResearch--tigerbot-kaggle-leetcodesolutions-en-2k/train?p=2) | apache-2.0 | | [ARB](https://arb.duckai.org) | CC BY 4.0 | | [`timdettmers/openassistant-guanaco`](https://huggingface.co/datasets/timdettmers/openassistant-guanaco) | apache-2.0 | ## Data Contamination Check We've removed approximately 200 questions that appear in the Hugging Face benchmark test sets. Please see our [paper](https://arxiv.org/abs/2308.07317) and [project webpage](https://platypus-llm.github.io) for additional information. ## Model Info Please see models at [`garage-bAInd`](https://huggingface.co/garage-bAInd). ## Training and filtering code Please see the [Platypus GitHub repo](https://github.com/arielnlee/Platypus). ## Citations ```bibtex @article{platypus2023, title={Platypus: Quick, Cheap, and Powerful Refinement of LLMs}, author={Ariel N. Lee and Cole J. Hunter and Nataniel Ruiz}, booktitle={arXiv preprint arxiv:2308.07317}, year={2023} } ``` ```bibtex @article{lightman2023lets, title={Let's Verify Step by Step}, author={Lightman, Hunter and Kosaraju, Vineet and Burda, Yura and Edwards, Harri and Baker, Bowen and Lee, Teddy and Leike, Jan and Schulman, John and Sutskever, Ilya and Cobbe, Karl}, journal={preprint arXiv:2305.20050}, year={2023} } ``` ```bibtex @inproceedings{lu2022learn, title={Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering}, author={Lu, Pan and Mishra, Swaroop and Xia, Tony and Qiu, Liang and Chang, Kai-Wei and Zhu, Song-Chun and Tafjord, Oyvind and Clark, Peter and Ashwin Kalyan}, booktitle={The 36th Conference on Neural Information Processing Systems (NeurIPS)}, year={2022} } ``` ```bibtex @misc{wang2023scibench, title={SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models}, author={Xiaoxuan Wang and Ziniu Hu and Pan Lu and Yanqiao Zhu and Jieyu Zhang and Satyen Subramaniam and Arjun R. Loomba and Shichang Zhang and Yizhou Sun and Wei Wang}, year={2023}, arXiv eprint 2307.10635 } ``` ```bibtex @inproceedings{yu2020reclor, author = {Yu, Weihao and Jiang, Zihang and Dong, Yanfei and Feng, Jiashi}, title = {ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning}, booktitle = {International Conference on Learning Representations (ICLR)}, month = {April}, year = {2020} } ``` ```bibtex @article{chen2023theoremqa, title={TheoremQA: A Theorem-driven Question Answering dataset}, author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu}, journal={preprint arXiv:2305.12524}, year={2023} } ``` ```bibtex @article{hendrycksmath2021, title={Measuring Mathematical Problem Solving With the MATH Dataset}, author={Dan Hendrycks and Collin Burns and Saurav Kadavath and Akul Arora and Steven Basart and Eric Tang and Dawn Song and Jacob Steinhardt}, journal={NeurIPS}, year={2021} } ``` ```bibtex @misc{sawada2023arb, title={ARB: Advanced Reasoning Benchmark for Large Language Models}, author={Tomohiro Sawada and Daniel Paleka and Alexander Havrilla and Pranav Tadepalli and Paula Vidas and Alexander Kranias and John J. Nay and Kshitij Gupta and Aran Komatsuzaki}, arXiv eprint 2307.13692, year={2023} } ```

Updated 1/24/2024

hugging_face

Description

Dataset Overview

Basic Information

Dataset Name: Open‑Platypus
Dataset Size:
- Download Size: 15,565,850 bytes
- Dataset Size: 30,776,452 bytes
Language: English (en)
Size Category: 10K < n < 100K

Structure

Configuration:
- Default Config:
  - Data Files:
    - Split: train
    - Path: data/train-*
Dataset Information:
- Features:
  - input: string
  - output: string
  - instruction: string
  - data_source: string
- Splits:
  - train:
    - Bytes: 30,776,452
    - Samples: 24,926

Source

Constituent Datasets:
- PRM800K
- MATH
- ScienceQA
- SciBench
- ReClor
- TheoremQA
- nuprl/leetcode-solutions-python-testgen-gpt4
- jondurbin/airoboros-gpt4-1.4.1
- TigerResearch/tigerbot-kaggle-leetcodesolutions-en-2k
- ARB
- timdettmers/openassistant-guanaco

Purpose

Goal: Enhance large language model (LLM) logical reasoning abilities, particularly for training the Platypus2 model.
Processing: Keyword search and Sentence‑Transformers filtering were used to remove questions with similarity above 80%.

Cleaning

Cleaning Measures: Approximately 200 questions that appeared in Hugging Face benchmark suites were removed.

Citation

References:
- Platypus: Quick, Cheap, and Powerful Refinement of LLMs
- Lets Verify Step by Step
- Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering
- SciBench: Evaluating College‑Level Scientific Problem‑Solving Abilities of Large Language Models
- ReClor: A Reading Comprehension Dataset Requiring Logical Reasoning
- TheoremQA: A Theorem‑driven Question Answering dataset
- Measuring Mathematical Problem Solving With the MATH Dataset
- ARB: Advanced Reasoning Benchmark for Large Language Models

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Natural Language Processing

Machine Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →