Dataset assetOpen Source CommunityQuestion AnsweringPython Programming

CodeFeedback-Python105K

This dataset is a subset extracted from the `m-a-p/CodeFeedback-Filtered-Instruction` dataset, specifically selecting 104,848 samples written in Python. The dataset includes two main features: 'query' and 'response', both of string type. It is divided into a training set containing 104,848 samples. The dataset is suitable for question‑answering tasks, in English, with a sample size between 10,000 and 100,000.

Source

huggingface

Created

Nov 1, 2024

Updated

Nov 14, 2024

Signals

229 views

Availability

Linked source ready

Overview

Dataset description and usage context

CodeFeedback-Python105K Dataset Overview

Dataset Information

Features:
- query: string type
- response: string type
Splits:
- train: contains 104,848 samples, occupying 232,791,997 bytes
Download Size: 114,503,169 bytes
Dataset Size: 232,791,997 bytes
Configurations:
- default: includes training data files data/train-*
License: Apache 2.0
Task Category: Question Answering
Language: English
Scale Category: 10K < n < 100K

Dataset Source

This dataset is a subset extracted from the m-a-p/CodeFeedback-Filtered-Instruction dataset, which originally contains 156,526 samples.
The original dataset includes samples from four major open‑source code instruction tuning datasets:
- Magicoder-OSS-Instruct
- Python code subset of ShareGPT
- Magicoder-Evol-Instruct
- Evol-Instruct-Code
This subset contains only 104,848 samples written in Python.

References

@article{zheng2024opencodeinterpreter, title={Opencodeinterpreter: Integrating code generation with execution and refinement}, author={Zheng, Tianyu and Zhang, Ge and Shen, Tianhao and Liu, Xueling and Lin, Bill Yuchen and Fu, Jie and Chen, Wenhu and Yue, Xiang}, journal={arXiv preprint arXiv:2402.14658}, year={2024} }

@article{meng2024pissa, title={Pissa: Principal singular values and singular vectors adaptation of large language models}, author={Meng, Fanxu and Wang, Zhaohui and Zhang, Muhan}, journal={arXiv preprint arXiv:2404.02948}, year={4 2024} }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio