CodeFeedback-Python105K
This dataset is a subset extracted from the `m-a-p/CodeFeedback-Filtered-Instruction` dataset, specifically selecting 104,848 samples written in Python. The dataset includes two main features: 'query' and 'response', both of string type. It is divided into a training set containing 104,848 samples. The dataset is suitable for question‑answering tasks, in English, with a sample size between 10,000 and 100,000.
Description
CodeFeedback-Python105K Dataset Overview
Dataset Information
- Features:
query: string typeresponse: string type
- Splits:
train: contains 104,848 samples, occupying 232,791,997 bytes
- Download Size: 114,503,169 bytes
- Dataset Size: 232,791,997 bytes
- Configurations:
default: includes training data filesdata/train-*
- License: Apache 2.0
- Task Category: Question Answering
- Language: English
- Scale Category: 10K < n < 100K
Dataset Source
- This dataset is a subset extracted from the
m-a-p/CodeFeedback-Filtered-Instructiondataset, which originally contains 156,526 samples. - The original dataset includes samples from four major open‑source code instruction tuning datasets:
- Magicoder-OSS-Instruct
- Python code subset of ShareGPT
- Magicoder-Evol-Instruct
- Evol-Instruct-Code
- This subset contains only 104,848 samples written in Python.
References
@article{zheng2024opencodeinterpreter, title={Opencodeinterpreter: Integrating code generation with execution and refinement}, author={Zheng, Tianyu and Zhang, Ge and Shen, Tianhao and Liu, Xueling and Lin, Bill Yuchen and Fu, Jie and Chen, Wenhu and Yue, Xiang}, journal={arXiv preprint arXiv:2402.14658}, year={2024} }
@article{meng2024pissa, title={Pissa: Principal singular values and singular vectors adaptation of large language models}, author={Meng, Fanxu and Wang, Zhaohui and Zhang, Muhan}, journal={arXiv preprint arXiv:2404.02948}, year={4 2024} }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 11/1/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.