Back to datasets
Dataset assetOpen Source CommunityQuestion AnsweringPython Programming

CodeFeedback-Python105K

This dataset is a subset extracted from the `m-a-p/CodeFeedback-Filtered-Instruction` dataset, specifically selecting 104,848 samples written in Python. The dataset includes two main features: 'query' and 'response', both of string type. It is divided into a training set containing 104,848 samples. The dataset is suitable for question‑answering tasks, in English, with a sample size between 10,000 and 100,000.

Source
huggingface
Created
Nov 1, 2024
Updated
Nov 14, 2024
Signals
229 views
Availability
Linked source ready
Overview

Dataset description and usage context

CodeFeedback-Python105K Dataset Overview

Dataset Information

  • Features:
    • query: string type
    • response: string type
  • Splits:
    • train: contains 104,848 samples, occupying 232,791,997 bytes
  • Download Size: 114,503,169 bytes
  • Dataset Size: 232,791,997 bytes
  • Configurations:
    • default: includes training data files data/train-*
  • License: Apache 2.0
  • Task Category: Question Answering
  • Language: English
  • Scale Category: 10K < n < 100K

Dataset Source

  • This dataset is a subset extracted from the m-a-p/CodeFeedback-Filtered-Instruction dataset, which originally contains 156,526 samples.
  • The original dataset includes samples from four major open‑source code instruction tuning datasets:
    • Magicoder-OSS-Instruct
    • Python code subset of ShareGPT
    • Magicoder-Evol-Instruct
    • Evol-Instruct-Code
  • This subset contains only 104,848 samples written in Python.

References

@article{zheng2024opencodeinterpreter, title={Opencodeinterpreter: Integrating code generation with execution and refinement}, author={Zheng, Tianyu and Zhang, Ge and Shen, Tianhao and Liu, Xueling and Lin, Bill Yuchen and Fu, Jie and Chen, Wenhu and Yue, Xiang}, journal={arXiv preprint arXiv:2402.14658}, year={2024} }

@article{meng2024pissa, title={Pissa: Principal singular values and singular vectors adaptation of large language models}, author={Meng, Fanxu and Wang, Zhaohui and Zhang, Muhan}, journal={arXiv preprint arXiv:2404.02948}, year={4 2024} }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio