semeru/Text-Code-CodeSearchNet-Python
This dataset is for natural‑language‑to‑source‑code search tasks and contains filtered Python code examples. Samples that cannot be parsed into an abstract syntax tree, have documentation token counts outside 3‑256, contain special tokens, or are non‑English were removed. The dataset comprises three .jsonl files for training, validation, and testing, each line representing a function. In the test set, function names and variables are replaced with special tokens to assess model generalization.
Description
Dataset Overview
Source and Processing
- Source: The dataset originates from CodeSearchNet.
- Processing: Pre‑processed with CodeXGLUE scripts, which remove code samples that cannot be parsed into an abstract syntax tree and those that do not meet token count or language criteria.
Content
- Format: Three .jsonl files: train.jsonl, valid.jsonl, test.jsonl.
- Structure: Each line represents a function with fields such as repo, path, func_name, original_string, language, code/function, code_tokens/function_tokens, docstring, docstring_tokens, url, idx, etc.
Statistics
| #Examples | |
|---|---|
| Train | 251,820 |
| Dev | 9,604 |
| Test | 19,210 |
Task Definition
- Goal: Given a natural‑language description, retrieve matching source code.
- Test: Function names and variables in the test set are replaced with special tokens to evaluate generalization.
Example
- File: evaluator/test.jsonl
- Sample Content: Multiple records, each with url, docstring, function, idx, etc.
Prediction Input
- Processing: For each natural‑language query, rank candidate code snippets in descending order and return their idx values.
- Example Output: JSON format containing the url and an array of corresponding answers.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.