JUHE API Marketplace
DATASET
Open Source Community

semeru/Text-Code-CodeSearchNet-Python

This dataset is for natural‑language‑to‑source‑code search tasks and contains filtered Python code examples. Samples that cannot be parsed into an abstract syntax tree, have documentation token counts outside 3‑256, contain special tokens, or are non‑English were removed. The dataset comprises three .jsonl files for training, validation, and testing, each line representing a function. In the test set, function names and variables are replaced with special tokens to assess model generalization.

Updated 3/27/2023
hugging_face

Description

Dataset Overview

Source and Processing

  • Source: The dataset originates from CodeSearchNet.
  • Processing: Pre‑processed with CodeXGLUE scripts, which remove code samples that cannot be parsed into an abstract syntax tree and those that do not meet token count or language criteria.

Content

  • Format: Three .jsonl files: train.jsonl, valid.jsonl, test.jsonl.
  • Structure: Each line represents a function with fields such as repo, path, func_name, original_string, language, code/function, code_tokens/function_tokens, docstring, docstring_tokens, url, idx, etc.

Statistics

#Examples
Train251,820
Dev9,604
Test19,210

Task Definition

  • Goal: Given a natural‑language description, retrieve matching source code.
  • Test: Function names and variables in the test set are replaced with special tokens to evaluate generalization.

Example

  • File: evaluator/test.jsonl
  • Sample Content: Multiple records, each with url, docstring, function, idx, etc.

Prediction Input

  • Processing: For each natural‑language query, rank candidate code snippets in descending order and return their idx values.
  • Example Output: JSON format containing the url and an array of corresponding answers.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Code Search
Natural Language Processing

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.