Explore high-quality datasets for your AI and machine learning projects.
This dataset is for natural‑language‑to‑source‑code search tasks and contains filtered Python code examples. Samples that cannot be parsed into an abstract syntax tree, have documentation token counts outside 3‑256, contain special tokens, or are non‑English were removed. The dataset comprises three .jsonl files for training, validation, and testing, each line representing a function. In the test set, function names and variables are replaced with special tokens to assess model generalization.