RefineCode-code-corpus-meta

This dataset contains metadata of raw source code from RefineCode, including repository names and file paths. It demonstrates superior training effectiveness and efficiency compared to the The Stack V2 training subset. Currently, metadata covering about 50% of the files from The Stack V2 has been uploaded, and efforts are underway to make the remaining portion available. RefineCode is a high‑quality, reproducible code pre‑training corpus containing 96 billion tokens, covering 607 programming languages and 75 billion code‑related tokens, and incorporates rules for more than 130 specific languages with custom weight allocation.

Updated 11/15/2024

huggingface

Description

RefineCode Code Corpus Metadata Dataset

Dataset Overview

This dataset contains metadata of raw source code from RefineCode, including repository names and file paths. Users can refer to this metadata to collect files and reproduce RefineCode.

Dataset Features

repo_name: Repository name, type string.
sub_path: Sub‑path, type string.
file_name: File name, type string.
file_ext: File extension, type string.
file_size_in_byte: File size (bytes), type int64.
line_count: Number of lines, type int64.
lang: Language, type string.
program_lang: Programming language, type string.
doc_type: Document type, type string.

Dataset Split

The_Stack_V2: Contains 46,577,045,485 bytes of data, comprising 336,845,710 samples.

Dataset Size

Download size: 20,019,085,005 bytes.
Dataset size: 46,577,045,485 bytes.

Configuration

default: Data file path is data/The_Stack_V2-*.

Dataset Characteristics

High Quality: RefineCode is a high‑quality code pre‑training corpus.
Reproducible: Users can reproduce RefineCode using the metadata.
Scale: Contains 960 billion tokens, covering 607 programming languages and 75 billion code‑related tokens.
Rules: Includes rules for over 130 specific languages with custom weight distribution.

Dataset Advantages

Training Efficiency: Compared with The Stack V2 training subset, RefineCode shows better training efficiency and performance.
Visualization: PCA visualisation of embeddings extracted by CodeBERT demonstrates RefineCode’s clear advantage on the pre‑training dataset.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Code Pretraining

Programming Languages

Source

Organization: huggingface

Created: 11/15/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →