Explore high-quality datasets for your AI and machine learning projects.
This dataset contains metadata of raw source code from RefineCode, including repository names and file paths. It demonstrates superior training effectiveness and efficiency compared to the The Stack V2 training subset. Currently, metadata covering about 50% of the files from The Stack V2 has been uploaded, and efforts are underway to make the remaining portion available. RefineCode is a high‑quality, reproducible code pre‑training corpus containing 96 billion tokens, covering 607 programming languages and 75 billion code‑related tokens, and incorporates rules for more than 130 specific languages with custom weight allocation.