Dataset assetOpen Source CommunityCode Data ProcessingPre‑training Datasets

opc-fineweb-code-corpus

opc‑fineweb‑code‑corpus is part of the OpenCoder dataset, specifically for the pre‑training stage. It consists of code‑related data retrieved from the Fineweb platform, processed through three rounds of fastText filtering, resulting in a corpus containing 55 B tokens of code and math‑related data. The math‑related portion is available in the OpenCoder‑LLM/fineweb‑math‑corpus.

Source

huggingface

Created

Nov 10, 2024

Updated

Nov 24, 2024

Signals

117 views

Availability

Linked source ready

Overview

Dataset description and usage context

opc‑fineweb‑code‑corpus

Dataset Overview

Dataset Name: opc‑fineweb‑code‑corpus
Source: Fineweb
Purpose: Used for OpenCoder pre‑training
Size: 55 B code and math‑related tokens

Features

url: string
tag: string
text: string
file_path: string
dump: string
file_size_in_byte: 64‑bit integer
line_count: 64‑bit integer

Splits

train: 100,920,235 samples, total size 254,927,419,643 bytes

Configuration

config_name: default
data_files:
- split: train
- path: data/train-*

Citation

Paper: OpenCoder: The Open Cookbook for Top‑Tier Code Large Language Models
Authors: Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu
Year: 2024
URL: https://arxiv.org/pdf/2411.04905

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio