Back to datasets
Dataset assetOpen Source CommunityCode Data ProcessingPre‑training Datasets

opc-fineweb-code-corpus

opc‑fineweb‑code‑corpus is part of the OpenCoder dataset, specifically for the pre‑training stage. It consists of code‑related data retrieved from the Fineweb platform, processed through three rounds of fastText filtering, resulting in a corpus containing 55 B tokens of code and math‑related data. The math‑related portion is available in the OpenCoder‑LLM/fineweb‑math‑corpus.

Source
huggingface
Created
Nov 10, 2024
Updated
Nov 24, 2024
Signals
117 views
Availability
Linked source ready
Overview

Dataset description and usage context

opc‑fineweb‑code‑corpus

Dataset Overview

  • Dataset Name: opc‑fineweb‑code‑corpus
  • Source: Fineweb
  • Purpose: Used for OpenCoder pre‑training
  • Size: 55 B code and math‑related tokens

Features

  • url: string
  • tag: string
  • text: string
  • file_path: string
  • dump: string
  • file_size_in_byte: 64‑bit integer
  • line_count: 64‑bit integer

Splits

  • train: 100,920,235 samples, total size 254,927,419,643 bytes

Configuration

  • config_name: default
  • data_files:
    • split: train
    • path: data/train-*

Citation

  • Paper: OpenCoder: The Open Cookbook for Top‑Tier Code Large Language Models
  • Authors: Siming Huang, Tianhao Cheng, Jason Klein Liu, Jiaran Hao, Liuyihan Song, Yang Xu, J. Yang, J. H. Liu, Chenchen Zhang, Linzheng Chai, Ruifeng Yuan, Zhaoxiang Zhang, Jie Fu, Qian Liu, Ge Zhang, Zili Wang, Yuan Qi, Yinghui Xu, Wei Chu
  • Year: 2024
  • URL: https://arxiv.org/pdf/2411.04905
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio