High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

opc-fineweb-code-corpus

opc‑fineweb‑code‑corpus is part of the OpenCoder dataset, specifically for the pre‑training stage. It consists of code‑related data retrieved from the Fineweb platform, processed through three rounds of fastText filtering, resulting in a corpus containing 55 B tokens of code and math‑related data. The math‑related portion is available in the OpenCoder‑LLM/fineweb‑math‑corpus.

huggingface

View Details