Back to datasets
Dataset assetOpen Source CommunityData AnalysisEducational Content
HuggingFaceFW/fineweb-edu
The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.
Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 11, 2024
Signals
1,718 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Basic Information
- License: odc‑by
- Task Type: Text Generation
- Language: English
- Dataset Name: FineWeb‑Edu
- Size: Over 1 trillion tokens
Data Configuration
- Default Configuration: Includes all data
- Path:
data/*/*
- Path:
- Sample Configurations:
sample-350BT: ~350 billion GPT‑2 tokens random subset- Path:
sample/350BT/*
- Path:
sample-100BT: ~100 billion GPT‑2 tokens random subset- Path:
sample/100BT/*
- Path:
sample-10BT: ~10 billion GPT‑2 tokens random subset- Path:
sample/10BT/*
- Path:
- Specific Crawl Configurations:
CC‑MAIN‑2024‑10toCC‑MAIN‑2013‑20, each representing a specific crawl period- Path format:
data/CC‑MAIN-(year)-(week number)/*
- Path format:
Dataset Loading
- Using
datatrove:
from datatrove.pipeline.readers import ParquetReader
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu", glob_pattern="data/*/*.parquet", limit=1000)
for document in data_reader():
print(document)
- Using
datasets:
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-edu", name="CC-MAIN-2024-10", split="train", streaming=True)
Dataset Creation
- Classifier Training: Trained an education‑quality classifier using annotations generated by LLama3‑70B‑Instruct.
- Filtering & Results: Applied a threshold of 3 to filter data, retaining 1.3 trillion educational tokens.
Dataset Versions
- FineWeb‑Edu: 1.3 trillion tokens
- FineWeb‑Edu‑score‑2: 5.4 trillion tokens (threshold 2)
Classifier
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.