DATASET
Open Source Community
HuggingFaceFW/fineweb-edu
The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.
Updated 10/11/2024
hugging_face
Description
Dataset Overview
Basic Information
- License: odc‑by
- Task Type: Text Generation
- Language: English
- Dataset Name: FineWeb‑Edu
- Size: Over 1 trillion tokens
Data Configuration
- Default Configuration: Includes all data
- Path:
data/*/*
- Path:
- Sample Configurations:
sample-350BT: ~350 billion GPT‑2 tokens random subset- Path:
sample/350BT/*
- Path:
sample-100BT: ~100 billion GPT‑2 tokens random subset- Path:
sample/100BT/*
- Path:
sample-10BT: ~10 billion GPT‑2 tokens random subset- Path:
sample/10BT/*
- Path:
- Specific Crawl Configurations:
CC‑MAIN‑2024‑10toCC‑MAIN‑2013‑20, each representing a specific crawl period- Path format:
data/CC‑MAIN-(year)-(week number)/*
- Path format:
Dataset Loading
- Using
datatrove:
from datatrove.pipeline.readers import ParquetReader
data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu", glob_pattern="data/*/*.parquet", limit=1000)
for document in data_reader():
print(document)
- Using
datasets:
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-edu", name="CC-MAIN-2024-10", split="train", streaming=True)
Dataset Creation
- Classifier Training: Trained an education‑quality classifier using annotations generated by LLama3‑70B‑Instruct.
- Filtering & Results: Applied a threshold of 3 to filter data, retaining 1.3 trillion educational tokens.
Dataset Versions
- FineWeb‑Edu: 1.3 trillion tokens
- FineWeb‑Edu‑score‑2: 5.4 trillion tokens (threshold 2)
Classifier
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Login to Access
Please login to view download links and access full dataset details.
Topics
Educational Content
Data Analysis
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.