Back to datasets
Dataset assetOpen Source CommunityData AnalysisEducational Content

HuggingFaceFW/fineweb-edu

The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.

Source
hugging_face
Created
Nov 28, 2025
Updated
Oct 11, 2024
Signals
1,718 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • License: odc‑by
  • Task Type: Text Generation
  • Language: English
  • Dataset Name: FineWeb‑Edu
  • Size: Over 1 trillion tokens

Data Configuration

  • Default Configuration: Includes all data
    • Path: data/*/*
  • Sample Configurations:
    • sample-350BT: ~350 billion GPT‑2 tokens random subset
      • Path: sample/350BT/*
    • sample-100BT: ~100 billion GPT‑2 tokens random subset
      • Path: sample/100BT/*
    • sample-10BT: ~10 billion GPT‑2 tokens random subset
      • Path: sample/10BT/*
  • Specific Crawl Configurations:
    • CC‑MAIN‑2024‑10 to CC‑MAIN‑2013‑20, each representing a specific crawl period
      • Path format: data/CC‑MAIN-(year)-(week number)/*

Dataset Loading

  • Using datatrove:
from datatrove.pipeline.readers import ParquetReader

data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu", glob_pattern="data/*/*.parquet", limit=1000)
for document in data_reader():
    print(document)
  • Using datasets:
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-edu", name="CC-MAIN-2024-10", split="train", streaming=True)

Dataset Creation

  • Classifier Training: Trained an education‑quality classifier using annotations generated by LLama3‑70B‑Instruct.
  • Filtering & Results: Applied a threshold of 3 to filter data, retaining 1.3 trillion educational tokens.

Dataset Versions

  • FineWeb‑Edu: 1.3 trillion tokens
  • FineWeb‑Edu‑score‑2: 5.4 trillion tokens (threshold 2)

Classifier

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio