JUHE API Marketplace
DATASET
Open Source Community

HuggingFaceFW/fineweb-edu

The FineWeb‑Edu dataset is a selection of educational web pages from the FineWeb dataset, containing 1.3 trillion tokens. Using LLama3‑70B‑Instruct generated annotations, a quality classifier was trained to select high‑quality educational content. This dataset aims to provide high‑quality educational data for language‑model training.

Updated 10/11/2024
hugging_face

Description

Dataset Overview

Basic Information

  • License: odc‑by
  • Task Type: Text Generation
  • Language: English
  • Dataset Name: FineWeb‑Edu
  • Size: Over 1 trillion tokens

Data Configuration

  • Default Configuration: Includes all data
    • Path: data/*/*
  • Sample Configurations:
    • sample-350BT: ~350 billion GPT‑2 tokens random subset
      • Path: sample/350BT/*
    • sample-100BT: ~100 billion GPT‑2 tokens random subset
      • Path: sample/100BT/*
    • sample-10BT: ~10 billion GPT‑2 tokens random subset
      • Path: sample/10BT/*
  • Specific Crawl Configurations:
    • CC‑MAIN‑2024‑10 to CC‑MAIN‑2013‑20, each representing a specific crawl period
      • Path format: data/CC‑MAIN-(year)-(week number)/*

Dataset Loading

  • Using datatrove:
from datatrove.pipeline.readers import ParquetReader

data_reader = ParquetReader("hf://datasets/HuggingFaceFW/fineweb-edu", glob_pattern="data/*/*.parquet", limit=1000)
for document in data_reader():
    print(document)
  • Using datasets:
from datasets import load_dataset
fw = load_dataset("HuggingFaceFW/fineweb-edu", name="CC-MAIN-2024-10", split="train", streaming=True)

Dataset Creation

  • Classifier Training: Trained an education‑quality classifier using annotations generated by LLama3‑70B‑Instruct.
  • Filtering & Results: Applied a threshold of 3 to filter data, retaining 1.3 trillion educational tokens.

Dataset Versions

  • FineWeb‑Edu: 1.3 trillion tokens
  • FineWeb‑Edu‑score‑2: 5.4 trillion tokens (threshold 2)

Classifier

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Educational Content
Data Analysis

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.