Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingText Dataset

stas/openwebtext-10k

This is a subset of the OpenWebText dataset, named stas/openwebtext-10k, which is an open‑source replica of OpenAI's WebText dataset. The subset contains the first 10,000 records of the original dataset, primarily for testing purposes. It includes a single split called `train` with a `text` feature, comprising 10,000 rows. The compressed size is approximately 15 MB and the uncompressed size is about 50 MB.

Source
hugging_face
Created
Nov 28, 2025
Updated
Sep 15, 2021
Signals
301 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • Name: OpenWebText‑10K
  • Description: An open‑source replica of OpenAI's WebText dataset, containing the first 10,000 records for testing.
  • Record Count: 10,000
  • Structure: Single feature text
  • Size:
    • Compressed: ~15 MB
    • Uncompressed: 50 MB

Usage

  • Loading:

    from datasets import load_dataset
    ds = load_dataset("stas/openwebtext-10k")
    
  • Convert to JSONL:

    from datasets import load_dataset
    dataset_name = "stas/openwebtext-10k"
    name = dataset_name.split("/")[-1]
    ds = load_dataset(dataset_name, split="train")
    ds.to_json(f"{name}.jsonl", orient="records", lines=True)
    
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio