JUHE API Marketplace
DATASET
Open Source Community

stas/openwebtext-10k

This is a subset of the OpenWebText dataset, named stas/openwebtext-10k, which is an open‑source replica of OpenAI's WebText dataset. The subset contains the first 10,000 records of the original dataset, primarily for testing purposes. It includes a single split called `train` with a `text` feature, comprising 10,000 rows. The compressed size is approximately 15 MB and the uncompressed size is about 50 MB.

Updated 9/15/2021
hugging_face

Description

Dataset Overview

Basic Information

  • Name: OpenWebText‑10K
  • Description: An open‑source replica of OpenAI's WebText dataset, containing the first 10,000 records for testing.
  • Record Count: 10,000
  • Structure: Single feature text
  • Size:
    • Compressed: ~15 MB
    • Uncompressed: 50 MB

Usage

  • Loading:

    from datasets import load_dataset
    ds = load_dataset("stas/openwebtext-10k")
    
  • Convert to JSONL:

    from datasets import load_dataset
    dataset_name = "stas/openwebtext-10k"
    name = dataset_name.split("/")[-1]
    ds = load_dataset(dataset_name, split="train")
    ds.to_json(f"{name}.jsonl", orient="records", lines=True)
    

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Text Dataset

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.