stas/openwebtext-10k
This is a subset of the OpenWebText dataset, named stas/openwebtext-10k, which is an open‑source replica of OpenAI's WebText dataset. The subset contains the first 10,000 records of the original dataset, primarily for testing purposes. It includes a single split called `train` with a `text` feature, comprising 10,000 rows. The compressed size is approximately 15 MB and the uncompressed size is about 50 MB.
Description
Dataset Overview
Basic Information
- Name: OpenWebText‑10K
- Description: An open‑source replica of OpenAI's WebText dataset, containing the first 10,000 records for testing.
- Record Count: 10,000
- Structure: Single feature
text - Size:
- Compressed: ~15 MB
- Uncompressed: 50 MB
Usage
-
Loading:
from datasets import load_dataset ds = load_dataset("stas/openwebtext-10k") -
Convert to JSONL:
from datasets import load_dataset dataset_name = "stas/openwebtext-10k" name = dataset_name.split("/")[-1] ds = load_dataset(dataset_name, split="train") ds.to_json(f"{name}.jsonl", orient="records", lines=True)
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.