openwebtext-sentences
The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.
Description
OpenWebText‑Sentences Dataset
Overview
This dataset originates from the popular OpenWebText corpus and contains the same textual content as the original OpenWebText, but split into individual sentences.
Key Features
- Content: All text from the original OpenWebText dataset.
- Format: Sentences are stored separately in Parquet format to improve access speed.
- Order: Preserves the original OpenWebText text and sequence.
- Tokenization: Sentence splitting performed with the NLTK 3.9.1 pre‑trained "Punkt" tokenizer.
Dataset Information
- Size: 25.7 GB (generated dataset)
- Number of Sentences: 307,432,490
- Language: English
Original OpenWebText Information
- Size: 41.70 GB (generated dataset)
- Number of Documents: 8,013,769
- Language: English
Citation
When using this dataset, please cite the original OpenWebText corpus:
@misc{Gokaslan2019OpenWeb,
title={OpenWebText Corpus},
author={Gokaslan, Aaron and Cohen, Vanya and Pavlick, Ellie and Tellex, Stefanie},
howpublished={url{http://Skylion007.github.io/OpenWebTextCorpus}},
year={2019}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: huggingface
Created: 9/17/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.