Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingText Analysis
openwebtext-sentences
The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.
Source
huggingface
Created
Sep 17, 2024
Updated
Sep 22, 2024
Signals
668 views
Availability
Linked source ready
Overview
Dataset description and usage context
OpenWebText‑Sentences Dataset
Overview
This dataset originates from the popular OpenWebText corpus and contains the same textual content as the original OpenWebText, but split into individual sentences.
Key Features
- Content: All text from the original OpenWebText dataset.
- Format: Sentences are stored separately in Parquet format to improve access speed.
- Order: Preserves the original OpenWebText text and sequence.
- Tokenization: Sentence splitting performed with the NLTK 3.9.1 pre‑trained "Punkt" tokenizer.
Dataset Information
- Size: 25.7 GB (generated dataset)
- Number of Sentences: 307,432,490
- Language: English
Original OpenWebText Information
- Size: 41.70 GB (generated dataset)
- Number of Documents: 8,013,769
- Language: English
Citation
When using this dataset, please cite the original OpenWebText corpus:
@misc{Gokaslan2019OpenWeb,
title={OpenWebText Corpus},
author={Gokaslan, Aaron and Cohen, Vanya and Pavlick, Ellie and Tellex, Stefanie},
howpublished={url{http://Skylion007.github.io/OpenWebTextCorpus}},
year={2019}
}
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.