Dataset assetOpen Source CommunityNatural Language ProcessingText Analysis

openwebtext-sentences

The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.

Source

huggingface

Created

Sep 17, 2024

Updated

Sep 22, 2024

Signals

668 views

Availability

Linked source ready

Overview

Dataset description and usage context

OpenWebText‑Sentences Dataset

Overview

This dataset originates from the popular OpenWebText corpus and contains the same textual content as the original OpenWebText, but split into individual sentences.

Key Features

Content: All text from the original OpenWebText dataset.
Format: Sentences are stored separately in Parquet format to improve access speed.
Order: Preserves the original OpenWebText text and sequence.
Tokenization: Sentence splitting performed with the NLTK 3.9.1 pre‑trained "Punkt" tokenizer.

Dataset Information

Size: 25.7 GB (generated dataset)
Number of Sentences: 307,432,490
Language: English

Original OpenWebText Information

Size: 41.70 GB (generated dataset)
Number of Documents: 8,013,769
Language: English

Citation

When using this dataset, please cite the original OpenWebText corpus:

@misc{Gokaslan2019OpenWeb,
    title={OpenWebText Corpus},
    author={Gokaslan, Aaron and Cohen, Vanya and Pavlick, Ellie and Tellex, Stefanie},
    howpublished={url{http://Skylion007.github.io/OpenWebTextCorpus}},
    year={2019}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio