JUHE API Marketplace
DATASET
Open Source Community

openwebtext-sentences

The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.

Updated 9/22/2024
huggingface

Description

OpenWebText‑Sentences Dataset

Overview

This dataset originates from the popular OpenWebText corpus and contains the same textual content as the original OpenWebText, but split into individual sentences.

Key Features

  • Content: All text from the original OpenWebText dataset.
  • Format: Sentences are stored separately in Parquet format to improve access speed.
  • Order: Preserves the original OpenWebText text and sequence.
  • Tokenization: Sentence splitting performed with the NLTK 3.9.1 pre‑trained "Punkt" tokenizer.

Dataset Information

  • Size: 25.7 GB (generated dataset)
  • Number of Sentences: 307,432,490
  • Language: English

Original OpenWebText Information

  • Size: 41.70 GB (generated dataset)
  • Number of Documents: 8,013,769
  • Language: English

Citation

When using this dataset, please cite the original OpenWebText corpus:

@misc{Gokaslan2019OpenWeb,
    title={OpenWebText Corpus},
    author={Gokaslan, Aaron and Cohen, Vanya and Pavlick, Ellie and Tellex, Stefanie},
    howpublished={url{http://Skylion007.github.io/OpenWebTextCorpus}},
    year={2019}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Text Analysis

Source

Organization: huggingface

Created: 9/17/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.