High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

poetry_dataset

The dataset comprises five features: title, author, text, identifier, and category. It is divided into a training set containing 47,829 samples, with a total size of 12,701,752 bytes. The download size is 9,836,711 bytes.

huggingface

View Details

ISEAR

Sentiment Analysis

Text Analysis

The ISEAR dataset, developed by the Swiss National Center for Ability Research, is an international survey of emotional antecedents and reactions, suitable for text analysis and sentiment analysis.

github

View Details

BBC-Dataset-News-Classification

News Classification

Text Analysis

The collection comprises 2,225 news articles from the BBC News website between 2004 and 2005, covering five thematic domains: business, entertainment, politics, sports, and technology.

github

View Details

afmck/text8

Natural Language Processing

Text Analysis

The dataset contains three parts: training (train), validation (validation), and test (test). Each part has a sample with different byte sizes. The dataset feature is text (string). Total download size is 54,357,043 bytes, total size is 100,000,012 bytes. Configuration name is default, data file paths correspond to train, validation, test.

hugging_face

View Details

openwebtext-sentences

Natural Language Processing

Text Analysis

The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.

huggingface

View Details

llm-aes/asappp-3-6-original

Essay Scoring

Text Analysis

--- dataset_info: features: - name: Essay_ID dtype: int64 - name: essay_set dtype: int64 - name: essay dtype: string - name: rater1_domain1 dtype: int64 - name: rater2_domain1 dtype: int64 - name: domain1_score dtype: int64 - name: rubrics dtype: string - name: prompt dtype: string - name: Content dtype: int64 - name: Prompt_Adherence dtype: int64 - name: Language dtype: int64 - name: Narrativity dtype: int64 splits: - name: train num_bytes: 60382165 num_examples: 7101 download_size: 2445084 dataset_size: 60382165 configs: - config_name: default data_files: - split: train path: data/train-* ---

hugging_face

View Details