Explore high-quality datasets for your AI and machine learning projects.
The dataset comprises five features: title, author, text, identifier, and category. It is divided into a training set containing 47,829 samples, with a total size of 12,701,752 bytes. The download size is 9,836,711 bytes.
The ISEAR dataset, developed by the Swiss National Center for Ability Research, is an international survey of emotional antecedents and reactions, suitable for text analysis and sentiment analysis.
The collection comprises 2,225 news articles from the BBC News website between 2004 and 2005, covering five thematic domains: business, entertainment, politics, sports, and technology.
The dataset contains three parts: training (train), validation (validation), and test (test). Each part has a sample with different byte sizes. The dataset feature is text (string). Total download size is 54,357,043 bytes, total size is 100,000,012 bytes. Configuration name is default, data file paths correspond to train, validation, test.
The OpenWebText‑Sentences dataset is extracted from the OpenWebText corpus, containing the original textual content split into individual sentences. It is stored in Parquet format for fast access. Sentences were split using the NLTK 3.9.1 pre‑trained "Punkt" tokenizer. The dataset size is 25.7 GB and includes 307,432,490 sentences in English.
--- dataset_info: features: - name: Essay_ID dtype: int64 - name: essay_set dtype: int64 - name: essay dtype: string - name: rater1_domain1 dtype: int64 - name: rater2_domain1 dtype: int64 - name: domain1_score dtype: int64 - name: rubrics dtype: string - name: prompt dtype: string - name: Content dtype: int64 - name: Prompt_Adherence dtype: int64 - name: Language dtype: int64 - name: Narrativity dtype: int64 splits: - name: train num_bytes: 60382165 num_examples: 7101 download_size: 2445084 dataset_size: 60382165 configs: - config_name: default data_files: - split: train path: data/train-* ---