High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

KShivendu/dbpedia-entities-openai-1M

OpenAI 1M with DBPedia Entities is a dataset of one million samples designed for feature‑extraction tasks. Each sample includes an `_id`, `title`, `text`, and an `openai` field containing a 1536‑dimensional float32 embedding generated with the text‑embedding‑ada‑002 model. The dataset is English, created in June 2023 for benchmarking pgvector and VectorDB (Qdrant) performance, and will later be expanded to ten million vectors. It is derived from the first one million entries of the BeIR/DBpedia‑Entity dataset.

hugging_face

View Details

essays-big5-openai-text-embedding-ada-002

Text Embedding

Personality Trait Analysis

This dataset includes several features such as embedding (string type) and O, C, E, A, N (integer types), as well as an id (integer type). The dataset is split into training, validation, and evaluation sets, containing 1,578, 395, and 494 samples respectively. The dataset configuration specifies file paths for each split.

huggingface

View Details

dutta18/omcs_dataset_full_with_embeds

Natural Language Processing

Text Embedding

--- dataset_info: features: - name: fact dtype: string - name: count dtype: int64 - name: embeddings sequence: float32 splits: - name: train num_bytes: 4951309139 num_examples: 1578238 download_size: 5895178326 dataset_size: 4951309139 --- # Dataset Card for "omcs_dataset_full_with_embeds" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)

hugging_face

View Details