Dataset assetOpen Source CommunityNatural Language ProcessingText Embedding

KShivendu/dbpedia-entities-openai-1M

OpenAI 1M with DBPedia Entities is a dataset of one million samples designed for feature‑extraction tasks. Each sample includes an `_id`, `title`, `text`, and an `openai` field containing a 1536‑dimensional float32 embedding generated with the text‑embedding‑ada‑002 model. The dataset is English, created in June 2023 for benchmarking pgvector and VectorDB (Qdrant) performance, and will later be expanded to ten million vectors. It is derived from the first one million entries of the BeIR/DBpedia‑Entity dataset.

Source

hugging_face

Created

Nov 28, 2025

Updated

Feb 19, 2024

Signals

110 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Basic Information

License: MIT
Size: 1M < n < 10M
Language: English (en)

Features

_id: string
title: string
text: string
openai: sequence of float32 (1536‑dimensional)

Splits

Training Set:
- Samples: 1,000,000
- Size: 12,383,152 bytes

Task Category

Feature Extraction

Dataset Name

Pretty Name: OpenAI 1M with DBPedia Entities

Embedding Details

Dimension: 1536
Embedding Text: title (string) + text (string)
Model: text‑embedding‑ada‑002

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio