Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingText Embedding
KShivendu/dbpedia-entities-openai-1M
OpenAI 1M with DBPedia Entities is a dataset of one million samples designed for feature‑extraction tasks. Each sample includes an `_id`, `title`, `text`, and an `openai` field containing a 1536‑dimensional float32 embedding generated with the text‑embedding‑ada‑002 model. The dataset is English, created in June 2023 for benchmarking pgvector and VectorDB (Qdrant) performance, and will later be expanded to ten million vectors. It is derived from the first one million entries of the BeIR/DBpedia‑Entity dataset.
Source
hugging_face
Created
Nov 28, 2025
Updated
Feb 19, 2024
Signals
110 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Basic Information
- License: MIT
- Size: 1M < n < 10M
- Language: English (en)
Features
- _id: string
- title: string
- text: string
- openai: sequence of float32 (1536‑dimensional)
Splits
- Training Set:
- Samples: 1,000,000
- Size: 12,383,152 bytes
Task Category
- Feature Extraction
Dataset Name
- Pretty Name: OpenAI 1M with DBPedia Entities
Embedding Details
- Dimension: 1536
- Embedding Text:
title(string) +text(string) - Model: text‑embedding‑ada‑002
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.