Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingText Embedding

KShivendu/dbpedia-entities-openai-1M

OpenAI 1M with DBPedia Entities is a dataset of one million samples designed for feature‑extraction tasks. Each sample includes an `_id`, `title`, `text`, and an `openai` field containing a 1536‑dimensional float32 embedding generated with the text‑embedding‑ada‑002 model. The dataset is English, created in June 2023 for benchmarking pgvector and VectorDB (Qdrant) performance, and will later be expanded to ten million vectors. It is derived from the first one million entries of the BeIR/DBpedia‑Entity dataset.

Source
hugging_face
Created
Nov 28, 2025
Updated
Feb 19, 2024
Signals
110 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Basic Information

  • License: MIT
  • Size: 1M < n < 10M
  • Language: English (en)

Features

  • _id: string
  • title: string
  • text: string
  • openai: sequence of float32 (1536‑dimensional)

Splits

  • Training Set:
    • Samples: 1,000,000
    • Size: 12,383,152 bytes

Task Category

  • Feature Extraction

Dataset Name

  • Pretty Name: OpenAI 1M with DBPedia Entities

Embedding Details

  • Dimension: 1536
  • Embedding Text: title (string) + text (string)
  • Model: text‑embedding‑ada‑002
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio