Explore high-quality datasets for your AI and machine learning projects.
OpenAI 1M with DBPedia Entities is a dataset of one million samples designed for feature‑extraction tasks. Each sample includes an `_id`, `title`, `text`, and an `openai` field containing a 1536‑dimensional float32 embedding generated with the text‑embedding‑ada‑002 model. The dataset is English, created in June 2023 for benchmarking pgvector and VectorDB (Qdrant) performance, and will later be expanded to ten million vectors. It is derived from the first one million entries of the BeIR/DBpedia‑Entity dataset.
This dataset includes several features such as embedding (string type) and O, C, E, A, N (integer types), as well as an id (integer type). The dataset is split into training, validation, and evaluation sets, containing 1,578, 395, and 494 samples respectively. The dataset configuration specifies file paths for each split.
--- dataset_info: features: - name: fact dtype: string - name: count dtype: int64 - name: embeddings sequence: float32 splits: - name: train num_bytes: 4951309139 num_examples: 1578238 download_size: 5895178326 dataset_size: 4951309139 --- # Dataset Card for "omcs_dataset_full_with_embeds" [More Information needed](https://github.com/huggingface/datasets/blob/main/CONTRIBUTING.md#how-to-contribute-to-the-dataset-cards)