neural-bridge/rag-dataset-12000
Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset specifically designed to optimize RAG models. Built by Neural Bridge AI, it contains 12,000 entries, each comprising three fields: context, question, and answer. Context data originates from Falcon RefinedWeb, while questions and answers are generated by GPT-4. The dataset is split into a training set (9,600 samples) and a test set (2,400 samples) and is released under the Apache 2.0 license.
Description
Dataset Overview
Dataset Name
Retrieval-Augmented Generation (RAG) Dataset 12000
Dataset Description
- Purpose: To build optimized RAG models that enhance large language models (LLMs) by accessing external authoritative knowledge bases for response generation.
- Characteristics: Extends models to specific domains or internal organizational data without retraining, improving relevance, accuracy, and context specificity of outputs.
Dataset Structure
- Features:
context: string, a sequence of tokens.question: string, a question related to the context.answer: string, the answer to the question.
- Data Splits:
train: 9,600 samples.test: 2,400 samples.
Language
- Language: English (
en)
License
- License: Apache License 2.0
Data Source
- Source: Falcon RefinedWeb
Data Example
json { "context": "...", "question": "...", "answer": "..." }
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.