neural-bridge/rag-dataset-12000

Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset specifically designed to optimize RAG models. Built by Neural Bridge AI, it contains 12,000 entries, each comprising three fields: context, question, and answer. Context data originates from Falcon RefinedWeb, while questions and answers are generated by GPT-4. The dataset is split into a training set (9,600 samples) and a test set (2,400 samples) and is released under the Apache 2.0 license.

Updated 2/5/2024

hugging_face

Description

Dataset Overview

Dataset Name

Retrieval-Augmented Generation (RAG) Dataset 12000

Dataset Description

Purpose: To build optimized RAG models that enhance large language models (LLMs) by accessing external authoritative knowledge bases for response generation.
Characteristics: Extends models to specific domains or internal organizational data without retraining, improving relevance, accuracy, and context specificity of outputs.

Dataset Structure

Features:
- context: string, a sequence of tokens.
- question: string, a question related to the context.
- answer: string, the answer to the question.
Data Splits:
- train: 9,600 samples.
- test: 2,400 samples.

Language

Language: English (en)

License

License: Apache License 2.0

Data Source

Source: Falcon RefinedWeb

Data Example

json { "context": "...", "question": "...", "answer": "..." }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Please login to view download links and access full dataset details.

Topics

Natural Language Processing

Machine Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.

Check Prices →