JUHE API Marketplace
DATASET
Open Source Community

neural-bridge/rag-dataset-12000

Retrieval-Augmented Generation (RAG) Dataset 12000 is an English dataset specifically designed to optimize RAG models. Built by Neural Bridge AI, it contains 12,000 entries, each comprising three fields: context, question, and answer. Context data originates from Falcon RefinedWeb, while questions and answers are generated by GPT-4. The dataset is split into a training set (9,600 samples) and a test set (2,400 samples) and is released under the Apache 2.0 license.

Updated 2/5/2024
hugging_face

Description

Dataset Overview

Dataset Name

Retrieval-Augmented Generation (RAG) Dataset 12000

Dataset Description

  • Purpose: To build optimized RAG models that enhance large language models (LLMs) by accessing external authoritative knowledge bases for response generation.
  • Characteristics: Extends models to specific domains or internal organizational data without retraining, improving relevance, accuracy, and context specificity of outputs.

Dataset Structure

  • Features:
    • context: string, a sequence of tokens.
    • question: string, a question related to the context.
    • answer: string, the answer to the question.
  • Data Splits:
    • train: 9,600 samples.
    • test: 2,400 samples.

Language

  • Language: English (en)

License

  • License: Apache License 2.0

Data Source

Data Example

json { "context": "...", "question": "...", "answer": "..." }

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Machine Learning

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.