Back to datasets
Dataset assetOpen Source CommunityQuestion Answering SystemsWikipedia
legacy107/qa_wikipedia
The qa_wikipedia dataset is a question‑answering dataset containing multiple documents extracted from Wikipedia along with associated questions. Features include document ID, title, context, question, answer start position, answer text, and the full article. The dataset is split into training, test, and validation subsets for different modeling stages.
Source
hugging_face
Created
Nov 28, 2025
Updated
Sep 18, 2023
Signals
108 views
Availability
Linked source ready
Overview
Dataset description and usage context
Dataset Overview
Configuration
- Default configuration (
default)- Data file paths:
- Training set (
train):data/train-* - Test set (
test):data/test-* - Validation set (
validation):data/validation-*
- Training set (
- Data file paths:
Data Features
id: stringtitle: stringcontext: stringquestion: stringanswer_start: 64‑bit integeranswer: stringarticle: string
Data Splits
- Training set (
train)- Size: 7,477,859,892 bytes
- Samples: 138,712
- Test set (
test)- Size: 898,641,134 bytes
- Samples: 17,341
- Validation set (
validation)- Size: 926,495,549 bytes
- Samples: 17,291
Dataset Size
- Download size: 498,772,569 bytes
- Total dataset size: 9,302,996,575 bytes
Need downstream help?
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.