Dataset assetOpen Source CommunityNatural Language ProcessingText Understanding

cimec/lambada

The LAMBADA dataset is used to evaluate computational models' text‑understanding ability, specifically testing whether a model can handle long‑range dependencies via a word‑prediction task. The dataset consists of narrative passages extracted from BookCorpus, split into development and test sets, with training data covering the full text of 2,662 novels. Its structure includes text and label fields, and it is partitioned into training, development, and test sets. The dataset was created to assess whether language models can retain long‑term contextual memory. Annotation involved paid crowdworkers ensuring that the target word could only be guessed by reading the entire passage. The language is English and the license is CC BY 4.0.

Source

hugging_face

Created

Nov 28, 2025

Updated

Jan 4, 2024

Signals

289 views

Availability

Linked source ready

Overview

Dataset description and usage context

Dataset Overview

Dataset Name

Name: LAMBADA
Alias: None

Basic Information

Language: English (en)
License: CC BY 4.0
Multilinguality: Monolingual
Size: 10K < n < 100K
Source Dataset: Extended from BookCorpus
Task Category: Text‑to‑text generation
Task ID: None
Tags: Long‑range dependency

Dataset Structure

Config Name: plain_text
Features:
- text: string, containing context, target sentence, and target word
- domain: string, provided only in the training split
Splits:
- Training: 2,662 novels, >200 M words
- Validation: 4,869 passages
- Test: 5,153 passages

Dataset Creation

Purpose: Evaluate whether language models can retain long‑term contextual memory
Data Source: Novels from BookCorpus
Annotation Process: Paid crowdworkers were instructed to predict the final word only after reading the whole passage

Usage Considerations

License: Must comply with CC BY 4.0
Citation: Full citation information is provided in the dataset metadata

Detailed Information

Description

Summary: LAMBADA assesses computational models' text‑understanding ability through a word‑prediction task. Human participants can guess the final word only after reading the entire passage; the last sentence alone is insufficient.
Supported Tasks: Long‑range dependency evaluation via word prediction

Structure

Instance: Each instance contains a text sequence composed of context, target sentence, and target word. Training data includes the full text of 2,662 novels, disjoint from validation and test sets.
Fields:
- category: Provided only in training, indicating a sub‑category extracted from the book
- text: Contains context, target sentence, and target word

Creation Rationale

The dataset is designed to evaluate language models' ability to handle long‑distance context. Filtering ensures the target word is guessable by humans after reading the whole passage but not from the final sentence alone.
Source data are from BookCorpus, de‑duplicated and filtered to remove potentially offensive content.

Usage Notes

License Information: Released under CC BY 4.0
Citation: Comprehensive citation format is supplied for academic referencing.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio