cimec/lambada
The LAMBADA dataset is used to evaluate computational models' text‑understanding ability, specifically testing whether a model can handle long‑range dependencies via a word‑prediction task. The dataset consists of narrative passages extracted from BookCorpus, split into development and test sets, with training data covering the full text of 2,662 novels. Its structure includes text and label fields, and it is partitioned into training, development, and test sets. The dataset was created to assess whether language models can retain long‑term contextual memory. Annotation involved paid crowdworkers ensuring that the target word could only be guessed by reading the entire passage. The language is English and the license is CC BY 4.0.
Description
Dataset Overview
Dataset Name
- Name: LAMBADA
- Alias: None
Basic Information
- Language: English (
en) - License: CC BY 4.0
- Multilinguality: Monolingual
- Size: 10K < n < 100K
- Source Dataset: Extended from BookCorpus
- Task Category: Text‑to‑text generation
- Task ID: None
- Tags: Long‑range dependency
Dataset Structure
- Config Name: plain_text
- Features:
- text: string, containing context, target sentence, and target word
- domain: string, provided only in the training split
- Splits:
- Training: 2,662 novels, >200 M words
- Validation: 4,869 passages
- Test: 5,153 passages
Dataset Creation
- Purpose: Evaluate whether language models can retain long‑term contextual memory
- Data Source: Novels from BookCorpus
- Annotation Process: Paid crowdworkers were instructed to predict the final word only after reading the whole passage
Usage Considerations
- License: Must comply with CC BY 4.0
- Citation: Full citation information is provided in the dataset metadata
Detailed Information
Description
- Summary: LAMBADA assesses computational models' text‑understanding ability through a word‑prediction task. Human participants can guess the final word only after reading the entire passage; the last sentence alone is insufficient.
- Supported Tasks: Long‑range dependency evaluation via word prediction
Structure
- Instance: Each instance contains a text sequence composed of context, target sentence, and target word. Training data includes the full text of 2,662 novels, disjoint from validation and test sets.
- Fields:
- category: Provided only in training, indicating a sub‑category extracted from the book
- text: Contains context, target sentence, and target word
Creation Rationale
- The dataset is designed to evaluate language models' ability to handle long‑distance context. Filtering ensures the target word is guessable by humans after reading the whole passage but not from the final sentence alone.
- Source data are from BookCorpus, de‑duplicated and filtered to remove potentially offensive content.
Usage Notes
- License Information: Released under CC BY 4.0
- Citation: Comprehensive citation format is supplied for academic referencing.
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.