JUHE API Marketplace
DATASET
Open Source Community

cimec/lambada

The LAMBADA dataset is used to evaluate computational models' text‑understanding ability, specifically testing whether a model can handle long‑range dependencies via a word‑prediction task. The dataset consists of narrative passages extracted from BookCorpus, split into development and test sets, with training data covering the full text of 2,662 novels. Its structure includes text and label fields, and it is partitioned into training, development, and test sets. The dataset was created to assess whether language models can retain long‑term contextual memory. Annotation involved paid crowdworkers ensuring that the target word could only be guessed by reading the entire passage. The language is English and the license is CC BY 4.0.

Updated 1/4/2024
hugging_face

Description

Dataset Overview

Dataset Name

  • Name: LAMBADA
  • Alias: None

Basic Information

  • Language: English (en)
  • License: CC BY 4.0
  • Multilinguality: Monolingual
  • Size: 10K < n < 100K
  • Source Dataset: Extended from BookCorpus
  • Task Category: Text‑to‑text generation
  • Task ID: None
  • Tags: Long‑range dependency

Dataset Structure

  • Config Name: plain_text
  • Features:
    • text: string, containing context, target sentence, and target word
    • domain: string, provided only in the training split
  • Splits:
    • Training: 2,662 novels, >200 M words
    • Validation: 4,869 passages
    • Test: 5,153 passages

Dataset Creation

  • Purpose: Evaluate whether language models can retain long‑term contextual memory
  • Data Source: Novels from BookCorpus
  • Annotation Process: Paid crowdworkers were instructed to predict the final word only after reading the whole passage

Usage Considerations

  • License: Must comply with CC BY 4.0
  • Citation: Full citation information is provided in the dataset metadata

Detailed Information

Description

  • Summary: LAMBADA assesses computational models' text‑understanding ability through a word‑prediction task. Human participants can guess the final word only after reading the entire passage; the last sentence alone is insufficient.
  • Supported Tasks: Long‑range dependency evaluation via word prediction

Structure

  • Instance: Each instance contains a text sequence composed of context, target sentence, and target word. Training data includes the full text of 2,662 novels, disjoint from validation and test sets.
  • Fields:
    • category: Provided only in training, indicating a sub‑category extracted from the book
    • text: Contains context, target sentence, and target word

Creation Rationale

  • The dataset is designed to evaluate language models' ability to handle long‑distance context. Filtering ensures the target word is guessable by humans after reading the whole passage but not from the final sentence alone.
  • Source data are from BookCorpus, de‑duplicated and filtered to remove potentially offensive content.

Usage Notes

  • License Information: Released under CC BY 4.0
  • Citation: Comprehensive citation format is supplied for academic referencing.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Text Understanding

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.