facebook/lama
The LAMA dataset is used to analyze and probe factual and commonsense knowledge in pre‑trained language models. It includes multiple configurations such as google_re, trex, conceptnet, and squad, each with specific fields. The dataset is English‑only and monolingual. It was created to assess language‑model understanding without reference translations. The data sources include Google RE, TRex, ConceptNet, and SQuAD. The dataset includes cleaned sentences with mask tokens ([MASK]) and corresponding answers, as well as negative sentences for some configurations.
Dataset description and usage context
Dataset Overview
Dataset Name: LAMA: LAnguage Model Analysis
Purpose: Probe and analyze factual and commonsense knowledge contained in pre‑trained language models.
Composition: Data from Google_RE, TRex (Wikidata subset), ConceptNet, and SQuAD.
Language: English (en)
License: CC‑BY‑4.0
Multilinguality: Monolingual
Size Categories:
- <1K
- 1K‑10K
- 10K‑100K
- 1M‑10M
Task Types:
- Text Retrieval
- Text Classification
Task IDs:
- Fact‑checking Retrieval
- Text Scoring
Configurations: conceptnet, google_re, squad, trex
Structure
Data Instances:
- trex: uuid, obj_uri, obj_label, sub_uri, sub_label, predicate_id, ...
- conceptnet: uuid, sub, obj, pred, obj_label, ...
- squad: id, sub_label, obj_label, ...
- google_re: uuid, pred, sub, obj, evidences, judgments, ...
Splits: No explicit splits provided.
Creation
Source Data: Aggregated from existing datasets, cleaned and adapted for probing.
Annotation: Mixed crowd‑sourced, expert‑generated, and machine‑generated annotations.
Usage Notes
Social Impact: Designed to evaluate language‑model understanding.
Bias Discussion: Crowd‑sourced data may contain biases.
Known Limitations: Limited documentation of original fields.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.