Back to datasets
Dataset assetOpen Source CommunityRelation ExtractionCounterfactual Reasoning

Re-DocRED-CF

Re‑DocRED‑CF is a counterfactual dataset for document‑level relation extraction, generated by entity replacement. It contains five counterfactual variants, each with training, development, and test splits, plus a mixed training set. Each example includes document title, relation labels, entity vertex sets, tokenized sentences, and the original document ID indicating its index in the seed dataset.

Source
huggingface
Created
Oct 14, 2024
Updated
Oct 15, 2024
Signals
214 views
Availability
Linked source ready
Overview

Dataset description and usage context

Re‑DocRED‑CF Dataset Overview

Dataset Description

Re‑DocRED‑CF is a counterfactual dataset for document‑level relation extraction (RE), created by replacing entities to evaluate and mitigate factual bias in document‑level RE.

Dataset Structure

The dataset comprises five counterfactual variants, each containing the following files:

  • train.jsonl
  • dev.jsonl
  • test.jsonl
  • train_mix.jsonl

Variant List

  • var-01
  • var-02
  • var-03
  • var-04
  • var-05
  • var-06
  • var-07
  • var-08
  • var-09

Data Format

Each data file includes the following fields:

  • title: document title.
  • labels: list of relations; each entry links a head entity to a tail entity and may include supporting evidence sentences.
  • vertexSet: list of entity vertices, each representing all mentions of an entity and its type within the document.
  • sents: tokenized sentences.
  • original_doc_id: index of the example in the original seed dataset.

Loading the Dataset

from datasets import load_dataset
dataset = load_dataset("amodaresi/Re-DocRED-CF", "var-01")

Citation

If you use this dataset, please cite the following paper:

@inproceedings{modarressi-covered-2024,
  title={Consistent Document‑Level Relation Extraction via Counterfactuals},
  author={Ali Modarressi and Abdullatif K{"o}ksal and Hinrich Sch{"u}tze},
  year={2024},
  booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
  address={Miami, United States},
  publisher={Association for Computational Linguistics}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio