Re-DocRED-CF
Re‑DocRED‑CF is a counterfactual dataset for document‑level relation extraction, generated by entity replacement. It contains five counterfactual variants, each with training, development, and test splits, plus a mixed training set. Each example includes document title, relation labels, entity vertex sets, tokenized sentences, and the original document ID indicating its index in the seed dataset.
Dataset description and usage context
Re‑DocRED‑CF Dataset Overview
Dataset Description
Re‑DocRED‑CF is a counterfactual dataset for document‑level relation extraction (RE), created by replacing entities to evaluate and mitigate factual bias in document‑level RE.
Dataset Structure
The dataset comprises five counterfactual variants, each containing the following files:
train.jsonldev.jsonltest.jsonltrain_mix.jsonl
Variant List
var-01var-02var-03var-04var-05var-06var-07var-08var-09
Data Format
Each data file includes the following fields:
title: document title.labels: list of relations; each entry links a head entity to a tail entity and may include supporting evidence sentences.vertexSet: list of entity vertices, each representing all mentions of an entity and its type within the document.sents: tokenized sentences.original_doc_id: index of the example in the original seed dataset.
Loading the Dataset
from datasets import load_dataset
dataset = load_dataset("amodaresi/Re-DocRED-CF", "var-01")
Citation
If you use this dataset, please cite the following paper:
@inproceedings{modarressi-covered-2024,
title={Consistent Document‑Level Relation Extraction via Counterfactuals},
author={Ali Modarressi and Abdullatif K{"o}ksal and Hinrich Sch{"u}tze},
year={2024},
booktitle={Findings of the Association for Computational Linguistics: EMNLP 2024},
address={Miami, United States},
publisher={Association for Computational Linguistics}
}
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.