Back to datasets
Dataset assetOpen Source CommunityRelation Extraction

DFKI-SLT/conll04

The CoNLL04 dataset is a benchmark for relation extraction tasks, containing 1,437 sentences, each with at least one relation. Sentences are annotated with entities (e.g., `Peop`, `Loc`, `Org`, `Other`) and relation types (e.g., `Located_In`, `Work_For`, `OrgBased_In`, `Live_In`, `Kill`). The dataset is in English and formatted as JSONL.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jun 7, 2024
Signals
440 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name: CoNLL04

Purpose: Relation extraction task

Language: English

Size: 1,437 sentences, each containing at least one relation.

Data Structure

Fields

  • tokens: Text content, string.
  • entities: List of entities
    • type: Entity type, string.
    • start: Start index, integer.
    • end: End index, integer.
  • relations: List of relations
    • type: Relation type, string.
    • head: Head entity index, integer.
    • tail: Tail entity index, integer.

Splits

  • Training (train): 922 samples, 358 752 bytes.
  • Validation (validation): 231 samples, 94 688 bytes.
  • Test (test): 288 samples, 114 248 bytes.

Configuration

  • Default:
    • Train path: data/train-*
    • Validation path: data/validation-*
    • Test path: data/test-*

Citation

BibTeX:

@inproceedings{roth-yih-2004-linear,
    title = "A Linear Programming Formulation for Global Inference in Natural Language Tasks",
    author = "Roth, Dan  and
      Yih, Wen-tau",
    booktitle = "Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004",
    month = may # " 6 - " # may # " 7",
    year = "2004",
    address = "Boston, Massachusetts, USA",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W04-2401",
    pages = "1--8",
}
@article{eberts-ulges2019spert,
  author       = {Markus Eberts and
                  Adrian Ulges},
  title        = {Span-based Joint Entity and Relation Extraction with Transformer Pre-training},
  journal      = {CoRR},
  volume       = {abs/1909.07755},
  year         = {2019},
  url          = {http://arxiv.org/abs/1909.07755},
  eprinttype    = {arXiv},
  eprint       = {1909.07755},
  timestamp    = {Mon, 23 Sep 2019 18:07:15 +0200},
  biburl       = {https://dblp.org/rec/journals/corr/abs-1909-07755.bib},
  bibsource    = {dblp computer science bibliography, https://dblp.org}
}

APA:

  • Roth, D., & Yih, W. (2004). A linear programming formulation for global inference in natural language tasks. In Proceedings of the Eighth Conference on Computational Natural Language Learning (CoNLL-2004) at HLT-NAACL 2004 (pp. 1‑8). Boston, MA, USA: Association for Computational Linguistics. https://aclanthology.org/W04-2401
  • Eberts, M., & Ulges, A. (2019). Span‑based joint entity and relation extraction with transformer pre‑training. CoRR, abs/1909.07755. http://arxiv.org/abs/1909.07755
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio