Back to datasets
Dataset assetOpen Source CommunityNatural Language ProcessingInformation Retrieval

irds/nfcorpus

The nfcorpus dataset is a text retrieval collection for medical information retrieval, consisting of 5,371 documents. Each document includes a document ID, URL, title, and abstract. The dataset was introduced by Vera Boteva et al. at the 2016 European Conference on Information Retrieval and has been used in several related sets such as `nfcorpus_dev`, `nfcorpus_test`, etc.

Source
hugging_face
Created
Nov 28, 2025
Updated
Jan 5, 2023
Signals
332 views
Availability
Linked source ready
Overview

Dataset description and usage context

Dataset Overview

Dataset Name

nfcorpus

Source

Provided by the ir-datasets package.

Content

  • Data type: docs (documents, i.e., corpus)
  • Number of documents: 5,371

Use Cases

Used in multiple related datasets, including:

  • nfcorpus_dev
  • nfcorpus_dev_nontopic
  • nfcorpus_dev_video
  • nfcorpus_test
  • nfcorpus_test_nontopic
  • nfcorpus_test_video
  • nfcorpus_train
  • nfcorpus_train_nontopic
  • nfcorpus_train_video

Example Usage

python from datasets import load_dataset

docs = load_dataset(irds/nfcorpus, docs) for record in docs: record # {doc_id: ..., url: ..., title: ..., abstract: ...}

Citation

@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio