irds/nfcorpus
The nfcorpus dataset is a text retrieval collection for medical information retrieval, consisting of 5,371 documents. Each document includes a document ID, URL, title, and abstract. The dataset was introduced by Vera Boteva et al. at the 2016 European Conference on Information Retrieval and has been used in several related sets such as `nfcorpus_dev`, `nfcorpus_test`, etc.
Dataset description and usage context
Dataset Overview
Dataset Name
nfcorpus
Source
Provided by the ir-datasets package.
Content
- Data type:
docs(documents, i.e., corpus) - Number of documents: 5,371
Use Cases
Used in multiple related datasets, including:
nfcorpus_devnfcorpus_dev_nontopicnfcorpus_dev_videonfcorpus_testnfcorpus_test_nontopicnfcorpus_test_videonfcorpus_trainnfcorpus_train_nontopicnfcorpus_train_video
Example Usage
python from datasets import load_dataset
docs = load_dataset(irds/nfcorpus, docs) for record in docs: record # {doc_id: ..., url: ..., title: ..., abstract: ...}
Citation
@inproceedings{Boteva2016Nfcorpus, title="A Full-Text Learning to Rank Dataset for Medical Information Retrieval", author = "Vera Boteva and Demian Gholipour and Artem Sokolov and Stefan Riezler", booktitle = "Proceedings of the European Conference on Information Retrieval ({ECIR})", location = "Padova, Italy", publisher = "Springer", year = 2016 }
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.