High Quality Data

Dataset Hub

Explore high-quality datasets for your AI and machine learning projects.

Sort:

Browse by Category

wangrongsheng/RerankerLLM-Dataset

This dataset supports training and testing of re‑ranking large language models (LLMs). It contains queries sampled from the MS MARCO dataset along with rankings predicted by ChatGPT. Files include 10 K‑ and 100 K‑scale query sets and their corresponding ChatGPT predictions.

hugging_face

View Details

irds/msmarco-document-v2_trec-dl-2019

Information Retrieval

Deep Learning

The msmarco-document-v2/trec-dl-2019 dataset, provided by the ir-datasets package, focuses on text retrieval tasks. It contains 200 queries and 13,940 relevance judgments (qrels) for evaluating document retrieval systems. Example usage includes loading and processing the data with HuggingFace's datasets library in Python.

hugging_face

View Details

irds/nfcorpus

Natural Language Processing

Information Retrieval

The nfcorpus dataset is a text retrieval collection for medical information retrieval, consisting of 5,371 documents. Each document includes a document ID, URL, title, and abstract. The dataset was introduced by Vera Boteva et al. at the 2016 European Conference on Information Retrieval and has been used in several related sets such as `nfcorpus_dev`, `nfcorpus_test`, etc.

hugging_face

View Details

THUIR/T2Ranking

Information Retrieval

Passage Ranking

T2Ranking is a large‑scale Chinese passage ranking benchmark dataset, containing over 300K queries and more than 2M unique passages, sourced from real‑world search engines. This dataset focuses on Chinese search scenarios, with extensive fine‑grained relevance annotations. By retrieving passage results from multiple commercial search engines and providing complete annotations, it mitigates false‑negative issues and employs various strategies to ensure high dataset quality.

hugging_face

View Details