JUHE API Marketplace
DATASET
Open Source Community

THUIR/T2Ranking

T2Ranking is a large‑scale Chinese passage ranking benchmark dataset, containing over 300K queries and more than 2M unique passages, sourced from real‑world search engines. This dataset focuses on Chinese search scenarios, with extensive fine‑grained relevance annotations. By retrieving passage results from multiple commercial search engines and providing complete annotations, it mitigates false‑negative issues and employs various strategies to ensure high dataset quality.

Updated 3/6/2025
hugging_face

Description

Dataset Overview

Dataset Name

T2Ranking

Dataset Description

T2Ranking is a large‑scale Chinese passage ranking benchmark dataset, containing over 300,000 queries and more than 2,000,000 unique passages, derived from actual search engine logs. This dataset focuses on Chinese search scenarios, aiming to support the design of deep learning algorithms and the construction of precise ranking algorithms.

Dataset Characteristics

  • Language: Chinese
  • Scale: Dataset size ranges between 1M and 10M
  • Content: Includes detailed four‑level relevance judgments, helping to explore fine‑grained relationships between queries and passages
  • Source: Data originates from user logs of the Sogou search engine, with passages segmented and de‑duplicated via modeling
  • Advantages: Compared with existing Chinese passage ranking datasets, it has clear advantages in scale and relevance annotation

Dataset Files

  • Collection: collection.tsv (2,303,643 records)
  • Queries: queries.train.tsv (258,042 records), queries.dev.tsv (24,832 records), queries.test.tsv (24,832 records)
  • Relevance: qrels.train.tsv (1,613,421 records), qrels.dev.tsv (400,536 records), qrels.retrieval.train.tsv (744,663 records), qrels.retrieval.dev.tsv (118,933 records)
  • Negative Samples: train.bm25.tsv (200,359,731 records), train.mined.tsv (200,376,001 records)

Dataset Download

The dataset can be downloaded with the following commands:

git lfs install
git clone https://huggingface.co/datasets/THUIR/T2Ranking

License

The dataset follows the Apache License 2.0.

Citation

If you use this dataset in research, please cite the relevant paper:

@misc{xie2023t2ranking,
      title={T2Ranking: A large-scale Chinese Benchmark for Passage Ranking}, 
      author={Xiaohui Xie and Qian Dong and Bingning Wang and Feiyang Lv and Ting Yao and Weinan Gan and Zhijing Wu and Xiangsheng Li and Haitao Li and Yiqun Liu and Jin Ma},
      year={2023},
      eprint={2304.03679},
      archivePrefix={arXiv},
      primaryClass={cs.IR}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Information Retrieval
Passage Ranking

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.