THUIR/T2Ranking
T2Ranking is a large‑scale Chinese passage ranking benchmark dataset, containing over 300K queries and more than 2M unique passages, sourced from real‑world search engines. This dataset focuses on Chinese search scenarios, with extensive fine‑grained relevance annotations. By retrieving passage results from multiple commercial search engines and providing complete annotations, it mitigates false‑negative issues and employs various strategies to ensure high dataset quality.
Description
Dataset Overview
Dataset Name
T2Ranking
Dataset Description
T2Ranking is a large‑scale Chinese passage ranking benchmark dataset, containing over 300,000 queries and more than 2,000,000 unique passages, derived from actual search engine logs. This dataset focuses on Chinese search scenarios, aiming to support the design of deep learning algorithms and the construction of precise ranking algorithms.
Dataset Characteristics
- Language: Chinese
- Scale: Dataset size ranges between 1M and 10M
- Content: Includes detailed four‑level relevance judgments, helping to explore fine‑grained relationships between queries and passages
- Source: Data originates from user logs of the Sogou search engine, with passages segmented and de‑duplicated via modeling
- Advantages: Compared with existing Chinese passage ranking datasets, it has clear advantages in scale and relevance annotation
Dataset Files
- Collection: collection.tsv (2,303,643 records)
- Queries: queries.train.tsv (258,042 records), queries.dev.tsv (24,832 records), queries.test.tsv (24,832 records)
- Relevance: qrels.train.tsv (1,613,421 records), qrels.dev.tsv (400,536 records), qrels.retrieval.train.tsv (744,663 records), qrels.retrieval.dev.tsv (118,933 records)
- Negative Samples: train.bm25.tsv (200,359,731 records), train.mined.tsv (200,376,001 records)
Dataset Download
The dataset can be downloaded with the following commands:
git lfs install
git clone https://huggingface.co/datasets/THUIR/T2Ranking
License
The dataset follows the Apache License 2.0.
Citation
If you use this dataset in research, please cite the relevant paper:
@misc{xie2023t2ranking,
title={T2Ranking: A large-scale Chinese Benchmark for Passage Ranking},
author={Xiaohui Xie and Qian Dong and Bingning Wang and Feiyang Lv and Ting Yao and Weinan Gan and Zhijing Wu and Xiangsheng Li and Haitao Li and Yiqun Liu and Jin Ma},
year={2023},
eprint={2304.03679},
archivePrefix={arXiv},
primaryClass={cs.IR}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.