HIT-TMG/Hansel
# Dataset Card for "Hansel" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Splits](#data-splits) - [Citation](#citation) ## Dataset Description - **Homepage:** https://github.com/HITsz-TMG/Hansel - **Paper:** https://arxiv.org/abs/2207.13005 Hansel is a high‑quality human‑annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities: - The test set contains Few‑shot (FS) and zero‑shot (ZS) slices, has 10K examples and uses Wikidata as the corresponding knowledge base. - The training and validation sets are from Wikipedia hyperlinks, useful for large‑scale pretraining of Chinese EL systems. Please see our [WSDM 2023](https://www.wsdm-conference.org/2023/) paper **"Hansel: A Chinese Few‑Shot and Zero‑Shot Entity Linking Benchmark"** to learn more about our dataset. For models in the paper and our processed knowledge base, please see our [Github repository](https://github.com/HITsz-TMG/Hansel). ## Dataset Structure ### Data Instances {"id": "hansel-eval-zs-1463", "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" } ### Data Splits | | # Mentions | # Entities | Domain | | ---- | ---- | ---- | ---- | | Train | 9,879,813 | 541,058 | Wikipedia | | Validation | 9,674 | 6,320 | Wikipedia | | Hansel-FS | 5,260 | 2,720 | News, Social Media | | Hansel-ZS | 4,715 | 4,046 | News, Social Media, E‑books, etc.| ## Citation If you find our dataset useful, please cite us. ```bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few‑Shot and Zero‑Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} } ```
Description
Dataset Overview: Hansel
Dataset Description
- Name: Hansel
- Language: Chinese
- License: cc‑by‑sa‑4.0
- Task Category: text‑retrieval
- Task ID: entity‑linking‑retrieval
- Size: 1M<n<10M and 1K<n<10K
- Source: original
- Characteristics:
- Focuses on tail entities and emerging entities
- Test set includes Few‑shot (FS) and zero‑shot (ZS) slices, totaling 10K examples, using Wikidata as the knowledge base
- Training and validation sets are derived from Wikipedia hyperlinks
Dataset Structure
Data Instances
- Fields: id, text, start, end, mention, gold_id, source, domain
- Example:
{ "id": "hansel-eval-zs-1463", "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" }
Data Splits
| Split | Mentions | Entities | Domain |
|---|---|---|---|
| Train | 9,879,813 | 541,058 | Wikipedia |
| Validation | 9,674 | 6,320 | Wikipedia |
| Hansel‑FS | 5,260 | 2,720 | News, Social Media |
| Hansel‑ZS | 4,715 | 4,046 | News, Social Media, E‑books, etc. |
Citation Information
@inproceedings{xu2022hansel,
author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing},
title = {Hansel: A Chinese Few‑Shot and Zero‑Shot Entity Linking Benchmark},
year = {2023},
publisher = {Association for Computing Machinery},
url = {https://doi.org/10.1145/3539597.3570418},
booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining},
pages = {832–840}
}
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: hugging_face
Created: Unknown
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.