JUHE API Marketplace
DATASET
Open Source Community

HIT-TMG/Hansel

# Dataset Card for "Hansel" ## Table of Contents - [Table of Contents](#table-of-contents) - [Dataset Description](#dataset-description) - [Dataset Structure](#dataset-structure) - [Data Instances](#data-instances) - [Data Splits](#data-splits) - [Citation](#citation) ## Dataset Description - **Homepage:** https://github.com/HITsz-TMG/Hansel - **Paper:** https://arxiv.org/abs/2207.13005 Hansel is a high‑quality human‑annotated Chinese entity linking (EL) dataset, focusing on tail entities and emerging entities: - The test set contains Few‑shot (FS) and zero‑shot (ZS) slices, has 10K examples and uses Wikidata as the corresponding knowledge base. - The training and validation sets are from Wikipedia hyperlinks, useful for large‑scale pretraining of Chinese EL systems. Please see our [WSDM 2023](https://www.wsdm-conference.org/2023/) paper **"Hansel: A Chinese Few‑Shot and Zero‑Shot Entity Linking Benchmark"** to learn more about our dataset. For models in the paper and our processed knowledge base, please see our [Github repository](https://github.com/HITsz-TMG/Hansel). ## Dataset Structure ### Data Instances {"id": "hansel-eval-zs-1463", "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。", "start": 29, "end": 32, "mention": "匹诺曹", "gold_id": "Q73895818", "source": "https://www.1905.com/news/20181107/1325389.shtml", "domain": "news" } ### Data Splits | | # Mentions | # Entities | Domain | | ---- | ---- | ---- | ---- | | Train | 9,879,813 | 541,058 | Wikipedia | | Validation | 9,674 | 6,320 | Wikipedia | | Hansel-FS | 5,260 | 2,720 | News, Social Media | | Hansel-ZS | 4,715 | 4,046 | News, Social Media, E‑books, etc.| ## Citation If you find our dataset useful, please cite us. ```bibtex @inproceedings{xu2022hansel, author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing}, title = {Hansel: A Chinese Few‑Shot and Zero‑Shot Entity Linking Benchmark}, year = {2023}, publisher = {Association for Computing Machinery}, url = {https://doi.org/10.1145/3539597.3570418}, booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining}, pages = {832–840} } ```

Updated 3/13/2023
hugging_face

Description

Dataset Overview: Hansel

Dataset Description

  • Name: Hansel
  • Language: Chinese
  • License: cc‑by‑sa‑4.0
  • Task Category: text‑retrieval
  • Task ID: entity‑linking‑retrieval
  • Size: 1M<n<10M and 1K<n<10K
  • Source: original
  • Characteristics:
    • Focuses on tail entities and emerging entities
    • Test set includes Few‑shot (FS) and zero‑shot (ZS) slices, totaling 10K examples, using Wikidata as the knowledge base
    • Training and validation sets are derived from Wikipedia hyperlinks

Dataset Structure

Data Instances

  • Fields: id, text, start, end, mention, gold_id, source, domain
  • Example:
    {
      "id": "hansel-eval-zs-1463",
      "text": "1905电影网讯 已经筹备了十余年的吉尔莫·德尔·托罗的《匹诺曹》,在上个月顺利被网飞公司买下,成为了流媒体巨头旗下的新片。近日,这部备受关注的影片确定了自己的档期:2021年。虽然具体时间未定,但影片却已经实实在在地向前迈出了一步。",
      "start": 29,
      "end": 32,
      "mention": "匹诺曹",
      "gold_id": "Q73895818",
      "source": "https://www.1905.com/news/20181107/1325389.shtml",
      "domain": "news"
    }
    

Data Splits

SplitMentionsEntitiesDomain
Train9,879,813541,058Wikipedia
Validation9,6746,320Wikipedia
Hansel‑FS5,2602,720News, Social Media
Hansel‑ZS4,7154,046News, Social Media, E‑books, etc.

Citation Information

@inproceedings{xu2022hansel,
  author = {Xu, Zhenran and Shan, Zifei and Li, Yuxin and Hu, Baotian and Qin, Bing},
  title = {Hansel: A Chinese Few‑Shot and Zero‑Shot Entity Linking Benchmark},
  year = {2023},
  publisher = {Association for Computing Machinery},
  url = {https://doi.org/10.1145/3539597.3570418},
  booktitle = {Proceedings of the Sixteenth ACM International Conference on Web Search and Data Mining},
  pages = {832–840}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Natural Language Processing
Entity Linking

Source

Organization: hugging_face

Created: Unknown

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.