JUHE API Marketplace
DATASET
Open Source Community

Shopping MMLU

Shopping MMLU is a large‑scale multi‑task online‑shopping benchmark dataset created by Amazon. It is designed to comprehensively evaluate large language models (LLMs) on multiple shopping‑related tasks. The dataset comprises 57 tasks covering four core shopping skills—concept understanding, knowledge reasoning, user‑behavior alignment, and multilingual capability—totaling 20,799 questions. It was constructed from authentic Amazon data and reformulated into text‑generation tasks to suit LLM solutions. Shopping MMLU is primarily intended for online‑shopping assistants, aiming to improve the shopping experience by reducing task‑specific engineering effort and enabling interactive user dialogues.

Updated 10/28/2024
arXiv

Description

Shopping MMLU Dataset Overview

Dataset Introduction

  • Name: Shopping MMLU
  • Description: An online‑shopping multi‑task benchmark for large language models (LLMs), covering four primary shopping skills: shopping concept understanding, shopping knowledge reasoning, user‑behavior alignment, and multilingual ability.
  • Release Agency: Accepted by the NeurIPS 2024 Dataset and Benchmark Track and used for the Amazon KDD Cup 2024.

Dataset Structure

  • Data Folder: data
  • Skill‑wise Evaluation Code: skill_wise_eval
  • Task‑wise Evaluation Code: task_wise_eval

Data Formats

  • Task Types: Five different task formats:
    • Multiple‑choice: .csv files with three columns: question, choices, answer.
    • Other Tasks: .json files containing two fields: input_field and target_field.

Data Download

  • Download Method: Download the data.zip archive and unzip it into the data directory.

Evaluation Methods

Dependencies

  • Main Libraries:
    • transformers==4.37.0
    • torch==2.1.2+cu121
    • pandas==2.0.3
    • evaluate==0.4.1
    • sentence_transformers==2.2.2
    • rouge_score
    • sacrebleu
    • sacrebleu[jp]

Single‑Task Evaluation

  • Example: Evaluate the Vicuna‑7B‑v1.5 model on the multiple_choice task.
    cd task_wise_eval/
    python3 hf_multi_choice.py --test_subject asin_compatibility --model_name vicuna2
    

Skill‑Level Evaluation

  • Example: Evaluate the Vicuna‑7B‑v1.5 model on the skill1_concept skill.
    cd skill_wise_eval/
    python3 hf_skill_inference.py --model_name vicuna2 --filename skill1_concept --output_filename <your_filename>
    python3 skill_evaluation.py --data_filename skill1_concept --output_filename vicuna2_<your_filename>
    

References

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Online Shopping
Language Model Evaluation

Source

Organization: arXiv

Created: 10/28/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.