Dataset assetOpen Source CommunityOnline ShoppingLanguage Model Evaluation

Shopping MMLU

Shopping MMLU is a large‑scale multi‑task online‑shopping benchmark dataset created by Amazon. It is designed to comprehensively evaluate large language models (LLMs) on multiple shopping‑related tasks. The dataset comprises 57 tasks covering four core shopping skills—concept understanding, knowledge reasoning, user‑behavior alignment, and multilingual capability—totaling 20,799 questions. It was constructed from authentic Amazon data and reformulated into text‑generation tasks to suit LLM solutions. Shopping MMLU is primarily intended for online‑shopping assistants, aiming to improve the shopping experience by reducing task‑specific engineering effort and enabling interactive user dialogues.

Source

arXiv

Created

Oct 28, 2024

Updated

Oct 28, 2024

Signals

306 views

Availability

Linked source ready

Overview

Dataset description and usage context

Shopping MMLU Dataset Overview

Dataset Introduction

Name: Shopping MMLU
Description: An online‑shopping multi‑task benchmark for large language models (LLMs), covering four primary shopping skills: shopping concept understanding, shopping knowledge reasoning, user‑behavior alignment, and multilingual ability.
Release Agency: Accepted by the NeurIPS 2024 Dataset and Benchmark Track and used for the Amazon KDD Cup 2024.

Dataset Structure

Data Folder: data
Skill‑wise Evaluation Code: skill_wise_eval
Task‑wise Evaluation Code: task_wise_eval

Data Formats

Task Types: Five different task formats:
- Multiple‑choice: .csv files with three columns: question, choices, answer.
- Other Tasks: .json files containing two fields: input_field and target_field.

Data Download

Download Method: Download the data.zip archive and unzip it into the data directory.

Evaluation Methods

Dependencies

Main Libraries:
- transformers==4.37.0
- torch==2.1.2+cu121
- pandas==2.0.3
- evaluate==0.4.1
- sentence_transformers==2.2.2
- rouge_score
- sacrebleu
- sacrebleu[jp]

Single‑Task Evaluation

Example: Evaluate the Vicuna‑7B‑v1.5 model on the multiple_choice task.

cd task_wise_eval/
python3 hf_multi_choice.py --test_subject asin_compatibility --model_name vicuna2

Skill‑Level Evaluation

Example: Evaluate the Vicuna‑7B‑v1.5 model on the skill1_concept skill.

cd skill_wise_eval/
python3 hf_skill_inference.py --model_name vicuna2 --filename skill1_concept --output_filename <your_filename>
python3 skill_evaluation.py --data_filename skill1_concept --output_filename vicuna2_<your_filename>

References

Paper: Detailed information can be found in the arXiv paper.
KDD Cup Challenge: More details are available on the KDD Cup 2024 website.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio