Shopping MMLU
Shopping MMLU is a large‑scale multi‑task online‑shopping benchmark dataset created by Amazon. It is designed to comprehensively evaluate large language models (LLMs) on multiple shopping‑related tasks. The dataset comprises 57 tasks covering four core shopping skills—concept understanding, knowledge reasoning, user‑behavior alignment, and multilingual capability—totaling 20,799 questions. It was constructed from authentic Amazon data and reformulated into text‑generation tasks to suit LLM solutions. Shopping MMLU is primarily intended for online‑shopping assistants, aiming to improve the shopping experience by reducing task‑specific engineering effort and enabling interactive user dialogues.
Dataset description and usage context
Shopping MMLU Dataset Overview
Dataset Introduction
- Name: Shopping MMLU
- Description: An online‑shopping multi‑task benchmark for large language models (LLMs), covering four primary shopping skills: shopping concept understanding, shopping knowledge reasoning, user‑behavior alignment, and multilingual ability.
- Release Agency: Accepted by the NeurIPS 2024 Dataset and Benchmark Track and used for the Amazon KDD Cup 2024.
Dataset Structure
- Data Folder:
data - Skill‑wise Evaluation Code:
skill_wise_eval - Task‑wise Evaluation Code:
task_wise_eval
Data Formats
- Task Types: Five different task formats:
- Multiple‑choice:
.csvfiles with three columns:question,choices,answer. - Other Tasks:
.jsonfiles containing two fields:input_fieldandtarget_field.
- Multiple‑choice:
Data Download
- Download Method: Download the
data.ziparchive and unzip it into thedatadirectory.
Evaluation Methods
Dependencies
- Main Libraries:
transformers==4.37.0torch==2.1.2+cu121pandas==2.0.3evaluate==0.4.1sentence_transformers==2.2.2rouge_scoresacrebleusacrebleu[jp]
Single‑Task Evaluation
- Example: Evaluate the Vicuna‑7B‑v1.5 model on the
multiple_choicetask.cd task_wise_eval/ python3 hf_multi_choice.py --test_subject asin_compatibility --model_name vicuna2
Skill‑Level Evaluation
- Example: Evaluate the Vicuna‑7B‑v1.5 model on the
skill1_conceptskill.cd skill_wise_eval/ python3 hf_skill_inference.py --model_name vicuna2 --filename skill1_concept --output_filename <your_filename> python3 skill_evaluation.py --data_filename skill1_concept --output_filename vicuna2_<your_filename>
References
- Paper: Detailed information can be found in the arXiv paper.
- KDD Cup Challenge: More details are available on the KDD Cup 2024 website.
Pair the dataset with AI analysis and content workflows.
Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.