ActPlan-1K
ActPlan‑1K is a multimodal planning benchmark jointly created by the Hong Kong University of Science and Technology and the University of California, San Diego. It evaluates vision‑language models' program planning abilities in domestic activities. The dataset includes 153 activities and 1,187 instances, each comprising a natural‑language task description and multiple environment images captured from the iGibson2 simulator. The creation process combined ChatGPT and iGibson2, converting BDDL activity definitions into natural‑language descriptions and collecting environment images. ActPlan‑1K is primarily used to assess the program planning capabilities of vision‑language models in multimodal tasks, especially for home activities and counterfactual scenarios.
Description
ActPlan‑1K Dataset Overview
Dataset Definition
- Base Source: Derived from BDDL language, extended from Behavior100.
- Definition Process:
- Translate activity descriptions from Behavior100 into natural language.
- Use ChatGPT to generate specific programs and contexts.
- Annotate initial and goal descriptions in the iGibson environment to create new BDDL cases.
- Convert BDDL descriptions into natural‑language task statements.
- Storage Location:
./bddl/activity-definitions.
Multimodal Data Collection
- Visual Information: Capture primary scene images within activity environments.
- Collection Procedure:
- For counterfactual activities, sample scene instances based on the previous step's activity definitions.
- For normal activities, use predefined activities from Behavior100.
- Load scene instances in the iGibson2 simulator and record video, selecting images that cover the main content.
- Examples:
./annotation/Beechwood_0_int/assembling_gift_baskets/0(normal) and./annotation/Beechwood_0_int/assembling_gift_baskets/1(counterfactual). - Data Download: The full dataset, including all annotations and sampled URDF files, can be downloaded here.
Automatic Evaluation
- Evaluation Method: Provide a natural‑language description and a selected set of images as prompts to a vision‑language model, which generates a program plan that is compared against a gold standard plan.
- Metrics:
- LCS: Longest Common Subsequence, details located in
./auto_lcs. - Finetuned BLEURT score: Fine‑tuned BLEURT metric, details in
./bleu-cls.
- LCS: Longest Common Subsequence, details located in
AI studio
Generate PPTs instantly with Nano Banana Pro.
Generate PPT NowAccess Dataset
Please login to view download links and access full dataset details.
Topics
Source
Organization: arXiv
Created: 10/5/2024
Power Your Data Analysis with Premium AI Models
Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.
Enjoy a free trial and save 20%+ compared to official pricing.