Dataset assetOpen Source CommunityVision-Language ModelsProgram Planning

ActPlan-1K

ActPlan‑1K is a multimodal planning benchmark jointly created by the Hong Kong University of Science and Technology and the University of California, San Diego. It evaluates vision‑language models' program planning abilities in domestic activities. The dataset includes 153 activities and 1,187 instances, each comprising a natural‑language task description and multiple environment images captured from the iGibson2 simulator. The creation process combined ChatGPT and iGibson2, converting BDDL activity definitions into natural‑language descriptions and collecting environment images. ActPlan‑1K is primarily used to assess the program planning capabilities of vision‑language models in multimodal tasks, especially for home activities and counterfactual scenarios.

Source

arXiv

Created

Oct 5, 2024

Updated

Oct 5, 2024

Signals

247 views

Availability

Linked source ready

Overview

Dataset description and usage context

ActPlan‑1K Dataset Overview

Dataset Definition

Base Source: Derived from BDDL language, extended from Behavior100.
Definition Process:
1. Translate activity descriptions from Behavior100 into natural language.
2. Use ChatGPT to generate specific programs and contexts.
3. Annotate initial and goal descriptions in the iGibson environment to create new BDDL cases.
4. Convert BDDL descriptions into natural‑language task statements.
Storage Location: ./bddl/activity-definitions.

Multimodal Data Collection

Visual Information: Capture primary scene images within activity environments.
Collection Procedure:
1. For counterfactual activities, sample scene instances based on the previous step's activity definitions.
2. For normal activities, use predefined activities from Behavior100.
3. Load scene instances in the iGibson2 simulator and record video, selecting images that cover the main content.
Examples: ./annotation/Beechwood_0_int/assembling_gift_baskets/0 (normal) and ./annotation/Beechwood_0_int/assembling_gift_baskets/1 (counterfactual).
Data Download: The full dataset, including all annotations and sampled URDF files, can be downloaded here.

Automatic Evaluation

Evaluation Method: Provide a natural‑language description and a selected set of images as prompts to a vision‑language model, which generates a program plan that is compared against a gold standard plan.
Metrics:
1. LCS: Longest Common Subsequence, details located in ./auto_lcs.
2. Finetuned BLEURT score: Fine‑tuned BLEURT metric, details in ./bleu-cls.

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.