JUHE API Marketplace
DATASET
Open Source Community

ActPlan-1K

ActPlan‑1K is a multimodal planning benchmark jointly created by the Hong Kong University of Science and Technology and the University of California, San Diego. It evaluates vision‑language models' program planning abilities in domestic activities. The dataset includes 153 activities and 1,187 instances, each comprising a natural‑language task description and multiple environment images captured from the iGibson2 simulator. The creation process combined ChatGPT and iGibson2, converting BDDL activity definitions into natural‑language descriptions and collecting environment images. ActPlan‑1K is primarily used to assess the program planning capabilities of vision‑language models in multimodal tasks, especially for home activities and counterfactual scenarios.

Updated 10/5/2024
arXiv

Description

ActPlan‑1K Dataset Overview

Dataset Definition

  • Base Source: Derived from BDDL language, extended from Behavior100.
  • Definition Process:
    1. Translate activity descriptions from Behavior100 into natural language.
    2. Use ChatGPT to generate specific programs and contexts.
    3. Annotate initial and goal descriptions in the iGibson environment to create new BDDL cases.
    4. Convert BDDL descriptions into natural‑language task statements.
  • Storage Location: ./bddl/activity-definitions.

Multimodal Data Collection

  • Visual Information: Capture primary scene images within activity environments.
  • Collection Procedure:
    1. For counterfactual activities, sample scene instances based on the previous step's activity definitions.
    2. For normal activities, use predefined activities from Behavior100.
    3. Load scene instances in the iGibson2 simulator and record video, selecting images that cover the main content.
  • Examples: ./annotation/Beechwood_0_int/assembling_gift_baskets/0 (normal) and ./annotation/Beechwood_0_int/assembling_gift_baskets/1 (counterfactual).
  • Data Download: The full dataset, including all annotations and sampled URDF files, can be downloaded here.

Automatic Evaluation

  • Evaluation Method: Provide a natural‑language description and a selected set of images as prompts to a vision‑language model, which generates a program plan that is compared against a gold standard plan.
  • Metrics:
    1. LCS: Longest Common Subsequence, details located in ./auto_lcs.
    2. Finetuned BLEURT score: Fine‑tuned BLEURT metric, details in ./bleu-cls.

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Vision-Language Models
Program Planning

Source

Organization: arXiv

Created: 10/5/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.