Dataset assetOpen Source CommunityAutonomous DrivingSpatial Understanding

DriveMLLM

The DriveMLLM dataset, created by the Institute of Automation, Chinese Academy of Sciences and other institutions, focuses on spatial understanding tasks in autonomous driving scenarios. It contains 880 forward‑camera images covering absolute and relative spatial reasoning tasks, accompanied by rich natural‑language questions. Built upon the nuScenes dataset, the images were strictly selected and annotated to ensure clear visibility of objects and explicit spatial relationships. DriveMLLM aims to evaluate and improve multimodal large language models' spatial reasoning abilities in autonomous driving, addressing complex spatial relation understanding.

Source

arXiv

Created

Nov 20, 2024

Updated

Nov 20, 2024

Signals

396 views

Availability

Linked source ready

Overview

Dataset description and usage context

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Dataset Overview

Dataset Name: MLLM_eval_dataset
Data Source:
- Images come from the nuScenes validation set CAM_FRONT.
- A metadata.jsonl file provides image attributes such as location2D.
Purpose: Evaluate multimodal large language models on spatial understanding in autonomous driving.

Using the Dataset

0. Prepare the Dataset

Dataset Link: MLLM_eval_dataset

1. Environment Setup

Setup Documentation: Setup Environment

2. Inference

Inference Scripts:

GPT API:

export OPENAI_API_KEY=your_api_key
python inference/get_MLLM_output.py \
    --model_type gpt \
    --model gpt-4o \
    --hf_dataset bonbon-rj/MLLM_eval_dataset \
    --prompts_dir prompt/prompts \
    --save_dir inference/mllm_outputs

Gemini API:

export GOOGLE_API_KEY=your_api_key
python inference/get_MLLM_output.py \
    --model_type gemini \
    --model models/gemini-1.5-flash \
    --hf_dataset bonbon-rj/MLLM_eval_dataset \
    --prompts_dir prompt/prompts \
    --save_dir inference/mllm_outputs

Local LLaVA‑Next:

python inference/get_MLLM_output.py \
    --model_type llava \
    --model lmms-lab/llava-onevision-qwen2-7b-si \
    --hf_dataset bonbon-rj/MLLM_eval_dataset \
    --prompts_dir prompt/prompts \
    --save_dir inference/mllm_outputs

Local QWen2‑VL:

python inference/get_MLLM_output.py \
    --model_type qwen \
    --model Qwen/Qwen2-VL-7B-Instruct \
    --hf_dataset bonbon-rj/MLLM_eval_dataset \
    --prompts_dir prompt/prompts \
    --save_dir inference/mllm_outputs

3. Evaluation

Evaluation Scripts:

All Results:

python evaluation/eval_from_json.py \
    --hf_dataset bonbon-rj/MLLM_eval_dataset \
    --eval_root_dir inference/mllm_outputs \
    --save_dir evaluation/eval_result \
    --eval_model_path all

Specific Model:

python evaluation/eval_from_json.py \
    --hf_dataset bonbon-rj/MLLM_eval_dataset \
    --eval_root_dir inference/mllm_outputs \
    --save_dir evaluation/eval_result \
    --eval_model_path gemini/gemini-1.5-flash

Citation

@article{DriveMLLM,
        title={DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving},
        author={Guo, Xianda and Zhang Ruijun and Duan Yiqun and He Yuhang and Zhang, Chenming and Chen, Long},
        journal={arXiv preprint arXiv:2411.13112},
        year={2024}
}

Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio