Back to datasets
Dataset assetOpen Source CommunityAutonomous DrivingSpatial Understanding

DriveMLLM

The DriveMLLM dataset, created by the Institute of Automation, Chinese Academy of Sciences and other institutions, focuses on spatial understanding tasks in autonomous driving scenarios. It contains 880 forward‑camera images covering absolute and relative spatial reasoning tasks, accompanied by rich natural‑language questions. Built upon the nuScenes dataset, the images were strictly selected and annotated to ensure clear visibility of objects and explicit spatial relationships. DriveMLLM aims to evaluate and improve multimodal large language models' spatial reasoning abilities in autonomous driving, addressing complex spatial relation understanding.

Source
arXiv
Created
Nov 20, 2024
Updated
Nov 20, 2024
Signals
396 views
Availability
Linked source ready
Overview

Dataset description and usage context

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Dataset Overview

  • Dataset Name: MLLM_eval_dataset
  • Data Source:
    • Images come from the nuScenes validation set CAM_FRONT.
    • A metadata.jsonl file provides image attributes such as location2D.
  • Purpose: Evaluate multimodal large language models on spatial understanding in autonomous driving.

Using the Dataset

0. Prepare the Dataset

1. Environment Setup

2. Inference

  • Inference Scripts:
    • GPT API:
      export OPENAI_API_KEY=your_api_key
      python inference/get_MLLM_output.py \
          --model_type gpt \
          --model gpt-4o \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      
    • Gemini API:
      export GOOGLE_API_KEY=your_api_key
      python inference/get_MLLM_output.py \
          --model_type gemini \
          --model models/gemini-1.5-flash \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      
    • Local LLaVA‑Next:
      python inference/get_MLLM_output.py \
          --model_type llava \
          --model lmms-lab/llava-onevision-qwen2-7b-si \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      
    • Local QWen2‑VL:
      python inference/get_MLLM_output.py \
          --model_type qwen \
          --model Qwen/Qwen2-VL-7B-Instruct \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      

3. Evaluation

  • Evaluation Scripts:
    • All Results:
      python evaluation/eval_from_json.py \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --eval_root_dir inference/mllm_outputs \
          --save_dir evaluation/eval_result \
          --eval_model_path all
      
    • Specific Model:
      python evaluation/eval_from_json.py \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --eval_root_dir inference/mllm_outputs \
          --save_dir evaluation/eval_result \
          --eval_model_path gemini/gemini-1.5-flash
      

Citation

@article{DriveMLLM,
        title={DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving},
        author={Guo, Xianda and Zhang Ruijun and Duan Yiqun and He Yuhang and Zhang, Chenming and Chen, Long},
        journal={arXiv preprint arXiv:2411.13112},
        year={2024}
}
Need downstream help?

Pair the dataset with AI analysis and content workflows.

Once the source passes your review, move straight into summarization, transformation, report drafting, or presentation generation with the JuheAI toolchain.

Explore AI studio