JUHE API Marketplace
DATASET
Open Source Community

DriveMLLM

The DriveMLLM dataset, created by the Institute of Automation, Chinese Academy of Sciences and other institutions, focuses on spatial understanding tasks in autonomous driving scenarios. It contains 880 forward‑camera images covering absolute and relative spatial reasoning tasks, accompanied by rich natural‑language questions. Built upon the nuScenes dataset, the images were strictly selected and annotated to ensure clear visibility of objects and explicit spatial relationships. DriveMLLM aims to evaluate and improve multimodal large language models' spatial reasoning abilities in autonomous driving, addressing complex spatial relation understanding.

Updated 11/20/2024
arXiv

Description

DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving

Dataset Overview

  • Dataset Name: MLLM_eval_dataset
  • Data Source:
    • Images come from the nuScenes validation set CAM_FRONT.
    • A metadata.jsonl file provides image attributes such as location2D.
  • Purpose: Evaluate multimodal large language models on spatial understanding in autonomous driving.

Using the Dataset

0. Prepare the Dataset

1. Environment Setup

2. Inference

  • Inference Scripts:
    • GPT API:
      export OPENAI_API_KEY=your_api_key
      python inference/get_MLLM_output.py \
          --model_type gpt \
          --model gpt-4o \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      
    • Gemini API:
      export GOOGLE_API_KEY=your_api_key
      python inference/get_MLLM_output.py \
          --model_type gemini \
          --model models/gemini-1.5-flash \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      
    • Local LLaVA‑Next:
      python inference/get_MLLM_output.py \
          --model_type llava \
          --model lmms-lab/llava-onevision-qwen2-7b-si \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      
    • Local QWen2‑VL:
      python inference/get_MLLM_output.py \
          --model_type qwen \
          --model Qwen/Qwen2-VL-7B-Instruct \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --prompts_dir prompt/prompts \
          --save_dir inference/mllm_outputs
      

3. Evaluation

  • Evaluation Scripts:
    • All Results:
      python evaluation/eval_from_json.py \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --eval_root_dir inference/mllm_outputs \
          --save_dir evaluation/eval_result \
          --eval_model_path all
      
    • Specific Model:
      python evaluation/eval_from_json.py \
          --hf_dataset bonbon-rj/MLLM_eval_dataset \
          --eval_root_dir inference/mllm_outputs \
          --save_dir evaluation/eval_result \
          --eval_model_path gemini/gemini-1.5-flash
      

Citation

@article{DriveMLLM,
        title={DriveMLLM: A Benchmark for Spatial Understanding with Multimodal Large Language Models in Autonomous Driving},
        author={Guo, Xianda and Zhang Ruijun and Duan Yiqun and He Yuhang and Zhang, Chenming and Chen, Long},
        journal={arXiv preprint arXiv:2411.13112},
        year={2024}
}

AI studio

Generate PPTs instantly with Nano Banana Pro.

Generate PPT Now

Access Dataset

Login to Access

Please login to view download links and access full dataset details.

Topics

Autonomous Driving
Spatial Understanding

Source

Organization: arXiv

Created: 11/20/2024

Power Your Data Analysis with Premium AI Models

Supporting GPT-5, Claude-4, DeepSeek v3, Gemini and more.

Enjoy a free trial and save 20%+ compared to official pricing.