File size: 4,143 Bytes
a040cb0 b80988b b2e273d b80988b b2e273d b3a76fe b80988b b2e273d b80988b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 |
---
base_model:
- OpenGVLab/InternVL2_5-8B
language:
- en
license: apache-2.0
pipeline_tag: image-text-to-text
library_name: transformers
datasets:
- ayeshaishaq/DriveLMMo1
---
**DriveLMM-o1: A Large Multimodal Model for Autonomous Driving Reasoning**
[Paper](https://arxiv.org/abs/2503.10621)
DriveLMM-o1 is a fine-tuned large multimodal model designed for autonomous driving. Built on InternVL2.5-8B with LoRA-based adaptation, it leverages stitched multiview images to produce step-by-step reasoning. This structured approach enhances both final decision accuracy and interpretability in complex driving tasks like perception, prediction, and planning.
**Key Features:**
- **Multimodal Integration:** Combines multiview images for comprehensive scene understanding.
- **Step-by-Step Reasoning:** Produces detailed intermediate reasoning steps to explain decisions.
- **Efficient Adaptation:** Utilizes dynamic image patching and LoRA finetuning for high-resolution inputs with minimal extra parameters.
- **Performance Gains:** Achieves significant improvements in both final answer accuracy and overall reasoning scores compared to previous open-source models.
**Performance Comparison:**
| Model | Risk Assessment Accuracy | Traffic Rule Adherence | Scene Awareness & Object Understanding | Relevance | Missing Details | Overall Reasoning Score | Final Answer Accuracy |
|-------------------------|--------------------------|------------------------|------------------------------------------|-----------|-----------------|-------------------------|-----------------------|
| GPT-4o (Closed) | 71.32 | 80.72 | 72.96 | 76.65 | 71.43 | 72.52 | 57.84 |
| Qwen-2.5-VL-7B | 46.44 | 60.45 | 51.02 | 50.15 | 52.19 | 51.77 | 37.81 |
| Ovis1.5-Gemma2-9B | 51.34 | 66.36 | 54.74 | 55.72 | 55.74 | 55.62 | 48.85 |
| Mulberry-7B | 51.89 | 63.66 | 56.68 | 57.27 | 57.45 | 57.65 | 52.86 |
| LLaVA-CoT | 57.62 | 69.01 | 60.84 | 62.72 | 60.67 | 61.41 | 49.27 |
| LlamaV-o1 | 60.20 | 73.52 | 62.67 | 64.66 | 63.41 | 63.13 | 50.02 |
| InternVL2.5-8B | 69.02 | 78.43 | 71.52 | 75.80 | 70.54 | 71.62 | 54.87 |
| **DriveLMM-o1 (Ours)** | **73.01** | **81.56** | **75.39** | **79.42** | **74.49** | **75.24** | **62.36** |
**Usage:**
Load the model using the following code snippet:
```python
from transformers import AutoModel, AutoTokenizer
import torch
path = 'ayeshaishaq/DriveLMMo1'
model = AutoModel.from_pretrained(
path,
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
use_flash_attn=True,
trust_remote_code=True
).eval().cuda()
tokenizer = AutoTokenizer.from_pretrained(
path,
trust_remote_code=True,
use_fast=False
)
```
For detailed usage instructions and additional configurations, please refer to the [OpenGVLab/InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) repository.
Code: [https://github.com/ayesha-ishaq/DriveLMM-o1](https://github.com/ayesha-ishaq/DriveLMM-o1)
**Limitations:**
While DriveLMM-o1 demonstrates strong performance in autonomous driving tasks, it is fine-tuned for domain-specific reasoning. Users may need to further fine-tune or adapt the model for different driving environments. |