ayeshaishaq
/

DriveLMMo1

Image-Text-to-Text

feature-extraction

Model card Files Files and versions Community

DriveLMMo1 / README.md

ayeshaishaq's picture

Update README.md

b2e273d verified 4 months ago

|

3.95 kB

	---
	license: apache-2.0
	language:
	- en
	base_model:
	- OpenGVLab/InternVL2_5-8B
	pipeline_tag: visual-question-answering
	---

	DriveLMM-o1: A Large Multimodal Model for Autonomous Driving Reasoning

	DriveLMM-o1 is a fine-tuned large multimodal model designed for autonomous driving. Built on InternVL2.5-8B with LoRA-based adaptation, it leverages stitched multiview images to produce step-by-step reasoning. This structured approach enhances both final decision accuracy and interpretability in complex driving tasks like perception, prediction, and planning.

	Key Features:
	- Multimodal Integration: Combines multiview images for comprehensive scene understanding.
	- Step-by-Step Reasoning: Produces detailed intermediate reasoning steps to explain decisions.
	- Efficient Adaptation: Utilizes dynamic image patching and LoRA finetuning for high-resolution inputs with minimal extra parameters.
	- Performance Gains: Achieves significant improvements in both final answer accuracy and overall reasoning scores compared to previous open-source models.

	Performance Comparison:

	\| Model \| Risk Assessment Accuracy \| Traffic Rule Adherence \| Scene Awareness & Object Understanding \| Relevance \| Missing Details \| Overall Reasoning Score \| Final Answer Accuracy \|
	\|-------------------------\|--------------------------\|------------------------\|------------------------------------------\|-----------\|-----------------\|-------------------------\|-----------------------\|
	\| GPT-4o (Closed) \| 71.32 \| 80.72 \| 72.96 \| 76.65 \| 71.43 \| 72.52 \| 57.84 \|
	\| Qwen-2.5-VL-7B \| 46.44 \| 60.45 \| 51.02 \| 50.15 \| 52.19 \| 51.77 \| 37.81 \|
	\| Ovis1.5-Gemma2-9B \| 51.34 \| 66.36 \| 54.74 \| 55.72 \| 55.74 \| 55.62 \| 48.85 \|
	\| Mulberry-7B \| 51.89 \| 63.66 \| 56.68 \| 57.27 \| 57.45 \| 57.65 \| 52.86 \|
	\| LLaVA-CoT \| 57.62 \| 69.01 \| 60.84 \| 62.72 \| 60.67 \| 61.41 \| 49.27 \|
	\| LlamaV-o1 \| 60.20 \| 73.52 \| 62.67 \| 64.66 \| 63.41 \| 63.13 \| 50.02 \|
	\| InternVL2.5-8B \| 69.02 \| 78.43 \| 71.52 \| 75.80 \| 70.54 \| 71.62 \| 54.87 \|
	\| DriveLMM-o1 (Ours) \| 73.01 \| 81.56 \| 75.39 \| 79.42 \| 74.49 \| 75.24 \| 62.36 \|


	Usage:

	Load the model using the following code snippet:

	```python
	from transformers import AutoModel, AutoTokenizer
	import torch

	path = 'ayeshaishaq/DriveLMMo1'
	model = AutoModel.from_pretrained(
	path,
	torch_dtype=torch.bfloat16,
	low_cpu_mem_usage=True,
	use_flash_attn=True,
	trust_remote_code=True
	).eval().cuda()

	tokenizer = AutoTokenizer.from_pretrained(
	path,
	trust_remote_code=True,
	use_fast=False
	)
	```

	For detailed usage instructions and additional configurations, please refer to the [OpenGVLab/InternVL2_5-8B](https://huggingface.co/OpenGVLab/InternVL2_5-8B) repository.


	Limitations:
	While DriveLMM-o1 demonstrates strong performance in autonomous driving tasks, it is fine-tuned for domain-specific reasoning. Users may need to further fine-tune or adapt the model for different driving environments.