--- base_model: meta-llama/Llama-3.2-11B-Vision-Instruct library_name: transformers license: llama3.2 tags: - lora - llama - vision-language - peft - fine-tuned inference: false pipeline_tag: image-text-to-text --- # **lavender-llama-3.2-11b-lora** 🚀 **LoRA fine-tuned model based on** `meta-llama/Llama-3.2-11B-Vision-Instruct` ## **Model Overview** This is a **Lavender fine-tuned** version of **Llama-3.2-11B-Vision-Instruct**. Lavender is a simple supervised fine-tuning (SFT) method that boosts the performance of advanced vision-language models (VLMs) by leveraging state-of-the-art image generation models such as Stable Diffusion. This model retains the core capabilities of Llama-3.2 while incorporating Stable Diffusion's visual expertise. - **Base Model**: [`meta-llama/Llama-3.2-11B-Vision-Instruct`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) - **Fine-Tuned Model**: `lxasqjc/lavender-llama-3.2-11b-lora` - **Lavender Paper**: [Diffusion Instruction Tuning (arXiv)](https://arxiv.org/abs/2502.06814) - **Lavender Project Space**: [Diffusion Instruction Tuning](https://astrazeneca.github.io/vlm/) - **Github repository**: [Diffusion Instruction Tuning](https://github.com/AstraZeneca/vlm) - **Parameter Efficient Fine-Tuning (PEFT)**: Uses **LoRA** (Low-Rank Adaptation) to optimize model efficiency. - **License**: Llama 3.2 Community License (See [`LICENSE.txt`](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct/resolve/main/LICENSE.txt)) --- ## **🛠️ How to Use This Model** This model contains **only LoRA weights**. To use it, you must first **load the base model** and then apply the LoRA adapter. ### **Install Required Packages** ```bash pip install torch transformers accelerate peft ``` ### **Load and Use the LoRA Model** ```python from transformers import MllamaForConditionalGeneration, MllamaProcessor from peft import PeftModel import requests import torch from PIL import Image # Define model paths base_model_name = "meta-llama/Llama-3.2-11B-Vision-Instruct" lora_model_name = "lxasqjc/lavender-llama-3.2-11b-lora" # Load base model base_model = MllamaForConditionalGeneration.from_pretrained( base_model_name, torch_dtype=torch.bfloat16, device_map="auto", ) processor = MllamaProcessor.from_pretrained(base_model_name) # Load LoRA adapter lora_model = PeftModel.from_pretrained(base_model, lora_model_name) url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg" image = Image.open(requests.get(url, stream=True).raw) messages = [ {"role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "If I had to write a haiku for this one, it would be: "} ]} ] input_text = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor( image, input_text, add_special_tokens=False, return_tensors="pt" ).to(lora_model.device) output = lora_model.generate(**inputs, max_new_tokens=30) print(processor.decode(output[0])) ``` --- ## **💡 Applications** - **Vision-Language Tasks**: Describing images, answering visual questions. - **Instruction-Following**: Improved instruction compliance compared to base Llama-3.2. - **Multimodal Reasoning**: Can process and reason about images with text. --- ## **📊 Training Details** - **Fine-tuning Method**: LoRA (Low-Rank Adaptation) - **Base Model**: `meta-llama/Llama-3.2-11B-Vision-Instruct` - **PEFT Framework**: [Hugging Face PEFT](https://huggingface.co/docs/peft) - **Precision**: `bfloat16` - **Hyperparameters**: Coming soon --- ## **📏 Limitations & Considerations** - This model is **not a standalone** model. It requires the base Llama-3.2-11B-Vision-Instruct model. - **Biases & Ethical Use**: Like all large models, it may exhibit biases present in the pretraining data. - **Hardware Requirements**: Minimum **24GB VRAM (A100 40GB recommended)** for inference. --- ## **🔗 References** - **Base Model**: [Meta Llama-3.2-11B-Vision-Instruct](https://huggingface.co/meta-llama/Llama-3.2-11B-Vision-Instruct) - **LoRA Paper**: [Low-Rank Adaptation of Large Language Models](https://arxiv.org/abs/2106.09685) - **PEFT Documentation**: [Hugging Face PEFT](https://huggingface.co/docs/peft) - **Project Space**: [Diffusion Instruction Tuning](https://astrazeneca.github.io/vlm/) - **Paper**: [Diffusion Instruction Tuning (arXiv)](https://arxiv.org/abs/2502.06814) ### **Citation** If you use this model or work in your research, please cite: ```bibtex @misc{jin2025diffusioninstructiontuning, title={Diffusion Instruction Tuning}, author={Chen Jin and Ryutaro Tanno and Amrutha Saseendran and Tom Diethe and Philip Teare}, year={2025}, eprint={2502.06814}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2502.06814}, } ```