Model Card: Qwen2-VL-7B for Nutrition Table Detection

This model card describes a fine-tuned version of the Qwen/Qwen2-VL-7B-Instruct model, specifically adapted for detecting nutrition tables in product images.

Model Details

Base Model: Qwen/Qwen2-VL-7B-Instruct
Model Type: Vision Language Model (VLM)
Fine-tuning Task: Object Detection (Nutrition Tables)
Fine-tuning Method: QLoRA (Quantized Low-Rank Adaptation) with SFT (Supervised Fine-Tuning)
Language(s): English (primarily for prompts and responses)
License: Apache-2.0 (inherited from the base model)

Colab Notebooks

The entire fine-tuning walkthrough can be accessed at: https://colab.research.google.com/drive/1EkF4arAYcxfi2fugO1B3bfohr9gZa8Ly?usp=sharing

For serving on vLLM/Nvidia Triton, the code can be found on: https://colab.research.google.com/drive/1furnMbQmD7beK5Z35KnJb2lQCIzQ-dpB?usp=sharing

Intended Use

This model is intended for identifying and localizing nutrition tables within images of food products. The primary output is the bounding box coordinates of the detected nutrition table.

Primary Intended Uses:

Automated extraction of nutrition information from product packaging.
Assisting in food logging and dietary tracking applications.
Retail and e-commerce applications for product information management.

Out-of-Scope Uses:

Detection of objects other than nutrition tables (unless further fine-tuned).
Optical Character Recognition (OCR) of the text within the nutrition table (the model only provides bounding boxes).
Making dietary recommendations or health assessments.

Training Data

Dataset: openfoodfacts/nutrition-table-detection
Dataset Description: This dataset contains product images along with corresponding bounding boxes for nutrition tables.
Preprocessing:
- The dataset was converted to the OpenAI ChatML format.
- Each sample consists of:
  - A system message defining the VLM's role.
  - A user message containing the product image and the prompt: "Detect the bounding box of the nutrition table."
  - An assistant message containing the ground truth bounding box coordinates formatted with Qwen2-VL specific tokens (<|object_ref_start|>nutrition table<|object_ref_end|><|box_start|>(x0,y0),(x1,y1)<|box_end|>), where coordinates are scaled to the [0,1000) integer space.
- The object name used was "nutrition table".

Training Procedure

Fine-tuning Framework: Hugging Face TRL (Transformer Reinforcement Learning) library, specifically using SFTTrainer (or a custom QwenVLSFTTrainer to handle specific model inputs).
Quantization: 4-bit NormalFloat (NF4) quantization using bitsandbytes.
- bnb_4bit_quant_type: "nf4"
- bnb_4bit_compute_dtype: torch.bfloat16
- bnb_4bit_use_double_quant: True
LoRA Configuration (QLoRA):
- r: 64
- lora_alpha: 16
- lora_dropout: 0.05
- bias: "none"
- task_type: "CAUSAL_LM"
- target_modules: ["q_proj", "k_proj", "v_proj", "o_proj", "gate_proj", "up_proj", "down_proj", "qkv", "proj"] (covering both transformer decoder and vision encoder ViT blocks).
SFT Configuration (SFTConfig):
- dataset_text_field: "messages"
- learning_rate: 2e-4
- per_device_train_batch_size: 4 (PD_BATCH)
- gradient_accumulation_steps: 16 (GA_STEPS) - effective batch size of 64.
- num_train_epochs: 3
- lr_scheduler_type: "cosine"
- warmup_ratio: 0.05
- bf16: True
- tf32: True
- gradient_checkpointing: True
- optim: "paged_adamw_32bit"
- max_grad_norm: 1.0
- eval_strategy: "steps"
- eval_steps: 500
- save_strategy: "steps"
- save_steps: 500
- save_total_limit: 2
- logging_steps: 25
- report_to: "wandb"
- load_best_model_at_end: True
- metric_for_best_model: "eval_loss"
- remove_unused_columns: False
- packing: False
- dataloader_pin_memory: True
- output_dir: "qwen2vl_qlora_sft"
- seed: 42
Hardware: Training was performed on NVIDIA A100 GPUs. Flash Attention 2 / SDPA was enabled for memory efficiency.
Software:
- transformers: 4.52.0.dev0 (or similar dev version, potentially 4.47.0.dev0 as initially installed)
- trl: 0.12.0.dev0
- datasets: 3.0.2
- bitsandbytes: 0.44.1
- peft: 0.13.2
- qwen-vl-utils: 0.0.8
- accelerate: 1.0.1
- torch: 2.4.1+cu121 (specific older version due to compatibility issues with latest PyTorch at the time of notebook creation)
- torchvision: 0.19.1+cu121
- torchaudio: 2.4.1+cu121
- wandb for logging.

Evaluation

Metric: Mean Intersection over Union (IoU) between predicted and ground truth bounding boxes.
Fine-tuned Model Performance (on validation set):
- Mean IoU: 0.1111
Base Model Performance (Qwen/Qwen2-VL-7B-Instruct, on validation set, without fine-tuning):
- Mean IoU: 0.0632
Comparison: The fine-tuned model shows an improvement in IoU score compared to the base model, indicating better localization of nutrition tables.

Hardware and Software Requirements

GPU: Recommended 1x A100 or 2x A6000 GPUs for fine-tuning due to model size. Flash Attention support (Nvidia Ampere series or better) is beneficial for memory efficiency.
Software: See "Training Procedure" for key library versions.

Limitations and Bias

Computational Resources: Fine-tuning Qwen2-VL-7B is computationally intensive.
Flash Attention: Optimal memory efficiency with Flash Attention is limited to Nvidia Ampere GPUs or newer. Disabling it on older GPUs may require more resources.
Dataset Specificity: The model is fine-tuned specifically for nutrition table detection on the openfoodfacts/nutrition-table-detection dataset. Performance on other types of objects or significantly different image styles may vary.
Bounding Box Format: The model outputs bounding boxes in a specific format (<|object_ref_start|>object_name<|object_ref_end|><|box_start|>(x0,y0),(x1,y1)<|box_end|>) with coordinates scaled to [0,1000). Parsing this output is necessary for downstream tasks.
IoU Score: While improved, the IoU of 0.1111 suggests there is still room for improvement in localization accuracy. Further fine-tuning, data augmentation, or architectural adjustments might be needed for higher precision.
The notebook mentions a specific PyTorch version (2.4.1+cu121) was used due to an issue with the latest version at the time. This might be a consideration for reproducibility.

How to Use

The fine-tuned LoRA adapters are available in the qwen2vl_qlora_sft/checkpoint-51 directory (or your specified output directory). The notebook demonstrates merging these adapters with the base Qwen/Qwen2-VL-7B-Instruct model and saving the merged model to /content/qwen2vl_merged-bf16. This merged model can then be pushed to a Hugging Face Hub repository (e.g., lordChipotle/nutrition-label-detector).

Inference with the merged model: Use the AutoModelForVision2Seq and AutoProcessor from the Hugging Face transformers library, loading the model from your Hub repository or the saved OUTPUT_DIR.

from transformers import AutoModelForVision2Seq, AutoProcessor
from PIL import Image
import torch

# Example: Load from local merged directory
MODEL_PATH = "/content/qwen2vl_merged-bf16" # Or your Hugging Face Hub repo name
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
DTYPE = torch.bfloat16

model = AutoModelForVision2Seq.from_pretrained(
    MODEL_PATH,
    torch_dtype=DTYPE,
    device_map="auto",
    trust_remote_code=True # Qwen models may require this
)
processor = AutoProcessor.from_pretrained(MODEL_PATH, trust_remote_code=True)

# Prepare image and prompt
# img = Image.open("path/to/your/image.jpg").convert("RGB")
# prompt_text = "Detect the bounding box of the nutrition table."
# vision_inputs = [{"type": "image", "image": img}]
# messages = [{"role": "user", "content": vision_inputs + [{"type": "text", "text": prompt_text}]}]
# prompt_chatml = processor.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# inputs = processor(
#     text=prompt_chatml,
#     images=[img], # Assuming single image for simplicity
#     return_tensors="pt"
# ).to(DEVICE)

# with torch.no_grad():
#     output_ids = model.generate(**inputs, max_new_tokens=64)
# response = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
# print(response)
# # Parse bounding box from response

lordChipotle
/

nutrition-label-detector