--- language: dv tags: - vision - image-to-text - OCR - Dhivehi - PaliGemma2 license: apache-2.0 datasets: - alakxender/dhivehi-vrd-b1-img-questions metrics: - accuracy base_model: - google/paligemma2-3b-pt-224 library_name: transformers --- # PaliGemma2 VRD Dhivehi OCR Model ## Model Description This is a fine-tuned version of the PaliGemma2 model specifically optimized for Optical Character Recognition (OCR) of Dhivehi text in images. The model is based on the `google/paligemma2-3b-pt-224` architecture and has been fine-tuned for improved performance in reading and transcribing Dhivehi text from images. ## Model Details - **Model type:** Vision-Language Model - **Base model:** google/paligemma2-3b-pt-224 - **Fine-tuning approach:** QLoRA - **Input format:** Images with text - **Output format:** Text transcription - **Supported languages:** Primarily Dhivehi ## How to Use ### Option 1: Direct Loading ```python from transformers.image_utils import load_image import torch from transformers import PaliGemmaForConditionalGeneration, AutoProcessor # Print GPU information print(f"CUDA available: {torch.cuda.is_available()}") if torch.cuda.is_available(): print(f"Current GPU: {torch.cuda.get_device_name(0)}") print(f"GPU memory allocated: {torch.cuda.memory_allocated(0) / 1024**2:.2f} MB") print(f"GPU memory cached: {torch.cuda.memory_reserved(0) / 1024**2:.2f} MB") model_id = "alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm" print("Loading model...") model = PaliGemmaForConditionalGeneration.from_pretrained(model_id).to("cuda") processor = AutoProcessor.from_pretrained(model_id) print("Loading image...") image = load_image("ocr1.png") print("Processing image...") prompt = "What text is written in this image?" model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16).to("cuda") input_len = model_inputs["input_ids"].shape[-1] print("Model inputs device:", model_inputs["input_ids"].device) print("Model device:", model.device) print("Generating output...") with torch.inference_mode(): generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) generation = generation[0][input_len:] decoded = processor.decode(generation, skip_special_tokens=True) print(decoded) print("Done!") ``` ### Option 2: Memory-Efficient PEFT Loading ```python from transformers.image_utils import load_image import torch from transformers import PaliGemmaForConditionalGeneration, AutoProcessor from peft import PeftModel, PeftConfig # Define model ID model_id = "alakxender/paligemma2-qlora-vrd-dhivehi-ocr-224-sm" # Load the PEFT configuration to get the base model path print("Loading PEFT configuration...") peft_config = PeftConfig.from_pretrained(model_id) # Load the base model print(f"Loading base model: {peft_config.base_model_name_or_path}...") base_model = PaliGemmaForConditionalGeneration.from_pretrained( peft_config.base_model_name_or_path, device_map="auto", torch_dtype=torch.bfloat16 ) # Load the adapter on top of the base model print(f"Loading PEFT adapter: {model_id}...") model = PeftModel.from_pretrained(base_model, model_id) # Load the processor from the base model processor = AutoProcessor.from_pretrained(peft_config.base_model_name_or_path) print("Loading image...") image = load_image("ocr1.png") print("Processing image...") prompt = "What text is written in this image?" model_inputs = processor(text=prompt, images=image, return_tensors="pt").to(torch.bfloat16) # Move inputs to the same device as the model if hasattr(model, 'device'): device = model.device else: # If device isn't directly accessible, infer from model parameters device = next(model.parameters()).device model_inputs = {k: v.to(device) for k, v in model_inputs.items()} input_len = model_inputs["input_ids"].shape[-1] # Process without printing device information print("Generating output...") with torch.inference_mode(): generation = model.generate(**model_inputs, max_new_tokens=100, do_sample=False) generation = generation[0][input_len:] decoded = processor.decode(generation, skip_special_tokens=True) print(decoded) print("Done!") ``` ## Training Details - **Base Model:** google/paligemma2-3b-pt-224 - **Dataset:** alakxender/dhivehi-vrd-b1-img-questions - **Training Configuration:** - Batch size: 2 per device - Gradient accumulation steps: 8 - Effective batch size: 16 - Learning rate: 2e-5 - Weight decay: 1e-6 - Adam β2: 0.999 - Warmup steps: 2 - Training steps: 20,000 - Epochs: 1 - Mixed precision: bfloat16 - **QLoRA Configuration:** - Quantization: 4-bit NF4 - LoRA rank (r): 8 - Target modules: - q_proj - k_proj - v_proj - o_proj - gate_proj - up_proj - down_proj - Task type: CAUSAL_LM - Optimizer: paged_adamw_8bit - **Data Processing:** - Image resize method: LANCZOS - Input format: RGB images - Text prompt format: "answer [question]" - **Training Metrics:** - Initial loss: ~15 - Final loss: ~2 - Learning rate: Decreasing from 1.5e-5 to 5e-6 - Gradient norm: Stabilized around 20-60 - Model checkpointing: Every 1000 steps - Logging frequency: Every 100 steps ## Performance The model showed consistent improvement during training: - Loss decreased significantly in the first 5k steps and stabilized afterwards - Gradient norms remained in a healthy range throughout training - Learning rate was automatically adjusted following a linear decay schedule - Training completed successfully with convergence in loss metrics - Training progress was monitored using Weights & Biases ## Model Architecture This model uses Parameter-Efficient Fine-Tuning (PEFT) with QLoRA: - **Quantization:** 4-bit quantization for memory efficiency - **LoRA Adaptation:** Low-rank adaptation of key transformer components - **Memory Optimization:** Uses paged optimizer for efficient memory usage - **Mixed Precision:** bfloat16 for training stability and speed ## Limitations - Primarily optimized for Dhivehi text - Performance may vary with different image qualities and text styles - May or may not perform optimally on handwritten text ## Dataset - **Source Dataset:** alakxender/dhivehi-vrd-images (VRD Batch 1) - **Processed Dataset:** alakxender/dhivehi-vrd-b1-img-questions - **Dataset Size:** - Total: 474,169 samples - Training set: 379,335 samples (80%) - Validation set: 94,834 samples (20%) - **Question Types:** The dataset uses a variety of question prompts for OCR tasks, including: ``` - "What text is written in this image?" - "Can you read and transcribe the Dhivehi text shown in this image?" - "What is the Dhivehi text visible in this image?" - "Please read out the text content from this image" - "What Dhivehi text can you see in this image?" - "Is there any text visible in this image? If so, what does it say?" - "Could you transcribe the Dhivehi text shown in this image?" - "What does the text in this image say?" - "Can you read the Dhivehi text in this image? What does it say?" - "Please identify and transcribe any text visible in this image" - "What Dhivehi text is present in this image?" ``` - **Dataset Format:** - Features: - `image`: Image containing Dhivehi text - `question`: Randomly selected question from the question pool - `answer`: Ground truth Dhivehi text transcription - Processing: Memory-efficient chunked processing (10,000 samples per chunk)