metadata

license: apache-2.0
tags:
  - text-generation-inference
  - image-captioning
  - optical-character-recognition
  - intelligent-character-recognition
  - caption
  - ocr
  - visual-understanding
  - art
  - icr
  - image-to-text
  - vlm
language:
  - en
  - zh
library_name: transformers
pipeline_tag: image-text-to-text
base_model:
  - Qwen/Qwen2.5-VL-7B-Instruct

Lh41-1042-Magellanic-7B-0711

The Lh41-1042-Magellanic-7B-0711 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Image Captioning, Visual Analysis, and Image Reasoning. Built on top of the Qwen2.5-VL architecture, this experimental model enhances visual comprehension capabilities with focused training on 3,000K image pairs for superior image understanding and reasoning tasks across all categories of images with variational dimensions.

Key Enhancements

Advanced Image Captioning: Superior capability for generating detailed and contextually accurate descriptions of images across diverse categories and dimensions.
Enhanced Visual Analysis: Designed to efficiently analyze and interpret complex visual content, patterns, and relationships within images.
Superior Image Reasoning: Optimized for logical reasoning and inference based on visual information, enabling complex visual question answering.
Multi-Category Image Support: Specialized in handling all categories of images with variational dimensions, from simple objects to complex scenes.
State-of-the-Art Performance Across Resolutions: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
Video Understanding up to 20+ minutes: Supports detailed comprehension of long-duration videos for content summarization, Q&A, and multi-modal reasoning.
Visually-Grounded Device Interaction: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Lh41-1042-Magellanic-7B-0711", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Lh41-1042-Magellanic-7B-0711")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

Advanced image captioning with contextually rich and detailed descriptions.
High-fidelity visual analysis and interpretation of complex visual content.
Image reasoning tasks requiring logical inference and pattern recognition.
Visual question answering for educational and enterprise applications.
Multi-modal content understanding across diverse image categories and dimensions.
Automated image description generation for accessibility and content management.
Visual content analysis for creative and professional applications.
Robotic or mobile automation with vision-guided contextual interaction.

Training Details

Parameter	Value
Dataset Size	3,000K image pairs
Model Architecture	`Qwen2_5_VLForConditionalGeneration`
Total Disk Volume	600,000 MB
Training Time	approx. 16,488 seconds (~4.58 hours)
Model Stage	Experimental
Hardware	3 × NVIDIA A40 (29 vCPUs)
Warmup Steps	750
Precision	bfloat16

Limitations

May show degraded performance on extremely low-quality or occluded images.
Not optimized for real-time applications on low-resource or edge devices due to computational demands.
Variable accuracy on uncommon visual patterns or highly specialized domain images.
Long video processing may require substantial memory and is not optimized for streaming applications.
Visual token settings affect performance; suboptimal configurations can impact results.
In rare cases, outputs may contain hallucinated or contextually misaligned information.
As an experimental model, performance may vary across different use cases and requires further validation.