metadata

license: apache-2.0
language:
  - en
  - zh
pipeline_tag: image-text-to-text
library_name: transformers
tags:
  - text-generation-inference
  - image-captioning
  - optical-character-recognition
  - intelligent-character-recognition
  - caption
  - ocr
  - visual-understanding
  - art
  - icr
  - image-to-text
  - vlm
base_model:
  - prithivMLmods/VIREX-062225-exp

WR30a-Deep-7B-0711

The WR30a-Deep-7B-0711 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Image Captioning, Visual Analysis, and Image Reasoning. Built on top of the Qwen2.5-VL architecture, this experimental model enhances visual comprehension capabilities with focused training on 1,500K image pairs for superior image understanding and reasoning tasks across all categories of images with variational dimensions.

Key Enhancements

Superior Image Captioning: Advanced capability for generating detailed, contextually accurate captions for diverse image types and content.
Enhanced Visual Analysis: Designed to efficiently analyze and interpret complex visual information across different image categories and formats.
Advanced Image Reasoning: Optimized for logical reasoning about visual content, understanding relationships, and making inferences from images.
Multi-Category Image Support: Specialized in handling all categories of images with variational dimensions, from simple objects to complex scenes.
State-of-the-Art Performance: Achieves competitive results on visual understanding benchmarks and real-world image analysis tasks.
Dimensional Flexibility: Supports images of various resolutions and aspect ratios for comprehensive visual processing.
Cross-Domain Visual Understanding: Enables robust performance across different visual domains and content types.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/WR30a-Deep-7B-0711", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/WR30a-Deep-7B-0711")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image in detail."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

This model is intended for:

High-quality image captioning across diverse visual content and categories.
Comprehensive visual analysis and interpretation of complex imagery.
Advanced image reasoning for educational, research, and commercial applications.
Multi-dimensional image understanding regardless of resolution or aspect ratio.
Visual question answering and image-based dialogue systems.
Content moderation and automated image classification tasks.
Creative applications requiring detailed visual understanding.
Accessibility tools for image description and visual assistance.

Training Details

Parameter	Value
Dataset Size	1,500K image pairs
Model Architecture	`Qwen2_5_VLForConditionalGeneration`
Total Disk Volume	400,000 MB
Training Time	approx. 9,612 seconds (~2.67 hours)
Model Stage	Experimental
Hardware	2 × NVIDIA A40 (19 vCPUs)
Precision	bfloat16

Limitations

May show degraded performance on extremely low-quality or heavily corrupted images.
Not optimized for real-time applications on low-resource or edge devices due to computational demands.
Variable accuracy on highly specialized or domain-specific visual content.
Performance may vary with unusual image compositions or artistic styles.
Being in experimental stage, outputs should be validated for critical applications.
May require fine-tuning for specific niche use cases or domains.