license: apache-2.0
language:
- en
- zh
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- text-generation-inference
- image-captioning
- optical-character-recognition
- intelligent-character-recognition
- caption
- ocr
- visual-understanding
- art
- icr
- image-to-text
- vlm
base_model:
- prithivMLmods/VIREX-062225-exp
WR30a-Deep-7B-0711
The WR30a-Deep-7B-0711 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Image Captioning, Visual Analysis, and Image Reasoning. Built on top of the Qwen2.5-VL architecture, this experimental model enhances visual comprehension capabilities with focused training on 1,500K image pairs for superior image understanding and reasoning tasks across all categories of images with variational dimensions.
Key Enhancements
Superior Image Captioning: Advanced capability for generating detailed, contextually accurate captions for diverse image types and content.
Enhanced Visual Analysis: Designed to efficiently analyze and interpret complex visual information across different image categories and formats.
Advanced Image Reasoning: Optimized for logical reasoning about visual content, understanding relationships, and making inferences from images.
Multi-Category Image Support: Specialized in handling all categories of images with variational dimensions, from simple objects to complex scenes.
State-of-the-Art Performance: Achieves competitive results on visual understanding benchmarks and real-world image analysis tasks.
Dimensional Flexibility: Supports images of various resolutions and aspect ratios for comprehensive visual processing.
Cross-Domain Visual Understanding: Enables robust performance across different visual domains and content types.
Quick Start with Transformers
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"prithivMLmods/WR30a-Deep-7B-0711", torch_dtype="auto", device_map="auto"
)
processor = AutoProcessor.from_pretrained("prithivMLmods/WR30a-Deep-7B-0711")
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
},
{"type": "text", "text": "Describe this image in detail."},
],
}
]
text = processor.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
)
inputs = inputs.to("cuda")
generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
Intended Use
This model is intended for:
- High-quality image captioning across diverse visual content and categories.
- Comprehensive visual analysis and interpretation of complex imagery.
- Advanced image reasoning for educational, research, and commercial applications.
- Multi-dimensional image understanding regardless of resolution or aspect ratio.
- Visual question answering and image-based dialogue systems.
- Content moderation and automated image classification tasks.
- Creative applications requiring detailed visual understanding.
- Accessibility tools for image description and visual assistance.
Training Details
Parameter | Value |
---|---|
Dataset Size | 1,500K image pairs |
Model Architecture | Qwen2_5_VLForConditionalGeneration |
Total Disk Volume | 400,000 MB |
Training Time | approx. 9,612 seconds (~2.67 hours) |
Model Stage | Experimental |
Hardware | 2 × NVIDIA A40 (19 vCPUs) |
Precision | bfloat16 |
Limitations
- May show degraded performance on extremely low-quality or heavily corrupted images.
- Not optimized for real-time applications on low-resource or edge devices due to computational demands.
- Variable accuracy on highly specialized or domain-specific visual content.
- Performance may vary with unusual image compositions or artistic styles.
- Being in experimental stage, outputs should be validated for critical applications.
- May require fine-tuning for specific niche use cases or domains.