WR30a-Deep-7B-0711 / README.md

Update README.md

94f9d7e verified about 8 hours ago

5.19 kB

	---
	license: apache-2.0
	language:
	- en
	- zh
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- text-generation-inference
	- image-captioning
	- optical-character-recognition
	- intelligent-character-recognition
	- caption
	- ocr
	- visual-understanding
	- art
	- icr
	- image-to-text
	- vlm
	base_model:
	- prithivMLmods/VIREX-062225-exp
	---

	![1.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/YBTwin1Fqn2NUBP5oKos2.png)

	# WR30a-Deep-7B-0711

	> The WR30a-Deep-7B-0711 model is a fine-tuned version of Qwen2.5-VL-7B-Instruct, optimized for Image Captioning, Visual Analysis, and Image Reasoning. Built on top of the Qwen2.5-VL architecture, this experimental model enhances visual comprehension capabilities with focused training on 1,500K image pairs for superior image understanding and reasoning tasks across all categories of images with variational dimensions.

	# Key Enhancements

	* Superior Image Captioning: Advanced capability for generating detailed, contextually accurate captions for diverse image types and content.

	* Enhanced Visual Analysis: Designed to efficiently analyze and interpret complex visual information across different image categories and formats.

	* Advanced Image Reasoning: Optimized for logical reasoning about visual content, understanding relationships, and making inferences from images.

	* Multi-Category Image Support: Specialized in handling all categories of images with variational dimensions, from simple objects to complex scenes.

	* State-of-the-Art Performance: Achieves competitive results on visual understanding benchmarks and real-world image analysis tasks.

	* Dimensional Flexibility: Supports images of various resolutions and aspect ratios for comprehensive visual processing.

	* Cross-Domain Visual Understanding: Enables robust performance across different visual domains and content types.

	# Quick Start with Transformers

	```python
	from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
	from qwen_vl_utils import process_vision_info

	model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
	"prithivMLmods/WR30a-Deep-7B-0711", torch_dtype="auto", device_map="auto"
	)

	processor = AutoProcessor.from_pretrained("prithivMLmods/WR30a-Deep-7B-0711")

	messages = [
	{
	"role": "user",
	"content": [
	{
	"type": "image",
	"image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
	},
	{"type": "text", "text": "Describe this image in detail."},
	],
	}
	]

	text = processor.apply_chat_template(
	messages, tokenize=False, add_generation_prompt=True
	)
	image_inputs, video_inputs = process_vision_info(messages)
	inputs = processor(
	text=[text],
	images=image_inputs,
	videos=video_inputs,
	padding=True,
	return_tensors="pt",
	)
	inputs = inputs.to("cuda")

	generated_ids = model.generate(**inputs, max_new_tokens=128)
	generated_ids_trimmed = [
	out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
	]
	output_text = processor.batch_decode(
	generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print(output_text)
	```

	# Intended Use

	This model is intended for:

	* High-quality image captioning across diverse visual content and categories.
	* Comprehensive visual analysis and interpretation of complex imagery.
	* Advanced image reasoning for educational, research, and commercial applications.
	* Multi-dimensional image understanding regardless of resolution or aspect ratio.
	* Visual question answering and image-based dialogue systems.
	* Content moderation and automated image classification tasks.
	* Creative applications requiring detailed visual understanding.
	* Accessibility tools for image description and visual assistance.

	## Training Details

	\| Parameter \| Value \|
	\|-------------------------\|-----------------------------------------------------\|
	\| Dataset Size \| 1,500K image pairs \|
	\| Model Architecture \| `Qwen2_5_VLForConditionalGeneration` \|
	\| Total Disk Volume \| 400,000 MB \|
	\| Training Time \| approx. 9,612 seconds (~2.67 hours) \|
	\| Model Stage \| Experimental \|
	\| Hardware \| 2 × NVIDIA A40 (19 vCPUs) \|
	\| Precision \| bfloat16 \|

	# Limitations

	* May show degraded performance on extremely low-quality or heavily corrupted images.
	* Not optimized for real-time applications on low-resource or edge devices due to computational demands.
	* Variable accuracy on highly specialized or domain-specific visual content.
	* Performance may vary with unusual image compositions or artistic styles.
	* Being in experimental stage, outputs should be validated for critical applications.
	* May require fine-tuning for specific niche use cases or domains.