prithivMLmods
/

WR30a-Deep-7B-0711

Model card Files Files and versions

xet

Community

prithivMLmods commited on Jul 11

Commit

3219e13

verified ·

1 Parent(s): e2e3c95

Update README.md

Browse files

Files changed (1) hide show

README.md +103 -3

README.md CHANGED Viewed

@@ -3,8 +3,6 @@ license: apache-2.0
 language:
 - en
 - zh
-base_model:
-- Qwen/Qwen2.5-VL-7B-Instruct
 pipeline_tag: image-text-to-text
 library_name: transformers
 tags:
@@ -19,4 +17,106 @@ tags:
 - icr
 - image-to-text
 - vlm
----

 language:
 - en
 - zh
 pipeline_tag: image-text-to-text
 library_name: transformers
 tags:
 - icr
 - image-to-text
 - vlm
+---
+# **WR30a-Deep-7B-0711**
+> The **WR30a-Deep-7B-0711** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, optimized for **Image Captioning**, **Visual Analysis**, and **Image Reasoning**. Built on top of the Qwen2.5-VL architecture, this experimental model enhances visual comprehension capabilities with focused training on 1,500K image pairs for superior image understanding and reasoning tasks across all categories of images with variational dimensions.
+# Key Enhancements
+* **Superior Image Captioning**: Advanced capability for generating detailed, contextually accurate captions for diverse image types and content.
+* **Enhanced Visual Analysis**: Designed to efficiently analyze and interpret complex visual information across different image categories and formats.
+* **Advanced Image Reasoning**: Optimized for logical reasoning about visual content, understanding relationships, and making inferences from images.
+* **Multi-Category Image Support**: Specialized in handling all categories of images with variational dimensions, from simple objects to complex scenes.
+* **State-of-the-Art Performance**: Achieves competitive results on visual understanding benchmarks and real-world image analysis tasks.
+* **Dimensional Flexibility**: Supports images of various resolutions and aspect ratios for comprehensive visual processing.
+* **Cross-Domain Visual Understanding**: Enables robust performance across different visual domains and content types.
+# Quick Start with Transformers
+```python
+from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
+from qwen_vl_utils import process_vision_info
+model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
+    "prithivMLmods/WR30a-Deep-7B-0711", torch_dtype="auto", device_map="auto"
+)
+processor = AutoProcessor.from_pretrained("prithivMLmods/WR30a-Deep-7B-0711")
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {
+                "type": "image",
+                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
+            },
+            {"type": "text", "text": "Describe this image in detail."},
+        ],
+    }
+]
+text = processor.apply_chat_template(
+    messages, tokenize=False, add_generation_prompt=True
+)
+image_inputs, video_inputs = process_vision_info(messages)
+inputs = processor(
+    text=[text],
+    images=image_inputs,
+    videos=video_inputs,
+    padding=True,
+    return_tensors="pt",
+)
+inputs = inputs.to("cuda")
+generated_ids = model.generate(**inputs, max_new_tokens=128)
+generated_ids_trimmed = [
+    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
+]
+output_text = processor.batch_decode(
+    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
+)
+print(output_text)
+```
+# Intended Use
+This model is intended for:
+* High-quality image captioning across diverse visual content and categories.
+* Comprehensive visual analysis and interpretation of complex imagery.
+* Advanced image reasoning for educational, research, and commercial applications.
+* Multi-dimensional image understanding regardless of resolution or aspect ratio.
+* Visual question answering and image-based dialogue systems.
+* Content moderation and automated image classification tasks.
+* Creative applications requiring detailed visual understanding.
+* Accessibility tools for image description and visual assistance.
+## Training Details
+| Parameter               | Value                                               |
+|-------------------------|-----------------------------------------------------|
+| **Dataset Size**        | 1,500K image pairs                                  |
+| **Model Architecture**  | `Qwen2_5_VLForConditionalGeneration`                |
+| **Total Disk Volume**   | 400,000 MB                                          |
+| **Training Time**       | approx. 9,612 seconds (~2.67 hours)                |
+| **Model Stage**         | Experimental                                        |
+| **Hardware**            | 2 × NVIDIA A40 (19 vCPUs)                          |
+| **Precision**           | bfloat16                                            |
+# Limitations
+* May show degraded performance on extremely low-quality or heavily corrupted images.
+* Not optimized for real-time applications on low-resource or edge devices due to computational demands.
+* Variable accuracy on highly specialized or domain-specific visual content.
+* Performance may vary with unusual image compositions or artistic styles.
+* Being in experimental stage, outputs should be validated for critical applications.
+* May require fine-tuning for specific niche use cases or domains.