prithivMLmods commited on
Commit
3219e13
·
verified ·
1 Parent(s): e2e3c95

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +103 -3
README.md CHANGED
@@ -3,8 +3,6 @@ license: apache-2.0
3
  language:
4
  - en
5
  - zh
6
- base_model:
7
- - Qwen/Qwen2.5-VL-7B-Instruct
8
  pipeline_tag: image-text-to-text
9
  library_name: transformers
10
  tags:
@@ -19,4 +17,106 @@ tags:
19
  - icr
20
  - image-to-text
21
  - vlm
22
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  language:
4
  - en
5
  - zh
 
 
6
  pipeline_tag: image-text-to-text
7
  library_name: transformers
8
  tags:
 
17
  - icr
18
  - image-to-text
19
  - vlm
20
+ ---
21
+
22
+ # **WR30a-Deep-7B-0711**
23
+
24
+ > The **WR30a-Deep-7B-0711** model is a fine-tuned version of **Qwen2.5-VL-7B-Instruct**, optimized for **Image Captioning**, **Visual Analysis**, and **Image Reasoning**. Built on top of the Qwen2.5-VL architecture, this experimental model enhances visual comprehension capabilities with focused training on 1,500K image pairs for superior image understanding and reasoning tasks across all categories of images with variational dimensions.
25
+
26
+ # Key Enhancements
27
+
28
+ * **Superior Image Captioning**: Advanced capability for generating detailed, contextually accurate captions for diverse image types and content.
29
+
30
+ * **Enhanced Visual Analysis**: Designed to efficiently analyze and interpret complex visual information across different image categories and formats.
31
+
32
+ * **Advanced Image Reasoning**: Optimized for logical reasoning about visual content, understanding relationships, and making inferences from images.
33
+
34
+ * **Multi-Category Image Support**: Specialized in handling all categories of images with variational dimensions, from simple objects to complex scenes.
35
+
36
+ * **State-of-the-Art Performance**: Achieves competitive results on visual understanding benchmarks and real-world image analysis tasks.
37
+
38
+ * **Dimensional Flexibility**: Supports images of various resolutions and aspect ratios for comprehensive visual processing.
39
+
40
+ * **Cross-Domain Visual Understanding**: Enables robust performance across different visual domains and content types.
41
+
42
+ # Quick Start with Transformers
43
+
44
+ ```python
45
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
46
+ from qwen_vl_utils import process_vision_info
47
+
48
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
49
+ "prithivMLmods/WR30a-Deep-7B-0711", torch_dtype="auto", device_map="auto"
50
+ )
51
+
52
+ processor = AutoProcessor.from_pretrained("prithivMLmods/WR30a-Deep-7B-0711")
53
+
54
+ messages = [
55
+ {
56
+ "role": "user",
57
+ "content": [
58
+ {
59
+ "type": "image",
60
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
61
+ },
62
+ {"type": "text", "text": "Describe this image in detail."},
63
+ ],
64
+ }
65
+ ]
66
+
67
+ text = processor.apply_chat_template(
68
+ messages, tokenize=False, add_generation_prompt=True
69
+ )
70
+ image_inputs, video_inputs = process_vision_info(messages)
71
+ inputs = processor(
72
+ text=[text],
73
+ images=image_inputs,
74
+ videos=video_inputs,
75
+ padding=True,
76
+ return_tensors="pt",
77
+ )
78
+ inputs = inputs.to("cuda")
79
+
80
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
81
+ generated_ids_trimmed = [
82
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
83
+ ]
84
+ output_text = processor.batch_decode(
85
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
86
+ )
87
+ print(output_text)
88
+ ```
89
+
90
+ # Intended Use
91
+
92
+ This model is intended for:
93
+
94
+ * High-quality image captioning across diverse visual content and categories.
95
+ * Comprehensive visual analysis and interpretation of complex imagery.
96
+ * Advanced image reasoning for educational, research, and commercial applications.
97
+ * Multi-dimensional image understanding regardless of resolution or aspect ratio.
98
+ * Visual question answering and image-based dialogue systems.
99
+ * Content moderation and automated image classification tasks.
100
+ * Creative applications requiring detailed visual understanding.
101
+ * Accessibility tools for image description and visual assistance.
102
+
103
+ ## Training Details
104
+
105
+ | Parameter | Value |
106
+ |-------------------------|-----------------------------------------------------|
107
+ | **Dataset Size** | 1,500K image pairs |
108
+ | **Model Architecture** | `Qwen2_5_VLForConditionalGeneration` |
109
+ | **Total Disk Volume** | 400,000 MB |
110
+ | **Training Time** | approx. 9,612 seconds (~2.67 hours) |
111
+ | **Model Stage** | Experimental |
112
+ | **Hardware** | 2 × NVIDIA A40 (19 vCPUs) |
113
+ | **Precision** | bfloat16 |
114
+
115
+ # Limitations
116
+
117
+ * May show degraded performance on extremely low-quality or heavily corrupted images.
118
+ * Not optimized for real-time applications on low-resource or edge devices due to computational demands.
119
+ * Variable accuracy on highly specialized or domain-specific visual content.
120
+ * Performance may vary with unusual image compositions or artistic styles.
121
+ * Being in experimental stage, outputs should be validated for critical applications.
122
+ * May require fine-tuning for specific niche use cases or domains.