prithivMLmods commited on
Commit
1cd8b25
·
verified ·
1 Parent(s): abb58a1

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +105 -1
README.md CHANGED
@@ -8,6 +8,7 @@ tags:
8
  - OCR
9
  - Receipt
10
  - VisionOCR
 
11
  datasets:
12
  - linxy/LaTeX_OCR
13
  - mychen76/ds_receipts_v2_eval
@@ -17,4 +18,107 @@ base_model:
17
  - Qwen/Qwen2.5-VL-3B-Instruct
18
  pipeline_tag: image-text-to-text
19
  library_name: transformers
20
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  - OCR
9
  - Receipt
10
  - VisionOCR
11
+ - Messy Handwriting OCR
12
  datasets:
13
  - linxy/LaTeX_OCR
14
  - mychen76/ds_receipts_v2_eval
 
18
  - Qwen/Qwen2.5-VL-3B-Instruct
19
  pipeline_tag: image-text-to-text
20
  library_name: transformers
21
+ ---
22
+
23
+ ![OCR.png](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/Xn8x267VedkZf6HFRsROD.png)
24
+
25
+ # **visionOCR-3B-061125**
26
+
27
+ > The **visionOCR-3B-061125** model is a fine-tuned version of **Qwen/Qwen2.5-VL-3B-Instruct**, optimized for **Document-Level Optical Character Recognition (OCR)**, **long-context vision-language understanding**, and **accurate image-to-text conversion with mathematical LaTeX formatting**. Built on top of the Qwen2.5-VL architecture, this model significantly improves document comprehension, structured data extraction, and visual reasoning across diverse input formats.
28
+
29
+ # Key Enhancements
30
+
31
+ * **Advanced Document-Level OCR**: Capable of extracting structured content from complex, multi-page documents such as invoices, academic papers, forms, and scanned reports.
32
+
33
+ * **Enhanced Long-Context Vision-Language Understanding**: Designed to handle dense document layouts, long sequences of embedded text, tables, and diagrams with coherent cross-reference understanding.
34
+
35
+ * **State-of-the-Art Performance Across Resolutions**: Achieves competitive results on OCR and visual QA benchmarks such as DocVQA, MathVista, RealWorldQA, and MTVQA.
36
+
37
+ * **Video Understanding up to 20+ minutes**: Supports detailed comprehension of long-duration videos for content summarization, Q\&A, and multi-modal reasoning.
38
+
39
+ * **Visually-Grounded Device Interaction**: Enables mobile/robotic device operation via visual inputs and text-based instructions using contextual understanding and decision-making logic.
40
+
41
+ # Quick Start with Transformers
42
+
43
+ ```python
44
+ from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
45
+ from qwen_vl_utils import process_vision_info
46
+
47
+ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
48
+ "prithivMLmods/visionOCR-3B-061125", torch_dtype="auto", device_map="auto"
49
+ )
50
+
51
+ processor = AutoProcessor.from_pretrained("prithivMLmods/visionOCR-3B-061125")
52
+
53
+ messages = [
54
+ {
55
+ "role": "user",
56
+ "content": [
57
+ {
58
+ "type": "image",
59
+ "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
60
+ },
61
+ {"type": "text", "text": "Describe this image."},
62
+ ],
63
+ }
64
+ ]
65
+
66
+ text = processor.apply_chat_template(
67
+ messages, tokenize=False, add_generation_prompt=True
68
+ )
69
+ image_inputs, video_inputs = process_vision_info(messages)
70
+ inputs = processor(
71
+ text=[text],
72
+ images=image_inputs,
73
+ videos=video_inputs,
74
+ padding=True,
75
+ return_tensors="pt",
76
+ )
77
+ inputs = inputs.to("cuda")
78
+
79
+ generated_ids = model.generate(**inputs, max_new_tokens=128)
80
+ generated_ids_trimmed = [
81
+ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
82
+ ]
83
+ output_text = processor.batch_decode(
84
+ generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
85
+ )
86
+ print(output_text)
87
+ ```
88
+
89
+ # Intended Use
90
+
91
+ This model is intended for:
92
+
93
+ * High-fidelity OCR from documents, forms, receipts, and printed or scanned materials.
94
+ * Image and document-based question answering for educational and enterprise applications.
95
+ * Extraction and LaTeX formatting of mathematical expressions from printed or handwritten content.
96
+ * Retrieval and summarization from long documents, slides, and multi-modal inputs.
97
+ * Multilingual OCR and structured content extraction for global use cases.
98
+ * Robotic or mobile automation with vision-guided contextual interaction.
99
+
100
+ # Limitations
101
+
102
+ * May show degraded performance on extremely low-quality or occluded images.
103
+ * Not optimized for real-time applications on low-resource or edge devices due to computational demands.
104
+ * Variable accuracy on uncommon or low-resource languages/scripts.
105
+ * Long video processing may require substantial memory and is not optimized for streaming applications.
106
+ * Visual token settings affect performance; suboptimal configurations can impact results.
107
+ * In rare cases, outputs may contain hallucinated or contextually misaligned information.
108
+
109
+ ## References
110
+
111
+ * **DocVLM: Make Your VLM an Efficient Reader**
112
+ [https://arxiv.org/pdf/2412.08746v1](https://arxiv.org/pdf/2412.08746v1)
113
+
114
+ * **YaRN: Efficient Context Window Extension of Large Language Models**
115
+ [https://arxiv.org/pdf/2309.00071](https://arxiv.org/pdf/2309.00071)
116
+
117
+ * **Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution**
118
+ [https://arxiv.org/pdf/2409.12191](https://arxiv.org/pdf/2409.12191)
119
+
120
+ * **Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond**
121
+ [https://arxiv.org/pdf/2308.12966](https://arxiv.org/pdf/2308.12966)
122
+
123
+ * **A Comprehensive and Challenging OCR Benchmark for Evaluating Large Multimodal Models in Literacy**
124
+ [https://arxiv.org/pdf/2412.02210](https://arxiv.org/pdf/2412.02210)