Safetensors
idefics3
paulml commited on
Commit
d7a064b
·
verified ·
1 Parent(s): e8c99d6

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +121 -0
README.md ADDED
@@ -0,0 +1,121 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - racineai/OGC_2_vdr-visRAG-colpali
5
+ language:
6
+ - fr
7
+ - en
8
+ - de
9
+ - es
10
+ - it
11
+ base_model:
12
+ - HuggingFaceTB/SmolVLM-500M-Instruct
13
+ ---
14
+
15
+ # Flantier-SmolVLM-500M-dse
16
+
17
+ A lightweight multimodal vision-language model specialized for technical document retrieval.
18
+
19
+ ## Overview
20
+
21
+ Flantier-SmolVLM-500M-dse (Document Screenshot Embedding) is a 500M parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.
22
+
23
+ ## Key Features
24
+
25
+ - **Efficient Retrieval**: Generates document and query embeddings for semantic similarity search
26
+ - **Multimodal Understanding**: Processes text, diagrams, charts, and tables in their original layout
27
+ - **Lightweight Architecture**: Only 500M parameters, runs on consumer GPUs
28
+ - **No Preprocessing Required**: Directly works with document screenshots
29
+
30
+ ## Installation
31
+
32
+ ```bash
33
+ pip install transformers accelerate pillow
34
+ ```
35
+
36
+ ## Usage Example
37
+
38
+ ```python
39
+ from PIL import Image
40
+ import torch
41
+ from transformers import AutoProcessor, AutoModelForVision2Seq
42
+
43
+ # Load model and processor
44
+ processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-500M-dse")
45
+ model = AutoModelForVision2Seq.from_pretrained(
46
+ "racineai/Flantier-SmolVLM-500M-dse",
47
+ torch_dtype=torch.bfloat16,
48
+ device_map="auto"
49
+ )
50
+
51
+ # Load document image
52
+ document_image = Image.open("technical_document.jpg")
53
+
54
+ # Process for document embedding
55
+ doc_messages = [
56
+ {
57
+ "role": "user",
58
+ "content": [
59
+ {"type": "image"},
60
+ {"type": "text", "text": "What is shown in this image?"}
61
+ ]
62
+ },
63
+ ]
64
+ doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True)
65
+ doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device)
66
+
67
+ # Generate document embedding
68
+ with torch.no_grad():
69
+ doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True)
70
+ doc_embedding = doc_outputs.hidden_states[-1][:, -1] # Last token embedding
71
+ doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1)
72
+
73
+ # Process query embedding
74
+ query = "What are the specifications of this component?"
75
+ query_messages = [
76
+ {
77
+ "role": "user",
78
+ "content": [
79
+ {"type": "text", "text": query}
80
+ ]
81
+ },
82
+ ]
83
+ query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True)
84
+ query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device)
85
+
86
+ # Generate query embedding
87
+ with torch.no_grad():
88
+ query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True)
89
+ query_embedding = query_outputs.hidden_states[-1][:, -1] # Last token embedding
90
+ query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1)
91
+
92
+ # Calculate similarity
93
+ similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding)
94
+ print(f"Similarity score: {similarity.item():.4f}")
95
+ ```
96
+
97
+ ## Applications
98
+
99
+ - **Technical Document Retrieval**: Find relevant documents based on technical queries
100
+ - **Technical Support Systems**: Match user questions to relevant documentation
101
+ - **Engineering Knowledge Management**: Index and search technical specifications, diagrams, and reports
102
+
103
+ ## Training Methodology
104
+
105
+ This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.
106
+
107
+ ## Citation
108
+
109
+ ```
110
+ @misc{flantier-smolvlm-dse,
111
+ author = {racine.ai},
112
+ title = {Flantier-SmolVLM-500M-dse: A Lightweight Document Screenshot Embedding Model},
113
+ year = {2025},
114
+ publisher = {Hugging Face},
115
+ url = {https://huggingface.co/racineai/Flantier-SmolVLM-500M-dse}
116
+ }
117
+ ```
118
+
119
+ ## License
120
+
121
+ This model is released under the Apache 2.0 license.