Safetensors
idefics3
File size: 4,012 Bytes
d7a064b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
---
license: apache-2.0
datasets:
- racineai/OGC_2_vdr-visRAG-colpali
language:
- fr
- en
- de
- es
- it
base_model:
- HuggingFaceTB/SmolVLM-500M-Instruct
---

# Flantier-SmolVLM-500M-dse

A lightweight multimodal vision-language model specialized for technical document retrieval.

## Overview

Flantier-SmolVLM-500M-dse (Document Screenshot Embedding) is a 500M parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.

## Key Features

- **Efficient Retrieval**: Generates document and query embeddings for semantic similarity search
- **Multimodal Understanding**: Processes text, diagrams, charts, and tables in their original layout
- **Lightweight Architecture**: Only 500M parameters, runs on consumer GPUs
- **No Preprocessing Required**: Directly works with document screenshots

## Installation

```bash
pip install transformers accelerate pillow
```

## Usage Example

```python
from PIL import Image
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq

# Load model and processor
processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-500M-dse")
model = AutoModelForVision2Seq.from_pretrained(
    "racineai/Flantier-SmolVLM-500M-dse",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

# Load document image
document_image = Image.open("technical_document.jpg")

# Process for document embedding
doc_messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "What is shown in this image?"}
        ]
    },
]
doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True)
doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device)

# Generate document embedding
with torch.no_grad():
    doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True)
    doc_embedding = doc_outputs.hidden_states[-1][:, -1]  # Last token embedding
    doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1)

# Process query embedding
query = "What are the specifications of this component?"
query_messages = [
    {
        "role": "user",
        "content": [
            {"type": "text", "text": query}
        ]
    },
]
query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True)
query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device)

# Generate query embedding
with torch.no_grad():
    query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True)
    query_embedding = query_outputs.hidden_states[-1][:, -1]  # Last token embedding
    query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1)

# Calculate similarity
similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding)
print(f"Similarity score: {similarity.item():.4f}")
```

## Applications

- **Technical Document Retrieval**: Find relevant documents based on technical queries
- **Technical Support Systems**: Match user questions to relevant documentation
- **Engineering Knowledge Management**: Index and search technical specifications, diagrams, and reports

## Training Methodology

This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.

## Citation

```
@misc{flantier-smolvlm-dse,
  author = {racine.ai},
  title = {Flantier-SmolVLM-500M-dse: A Lightweight Document Screenshot Embedding Model},
  year = {2025},
  publisher = {Hugging Face},
  url = {https://huggingface.co/racineai/Flantier-SmolVLM-500M-dse}
}
```

## License

This model is released under the Apache 2.0 license.