|
--- |
|
license: apache-2.0 |
|
datasets: |
|
- racineai/OGC_2_vdr-visRAG-colpali |
|
language: |
|
- fr |
|
- en |
|
- de |
|
- es |
|
- it |
|
base_model: |
|
- HuggingFaceTB/SmolVLM-500M-Instruct |
|
--- |
|
|
|
# Flantier-SmolVLM-500M-dse |
|
|
|
A lightweight multimodal vision-language model specialized for technical document retrieval. |
|
|
|
## Overview |
|
|
|
Flantier-SmolVLM-500M-dse (Document Screenshot Embedding) is a 500M parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction. |
|
|
|
## Key Features |
|
|
|
- **Efficient Retrieval**: Generates document and query embeddings for semantic similarity search |
|
- **Multimodal Understanding**: Processes text, diagrams, charts, and tables in their original layout |
|
- **Lightweight Architecture**: Only 500M parameters, runs on consumer GPUs |
|
- **No Preprocessing Required**: Directly works with document screenshots |
|
|
|
## Installation |
|
|
|
```bash |
|
pip install transformers accelerate pillow |
|
``` |
|
|
|
## Usage Example |
|
|
|
```python |
|
from PIL import Image |
|
import torch |
|
from transformers import AutoProcessor, AutoModelForVision2Seq |
|
|
|
# Load model and processor |
|
processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-500M-dse") |
|
model = AutoModelForVision2Seq.from_pretrained( |
|
"racineai/Flantier-SmolVLM-500M-dse", |
|
torch_dtype=torch.bfloat16, |
|
device_map="auto" |
|
) |
|
|
|
# Load document image |
|
document_image = Image.open("technical_document.jpg") |
|
|
|
# Process for document embedding |
|
doc_messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "image"}, |
|
{"type": "text", "text": "What is shown in this image?"} |
|
] |
|
}, |
|
] |
|
doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True) |
|
doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device) |
|
|
|
# Generate document embedding |
|
with torch.no_grad(): |
|
doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True) |
|
doc_embedding = doc_outputs.hidden_states[-1][:, -1] # Last token embedding |
|
doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1) |
|
|
|
# Process query embedding |
|
query = "What are the specifications of this component?" |
|
query_messages = [ |
|
{ |
|
"role": "user", |
|
"content": [ |
|
{"type": "text", "text": query} |
|
] |
|
}, |
|
] |
|
query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True) |
|
query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device) |
|
|
|
# Generate query embedding |
|
with torch.no_grad(): |
|
query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True) |
|
query_embedding = query_outputs.hidden_states[-1][:, -1] # Last token embedding |
|
query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1) |
|
|
|
# Calculate similarity |
|
similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding) |
|
print(f"Similarity score: {similarity.item():.4f}") |
|
``` |
|
|
|
## Applications |
|
|
|
- **Technical Document Retrieval**: Find relevant documents based on technical queries |
|
- **Technical Support Systems**: Match user questions to relevant documentation |
|
- **Engineering Knowledge Management**: Index and search technical specifications, diagrams, and reports |
|
|
|
## Training Methodology |
|
|
|
This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents. |
|
|
|
## Citation |
|
|
|
``` |
|
@misc{flantier-smolvlm-dse, |
|
author = {racine.ai}, |
|
title = {Flantier-SmolVLM-500M-dse: A Lightweight Document Screenshot Embedding Model}, |
|
year = {2025}, |
|
publisher = {Hugging Face}, |
|
url = {https://huggingface.co/racineai/Flantier-SmolVLM-500M-dse} |
|
} |
|
``` |
|
|
|
## License |
|
|
|
This model is released under the Apache 2.0 license. |