--- license: apache-2.0 datasets: - racineai/OGC_2_vdr-visRAG-colpali language: - fr - en - de - es - it base_model: - HuggingFaceTB/SmolVLM-500M-Instruct --- # Flantier-SmolVLM-500M-dse A lightweight multimodal vision-language model specialized for technical document retrieval. ## Overview Flantier-SmolVLM-500M-dse (Document Screenshot Embedding) is a 500M parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction. ## Key Features - **Efficient Retrieval**: Generates document and query embeddings for semantic similarity search - **Multimodal Understanding**: Processes text, diagrams, charts, and tables in their original layout - **Lightweight Architecture**: Only 500M parameters, runs on consumer GPUs - **No Preprocessing Required**: Directly works with document screenshots ## Installation ```bash pip install transformers accelerate pillow ``` ## Usage Example ```python from PIL import Image import torch from transformers import AutoProcessor, AutoModelForVision2Seq # Load model and processor processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-500M-dse") model = AutoModelForVision2Seq.from_pretrained( "racineai/Flantier-SmolVLM-500M-dse", torch_dtype=torch.bfloat16, device_map="auto" ) # Load document image document_image = Image.open("technical_document.jpg") # Process for document embedding doc_messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "What is shown in this image?"} ] }, ] doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True) doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device) # Generate document embedding with torch.no_grad(): doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True) doc_embedding = doc_outputs.hidden_states[-1][:, -1] # Last token embedding doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1) # Process query embedding query = "What are the specifications of this component?" query_messages = [ { "role": "user", "content": [ {"type": "text", "text": query} ] }, ] query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True) query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device) # Generate query embedding with torch.no_grad(): query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True) query_embedding = query_outputs.hidden_states[-1][:, -1] # Last token embedding query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1) # Calculate similarity similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding) print(f"Similarity score: {similarity.item():.4f}") ``` ## Applications - **Technical Document Retrieval**: Find relevant documents based on technical queries - **Technical Support Systems**: Match user questions to relevant documentation - **Engineering Knowledge Management**: Index and search technical specifications, diagrams, and reports ## Training Methodology This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents. ## Citation ``` @misc{flantier-smolvlm-dse, author = {racine.ai}, title = {Flantier-SmolVLM-500M-dse: A Lightweight Document Screenshot Embedding Model}, year = {2025}, publisher = {Hugging Face}, url = {https://huggingface.co/racineai/Flantier-SmolVLM-500M-dse} } ``` ## License This model is released under the Apache 2.0 license.