Create README.md

d7a064b verified 2 months ago

4.01 kB

	---
	license: apache-2.0
	datasets:
	- racineai/OGC_2_vdr-visRAG-colpali
	language:
	- fr
	- en
	- de
	- es
	- it
	base_model:
	- HuggingFaceTB/SmolVLM-500M-Instruct
	---

	# Flantier-SmolVLM-500M-dse

	A lightweight multimodal vision-language model specialized for technical document retrieval.

	## Overview

	Flantier-SmolVLM-500M-dse (Document Screenshot Embedding) is a 500M parameter vision-language model designed for efficient retrieval of technical documentation. It directly encodes document screenshots into embeddings, preserving all information including text, images, and layout without requiring separate content extraction.

	## Key Features

	- Efficient Retrieval: Generates document and query embeddings for semantic similarity search
	- Multimodal Understanding: Processes text, diagrams, charts, and tables in their original layout
	- Lightweight Architecture: Only 500M parameters, runs on consumer GPUs
	- No Preprocessing Required: Directly works with document screenshots

	## Installation

	```bash
	pip install transformers accelerate pillow
	```

	## Usage Example

	```python
	from PIL import Image
	import torch
	from transformers import AutoProcessor, AutoModelForVision2Seq

	# Load model and processor
	processor = AutoProcessor.from_pretrained("racineai/Flantier-SmolVLM-500M-dse")
	model = AutoModelForVision2Seq.from_pretrained(
	"racineai/Flantier-SmolVLM-500M-dse",
	torch_dtype=torch.bfloat16,
	device_map="auto"
	)

	# Load document image
	document_image = Image.open("technical_document.jpg")

	# Process for document embedding
	doc_messages = [
	{
	"role": "user",
	"content": [
	{"type": "image"},
	{"type": "text", "text": "What is shown in this image?"}
	]
	},
	]
	doc_prompt = processor.apply_chat_template(doc_messages, add_generation_prompt=True)
	doc_inputs = processor(text=doc_prompt, images=[document_image], return_tensors="pt").to(model.device)

	# Generate document embedding
	with torch.no_grad():
	doc_outputs = model(**doc_inputs, output_hidden_states=True, return_dict=True)
	doc_embedding = doc_outputs.hidden_states[-1][:, -1] # Last token embedding
	doc_embedding = torch.nn.functional.normalize(doc_embedding, p=2, dim=-1)

	# Process query embedding
	query = "What are the specifications of this component?"
	query_messages = [
	{
	"role": "user",
	"content": [
	{"type": "text", "text": query}
	]
	},
	]
	query_prompt = processor.apply_chat_template(query_messages, add_generation_prompt=True)
	query_inputs = processor(text=query_prompt, return_tensors="pt").to(model.device)

	# Generate query embedding
	with torch.no_grad():
	query_outputs = model(**query_inputs, output_hidden_states=True, return_dict=True)
	query_embedding = query_outputs.hidden_states[-1][:, -1] # Last token embedding
	query_embedding = torch.nn.functional.normalize(query_embedding, p=2, dim=-1)

	# Calculate similarity
	similarity = torch.nn.functional.cosine_similarity(query_embedding, doc_embedding)
	print(f"Similarity score: {similarity.item():.4f}")
	```

	## Applications

	- Technical Document Retrieval: Find relevant documents based on technical queries
	- Technical Support Systems: Match user questions to relevant documentation
	- Engineering Knowledge Management: Index and search technical specifications, diagrams, and reports

	## Training Methodology

	This model was trained using the Document Screenshot Embedding (DSE) approach, which treats document screenshots as a unified input format. This eliminates the need for content extraction preprocessing while preserving all visual and textual information in documents.

	## Citation

	```
	@misc{flantier-smolvlm-dse,
	author = {racine.ai},
	title = {Flantier-SmolVLM-500M-dse: A Lightweight Document Screenshot Embedding Model},
	year = {2025},
	publisher = {Hugging Face},
	url = {https://huggingface.co/racineai/Flantier-SmolVLM-500M-dse}
	}
	```

	## License

	This model is released under the Apache 2.0 license.