This is N2.3-Eye-1.3B-[DEV] We are currently experimenting with siglip2. This model is underfit.
N2-Eye: Multimodal Conversational AI
N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with Google's siglip2 vision encoder to enable image understanding and conversation capabilities.
Model Details
- Base Language Model: LiquidAI/LFM2-1.2B (1.26B parameters)
- Vision Encoder: Google google/siglip2-so400m-patch14-384
- Model Type: Image-Text-to-Text (Multimodal Conversational)
- Training Dataset: CRAG-MM Multi-Turn Public Dataset
- License: MIT
- Framework: PyTorch + Transformers
Architecture
N2-Eye uses a modular architecture that combines:
- Language Model: LFM2-1.2B for text generation and conversation
- Vision Encoder: siglip2 for image understanding (frozen during training)
- Projection Layer: A trainable MLP that maps CLIP features to the language model's embedding space
The model processes images by:
- Encoding images with siglip2 to extract visual features
- Projecting these features through a learnable projection layer
- Integrating projected features into the language model at special
<image>
token positions
Training Details
Dataset
- Source: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
- Format: Multi-turn conversations with images
- Preprocessing: Conversations formatted with ChatML-style tokens
Training Configuration
- Batch Size: 2 per device (with gradient accumulation steps: 4)
- Learning Rate: 2e-5
- Training Length: 1 epoch on validation split (its a DEV version.)
- Precision: bfloat16
- Max Sequence Length: 2048 tokens
- Optimization: Gradient checkpointing enabled
Special Tokens
<image>
: Placeholder for image embeddings in conversation- System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."
Usage
Basic Inference
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.3-Eye-1.3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.3-Eye-1.3B", trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
Chat Template
N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:
- System prompts: Automatically formatted with
<|im_start|>system
tags - Tool integration: Special
<|tool_list_start|>
and<|tool_list_end|>
markers for tool definitions - Tool responses: Wrapped with
<|tool_response_start|>
and<|tool_response_end|>
markers - Multimodal content: JSON serialization for complex message content including images
Basic conversation format:
<|im_start|>system
You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
<image>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
For tool-enabled conversations:
<|im_start|>system
{system_prompt}
List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
<|im_start|>tool
<|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|>
Capabilities
N2-Eye can:
- Visual Understanding: Understand and describe images in detail
- Visual Q&A: Answer questions about visual content
- Multi-turn Conversations: Engage in extended conversations that reference images
- Tool Integration: Support for tool calling and structured responses
- Multimodal Reasoning: Combine visual and textual information for comprehensive responses
- Structured Output: Handle complex message formats including JSON content
Limitations
- Image Token Handling: Requires specific placement of
<image>
tokens in conversation format - Single Image: Currently optimized for single image per conversation
- Training Scale: Trained on a limited dataset (validation split only)
- Frozen Vision: CLIP encoder is frozen, limiting adaptation to new visual domains
Technical Implementation
Model Architecture Classes
The implementation includes several key components:
- MultimodalLFM2Model: Main model class combining language and vision
- CRAGMMDataset: Dataset handler for CRAG-MM format
- MultimodalTrainer: Custom trainer for multimodal inputs
Key Features
- Gradient Checkpointing: Memory-efficient training
- Custom Collation: Handles multimodal batch processing
- Flexible Image Integration: Dynamic matching of image features to token positions
- Safe Serialization: Custom saving to handle shared tensors
Requirements
torch
transformers
datasets
Pillow
clip-by-openai
Training Your Own Version
To retrain or fine-tune N2-Eye:
- Install dependencies
- Prepare your dataset in CRAG-MM format
- Modify configuration in the training script
- Run the training pipeline
See the included training script for complete implementation details.
Citation
If you use N2-Eye in your research, please cite:
@misc{n2eye2025,
title={N2-Eye: Multimodal Conversational AI},
author={GoofyLM Lab},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
}
Acknowledgments
- LiquidAI for the LFM2-1.2B base model
- Google for the siglip2 vision encoder
- CRAG-MM dataset contributors for training data
- Hugging Face for the transformers library and model hosting
License
This model is released under the MIT License. See the LICENSE file for details.
- Downloads last month
- 2