This is N2.3-Eye-1.3B-[DEV] We are currently experimenting with siglip2. This model is underfit.

N2-Eye: Multimodal Conversational AI

N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with Google's siglip2 vision encoder to enable image understanding and conversation capabilities.

Model Details

  • Base Language Model: LiquidAI/LFM2-1.2B (1.26B parameters)
  • Vision Encoder: Google google/siglip2-so400m-patch14-384
  • Model Type: Image-Text-to-Text (Multimodal Conversational)
  • Training Dataset: CRAG-MM Multi-Turn Public Dataset
  • License: MIT
  • Framework: PyTorch + Transformers

Architecture

N2-Eye uses a modular architecture that combines:

  1. Language Model: LFM2-1.2B for text generation and conversation
  2. Vision Encoder: siglip2 for image understanding (frozen during training)
  3. Projection Layer: A trainable MLP that maps CLIP features to the language model's embedding space

The model processes images by:

  • Encoding images with siglip2 to extract visual features
  • Projecting these features through a learnable projection layer
  • Integrating projected features into the language model at special <image> token positions

Training Details

Dataset

  • Source: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
  • Format: Multi-turn conversations with images
  • Preprocessing: Conversations formatted with ChatML-style tokens

Training Configuration

  • Batch Size: 2 per device (with gradient accumulation steps: 4)
  • Learning Rate: 2e-5
  • Training Length: 1 epoch on validation split (its a DEV version.)
  • Precision: bfloat16
  • Max Sequence Length: 2048 tokens
  • Optimization: Gradient checkpointing enabled

Special Tokens

  • <image>: Placeholder for image embeddings in conversation
  • System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."

Usage

Basic Inference

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.3-Eye-1.3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.3-Eye-1.3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Chat Template

N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:

  • System prompts: Automatically formatted with <|im_start|>system tags
  • Tool integration: Special <|tool_list_start|> and <|tool_list_end|> markers for tool definitions
  • Tool responses: Wrapped with <|tool_response_start|> and <|tool_response_end|> markers
  • Multimodal content: JSON serialization for complex message content including images

Basic conversation format:

<|im_start|>system
You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
<image>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

For tool-enabled conversations:

<|im_start|>system
{system_prompt}
List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
<|im_start|>tool
<|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|>

Capabilities

N2-Eye can:

  • Visual Understanding: Understand and describe images in detail
  • Visual Q&A: Answer questions about visual content
  • Multi-turn Conversations: Engage in extended conversations that reference images
  • Tool Integration: Support for tool calling and structured responses
  • Multimodal Reasoning: Combine visual and textual information for comprehensive responses
  • Structured Output: Handle complex message formats including JSON content

Limitations

  • Image Token Handling: Requires specific placement of <image> tokens in conversation format
  • Single Image: Currently optimized for single image per conversation
  • Training Scale: Trained on a limited dataset (validation split only)
  • Frozen Vision: CLIP encoder is frozen, limiting adaptation to new visual domains

Technical Implementation

Model Architecture Classes

The implementation includes several key components:

  1. MultimodalLFM2Model: Main model class combining language and vision
  2. CRAGMMDataset: Dataset handler for CRAG-MM format
  3. MultimodalTrainer: Custom trainer for multimodal inputs

Key Features

  • Gradient Checkpointing: Memory-efficient training
  • Custom Collation: Handles multimodal batch processing
  • Flexible Image Integration: Dynamic matching of image features to token positions
  • Safe Serialization: Custom saving to handle shared tensors

Requirements

torch
transformers
datasets
Pillow
clip-by-openai

Training Your Own Version

To retrain or fine-tune N2-Eye:

  1. Install dependencies
  2. Prepare your dataset in CRAG-MM format
  3. Modify configuration in the training script
  4. Run the training pipeline

See the included training script for complete implementation details.

Citation

If you use N2-Eye in your research, please cite:

@misc{n2eye2025,
  title={N2-Eye: Multimodal Conversational AI},
  author={GoofyLM Lab},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
}

Acknowledgments

  • LiquidAI for the LFM2-1.2B base model
  • Google for the siglip2 vision encoder
  • CRAG-MM dataset contributors for training data
  • Hugging Face for the transformers library and model hosting

License

This model is released under the MIT License. See the LICENSE file for details.

Downloads last month
2
Safetensors
Model size
1.26B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GoofyLM/N2.3-Eye-1.3B-DEV

Dataset used to train GoofyLM/N2.3-Eye-1.3B-DEV