N2-Eye: Multimodal Conversational AI

N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities.

Model Details

Base Language Model: LiquidAI/LFM2-1.2B (1.26B parameters)
Vision Encoder: OpenAI CLIP-ViT-Base-Patch32
Model Type: Image-Text-to-Text (Multimodal Conversational)
Training Dataset: CRAG-MM Multi-Turn Public Dataset
License: MIT
Framework: PyTorch + Transformers

Architecture

N2-Eye uses a modular architecture that combines:

Language Model: LFM2-1.2B for text generation and conversation
Vision Encoder: CLIP for image understanding (frozen during training)
Projection Layer: A trainable MLP that maps CLIP features to the language model's embedding space

The model processes images by:

Encoding images with CLIP to extract visual features
Projecting these features through a learnable projection layer
Integrating projected features into the language model at special <image> token positions

Training Details

Dataset

Source: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
Format: Multi-turn conversations with images
Preprocessing: Conversations formatted with ChatML-style tokens

Training Configuration

Batch Size: 2 per device (with gradient accumulation steps: 4)
Learning Rate: 2e-5
Training Length: 1 epoch on validation split
Precision: bfloat16
Max Sequence Length: 2048 tokens
Optimization: Gradient checkpointing enabled

Special Tokens

<image>: Placeholder for image embeddings in conversation
System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."

Usage

Basic Inference

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Chat Template

N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:

System prompts: Automatically formatted with <|im_start|>system tags
Tool integration: Special <|tool_list_start|> and <|tool_list_end|> markers for tool definitions
Tool responses: Wrapped with <|tool_response_start|> and <|tool_response_end|> markers
Multimodal content: JSON serialization for complex message content including images

Basic conversation format:

<|im_start|>system
You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
<image>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

For tool-enabled conversations:

<|im_start|>system
{system_prompt}
List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
<|im_start|>tool
<|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|>

Capabilities

N2-Eye can:

Visual Understanding: Understand and describe images in detail
Visual Q&A: Answer questions about visual content
Multi-turn Conversations: Engage in extended conversations that reference images
Tool Integration: Support for tool calling and structured responses
Multimodal Reasoning: Combine visual and textual information for comprehensive responses
Structured Output: Handle complex message formats including JSON content

Limitations

Image Token Handling: Requires specific placement of <image> tokens in conversation format
Single Image: Currently optimized for single image per conversation
Training Scale: Trained on a limited dataset (validation split only)
Frozen Vision: CLIP encoder is frozen, limiting adaptation to new visual domains

Technical Implementation

Model Architecture Classes

The implementation includes several key components:

MultimodalLFM2Model: Main model class combining language and vision
CRAGMMDataset: Dataset handler for CRAG-MM format
MultimodalTrainer: Custom trainer for multimodal inputs

Key Features

Gradient Checkpointing: Memory-efficient training
Custom Collation: Handles multimodal batch processing
Flexible Image Integration: Dynamic matching of image features to token positions
Safe Serialization: Custom saving to handle shared tensors

Requirements

torch
transformers
datasets
Pillow
clip-by-openai

Training Your Own Version

To retrain or fine-tune N2-Eye:

Install dependencies
Prepare your dataset in CRAG-MM format
Modify configuration in the training script
Run the training pipeline

See the included training script for complete implementation details.

Citation

If you use N2-Eye in your research, please cite:

@misc{n2eye2025,
  title={N2-Eye: Multimodal Conversational AI},
  author={GoofyLM Lab},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
}

Acknowledgments

LiquidAI for the LFM2-1.2B base model
OpenAI for the CLIP vision encoder
CRAG-MM dataset contributors for training data
Hugging Face for the transformers library and model hosting

License

This model is released under the MIT License. See the LICENSE file for details.

GoofyLM
/

N2.1-Eye-1.3B