A newer version of this model is available: GoofyLM/N2.2-Eye-1.3B

N2-Eye: Multimodal Conversational AI

image/png

N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities.

Model Details

  • Base Language Model: LiquidAI/LFM2-1.2B (1.26B parameters)
  • Vision Encoder: OpenAI CLIP-ViT-Base-Patch32
  • Model Type: Image-Text-to-Text (Multimodal Conversational)
  • Training Dataset: CRAG-MM Multi-Turn Public Dataset
  • License: MIT
  • Framework: PyTorch + Transformers

Architecture

N2-Eye uses a modular architecture that combines:

  1. Language Model: LFM2-1.2B for text generation and conversation
  2. Vision Encoder: CLIP for image understanding (frozen during training)
  3. Projection Layer: A trainable MLP that maps CLIP features to the language model's embedding space

The model processes images by:

  • Encoding images with CLIP to extract visual features
  • Projecting these features through a learnable projection layer
  • Integrating projected features into the language model at special <image> token positions

Training Details

Dataset

  • Source: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
  • Format: Multi-turn conversations with images
  • Preprocessing: Conversations formatted with ChatML-style tokens

Training Configuration

  • Batch Size: 2 per device (with gradient accumulation steps: 4)
  • Learning Rate: 2e-5
  • Training Length: 1 epoch on validation split
  • Precision: bfloat16
  • Max Sequence Length: 2048 tokens
  • Optimization: Gradient checkpointing enabled

Special Tokens

  • <image>: Placeholder for image embeddings in conversation
  • System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."

Usage

Basic Inference

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Chat Template

N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:

  • System prompts: Automatically formatted with <|im_start|>system tags
  • Tool integration: Special <|tool_list_start|> and <|tool_list_end|> markers for tool definitions
  • Tool responses: Wrapped with <|tool_response_start|> and <|tool_response_end|> markers
  • Multimodal content: JSON serialization for complex message content including images

Basic conversation format:

<|im_start|>system
You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
<image>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>

For tool-enabled conversations:

<|im_start|>system
{system_prompt}
List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
<|im_start|>tool
<|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|>

Capabilities

N2-Eye can:

  • Visual Understanding: Understand and describe images in detail
  • Visual Q&A: Answer questions about visual content
  • Multi-turn Conversations: Engage in extended conversations that reference images
  • Tool Integration: Support for tool calling and structured responses
  • Multimodal Reasoning: Combine visual and textual information for comprehensive responses
  • Structured Output: Handle complex message formats including JSON content

Limitations

  • Image Token Handling: Requires specific placement of <image> tokens in conversation format
  • Single Image: Currently optimized for single image per conversation
  • Training Scale: Trained on a limited dataset (validation split only)
  • Frozen Vision: CLIP encoder is frozen, limiting adaptation to new visual domains

Technical Implementation

Model Architecture Classes

The implementation includes several key components:

  1. MultimodalLFM2Model: Main model class combining language and vision
  2. CRAGMMDataset: Dataset handler for CRAG-MM format
  3. MultimodalTrainer: Custom trainer for multimodal inputs

Key Features

  • Gradient Checkpointing: Memory-efficient training
  • Custom Collation: Handles multimodal batch processing
  • Flexible Image Integration: Dynamic matching of image features to token positions
  • Safe Serialization: Custom saving to handle shared tensors

Requirements

torch
transformers
datasets
Pillow
clip-by-openai

Training Your Own Version

To retrain or fine-tune N2-Eye:

  1. Install dependencies
  2. Prepare your dataset in CRAG-MM format
  3. Modify configuration in the training script
  4. Run the training pipeline

See the included training script for complete implementation details.

Citation

If you use N2-Eye in your research, please cite:

@misc{n2eye2025,
  title={N2-Eye: Multimodal Conversational AI},
  author={GoofyLM Lab},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
}

Acknowledgments

  • LiquidAI for the LFM2-1.2B base model
  • OpenAI for the CLIP vision encoder
  • CRAG-MM dataset contributors for training data
  • Hugging Face for the transformers library and model hosting

License

This model is released under the MIT License. See the LICENSE file for details.

Downloads last month
28
Safetensors
Model size
1.26B params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for GoofyLM/N2.1-Eye-1.3B

Dataset used to train GoofyLM/N2.1-Eye-1.3B

Collection including GoofyLM/N2.1-Eye-1.3B