--- license: mit language: - en base_model: - LiquidAI/LFM2-1.2B - openai/clip-vit-base-patch32 pipeline_tag: image-text-to-text library_name: transformers tags: - merge datasets: - crag-mm-2025/crag-mm-multi-turn-public --- # N2-Eye: Multimodal Conversational AI ![image/png](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/gq_R1hx5UTDiSns2gUzJ2.png) N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities. ## Model Details - **Base Language Model**: LiquidAI/LFM2-1.2B (1.26B parameters) - **Vision Encoder**: OpenAI CLIP-ViT-Base-Patch32 - **Model Type**: Image-Text-to-Text (Multimodal Conversational) - **Training Dataset**: CRAG-MM Multi-Turn Public Dataset - **License**: MIT - **Framework**: PyTorch + Transformers ## Architecture N2-Eye uses a modular architecture that combines: 1. **Language Model**: LFM2-1.2B for text generation and conversation 2. **Vision Encoder**: CLIP for image understanding (frozen during training) 3. **Projection Layer**: A trainable MLP that maps CLIP features to the language model's embedding space The model processes images by: - Encoding images with CLIP to extract visual features - Projecting these features through a learnable projection layer - Integrating projected features into the language model at special `` token positions ## Training Details ### Dataset - **Source**: CRAG-MM Multi-Turn Public Dataset (v0.1.1) - **Format**: Multi-turn conversations with images - **Preprocessing**: Conversations formatted with ChatML-style tokens ### Training Configuration - **Batch Size**: 2 per device (with gradient accumulation steps: 4) - **Learning Rate**: 2e-5 - **Training Length**: 3 epoch on validation split (we got down to loss 0.703300) - **Precision**: bfloat16 - **Max Sequence Length**: 2048 tokens - **Optimization**: Gradient checkpointing enabled ### Special Tokens - ``: Placeholder for image embeddings in conversation - System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images." ## Usage ### Basic Inference ```python # Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.2-Eye-1.3B", trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.2-Eye-1.3B", trust_remote_code=True) messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) ``` ### Chat Template N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles: - **System prompts**: Automatically formatted with `<|im_start|>system` tags - **Tool integration**: Special `<|tool_list_start|>` and `<|tool_list_end|>` markers for tool definitions - **Tool responses**: Wrapped with `<|tool_response_start|>` and `<|tool_response_end|>` markers - **Multimodal content**: JSON serialization for complex message content including images Basic conversation format: ``` <|im_start|>system You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> ``` For tool-enabled conversations: ``` <|im_start|>system {system_prompt} List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|> <|im_start|>user {user_message}<|im_end|> <|im_start|>assistant {assistant_response}<|im_end|> <|im_start|>tool <|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|> ``` ## Capabilities N2-Eye can: - **Visual Understanding**: Understand and describe images in detail - **Visual Q&A**: Answer questions about visual content - **Multi-turn Conversations**: Engage in extended conversations that reference images - **Tool Integration**: Support for tool calling and structured responses - **Multimodal Reasoning**: Combine visual and textual information for comprehensive responses - **Structured Output**: Handle complex message formats including JSON content ## Limitations - **Image Token Handling**: Requires specific placement of `` tokens in conversation format - **Single Image**: Currently optimized for single image per conversation - **Training Scale**: Trained on a limited dataset (validation split only) - **Frozen Vision**: CLIP encoder is frozen, limiting adaptation to new visual domains ## Technical Implementation ### Model Architecture Classes The implementation includes several key components: 1. **MultimodalLFM2Model**: Main model class combining language and vision 2. **CRAGMMDataset**: Dataset handler for CRAG-MM format 3. **MultimodalTrainer**: Custom trainer for multimodal inputs ### Key Features - **Gradient Checkpointing**: Memory-efficient training - **Custom Collation**: Handles multimodal batch processing - **Flexible Image Integration**: Dynamic matching of image features to token positions - **Safe Serialization**: Custom saving to handle shared tensors ## Requirements ``` torch transformers datasets Pillow clip-by-openai ``` ## Training Your Own Version To retrain or fine-tune N2-Eye: 1. Install dependencies 2. Prepare your dataset in CRAG-MM format 3. Modify configuration in the training script 4. Run the training pipeline See the included training script for complete implementation details. ## Citation If you use N2-Eye in your research, please cite: ```bibtex @misc{n2eye2025, title={N2-Eye: Multimodal Conversational AI}, author={GoofyLM Lab}, year={2025}, publisher={Hugging Face}, howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}} } ``` ## Acknowledgments - **LiquidAI** for the LFM2-1.2B base model - **OpenAI** for the CLIP vision encoder - **CRAG-MM** dataset contributors for training data - **Hugging Face** for the transformers library and model hosting ## License This model is released under the MIT License. See the LICENSE file for details.