File size: 6,597 Bytes
45589fa 93482c7 45589fa 93482c7 63375ba 93482c7 9f0bb3e 93482c7 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 |
---
license: mit
language:
- en
base_model:
- LiquidAI/LFM2-1.2B
- openai/clip-vit-base-patch32
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- merge
datasets:
- crag-mm-2025/crag-mm-multi-turn-public
---
# N2-Eye: Multimodal Conversational AI

N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities.
## Model Details
- **Base Language Model**: LiquidAI/LFM2-1.2B (1.26B parameters)
- **Vision Encoder**: OpenAI CLIP-ViT-Base-Patch32
- **Model Type**: Image-Text-to-Text (Multimodal Conversational)
- **Training Dataset**: CRAG-MM Multi-Turn Public Dataset
- **License**: MIT
- **Framework**: PyTorch + Transformers
## Architecture
N2-Eye uses a modular architecture that combines:
1. **Language Model**: LFM2-1.2B for text generation and conversation
2. **Vision Encoder**: CLIP for image understanding (frozen during training)
3. **Projection Layer**: A trainable MLP that maps CLIP features to the language model's embedding space
The model processes images by:
- Encoding images with CLIP to extract visual features
- Projecting these features through a learnable projection layer
- Integrating projected features into the language model at special `<image>` token positions
## Training Details
### Dataset
- **Source**: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
- **Format**: Multi-turn conversations with images
- **Preprocessing**: Conversations formatted with ChatML-style tokens
### Training Configuration
- **Batch Size**: 2 per device (with gradient accumulation steps: 4)
- **Learning Rate**: 2e-5
- **Training Length**: 3 epoch on validation split (we got down to loss 0.703300)
- **Precision**: bfloat16
- **Max Sequence Length**: 2048 tokens
- **Optimization**: Gradient checkpointing enabled
### Special Tokens
- `<image>`: Placeholder for image embeddings in conversation
- System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."
## Usage
### Basic Inference
```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.2-Eye-1.3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.2-Eye-1.3B", trust_remote_code=True)
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
```
### Chat Template
N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:
- **System prompts**: Automatically formatted with `<|im_start|>system` tags
- **Tool integration**: Special `<|tool_list_start|>` and `<|tool_list_end|>` markers for tool definitions
- **Tool responses**: Wrapped with `<|tool_response_start|>` and `<|tool_response_end|>` markers
- **Multimodal content**: JSON serialization for complex message content including images
Basic conversation format:
```
<|im_start|>system
You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
<image>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
```
For tool-enabled conversations:
```
<|im_start|>system
{system_prompt}
List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
<|im_start|>tool
<|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|>
```
## Capabilities
N2-Eye can:
- **Visual Understanding**: Understand and describe images in detail
- **Visual Q&A**: Answer questions about visual content
- **Multi-turn Conversations**: Engage in extended conversations that reference images
- **Tool Integration**: Support for tool calling and structured responses
- **Multimodal Reasoning**: Combine visual and textual information for comprehensive responses
- **Structured Output**: Handle complex message formats including JSON content
## Limitations
- **Image Token Handling**: Requires specific placement of `<image>` tokens in conversation format
- **Single Image**: Currently optimized for single image per conversation
- **Training Scale**: Trained on a limited dataset (validation split only)
- **Frozen Vision**: CLIP encoder is frozen, limiting adaptation to new visual domains
## Technical Implementation
### Model Architecture Classes
The implementation includes several key components:
1. **MultimodalLFM2Model**: Main model class combining language and vision
2. **CRAGMMDataset**: Dataset handler for CRAG-MM format
3. **MultimodalTrainer**: Custom trainer for multimodal inputs
### Key Features
- **Gradient Checkpointing**: Memory-efficient training
- **Custom Collation**: Handles multimodal batch processing
- **Flexible Image Integration**: Dynamic matching of image features to token positions
- **Safe Serialization**: Custom saving to handle shared tensors
## Requirements
```
torch
transformers
datasets
Pillow
clip-by-openai
```
## Training Your Own Version
To retrain or fine-tune N2-Eye:
1. Install dependencies
2. Prepare your dataset in CRAG-MM format
3. Modify configuration in the training script
4. Run the training pipeline
See the included training script for complete implementation details.
## Citation
If you use N2-Eye in your research, please cite:
```bibtex
@misc{n2eye2025,
title={N2-Eye: Multimodal Conversational AI},
author={GoofyLM Lab},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
}
```
## Acknowledgments
- **LiquidAI** for the LFM2-1.2B base model
- **OpenAI** for the CLIP vision encoder
- **CRAG-MM** dataset contributors for training data
- **Hugging Face** for the transformers library and model hosting
## License
This model is released under the MIT License. See the LICENSE file for details. |