File size: 6,601 Bytes
c7ffc29
 
 
 
 
 
 
 
 
 
 
49420de
 
22eb88c
9bd8339
 
 
 
71e64eb
 
9bd8339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
730aae2
9bd8339
 
 
 
 
 
c1b9ca7
 
730aae2
5140b15
 
730aae2
 
 
 
c1b9ca7
 
730aae2
c1b9ca7
730aae2
c1b9ca7
 
 
 
 
 
 
 
 
 
9bd8339
 
 
 
730aae2
9bd8339
730aae2
 
 
 
 
 
9bd8339
 
730aae2
9bd8339
 
 
 
 
 
 
730aae2
 
 
 
 
 
 
 
 
 
 
 
 
9bd8339
 
 
730aae2
 
 
 
 
 
9bd8339
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fb45d12
9bd8339
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
---
license: mit
language:
- en
base_model:
- LiquidAI/LFM2-1.2B
- openai/clip-vit-base-patch32
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- merge
datasets:
- crag-mm-2025/crag-mm-multi-turn-public
new_version: GoofyLM/N2.2-Eye-1.3B
---

# N2-Eye: Multimodal Conversational AI

![image/png](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/gq_R1hx5UTDiSns2gUzJ2.png)

N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities.

## Model Details

- **Base Language Model**: LiquidAI/LFM2-1.2B (1.26B parameters)
- **Vision Encoder**: OpenAI CLIP-ViT-Base-Patch32
- **Model Type**: Image-Text-to-Text (Multimodal Conversational)
- **Training Dataset**: CRAG-MM Multi-Turn Public Dataset
- **License**: MIT
- **Framework**: PyTorch + Transformers

## Architecture

N2-Eye uses a modular architecture that combines:

1. **Language Model**: LFM2-1.2B for text generation and conversation
2. **Vision Encoder**: CLIP for image understanding (frozen during training)
3. **Projection Layer**: A trainable MLP that maps CLIP features to the language model's embedding space

The model processes images by:
- Encoding images with CLIP to extract visual features
- Projecting these features through a learnable projection layer
- Integrating projected features into the language model at special `<image>` token positions

## Training Details

### Dataset
- **Source**: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
- **Format**: Multi-turn conversations with images
- **Preprocessing**: Conversations formatted with ChatML-style tokens

### Training Configuration
- **Batch Size**: 2 per device (with gradient accumulation steps: 4)
- **Learning Rate**: 2e-5
- **Training Length**: 1 epoch on validation split
- **Precision**: bfloat16
- **Max Sequence Length**: 2048 tokens
- **Optimization**: Gradient checkpointing enabled

### Special Tokens
- `<image>`: Placeholder for image embeddings in conversation
- System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."

## Usage

### Basic Inference

```python
# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
```

### Chat Template

N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:

- **System prompts**: Automatically formatted with `<|im_start|>system` tags
- **Tool integration**: Special `<|tool_list_start|>` and `<|tool_list_end|>` markers for tool definitions
- **Tool responses**: Wrapped with `<|tool_response_start|>` and `<|tool_response_end|>` markers
- **Multimodal content**: JSON serialization for complex message content including images

Basic conversation format:
```
<|im_start|>system
You are a helpful assistant trained by Liquid AI. You can see and understand images.<|im_end|>
<image>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
```

For tool-enabled conversations:
```
<|im_start|>system
{system_prompt}
List of tools: <|tool_list_start|>[{tool_definitions}]<|tool_list_end|><|im_end|>
<|im_start|>user
{user_message}<|im_end|>
<|im_start|>assistant
{assistant_response}<|im_end|>
<|im_start|>tool
<|tool_response_start|>{tool_output}<|tool_response_end|><|im_end|>
```

## Capabilities

N2-Eye can:
- **Visual Understanding**: Understand and describe images in detail
- **Visual Q&A**: Answer questions about visual content
- **Multi-turn Conversations**: Engage in extended conversations that reference images
- **Tool Integration**: Support for tool calling and structured responses
- **Multimodal Reasoning**: Combine visual and textual information for comprehensive responses
- **Structured Output**: Handle complex message formats including JSON content

## Limitations

- **Image Token Handling**: Requires specific placement of `<image>` tokens in conversation format
- **Single Image**: Currently optimized for single image per conversation
- **Training Scale**: Trained on a limited dataset (validation split only)
- **Frozen Vision**: CLIP encoder is frozen, limiting adaptation to new visual domains

## Technical Implementation

### Model Architecture Classes

The implementation includes several key components:

1. **MultimodalLFM2Model**: Main model class combining language and vision
2. **CRAGMMDataset**: Dataset handler for CRAG-MM format
3. **MultimodalTrainer**: Custom trainer for multimodal inputs

### Key Features

- **Gradient Checkpointing**: Memory-efficient training
- **Custom Collation**: Handles multimodal batch processing
- **Flexible Image Integration**: Dynamic matching of image features to token positions
- **Safe Serialization**: Custom saving to handle shared tensors

## Requirements

```
torch
transformers
datasets
Pillow
clip-by-openai
```

## Training Your Own Version

To retrain or fine-tune N2-Eye:

1. Install dependencies
2. Prepare your dataset in CRAG-MM format
3. Modify configuration in the training script
4. Run the training pipeline

See the included training script for complete implementation details.

## Citation

If you use N2-Eye in your research, please cite:

```bibtex
@misc{n2eye2025,
  title={N2-Eye: Multimodal Conversational AI},
  author={GoofyLM Lab},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
}
```

## Acknowledgments

- **LiquidAI** for the LFM2-1.2B base model
- **OpenAI** for the CLIP vision encoder
- **CRAG-MM** dataset contributors for training data
- **Hugging Face** for the transformers library and model hosting

## License

This model is released under the MIT License. See the LICENSE file for details.