N2.1-Eye-1.3B / README.md

Update README.md

71e64eb verified about 17 hours ago

6.6 kB

	---
	license: mit
	language:
	- en
	base_model:
	- LiquidAI/LFM2-1.2B
	- openai/clip-vit-base-patch32
	pipeline_tag: image-text-to-text
	library_name: transformers
	tags:
	- merge
	datasets:
	- crag-mm-2025/crag-mm-multi-turn-public
	new_version: GoofyLM/N2.2-Eye-1.3B
	---

	# N2-Eye: Multimodal Conversational AI

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/6615494716917dfdc645c44e/gq_R1hx5UTDiSns2gUzJ2.png)

	N2-Eye is a multimodal language model that combines the power of LiquidAI's LFM2-1.2B language model with OpenAI's CLIP vision encoder to enable image understanding and conversation capabilities.

	## Model Details

	- Base Language Model: LiquidAI/LFM2-1.2B (1.26B parameters)
	- Vision Encoder: OpenAI CLIP-ViT-Base-Patch32
	- Model Type: Image-Text-to-Text (Multimodal Conversational)
	- Training Dataset: CRAG-MM Multi-Turn Public Dataset
	- License: MIT
	- Framework: PyTorch + Transformers

	## Architecture

	N2-Eye uses a modular architecture that combines:

	1. Language Model: LFM2-1.2B for text generation and conversation
	2. Vision Encoder: CLIP for image understanding (frozen during training)
	3. Projection Layer: A trainable MLP that maps CLIP features to the language model's embedding space

	The model processes images by:
	- Encoding images with CLIP to extract visual features
	- Projecting these features through a learnable projection layer
	- Integrating projected features into the language model at special `<image>` token positions

	## Training Details

	### Dataset
	- Source: CRAG-MM Multi-Turn Public Dataset (v0.1.1)
	- Format: Multi-turn conversations with images
	- Preprocessing: Conversations formatted with ChatML-style tokens

	### Training Configuration
	- Batch Size: 2 per device (with gradient accumulation steps: 4)
	- Learning Rate: 2e-5
	- Training Length: 1 epoch on validation split
	- Precision: bfloat16
	- Max Sequence Length: 2048 tokens
	- Optimization: Gradient checkpointing enabled

	### Special Tokens
	- `<image>`: Placeholder for image embeddings in conversation
	- System prompt: "You are a helpful assistant trained by Liquid AI. You can see and understand images."

	## Usage

	### Basic Inference

	```python
	# Load model directly
	from transformers import AutoTokenizer, AutoModelForCausalLM

	tokenizer = AutoTokenizer.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained("GoofyLM/N2.1-Eye-1.3B", trust_remote_code=True)
	messages = [
	{
	"role": "user",
	"content": [
	{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
	{"type": "text", "text": "What animal is on the candy?"}
	]
	},
	]
	inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
	).to(model.device)

	outputs = model.generate(**inputs, max_new_tokens=40)
	print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))
	```

	### Chat Template

	N2-Eye uses an advanced ChatML-based format with support for tools and multimodal content. The model includes a sophisticated Jinja2 template that handles:

	- System prompts: Automatically formatted with `<\|im_start\|>system` tags
	- Tool integration: Special `<\|tool_list_start\|>` and `<\|tool_list_end\|>` markers for tool definitions
	- Tool responses: Wrapped with `<\|tool_response_start\|>` and `<\|tool_response_end\|>` markers
	- Multimodal content: JSON serialization for complex message content including images

	Basic conversation format:
	```
	<\|im_start\|>system
	You are a helpful assistant trained by Liquid AI. You can see and understand images.<\|im_end\|>
	<image>
	<\|im_start\|>user
	{user_message}<\|im_end\|>
	<\|im_start\|>assistant
	{assistant_response}<\|im_end\|>
	```

	For tool-enabled conversations:
	```
	<\|im_start\|>system
	{system_prompt}
	List of tools: <\|tool_list_start\|>[{tool_definitions}]<\|tool_list_end\|><\|im_end\|>
	<\|im_start\|>user
	{user_message}<\|im_end\|>
	<\|im_start\|>assistant
	{assistant_response}<\|im_end\|>
	<\|im_start\|>tool
	<\|tool_response_start\|>{tool_output}<\|tool_response_end\|><\|im_end\|>
	```

	## Capabilities

	N2-Eye can:
	- Visual Understanding: Understand and describe images in detail
	- Visual Q&A: Answer questions about visual content
	- Multi-turn Conversations: Engage in extended conversations that reference images
	- Tool Integration: Support for tool calling and structured responses
	- Multimodal Reasoning: Combine visual and textual information for comprehensive responses
	- Structured Output: Handle complex message formats including JSON content

	## Limitations

	- Image Token Handling: Requires specific placement of `<image>` tokens in conversation format
	- Single Image: Currently optimized for single image per conversation
	- Training Scale: Trained on a limited dataset (validation split only)
	- Frozen Vision: CLIP encoder is frozen, limiting adaptation to new visual domains

	## Technical Implementation

	### Model Architecture Classes

	The implementation includes several key components:

	1. MultimodalLFM2Model: Main model class combining language and vision
	2. CRAGMMDataset: Dataset handler for CRAG-MM format
	3. MultimodalTrainer: Custom trainer for multimodal inputs

	### Key Features

	- Gradient Checkpointing: Memory-efficient training
	- Custom Collation: Handles multimodal batch processing
	- Flexible Image Integration: Dynamic matching of image features to token positions
	- Safe Serialization: Custom saving to handle shared tensors

	## Requirements

	```
	torch
	transformers
	datasets
	Pillow
	clip-by-openai
	```

	## Training Your Own Version

	To retrain or fine-tune N2-Eye:

	1. Install dependencies
	2. Prepare your dataset in CRAG-MM format
	3. Modify configuration in the training script
	4. Run the training pipeline

	See the included training script for complete implementation details.

	## Citation

	If you use N2-Eye in your research, please cite:

	```bibtex
	@misc{n2eye2025,
	title={N2-Eye: Multimodal Conversational AI},
	author={GoofyLM Lab},
	year={2025},
	publisher={Hugging Face},
	howpublished={\url{https://huggingface.co/GoofyLM/N2-Eye-v1-1.3B}}
	}
	```

	## Acknowledgments

	- LiquidAI for the LFM2-1.2B base model
	- OpenAI for the CLIP vision encoder
	- CRAG-MM dataset contributors for training data
	- Hugging Face for the transformers library and model hosting

	## License

	This model is released under the MIT License. See the LICENSE file for details.