Sai Rajeswar's picture
2 1

Sai Rajeswar

rajeswarsai
Β·

AI & ML interests

None yet

Recent Activity

reacted to ahmed-masry's post with πŸ‘ about 9 hours ago
Happy to announce AlignVLM πŸ“ – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) πŸŒπŸ“„πŸ–Ό πŸ”— Read the paper: https://huggingface.co/papers/2502.01341 🧐 What’s the challenge? Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌ 🎯 Our Solution: ALIGN Connector We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. βœ… πŸ”¬ How does it perform? We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all πŸ† on diverse document understanding tasks πŸ“„. πŸ“Š Meet the AlignVLM Model Family! We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results: βœ… AlignVLM surpasses all Base VLMs trained under similar configurations. βœ… Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 πŸš€. πŸ€” What about robustness to noise? We injected Gaussian noise (ΞΌ=0, Οƒ=3) into the vision encoder’s outputs before feeding them to the connector: βœ… ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness! ❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs. Code & model weights coming soon! Stay tuned! πŸ”₯
View all activity

Organizations

ServiceNow's profile picture SNOW-Multimodal's profile picture Agent Poirot's profile picture

rajeswarsai's activity