Visualizations + NLP

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

ahmed-masry authored a paper about 12 hours ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

mparvez authored a paper 30 days ago

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

ahmed-masry authored a paper about 2 months ago

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

View all activity

VisNLP's activity

ahmed-masry

posted an update about 4 hours ago

Post

251

Happy to announce AlignVLM 📏 – a novel approach to bridging vision and language latent spaces for multimodal understanding in Vision-Language Models (VLMs) 🌍📄🖼

🔗 Read the paper: AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding (2502.01341)

🧐 What’s the challenge?
Aligning visual features with language embeddings remains a major bottleneck in VLMs. Existing connectors such as Multi-layer perceptron (MLPs) often introduce noise that degrades performance. ❌

🎯 Our Solution: ALIGN Connector
We propose AlignVLM, a method that maps vision features into a weighted average of LLM text embeddings, ensuring they remain in a space that the LLM can effectively interpret. ✅

🔬 How does it perform?
We compared ALIGN against common connectors like MLPs, Perceiver Resampler, and Ovis trained under similar configurations. The results? ALIGN outperforms them all 🏆 on diverse document understanding tasks 📄.

📊 Meet the AlignVLM Model Family!
We trained Llama 3.1 (1B, 3B, 8B) using our connector and benchmarked them against various models. The results:
✅ AlignVLM surpasses all Base VLMs trained under similar configurations. ✅ Our models also perform competitively against Instruct VLMs such as Qwen2-VL and InternVL-2.5 🚀.

🤔 What about robustness to noise?
We injected Gaussian noise (μ=0, σ=3) into the vision encoder’s outputs before feeding them to the connector:
✅ ALIGN Connector: Minimal drop (↓1.67%) – proving its high robustness!
❌ MLP Connector: Severe degradation (↓25.54%) – struggling with noisy inputs.

Code & model weights coming soon! Stay tuned! 🔥

ahmed-masry

authored a paper about 12 hours ago

AlignVLM: Bridging Vision and Language Latent Spaces for Multimodal Understanding

Paper • 2502.01341 • Published 1 day ago • 25

mparvez

authored a paper 30 days ago

MapEval: A Map-Based Evaluation of Geo-Spatial Reasoning in Foundation Models

Paper • 2501.00316 • Published Dec 31, 2024 • 22

ahmed-masry

authored a paper about 2 months ago

BigDocs: An Open and Permissively-Licensed Dataset for Training Multimodal Models on Document and Code Tasks

Paper • 2412.04626 • Published Dec 5, 2024 • 13

mparvez

authored 9 papers 2 months ago

xCodeEval: A Large Scale Multilingual Multitask Benchmark for Code Understanding, Generation, Translation and Retrieval

Paper • 2303.03004 • Published Mar 6, 2023

DelucionQA: Detecting Hallucinations in Domain-specific Question Answering

Paper • 2312.05200 • Published Dec 8, 2023 • 1

Evidence to Generate (E2G): A Single-agent Two-step Prompting for Context Grounded and Retrieval Augmented Reasoning

Paper • 2401.05787 • Published Jan 11, 2024

ChartInstruct: Instruction Tuning for Chart Comprehension and Reasoning

Paper • 2403.09028 • Published Mar 14, 2024

MapCoder: Multi-Agent Code Generation for Competitive Problem Solving

Paper • 2405.11403 • Published May 18, 2024 • 2

A Systematic Survey and Critical Review on Evaluating Large Language Models: Challenges, Limitations, and Recommendations

Paper • 2407.04069 • Published Jul 4, 2024

Learning to Filter Context for Retrieval-Augmented Generation

Paper • 2311.08377 • Published Nov 14, 2023

Open-RAG: Enhanced Retrieval-Augmented Reasoning with Open-Source Large Language Models

Paper • 2410.01782 • Published Oct 2, 2024 • 10

VideoLights: Feature Refinement and Cross-Task Alignment Transformer for Joint Video Highlight Detection and Moment Retrieval

Paper • 2412.01558 • Published Dec 2, 2024 • 4

ahmed-masry

posted an update 4 months ago

Post

1481

🚀 Introducing ColFlor: An Efficient, OCR-Free Vision-Language Document Retrieval Model 🌟

Earlier this year, ColPali revolutionized document retrieval by eliminating the need for error-prone OCR pipelines. Instead, it directly processes the document images. However, with its 3 billion parameters, ColPali is computationally heavy for large-scale applications.

That’s where ColFlor comes in—a smaller, faster alternative! 🎉 At 17x smaller than ColPali, ColFlor offers a more efficient, OCR-free document retrieval solution, making it ideal for users with limited computing resources (GPU Poor). 💡

Key Highlights:
🧠 174M parameters (vs. 3B for ColPali)
⚡ 9.8x faster query encoding, 5.25x faster image encoding
📉 Only 1.8% performance drop on text-rich English documents

Check out the full blog post for more insights on modeling, training, and evaluations across various document retrieval tasks! 🚀
Also, feel free to try our demo on huggingface 🤗

🔗 Resources:
📄 Blog post: https://huggingface.co/blog/ahmed-masry/colflor
🧠 Model: ahmed-masry/ColFlor
💻 Demo: ahmed-masry/ColFlor-Demo
🏋️‍♂️ Training code: https://github.com/AhmedMasryKU/colflor
📊 Evaluation code: https://github.com/AhmedMasryKU/vidore-benchmark-colflor

ahmed-masry

posted an update 7 months ago

Post

3657

📢 Exciting News! Our latest paper "ChartGemma" is out! 📊

🧵1/3: ChartGemma overcomes existing chart models key limitations that rely too much on data tables. Instead, it is trained on data generated directly from chart images, capturing crucial visual trends📸🔍

🧵2/3: ChartGemma builds upon PaliGemma from Google Research and is fine-tuned on a high-quality visual instruction tuning dataset generated from Gemini Flash 1.5. 🌟📊

🧵3/3: Achieves state-of-the-art results in chart summarization, question answering, and fact-checking tasks. 🏅📊 It can also generate more accurate and realistic chart summaries. 📝🔍

Our model and data are publicly available. We also have a cool web demo. Check it out! 🚀
Demo: ahmed-masry/ChartGemma
Code: https://github.com/vis-nlp/ChartGemma
Paper: ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild (2407.04172)

ahmed-masry

authored 5 papers 7 months ago

Chart-to-Text: A Large-Scale Benchmark for Chart Summarization

Paper • 2203.06486 • Published Mar 12, 2022

ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning

Paper • 2203.10244 • Published Mar 19, 2022

ChartGemma: Visual Instruction-tuning for Chart Reasoning in the Wild

Paper • 2407.04172 • Published Jul 4, 2024 • 23

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

Paper • 2305.14761 • Published May 24, 2023

Do LLMs Work on Charts? Designing Few-Shot Prompts for Chart Question Answering and Summarization

Paper • 2312.10610 • Published Dec 17, 2023 • 1

AI & ML interests

Recent Activity

Team members 6

VisNLP's activity