Leyo (Leo Tronchon)

upvoted 3 articles over 1 year ago

Article

Docmatix - a huge dataset for Document Visual Question Answering

Jul 18, 2024

•

78

Article

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

+1

Apr 15, 2024

•

191

Article

Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task

May 16, 2024

•

17

upvoted a paper almost 2 years ago

What matters when building vision-language models?

Paper • 2405.02246 • Published May 3, 2024 • 103

upvoted a collection almost 2 years ago

Idefics2 🐶

Collection

Idefics2-8B is a foundation vision-language model. In this collection, you will find the models, datasets and demo related to its creation. • 11 items • Updated May 6, 2024 • 92

upvoted 2 papers almost 2 years ago

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

Paper • 2403.09029 • Published Mar 14, 2024 • 56

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

Paper • 2402.10896 • Published Feb 16, 2024 • 16

upvoted a paper about 2 years ago

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

Paper • 2312.14238 • Published Dec 21, 2023 • 20

upvoted 10 papers over 2 years ago

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Paper • 2308.01907 • Published Aug 3, 2023 • 12

Retentive Network: A Successor to Transformer for Large Language Models

Paper • 2307.08621 • Published Jul 17, 2023 • 172

Generative Pretraining in Multimodality

Paper • 2307.05222 • Published Jul 11, 2023 • 22

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Paper • 2306.16527 • Published Jun 21, 2023 • 46

Leo Tronchon

AI & ML interests

Organizations

Docmatix - a huge dataset for Document Visual Question Answering

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task

What matters when building vision-language models?

Idefics2 🐶

Unlocking the conversion of Web Screenshots into HTML Code with the WebSight Dataset

PaLM2-VAdapter: Progressively Aligned Language Model Makes a Strong Vision-language Adapter

InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks

ConvNets Match Vision Transformers at Scale

FP8-LM: Training FP8 Large Language Models

Octopus: Embodied Vision-Language Programmer from Environmental Feedback

Vision Transformers Need Registers

Small-scale proxies for large-scale Transformer training instabilities

Scaling Autoregressive Multi-Modal Models: Pretraining and Instruction Tuning

The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World

Retentive Network: A Successor to Transformer for Large Language Models

Generative Pretraining in Multimodality

OBELICS: An Open Web-Scale Filtered Dataset of Interleaved Image-Text Documents

Leo Tronchon

AI & ML interests

Organizations

Leyo's activity

Docmatix - a huge dataset for Document Visual Question Answering

Introducing Idefics2: A Powerful 8B Vision-Language Model for the community

Multimodal Augmentation for Documents: Recovering “Comprehension” in “Reading and Comprehension” task