more details about local and global features

by VictorSanh - opened Mar 19, 2024

Mar 19, 2024

Cool release!

The blogpost mentions "H-Former integrates a dual-network design to learn both local and global features for vision-language alignment". can you say more about the local and global features and how they are computed/combined?
I could not parse that info just reading the code. It looks like that the H-former is essentially a perceiver module? But I could be wrong.

xwwu

HyperGAI org Mar 20, 2024

•

edited Mar 20, 2024

Hi @VictorSanh , thank you for your interest in our work! We will provide more details in our technical report, which may take some time to prepare. We will inform you once the report is ready.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment