@merve on Hugging Face: "Florence-2 is a new vision foundation model capable of a wide variety of tasks…"

Hugging Face

Join the conversation

Join the community of Machine Learners and AI enthusiasts.

Back to feed

merve

posted an update Jun 20, 2024

Post

4386

Florence-2 is a new vision foundation model capable of a wide variety of tasks 🤯
Demo 👉🏻 gokaygokay/Florence-2
Collection 👉🏻 microsoft/florence-6669f44df0d87d9c3bfb76de

This model can handle tasks that vary from OCR to semantic segmentation.

The difference from previous models is that the authors have compiled a dataset consisting of 126M images with 5.4B annotations labelled with their own data engine pseudolabelled by smaller specialized models and APIs.

The model has a similar architecture to previous models: an image encoder and a multimodality encoder with a text decoder. The authors have compiled the multitask dataset with prompts for each task.

You can also fine-tune this model on any task of choice. The authors also released different results on downstream tasks and reported their results when un/freezing the vision encoder 🤓📉
They have released fine-tuned models too, you can find them in the collection above 🤗

polles

Jun 20, 2024

nice post !

ZeroWw

Jun 21, 2024

Interesting, I gave it a photo of a barely readable handwritten piece of old paper, using OCR it made a mess, but when I used "Detailed caption" it made only 2 errors.

lucasjin

Jun 22, 2024

It has 126M images training, yet didn't support Chinese or other languages well. A little pity

In this post