Post
2080
smolagents can see š„
we just shipped vision support to smolagents š¤ agentic computers FTW
you can now:
š» let the agent get images dynamically (e.g. agentic web browser)
š pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc)
with few LoC change! š¤Æ
you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) š¤
read our blog http://hf.co/blog/smolagents-can-see
we just shipped vision support to smolagents š¤ agentic computers FTW
you can now:
š» let the agent get images dynamically (e.g. agentic web browser)
š pass images at the init of the agent (e.g. chatting with documents, filling forms automatically etc)
with few LoC change! š¤Æ
you can use transformers models locally (like Qwen2VL) OR plug-in your favorite multimodal inference provider (gpt-4o, antrophic & co) š¤
read our blog http://hf.co/blog/smolagents-can-see