Interact with an agent to perform web-based tasks
image and video understanding.
Find matching images from a collection
Generate a short video from an image
Generate responses to video or image inputs
Generate realistic voice synthesis using text and reference audio