Generate text by uploading images or videos
a tiny vision language model
Generate text based on input prompts