--- license: apache-2.0 --- # What is Yi-VL? ## Architecture Yi-VL adopts the [LLaVA](https://github.com/haotian-liu/LLaVA) architecture, which is composed of three primary components: - Vision Transformer (ViT): it's initialized with [CLIP ViT-H/14 model](https://huggingface.co/laion/CLIP-ViT-H-14-laion2B-s32B-b79K) and used for image encoding. - Projection Module: it's designed to align image features with text feature space, consisting of a two-layer Multilayer Perceptron (MLP) with layer normalizations. - Large Language Model (LLM): it's initialized with [Yi-34B-Chat](https://huggingface.co/01-ai/Yi-34B-Chat) or [Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat), demonstrating exceptional proficiency in understanding and generating both English and Chinese. ![image/png](https://cdn-uploads.huggingface.co/production/uploads/656d9adce8bf55919aca7c3f/EGVHSWG4kAcX01xDaoeXS.png) # How to use Yi-VL? ## Quick start This has been implemented into the SGLang codebase, where you can simply call this model by creating a function like so: ``` import sglang as sgl @sgl.function def image_qa(s, image_path, question): s += sgl.user(sgl.image(image_path) + question) s += sgl.assistant(sgl.gen("answer")) runtime = sgl.Runtime(model_path="BabyChou/Yi-VL-34B", tokenizer_path="BabyChou/Yi-VL-34B") sgl.set_default_backend(runtime) # Single state = image_qa.run( image_path="images/cat.jpeg", question="What is this?", max_new_tokens=64) print(state["answer"], "\n") ``` ## License Please refer to the [acknowledgments and attributions](#acknowledgments_and_attributions) as well as individual components, for the license of source code. The Yi series models are fully open for academic research and free for commercial use, permissions of which are automatically granted upon application. All usage must adhere to the [Yi Series Models Community License Agreement 2.1](https://huggingface.co/01-ai/Yi-VL-34B/blob/main/LICENSE). For free commercial use, you only need to send an email to get official commercial permission.