NCSOFT
/

VARCO-VISION-14B

@@ -17,9 +17,14 @@ pipeline_tag: image-text-to-text
 # VARCO-VISION-14B
-## About the Model
-**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models.  The model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
 - **Developed by:** NC Research, Multimodal Generation Team
 - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
@@ -146,7 +151,7 @@ conversation = [
     {
         "role": "user",
         "content": [
-            {"type": "text", "text": "<gro>\nDescribe the image in detail."},
             {"type": "image"},
         ],
     },
@@ -171,7 +176,7 @@ conversation = [
         "content": [
             {
                 "type": "text",
-                "text": "<obj>이 물건</obj><bbox>0.039, 0.138, 0.283, 0.257</bbox>은 어떻게 쓰는거야?",
             },
             {"type": "image"},
         ],
@@ -196,7 +201,7 @@ conversation = [
     {
         "role": "user",
         "content": [
-            {"type": "text", "text": "<ocr>"},
             {"type": "image"},
         ],
     },

 # VARCO-VISION-14B
+## 🚨News🎙️
+- The 2.0 model has been released. Please use the new version.
+- 📰 2025-07-16: We released VARCO-VISION-2.0-14B at [link](https://huggingface.co/NCSOFT/VARCO-VISION-2.0-14B)
+- 📰 2025-07-16: We released GME-VARCO-VISION-Embedding at [link](https://huggingface.co/NCSOFT/GME-VARCO-VISION-Embedding)
+## About the VARCO-VISION-1.0-14B Model
+**VARCO-VISION-14B** is a powerful English-Korean Vision-Language Model (VLM). The training pipeline of VARCO-VISION consists of four stages: Feature Alignment Pre-training, Basic Supervised Fine-tuning, Advanced Supervised Fine-tuning, and Preference Optimization. In both multimodal and text-only benchmarks, VARCO-VISION-14B not only surpasses other models of similar size in performance but also achieves scores comparable to those of proprietary models.  The Model currently accepts a single image and a text as inputs, generating an output text. It supports grounding, referring as well as OCR (Optical Character Recognition).
 - **Developed by:** NC Research, Multimodal Generation Team
 - **Technical Report:** [VARCO-VISION: Expanding Frontiers in Korean Vision-Language Models](https://arxiv.org/pdf/2411.19103)
     {
         "role": "user",
         "content": [
+            {"type": "text", "text": "\nDescribe the image in detail."},
             {"type": "image"},
         ],
     },
         "content": [
             {
                 "type": "text",
+                "text": "이 물건0.039, 0.138, 0.283, 0.257은 어떻게 쓰는거야?",
             },
             {"type": "image"},
         ],
     {
         "role": "user",
         "content": [
+            {"type": "text", "text": ""},
             {"type": "image"},
         ],
     },