Improve model card: update pipeline tag, add abstract, and code link

This PR significantly enhances the model card for Kwai Keye-VL by:

* **Updating the `pipeline_tag`**: Changed from `image-text-to-text` to `video-text-to-text`. This change accurately reflects the model's primary strength and focus on video understanding, as highlighted in the paper's abstract and experimental results. This will improve its discoverability for users searching for video-capable models on the Hub.
* **Adding the paper abstract**: The full abstract from the technical report has been added to provide a comprehensive overview of the model's capabilities, development, and innovative training recipe directly on the model page.
* **Including a direct code link**: A link to the official GitHub repository (`https://github.com/Kwai-Keye/Keye`) has been added to the top navigation bar for easier access to the model's codebase and utilities.
* **Updating the citation**: The BibTeX entry has been updated to the more complete version found in the official GitHub repository, ensuring proper attribution to the technical report.

These changes make the model card more informative, accurate, and user-friendly.

Files changed (1) hide show

README.md +16 -11

README.md CHANGED Viewed

@@ -1,12 +1,11 @@
 ---
-license: apache-2.0
 language:
 - en
-pipeline_tag: image-text-to-text
 tags:
 - multimodal
-library_name: transformers
 ---
 # Kwai Keye-VL
@@ -15,7 +14,11 @@ library_name: transformers
   <img src="asset/keye_logo_2.png" width="100%" alt="Kwai Keye-VL Logo">
 </div>
-<font size=3><div align='center' >  [[🍎 Home Page](https://kwai-keye.github.io/)] [[📖 Technical Report](https://huggingface.co/papers/2507.01949)] [[📊 Models](https://huggingface.co/Kwai-Keye)] [[🚀 Demo](https://huggingface.co/spaces/Kwai-Keye/Keye-VL-8B-Preview)] </div></font>
 ## 🔥 News
 * **`2025.06.26`** 🌟 We are very proud to launch **Kwai Keye-VL**, a cutting-edge multimodal large language model meticulously crafted by the **Kwai Keye Team** at [Kuaishou](https://www.kuaishou.com/). As a cornerstone AI product within Kuaishou's advanced technology ecosystem, Keye excels in video understanding, visual perception, and reasoning tasks, setting new benchmarks in performance. Our team is working tirelessly to push the boundaries of what's possible, so stay tuned for more exciting updates!
@@ -476,12 +479,14 @@ The post-training phase of Kwai Keye is meticulously designed into two phases wi
 If you find our work helpful for your research, please consider citing our work.
 ```bibtex
-@misc{Keye-VL-8B-Preview,
-    title = {Keye-VL-8B-Preview},
-    url = {https://github.com/Kwai-Keye/Keye},
-    author = {Keye Team},
-    month = {June},
-    year = {2025}
 }
 ```

 ---
 language:
 - en
+library_name: transformers
+license: apache-2.0
+pipeline_tag: video-text-to-text
 tags:
 - multimodal
 ---
 # Kwai Keye-VL
   <img src="asset/keye_logo_2.png" width="100%" alt="Kwai Keye-VL Logo">
 </div>
+<font size=3><div align='center' >  [[🍎 Home Page](https://kwai-keye.github.io/)] [[📖 Technical Report](https://huggingface.co/papers/2507.01949)] [[📊 Models](https://huggingface.co/Kwai-Keye)] [[🚀 Demo](https://huggingface.co/spaces/Kwai-Keye/Keye-VL-8B-Preview)] [[💻 Code](https://github.com/Kwai-Keye/Keye)] </div></font>
+## Abstract
+While Multimodal Large Language Models (MLLMs) demonstrate remarkable capabilities on static images, they often fall short in comprehending dynamic, information-dense short-form videos, a dominant medium in today's digital landscape. To bridge this gap, we introduce **Kwai Keye-VL**, an 8-billion-parameter multimodal foundation model engineered for leading-edge performance in short-video understanding while maintaining robust general-purpose vision-language abilities. The development of Keye-VL rests on two core pillars: a massive, high-quality dataset exceeding 600 billion tokens with a strong emphasis on video, and an innovative training recipe. This recipe features a four-stage pre-training process for solid vision-language alignment, followed by a meticulous two-phase post-training process. The first post-training stage enhances foundational capabilities like instruction following, while the second phase focuses on stimulating advanced reasoning. In this second phase, a key innovation is our five-mode ``cold-start'' data mixture, which includes ``thinking'', ``non-thinking'', ``auto-think'', ``think with image'', and high-quality video data. This mixture teaches the model to decide when and how to reason. Subsequent reinforcement learning (RL) and alignment steps further enhance these reasoning capabilities and correct abnormal model behaviors, such as repetitive outputs. To validate our approach, we conduct extensive evaluations, showing that Keye-VL achieves state-of-the-art results on public video benchmarks and remains highly competitive on general image-based tasks (Figure 1). Furthermore, we develop and release the **KC-MMBench**, a new benchmark tailored for real-world short-video scenarios, where Keye-VL shows a significant advantage.
 ## 🔥 News
 * **`2025.06.26`** 🌟 We are very proud to launch **Kwai Keye-VL**, a cutting-edge multimodal large language model meticulously crafted by the **Kwai Keye Team** at [Kuaishou](https://www.kuaishou.com/). As a cornerstone AI product within Kuaishou's advanced technology ecosystem, Keye excels in video understanding, visual perception, and reasoning tasks, setting new benchmarks in performance. Our team is working tirelessly to push the boundaries of what's possible, so stay tuned for more exciting updates!
 If you find our work helpful for your research, please consider citing our work.
 ```bibtex
+@misc{kwaikeyeteam2025kwaikeyevltechnicalreport,
+      title={Kwai Keye-VL Technical Report},
+      author={Kwai Keye Team},
+      year={2025},
+      eprint={2507.01949},
+      archivePrefix={arXiv},
+      primaryClass={cs.CV},
+      url={https://arxiv.org/abs/2507.01949},
 }
 ```