liushaowei
commited on
Commit
·
7f98307
1
Parent(s):
2bfbc7b
update readme
Browse files
README.md
CHANGED
@@ -43,6 +43,11 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
43 |
- **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
|
44 |
- **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
|
45 |
|
|
|
|
|
|
|
|
|
|
|
46 |
|
47 |
## 2. Model Summary
|
48 |
|
@@ -128,7 +133,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
128 |
|
129 |
<tr>
|
130 |
<td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
|
131 |
-
<td align="center">Single Patch</td>
|
132 |
<td align="center"><ins><strong>51.8</strong></ins></td>
|
133 |
<td align="center">36.6</td>
|
134 |
<td align="center">39.4</td>
|
@@ -188,7 +193,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
188 |
|
189 |
<tr>
|
190 |
<!--<td align="center">TerminalBench</td>-->
|
191 |
-
<td align="center">Acc</td>
|
192 |
<td align="center"><ins><strong>25.0</strong> </ins></td>
|
193 |
<td align="center">16.3</td>
|
194 |
<td align="center">6.6</td>
|
@@ -495,26 +500,150 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
|
|
495 |
|
496 |
<div align="center">
|
497 |
|
498 |
-
|
499 |
-
|
500 |
-
|
501 |
-
|
502 |
-
|
503 |
-
|
504 |
-
|
505 |
-
|
506 |
-
|
507 |
-
|
508 |
-
|
509 |
-
|
510 |
-
|
511 |
-
|
512 |
-
|
513 |
-
|
514 |
-
|
515 |
-
|
516 |
-
|
517 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
518 |
</div>
|
519 |
<sup>
|
520 |
• We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
|
@@ -656,4 +785,4 @@ Both the code repository and the model weights are released under the [Modified
|
|
656 |
|
657 |
## 7. Contact Us
|
658 |
|
659 |
-
If you have any questions, please reach out at [[email protected]](mailto:[email protected]).
|
|
|
43 |
- **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
|
44 |
- **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
|
45 |
|
46 |
+
<div align="center">
|
47 |
+
<picture>
|
48 |
+
<img src="figures/banner.png" width="80%" alt="Evaluation Results">
|
49 |
+
</picture>
|
50 |
+
</div>
|
51 |
|
52 |
## 2. Model Summary
|
53 |
|
|
|
133 |
|
134 |
<tr>
|
135 |
<td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
|
136 |
+
<td align="center">Single Patch w/o Test (Acc)</td>
|
137 |
<td align="center"><ins><strong>51.8</strong></ins></td>
|
138 |
<td align="center">36.6</td>
|
139 |
<td align="center">39.4</td>
|
|
|
193 |
|
194 |
<tr>
|
195 |
<!--<td align="center">TerminalBench</td>-->
|
196 |
+
<td align="center">Terminus (Acc)</td>
|
197 |
<td align="center"><ins><strong>25.0</strong> </ins></td>
|
198 |
<td align="center">16.3</td>
|
199 |
<td align="center">6.6</td>
|
|
|
500 |
|
501 |
<div align="center">
|
502 |
|
503 |
+
<table>
|
504 |
+
<thead>
|
505 |
+
<tr>
|
506 |
+
<th align="center">Benchmark</th>
|
507 |
+
<th align="center">Metric</th>
|
508 |
+
<th align="center">Shot</th>
|
509 |
+
<th align="center">Kimi K2 Base</th>
|
510 |
+
<th align="center">Deepseek-V3-Base</th>
|
511 |
+
<th align="center">Qwen2.5-72B</th>
|
512 |
+
<th align="center">Llama 4 Maverick</th>
|
513 |
+
</tr>
|
514 |
+
</thead>
|
515 |
+
<tbody>
|
516 |
+
<tr>
|
517 |
+
<td align="center" colspan="7"><strong>General Tasks</strong></td>
|
518 |
+
</tr>
|
519 |
+
<tr>
|
520 |
+
<td align="center">MMLU</td>
|
521 |
+
<td align="center">EM</td>
|
522 |
+
<td align="center">5-shot</td>
|
523 |
+
<td align="center"><strong>87.8</strong></td>
|
524 |
+
<td align="center">87.1</td>
|
525 |
+
<td align="center">86.1</td>
|
526 |
+
<td align="center">84.9</td>
|
527 |
+
</tr>
|
528 |
+
<tr>
|
529 |
+
<td align="center">MMLU-pro</td>
|
530 |
+
<td align="center">EM</td>
|
531 |
+
<td align="center">5-shot</td>
|
532 |
+
<td align="center"><strong>69.2</strong></td>
|
533 |
+
<td align="center">60.6</td>
|
534 |
+
<td align="center">62.8</td>
|
535 |
+
<td align="center">63.5</td>
|
536 |
+
</tr>
|
537 |
+
<tr>
|
538 |
+
<td align="center">MMLU-redux-2.0</td>
|
539 |
+
<td align="center">EM</td>
|
540 |
+
<td align="center">5-shot</td>
|
541 |
+
<td align="center"><strong>90.2</strong></td>
|
542 |
+
<td align="center">89.5</td>
|
543 |
+
<td align="center">87.8</td>
|
544 |
+
<td align="center">88.2</td>
|
545 |
+
</tr>
|
546 |
+
<tr>
|
547 |
+
<td align="center">SimpleQA</td>
|
548 |
+
<td align="center">Correct</td>
|
549 |
+
<td align="center">5-shot</td>
|
550 |
+
<td align="center"><strong>35.3</strong></td>
|
551 |
+
<td align="center">26.5</td>
|
552 |
+
<td align="center">10.3</td>
|
553 |
+
<td align="center">23.7</td>
|
554 |
+
</tr>
|
555 |
+
<tr>
|
556 |
+
<td align="center">TriviaQA</td>
|
557 |
+
<td align="center">EM</td>
|
558 |
+
<td align="center">5-shot</td>
|
559 |
+
<td align="center"><strong>85.1</strong></td>
|
560 |
+
<td align="center">84.1</td>
|
561 |
+
<td align="center">76.0</td>
|
562 |
+
<td align="center">79.3</td>
|
563 |
+
</tr>
|
564 |
+
<tr>
|
565 |
+
<td align="center">GPQA-Diamond</td>
|
566 |
+
<td align="center">Avg@8</td>
|
567 |
+
<td align="center">5-shot</td>
|
568 |
+
<td align="center">48.1</td>
|
569 |
+
<td align="center"><strong>50.5</strong></td>
|
570 |
+
<td align="center">40.8</td>
|
571 |
+
<td align="center">49.4</td>
|
572 |
+
</tr>
|
573 |
+
<tr>
|
574 |
+
<td align="center">SuperGPQA</td>
|
575 |
+
<td align="center">EM</td>
|
576 |
+
<td align="center">5-shot</td>
|
577 |
+
<td align="center"><strong>44.7</strong></td>
|
578 |
+
<td align="center">39.2</td>
|
579 |
+
<td align="center">34.2</td>
|
580 |
+
<td align="center">38.8</td>
|
581 |
+
</tr>
|
582 |
+
<tr>
|
583 |
+
<td align="center" colspan="7"><strong>Coding Tasks</strong></td>
|
584 |
+
</tr>
|
585 |
+
<tr>
|
586 |
+
<td align="center">LiveCodeBench v6</td>
|
587 |
+
<td align="center">Pass@1</td>
|
588 |
+
<td align="center">1-shot</td>
|
589 |
+
<td align="center"><strong>26.3</strong></td>
|
590 |
+
<td align="center">22.9</td>
|
591 |
+
<td align="center">21.1</td>
|
592 |
+
<td align="center">25.1</td>
|
593 |
+
</tr>
|
594 |
+
<tr>
|
595 |
+
<td align="center">EvalPlus</td>
|
596 |
+
<td align="center">Pass@1</td>
|
597 |
+
<td align="center">-</td>
|
598 |
+
<td align="center"><strong>80.3</strong></td>
|
599 |
+
<td align="center">65.6</td>
|
600 |
+
<td align="center">66.0</td>
|
601 |
+
<td align="center">65.5</td>
|
602 |
+
</tr>
|
603 |
+
<tr>
|
604 |
+
<td align="center" colspan="7"><strong>Mathematics Tasks</strong></td>
|
605 |
+
</tr>
|
606 |
+
<tr>
|
607 |
+
<td align="center">MATH</td>
|
608 |
+
<td align="center">EM</td>
|
609 |
+
<td align="center">4-shot</td>
|
610 |
+
<td align="center"><strong>70.2</strong></td>
|
611 |
+
<td align="center">60.1</td>
|
612 |
+
<td align="center">61.0</td>
|
613 |
+
<td align="center">63.0</td>
|
614 |
+
</tr>
|
615 |
+
<tr>
|
616 |
+
<td align="center">GSM8k</td>
|
617 |
+
<td align="center">EM</td>
|
618 |
+
<td align="center">8-shot</td>
|
619 |
+
<td align="center"><strong>92.1</strong></td>
|
620 |
+
<td align="center">91.7</td>
|
621 |
+
<td align="center">90.4</td>
|
622 |
+
<td align="center">86.3</td>
|
623 |
+
</tr>
|
624 |
+
<tr>
|
625 |
+
<td align="center" colspan="7"><strong>Chinese Tasks</strong></td>
|
626 |
+
</tr>
|
627 |
+
<tr>
|
628 |
+
<td align="center">C-Eval</td>
|
629 |
+
<td align="center">EM</td>
|
630 |
+
<td align="center">5-shot</td>
|
631 |
+
<td align="center"><strong>92.5</strong></td>
|
632 |
+
<td align="center">90.0</td>
|
633 |
+
<td align="center">90.9</td>
|
634 |
+
<td align="center">80.9</td>
|
635 |
+
</tr>
|
636 |
+
<tr>
|
637 |
+
<td align="center">CSimpleQA</td>
|
638 |
+
<td align="center">Correct</td>
|
639 |
+
<td align="center">5-shot</td>
|
640 |
+
<td align="center"><strong>77.6</strong></td>
|
641 |
+
<td align="center">72.1</td>
|
642 |
+
<td align="center">50.5</td>
|
643 |
+
<td align="center">53.5</td>
|
644 |
+
</tr>
|
645 |
+
</tbody>
|
646 |
+
</table>
|
647 |
</div>
|
648 |
<sup>
|
649 |
• We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
|
|
|
785 |
|
786 |
## 7. Contact Us
|
787 |
|
788 |
+
If you have any questions, please reach out at [[email protected]](mailto:[email protected]).
|