liushaowei commited on
Commit
7f98307
·
1 Parent(s): 2bfbc7b

update readme

Browse files
Files changed (1) hide show
  1. README.md +152 -23
README.md CHANGED
@@ -43,6 +43,11 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
43
  - **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
44
  - **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
45
 
 
 
 
 
 
46
 
47
  ## 2. Model Summary
48
 
@@ -128,7 +133,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
128
 
129
  <tr>
130
  <td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
131
- <td align="center">Single Patch</td>
132
  <td align="center"><ins><strong>51.8</strong></ins></td>
133
  <td align="center">36.6</td>
134
  <td align="center">39.4</td>
@@ -188,7 +193,7 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
188
 
189
  <tr>
190
  <!--<td align="center">TerminalBench</td>-->
191
- <td align="center">Acc</td>
192
  <td align="center"><ins><strong>25.0</strong> </ins></td>
193
  <td align="center">16.3</td>
194
  <td align="center">6.6</td>
@@ -495,26 +500,150 @@ Kimi K2 is a state-of-the-art mixture-of-experts (MoE) language model with 32 bi
495
 
496
  <div align="center">
497
 
498
- | Benchmark | Metric | Shot | Kimi K2 Base | Deepseek-V3-Base | Qwen2.5-72B | Llama 4 Maverick |
499
- |:-------------------:|:----------:|:---------:|:--------------:|:------------------:|:-------------:|:------------------:|
500
- | **General Tasks** | | | | | | |
501
- | MMLU | EM | 5-shot | **87.8** | 87.1 | 86.1 | 84.9 |
502
- | MMLU-pro | EM | 5-shot | **69.2** | 60.6 | 62.8 | 63.5 |
503
- | MMLU-redux-2.0 | EM | 5-shot | **90.2** | 89.5 | 87.8 | 88.2 |
504
- | SimpleQA | Correct | 5-shot | **35.3** | 26.5 | 10.3 | 23.7 |
505
- | TriviaQA | EM | 5-shot | **85.1** | 84.1 | 76.0 | 79.3 |
506
- | GPQA-Diamond | Avg@8 | 5-shot | 48.1 | **50.5** | 40.8 | 49.4 |
507
- | SuperGPQA | EM | 5-shot | **44.7** | 39.2 | 34.2 | 38.8 |
508
- | **Code Tasks** | | | | | | |
509
- | LiveCodeBench v6 | Pass@1 | 1-shot | **26.3** | 22.9 | 21.1 | 25.1 |
510
- | EvalPlus | Pass@1 | - | **80.3** | 65.6 | 66.0 | 65.5 |
511
- | **Mathematics Tasks** | | | | | | |
512
- | MATH | EM | 4-shot | **70.2** | 60.1 | 61.0 | 63.0 |
513
- | GSM8k | EM | 8-shot | **92.1** | 91.7 | 90.4 | 86.3 |
514
- | **Chinese Tasks** | | | | | | |
515
- | C-Eval | EM | 5-shot | **92.5** | 90.0 | 90.9 | 80.9 |
516
- | CSimpleQA | Correct | 5-shot | **77.6** | 72.1 | 50.5 | 53.5 |
517
-
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
518
  </div>
519
  <sup>
520
  • We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
@@ -656,4 +785,4 @@ Both the code repository and the model weights are released under the [Modified
656
 
657
  ## 7. Contact Us
658
 
659
- If you have any questions, please reach out at [[email protected]](mailto:[email protected]).
 
43
  - **Kimi-K2-Base**: The foundation model, a strong start for researchers and builders who want full control for fine-tuning and custom solutions.
44
  - **Kimi-K2-Instruct**: The post-trained model best for drop-in, general-purpose chat and agentic experiences. It is a reflex-grade model without long thinking.
45
 
46
+ <div align="center">
47
+ <picture>
48
+ <img src="figures/banner.png" width="80%" alt="Evaluation Results">
49
+ </picture>
50
+ </div>
51
 
52
  ## 2. Model Summary
53
 
 
133
 
134
  <tr>
135
  <td align="center">SWE-bench Verified <br/><sup>(Agentless Coding)</sup></td>
136
+ <td align="center">Single Patch w/o Test (Acc)</td>
137
  <td align="center"><ins><strong>51.8</strong></ins></td>
138
  <td align="center">36.6</td>
139
  <td align="center">39.4</td>
 
193
 
194
  <tr>
195
  <!--<td align="center">TerminalBench</td>-->
196
+ <td align="center">Terminus (Acc)</td>
197
  <td align="center"><ins><strong>25.0</strong> </ins></td>
198
  <td align="center">16.3</td>
199
  <td align="center">6.6</td>
 
500
 
501
  <div align="center">
502
 
503
+ <table>
504
+ <thead>
505
+ <tr>
506
+ <th align="center">Benchmark</th>
507
+ <th align="center">Metric</th>
508
+ <th align="center">Shot</th>
509
+ <th align="center">Kimi K2 Base</th>
510
+ <th align="center">Deepseek-V3-Base</th>
511
+ <th align="center">Qwen2.5-72B</th>
512
+ <th align="center">Llama 4 Maverick</th>
513
+ </tr>
514
+ </thead>
515
+ <tbody>
516
+ <tr>
517
+ <td align="center" colspan="7"><strong>General Tasks</strong></td>
518
+ </tr>
519
+ <tr>
520
+ <td align="center">MMLU</td>
521
+ <td align="center">EM</td>
522
+ <td align="center">5-shot</td>
523
+ <td align="center"><strong>87.8</strong></td>
524
+ <td align="center">87.1</td>
525
+ <td align="center">86.1</td>
526
+ <td align="center">84.9</td>
527
+ </tr>
528
+ <tr>
529
+ <td align="center">MMLU-pro</td>
530
+ <td align="center">EM</td>
531
+ <td align="center">5-shot</td>
532
+ <td align="center"><strong>69.2</strong></td>
533
+ <td align="center">60.6</td>
534
+ <td align="center">62.8</td>
535
+ <td align="center">63.5</td>
536
+ </tr>
537
+ <tr>
538
+ <td align="center">MMLU-redux-2.0</td>
539
+ <td align="center">EM</td>
540
+ <td align="center">5-shot</td>
541
+ <td align="center"><strong>90.2</strong></td>
542
+ <td align="center">89.5</td>
543
+ <td align="center">87.8</td>
544
+ <td align="center">88.2</td>
545
+ </tr>
546
+ <tr>
547
+ <td align="center">SimpleQA</td>
548
+ <td align="center">Correct</td>
549
+ <td align="center">5-shot</td>
550
+ <td align="center"><strong>35.3</strong></td>
551
+ <td align="center">26.5</td>
552
+ <td align="center">10.3</td>
553
+ <td align="center">23.7</td>
554
+ </tr>
555
+ <tr>
556
+ <td align="center">TriviaQA</td>
557
+ <td align="center">EM</td>
558
+ <td align="center">5-shot</td>
559
+ <td align="center"><strong>85.1</strong></td>
560
+ <td align="center">84.1</td>
561
+ <td align="center">76.0</td>
562
+ <td align="center">79.3</td>
563
+ </tr>
564
+ <tr>
565
+ <td align="center">GPQA-Diamond</td>
566
+ <td align="center">Avg@8</td>
567
+ <td align="center">5-shot</td>
568
+ <td align="center">48.1</td>
569
+ <td align="center"><strong>50.5</strong></td>
570
+ <td align="center">40.8</td>
571
+ <td align="center">49.4</td>
572
+ </tr>
573
+ <tr>
574
+ <td align="center">SuperGPQA</td>
575
+ <td align="center">EM</td>
576
+ <td align="center">5-shot</td>
577
+ <td align="center"><strong>44.7</strong></td>
578
+ <td align="center">39.2</td>
579
+ <td align="center">34.2</td>
580
+ <td align="center">38.8</td>
581
+ </tr>
582
+ <tr>
583
+ <td align="center" colspan="7"><strong>Coding Tasks</strong></td>
584
+ </tr>
585
+ <tr>
586
+ <td align="center">LiveCodeBench v6</td>
587
+ <td align="center">Pass@1</td>
588
+ <td align="center">1-shot</td>
589
+ <td align="center"><strong>26.3</strong></td>
590
+ <td align="center">22.9</td>
591
+ <td align="center">21.1</td>
592
+ <td align="center">25.1</td>
593
+ </tr>
594
+ <tr>
595
+ <td align="center">EvalPlus</td>
596
+ <td align="center">Pass@1</td>
597
+ <td align="center">-</td>
598
+ <td align="center"><strong>80.3</strong></td>
599
+ <td align="center">65.6</td>
600
+ <td align="center">66.0</td>
601
+ <td align="center">65.5</td>
602
+ </tr>
603
+ <tr>
604
+ <td align="center" colspan="7"><strong>Mathematics Tasks</strong></td>
605
+ </tr>
606
+ <tr>
607
+ <td align="center">MATH</td>
608
+ <td align="center">EM</td>
609
+ <td align="center">4-shot</td>
610
+ <td align="center"><strong>70.2</strong></td>
611
+ <td align="center">60.1</td>
612
+ <td align="center">61.0</td>
613
+ <td align="center">63.0</td>
614
+ </tr>
615
+ <tr>
616
+ <td align="center">GSM8k</td>
617
+ <td align="center">EM</td>
618
+ <td align="center">8-shot</td>
619
+ <td align="center"><strong>92.1</strong></td>
620
+ <td align="center">91.7</td>
621
+ <td align="center">90.4</td>
622
+ <td align="center">86.3</td>
623
+ </tr>
624
+ <tr>
625
+ <td align="center" colspan="7"><strong>Chinese Tasks</strong></td>
626
+ </tr>
627
+ <tr>
628
+ <td align="center">C-Eval</td>
629
+ <td align="center">EM</td>
630
+ <td align="center">5-shot</td>
631
+ <td align="center"><strong>92.5</strong></td>
632
+ <td align="center">90.0</td>
633
+ <td align="center">90.9</td>
634
+ <td align="center">80.9</td>
635
+ </tr>
636
+ <tr>
637
+ <td align="center">CSimpleQA</td>
638
+ <td align="center">Correct</td>
639
+ <td align="center">5-shot</td>
640
+ <td align="center"><strong>77.6</strong></td>
641
+ <td align="center">72.1</td>
642
+ <td align="center">50.5</td>
643
+ <td align="center">53.5</td>
644
+ </tr>
645
+ </tbody>
646
+ </table>
647
  </div>
648
  <sup>
649
  • We only evaluate open-source pretrained models in this work. We report results for Qwen2.5-72B because the base checkpoint for Qwen3-235B-A22B was not open-sourced at the time of our study.
 
785
 
786
  ## 7. Contact Us
787
 
788
+ If you have any questions, please reach out at [[email protected]](mailto:[email protected]).