Zacharias030 commited on
Commit
1019e2b
·
verified ·
1 Parent(s): 61aae47

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -6
README.md CHANGED
@@ -7,7 +7,7 @@ datasets:
7
  ---
8
 
9
  # KernelLLM
10
- ![scatter performance comparison plot](llm_performance_comparison.png)
11
  Caption: On KernelBench-Triton Level 1, our 8B parameter model matches GPT-4o in single-shot performance. With multiple inferences, KernelLLM's performance matches DeepSeek R1. This is all from a model with two orders of magnitude fewer parameters than its competitors.
12
  ## Making Kernel Development more accessible with KernelLLM
13
 
@@ -17,9 +17,9 @@ KernelLLM's vision is to meet the growing demand for high-performance GPU kernel
17
  KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient.
18
 
19
 
20
- ![alt text](triton-kernel-workflow.png)
21
 
22
- Caption: KernelLLM Workflow for Triton Kernel Generation Our approach uses KernelLLM to translate PyTorch code (green) into Triton kernel candidates. Input and output components are marked in bold. The generations are validated against unit tests, which run kernels with random inputs of known shapes. This workflow allows us to evaluate multiple generations (pass@k) by increasing the number of kernel candidate generations. The best kernel implementation is selected and returned (green output).
23
 
24
 
25
  The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset can be found [on Huggingface](https://huggingface.co/datasets/GPUMODE/Inductor_Created_Data_Permissive).
@@ -28,9 +28,8 @@ We finetuned Llama3.1-8B-Instruct on the created dataset using supervised instru
28
 
29
  ### Model Performance
30
 
31
- KernelLLM significantly outperforms larger general-purpose models on specialized kernel generation tasks, demonstrating the value of domain-specific fine-tuning.
32
 
33
- ![alt text](vscode-local:/Users/zacharias/code/gtc_presentation/blog_post_model_performance_rev_4_RC.png)
34
 
35
  | Model | Parameters (B) | Score | Pass@k |
36
  |-------|---------------|-------|--------|
@@ -49,7 +48,7 @@ KernelLLM significantly outperforms larger general-purpose models on specialized
49
  Our 8B parameter model achieves competitive or superior performance compared to much larger models on kernel generation tasks, demonstrating the effectiveness of our specialized training approach.
50
 
51
  The resulting model is competitive with state of the art LLMs despite its small size. We evaluate our model on KernelBench which is an open-source benchmark to evaluate the ability of LLMs to write efficient GPU kernels. It contains 250 selected PyTorch modules organized into difficulty levels, from single torch operators such as Conv2D or Swish (level 1), to full model architectures (level 3). The benchmark measures both correctness (by comparing against reference PyTorch outputs) and performance (by measuring speedup over baseline implementations). We implemented a new KernelBench-Triton variant that evaluates an LLMs ability to generate Triton kernels, making it an ideal benchmark for evaluating KernelLLM's capabilities. All our measurements were done on Nvidia H100 GPUs.
52
-
53
 
54
 
55
 
 
7
  ---
8
 
9
  # KernelLLM
10
+ ![scatter performance comparison plot](media/llm_performance_comparison.png)
11
  Caption: On KernelBench-Triton Level 1, our 8B parameter model matches GPT-4o in single-shot performance. With multiple inferences, KernelLLM's performance matches DeepSeek R1. This is all from a model with two orders of magnitude fewer parameters than its competitors.
12
  ## Making Kernel Development more accessible with KernelLLM
13
 
 
17
  KernelLLM aims to democratize GPU programming by making kernel development more accessible and efficient.
18
 
19
 
20
+ ![alt text](media/triton-kernel-workflow.png)
21
 
22
+ *KernelLLM Workflow for Triton Kernel Generation Our approach uses KernelLLM to translate PyTorch code (green) into Triton kernel candidates. Input and output components are marked in bold. The generations are validated against unit tests, which run kernels with random inputs of known shapes. This workflow allows us to evaluate multiple generations (pass@k) by increasing the number of kernel candidate generations. The best kernel implementation is selected and returned (green output).*
23
 
24
 
25
  The model was trained on approximately 25,000 paired examples of PyTorch modules and their equivalent Triton kernel implementations and additional synthetically generated samples. Our approach combines filtered code from TheStack [Kocetkov et al. 2022] and synthetic examples generated through torch.compile() and additional prompting techniques. The filtered and compiled dataset can be found [on Huggingface](https://huggingface.co/datasets/GPUMODE/Inductor_Created_Data_Permissive).
 
28
 
29
  ### Model Performance
30
 
 
31
 
32
+ ![alt text](media/blog_post_model_performance.png)
33
 
34
  | Model | Parameters (B) | Score | Pass@k |
35
  |-------|---------------|-------|--------|
 
48
  Our 8B parameter model achieves competitive or superior performance compared to much larger models on kernel generation tasks, demonstrating the effectiveness of our specialized training approach.
49
 
50
  The resulting model is competitive with state of the art LLMs despite its small size. We evaluate our model on KernelBench which is an open-source benchmark to evaluate the ability of LLMs to write efficient GPU kernels. It contains 250 selected PyTorch modules organized into difficulty levels, from single torch operators such as Conv2D or Swish (level 1), to full model architectures (level 3). The benchmark measures both correctness (by comparing against reference PyTorch outputs) and performance (by measuring speedup over baseline implementations). We implemented a new KernelBench-Triton variant that evaluates an LLMs ability to generate Triton kernels, making it an ideal benchmark for evaluating KernelLLM's capabilities. All our measurements were done on Nvidia H100 GPUs.
51
+ ![pass at k analysis plot](media/kernelllm_pass_at_k_scaling.png)
52
 
53
 
54