jeffra's picture
Update README.md
aaf034e verified
|
raw
history blame
2.24 kB
metadata
license: cc-by-nc-4.0

ArcticSpeculator

Build a fastest OSS vllm-based speculative decoding system for your own model, using ArcticTraining and ArcticInference!

We compare the throughput (tokens/s) of existing vllm-based speculative decoding systems for Llama3.1-70B-Instruct on 8xH100 as below:

method ShareGPT HumanEval
VLLM V1 Baseline 84.1 84.1
VLLM V1 Eagle 102.2 112.0
VLLM V1 Eagle3 77.7 85.3
VLLM V0 MLP-Speculator (IBM) 77.9 66.7
ArcticSpeculator 172.4 203.7

For more details about ArcticSpeculator and how to use it:

We also release ArcticSpeculator checkpoints we trained with ArcticTraining to run with ArcticInference: