Update README.md

aaf034e verified 3 months ago

2.24 kB

	---
	license: cc-by-nc-4.0
	---

	# ArcticSpeculator

	Build a fastest OSS vllm-based speculative decoding system for your own model, using [ArcticTraining](https://github.com/snowflakedb/ArcticTraining) and [ArcticInference](https://github.com/snowflakedb/ArcticInference)!

	We compare the throughput (tokens/s) of existing vllm-based speculative decoding systems for Llama3.1-70B-Instruct on 8xH100 as below:

	\| method \| ShareGPT \| HumanEval \|
	\|--------------------------------------\|----------------\|--------------\|
	\| VLLM V1 Baseline \| 84.1 \| 84.1 \|
	\| VLLM V1 Eagle \| 102.2 \| 112.0 \|
	\| VLLM V1 Eagle3 \| 77.7 \| 85.3 \|
	\| VLLM V0 MLP-Speculator (IBM) \| 77.9 \| 66.7 \|
	\| ArcticSpeculator \| 172.4 \| 203.7 \|

	For more details about ArcticSpeculator and how to use it:

	* ❄️ [Using Arctic-Inference and Arctic-Training for improving real-world speculative decoding Performance (blog)]()
	* 🚀 [Getting started guide using ArcticTraining](https://github.com/snowflakedb/ArcticTraining/tree/mlp-variant-speculator/projects/mlp_variant_speculator)

	We also release ArcticSpeculator checkpoints we trained with [ArcticTraining](https://github.com/snowflakedb/ArcticTraining) to run with [ArcticInference](https://github.com/snowflakedb/ArcticInference):

	\| model \| ArcticSpeculator \|
	\|---- \| ---- \|
	\| [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) \| [Arctic-LSTM-Speculator-Llama-3.1-70B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Llama-3.1-70B-Instruct) \|
	\| [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) \| [Arctic-LSTM-Speculator-Llama-3.3-70B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Llama-3.3-70B-Instruct) \|
	\| [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) \| [Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct) \|
	\| [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \| [Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct)\|

	---
	license: cc-by-nc-4.0
	---

	# ArcticSpeculator

	Build a fastest OSS vllm-based speculative decoding system for your own model, using [ArcticTraining](https://github.com/snowflakedb/ArcticTraining) and [ArcticInference](https://github.com/snowflakedb/ArcticInference)!

	We compare the throughput (tokens/s) of existing vllm-based speculative decoding systems for Llama3.1-70B-Instruct on 8xH100 as below:

	\| method \| ShareGPT \| HumanEval \|
	\|--------------------------------------\|----------------\|--------------\|
	\| VLLM V1 Baseline \| 84.1 \| 84.1 \|
	\| VLLM V1 Eagle \| 102.2 \| 112.0 \|
	\| VLLM V1 Eagle3 \| 77.7 \| 85.3 \|
	\| VLLM V0 MLP-Speculator (IBM) \| 77.9 \| 66.7 \|
	\| ArcticSpeculator \| 172.4 \| 203.7 \|

	For more details about ArcticSpeculator and how to use it:

	* ❄️ [Using Arctic-Inference and Arctic-Training for improving real-world speculative decoding Performance (blog)]()
	* 🚀 [Getting started guide using ArcticTraining](https://github.com/snowflakedb/ArcticTraining/tree/mlp-variant-speculator/projects/mlp_variant_speculator)

	We also release ArcticSpeculator checkpoints we trained with [ArcticTraining](https://github.com/snowflakedb/ArcticTraining) to run with [ArcticInference](https://github.com/snowflakedb/ArcticInference):

	\| model \| ArcticSpeculator \|
	\|---- \| ---- \|
	\| [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) \| [Arctic-LSTM-Speculator-Llama-3.1-70B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Llama-3.1-70B-Instruct) \|
	\| [Llama-3.3-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.3-70B-Instruct) \| [Arctic-LSTM-Speculator-Llama-3.3-70B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Llama-3.3-70B-Instruct) \|
	\| [Qwen2.5-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-32B-Instruct) \| [Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Qwen2.5-32B-Instruct) \|
	\| [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) \| [Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct](https://huggingface.co/Snowflake/Arctic-LSTM-Speculator-Llama-3.1-8B-Instruct)\|