Eagle Speculative Decoding Model Trained with BaldEagle

BaldEagle Repo: https://github.com/NickL77/BaldEagle/

Achieves 3.17x speed up (49.24 tok/s -> 156.33 tok/s) on Llama3.1 8B model.

Benchmarking (on RTX 3090)

  1. Start sglang server
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --speculative-algo EAGLE \
  --speculative-draft NickL77/BaldEagle-Llama-3.1-8B-Instruct \
  --speculative-num-steps 5 \
  --speculative-eagle-topk 8 \
  --speculative-num-draft-tokens 64 \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65
  1. In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py

Output:

#questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57

runtime: 5 min 24 sec

Baseline

  1. Start sglang server
python3 -m sglang.launch_server \
  --model meta-llama/Meta-Llama-3-8B-Instruct \
  --dtype bfloat16 \
  --port 30000 \
  --mem-fraction-static 0.65
  1. In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py

Output:

#questions: 80, Throughput: 49.24 token/s, Acceptance length: 1.00

runtime: 15 min 5 sec

Downloads last month
14
Safetensors
Model size
777M params
Tensor type
F32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support