Eagle Speculative Decoding Model Trained with BaldEagle
BaldEagle Repo: https://github.com/NickL77/BaldEagle/
Achieves 3.17x speed up (49.24 tok/s -> 156.33 tok/s) on Llama3.1 8B model.
Benchmarking (on RTX 3090)
- Start sglang server
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--speculative-algo EAGLE \
--speculative-draft NickL77/BaldEagle-Llama-3.1-8B-Instruct \
--speculative-num-steps 5 \
--speculative-eagle-topk 8 \
--speculative-num-draft-tokens 64 \
--dtype bfloat16 \
--port 30000 \
--mem-fraction-static 0.65
- In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py
Output:
#questions: 80, Throughput: 156.33 token/s, Acceptance length: 3.57
runtime: 5 min 24 sec
Baseline
- Start sglang server
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3-8B-Instruct \
--dtype bfloat16 \
--port 30000 \
--mem-fraction-static 0.65
- In another terminal, run benchmark script
python3 bench_sglang_eagle_double_turn.py
Output:
#questions: 80, Throughput: 49.24 token/s, Acceptance length: 1.00
runtime: 15 min 5 sec
- Downloads last month
- 14
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support