GQA configuration

#1
by gogo8232 - opened

In the Llama 70B Eagle, you downsized the max positional embedding, which somewhat makes sense from your training code.
However, in this particular case, you disabled GQA. Any reason for that? I just checked for the case of Llama 70B, but you didn't change the config for the num_key_value_heads.

Lastly, it's a minor issue, but max_window_layers has increased to 80 from 70, but is there any reason for that?

Parameter Qwen2.5-70B-Instruct EAGLE-Qwen2.5-70B-Instruct
architectures Qwen2ForCausalLM Qwen2ForCausalLM
attention_dropout 0.0 0.0
bos_token_id 151643 151643
eos_token_id 151645 151645
hidden_act silu silu
hidden_size 8192 8192
initializer_range 0.02 0.02
intermediate_size 29568 29568
max_position_embeddings 32768 32768
max_window_layers 70 80
model_type qwen2 qwen2
num_attention_heads 64 64
num_hidden_layers 80 1
num_key_value_heads 8 64
rms_norm_eps 1e-06 1e-06
rope_theta 1000000.0 1000000.0
sliding_window 131072 131072
tie_word_embeddings false false
torch_dtype bfloat16 bfloat16
transformers_version 4.43.1 4.40.1
use_cache true true
use_sliding_window false false
vocab_size 152064 152064
qkv_bias Not specified true
Your need to confirm your account before you can post a new comment.

Sign up or log in to comment