GQA configuration
#1
by
gogo8232
- opened
In the Llama 70B Eagle, you downsized the max positional embedding, which somewhat makes sense from your training code.
However, in this particular case, you disabled GQA. Any reason for that? I just checked for the case of Llama 70B, but you didn't change the config for the num_key_value_heads
.
Lastly, it's a minor issue, but max_window_layers
has increased to 80 from 70, but is there any reason for that?
Parameter | Qwen2.5-70B-Instruct | EAGLE-Qwen2.5-70B-Instruct |
---|---|---|
architectures | Qwen2ForCausalLM | Qwen2ForCausalLM |
attention_dropout | 0.0 | 0.0 |
bos_token_id | 151643 | 151643 |
eos_token_id | 151645 | 151645 |
hidden_act | silu | silu |
hidden_size | 8192 | 8192 |
initializer_range | 0.02 | 0.02 |
intermediate_size | 29568 | 29568 |
max_position_embeddings | 32768 | 32768 |
max_window_layers | 70 | 80 |
model_type | qwen2 | qwen2 |
num_attention_heads | 64 | 64 |
num_hidden_layers | 80 | 1 |
num_key_value_heads | 8 | 64 |
rms_norm_eps | 1e-06 | 1e-06 |
rope_theta | 1000000.0 | 1000000.0 |
sliding_window | 131072 | 131072 |
tie_word_embeddings | false | false |
torch_dtype | bfloat16 | bfloat16 |
transformers_version | 4.43.1 | 4.40.1 |
use_cache | true | true |
use_sliding_window | false | false |
vocab_size | 152064 | 152064 |
qkv_bias | Not specified | true |