How did you trained your LatentAttentionLayer?
Hello.
I am wondering how you trained your latent attention layer.
From your technical report, you've mentioned that you used LoRA with rank 16.
This makes sense with layers and weights that came from initial model Mistral-7B-v0.1, but I am confusing whether you used LoRA in latent attention layer too.
Did you trained LoRA for latent attention layer? If so, Are all initial weights for latent attention in base model frozen to be 0?
Did you use same learing-rate for decoder-layer and latent attention layer?
Could you notice how you trained your model adding latent attention layer?
Thank you.
Hi, @juneonetwothree . Thanks for asking the question. We did not use LoRA technique for latent attention layer. Only decoder-only LLM are trained with LoRA technique. The decoder-only LLM and latent-attention layers are trained in an end-to-end manner (it's the same learning rate).