bird-of-paradise/deepseek-mla · Why do you have UV, UK, and UQ?

Hi Hudson! Great question about the matrix absorption optimization.

You're absolutely right that the paper mentions these matrices can be absorbed during inference for efficiency. In this implementation, I kept them separate for a few reasons:

Clarity and Understanding: This implementation prioritizes making the architecture clear and understandable, showing each component explicitly as described in the paper's equations.
Training vs Inference: The absorption optimization is specifically for inference. During training, you'd still need the separate matrices.
Flexibility: Keeping them separate allows for easier experimentation and modification of individual components.