zaydzuhri's picture
Add files using upload-large-folder tool
bfd666f verified
|
raw
history blame
535 Bytes

Simple GLA

Gating mechanism in Gated RFA, Mamba2 and YOCO (a.k.a., Gated RetNet).

Compared to GLA, the gating is head-wise instead of elementwise. As a result, we can adapt the RetNet kernel for training using matmul w/o numerical instability. It is faster than GLA but has less expressive power. I will use it as a baseline for the GLA.

$S_{t+1} = g_{t+1} \odot S_{t} + K_{t+1} V_{t+1}^{\top}$ where $g$ is a scalar.