@Jaward on Hugging Face: "Lightweight implementation of newly introduced “Differential Transformer”:…"

Post

1176

Lightweight implementation of newly introduced “Differential Transformer”:
Proposes differential attention mechanism which computes attention scores as a difference between two separate softmax attention maps thereby reducing noise in attention blocks. [[[Differential nanoGPT]]] :)

Code: https://github.com/Jaykef/ai-algorithms/blob/main/DIFF_Transformer.ipynb
YT Video: https://youtu.be/9V4mJA5y7dg

Join the conversation