Architecture

Linear Attention Let’s start by looking at the standard attention formula: $$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$ Where: $V$ is the value matrix representing meanings in the sequence, dimensions $n \times d_v$ $K$ is the key matrix representing indexes/locations of the values, dimensions $n \times d_k$ $Q$ is the query matrix representing information we currently need, dimensions $n \times d_k$ $d_k$ is the dimension of the key vectors The output is a weighted sum of the value vectors $V$, where the weights are determined by the attention scores....