Fast Autoregressive Transformers with Linear Attention

ICML 2020

Transformers achieve remarkable performance in several tasks but
due to their quadratic complexity, with respect to the input's
length, they are prohibitively slow for very long sequences. To
address this limitation, we express the self-attention as a linear
dot-product of kernel feature maps and make use of the
associativity property of matrix products to reduce the complexity
from $\bigO{N^2}$ to $\bigO{N}$, where $N$ is the sequence length.
We show that this formulation permits an iterative implementation
that dramatically accelerates autoregressive transformers and
reveals their relationship to recurrent neural networks. Our
*linear transformers* achieve similar performance to vanilla
transformers and they are up to 4000x faster on autoregressive
prediction of very long sequences.

Video presentation

Interactive Demo

In our paper, we show that replacing the softmax attention with a
linear dot product of kernel feature maps $\fe{Q}$ and $\fe{K}$
allows us to express any transformer as a recurrent neural network
with the following update equations, where $x_i$ denotes the $i$-th
input, $y_i$ the $i$-th output and $s_i$, $z_i$ the hidden state
at timestep $i$ (see section 3 of the paper for a detailed
description):
$$
\begin{aligned}
s_0 &= 0, \\
z_0 &= 0, \\
s_i &= s_{i-1} + \fe{x_i W_K} \left(x_i W_V\right)^T, \\
z_i &= z_{i-1} + \fe{x_i W_K}, \\
y_i &= f_l\left(\frac{\fe{x_i W_Q}^T s_i}{\fe{x_i W_Q}^T z_i} + x_i\right).
\end{aligned}
$$

The above formulation allows for efficient autoregressive inference
with linear time complexity and constant memory. To showcase the
efficiency of our model we load in the browser a transformer
trained for autoregressive image generation and perform
unconditional image generation with JavaScript in the browser. The
loaded transformer has 8 layers and 8 heads per layer.

The following demo does not compare with any baselines; if you want
to compare our formulation with the standard transformer try our colab.
However, this demo runs on your device and not on any server, you
could even try it on your phone.

Unconditional Samples Generator

Concurrent Work

If you are interested in our work, you might also be interested in
highly related concurrent work:

Acknowledgements

Angelos Katharopoulos and Apoorv Vyas are supported by SNSF for their research.