In a recently proposed method , self-attention is rephrased as a kernelized similarity function. Despite the fact that the computational complexity is drastically reduced, it is now possible to use other feature transformations than exp(q*k). The paper uses elu(x)+1 for the transformation because by definition a kernel function k(x,y) must be positive as it is a proxy for the similarity between x and y.
What is really nice is that the authors also provide PyTorch code that is pretty much self contained  and thus it easily allows to test different feature maps. But before we actually start using the new layer in a network, we wanted to better understand the impact of the feature transformation. This can be done without any neural net training, just by looking at same numbers.
We generate five random numbers N(0, 1): X = [0.56 0.12 0.06 -0.43 0.03]. We can imagine that X is the inner product of some word with all other words in a sequence. Thus, higher positive scores indicate similarity, while negative values indicate the opposite. Then we used four different transformations:
(1) torch.nn.functional.leaky_relu(X)+0.05)**2 [leaky relu]
(2) torch.nn.functional.elu(X)+1 [elu]
(3) torch.nn.functional.gelu(X)+0.2 [gelu]
Then we applied the feature transformation on X and normalized the scores:
[0.56 0.12 0.06 -0.43 0.03] -- X --
[0.88 0.07 0.03 0.005 0.016] -- leaky_relu
[0.288 0.207 0.196 0.12 0.19 ] -- elu
[0.437 0.194 0.169 0.041 0.158] -- gelu
The intuition is that larger positive values should contribute more while negative ones should get little or almost no weight. The problem with (2)-(3) is that due to the shifting values in the range [-bias, 0] those values still contribute a lot to the final weights. This can be seen by (3) with the input values -0.43 / 0.03 which are mapped to 0.12 / 0.19 despite the fact that the difference between them is very noticeable. In general, the elu transformation leads a distribution that resembles uniformity which is not very useful since words usually do not contribute equally to a context. With gelu at least the problem with negative values is handled. The first transformation probably best reflects the assumption that negative and very small inner product scores should not contribute (much) to the attention, maybe it puts a little too much emphasis on larger positive values.
Bottom line, the experiments in the paper  confirm that the ELU transformation is a reasonable choice, but in our own experiments we noticed some strange behavior for the final representation when transformation (2) is used and thus we stick with (1). However, it would be interesting to know why this happens as this probably leads to a deeper understanding of the transformation step and its implications.
 2006.16236: Fast Autoregressive Transformers with Linear Attention