The other day we were joking  that the days of attention are counted and a few days later  another paper was published that showed that other methods also can solve the same problem without attention. As we mentioned before, we think attention is a great building block with respect to explain-ability, but the computational complexity is something we need to work on. Frankly, it feels a bit like walking on the edge of a circle where we now arrived where we started, or stated differently the insights during our walk indicates that plain feed-forward MLPs with ‘spatial mixing’ are enough to solve all the problems out there. So, we are back at ‘MLPs Are All We Need’.
However, regardless if attention is required or not, we need a building block that can at least partly explain the decisions of a neural net. As a consequence, at the end of the network, we need some scoring with respect to the ”input tokens”. When previous layers perform non-linear spatial mixing of these tokens, the scores obviously cannot be directly tracked to the input tokens, but this problem is existing for Transformer architectures in general. So, for the sake of simplicity, we always consider ‘4 layer’ networks an embedding layer, one mixing layer, an attention layer and finally the output layer for the prediction. For the mixing layer, the only constraint is that the shape of the input sequence is preserved which is usually (batch, seq, dim). With this in mind, we could use [1,3] or any classical attention method, like .
In case we use a building block from a paper it is worth to think about it for a moment before we implement it as a PyTorch Layer. Why? We are not sure what the design criteria for this block was, but we assume that some idea was verified by applying it to some problem. And that usually means some dataset is used and the goal is to train a model that generalizes well. Even if a goal is to design something that is applicable to a broad range of problems, the optimization of the design was likely done with respect to the conducted experiments and that means with respect to the used datasets.
But do not let us be vague here. We really appreciate the efforts and also that the authors share the results with the community, often with reference or some pseudo code, but our point is that maybe building block is too powerful for your problem. Very often a grid search is done for the hyper parameters, but it is less clear how to apply the idea to “minimize” a layer design. Maybe we are still to foggy here, so let’s be more concrete:
This is the pseudo code from :
shortcut = x
x = norm(x, axis="channel")
x = gelu(proj(x, d_ffn, axis="channel"))
x = spatial_gating_unit(x)
x = proj(x, d_model, axis="channel")
x = x + shortcut
As said before, we do not imply that designs happen by incident, but as stated in  “it is still unclear what empowers such success” (meant is the Transformer architecture) and the same is true for new methods that provide no theoretical analysis to show that every operation is really necessary.
Furthermore, all those architectures were designed with scaling in mind which means nobody thinks about datasets with just hundreds or thousands of examples. It is right that models should be reused as often as possible to avoid wasting energy, but often fine-tuning or distilling a larger model requires more energy than to train the right model on a smaller dataset. And the right model implies a network architecture that minimizes the required FLOPs.
The point is that we believe that building blocks are still too coarse to be used as a recipe
for general problems, especially if the problem is tied to non-large dataset. That means if time and
space are not problem and the dataset is considerably large, often those building blocks already
provide a very strong baseline. However, it is not clear what parts of the blocks are really required
and what parts could be removed without scarifying anything. This is very much related to questions
like if you really need 768 dims, or if 384 are enough?
Why we think it is so important? Most of the companies out there have limited resources, time and
money, and if your network could deliver the same decision in half of the time or space, shouldn’t
this be preferred? Model distillation or quantization might be applied here, but we think it is
still important to do research how to optimize network architectures with respect to a budget.
 Pay Attention to MLPs, arxiv:2105.08050
 MLP-Mixer, arxiv:2105.01601
 Fast Autoregressive Transformers with Linear Attention, arxiv:2006.16236