With our very modest computational machinery, we search for an AI architecture that is both lightweight and versatile. That is why the tendency of AI to improve models just by using bigger networks and more data is a bit concerning. The energy required to train and run those models is quite a lot, not to talk about the problem that only very big companies can afford to train those models and therefore decide how others can use them.
It is also a problem that the Transformer architecture that is now used almost everywhere has a quadratic complexity with respect to the size of the input. A lot of variants have been proposed, but the real-world models out there, are mostly using the vanilla Transformer architecture. There are lots of variations that address the issue, like , but since the method does not generate the similarity matrix, it is also hard to ‘debug’ possible attention patterns.
Even if a method is working very well, we advocate to follow also other, maybe even, non-promising research directions, since it is unlikely that the Transformer is the end of the AI road. The fact that papers like  exist, confirms that other researcher have similar thoughts. To stress this point, we do not favor any architecture, we just want a neural that is “getting things done” efficiently and it does not matter if it is an FFN, RNN, Transformer or ConvNet. But since  aims for fixed-size input which is problematic for NLP, which is our domain, we did not pursuit this line of research. But not a year later,  introduced a method to work with variable-size input and the method is also called “green”. So, good news everyone, we have an green AI architecture to deliver.
We implemented the method in PyTorch which is very simple if you get the axis right. Instead of using a fixed weight matrix of shape (N, dim), where N is the sequence length, we generate the matrix by a ‘hyper network’. This allows us to process inputs of any length. And the complexity is linear, O(n), with respect to the input and not quadratic, like in Transformers. So, problem solved and we are done here?
We created a modular neural net with PyTorch that allows easily to replace the ‘mixing’ layer by just following a defined interface. In other words, we can test vanilla Self Attention, Linear Attention and Hyper Mixer by just replacing one line. It should be noted that we did not optimize any particular method. We used the reference implementation of ,  for Self Attention and our own implementation of  which just consists of matrix multiplications and transposing operations.
We discarded vanilla Transformers from the evaluation, since preliminary tests with respect to the runtime for longer sequences, on commodity hardware(!), showed that it did not scale for the use-case we are currently working on. It should be noted that our implementation of  might not be efficient. But for all experiments we conducted,  clearly outperformed  with respect to the loss function and also the CPU runtime. We did a lot of tests to check if our implementation of  is buggy, but since learning is clearly visible, just slower in terms of CPU-time and the loss function, we concluded that the code is correct, just not proper for the problem at hand.
So, let’s get back to green AI. We did not formerly analyze the FLOPS required of  and , we just measured the CPU-time of both methods to train a model. Both per epoch and the total time until the model “converged”. And both Linear Attention and Hyper Mixer seem to have a very similar complexity class, with respect to the timing information. This might change if you train with ‘thousands of dimensions’, but in our case Hyper Mixer is just as green as the Linear Transformer .
Bottom line, since Transformers are pretty dominant right now, the quadratic complexity should be definitely addressed to speed-up computations which saves energy and reduces the carbon footprint. However, even if our experiments are very modest, for our kind of NLP problems, the Transformer still performs best. But as we mentioned earlier, we do not believe that the Transformer is the end of the AI road which is why we encourage researches to explore other lines of research and welcome alternatives like [2,3].
 Fast Autoregressive Transformers with Linear Attention, arxiv:2006.16236
 MLP-Mixer: An all-MLP Architecture for Vision, arxiv:2105.01601
 HyperMixer: An MLP-based Green AI Alternative to Transformers, arix:2203.03691