We still pursue our SOSO idea  and during literature research, we stumbled about mixture of expert (MOE). The idea is from the early 90s and is quite simple: instead of having one big network to do all the work, we delegate the decision to one or more experts. The architecture consists of two building blocks:
(1) a controller network to decide what experts to asked, based on the current input
(2) a set of expert networks that are as simple as possible.
The controller outputs a probability distribution and depending on the design, only those experts are asked whose probability is above a threshold, or only the top-k experts. The idea is to limit the required computations at test time to derive a decision. MOEs can be also used for regression.
While we followed this track, we stumbled about ‘Active Memory Networks’  that are somehow related to the idea. Since one of our goals is also topic learning we gave it a try, since one conclusion was that the memory cells often learned topics but without explicitly giving them topics labels, or encourage them to learn such information with a specific loss function. We found the hyper-parameters in the PhD thesis of the first author.
In contrast to our other posts, this one is about the frustration when the training silently fails. In  the authors also mention one big issue and they propose a solution, but as often, it seems that the solution is data dependent and is no golden rule in general. The issue is that gating functions based on the softmax, often converge to absolutely useless states, namely the uniform output, or that a single unit gets all the score which is reinforced over time and all other units quickly die. We encountered this problem with subtracting the mean activation value , while  proposes to use a temperature-based annealing. The problem is that in the derivative the probability score for unit ‘i’ is multiplied with the term and thus, if it is very small, the unit gets almost no update and if it is close to one, one unit gets all the weight which is a kind of reinforcement. With the temperature, a uniform probability is forced during the early epochs which means the units all get updated. When the annealing is done, the temperature T is set to one, which recovers the classical softmax output. This can be considered a warmup that is slowly turned off to recover the original state in a step-wise, but continual fashion.
As also noted by the authors of  it is unlikely that such a complex network automatically adjusts itself to learn a powerful, but also selective representation. A problem is definitely co-adaption which means that units likely cooperate and rely on the other in case of errors. To promote diversity, a special loss function is added and also dropout is used. And as usual, most hyper-parameters interact with each other and we further need a working baseline to start tuning at all. So, finally we need to take care of at least:
– the temperature T, the multiplier gamma and a schedule when to derive a new T value
– the amount of dropout in the memory cells
– the regularization penalty lambda for the ITL loss
Seems manageable, but the problem is that we need to wait until the warmup is done to analyze the proability distribution of the controller, since earlier, it is surely almost uniform.
Just a quick note: this issue is not limited to this particular method which is why it is so important to mention those pitfalls in a paper and if there is no general solution, at least to give some hints where to start. Still, even with those hints or solutions there is no guarantee that they will also work with a new dataset, or at least without a considerable time of finetuning.
Bottom line, it is often a dream that one can just re-implement some neural method found in a paper and it works out-of-the-box with the data at hand. When we read papers we usually wonder how much time the authors spent until a working solution was found. The problem is that the landscape of loss functions with millions of parameters is very hard to understand and that every (hyper-)parameter has an impact and due to the chain rule of backprop, a tiny modification propagates through the network and might be magnified or damped depending on the dynamics.
At the begin we said silently fail, but why silently? The final accuracy of the network might be reasonable or even perfect, but the attention mechanism might still be useless, since it often performs either averaging or one-hot selection. But for a complex dataset it is reasonable to assume that indeed a mixture of experts should be used to derive the final representation. However when the architecture is powerful enough, there is no benefit for the network to learn a non-trivial attention mechanism, since the other parts of the network compensate the problem.
 “Active Memory Networks for Language Modeling”, O. Chen, A. Ragni, M.J.F. Gales, X. Chen