Memory augmented networks have the capability to remember rare events which is very useful for one-shot learning and to avoid catastrophic forgetting of very descriptive, but low frequent patterns. With the evidence from the recently published papers it is safe to say that memory is definitely a step into the right direction to make networks more powerful. However, as usual, there is a BUT which can bite you in the backside.
In general, it is assumed that the data has some underlying, but hidden factors that need to be explained by a model. If the model does a good job, it learns a compressed representation of the data that can be used to solve the problem at hand, usually a classification task. So, the success of the model relies on disentangling the data until a classification with a linear model is possible.
When a memory is added to the model, its life is getting easier because it can store and retrieve templates for latent factors that describe a class label which removes the burden from the model to encoding all the knowledge into its weight matrices. This is especially important if some patterns are very rare and therefore are likely “overwritten” by more frequent ones which improves the loss a lot, but does not help to learn those rare patterns.
The problem is that for some kind of data, it takes a lot of time and space (memory) to converge to a stable state and during this time, the memory is adjusted a lot. What does it mean? By default, the oldest entry is replaced which means it likely points to a rare pattern because those are not seen and updated very often. And this leads to the problem that templates for rare pattern are eventually removed from the memory and need to be re-learned when introduced again, which is burdensome and unreliable.
In other words, if the underlying data manifold is very complex and the memory is in flux during a phase of converging, the benefit of using a memory for rare events is practically gone, since they are “pushed out” of the memory due to the many readjustment steps.
Bottom line, we need to adjust the procedure to select “old” entries to minimize the probability of removing rare events. But the problem is more complex than that because the template gets likely “out of sync” if not averaged with a recent controller representation from time to time. Again, the problem is the data, since our experiments with other domains, like images or natural language worked much better.