In the previous post we gave brief introduction about a network that is augmented with an external memory. The motivation is that humans can learn new concepts from very few examples and it would help a lot, if networks could do the same. The idea of meta-learning is actually still very old, but it recently got a lot of attention again. However, very often it is described with sequence tasks which do not fit for our problem. Thus, let’s start by summarizing our problem again.
We have a large dataset of items described by a set of sparse features. The problem is to find a function that allows to assign a utility to item each which best approximates the references of a user. Besides the problem to capture the actual preferences of a user, we mostly work with positive examples. Negative feedback is possible and partly available, but not to the same extent as positive one and in the most extreme case, it is missing entirely.
So, we could model the problem as a classification task, but it is very likely that a lot very important patterns are not very frequent in the training set or, in case of on-line learning, not seen very often in the data stream. In other words, we need something like a template which can be used to describe rare events and which can be accessed at test time to retrieve a label, or any other stored information. Such a system helps against catastrophic forgetting, as long as at least some event is matching a template, because otherwise, the memory gets overwritten eventually.
The problem with the basic approach is that it is not fully differentiable. There are alternatives, but as mentioned before, they are mostly used for sequential data and often do not scale. In our case, the stream of data consists of items with an unknown, but positive preference, since there is at least one reason why the user decided to click/view/record it. And further interaction with the item can be treated as some kind of reinforcement, either positive or negative.
In such a scenario, a memory is essential since the system should remember recently introduced preferences/patterns at least for some time until their are renewed or eventually overwritten. The most basic approach is to add some memory to a network that is accessed with a nearest neighbor search and updated somehow by merging the information from the query with the existing template. The key is to use the differentiable part of the system to clearly separate those templates, that are ”memories”, from each other to allow a non-ambiguous decision.
In contrast to metric learning, we do not learn a single metric that induces some embedding space, but we learn ”conceptual templates” very similar to clusters. The difference is that we might have a lot of clusters with only few matches which are nevertheless extremely helpful, because they encode some important (temporal) patterns.
As soon as we cleaned up our code and finished our experiments, we plan to summarize our results in a new post, hopefully with useful insights about how to apply memory-augmented networks in the domain of non-sequential problems.