Augmenting neural networks with external memory is definitely a hot topic these days. However, if the method aims for large-scale learning, it is usually not fully differentiable and if it is, the scale is usually not large-scale. Furthermore, often the memory is only applicable for situations where training is done based on stories or sequences.
What we have in mind is a memory that is formed by on-line training, with support for one-shot learning, and that has some kind of structure with a focus on minimal memory footprint. In other words, we do not want to fix the layout before training but we want it to evolve over time. In terms of known concepts, we want something like competitive learning related to winner-takes-all approaches to distribute knowledge among memory nodes.
Similar to support vectors, we are looking for examples that lie on the boundary “line” to separate the feature space. For example, we have two nodes with different classes and the edge between them crosses the boundary of those two classes. We were a little disappointed because the idea has been already described by [arxiv:1505.02867]. However, their approach assumes a fixed feature representation which is not suited for our needs. After we experimented with ways to learn a good features representation for the nodes, we stumbled about a new paper [arxiv:1702.08833] that enhances Boundary Trees with backprop to learn the features.
But let’s start with an intuitive example first. We have a set of binary ratings from a user for movies which should be used to estimate a function to predict the class of unseen examples. The idea is to combine elements from memory networks with boundary trees to learn something like a decision tree that allows to classify unseen examples with minimal memory footprint. So let’s start.
At the begin we have nothing, so we need to sample a rating from the dataset which becomes the root of our tree. We measure the “match” of a query, an unseen sample, with a node by using the l2 distance between the query and the node: T.sum((query – node)**2). To find the best match, we traverse the tree iteratively by considering the best local match and continue with the children until we reach a leaf node or the parent node is a better match than any children. If the final node has already the same class as the selected sample, we discard it, otherwise we add the sample as a child to the final node. In case of a fixed representation the method is straightforward, but it is only logical to a assume that a learned epresentation is more efficient since we do not know a priori what is actually useful. And now comes the trick part where we have learn a representation.
In case you cannot learn something end-to-end it has some tradition in optimization to alternating between two steps. For instance, to keep one parameter fix and update the other and then do the opposite. We can use the same approach here and build the memory tree with a fixed feature representation from the current model parameters. Then we fix the tree and use random queries to update the representation, the model parameters. The two steps are repeating until convergence.
We hopefully find some time to elaborate on details in later posts, but the core idea is very simple: Build a tree, re-fine it and continue as long as the model improves. At the end, we hopefully have a model -a tree- that consists of a manageable number of nodes and allows us to correctly predict the class of unseen example by traversing the tree to the corresponding leaf node. At least for the domain of images it has been shown that trained models have an interpretable structure and we hope to show the same for the domain of high dimensional sparse input data.