A Network To Remember

Introduction

In the post about rare words, we discussed the problem that a lot of low-frequency values/tokens are often extremely valuable but cannot be (fully) used for training. To address the problem, we could learn a different feature space for those values that is then fused with the ordinary feature space, or we could use some kind of memory. In case of neural networks, there is an implicit memory that can be trained with a loss function, but there is no easy way to control this memory beyond predicting the correct output.

The idea to add an explicit addressable memory that is attached to the network, like in memory networks [arxiv:1503.08895]. However, a problem of some methods is that the memory is often restricted to so-called episodes that are bounded to the data input sample. In other words, it is not clear how to apply the memory, to process general information that are intended to generalize beyond single training examples. In a recent paper that was submitted to ICLR 2017 [“Learning To Remember Rare Events”] the authors present a method how to augment a network with memory that is especially suited to handle “events” that are rarely present in the data. In other words, the memory is able to store both long-term and short-term patterns.

Since quite some time, we tinkered with the idea to add a memory module to our networks to allow the model to evolve over time and to remember the little things. Especially for preference-based learning, such a component is very valuable since there is very likely a drift or repeating patterns based on temporal attributes, like Christmas, summer or weekends. The problem is that some of those events do not happen very often, but they are still extremely important. Stated differently, the idea is to have a model that is capable of
“life-long” learning were new memories can be formed and older memories are eventually overwritten.

Overview

Let’s start with a brief overview of the method. The memory M consists of three parts: The key storage K, combined with the values V and the age A of each entry. Let’s assume that the keys K are positive integers which means we want to query the memory to get the best matching label. The notation looks like that:

K: shape=(size_of_mem, key_size) # list of vectors
V: shape=(size_of_mem, 1) # integer labels
A: shape=(size_of_mem, 1) # age as integers

The query is a vector of key_size dimensions. Like in the paper, each vector is normalized so that the dot product is the cosine similarity [-1,+1].

To retrieve the label that is best matching for a query, we perform a nearest neighbor search:

q = raw_query / np.sqrt(np.sum(raw_query**2))
idx = np.argmax(np.dot(q, K.T))
best_label = V[idx]

To enhance some ordinary neural network with the memory module, we get rid of the softmax output layer and use the representation of the new last layer, usually a fully connected one, as the query to the memory which is l2 normalized. The aim is to map a query to the correct class label of the input sample. As stated in the paper, the whole approach is fully differentiable except for the nearest neighbor search.

At the begin, the memory is empty and the network parameters are just random values. We start by choosing a random sample x and its label y. To convert x into the query, we forward propagate it through the network.

(1) q = fprop(x); q' = l2_norm(q)

Next, we determine the nearest neighbor of the query q’ with respect to the memory.

(2) n1 = nearest_neighbor(q', M)

In case V[n1] already contains the correct label y, we update the memory slot with the average of both values.

(3.a) K[n1] = l2_norm(q' + K[n1])

Otherwise, we need to find a new slot n’ for the query which is usually
done by selecting the oldest entry combined with some randomness.

(3.b) n' = argmax(A + rand); K[n'] = q'; V[n'] = y

The age of all slots is then increased by one, while

(4) A += 1

the age of the updated slot is always reset to zero.

(5.a) A[n1] = 0
(5.b) A[n'] = 0

The last question is how to optimize the network parameters to learn a useful memory for the classification task?

We got a query q’, the correct label y and the nearest neighbor of q’ should return the correct label. Thus, we are computing the k nearest neighbors and find the index n_pos such that V[n_pos] = y and the index n_neg such that V[n_neg] != y. The loss is then defined as:

loss = np.maximum(0, q' * K[n_neg] - q * K[n_pos] + margin)

Stated differently, the cosine similarity f(q’, pos) should be higher than f(q’, neg) by the given margin, which means we want to maximize the similarity for the positive key and minimize it for the negative key.

As a quick reminder, the memory content K is not a trainable parameter, but since the content is derived from the query q we can shape the memory with the used loss function.

Example

To train a good model, one usually needs a sufficiently large training set that provides enough data per tag/label and it is quite common that the performance for rare tags can be pretty bad. With a memory module, a network can encode the limited information that is available and retrieve it later without the risk that the whole pattern is forgotten during training, since it only marginally influences the loss function.

Let’s illustrate the behavior with an example. We start with an empty memory and draw samples from the training set. Thus, a rare tag will be drawn with a very low frequency, but if there is a pattern in the input data, the chance is good that items are mapped to the same memory entry, or at least to a small group of entries. When the training continues, the memory entries for more frequent tags will be updated, but hopefully the memory entries that is assigned to the rare tag won’t be overwritten. The problem is that the age of all non-updated entries is increased per step and thus, entries of rare events might be more likely candidates that will be overwritten. The probability that this will happen both depends on the memory size and the number of distinct patterns in the training data. However, since each epoch visits all training samples, also the rare events get updates eventually and hopefully the pattern is strong enough to avoid that a new memory entry is chosen but the old is updated.

At the end of the training, we hopefully learned a memory that is good at correctly predicting tags for new data, both for the frequent tags, but also for the rare ones for which only limited data was available.

Bottom Line

Depending on the actual goal, a static network might suffice. For instance, for some classification tasks, a re-training might be only required 1-2 times a year. However, for other tasks, it is imperative that a network quickly reacts to drifts in input patterns and adjusts the output. But in contrast to on-line learning, the network should not forget old patterns (too) quickly, but remember them for a while because when they occur again, the re-learning process is very inefficient. Therefore, a memory can help to store also several less frequent patterns, like those of rare events, which is essential to explain some preferences and to deliver a good performance.

Who Is The Brain Behind This?

The more we are doing research in the area of recommendation systems the more it becomes clear that a real solution cannot be accomplished by any “classical” algorithm. What we are are trying to say is that we doubt that any existing classifier, regardless how powerful the handcrafted features are, will ever be sufficient to deliver a satisfying performance over the whole time. The main problem is that such a model only captures current trends with the data known so far and even with on-line learning, it hardly evolves or reacts to trends that pop-up frequently but then soon vanish into thin air. What we need is a model that contains a component to address short-term events as well as long-term events. In other words, we need some kind of memory to fight the problem of catastrophic forgetting that can happen in neural nets. Like in the brain, we want something that remembers recent events, but also encodes regular patterns and is able to get rid of older memories by overwriting them. Such a dynamic component would allow a life-long learning which is perfect for preference-based learning.

Thus, we decided to change our direction of research towards adding external memory to networks, but on a much smaller scale than compared to what other researchers did and will do. The next post will shed some light on the issue, but we still need to figure out how we can fully utilize the power of such a component and of course how we can phrase our problem as a learning task for such a network.

Curriculum Learning Revisited

It is always a good idea to draw inspiration from biological learning. For instance, if we learn something new, it is common to start with simple examples and gradually move to ones that are more difficult to solve. The idea is to learn the basic steps and then to combine those steps into new knowledge which is then combined again and so forth. The same can be done if we consider neural networks and it is called curriculum learning. At the begin the network is fed with examples that are easy to learn. Then, decided by some schedule, more difficult examples are presented to the network. The idea is, again, that the network learns basic concepts first which can then be combined into more complex ones to classify the challenging examples.

A recently published paper [arxiv:1612.09508] combines this idea with a feedback mechanism to incrementally predicts a sequence of classes for an image that ranges from easy to difficult (coarse to fine-grained). The idea is to use a recurrent network in combination with loss functions that are attached to the output of each step in time. The input is the image in combination with the hidden state of the previous time step. In other words, the first step is feed into the first loss function -the easy class-, the next time step is feed into the second loss function -more difficult- which continues for #T steps which equals the number of loss functions.

An obvious advantage is that we can model a taxonomy of classes, for example: (animal, vehicle, plane)->([cat, dog, bird], [car, bike, truck], […]) -> ([tabby, …], [shepard, beagle, …], […]) with this approach. At the top level, the class is very coarse, but with each step, it is further refined which is the connection to curriculum learning. For instance, to predict if an image contains an animal or a vehicle is much easier than to predict if an image contains a german shepard or a pick-up truck.

However, in contrast to plain curriculum learning, we do not increase the difficulty per epoch, but per “layer”, because the network needs to predict all classes for an image correctly at every epoch. The idea is related to multi-task learning but with the difference that the loss functions are now connected to a recurrent layer which evolves over time and depends on the output of all previous steps.

So far for the overview of the method, but we are of course not interested to apply the method on images but on movie data. The good news is that we can easily build a simple taxonomy of classes, for instance: top genre->sub genre->theme and since the idea can be applied to all kind of data, it is straightforward to feed our movie feature vectors to the network. We start with a very simple network. The input is a 2,000 dim vector with floats within a range [0,1] and the pseudo code looks like this:

x = T.vector() # input vector
y1, y2, y3 = T.iscalar(), T.iscalar(), T.iscalar() # output class labels
h1 = Projection+Layer-Normalization+ReLU(x, num_units=64)
r1 = GatedRecurrent+LayerNormalization+RelU(h1, num_units=64)
r1_step1 = "output of time-step 1"
r1_step2 = "output of time-step 2"
r1_step3 = "output of time-step 3"
y1_hat = Softmax(r1_step1, num_classes=#y1)
y2_hat = Softmax(r1_step2, num_classes=#y2)
y3_hat = Softmax(r1_step3, num_classes=#y3)
loss = nll(y1_hat, y1) + nll(y2_hat, y2) + nll(y3_hat, y3)

For each class, we use a separate softmax to predict the label with the input from the recurrent output at step #t. Thus a training step can be summarized as: a movie vector is fed into the network and projected to the feature space spanned by h1. Then we feed h1 three times into the recurrent layer r1 with the state from the previous step, at #t=0 the state is zero, to produce a prediction for each class. Therefore, there is a two-fold despondency. First, the hidden state is propagated through time and second, the next state is
influenced by the error derivate of the loss from the previous state which means the representation learned by the recurrent layer must be useful for all labels.

To illustrate the method with an example, let’s consider a movie with the top-genre “horror”, the sub-genre “creature-film” and the theme “zombies”. When the network sees the movie, it gets the first hint that it is a horror movie and builds a representation that maximizes the prediction of the first softmax for “horror”. But instead of forgetting the context, it ‘remembers’ that it belongs to the horror genre and uses the hint to adjust the representation to predict both classes (horror,creature-film) correctly. This is the second step. Lastly, it is using the given context to adjust the representation again to predict all three classes (horror,creature-film,zombies) correctly. The whole process can be considered as a loop which is in contrast to typical multi-label learning that is using a single output of a layer to predict all classes correctly.

Bottom line, as we demonstrated, feedback networks are not limited to images and they are extremely useful if there exists a natural, hierarchical label space for data samples. Furthermore, since the classification of hierarchical labels requires more powerful representations, it is very likely that the learned feature space is more versatile and can be also used as input features for other models.

Rare Words Revisited

It is well known that the words that do not occur very often, are often nevertheless very descriptive which is also true for the domain of non-textual items. This is related to the problem that a lot of models only consider the co-occurrences of items to train an embedding and in case of rare words/items that is not really working. The issue was also discussed in a paper released in 2016 [arxiv:1608.00318] with the idea to combine knowledge graphs and recurrent language models. Why is this relevant for our domain? Well, the sparsity of features is a huge problem and since the distribution of them follows a power law, a set of very important, but low frequency features will be ignored which seriously limits the expressiveness of possible models. To tackle the problem, we need to learn an embedding for those words that is not only based on co-occurrence statistics.

We also tinkered with the idea to learn an embedding -before we read the paper- that combines a skip-gram model with facts represented as triplets
(subject, relation, object)
but since, in our domain, rare features cannot be always easily expressed as facts, we did not follow it. However, with the research we did so far, it becomes more and more obvious that models that only use the basic features will not now, nor even never, deliver satisfying results.

In other words, we need to utilize all kind of data and if the co-occurrence does not work, we need something else. The challenge is to build a network with something like a universal memory that can be used to for a variety of tasks, like clustering items or to train
preference models. The first challenges is to find a way how to build a powerful knowledge base, mostly in a semi-supervised way and the second is to incorporate all the knowledge into a network.

The good news is that there are lots of interesting and new methods out there which we can study and adopt, but the bad news is that none of them really fits to our domain and that is the reason we cannot easily resort to a standard data set like Wikipedia. But as usual, we won’t give up and continue our research because we believe that it is only a matter of time until we -or someone else- come up with a new idea that brings us a step further on our way to solve the problem.

The Year 2016 On A Rice Grain

If we consider all domains of machine learning, the year was definitely a huge success. For instance, recurrent networks are now all over the place, either stand-alone or as part of other networks. In the domain of images, all cool kids are now using some variant of generative adversarial networks (GAN) and reinforcement learning is so popular that you cannot avoid it. In other words, whenever a method works with “natural” data, like images, videos, documents or ‘environments’, a lot of progress has been made. Of course there was also progress in other domains, for instance information retrieval or recommender systems, but not in such a profound way. And this brings to the domain, we are mostly concerned about which is personalization.

First, even if finding the needle in the digital haystack gets more and more important, for instance product search/recommendation, there were very few papers about it. One reason might be that experiments require a sufficiently large data set which contains valuable information and not many are willing to give away data like this for free. Another reason might be that most items -in general- cannot be described in a uniform way. For instance, two product descriptions in different languages can be considered very similar, but are totally differently encoded. Furthermore, the level of details might strongly vary and thus, content-based methods might fail. So, we are back at colloborative filtering which has its own weaknesses and strengths.

Stated differently in contrast to images/video or audio, the quality of a model mainly consists of how the input data is encoded into features and since we do not know in advance if the features are powerful enough for the problem we try to solve, the model is often simply not powerful enough. We already argued that those models could be enriched with data from other domains, like image data, a photo of the product or summary of the video/song, but this multi-modal approach has not been widely explored yet and needs a lot of more research.

Without a doubt we made some progress to build personalized services, but compared others the steps seem to be very tiny and sometimes even targeted backwards. It’s hard to tell what could improve the situation, because even such a simple thing like a taxonomy can fail when a couple of parties is involved. There are some approaches for open knowledge bases but they are far from being complete or easy to use and the rest of the knowledge bases have been acquired by companies which now control the data and the access to them.

This might all sound very darkly but also underlines that this calls for a fundamental rethink. For instance, instead of encoding all required knowledge in the features, it could make sense to have some kind of memory that allows us to access pre-existing knowledge that is not restricted to a specific problem. Like the problem that a feature like “david-cronenberg” is not just a word, but describes a concept that implies a lot of facts, like (david-cronenberg,is-a,director) or (david-cronenberg,directed,scanners). In other words, we need an efficient way to re-use or at least integrate existing knowledge in a consistent way to improve the expressiveness of models with limited features.

In a nutshell, maybe we need to throw away most of the old baggage and start -almost- from scratch again to get rid of all the restrictions that have been piled up over the years. Of course this is easier said than done, but researchers already demonstrated that this is possible in other domains. However, it might also be possible that somebody already came up with a great idea but is not able/willing to publish it since it is owned by some company. But one is for sure, namely that we
are still at the begin of very long, but interesting journey.

Efficient Embeddings With Theano

To train a word embedding model is a pretty old hat and maybe this is nothing new, but we nevertheless thought it would be a good idea to summarize our experiences to avoid possible headaches in case somebody is not so familiar with Theano. To be fair, the FAQ of Theano is mentioning the issue, so we are using the post as a more detailed summary.

Let’s assume that we want to train an embedding model for words with a
vocabulary size of N. Then we have a lookup matrix “W”

W = theano.shared(np.random.normal(0, 0.1, size=(N, dim)))

where dim is the number of dimensions we use for the embedding.

For the sake of simplicity, a training step consists of a pair (ref, pos) that should be moved together and a pair (ref, neg) that is supposed to be pulled apart. In other words, for each step, we only change three row of W.

This is a possible loss function with the update:

cost = -T.log(T.nnet.sigmoid(T.dot(W[ref], W[pos]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W[ref], W[neg])))
grad_W = T.grad(cost, W)
updates = [(W, W - lrate * grad_W)]

So, what is the problem here? Even if we only adjust three rows of the matrix, we calculate the gradients for the whole matrix. This is very wasteful and also awfully slow.

The solution is to use advanced indexing and only calculate the gradients for the subset of weights that are actually used in the step:

idx = T.ivector() #ref, pos, neg
W_part = W[idx]
cost = -T.log(T.nnet.sigmoid(T.dot(W_part[0], W_part[1]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W_part[0], W_part[2])))
grad_W_part = T.grad(cost, W_part)
updates = [(W, T.set_subtensor(W_part, W_part - lrate * grad_W_part))]

For those who are not familiar with Theano, let’s see what is going on here. The variable “idx” is a vector that holds integers, in our case ref, pos, neg which are used to retrieve particular rows of “W”. In other words, W_part contains references to these rows and is therefore something like a view of “W”.

The trick is to call T.grad with the subset and not the whole matrix to avoid unnecessary computations, because the gradient of all rows, except the ones referenced in W_part, are zero anyway. But since we are working with a view now, we need a different way to just update the three rows in W which can be done with set_subtensor(). First, we are updating W_part as usual with gradient descent, but then we need to use advanced indexing to just replace the referenced rows. We can think of the new update statement as:

W[ref] = W_part[0] - lrate * grad_W_part[0]
W[pos] = W_part[1] - lrate * grad_W_part[1]
W[neg] = W_part[2] - lrate * grad_W_part[2]

And that’s it. It is not really rocket science, but it requires some deeper understanding of how Theano works under the hood.

The lesson is that if you just use a subset of weights during learning, you should only derive gradients for this subset. The notation might look unfamiliar at the beginning, but in case of a lookup matrix with several hundred thousands rows, we are talking about savings in the range of hours and this is worth the little extra effort.

And last but not least, updating the whole matrix leads to very strange side-effects (also briefly mentioned in the FAQ). Let’s assume that we train an embedding model and we use RMSprop as the optimizer. In each step, we update the moving average:

avg_new = 0.9 * avg + 0.1 * grad**2,

where avg_new has the same dimension as the embedding matrix. In case of a batch-size of one, most of the gradients are zero and thus, the update for a row with a zero gradient looks like this:

avg_new[i] = 0.9 * avg[i] + 0.1 * 0,

which means the average is pushed to zero.

Therefore, even if the row was not referenced in the pairs, the dynamic of the optimizer is changed for all rows, regardless if they were referenced or not. This is true for all kind of optimizers that keep a history of values, like momentum, adam or adagrad.

Bottom line. While using only the subset means a huge gain in performance for classical SGD, but it is not strictly required, it is mandatory for all sophisticated optimizers.

Label Schmoozing

Even if there are millions of items out there, for instance images, tweets, documents and other stuff, labels are still very scarce. Yes, there are large datasets with labels, but they might not work for the problem at hand, or the labels might be also too noisy. So, at the end, we often have a rather small dataset to train our model and with a such a dataset comes the old problems. In case we do not regularize the model, the network is likely to memorize most patterns and learns the noise in the training data. In other words, the model overfits and does not generalize to new examples. A good sign that this happens is that the network is overconfident when it predicts the label distribution. That means, even if the network is totally wrong, it puts all confidence in a single class. Of course it is the best to get more data, but if this is not possible, there are astonishingly simple regularizers that can help to improve the situation.

(1) Label Smoothing[1]

We start with label smoothing which became popular again with the inception network architecture, though it is actually around for a longer time, but was probably forgotten or simply ignored. The idea is pretty simple: The network is discouraged to put all confidence in a single label class. Informally speaking, even if the network is sure that the image contains a frog, it is also open to other suggestions even if they are unlikely. This leads to the following regularized loss:

y = T.vector()
y_hat = T.nnet.softmax(..)
y_smooth = 0.9 * y + 0.1 / num_labels
loss = -T.sum(y_smooth * T.log(y_hat))

If we have 4 classes, and the correct class is 0, y_smooth looks like this: y_smooth = [0.925, 0.025, 0.025, 0.025].

Stated differently, for each label y, we set it to k with probability 0.9 and replace it with probability 0.1 with a label from the uniform distribution, 1/#labels, that is independent of the actual label.

Compared to other regularizers there are only very few papers with qualitatively results for neural networks. However, since the implementation is straightforward, it is easy to compare vanilla and smoothed models for the problem at hand at the cost of training the model twice.

(2) Penalizing Confident Predictions[2]

This particular approach is a newer but also related to the idea of method (1). It is based on the idea to penalize network outputs with low entropy. Where the entropy is defined as follows:

entropy = -T.sum(y_hat * T.log(y_hat))

This seems familiar, right? Now, let’s fill in some numbers to learn something about the range. Since y_hat is the output of a softmax, all values are in [0, 1].

Let’s calculate the entropy for some examples with 4 classes:
y_hat = [0.95, 0.04, 0.005, 0.005] => entropy = 0.23
y_hat = [0.75, 0.1, 0.14, 0.01] => entropy = 0.76
y_hat = [0.25, 0.25, 0.25, 0.25] => entropy = 1.38

The aim of the model is thus, to discourage peaked distributions that have low entropy with the maximum when all labels are equally probable. This is of course not useful for classification which means we need an extra hyper-parameter ‘beta’ to control the regularization.

This leads to the following regularized loss:

y_hat = T.nnet.softmax(..)
entropy = -T.sum(y_hat * T.log(y_hat))
loss = -T.sum(T.log(y_hat) * y) - beta * entropy

Actually, as noted in the comments for the approach, the idea has been around for a while, e.g. [3], but not in the domain of neural networks. Thus, the situation is the same as for (1), but again it is easy to assess the results by comparing a model that is trained without the regularizer and one that uses it.

Summary

The focus of the post is not whether or not these methods are actually new, but how to fight overfitting in case of smaller datasets and/or challenging learning tasks. Especially for method (2) the various experiments confirmed that this regularizer can actually help to learn better neural network models which is important since the methods were not broadly applied in this domain yet.

Links

[1] arxiv.org/abs/1512.00567
[2] openreview.net/pdf?id=HkCjNI5ex
[3] jmlr.org/papers/volume11/mann10a/mann10a.pdf