Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With
P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.

When Unsuper Is Actually Super, Or Not?

It is no secret that most of the energy has been put into advancing supervised approaches for machine learning. One reason is that lots of problems can be actually phrased as predicting labels and often with very good results. So, the question is, especially for commercial solutions where time and resources are very limited, if it isn’t better to spend some time to label data and train a classifier to get some insights about the data. We got some inspiration from a recent twitter post that suggested a similar approach.

For instance, if we want to predict if an “event” is an outlier or not, we have to decide between supervised and unsupervised methods. The advantage of the latter is that we have access to lots of data, but we have no clear notion of “outliers”, while for the former, we need events that are labeled with the risk that the data is not very representative and therefore, the trained model might be of limited use.

In other words, it is the old story again: A supervised model is usually easier to train, if we have sufficient labeled data at the expense that we get what we “feed”. Thus, more labeled data is likely to improve the model but we can never be sure when we captured all irregularities. On the other hand, unsupervised learning might be able to (fully) disentangle the explaining factors of the data and thus leads to a more powerful model, but coming up with a proper loss function and the actual training can be very hard.

Bottom line, there is some truth in it that if you cannot come up with a good unsupervised model, but you can partly solve the problem with an supervised one, you should start with it. With some luck, the simple model will lead to additional insights that might eventually lead to an unsupervised solution.

Forming New Memory

Augmenting neural networks with external memory is definitely a hot topic these days. However, if the method aims for large-scale learning, it is usually not fully differentiable and if it is, the scale is usually not large-scale. Furthermore, often the memory is only applicable for situations where training is done based on stories or sequences.

What we have in mind is a memory that is formed by on-line training, with support for one-shot learning, and that has some kind of structure with a focus on minimal memory footprint. In other words, we do not want to fix the layout before training but we want it to evolve over time. In terms of known concepts, we want something like competitive learning related to winner-takes-all approaches to distribute knowledge among memory nodes.

Similar to support vectors, we are looking for examples that lie on the boundary “line” to separate the feature space. For example, we have two nodes with different classes and the edge between them crosses the boundary of those two classes. We were a little disappointed because the idea has been already described by [arxiv:1505.02867]. However, their approach assumes a fixed feature representation which is not suited for our needs. After we experimented with ways to learn a good features representation for the nodes, we stumbled about a new paper [arxiv:1702.08833] that enhances Boundary Trees with backprop to learn the features.

But let’s start with an intuitive example first. We have a set of binary ratings from a user for movies which should be used to estimate a function to predict the class of unseen examples. The idea is to combine elements from memory networks with boundary trees to learn something like a decision tree that allows to classify unseen examples with minimal memory footprint. So let’s start.

At the begin we have nothing, so we need to sample a rating from the dataset which becomes the root of our tree. We measure the “match” of a query, an unseen sample, with a node by using the l2 distance between the query and the node: T.sum((query – node)**2). To find the best match, we traverse the tree iteratively by considering the best local match and continue with the children until we reach a leaf node or the parent node is a better match than any children. If the final node has already the same class as the selected sample, we discard it, otherwise we add the sample as a child to the final node. In case of a fixed representation the method is straightforward, but it is only logical to a assume that a learned epresentation is more efficient since we do not know a priori what is actually useful. And now comes the trick part where we have learn a representation.

In case you cannot learn something end-to-end it has some tradition in optimization to alternating between two steps. For instance, to keep one parameter fix and update the other and then do the opposite. We can use the same approach here and build the memory tree with a fixed feature representation from the current model parameters. Then we fix the tree and use random queries to update the representation, the model parameters. The two steps are repeating until convergence.

We hopefully find some time to elaborate on details in later posts, but the core idea is very simple: Build a tree, re-fine it and continue as long as the model improves. At the end, we hopefully have a model -a tree- that consists of a manageable number of nodes and allows us to correctly predict the class of unseen example by traversing the tree to the corresponding leaf node. At least for the domain of images it has been shown that trained models have an interpretable structure and we hope to show the same for the domain of high dimensional sparse input data.

Sparse Input Data With Theano

For some kind of data it is not unusual to have a couple of thousand dimensions but only very few of them carry actual values. Like a bag-of-word approach with thousands of binary features but on average 99.5% of them are zero. In case of a neural network this means we have a projection matrix W with N rows and M columns where N is the number of input features (~10,000). Since it is obvious that a dot-product just depends on non-zero entries, the procedure can be speed-up a lot if we use a sparse matrix instead of a dense one, for the input data. However, we only need the sparse tensor type once, since after the first layer, the output is always dense again.

The actual implementation in Theano is not a big deal. Instead of T.fmatrix(), we use sparse.csc_matrix() which comes from the sparse module of Theano: from theano import sparse. If we use a generic projection layer, all we have to check is the instance type of the input tensor to use the appropriate dot function:

if type(input.output) is theano.sparse.basic.SparseVariable:
 op = sparse.structured_dot
 op =

That is all and the rest can stay as it is.

The idea of “structured_dot” is that the first operand, the input data, is sparse and the other operand, the projection matrix, is dense. The derived gradient is also sparse and according to the docs, both fprop and bprop is using C-code.

Bottom line, if the input dimension is huge but only very few elements are actually “non-zero” using a sparse matrix object is essential for a good performance. The fact that non-contiguous objects cannot be used on the GPU is a drawback, but not a real problem for our models since they are CPU-optimized anyway.

Gratitude For The Old

In the last post we discussed a problem that occurs when the first phase of learning has many ups and downs which means the memory is re-adjusted a lot. In many of those cases, the system calms down eventually, but the drawback is that rare labels are very likely removed and re-introduced several times which does not allow to learn a stable pattern for them.

The problem is that the age of all memory slots is always increased by one regardless of how frequent an actual label is. In other words, if we have three labels and the distribution is 80/18/2, slots with label three are getting easily old and are good candidates to be replaced, in the phase where the system tries to settle down.

The issue can be addressed by keeping a history of how labels are distributed across the memory. The more a label occupies the memory, the higher should be the chance to be replaced, because there are several features templates available. This should help to keep rare labels in memory to allow to learn a stable feature template for them.

The implementation is pretty easy. Instead of selecting the slot just by its age, we also consider the label distribution:

n = argmax(A * T)

where T is a vector of the same length as A filled with the label portion #label/#total per dimension.

For example, if a rare label has age=50, but a t=0.2 and we have a frequent label with age=15 but t=0.8, the more frequent one gets replaced because 15*0.8=12 and 50*0.2=10. And the good thing is that if all labels are distributed uniformly, we get exactly the original method.

Forgetting Events Despite Having a Memory

Memory augmented networks have the capability to remember rare events which is very useful for one-shot learning and to avoid catastrophic forgetting of very descriptive, but low frequent patterns. With the evidence from the recently published papers it is safe to say that memory is definitely a step into the right direction to make networks more powerful. However, as usual, there is a BUT which can bite you in the backside.

In general, it is assumed that the data has some underlying, but hidden factors that need to be explained by a model. If the model does a good job, it learns a compressed representation of the data that can be used to solve the problem at hand, usually a classification task. So, the success of the model relies on disentangling the data until a classification with a linear model is possible.

When a memory is added to the model, its life is getting easier because it can store and retrieve templates for latent factors that describe a class label which removes the burden from the model to encoding all the knowledge into its weight matrices. This is especially important if some patterns are very rare and therefore are likely “overwritten” by more frequent ones which improves the loss a lot, but does not help to learn those rare patterns.

The problem is that for some kind of data, it takes a lot of time and space (memory) to converge to a stable state and during this time, the memory is adjusted a lot. What does it mean? By default, the oldest entry is replaced which means it likely points to a rare pattern because those are not seen and updated very often. And this leads to the problem that templates for rare pattern are eventually removed from the memory and need to be re-learned when introduced again, which is burdensome and unreliable.

In other words, if the underlying data manifold is very complex and the memory is in flux during a phase of converging, the benefit of using a memory for rare events is practically gone, since they are “pushed out” of the memory due to the many readjustment steps.

Bottom line, we need to adjust the procedure to select “old” entries to minimize the probability of removing rare events. But the problem is more complex than that because the template gets likely “out of sync” if not averaged with a recent controller representation from time to time. Again, the problem is the data, since our experiments with other domains, like images or natural language worked much better.

Converting Theano to Numpy

It is an open secret that we like Theano. It’s flexible, powerful and once you mastered some hurdles, it allows you to easily test a variety of loss functions and network architectures. However, once the model is trained, Theano can be a bit of a burden when it comes to the fprop-only part. In other words, if we just want to get predictions or feature representations, the setup and compilation overhead might be too much. The alternative would be to convert the flow graph into numpy which has the advantage that there are fewer dependencies and less overhead for the actual predictions with the model. Frankly, what we describe is neither rocket science nor new, but it is also no common usage, so we decided to summarize the method in this post.

To convert the graph notation to numpy, we make use of the __call__ interface of python classes. The idea is to call an instance of a class as a function with a parameter:

class Input(Layer):
def __init__(self):
 self.prev = None # no previous layer

def __call__(self, value):
 return value #identity

class Projection(Layer):
def __init__(self, prev, W, bias):
 self.W = W, self.bias = bias
 self.prev = prev # previous layer

def __call__(self, value):
 val = self.prev(value)
 return, self.W) + self.bias

We illustrate the method with a 1-layer linear network:

inp = Input()
lin = Projection(inp, W="random matrix", b="zero bias")
X = "input matrix"
out = lin(X)

The notation of fprop might be confusing here, since the input travels backwards from the last layer to the input layer. So, let’s see what is happening here:

lin(X) is equivalent to lin.__call__(value) and inside this function, the output of the previous layer is requested self.prev(value) which is continued until the input layer returns the actual value. This is the stop condition. The approach is not restricted to a 1-layer network and can be used for arbitrary large networks.

With this idea, all we have to do is to split the layer setup and computation part that is combined in Theano. For instance, a projection layer in Theano:

class Projection(Layer):
def __init__(self, input, W, bias):
 self.output =, W) + bias

now looks like this with numpy:

class ProjectionNP(LayerNP):
def __init__(self, input, W, bias): # setup
 self.prev = input
 self.W, self.bias = W, bias

def __call__(self, value): # computation
 val = self.prev(value)
 return, self.W) + self.bias

In other words, the step to convert any Theano layer is pretty straightforward and only needs time to type, but not to think (much).

The storage of such a model is just a list with all layers and we can extract the output of any layer, by simply calling the layer object with the input:

net = [inp, pro, bn, relu]
net[-1](X) # relu
net[-3](X) # projection

Let’s summarize the advantages again: First, except for numpy there are no other dependencies and numpy is pretty portable and introduces not much overhead. Second, we do not need to compile any functions since we are working with real data and not symbolic variables. The latter is especially important if an “app” is started frequently but the interaction time is rather low, because then a constant overhead very likely declines user satisfaction.

Bottom line, the method we described here is especially useful for smaller models and environments with limited resources which might include apps that are frequently started and thus should have low setup time.