Sparse Input Data With Theano

For some kind of data it is not unusual to have a couple of thousand dimensions but only very few of them carry actual values. Like a bag-of-word approach with thousands of binary features but on average 99.5% of them are zero. In case of a neural network this means we have a projection matrix W with N rows and M columns where N is the number of input features (~10,000). Since it is obvious that a dot-product just depends on non-zero entries, the procedure can be speed-up a lot if we use a sparse matrix instead of a dense one, for the input data. However, we only need the sparse tensor type once, since after the first layer, the output is always dense again.

The actual implementation in Theano is not a big deal. Instead of T.fmatrix(), we use sparse.csc_matrix() which comes from the sparse module of Theano: from theano import sparse. If we use a generic projection layer, all we have to check is the instance type of the input tensor to use the appropriate dot function:

if type(input.output) is theano.sparse.basic.SparseVariable:
 op = sparse.structured_dot
 op =

That is all and the rest can stay as it is.

The idea of “structured_dot” is that the first operand, the input data, is sparse and the other operand, the projection matrix, is dense. The derived gradient is also sparse and according to the docs, both fprop and bprop is using C-code.

Bottom line, if the input dimension is huge but only very few elements are actually “non-zero” using a sparse matrix object is essential for a good performance. The fact that non-contiguous objects cannot be used on the GPU is a drawback, but not a real problem for our models since they are CPU-optimized anyway.

Gratitude For The Old

In the last post we discussed a problem that occurs when the first phase of learning has many ups and downs which means the memory is re-adjusted a lot. In many of those cases, the system calms down eventually, but the drawback is that rare labels are very likely removed and re-introduced several times which does not allow to learn a stable pattern for them.

The problem is that the age of all memory slots is always increased by one regardless of how frequent an actual label is. In other words, if we have three labels and the distribution is 80/18/2, slots with label three are getting easily old and are good candidates to be replaced, in the phase where the system tries to settle down.

The issue can be addressed by keeping a history of how labels are distributed across the memory. The more a label occupies the memory, the higher should be the chance to be replaced, because there are several features templates available. This should help to keep rare labels in memory to allow to learn a stable feature template for them.

The implementation is pretty easy. Instead of selecting the slot just by its age, we also consider the label distribution:

n = argmax(A * T)

where T is a vector of the same length as A filled with the label portion #label/#total per dimension.

For example, if a rare label has age=50, but a t=0.2 and we have a frequent label with age=15 but t=0.8, the more frequent one gets replaced because 15*0.8=12 and 50*0.2=10. And the good thing is that if all labels are distributed uniformly, we get exactly the original method.

Forgetting Events Despite Having a Memory

Memory augmented networks have the capability to remember rare events which is very useful for one-shot learning and to avoid catastrophic forgetting of very descriptive, but low frequent patterns. With the evidence from the recently published papers it is safe to say that memory is definitely a step into the right direction to make networks more powerful. However, as usual, there is a BUT which can bite you in the backside.

In general, it is assumed that the data has some underlying, but hidden factors that need to be explained by a model. If the model does a good job, it learns a compressed representation of the data that can be used to solve the problem at hand, usually a classification task. So, the success of the model relies on disentangling the data until a classification with a linear model is possible.

When a memory is added to the model, its life is getting easier because it can store and retrieve templates for latent factors that describe a class label which removes the burden from the model to encoding all the knowledge into its weight matrices. This is especially important if some patterns are very rare and therefore are likely “overwritten” by more frequent ones which improves the loss a lot, but does not help to learn those rare patterns.

The problem is that for some kind of data, it takes a lot of time and space (memory) to converge to a stable state and during this time, the memory is adjusted a lot. What does it mean? By default, the oldest entry is replaced which means it likely points to a rare pattern because those are not seen and updated very often. And this leads to the problem that templates for rare pattern are eventually removed from the memory and need to be re-learned when introduced again, which is burdensome and unreliable.

In other words, if the underlying data manifold is very complex and the memory is in flux during a phase of converging, the benefit of using a memory for rare events is practically gone, since they are “pushed out” of the memory due to the many readjustment steps.

Bottom line, we need to adjust the procedure to select “old” entries to minimize the probability of removing rare events. But the problem is more complex than that because the template gets likely “out of sync” if not averaged with a recent controller representation from time to time. Again, the problem is the data, since our experiments with other domains, like images or natural language worked much better.

Converting Theano to Numpy

It is an open secret that we like Theano. It’s flexible, powerful and once you mastered some hurdles, it allows you to easily test a variety of loss functions and network architectures. However, once the model is trained, Theano can be a bit of a burden when it comes to the fprop-only part. In other words, if we just want to get predictions or feature representations, the setup and compilation overhead might be too much. The alternative would be to convert the flow graph into numpy which has the advantage that there are fewer dependencies and less overhead for the actual predictions with the model. Frankly, what we describe is neither rocket science nor new, but it is also no common usage, so we decided to summarize the method in this post.

To convert the graph notation to numpy, we make use of the __call__ interface of python classes. The idea is to call an instance of a class as a function with a parameter:

class Input(Layer):
def __init__(self):
 self.prev = None # no previous layer

def __call__(self, value):
 return value #identity

class Projection(Layer):
def __init__(self, prev, W, bias):
 self.W = W, self.bias = bias
 self.prev = prev # previous layer

def __call__(self, value):
 val = self.prev(value)
 return, self.W) + self.bias

We illustrate the method with a 1-layer linear network:

inp = Input()
lin = Projection(inp, W="random matrix", b="zero bias")
X = "input matrix"
out = lin(X)

The notation of fprop might be confusing here, since the input travels backwards from the last layer to the input layer. So, let’s see what is happening here:

lin(X) is equivalent to lin.__call__(value) and inside this function, the output of the previous layer is requested self.prev(value) which is continued until the input layer returns the actual value. This is the stop condition. The approach is not restricted to a 1-layer network and can be used for arbitrary large networks.

With this idea, all we have to do is to split the layer setup and computation part that is combined in Theano. For instance, a projection layer in Theano:

class Projection(Layer):
def __init__(self, input, W, bias):
 self.output =, W) + bias

now looks like this with numpy:

class ProjectionNP(LayerNP):
def __init__(self, input, W, bias): # setup
 self.prev = input
 self.W, self.bias = W, bias

def __call__(self, value): # computation
 val = self.prev(value)
 return, self.W) + self.bias

In other words, the step to convert any Theano layer is pretty straightforward and only needs time to type, but not to think (much).

The storage of such a model is just a list with all layers and we can extract the output of any layer, by simply calling the layer object with the input:

net = [inp, pro, bn, relu]
net[-1](X) # relu
net[-3](X) # projection

Let’s summarize the advantages again: First, except for numpy there are no other dependencies and numpy is pretty portable and introduces not much overhead. Second, we do not need to compile any functions since we are working with real data and not symbolic variables. The latter is especially important if an “app” is started frequently but the interaction time is rather low, because then a constant overhead very likely declines user satisfaction.

Bottom line, the method we described here is especially useful for smaller models and environments with limited resources which might include apps that are frequently started and thus should have low setup time.

Theano vs. The Rest

If we only consider the back-ends, there are three major frameworks available. Torch, which was released in early 2000, Theano which followed around 2010 and TensorFlow released at the end of 2015 as the youngest member in the team. Yes, there are other frameworks, but most of the big companies are using one of those with a noticeable shift towards TensorFlow. Probably because it has the largest community, lots of high-level code for common tasks which includes visualization and data processing and it undergoes a rapid development.

Theano on the other side is rather small, if we consider the provided functionality, but provides a kind of low-level access that is very convenient if you need to manipulate gradient expressions directly. Furthermore, there is no overhead if you just want to optimize a function. The price you have to pay is a steep learning curve and that you need to write your own code for the network abstraction. It is also possible to use a front-end for this, but as soon as you handle very complex loss functions and non-standard components in terms of layers, generic frameworks/front-ends often reach their limits.

If we think of a large-scale adoption of a framework, it is perfectly understandable to switch, because, for instance, in case of multi-{C,G}PU Theano might not be the best choice. In other words, each framework has its unique positive and negative sides, but sometimes you just need a hammer, if you have a nail and a tool belt is too much overhead.

Bottom line, we are still huge supporters of Theano and hope that the development of it will continue, since it is a fine piece of software and a big help if it is used for the problem it was designed for.

Padded Word Indexes For Embeddings With Theano

We already wrote a post about how to speed-up embeddings with Theano, but in the post, we used a batch size of one. If you have to use mini-batches, things get a little more complicated. For instance, let’s assume that you have a network that takes the average of per-sample tags, encoded as one-hot vectors, in combination with other features.

With a batch size of one, things are easy:

W = "Embedding Matrix"
i = T.ivector()
avg = T.mean(W[i], axis=0)

But now, let’s assume that we have a mini-batch and the number of tags per sample varies.

The naive solution:

i = T.imatrix()
avg = T.mean(W[i], axis=0)
func = theano.function([i], avg)

won’t work with an input like “[[0], [1, 2], [1, 10, 11]]” because a matrix does only support rows with the same length.

Thus, we need to pad all rows with a “stop token” until they have the same length: “[[#, #, 0], [1, 2, #], [1, 10, 11]]”. The most straightforward solution is to use “0” as this token and increment all IDs by one. In other words, entry “0” of the embedding won’t get any updates. “[[0, 0, 1], [2, 3, 0], [2, 10, 11]]”.

So far for the theory, but how can we express this in Theano? Well, there are different ways and ours is very likely neither the smartest nor the fastest one, but it works! We split the calculation of the mean into the sum part and the dividing part.

Let’s assume that we have

pos_list = [[0, 0, 1], [2, 3, 0], [2, 10, 11]]

Then we need a binary mask to decide what are not padding tokens:

mask = (1. * (pos_list > 0))[:, :, None] #shape (n, x, 1)

Next, we “fetch” all indexed rows but we zero out the ones with padding tokens:

w = T.sum(mask * W[pos_list], axis=1) #shape W: (n, x, y), shape w: (n, y)

Finally, we determine the non-padded indexes per row:

div = T.sum(pos_list > 0, axis=1)[:, None] # shape(n, 1)

The rest is piece of cake:

avg_batch = w / T.maximum(1, div) #avoid div-by-zero

Frankly, there is no magic here and all we do is advanced indexing and reshaping. Again, we are pretty sure there are smarter ways to do this, but the performance is okay and the problem is solved, so why bother?

With this method it is now possible train a model with mini-batches that is using averages of embeddings as input.

More Data vs. Better Models

The hype about A.I. came to almost preposterous proportions. Without a doubt, there was a lot of recent progress, but there still is a long way to achieve even a modest success in terms of a real ‘intelligence’. That’s why it is no shame to say that we are just scratching the surface. With deep neural nets, we are closer than we were ten years ago, but most of the work is still supervised, even if some approaches are *very* clever. Thus, with more data we can likely improve the score of some model, but this does not help to overcome serious limitations of big, but dumb networks. One way out of it would be unsupervised learning, but the advances in this domain are rather modest, probably because supervised learning works so well for most tasks. Thus, it should be noted that for some kind of problems, more data actually helps a lot and might even solve the whole problem, but it is very unlikely that this is true for most kind of problems.

For instance, as soon as we use some kind of label, the learning is only driven by the error signal induced by the difference between the actual and the predicted value. Stated differently, if the model is able to correctly predict the labels, there will be no further disentangling of explaining factors in the data, because there is no benefit in terms of the objective function.

But, there are real-world problems with limited or no supervision at all, which means there is no direct error signal, but we still must explain the data. One solution to the problem is a generative approach, since if we can generate realistic data, we surely understand most of the explaining factors. However, generative models often involve sampling and learning can be rather slow and/or challenging. Furthermore, for some kind of data, like sparse textual data, a successful generative training can be even more difficult.

With the introduction of memory to networks, models got more powerful, especially in handling “rare events”, but most of the time the overall network is still supervised and so is the adjustment of the memory. The required supervision is the first problem and the second one is that there is no large-scale support for general memory architectures. For instance, non-differentiable memory often requires a nearest neighbor search[1] which is a bottleneck, or it is requires to pre-fill the memory and reset it after so-called “episodes”.

In a talk they used the analogy with a cake where supervised learning is the “icing”, but the unsupervised learning is the core of the cake, the “heart” of it. So, in other words, even with unlimited data we cannot make a dumb model smarter, because at some point it would stop learning with respect to the supervised loss function. The reason is that it “knows” everything about the data for a “perfect” prediction but is ignoring other details. So, it’s the old story again, about choosing an appropriate loss function that actually learns the explaining factors of the data.

Bottom line, getting more data is always a good idea, but only if we can somehow extract knowledge from it. Thus, it should be our first priority to work on models that can learn without any supervision, and also with fewer data (one-shot learning). But we should also not forget about practical aspects, because models which are slow and require lots of resources are of very limited use.