# Tagged: theano

# Just Getting ML Things Done

It’s true that at some point, you might need full control of the situation, with access to the loss function and maybe even the gradients. But sometimes, all you need is a hammer and a pair of nails without fine-tuning anything. Why? Because you just want to get the job done, ASAP. For example, we had a crazy idea that is likely to NOT work, but if we just need 10 minutes coding and 30 minutes waiting for the results, why not trying it? Without a proper framework, we would need to adjust our own code which can be a problem if there are lots of “best practices” one need to consider, like for instance for recurrent nets. Then it’s probably a better idea to use a mature implementation to avoid common pitfalls, so you can just focus on the actual work. Otherwise frustration is only a matter of time and spending hours on a problem that someone else has already solved will lead you nowhere.

Long story short, what we needed is a front-end for Theano with a clean interface and without the need to write tons of boilerplate code. After some investigations, and by focusing on support for recurrent nets, we decided to give Keras a try. The actual code to build and train the network has less than 20 lines, because of the sophisticated design of the framework. In other words, it allows you to get things done in a nice, but also fast way, if you are willing to to sacrifice the ability of controlling every aspect of the training.

After testing the code with a small dataset, we generated the actual dataset and started the training. What can we say? The whole procedure was painless. Installation? No problem. Writing the code? Piece of cake. The training? Smooth without any fine-tuning. Model deployment? After installing h5py it worked without any problems ;-).

Bottom line, we are still an advocate of Theano, but writing all the code yourself can be a bit of a burden, especially if you throw all the stuff away after a single experiment. Furthermore, if your experiment uses a standard network architecture without the necessity to tune or adjust it, mature code can avoid lots of frustration in form of hours of bug hunting. Plus, it’s likely that the code contains some heuristics to work around some known problems that you might not be aware of.

For clarification, we do not say that Keras is just a high-level front-end and does not allow any customization, what we say is that it does a great just to provide one in case you don’t need all the expert stuff! And last but not least, it allows you to switch the back-end in case you want something else than Theano. We like the concept a lot provide a unified API for different back-ends, because it’s possible that back-ends have different strengths and weaknesses and the freedom to choose allows you, to write code with one API but switching

back-ends dynamically as you need it.

# Top-K-Gating With Theano

In a recently published paper [arxiv:1701.06538], the authors introduced a mixture of experts which is not new. However, the twist is to use only a small subset of those experts which cannot be done with an ordinary softmax, since the output of a softmax is always -slightly- positive. The idea is to keep only the truly top-k experts by setting the values, before applying the softmax operation, of all non-top-k experts to a large negative value. The result is that the actual output value at the corresponding position is zero.

With numpy, x is a vector, this is actually straightforward:

def keep_topk(x, k, neg=-10):

rest = x.shape[0] - k

idx = np.argsort(x)[0:rest]

x[idx] = neg

return x

We just sort the values of x, getting the indicies for the |x|-k positions and set the values to -10.

But since we want to use all the nice features of Theano, we need to port the code to the tensor world. Frankly, this is no big deal either, but it requires a tiny adaption since we cannot assign values to tensors directly.

def keep_topk(x, k, neg=-10):

rest = x.shape[0] - k

idx = T.argsort(x)[0:rest]

return T.set_subtensor(x[idx], neg)

And that’s it.

The reason why we spent some time with the porting is that we also had the idea to use soft attention to model the final prediction as a decision of a small set of experts. The experts might have different opinions and with the gating, we can blend different confidence levels with different outputs.

# Sparse Input Data With Theano

For some kind of data it is not unusual to have a couple of thousand dimensions but only very few of them carry actual values. Like a bag-of-word approach with thousands of binary features but on average 99.5% of them are zero. In case of a neural network this means we have a projection matrix W with N rows and M columns where N is the number of input features (~10,000). Since it is obvious that a dot-product just depends on non-zero entries, the procedure can be speed-up a lot if we use a sparse matrix instead of a dense one, for the input data. However, we only need the sparse tensor type once, since after the first layer, the output is always dense again.

The actual implementation in Theano is not a big deal. Instead of T.fmatrix(), we use sparse.csc_matrix() which comes from the sparse module of Theano: `from theano import sparse`

. If we use a generic projection layer, all we have to check is the instance type of the input tensor to use the appropriate dot function:

if type(input.output) is theano.sparse.basic.SparseVariable:

op = sparse.structured_dot

else:

op = T.dot

That is all and the rest can stay as it is.

The idea of “structured_dot” is that the first operand, the input data, is sparse and the other operand, the projection matrix, is dense. The derived gradient is also sparse and according to the docs, both fprop and bprop is using C-code.

Bottom line, if the input dimension is huge but only very few elements are actually “non-zero” using a sparse matrix object is essential for a good performance. The fact that non-contiguous objects cannot be used on the GPU is a drawback, but not a real problem for our models since they are CPU-optimized anyway.

# Converting Theano to Numpy

It is an open secret that we like Theano. It’s flexible, powerful and once you mastered some hurdles, it allows you to easily test a variety of loss functions and network architectures. However, once the model is trained, Theano can be a bit of a burden when it comes to the fprop-only part. In other words, if we just want to get predictions or feature representations, the setup and compilation overhead might be too much. The alternative would be to convert the flow graph into numpy which has the advantage that there are fewer dependencies and less overhead for the actual predictions with the model. Frankly, what we describe is neither rocket science nor new, but it is also no common usage, so we decided to summarize the method in this post.

To convert the graph notation to numpy, we make use of the **__call__** interface of python classes. The idea is to call an instance of a class as a function with a parameter:

class Input(Layer):

def __init__(self):

self.prev = None # no previous layer

def __call__(self, value):

return value #identity

class Projection(Layer):

def __init__(self, prev, W, bias):

self.W = W, self.bias = bias

self.prev = prev # previous layer

` def __call__(self, value):
val = self.prev(value)
return np.dot(val, self.W) + self.bias
`

We illustrate the method with a 1-layer linear network:

inp = Input()

lin = Projection(inp, W="random matrix", b="zero bias")

X = "input matrix"

out = lin(X)

The notation of fprop might be confusing here, since the input travels backwards from the last layer to the input layer. So, let’s see what is happening here:

lin(X) is equivalent to lin.__call__(value) and inside this function, the output of the previous layer is requested self.prev(value) which is continued until the input layer returns the actual value. This is the stop condition. The approach is not restricted to a 1-layer network and can be used for arbitrary large networks.

With this idea, all we have to do is to split the layer setup and computation part that is combined in Theano. For instance, a projection layer in Theano:

class Projection(Layer):

def __init__(self, input, W, bias):

self.output = T.dot(input, W) + bias

now looks like this with numpy:

class ProjectionNP(LayerNP):

def __init__(self, input, W, bias): # setup

self.prev = input

self.W, self.bias = W, bias

` def __call__(self, value): # computation
val = self.prev(value)
return np.dot(value, self.W) + self.bias
`

In other words, the step to convert any Theano layer is pretty straightforward and only needs time to type, but not to think (much).

The storage of such a model is just a list with all layers and we can extract the output of any layer, by simply calling the layer object with the input:

net = [inp, pro, bn, relu]

net[-1](X) # relu

net[-3](X) # projection

Let’s summarize the advantages again: First, except for numpy there are no other dependencies and numpy is pretty portable and introduces not much overhead. Second, we do not need to compile any functions since we are working with real data and not symbolic variables. The latter is especially important if an “app” is started frequently but the interaction time is rather low, because then a constant overhead very likely declines user satisfaction.

Bottom line, the method we described here is especially useful for smaller models and environments with limited resources which might include apps that are frequently started and thus should have low setup time.

# Theano vs. The Rest

If we only consider the back-ends, there are three major frameworks available. Torch, which was released in early 2000, Theano which followed around 2010 and TensorFlow released at the end of 2015 as the youngest member in the team. Yes, there are other frameworks, but most of the big companies are using one of those with a noticeable shift towards TensorFlow. Probably because it has the largest community, lots of high-level code for common tasks which includes visualization and data processing and it undergoes a rapid development.

Theano on the other side is rather small, if we consider the provided functionality, but provides a kind of low-level access that is very convenient if you need to manipulate gradient expressions directly. Furthermore, there is no overhead if you just want to optimize a function. The price you have to pay is a steep learning curve and that you need to write your own code for the network abstraction. It is also possible to use a front-end for this, but as soon as you handle very complex loss functions and non-standard components in terms of layers, generic frameworks/front-ends often reach their limits.

If we think of a large-scale adoption of a framework, it is perfectly understandable to switch, because, for instance, in case of multi-{C,G}PU Theano might not be the best choice. In other words, each framework has its unique positive and negative sides, but sometimes you just need a hammer, if you have a nail and a tool belt is too much overhead.

Bottom line, we are still huge supporters of Theano and hope that the development of it will continue, since it is a fine piece of software and a big help if it is used for the problem it was designed for.

# Padded Word Indexes For Embeddings With Theano

We already wrote a post about how to speed-up embeddings with Theano, but in the post, we used a batch size of one. If you have to use mini-batches, things get a little more complicated. For instance, let’s assume that you have a network that takes the average of per-sample tags, encoded as one-hot vectors, in combination with other features.

With a batch size of one, things are easy:

W = "Embedding Matrix"

i = T.ivector()

avg = T.mean(W[i], axis=0)

But now, let’s assume that we have a mini-batch and the number of tags per sample varies.

The naive solution:

i = T.imatrix()

avg = T.mean(W[i], axis=0)

func = theano.function([i], avg)

won’t work with an input like “[[0], [1, 2], [1, 10, 11]]” because a matrix does only support rows with the same length.

Thus, we need to pad all rows with a “stop token” until they have the same length: “[[#, #, 0], [1, 2, #], [1, 10, 11]]”. The most straightforward solution is to use “0” as this token and increment all IDs by one. In other words, entry “0” of the embedding won’t get any updates. “[[0, 0, 1], [2, 3, 0], [2, 10, 11]]”.

So far for the theory, but how can we express this in Theano? Well, there are different ways and ours is very likely neither the smartest nor the fastest one, but it works! We split the calculation of the mean into the sum part and the dividing part.

Let’s assume that we have

pos_list = [[0, 0, 1], [2, 3, 0], [2, 10, 11]]

Then we need a binary mask to decide what are not padding tokens:

mask = (1. * (pos_list > 0))[:, :, None] #shape (n, x, 1)

Next, we “fetch” all indexed rows but we zero out the ones with padding tokens:

w = T.sum(mask * W[pos_list], axis=1) #shape W: (n, x, y), shape w: (n, y)

Finally, we determine the non-padded indexes per row:

div = T.sum(pos_list > 0, axis=1)[:, None] # shape(n, 1)

The rest is piece of cake:

avg_batch = w / T.maximum(1, div) #avoid div-by-zero

Frankly, there is no magic here and all we do is advanced indexing and reshaping. Again, we are pretty sure there are smarter ways to do this, but the performance is okay and the problem is solved, so why bother?

With this method it is now possible train a model with mini-batches that is using averages of embeddings as input.

# Curriculum Learning Revisited

It is always a good idea to draw inspiration from biological learning. For instance, if we learn something new, it is common to start with simple examples and gradually move to ones that are more difficult to solve. The idea is to learn the basic steps and then to combine those steps into new knowledge which is then combined again and so forth. The same can be done if we consider neural networks and it is called curriculum learning. At the begin the network is fed with examples that are easy to learn. Then, decided by some schedule, more difficult examples are presented to the network. The idea is, again, that the network learns basic concepts first which can then be combined into more complex ones to classify the challenging examples.

A recently published paper [arxiv:1612.09508] combines this idea with a feedback mechanism to incrementally predicts a sequence of classes for an image that ranges from easy to difficult (coarse to fine-grained). The idea is to use a recurrent network in combination with loss functions that are attached to the output of each step in time. The input is the image in combination with the hidden state of the previous time step. In other words, the first step is feed into the first loss function -the easy class-, the next time step is feed into the second loss function -more difficult- which continues for #T steps which equals the number of loss functions.

An obvious advantage is that we can model a taxonomy of classes, for example: (animal, vehicle, plane)->([cat, dog, bird], [car, bike, truck], […]) -> ([tabby, …], [shepard, beagle, …], […]) with this approach. At the top level, the class is very coarse, but with each step, it is further refined which is the connection to curriculum learning. For instance, to predict if an image contains an animal or a vehicle is much easier than to predict if an image contains a german shepard or a pick-up truck.

However, in contrast to plain curriculum learning, we do not increase the difficulty per epoch, but per “layer”, because the network needs to predict all classes for an image correctly at every epoch. The idea is related to multi-task learning but with the difference that the loss functions are now connected to a recurrent layer which evolves over time and depends on the output of all previous steps.

So far for the overview of the method, but we are of course not interested to apply the method on images but on movie data. The good news is that we can easily build a simple taxonomy of classes, for instance: top genre->sub genre->theme and since the idea can be applied to all kind of data, it is straightforward to feed our movie feature vectors to the network. We start with a very simple network. The input is a 2,000 dim vector with floats within a range [0,1] and the pseudo code looks like this:

x = T.vector() # input vector

y1, y2, y3 = T.iscalar(), T.iscalar(), T.iscalar() # output class labels

h1 = Projection+Layer-Normalization+ReLU(x, num_units=64)

r1 = GatedRecurrent+LayerNormalization+RelU(h1, num_units=64)

r1_step1 = "output of time-step 1"

r1_step2 = "output of time-step 2"

r1_step3 = "output of time-step 3"

y1_hat = Softmax(r1_step1, num_classes=#y1)

y2_hat = Softmax(r1_step2, num_classes=#y2)

y3_hat = Softmax(r1_step3, num_classes=#y3)

loss = nll(y1_hat, y1) + nll(y2_hat, y2) + nll(y3_hat, y3)

For each class, we use a separate softmax to predict the label with the input from the recurrent output at step #t. Thus a training step can be summarized as: a movie vector is fed into the network and projected to the feature space spanned by h1. Then we feed h1 three times into the recurrent layer r1 with the state from the previous step, at #t=0 the state is zero, to produce a prediction for each class. Therefore, there is a two-fold despondency. First, the hidden state is propagated through time and second, the next state is

influenced by the error derivate of the loss from the previous state which means the representation learned by the recurrent layer must be useful for all labels.

To illustrate the method with an example, let’s consider a movie with the top-genre “horror”, the sub-genre “creature-film” and the theme “zombies”. When the network sees the movie, it gets the first hint that it is a horror movie and builds a representation that maximizes the prediction of the first softmax for “horror”. But instead of forgetting the context, it ‘remembers’ that it belongs to the horror genre and uses the hint to adjust the representation to predict both classes (horror,creature-film) correctly. This is the second step. Lastly, it is using the given context to adjust the representation again to predict all three classes (horror,creature-film,zombies) correctly. The whole process can be considered as a loop which is in contrast to typical multi-label learning that is using a single output of a layer to predict all classes correctly.

Bottom line, as we demonstrated, feedback networks are not limited to images and they are extremely useful if there exists a natural, hierarchical label space for data samples. Furthermore, since the classification of hierarchical labels requires more powerful representations, it is very likely that the learned feature space is more versatile and can be also used as input features for other models.