# Tagged: theano

# PyTorch: Non-differentiable is Nothing

In the last weeks, we really learned to appreciate PyTorch. Not because it is flawless, but because it is very intuitive and makes your life so much easier when you need to debug something. And let’s face it, at one point your network is doing something silly and you have no idea why. Then you need to take a look under the hood which can be a bit of a burden with symbolic variables. Furthermore, the dynamic graph approach lets you define some concepts, like recurrence without complicated scan functions which is one more reason why recurrent networks and PyTorch should be best friends. Bottom line, for a beta, the framework feels very stable and except for some numerical instabilities, we never encountered any problems so far.

With arxiv as a hub for research papers, one have access to lots of new ideas. Not all of them are diamonds, but sometimes a paper contains the hint you need to complete a problem or at least to evaluate an idea or to use it as a basis. Then, it is extremely useful if you can do a rapid prototype. Even with Theano it was quite easy possible, but since the framework is a little too low-level, you had to write lots of setup code. With PyTorch and the seamless numpy integration the setup part is almost neglectable.

Furthermore, it is also possible to transfer a learned model from one framework to another, in case one might have concerns to use a particular framework for production, or a particular framework is required. Thus, it is no problem to use the advantages of PyTorch for training and evaluation of a model and then store the model parameters as numpy arrays and build the network with Theano.

It is really hard to imagine to evaluate new, complex networks without all the fancy frameworks, especially automatic differentiation, but also computational graphs in general. For example, the famous AlexNet was written from scratch which was quite an achievement. In other words, there is no excuse today, not to test your ideas quickly, but also systematically with a framework like PyTorch. Yes, you need some Python and math skills, but with all the groundwork done by others, it is much faster than a few years ago.

So, the major problem, which cannot be solved by PyTorch yet, is to get your hands on a sufficiently large set of (labeled) data. However, if you have a dataset you can start with, there are no limits. Try some funky loss functions, or combine them. Can you imagine a world without GANs? Not anymore. But, somebody has to come up with the idea before it can be refined.

Bottom line, it has never been easier to utilize massive amounts of computational power with GPUs. There are many publicly available datasets, but also relatively easy ways to acquire labeled data, so you should be at least able to start your work. In combination with all the available frameworks, you can train very complex networks in a reasonable time which would have been impossible more than ten years ago even with a massive cluster of machines. So, everybody with a clever idea and some luck has the opportunity to make a difference.

# Just Getting ML Things Done

It’s true that at some point, you might need full control of the situation, with access to the loss function and maybe even the gradients. But sometimes, all you need is a hammer and a pair of nails without fine-tuning anything. Why? Because you just want to get the job done, ASAP. For example, we had a crazy idea that is likely to NOT work, but if we just need 10 minutes coding and 30 minutes waiting for the results, why not trying it? Without a proper framework, we would need to adjust our own code which can be a problem if there are lots of “best practices” one need to consider, like for instance for recurrent nets. Then it’s probably a better idea to use a mature implementation to avoid common pitfalls, so you can just focus on the actual work. Otherwise frustration is only a matter of time and spending hours on a problem that someone else has already solved will lead you nowhere.

Long story short, what we needed is a front-end for Theano with a clean interface and without the need to write tons of boilerplate code. After some investigations, and by focusing on support for recurrent nets, we decided to give Keras a try. The actual code to build and train the network has less than 20 lines, because of the sophisticated design of the framework. In other words, it allows you to get things done in a nice, but also fast way, if you are willing to to sacrifice the ability of controlling every aspect of the training.

After testing the code with a small dataset, we generated the actual dataset and started the training. What can we say? The whole procedure was painless. Installation? No problem. Writing the code? Piece of cake. The training? Smooth without any fine-tuning. Model deployment? After installing h5py it worked without any problems ;-).

Bottom line, we are still an advocate of Theano, but writing all the code yourself can be a bit of a burden, especially if you throw all the stuff away after a single experiment. Furthermore, if your experiment uses a standard network architecture without the necessity to tune or adjust it, mature code can avoid lots of frustration in form of hours of bug hunting. Plus, it’s likely that the code contains some heuristics to work around some known problems that you might not be aware of.

For clarification, we do not say that Keras is just a high-level front-end and does not allow any customization, what we say is that it does a great just to provide one in case you don’t need all the expert stuff! And last but not least, it allows you to switch the back-end in case you want something else than Theano. We like the concept a lot provide a unified API for different back-ends, because it’s possible that back-ends have different strengths and weaknesses and the freedom to choose allows you, to write code with one API but switching

back-ends dynamically as you need it.

# Top-K-Gating With Theano

In a recently published paper [arxiv:1701.06538], the authors introduced a mixture of experts which is not new. However, the twist is to use only a small subset of those experts which cannot be done with an ordinary softmax, since the output of a softmax is always -slightly- positive. The idea is to keep only the truly top-k experts by setting the values, before applying the softmax operation, of all non-top-k experts to a large negative value. The result is that the actual output value at the corresponding position is zero.

With numpy, x is a vector, this is actually straightforward:

def keep_topk(x, k, neg=-10):

rest = x.shape[0] - k

idx = np.argsort(x)[0:rest]

x[idx] = neg

return x

We just sort the values of x, getting the indicies for the |x|-k positions and set the values to -10.

But since we want to use all the nice features of Theano, we need to port the code to the tensor world. Frankly, this is no big deal either, but it requires a tiny adaption since we cannot assign values to tensors directly.

def keep_topk(x, k, neg=-10):

rest = x.shape[0] - k

idx = T.argsort(x)[0:rest]

return T.set_subtensor(x[idx], neg)

And that’s it.

The reason why we spent some time with the porting is that we also had the idea to use soft attention to model the final prediction as a decision of a small set of experts. The experts might have different opinions and with the gating, we can blend different confidence levels with different outputs.

# Sparse Input Data With Theano

For some kind of data it is not unusual to have a couple of thousand dimensions but only very few of them carry actual values. Like a bag-of-word approach with thousands of binary features but on average 99.5% of them are zero. In case of a neural network this means we have a projection matrix W with N rows and M columns where N is the number of input features (~10,000). Since it is obvious that a dot-product just depends on non-zero entries, the procedure can be speed-up a lot if we use a sparse matrix instead of a dense one, for the input data. However, we only need the sparse tensor type once, since after the first layer, the output is always dense again.

The actual implementation in Theano is not a big deal. Instead of T.fmatrix(), we use sparse.csc_matrix() which comes from the sparse module of Theano: `from theano import sparse`

. If we use a generic projection layer, all we have to check is the instance type of the input tensor to use the appropriate dot function:

if type(input.output) is theano.sparse.basic.SparseVariable:

op = sparse.structured_dot

else:

op = T.dot

That is all and the rest can stay as it is.

The idea of “structured_dot” is that the first operand, the input data, is sparse and the other operand, the projection matrix, is dense. The derived gradient is also sparse and according to the docs, both fprop and bprop is using C-code.

Bottom line, if the input dimension is huge but only very few elements are actually “non-zero” using a sparse matrix object is essential for a good performance. The fact that non-contiguous objects cannot be used on the GPU is a drawback, but not a real problem for our models since they are CPU-optimized anyway.

# Converting Theano to Numpy

It is an open secret that we like Theano. It’s flexible, powerful and once you mastered some hurdles, it allows you to easily test a variety of loss functions and network architectures. However, once the model is trained, Theano can be a bit of a burden when it comes to the fprop-only part. In other words, if we just want to get predictions or feature representations, the setup and compilation overhead might be too much. The alternative would be to convert the flow graph into numpy which has the advantage that there are fewer dependencies and less overhead for the actual predictions with the model. Frankly, what we describe is neither rocket science nor new, but it is also no common usage, so we decided to summarize the method in this post.

To convert the graph notation to numpy, we make use of the **__call__** interface of python classes. The idea is to call an instance of a class as a function with a parameter:

class Input(Layer):

def __init__(self):

self.prev = None # no previous layer

def __call__(self, value):

return value #identity

class Projection(Layer):

def __init__(self, prev, W, bias):

self.W = W, self.bias = bias

self.prev = prev # previous layer

` def __call__(self, value):
val = self.prev(value)
return np.dot(val, self.W) + self.bias
`

We illustrate the method with a 1-layer linear network:

inp = Input()

lin = Projection(inp, W="random matrix", b="zero bias")

X = "input matrix"

out = lin(X)

The notation of fprop might be confusing here, since the input travels backwards from the last layer to the input layer. So, let’s see what is happening here:

lin(X) is equivalent to lin.__call__(value) and inside this function, the output of the previous layer is requested self.prev(value) which is continued until the input layer returns the actual value. This is the stop condition. The approach is not restricted to a 1-layer network and can be used for arbitrary large networks.

With this idea, all we have to do is to split the layer setup and computation part that is combined in Theano. For instance, a projection layer in Theano:

class Projection(Layer):

def __init__(self, input, W, bias):

self.output = T.dot(input, W) + bias

now looks like this with numpy:

class ProjectionNP(LayerNP):

def __init__(self, input, W, bias): # setup

self.prev = input

self.W, self.bias = W, bias

` def __call__(self, value): # computation
val = self.prev(value)
return np.dot(value, self.W) + self.bias
`

In other words, the step to convert any Theano layer is pretty straightforward and only needs time to type, but not to think (much).

The storage of such a model is just a list with all layers and we can extract the output of any layer, by simply calling the layer object with the input:

net = [inp, pro, bn, relu]

net[-1](X) # relu

net[-3](X) # projection

Let’s summarize the advantages again: First, except for numpy there are no other dependencies and numpy is pretty portable and introduces not much overhead. Second, we do not need to compile any functions since we are working with real data and not symbolic variables. The latter is especially important if an “app” is started frequently but the interaction time is rather low, because then a constant overhead very likely declines user satisfaction.

Bottom line, the method we described here is especially useful for smaller models and environments with limited resources which might include apps that are frequently started and thus should have low setup time.

# Theano vs. The Rest

If we only consider the back-ends, there are three major frameworks available. Torch, which was released in early 2000, Theano which followed around 2010 and TensorFlow released at the end of 2015 as the youngest member in the team. Yes, there are other frameworks, but most of the big companies are using one of those with a noticeable shift towards TensorFlow. Probably because it has the largest community, lots of high-level code for common tasks which includes visualization and data processing and it undergoes a rapid development.

Theano on the other side is rather small, if we consider the provided functionality, but provides a kind of low-level access that is very convenient if you need to manipulate gradient expressions directly. Furthermore, there is no overhead if you just want to optimize a function. The price you have to pay is a steep learning curve and that you need to write your own code for the network abstraction. It is also possible to use a front-end for this, but as soon as you handle very complex loss functions and non-standard components in terms of layers, generic frameworks/front-ends often reach their limits.

If we think of a large-scale adoption of a framework, it is perfectly understandable to switch, because, for instance, in case of multi-{C,G}PU Theano might not be the best choice. In other words, each framework has its unique positive and negative sides, but sometimes you just need a hammer, if you have a nail and a tool belt is too much overhead.

Bottom line, we are still huge supporters of Theano and hope that the development of it will continue, since it is a fine piece of software and a big help if it is used for the problem it was designed for.

# Padded Word Indexes For Embeddings With Theano

We already wrote a post about how to speed-up embeddings with Theano, but in the post, we used a batch size of one. If you have to use mini-batches, things get a little more complicated. For instance, let’s assume that you have a network that takes the average of per-sample tags, encoded as one-hot vectors, in combination with other features.

With a batch size of one, things are easy:

W = "Embedding Matrix"

i = T.ivector()

avg = T.mean(W[i], axis=0)

But now, let’s assume that we have a mini-batch and the number of tags per sample varies.

The naive solution:

i = T.imatrix()

avg = T.mean(W[i], axis=0)

func = theano.function([i], avg)

won’t work with an input like “[[0], [1, 2], [1, 10, 11]]” because a matrix does only support rows with the same length.

Thus, we need to pad all rows with a “stop token” until they have the same length: “[[#, #, 0], [1, 2, #], [1, 10, 11]]”. The most straightforward solution is to use “0” as this token and increment all IDs by one. In other words, entry “0” of the embedding won’t get any updates. “[[0, 0, 1], [2, 3, 0], [2, 10, 11]]”.

So far for the theory, but how can we express this in Theano? Well, there are different ways and ours is very likely neither the smartest nor the fastest one, but it works! We split the calculation of the mean into the sum part and the dividing part.

Let’s assume that we have

pos_list = [[0, 0, 1], [2, 3, 0], [2, 10, 11]]

Then we need a binary mask to decide what are not padding tokens:

mask = (1. * (pos_list > 0))[:, :, None] #shape (n, x, 1)

Next, we “fetch” all indexed rows but we zero out the ones with padding tokens:

w = T.sum(mask * W[pos_list], axis=1) #shape W: (n, x, y), shape w: (n, y)

Finally, we determine the non-padded indexes per row:

div = T.sum(pos_list > 0, axis=1)[:, None] # shape(n, 1)

The rest is piece of cake:

avg_batch = w / T.maximum(1, div) #avoid div-by-zero

Frankly, there is no magic here and all we do is advanced indexing and reshaping. Again, we are pretty sure there are smarter ways to do this, but the performance is okay and the problem is solved, so why bother?

With this method it is now possible train a model with mini-batches that is using averages of embeddings as input.