PyTorch – Weight Decay Made Easy

In PyTorch the implementation of the optimizer does not know anything about neural nets which means it possible that the current settings also apply l2 weight decay to bias parameters. In general this is not done, since those parameters are less likely to overfit. Furthermore, the decay should also not be applied to parameters with a shape of one, meaning the parameter is a vector and no matrix which is quite often for normalization modules, like batch-norm, layer-norm or weight-norm. So, how can we tell the optimizer in a principled way to set the decay of those parameters to zero?

With the introduction of the function named_parameters(), we also get a name along with the parameter value. For standard layers, biases are named as “bias” and combined with the shape, we can create two parameter lists, one with weight_decay and the other without it. Furthermore, we can easily use a skip_list to manually disable weight_decay for some layers, like embedding layers. The code is pretty simple:

def add_weight_decay(net, l2_value, skip_list=()):
 decay, no_decay = [], []
 for name, param in net.named_parameters():
  if not param.requires_grad: continue # frozen weights		            
  if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list: no_decay.append(param)
  else: decay.append(param)
 return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]

and the returned list is passed to the optimizer:

params = add_weight_decay(net, 2e-5)
sgd = torch.optim.SGD(params, lr=0.05)

That’s it. The behavior is documented, but we still think it’s a good idea to give an example, since in frameworks specialized on neural nets, the default behavior might be different. Furthermore, the method is straightforward, but requires some knowledge of the internals and an example hopefully helps to better understand the process.


PyTorch – Freezing Weights of Pre-Trained Layers

Back in 2006 training deep nets based on the idea of using pre-trained layers that were stacked until the full network has been trained. Then, a final fine-tuning step was performed to tune all network weights jointly. With the introduction of batch norm and other techniques that has become obsolete, since now we can train deep nets end-to-end without much problems. However, sometimes it is still beneficial to combine layers from pre-trained networks to give a network a direction where to search for good solutions. For instance, a recurrent network could use word embeddings from an unsupervised learning step like GloVe or Word2Vec. If this makes sense surely depends on the problem, but it is still possible.

We recently encountered a problem where we need to predict multiple tags for a sequence of words. As a baseline we tried to train a continuous bag-of-words model with a skip-gram loss, but we found the performance not satisfying mostly because the supervised loss failed to learn a good embedding of individual words. This is a common problem since the representation of the data only depends on the error signal and if it goes to zero, the learning stops immediately. This helps to reduce the loss, but it might also hurt the generalization since the getting predictions for frequent items right drives the loss faster down than those of items from the long tail.

So, we decided to pre-train the embedding layer unsupervised, with a huge corpus of sentences, sequences of words, and then we used the embedding in a network to predict tags for them. We further decided to freeze the layer which means the weights are not changed during learning. The reason is that we want avoid a bias introduced by the supervised error signal. Next, we describe how this is possible in PyTorch.

Let’s assume we have a simple network:

class Network(nn.Module):
 def __init__(self, n_words, n_dim=5):
  super(Network, self).__init__()
  self.word_embed = nn.Embedding(n_words, n_dim, sparse=True)

def freeze_layer(layer):
 for param in layer.parameters():
  param.requires_grad = False

net = Network(1000)

By default in PyTorch, every parameter in a module -network- requires a gradient (requires_grad=True) which makes sense, since we want to jointly learn all parameters of a network. However, in case of a pre-trained layer, we want to disable backprop for this layer which means the weights are fixed and are not getting any updates during the backprop step.

That’s it. Pretty easy, if you know how a little about the PyTorch internals. With this in mind, we can use the setting to fine-tune a network, like just learning the weights of a new classifier on top of a pre-trained network, or we can use it to combine a pre-trained layer at the bottom with a new network architecture.

Character-Level Positional Encoding

For documents, word-level embeddings work pretty well, because the vocabulary does not contain too many special cases. However, in case of tweets or (short) texts with a rather dynamic structure, using words might not be appropriate because it is not possible to generalize to unknown words. That includes the case where those words are pretty similar to existing ones, but not quite the same. The issue can be tackled with recurrent nets that are working with characters, but the problem is that the processing cannot be parallelized easily. In a recent paper “Attention Is All You Need” an approach is described that uses a positional encoding which is used to encode sequences of words without the necessity to use a recurrent net. Thus, the process can be more easily parallelized and has therefore a better performance. However, we still have the problem that we cannot generalize to unknown words except for a ‘UNK’ token which is not very useful in our case.

To be more precise, we try to model titles from TV shows, series, etc. in order to predict the category and there it is imperative that we can generalize to slight variations of those titles which are likely to happen due to spelling errors and several formats for different TV channels. The method to adapt the procedure to utilize characters instead of words is straightforward. We just need to build a lookup map with the calculated values based on the position and the index of the dimension up to the maximal sequence length that is present in the training data. Then, we can easily calculate the encoding of arbitrary inputs by using this formular:

 char_id_list #sentence as a sequence of characters
 embed = np.sum(pos_enc_map[np.arange(len(char_id_list)) * char_embed[char_id_list], 0)

and one possible lookup could be calculated by:

result = np.zeros((max_seq_len, dim_len))
for pos in xrange(max_seq_len):
 a = np.arange(dim_len) + 1.
 if pos % 2 == 0:
  t = np.cos(pos / a)
  t = np.sin(pos / a)
 result[pos] = t

The lookup uses different weights for different positions, but also considers each dimension differently which helps to bring more diversity for the encoding. However, since the encoding does not depend on previous states, it also guarantees to preserve semantic similarity of tokens that are present in multiple strings.

Bottom line, an positional encoding allows to us to take the order of a sequence in consideration without using a recurrent net which is very beneficial in many cases. Furthermore, the character-level encoding allows us to classify text sequences we have never seen before which is very important due to minor variations of titles.

Do We Need More Data or Better Ways to Use It?

One of the first demonstrations on how powerful Deep Learning can be, used 1,000 pictures per category and needed quite a lot steps to build a model that worked. Without a doubt, this was a seminal work but it also demonstrated that DL only vaguely resembles how humans learn. For instance, if a child would have to look at 1,000 cups to get the concept of it, the lifespan of humans would be too short to survive without strong supervision. Another example are recent breakthroughs in reinforcement learning, but which also come at a certain cost, like a couple of thousand bucks a day for energy. In a lot of cases, data, and even labels, might be no problem, but it often takes days or even weeks to turn them into a useful model. This is also in stark contrast to the brain that uses very little energy and is able to generalize with just a few, or even one example. Again, this is nothing new but begs the question if we spend too little time on fundamental research and try instead too often to beat state-of-art results to get a place in the hall of fame? The viewpoint is probably too easy, since there are examples that there is research that focuses on real-world usage, like WaveNet, but it also shows that you need lots of manpower to do it. Thus, most companies have to rely on global players or public research if they want to build cutting-edge A.I. products. The introduction of GPU clouds definitely helped, because it allows everyone to train larger models without buying the computational machinery, but using the cloud is also not for free and it’s getting worse if the training has to be fast, since you need to buy lots of GPU time then. The topic, in a broader context, has also been recently debated in[1]. In the spirit of the debate, the question is how can we avoid to run against a wall about 1,000 times before we realize it’s not a good idea?

[1] “Will the Future of AI Learning Depend More on Nature or Nurture?”

Stuck In A Local Minima

It was never easier to get access to a very powerful machinery to learn something. There are lots of different frameworks to train your model and with GPUs you can even train “bigger” models on commodity hardware. Furthermore, there is a lot of published research and cool blog posts that explain recent trends and new theories. In other words, it’s almost everything out there, you just need to find the time to digest all the content and turn it into something that is able to solve your problem, right? Well, honestly if you have one of those common problems, a solution is probably just around the corner and you don’t need much time or energy to solve it. Maybe there is even a pre-trained network that can be directly used, or you could ask around if somebody is willing to share it. But frankly, this is rather the exception than the rule, because very often, your problem is special and the chance that existing code is available is often close to zero. Maybe there are some relevant papers, but it is likely to take time to find them and more to implement the code. In other words, if the problem is easy, you often don’t need to do any research, but otherwise it can be a very long way with lots of obstacles even to get a hint where to start. In such cases, if you are lucky, you can ask people in your company or team to give you some hints, or at least to be able to discuss the problem at eye-level. But what if you have no access to such valuable resources? Doing research on your own takes time and it is not guaranteed that it leads somewhere, if your time is limited, which is usually the case. So, what to do? It’s like training a neural network with lots of local minima and in some configurations learning even get totally stuck. This is nothing new but sometimes we get the impression that the popular opinion is that all you need is a framework and tuning the knobs as long as it takes to solve the problem. This is like having a racing car without the proper training to drive it. There is a chance to win a race, but it’s more likely that you strip it down. The question is how to best spend your time when you want to solve a problem? Concentrate on technology, or on the theory? A little bit of both? Or try to find existing solutions? This is related to our concern that we meet a growing number of ML engineers that seem just like power users of a framework without the ability to understand what is going on under the hood.

A Canticle For Theano

Back in 2010 there was a presentation about a new computation framework at SciPy and the name was Theano. At this time, even if it was not the only one available framework, there was also Torch for example, it provided a lot of new and very powerful features. Furthermore, it provided an interface very similar to the popular numpy framework, which is very versatile and provides everything you need to build machine learning algorithms, especially neural nets.

The idea to describe an algorithm in a purely symbolic way and let the framework optimize this expression into a computational graph was something new on the one hand, but it also allowed to execute code on other devices like the GPU, which can be much faster, on the other hand. And the cherry on top of this delicious cake was something called automatic differentiation. In contrast to other frameworks, or if in case you implement algorithms by hand, you can define an arbitrary loss function and let the framework do the calculation of the gradient. This has several advantages, including the error-prone process to derive the gradient manually, but also to perform the required steps in a deterministic way with the help of the computational graph.

In other words, Theano allows you to implement everything that you can imagine as long as it is continuous and differentiable. Of course this does not come for free, since the framework is a low-level library, it requires a solid understanding of the whole work flow and also to code all the functionality yourself. But once you have done this, it allows you rapid prototyping of different algorithms since you don’t need to manually derive the gradients and also you don’t have to care for the optimization of the data flow since Theano optimizes the graph for you.

More than five years ago, we started to use Theano, because we love python, machine learning, including neural nets, and the way Theano does the things. In this time, we never encountered any serious bugs or problems and we were also able to use it on a broad range of devices including, CPUs, GPUs and embedded ones. We implemented innumerable algorithms in this time, from rather straightforward nets to deep nets with dozens of layers. But we did not only use it for research, we also used it to build real-world applications which were deployed and are still in use.

In other words, without Theano, the world of machine learning would definitely not be the same. For instance, recent frameworks were likely to be inspired by the features Theano provided. So, the question is, why such a framework, with all the awesomeness never managed it to get the attention it deserved? Was it too difficult to use? Does it require the backup of a major company? Was it too close to academia? Missing marketing? We don’t have the answers and now that we development of Theano will stop soon, the future is very uncertain. Sure, it is open source, but we still need people that are willing to spend time to improve it and to coordinate the further development. But even if we are a little skeptical about the future of Theano, version 1.0 is still a very useful piece of software that can be used to design and deploy your models.

What’s left to say? We would like to thank all the contributors and authors of Theano for their hard work. Theano is an amazing piece of software that allows everyone to express their thoughts as algorithms in a straightforward and efficient way. With it, it was possible to build cutting edge products and to use GPUs even if you were not an CUDA expert.

Training and Deployment Neural Nets: Two Sides Of One Coin

There are a lot of problems out there and some of them can be definitely solved with neural nets. However, despite the availability of lots of sophisticated frameworks, you still need a solid understanding of the theory and also a portion of luck; at least when you plan to design your own loss function, or in case you want to implement a model from a recent research paper. To just train a classifier on some labeled data, you can practically select any of the available frameworks at random. But most real-world problems are not like that.

The truth is that you can do all this stuff with most of the frameworks out there, but what is not said very often is that this can be very time consuming and sometimes even frustrating. In the last months we promoted PyTorch as a framework that goes hand-in-hand with Python and which allows you to easily create networks with dynamic graphs. Not to mention the ability to debug code on-the-fly, since tensors have actual values and are not purely symbolic. This increased our productivity a lot and also reduced our level of frustration.

Still, we should not forget that all this is just technology and even if frameworks have very large communities (and might be backed up by big companies), there is no guarantee that it won’t be obsolete next year or maybe in five years. That said, in our humble opinion a framework should allow developers to get things done quickly and not to write lots of code that is not related to the actual problem. But, this can turn into a problem when a framework is very high-level and it does not allow you easily to customize your models, or to adapt the design which includes, for example, the combination of different loss functions.

Let’s take NLP as an example: Without a doubt attention is a very important part of most modern approaches and thus, it is very likely that we need to integrate this feature in a real-world model. Despite its effectiveness, the method is not very complicated and in terms of the computational graph, it is also not very hard to implement. But this of course depends on the framework and its API. Does the framework come with native support for it? Is it possible to modify it easily? How well does it fit into the layer landscape? How difficult is it to implement it from scratch? Can it be debugged easily?

Even if we made a lot progress to understand and to train neural nets, it still feels more like black magic than science. With a nice framework like Keras, it is not hard to train a neural net from scratch. But what happens if the learning get stuck and if this cannot be fixed trivially by adjusting some options? Then you need to go deeper which requires a different skill set. In other words, try easy solutions first since sometimes you don’t need more than a standard model.

This bring us to the question if we should use different frameworks for experiments and production. For the experiments, we need one that is very flexible, easy to debug and with a focus on understanding what is going on inside and that can be easily adapted. However, for deployment we need one that allows to run the model in heterogeneous environment with very different resources. It is possible that a single framework can do both, but the requirements for the both cases are very different.

Bottom line, once the model is trained, the major focus is maximal performance and to minimize the used resources. Issues like flexibility, adaption and to some degree, debugging are not that important any longer. That is why we wonder why there is so little information about using neural nets in production environments and how to do it, because integrating models into applications and also the deployment is far from being trivial.