Backtrack But Not Backwards

A recent talk at NIPS 2017 underlined that despite all the progress we made in the last years, we still practically know very little about how things work internally. The thing is what do you do, when you encounter a strange problem while you train your model? Do you just switch to another optimizer/architecture/hyper-parameter, or do you try to find the root-cause of the problem? With all the nice and publicly available ML stuff out there, it is tempting to just try all these things and if one does not work, just try the next one. At the end of the day, your model might be powerful enough to solve the problem at hand, but it is also very likely that it is just a black-box you don’t fully understand and if the system stops working, you need to search for a new model again.

The talk also emphasized that we need more well-understood building-blocks which can be combined to tackle more complex problems, instead of just plugging “mythical” things into your networks which makes it “magically” work. In other words, we should focus more on basic experiments to better understand existing building blocks which includes to spend more time to prove why things work, instead of just saying they do, because the error rate goes down, but no one cares to explain exactly why.

This is kind of backtracking you do, when you are stuck, since your model won’t work. If you just switch your architecture, you won’t get any new insights and if the problem occurs again, you also need to switch again. The process to really understand what is going can be extremely painful and probably need lots of resources, but at the end it pays since you can use the knowledge to build better models and to focus on new problems instead of just plugging black-boxes together and hope that they eventually work.

Like a long journey, it starts with a single step and at the begin, there might be no light at the end of the tunnel, but if you don’t give up, you will figure out how to put the next piece of the puzzle eventually, and then the next one and so forth. At the end you will see much more of the whole picture even if it takes a very long time.


IdeaPad 720s – Machine Learning For The Road

Even if the GPU is a mobile version, the performance gain compared to a CPU is very much noticeable. As a result, it makes a lot of sense to have a dedicated GPU in your notebook if you buy a new one. This allows you to play with more complex models, or even to train them, while you are traveling, which might mean that you don’t have easy access to servers with GPU cards all the time. And even if you just want to use a pre-trained model for feature extraction, you can spare a lot of time by using the GPU.

There is always the option to use a gamers notebook, but in case you want a lightweight companion there are much fewer options. Our choice was the IdeaPad 720s with a 14′ screen, because it is lightweight, but still powerful with enough RAM, and a dedicated GeForce 940mx GPU that comes with non-shared memory. Without a doubt this is no high-end configuration, but the CUDA capabilities are sufficient to run older nets, or to design your own one, might it be a ConvNet or a RNN. Plus, with the huge SSD, you can train on pretty large training sets with better I/O performance than SATA disks.

So far for the theory, but now comes the reality. Especially for newer, or more exotic notebooks, installing Linux on them is not always trivial. To spare others the pain, we summarize the steps we did to get it running. It’s still not perfect, there are minor problems with the WLAN, but we used it for quite some hours now without any problems and successfully tested PyTorch + CUDA.

The first thing you have to do is to switch from “UEFI” to “Legacy Support” but that’s nothing new. You can enter the BIOS by pressing F2 during boot, or use the Novo “button” on the left side of the notebook. If this worked, you should shrink the NTFS volume which is pretty straightforward to make room for a real OS. Just half the size, so you got about ~250 GB for Linux. After all settings were adjusted, we can start with the Linux installation. Lubuntu seems a good choice since it is also lightweight, but comes with excellent support for detecting more exotic hardware. Make sure you have chosen the correct boot order, so you can boot from a USB stick/DVD drive.

Long story short, the installation failed miserable because the SSD drive was not recognized. But there is no time for panic! Thanks to the active community, we found the answer to that problem pretty fast. You have to switch the “SATA Controller Mode” from “RAID” to “AHCI” in the BIOS. With the new setting, it was possible to create a ext4+swap partition in the free space of the SSD. Then, the actual installation could be done without any problems. Merely the GRUB installation seems not optimal, since we get no boot screen and thus, we don’t know if our Windows partition was correctly recognized. From the grub config it does not seem so, but this is not our major concern, since our focus is a working Linux system with GPU support. So, we are blind at startup, but since Linux starts correctly we do not investigate this any further right now.

The next step is to get PyTorch working which was no problem at all. We used pip, python 2.7 + cuda 8.0 and it worked like a charm. Only torchvision failed, but we solved it by using “pip install –no-deps torchvision” since one dependency is still pytorch 0.1.2. A quick test with ipython confirmed that everything is okay and working. The last step is the installation of the CUDA toolkit which was also no problem thanks to the apt sources we just had to uncomment in the sources.list file. After “apt-get update” we installed the cuda toolkit packages and all its dependencies. Since CUDA requires a kernel module that is compiled at the end of the installation, a restart is required. To check if the setup was done correctly, start “nvidia-smi” -after reboot- and see if at the device is listed there.

After we got a prompt again, we downloaded a pre-trained network from the model collection and hacked some code to perform an image classification. Compared to the early days of ConvNets, even the CPU version was pretty “fast”. Next, we checked that cuda is correctly recognized by PyTorch and after that, we moved the model to the GPU and also the tensor we use for classification. As we mentioned at the begin of the post, the performance boost was pretty much visible and except for the first call that triggered some background activities to setup CUDA, everything went smooth.

Bottom line, here is the check list again to enjoy the notebook with Linux:
– Shrink the size of your NTFS volumne by 50%
– Switch from “UEFI” to “Legacy Support”
– Switch the “SATA Controller Mode” from “RAID” to “AHCI”
Since there is still room for improvements, we might create a successor blog post with additional details.

PyTorch 0.3 – An Early Xmas Gift

In this year, Santa is a little early but the gifts are nonetheless still impressive. In the last days, we hacked on a model to predict if a text sequence belongs to a certain type, or not, and we found out, that if we do not process the text from left to right but in reverse order, the model learns much better and faster. The idea is not new and was already introduced in neural translation models. Still, it’s amazing that such a little modification has such a huge impact.

Next, we heard about the new PyTorch release and that it might also bring gifts, for instance performance boosts along with other nice goodies. So, we updated, still from the source since the provided pre-compiled packages do not work for us, and then run a short test. The results were pretty stunning since now the required time for a cycle takes about 5 seconds less compared to version 0.2.0.

In total, from a machine learning perspective we are pretty happy with the gifts Santa brought for us and with the experiences we made so far, we encourage everybody to update their PyTorch version to also feel the machine learning Christmas spirit.

When The ML Devil Is A Cute Squirrel

We recently stumbled over a problem that is pretty straightforward: Based on a sub-title, we had to decide if the text describes a movie or some other type, like a series or a documentary. So, we started our favorite editor and began to hack a very basic recurrent neural network. Furthermore, since we wanted to ensure that we can use the net for all kind of new input, we decided to use a character-based net. That was the easy part. From a fairly recent paper, we used the heuristic to initialize all non-recurrent weights from U[-0.1, 0.1] and the recurrent weights using orthogonalization and Adam as our optimizer.

We are aware that heuristics do not always work, but we were pretty astonished that no learning at all occurred, not even a little. So, we used the default weight initialization from the framework and voilĂ , there was immediate progress. Just a slightly different weight initialization procedure and it work. Out of curiosity, we also tried different optimizers and we found out that Adam with the default settings, lr=0.001, was far from being optimal. For instance, when we used RMSprop with the same lr parameter, the error after an epoch was “much” lower and also the number of correctly classified items.

The lesson we learned -again- is that even with all the insights and tricks from the dozens of papers, lectures and tutorials, optimizing neural nets is still more of an art than science and there is no recipe one can always use to get a good model. This is why we strongly favor to do more basic research instead of beating state-of-the-art results, since this is the only way to get more insights how to actually solve the actual problem. To be fair, there are people doing exactly this and they also share their insights which is very valuable, but on the other and, there are lots of papers that hardly provide even all the details to repeat the experiments.

It boils down to the question, what you do if you are working on a challenging problem and you run into a dead end? To quote from the AMA of Schmidhuber how to recognize a promising ML student: “[..]run into a dead end, and backtrack. Another dead end, another backtrack. But they don’t give up.” But way too often it seems that if something is not working, the method is discarded and something new is tried without understanding the actual problem. If you do ML stuff in your spare time it’s understandable, that you want to make progress no matter how, but if you are a professional, deeper insights should be the way to go and not to get just something done, even if you don’t know why it works or how.

To sum it up, even if machine learning is very fascinating these days, especially with all the resources you can use, it is still a long way until we really understand what is going on under the hood. And as long as we do not stop to find this out, we will make continual progress even if the steps seems to be very tiny.

The Opposite of Something

When the size of the vocabulary is very large, learning embeddings with negative log-likelihood and thus, the softmax, can be pretty slow. That’s why negative sampling has been introduced. It’s very fast and can lead to competitive results while it utilizes much fewer resources. However, there is a serious drawback. Similar to the triplet loss, the performance largely depends on the generation of proper positive and negative pairs. This is nothing new, but has been recently confirmed again by an analysis reported in the StarSpace [arxiv:1709.03856] paper.

The problem is that selecting “trivial” negative candidates might result in no loss or a low loss, since those items are likely to be already well separated from the positive item. Furthermore, there is often no clear strategy what the inverse of something is. For instance, let’s assume that we have two positive items related to “cooking” and now we need one or more negative items as a contrastive force. The question is are items from “cars” better than from those in “news”? Are they more inverse? A solution could be to perform hard negative mining by finding items that clearly violate the margin and thus lead to a higher loss which means some learning occurs. But the procedure is computationally very expensive and not feasible if we have thousand or more of candidates.

So, if we restrict the norm of each embedding and not using a L2 weightdecay scheme that always pushes down the weights, the model will eventually “converge”, but we don’t know how many steps are required. In other words, often a straightforward (linear) model might suffice, but we should instead invest more time in finding clever ways to perform the positive and negative sampling step.

It is astonishing and a little sad that the issue did not find more attention in research and often, just trivial examples are given that cannot be used in real-world problems. Without a doubt the issue challenging, but since it can often decide about the performance of a model, it should be worth the time.

Flaming Winners

Recently, we read a paper that also mentioned winner-takes-all (WTA) circuits and since we moved from Theano to PyTorch, we wanted to give the new idea a try. This type of neuron is similar to maxout, but instead of reducing the output dimensions, the dimensions are kept but filled with zeros. Thus, a layer consists of a group of neurons and in each group, only the “fittest” survives, while the others are set to zero. For example, let’s assume that we have 128 neurons and they should form 32 groups with 4 units each. In PyTorch this is done with a linear layer: wta = nn.Linear(dim_in, 32*4). Next comes the implementation of the forward step which is straightforward. We assume that the shape of the tensor is (batch_size, dim_in).

def forward(self, input):
 h = wta(input) #projection
 h = h.view(-1, 32, 4) # reshape: (1, 32, 4)
 val, _ = h.max(2) # maximal values per batch
 val = val[:, :, None] # reshape: (batch, 1, 1)
 pre = val * (h >= val).type(torch.FloatTensor) #binary matrix->float matrix
 return pre.view(-1, 32*4) # reshape: (batch, 32*4)

That’s it. Definitely not rocket science, just a bit of juggling with the shape of the tensors and reshaping.

PyTorch – Weight Decay Made Easy

In PyTorch the implementation of the optimizer does not know anything about neural nets which means it possible that the current settings also apply l2 weight decay to bias parameters. In general this is not done, since those parameters are less likely to overfit. Furthermore, the decay should also not be applied to parameters with a shape of one, meaning the parameter is a vector and no matrix which is quite often for normalization modules, like batch-norm, layer-norm or weight-norm. So, how can we tell the optimizer in a principled way to set the decay of those parameters to zero?

With the introduction of the function named_parameters(), we also get a name along with the parameter value. For standard layers, biases are named as “bias” and combined with the shape, we can create two parameter lists, one with weight_decay and the other without it. Furthermore, we can easily use a skip_list to manually disable weight_decay for some layers, like embedding layers. The code is pretty simple:

def add_weight_decay(net, l2_value, skip_list=()):
 decay, no_decay = [], []
 for name, param in net.named_parameters():
  if not param.requires_grad: continue # frozen weights		            
  if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list: no_decay.append(param)
  else: decay.append(param)
 return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]

and the returned list is passed to the optimizer:

params = add_weight_decay(net, 2e-5)
sgd = torch.optim.SGD(params, lr=0.05)

That’s it. The behavior is documented, but we still think it’s a good idea to give an example, since in frameworks specialized on neural nets, the default behavior might be different. Furthermore, the method is straightforward, but requires some knowledge of the internals and an example hopefully helps to better understand the process.