Tagged: pytorch

Challenges With Real-World Embeddings

To relate TV titles that come from the electronic program guide (EPG), we have decided to train an embedding that directly optimizes on “sentence-level” instead of just related words, like word2vec. That is in the spirit of StarSpace[arxiv:1709.03856] which is a fairly simple but approach but which is nevertheless a strong baseline.

The idea for the training is straightforward and uses the leave-out-out approach. We use either existing or manually annotated categories for a weak supervision. Then we sample N positive items from one category, and use N-1 items to predict the Nth item. Negative items are sampled from other categories. A title is encoded as the sum of all the embeddings of the words it contains: np.sum(E[word_id_list], 0) (bag-of-words). All vectors are normalized to lie on the unit-ball. Next, we combine all bag-of-words into bag-of-documents: np.mean(V, 0) where V is a matrix with #title rows and #dim columns.

The loss is the well-known triplet loss: np.maximum(0, margin – pos + neg), where pos = np.dot(bag_items, nth_item) and neg = np.dot(bag_items, neg_item) and margin is a hyper-parameter (0.2 by default). The idea is not to learn a hyperplane to separate classes, but to move related items closer and push unrelated items further away. The learning stops, when positive and negative pairs are separated by at least the given margin. The performance of such models largely depends on the sampling of positive and negative items
but this is not the concern of this post.

In contrast to titles from books, and to some degree movies, generic TV titles belonging to shows, reports or entertain, are very heterogeneous with lots of special characters and that can also include meta information. Therefore, we need a more sophisticated tokenizer to convert titles into “words”, but we also need to address the issue of rare “words”. The long-tail is always a problem for text, but in our case the domain is more like tweets with special emoticons and/or hashtags than traditional text. Throwing away
those “words” is no solution which is why we need to adjust the learning scheme.

In word2vec down-sampling of frequent words is used, but this does not really address the problem since we do not want to damp the learning signal for frequent “words”, but we want to boost the signal for rare “words”. That is why we decided to scale the gradients with the inverse frequency of the words. The procedure just requires a fixed lookup table: ID->WEIGHT, which is easy to implement.

The necessity of the procedure became obvious when we checked the result of our first model. We took an arbitrary title and used the cosine score to rank all other titles. The results looked promising, but from time to time there were outliers and we wanted to find out why. We started by removing single “words” from the offending title and repeated the ranking. We found out that the problem were often related to rare words that did not get much weight updates and thus, their position in the embedding space is something “arbitrary”. When the “word” was removed, the cosine score reduced dramatically. This also worked for other titles.

Thanks to PyTorch, the implementation of re-scaling the gradients was very easy:

def scale_grad_by_freq(parameters, lookup):
parameters = list(filter(lambda p: p.grad is not None, parameters))
 for p in parameters:
  g,grads = p.grad.data, g._values()
  for j, i in enumerate(g._indices().view(-1)): grads[j].mul_(lookup[i])

The multiplication is done in-place and thanks to the sparse flag of the PyTorch Embedding module, we only re-scale a small subset of all embeddings. With this minor modification, a loss that involves rare “words” leads to a stronger error signal which partly compensates the fact that those “words” get fewer updates. This is not the holy grail, but a good example that a deeper understanding of a problem can minimize the level of frustration and gives you more time to enhance or tune your model.


IdeaPad 720s – Machine Learning For The Road

Even if the GPU is a mobile version, the performance gain compared to a CPU is very much noticeable. As a result, it makes a lot of sense to have a dedicated GPU in your notebook if you buy a new one. This allows you to play with more complex models, or even to train them, while you are traveling, which might mean that you don’t have easy access to servers with GPU cards all the time. And even if you just want to use a pre-trained model for feature extraction, you can spare a lot of time by using the GPU.

There is always the option to use a gamers notebook, but in case you want a lightweight companion there are much fewer options. Our choice was the IdeaPad 720s with a 14′ screen, because it is lightweight, but still powerful with enough RAM, and a dedicated GeForce 940mx GPU that comes with non-shared memory. Without a doubt this is no high-end configuration, but the CUDA capabilities are sufficient to run older nets, or to design your own one, might it be a ConvNet or a RNN. Plus, with the huge SSD, you can train on pretty large training sets with better I/O performance than SATA disks.

So far for the theory, but now comes the reality. Especially for newer, or more exotic notebooks, installing Linux on them is not always trivial. To spare others the pain, we summarize the steps we did to get it running. It’s still not perfect, there are minor problems with the WLAN, but we used it for quite some hours now without any problems and successfully tested PyTorch + CUDA.

The first thing you have to do is to switch from “UEFI” to “Legacy Support” but that’s nothing new. You can enter the BIOS by pressing F2 during boot, or use the Novo “button” on the left side of the notebook. If this worked, you should shrink the NTFS volume which is pretty straightforward to make room for a real OS. Just half the size, so you got about ~250 GB for Linux. After all settings were adjusted, we can start with the Linux installation. Lubuntu seems a good choice since it is also lightweight, but comes with excellent support for detecting more exotic hardware. Make sure you have chosen the correct boot order, so you can boot from a USB stick/DVD drive.

Long story short, the installation failed miserable because the SSD drive was not recognized. But there is no time for panic! Thanks to the active community, we found the answer to that problem pretty fast. You have to switch the “SATA Controller Mode” from “RAID” to “AHCI” in the BIOS. With the new setting, it was possible to create a ext4+swap partition in the free space of the SSD. Then, the actual installation could be done without any problems. Merely the GRUB installation seems not optimal, since we get no boot screen and thus, we don’t know if our Windows partition was correctly recognized. From the grub config it does not seem so, but this is not our major concern, since our focus is a working Linux system with GPU support. So, we are blind at startup, but since Linux starts correctly we do not investigate this any further right now.

The next step is to get PyTorch working which was no problem at all. We used pip, python 2.7 + cuda 8.0 and it worked like a charm. Only torchvision failed, but we solved it by using “pip install –no-deps torchvision” since one dependency is still pytorch 0.1.2. A quick test with ipython confirmed that everything is okay and working. The last step is the installation of the CUDA toolkit which was also no problem thanks to the apt sources we just had to uncomment in the sources.list file. After “apt-get update” we installed the cuda toolkit packages and all its dependencies. Since CUDA requires a kernel module that is compiled at the end of the installation, a restart is required. To check if the setup was done correctly, start “nvidia-smi” -after reboot- and see if at the device is listed there.

After we got a prompt again, we downloaded a pre-trained network from the model collection and hacked some code to perform an image classification. Compared to the early days of ConvNets, even the CPU version was pretty “fast”. Next, we checked that cuda is correctly recognized by PyTorch and after that, we moved the model to the GPU and also the tensor we use for classification. As we mentioned at the begin of the post, the performance boost was pretty much visible and except for the first call that triggered some background activities to setup CUDA, everything went smooth.

Bottom line, here is the check list again to enjoy the notebook with Linux:
– Shrink the size of your NTFS volumne by 50%
– Switch from “UEFI” to “Legacy Support”
– Switch the “SATA Controller Mode” from “RAID” to “AHCI”
Since there is still room for improvements, we might create a successor blog post with additional details.

PyTorch 0.3 – An Early Xmas Gift

In this year, Santa is a little early but the gifts are nonetheless still impressive. In the last days, we hacked on a model to predict if a text sequence belongs to a certain type, or not, and we found out, that if we do not process the text from left to right but in reverse order, the model learns much better and faster. The idea is not new and was already introduced in neural translation models. Still, it’s amazing that such a little modification has such a huge impact.

Next, we heard about the new PyTorch release and that it might also bring gifts, for instance performance boosts along with other nice goodies. So, we updated, still from the source since the provided pre-compiled packages do not work for us, and then run a short test. The results were pretty stunning since now the required time for a cycle takes about 5 seconds less compared to version 0.2.0.

In total, from a machine learning perspective we are pretty happy with the gifts Santa brought for us and with the experiences we made so far, we encourage everybody to update their PyTorch version to also feel the machine learning Christmas spirit.

Flaming Winners

Recently, we read a paper that also mentioned winner-takes-all (WTA) circuits and since we moved from Theano to PyTorch, we wanted to give the new idea a try. This type of neuron is similar to maxout, but instead of reducing the output dimensions, the dimensions are kept but filled with zeros. Thus, a layer consists of a group of neurons and in each group, only the “fittest” survives, while the others are set to zero. For example, let’s assume that we have 128 neurons and they should form 32 groups with 4 units each. In PyTorch this is done with a linear layer: wta = nn.Linear(dim_in, 32*4). Next comes the implementation of the forward step which is straightforward. We assume that the shape of the tensor is (batch_size, dim_in).

def forward(self, input):
 h = wta(input) #projection
 h = h.view(-1, 32, 4) # reshape: (1, 32, 4)
 val, _ = h.max(2) # maximal values per batch
 val = val[:, :, None] # reshape: (batch, 1, 1)
 pre = val * (h >= val).type(torch.FloatTensor) #binary matrix->float matrix
 return pre.view(-1, 32*4) # reshape: (batch, 32*4)

That’s it. Definitely not rocket science, just a bit of juggling with the shape of the tensors and reshaping.

PyTorch – Weight Decay Made Easy

In PyTorch the implementation of the optimizer does not know anything about neural nets which means it possible that the current settings also apply l2 weight decay to bias parameters. In general this is not done, since those parameters are less likely to overfit. Furthermore, the decay should also not be applied to parameters with a shape of one, meaning the parameter is a vector and no matrix which is quite often for normalization modules, like batch-norm, layer-norm or weight-norm. So, how can we tell the optimizer in a principled way to set the decay of those parameters to zero?

With the introduction of the function named_parameters(), we also get a name along with the parameter value. For standard layers, biases are named as “bias” and combined with the shape, we can create two parameter lists, one with weight_decay and the other without it. Furthermore, we can easily use a skip_list to manually disable weight_decay for some layers, like embedding layers. The code is pretty simple:

def add_weight_decay(net, l2_value, skip_list=()):
 decay, no_decay = [], []
 for name, param in net.named_parameters():
  if not param.requires_grad: continue # frozen weights		            
  if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list: no_decay.append(param)
  else: decay.append(param)
 return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]

and the returned list is passed to the optimizer:

params = add_weight_decay(net, 2e-5)
sgd = torch.optim.SGD(params, lr=0.05)

That’s it. The behavior is documented, but we still think it’s a good idea to give an example, since in frameworks specialized on neural nets, the default behavior might be different. Furthermore, the method is straightforward, but requires some knowledge of the internals and an example hopefully helps to better understand the process.

PyTorch – Freezing Weights of Pre-Trained Layers

Back in 2006 training deep nets based on the idea of using pre-trained layers that were stacked until the full network has been trained. Then, a final fine-tuning step was performed to tune all network weights jointly. With the introduction of batch norm and other techniques that has become obsolete, since now we can train deep nets end-to-end without much problems. However, sometimes it is still beneficial to combine layers from pre-trained networks to give a network a direction where to search for good solutions. For instance, a recurrent network could use word embeddings from an unsupervised learning step like GloVe or Word2Vec. If this makes sense surely depends on the problem, but it is still possible.

We recently encountered a problem where we need to predict multiple tags for a sequence of words. As a baseline we tried to train a continuous bag-of-words model with a skip-gram loss, but we found the performance not satisfying mostly because the supervised loss failed to learn a good embedding of individual words. This is a common problem since the representation of the data only depends on the error signal and if it goes to zero, the learning stops immediately. This helps to reduce the loss, but it might also hurt the generalization since the getting predictions for frequent items right drives the loss faster down than those of items from the long tail.

So, we decided to pre-train the embedding layer unsupervised, with a huge corpus of sentences, sequences of words, and then we used the embedding in a network to predict tags for them. We further decided to freeze the layer which means the weights are not changed during learning. The reason is that we want avoid a bias introduced by the supervised error signal. Next, we describe how this is possible in PyTorch.

Let’s assume we have a simple network:

class Network(nn.Module):
 def __init__(self, n_words, n_dim=5):
  super(Network, self).__init__()
  self.word_embed = nn.Embedding(n_words, n_dim, sparse=True)

def freeze_layer(layer):
 for param in layer.parameters():
  param.requires_grad = False

net = Network(1000)

By default in PyTorch, every parameter in a module -network- requires a gradient (requires_grad=True) which makes sense, since we want to jointly learn all parameters of a network. However, in case of a pre-trained layer, we want to disable backprop for this layer which means the weights are fixed and are not getting any updates during the backprop step.

That’s it. Pretty easy, if you know how a little about the PyTorch internals. With this in mind, we can use the setting to fine-tune a network, like just learning the weights of a new classifier on top of a pre-trained network, or we can use it to combine a pre-trained layer at the bottom with a new network architecture.

Updating PyTorch

About a week ago, there was an update of the framework (0.2.0) and since we encountered some minor problems, we decided to test the version. For the convenience we used pip to perform an update. It should be noted that our environment is python 2.7 with no GPU support. Since the first link did not work (no support for our environment was reported), we tried the second link and that seemed to work. Everything seemed fine and we could execute a trained network without any problems. However, when we tried to train our network again, we got an “illegal instruction” and the process aborted itself. We could have tried conda, but we decided to compile the source from scratch to best match our environment.

To avoid to mess up a system-wide installation, we used $ python setup.py install --user. After waiting a couple of minutes that it took to compile the code, we got a ‘finished’ message and no error. We tried the test part of the network which worked and now, to our satisfaction, the training also worked again. So, we considered this step successful but we have the feeling that the selected BLAS routines are a little slower compared to the old version. However, we need further investigation until we can confirm this.

Bottom line, despite the coolness of the framework, an update does not seem to be straightforward for all environments with respect to the available pre-build packages. However, since building from the sources works like a charm on a fairly default system, we can “always” use this option as a fallback.