PyTorch: Identifying Computational Bottlenecks

It might happen that if we start with a new idea, we focus on the clarity of the code but not on the overall performance. Of course the model should not be slow as a snail, but often there is room for improvement. Still, first it is more important to get it working than to be super fast. When everything works well, it’s time to take a closer look at the code and to identify possible bottlenecks.

In our case, we often calculate dot products between vectors and matrices and there are different ways to do the math. For example:
(1) torch.sum(anchor * examples, 1) # shape: (1, dim) x (n, dim)
(2) examples.mm(anchor.view(-1, 1)) # shape: (n, dim) x (dim, 1)
For both methods there is not much overhead, at least not function-wise, however, after we did some profiling, we found out that method (2) is about 40% faster than the first one. This is probably related to hardware utilization since (2) feels more “batched”.

Frankly, this is nothing new, but it just reminded us that for large-scale learning, using optimal numeric calculation can save you a day or week, or it can give you the opportunity to train a little longer. In our case, by introducing padding we reduced the time by almost 50% and now with the batched dot product, we got another 40%.

Advertisements

PyTorch: Faster Embedding Lookups With Padding

There are quite a few helper functions when it comes to recurrent nets, but in our case we just wanted to speed up the forward step of a model that is just using Embedding layers. Maybe there is also a helper for our problem, but in any case it’s a good idea to manually implement these steps to see how it works under the hood and to learn about possible side effects. Our setup is pretty simple: We have a batch of lists that contain individual tokens and our network shall return the sum of the corresponding embeddings for each sample.

The naive implementation only works if all those token lists have the same size, otherwise we are not able to build a LongTensor:

torch.LongTensor([[0, 5, 10], [3, 33, 333]]) [okay]
torch.LongTensor([[0, 5, 10], [3, 33]]) [error]

Since this is a common problem, the nn.Embedding module of PyTorch supports padding with “padding_idx=PAD”. Whenever PAD is found in the long tensor, the output is filled with zeros:

torch.LongTensor([3, 33, PAD]):
x_3_0 ... x_3_d
x_33_0 ... x_33_d
0 ... 0

In other words, this acts like a dummy embedding that does not change the gradient because no actual parameters are used. With this approach, we are able to return the aggregated embeddings (sum) for a batch of samples with different lengths, instead of forwarding each sample separately through the network.


batch = torch.LongTensor([
[0, 5, 10, 15],
[3, 33, PAD, PAD],
[17, PAD, PAD, PAD]])
batch_emb = net(Variable(batch))

We measured the runtime for both approaches and as expected, there is a notable performance gain by using batching: naive=14863 msecs. vs. batched=8294 msecs. which is an improvement of more than 40%.

Actually there is not any magic involved and you just need to make sure that you are working with the correct axis if you perform per-sample transformations. In our case, we normalized each aggregated vector (sum) so it has a unit-norm.

As a last step, let’s go through an example: If we assume that our embed_dim is 10 and we use batch as the input to the network, we get the following output shape: (3, 4, 10) which means we have 3 samples, each with 4 embeddings and each with 10 dimensions. Now, we want to calculate the sum of the embedding for each sample in the batch: batch_emb_sum = torch.sum(batch_emb, 1) with a resulting shape of (3, 10) and finally the normalization step: batch_emb_final = batch_emb_sum / batch_emb_sum.norm(dim=-1, keepdim=True) and that’s it. Thanks to the padding, the zero vectors do not interfere with any steps, since adding zero to something does not change anything.

But we need to be careful when we use an operation that depends on the number of elements, like torch.mean since the padding changes the size of the shape. To be more concrete, if we only have one token, but three PAD entries, the shape is (4, 10) and the mean would be: torch.sum(x, 1) / 4 even if the last three entries do not hold any values. Thus, we need to re-calculate the shape if padding has been used: actual_len = #rows – #pad_rows.

Ask Your Neighbors For Advice

Since we have a rather unusual setup, ordinary classification often delivers a performance that is not satisfying. We tried to address the issue with a large-nargin approach that uses a triplet loss to model local neighborhoods, but even this approach fails to solve our problems for some kind of data. To put it simple, some samples with a rare combination of features might not find their place in the learned embedding space and thus, probably end up at some “random” place. To guide the way of those little rascals, we introduced a memory component like in [arxiv:1703.03129].

First, the memory is filled uniformly with samples with different labels. After that for each query, we find the nearest neighbor with the same label, but also one with a different label. The idea is to ensure that memory slots with different labels are well separated in the embedding space. Since we do not backprop through the memory slots, we adjust the embedding parameters of the query to ensure the margin. This is done by a simple triplet loss:
max(0, margin + dot(query, nn_neg) - dot(query, nn_pos))
where all data is unit-normed.

The memory slot with the matching label is then updated:
mem[idx_pos] = l2_norm(query + mem[idx_pos])
where idx_pos is the position of the memory slot.

The argumentation why this helps to improve the performance is similar to the one in the paper: The additional memory helps to remember combinations of features that are rarely seen and thus is often able to infer the correct label even if the embedding has not been “settled” at all.

Furthermore, the memory can also help to improve the embedding space by concatenating embeddings of samples and slots which leads to a cleaner separation of class boundaries: h_new = hidden||nn(mem, hidden).

Still, there are quite a few questions that need to be addressed by further research:
– The more a memory slot gets updated, the more likely it is that it will be closer to an arbitrary query. This will likely lead to lots of orphaned memory slots. How can we ensure that distinct feature combinations won’t get mixed into those slots?

– Shall we introduce some some non-determinism to introduce some noise to improve the utilization of the memory (to better preserve some rare patterns)?

– Shall the memory be either a circular buffer or shall we average memory slots to convert to a stable state?

As long as there is no way to learn in an reliably, but unsupervised way from sparse input data, we believe that external memory is a promising path to pursuit. However, even with this adjustment there are still lots of challenges that need to be addressed before we can come up with a working solution.

Don’t Push Things Into Neural Corners

The last year was quite a ride with lots of new ideas, controversial discussions and real-world examples that large-scale machine learning is actually more than a flash in the pan. However, we still have the feeling that more basic research should be done. For instance, a while ago we stumbled about an excellent blog[1] that explains neural nets in a very descriptive but nevertheless still formal way. Since manifolds are a very powerful tool to explain what the representation that a net learned looks like, we tried to find similar, but introductory literature. It was a bit surprising that we did not find much. Maybe we didn’t try hard enough, but it’s more likely that relevant literature is buried somewhere and cannot be easily accessed through search engines, or, in the worst case, there is not much at all.

Nevertheless, the post inspired us to rethink the way how we tackle our current problems. For instance, we try to build a preference-model that ranks items according to the known preferences of users. But instead of simply classifying new items, we would like to capture latent topics in known item pairs to be able to perform something like a k-NN search at test time to better explain why a user might like an item or not. The reason why we are working on alternatives is that because the input space is very sparse and heterogeneous the generalization often fails. In other words, we get good scores for train/test but unseen items still land on the wrong side of the decision boundary way too often.

To get a better understanding of what’s going wrong, we worked on ways to visualize decision boundaries. But it is not hard to believe that a hyperplane is not always the best way to support fine-grained classification. A problem is that all +1 items are not really equal, like some image category, but might contain items from very different categories and the same is true for -1 items. Thus, it might be much easier for the network to learn a local neighborhood instead. As a baseline, we could use k-NN, but it is well known that the accuracy of the algorithm highly depends on the feature representation and since our input space consists of highly entangled data, it does not work in the original space.

As a result, we need a neural net to disentangle the representation with some loss function and then we can use k-NN to perform a prediction that is more robust compared to an end-to-end classification. For example, let’s assume that we trained a classification model and we get an item x that lies on the wrong side of the decision boundary. However, there is also a wrongly classified -positive- item x’ that is pretty close to x and the next negative x” item is farther away than x’: dist(x, x”) > dist(x, x’). Therefore, while the classification of x would be -1, the 1-NN model would still get it right because we consider the neighborhood of x.

Of course it’s not always that trivial, but instead of pushing every category into a single corner of the feature space, we can make the life of the neural net easier by just separating positive and negative items by a fixed margin. This is nothing more than the good old triplet loss, but we still have one challenge to address!

The problem is that not all +1 items come from the same distribution. Thus, if we just sample uniformly, we likely sample items with different topics and force them to be close together in the feature space. This still works, but at the end, we probably just learn a representation that is linearly separable again. So, the question is how to preserve the local neighborhood in each +1/-1 category? We could cheat by using tags or meta-data, or we could also extract topics with a NMF.

Of course the new expressive power won’t come for free. The beauty of a classifier is that we can make predictions but just feeding the item to the network and get the score for it. In case of a k-NN-based method, we also need to store labeled training data and perform a lookup each time a new item needs to be labeled. But depending on how much labeled training data is required for a good score, the overhead is often negligible, since we just need to store a matrix of the size (N x dim) and perform a single matrix multiplication, following by an argument sorting. And if all items have unit-norm, this corresponds to the cosine score.

Of course we are not done here, but we have a strong feeling that pursuing this way will lead somewhere, even if it takes a lot of steps until we arrive somewhere.

[1] colah.github.io/posts/2014-03-NN-Manifolds-Topology/

Challenges With Real-World Embeddings

To relate TV titles that come from the electronic program guide (EPG), we have decided to train an embedding that directly optimizes on “sentence-level” instead of just related words, like word2vec. That is in the spirit of StarSpace[arxiv:1709.03856] which is a fairly simple but approach but which is nevertheless a strong baseline.

The idea for the training is straightforward and uses the leave-out-out approach. We use either existing or manually annotated categories for a weak supervision. Then we sample N positive items from one category, and use N-1 items to predict the Nth item. Negative items are sampled from other categories. A title is encoded as the sum of all the embeddings of the words it contains: np.sum(E[word_id_list], 0) (bag-of-words). All vectors are normalized to lie on the unit-ball. Next, we combine all bag-of-words into bag-of-documents: np.mean(V, 0) where V is a matrix with #title rows and #dim columns.

The loss is the well-known triplet loss: np.maximum(0, margin – pos + neg), where pos = np.dot(bag_items, nth_item) and neg = np.dot(bag_items, neg_item) and margin is a hyper-parameter (0.2 by default). The idea is not to learn a hyperplane to separate classes, but to move related items closer and push unrelated items further away. The learning stops, when positive and negative pairs are separated by at least the given margin. The performance of such models largely depends on the sampling of positive and negative items
but this is not the concern of this post.

In contrast to titles from books, and to some degree movies, generic TV titles belonging to shows, reports or entertain, are very heterogeneous with lots of special characters and that can also include meta information. Therefore, we need a more sophisticated tokenizer to convert titles into “words”, but we also need to address the issue of rare “words”. The long-tail is always a problem for text, but in our case the domain is more like tweets with special emoticons and/or hashtags than traditional text. Throwing away
those “words” is no solution which is why we need to adjust the learning scheme.

In word2vec down-sampling of frequent words is used, but this does not really address the problem since we do not want to damp the learning signal for frequent “words”, but we want to boost the signal for rare “words”. That is why we decided to scale the gradients with the inverse frequency of the words. The procedure just requires a fixed lookup table: ID->WEIGHT, which is easy to implement.

The necessity of the procedure became obvious when we checked the result of our first model. We took an arbitrary title and used the cosine score to rank all other titles. The results looked promising, but from time to time there were outliers and we wanted to find out why. We started by removing single “words” from the offending title and repeated the ranking. We found out that the problem were often related to rare words that did not get much weight updates and thus, their position in the embedding space is something “arbitrary”. When the “word” was removed, the cosine score reduced dramatically. This also worked for other titles.

Thanks to PyTorch, the implementation of re-scaling the gradients was very easy:

def scale_grad_by_freq(parameters, lookup):
parameters = list(filter(lambda p: p.grad is not None, parameters))
 for p in parameters:
  g,grads = p.grad.data, g._values()
  for j, i in enumerate(g._indices().view(-1)): grads[j].mul_(lookup[i])

The multiplication is done in-place and thanks to the sparse flag of the PyTorch Embedding module, we only re-scale a small subset of all embeddings. With this minor modification, a loss that involves rare “words” leads to a stronger error signal which partly compensates the fact that those “words” get fewer updates. This is not the holy grail, but a good example that a deeper understanding of a problem can minimize the level of frustration and gives you more time to enhance or tune your model.

Backtrack But Not Backwards

A recent talk at NIPS 2017 underlined that despite all the progress we made in the last years, we still practically know very little about how things work internally. The thing is what do you do, when you encounter a strange problem while you train your model? Do you just switch to another optimizer/architecture/hyper-parameter, or do you try to find the root-cause of the problem? With all the nice and publicly available ML stuff out there, it is tempting to just try all these things and if one does not work, just try the next one. At the end of the day, your model might be powerful enough to solve the problem at hand, but it is also very likely that it is just a black-box you don’t fully understand and if the system stops working, you need to search for a new model again.

The talk also emphasized that we need more well-understood building-blocks which can be combined to tackle more complex problems, instead of just plugging “mythical” things into your networks which makes it “magically” work. In other words, we should focus more on basic experiments to better understand existing building blocks which includes to spend more time to prove why things work, instead of just saying they do, because the error rate goes down, but no one cares to explain exactly why.

This is kind of backtracking you do, when you are stuck, since your model won’t work. If you just switch your architecture, you won’t get any new insights and if the problem occurs again, you also need to switch again. The process to really understand what is going can be extremely painful and probably need lots of resources, but at the end it pays since you can use the knowledge to build better models and to focus on new problems instead of just plugging black-boxes together and hope that they eventually work.

Like a long journey, it starts with a single step and at the begin, there might be no light at the end of the tunnel, but if you don’t give up, you will figure out how to put the next piece of the puzzle eventually, and then the next one and so forth. At the end you will see much more of the whole picture even if it takes a very long time.

IdeaPad 720s – Machine Learning For The Road

Even if the GPU is a mobile version, the performance gain compared to a CPU is very much noticeable. As a result, it makes a lot of sense to have a dedicated GPU in your notebook if you buy a new one. This allows you to play with more complex models, or even to train them, while you are traveling, which might mean that you don’t have easy access to servers with GPU cards all the time. And even if you just want to use a pre-trained model for feature extraction, you can spare a lot of time by using the GPU.

There is always the option to use a gamers notebook, but in case you want a lightweight companion there are much fewer options. Our choice was the IdeaPad 720s with a 14′ screen, because it is lightweight, but still powerful with enough RAM, and a dedicated GeForce 940mx GPU that comes with non-shared memory. Without a doubt this is no high-end configuration, but the CUDA capabilities are sufficient to run older nets, or to design your own one, might it be a ConvNet or a RNN. Plus, with the huge SSD, you can train on pretty large training sets with better I/O performance than SATA disks.

So far for the theory, but now comes the reality. Especially for newer, or more exotic notebooks, installing Linux on them is not always trivial. To spare others the pain, we summarize the steps we did to get it running. It’s still not perfect, there are minor problems with the WLAN, but we used it for quite some hours now without any problems and successfully tested PyTorch + CUDA.

The first thing you have to do is to switch from “UEFI” to “Legacy Support” but that’s nothing new. You can enter the BIOS by pressing F2 during boot, or use the Novo “button” on the left side of the notebook. If this worked, you should shrink the NTFS volume which is pretty straightforward to make room for a real OS. Just half the size, so you got about ~250 GB for Linux. After all settings were adjusted, we can start with the Linux installation. Lubuntu seems a good choice since it is also lightweight, but comes with excellent support for detecting more exotic hardware. Make sure you have chosen the correct boot order, so you can boot from a USB stick/DVD drive.

Long story short, the installation failed miserable because the SSD drive was not recognized. But there is no time for panic! Thanks to the active community, we found the answer to that problem pretty fast. You have to switch the “SATA Controller Mode” from “RAID” to “AHCI” in the BIOS. With the new setting, it was possible to create a ext4+swap partition in the free space of the SSD. Then, the actual installation could be done without any problems. Merely the GRUB installation seems not optimal, since we get no boot screen and thus, we don’t know if our Windows partition was correctly recognized. From the grub config it does not seem so, but this is not our major concern, since our focus is a working Linux system with GPU support. So, we are blind at startup, but since Linux starts correctly we do not investigate this any further right now.

The next step is to get PyTorch working which was no problem at all. We used pip, python 2.7 + cuda 8.0 and it worked like a charm. Only torchvision failed, but we solved it by using “pip install –no-deps torchvision” since one dependency is still pytorch 0.1.2. A quick test with ipython confirmed that everything is okay and working. The last step is the installation of the CUDA toolkit which was also no problem thanks to the apt sources we just had to uncomment in the sources.list file. After “apt-get update” we installed the cuda toolkit packages and all its dependencies. Since CUDA requires a kernel module that is compiled at the end of the installation, a restart is required. To check if the setup was done correctly, start “nvidia-smi” -after reboot- and see if at the device is listed there.

After we got a prompt again, we downloaded a pre-trained network from the model collection and hacked some code to perform an image classification. Compared to the early days of ConvNets, even the CPU version was pretty “fast”. Next, we checked that cuda is correctly recognized by PyTorch and after that, we moved the model to the GPU and also the tensor we use for classification. As we mentioned at the begin of the post, the performance boost was pretty much visible and except for the first call that triggered some background activities to setup CUDA, everything went smooth.

Bottom line, here is the check list again to enjoy the notebook with Linux:
– Shrink the size of your NTFS volumne by 50%
– Switch from “UEFI” to “Legacy Support”
– Switch the “SATA Controller Mode” from “RAID” to “AHCI”
Since there is still room for improvements, we might create a successor blog post with additional details.