PyTorch: Tackle Sparse Gradient Issues with Multi-Processing

Let’s imagine that we want to train a model that is using an embedding layer for a very large vocabulary. In on-line mode, you only work on very few words per sample which means you get a sparse gradient because most of the words do not need to be touched in any way[1].

The good news is that it already works with vanilla gradient descent and AdaGrad. However, since the latter eventually decays the learning rate to zero, we might have a problem if we need to visit a lot of samples to achieve a good score. This might be the case for our recent model that is using a triplet loss, because not every update has the same impact and using more recent gradient information would be more efficient. Thus, we decided *not* to use Adagrad.

As a result, we can only use stochastic gradient descent. There is nothing wrong with this optimizer, but it can take lots of time until convergence. We can address parts of the problem with momentum, since it accelerates learning if the gradient follows one direction, but it turned out that enabling momentum turns the sparse update into a dense one and that means we are losing our only computational advantage.

Again, the issue is likely to be also relevant for other frameworks and in general sparse operations always seem a little behind their dense colleagues. Since PyTorch is very young, we don’t mind challenges, as noted before.

We would really like to continue using it for our current experiment, but we also need to ensure that the optimizer is not a bottleneck in terms of training time. The good news is that the results, we got so far, confirm that the model is learning a useful representation. But with the problem of the long tail, we might need more time to perform a sufficiently large number of updates for the rare words.

So, in this particular case we do not have much options, but a good one would suffice and indeed there seems to be a way out it: Asynchronous training with multi processing. We still need to investigate more details, but PyTorch provides a drop-in replacement for “multiprocess” and a quick & dirty example seems to work already.

The idea is to create the model as usual, with the only exception to call “model.share_memory()” to properly share model parameters with fork(). Then, we spawn N new processes that all gets a copy of the model, with tied parameters, but each with its own optimizer and data sampler. In other words, we perform independent N trainings but all processes update the same model parameters. The provided example code from PyTorch is pretty much runnable out of the box:


## borrowed from PyTorch ##
from torch.multiprocessing as mp
model = TheModel()
model.share_memory() # required for async training
# train: function with "model" as the parameter
procs = []
for _ in xrange(4): # no. processes
 p = mp.Process(target=train, args=model(,))
 p.start()
 procs.append(p)

for p in procs: # wait until all trainers are done
 p.join()

The training function ‘train’ does not require any special code, or stated differently it, if you call it just once, single-threaded, it works as usual. As noted before, there are likely some issues that need to be taken care of, but a first test seemed to work without any problems.

Bottom line, the parallel training should help to increase the through-put to perform more updates which -hopefully- leads to an earlier convergence of the model. We still need to investigate the quality of the trained model, along with some hyper-parameters, like the number of processes, but we are confident that we find a way to make it work.

[1]

PyTorch: Non-differentiable is Nothing

In the last weeks, we really learned to appreciate PyTorch. Not because it is flawless, but because it is very intuitive and makes your life so much easier when you need to debug something. And let’s face it, at one point your network is doing something silly and you have no idea why. Then you need to take a look under the hood which can be a bit of a burden with symbolic variables. Furthermore, the dynamic graph approach lets you define some concepts, like recurrence without complicated scan functions which is one more reason why recurrent networks and PyTorch should be best friends. Bottom line, for a beta, the framework feels very stable and except for some numerical instabilities, we never encountered any problems so far.

With arxiv as a hub for research papers, one have access to lots of new ideas. Not all of them are diamonds, but sometimes a paper contains the hint you need to complete a problem or at least to evaluate an idea or to use it as a basis. Then, it is extremely useful if you can do a rapid prototype. Even with Theano it was quite easy possible, but since the framework is a little too low-level, you had to write lots of setup code. With PyTorch and the seamless numpy integration the setup part is almost neglectable.

Furthermore, it is also possible to transfer a learned model from one framework to another, in case one might have concerns to use a particular framework for production, or a particular framework is required. Thus, it is no problem to use the advantages of PyTorch for training and evaluation of a model and then store the model parameters as numpy arrays and build the network with Theano.

It is really hard to imagine to evaluate new, complex networks without all the fancy frameworks, especially automatic differentiation, but also computational graphs in general. For example, the famous AlexNet was written from scratch which was quite an achievement. In other words, there is no excuse today, not to test your ideas quickly, but also systematically with a framework like PyTorch. Yes, you need some Python and math skills, but with all the groundwork done by others, it is much faster than a few years ago.

So, the major problem, which cannot be solved by PyTorch yet, is to get your hands on a sufficiently large set of (labeled) data. However, if you have a dataset you can start with, there are no limits. Try some funky loss functions, or combine them. Can you imagine a world without GANs? Not anymore. But, somebody has to come up with the idea before it can be refined.

Bottom line, it has never been easier to utilize massive amounts of computational power with GPUs. There are many publicly available datasets, but also relatively easy ways to acquire labeled data, so you should be at least able to start your work. In combination with all the available frameworks, you can train very complex networks in a reasonable time which would have been impossible more than ten years ago even with a massive cluster of machines. So, everybody with a clever idea and some luck has the opportunity to make a difference.

Think Local, Not Global

In contrast to some academic datasets, real-life datasets are often unbalanced, with a lot noise and more variance than some models can cope with. There are ways to adapt models for those problems, but sometimes a classification is too restrictive. What do we mean by that? With a lot of variance for a single label, it might be possible to squeeze all samples into some corner of a high-dimensional space, but it is probably easier to learn the relative similar of samples.

With a triplet loss, we can move two samples with the same label closer together while we push them further away for a sample of a different label: maximum(0, margin + f(anchor,neg) - f(anchor,pos)). The approach has the advantage that we can distribute samples across the whole feature space by forming “local clusters” instead of concentrating it in a dense region of the space. The quality of the model depends on the sampling of the triplets, since for a lot of triplets the margin is already preserved and no learning is done. But since the sampling can be done asynchronously, it is usually no problem to find enough violating triplets to continue the training.

If we take for instance items from an electronic program guide (epg) there might be a huge variance for the category “sports”, but only few words to describe the content. Thus, it might be sufficient that a nearest neighbor search -in the feature space- returns any item with the same label to perform a successful “classification”. This has the benefit that we can capture lots of local patterns with a shallower network instead of one global pattern in case of a softmax classification that requires a network that is likely complex. The price of the model is that we cannot simply classify new items with a single forward pass, but the model is probably easier to train because of the relative similarity. Plus, if the similarity of the anchor and the positive sample is higher than the similarity of the anchor and the negative one, with respect to the margin, there is no loss at all and no gradient step is required. In case of a nll loss, for example, there is always a gradient step, even if it is very small.

Furthermore, nearest neighbors searches[arxiv:1703.03129] can be done with matrix multiplications that are very efficient these days thanks to modern CPU and GPU architectures. It is also not required to scan the whole dataset, because it is likely that we can perform a bottom up clustering of the feature space to handle near duplicates.

The next question is how to combine a triplet loss with attention to aggregate data from related items, for example items with similar tags, to group embeddings of similar words. We are using the ideas from the post to encode items with a variable number of words into a fixed length representation.

More PyTorch Kludges

We slightly adjusted our loss function to a use a hinge-like loss that requires the use of a maximum function that we used pretty often in Theano: T.maximum(0, 0.3 - score). However, in PyTorch the cmax function, that provided the same functionality, was recently removed to simplify the API.

Of course it’s still possible but the new syntax feels a bit strange: torch.clamp(0.3 - score, min=0). In case of the minimum, it’s the other way around: torch.clamp(0.3 - score, max=0).

So, it’s the clamp function again that we previously used, to bound the values of a function. Seems like clamp is our swiss army knife.

PyTorch: Kludges To Ensure Numerical Stability

After we decided to switch to PyTorch for new experiments, we stumbled about some minor problems. It’s no big deal and the workarounds are straightforward, but one should be aware of them to avoid frustration. Furthermore, it should be noted that the framework is flagged as “early beta” and this is part of the adventure mentioned on the website :-).

We extended an existing model by adding a skip-gram like loss to relate samples with tags both in a positive or negative way. For this, we are using the classical sigmoid + log loss:

sigmoid = torch.nn.functional.sigmoid
dot_p = torch.dot(anchor, tag_p)
loss_pos = -torch.log(sigmoid(dot_p)) #(1)
dot_n = torch.dot(anchor, tag_n)
loss_neg = -torch.log(1 - sigmoid(dot_n)) #(2)

The critical point is log(0), since log is undefined for this input, “inf” in PyTorch, and there are two ways how this can happen:
(1) sigmoid(x) = 0, which means x is a “large” negative value.
(2) sigmoid(x) = 1, which means x is a “large” positive value.
In both cases, -log(y) evaluates to zero and a hiccup occurs which leads to a numerical instability that makes further optimization steps useless.

One possible workaround is to bound the values of sigmoid to be slightly above zero and slightly below one, with eps ~1e-4:

value = torch.nn.functional.sigmoid(x)
value = torch.clamp(torch.clamp(value, min=eps), max=1-eps)

With this adjustment, sigmoid(dot_p) is always slightly positive and (1 – sigmoid(dot_n)) also never evaluates to zero.

It might be possible that pre-defined loss functions in PyTorch do not suffer this problem, but since we usually design our own loss function from scratch, numerical instabilities can happen if we combine certain functions. With the described kludge, we did not encounter problems any longer during the training of our model.

Again, we are pretty sure that those issues are addressed over time, but since PyTorch is already very powerful, elegant and fast, we do not want to wait until this happened. In other words, we really appreciate the hard work of the PyTorch team and since we made a choice to use a framework in an “early-release beta”, it’s only fair to be patient. Of course we are willing to help the project, for example by reporting bugs, but in this case someone else already did it (issue #1835).

Efficient Embedding Models With PyTorch

With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.

We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.

In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:
torch.sum(E(in_words) * M(in_words), 0)
where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:
M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())

The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it[1]. In PyTorch this is of course also possible, but the approach is different.

Usually a model contains an module for the embedding
torch.nn.Embedding(#tokens, #dims)
which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to
torch.nn.Embedding(#tokens, #dims, sparse=True)
and that is all.

However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing
torch.sparse.FloatTensor' object has no attribute 'addcmul_'
but we strongly believe that this is fixed pretty soon.

The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.

Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.

To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.

In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.

[1] Efficient Embeddings With Theano

pytorch: A New Burning Star

We are still in love with Theano and it’s part of our machine learning framework since quite some time, but now and then you need something else to get something done. Of course python is our first choice since it is intuitive, flexible and when combined with low-level modules written in C/C++, the performance is also no problem.

Frankly, one reason we never gave torch a try is because even if learning a new language can be fun, time is very valuable. Plus, in our humble opinion it makes more sense to master one language than to divide your attention between two. However, with the arrival of pytorch things are different now.

It’s not like all our problems are solved, but pytorch introduces a new, very interesting concept of dynamic graphs. Furthermore, pytorch uses the WYSIWYG concept which means that a tensor contains actual values at any moment and not only symbolic references to it. Because of this, there are also no lengthy compilation steps and this also makes debugging much easier.

Even if the state of the project is described as “early-release beta”, our experiments so far did not encounter any serious problems or limitations. But to be fair, our models were fairly straight-forward and thus only used standard components. Nevertheless, describing the model and the actual training worked like a charm without any pitfalls. And with our existing knowledge from Theano and other graph-based frameworks, it was easy to adapt existing code and/or to write new one.

The integration of the automatic differentiation is a little different compared to Theano but this is also no big deal if you spent minimal time on the interface description. Plus, with all the available examples and tutorials, it’s pretty easy to get an overview of all the modules you need for your daily work. Especially recurrent networks are pretty easy to use and require only minimal knowledge of the underlying details in case you just need a standard setup to solve a problem.

Bottom line, we are big fans of frameworks that are easy to use but also versatile. When we stumbled about numpy many years ago, we instantly fall in love because implementing algorithms was straight-forward and also very fast, because of the optimized linear algebra routines and the vectorization. The only drawback is the missing automatic differentiation because doing it manually is burdensome and very error prone. Thus, if a framework extends numpy with this feature, plus the ability to perform calculation on the GPU in a transparent way, the outcome has to be useful ;-).

In other words, if you are familiar with the computational machinery that is required for implementing neural networks, but also other machine learning models, pytorch can make your life a lot of easier. It’s pretty lightweight, fast and easy to use. Maybe it needs a little more time to be “feature complete” and more mature, but our tests did not reveal any severe problems and since the community is pretty active, problems should be addressed pretty soon after they are reported.