# Tagged: gradient

# PyTorch: Tackle Sparse Gradient Issues with Multi-Processing

Let’s imagine that we want to train a model that is using an embedding layer for a very large vocabulary. In on-line mode, you only work on very few words per sample which means you get a sparse gradient because most of the words do not need to be touched in any way[1].

The good news is that it already works with vanilla gradient descent and AdaGrad. However, since the latter eventually decays the learning rate to zero, we might have a problem if we need to visit a lot of samples to achieve a good score. This might be the case for our recent model that is using a triplet loss, because not every update has the same impact and using more recent gradient information would be more efficient. Thus, we decided *not* to use Adagrad.

As a result, we can only use stochastic gradient descent. There is nothing wrong with this optimizer, but it can take lots of time until convergence. We can address parts of the problem with momentum, since it accelerates learning if the gradient follows one direction, but it turned out that enabling momentum turns the sparse update into a dense one and that means we are losing our only computational advantage.

Again, the issue is likely to be also relevant for other frameworks and in general sparse operations always seem a little behind their dense colleagues. Since PyTorch is very young, we don’t mind challenges, as noted before.

We would really like to continue using it for our current experiment, but we also need to ensure that the optimizer is not a bottleneck in terms of training time. The good news is that the results, we got so far, confirm that the model is learning a useful representation. But with the problem of the long tail, we might need more time to perform a sufficiently large number of updates for the rare words.

So, in this particular case we do not have much options, but a good one would suffice and indeed there seems to be a way out it: Asynchronous training with multi processing. We still need to investigate more details, but PyTorch provides a drop-in replacement for “multiprocess” and a quick & dirty example seems to work already.

The idea is to create the model as usual, with the only exception to call “model.share_memory()” to properly share model parameters with fork(). Then, we spawn N new processes that all gets a copy of the model, with tied parameters, but each with its own optimizer and data sampler. In other words, we perform independent N trainings but all processes update the same model parameters. The provided example code from PyTorch is pretty much runnable out of the box:

## borrowed from PyTorch ##

from torch.multiprocessing as mp

model = TheModel()

model.share_memory() # required for async training

# train: function with "model" as the parameter

procs = []

for _ in xrange(4): # no. processes

p = mp.Process(target=train, args=model(,))

p.start()

procs.append(p)

```
```

`for p in procs: # wait until all trainers are done`

p.join()

The training function ‘train’ does not require any special code, or stated differently it, if you call it just once, single-threaded, it works as usual. As noted before, there are likely some issues that need to be taken care of, but a first test seemed to work without any problems.

Bottom line, the parallel training should help to increase the through-put to perform more updates which -hopefully- leads to an earlier convergence of the model. We still need to investigate the quality of the trained model, along with some hyper-parameters, like the number of processes, but we are confident that we find a way to make it work.

# Efficient Embedding Models With PyTorch

With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.

We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.

In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:

`torch.sum(E(in_words) * M(in_words), 0)`

where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:

`M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())`

The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it[1]. In PyTorch this is of course also possible, but the approach is different.

Usually a model contains an module for the embedding

`torch.nn.Embedding(#tokens, #dims)`

which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to

`torch.nn.Embedding(#tokens, #dims, sparse=True)`

and that is all.

However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing

`torch.sparse.FloatTensor' object has no attribute 'addcmul_'`

but we strongly believe that this is fixed pretty soon.

The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.

Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.

To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.

In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.

# Regularization Beyond Dropout

Before the advent of batch normalization (BN), dropout was the silver bullet to avoid overfitting and to regularize a network. Now, after BN belongs to the standard tools to train networks, dropout is often omitted or at least the rate is drastically reduced. However, for some special ‘networks’, like factorization machines (FM), neither dropout nor BN can be (efficiently) used and thus, we are searching for ways to regularize such a model with a stronger method than weight decay.

First, let’s see why dropout does not make sense. The input data for FMs is usually very sparse and if we use dropout for the input data, it is possible that all elements are zero which does not make any sense. So, for the linear part T.dot(x, W) dropout is not possible. What about the non-linear part T.sum(T.dot(x, V)**2 – T.dot(x**2, V**2))? It can be easily seen that we have the same problem here.

As noted earlier, weight decay is always a good idea to prevent that a prediction of a model is driven by very few large weights. For example, if we have two weights [w1, w2] the model w=[0.3, 2.1] have a loss of

l=0.5 * (0.1**2 + 2.1**2)=2.21,

while the weights w2=[0.1, 0.7] have a loss of

l=0.5 * (0.3**2 + 0.7**2)=0.25.

The problem of weight decay is that it does not help if the model got stuck in a poor local minima or in case of plateaus.

In a paper about sgd variants they mentioned a simple method that adds noise to the gradient [arxiv:1511.06807]. The approach is very simple because it just adds Gaussian random noise, with a standard deviation that is decayed over time, to the gradient. The annealing ensures that at later stages of the training, the noise has less impact. This is beneficial because we likely found a good minima and too much noise increases the chance that we “jump” out of it.

The intention of the paper is to make the optimization of _deep_ networks easier, but the method itself is not limited to deeo networks but can be used for all kind of models. So, we used gradient noise to train a FM model, to see if we can improve the generalization performance (by finding a better local minima).

To compare the results, we fixed all random seeds and trained the model for ten epochs. The data is a two-class problem to predict if a user likes a movie or not. To be fair, the problem is already simple, but with gradient noise we achieved a precision of 100% in 3 epochs, while without the noise, it took 9 epochs. We further fixed a set of movies to study the predictions of both models. However, there is no clear trend except a possible tendency that the model trained with the noise gives lower scores to movies where the preference is “vague”, but also lower scores to movies that are rated negative.

Bottom line, we could not clearly verify that gradient noise improved the accuracy of the model, but the method definitely helps to fight against overfitting where dropout cannot be used and lead to a faster convergence.

# Devilish Details

In general, we are very glad that there are general purpose toolkits available to solve a wide range of problems we encounter every day. Especially in machine learning, such toolkits can be very handy to evaluate different approaches very fast and without the need to implement all the algorithms manually. Furthermore, there are a lot of problems that can be solved without the need to modify the implementations of the framework, thanks to the wide range of available parameters. In other words, whenever possible it is a good idea to rely on experts to implement the real stuff and we can focus on tuning the parameters of the algorithm for the task at hand.

However, sometimes the problem is very special and so might be the data. In this case, we have two options: First, we sit down and implement one solution after another, or second, we use a toolkit that allows to state the problem in some abstract language and let it do the hard work. That sounds like magic, but at least for the pure optimization part, there are good solutions available.

In the past, we already mentioned Theano, a library that helps to optimize a given cost function with automatic differentiation and highly optimized code and it is also possible to utilize the GPU. Why we mention it here? In our case, the problem is that we almost never can use an off-the-shelf approach to train a model. That means we need to work on a special cost function that describes the loss in case of certain inputs and then we have to minimize it.

Even with Theano, we still have to write a lot of code, for instance the support for mini-batches, or procedures to decay the learning rate, but the hard part, the optimization and the calculation of the gradients, is taken from our shoulders. It is well known that some algorithms, like back-prop are notoriously difficult to implement and with Theano we can focus on the actual problem instead of the optimization part.

We already mentioned several times that good features are essential to train good models but it is also very important to use a proper cost function for the data at hand. The use of Theano requires some skills, but once this hurdle is taken, it is a very efficient helper to try different cost functions with only minor modifications of the code.

It is needless to say that things can still go wrong, but at such a high-level, the bug hunting is much easier than to find an error in a manually derived gradient expression and that means, again, we hopefully have more time for tuning and to re-fining the approach.