The Model Has Been Trained … Now What?

With the increasing popularity of data science, whatever this term actually means, a lot of functionality in software is based on machine learning. And yes, we all know machine learning is a lot of fun, if you get your model to learn something useful. In the academic community a major focus is to find something new, or enhance existing approaches, or to just beat the existing state of the art score. However, in industry, once a model is trained to solve an actual problem, it needs to be deployed and maintained somewhere.

Suddenly, we might not have access to huge GPU/CPU clusters any longer which means the final model might need to run a device with very limited computational power, or even on commodity hardware. Not to speak about the versioning of models and the necessity to re-deploy the actual parameters at some time. At this point, we need to change our point of view from research/science to production/engineering.

In case of python, pickling the whole model is pretty easy, but it requires that a compatible version of the code is used for de-serialization[1]. For long-term storage, it is a much better idea to store the actual model parameters in a way that does not depend on the actual implementation. For instance, if you trained an Elman recurrent network, you have three parameters:
(1) the embedding matrix
(2) the “recurrent” matrix
(3) the bias
which are nothing more than just plain (numpy) arrays which can be even stored in JSON as a list (of lists). To utilize the model, it is straightforward -in any language- to implement the forward propagation, or to use an existing implementation which just requires to initialize the parameters. For example, in major languages like Java or C++, initializing arrays from JSON data is no big deal. Of course there are many other ways, but JSON is a very convenient data transport format.

And since the storage of the parameters is not coupled to any code, a model setup is possible in almost any environment with sufficient resources.

Sure, we are aware that restoring a 100-layer network from a JSON file can be burdensome, but nevertheless it is required to transfer model parameters in a unified, non-language dependent, way. So, we discussed some details of the storage and the deploy, but what about using the model in real applications?

We want to consider a broader context and not only images. Like the implementation of a search / retrieval system. In contrast to experiments, it is mandatory, that we get a result from a model within reasonable time. In other words, nobody says that training is easy, but if a good model need too much time to get a decision, it is useless for real-world applications. For instance, if the output of the model is “rather large”, we need to think about ways for an efficient retrieval in case of information retrieval. As an example: if the final output is 4096 dims (float32) and we have 100K ‘documents’, we need ~1.5 GB to store just the results and not a single bit of meta data.

But even for smaller models, like word2vec, we might have 50K words, each represented by 100 dims, which makes it a non-trivial task to match the entered sequence of words to, say, an existing list of movie titles and to rank the results in real-time and multi-threaded.

We know, we are a bit unfair here, but we have the feeling that way too often most of the energy is put into beating some score, which includes that often even the information in papers are not sufficient to reproduce results, and the models that actually provide new insights are sometimes not usable because they require lots of computational power.

Bottom line, on the one hand, without good models we have nothing to deploy, but on the other hand, machine learning is so much more than just training a good model and then dumping it at some place to let some other guys run it. As a machine learning engineer, we have the responsibility to be part of the whole development process and not just the cool part where we can play with new neural network architectures and let other folks do the “boring” part.



PyTorch: Tackle Sparse Gradient Issues with Multi-Processing

Let’s imagine that we want to train a model that is using an embedding layer for a very large vocabulary. In on-line mode, you only work on very few words per sample which means you get a sparse gradient because most of the words do not need to be touched in any way[1].

The good news is that it already works with vanilla gradient descent and AdaGrad. However, since the latter eventually decays the learning rate to zero, we might have a problem if we need to visit a lot of samples to achieve a good score. This might be the case for our recent model that is using a triplet loss, because not every update has the same impact and using more recent gradient information would be more efficient. Thus, we decided *not* to use Adagrad.

As a result, we can only use stochastic gradient descent. There is nothing wrong with this optimizer, but it can take lots of time until convergence. We can address parts of the problem with momentum, since it accelerates learning if the gradient follows one direction, but it turned out that enabling momentum turns the sparse update into a dense one and that means we are losing our only computational advantage.

Again, the issue is likely to be also relevant for other frameworks and in general sparse operations always seem a little behind their dense colleagues. Since PyTorch is very young, we don’t mind challenges, as noted before.

We would really like to continue using it for our current experiment, but we also need to ensure that the optimizer is not a bottleneck in terms of training time. The good news is that the results, we got so far, confirm that the model is learning a useful representation. But with the problem of the long tail, we might need more time to perform a sufficiently large number of updates for the rare words.

So, in this particular case we do not have much options, but a good one would suffice and indeed there seems to be a way out it: Asynchronous training with multi processing. We still need to investigate more details, but PyTorch provides a drop-in replacement for “multiprocess” and a quick & dirty example seems to work already.

The idea is to create the model as usual, with the only exception to call “model.share_memory()” to properly share model parameters with fork(). Then, we spawn N new processes that all gets a copy of the model, with tied parameters, but each with its own optimizer and data sampler. In other words, we perform independent N trainings but all processes update the same model parameters. The provided example code from PyTorch is pretty much runnable out of the box:

## borrowed from PyTorch ##
from torch.multiprocessing as mp
model = TheModel()
model.share_memory() # required for async training
# train: function with "model" as the parameter
procs = []
for _ in xrange(4): # no. processes
 p = mp.Process(target=train, args=model(,))

for p in procs: # wait until all trainers are done

The training function ‘train’ does not require any special code, or stated differently it, if you call it just once, single-threaded, it works as usual. As noted before, there are likely some issues that need to be taken care of, but a first test seemed to work without any problems.

Bottom line, the parallel training should help to increase the through-put to perform more updates which -hopefully- leads to an earlier convergence of the model. We still need to investigate the quality of the trained model, along with some hyper-parameters, like the number of processes, but we are confident that we find a way to make it work.


PyTorch: Non-differentiable is Nothing

In the last weeks, we really learned to appreciate PyTorch. Not because it is flawless, but because it is very intuitive and makes your life so much easier when you need to debug something. And let’s face it, at one point your network is doing something silly and you have no idea why. Then you need to take a look under the hood which can be a bit of a burden with symbolic variables. Furthermore, the dynamic graph approach lets you define some concepts, like recurrence without complicated scan functions which is one more reason why recurrent networks and PyTorch should be best friends. Bottom line, for a beta, the framework feels very stable and except for some numerical instabilities, we never encountered any problems so far.

With arxiv as a hub for research papers, one have access to lots of new ideas. Not all of them are diamonds, but sometimes a paper contains the hint you need to complete a problem or at least to evaluate an idea or to use it as a basis. Then, it is extremely useful if you can do a rapid prototype. Even with Theano it was quite easy possible, but since the framework is a little too low-level, you had to write lots of setup code. With PyTorch and the seamless numpy integration the setup part is almost neglectable.

Furthermore, it is also possible to transfer a learned model from one framework to another, in case one might have concerns to use a particular framework for production, or a particular framework is required. Thus, it is no problem to use the advantages of PyTorch for training and evaluation of a model and then store the model parameters as numpy arrays and build the network with Theano.

It is really hard to imagine to evaluate new, complex networks without all the fancy frameworks, especially automatic differentiation, but also computational graphs in general. For example, the famous AlexNet was written from scratch which was quite an achievement. In other words, there is no excuse today, not to test your ideas quickly, but also systematically with a framework like PyTorch. Yes, you need some Python and math skills, but with all the groundwork done by others, it is much faster than a few years ago.

So, the major problem, which cannot be solved by PyTorch yet, is to get your hands on a sufficiently large set of (labeled) data. However, if you have a dataset you can start with, there are no limits. Try some funky loss functions, or combine them. Can you imagine a world without GANs? Not anymore. But, somebody has to come up with the idea before it can be refined.

Bottom line, it has never been easier to utilize massive amounts of computational power with GPUs. There are many publicly available datasets, but also relatively easy ways to acquire labeled data, so you should be at least able to start your work. In combination with all the available frameworks, you can train very complex networks in a reasonable time which would have been impossible more than ten years ago even with a massive cluster of machines. So, everybody with a clever idea and some luck has the opportunity to make a difference.

Think Local, Not Global

In contrast to some academic datasets, real-life datasets are often unbalanced, with a lot noise and more variance than some models can cope with. There are ways to adapt models for those problems, but sometimes a classification is too restrictive. What do we mean by that? With a lot of variance for a single label, it might be possible to squeeze all samples into some corner of a high-dimensional space, but it is probably easier to learn the relative similar of samples.

With a triplet loss, we can move two samples with the same label closer together while we push them further away for a sample of a different label: maximum(0, margin + f(anchor,neg) - f(anchor,pos)). The approach has the advantage that we can distribute samples across the whole feature space by forming “local clusters” instead of concentrating it in a dense region of the space. The quality of the model depends on the sampling of the triplets, since for a lot of triplets the margin is already preserved and no learning is done. But since the sampling can be done asynchronously, it is usually no problem to find enough violating triplets to continue the training.

If we take for instance items from an electronic program guide (epg) there might be a huge variance for the category “sports”, but only few words to describe the content. Thus, it might be sufficient that a nearest neighbor search -in the feature space- returns any item with the same label to perform a successful “classification”. This has the benefit that we can capture lots of local patterns with a shallower network instead of one global pattern in case of a softmax classification that requires a network that is likely complex. The price of the model is that we cannot simply classify new items with a single forward pass, but the model is probably easier to train because of the relative similarity. Plus, if the similarity of the anchor and the positive sample is higher than the similarity of the anchor and the negative one, with respect to the margin, there is no loss at all and no gradient step is required. In case of a nll loss, for example, there is always a gradient step, even if it is very small.

Furthermore, nearest neighbors searches[arxiv:1703.03129] can be done with matrix multiplications that are very efficient these days thanks to modern CPU and GPU architectures. It is also not required to scan the whole dataset, because it is likely that we can perform a bottom up clustering of the feature space to handle near duplicates.

The next question is how to combine a triplet loss with attention to aggregate data from related items, for example items with similar tags, to group embeddings of similar words. We are using the ideas from the post to encode items with a variable number of words into a fixed length representation.

More PyTorch Kludges

We slightly adjusted our loss function to a use a hinge-like loss that requires the use of a maximum function that we used pretty often in Theano: T.maximum(0, 0.3 - score). However, in PyTorch the cmax function, that provided the same functionality, was recently removed to simplify the API.

Of course it’s still possible but the new syntax feels a bit strange: torch.clamp(0.3 - score, min=0). In case of the minimum, it’s the other way around: torch.clamp(0.3 - score, max=0).

So, it’s the clamp function again that we previously used, to bound the values of a function. Seems like clamp is our swiss army knife.

PyTorch: Kludges To Ensure Numerical Stability

After we decided to switch to PyTorch for new experiments, we stumbled about some minor problems. It’s no big deal and the workarounds are straightforward, but one should be aware of them to avoid frustration. Furthermore, it should be noted that the framework is flagged as “early beta” and this is part of the adventure mentioned on the website :-).

We extended an existing model by adding a skip-gram like loss to relate samples with tags both in a positive or negative way. For this, we are using the classical sigmoid + log loss:

sigmoid = torch.nn.functional.sigmoid
dot_p =, tag_p)
loss_pos = -torch.log(sigmoid(dot_p)) #(1)
dot_n =, tag_n)
loss_neg = -torch.log(1 - sigmoid(dot_n)) #(2)

The critical point is log(0), since log is undefined for this input, “inf” in PyTorch, and there are two ways how this can happen:
(1) sigmoid(x) = 0, which means x is a “large” negative value.
(2) sigmoid(x) = 1, which means x is a “large” positive value.
In both cases, -log(y) evaluates to zero and a hiccup occurs which leads to a numerical instability that makes further optimization steps useless.

One possible workaround is to bound the values of sigmoid to be slightly above zero and slightly below one, with eps ~1e-4:

value = torch.nn.functional.sigmoid(x)
value = torch.clamp(torch.clamp(value, min=eps), max=1-eps)

With this adjustment, sigmoid(dot_p) is always slightly positive and (1 – sigmoid(dot_n)) also never evaluates to zero.

It might be possible that pre-defined loss functions in PyTorch do not suffer this problem, but since we usually design our own loss function from scratch, numerical instabilities can happen if we combine certain functions. With the described kludge, we did not encounter problems any longer during the training of our model.

Again, we are pretty sure that those issues are addressed over time, but since PyTorch is already very powerful, elegant and fast, we do not want to wait until this happened. In other words, we really appreciate the hard work of the PyTorch team and since we made a choice to use a framework in an “early-release beta”, it’s only fair to be patient. Of course we are willing to help the project, for example by reporting bugs, but in this case someone else already did it (issue #1835).

Efficient Embedding Models With PyTorch

With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.

We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.

In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:
torch.sum(E(in_words) * M(in_words), 0)
where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:
M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())

The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it[1]. In PyTorch this is of course also possible, but the approach is different.

Usually a model contains an module for the embedding
torch.nn.Embedding(#tokens, #dims)
which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to
torch.nn.Embedding(#tokens, #dims, sparse=True)
and that is all.

However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing
torch.sparse.FloatTensor' object has no attribute 'addcmul_'
but we strongly believe that this is fixed pretty soon.

The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.

Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.

To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.

In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.

[1] Efficient Embeddings With Theano