PyTorch: Kludges To Ensure Numerical Stability

After we decided to switch to PyTorch for new experiments, we stumbled about some minor problems. It’s no big deal and the workarounds are straightforward, but one should be aware of them to avoid frustration. Furthermore, it should be noted that the framework is flagged as “early beta” and this is part of the adventure mentioned on the website :-).

We extended an existing model by adding a skip-gram like loss to relate samples with tags both in a positive or negative way. For this, we are using the classical sigmoid + log loss:

sigmoid = torch.nn.functional.sigmoid
dot_p =, tag_p)
loss_pos = -torch.log(sigmoid(dot_p)) #(1)
dot_n =, tag_n)
loss_neg = -torch.log(1 - sigmoid(dot_n)) #(2)

The critical point is log(0), since log is undefined for this input, “inf” in PyTorch, and there are two ways how this can happen:
(1) sigmoid(x) = 0, which means x is a “large” negative value.
(2) sigmoid(x) = 1, which means x is a “large” positive value.
In both cases, -log(y) evaluates to zero and a hiccup occurs which leads to a numerical instability that makes further optimization steps useless.

One possible workaround is to bound the values of sigmoid to be slightly above zero and slightly below one, with eps ~1e-4:

value = torch.nn.functional.sigmoid(x)
value = torch.clamp(torch.clamp(value, min=eps), max=1-eps)

With this adjustment, sigmoid(dot_p) is always slightly positive and (1 – sigmoid(dot_n)) also never evaluates to zero.

It might be possible that pre-defined loss functions in PyTorch do not suffer this problem, but since we usually design our own loss function from scratch, numerical instabilities can happen if we combine certain functions. With the described kludge, we did not encounter problems any longer during the training of our model.

Again, we are pretty sure that those issues are addressed over time, but since PyTorch is already very powerful, elegant and fast, we do not want to wait until this happened. In other words, we really appreciate the hard work of the PyTorch team and since we made a choice to use a framework in an “early-release beta”, it’s only fair to be patient. Of course we are willing to help the project, for example by reporting bugs, but in this case someone else already did it (issue #1835).

Efficient Embedding Models With PyTorch

With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.

We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.

In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:
torch.sum(E(in_words) * M(in_words), 0)
where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:
M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())

The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it[1]. In PyTorch this is of course also possible, but the approach is different.

Usually a model contains an module for the embedding
torch.nn.Embedding(#tokens, #dims)
which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to
torch.nn.Embedding(#tokens, #dims, sparse=True)
and that is all.

However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing
torch.sparse.FloatTensor' object has no attribute 'addcmul_'
but we strongly believe that this is fixed pretty soon.

The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.

Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.

To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.

In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.

[1] Efficient Embeddings With Theano

pytorch: A New Burning Star

We are still in love with Theano and it’s part of our machine learning framework since quite some time, but now and then you need something else to get something done. Of course python is our first choice since it is intuitive, flexible and when combined with low-level modules written in C/C++, the performance is also no problem.

Frankly, one reason we never gave torch a try is because even if learning a new language can be fun, time is very valuable. Plus, in our humble opinion it makes more sense to master one language than to divide your attention between two. However, with the arrival of pytorch things are different now.

It’s not like all our problems are solved, but pytorch introduces a new, very interesting concept of dynamic graphs. Furthermore, pytorch uses the WYSIWYG concept which means that a tensor contains actual values at any moment and not only symbolic references to it. Because of this, there are also no lengthy compilation steps and this also makes debugging much easier.

Even if the state of the project is described as “early-release beta”, our experiments so far did not encounter any serious problems or limitations. But to be fair, our models were fairly straight-forward and thus only used standard components. Nevertheless, describing the model and the actual training worked like a charm without any pitfalls. And with our existing knowledge from Theano and other graph-based frameworks, it was easy to adapt existing code and/or to write new one.

The integration of the automatic differentiation is a little different compared to Theano but this is also no big deal if you spent minimal time on the interface description. Plus, with all the available examples and tutorials, it’s pretty easy to get an overview of all the modules you need for your daily work. Especially recurrent networks are pretty easy to use and require only minimal knowledge of the underlying details in case you just need a standard setup to solve a problem.

Bottom line, we are big fans of frameworks that are easy to use but also versatile. When we stumbled about numpy many years ago, we instantly fall in love because implementing algorithms was straight-forward and also very fast, because of the optimized linear algebra routines and the vectorization. The only drawback is the missing automatic differentiation because doing it manually is burdensome and very error prone. Thus, if a framework extends numpy with this feature, plus the ability to perform calculation on the GPU in a transparent way, the outcome has to be useful ;-).

In other words, if you are familiar with the computational machinery that is required for implementing neural networks, but also other machine learning models, pytorch can make your life a lot of easier. It’s pretty lightweight, fast and easy to use. Maybe it needs a little more time to be “feature complete” and more mature, but our tests did not reveal any severe problems and since the community is pretty active, problems should be addressed pretty soon after they are reported.

Generalizing word2vec

The idea of the word2vec method is quite simple, but nevertheless very elegant and, as lots of recently published papers confirmed, very versatile for a broad range of problems. Thus, a lot of problems can be transformed and solved by a generic word2vec implementation. The advantage is that we can reuse a mature and optimized method to solve various problems, but the drawback is that this might not be very flexible, if we have, for instance, special requirements. Like kludges for the data preparation, or the abuse of parameters to emulate a certain behavior.

However, if we just extract the core component of word2vec, we have a fairly generic black box to solve a problem. In the skip-gram case, the input consists of a source word that is used to predict a sequence of context words. This approach also works, if we don’t have actual words, but tokens that are somehow related. Especially for unordered data, like sets, where the context is not well defined, an adaptable preparation step makes a lot of sense.

For example, let’s assume that we have a set of titles and each title has a set of corresponding tags. The intuition is that if two titles share a lot of tags, they are related. In other words, the title is the source and all assigned tags form the context, which can be seen as a local neighborhood. In case we also consider tags from titles that are reachable through shared tags, we gradually move away from a local to more global neighborhood.

This also has been explored in the literature where the local neighborhood can be described as a breadth-first search and the global one as a depth-first search. This is also related to a random walk, since it makes sense to stochastically decide what node to traverse next. For instance, we start at an arbitrary title, then we sample from the corresponding tags, then we sample from the connected titles and so forth. The whole sequence is then the walk. In contrast to a sequence a set is not ordered and thus, we need a different kind of notation for the window size. In [arxiv:1603.04259] it was proposed to consider all pairs in the set, or to shuffle each training sample.

Bottom line, word2vec can be used far beyond the field of NLP which includes graph embedding, collaborative filtering or personalization, just to name a few. Furthermore, in most scenarios, a well-matured implementation[python:gensim.models.Word2Vec] can be used to train an embedding without the necessity to adapt a single line of the code. In other cases, the input data might need to be encoded in a special way, but this is often straightforward and also does not require to change the code.

Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale
x = [i1, i2, i3, ..., ik]
U = E[x]
h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]
U = E[x]
g = tanh(, v) + bias)
a = softmax(g)
h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:
h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2].

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

Just Getting ML Things Done

It’s true that at some point, you might need full control of the situation, with access to the loss function and maybe even the gradients. But sometimes, all you need is a hammer and a pair of nails without fine-tuning anything. Why? Because you just want to get the job done, ASAP. For example, we had a crazy idea that is likely to NOT work, but if we just need 10 minutes coding and 30 minutes waiting for the results, why not trying it? Without a proper framework, we would need to adjust our own code which can be a problem if there are lots of “best practices” one need to consider, like for instance for recurrent nets. Then it’s probably a better idea to use a mature implementation to avoid common pitfalls, so you can just focus on the actual work. Otherwise frustration is only a matter of time and spending hours on a problem that someone else has already solved will lead you nowhere.

Long story short, what we needed is a front-end for Theano with a clean interface and without the need to write tons of boilerplate code. After some investigations, and by focusing on support for recurrent nets, we decided to give Keras a try. The actual code to build and train the network has less than 20 lines, because of the sophisticated design of the framework. In other words, it allows you to get things done in a nice, but also fast way, if you are willing to to sacrifice the ability of controlling every aspect of the training.

After testing the code with a small dataset, we generated the actual dataset and started the training. What can we say? The whole procedure was painless. Installation? No problem. Writing the code? Piece of cake. The training? Smooth without any fine-tuning. Model deployment? After installing h5py it worked without any problems ;-).

Bottom line, we are still an advocate of Theano, but writing all the code yourself can be a bit of a burden, especially if you throw all the stuff away after a single experiment. Furthermore, if your experiment uses a standard network architecture without the necessity to tune or adjust it, mature code can avoid lots of frustration in form of hours of bug hunting. Plus, it’s likely that the code contains some heuristics to work around some known problems that you might not be aware of.

For clarification, we do not say that Keras is just a high-level front-end and does not allow any customization, what we say is that it does a great just to provide one in case you don’t need all the expert stuff! And last but not least, it allows you to switch the back-end in case you want something else than Theano. We like the concept a lot provide a unified API for different back-ends, because it’s possible that back-ends have different strengths and weaknesses and the freedom to choose allows you, to write code with one API but switching
back-ends dynamically as you need it.

Predicting The Next ‘Thing’

It is the dream of all machine learning guys to find a way to use the available data in combination with some unsupervised learning algorithm to train a useful representation of the data. Yes, we drastically simplifying things here, but the point is to learn without the necessity to label the data which is very expensive.

For example, there are tons of documents available which could be used for learning, but the problem is what cost function do we want to optimize? In case of word2vec and friends, we try to predict surrounding or center words without explicit labels. This works very good, but the result is an embedding of words and besides simple aggregation methods, there is no general way to represent documents with a learned embedding in a meaningful way. However, it is still a simple, but powerful approach that can easily utilize huge amounts of unlabeled text data to learn a useful representation.

Another example is a recently published paper [arxiv:1704.01444] that is also using a large text corpus without labels, at least for the first model, to just predict the next character of the data block. So far, this is nothing new, but it is remarkable that a single unit learned to predict the sentiment of a data block. In other words, all those models learn by predicting the next “thing” which can be, for instance, a word, a character, or some other token.

The interesting part is that such an “autoregression” model can be learned by just taking a sequence, removing the last item and try to predict it, given the previous data. This also works for sets, but the process is not straightforward since sets are not ordered. Furthermore, it is not obvious how to select the item, since there is no “previous” data.

Bottom line, it has been demonstrated several times that it is possible to learn a good representation of data by just predicting the next token. Nevertheless, often such methods are limited to short texts, since processing longer texts require to remember lots of data or context, especially for models based on RNNs.

However, since we are usually dealing with short descriptions of items, with the exception that we handle sets and no sequences, we adapted the method and trained a model to predict a keyword from the set, given the rest of the set, with moderate results. Despite the problems we encountered, we still believe that no (strongly) supervised model will ever be able to learn powerful but also general representation of data. Thus, it seems a good idea to follow the research track and address the existing problems one by one, until we
eventually find a method that addressed the major hurdles.