Tagged: sigmoid

PyTorch: Kludges To Ensure Numerical Stability

After we decided to switch to PyTorch for new experiments, we stumbled about some minor problems. It’s no big deal and the workarounds are straightforward, but one should be aware of them to avoid frustration. Furthermore, it should be noted that the framework is flagged as “early beta” and this is part of the adventure mentioned on the website :-).

We extended an existing model by adding a skip-gram like loss to relate samples with tags both in a positive or negative way. For this, we are using the classical sigmoid + log loss:

sigmoid = torch.nn.functional.sigmoid
dot_p = torch.dot(anchor, tag_p)
loss_pos = -torch.log(sigmoid(dot_p)) #(1)
dot_n = torch.dot(anchor, tag_n)
loss_neg = -torch.log(1 - sigmoid(dot_n)) #(2)

The critical point is log(0), since log is undefined for this input, “inf” in PyTorch, and there are two ways how this can happen:
(1) sigmoid(x) = 0, which means x is a “large” negative value.
(2) sigmoid(x) = 1, which means x is a “large” positive value.
In both cases, -log(y) evaluates to zero and a hiccup occurs which leads to a numerical instability that makes further optimization steps useless.

One possible workaround is to bound the values of sigmoid to be slightly above zero and slightly below one, with eps ~1e-4:

value = torch.nn.functional.sigmoid(x)
value = torch.clamp(torch.clamp(value, min=eps), max=1-eps)

With this adjustment, sigmoid(dot_p) is always slightly positive and (1 – sigmoid(dot_n)) also never evaluates to zero.

It might be possible that pre-defined loss functions in PyTorch do not suffer this problem, but since we usually design our own loss function from scratch, numerical instabilities can happen if we combine certain functions. With the described kludge, we did not encounter problems any longer during the training of our model.

Again, we are pretty sure that those issues are addressed over time, but since PyTorch is already very powerful, elegant and fast, we do not want to wait until this happened. In other words, we really appreciate the hard work of the PyTorch team and since we made a choice to use a framework in an “early-release beta”, it’s only fair to be patient. Of course we are willing to help the project, for example by reporting bugs, but in this case someone else already did it (issue #1835).


Activation Sparsity For Non-Sigmoid Units

There has been a lot of written about the benefits of sparsity for output values of neurons in a network. In our case, with bag-of-word data that is binary and high-dimensional, it allows a compact representation of the input, disentangling of topical factors and an easier interpretation. For instance, let us assume that we have 64 units and all of them are activated regardless of the input pattern. This is, without a doubt, not very useful and sparsity would clearly help to analyze what neuron is sensible for what kind of latent topic. In case of sigmoid units, with values in the range 0…1 there are different variants to encourage sparsity:
– the L1 norm on the hidden activation
– Student-t penalty log(1 + h**2)
– cross entropy with desired probability, like p=0.05
or p * T.log(p / h) + (1 -p) * T.log((1 -p) / (1 – h))

But since sigmoid units are not really popular these days, all approaches based on probabilities are not applicable any longer. With modern neurons like elu/relu/leaky that are unbounded on the positive side, a different approach is required. A method that can be used regardless of the neuron is to encourage output values to be close to a specific value, like “p=0” for relu units. This can be done by the following penalizing term:
– T.sum((0 – T.mean(h, axis=0))**2)
In case of mini-batch training, “h” has the shape (batchsize, num_units) and we consider the mean activation of each unit in the minibatch which is “T.mean(h, axis=0)”. The error is the sum of the squared difference between “p” and each unit mean.

Now the question is, why is such a sparsity constraint required at all when relu units + dropout already lead to sparse activations? Well, the problem is the high-dimensional, but very sparse, input data that often leads to feature representations where all hidden units are activated. As noted earlier, this is not useful at all and also a strong indicator that the features were not able to learn the underlying structure of the input data.

Our recent experiment with an autoencoder and threshold units did not suffer from the described problem, but the feature representation was still very dense, only 10% zeros, despite the fact that we used dropout and rectification. To analyze possible benefits of sparse features, we re-trained the model with the penalizing term. With the adjusted model, the portion of zeros increased to about 56%. The next step is to analyze the qualitative results of the sparse model.

Bottom line, some network architectures provide natural sparsity and no extra work is required, but sometimes, an explicit sparsity term is required, especially for some kind of input data.


We are still busy with increasing the size of our training set and to complete our lightweight ontology for feature words. But meanwhile, we thought it would be a great idea to make our data more descriptive by generating useful tags.

We are using a supervised scheme, because the size of our training set does not guarantee that we really learn useful patterns otherwise. As a very simple example, we tried to train a model that predicts if a movie has a strong ‘zombie’ scheme or not. The assumption is that movies with a specific scheme have a unique distribution of keywords.

The network architecture to learn to predict a tag is very simple. The keywords are binarized and serve as the input to the network. We use a single hidden layer, with ReLU neurons, and a sigmoid layer with just a single output neuron to predict the probability of the scheme. Every movie where the theme is explicitly present (in the meta data), is marked as ‘1’ and a random set of other movies is marked as ‘0’. We use AdaGrad because of the very different frequency of keywords in combination with L2/L1 weight decay. For a simple scheme like ‘zombies’ the model works very good and is almost perfect to separate movies into positive and negative sets.

The advantage of such a model is that all movies with a specific combination of keywords will be marked as ‘zombie’ and not only the ones that were (manually) annotated with it. Furthermore, because the output of the network can be treated as a confidence, the model can be used as a building block with other models to span a feature space that consists of pre-defined topics. And not to forget that we can use multiple themes to learn a single tag (e.g, ‘mommies’, werewolves’, vampires’ -> ‘supernatural’).

We also tried to train models for other tags, for instance ‘sports’ or ‘substance-abuse’. The precision of the results strongly varies which is an indication that specific themes are easier to map with the limited amount of keywords we have. Stated differently, it is obvious that a theme like ‘football’ is easier to learn than a more general scheme like ‘team sport’ because for the latter there is likely more noise in the used keywords.

The power of tag models are not exhausted by the given examples. For instance, it is possible that users create their own tags or to condense the existing themes into a new taxonomy that better models similarity between themes.