Flow Gradient, Please Flow, Darn It!

With residual networks, batch normalization and sophisticated recurrent net like LSTMs, it seems that getting a network to “fly” is just a matter of data and time. It probably feels like that because most papers do not mention the effort the authors had to take before the network actually started learning. We have no evidence for that but we a hunch that a successful setup can take a lot of time, mostly to figure out the hyper-parameters and to tune the network architecture. But who knows, we could be also totally wrong.

What’s definitely useful is to visualize as much aspects of the learning as possible. This includes to keep track of stats like the gradient norm, magnitudes of activations, the accuracy and the loss, and many other things. One excellent example is TensorBoard that allows to visualize metrics during learning. But sometimes it just helps a lot to visualize the “flow” of the gradient through the network, to see if the training is healthy or not.

In our case, we tried to train a network that contains two LSTMs but despite the mitigation of vanishing gradients that LSTMS should have, there was no real flow in the network and thus, the loss did not decrease as expected. For each learning step, we dumped the norm of the whole gradient but this is not useful to learn something about the flow of it through the network per layer. So, we decided to plot the absolute mean of the gradient for each layer, excluding bias values, to debug which layer is responsible for “blocking” the backward flow of the error signal.

We used the following script:

def plot_grad_flow(net):
  lay_grads, lay_names = [], []
  for name, param in net.named_parameters():
    if param.requires_grad and 'bias' not in name:
      name = name.replace('.weight', '')
      mean = float(param.grad.abs().mean())
      print(" ++ %s=%.6f" % (name, mean))

  plt.bar(range(len(lay_grads)), lay_grads, align='center', alpha=0.3)
  plt.hlines(0, 0, len(lay_grads) + 1, linewidth=1, color="k" )
  plt.xticks(range(0,len(lay_grads), 1), lay_names, rotation=90)
  plt.xlim(left=-1, right=len(lay_grads) + 1)
  plt.ylabel("avg grad"); plt.title("gradient flow"); plt.grid(True); plt.tight_layout()

We borrowed ideas from a post in the PyTorch discussion forum[1] with some adjustments. For instance, without the tight layout step, not all text was visible on the x axis and we did not use plot, but the bar style.

It is hard to come up with heuristics for absolute values, but if a bar in the plot is almost zero for several iterations, it’s a good indicator that something is wrong. Since if it’s close to zero, the weights of this particular layer won’t get updated much and thus it is likely that the loss won’t go down, because the gradient is “blocked”.

Despite the fact that the visualization will not offer any concrete solutions, you have at least a clue where to start debugging which can help a lot if you have a 24 layer network.

[1] discuss.pytorch.org/t/check-gradient-flow-in-network/15063

Implementing Loss Functions

With PyTorch it’s pretty easy to implement arbitrary loss functions because of the dynamic computational graph. With it, you can use loops and other Python flow control which is extremely useful if you start to implement a more complex loss function. We follow the principle “readable first, performance later” which means we don’t care if the code is efficient, as long as it is easy to read and of course to easy to debug. When we are sure, after performing extensive tests, that we correctly implemented the loss function, then and only then, we think about optimization. The recipe from Adrej Karpathy[1] gives a good hint that we should overfit on a small data set first, before we start training on the whole dataset. This is very important since if we cannot get the loss to go down on a tiny data set, something is wrong and we need to debug our code.

We recently stumbled on a paper from Hinton[arxiv:1902.01889] with the note that a batched version of the loss is provided for TensorFlow (in the CleverHans package). However, since we are using PyTorch, we started from scratch to learn something about the challenges.

Let’s start with some notation. We have a batch X with the corresponding labels in Y, where X is of shape (n, dim) and Y has a shape of (n) and contains integer labels, while X contains the features from some hidden layer.

For a sample (x, y) the distance between x and some other sample x’ is used to define the probability to sample neighboring samples for x, see chapter 2 in[arxiv:1902.01889]. To turn distances into ‘similarity’ values, a gaussian kernel function is used: exp(-dist(x, x’)**2/T), where T is the temperature which is tunable to stronger consider also larger distances (if T is chosen larger) and dist is the euclidean distance betwen x and x’.

For a sample x_i of label y_i, the loss is calculated as:

same_i = sum(j=1 to n, j != i and y_i == y_j) exp(-dist(x_i - x_j)**2/T)
rest_i = sum(k=1 to n, k != i) exp(-dist(x_i - x_k)**2/T)
prob_i = same_i/rest_i
loss_i = -log(prob_i)

The (j != i) and (k != i) part is also known as “leave one out” which means, we skip the sample to avoid a self reference where exp(-dist(x_i, x_i)**2/T) is 1 because the distance is zero.

The idea is pretty simple: if two samples share the same label, the distance between the pair should be lower than to any other sample with a different label. This is the (y_i == y_j) part in same_i formula which selects all samples with the same label. The negative log step should be familiar from the softmax loss.

Let’s consider two cases before we start coding:
(1) A pair (x_i, x_j) share the same label and is actually pretty close. Thus, exp(-dist(x_i, x_j)**2/T) is close to 1 since dist(x_i, x_j) is close to zero. If other distances (x_i, x_k) are reasonable large, the term exp(-large**2/T) is close to zero. So, same_i looks like [0.99, 0.99] because all entries with the same labels are pretty close. And in case other entries are well separated, rest_i looks like [0.99, 0.01, 0.99, 0.01, 0.01] which means 0.99 are entries with the same label and all other entries have different labels. If we sum this up, and divide it, we get a probability ~0.98 and thus a loss of ~0.015 as expected.

(2) Now, if a pair (x_i, x_j) is further away and a sample with a different label x_k is closer to x_i, same_i now looks like [0.01, 0.99] since one entry has a larger distance. And since an impostor now has a smaller distance, rest_i looks like [0.99, 0.01, 0.01, 0.99, 0.01]. The values for rest_i are the same, because only two entries switched places, but the sum of same_i is different now. This means, the probability is now only ~0.5 and the loss is ~0.70.

Bottom line, the loss is supposed to be low, if all samples with the same label are densely located in the feature space, or at least, each pair with the same label is closer to to any other sample with a different label. This is the soft version of the classical k nearest neighbor method.

What follows is a naive implementation of the loss:

for i in range(batch_x.shape[0]):
 label = batch_y[i]
 batch_same = batch_x[(batch_y == label) & (idx != i)]
 batch_rest = batch_x[idx != i]
 dists_1 = ((batch_x[i] - batch_same)**2).sum(dim=1)
 dists_2 = ((batch_x[i] - batch_rest)**2).sum(dim=1)
 prob = torch.exp(-dists_1/temp).sum() / torch.exp(-dists_2/temp).sum()
 assert float(prob.data) <= 1
 loss = -torch.log(prob)
loss /= batch_x.shape[0]

That’s not scary at all thanks to some nice features in PyTorch. First, with masks, it is straightforward to filter entries in the batch and since we can also use the logical AND operation, it is possible to combine selection filters. For batch_same, we select all entries with the same label, except entry i, while for batch_rest everything except entry i is selected. The calculation of the squared distances is also easy. Then we turn the distances into ‘similarities’ sum them up as in the example and divide them to get the final probability to sample an entry with the same label. At the end, we average the loss and we are done.

No doubt that the code can be optimized, but what we wanted to show that with PyTorch the implementation of complex loss functions, one that use advanced indexing for example, is straightforward. To fully utilize GPU devices or to implement efficient batching is a different story we tell later.

[1] karpathy.github.io/2019/04/25/recipe/

Sharing Useful Hints How To Train Neural Nets

These times it is pretty easy to find useful stuff for neural nets with respect to architecture, frameworks and methods. Without a doubt there is also material available that gives you hints how to avoid common pitfalls, but a lot of text is about the illustration of more complex ideas like Bert, to give an overview about an area or an introduction to a framework.

This is already helpful, but sometimes the things that were *not* said are even more interesting. That’s why we welcome the blog by Andrej Karpathy[1] so, much since it contains a cookbook how to avoid common mistakes. This form of assembled knowledge is rather seldom, but so much requires to lower the level of frustration for newbies when they start to dive into neural nets.

By random we stumbled about a homework from a university[2]. It is about implemented recurrent nets and gives you exactly these hints that are to often missing. Like “As always, make sure you can overfit a very small training set as an initial test”[2], which is also present in the recipe[1]. Or “and small hidden sizes for the RNN may work better than you think”[2] which hints that you should start small and only increase the capacity if needed. Or that any net with a softmax should start with a loss of -log(1/#number_of_classes) [2, “Getting started”].

For recurrent nets, there are already some hints, like gradient clipping, but the proper value likely depends on the data plus the architecture, so it might be a good idea to give the actual value instead of justing “gradients were clipped”. Like initializing the gates of highway network with a negative value to bias the first evaluation towards a state. This is definitely not rocket science, but might help to focus on the actual problem and not to get the initial net up and running.

For neural net veterans or skilled researchers this might seem trivial, but to successfully train a neural net can be challenging for newbies, since there is a lot of stuff you have to keep in mind. Plus, lots of hyper-parameters affect each other and even if you do almost everything right it might still not work.

Bottom line, maybe we can make a it a rule also to talk about failures and how we fixed them instead of just telling the good parts of the story. If the code is available, those hints are sometimes given in the code, but scanning a whole code-base for hints can also be time consuming.

[1] karpathy.github.io/2019/04/25/recipe/
[2] _www.cs.utexas.edu/~gdurrett/courses/sp2019/a4.pdf

Character-Based Neural Language Models

With enough data and computational power, it’s quite easy these days to train a high-capacity language model. In this case, you can start from scratch since there is enough data, like Wikipedia, to learn all model parts jointly which includes a word embedding. However, not all of us have access to a large-scale machinery and thus it still is a good idea to think about solutions with a low memory footprint.

Of course we are aware that there are plenty of pre-trained word embeddings, also for non-English languages, and also some which include subword information. However, with 1M tokens and 1M ngrams, each with 300 dimensions, such a model alone needs ~1 GB memory which is pretty much for smaller projects.

So, are there any other options to train an embedding with fewer resources but that is similar with respect to the expressive power? Well, we did not perform real comparisons, but if we go down to the character level, the required resources for embeddings are much lower. The price for is that we now require more computational power. For example, a sequence of N words needs N steps with a recurrent net. In case of characters, we have #(word_1) + #(word_2) + … + #(word_n) steps or stated differently, the sum of the chars per word which is much larger than N. For a word-based approach, the embedding is N x dim, where N is the distinct number of words, where for character-based methods, it is M x dim, where M is the number of distinct chars which is usually much lower. Depending on the data, M is maybe 100, where N is often in the range of 25,000 and more.

The trade-off between time and space is a necessary evil, but we can “cheat” a little to improve our situation. What if we use the character-level for the embedding, to restrict the number of parameters, but at the same time, we let the recurrent net, or whatever is used in the architecture, operating on words. How is that possible? We borrow ideas from conv nets for text. The idea is to treat each word separately and learn a dynamic embedding. Too abstract? Okay, let’s discuss some details.

Let’s consider the sequence “beer is awesome”. The set of chars is: {‘a’, ‘b’, ‘e’, ‘i’, ‘m’, ‘o’, ‘r’, ‘s’, ‘w’} and we have three words [beer], [is], [awesome]. Now, we need to convert each sequence of chars, like [b, e, e, r], into a fixed representation, and of course also [i, s] and [a, w, e, s, o, m, e]. This can be easily done with a convolution (2d) and max-pooling over time. The clou is with such a cnn encoder can convert words of arbitrary lengths into a fixed representation. With such a function, we can learn the vocabulary dynamically per batch and use it for the input to the recurrent net which now operates on words again. Additionally to the computational advantage, we are also able to encode variations of words or even new words, as long as they use the same character set from the training. For example, [beers] or [awesomeness] can be also encoded and hopefully each embedding is close to the “parent” word.

In PyTorch this is pretty easy to code. We set char_dim=15 and n_filters=10:

char_emb = nn.Embedding(n_chars, char_dim, padding_idx=0)
c1 = nn.Conv2d(1, n_filters, (1, char_dim))

#sentence [n_batch, max_wordlen]
emb = self.char_emb(sentence).unsqueeze(1) # [n_batch, 1, max_wordlen, char_dim]
c1 = self.conv1(emb)# [n_batch, n_filters, 32, 1]
# maximum for each filter, across all chars(!)
h1, _ = c1.squeeze(3).max(2) # [n_batch, n_filters]

This toy code shows the basic idea, but is limited in its capacity. The issue can be addressed by using wider filters and more of them. It should be noted that this is not our work, but that we borrowed ideas from [arxiv:1508.06615] and from other excellent publicly available research.

Bottom line, with this conv-based word encoder that works on character-level, we are able to generate a dynamic embedding for arbitrary words. The embedding can then be used as input for other model components, like transformer networks, or recurrent nets. Furthermore, since most language models are trained in an unsupervised way, we can also use the character-embedding function to build embeddings for other tasks in a very compact way, since there is no embedding matrix, but a function to do the embedding that has much fewer parameters.

PyTorch – Combining Dense And Sparse Gradients

In case you a train a vanilla neural network, gradients are usually dense. However, in PyTorch, the embedding layer supports the “sparse=True” option to speed up learning in case of larger vocabularies. We wrote about it before[1]. The only optimizer that can handle both dense and sparse gradients is SGD and not to forget Adagrad. But it should be noted in case of momentum, for SGD, at least for version <1.0.0, there seemed a slowdown because of the accumulator after some epochs. Thus, we usually stayed with Adagrad which was a good choice.

Nevertheless, it seems not satisfying to be tied to one optimizer for all our problems. But what happens if you mix both types of gradients? Adam complaints it can't handle sparse gradients and SparseAdam complaints it can't handle dense ones. Maybe the solution is obvious, but probably not straightforward as it should. In the a discussion[2] a quite elegant solution is proposed to avoid to call multiple optimizers.

But before, let's precisely describe the problem. Let's say we've a simple network with an Embedding layer (sparse) and a Linear layer (dense) and the forward() function combines both. If we want to use Adam as the optimizer, we need one optimizer for the sparse layers (Embedding) and one for the rest. We are omitting bias values for brevity.

The learning step looks like that:

net = Network()
opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)
opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)

y_hat = net(torch.LongTensor([1, 2, 3])
loss = ((y - y_hat)**2).mean()

It’s not hard to follow, but it could be probably a bit more intuitive. The thing is that we can’t use net.parameters(), at least not without filtering parameters that require a sparse gradient. At the end, we need two parameter lists and a separate optimizer for each type: sparse and dense.

At least from the software engineering point of view, also noted by [2], we can improve the situation a little.

class MultipleOptimizer:
 def __init__(self, *op):
  self.optimizers = op

def zero_grad(self):
 for op in self.optimizers:

def step(self):
 for op in self.optimizers:

With this helper, the new training code becomes:

net = Network()
opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)
opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)
opt = MultipleOptimizer(opt_sparse, opt_dense)

y_hat = net(torch.LongTensor([1, 2, 3])
loss = ((y - y_hat)**2).mean()

And we can take this approach a little further by letting it figure out what layers lead to sparse gradients and which lead to dense ones.

Surely, there is more work to do, but our aim was to give a simple recipe how to handle sparse and dense gradients without the hassle to call multiple optimizers.

[1] raberrytv.wordpress.com/2017/06/28/efficient-embedding-models-with-pytorch/
[2] discuss.pytorch.org/t/two-optimizers-for-one-model/11085

Implementing The SMORMS3 Optimizer in PyTorch

Even so the Adam optimizer is a pretty solid choice if you begin to train your neural network, it might be possible that learning is still slow at the beginning. Furthermore, for more complex loss functions, it might be possible that you need to tweak beta and find the optimal learning rate to get the network to start learning reasonably fast. The same is true for almost any other optimizer which can be a bit frustrating. So the question is if there is no optimizer that requires no tweaking or at least very little? The answer is, well there is a variant that at least has no beta values and thus, you can focus solely on the learning rate.

Yes, we talk about SMORMS[1] a variant of RMSprop that was conceived by Simon Funk in 2015. To the best of our knowledge it got not much attention back then, even so our earlier experiments confirmed that it improved the learning for some tasks and often delivered the same performance as RMSprop & friends.

The reason we thought about it again is that for our current task, the designed network learns very slow and needs a lot of time until it begins to converge. We tried quite a lot optimizer and hyper-parameters but as the blog by Andrej Karpathy[2] impressively emphasizes it, network training often fails silently and it might require quite a lot of debugging to find out why (axis, weights, architecture, data, etc.).

So, we checked if somebody already implemented SMORMS3 in PyTorch but found no code and thus, we decided to implement it from scratch. That was also a valuable experience since the time of Theano, we usually do not implement optimizers ourselves. Except for the in-place operations, it is straightforward and you only need to follow the “Optimizer” interface by PyTorch.

The code that follows is free, hopefully also free from serious bugs and mainly for illustration. The purpose was not high-speed performance, but the runtime seems already satisfying.

class SMORMS3(Optimizer):
 Optimizer described by Simon Funk
 def __init__(self, params, lr=0.0025, eps=1e-16):
  Setup optimizer with parameters.
  lr: default learning rate
  eps: default epsilon
  defaults = dict(lr=lr)
  super(SMORMS3, self).__init__(params, defaults)
  self.eps = eps

 def step(self, closure=None):
  Perform a single gradient step for all parameters.
  loss = closure() if closure is not None else None
  for group in self.param_groups:
   lr = group['lr']
   for p in group['params']:
    if p.grad is None: # skip if param has no gradient
    grad = p.grad.data
    param_state = self.state[p]
    if 'mem' not in param_state: # setup accumulators once
     param_state['mem'] = torch.full_like(p.data, 1.)
     param_state['g'] = torch.full_like(p.data, 0.)
     param_state['g2'] = torch.full_like(p.data, 0.)
    mem = param_state['mem']
    g, g2 = param_state['g'], param_state['g2']
    # g = (1-r)*g + r*g, g2 = (1-r)*g2 + r*g**2
    r = 1. / (mem + 1.)
    g.mul_(1 - r).addcmul_(r, grad)
    g2.mul_(1 - r).addcmul_(r, grad**2)
    # mem = mem * (1 - g*g/(g2 + eps)) + 1
    div = g2 + self.eps
    mem.mul_(1 - (g**2 / div)).add_(1.)
    # lrate = min(lr, g*g/(g2 + eps))
    lrate = torch.clamp(g**2 / div, max=lr)
    # p = p - lrate*grad/(sqrt(g2) + eps)
    new_grad = lrate*grad / (g2.sqrt() + self.eps)
    p.data.add_(-1, new_grad)
  return loss

[1] sifter.org/~simon/journal/20150420.html
[2] karpathy.github.io/2019/04/25/recipe

Who’s Afraid of the Recurrent Wolf?

Compared to a couple of years ago, training of recurrent neural nets is now much easier. For instance, we trained a character-based RNN to classify certain patterns in the sub-titles of items collected from an electronic program guide (EPG). All we had to do was to collect some training data, label it and use PyTorch to train a model. The model was actually pretty shallow, just one embedding layer fed into some GRU cells followed by a linear layer that acts a softmax classifier. With the packed interface for RNNs of PyTorch, the training took less than two minutes and the precision is above 90%. No pain so far and it just felt like an ordinary feed-forward network, with respect to the training complexity!

Compared to our previous post on batching RNNs, we optimized our code, also thanks to code from the community. The new forward code looks like this:

# batch is list of lists where each list contains a word
# represented as a sequence of chars mapped to an integer
# batch = [[0, 8, 2], [5, 5], [1, 2, 3, 4]]

# first create a tensor with the length of each item in the batch
# and sort the list by decreasing order
batch_len = torch.LongTensor([x.shape[0] for x in batch])
batch_len, perm_idx = batch_len.sort(0, descending=True)

# next, pad the sequences with zeros so they all have the same size
# and adjust the batch order according to the length
batch_pad = pad_sequence(batch, batch_first=True, padding_value=0)
batch_pad = batch_pad[perm_idx]

# apply the embedding on the padded batch and pack the data
batch_emb = embedding(batch_pad)
batch_pack = pack_padded_sequence(batch_emb, batch_len, batch_first=True)

# create the initial rnn state h0, feed the padded batch and undo the packing
ht = torch.zeros(1, batch_emb.shape[0], n_rnn_units)
out, ht = self.rnn(batch_pack, ht)
out, _ = pad_packed_sequence(out, batch_first=True)

# retrieve the last state from each item by using the unpadded length(!)
idx = torch.arange(0, len(batch_len)).long()
out = out[idx, batch_len – 1, :]

# undo the sorting by length which recovers the original order of the input
_, unperm_idx = perm_idx.sort(0)
return out[unperm_idx]

The steps are necessary for large-scale datasets since otherwise, PyTorch processes one item at a time which does not benefit from any parallelism and is thus not applicable for real-world data.

Our next problem was a bit more challenging: We wanted to map all words with the same stem close in the embedding space and push irrelevant words away. We used a margin-based triplet approach that samples an anchor, a reference and a negative word. The margin was set to 0.3 and we used the cosine score forĀ  learning. Again, we started with a pretty basic model that included an embedding and single layer of GRU tanh units, followed by a linear layer to allow unbounded continuous values. Without a doubt this problem has a different complexity, since it is not an ordinary classification problem and thus relies on a good sample strategy. So far, this has nothing to do with recurrent nets and thus, we decided to evaluate the model on a small subset of the dataset with words that can be mostly mapped pretty easy. Stated differently, we wanted to overfit the model to see if it can successfully “learn”, or better remember, the required patterns.

We also used the chance to check different RNN styles and found that vanilla style, so called Elman nets, either learned very slow or not at all and thus decided to select GRUs because they have fewer parameters than LSTMs and are often not inferior. Furthermore, without any regularisation, results were often disappointing, since words with the same prefix get mapped pretty close in the embedding space which was the motivation to use dropout with p=0.1 after the embedding layer to fight easy memorization of patterns. With this setting, the final model delivered a solid performance, but it required more tuning and evaluation than the classification model. A trick we borrowed from seq2seq models is that we read text not from left to right, but from right to left which boosted the performance a lot.

However, the first challenge remains to get the batch interface up and working completely, without the necessity to undo the sorting outside the model class. The drawback is, if we see it correctly, that you can only use RNN variants provided by PyTorch, or you have to hack at least some code. They offer Elman, GRU and LSTM which is sufficient, but there is no easy way to use normalization layers inside the RNNs, except if you use hooks, like weight_norm, but that can be tricky.

Bottom like, working with PyTorch is still heaps of fun, but for beginners it can be a bit of a burden to get more complex models, especially recurrent ones, up-and-running on a large-scale setting. Thanks to the active community, you can always kindly ask for advice, but that does not change the fact that some design issues are sometimes trade-offs between performance and usability. This is perfectly okay, but it can be a bit frustrating if you start to implement a model in a naive, loop-based way, and see the performance will not allow you to use it on a real-world data without training for a very long time.

But who knows, maybe this problem is only relevant for us, since all cool kids use transformer networks
these days and avoid recurrent networks at all ;-).