The Opposite of Something

When the size of the vocabulary is very large, learning embeddings with negative log-likelihood and thus, the softmax, can be pretty slow. That’s why negative sampling has been introduced. It’s very fast and can lead to competitive results while it utilizes much fewer resources. However, there is a serious drawback. Similar to the triplet loss, the performance largely depends on the generation of proper positive and negative pairs. This is nothing new, but has been recently confirmed again by an analysis reported in the StarSpace [arxiv:1709.03856] paper.

The problem is that selecting “trivial” negative candidates might result in no loss or a low loss, since those items are likely to be already well separated from the positive item. Furthermore, there is often no clear strategy what the inverse of something is. For instance, let’s assume that we have two positive items related to “cooking” and now we need one or more negative items as a contrastive force. The question is are items from “cars” better than from those in “news”? Are they more inverse? A solution could be to perform hard negative mining by finding items that clearly violate the margin and thus lead to a higher loss which means some learning occurs. But the procedure is computationally very expensive and not feasible if we have thousand or more of candidates.

So, if we restrict the norm of each embedding and not using a L2 weightdecay scheme that always pushes down the weights, the model will eventually “converge”, but we don’t know how many steps are required. In other words, often a straightforward (linear) model might suffice, but we should instead invest more time in finding clever ways to perform the positive and negative sampling step.

It is astonishing and a little sad that the issue did not find more attention in research and often, just trivial examples are given that cannot be used in real-world problems. Without a doubt the issue challenging, but since it can often decide about the performance of a model, it should be worth the time.


Flaming Winners

Recently, we read a paper that also mentioned winner-takes-all (WTA) circuits and since we moved from Theano to PyTorch, we wanted to give the new idea a try. This type of neuron is similar to maxout, but instead of reducing the output dimensions, the dimensions are kept but filled with zeros. Thus, a layer consists of a group of neurons and in each group, only the “fittest” survives, while the others are set to zero. For example, let’s assume that we have 128 neurons and they should form 32 groups with 4 units each. In PyTorch this is done with a linear layer: wta = nn.Linear(dim_in, 32*4). Next comes the implementation of the forward step which is straightforward. We assume that the shape of the tensor is (batch_size, dim_in).

def forward(self, input):
 h = wta(input) #projection
 h = h.view(-1, 32, 4) # reshape: (1, 32, 4)
 val, _ = h.max(2) # maximal values per batch
 val = val[:, :, None] # reshape: (batch, 1, 1)
 pre = val * (h >= val).type(torch.FloatTensor) #binary matrix->float matrix
 return pre.view(-1, 32*4) # reshape: (batch, 32*4)

That’s it. Definitely not rocket science, just a bit of juggling with the shape of the tensors and reshaping.

PyTorch – Weight Decay Made Easy

In PyTorch the implementation of the optimizer does not know anything about neural nets which means it possible that the current settings also apply l2 weight decay to bias parameters. In general this is not done, since those parameters are less likely to overfit. Furthermore, the decay should also not be applied to parameters with a shape of one, meaning the parameter is a vector and no matrix which is quite often for normalization modules, like batch-norm, layer-norm or weight-norm. So, how can we tell the optimizer in a principled way to set the decay of those parameters to zero?

With the introduction of the function named_parameters(), we also get a name along with the parameter value. For standard layers, biases are named as “bias” and combined with the shape, we can create two parameter lists, one with weight_decay and the other without it. Furthermore, we can easily use a skip_list to manually disable weight_decay for some layers, like embedding layers. The code is pretty simple:

def add_weight_decay(net, l2_value, skip_list=()):
 decay, no_decay = [], []
 for name, param in net.named_parameters():
  if not param.requires_grad: continue # frozen weights		            
  if len(param.shape) == 1 or name.endswith(".bias") or name in skip_list: no_decay.append(param)
  else: decay.append(param)
 return [{'params': no_decay, 'weight_decay': 0.}, {'params': decay, 'weight_decay': l2_value}]

and the returned list is passed to the optimizer:

params = add_weight_decay(net, 2e-5)
sgd = torch.optim.SGD(params, lr=0.05)

That’s it. The behavior is documented, but we still think it’s a good idea to give an example, since in frameworks specialized on neural nets, the default behavior might be different. Furthermore, the method is straightforward, but requires some knowledge of the internals and an example hopefully helps to better understand the process.

PyTorch – Freezing Weights of Pre-Trained Layers

Back in 2006 training deep nets based on the idea of using pre-trained layers that were stacked until the full network has been trained. Then, a final fine-tuning step was performed to tune all network weights jointly. With the introduction of batch norm and other techniques that has become obsolete, since now we can train deep nets end-to-end without much problems. However, sometimes it is still beneficial to combine layers from pre-trained networks to give a network a direction where to search for good solutions. For instance, a recurrent network could use word embeddings from an unsupervised learning step like GloVe or Word2Vec. If this makes sense surely depends on the problem, but it is still possible.

We recently encountered a problem where we need to predict multiple tags for a sequence of words. As a baseline we tried to train a continuous bag-of-words model with a skip-gram loss, but we found the performance not satisfying mostly because the supervised loss failed to learn a good embedding of individual words. This is a common problem since the representation of the data only depends on the error signal and if it goes to zero, the learning stops immediately. This helps to reduce the loss, but it might also hurt the generalization since the getting predictions for frequent items right drives the loss faster down than those of items from the long tail.

So, we decided to pre-train the embedding layer unsupervised, with a huge corpus of sentences, sequences of words, and then we used the embedding in a network to predict tags for them. We further decided to freeze the layer which means the weights are not changed during learning. The reason is that we want avoid a bias introduced by the supervised error signal. Next, we describe how this is possible in PyTorch.

Let’s assume we have a simple network:

class Network(nn.Module):
 def __init__(self, n_words, n_dim=5):
  super(Network, self).__init__()
  self.word_embed = nn.Embedding(n_words, n_dim, sparse=True)

def freeze_layer(layer):
 for param in layer.parameters():
  param.requires_grad = False

net = Network(1000)

By default in PyTorch, every parameter in a module -network- requires a gradient (requires_grad=True) which makes sense, since we want to jointly learn all parameters of a network. However, in case of a pre-trained layer, we want to disable backprop for this layer which means the weights are fixed and are not getting any updates during the backprop step.

That’s it. Pretty easy, if you know how a little about the PyTorch internals. With this in mind, we can use the setting to fine-tune a network, like just learning the weights of a new classifier on top of a pre-trained network, or we can use it to combine a pre-trained layer at the bottom with a new network architecture.

Character-Level Positional Encoding

For documents, word-level embeddings work pretty well, because the vocabulary does not contain too many special cases. However, in case of tweets or (short) texts with a rather dynamic structure, using words might not be appropriate because it is not possible to generalize to unknown words. That includes the case where those words are pretty similar to existing ones, but not quite the same. The issue can be tackled with recurrent nets that are working with characters, but the problem is that the processing cannot be parallelized easily. In a recent paper “Attention Is All You Need” an approach is described that uses a positional encoding which is used to encode sequences of words without the necessity to use a recurrent net. Thus, the process can be more easily parallelized and has therefore a better performance. However, we still have the problem that we cannot generalize to unknown words except for a ‘UNK’ token which is not very useful in our case.

To be more precise, we try to model titles from TV shows, series, etc. in order to predict the category and there it is imperative that we can generalize to slight variations of those titles which are likely to happen due to spelling errors and several formats for different TV channels. The method to adapt the procedure to utilize characters instead of words is straightforward. We just need to build a lookup map with the calculated values based on the position and the index of the dimension up to the maximal sequence length that is present in the training data. Then, we can easily calculate the encoding of arbitrary inputs by using this formular:

 char_id_list #sentence as a sequence of characters
 embed = np.sum(pos_enc_map[np.arange(len(char_id_list)) * char_embed[char_id_list], 0)

and one possible lookup could be calculated by:

result = np.zeros((max_seq_len, dim_len))
for pos in xrange(max_seq_len):
 a = np.arange(dim_len) + 1.
 if pos % 2 == 0:
  t = np.cos(pos / a)
  t = np.sin(pos / a)
 result[pos] = t

The lookup uses different weights for different positions, but also considers each dimension differently which helps to bring more diversity for the encoding. However, since the encoding does not depend on previous states, it also guarantees to preserve semantic similarity of tokens that are present in multiple strings.

Bottom line, an positional encoding allows to us to take the order of a sequence in consideration without using a recurrent net which is very beneficial in many cases. Furthermore, the character-level encoding allows us to classify text sequences we have never seen before which is very important due to minor variations of titles.

Do We Need More Data or Better Ways to Use It?

One of the first demonstrations on how powerful Deep Learning can be, used 1,000 pictures per category and needed quite a lot steps to build a model that worked. Without a doubt, this was a seminal work but it also demonstrated that DL only vaguely resembles how humans learn. For instance, if a child would have to look at 1,000 cups to get the concept of it, the lifespan of humans would be too short to survive without strong supervision. Another example are recent breakthroughs in reinforcement learning, but which also come at a certain cost, like a couple of thousand bucks a day for energy. In a lot of cases, data, and even labels, might be no problem, but it often takes days or even weeks to turn them into a useful model. This is also in stark contrast to the brain that uses very little energy and is able to generalize with just a few, or even one example. Again, this is nothing new but begs the question if we spend too little time on fundamental research and try instead too often to beat state-of-art results to get a place in the hall of fame? The viewpoint is probably too easy, since there are examples that there is research that focuses on real-world usage, like WaveNet, but it also shows that you need lots of manpower to do it. Thus, most companies have to rely on global players or public research if they want to build cutting-edge A.I. products. The introduction of GPU clouds definitely helped, because it allows everyone to train larger models without buying the computational machinery, but using the cloud is also not for free and it’s getting worse if the training has to be fast, since you need to buy lots of GPU time then. The topic, in a broader context, has also been recently debated in[1]. In the spirit of the debate, the question is how can we avoid to run against a wall about 1,000 times before we realize it’s not a good idea?

[1] “Will the Future of AI Learning Depend More on Nature or Nurture?”

Stuck In A Local Minima

It was never easier to get access to a very powerful machinery to learn something. There are lots of different frameworks to train your model and with GPUs you can even train “bigger” models on commodity hardware. Furthermore, there is a lot of published research and cool blog posts that explain recent trends and new theories. In other words, it’s almost everything out there, you just need to find the time to digest all the content and turn it into something that is able to solve your problem, right? Well, honestly if you have one of those common problems, a solution is probably just around the corner and you don’t need much time or energy to solve it. Maybe there is even a pre-trained network that can be directly used, or you could ask around if somebody is willing to share it. But frankly, this is rather the exception than the rule, because very often, your problem is special and the chance that existing code is available is often close to zero. Maybe there are some relevant papers, but it is likely to take time to find them and more to implement the code. In other words, if the problem is easy, you often don’t need to do any research, but otherwise it can be a very long way with lots of obstacles even to get a hint where to start. In such cases, if you are lucky, you can ask people in your company or team to give you some hints, or at least to be able to discuss the problem at eye-level. But what if you have no access to such valuable resources? Doing research on your own takes time and it is not guaranteed that it leads somewhere, if your time is limited, which is usually the case. So, what to do? It’s like training a neural network with lots of local minima and in some configurations learning even get totally stuck. This is nothing new but sometimes we get the impression that the popular opinion is that all you need is a framework and tuning the knobs as long as it takes to solve the problem. This is like having a racing car without the proper training to drive it. There is a chance to win a race, but it’s more likely that you strip it down. The question is how to best spend your time when you want to solve a problem? Concentrate on technology, or on the theory? A little bit of both? Or try to find existing solutions? This is related to our concern that we meet a growing number of ML engineers that seem just like power users of a framework without the ability to understand what is going on under the hood.