PyTorch – Combining Dense And Sparse Gradients

In case you a train a vanilla neural network, gradients are usually dense. However, in PyTorch, the embedding layer supports the “sparse=True” option to speed up learning in case of larger vocabularies. We wrote about it before[1]. The only optimizer that can handle both dense and sparse gradients is SGD and not to forget Adagrad. But it should be noted in case of momentum, for SGD, at least for version <1.0.0, there seemed a slowdown because of the accumulator after some epochs. Thus, we usually stayed with Adagrad which was a good choice.

Nevertheless, it seems not satisfying to be tied to one optimizer for all our problems. But what happens if you mix both types of gradients? Adam complaints it can't handle sparse gradients and SparseAdam complaints it can't handle dense ones. Maybe the solution is obvious, but probably not straightforward as it should. In the a discussion[2] a quite elegant solution is proposed to avoid to call multiple optimizers.

But before, let's precisely describe the problem. Let's say we've a simple network with an Embedding layer (sparse) and a Linear layer (dense) and the forward() function combines both. If we want to use Adam as the optimizer, we need one optimizer for the sparse layers (Embedding) and one for the rest. We are omitting bias values for brevity.

The learning step looks like that:

net = Network()
opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)
opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)

y_hat = net(torch.LongTensor([1, 2, 3])
loss = ((y - y_hat)**2).mean()

It’s not hard to follow, but it could be probably a bit more intuitive. The thing is that we can’t use net.parameters(), at least not without filtering parameters that require a sparse gradient. At the end, we need two parameter lists and a separate optimizer for each type: sparse and dense.

At least from the software engineering point of view, also noted by [2], we can improve the situation a little.

class MultipleOptimizer:
 def __init__(self, *op):
  self.optimizers = op

def zero_grad(self):
 for op in self.optimizers:

def step(self):
 for op in self.optimizers:

With this helper, the new training code becomes:

net = Network()
opt_sparse = torch.optim.SparseAdam([net.emb.weight], lr=3e-4)
opt_dense = torch.optim.Adam([net.out.weight], lr=3e-4)
opt = MultipleOptimizer(opt_sparse, opt_dense)

y_hat = net(torch.LongTensor([1, 2, 3])
loss = ((y - y_hat)**2).mean()

And we can take this approach a little further by letting it figure out what layers lead to sparse gradients and which lead to dense ones.

Surely, there is more work to do, but our aim was to give a simple recipe how to handle sparse and dense gradients without the hassle to call multiple optimizers.



Implementing The SMORMS3 Optimizer in PyTorch

Even so the Adam optimizer is a pretty solid choice if you begin to train your neural network, it might be possible that learning is still slow at the beginning. Furthermore, for more complex loss functions, it might be possible that you need to tweak beta and find the optimal learning rate to get the network to start learning reasonably fast. The same is true for almost any other optimizer which can be a bit frustrating. So the question is if there is no optimizer that requires no tweaking or at least very little? The answer is, well there is a variant that at least has no beta values and thus, you can focus solely on the learning rate.

Yes, we talk about SMORMS[1] a variant of RMSprop that was conceived by Simon Funk in 2015. To the best of our knowledge it got not much attention back then, even so our earlier experiments confirmed that it improved the learning for some tasks and often delivered the same performance as RMSprop & friends.

The reason we thought about it again is that for our current task, the designed network learns very slow and needs a lot of time until it begins to converge. We tried quite a lot optimizer and hyper-parameters but as the blog by Andrej Karpathy[2] impressively emphasizes it, network training often fails silently and it might require quite a lot of debugging to find out why (axis, weights, architecture, data, etc.).

So, we checked if somebody already implemented SMORMS3 in PyTorch but found no code and thus, we decided to implement it from scratch. That was also a valuable experience since the time of Theano, we usually do not implement optimizers ourselves. Except for the in-place operations, it is straightforward and you only need to follow the “Optimizer” interface by PyTorch.

The code that follows is free, hopefully also free from serious bugs and mainly for illustration. The purpose was not high-speed performance, but the runtime seems already satisfying.

class SMORMS3(Optimizer):
 Optimizer described by Simon Funk
 def __init__(self, params, lr=0.0025, eps=1e-16):
  Setup optimizer with parameters.
  lr: default learning rate
  eps: default epsilon
  defaults = dict(lr=lr)
  super(SMORMS3, self).__init__(params, defaults)
  self.eps = eps

 def step(self, closure=None):
  Perform a single gradient step for all parameters.
  loss = closure() if closure is not None else None
  for group in self.param_groups:
   lr = group['lr']
   for p in group['params']:
    if p.grad is None: # skip if param has no gradient
    grad =
    param_state = self.state[p]
    if 'mem' not in param_state: # setup accumulators once
     param_state['mem'] = torch.full_like(, 1.)
     param_state['g'] = torch.full_like(, 0.)
     param_state['g2'] = torch.full_like(, 0.)
    mem = param_state['mem']
    g, g2 = param_state['g'], param_state['g2']
    # g = (1-r)*g + r*g, g2 = (1-r)*g2 + r*g**2
    r = 1. / (mem + 1.)
    g.mul_(1 - r).addcmul_(r, grad)
    g2.mul_(1 - r).addcmul_(r, grad**2)
    # mem = mem * (1 - g*g/(g2 + eps)) + 1
    div = g2 + self.eps
    mem.mul_(1 - (g**2 / div)).add_(1.)
    # lrate = min(lr, g*g/(g2 + eps))
    lrate = torch.clamp(g**2 / div, max=lr)
    # p = p - lrate*grad/(sqrt(g2) + eps)
    new_grad = lrate*grad / (g2.sqrt() + self.eps), new_grad)
  return loss


Who’s Afraid of the Recurrent Wolf?

Compared to a couple of years ago, training of recurrent neural nets is now much easier. For instance, we trained a character-based RNN to classify certain patterns in the sub-titles of items collected from an electronic program guide (EPG). All we had to do was to collect some training data, label it and use PyTorch to train a model. The model was actually pretty shallow, just one embedding layer fed into some GRU cells followed by a linear layer that acts a softmax classifier. With the packed interface for RNNs of PyTorch, the training took less than two minutes and the precision is above 90%. No pain so far and it just felt like an ordinary feed-forward network, with respect to the training complexity!

Compared to our previous post on batching RNNs, we optimized our code, also thanks to code from the community. The new forward code looks like this:

# batch is list of lists where each list contains a word
# represented as a sequence of chars mapped to an integer
# batch = [[0, 8, 2], [5, 5], [1, 2, 3, 4]]

# first create a tensor with the length of each item in the batch
# and sort the list by decreasing order
batch_len = torch.LongTensor([x.shape[0] for x in batch])
batch_len, perm_idx = batch_len.sort(0, descending=True)

# next, pad the sequences with zeros so they all have the same size
# and adjust the batch order according to the length
batch_pad = pad_sequence(batch, batch_first=True, padding_value=0)
batch_pad = batch_pad[perm_idx]

# apply the embedding on the padded batch and pack the data
batch_emb = embedding(batch_pad)
batch_pack = pack_padded_sequence(batch_emb, batch_len, batch_first=True)

# create the initial rnn state h0, feed the padded batch and undo the packing
ht = torch.zeros(1, batch_emb.shape[0], n_rnn_units)
out, ht = self.rnn(batch_pack, ht)
out, _ = pad_packed_sequence(out, batch_first=True)

# retrieve the last state from each item by using the unpadded length(!)
idx = torch.arange(0, len(batch_len)).long()
out = out[idx, batch_len – 1, :]

# undo the sorting by length which recovers the original order of the input
_, unperm_idx = perm_idx.sort(0)
return out[unperm_idx]

The steps are necessary for large-scale datasets since otherwise, PyTorch processes one item at a time which does not benefit from any parallelism and is thus not applicable for real-world data.

Our next problem was a bit more challenging: We wanted to map all words with the same stem close in the embedding space and push irrelevant words away. We used a margin-based triplet approach that samples an anchor, a reference and a negative word. The margin was set to 0.3 and we used the cosine score forĀ  learning. Again, we started with a pretty basic model that included an embedding and single layer of GRU tanh units, followed by a linear layer to allow unbounded continuous values. Without a doubt this problem has a different complexity, since it is not an ordinary classification problem and thus relies on a good sample strategy. So far, this has nothing to do with recurrent nets and thus, we decided to evaluate the model on a small subset of the dataset with words that can be mostly mapped pretty easy. Stated differently, we wanted to overfit the model to see if it can successfully “learn”, or better remember, the required patterns.

We also used the chance to check different RNN styles and found that vanilla style, so called Elman nets, either learned very slow or not at all and thus decided to select GRUs because they have fewer parameters than LSTMs and are often not inferior. Furthermore, without any regularisation, results were often disappointing, since words with the same prefix get mapped pretty close in the embedding space which was the motivation to use dropout with p=0.1 after the embedding layer to fight easy memorization of patterns. With this setting, the final model delivered a solid performance, but it required more tuning and evaluation than the classification model. A trick we borrowed from seq2seq models is that we read text not from left to right, but from right to left which boosted the performance a lot.

However, the first challenge remains to get the batch interface up and working completely, without the necessity to undo the sorting outside the model class. The drawback is, if we see it correctly, that you can only use RNN variants provided by PyTorch, or you have to hack at least some code. They offer Elman, GRU and LSTM which is sufficient, but there is no easy way to use normalization layers inside the RNNs, except if you use hooks, like weight_norm, but that can be tricky.

Bottom like, working with PyTorch is still heaps of fun, but for beginners it can be a bit of a burden to get more complex models, especially recurrent ones, up-and-running on a large-scale setting. Thanks to the active community, you can always kindly ask for advice, but that does not change the fact that some design issues are sometimes trade-offs between performance and usability. This is perfectly okay, but it can be a bit frustrating if you start to implement a model in a naive, loop-based way, and see the performance will not allow you to use it on a real-world data without training for a very long time.

But who knows, maybe this problem is only relevant for us, since all cool kids use transformer networks
these days and avoid recurrent networks at all ;-).

Clustering Without “K”

Clustering can be a very powerful tool to analyze the learned representation of a model. For instance, if we trained an arbitrary neural net on MNIST, the learned features should “easily” allow to cluster samples according their corresponding digit classes. Thus, in a perfect world, if we would use K-Means with k=10, one centroid for each digit class, we would assume each cluster resembles the average digit of this particular class. Of course it is not that simple, but it’s reasonable to assume that samples with the same label are “clustered” together in the learned high-dim feature space. When we are a bit more abstract, we could say that each “6” lives on the 6-manifold and in general that the manifolds of different classes are well separated.

However, even in case of supervised models, we cannot always set “k” to the number of classes and furthermore, for feature representations that were learned in an unsupervised way, we have no hint at all. Yes, there are some heuristics how to determine “k”, the number of clusters, but those methods do not always work very well and there is no guarantee that it works with the data at hand. A further problem is that most cluster algorithms, like kMeans, do not have the notion of “noise” and thus every sample is assigned to a cluster even if this heavily distorts the “decision boundary” of a centroid. So, what to do if we, for instance, have a learned word embedding and we want a partition of the data any prior knowledge of “k”?

It is no secret that there is a clustering method that automatically figures out “k”, DBSCAN, but it comes with a price. The price is computational complexity, since it requires to determine the “n” nearest neighbors of each sample. There are also optimizations for this procedure, but nevertheless, for larger datasets, off-the-shelf implementations often reach their limits. The complexity depends on two factors: First, the number of samples and second, the number of feature dimensions. With a GPU, the procedure can be scaled pretty easily, since at the end, it comes down to a matrix matrix multiplication, for which a GPU is very well suited. Still, DBSCAN requires some hyper-parameters that have a huge impact on the learned clustering. Namely the epsilon environment and the number of minimal samples in the neighborhood, for sklearn the parameters are named “eps” and “min_samples”. But even with “default” parameters, the clustering, especially for word embeddings, often leads to a reasonable solution that can be easily interpreted by humans. Furthermore, when used with wrong hyper-parameters, most samples are classified as noise which is unlikely for a well-trained embedding model. The question is if there are alternatives that require fewer tuning?

A while ago we stumbled about an algorithm “Chinese Whsipers” (doi: that performs a clustering of an arbitrary weighted graph which is similar to label propagation. Despite its simplicity, the algorithm is very powerful and efficient. It works roughly like this: Each node gets its own class. Then we shuffle the nodes and visit each, one by one. The class of a node “n” is updated according to the weighted class of the neighborhood, which means all nodes with an edge to “n”. Since each edge is weighted, we can aggregate the weights for each label to infer a ranking. The procedure is done until there are no longer changes in the classes, or until a maximal number of epochs, a pass through the nodes, is reached. At the end, usually a lot of classes has been “wiped out” and merged into a particular label that finally forms a cluster. There still might be singleton clusters, but for a reasonable data, like word embeddings, for instance, usually learned clusters resemble some concept of which some might be abstract and some are not.

One example is {"winter_day", "winter_saison", "winter_wonder", "winter_storm", "winter_garden", "winter_end", "winter_months", "winter"}

So, the last question is how can we convert a word embedding, a matrix with |words| rows and dim columns, into a graph? Since we have no labels, we must use something else to decide if there is an edge between two words. A straightforward way is to determine the “N” nearest neighbors of each word and define an edge from word a to b with the cosine score: edge(a,b)=cosine(a,b). It is true that we have a hyper-parameter again, the number of neighbors, but the learned clusters seem pretty robust for reasonable values of “N” like 20-30. To summarize the procedure: we have a word embedding matrix W with shape: |words| x dim, and we turn it into an adjacency matrix by creating edges for each word by considering the cosine score of its “N” nearest neighbors. Since the Chinese Whispers method only requires a graph as the input, the transformation can be done on any feature space that induces a metric to describe the similarity of samples.

We tested the algorithm successfully with word embeddings, but also on other output, like for instance, the features of a tag prediction system. The result was quite promising since the cluster assignments refelcted a subset of the trained labels, without seeing them, but also new tags to group related samples in topics.

Bottom line, clustering is extremely useful to analyze feature representations, or in general, to find patterns in unlabeled data, but the necessity to chose the number of clusters in advance can be a real burden. Both DBSCAN and Chinese Whispers allow to partition the space by itself guided only by the definition of the neighborhood environment. The benefit of those methods is that “intruder” samples are not artificially forced into the gravity of some cluster center and thus their impact on the final results can be limited.

The Power Of Linear Models

Without a doubt Deep Learning is extremely successful, but is it also always required, to solve a particular problem? Stated differently, if we can solve a problem with a linear model, we surely can solve it also with a neural net, since the latter is more expressive. However, in such a case, we likely waste resources, time + space, and it might be even possible that we need to tune the net to achieve the same performance as the linear model. Of course some domains are biased, like images where linear models are known to lack the expressiveness to solve even simple classification problems. But, if we take for instance a classification problem with thousands of human-generated features, even a linear SVM is likely to suffice, since the high-dimensional space increases the chance a lot that the data is already linearly separable or that at least the data is not as entangled as in the case of images.

So, let’s start with a linear SVM by minimizing the hinge loss which is max(0, 1 – y*y_hat). Because there are lots of off-the-shelf solutions available, all we have to take care of is the sparsity in the input. In case of cross-features it can easily happen that we have to cope with 250,000 features but each sample only has about 25. In case of Python support, there is a pretty good chance that there is support for sparse matrices via scipy/numpy and we are done. Nevertheless, to demonstrate how little effort is required to train such a model without any sophisticated libraries, we started from scratch.

The method is straightforward: First, we create a lookup L to map each word to a unique feature ID. The model W is an empty defaultdict of type float. We draw a random sample (x, y), y in [-1, 1], and sum up all features to get y_hat:
y_hat = sum([W[L[word]] for word in x])
First the model is empty and y_hat equals 0. Thus, the loss is: max(0, 1 – y*y_hat) = 1. Next, we derive the gradient, which is extremely simple for the hinge loss, and update the parameters accordingly.

for word in x:
 W[L[word]] -= lrate * (-y)

Since we treat all words as equally important, for the sake of simplicity, the update rule just increases/decreases a feature proportional to the learning rate according to the label:

w_i = w_i - lrate * (-1) => w_i += lrate
w_i = w_i - lrate * (1) => w_i -= lrate

This is neither magical nor unexpected, since we want features that are related to +1 samples to be positive and vice versa negative for -1 samples.

The beauty of this method is that we only need to update the model parameters, features, that are present in the drawn sample which is very efficient. Furthermore, with the lookup and the dictionary, every actual parameter update can be also performed very efficiently and there is no scalability problem in case of millions of features. It is also possible to perform on-line learning, because of the modest model size, which means we just update the model parameters continuously for each new sample (x, y). The advantage is that we can automatically cope with drifts in the preferences over time which means nothing more than to slowly shift the weight from some features to others.

Once the model W is trained, the parameters can be easily serialized as a Python dictionary, along with the lookup L. Then, at test time, the prediction just requires us to sum the weights for the present features in a new sample x, which is very fast:

y_hat = sum([W[L[word]] for word in x])

Bottom line, it is no secret that the power of a linear model lies in the features. However, if features are already available and there is no need to handcraft them, a simple feature combination and a linear model might already provide a very strong baseline. In other words, even if Deep Learning is awesome, we should always start with the simplest model and only increase the capacity if really needed.

fastText: Tinkering With Character N-Grams

Inspired by the recent advances in multi-task learning we also wanted to investigate the capabilities of those methods for simpler tasks. As stated by researchers, learning your own word embedding can easily lead to overfitting which is why we also wanted to use pre-trained embeddings. However, to address out-of-vocab words, it is mandatory that the model can also embed words that were unknown during training. For instance, fastText comes with n-gram support that allows exactly this and furthermore, it also provides models for other
languages than English.

However, with a vocab of 2M words and 300 dimensions, those models can be a bit cumbersome to work with. You require approximately 8 GB RAM just to hold the data in memory which is no problem for modern servers, but at your machine at home, it might be quite a lot since you also need to run other services. A further question is about the interface between the data and your favorite language. There is a very nice API for python, but it requires to load the full binary model and also introduces some minor overhead.

So, we decided to analyze the binary format of the model, to see if we can somehow represent the model more compactly. The good thing is that the format is quite simple: there is a dictionary, additional model parameters and at the end of the file there is the input and the output matrix.

For the pre-trained models, we have 2M n-gram buckets and 2M words, and each row has 300 dimensions (float32). A matrix is represented by a fixed 16 octet header: int64 rows, int64 cols and each vector is a sequence of 300 float32 values. Between the input and output matrix is a single octet that indicates if quantization has been used. With this numbers it is easy to go to the correct offset in the file to read the data.

The size of a matrix is: rows*cols*sizeof(float32) which is 2M * 300 * 4 ~ 2.2 GB. Actually, the input matrix is a concatenation of the vocabulary words and the buckets for the n-grams. Thus, the index 0…2M references the words and 2M…4M references the buckets which is why 2M is added to the index returned by the hash function. So, we have three matrices with a total size of ~7 GB.

Now, we come to the coding part which is pretty easy with python. You just need to calculate the total size of the data + 1 octet as a quantization flag and seek to this position relative from the end: lseek(fd, -total_size, 2). After, you parsed the header, you can simply read each vector by vec = unpack('300f', * sizeof(float32))) and store it in a numpy array. It is also possible to use float16 for the array since the loss of precision should not be noteable for similarity lookups. Since there is no direct lookup for each n-gram, we also need to port the hash function from c++ to python which is used to retrieve the row ID for each n-gram. Example:

h = 2166136261
for i in xrange(len(str)):
 h = h ^ int(ord(str[i]))
 h = h * 16777619
 h = h & 0xffffffff
return h % bucket_size

Because we cannot model a int32_t type, we need to ensure the boundaries with the AND mask. And bucket_size is 2M. To retrieve the embedding of a specific ngram, we use the hash function:

v_emb = W_ngrams[hash("%wh")]

To convert a new word into a lookup vector, we first generate all n-grams with a size between 3 and 6. Then we convert each n-gram to a row ID and sum all vectors up and dividing it by the length of the n-gram list, which is nothing more than the average.

id_list = [hash(x) for x in ngrams("%where%")]
emb = np.sum(W_ngrams[id_list], 0) / len(id_list)

It is not complicated to wrap the whole procedure into a single class that outputs an embedding vector for an arbitrary word. So, instead of fiddling with 7 GB we just have 2.2 GB as a single numpy array, in case we need to generate vectors for unknown words.

But there is a minor issue we need to address. The representation of words from the vocabulary is a combination of the actual embedding vector plus the sum of its n-grams. In other words, to perform a similarity lookup, we require both matrices to perform it. Since the accumulated representation does not change and the actual embedding vector is never use stand-alone, we decided to perform the pre-processing step and store the result as a separate matrix. Now, we can use this matrix to directly output word embeddings and/or to perform nearest neighbor queries.

W_words = [..]
W_grams = [..]
id_list = ngrams(vocab[0])
W_words = np.vstack((W_words[0], W_grams[id_list]))
W_word = np.sum(W_words, 0) / W.shape[0]

Finally, we have two matrices: (1) the accumulated vocabulary matrix, which can be compared to the output of a word2vec with n-gram support and (2) one for the 2M n-grams which can be used with the hash function for lookups.

Bottom line, it is very helpful to have access to pre-trained embeddings which were trained on a large-scale corpus since very often, your data at hand is not sufficient to train ‘unbiased’ embeddings that generalize well. Furthermore, the n-gram support allows you to embed also oov words which will definitely occur and neither random inits nor UNK token are really appropriate to address them.

How to Handle Out-of-Vocabulary Words Pragmatically

Using pre-trained word embeddings can be a huge advantage if you don’t have enough data to train a proper embedding, but also if you are afraid to overfit to the task at hand if you train it jointly with your model. In case of English you find quite a lot word embedding models that are publicly available, but it’s different for other languages like French, German or Spain. There are some available, like the ones provided with fastText, but depending on the language, it can be challenging. So, let’s assume that you have some luck and there is a pre-trained model, then the next challenge waits just around the corner. The problem is that for specific tasks, there are definitely words that are not present in the vocabulary. There are some dirty solutions like mapping all those words to a fixed token like ‘UNK’ or using random vectors, but none of these approaches is really satisfying.

In case of fastText there is a clever, built-in, way to handle oov words: n-grams. The general idea of n-grams is to also consider the structure of a word. For instance, with n=3 and word=’where’: [%wh, whe, her, ere, re%]. The % are used to differ between the word “her” and the ngram her, since the former is encoded as “%her%” which leads to [%he, her, er%]. In case of a new word, the sum of n-grams is used to encode the word which means as long as you have seen the n-grams, you can encode any new word you require. For a sufficiently large text corpus, it is very likely that a large portion, or even all, required n-grams are present.

Since most implementions, also fastText, is using the hashing trick, you cannot directly export the mapping n-gram vector, however, there is a function to query n-grams for a given word:

$ fasttext print-ngrams my_model "gibberish"

There is an open pull request (#289), to export all n-grams for a list of words, but right now to call fasttext for each new word which is very inefficient in case of huge models. Without knowing about the pull request, we did exactly the same. First we slightly modified the code to accept a list of words from stdin and then we performed a sort with duplicate elimination to get a distinct list of all n-grams which are present in the model.

Now, we have a pre-trained model for the fixed vocabulary, but additionally, we also have a model that allows us to map oov words to the same vector space as long as the n-grams are known. We also evaluated the mapping with slightly modified or related words which are not in the vocabulary with very good results.

Bottom line, without a doubt n-grams are not the silver bullet but they help a lot if you work with data that is dynamically changing, which includes spelling errors, variations and/or made-up words. Furthermore, publicly available models often deliver already solid results which takes the burden from you to train a model yourself which might overfit to the problem at hand or is not satisfying at all because you don’t have enough data. In case a model does not come with n-gram support, there is also a good chance to transfer the knowledge encoded in the vectors into n-grams by finding an appropriate loss function that preserves this knowledge in the sum of n-grams.