To relate TV titles that come from the electronic program guide (EPG), we have decided to train an embedding that directly optimizes on “sentence-level” instead of just related words, like word2vec. That is in the spirit of StarSpace[arxiv:1709.03856] which is a fairly simple but approach but which is nevertheless a strong baseline.
The idea for the training is straightforward and uses the leave-out-out approach. We use either existing or manually annotated categories for a weak supervision. Then we sample N positive items from one category, and use N-1 items to predict the Nth item. Negative items are sampled from other categories. A title is encoded as the sum of all the embeddings of the words it contains: np.sum(E[word_id_list], 0) (bag-of-words). All vectors are normalized to lie on the unit-ball. Next, we combine all bag-of-words into bag-of-documents: np.mean(V, 0) where V is a matrix with #title rows and #dim columns.
The loss is the well-known triplet loss: np.maximum(0, margin – pos + neg), where pos = np.dot(bag_items, nth_item) and neg = np.dot(bag_items, neg_item) and margin is a hyper-parameter (0.2 by default). The idea is not to learn a hyperplane to separate classes, but to move related items closer and push unrelated items further away. The learning stops, when positive and negative pairs are separated by at least the given margin. The performance of such models largely depends on the sampling of positive and negative items
but this is not the concern of this post.
In contrast to titles from books, and to some degree movies, generic TV titles belonging to shows, reports or entertain, are very heterogeneous with lots of special characters and that can also include meta information. Therefore, we need a more sophisticated tokenizer to convert titles into “words”, but we also need to address the issue of rare “words”. The long-tail is always a problem for text, but in our case the domain is more like tweets with special emoticons and/or hashtags than traditional text. Throwing away
those “words” is no solution which is why we need to adjust the learning scheme.
In word2vec down-sampling of frequent words is used, but this does not really address the problem since we do not want to damp the learning signal for frequent “words”, but we want to boost the signal for rare “words”. That is why we decided to scale the gradients with the inverse frequency of the words. The procedure just requires a fixed lookup table: ID->WEIGHT, which is easy to implement.
The necessity of the procedure became obvious when we checked the result of our first model. We took an arbitrary title and used the cosine score to rank all other titles. The results looked promising, but from time to time there were outliers and we wanted to find out why. We started by removing single “words” from the offending title and repeated the ranking. We found out that the problem were often related to rare words that did not get much weight updates and thus, their position in the embedding space is something “arbitrary”. When the “word” was removed, the cosine score reduced dramatically. This also worked for other titles.
Thanks to PyTorch, the implementation of re-scaling the gradients was very easy:
def scale_grad_by_freq(parameters, lookup):
parameters = list(filter(lambda p: p.grad is not None, parameters))
for p in parameters:
g,grads = p.grad.data, g._values()
for j, i in enumerate(g._indices().view(-1)): grads[j].mul_(lookup[i])
The multiplication is done in-place and thanks to the sparse flag of the PyTorch Embedding module, we only re-scale a small subset of all embeddings. With this minor modification, a loss that involves rare “words” leads to a stronger error signal which partly compensates the fact that those “words” get fewer updates. This is not the holy grail, but a good example that a deeper understanding of a problem can minimize the level of frustration and gives you more time to enhance or tune your model.
Let’s imagine that we want to train a model that is using an embedding layer for a very large vocabulary. In on-line mode, you only work on very few words per sample which means you get a sparse gradient because most of the words do not need to be touched in any way.
The good news is that it already works with vanilla gradient descent and AdaGrad. However, since the latter eventually decays the learning rate to zero, we might have a problem if we need to visit a lot of samples to achieve a good score. This might be the case for our recent model that is using a triplet loss, because not every update has the same impact and using more recent gradient information would be more efficient. Thus, we decided *not* to use Adagrad.
As a result, we can only use stochastic gradient descent. There is nothing wrong with this optimizer, but it can take lots of time until convergence. We can address parts of the problem with momentum, since it accelerates learning if the gradient follows one direction, but it turned out that enabling momentum turns the sparse update into a dense one and that means we are losing our only computational advantage.
Again, the issue is likely to be also relevant for other frameworks and in general sparse operations always seem a little behind their dense colleagues. Since PyTorch is very young, we don’t mind challenges, as noted before.
We would really like to continue using it for our current experiment, but we also need to ensure that the optimizer is not a bottleneck in terms of training time. The good news is that the results, we got so far, confirm that the model is learning a useful representation. But with the problem of the long tail, we might need more time to perform a sufficiently large number of updates for the rare words.
So, in this particular case we do not have much options, but a good one would suffice and indeed there seems to be a way out it: Asynchronous training with multi processing. We still need to investigate more details, but PyTorch provides a drop-in replacement for “multiprocess” and a quick & dirty example seems to work already.
The idea is to create the model as usual, with the only exception to call “model.share_memory()” to properly share model parameters with fork(). Then, we spawn N new processes that all gets a copy of the model, with tied parameters, but each with its own optimizer and data sampler. In other words, we perform independent N trainings but all processes update the same model parameters. The provided example code from PyTorch is pretty much runnable out of the box:
## borrowed from PyTorch ##
from torch.multiprocessing as mp
model = TheModel()
model.share_memory() # required for async training
# train: function with "model" as the parameter
procs = 
for _ in xrange(4): # no. processes
p = mp.Process(target=train, args=model(,))
for p in procs: # wait until all trainers are done
The training function ‘train’ does not require any special code, or stated differently it, if you call it just once, single-threaded, it works as usual. As noted before, there are likely some issues that need to be taken care of, but a first test seemed to work without any problems.
Bottom line, the parallel training should help to increase the through-put to perform more updates which -hopefully- leads to an earlier convergence of the model. We still need to investigate the quality of the trained model, along with some hyper-parameters, like the number of processes, but we are confident that we find a way to make it work.
With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.
We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.
In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:
torch.sum(E(in_words) * M(in_words), 0)
where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:
M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())
The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it. In PyTorch this is of course also possible, but the approach is different.
Usually a model contains an module for the embedding
which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to
torch.nn.Embedding(#tokens, #dims, sparse=True)
and that is all.
However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing
torch.sparse.FloatTensor' object has no attribute 'addcmul_'
but we strongly believe that this is fixed pretty soon.
The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.
Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.
To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.
In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.
For some kind of data it is not unusual to have a couple of thousand dimensions but only very few of them carry actual values. Like a bag-of-word approach with thousands of binary features but on average 99.5% of them are zero. In case of a neural network this means we have a projection matrix W with N rows and M columns where N is the number of input features (~10,000). Since it is obvious that a dot-product just depends on non-zero entries, the procedure can be speed-up a lot if we use a sparse matrix instead of a dense one, for the input data. However, we only need the sparse tensor type once, since after the first layer, the output is always dense again.
The actual implementation in Theano is not a big deal. Instead of T.fmatrix(), we use sparse.csc_matrix() which comes from the sparse module of Theano:
from theano import sparse. If we use a generic projection layer, all we have to check is the instance type of the input tensor to use the appropriate dot function:
if type(input.output) is theano.sparse.basic.SparseVariable:
op = sparse.structured_dot
op = T.dot
That is all and the rest can stay as it is.
The idea of “structured_dot” is that the first operand, the input data, is sparse and the other operand, the projection matrix, is dense. The derived gradient is also sparse and according to the docs, both fprop and bprop is using C-code.
Bottom line, if the input dimension is huge but only very few elements are actually “non-zero” using a sparse matrix object is essential for a good performance. The fact that non-contiguous objects cannot be used on the GPU is a drawback, but not a real problem for our models since they are CPU-optimized anyway.
In the previous post about sparse topic modeling we discussed how to learn a sparse, distributed representation of topics in data. For the training, we used the co-occurrence matrix and in case of the sparse NMF method, only the sparsified matrix W is used and the output H is thrown away. The dimensions of the matrix W are (#words, #topics) which means we can encode data in terms of latent factors, but also words in terms of features. That means, each word is represented by #topics dimensions or stated differently, the word is encoded as the distribution to each topic. For example in case of #topics=4, an embedding of a word might look like [0.00, 0.91, 0.00, 0.21], indicating that the word is not present in topic 0 and 3, but strongly contributes to topic 1 and marginally to topic 4.
With this encoding, we can determine the cosine similarity of words to see what the learned representation looks like. For instance, the next neighbors of “rock-music” are:
– rock, rock-music, rock-band, rock-star, performer and concert
or for the word “coach”:
– team, coach, baseball, underdogs, football, sports
We tried several words and as expected, the next neighbors look plausible with regard to the co-occurrence data. However, for rare words, the score is decreasing very fast because there is only few data to sufficiently model the relation with other words. In contrast to other embedding models, the rare word problem cannot be addressed by adjusting the sampling strategy, because NMF is working on the whole matrix. The only chance to address the problem is some kind of pre-processing of the raw data, to influence the output of the co-occurrence matrix.
In a nutshell, the learned factor model can be also used as a word embedding which might be useful for information retrieval tasks, like suggesting relevant keywords in search queries. Furthermore, if the word embedding of all present words in a sample is averaged, the approach also allows to embed whole documents or movies into a single vector representation.
There are some algorithms that seem to be very powerful on the first look. For instance, we stumbled about a matrix factorization that is semi-supervised to cluster arbitrary data. To introduce a supervised signal, there are pairwise constraints to express that a pair of documents should be in the same cluster (W_like) or not (W_dislike). The model uses a similarity matrix S, plus W_like and W_dislike to build a matrix A: A = S + W_like – W_dislike.
We will illustrate an example from the movie domain. Let us assume that we have n=30K movies, which means S is a n * n matrix and so is W_* and we use k=50 clusters. When we assume that each value is encoded as a float32 value, we need about 3.5 GB memory. The required memory for the model parameters, is negligible, since we only need to store n * k for the cluster assignment and k * k for the centers, which is less than 6 MB memory.
Regardless of the efficiency and precision of models that are using a full similarity matrix, the bottleneck is the memory consumption during training. For 5K samples we need ~100 MB, but for 50K we need about ~9.5 GB, since the increase is not linearly. Combined with the fact that such models are transductive and do not allow to project unseen samples into the new space, the use is limited for datasets with lots of samples.
The impact could be alleviated if the similarity matrix could be stored as a sparse matrix since then the memory consumption could be drastically reduced. However, this requires that the similarity for lots of pairs of movies is zero. Plus, even if most of the full-fledged optimization frameworks support sparse matrices, they can be problematic.
Sometimes it is fun to let your mind drift in no particular direction. This way, we were reminded of a concept rooted in neural computation for no good reason: Sparse Coding.
Our ultimate goal is to learn topic neurons that represent latent higher-level concepts in movies. A movie usually consists of several topics, but not all of them have the same depth. This is equivalent to a very simple sparse model where each topic is represented by a base (“dictionary”) and the feature representation of a movie is then the sequence of the associated weights for each dictionary.
We illustrate this with a simple example. For the sake of simplicity, we only use five topic dictionaries. Each represents a single high-level topic: dystopia, space, black-comedy, romantic and action. To represent a movie in the feature space, we treat the keywords of it as a single sample x. Furthermore, let X be the sequence of the dictionaries. Now, we need to solve an optimization problem that minimizes the distance |y – Xw|. The vector w is our weight vector for the dictionaries. There are several methods available, but we used the Orthogonal Matching Pursuit because of its performance and the ability to produce sparse solutions.
Let us further assume that our example movie is a science fiction film that plays in a world where omnipotent corporations control most of the world. A possible encoding with respect to the dictionaries could look like that (0.7, 0.5, 0.1, 0., 0.) which can be interpreted as clear focus on the the first three topics (dystopia, space, black-comedy) while the other topics are not present at all. It should be noted that negative weights are also possible.
This feature representation could be easily used for a semantic clustering since the compressed encoding of the movies focuses on high-level concepts and is robust against minor differences of movies on the keyword-level. In other words, if for “noir” movies the same neurons are active, with similar excitation levels, they would be in the same cluster, even if there are minor differences in their plots.