To relate TV titles that come from the electronic program guide (EPG), we have decided to train an embedding that directly optimizes on “sentence-level” instead of just related words, like word2vec. That is in the spirit of StarSpace[arxiv:1709.03856] which is a fairly simple but approach but which is nevertheless a strong baseline.
The idea for the training is straightforward and uses the leave-out-out approach. We use either existing or manually annotated categories for a weak supervision. Then we sample N positive items from one category, and use N-1 items to predict the Nth item. Negative items are sampled from other categories. A title is encoded as the sum of all the embeddings of the words it contains: np.sum(E[word_id_list], 0) (bag-of-words). All vectors are normalized to lie on the unit-ball. Next, we combine all bag-of-words into bag-of-documents: np.mean(V, 0) where V is a matrix with #title rows and #dim columns.
The loss is the well-known triplet loss: np.maximum(0, margin – pos + neg), where pos = np.dot(bag_items, nth_item) and neg = np.dot(bag_items, neg_item) and margin is a hyper-parameter (0.2 by default). The idea is not to learn a hyperplane to separate classes, but to move related items closer and push unrelated items further away. The learning stops, when positive and negative pairs are separated by at least the given margin. The performance of such models largely depends on the sampling of positive and negative items
but this is not the concern of this post.
In contrast to titles from books, and to some degree movies, generic TV titles belonging to shows, reports or entertain, are very heterogeneous with lots of special characters and that can also include meta information. Therefore, we need a more sophisticated tokenizer to convert titles into “words”, but we also need to address the issue of rare “words”. The long-tail is always a problem for text, but in our case the domain is more like tweets with special emoticons and/or hashtags than traditional text. Throwing away
those “words” is no solution which is why we need to adjust the learning scheme.
In word2vec down-sampling of frequent words is used, but this does not really address the problem since we do not want to damp the learning signal for frequent “words”, but we want to boost the signal for rare “words”. That is why we decided to scale the gradients with the inverse frequency of the words. The procedure just requires a fixed lookup table: ID->WEIGHT, which is easy to implement.
The necessity of the procedure became obvious when we checked the result of our first model. We took an arbitrary title and used the cosine score to rank all other titles. The results looked promising, but from time to time there were outliers and we wanted to find out why. We started by removing single “words” from the offending title and repeated the ranking. We found out that the problem were often related to rare words that did not get much weight updates and thus, their position in the embedding space is something “arbitrary”. When the “word” was removed, the cosine score reduced dramatically. This also worked for other titles.
Thanks to PyTorch, the implementation of re-scaling the gradients was very easy:
def scale_grad_by_freq(parameters, lookup):
parameters = list(filter(lambda p: p.grad is not None, parameters))
for p in parameters:
g,grads = p.grad.data, g._values()
for j, i in enumerate(g._indices().view(-1)): grads[j].mul_(lookup[i])
The multiplication is done in-place and thanks to the sparse flag of the PyTorch Embedding module, we only re-scale a small subset of all embeddings. With this minor modification, a loss that involves rare “words” leads to a stronger error signal which partly compensates the fact that those “words” get fewer updates. This is not the holy grail, but a good example that a deeper understanding of a problem can minimize the level of frustration and gives you more time to enhance or tune your model.