Rare Words Revisited

It is well known that the words that do not occur very often, are often nevertheless very descriptive which is also true for the domain of non-textual items. This is related to the problem that a lot of models only consider the co-occurrences of items to train an embedding and in case of rare words/items that is not really working. The issue was also discussed in a paper released in 2016 [arxiv:1608.00318] with the idea to combine knowledge graphs and recurrent language models. Why is this relevant for our domain? Well, the sparsity of features is a huge problem and since the distribution of them follows a power law, a set of very important, but low frequency features will be ignored which seriously limits the expressiveness of possible models. To tackle the problem, we need to learn an embedding for those words that is not only based on co-occurrence statistics.

We also tinkered with the idea to learn an embedding -before we read the paper- that combines a skip-gram model with facts represented as triplets
(subject, relation, object)
but since, in our domain, rare features cannot be always easily expressed as facts, we did not follow it. However, with the research we did so far, it becomes more and more obvious that models that only use the basic features will not now, nor even never, deliver satisfying results.

In other words, we need to utilize all kind of data and if the co-occurrence does not work, we need something else. The challenge is to build a network with something like a universal memory that can be used to for a variety of tasks, like clustering items or to train
preference models. The first challenges is to find a way how to build a powerful knowledge base, mostly in a semi-supervised way and the second is to incorporate all the knowledge into a network.

The good news is that there are lots of interesting and new methods out there which we can study and adopt, but the bad news is that none of them really fits to our domain and that is the reason we cannot easily resort to a standard data set like Wikipedia. But as usual, we won’t give up and continue our research because we believe that it is only a matter of time until we -or someone else- come up with a new idea that brings us a step further on our way to solve the problem.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s