In the last week, we tried a lot of different approaches. Most of the time, the results were okay but none of them knocked us from our feet. We speculated a lot about the reasons since there are some known approaches in the literature that work well even for data that is very sparse and limited. However, one big difference is the semantic of the underlying structure of the data. In case of micro-blogs the text still follows some patterns, even if it is different from natural language. When we consider our meta data, the situation is different. If a movie has only very limited keywords and one or two of them are even wrong, or the spelling is not correct, the record is close to be useless. In combination with the fact that our training set is also very small, it might happen that a weak learning signal gets too much noise and it does not learn anything useful at all.
Stated differently, we need an approach that is more robust in case of errors and that can cope with very sparse data. While we did some research, we stumbled about an old idea called “Siamese Networks”. This kind of network can be used for many things and we will investigate it to learn a similarity measure for our meta data. But first, we need to address some problems. The idea is to learn a projection from the keyword space to a concept space that allows to compare items with the help of latent topics instead of a direct comparison of the keywords. That is exactly what we want, but the drawback is that we now have a supervised model that requires labels to express the similarity of pairs of items.
Let us start with a brief description of the approach. The training is done on randomly sampled items f, g and a label y that describes how similar those items are. For simplicity, we use the cosine similarity. Furthermore, we have a projection matrix W that allows to map the items f, g into the new concept space by some function F; we use the dot product for now (h = f*W). A possible cost function is the squared loss 0.5*(label’ – label)**2, where label’ is the cosine similarity of f, g in the topic space. That was the easy part.
Now, we need to find a label that already exists and can be used to discriminate items at a basic level. In our initial setup, we used the genres for the label domain. We used the Jaccard coefficient as the label to determine the overlap of the genres which has the advantage that the result is bound by [0, 1], exactly as the cosine similarity. We implemented the model with Theano to avoid to manually juggle with the gradients. Since the output is compatible with our previous models, we used our existing machinery to check the similarity of some representative movies. Altogether, the results were at least as good as with the other models, with a tendency to be more consistent and fewer outliers.
Of course, more tests needs to be done here but the new approach seems to have a lot of potential and if we can improve the label generation the results can be hopefully further improved. Especially the situation where two movies have no overlap of the keywords, but the keywords are semantically related, can be modeled more efficiently with the pair-wise training.