For one last time, we say it again: The best algorithm is useless without proper features and since we aim to learn a model that describes the data as good as possible, we need a way out of it. We cannot change the fact that there are no natural features for movies, so we will take this as irrevocable. However, what we can do is to tackle the problem with the encoding of the words.
For instance, words like “werewolves” and “werewolf” have the same meaning and they are only special because of the plural form that does not simply add an “s” at the end of the singular form. Of course, non-trivial plural forms are easy to handle. What really troubles us are words, especially if they have a frequency of one, that are strongly related to other words. Why? Because such words could be “substituted” with a more general word and then they would contribute to the top-k words.
Let’s take the word ‘submarine’ for example. If 80% of the time this word is used, but 20% of the time ‘u-boat’, that prevents that the word is added to the top-k selection and thus, the feature submarine’ is not used as a possible concept that can be learned from the data.
Another example is that a particular movie has some keywords that are semantically related to the selected top-k words, but not identical. For instance, ‘marine’ was selected as a top-k word, but a movie contains the word ‘leatherneck’. A similar case is ‘villain’ vs. ‘scoundrel’ and we can think of lots of other cases. This issue leads to unnecessary sparsity in the feature vectors and prevents that movies will be treated as somehow similar because of the missing feature in a notable fraction of movies.
In a nutshell, this is not a machine learning problem, but a natural language problem. To learn a good model, we need to find the similarity between words and to map words to its most general form to connect them. Then, and only then, our model will be able to find a useful structure in the data.
The next post is about a simple approach that is able to solve some of the described issues. But for a full-fledged solution, we probably need to dive deeper into the NLP domain.