Sometimes it is a good idea to throw everything over board and start from scratch again, and for some crude reasons we immediately thought about a guy named Arthur. Stated differently, if you want to speak with something and you don’t know the proper language, you need a translator.
To be more specific, we tried a completely different approach for our input data, movie features. Instead of using the raw features, we worked on the data to first transform it into a different encoding which is then feed into the RBM. A common model that is used in the domain of language and speech is the n-gram model. Actually, the model is very versatile and can be used for many purposes. In our case, we only used some ideas of it to split the raw keywords into smaller parts. For instance, the word ‘developer’ could be represented as ‘dev’, ‘eve’ vel’ and so forth.
With this approach we can also tackle some existing problems of the data. First, singular and plural forms can be more easily handled and second, we can also avoid stemming if we only use the most frequent parts. A neat side effect is also that spelling errors can be handled more robustly.
The training of the RBM remains the same, but now, we first encode all existing keywords with an n-gram scheme and then we picked the N most frequent parts of it. In contrast to our first approach, the data is still sparse but not as sparse as before, because a lot of words share n-grams and that means there are more “active” features in our training data. For our first test, we used binary units since with a simple model, we can focus on the parameters and it is easier to rule out side effects which might be caused by more advanced types of units.
As expected, even a model without any fine-tuning leads to very good results. Of course the utilization of the n-grams cannot fix incomplete or wrong movie meta data, but now it is much easier to relate items if they have similar keywords. Here is a very simple example: Let us assume that we have two movies A: ‘werewolves’, ‘vampires’, ‘undead’ and B: ‘werewolf’, ‘vampire’ and ‘zombie’. If we use the plain keywords, there would be no overlap between the two movies, stemming definitely would help, but also requires a stemmer for each language and so forth. With simple n-gram encoding, the data can be more easily related and learned with an RBM model.
However, the drawback of the encoding is that we are loosing some information. A word that is split into parts make those parts “lose” which means they do not have to be next to each other. In other words, there is no context for these parts any longer and a floating ‘wolf’ could stand for a lone guy, the animal or the mystic creature. Thus, we have to make sure that the RBM model captures the context of n-grams to discriminate between different high level concepts of words.
Nevertheless, the tests we performed with the trained model indicate that the model is sufficient to differ between the genre of movies. The three best matches for the movie Spider-Man were: Spider-Man 2, Batman Forever, and Daredevil; in this case all the movies belonged, at least partly, into the category of superhero movies.
The fact that Spider-Man 3 was not on the list, confirms our assessment that n-grams are still to superficial to encapsulate higher level concepts like the concept of a teen that is a superhero with the power of a spider. As we discussed earlier, we would need much more meta data to learn such a concept.
Bottom line: The new encoding definitely showed a lot potential and is also proves that it is sometimes a good idea to start from scratch again instead of obsessing about a single idea.