With the limited data at hand, it is not very grateful to learn a model that generalizes to unseen data. In the domain of raw words, people can use a large corpus, like Wikipedia to learn a model for the “big picture”. For movies, there is no common data set.
There are a couple of sites that try to gather information regarding movies, but they do not share a common focus. Even worse, there is not a single site -not that we are aware of- that allows access to the whole data set in a machine-readable way. What comes closest to a such a data set is the text interface of a well known movie database site.
We assume that the keywords of this dataset are also very sparse, so we analyzed the distribution for a set of movies, where each movie received at least 500 votes. This leads to about 102K keywords and 75K movies with these statistics:
– ~40% of the keywords have a frequency of exactly one
– a movie has ~20 keywords on average
– the variance of the keywords is about ~19500
Stated differently, the situation is very similar to what we described in our previous posts and without some kind of “unification”, it is almost hopeless to use the keywords directly as features. However, even with a simple taxonomy we can improve the situation a lot. Let us assume that we have two keywords ‘robot-doll’ and ‘robot-cop’. There is no doubt that both keywords share the parent concept ‘robot’, so we can map all those words to this concept.
We performed a simple test and found out that 75 keywords could be mapped to this concept. Now, if these keywords are all one-time words, we can at least unify 75 movies to a common concept instead of having a set of 75 disjoint features. Of course, we loose a lot of information during this process, but on the other hand, we gain a lot by noting that all these movies have at least something to do with robots.
Furthermore, we can combine the approach with our most recent idea to use synsets, sets of synonyms for words, to relate one-time keywords with more frequent keywords.