It is no secret that some data is more valuable than other. For instance, if a feature to describe movie content is present in all movies, it cannot be used to discriminate among movies. That is why a weighting of the feature values is usually beneficial and done in most of the cases. No doubt that a binary scheme is easier to implement, but it does not help to get the most out of a model to precisely describe the data because it completely ignores the context.
A very traditional scheme is TF-IDF, to consider the frequency of terms, but also the inverse ‘document’ frequency. In an earlier post, we briefly discussed the approach for the domain of movies. In a nutshell, the term frequency is always one, which is a problem and the reason we used a conditional frequencies which depend on the genre. After we build the tf-idf values for all keywords, we processed all movies and replaced the binary features with the new tf-idf values. Now, it is very likely that a keyword, e.g. ‘dog’, has several different (weight) values depending on the genre of the movie.
To evaluate the performance of the new features, we decided to use an RBM to find possible latent topics in the data. In contrast to previous experiments, we now have real-valued input data which means that we need to use Gaussian visible units in our RBM model. To simplify training, we normalized each feature to unit variance.
The results look quite promising, as we can see if we consider the features with the highest weights per neuron:
– good-guy, bad-guy, cowboy, marshal, bounty-hunter
– doctor, nurse, surgery, patient
– rock, music, songwriter, showbiz, fame
– school, teenager, teacher, rebel
On the one hand, the fact that keywords occur at most once is very problematic, because it does not tell us how important a keyword is for a movie; instead, in our approach, we just get the importance of it for a specific genre. On the other hand, to down-weight features that are not very relevant for a genre is definitely beneficial in general.