A model to precisely capture the preferences of individual users is very valuable in many situations. Common approaches are collaborative filtering or content-based approaches. Our approach is clearly content-based but instead of coupling it directly with labels to predict the preference of specific items, we are more interested to capture preferences in a broader way.
For instance, instead of trying to predict a rating for the new Star Wars movie, we think it would be more useful to learn something about the preferences of a user for SF movies with ‘space opera’ elements. In other words, we are more interesting in models that describe the data P(X) instead of data and labels P(Y|X).
If we consider a system that continually stores the actions of a user, that includes what movies she watches or the ones she reads the description of, all these actions can be expressed as preferences towards items. Of course, without feedback after a user watches a movie, no clear judgment can be made. However, the decision to read more information about a movie or to watch it, clearly expresses a preference towards this movie or at least that a user is interested in some parts of it.
The more time a user spent with the system, the more (latent) preferences will be available to derive a model. For simplicity, we ignore possible feedback from the user since even if a user did not enjoy a movie she watched, there was something in the movie that caught her attention.
As a first step, we are interested in a snapshot of the user preferences in a compact way. In other words, we are interested to find out what topics the user is interested in instead of just considering what genres she enjoys. As we described in earlier posts, a rough summary of a topic can be extracted with the meta data (keywords) of a movie. The simplest model is a matrix factorization that is performed on all meta data of movies a user watched.
The extracted (latent) features are a then summary of topics a user enjoys. However, extra care needs to be taken since over-fitting can be an issue and -as we described earlier- meta data of movies can be incomplete or even wrong. Of course, the outcome of the factorization is only an approximation of the real preferences but with sufficient meta data and a minimal number of training examples (movies), at least tendencies of the preferences should be become visible.
Furthermore, the model can be used as a building block to create more powerful modules. For instance, it could be used to find similar movies (clustering) or to turn the model into a utility function to rank movies.