In preference-based learning, we try to estimate a utility function for a user by learning from the decisions a user made over time. This ranges from ratings, e.g. “stars” from 1-5, to what a user clicked in case of multiple choices, the recorded movies or what kind of movies were browsed. All these decisions offer some insights of the real preferences of a user and the more data we have, the better the function can be approximate. This is a bit simplistic, but suffices for the way we go.
There are lots of different models available to learn such a function, for instance (pairwise) ranking, to predict the star rating, or to return a confidence score from 0..1. We focus on purely content-based methods and ignore collaborative filtering and friends. In general, tags can help to better understand user choices and thus the preferences, which is why we focus on a solution that strongly relies on tags to improve the learning. However, finding good tags for movies can be very burdensome and also challenging because a movie might be better described by multiple tags that somehow need to be grouped together, or just because it is hard to find the right tag.
During our research, we stumbled about a paper for automatically tagging mails that inspired us to break new ground. Instead of explicitly describing the tags for a movie, we define the concept of virtual folders. In case of mail, this can be a folder for business, travel, or whatever. For movies, the idea is to put all movies that fit some criteria into a folder without adding any tags. For instance, all movies that might contain a specific music, or an actor, or the theme is a combination of cowboys and dystopia. A human might be able to see the pattern of a folder, but that is not necessary and usually the name of the folder will give sufficient hints.
The idea is that all movies which are important for a user will be assigned to one or more virtual folders. This is a replacement for rating or tagging movies explicitly. In other words, all movies in a folder named “70s_music_disco” share a broader topic and all are liked by the user. Thus, virtual folders can be seen as a clustering of all positive movies a user has seen. In the simplest case, a user might fall-back to folders like ‘crime’, ‘comedy’, ‘scifi’ or ‘horror’ which can be refined later.
Finally, this leads to the notorious scenario where the data is described by arbitrary features and the goal is to predict a set of tags for each sample. The only difference is that now, tags might not be easily described with a single word, because a virtual folder might have a very complex criteria to match movies. Nevertheless, this does not prevent us from training a tag model. And after training, tags are not assigned to samples, but samples are assigned to folders.
Bottom line, we let a user describe its preferences by clustering movies into a set of virtual folders that are created by the user, where each movie can be present in more or more folders. Then, a model is trained to predict tags for unseen samples and these are added to the corresponding virtual folder -marked as new- to let the user decide if his/her mood fits a movie. This resembles an inverse tag model that is represented as a tree where each tag is a folder, filled with matching movies. However, we doubt that preferences can be modeled as single tags and we therefore believe that folders which represent more complex themes, are more appropriate to represent user preferences.