For our running example, we assume that a user has access to some electronic program guide (EPG for short) that is, for example, transmitted in the DVB-S stream. We developed a simple program that extracts the data from the DVB streams and we store it in a document-based database as pairs of (name, value). The information can vary from channel to channel, but some basic fields are always present like title, runtime, channel, start and stop. Examples for optional fields are a textual description, a category, a production year or a rating.
If we only consider the mandatory part of the EPG data, it is not sufficient for deriving useful features that could be used in a recommendation engine. Traditional features are name of actors, movie genres or story keywords. As a result, we need to fetch the data someplace else and map them back to the EPG data. For the sake of simplicity, we model all these features as binary values as follows: If we have 4 genres, we store the values as a N-dimensional vector that has a 1 at position i, if the a movie belongs to this category and 0 otherwise. Example: genres = (SciFi, Action, Comedy, Drama), X = (0, 0, 1, 1) which means the movie X belongs to the category Comedy/Drama. Furthermore, the restriction to binary features allows us to use a very broad spectrum of algorithms to train a model.
As a last step, we need to gather ratings for movies. This step has not to be done by the system, we just need to assure that we can map back the ratings to the EPG data. Usually, it is sufficient to store the sequence (title, year, rating) since the title and the year allows a precise description of a movie (except for the case when two movies with the same name are released in the same year).
Now we have the basic ingredients to build our system: Useful binary features to model the movies and ratings of a subset of them to describe our preferences. In our next post, we will describe a very simple approach that nevertheless works pretty good with carefully created features.