Getting More From Less

The data provided by the electronic program guide, EPG for short, is mainly to get a brief overview what is on TV right now. Such a list provides at least the title of the item, the channel and the duration. Other data like a short description or genre are optional which means very often more data has to be gathered to decide if a user considers the item to be interesting, especially when it is not on air right now to zap in. However, for a personalized recommender it is essential to cluster items into categories like series/shows/movies to actually provide assistance to users. Of course it would be also possible to map titles with external (movie) databases, but relying on third-party data is sometimes not sufficient, especially in case of non-English languages.

So, we decided to experiment with the very few data we have, to see if we can build a movie classifier solely based on the EPG data. We used factorization machines (FM) because they showed excellent results for very sparse input data for various problems. We only use categorical features to train the model, which means each feature is a choice of exactly 1 out of N choices. The first feature is the duration discretized into steps of 20 from 60 to 240 minutes. The second is a 1-hot encoding of the channel and the third are groups of disjoint terms:

0: (movie, thriller)
1: (series,)
2: (show, magazine)

Therefore, if a description contains movie/thriller, the output is a ‘1’ at position zero, for show/series at position 2 and so forth.

We wanted to start with a very simple model that we can stepwise improve to see how much information is required to reliably predict if an item is a movie or not. The hyper-parameters are mostly the same as in our last experiment, we used 5 factors and RMSprop to train the model. Since we expect that there is a global pattern that is valid for all movies, we do not really care for over-fitting because it is unlikely that a new movie has completely different statistics (like a movie with a duration of 1,000 minutes). Or stated differently, we expect to find “all” patterns in the training set.

To see if a linear model suffices to find such patterns, we started with a linear SVM, then we adjusted the objective with the correlation term of FM. The score of the linear model was 88.09% vs. 99.57% for the non-linear one. The results clearly show that the model benefits enormously from modeling pairwise interactions between features, but nevertheless, the linear model also provides a very good baseline.

The results further show that with a proper encoding of the features, plus a powerful algorithm, we can train very accurate models but that are also lightweight and simple on the other hand. Without a doubt there is room for improvement, but with the results we got so far, there is no need to waste time for tweaking and tuning parameters and features. But it is worth to mention that we are again very positively surprised how accurate such a simple method like FM is.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s