Conditional Feature Encoding

In the previous post we manually created a set of categorical features to decide if an item describes a movie or not. The idea is that pairs of features are often much more descriptive than one feature alone. For instance, if we use the name of the channel, the duration and the hour when the item starts, the combination of those three features can already lead to very accurate models.

A very simple example is the rule for “prime time” movies, which are movies that are aired around eight PM, with a duration around 120. The channel might not be as descriptive as the other features (duration, hour) but because some channels might not show movies at all, or at least not as often as other channels, the information can be still vital. In a nutshell, pairs like duration=120 and hour=08pm, or channel=Movie-Channel and duration=90 are a good start to separate item classes like movies and series/shows.

In other words, the best model will fail if the feature encoding is not powerful enough. But on the other hand, if we find a very creative encoding, maybe even a linear model suffices to separate item classes. And since there are no limitations what is possible, even very crude ideas might lead to success. For instance, we could encode the number of terms in the title, or we could count special characters. We could also use the weekday or audio information, like if 5.1 sound is available.

However, it should be noted that it is very important that a minimum number of features is present in all items because at the end, the accuracy can only be measured with those “common” features. Furthermore, categorical features lead to very sparse input vectors which require a model that benefits from sparsity, like Factorization Machines (FM). The question why FMs work so well with sparse input data can be answered by noting that the parameters of V are shared and therefore, not every feature pair (i,j) has its own parameter. This allows to model even interactions that are not present in the training data. The only drawback is the additional hyper-parameter “k” for the number of factors, but especially for smaller datasets, finding a good value is straightforward.

With this in mind, it is very likely that we can improve a per-user recommender by adding categorical features to express preferences or relations between different domains. Like a pattern where a channel shows very specific sci-fi movies a user enjoys which means that this particular sci-fi movies should be scored higher for the user. Such a model is no longer independent, but conditioned on extra data that is not directly related to the movie. This might look like a restriction, but users rarely like all sci-fi movies, but only those with specific themes. That means, in case the meta data does not contain such information, we could use the relation between the movie and the channel to -partly- capture those user preferences.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s