The Tiny Feature Handbook

Even today where everybody is using deep learning for everything, feature engineering is still a very important task for many problems. In contrast to images or audio, a lot of data domains still require to support models with handcrafted features to find regularities in the data. Therefore, it is important to encode the semantic of each feature to allow models to maximize the precision. And since we are currently rewriting our feature pipeline, we thought it would be a good idea to give a very brief overview of common types.

Let’s start with the most basic type, the Categorical feature. It is the classical 1-out-of-n encoding where a feature can be exactly one value, like the type of a item: series or movie. Since an item cannot be both, we have exactly one ‘1’ and otherwise ‘0’s in the encoding.

In case the value is a float, like the rating of a movie (which does not need to be limited to stars), the feature is Continuous. In theory, the feature can be any value, even if ratings are usually bounded and might be even discrete (1,2,3,4,5) which brings us to the next type, the Discrete feature. Examples for this feature is the year of release, for instance 1997 or 2016 or the duration of the movie.

Even if discrete features are limited, the range of possible values can be too large and/or biased and then it makes sense to cluster them somehow. This bring us to the Bucket features where nearby values are stored with the same value. For instance, if we want to bucketize the year into decades, we can compress values with: f(year) = year - (year % 10)
Then, 1991, 1995 and 1998 all go into the 1990 bucket. The benefit is that we reduce the feature space with the drawback that we are loosing information. The duration is another candidate, because a runtime of 120 or 121 does not make much difference, but larger distances might capture some semantics.

A more exotic feature is the Or feature that is a logic feature that emits a “1” if any value from a pre-defined list is present. How an this be useful? Let’s assume that we want to specify if the movie is for children. A good indicator is a genre like ‘childrens/family’ or some dedicated flag ‘suitable for kids’ or the rating. With the Or feature, we emit a “1” if any of the criteria is matching.

If we invert the logic, we get the And feature that emits a “1” if all of the criteria are matching, for instance the genre ‘scifi’ and ‘action’. In case we use two different features for the and condition, we get the Cross feature that emits a “1” if a specific pair of feature values is present. Like a director and an actor. With the help of the Cross feature we utilize the correlation of two feature domains which often improves the expressiveness of a model, especially linear ones, a lot.

Sometimes we might have a list of pre-defined topics, defined by a list of keywords and we would like to consider how well an item matches this topic. This can be implemented with a Counter feature that accumulates the number of matches, according to the list of keywords, of an item. In other words, it counts the overlap with an optional step to normalize to avoid biases for longer lists.

Wrapped into some nice python modules, we can easily define a column for a specific feature like that:
decades = DiscretizeColumn('year', coll.distinct('year'), steps=10)

This code defines the decades of all documents in the given collection using all unique values of the ‘year field. The result will be something like [1970, 1980, 1990, 2000, 2010].

To create a feature especially for “zombie” films, we could create a dedicated feature:
zombies = OrColumn('keywords', ["zombie", "undead"])

The code defines a feature that is one if any of the given keywords is present in the item. Very useful to manually encode (temporal) trends.

Without a doubt we did not even scratch the surface, but we wanted to emphasize that it is extremely important to have proper features at your disposal. We could also rely on a non-linear model to find all the correlations, but especially for the data we are using, optimizing such models can be very challenging and even then often fails to capture some important regularities. Thus, we decided to invest some time into feature engineering with the benefit that a linear SVM often suffices. The advantage is that the training of the model is extremely fast and so the the evaluation of the scores for items because of the sparsity.

Bottom line, it’s the old truth that without good features even the best and deepest model will fail to achieve the goal. And since we cannot automatically learn features from the ground up, we need some form of a kick-starter that allows to train more powerful models.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s