Simultaneous Feature Selection

We already mentioned that an SVM with a L1 penalty can be used for feature selection. The drawback is that we have to train a one-vs-rest model for each domain we want to filter. Furthermore, SVMs are kind of problematic when the two classes are not balanced.

That is the reason why we started to think about a method that is easier to use but nonetheless as expressive as the L1 SVM approach. Because it would be silly to reject simple ideas without first trying them, we started with a model that directly tries to reconstruct the genre distribution of a movie. In our example, we try to connect movies themes with their genres where all features are binary. To be more precisely, the input data X represents the themes of a movie and Y stands for its genres. The parameters of the model consist of a weight matrix W, with |X| rows and |Y| columns and a bias. The non-linearity function is the sigmoid.

Thus, we try to minimize |Y – sigmoid(X, W, bias)| with a cross-entropy cost. The intention of the model is not to predict the genre distribution for a new movie, but to select the “most important” features for a genre. In our case, important means that the weight for an arbitrary feature is as large as possible and positive. To select the final features, we removed all negative weights and sorted the rest by its magnitude. We further defined a threshold and only considered features that are above it.

The results are very promising. Here are the top-k features for some distinctive genres.

fantasy : dragon, witch, afterline, angel, wizard
crime : heist, gangster, hitman, double-cross
historic: king, colony, independent, war
western : outlaw-western, gunfight, frontier, cowboy

If we compare the model to the SVM approach, we do not have to train several models and we do not have to care for the balance of classes. A disadvantage is that the problem is non-convex and thus, we have to live with the possibility of a poor solution because of local minima.

To sum up, our rather simple approach can be used to find simultaneously descriptive features for a set of genres. Some more work has to be done to find the best threshold, but we consider that as fine-tuning. Beyond the selection, the method could be used to weight features in a given context (genre), because it is obvious that the keyword ‘robot’ should be given more weight for a sci-fi action film than a romantic comedy.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s