# Detour II: Feature Selection

Recently we introduced a concept to learn features with an RBM. Now, we want to compare the approach with a simple feature selector based on a L1-based SVM regularized model. Again, we do not describe Support Vector Machines in detail since there are a lot of good introductory texts about this classifier available. The basic idea of an SVM is, we only consider the linear case here, to find a hyperplane that linearly separates the data points. In other words, each movie is represented by a single feature vector X that is multiplied by a transposed weight vector W. The sign of the dot product is then the prediction: -1 if X*W < 0, +1 if X*W >= 0.

However, since we only use the classifier as a feature selector, we do not predict any class labels and we only use the weight vector. More precisely, we use a mapping of the features to assign each keyword to a position in the weight vector to interpret what features are important for a specific genre.

The setup is as follows. We have a data set of movies where each movie is described by a set of topic words. We map each word to a fixed position which allows us to convert each movie to a single training vector. We use the keywords from the most frequent movie genres and we only consider keywords that occur at least N times. Next, we train a linear SVM model for each genre with all chosen keywords. It should be noted that the input data is already very sparse since each movie only contains a very small subset of all possible topic keywords. The idea is to use the L1 norm on the weights for training to drive ‘unused’ weights to zero. The result is a sparse weight vector where most weights (features) are very small, close to zero, and the remaining weights (hopefully) somehow characterize the training data.

Since the outcome is much easier to describe with an example, we use the example from the RBM detour. We present the top-6 features with the highest weights from the model that was trained on the ‘teen-movie’ genre. This is the outcome:

high-school -> ~0.50

popularity -> ~0.35

high-school-life -> ~0.30

college-life -> ~0.29

teenagers -> ~0.27

college -> ~0.24

As we can clearly see, the chosen keywords give a good summary of the concept of ‘teen movies’. The most important parts are teenagers, school and popularity. Other important keywords are ‘boyfriend’, ‘cheer-leading’ and ‘summer-camp’ which also fit very good into the concept.

In contrast to the RBM example, we do not have neurons that represent specific details of the concept, but only a global summary of it. However, as we can see, it is possible to use very simple models, L1-SVMs, to select useful features for further tasks.

In a separate posting, we plan to demonstrate that the genre features can be used to build a very simple clustering model that goes beyond simple movie genre categorizations.