Detour I: Feature Learning

Before we describe our initial system, we start with a little detour regarded with feature generation. There are many approaches to extract features from data, but most of them require the data to be dense. That is a big problem for our kind of data, since it is usually very sparse.

After some research, we decided to give RBMs, short for Restricted Boltzmann Machines, a try. The rationale is that these models were already used for similar tasks and that the standard RBM is perfect for modeling binary data. Since there are a lot of good introductory texts about RBMs, we will not describe the approach in detail here. RBMs can be used for many tasks, for instance feature learning, topic modeling or dimensionality reduction, which shows their great potential and their versatileness.

If we consider the RBM as a black-box, we feed the algorithm with some data X, where X is a sequence of binary vectors x (samples) that represents our movies, and we get a latent (hidden) representation H of the data. RBMs are unsupervised which means we do not use any labels to create the model. Instead, the algorithm fits a model that tries to describe the data X as good as possible via latent variables H.

To be more specific, let us try to model a specific topic of movies, ‘teen movies’, with a fixed number of pre-defined story keywords. That means we have a sequence of vectors, where each vector represents a single movie. The vector has a ‘1’ at position i, if the keyword is present and ‘0’ otherwise. As we mentioned before, the data X is usually sparse since each movie only contains a subset of all possible story words.

Now, we need to fit a model by using only movies with this specific genre and the pre-defined keywords converted to the binary representation. We will not describe the details of the training here, only that we used 16 hidden nodes (H) and about 160 visible nodes (X).

The next question is how to visualize what the model has learned? A naive approach would be to treat each hidden node h as a neuron that has learned a concept by representing all input nodes with weights to indicate if a keyword contributes positively or negatively to the concept and if it is important at all. Since each hidden node h has weighted connections to each visible node v, we can try to consider only connections from h to v with a large positive weight. This can be done for all hidden nodes which leads to a Top-K representation of keywords for each node.

The following topic was learned by a neuron of our model, the left hand is the keyword and the right hand is the rounded weight:

high-school -> ~4.0
new-kid-in-town -> ~1.5
coming-of-age -> ~1.3
cliques -> ~1.2
teenagers -> ~1.1
high-school-life -> ~1.0
student -> ~1.0
teen -> ~1.0

In this example, it is easy to interpret the topic as a classic scheme in teen movies. A teenager moves with his family into a new town, which means he is the new guy in town and in school. That means he has to integrate himself into existing cliques at high-school with all the problems related to it.

Another learned topic of our model:

friendship -> ~2.7
vampires -> ~1.6
friend -> ~1.4
high-school -> ~1.4
fantastic-reality -> ~1.4
neighbor -> ~1.2
monster -> ~1.2
amateur-sleuths -> ~1.1

The output indicates that the model has not only learned to describe concepts, but also to cluster similar words like ‘friendship’ and ‘friend’. Since the number of movies was limited for fitting our model, ‘underfitting’ is a serious issue which might be visible in the output of the top-k weights of a concept.
First, the concept is not as broad as before but more special and it seems like the neuron represents multiple vampire-like concepts. Here we can see the resemblance to movies where the neighbor is some kind of monster/vampire and the teenagers find it out and try to defeat the monster, with possible detective elements. On the other hand, the neuron likely describes movies where vampires are a key element in combination with high-schools and friendship.

In a nutshell: Even with the few samples we used to train the model it can be clearly seen that RBMs are able to extract useful features from our movie data and even to describe the data as some kind of higher-level concepts than can be further used to learn better features. The results could be used to feed a supervised learned scheme, by transforming the raw movie data into the feature space, or to cluster movies into semantically useful groups, just to name a few possibilities.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s