A Proper Criteria For Supervision

In the literature, the unsupervised learning phase is often followed by a supervised fine-tuning phase to improve the discriminative ability of a model. For standard data sets, like MNINST, it is clear what label to use, the class of the digit, for the process and the challenge lies in the parameter selection. However, in our case, the situation is very different.

It would be much easier if would pre-train a data model and then we would fine-tune it for a specific user. That surely is possible but then the result would be a user-specific model, while we are trying to find a model that describes the data very well and then, we will focus on a specific aspect of the data to improve the overall discrimination. Let us illustrate this with an example:

We have a set of movies, where each movie is described by dedicated keywords. The keywords might describe the mood, parts of the story or some flags. We treat each keyword as a binary feature. Then, we select the most frequent keywords and convert each movie to a binary vector. The vector contains a “1” at position i, if keyword i is present, 0 otherwise. The first step is to find latent topics in the data set, in other words, we try to find a good model to describe P(data). For this purpose, we usually use some kind of RBM. This step results in a model where the weight matrix from the visible to the hidden units describes latent high-level concepts of the data.

The activation probabilities of the hidden nodes can then be treated as an indicator if a certain topic is present in the movie or not. We call this the semantic space and it can be used for classification or clustering. Now, we want to fine-tune the model with some criteria that is not too restrictive but also not too general.

The most obvious choice would be to use the genre of a movie. Then we can assign one or more genres to each movie and use some classifier, a softmax for instance, to fine-tune the weights by back-propagating the errors through the network. This can be done by adding the classifier as a new layer on top of our existing network. The approach is used in the literature with a lot success.

The problem is not the procedure itself, but the label we are using. In the literature, the approach is used to lower the test error which works quite well, but since we do not have a classification problem, it is not obvious if such an “aspect fine-tuning” is beneficial at all. We conducted some experiments with ReLU AEs and the learned topics were often useful. Then, we initialized the AE with the weights from a pre-trained model to avoid solutions with a poor local minima.

We are still working on a measure to assess the results, but it seems that the genre information might be too coarse for this task. Stated differently, it is possible that a latent topic belongs to more than one genre and that a clear-cut discrimination is not easily possible. In such a case, the weights might be adjusted into one direction and then in the opposite which is canceling out the adjustment.

There is still lots of work to do, but with each new day, we can see some a little more light at the end of the tunnel.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s