We already wrote something about combining an AE objective, to reconstruct the data with a bottleneck, with some supervised and/or discriminative criteria. In our case, we are mostly concerned to build a semantic space that clusters similar movies together, while different movies are pulled apart. In a paper we recently found, an AE variant with discriminative capabilities was described, with the objective to be good at reconstructing the positive examples and to pull away the negative ones.
To test the approach for our data, we split all sub-genres into crime vs. rest and trained a discriminative AE. The positive movies, crime, were labeled with +1 and the others with -1. In the spirit of SVMs, the objective, a customized hinge loss, was used to to control the reconstruction error with a threshold for each of the classes (T for -1). The first results we got were not mind-blowing, but after further investigations, the output of the model made much more sense. As usual, we used the top-k keywords as the features. With the following example, we demonstrate the power, but also the limitations of the object.
We analyze the quality, but using a movie as a query and we retrieve the k nearest neighbors. We use two 007 movies. The first one, has very few keywords, while the other has more:
For query I), we did not expect top results, because the few keywords might be also present in a lot of non-crime movies. And actually, there are hits like ‘Orca’ or ‘Hard Target’ which are clearly not crime but at least contain a subset of the query keywords. Thus, the model was able to capture relevant keywords and classified such movies more or less correctly to the proper genre.
For query II), the results are much better. The first five matches, clearly belong to the crime genre and the rest of the results are without a doubt with a strong crime theme which can be seen in the sub-genre and the keywords.
In a nutshell: When each movie is described by a sufficient number of keywords, a discriminative AE is very useful to train a model that splits the space into two regions. Furthermore, since the label is only required during training, it is straightforward to ‘classify’ new movies.