To train a good model, you need sufficient data. What exactly this means depends on the model, but usually more data will not hurt the performance, rather the contrary is true, especially for unsupervised models. That is the reason why we invested some extra time to gather more data. And because of the sparsity of our data, more of it will increase the chance that rare pairs of keywords will be present to learn useful pattern.
That is the reason why we are in a bit of a pickle. First, to capture higher order relations of data, or stated differently, to disentangle the explaining factors, we need bigger models. In other words, several layers are required to get useful predictions or an accurate model of the data. However, a bigger model also means more parameter to learn and that means we need a sufficient amount of data. For that reason we decided to evaluate the necessity of more layers and because we have labels for all the movies, we started with a supervised problem. We already talked about simple methods to turn sub-genres into a flat taxonomy, which we use here as labels. That means we have a multi label problem which can be addressed by a sigmoid layer on top and cross entropy as the cost function.
We further use AdaGrad, because the data is very sparse and the frequency of features is very different which means a single learning rate is likely to hurt the accuracy of the model a lot. All neurons are ReLUs and we use dropout as a kind of regularization. The input data consists of the Top-K keywords from the data and the output is the binary encoding of the sub-genre taxonomy. In other words, we try to estimate action/horror/crime/… aspects of movies with the given keywords as features. For starters, we compare a one vs. a two hidden layer model. Since our data is still limited, we have to be careful with the number of layers, plus, the problem of vanishing gradients is likely to happen with more layers.
The comparison is done by using the output of the model, selecting an arbitrary movie and use the L2 distance to find the nearest neighbors. The better the model works, the more “similar” movies, or at least movies with a very similar distribution of aspects, should be present in the Top-K results. For the one layer model, especially for movies with a complex distribution of aspects, some of the Top-K results seems to be very unusual. If we check the actual keywords of such a movie, we can at least relate the movie with the query movie, but there are definitely better matches in the data. With the two layer model, the results are much more consistent and outliers are harder to find. That is no real surprise, because a single layer can hardly combine input features into higher-level concepts.
In a nutshell, the depth of the model is the key to success. In other domains it was shown that the accuracy of a model drastically decreases when layers from a reference model are removed. We toyed with the idea from the domain of images to learn a “base” model from huge amounts of data and then to use the high-level features of this models for transfer learning. However, because we use handcrafted input features, it is very unlikely that we can re-use model features for different tasks.