A difference of unsupervised and supervised learning, stated very simply, is to find features to describe the data on the one hand, and to discriminate the data on the other hand. In other words, if we just care if a movie is sci-fi or not, we do not need a perfect model of the data, just one that allows us to draw a line between the two categories of movies.
This makes is obvious that usually a label does not carry much information. For instance, we do not know anything about the theme of the sci-fi movie, if it is about aliens and space or some kind of new wave. All we know is that it is a sci-fi movie or not.
Of course, this is nothing new and one reason why our supervised models with these ‘flat’ labels usually did not perform very well. So, we thought about ways to improve the information density of labels. One possibility is to encode each movie with a taxonomy as a kind of pseudo label which has more than one dimension. The values could be binary, to indicate if a feature is present or not, or real values, to simulate probabilities. The cost function to minimize could be the cross entropy function. Thus, our ultimate goal is to reconstruct a concept vector instead of just classifying a movie into a category.
We ran a very simple example, where we derived the concept vectors by the factorization results of the co-occurence matrix with a non-negative matrix factorization. The idea was to use a jaccard coefficient to measure the overlap of each latent topic and the movie keywords. Since we also consider the context of keywords, we can better model the case where two movies are very similar but do not have any keywords in common, like “robbery” and “heist” or “cowboy” and “sheriff”.
The results looked quite promising, but the chosen label vectors are not powerful enough to build a proper semantic clustering that worked also for the challenging movies which combine a lot of themes and that contain ambiguous keywords.