Without a doubt backprop as a learning rule is very efficient to train (supervised) neural networks and as long as nobody comes up with a better method, backprop will be our companion for the coming years. This is not a real problem, because recent advances regarding the optimizations of deep nets has been very fruitful, with computer vision as a very prominent example. However, despite the success, the method to train a network that is only driven by weak labels is clumsy and not very efficient.
The problem is that the error provided by a label, like “western” for a genre, in combination with an objective, like the categorical cross-entropy, often leads to very weak learning signals. For example, when the training just started, everything is random. Then, the first sample is fed into the network which leads to an initial guess that is very likely wrong. Then, the mistake is backpropagated through the network to adjust the weights towards a correct prediction. Thus, the whole learning of the network is only motivated by avoiding classification errors with regard to the loss function. In other words, the model does not need to understand the whole data, it just has to find enough regularities to explain the labels. This makes the training often very easy, on the one hand, but on the other hand, the power of the model can be also very limited.
For example, if we have two genres “western” and “horror” which can be perfectly separated by a single keyword, like ‘cowboy’, the learning will immediately stop after the network found this pattern. The problem is that if the keyword is ever present in a “horror” movie, a prediction is garbage, because the world of the network does not makes sense any longer. In general, discriminative models do a very good job if the data contains enough patterns which allows the network to generalize to unseen data. However, this approach usually requires a lot of examples to catch all relevant regularities of the data and even then, the learning is purely driven by labels which carry very few information. In a figurative way, the networks sees the real world only through “label glasses” which hide all the details that are not important for the discriminative task. For the toy example, it means all objects are represented by a single note that contains the single word “cowboy” or no text at all. Stated differently, the model knows _nothing_ about the concept of horror or western.
Let’s try something new: We lock out the teacher and go nuts! All we do is to look at samples and to model correlations between raw features, without using any kind of label information. In the most elementary setup, we try to reconstruct the data by using concept bases which are continually defined with each visited sample. This is nothing more than an auto-encoder. At the end, we hopefully learned features that can explain
most of the data and because we did not use any labels, the features should be more general and not restricted to a specific task. In contrast to a classification, we model P(“data”) and not P(“label”|”data”).
Why is this step also useful for a more robust classification? First, the training is not driven by a single aspect, the label, but to model all aspects of the data that help to reconstruct it. The idea is that a model which knows how to reconstruct individual samples, or generate new ones, in case of generative models, must have a good grasp of the distribution P(“data”) where the data comes from. In other words, unsupervised learning helps to disentangle explaining factors of the data which allows a compact representation in the learned feature space. Furthermore, because these new representations are more likely to be linear separable, we can use very simple models for a classification.
In the domain of images, the concept can be illustrated very easily. Instead of using low-entropy labels to learn a model, we try to model directly the image content. This is done by starting with correlations of pixel values which lead to edges, then lines, contours and finally shapes. With such a model, we can describe the content of an image more effective and provide a summary of the factors as features. In a simplified way, features can be thought of as template detectors that activate if a certain patch of an image matches the concept (“eye” or “nose” or “face”). Then, a high-level concept can be described by a group of activated templates, like a “face” requires active templates for “eye”, “nose”, “mouth” and “ears”. Of course, the modeling of real-world data is more complex, but even there, it is done in a hierarchical way that starts with simple concepts which are then composed into more powerful ones.
The idea is not restricted to images, but for sparse bag-of-words data, like movie descriptions, the illustration is limited. At the bottom, we have a set of words which then form topics which are finally combined into a set of topics -a movie- and so forth. Nevertheless, even for this kind of data, unsupervised learning helps a lot to capture the structure of the data. This includes correlations of words which can be used to disentangle the various factors to form latent topics that are present in the data. With such a learned feature space, we are able to solve a broad range of problems, like the classification of genres, a clustering of “similar” movies -with related topics-, or to train preference-based models to suggest movies to users.
Bottom line, in contrast to discriminative models, unsupervised methods have the benefits that they can be used as a generic building block for other models. This is possible because they try to capture all regularities in the data and not only specific aspects of it. In terms of our glass analogy, the network now sees a lot of more details of the world, even if they are simplified, which makes it easier to compose existing knowledge to solve new problems. What makes our domain so challenging is that everybody knows what a cat is, because it is a specific concept and therefore can be decomposed, but to do the same for a sci-fi horror movie is totally different, because even the label is man-made and therefore subjective.