There are several aspects of how to find structure in data. The most simple case is to label your data with some categories, train a model and hope that the model captured enough structure to correctly predict the label for unseen data. To be fair, the approach often works pretty well, if there is enough data, good feature and clean labels.
For the domain of movies, a simple classifier could predict whether a movie has a strong “western theme” or not. But even such a simple model requires good features to correctly separate both classes in which case a linear model might even suffice for the task. In case the explaining factors of the data are entangled, a more powerful model with more layers is required to separate them. But if the features do not carry the required information, even the deepest network is not able to deliver a reasonable performance. However, this is not the limitation we are talking about.
We borrow an example from a paper about development and learning by Jeff Elman to demonstrate the problem. Let us assume that we have the following data with the given labels:
1 0 1 1 0 1 | 1
0 0 0 0 0 0 | 1
0 0 1 1 0 0 | 1
0 1 0 1 1 0 | 0
1 1 1 0 1 1 | 0
0 0 0 1 1 1 | 0
For instance, class one is symmetric around the center and all members have even parity. A pattern for class zero is that the fifth bit is always set. All these patterns could have been learned by a model. Now, an unseen sample “0 1 1 1 0 1” is presented to the model and the question is what class it will predict? The fifth bit is “zero”, so the prediction should be “1”, but it is also non-symmetrical so “0” would be also reasonable.
As discussed in earlier posts, a major problem is that the label does not describe *why* a sample belongs to a class but rather that it *does* belong there. For instance, we have 20 movies, 10 movies are somehow about dogs, the other ones are about cats *and* trees. So, we label the the dog movies with “0” and the other movies with “1”. But the real question is what does the label of “1” really mean? That the movies are about cats? About trees? Or about both? More of this but little of that? For unseen movies about dogs, the situation is much clearer, but if we present a new movie to the model that is about dogs *and* trees what would be the prediction? Or a movie with cats but no trees or trees, but no cats? Not to forget that the whole content of a movie is condensed into a set of keywords that needs to encode all these information.
The first insight, but definitely not a new one, is that a classifier just learns enough to correctly predict the labels. Therefore, if a single pattern suffices to separate the classes and the loss will become “zero”, the model stops learning, because its work is done. The drawback is that we cannot control what patterns the model learns to complete the task. On the one hand, this is a positive aspect since the model might know better what pattern to learn to complete the task. But since the information in the labels is extremely limited, the learned pattern might not really capture all *intended* purposes of a label. In the dog-cat-tree example, a model just needs to learn dog vs. tree since all movies about cats are also about trees. However, if a new movie is about trees, but not cats, the model still predicts class “1” because the world of the model just contains of dogs and trees.
The conclusion is fairly simple, because it is obvious that a model can learn only those patterns that were present in the samples used for training. For example, if we would have provided also movies with dogs *and* trees, the model were forced to learn at least three aspects, instead of just the two to properly reduce the loss of the training data. Nevertheless, the solution is not always that easy because usually samples contain a variety of aspects which are often not important for the actual classification task and even worse, they might be a distraction for it.
At the end, it is a trade-off, because we need to limit the capacity of a model, otherwise it would just remember all the data and does not learn any patterns, but at the same time, we need to force the model to learn most regularities of the data to correctly predict the labels for unseen data. That means, we either have to use fine-grained labels that encode more information, like “spaghetti-western” instead of just “western”, or we have to increase the capacity of the model in a controlled way. In the latter case, we could add a regularizer to the model to guide the learned representation. This could be done in some unsupervised way, with extra data that has no labels but is related, to learn more patterns.