Especially for the domain of movies, some genres are very popular. For instance, in our data set, the Top-5 genres are:
– drama (~27%)
– comedy (22%)
– action (8%)
– crime (7%)
– thriller (6%)
In other words, if we pick a random movie the chance that it is a “drama” is very roughly 1/3 which is quite a lot. As discussed in an earlier post, the theme “drama” is very broad and can mean a lot of things and with the other genres the situation is similar.
So, if we create a training set by sampling random items from the data set, the label distribution is definitely biased towards the Top-k genres. That means a model will have a hard time to find useful features for niche genres. Furthermore, since “drama” is so diverse it might also fail to find good discriminative features to capture such a complex concept.
The situation seems pretty seems pretty awkward, but we can improve it in case of multi genres. If a movie has the genres “drama” and “horror”, we can at least try to classify the movie as “horror”. In general, labels carry very few information and therefore, we should not trust them too much. They are a good indicator, but we should put more emphasis on the actual data.
Back to our actual problem, we have to re-balance the data somehow, otherwise the model will focus only on very few genres which leads to low generalization for the minor genres. Furthermore, especially with piecewise linear units, like ReLU, an uneven distribution of the data can lead to very poor models. Why? If too much samples have the same genre, the training will focus on a region where all those samples are located, but for other regions, those with much fewer labels, the training of the boundary to separate the regions is less efficient because of fewer samples.
What does this all have to do with cats? In case of a dataset that contains of 95% cats and only 5% dogs, eventually, everything looks like a cat, at least for the trained network. And this is the same, if every movie is a “drama”, because then, the most beneficial features are those to recognize this particular genre. However, such a network is not able to reliably recognize aspects like “horror”, “western” or “mystery” because for it, everything looks like “drama”.