For an arbitrary problem, it is often very beneficial to have balanced labels. However, for lots of real world problems, this requirement is not met. For instance, in case of tag prediction, usually a very small set of tags is responsible for more than 80% of the data. In other words, the problem is the “long tail” which means that there are very few high-frequent tags, but also lots of tags that rarely occur but nevertheless are (highly) discriminative. This is also a well known problem in the domain of movie meta data. For instance, the “drama” genre is present for more than 50% of all movies and thus have little discriminative power. A solution to the problem is to use sampling to create mini-batches which are balanced in terms of the labels.
However, even with balanced labels, most models give higher importance to frequent features because they often can explain most of the data. Down-weighting those features, with TF-IDF for instance, is important but does not guarantee that the learned features also focus on minor aspects (“long tail”). This can be easily seen by considering an auto-encoder (AE) that was trained on bag-of-word data where each value is treated as a probability that describes how likely a feature is enabled (based on TF-IDF). If we analyze the learned “bases”, the weight matrix W of the AE, the results are likely to be redundant.
To further analyze this, we consider the absolute pairwise cosine similarity of two bases from W. The output of acs = abs(cosim(Wi, Wj)) is around zero if there is only minor overlap of the learned concepts. For topic modeling, it would be optimal if bases would be pairwise orthogonal to assure that two bases did not learn the same topic, or slightly different variations of it. This usually also leads to more diversity, since then, a base can focus on a small set of minor features that is required for a latent topic. With enough hidden nodes there is a good chance that some nodes learn latent concepts by using less frequent features from the “long tail.”
Now the question is how we can use this knowledge to train a better model? For a global assessment, we have to consider all pairs of base vectors which is computationally very challenging. On the other hand, we need fewer hidden nodes because the expected redundancy is power. Informally, a very simple loss is the mean of all (0 – acs_ij)**2 values for all pair of bases i, j. This forces the model to penalize correlations of two base vectors. In case all base vectors are orthogonal, the loss is zero. Nevertheless, even with a moderate number of hidden nodes, 50 for instance, we have 1225 pairs to evaluate, which is lots of overhead per gradient step. With this idea in mind, we did some research and there is a recent paper that uses a similar approach to model topics, also from the long tail, in documents with RBMs.
Bottom line, in recommender’s the problem of the “long tail” is present in many forms. For genres, where highly specific, but rare, genres might help to better explain user preferences, or in case of keyword-based features that helps to discriminate between two movies which would be otherwise equal. Therefore, learning with diversity seems to be fruitful for more than one reason.