Whenever the input data consists of very sparse vectors, we need to assure that rare features won’t get lost during the training process. That can be done with AdaGrad, a method that allows to use a separate learning rate for each feature. Of course other methods are possible, but AdaGrad is very simple but nevertheless very powerful. A similar problem can be that not only the features, but also the labels are highly out-of balance, at least for the movie domain. For instance, the top-3 genres are usually: drama, comedy and action and especially the top-1 genre can cover more than 20% of all movies.
What does it mean for a classification model? Most of the drawn samples for training are from the top-3 classes and therefore, we focus mainly to model the pattern of these genres. If, for instance, a western movie is presented at test time, the model is probably baffled, because it has seen only very few of those and tries to explain the sample by combining patterns from the top-k genres.
Because there is no equivalent of AdaGrad for labels, we need to customize the sampling procedure. A straightforward approach is to determine the distribution of all labels and use the inverse distribution to calculate how often a label class is drawn. That means, the more frequent a label is, the less it will be sampled. This way, the model has a much better chance to find pattern for non top-k genres because samples with a rare genre are presented more often to it.