Or how to find a needle in a 10,000 dimensional haystack. The distribution of keywords is in some sense similar to those of the population of cities. Why? Usually there are a few very big cities, then some larger cities follow and lots of smaller cities. However, the accumulated population of all biggest cities is responsible for a huge portion of the whole population, which is very similar to keywords. Therefore, it is reasonable to assume that the keyword data resembles a power-law distribution. This is nothing new, but nevertheless very important because the prediction and/or utilization of rare keywords/tags is often a key point to success.
This can be seen with a very simple example. Let’s assume the most frequent words are: love, friendship and family and the most popular one, love, is present in 10% of all movies. Without any down-weighting of those very frequent keywords, the feature construction is biased towards them. Why? In case of 10K movies, “love” is present in 1,000 of them, while a rare keyword like “architecture” might be only present in 50 movies. Despite the chance that there is a pattern for the rare keyword that could be learned, it is not likely that this will happen since the loss induced by movies with frequent features is much higher.
Bottom line, a model should encourage diversity and focus on learning patterns for rare features to better explain the data. In an earlier post, we described a way to “force” RBMs to learn orthogonal concepts by using the cosine similarity to determine how likewise hidden nodes are. Since the approach is not limited to RBMs, we could also try to use a similar penalty for supervised models.