Currently, we select our features by considering the Top-K keywords from the movie meta data. The keywords are cleaned up first, but this does not prevent, for K=1000, that on average only 6.3 words are present. In other words, the sparsity factor is about 99% which is not unusual text-based features.
When our goal is to train some classifier, the sparsity is no insurmountable challenge. However, if we want to train an unsupervised model, the sparsity suddenly becomes a problem, especially if the input data is human-generated content that is likely to contain errors and it is neither complete, nor consistent. Then, it is often impractical to recover those 6 binary values -words- reliable from the noisy data. Thus, it is necessary to adjust the training procedure.
One way out of it is to semantically group features -words- and to reconstruct the concept of the group, instead of the specific feature that is part of it. Let us consider the following example. With a non-negative factorization, we learned 100 latent topics and the first one is clearly a Christmas theme:
The movie we try to model is assumed to have a strong “christmas” and “family” theme. Next, each of the movie keywords -present in the Top-K selection- is assigned to each of the learned latent topics. We can do this by determining the overlap or we can consider the learned weights and use a threshold. Stated differently, we reduce the K=1000 keywords to 100 binary features, where each original keyword has a connection to the topic it best fits in.
This approach surely loses some details, because ‘holidays’ and ‘santa-claus’ are now both treated as a “christmas” theme, but this weakness is also a strength of the model, because now, not each keyword has to be reconstructed, only a more general theme the keyword belongs to. We can think of the approach as a special form of pooling. Whenever a word from the pool is present in the movie data, the output is “1” to indicate membership and “0” otherwise.
At the end, we need to do some fine-tuning to differ between intra-class examples which leads eventually to a hierarchical model. In this model, we start with high-level themes like ‘christmas’ or ‘martial-arts’ and at the next level, we further re-fine those topics to split the space according to the low-level details of the movies.