The TF-IDF scheme is a pretty good baseline because it automatically reduces the weights of features that are very frequent. In case of movie meta data, the term frequency -TF value- is always 1, which seriously limits the advantage of the scheme for this kind of data. However, regardless of the domain, rare features are very precious!
The problem we are fighting is manifold. First, the input data is very sparse which we addressed with reconstructive sampling. Second, the meta data is very noisy in terms of notation/spelling. We fought this issue with by unifying values with stemming and n-gram matching. And third, features are grouped in themes/keywords/flags but which are not present for every movie. The third problem means that we do not differ between not present (-1) and not set (0), if we use a top-k feature selection approach. Stated differently, we need a different way to handle data that is only partially available.
A straightforward approach is to use those data to regularize the model. For instance, in case of an auto-encoder, we could use the data to help the model to better separate movies with different flags. This can be easily done by using the output of the hidden layer to train a simple classifier. Since the new model is conditional -supervised training is only done for labeled samples- we also need an indicator function. The new model is just the cost function of the auto-encoder, plus the cost function of the classifier. To balance the two functions, we could use an additional “alpha” value to trade-off the importance of the supervision: alpha*cost_ae + (1-alpha)*cost_classifier.
The point is that it is very important to utilize all kind of data, even if it is not available for all kind of movies, because otherwise the missing “context” is likely to collapse movies with similar keywords, but different flags/themes, in the feature space. In other words, we would like to disentangle the explaining factors of the input data. For instance, the movie Spider-Man is clearly a mixture of themes like high-school, being a superhero and friendship, but it is unlikely that the final model has separate neurons for each latent topic. That means we need multiple layers to improve the chance of disentangling topics from the raw feature space.
In a nutshell, we try to use all available data to learn a good a representation during training, but without the need to provide those data at test time. The key idea is to give “hints” to the hidden layers by using additional data whenever it is available. This can be done with a supervised cost function or an embedding scheme, just to name a few.