In the previous post we talked about unbalanced labels and the consequences a strong regularization might have, if only a few errors remain, but the whole dataset is repeatedly feed to the model. At the end of the day, the careful selection actually helped to improve the model. However, there are still some “errors” that are very annoying.
The human brain is remarkable because it can perform an inference step with a minimum amount of power and information, thanks to the long-term memory. For instance, the title of a movie and a minimal description often suffices to decide if we are interested in the movie or not. With the ongoing success of Deep Learning, a lot of people transfer this expectation to the output of machine learning models.
Why is this problematic? In case of content-based methods, the model can be only as good as the features and a 100 layer network won’t change the fact, because even the largest network need to see the whole picture to learn a useful a representation. Thus, if a movie is described by a few keywords, maybe some themes, flags and genres the achievable quality of the model, the inference is bounded by the quality and the completeness of the features.
The theory is supported if we analyze errors made by trained models on unseen movies. As a human, we look at the title and we already have a vague expectation if we like it and the category. Of course we could be wrong, but the point is that in most cases, at least one descriptive feature is missing and thus, the prediction, if we only consider the features, is perfectly right, but totally unusable for us.
The problem is the lack of understanding of users because, when one or two missing keywords can fool the whole system, but even a child could correctly make the right prediction, what is the point of using machine learning at all? Well, for most cases such models work flawless and really help users to make the right decisions, but it is obvious that there is a lack of appreciation, if you tell your users you cannot fix a problem that seems trivial for them, but impossible for a multi-core CPU system.
Bottom line, it is an old hat that content-based systems are only as good as the input data and that collaborative systems are more powerful, but require lots of data before they make useful predictions. In our case, we plan to circumvent the problem by using different modalities that are projected into a single feature space.