One problem with hand-crafted features is that we usually need some form of iteration until we know that the representation is powerful enough for the task at hand. Stated differently, if the features are too simple, we likely learn pattern that are plausible according to the data, but not useful for the actual task. Here is an example in which we use the movie ‘Doom’ as a reference. We train a simple genre-based classifier and we study the last hidden layer to see how well it separates different classes in the data.
The analysis confirms that most classes are clearly separated, but nevertheless the model lacks semantic expressiveness, because some movies that are close in the feature space have nevertheless huge semantical differences. In our case, a movie called ‘Mad Monster Party’ is close neighbor of ‘Doom’, mainly because of the ‘creature film’ sub-genre and the ‘monster’ theme derived from the keywords, but a quick look into the details of it reveals that the movie is aimed for a younger audience.
In a nutshell, the designed features did a good job for clustering movies into high-level concepts, but they failed to figure out if a movie was for children or adults. That means in contrast to learned features, we have to encode all the basic knowledge right into the features before we can train a model to learn the required patterns. That resembles a little a chicken-egg-problem.
It is not new that crafting features is hard work, but in case of movies it seems almost impossible because movies combine three domains: text/plot, audio/speech and video/scene and we do not have access to any of those. All we have is a summary that is usually very short or biased, descriptive plot words and other partial information like the involved persons or genres. Furthermore, while we might have more accurate information for some movies, we also might have (almost) no information for others which means we cannot embed all movies into the same feature space.
Long story short, with the data at hand, we can do lots of things, but the extraction of consistent, powerful and semantic concepts is none of them. Plus, without proper feature engineering, the explanatory power of models will be limited and that also means that even silver bullets like Deep Learning won’t help, since DL requires that the raw input data contains all the knowledge. The reason why DL seems to work so good these days is a combination of processing power (GPUs) and the utilization of tons of data. For movies, we have this data, but the access is limited and the required processing power is still too much for real-life applications. With the recent advances to describe images in a textual way, we might be able to condense a movie into useful semantical features some day.