Especially in computer vision there is a strong tendency to use directly the data to learn features instead of handcrafting them. We couldn’t agree more with this approach, because otherwise you will never know if the features are best for your problem at hand.
Well, computer vision on images is nevertheless hard work, but at least you have the image data to work with it. It would be a huge benefit if the same would be possible for movies. But the task is not to predict or classify something from a single frame, which is why we need to consider the context of the pictures. And even if this would be feasible, it is not very likely that such an approach is successful to describe a movie details like ‘car chase’ or ‘satire’.
As we noted in previous posts, the situation is different for documents. Documents are also kind of self-describing. Like images, it is also hard work to understand them, but the features are the text itself. That is what makes both domains similar. In contrast, a movie is always summarized and described by some human, with meta data like genres, keywords or ratings.
So, the question is, what are the best features to describe movies? This is, no doubt, a rhetorical question because there is no correct answer to it. For a TV magazine it suffices to describe a movie by a short summary, leading actors, the genre and maybe some kind of star-based rating. With this information, a human can usually classify the movies into “worth to watch” (+1) or “rather not” (-1). In other words, for a simple classification these features are enough, if you understand the semantic. However, to compare movies, you probably need more details.
What about some fairy dust, or Deep Learning as it is called today? A layered model would surely help to disentangle the factors in the data, but only if the expressive power of the features is sufficient. For instance, the genres and the actors are definitely not enough to explain all themes in a movie. Stated differently, if we could describe movies with adequate features, Deep Learning would help us to find better representations of them.
But unless a picture with a cat, or a document that describes how to build a time machine, movies are different and because people interpret them differently, even handcrafting features is a real challenge. It is like a storybook, with text _and_ images.