There is no doubt that it takes depth to build abstract concepts from raw data. We recently wrote about the problem that Deep Learning is unlikely to succeed, if the data does not provide enough semantics. For instance, we can consider the problem that a lot of keywords have a frequency of exactly one. If we assume that a large percentage of this words describe animals – squirrel, dog, bird, cat, snake – we could unify those words into a single concept with the name ‘animals’. Of course it would be cooler if we could learn this from the data, but that is not possible with a top-k keyword approach. However, now, we can at least say something about such movies -that they have something to do with animals- and since a lot of movies are about animals, it is much more likely that the word ‘animal’ will become a top-k candidate.
In a recent post, we did something similar for specializations of words, for instance ‘sand shark’ will be mapped on ‘shark’ and with the animal taxonomy, shark will be mapped on ‘animals’. In other words, the rather special description of the movie can be enhanced with more general concepts: sand shark -> shark -> animal. With this approach is it possible to build a hierarchical approach that makes it much easier to find relations between movies on the one side, but also allows to compare details on the other side.
The drawback is obvious, because now, we have not only handcrafted features, but we also need a handcrafted taxonomy to embed some semantic into them. However, since there is no way (yet) to automatically extract features from the raw movie data, handcrafted ones are our only chance.