It is safe to say that a lot of things do not happen by accident. For instance, if we consider a 32 x 32 block of pixels and we randomly turn pixels on/off, it is extremely unlikely that the result resembles a face. Why? A face follows a concept that includes two eyes, a nose, a mouth and other parts, plus the parts have a spatial relation to each other. In other words, if we consider all 32×32 pixel combinations, only very few of them represent valid faces.
In the domain of images, manifolds are plausible because a face, a cat or a car are all concepts that can be described very precisely. There is still a lot of in-class variance, but nevertheless, there are unique attributes and relations between them, which are distinctive for each class. For example, faces with three eyes are extremely unlikely for humans.
The domain of movies is much harder to grasp. For instance, a horror movie has very unique elements that rarely occur in other movies, but there are lots of movies that combine genres, like horror and comedy or science fiction. Thus, a question is if we can we disentangle the explaining factors for a movie?
Without a doubt, horror movies are clustered somewhere in space, because if we combine a random subset of all feature keywords, it would be strange if the result could be clearly assigned to the horror genre. However, if words like ‘demon’ or ‘monster’ are present, a lot of user would probably agree that there is a horror element in this particular movie, even if all other keywords belong to non-horror genres.
Compared to pixel groups, sets of words are much more ambiguous. For instance, a nose and two eyes, regardless of the lightning and the pose, will be usually recognized as “face”, while the interpretation of words is highly dependent on the context. That is one reason why metric learning for movies is very challenging and another is that the meta data is neither complete, nor free of errors.