The discussion of what similarity in context of movies is, is a barrel without a bottom. Why? Because people do not tend to compare descriptive keywords to make a decision, but they consider abstract concepts in movies. A movie that is summarized by: zombie, robbery, space, allows dozens of interpretations what could happen in the movie. A zombie that robs a bank in outer space? A robbery of a space ship that inhabits zombies… the combinations are almost endless. One problem is that those keywords hint at possible topics in a movie but they do not precisely give a good context or narrow them down sufficiently. Not to mention that it is unlikely that all topics, not to forget the latent topics, are properly described by keywords.
So, in the spirit of Deep Learning, we could try to mimic the brain or stated differently, how humans decide if they find a movie interesting or not? In a TV (digital) magazine, you usually have only the title, a very brief description, the genre and maybe a list of involved persons.
We definitely get lots of information from the the title alone. For instance, a title like “Sand Octopus”, in case it is not a paraphrase, let our fictional bells ring, probably also our horror bells. Again, we automatically use external knowledge to infer possible contexts. In this case, that an octopus usually is not found in sand and thus that the movie is probably highly fictional, to put it nicely, plus movies about an octopus tend to belong to the horror scheme. Combined with our knowledge about actors, we might infer that the movie is low- or high-budget. If we combine all data, we only have to decide, if we like to watch a trashy low-budget horror movie with possible a lot of bad acting and CGI in it.
To return to our initial problem, finding good neighbors in high-dimensional spaces with only keywords, genres and themes is very challenging. There are additional sources, like actors, directors, ratings or plots, but the less known a movie is, the higher the probability that this extra information is either very incomplete or not present at all and especially the feature space of persons is extremely sparse and high-dimensional.