We are doing research in the domain of movie recommendations for quite some time, but even with the smartest methods we had a lot of setbacks and no, deep learning won’t save us, since the method is only as strong as the features and the features are the actual problem. For example to learn the preferences of users, keywords, tags and genres might not suffice, or stated differently, there must be “enough” data to describe the movie content with respect to (all) preferences. The problem is mostly time and consistency, since it takes a lot of time to label all movies and furthermore, to you have to be consistent with a taxonomy that often incomplete itself. Bottom line, a model that is using content-based features requires that _all_ movies are sufficiently described by textual features which is not realistic in practice.
A much better method would be, to use directly the movie content, a stream of images, to generate a compact fingerprint of a movie. With modern GPUs this would be possible, but only at a small-scale and at a much lower resolution. For example, we could use trailers of movies since they already summarize the movie but they also provide important parts of it. Like explosions, space ships, people kissing, fighting, cars, islands, dinosaurs or whatever. The idea is to use a sequence of frames x1,…,xn and describe them by a single feature representation ~x. This sounds like a recurrent network, but since a trailer contains lots of frames the training will be very challenging. Not to forget that we need a loss function to optimize and it is not obvious how to turn our goal into a differentiable loss function.
However, without access to a massive feature database for movies, in combination with other data like tags and ratings, content-based recommender systems will never provide real user experiences since the quality of the movie usually depends on the popularity. And lots of movies are in the “long tail” which means they often have very limited or even none feature data. With the utilization of the actual movie data, we could at least provide a non-subjective way to compress a movie into a fixed length feature representation that is used to learn user preferences.