We recently wrote about micro genres and how tags could be helpful to build them. The problem is, as usual, that we have no access a sufficient amount of data that are required for such a task.
User-generated data from communities or social networks could be utilized to gather portions of the required data, but it is unlikely that larger sites are willing to share the information for free because these information are very valuable for them. However, fresh data is required to learn new trends or global drifts in user preferences.
The other possibility is to use fixed meta data of movies, but then we depend on a single source, or a community of volunteers to index new movies. Furthermore, the integration of meta data from different sources can be problematic. So, what are ways of out it? Finding a community that matches the basic requirements would be ideal, but the motives of such communities might not coincident with our goals.
Instead of pre-defined features, we could also use raw text. Like plot summaries, overviews, reviews or just comments. However, most of the problems remain: if nobody writes about a movie, no recommendations are possible. Not to mention the recent trend that reviews might be fabricated and thus, they do not reflect the actual opinion of the crowd or at least bias it.
Bottom line: without the will to pay for decent information, we need to use every crumb we find and integrate it into our model. That means even more sparsity and higher dimensions for our features and to use data from very different domains.