The progress we recently made gave us some confidence, but we still feel that at least one essential piece is missing. That is why we did some more literature research to see if other researchers suffer from the same problems. The scenario is still the same: We have data that describe arbitrary items, movies in our case, and we use a simple bag-of-words approach to model the binary features.
Most of the results we found in the literature was about modeling documents. Even so the domain looks similar, it has not much in common with our data. A document consists of many words and it is very likely that a word occurs more than once. In our case, any feature occurs at most once and it is strictly binary. There are approaches that treat a document as a set of binary features but even then, the sparsity factor is much smaller compared our data. A further problem is the distribution of the features. Even the document words with a low frequency are much more likely than a randomly chosen keyword from our data. For instance, there exists a notable number of keywords that occur exactly once. Stated differently, the vocabulary of documents is very different from a feature vocabulary.
A domain that is more related to our problem are topic models for micro-blogging data. Usually these blogs are very short and the number of repeated words is small. Nevertheless, the distribution and the structure of the words are very different compared to purely binary tags of movies.
The bottom line is that we did not find any literature that is using a data set that shares enough attributes with our data to be beneficial and if a paper comes close, it is very likely that the ultimate goal is the classification of the data. For instance, the sentiment of a review, or the category of an article. We believe that a discriminative and/or supervised model is very useful, but only as a second step after we build a good model for the data.