We are still on our journey to find a suitable model for our data. We tried different regularizations, different learning algorithms, mainly CD and PCD, various pre-processing routines for the data and we are still at the begin of the journey.
Because one of our ultimate goals is to train a joint model for the whole data, instead of smaller models which are then somehow combined later, we think that even rather unusual methods are worth a try. That is why we experimented with an RBM approach that was intended for collaborative filtering. The approach can be summarized by saying that a joint model is trained by sharing weights and biases, but that there are also dedicated models for each user to solve the problem that a user usually rated only a small fraction of all possible movies.
A direct adaption of the method for our data does not seems to be straightforward since we are trying to model binary states and there is no easy way to convert the movie/user concept to our movie/genre approach. Nevertheless, we got some very useful ideas from the paper that we plan to pursue further.
One aspect we will definitely pursue is the introduction of another set of binary input nodes to model a conditional distribution of the hidden nodes that then depends on two sets of input data. In the original paper they used the new set of nodes to model that a user has seen a particular movie but that there is no rating for this movie available.
It definitely seems very clever to incorporate these information into the final model, because if a user decided to watch a movie there must be a reason for it. Maybe the story or the genre of the film, an actor or the director. In any case, this information can be interpreted as a latent preference towards this movie and thus, it is very valuable data and should not be ignored.
In our case, we could use the additional nodes in many different ways. For instance, we could connect keywords with actors, by setting the value of a node to “1” if a keyword is present in a movie with this actor and “0” otherwise. But that is just one possibility and we are still thinking about ways how to maximize the gain for our model.