We postpone our research on Conditional RBMs for some more time, because we recently stumbled about a paper that is very similar to an idea we pursuit in the past. The idea is to use an auto-encoder with two inputs X, Y that is trained to reconstruct both X, Y. The aim is to minimize both the cross-reconstruction, but also the self-reconstruction of each input. Plus, the correlation of the hidden representations of both X and Y should be maximized.
Let us consider the following example. In the domain of movies, X could be a list of plot keywords, while Y are relevant themes. As mentioned, the aim of the model is to reconstruct both X, Y, if both inputs are given, but also, to reconstruct X given Y and Y given X. In other words, if we only have the keywords of a movie, we would like to reconstruct the relevant themes, or in the other direction, we would like to reconstruct keywords given only the themes.
Stated differently, the idea of correlational networks (CorrNet) is to feed different modalities into an auto-encoder, to learn a joint hidden representation of both domains that is able to capture the explaining factors of each domain. Then, these factors can be used to reconstruct individual modalities, given the other one.
Especially for the domain of movies, such a model is extremely useful, because very often the meta information of movies is not complete. For instance, we might have keywords to describe the plot, but only partial or no information about the themes. With a CorrNet, we can simply feed the keywords (X) into the model, to reconstruct the themes (Y). Furthermore, the model can be also used to “enhance” theme information which means that the CorrNet can suggest themes. For example, a combination of some keywords might suggest a “zombie” theme, even so this theme is not present in the meta information for some reason.