The machine learning world seems to be very busy with building excellent classifiers for various problems with thousands of labels and million of samples. However, in our case, we are not really concerned about missing labels or predicting them, but to find labels that are suited for our needs.
As mentioned before, genres are a great thing but not as labels for movies, since the discriminative power of them is not sufficient. For that reason, we introduced latent pseudo labels which use the correlation of features to be more descriptive. But one problem remains, because our ultimate goal is a clustering of movies and not to classify them. Nevertheless, with better labels, we get models that are more discriminative. All we need to do then, is to force the model to focus on pair-wise classifications. In other words, it is important that a prediction returns the correct label, but it is even more important that two similar movies end up with the same label and if this is not the case, the penalization should be correspondingly.
Stated differently, we should avoid at all costs that a movie like Doom and Titanic ends up with same label, while it is perfectly okay for a movie like Resident Evil. This is a bit like a Siamese network, but instead of forcing proximity in the feature space, we only focus on the on the label information. The intuition is that the task should be easier, because items with the same label do not have to be close together in space, they just need to be in the same separated region in the feature space.
At the end, it is the same old story. After training, we throw away the softmax layer and use the last hidden layer to extract our features. Those can be directly used for instance retrieval or used as input for a deep neural net. Since we use a supervised cost function, it is unlikely that the learned features disentangled all explaining factors, but with more powerful pseudo labels, they could suffice for a good clustering.