# A Lightweight Embedding of Features

The sparsity of the feature data for movies poses a serious limitation in terms of training a good model, because especially the rare features are very useful to derive useful latent topics for movies. That is one reason, why we started to move our efforts from AutoEncoders and friends, to a pairwise learning scheme.

For instance, in case of spectral clustering, we can use all features to build an affinity matrix of all movies and use it to project the movies into some features space to cluster them. However, approaches that are based a full N x N matrix do not scale for larger data sets and thus, we decided to experiment with simpler schemes that use stochastic gradient descent and a pairwise loss function. We start with movie themes to explain our approach.

The goal is to embed all themes into a feature space where “similar” themes are placed closer together while “different” ones are pulled apart. In our case, similar means that two themes are frequently occur together. The idea is to assure that pairs of similar themes are always scored higher than other pairs. More formally, we have a reference theme Tr, a positive match Tp and a negative match Tn. Then we have: f(Tr, Tp) > f(Tr, Tn) + margin, where f() returns the similarity for a pair of themes.

In other words, a theme has a list of neighbors from which Tp is sampled and Tn is chosen at random and is not part of those neighbors. We can convert this constraint into a simple cost function: cost = maximum(0, margin + f(Tr, Tn) – f(Tr, Tp)). Stated differently, there is no loss, if the distance between the positive and the negative similarity score of the theme is larger than the margin!

The training of the model is simple, if we stored a list of neighbors for each theme in advance. Then the procedure to train the model can be summarized as:

– Sample an arbitrary theme Tr

– Sample the positive theme Tp from the neighbors of Tr: nn[Tr]

– Sample an arbitrary theme Tn until Tn is not part of nn[Tr]

– Compute the gradient and adjust the weights of the model to minimize the cost

We used the cosine similarity, parameterized ReLU neurons and Nesterov momentum to speed up the convergence. The learning rate was decayed. We trained the model until convergence or if the learning goot too small. To evaluate the results, we visualized the embedding of all themes with T-SNE as a 2D plot. But since the number of themes is limited, we also picked some themes and studied the next neighbors:

– mad-scientists => plagues-and-epidemics, mutants, split-personalities, zombies

– angels => deal-with-the-devil, miraculous-events, immortality

– gambling => cons-and-scams, alcoholism, families-in-crisis

Even with the limited data, the embedding seems to be plausible for most of the themes.

To sum it up, the experiments were very instructive. First, the training data can be generated by simply using the co-occurrence matrix or by counting pairs of themes. Second, the training takes only a couple of minutes, because SGD does not depend on the number of training examples and makes very fast progress in exchange for noisy gradient updates. At the end, we have a simple weight matrix that can be used to transform all themes, as a one-hot encoding, into the feature space.

Now, if we want to find out how similar two movies with different themes are, we can project the themes into the new space and use the cosine similarity. This way, we can relate movies even if they do not the same theme. To handle multiple schemes is not difficult but needs definitely further investigation.