# Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With

P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.