Describing Complex Concepts As Folders

In the previous post we described how we it is possible to define more sophisticated tags/concepts without explicitly naming them. The idea is to group movies into folders and treat the folder as some concept or equivalently, an opaque tag. This allows users to define their own categories, similar to mail, and to efficiently train a preference model by converting the folders into a ranking problem. As noted before, in contrast to mail, it is possible that items have multiple tags and the challenge is that our data is much sparser and we cannot rely on natural language approaches because the input data is a set of keywords without semantics like an order, or types, like noun, verb or adjective.

Because it is always good to start with a simple model, we began with a linear model which is reasonable, since in high dimensional spaces, data is often easier (linearly) separable than with dense data. Our model heavily borrows from word embeddings but the objective is different. First, we have a set of keywords that describe our data and each keyword is assigned to a fixed length feature vector that is learned. The same is done for the tags. In other words, we use a look-up table that maps a keyword ID to a vector. In our first setup, we had 2,000 keywords and 20 tags. To describe an item, we use the average of the feature vectors of all keywords that are present in the item. With this approach, items and tags have the same feature length.

To train an embedding, we randomly draw items. Then, we determine the non-zero entries, keywords, and map each of those keywords to the corresponding keyword ID. The result is stored as a list of integers “x”. Next, we randomly draw a tag from this item and look-up the ID “y_pos”. For the negative sampling, “y_neg”, we draw K tags that are not present for the item and store the IDs as a list. With a toy example, K=2, it looks like that:

{castle->4, ghost->999, friends->814, scotland->21} => x:[4, 999, 814, 21]
{childrens->0, fantasy->8} => y: [0, 8], y_pos = 0
{horror->14, western->11} => y_neg = [11, 14]

The objective is to score the positive label higher than any drawn negative label. To reduce the complexity, we sample only a subset of all possible negative tags. In pseudo numpy-notation:

W: lookup-table keywords
Y: lookup-table tags
b: bias tags

f_mean = np.mean(W[x], axis=0)
y_hat_pos = sigmoid(np.dot(f_mean, Y[y_pos]) + b[y_pos])
y_hat_neg = sigmoid(np.dot(f_mean, Y[y_neg]) + b[y_neg])
loss = -np.log(y_hat_pos) + -np.sum(np.log(1 - y_hat_neg))

If we think of CBOW, f_mean is the context that describes the surrounding words and y_pos is the word to predict. Stated differently, we use a very similar approach to train an embedding with negative sampling, but not with a focus on the co-variance of the data, but to embed a context, the item, and a tag, into a new, dense feature space. And in case of sparse data, the complexity of the training does not depend on the total number of keywords, but only on the average sparsity of the items which is about 99% in our case. After we trained a model, we can predict tags for new items very fast. All we have to do is to map the set of keywords to IDs, with the look-up used for the model, and average the features vectors. Then, we get the score of a tag by a single dot product for each tag:

y_i = sigmoid(np.dot(f_mean, Y[i]) + b[i])

The procedure can be also executed as a single vector-matrix multiplication. The output of the model is a list of confidence scores [0..1] for each tag. To turn the scores into a labeling, we need a threshold scheme. For our first setup, we used a constant of 0.6 which lead to reasonable results.

Bottom line, we combine ideas from word embedding, tagging and ranking to learn a preference-model for individual, complex tags that are defined by users by assigning movies to virtual folders. With this approach, we only model positive preferences, but it is straight-forward to add a “spam” filter that uses a “don’t like” folder for -1 samples and +1 samples that are drawn from real folders.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s