Tagged: cbow

Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale
x = [i1, i2, i3, ..., ik]
U = E[x]
h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]
U = E[x]
g = tanh(np.dot(U, v) + bias)
a = softmax(g)
h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:
h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2].

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

Efficient Embeddings With Theano

To train a word embedding model is a pretty old hat and maybe this is nothing new, but we nevertheless thought it would be a good idea to summarize our experiences to avoid possible headaches in case somebody is not so familiar with Theano. To be fair, the FAQ of Theano is mentioning the issue, so we are using the post as a more detailed summary.

Let’s assume that we want to train an embedding model for words with a
vocabulary size of N. Then we have a lookup matrix “W”

W = theano.shared(np.random.normal(0, 0.1, size=(N, dim)))

where dim is the number of dimensions we use for the embedding.

For the sake of simplicity, a training step consists of a pair (ref, pos) that should be moved together and a pair (ref, neg) that is supposed to be pulled apart. In other words, for each step, we only change three row of W.

This is a possible loss function with the update:

cost = -T.log(T.nnet.sigmoid(T.dot(W[ref], W[pos]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W[ref], W[neg])))
grad_W = T.grad(cost, W)
updates = [(W, W - lrate * grad_W)]

So, what is the problem here? Even if we only adjust three rows of the matrix, we calculate the gradients for the whole matrix. This is very wasteful and also awfully slow.

The solution is to use advanced indexing and only calculate the gradients for the subset of weights that are actually used in the step:

idx = T.ivector() #ref, pos, neg
W_part = W[idx]
cost = -T.log(T.nnet.sigmoid(T.dot(W_part[0], W_part[1]))) + -T.log(1 - T.nnet.sigmoid(T.dot(W_part[0], W_part[2])))
grad_W_part = T.grad(cost, W_part)
updates = [(W, T.set_subtensor(W_part, W_part - lrate * grad_W_part))]

For those who are not familiar with Theano, let’s see what is going on here. The variable “idx” is a vector that holds integers, in our case ref, pos, neg which are used to retrieve particular rows of “W”. In other words, W_part contains references to these rows and is therefore something like a view of “W”.

The trick is to call T.grad with the subset and not the whole matrix to avoid unnecessary computations, because the gradient of all rows, except the ones referenced in W_part, are zero anyway. But since we are working with a view now, we need a different way to just update the three rows in W which can be done with set_subtensor(). First, we are updating W_part as usual with gradient descent, but then we need to use advanced indexing to just replace the referenced rows. We can think of the new update statement as:

W[ref] = W_part[0] - lrate * grad_W_part[0]
W[pos] = W_part[1] - lrate * grad_W_part[1]
W[neg] = W_part[2] - lrate * grad_W_part[2]

And that’s it. It is not really rocket science, but it requires some deeper understanding of how Theano works under the hood.

The lesson is that if you just use a subset of weights during learning, you should only derive gradients for this subset. The notation might look unfamiliar at the beginning, but in case of a lookup matrix with several hundred thousands rows, we are talking about savings in the range of hours and this is worth the little extra effort.

And last but not least, updating the whole matrix leads to very strange side-effects (also briefly mentioned in the FAQ). Let’s assume that we train an embedding model and we use RMSprop as the optimizer. In each step, we update the moving average:

avg_new = 0.9 * avg + 0.1 * grad**2,

where avg_new has the same dimension as the embedding matrix. In case of a batch-size of one, most of the gradients are zero and thus, the update for a row with a zero gradient looks like this:

avg_new[i] = 0.9 * avg[i] + 0.1 * 0,

which means the average is pushed to zero.

Therefore, even if the row was not referenced in the pairs, the dynamic of the optimizer is changed for all rows, regardless if they were referenced or not. This is true for all kind of optimizers that keep a history of values, like momentum, adam or adagrad.

Bottom line. While using only the subset means a huge gain in performance for classical SGD, but it is not strictly required, it is mandatory for all sophisticated optimizers.

Neural Pondering

We did not give up our old dream of purely unsupervised learning, but right now, we neither seem have the right data nor a powerful enough model. However, we recently stumbled about a research paper about neural variational inference for text and some ideas caught our attention.

The idea is to encode a whole document as a latent representation and then a softmax is used to independently generating the words the document consists of. The document is encoded as a bag-of-words with an additional integer index for the non-zero words. This sounds very much like our data, very high-dimensional, very sparse and not ordered. So, we shamelessly took their ideas from the paper and turned them into our context, but of course we give credit to the authors[1511.06038].

Since we mainly use the softmax to ‘reconstruct’ the individual words, our new model looks a lot like CBOW, continuous bag of words, with the exception that now a context is not used to predict a single word, but several words. In Theano notation:

x_idx = T.ivector() # the list of word IDs
W = weight matrix of |V| x embed_dim
b0 = bias with embed_dim
R = weight matrix of embed_dim x |V|
b = bias with |V|
h_lin = T.mean(W[x_idx], axis=0) # mean of all input words
h = elu(h_lin + b0)
y_hat = T.nnet.softmax(T.dot(h, R) + b)
loss = -T.sum(T.log(y_hat[0, x_idx]))

which means we average the embedding of all words that are present in the document, optionally with a non-linearity at the end. Then we use this combined representation of the document to maximize the probability of all present words while pushing down the values for all others.

Again, we do not deny a strong resemblance to CBOW and we are pretty sure the novelty is limited, but maybe this is a stepping stone to something greater. The analysis of word neighbors in the learned space at least confirms that some semantic is definitely captured, but more work needs to be done.

Classification: Linear vs. Embedding

Whenever we are juggling with very high-dimensional, but also very sparse data, linear models are a good way to start. Why? Because they are fast to train and to evaluate, with minimal footprint and often sufficient to deliver a good performance, because data in high-dim spaces is more likely to be separable. However, the simplicity also comes with a price.

In a linear model, each feature has a scalar weight, like barbecue=1.1, cooking=0.3, racing=-0.7, car=-0.9 and a bias is used as a threshold. To predict if something belongs to the category “food”, we calculate:

y_hat = x_1 * barbecue + x_2 * cooking + x_3 * racing + x_4 * car + bias

where x_1 is {0, 1} depending if the feature is present or not. If y_hat is positive, the answer is yes, otherwise no. It is obvious that the answer SHOULD be positive, if a sufficient number of food-related features are present. In case of a mixture, the answer resembles a majority vote. The problem is that linear models completely ignore the context of features which means the meaning of a feature cannot be adjusted depending on present neighbors.

This is when embedding models jumps in to fill the gap. In such a model, each feature is represented by a vector and not a scalar and an item is represented as the average of all its feature vectors. For the labels, we also use a vector, like for linear models, but not in the original feature space, but in the embedding space. Then, we can make a prediction by transforming the item and calculating for each label:

y_i = f(dot(mean_item, label_i) + bias_i)

What is the difference? Let’s assume that we encode the vectors with 20-dims, which means, we have much more representational power to encode relations between features. For instance, if a certain feature usually belongs to the non-food category, but if it is combined with specific other features, it is strongly related, a linear model is likely to have trouble to capture the correlation. More precisely, the weight can be either negative, around zero, or positive. On the one hand, if it is positive, it must be related to the category which is usually not the case, on the other hand, if it’s negative, it never contributes to a positive prediction. And the last case, where it is very small, neglecting the sign, it does not contribute at all. To be frank, we simplified the situation a lot, but the point is still valid.

In case of an embedding, we can address the issue because we have more than a single dimension to encode relations with other features. Let’s see again how a prediction looks like:

dot(mean_item, label_i) + bias_i

following by some function f that that we ignore, since it only rescales the output value. We can also express this as:

sum(mean_item[j] * label_i[j]) + bias_i

Stated differently, we could say that each dim has a weighted vote that is summed up and thresholded against the label bias. The more positive votes we get, the higher the chance that (sum_i + bias_i) > 0 which means the final prediction is positive.

To come back to our context example, with the embedding it is possible that some dimensions are sensible for correlations. For instance, for a specific feature, dim “k” might be negative and positive for non-food features which eliminates the contribution due to the averaging. However, if it is also negative for food-related features, and also negative in the label, the specific feature strengthens the prediction because of the context. Of course, it is not that simple for real-word data, but the argument remains the same, because with a vector the model is able to learn more powerful representations.

Bottom line, linear models are very useful, but depending on the problem, they are too simple. Given sparse data, embedding models are still very efficient, because the complexity does not depend on |features| but on |features>0|, with a sparsity of ~99.9%. With the more powerful representations, it is also likely that the embedding can be re-used, for instance to learn preference-based models or to cluster data.

Conceptual Similarity With Embeddings

For quite a while, we studied methods how to project movie descriptions, represented by short lists of keywords and themes, into some feature space that preserves a semantic distance between movies. We tried classic models like auto-encoders, restricted Boltzmann machines, siamese networks and lots of other neural network-based models to learn such an encoding, but not any model was able to capture enough information to learn a good feature space. The major issues are low-frequency keywords, incomplete descriptions and that the data set is likely too small. For these reasons, we changed our focus to embed words rather than full movies which eventually lead to a scheme that allowed us to aggregate these embeddings to movie embeddings. We started with classical word embeddings trained with CBOW that learns an embedding matrix and a context matrix.

To illustrate one problem of movie embeddings, let’s consider a simple example of a siamese network. The oracle to decide if two movies are similar might be the genre. In case of very narrow genres, like western, such a model might likely work, because the vocabulary for western movies is very specific. However, even then the embedding of a new movie could fail, because the movie might have too many ambiguous words and since no further information is used, the distance in the feature space is likely to be closer to some non-western movies.

To be clear, the word embedding approach does not solve all problems, but it helps to tackle down some of the known problems. For instance, a word might not be really specific for the western genre, but it relatively often occurs with terms from this genre. Thus, the embedding of the word, in the context space, is close to some western-specific words. This helps a lot if a movie is encoded as a centroid by averaging all embedding vectors of its present words.

To be more specific, let us assume that a movie consists only of a single word “w_a” and the context “W_out”, the neighbors of this word, is related to western. However, in the embedding space “W”, non-western words are more similar. As mentioned before, a movie is represented by the average of word embeddings of present words:

W_movie[i] = 1/n * (W_out[w_1] + W_out[w_2] + ...)

To find the best matches, we use the cosine similarity of “w_a” with all movie centroids. The intuition is that instead of using the typicality of a word, we use the topicality to find related movies. In terms of documents, the analogy is to find text that not only mention this single word, but is actually about it. This is why the context is so important, because if a word is not directly related to a genre, we still want to find movies where the word fits into the broader topic. To be fair, we got lots of inspirations from [arxiv:1602.01137] where they used a similar approach to find matching documents for a query.

However, as noted in the previous post it is not sufficient to model the co-occurrence of words which is why we use the movie itself as the context. More precisely, instead of selecting “surrounding” words and predicting the “center” word, we use all words that are present in a movie as the context and some (abstract) concept as the center. The notation, despite this slight adjustment, remains the same. At the end, we want to bring the average vector, the movie, closer to valid concepts and push it further away from non-valid concepts.

To see if the model actually leads to a useful feature space, we calculated the embeddings of all movies that are aired in the next days and selected some random samples and analyzed their neighbors.

#1: Red Dog -> {Dr. Dolitte, Shaggy Dog, Despicable Me, Dr. Dolitte 4, Peter Pan}
#2: Police Python 357 -> {Miami Vice, Columbo, Chaos, The Son of No One}
#3: Star Trek Insurrection -> {S.T First Contact, S.T. Motion Picture, Apollo 13}

We can clearly see that semantically related movies are grouped together. For #1, the movies are for children and family, #2 has a strong crime theme and #3 is sci-fi with a space theme. But it should be noted that the results are also best effort, because if no similar movie is aired in the next days, the next neighbor might be very far away.

In a nutshell, the training of an embedding with a more appropriate loss function for the movie scenario already leads to very good results despite the fact that the approach is quite simple. Furthermore, the model is very versatile, since we can use it to predict arbitrary tags for unseen movies and we can use the learned embedding to find semantically related neighbors.

Describing Complex Concepts As Folders

In the previous post we described how we it is possible to define more sophisticated tags/concepts without explicitly naming them. The idea is to group movies into folders and treat the folder as some concept or equivalently, an opaque tag. This allows users to define their own categories, similar to mail, and to efficiently train a preference model by converting the folders into a ranking problem. As noted before, in contrast to mail, it is possible that items have multiple tags and the challenge is that our data is much sparser and we cannot rely on natural language approaches because the input data is a set of keywords without semantics like an order, or types, like noun, verb or adjective.

Because it is always good to start with a simple model, we began with a linear model which is reasonable, since in high dimensional spaces, data is often easier (linearly) separable than with dense data. Our model heavily borrows from word embeddings but the objective is different. First, we have a set of keywords that describe our data and each keyword is assigned to a fixed length feature vector that is learned. The same is done for the tags. In other words, we use a look-up table that maps a keyword ID to a vector. In our first setup, we had 2,000 keywords and 20 tags. To describe an item, we use the average of the feature vectors of all keywords that are present in the item. With this approach, items and tags have the same feature length.

To train an embedding, we randomly draw items. Then, we determine the non-zero entries, keywords, and map each of those keywords to the corresponding keyword ID. The result is stored as a list of integers “x”. Next, we randomly draw a tag from this item and look-up the ID “y_pos”. For the negative sampling, “y_neg”, we draw K tags that are not present for the item and store the IDs as a list. With a toy example, K=2, it looks like that:

{castle->4, ghost->999, friends->814, scotland->21} => x:[4, 999, 814, 21]
{childrens->0, fantasy->8} => y: [0, 8], y_pos = 0
{horror->14, western->11} => y_neg = [11, 14]

The objective is to score the positive label higher than any drawn negative label. To reduce the complexity, we sample only a subset of all possible negative tags. In pseudo numpy-notation:

W: lookup-table keywords
Y: lookup-table tags
b: bias tags

f_mean = np.mean(W[x], axis=0)
y_hat_pos = sigmoid(np.dot(f_mean, Y[y_pos]) + b[y_pos])
y_hat_neg = sigmoid(np.dot(f_mean, Y[y_neg]) + b[y_neg])
loss = -np.log(y_hat_pos) + -np.sum(np.log(1 - y_hat_neg))

If we think of CBOW, f_mean is the context that describes the surrounding words and y_pos is the word to predict. Stated differently, we use a very similar approach to train an embedding with negative sampling, but not with a focus on the co-variance of the data, but to embed a context, the item, and a tag, into a new, dense feature space. And in case of sparse data, the complexity of the training does not depend on the total number of keywords, but only on the average sparsity of the items which is about 99% in our case. After we trained a model, we can predict tags for new items very fast. All we have to do is to map the set of keywords to IDs, with the look-up used for the model, and average the features vectors. Then, we get the score of a tag by a single dot product for each tag:

y_i = sigmoid(np.dot(f_mean, Y[i]) + b[i])

The procedure can be also executed as a single vector-matrix multiplication. The output of the model is a list of confidence scores [0..1] for each tag. To turn the scores into a labeling, we need a threshold scheme. For our first setup, we used a constant of 0.6 which lead to reasonable results.

Bottom line, we combine ideas from word embedding, tagging and ranking to learn a preference-model for individual, complex tags that are defined by users by assigning movies to virtual folders. With this approach, we only model positive preferences, but it is straight-forward to add a “spam” filter that uses a “don’t like” folder for -1 samples and +1 samples that are drawn from real folders.

CBOW: Encoding Movie Descriptions As Graphs

In contrast to NLP with texts and real sentences, the meta data of movies is not ordered. A set like {photograph, magination, daydream} can be shuffled without changing the meaning of it, which is in contrast to ordinary sentences. However, to train a word embedding, we need an anchor word and a context to predict this particular word. But without an order there is no consistent way to do so and this is why we are treating the data as a graph.

To be more precise, a set like {a,b,c} can be treated as a clique where all nodes are pairwise connected and the idea can be generalized to larger sets. Furthermore, the graph is not directed and the edges between two nodes have a specific weight that equals the co-occurrence of these nodes. With this notation, we can sample an anchor word, uniformly from all nodes, and build the context based on the co-occurrence statistics of the anchor word. This can be done by sampling a uniform random value [0, w_sum], where w_sum is the sum of all edge weights from the anchor word to the other nodes and then we check what node corresponds to the randomly chosen interval. With this method, we take the co-occurrence of two words into consideration to focus on frequent pairs which is likely to help to capture the semantic and encode it into the embedding.

The training procedure is then straightforward. We just need to build the co-ocurrence matrix C. Then, we shuffle all descriptions and iterate over the data, one by one. For each data, we select, randomly, an anchor word “a” and build the context “c” according to the method we previously described. A negative sample “n” is selected from C where only nodes with a frequency #(n,c) < T are considered, or stated differently, words that rarely or never occur within the context.

Bottom line, by treating movie descriptions that are sets of words as graphs , it is possible to define a consistent method to train a word embedding on this data, despite the fact that the selection of the anchor word and the context is stochastic.