Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale
x = [i1, i2, i3, ..., ik]
U = E[x]
h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]
U = E[x]
g = tanh(, v) + bias)
a = softmax(g)
h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:
h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2].

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

Just Getting ML Things Done

It’s true that at some point, you might need full control of the situation, with access to the loss function and maybe even the gradients. But sometimes, all you need is a hammer and a pair of nails without fine-tuning anything. Why? Because you just want to get the job done, ASAP. For example, we had a crazy idea that is likely to NOT work, but if we just need 10 minutes coding and 30 minutes waiting for the results, why not trying it? Without a proper framework, we would need to adjust our own code which can be a problem if there are lots of “best practices” one need to consider, like for instance for recurrent nets. Then it’s probably a better idea to use a mature implementation to avoid common pitfalls, so you can just focus on the actual work. Otherwise frustration is only a matter of time and spending hours on a problem that someone else has already solved will lead you nowhere.

Long story short, what we needed is a front-end for Theano with a clean interface and without the need to write tons of boilerplate code. After some investigations, and by focusing on support for recurrent nets, we decided to give Keras a try. The actual code to build and train the network has less than 20 lines, because of the sophisticated design of the framework. In other words, it allows you to get things done in a nice, but also fast way, if you are willing to to sacrifice the ability of controlling every aspect of the training.

After testing the code with a small dataset, we generated the actual dataset and started the training. What can we say? The whole procedure was painless. Installation? No problem. Writing the code? Piece of cake. The training? Smooth without any fine-tuning. Model deployment? After installing h5py it worked without any problems ;-).

Bottom line, we are still an advocate of Theano, but writing all the code yourself can be a bit of a burden, especially if you throw all the stuff away after a single experiment. Furthermore, if your experiment uses a standard network architecture without the necessity to tune or adjust it, mature code can avoid lots of frustration in form of hours of bug hunting. Plus, it’s likely that the code contains some heuristics to work around some known problems that you might not be aware of.

For clarification, we do not say that Keras is just a high-level front-end and does not allow any customization, what we say is that it does a great just to provide one in case you don’t need all the expert stuff! And last but not least, it allows you to switch the back-end in case you want something else than Theano. We like the concept a lot provide a unified API for different back-ends, because it’s possible that back-ends have different strengths and weaknesses and the freedom to choose allows you, to write code with one API but switching
back-ends dynamically as you need it.

Predicting The Next ‘Thing’

It is the dream of all machine learning guys to find a way to use the available data in combination with some unsupervised learning algorithm to train a useful representation of the data. Yes, we drastically simplifying things here, but the point is to learn without the necessity to label the data which is very expensive.

For example, there are tons of documents available which could be used for learning, but the problem is what cost function do we want to optimize? In case of word2vec and friends, we try to predict surrounding or center words without explicit labels. This works very good, but the result is an embedding of words and besides simple aggregation methods, there is no general way to represent documents with a learned embedding in a meaningful way. However, it is still a simple, but powerful approach that can easily utilize huge amounts of unlabeled text data to learn a useful representation.

Another example is a recently published paper [arxiv:1704.01444] that is also using a large text corpus without labels, at least for the first model, to just predict the next character of the data block. So far, this is nothing new, but it is remarkable that a single unit learned to predict the sentiment of a data block. In other words, all those models learn by predicting the next “thing” which can be, for instance, a word, a character, or some other token.

The interesting part is that such an “autoregression” model can be learned by just taking a sequence, removing the last item and try to predict it, given the previous data. This also works for sets, but the process is not straightforward since sets are not ordered. Furthermore, it is not obvious how to select the item, since there is no “previous” data.

Bottom line, it has been demonstrated several times that it is possible to learn a good representation of data by just predicting the next token. Nevertheless, often such methods are limited to short texts, since processing longer texts require to remember lots of data or context, especially for models based on RNNs.

However, since we are usually dealing with short descriptions of items, with the exception that we handle sets and no sequences, we adapted the method and trained a model to predict a keyword from the set, given the rest of the set, with moderate results. Despite the problems we encountered, we still believe that no (strongly) supervised model will ever be able to learn powerful but also general representation of data. Thus, it seems a good idea to follow the research track and address the existing problems one by one, until we
eventually find a method that addressed the major hurdles.

Top-K-Gating With Theano

In a recently published paper [arxiv:1701.06538], the authors introduced a mixture of experts which is not new. However, the twist is to use only a small subset of those experts which cannot be done with an ordinary softmax, since the output of a softmax is always -slightly- positive. The idea is to keep only the truly top-k experts by setting the values, before applying the softmax operation, of all non-top-k experts to a large negative value. The result is that the actual output value at the corresponding position is zero.

With numpy, x is a vector, this is actually straightforward:

def keep_topk(x, k, neg=-10):
 rest = x.shape[0] - k
 idx = np.argsort(x)[0:rest]
 x[idx] = neg
 return x

We just sort the values of x, getting the indicies for the |x|-k positions and set the values to -10.

But since we want to use all the nice features of Theano, we need to port the code to the tensor world. Frankly, this is no big deal either, but it requires a tiny adaption since we cannot assign values to tensors directly.

def keep_topk(x, k, neg=-10):
 rest = x.shape[0] - k
 idx = T.argsort(x)[0:rest]
 return T.set_subtensor(x[idx], neg)

And that’s it.

The reason why we spent some time with the porting is that we also had the idea to use soft attention to model the final prediction as a decision of a small set of experts. The experts might have different opinions and with the gating, we can blend different confidence levels with different outputs.

Adaptable Data Processing And Presentation

When we began to rewrite our TV recommender application from scratch some patterns repeated a lot which are not new, but extremely useful. The basic pipeline is like that: we have chunked id/key/value data that is converted into a coherent CSV file which is then typified, evaluated and imported into some back-end.

The chunked key-value data is nothing special and we only need to keep track of the current ID to combine all pairs into one item. However, streaming makes our life a lot of easier, because then we can treat the encoding as a pipeline with adaptable stages. In the pythonic world, we are using iterators which can be easily chained to model a pipeline. To conclude the process, need two operations: (1) filter out items that do match a certain criteria and (2) map items by modifying existing values or adding new ones.

In this setting, ML procedures can be often seen as an additional mapping step. For instance, we could use the method to build data that is required for a special model, but we could also directly apply the raw data on existing models. Like to estimate the preference of movies, to determine an “interaction” score with series or to annotate items with tags/concepts. And last but not least, it is also possible to build data for ML step-wise, or to model dependencies from one stage to another.

With this in mind, a simple pipeline could look like that:

csv = ChunkedReader(sys.stdin)
pipe = Pipeline(csv.__iter__())
pipe.add(TypifyMap()) # convert strings to data type
pipe.add(ExpireFilter()) #filter out expired items
pipe.add(DedupFilter()) #ignore duplicates
pipe.add(PrefModelMap()) #score movies
pipe.add(IndexMap()) # tokenize relevant data

The chunked data is read from stdin and as soon as there is a complete item, it is passed to the next stage. In this case, a module that converts the strings into more useful data type. Like datetime() for timestamps, integer for IDs and lists for keys that have multiple values. Next, we want to skip items that have been already aired, or if the primary ID has been already seen in the stream of items. At the end, we use some ML model to access the meta data -if available- of movie items and assign a preference score to them. Finally, all values of relevant keys of the item will be tokenized, unified and indexed to allow to search the content.

What is important is that the sequence of the stages might be internally reorganized since filters need to be executed first to decide if any mapping should be done at all. In other words, if any filter rejects an item, the pipeline skips it altogether and requests a new item from the stream.

So, the basic ingredients are pretty simple: We have a stream of items -the iterator- and we have filters to skip items and maps to adjust items. At the end of the pipeline a transformed item emerges that can be stored into some back-end. The beauty is that we can add arbitrary stages with new functionality without modifying existing modules, as long as the design-by-contract is not violated.

Later, when we use dialogs to present aggregated data to the user, we can use the same principle. For instance, a user might request all western movies with a positive score which is nothing more than two filters to reject non-western movies and all items with a negative score. Another example is when a user clicks on a tag to get a list of all items that share the same tag and there are dozens of other examples like that.

The pattern can be described like that: we need a basic iterator to access the data, probably with a strong filter to reduce the amount, and then we need to adapt the data with combinations of filter/map to access only the relevant parts of it. This approach has the advantage that we can implement basic building blocks that are reused all over the place and new functionality can be built by combining elementary blocks.

However, despite the flexibility we gain with such an approach, it is very challenging to integrate this into a GUI since many views require customized widgets to preserve this flexibility. Furthermore, recommenders should also “learn” the optimal layout with respect to the preferences of users which introduces dynamics that further increase the complexity of the graphical interface.

Bottom line, the success of a recommender system mostly depends on the implementation of the graphical user interface and requires both an intuitive, but also flexible application to avoid frustrated user. Hopefully we can find some time to elaborate on it in a new post.

Joint Representation Learning of Attributes and Items

Learning dense embeddings for graph-like data is still tremendously popular. For instance there is word2vec, pin2vec, node2vec, doc2vec or tweet2vec and there is no end in sight. The idea to capture semantic information of tokens in a feature representation is very versatile and despite the simplicity also very powerful. However, it is not obvious how to appropriately(!) convert items, for example a document which is a sequence of tokens, into a single representation. The average of all tokens, also the sum, works well, but does not consider the order of the tokens and also neglects other possible structural information. To be clear, our proposal does not address the whole issue but at least allows to capture the statistics of items from the dataset.

As our domain is not text, but movies, there is no clear notion of a sequence for meta data, but we can treat the problem as a bipartite graph with the items on the “left” side and the attributes on the other side. In this graph, items are not directly connected, but by meshed by common attributes. In other words, the length of the shortest path from item A to B is 2 which means A->some_node->B. A simple example is that A,B are both sci-fi movies with a common theme of {spaceship,alien} along with other individual attributes and thus they should be treated at least latently similar.

In this setting, item nodes can be seen as anchors that are used to shape the feature space by using the local neighborhood, but also by walks from a source node to an arbitrary node that is reachable from the source. The power of the embedding lies in the sampling, but for now let’s just focus on the objective: min -log(P(N(u)|u) where u is the source node and N(u) is the set of all neighbors of u. With
P(n_i|u) = exp(f(n_i)*f(u)) / sum(i, exp(f(i)*f(u))) for each neighbor n_i of N(u) with respect to u. In plain English, we want to maximize the probability to observe the neighbors N(u) for the source node u. By using the softmax, we are pushing all pairs of (n_i, u) closer together while we are pulling the other nodes apart.

This is closely related to the word2vec objective with an adapted method to generate training samples. In the original setup, we select a word from a sentence and try to predict the surround words, while we select a node from the graph and try to predict the selected neighborhood. By customizing sampling strategies for the neighborhood, we can model different aspects of the graph and thus guide the learned representation.

Bottom line, instead of learning an embedding just for the attributes, we jointly learn an embedding for movies and attributes. This combines a transductive setting, since new movies cannot be embedded without re-training, but also an inductive one, since we can at least approximate the embedding of a new movie if we know its tags.

When Unsuper Is Actually Super, Or Not?

It is no secret that most of the energy has been put into advancing supervised approaches for machine learning. One reason is that lots of problems can be actually phrased as predicting labels and often with very good results. So, the question is, especially for commercial solutions where time and resources are very limited, if it isn’t better to spend some time to label data and train a classifier to get some insights about the data. We got some inspiration from a recent twitter post that suggested a similar approach.

For instance, if we want to predict if an “event” is an outlier or not, we have to decide between supervised and unsupervised methods. The advantage of the latter is that we have access to lots of data, but we have no clear notion of “outliers”, while for the former, we need events that are labeled with the risk that the data is not very representative and therefore, the trained model might be of limited use.

In other words, it is the old story again: A supervised model is usually easier to train, if we have sufficient labeled data at the expense that we get what we “feed”. Thus, more labeled data is likely to improve the model but we can never be sure when we captured all irregularities. On the other hand, unsupervised learning might be able to (fully) disentangle the explaining factors of the data and thus leads to a more powerful model, but coming up with a proper loss function and the actual training can be very hard.

Bottom line, there is some truth in it that if you cannot come up with a good unsupervised model, but you can partly solve the problem with an supervised one, you should start with it. With some luck, the simple model will lead to additional insights that might eventually lead to an unsupervised solution.