We are still in love with Theano and it’s part of our machine learning framework since quite some time, but now and then you need something else to get something done. Of course python is our first choice since it is intuitive, flexible and when combined with low-level modules written in C/C++, the performance is also no problem.
Frankly, one reason we never gave torch a try is because even if learning a new language can be fun, time is very valuable. Plus, in our humble opinion it makes more sense to master one language than to divide your attention between two. However, with the arrival of pytorch things are different now.
It’s not like all our problems are solved, but pytorch introduces a new, very interesting concept of dynamic graphs. Furthermore, pytorch uses the WYSIWYG concept which means that a tensor contains actual values at any moment and not only symbolic references to it. Because of this, there are also no lengthy compilation steps and this also makes debugging much easier.
Even if the state of the project is described as “early-release beta”, our experiments so far did not encounter any serious problems or limitations. But to be fair, our models were fairly straight-forward and thus only used standard components. Nevertheless, describing the model and the actual training worked like a charm without any pitfalls. And with our existing knowledge from Theano and other graph-based frameworks, it was easy to adapt existing code and/or to write new one.
The integration of the automatic differentiation is a little different compared to Theano but this is also no big deal if you spent minimal time on the interface description. Plus, with all the available examples and tutorials, it’s pretty easy to get an overview of all the modules you need for your daily work. Especially recurrent networks are pretty easy to use and require only minimal knowledge of the underlying details in case you just need a standard setup to solve a problem.
Bottom line, we are big fans of frameworks that are easy to use but also versatile. When we stumbled about numpy many years ago, we instantly fall in love because implementing algorithms was straight-forward and also very fast, because of the optimized linear algebra routines and the vectorization. The only drawback is the missing automatic differentiation because doing it manually is burdensome and very error prone. Thus, if a framework extends numpy with this feature, plus the ability to perform calculation on the GPU in a transparent way, the outcome has to be useful ;-).
In other words, if you are familiar with the computational machinery that is required for implementing neural networks, but also other machine learning models, pytorch can make your life a lot of easier. It’s pretty lightweight, fast and easy to use. Maybe it needs a little more time to be “feature complete” and more mature, but our tests did not reveal any severe problems and since the community is pretty active, problems should be addressed pretty soon after they are reported.
The idea of the word2vec method is quite simple, but nevertheless very elegant and, as lots of recently published papers confirmed, very versatile for a broad range of problems. Thus, a lot of problems can be transformed and solved by a generic word2vec implementation. The advantage is that we can reuse a mature and optimized method to solve various problems, but the drawback is that this might not be very flexible, if we have, for instance, special requirements. Like kludges for the data preparation, or the abuse of parameters to emulate a certain behavior.
However, if we just extract the core component of word2vec, we have a fairly generic black box to solve a problem. In the skip-gram case, the input consists of a source word that is used to predict a sequence of context words. This approach also works, if we don’t have actual words, but tokens that are somehow related. Especially for unordered data, like sets, where the context is not well defined, an adaptable preparation step makes a lot of sense.
For example, let’s assume that we have a set of titles and each title has a set of corresponding tags. The intuition is that if two titles share a lot of tags, they are related. In other words, the title is the source and all assigned tags form the context, which can be seen as a local neighborhood. In case we also consider tags from titles that are reachable through shared tags, we gradually move away from a local to more global neighborhood.
This also has been explored in the literature where the local neighborhood can be described as a breadth-first search and the global one as a depth-first search. This is also related to a random walk, since it makes sense to stochastically decide what node to traverse next. For instance, we start at an arbitrary title, then we sample from the corresponding tags, then we sample from the connected titles and so forth. The whole sequence is then the walk. In contrast to a sequence a set is not ordered and thus, we need a different kind of notation for the window size. In [arxiv:1603.04259] it was proposed to consider all pairs in the set, or to shuffle each training sample.
Bottom line, word2vec can be used far beyond the field of NLP which includes graph embedding, collaborative filtering or personalization, just to name a few. Furthermore, in most scenarios, a well-matured implementation[python:gensim.models.Word2Vec] can be used to train an embedding without the necessity to adapt a single line of the code. In other cases, the input data might need to be encoded in a special way, but this is often straightforward and also does not require to change the code.
For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.
We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.
Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:
E = np.random.uniform(-1, 1, size=(#features, #dim))*scale
x = [i1, i2, i3, ..., ik]
U = E[x]
h = np.mean(U, axis=0)
Instead, we want that the model puts more focus on “relevant” aspects:
x = [i1, i2, i3, ..., ik]
U = E[x]
g = tanh(np.dot(U, v) + bias)
a = softmax(g)
h = np.sum(a * U, axis=0)
Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.
With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:
h = 0.3 * U + 0.6 * U + 0.1 * U.
When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.
Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.
The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.
It’s true that at some point, you might need full control of the situation, with access to the loss function and maybe even the gradients. But sometimes, all you need is a hammer and a pair of nails without fine-tuning anything. Why? Because you just want to get the job done, ASAP. For example, we had a crazy idea that is likely to NOT work, but if we just need 10 minutes coding and 30 minutes waiting for the results, why not trying it? Without a proper framework, we would need to adjust our own code which can be a problem if there are lots of “best practices” one need to consider, like for instance for recurrent nets. Then it’s probably a better idea to use a mature implementation to avoid common pitfalls, so you can just focus on the actual work. Otherwise frustration is only a matter of time and spending hours on a problem that someone else has already solved will lead you nowhere.
Long story short, what we needed is a front-end for Theano with a clean interface and without the need to write tons of boilerplate code. After some investigations, and by focusing on support for recurrent nets, we decided to give Keras a try. The actual code to build and train the network has less than 20 lines, because of the sophisticated design of the framework. In other words, it allows you to get things done in a nice, but also fast way, if you are willing to to sacrifice the ability of controlling every aspect of the training.
After testing the code with a small dataset, we generated the actual dataset and started the training. What can we say? The whole procedure was painless. Installation? No problem. Writing the code? Piece of cake. The training? Smooth without any fine-tuning. Model deployment? After installing h5py it worked without any problems ;-).
Bottom line, we are still an advocate of Theano, but writing all the code yourself can be a bit of a burden, especially if you throw all the stuff away after a single experiment. Furthermore, if your experiment uses a standard network architecture without the necessity to tune or adjust it, mature code can avoid lots of frustration in form of hours of bug hunting. Plus, it’s likely that the code contains some heuristics to work around some known problems that you might not be aware of.
For clarification, we do not say that Keras is just a high-level front-end and does not allow any customization, what we say is that it does a great just to provide one in case you don’t need all the expert stuff! And last but not least, it allows you to switch the back-end in case you want something else than Theano. We like the concept a lot provide a unified API for different back-ends, because it’s possible that back-ends have different strengths and weaknesses and the freedom to choose allows you, to write code with one API but switching
back-ends dynamically as you need it.
It is the dream of all machine learning guys to find a way to use the available data in combination with some unsupervised learning algorithm to train a useful representation of the data. Yes, we drastically simplifying things here, but the point is to learn without the necessity to label the data which is very expensive.
For example, there are tons of documents available which could be used for learning, but the problem is what cost function do we want to optimize? In case of word2vec and friends, we try to predict surrounding or center words without explicit labels. This works very good, but the result is an embedding of words and besides simple aggregation methods, there is no general way to represent documents with a learned embedding in a meaningful way. However, it is still a simple, but powerful approach that can easily utilize huge amounts of unlabeled text data to learn a useful representation.
Another example is a recently published paper [arxiv:1704.01444] that is also using a large text corpus without labels, at least for the first model, to just predict the next character of the data block. So far, this is nothing new, but it is remarkable that a single unit learned to predict the sentiment of a data block. In other words, all those models learn by predicting the next “thing” which can be, for instance, a word, a character, or some other token.
The interesting part is that such an “autoregression” model can be learned by just taking a sequence, removing the last item and try to predict it, given the previous data. This also works for sets, but the process is not straightforward since sets are not ordered. Furthermore, it is not obvious how to select the item, since there is no “previous” data.
Bottom line, it has been demonstrated several times that it is possible to learn a good representation of data by just predicting the next token. Nevertheless, often such methods are limited to short texts, since processing longer texts require to remember lots of data or context, especially for models based on RNNs.
However, since we are usually dealing with short descriptions of items, with the exception that we handle sets and no sequences, we adapted the method and trained a model to predict a keyword from the set, given the rest of the set, with moderate results. Despite the problems we encountered, we still believe that no (strongly) supervised model will ever be able to learn powerful but also general representation of data. Thus, it seems a good idea to follow the research track and address the existing problems one by one, until we
eventually find a method that addressed the major hurdles.
In a recently published paper [arxiv:1701.06538], the authors introduced a mixture of experts which is not new. However, the twist is to use only a small subset of those experts which cannot be done with an ordinary softmax, since the output of a softmax is always -slightly- positive. The idea is to keep only the truly top-k experts by setting the values, before applying the softmax operation, of all non-top-k experts to a large negative value. The result is that the actual output value at the corresponding position is zero.
With numpy, x is a vector, this is actually straightforward:
def keep_topk(x, k, neg=-10):
rest = x.shape - k
idx = np.argsort(x)[0:rest]
x[idx] = neg
We just sort the values of x, getting the indicies for the |x|-k positions and set the values to -10.
But since we want to use all the nice features of Theano, we need to port the code to the tensor world. Frankly, this is no big deal either, but it requires a tiny adaption since we cannot assign values to tensors directly.
def keep_topk(x, k, neg=-10):
rest = x.shape - k
idx = T.argsort(x)[0:rest]
return T.set_subtensor(x[idx], neg)
And that’s it.
The reason why we spent some time with the porting is that we also had the idea to use soft attention to model the final prediction as a decision of a small set of experts. The experts might have different opinions and with the gating, we can blend different confidence levels with different outputs.
When we began to rewrite our TV recommender application from scratch some patterns repeated a lot which are not new, but extremely useful. The basic pipeline is like that: we have chunked id/key/value data that is converted into a coherent CSV file which is then typified, evaluated and imported into some back-end.
The chunked key-value data is nothing special and we only need to keep track of the current ID to combine all pairs into one item. However, streaming makes our life a lot of easier, because then we can treat the encoding as a pipeline with adaptable stages. In the pythonic world, we are using iterators which can be easily chained to model a pipeline. To conclude the process, need two operations: (1) filter out items that do match a certain criteria and (2) map items by modifying existing values or adding new ones.
In this setting, ML procedures can be often seen as an additional mapping step. For instance, we could use the method to build data that is required for a special model, but we could also directly apply the raw data on existing models. Like to estimate the preference of movies, to determine an “interaction” score with series or to annotate items with tags/concepts. And last but not least, it is also possible to build data for ML step-wise, or to model dependencies from one stage to another.
With this in mind, a simple pipeline could look like that:
csv = ChunkedReader(sys.stdin)
pipe = Pipeline(csv.__iter__())
pipe.add(TypifyMap()) # convert strings to data type
pipe.add(ExpireFilter()) #filter out expired items
pipe.add(DedupFilter()) #ignore duplicates
pipe.add(PrefModelMap()) #score movies
pipe.add(IndexMap()) # tokenize relevant data
The chunked data is read from stdin and as soon as there is a complete item, it is passed to the next stage. In this case, a module that converts the strings into more useful data type. Like datetime() for timestamps, integer for IDs and lists for keys that have multiple values. Next, we want to skip items that have been already aired, or if the primary ID has been already seen in the stream of items. At the end, we use some ML model to access the meta data -if available- of movie items and assign a preference score to them. Finally, all values of relevant keys of the item will be tokenized, unified and indexed to allow to search the content.
What is important is that the sequence of the stages might be internally reorganized since filters need to be executed first to decide if any mapping should be done at all. In other words, if any filter rejects an item, the pipeline skips it altogether and requests a new item from the stream.
So, the basic ingredients are pretty simple: We have a stream of items -the iterator- and we have filters to skip items and maps to adjust items. At the end of the pipeline a transformed item emerges that can be stored into some back-end. The beauty is that we can add arbitrary stages with new functionality without modifying existing modules, as long as the design-by-contract is not violated.
Later, when we use dialogs to present aggregated data to the user, we can use the same principle. For instance, a user might request all western movies with a positive score which is nothing more than two filters to reject non-western movies and all items with a negative score. Another example is when a user clicks on a tag to get a list of all items that share the same tag and there are dozens of other examples like that.
The pattern can be described like that: we need a basic iterator to access the data, probably with a strong filter to reduce the amount, and then we need to adapt the data with combinations of filter/map to access only the relevant parts of it. This approach has the advantage that we can implement basic building blocks that are reused all over the place and new functionality can be built by combining elementary blocks.
However, despite the flexibility we gain with such an approach, it is very challenging to integrate this into a GUI since many views require customized widgets to preserve this flexibility. Furthermore, recommenders should also “learn” the optimal layout with respect to the preferences of users which introduces dynamics that further increase the complexity of the graphical interface.
Bottom line, the success of a recommender system mostly depends on the implementation of the graphical user interface and requires both an intuitive, but also flexible application to avoid frustrated user. Hopefully we can find some time to elaborate on it in a new post.