# Tagged: rnn

# pytorch: A New Burning Star

We are still in love with Theano and it’s part of our machine learning framework since quite some time, but now and then you need something else to get something done. Of course python is our first choice since it is intuitive, flexible and when combined with low-level modules written in C/C++, the performance is also no problem.

Frankly, one reason we never gave torch a try is because even if learning a new language can be fun, time is very valuable. Plus, in our humble opinion it makes more sense to master one language than to divide your attention between two. However, with the arrival of pytorch things are different now.

It’s not like all our problems are solved, but pytorch introduces a new, very interesting concept of dynamic graphs. Furthermore, pytorch uses the WYSIWYG concept which means that a tensor contains actual values at any moment and not only symbolic references to it. Because of this, there are also no lengthy compilation steps and this also makes debugging much easier.

Even if the state of the project is described as “early-release beta”, our experiments so far did not encounter any serious problems or limitations. But to be fair, our models were fairly straight-forward and thus only used standard components. Nevertheless, describing the model and the actual training worked like a charm without any pitfalls. And with our existing knowledge from Theano and other graph-based frameworks, it was easy to adapt existing code and/or to write new one.

The integration of the automatic differentiation is a little different compared to Theano but this is also no big deal if you spent minimal time on the interface description. Plus, with all the available examples and tutorials, it’s pretty easy to get an overview of all the modules you need for your daily work. Especially recurrent networks are pretty easy to use and require only minimal knowledge of the underlying details in case you just need a standard setup to solve a problem.

Bottom line, we are big fans of frameworks that are easy to use but also versatile. When we stumbled about numpy many years ago, we instantly fall in love because implementing algorithms was straight-forward and also very fast, because of the optimized linear algebra routines and the vectorization. The only drawback is the missing automatic differentiation because doing it manually is burdensome and very error prone. Thus, if a framework extends numpy with this feature, plus the ability to perform calculation on the GPU in a transparent way, the outcome has to be useful ;-).

In other words, if you are familiar with the computational machinery that is required for implementing neural networks, but also other machine learning models, pytorch can make your life a lot of easier. It’s pretty lightweight, fast and easy to use. Maybe it needs a little more time to be “feature complete” and more mature, but our tests did not reveal any severe problems and since the community is pretty active, problems should be addressed pretty soon after they are reported.

# Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale

x = [i1, i2, i3, ..., ik]

U = E[x]

h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]

U = E[x]

g = tanh(np.dot(U, v) + bias)

a = softmax(g)

h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:

`h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2]`

.

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.

# Just Getting ML Things Done

It’s true that at some point, you might need full control of the situation, with access to the loss function and maybe even the gradients. But sometimes, all you need is a hammer and a pair of nails without fine-tuning anything. Why? Because you just want to get the job done, ASAP. For example, we had a crazy idea that is likely to NOT work, but if we just need 10 minutes coding and 30 minutes waiting for the results, why not trying it? Without a proper framework, we would need to adjust our own code which can be a problem if there are lots of “best practices” one need to consider, like for instance for recurrent nets. Then it’s probably a better idea to use a mature implementation to avoid common pitfalls, so you can just focus on the actual work. Otherwise frustration is only a matter of time and spending hours on a problem that someone else has already solved will lead you nowhere.

Long story short, what we needed is a front-end for Theano with a clean interface and without the need to write tons of boilerplate code. After some investigations, and by focusing on support for recurrent nets, we decided to give Keras a try. The actual code to build and train the network has less than 20 lines, because of the sophisticated design of the framework. In other words, it allows you to get things done in a nice, but also fast way, if you are willing to to sacrifice the ability of controlling every aspect of the training.

After testing the code with a small dataset, we generated the actual dataset and started the training. What can we say? The whole procedure was painless. Installation? No problem. Writing the code? Piece of cake. The training? Smooth without any fine-tuning. Model deployment? After installing h5py it worked without any problems ;-).

Bottom line, we are still an advocate of Theano, but writing all the code yourself can be a bit of a burden, especially if you throw all the stuff away after a single experiment. Furthermore, if your experiment uses a standard network architecture without the necessity to tune or adjust it, mature code can avoid lots of frustration in form of hours of bug hunting. Plus, it’s likely that the code contains some heuristics to work around some known problems that you might not be aware of.

For clarification, we do not say that Keras is just a high-level front-end and does not allow any customization, what we say is that it does a great just to provide one in case you don’t need all the expert stuff! And last but not least, it allows you to switch the back-end in case you want something else than Theano. We like the concept a lot provide a unified API for different back-ends, because it’s possible that back-ends have different strengths and weaknesses and the freedom to choose allows you, to write code with one API but switching

back-ends dynamically as you need it.

# movie2vec + RNN = Fun

In the first post about movie2vec we closed with the idea to use a recurrent network to encode a sequence of words as a fixed vector. First, let us summarize what we got so far: We trained an embedding model of plot keywords of movies by constructing a graph that encodes the relation between the words. Thus, each word is represented by x-dim vector in the feature space.

To learn something about the knowledge encoded by such a system, we used the embedding to learn a simple softmax model that predicts if a certain concept is present in a movie or not; we used handpicked genres for this task. To convert the sequence of the embedding vectors into a fixed length vector, we averaged all vectors and the result was used as the input to the softmax. This is our baseline.

But such a model is too simple to capture higher-order correlations which is why we decided to use a recurrent network for the transformation. We got our inspiration from a Theano example that used recurrent networks to predict the sentiment of a review. The idea is to average the hidden representation over time which leads to a fixed vector that is fed into the softmax layer to get the prediction.

This is exactly what we need, to turn a sequence of words, the embeddings to be precise, into a shared representation that can be used to represent a movie. A drawback is that the learned features are driven by labels which limits the use of them. However, a more severe problem is that in contrast to real NLP problems, a sequence of plot keywords does not have an order. Therefore, we need to investigate how this interacts with the model dynamics and how we can find a deterministic way to ensure consistent encoding of sequences.

We started with a very simple model, an recurrent Elman network with 50 hidden tanh units and a softmax with 10 units. The network was trained for some epochs with rmsprop + momentum and the model was then compared to the softmax baseline model. Because we were more interested in the errors of the model, to see what it has learned, we picked some movies and compared the top-2 predictions. It should be noted that not all genres were used for training which explains some approximations made by the model:

1) Zombieland

– softmax: 46.22% horror, 29.22% sci-fi => 75.44% confidence

– recurrent: 48.39% horror, 48.27% sci-fi => 96.66% confidence

2) Doom

– softmax: 60.07% horror, 24.62% sci-fi => 84.69% confidence

– recurrent: 62.48% horror, 36.39% sci-fi => 98.87% confidence

For both models, the predictions are reasonable and for Zombieland the sci-fi portion can be explained because movies with a post apocalyptic theme are often also sci-fi movies. However, it is no coincidence that the confidence of the recurrent model is usually higher than its softmax counterpart. We need to study this further, but we assume that the recurrent model better captures the correlation of the plot words and thus, the predictions have more confidence.

We just started our recurrent journey but with our first results we are confident that this is not the end but the beginning of a long and fruitful travel to the depth of a new world.

# movie2vec – Adapting Word Embeddings For Movies

The word2vec approach is an excellent example that a powerful method does not always come with high computational complexity and because the algorithm does not require any labels, training data can be obtained very easily, even for non-English languages. With asynchronous SGD, such a model can be trained on a corpus with millions of words in the vocabulary without spending weeks to wait for a good accuracy. Since there are already very good posts about the method, we do not describe it any further. But instead of focusing only on word embeddings, we use a more general approach [arxiv:1503.03578] to learn an embedding of data that can be represented as a graph.

In the last months, we progressively encountered problems because of the sparsity in the input data and missing descriptions. Thus, we decided to move our focus from feature-based learning to relation-based learning and more specifically, embeddings, because relations of tuples can be modeled efficiently even with sparse input data. More precisely, we treat each movie as a vertice in a graph that has edges to other vertices (movies) that model arbitrary relations between those movies. For instance, an edge could mean two movies have the same genre, theme or keywords in common. However, as a first step, we decided to use a more lightweight model that is using words to get to know the new method.

The model can briefly describes as follows. The graph consists of vertices that represent plot words and an edge describes that the word pair (i,j) occur together. We use the inverse word frequency as weights for the edges (i->j): w[i], (j->i): w[j]. This is similar to a matrix factorization of the co-occurrence matrix, but with SGD the method scales much better to larger vocabularies. At the end, the model learns an embedding for each word in the feature space that resembles the proximity of words in the original space. To convert the word vectors to movies, we start with averaging all vectors for words which are present in a movie. This serves as a baseline for further experiments. Our base model consists of 4,450 words/vertices and 833,005/edges (* 2 to make the graph directed). With randomly drawn word samples, we tested that the embedding is reasonable (word + most similar words):

– class-reunion => faculty, reunions, campus, football-star

– possession => evil-possession, demonic-possession, supernatural-forces

– stock-car => nascar, car-racing, speed, racing, race-car

– tiger => elephant, lion, crocodile, animal

The results confirm that for words with a sufficient frequency, the embedding delivers a solid performance, but for some words, which are less frequent, it is noticeable that the number of edges was insufficient which lead to bizarre neighbors in the feature space. Since our training corpus is rather small, the issue can be addressed by collecting more movie meta data.

Bottom line, even with a small dataset we can learn a good model to relate words. However, in contrast to sentences, the meta data of movies has no order which is no problem, but requires to generate the training data differently. For instance a sentence that consists of four words

s=(w1, w2, w3, w4)

implies that word w2 is adjacent to word w3 and w1 but not to w4 and therefore there is no edge between w2 and w4. In case of a movie m={w1, w2, w3, w4} we need to consider **all** pairs of words as edges:

(w1, w2), (w1, w3), (w1, w4), (w2, w3), (w2, w4), (w3, w4)

but otherwise the training remains the same.

The problem with averaging the embedding vectors is that not every word contributes equally to the concepts of a movie. Thus, we evaluated the quality of our baseline with a simple classifier that predicts the genre with a simple soft-max model without any hidden layers. The first thing we noted during training was that the model converged very quickly, regardless of the chosen hyper-parameters. This is in indicator that the embedding already did a good job in disentangling various explaining factors in the input data. To check the expressive power of the embedding, we added a single hidden layer to the model but we found out that this does not improve the model which means that the embedding is descriptive but unable to explain some higher-order correlations. This is not surprising if we consider the simplicity of the method. Nevertheless, the soft-max model still delivers a good performance considered its simplicity.

Even though we can now better utilize the existing data, we still need a larger training set to tackle issues like the rare word problem. Furthermore, even if the averaging of vectors provides a good baseline, it is far from being optimal and needs to be replaced with a more powerful model. Despite these issues, we are pretty happy with the results because even such a simple model is already very versatile with applications in clustering and classification.

At the end, we can say that the new direction in our research already helped a lot to see the current issues from different perspectives and we are positive that we can extend the basic model into a more powerful one that improves the representational power of movie vectors. On top of our wish list is to use recurrent networks to convert a sequence of word vectors into a fixed representation of a movie.