Character-Level Positional Encoding

For documents, word-level embeddings work pretty well, because the vocabulary does not contain too many special cases. However, in case of tweets or (short) texts with a rather dynamic structure, using words might not be appropriate because it is not possible to generalize to unknown words. That includes the case where those words are pretty similar to existing ones, but not quite the same. The issue can be tackled with recurrent nets that are working with characters, but the problem is that the processing cannot be parallelized easily. In a recent paper “Attention Is All You Need” an approach is described that uses a positional encoding which is used to encode sequences of words without the necessity to use a recurrent net. Thus, the process can be more easily parallelized and has therefore a better performance. However, we still have the problem that we cannot generalize to unknown words except for a ‘UNK’ token which is not very useful in our case.

To be more precise, we try to model titles from TV shows, series, etc. in order to predict the category and there it is imperative that we can generalize to slight variations of those titles which are likely to happen due to spelling errors and several formats for different TV channels. The method to adapt the procedure to utilize characters instead of words is straightforward. We just need to build a lookup map with the calculated values based on the position and the index of the dimension up to the maximal sequence length that is present in the training data. Then, we can easily calculate the encoding of arbitrary inputs by using this formular:

 char_id_list #sentence as a sequence of characters
 embed = np.sum(pos_enc_map[np.arange(len(char_id_list)) * char_embed[char_id_list], 0)

and one possible lookup could be calculated by:

result = np.zeros((max_seq_len, dim_len))
for pos in xrange(max_seq_len):
 a = np.arange(dim_len) + 1.
 if pos % 2 == 0:
  t = np.cos(pos / a)
 else:
  t = np.sin(pos / a)
 result[pos] = t

The lookup uses different weights for different positions, but also considers each dimension differently which helps to bring more diversity for the encoding. However, since the encoding does not depend on previous states, it also guarantees to preserve semantic similarity of tokens that are present in multiple strings.

Bottom line, an positional encoding allows to us to take the order of a sequence in consideration without using a recurrent net which is very beneficial in many cases. Furthermore, the character-level encoding allows us to classify text sequences we have never seen before which is very important due to minor variations of titles.

Advertisements

Do We Need More Data or Better Ways to Use It?

One of the first demonstrations on how powerful Deep Learning can be, used 1,000 pictures per category and needed quite a lot steps to build a model that worked. Without a doubt, this was a seminal work but it also demonstrated that DL only vaguely resembles how humans learn. For instance, if a child would have to look at 1,000 cups to get the concept of it, the lifespan of humans would be too short to survive without strong supervision. Another example are recent breakthroughs in reinforcement learning, but which also come at a certain cost, like a couple of thousand bucks a day for energy. In a lot of cases, data, and even labels, might be no problem, but it often takes days or even weeks to turn them into a useful model. This is also in stark contrast to the brain that uses very little energy and is able to generalize with just a few, or even one example. Again, this is nothing new but begs the question if we spend too little time on fundamental research and try instead too often to beat state-of-art results to get a place in the hall of fame? The viewpoint is probably too easy, since there are examples that there is research that focuses on real-world usage, like WaveNet, but it also shows that you need lots of manpower to do it. Thus, most companies have to rely on global players or public research if they want to build cutting-edge A.I. products. The introduction of GPU clouds definitely helped, because it allows everyone to train larger models without buying the computational machinery, but using the cloud is also not for free and it’s getting worse if the training has to be fast, since you need to buy lots of GPU time then. The topic, in a broader context, has also been recently debated in[1]. In the spirit of the debate, the question is how can we avoid to run against a wall about 1,000 times before we realize it’s not a good idea?

[1] “Will the Future of AI Learning Depend More on Nature or Nurture?”

Stuck In A Local Minima

It was never easier to get access to a very powerful machinery to learn something. There are lots of different frameworks to train your model and with GPUs you can even train “bigger” models on commodity hardware. Furthermore, there is a lot of published research and cool blog posts that explain recent trends and new theories. In other words, it’s almost everything out there, you just need to find the time to digest all the content and turn it into something that is able to solve your problem, right? Well, honestly if you have one of those common problems, a solution is probably just around the corner and you don’t need much time or energy to solve it. Maybe there is even a pre-trained network that can be directly used, or you could ask around if somebody is willing to share it. But frankly, this is rather the exception than the rule, because very often, your problem is special and the chance that existing code is available is often close to zero. Maybe there are some relevant papers, but it is likely to take time to find them and more to implement the code. In other words, if the problem is easy, you often don’t need to do any research, but otherwise it can be a very long way with lots of obstacles even to get a hint where to start. In such cases, if you are lucky, you can ask people in your company or team to give you some hints, or at least to be able to discuss the problem at eye-level. But what if you have no access to such valuable resources? Doing research on your own takes time and it is not guaranteed that it leads somewhere, if your time is limited, which is usually the case. So, what to do? It’s like training a neural network with lots of local minima and in some configurations learning even get totally stuck. This is nothing new but sometimes we get the impression that the popular opinion is that all you need is a framework and tuning the knobs as long as it takes to solve the problem. This is like having a racing car without the proper training to drive it. There is a chance to win a race, but it’s more likely that you strip it down. The question is how to best spend your time when you want to solve a problem? Concentrate on technology, or on the theory? A little bit of both? Or try to find existing solutions? This is related to our concern that we meet a growing number of ML engineers that seem just like power users of a framework without the ability to understand what is going on under the hood.

A Canticle For Theano

Back in 2010 there was a presentation about a new computation framework at SciPy and the name was Theano. At this time, even if it was not the only one available framework, there was also Torch for example, it provided a lot of new and very powerful features. Furthermore, it provided an interface very similar to the popular numpy framework, which is very versatile and provides everything you need to build machine learning algorithms, especially neural nets.

The idea to describe an algorithm in a purely symbolic way and let the framework optimize this expression into a computational graph was something new on the one hand, but it also allowed to execute code on other devices like the GPU, which can be much faster, on the other hand. And the cherry on top of this delicious cake was something called automatic differentiation. In contrast to other frameworks, or if in case you implement algorithms by hand, you can define an arbitrary loss function and let the framework do the calculation of the gradient. This has several advantages, including the error-prone process to derive the gradient manually, but also to perform the required steps in a deterministic way with the help of the computational graph.

In other words, Theano allows you to implement everything that you can imagine as long as it is continuous and differentiable. Of course this does not come for free, since the framework is a low-level library, it requires a solid understanding of the whole work flow and also to code all the functionality yourself. But once you have done this, it allows you rapid prototyping of different algorithms since you don’t need to manually derive the gradients and also you don’t have to care for the optimization of the data flow since Theano optimizes the graph for you.

More than five years ago, we started to use Theano, because we love python, machine learning, including neural nets, and the way Theano does the things. In this time, we never encountered any serious bugs or problems and we were also able to use it on a broad range of devices including, CPUs, GPUs and embedded ones. We implemented innumerable algorithms in this time, from rather straightforward nets to deep nets with dozens of layers. But we did not only use it for research, we also used it to build real-world applications which were deployed and are still in use.

In other words, without Theano, the world of machine learning would definitely not be the same. For instance, recent frameworks were likely to be inspired by the features Theano provided. So, the question is, why such a framework, with all the awesomeness never managed it to get the attention it deserved? Was it too difficult to use? Does it require the backup of a major company? Was it too close to academia? Missing marketing? We don’t have the answers and now that we development of Theano will stop soon, the future is very uncertain. Sure, it is open source, but we still need people that are willing to spend time to improve it and to coordinate the further development. But even if we are a little skeptical about the future of Theano, version 1.0 is still a very useful piece of software that can be used to design and deploy your models.

What’s left to say? We would like to thank all the contributors and authors of Theano for their hard work. Theano is an amazing piece of software that allows everyone to express their thoughts as algorithms in a straightforward and efficient way. With it, it was possible to build cutting edge products and to use GPUs even if you were not an CUDA expert.

Training and Deployment Neural Nets: Two Sides Of One Coin

There are a lot of problems out there and some of them can be definitely solved with neural nets. However, despite the availability of lots of sophisticated frameworks, you still need a solid understanding of the theory and also a portion of luck; at least when you plan to design your own loss function, or in case you want to implement a model from a recent research paper. To just train a classifier on some labeled data, you can practically select any of the available frameworks at random. But most real-world problems are not like that.

The truth is that you can do all this stuff with most of the frameworks out there, but what is not said very often is that this can be very time consuming and sometimes even frustrating. In the last months we promoted PyTorch as a framework that goes hand-in-hand with Python and which allows you to easily create networks with dynamic graphs. Not to mention the ability to debug code on-the-fly, since tensors have actual values and are not purely symbolic. This increased our productivity a lot and also reduced our level of frustration.

Still, we should not forget that all this is just technology and even if frameworks have very large communities (and might be backed up by big companies), there is no guarantee that it won’t be obsolete next year or maybe in five years. That said, in our humble opinion a framework should allow developers to get things done quickly and not to write lots of code that is not related to the actual problem. But, this can turn into a problem when a framework is very high-level and it does not allow you easily to customize your models, or to adapt the design which includes, for example, the combination of different loss functions.

Let’s take NLP as an example: Without a doubt attention is a very important part of most modern approaches and thus, it is very likely that we need to integrate this feature in a real-world model. Despite its effectiveness, the method is not very complicated and in terms of the computational graph, it is also not very hard to implement. But this of course depends on the framework and its API. Does the framework come with native support for it? Is it possible to modify it easily? How well does it fit into the layer landscape? How difficult is it to implement it from scratch? Can it be debugged easily?

Even if we made a lot progress to understand and to train neural nets, it still feels more like black magic than science. With a nice framework like Keras, it is not hard to train a neural net from scratch. But what happens if the learning get stuck and if this cannot be fixed trivially by adjusting some options? Then you need to go deeper which requires a different skill set. In other words, try easy solutions first since sometimes you don’t need more than a standard model.

This bring us to the question if we should use different frameworks for experiments and production. For the experiments, we need one that is very flexible, easy to debug and with a focus on understanding what is going on inside and that can be easily adapted. However, for deployment we need one that allows to run the model in heterogeneous environment with very different resources. It is possible that a single framework can do both, but the requirements for the both cases are very different.

Bottom line, once the model is trained, the major focus is maximal performance and to minimize the used resources. Issues like flexibility, adaption and to some degree, debugging are not that important any longer. That is why we wonder why there is so little information about using neural nets in production environments and how to do it, because integrating models into applications and also the deployment is far from being trivial.

Updating PyTorch

About a week ago, there was an update of the framework (0.2.0) and since we encountered some minor problems, we decided to test the version. For the convenience we used pip to perform an update. It should be noted that our environment is python 2.7 with no GPU support. Since the first link did not work (no support for our environment was reported), we tried the second link and that seemed to work. Everything seemed fine and we could execute a trained network without any problems. However, when we tried to train our network again, we got an “illegal instruction” and the process aborted itself. We could have tried conda, but we decided to compile the source from scratch to best match our environment.

To avoid to mess up a system-wide installation, we used $ python setup.py install --user. After waiting a couple of minutes that it took to compile the code, we got a ‘finished’ message and no error. We tried the test part of the network which worked and now, to our satisfaction, the training also worked again. So, we considered this step successful but we have the feeling that the selected BLAS routines are a little slower compared to the old version. However, we need further investigation until we can confirm this.

Bottom line, despite the coolness of the framework, an update does not seem to be straightforward for all environments with respect to the available pre-build packages. However, since building from the sources works like a charm on a fairly default system, we can “always” use this option as a fallback.

(Very) Simple Text Segmentation

Despite the fact that we are dealing with text fragments that do not follow a strict format, there are still a lot of local patterns. Those are often not very reliable, but it’s better than nothing and with the power of machine learning, we have a good chance to capture enough regularities to generalize them to unseen data. To be more concrete, we are dealing with text that acts as a “sub-title” to annotate items. Furthermore, we only focus on items that are episodes of series because they contain some very prominent patterns we wish to learn.

Again, it should be noted that the sub-title might contain any sequence of characters, but especially for some channels, they often follow a pattern to include the name of the episode, the year and the country. For instance, “The Blue Milkshake, USA 2017”, or “The Crack in Space, Science-Fiction, USA 2017”. There are several variations present, but it is still easy to see a general pattern here.

Now the question is if we can teach a network to “segment” this text into a summary and a meta data part. This is very similar to POS (part-of-speech) tagging where a network labels each word with a concrete type. In our case, the problem is much easier since we only have two types of labels (0: summary, 1: meta) and a pseudo-structure that is repeated a lot.

Furthermore, we do not consider words, but we work on the character-level which hopefully allows us to generalize to unseen pattern that are very similar. In other words, we want to learn as much as possible of these regularities without focusing on concrete words. Like the variation “The Crack in Space, Science-Fiction, CDN, 2017”. For a word-level model, we could not classify “CDN” if it was not present in the training data, but we do not have this limitation with char-level models.

To test a prototype, we use our favorite framework PyTorch since it is a pice of cake to dealing with recurrent networks there. The basic model is pretty simple. We use a RNN with GRU units and we use the NLL loss function to predict the label at every time step. The data presented to the network is a list of characters (sub-title) and a list of binaries (labels) of the same length.

The manual labeling of the data is also not very hard since we can store the full string of all known patterns. The default label is 0. Then we check if we can find the sub-string in the current sub-text and if so, we set the labels of the relevant parts to 1, leaving the rest untouched.

To test the model, we feed a new sub-text to the network and check what parts it tags with 1 (meta). The results are impressive with respect to the very simple network architecture we have chosen, plus the fact that the dimensions of the hidden space is tiny. Of course the network sometimes fails to tag all adjacent parts of the meta data like ‘S_c_ience Fiction, USA, 2017″ where ‘c’ is tagged as 0, but such issues can be often fixed with a simple post-processing step.

No doubt that this is almost a toy problem compared to other tagging problems on NLP data, but in general it is a huge problem to identify the semantic context of text in a description. For instance, the longer description often contains the list of involved persons, a year of release, a summary and maybe additional information like certificates. To identify all portions correctly is much more challenging than finding simple patterns for sub-text, but it falls into the same problem category.

We plan to continue this research track since we need text segmentation all over the place to correctly predict actions and/or categories of data.