About a week ago, there was an update of the framework (0.2.0) and since we encountered some minor problems, we decided to test the version. For the convenience we used pip to perform an update. It should be noted that our environment is python 2.7 with no GPU support. Since the first link did not work (no support for our environment was reported), we tried the second link and that seemed to work. Everything seemed fine and we could execute a trained network without any problems. However, when we tried to train our network again, we got an “illegal instruction” and the process aborted itself. We could have tried conda, but we decided to compile the source from scratch to best match our environment.
To avoid to mess up a system-wide installation, we used $ python setup.py install --user. After waiting a couple of minutes that it took to compile the code, we got a ‘finished’ message and no error. We tried the test part of the network which worked and now, to our satisfaction, the training also worked again. So, we considered this step successful but we have the feeling that the selected BLAS routines are a little slower compared to the old version. However, we need further investigation until we can confirm this.
Bottom line, despite the coolness of the framework, an update does not seem to be straightforward for all environments with respect to the available pre-build packages. However, since building from the sources works like a charm on a fairly default system, we can “always” use this option as a fallback.
Despite the fact that we are dealing with text fragments that do not follow a strict format, there are still a lot of local patterns. Those are often not very reliable, but it’s better than nothing and with the power of machine learning, we have a good chance to capture enough regularities to generalize them to unseen data. To be more concrete, we are dealing with text that acts as a “sub-title” to annotate items. Furthermore, we only focus on items that are episodes of series because they contain some very prominent patterns we wish to learn.
Again, it should be noted that the sub-title might contain any sequence of characters, but especially for some channels, they often follow a pattern to include the name of the episode, the year and the country. For instance, “The Blue Milkshake, USA 2017”, or “The Crack in Space, Science-Fiction, USA 2017”. There are several variations present, but it is still easy to see a general pattern here.
Now the question is if we can teach a network to “segment” this text into a summary and a meta data part. This is very similar to POS (part-of-speech) tagging where a network labels each word with a concrete type. In our case, the problem is much easier since we only have two types of labels (0: summary, 1: meta) and a pseudo-structure that is repeated a lot.
Furthermore, we do not consider words, but we work on the character-level which hopefully allows us to generalize to unseen pattern that are very similar. In other words, we want to learn as much as possible of these regularities without focusing on concrete words. Like the variation “The Crack in Space, Science-Fiction, CDN, 2017”. For a word-level model, we could not classify “CDN” if it was not present in the training data, but we do not have this limitation with char-level models.
To test a prototype, we use our favorite framework PyTorch since it is a pice of cake to dealing with recurrent networks there. The basic model is pretty simple. We use a RNN with GRU units and we use the NLL loss function to predict the label at every time step. The data presented to the network is a list of characters (sub-title) and a list of binaries (labels) of the same length.
The manual labeling of the data is also not very hard since we can store the full string of all known patterns. The default label is 0. Then we check if we can find the sub-string in the current sub-text and if so, we set the labels of the relevant parts to 1, leaving the rest untouched.
To test the model, we feed a new sub-text to the network and check what parts it tags with 1 (meta). The results are impressive with respect to the very simple network architecture we have chosen, plus the fact that the dimensions of the hidden space is tiny. Of course the network sometimes fails to tag all adjacent parts of the meta data like ‘S_c_ience Fiction, USA, 2017″ where ‘c’ is tagged as 0, but such issues can be often fixed with a simple post-processing step.
No doubt that this is almost a toy problem compared to other tagging problems on NLP data, but in general it is a huge problem to identify the semantic context of text in a description. For instance, the longer description often contains the list of involved persons, a year of release, a summary and maybe additional information like certificates. To identify all portions correctly is much more challenging than finding simple patterns for sub-text, but it falls into the same problem category.
We plan to continue this research track since we need text segmentation all over the place to correctly predict actions and/or categories of data.
With the increasing popularity of data science, whatever this term actually means, a lot of functionality in software is based on machine learning. And yes, we all know machine learning is a lot of fun, if you get your model to learn something useful. In the academic community a major focus is to find something new, or enhance existing approaches, or to just beat the existing state of the art score. However, in industry, once a model is trained to solve an actual problem, it needs to be deployed and maintained somewhere.
Suddenly, we might not have access to huge GPU/CPU clusters any longer which means the final model might need to run a device with very limited computational power, or even on commodity hardware. Not to speak about the versioning of models and the necessity to re-deploy the actual parameters at some time. At this point, we need to change our point of view from research/science to production/engineering.
In case of python, pickling the whole model is pretty easy, but it requires that a compatible version of the code is used for de-serialization. For long-term storage, it is a much better idea to store the actual model parameters in a way that does not depend on the actual implementation. For instance, if you trained an Elman recurrent network, you have three parameters:
(1) the embedding matrix
(2) the “recurrent” matrix
(3) the bias
which are nothing more than just plain (numpy) arrays which can be even stored in JSON as a list (of lists). To utilize the model, it is straightforward -in any language- to implement the forward propagation, or to use an existing implementation which just requires to initialize the parameters. For example, in major languages like Java or C++, initializing arrays from JSON data is no big deal. Of course there are many other ways, but JSON is a very convenient data transport format.
And since the storage of the parameters is not coupled to any code, a model setup is possible in almost any environment with sufficient resources.
Sure, we are aware that restoring a 100-layer network from a JSON file can be burdensome, but nevertheless it is required to transfer model parameters in a unified, non-language dependent, way. So, we discussed some details of the storage and the deploy, but what about using the model in real applications?
We want to consider a broader context and not only images. Like the implementation of a search / retrieval system. In contrast to experiments, it is mandatory, that we get a result from a model within reasonable time. In other words, nobody says that training is easy, but if a good model need too much time to get a decision, it is useless for real-world applications. For instance, if the output of the model is “rather large”, we need to think about ways for an efficient retrieval in case of information retrieval. As an example: if the final output is 4096 dims (float32) and we have 100K ‘documents’, we need ~1.5 GB to store just the results and not a single bit of meta data.
But even for smaller models, like word2vec, we might have 50K words, each represented by 100 dims, which makes it a non-trivial task to match the entered sequence of words to, say, an existing list of movie titles and to rank the results in real-time and multi-threaded.
We know, we are a bit unfair here, but we have the feeling that way too often most of the energy is put into beating some score, which includes that often even the information in papers are not sufficient to reproduce results, and the models that actually provide new insights are sometimes not usable because they require lots of computational power.
Bottom line, on the one hand, without good models we have nothing to deploy, but on the other hand, machine learning is so much more than just training a good model and then dumping it at some place to let some other guys run it. As a machine learning engineer, we have the responsibility to be part of the whole development process and not just the cool part where we can play with new neural network architectures and let other folks do the “boring” part.
Let’s imagine that we want to train a model that is using an embedding layer for a very large vocabulary. In on-line mode, you only work on very few words per sample which means you get a sparse gradient because most of the words do not need to be touched in any way.
The good news is that it already works with vanilla gradient descent and AdaGrad. However, since the latter eventually decays the learning rate to zero, we might have a problem if we need to visit a lot of samples to achieve a good score. This might be the case for our recent model that is using a triplet loss, because not every update has the same impact and using more recent gradient information would be more efficient. Thus, we decided *not* to use Adagrad.
As a result, we can only use stochastic gradient descent. There is nothing wrong with this optimizer, but it can take lots of time until convergence. We can address parts of the problem with momentum, since it accelerates learning if the gradient follows one direction, but it turned out that enabling momentum turns the sparse update into a dense one and that means we are losing our only computational advantage.
Again, the issue is likely to be also relevant for other frameworks and in general sparse operations always seem a little behind their dense colleagues. Since PyTorch is very young, we don’t mind challenges, as noted before.
We would really like to continue using it for our current experiment, but we also need to ensure that the optimizer is not a bottleneck in terms of training time. The good news is that the results, we got so far, confirm that the model is learning a useful representation. But with the problem of the long tail, we might need more time to perform a sufficiently large number of updates for the rare words.
So, in this particular case we do not have much options, but a good one would suffice and indeed there seems to be a way out it: Asynchronous training with multi processing. We still need to investigate more details, but PyTorch provides a drop-in replacement for “multiprocess” and a quick & dirty example seems to work already.
The idea is to create the model as usual, with the only exception to call “model.share_memory()” to properly share model parameters with fork(). Then, we spawn N new processes that all gets a copy of the model, with tied parameters, but each with its own optimizer and data sampler. In other words, we perform independent N trainings but all processes update the same model parameters. The provided example code from PyTorch is pretty much runnable out of the box:
## borrowed from PyTorch ##
from torch.multiprocessing as mp
model = TheModel()
model.share_memory() # required for async training
# train: function with "model" as the parameter
procs = 
for _ in xrange(4): # no. processes
p = mp.Process(target=train, args=model(,))
for p in procs: # wait until all trainers are done
The training function ‘train’ does not require any special code, or stated differently it, if you call it just once, single-threaded, it works as usual. As noted before, there are likely some issues that need to be taken care of, but a first test seemed to work without any problems.
Bottom line, the parallel training should help to increase the through-put to perform more updates which -hopefully- leads to an earlier convergence of the model. We still need to investigate the quality of the trained model, along with some hyper-parameters, like the number of processes, but we are confident that we find a way to make it work.
In the last weeks, we really learned to appreciate PyTorch. Not because it is flawless, but because it is very intuitive and makes your life so much easier when you need to debug something. And let’s face it, at one point your network is doing something silly and you have no idea why. Then you need to take a look under the hood which can be a bit of a burden with symbolic variables. Furthermore, the dynamic graph approach lets you define some concepts, like recurrence without complicated scan functions which is one more reason why recurrent networks and PyTorch should be best friends. Bottom line, for a beta, the framework feels very stable and except for some numerical instabilities, we never encountered any problems so far.
With arxiv as a hub for research papers, one have access to lots of new ideas. Not all of them are diamonds, but sometimes a paper contains the hint you need to complete a problem or at least to evaluate an idea or to use it as a basis. Then, it is extremely useful if you can do a rapid prototype. Even with Theano it was quite easy possible, but since the framework is a little too low-level, you had to write lots of setup code. With PyTorch and the seamless numpy integration the setup part is almost neglectable.
Furthermore, it is also possible to transfer a learned model from one framework to another, in case one might have concerns to use a particular framework for production, or a particular framework is required. Thus, it is no problem to use the advantages of PyTorch for training and evaluation of a model and then store the model parameters as numpy arrays and build the network with Theano.
It is really hard to imagine to evaluate new, complex networks without all the fancy frameworks, especially automatic differentiation, but also computational graphs in general. For example, the famous AlexNet was written from scratch which was quite an achievement. In other words, there is no excuse today, not to test your ideas quickly, but also systematically with a framework like PyTorch. Yes, you need some Python and math skills, but with all the groundwork done by others, it is much faster than a few years ago.
So, the major problem, which cannot be solved by PyTorch yet, is to get your hands on a sufficiently large set of (labeled) data. However, if you have a dataset you can start with, there are no limits. Try some funky loss functions, or combine them. Can you imagine a world without GANs? Not anymore. But, somebody has to come up with the idea before it can be refined.
Bottom line, it has never been easier to utilize massive amounts of computational power with GPUs. There are many publicly available datasets, but also relatively easy ways to acquire labeled data, so you should be at least able to start your work. In combination with all the available frameworks, you can train very complex networks in a reasonable time which would have been impossible more than ten years ago even with a massive cluster of machines. So, everybody with a clever idea and some luck has the opportunity to make a difference.
In contrast to some academic datasets, real-life datasets are often unbalanced, with a lot noise and more variance than some models can cope with. There are ways to adapt models for those problems, but sometimes a classification is too restrictive. What do we mean by that? With a lot of variance for a single label, it might be possible to squeeze all samples into some corner of a high-dimensional space, but it is probably easier to learn the relative similar of samples.
With a triplet loss, we can move two samples with the same label closer together while we push them further away for a sample of a different label:
maximum(0, margin + f(anchor,neg) - f(anchor,pos)). The approach has the advantage that we can distribute samples across the whole feature space by forming “local clusters” instead of concentrating it in a dense region of the space. The quality of the model depends on the sampling of the triplets, since for a lot of triplets the margin is already preserved and no learning is done. But since the sampling can be done asynchronously, it is usually no problem to find enough violating triplets to continue the training.
If we take for instance items from an electronic program guide (epg) there might be a huge variance for the category “sports”, but only few words to describe the content. Thus, it might be sufficient that a nearest neighbor search -in the feature space- returns any item with the same label to perform a successful “classification”. This has the benefit that we can capture lots of local patterns with a shallower network instead of one global pattern in case of a softmax classification that requires a network that is likely complex. The price of the model is that we cannot simply classify new items with a single forward pass, but the model is probably easier to train because of the relative similarity. Plus, if the similarity of the anchor and the positive sample is higher than the similarity of the anchor and the negative one, with respect to the margin, there is no loss at all and no gradient step is required. In case of a nll loss, for example, there is always a gradient step, even if it is very small.
Furthermore, nearest neighbors searches[arxiv:1703.03129] can be done with matrix multiplications that are very efficient these days thanks to modern CPU and GPU architectures. It is also not required to scan the whole dataset, because it is likely that we can perform a bottom up clustering of the feature space to handle near duplicates.
The next question is how to combine a triplet loss with attention to aggregate data from related items, for example items with similar tags, to group embeddings of similar words. We are using the ideas from the post to encode items with a variable number of words into a fixed length representation.
We slightly adjusted our loss function to a use a hinge-like loss that requires the use of a maximum function that we used pretty often in Theano: T.maximum(0, 0.3 - score). However, in PyTorch the cmax function, that provided the same functionality, was recently removed to simplify the API.
Of course it’s still possible but the new syntax feels a bit strange: torch.clamp(0.3 - score, min=0). In case of the minimum, it’s the other way around: torch.clamp(0.3 - score, max=0).
So, it’s the clamp function again that we previously used, to bound the values of a function. Seems like clamp is our swiss army knife.