About a week ago, there was an update of the framework (0.2.0) and since we encountered some minor problems, we decided to test the version. For the convenience we used pip to perform an update. It should be noted that our environment is python 2.7 with no GPU support. Since the first link did not work (no support for our environment was reported), we tried the second link and that seemed to work. Everything seemed fine and we could execute a trained network without any problems. However, when we tried to train our network again, we got an “illegal instruction” and the process aborted itself. We could have tried conda, but we decided to compile the source from scratch to best match our environment.
To avoid to mess up a system-wide installation, we used $ python setup.py install --user. After waiting a couple of minutes that it took to compile the code, we got a ‘finished’ message and no error. We tried the test part of the network which worked and now, to our satisfaction, the training also worked again. So, we considered this step successful but we have the feeling that the selected BLAS routines are a little slower compared to the old version. However, we need further investigation until we can confirm this.
Bottom line, despite the coolness of the framework, an update does not seem to be straightforward for all environments with respect to the available pre-build packages. However, since building from the sources works like a charm on a fairly default system, we can “always” use this option as a fallback.
Despite the fact that we are dealing with text fragments that do not follow a strict format, there are still a lot of local patterns. Those are often not very reliable, but it’s better than nothing and with the power of machine learning, we have a good chance to capture enough regularities to generalize them to unseen data. To be more concrete, we are dealing with text that acts as a “sub-title” to annotate items. Furthermore, we only focus on items that are episodes of series because they contain some very prominent patterns we wish to learn.
Again, it should be noted that the sub-title might contain any sequence of characters, but especially for some channels, they often follow a pattern to include the name of the episode, the year and the country. For instance, “The Blue Milkshake, USA 2017”, or “The Crack in Space, Science-Fiction, USA 2017”. There are several variations present, but it is still easy to see a general pattern here.
Now the question is if we can teach a network to “segment” this text into a summary and a meta data part. This is very similar to POS (part-of-speech) tagging where a network labels each word with a concrete type. In our case, the problem is much easier since we only have two types of labels (0: summary, 1: meta) and a pseudo-structure that is repeated a lot.
Furthermore, we do not consider words, but we work on the character-level which hopefully allows us to generalize to unseen pattern that are very similar. In other words, we want to learn as much as possible of these regularities without focusing on concrete words. Like the variation “The Crack in Space, Science-Fiction, CDN, 2017”. For a word-level model, we could not classify “CDN” if it was not present in the training data, but we do not have this limitation with char-level models.
To test a prototype, we use our favorite framework PyTorch since it is a pice of cake to dealing with recurrent networks there. The basic model is pretty simple. We use a RNN with GRU units and we use the NLL loss function to predict the label at every time step. The data presented to the network is a list of characters (sub-title) and a list of binaries (labels) of the same length.
The manual labeling of the data is also not very hard since we can store the full string of all known patterns. The default label is 0. Then we check if we can find the sub-string in the current sub-text and if so, we set the labels of the relevant parts to 1, leaving the rest untouched.
To test the model, we feed a new sub-text to the network and check what parts it tags with 1 (meta). The results are impressive with respect to the very simple network architecture we have chosen, plus the fact that the dimensions of the hidden space is tiny. Of course the network sometimes fails to tag all adjacent parts of the meta data like ‘S_c_ience Fiction, USA, 2017″ where ‘c’ is tagged as 0, but such issues can be often fixed with a simple post-processing step.
No doubt that this is almost a toy problem compared to other tagging problems on NLP data, but in general it is a huge problem to identify the semantic context of text in a description. For instance, the longer description often contains the list of involved persons, a year of release, a summary and maybe additional information like certificates. To identify all portions correctly is much more challenging than finding simple patterns for sub-text, but it falls into the same problem category.
We plan to continue this research track since we need text segmentation all over the place to correctly predict actions and/or categories of data.
Let’s imagine that we want to train a model that is using an embedding layer for a very large vocabulary. In on-line mode, you only work on very few words per sample which means you get a sparse gradient because most of the words do not need to be touched in any way.
The good news is that it already works with vanilla gradient descent and AdaGrad. However, since the latter eventually decays the learning rate to zero, we might have a problem if we need to visit a lot of samples to achieve a good score. This might be the case for our recent model that is using a triplet loss, because not every update has the same impact and using more recent gradient information would be more efficient. Thus, we decided *not* to use Adagrad.
As a result, we can only use stochastic gradient descent. There is nothing wrong with this optimizer, but it can take lots of time until convergence. We can address parts of the problem with momentum, since it accelerates learning if the gradient follows one direction, but it turned out that enabling momentum turns the sparse update into a dense one and that means we are losing our only computational advantage.
Again, the issue is likely to be also relevant for other frameworks and in general sparse operations always seem a little behind their dense colleagues. Since PyTorch is very young, we don’t mind challenges, as noted before.
We would really like to continue using it for our current experiment, but we also need to ensure that the optimizer is not a bottleneck in terms of training time. The good news is that the results, we got so far, confirm that the model is learning a useful representation. But with the problem of the long tail, we might need more time to perform a sufficiently large number of updates for the rare words.
So, in this particular case we do not have much options, but a good one would suffice and indeed there seems to be a way out it: Asynchronous training with multi processing. We still need to investigate more details, but PyTorch provides a drop-in replacement for “multiprocess” and a quick & dirty example seems to work already.
The idea is to create the model as usual, with the only exception to call “model.share_memory()” to properly share model parameters with fork(). Then, we spawn N new processes that all gets a copy of the model, with tied parameters, but each with its own optimizer and data sampler. In other words, we perform independent N trainings but all processes update the same model parameters. The provided example code from PyTorch is pretty much runnable out of the box:
## borrowed from PyTorch ##
from torch.multiprocessing as mp
model = TheModel()
model.share_memory() # required for async training
# train: function with "model" as the parameter
procs = 
for _ in xrange(4): # no. processes
p = mp.Process(target=train, args=model(,))
for p in procs: # wait until all trainers are done
The training function ‘train’ does not require any special code, or stated differently it, if you call it just once, single-threaded, it works as usual. As noted before, there are likely some issues that need to be taken care of, but a first test seemed to work without any problems.
Bottom line, the parallel training should help to increase the through-put to perform more updates which -hopefully- leads to an earlier convergence of the model. We still need to investigate the quality of the trained model, along with some hyper-parameters, like the number of processes, but we are confident that we find a way to make it work.
In the last weeks, we really learned to appreciate PyTorch. Not because it is flawless, but because it is very intuitive and makes your life so much easier when you need to debug something. And let’s face it, at one point your network is doing something silly and you have no idea why. Then you need to take a look under the hood which can be a bit of a burden with symbolic variables. Furthermore, the dynamic graph approach lets you define some concepts, like recurrence without complicated scan functions which is one more reason why recurrent networks and PyTorch should be best friends. Bottom line, for a beta, the framework feels very stable and except for some numerical instabilities, we never encountered any problems so far.
With arxiv as a hub for research papers, one have access to lots of new ideas. Not all of them are diamonds, but sometimes a paper contains the hint you need to complete a problem or at least to evaluate an idea or to use it as a basis. Then, it is extremely useful if you can do a rapid prototype. Even with Theano it was quite easy possible, but since the framework is a little too low-level, you had to write lots of setup code. With PyTorch and the seamless numpy integration the setup part is almost neglectable.
Furthermore, it is also possible to transfer a learned model from one framework to another, in case one might have concerns to use a particular framework for production, or a particular framework is required. Thus, it is no problem to use the advantages of PyTorch for training and evaluation of a model and then store the model parameters as numpy arrays and build the network with Theano.
It is really hard to imagine to evaluate new, complex networks without all the fancy frameworks, especially automatic differentiation, but also computational graphs in general. For example, the famous AlexNet was written from scratch which was quite an achievement. In other words, there is no excuse today, not to test your ideas quickly, but also systematically with a framework like PyTorch. Yes, you need some Python and math skills, but with all the groundwork done by others, it is much faster than a few years ago.
So, the major problem, which cannot be solved by PyTorch yet, is to get your hands on a sufficiently large set of (labeled) data. However, if you have a dataset you can start with, there are no limits. Try some funky loss functions, or combine them. Can you imagine a world without GANs? Not anymore. But, somebody has to come up with the idea before it can be refined.
Bottom line, it has never been easier to utilize massive amounts of computational power with GPUs. There are many publicly available datasets, but also relatively easy ways to acquire labeled data, so you should be at least able to start your work. In combination with all the available frameworks, you can train very complex networks in a reasonable time which would have been impossible more than ten years ago even with a massive cluster of machines. So, everybody with a clever idea and some luck has the opportunity to make a difference.
We slightly adjusted our loss function to a use a hinge-like loss that requires the use of a maximum function that we used pretty often in Theano: T.maximum(0, 0.3 - score). However, in PyTorch the cmax function, that provided the same functionality, was recently removed to simplify the API.
Of course it’s still possible but the new syntax feels a bit strange: torch.clamp(0.3 - score, min=0). In case of the minimum, it’s the other way around: torch.clamp(0.3 - score, max=0).
So, it’s the clamp function again that we previously used, to bound the values of a function. Seems like clamp is our swiss army knife.
After we decided to switch to PyTorch for new experiments, we stumbled about some minor problems. It’s no big deal and the workarounds are straightforward, but one should be aware of them to avoid frustration. Furthermore, it should be noted that the framework is flagged as “early beta” and this is part of the adventure mentioned on the website :-).
We extended an existing model by adding a skip-gram like loss to relate samples with tags both in a positive or negative way. For this, we are using the classical sigmoid + log loss:
sigmoid = torch.nn.functional.sigmoid
dot_p = torch.dot(anchor, tag_p)
loss_pos = -torch.log(sigmoid(dot_p)) #(1)
dot_n = torch.dot(anchor, tag_n)
loss_neg = -torch.log(1 - sigmoid(dot_n)) #(2)
The critical point is log(0), since log is undefined for this input, “inf” in PyTorch, and there are two ways how this can happen:
(1) sigmoid(x) = 0, which means x is a “large” negative value.
(2) sigmoid(x) = 1, which means x is a “large” positive value.
In both cases, -log(y) evaluates to zero and a hiccup occurs which leads to a numerical instability that makes further optimization steps useless.
One possible workaround is to bound the values of sigmoid to be slightly above zero and slightly below one, with eps ~1e-4:
value = torch.nn.functional.sigmoid(x)
value = torch.clamp(torch.clamp(value, min=eps), max=1-eps)
With this adjustment, sigmoid(dot_p) is always slightly positive and (1 – sigmoid(dot_n)) also never evaluates to zero.
It might be possible that pre-defined loss functions in PyTorch do not suffer this problem, but since we usually design our own loss function from scratch, numerical instabilities can happen if we combine certain functions. With the described kludge, we did not encounter problems any longer during the training of our model.
Again, we are pretty sure that those issues are addressed over time, but since PyTorch is already very powerful, elegant and fast, we do not want to wait until this happened. In other words, we really appreciate the hard work of the PyTorch team and since we made a choice to use a framework in an “early-release beta”, it’s only fair to be patient. Of course we are willing to help the project, for example by reporting bugs, but in this case someone else already did it (issue #1835).
With the ability to actually see the values of tensors at each step of the computation, PyTorch is our red-hot favorite when it comes to ML frameworks. One reason is that it makes debugging so much easier. There are still some rough edges, but there is also a pretty active community that continually improves the framework and fixes existing bugs.
We recently stumbled about a paper [arxiv:1704.08384] that uses a knowledge-based memory in combination with attention and we wanted to try a similar approach to predict types for fragments of texts that often have very few tokens. The pre-processing took the most time, while the actual training and description of the model, thanks to PyTorch, was a piece of cake. Our idea can be implemented by combining some recently introduced methods and it does not require any new layer or module.
In our first approach we ignore the order of the tokens, but we are using a mask [arxiv:1612.03969] to weight individual dimensions of the embedding:
torch.sum(E(in_words) * M(in_words), 0)
where E, M are both matrices with shape=(#tokens, #dims). This allows us to convert an arbitrary sequence of tokens into a fixed length representation. The mask should be initialized to 1 for all entries which can be done with:
M.weight = torch.nn.Parameter(torch.from_numpy(np.ones((#tokens, #dims))).float())
The problem is that even if an example only references a very small subset of all tokens, the gradient update is dense which means the whole embedding matrix is updated. This problem is not limited to PyTorch, for instance, it is also present in Theano. For the latter we already described one way to fix it. In PyTorch this is of course also possible, but the approach is different.
Usually a model contains an module for the embedding
which leads by default to a dense gradient update. To switch to sparse gradient updates, we only have to adjust the initialization to
torch.nn.Embedding(#tokens, #dims, sparse=True)
and that is all.
However, in our PyTorch version the adjustment only worked with basic optimizers like Adagrad or SGD, but it refused to work with RMSprop or Adam. It seems some functionality is missing
torch.sparse.FloatTensor' object has no attribute 'addcmul_'
but we strongly believe that this is fixed pretty soon.
The performance gain in terms of the sparsity is pretty huge: When everything else is equal, the processing of a block took 7000 ms without sparsity, but only 950 ms with sparsity. This is an improvement of 86%.
Without the memory, the rest of the model is straightforward: First we encode the input tokens to get a fixed length vector, then we use a linear layer in combination with a softmax to predict the type.
To address the issue of unbalanced labels, we introduce a penalty that depends on the inverse frequency of the labels: log(#total / #total(y)). For example, the penalty of an almost common label is 1.17, while it is 3.66 for a rather seldom one.
In a first test, we used ~30 K tokens and five classes and we got reasonable results in less than an hour. After we finish to analyze the results, we plan to integrate the knowledge-base into the model, but this is a completely new story.