Compared to a couple of years ago, training of recurrent neural nets is now much easier. For instance, we trained a character-based RNN to classify certain patterns in the sub-titles of items collected from an electronic program guide (EPG). All we had to do was to collect some training data, label it and use PyTorch to train a model. The model was actually pretty shallow, just one embedding layer fed into some GRU cells followed by a linear layer that acts a softmax classifier. With the packed interface for RNNs of PyTorch, the training took less than two minutes and the precision is above 90%. No pain so far and it just felt like an ordinary feed-forward network, with respect to the training complexity!
Compared to our previous post on batching RNNs, we optimized our code, also thanks to code from the community. The new forward code looks like this:
# batch is list of lists where each list contains a word
# represented as a sequence of chars mapped to an integer
# batch = [[0, 8, 2], [5, 5], [1, 2, 3, 4]]
# first create a tensor with the length of each item in the batch
# and sort the list by decreasing order
batch_len = torch.LongTensor([x.shape for x in batch])
batch_len, perm_idx = batch_len.sort(0, descending=True)
# next, pad the sequences with zeros so they all have the same size
# and adjust the batch order according to the length
batch_pad = pad_sequence(batch, batch_first=True, padding_value=0)
batch_pad = batch_pad[perm_idx]
# apply the embedding on the padded batch and pack the data
batch_emb = embedding(batch_pad)
batch_pack = pack_padded_sequence(batch_emb, batch_len, batch_first=True)
# create the initial rnn state h0, feed the padded batch and undo the packing
ht = torch.zeros(1, batch_emb.shape, n_rnn_units)
out, ht = self.rnn(batch_pack, ht)
out, _ = pad_packed_sequence(out, batch_first=True)
# retrieve the last state from each item by using the unpadded length(!)
idx = torch.arange(0, len(batch_len)).long()
out = out[idx, batch_len – 1, :]
# undo the sorting by length which recovers the original order of the input
_, unperm_idx = perm_idx.sort(0)
The steps are necessary for large-scale datasets since otherwise, PyTorch processes one item at a time which does not benefit from any parallelism and is thus not applicable for real-world data.
Our next problem was a bit more challenging: We wanted to map all words with the same stem close in the embedding space and push irrelevant words away. We used a margin-based triplet approach that samples an anchor, a reference and a negative word. The margin was set to 0.3 and we used the cosine score for learning. Again, we started with a pretty basic model that included an embedding and single layer of GRU tanh units, followed by a linear layer to allow unbounded continuous values. Without a doubt this problem has a different complexity, since it is not an ordinary classification problem and thus relies on a good sample strategy. So far, this has nothing to do with recurrent nets and thus, we decided to evaluate the model on a small subset of the dataset with words that can be mostly mapped pretty easy. Stated differently, we wanted to overfit the model to see if it can successfully “learn”, or better remember, the required patterns.
We also used the chance to check different RNN styles and found that vanilla style, so called Elman nets, either learned very slow or not at all and thus decided to select GRUs because they have fewer parameters than LSTMs and are often not inferior. Furthermore, without any regularisation, results were often disappointing, since words with the same prefix get mapped pretty close in the embedding space which was the motivation to use dropout with p=0.1 after the embedding layer to fight easy memorization of patterns. With this setting, the final model delivered a solid performance, but it required more tuning and evaluation than the classification model. A trick we borrowed from seq2seq models is that we read text not from left to right, but from right to left which boosted the performance a lot.
However, the first challenge remains to get the batch interface up and working completely, without the necessity to undo the sorting outside the model class. The drawback is, if we see it correctly, that you can only use RNN variants provided by PyTorch, or you have to hack at least some code. They offer Elman, GRU and LSTM which is sufficient, but there is no easy way to use normalization layers inside the RNNs, except if you use hooks, like weight_norm, but that can be tricky.
Bottom like, working with PyTorch is still heaps of fun, but for beginners it can be a bit of a burden to get more complex models, especially recurrent ones, up-and-running on a large-scale setting. Thanks to the active community, you can always kindly ask for advice, but that does not change the fact that some design issues are sometimes trade-offs between performance and usability. This is perfectly okay, but it can be a bit frustrating if you start to implement a model in a naive, loop-based way, and see the performance will not allow you to use it on a real-world data without training for a very long time.
But who knows, maybe this problem is only relevant for us, since all cool kids use transformer networks
these days and avoid recurrent networks at all ;-).