Despite the fact that we are dealing with text fragments that do not follow a strict format, there are still a lot of local patterns. Those are often not very reliable, but it’s better than nothing and with the power of machine learning, we have a good chance to capture enough regularities to generalize them to unseen data. To be more concrete, we are dealing with text that acts as a “sub-title” to annotate items. Furthermore, we only focus on items that are episodes of series because they contain some very prominent patterns we wish to learn.
Again, it should be noted that the sub-title might contain any sequence of characters, but especially for some channels, they often follow a pattern to include the name of the episode, the year and the country. For instance, “The Blue Milkshake, USA 2017”, or “The Crack in Space, Science-Fiction, USA 2017”. There are several variations present, but it is still easy to see a general pattern here.
Now the question is if we can teach a network to “segment” this text into a summary and a meta data part. This is very similar to POS (part-of-speech) tagging where a network labels each word with a concrete type. In our case, the problem is much easier since we only have two types of labels (0: summary, 1: meta) and a pseudo-structure that is repeated a lot.
Furthermore, we do not consider words, but we work on the character-level which hopefully allows us to generalize to unseen pattern that are very similar. In other words, we want to learn as much as possible of these regularities without focusing on concrete words. Like the variation “The Crack in Space, Science-Fiction, CDN, 2017”. For a word-level model, we could not classify “CDN” if it was not present in the training data, but we do not have this limitation with char-level models.
To test a prototype, we use our favorite framework PyTorch since it is a pice of cake to dealing with recurrent networks there. The basic model is pretty simple. We use a RNN with GRU units and we use the NLL loss function to predict the label at every time step. The data presented to the network is a list of characters (sub-title) and a list of binaries (labels) of the same length.
The manual labeling of the data is also not very hard since we can store the full string of all known patterns. The default label is 0. Then we check if we can find the sub-string in the current sub-text and if so, we set the labels of the relevant parts to 1, leaving the rest untouched.
To test the model, we feed a new sub-text to the network and check what parts it tags with 1 (meta). The results are impressive with respect to the very simple network architecture we have chosen, plus the fact that the dimensions of the hidden space is tiny. Of course the network sometimes fails to tag all adjacent parts of the meta data like ‘S_c_ience Fiction, USA, 2017″ where ‘c’ is tagged as 0, but such issues can be often fixed with a simple post-processing step.
No doubt that this is almost a toy problem compared to other tagging problems on NLP data, but in general it is a huge problem to identify the semantic context of text in a description. For instance, the longer description often contains the list of involved persons, a year of release, a summary and maybe additional information like certificates. To identify all portions correctly is much more challenging than finding simple patterns for sub-text, but it falls into the same problem category.
We plan to continue this research track since we need text segmentation all over the place to correctly predict actions and/or categories of data.
It is always a good idea to draw inspiration from biological learning. For instance, if we learn something new, it is common to start with simple examples and gradually move to ones that are more difficult to solve. The idea is to learn the basic steps and then to combine those steps into new knowledge which is then combined again and so forth. The same can be done if we consider neural networks and it is called curriculum learning. At the begin the network is fed with examples that are easy to learn. Then, decided by some schedule, more difficult examples are presented to the network. The idea is, again, that the network learns basic concepts first which can then be combined into more complex ones to classify the challenging examples.
A recently published paper [arxiv:1612.09508] combines this idea with a feedback mechanism to incrementally predicts a sequence of classes for an image that ranges from easy to difficult (coarse to fine-grained). The idea is to use a recurrent network in combination with loss functions that are attached to the output of each step in time. The input is the image in combination with the hidden state of the previous time step. In other words, the first step is feed into the first loss function -the easy class-, the next time step is feed into the second loss function -more difficult- which continues for #T steps which equals the number of loss functions.
An obvious advantage is that we can model a taxonomy of classes, for example: (animal, vehicle, plane)->([cat, dog, bird], [car, bike, truck], […]) -> ([tabby, …], [shepard, beagle, …], […]) with this approach. At the top level, the class is very coarse, but with each step, it is further refined which is the connection to curriculum learning. For instance, to predict if an image contains an animal or a vehicle is much easier than to predict if an image contains a german shepard or a pick-up truck.
However, in contrast to plain curriculum learning, we do not increase the difficulty per epoch, but per “layer”, because the network needs to predict all classes for an image correctly at every epoch. The idea is related to multi-task learning but with the difference that the loss functions are now connected to a recurrent layer which evolves over time and depends on the output of all previous steps.
So far for the overview of the method, but we are of course not interested to apply the method on images but on movie data. The good news is that we can easily build a simple taxonomy of classes, for instance: top genre->sub genre->theme and since the idea can be applied to all kind of data, it is straightforward to feed our movie feature vectors to the network. We start with a very simple network. The input is a 2,000 dim vector with floats within a range [0,1] and the pseudo code looks like this:
x = T.vector() # input vector
y1, y2, y3 = T.iscalar(), T.iscalar(), T.iscalar() # output class labels
h1 = Projection+Layer-Normalization+ReLU(x, num_units=64)
r1 = GatedRecurrent+LayerNormalization+RelU(h1, num_units=64)
r1_step1 = "output of time-step 1"
r1_step2 = "output of time-step 2"
r1_step3 = "output of time-step 3"
y1_hat = Softmax(r1_step1, num_classes=#y1)
y2_hat = Softmax(r1_step2, num_classes=#y2)
y3_hat = Softmax(r1_step3, num_classes=#y3)
loss = nll(y1_hat, y1) + nll(y2_hat, y2) + nll(y3_hat, y3)
For each class, we use a separate softmax to predict the label with the input from the recurrent output at step #t. Thus a training step can be summarized as: a movie vector is fed into the network and projected to the feature space spanned by h1. Then we feed h1 three times into the recurrent layer r1 with the state from the previous step, at #t=0 the state is zero, to produce a prediction for each class. Therefore, there is a two-fold despondency. First, the hidden state is propagated through time and second, the next state is
influenced by the error derivate of the loss from the previous state which means the representation learned by the recurrent layer must be useful for all labels.
To illustrate the method with an example, let’s consider a movie with the top-genre “horror”, the sub-genre “creature-film” and the theme “zombies”. When the network sees the movie, it gets the first hint that it is a horror movie and builds a representation that maximizes the prediction of the first softmax for “horror”. But instead of forgetting the context, it ‘remembers’ that it belongs to the horror genre and uses the hint to adjust the representation to predict both classes (horror,creature-film) correctly. This is the second step. Lastly, it is using the given context to adjust the representation again to predict all three classes (horror,creature-film,zombies) correctly. The whole process can be considered as a loop which is in contrast to typical multi-label learning that is using a single output of a layer to predict all classes correctly.
Bottom line, as we demonstrated, feedback networks are not limited to images and they are extremely useful if there exists a natural, hierarchical label space for data samples. Furthermore, since the classification of hierarchical labels requires more powerful representations, it is very likely that the learned feature space is more versatile and can be also used as input features for other models.