Despite the fact that we are dealing with text fragments that do not follow a strict format, there are still a lot of local patterns. Those are often not very reliable, but it’s better than nothing and with the power of machine learning, we have a good chance to capture enough regularities to generalize them to unseen data. To be more concrete, we are dealing with text that acts as a “sub-title” to annotate items. Furthermore, we only focus on items that are episodes of series because they contain some very prominent patterns we wish to learn.
Again, it should be noted that the sub-title might contain any sequence of characters, but especially for some channels, they often follow a pattern to include the name of the episode, the year and the country. For instance, “The Blue Milkshake, USA 2017”, or “The Crack in Space, Science-Fiction, USA 2017”. There are several variations present, but it is still easy to see a general pattern here.
Now the question is if we can teach a network to “segment” this text into a summary and a meta data part. This is very similar to POS (part-of-speech) tagging where a network labels each word with a concrete type. In our case, the problem is much easier since we only have two types of labels (0: summary, 1: meta) and a pseudo-structure that is repeated a lot.
Furthermore, we do not consider words, but we work on the character-level which hopefully allows us to generalize to unseen pattern that are very similar. In other words, we want to learn as much as possible of these regularities without focusing on concrete words. Like the variation “The Crack in Space, Science-Fiction, CDN, 2017”. For a word-level model, we could not classify “CDN” if it was not present in the training data, but we do not have this limitation with char-level models.
To test a prototype, we use our favorite framework PyTorch since it is a pice of cake to dealing with recurrent networks there. The basic model is pretty simple. We use a RNN with GRU units and we use the NLL loss function to predict the label at every time step. The data presented to the network is a list of characters (sub-title) and a list of binaries (labels) of the same length.
The manual labeling of the data is also not very hard since we can store the full string of all known patterns. The default label is 0. Then we check if we can find the sub-string in the current sub-text and if so, we set the labels of the relevant parts to 1, leaving the rest untouched.
To test the model, we feed a new sub-text to the network and check what parts it tags with 1 (meta). The results are impressive with respect to the very simple network architecture we have chosen, plus the fact that the dimensions of the hidden space is tiny. Of course the network sometimes fails to tag all adjacent parts of the meta data like ‘S_c_ience Fiction, USA, 2017″ where ‘c’ is tagged as 0, but such issues can be often fixed with a simple post-processing step.
No doubt that this is almost a toy problem compared to other tagging problems on NLP data, but in general it is a huge problem to identify the semantic context of text in a description. For instance, the longer description often contains the list of involved persons, a year of release, a summary and maybe additional information like certificates. To identify all portions correctly is much more challenging than finding simple patterns for sub-text, but it falls into the same problem category.
We plan to continue this research track since we need text segmentation all over the place to correctly predict actions and/or categories of data.
In the last weeks we spent lots of time with the analysis of the meta data. We are working on a model that combines several domains but there is nothing definite yet.
In the meantime, we stumbled about a new, possible source of data. Of course, we knew it was there all the time, but because we were a little afraid that the publicly available data is not sufficient to train a good model, we did not pursue the idea any further. However, now we decided that it is a good idea to check at least if data is able to improve our model notable and if so, we can worry about data acquisition later.
The experiment we conducted was pretty simple: We used a publicly available data set that contains tags for movies. To simplify our setup, we only considered tags that were used at least 50 times and the tags were pre-processed to unify them as good as possible. This resulted in 335 tags for 4239 movies. The data was then used to create a binary vector for each movie with a “1” at position i (tag i) and “0” elsewhere. Next, we used an RBM with binary units in the hidden layer, to train a model. As usual, we used weight decay and momentum to regularize the model and to speed-up the training.
To get a better understanding of the model, we randomly selected a “popular” movie and calculated the distance of all other movies in the new feature space, plus we determined the jaccard coefficient (JC) for the pairs of tags. The intuition is that the value of the JC is decreasing when the distance is increasing. The assumption was checked by calculating the correlation coefficient of some randomly drawn movies.
However, since the model also captures non-linear relations in the data the interpretation of the results is not always straight-forward. Stated differently, some tags are more important than others and so are some pairs of tags. That means it is still possible that a pair has a notable distance but nevertheless also a higher JC value. Such titles will then appear on higher position of a ranking because of the valuable tags or combinations it contains.
Despite possible obstacles, it is still easy to get a better intuition of the model by considering some classic movies. We used 007 movies for this purpose. The reason for our choice is simple: there are lots of them and we expect tags that are very similar. And the model did not disappoint with the results: For the top-20 results, each of the listed movies shared at least the tags “007” and “bond”. Actually the lowest JC value was 0.36, followed by 0.55.
So, in a nutshell we can say that tags are extremely valuable to perform a semantic clustering or to enhance existing meta data of movies. This fact is well known, regardless of the domain, and already used in the literature to improve the accuracy of classification and other tasks. Now, we are in the dilemma that we need tags for more recent movies and an active community that continually interacts with the data.