It is always a good idea to draw inspiration from biological learning. For instance, if we learn something new, it is common to start with simple examples and gradually move to ones that are more difficult to solve. The idea is to learn the basic steps and then to combine those steps into new knowledge which is then combined again and so forth. The same can be done if we consider neural networks and it is called curriculum learning. At the begin the network is fed with examples that are easy to learn. Then, decided by some schedule, more difficult examples are presented to the network. The idea is, again, that the network learns basic concepts first which can then be combined into more complex ones to classify the challenging examples.
A recently published paper [arxiv:1612.09508] combines this idea with a feedback mechanism to incrementally predicts a sequence of classes for an image that ranges from easy to difficult (coarse to fine-grained). The idea is to use a recurrent network in combination with loss functions that are attached to the output of each step in time. The input is the image in combination with the hidden state of the previous time step. In other words, the first step is feed into the first loss function -the easy class-, the next time step is feed into the second loss function -more difficult- which continues for #T steps which equals the number of loss functions.
An obvious advantage is that we can model a taxonomy of classes, for example: (animal, vehicle, plane)->([cat, dog, bird], [car, bike, truck], […]) -> ([tabby, …], [shepard, beagle, …], […]) with this approach. At the top level, the class is very coarse, but with each step, it is further refined which is the connection to curriculum learning. For instance, to predict if an image contains an animal or a vehicle is much easier than to predict if an image contains a german shepard or a pick-up truck.
However, in contrast to plain curriculum learning, we do not increase the difficulty per epoch, but per “layer”, because the network needs to predict all classes for an image correctly at every epoch. The idea is related to multi-task learning but with the difference that the loss functions are now connected to a recurrent layer which evolves over time and depends on the output of all previous steps.
So far for the overview of the method, but we are of course not interested to apply the method on images but on movie data. The good news is that we can easily build a simple taxonomy of classes, for instance: top genre->sub genre->theme and since the idea can be applied to all kind of data, it is straightforward to feed our movie feature vectors to the network. We start with a very simple network. The input is a 2,000 dim vector with floats within a range [0,1] and the pseudo code looks like this:
x = T.vector() # input vector
y1, y2, y3 = T.iscalar(), T.iscalar(), T.iscalar() # output class labels
h1 = Projection+Layer-Normalization+ReLU(x, num_units=64)
r1 = GatedRecurrent+LayerNormalization+RelU(h1, num_units=64)
r1_step1 = "output of time-step 1"
r1_step2 = "output of time-step 2"
r1_step3 = "output of time-step 3"
y1_hat = Softmax(r1_step1, num_classes=#y1)
y2_hat = Softmax(r1_step2, num_classes=#y2)
y3_hat = Softmax(r1_step3, num_classes=#y3)
loss = nll(y1_hat, y1) + nll(y2_hat, y2) + nll(y3_hat, y3)
For each class, we use a separate softmax to predict the label with the input from the recurrent output at step #t. Thus a training step can be summarized as: a movie vector is fed into the network and projected to the feature space spanned by h1. Then we feed h1 three times into the recurrent layer r1 with the state from the previous step, at #t=0 the state is zero, to produce a prediction for each class. Therefore, there is a two-fold despondency. First, the hidden state is propagated through time and second, the next state is
influenced by the error derivate of the loss from the previous state which means the representation learned by the recurrent layer must be useful for all labels.
To illustrate the method with an example, let’s consider a movie with the top-genre “horror”, the sub-genre “creature-film” and the theme “zombies”. When the network sees the movie, it gets the first hint that it is a horror movie and builds a representation that maximizes the prediction of the first softmax for “horror”. But instead of forgetting the context, it ‘remembers’ that it belongs to the horror genre and uses the hint to adjust the representation to predict both classes (horror,creature-film) correctly. This is the second step. Lastly, it is using the given context to adjust the representation again to predict all three classes (horror,creature-film,zombies) correctly. The whole process can be considered as a loop which is in contrast to typical multi-label learning that is using a single output of a layer to predict all classes correctly.
Bottom line, as we demonstrated, feedback networks are not limited to images and they are extremely useful if there exists a natural, hierarchical label space for data samples. Furthermore, since the classification of hierarchical labels requires more powerful representations, it is very likely that the learned feature space is more versatile and can be also used as input features for other models.