In the last couple weeks, we tried to free our minds by reading papers from different domains, to take a look at what is going outside the machine learning land. Since our data is very special and limited, we got serious doubts that a classical approach will lead to the holy grail. In other words, we hope to get inspiration from other domains.
Very recently, we wrote about an approach to fine-tune each hidden layer jointly with an overall objective function and the results looked very promising. In general, the idea to give hints, layer-wise, to the network to improve the overall learning seems to be beneficial and with a tree-based approach, the increase of the level of details with each new layer layer, we also loosely resemble the idea of curriculum learning which proofed to be powerful.
Whenever the representational power of the raw features is limited, prior knowledge can be easily incorporated with additional objective functions at one or more hidden layers. For example, in one semi-supervised embedding scheme, a similar approach was used to learn an embedding layer jointly with a supervised cost function. For example, if we want to train some movie classifier and we want to emphasize that the model is able to separate adult movies and movies for a younger audience, several ways are possible: 1) add a new cost function at the top layer, 2) add a new binary feature to the data and 3) use the output of an arbitrary hidden layer to train a separate classifier.
Choice 2) definitely helps, but it also does not allow to control the influence of the feature much and thus, choices 1) or 3) are more flexible. The idea to use a separate classifier at a lower level, 2), has the advantage that we get more discriminate features much earlier which might help to make the overall task more easier. Because otherwise 3), the influence of the additional cost function to lower layers might not be enough to improve the training process perceptibly. However, at the end, also hint-based models need a careful selection of hyper-parameters and are no silver bullets that can solve every problem with just a wink. On the other hand, especially for hand-crafted features, they seem to be very advantageous because usually it does not help to use a standard model with an increased capacity, because usually the additional knowledge cannot be easily incorporated in a straightforward way.
Bottom line, for some kind of data it might work to use a standard approach and to spend all the time with the selection of parameters and training, but we doubt that this works for the domain of movies in combination with hand-crafted features.