One of the first demonstrations on how powerful Deep Learning can be, used 1,000 pictures per category and needed quite a lot steps to build a model that worked. Without a doubt, this was a seminal work but it also demonstrated that DL only vaguely resembles how humans learn. For instance, if a child would have to look at 1,000 cups to get the concept of it, the lifespan of humans would be too short to survive without strong supervision. Another example are recent breakthroughs in reinforcement learning, but which also come at a certain cost, like a couple of thousand bucks a day for energy. In a lot of cases, data, and even labels, might be no problem, but it often takes days or even weeks to turn them into a useful model. This is also in stark contrast to the brain that uses very little energy and is able to generalize with just a few, or even one example. Again, this is nothing new but begs the question if we spend too little time on fundamental research and try instead too often to beat state-of-art results to get a place in the hall of fame? The viewpoint is probably too easy, since there are examples that there is research that focuses on real-world usage, like WaveNet, but it also shows that you need lots of manpower to do it. Thus, most companies have to rely on global players or public research if they want to build cutting-edge A.I. products. The introduction of GPU clouds definitely helped, because it allows everyone to train larger models without buying the computational machinery, but using the cloud is also not for free and it’s getting worse if the training has to be fast, since you need to buy lots of GPU time then. The topic, in a broader context, has also been recently debated in. In the spirit of the debate, the question is how can we avoid to run against a wall about 1,000 times before we realize it’s not a good idea?
 “Will the Future of AI Learning Depend More on Nature or Nurture?”
With all the lasting hype about Deep Learning, it seems that all problems can be easily solved with just enough data, a couple of GPUs and a (very) deep neural network. Frankly, this method works quite well for a variety of classification problems, especially for the domain of language and images, but mostly because, despite the possible large number of categories, classification is a very well understood problem.
For instance, in case of word2vec not a deep network was used, but a shallow model that worked amazingly well. And even if support vector machines went out of fashion, there are some problems that can be solved very efficiently thanks to the freedom to use (almost) arbitrary kernel functions. Another example is text classification where even a linear model, in combination with a very high-dim input space, is able to get a very good accuracy.
The journey to build a “complete” model of the human brain is deeply fascinating, but we should not forget that sometimes all we need is a hammer and not a drill to solve a problem. In other words, if the problem is easy, a simple model should be preferred and not one that is -theoretically- able to solve all problems but requires huge amount of fine-tuning and training time and even then the outcome might not be (much) better than the simple model. This is a bit like Occam’s Razor.
Bottom line, in machine learning we should be more concerned with a deeper understanding of the problem which allows to select and/or build a more efficient model than to solve everything with a single method just because it is a hot topic. Since Deep Learning is used by the crowd, we are getting more and more the impression that DL is treated as a grail to solve all problems without the necessity for a solid understanding of the underlying problem. The impression is supported by frequent posts at DL groups where newbies ask for a complete solution of their problem. To be clear, asking for help is a good thing, but like the famous saying, if you cannot (re-)build something you do not understand it.
The problem is, if somebody gives you a ‘perfect’ solution, it works as long as your problem remains the same. But if your problem evolves over time, you have to modify the solution. And for this, you need to understand both the problem and the solution to adjust the model. In case of the black-box approach often used by DL, data in and labels out, encoding prior knowledge or whatever is necessary to solve the evolved problem, requires a much deeper understanding of the whole pipeline.
One problem with hand-crafted features is that we usually need some form of iteration until we know that the representation is powerful enough for the task at hand. Stated differently, if the features are too simple, we likely learn pattern that are plausible according to the data, but not useful for the actual task. Here is an example in which we use the movie ‘Doom’ as a reference. We train a simple genre-based classifier and we study the last hidden layer to see how well it separates different classes in the data.
The analysis confirms that most classes are clearly separated, but nevertheless the model lacks semantic expressiveness, because some movies that are close in the feature space have nevertheless huge semantical differences. In our case, a movie called ‘Mad Monster Party’ is close neighbor of ‘Doom’, mainly because of the ‘creature film’ sub-genre and the ‘monster’ theme derived from the keywords, but a quick look into the details of it reveals that the movie is aimed for a younger audience.
In a nutshell, the designed features did a good job for clustering movies into high-level concepts, but they failed to figure out if a movie was for children or adults. That means in contrast to learned features, we have to encode all the basic knowledge right into the features before we can train a model to learn the required patterns. That resembles a little a chicken-egg-problem.
It is not new that crafting features is hard work, but in case of movies it seems almost impossible because movies combine three domains: text/plot, audio/speech and video/scene and we do not have access to any of those. All we have is a summary that is usually very short or biased, descriptive plot words and other partial information like the involved persons or genres. Furthermore, while we might have more accurate information for some movies, we also might have (almost) no information for others which means we cannot embed all movies into the same feature space.
Long story short, with the data at hand, we can do lots of things, but the extraction of consistent, powerful and semantic concepts is none of them. Plus, without proper feature engineering, the explanatory power of models will be limited and that also means that even silver bullets like Deep Learning won’t help, since DL requires that the raw input data contains all the knowledge. The reason why DL seems to work so good these days is a combination of processing power (GPUs) and the utilization of tons of data. For movies, we have this data, but the access is limited and the required processing power is still too much for real-life applications. With the recent advances to describe images in a textual way, we might be able to condense a movie into useful semantical features some day.
Even if a definition of the term is very difficult, there is no doubt that the combination of massive amounts of data -big data how it is called nowadays- and the rise of highly efficient processing units -GPUs- is one key to the success of Deep Learning. Today, we are able to learn classifiers for a couple of thousand categories with more than ten million input samples. That is quite impressive and as the results of some competitions show the models get better and better every year.
The idea to begin with simple features and compose them into higher level features is not only biologically more plausible than handcrafted features, but they also seem to work very good. Especially convolutional networks are very useful to visualize how the model learns to combine primitive features stepwise into concepts like a cat, a car or a flower. In case of the image domain, all we need is to label images. That is expensive, but we do not have to face problems that it is not clear if the animal on a picture is a dog or a cat. Sure, sometimes it can be challenging but at the end, it is not a matter of personal taste. We talked about this in earlier posts. For the music domain, it is much harder because genres might overlap and of course genres itself might be ambiguous. However, in case of unsupervised learned, this is no problem because in that case, the features are derived directly from the audio and no labels are used.
Now, let’s talk about movies. As we already explained, in this domain, we do no have access to any native features. In other words, we have to rely on handcrafted features to learn concepts which can be a serious problem because we assume that the features contain enough semantic information to perform hierarchical learning. Besides the features itself, the procedure of feature selection, usually a top-k scheme, hurts the performance a lot because rare features are likely to be very discriminative.
In a nutshell, Deep Learning has been very successful in the domains of music, images, speech and NLP. All these domains provide some kind of native features, so they do not rely on handcrafted features. However, without access to native features, the success of a deep approach cannot be guaranteed, because the feature might lack enough descriptive information and thus higher layers are not able to compose the inferred information into an useful concept.
Here is the short version of a long story: To train a bigger model, you usually start by pre-training each layer separately with stochastic gradient descent and momentum. The aim of this step is not to find the best minimum, but to move the weights to some region in space that is close to one. This is done for all layers.
The result is not very useful yet, since there was no joint effort to optimize all layers to perform a specific task, like to reconstruct the input data. Stated differently, the idea is to help each layer to make an educated guess instead of starting the parameter search from scratch. After all layers are roughly initialized, the layers are unfolded into a single, deep auto-encoder that will be optimized jointly.
While the pre-training is usually done with good old gradient descent, the fine tuning often uses a more sophisticated approach, like Conjugate Gradient. The idea is to get rid of the adjustment of the learning rate and let the method itself decide what is the best choice.
Our favorite library, Theano, comes with gradient descent out of the box, but needs some extra effort to work with external optimization routines. Usually, the optimization function is used as a black box. We start with an initial guess of x and then we provide the gradient of the cost function F at x.
To integrate it into Theano, we define our cost function as usual but then we create two functions. First, a function to evaluate the cost function F with our data to return the cost and second, a function that returns the actual gradient at x. It should be noted, that x is equal to all parameters of the model, but since most optimization APIs expect a 1D vector, we need to flatten our model parameters into a single vector that represents the whole model.
We illustrate the procedure with a simple auto-encoder. The parameters for this model are (weights, bias_hidden, bias_visible). Therefore, x is weights||bias_hidden||bias_visible. At least, a final step is required, to update the Theano model with the output of the optimization function. To do this, we map the individual parameters to a range of x:
weights = x[0:(num_hidden*num_visible)].reshape(num_hidden, num_visible)
off = num_hidden*num_visible
bias_hidden = x[off:off + num_hidden]
bias_visible = x[off+num_hidden:off+num_hidden + num_visible]
This has the advantage that an update of x automatically updates the parameters of the model. Now, we are ready to call the optimization function. In pseudo code this should be something like:
fmin_cg(f=cost_function, fprime=grad_function, x0=x)
where “f” is the function to minimize, “fprime” returns the gradient at x and “x0” is our initial guess which was selected by pre-training.
To sum it up, in this post we considered a very simple example with just a single layer. In case of more layers, the procedure remains the same and only the mapping of the model parameters to x gets a little more unreadable. Plus, the cost function is a little more complex and so is the setup of the initial model parameters. In a nutshell, we use Theano as a black box to calculate gradients and some optimization function as a black box to choose the ideal learning rate to minimize the cost function. The result is one big fine tuned auto-encoder model for the data.
Especially in computer vision there is a strong tendency to use directly the data to learn features instead of handcrafting them. We couldn’t agree more with this approach, because otherwise you will never know if the features are best for your problem at hand.
Well, computer vision on images is nevertheless hard work, but at least you have the image data to work with it. It would be a huge benefit if the same would be possible for movies. But the task is not to predict or classify something from a single frame, which is why we need to consider the context of the pictures. And even if this would be feasible, it is not very likely that such an approach is successful to describe a movie details like ‘car chase’ or ‘satire’.
As we noted in previous posts, the situation is different for documents. Documents are also kind of self-describing. Like images, it is also hard work to understand them, but the features are the text itself. That is what makes both domains similar. In contrast, a movie is always summarized and described by some human, with meta data like genres, keywords or ratings.
So, the question is, what are the best features to describe movies? This is, no doubt, a rhetorical question because there is no correct answer to it. For a TV magazine it suffices to describe a movie by a short summary, leading actors, the genre and maybe some kind of star-based rating. With this information, a human can usually classify the movies into “worth to watch” (+1) or “rather not” (-1). In other words, for a simple classification these features are enough, if you understand the semantic. However, to compare movies, you probably need more details.
What about some fairy dust, or Deep Learning as it is called today? A layered model would surely help to disentangle the factors in the data, but only if the expressive power of the features is sufficient. For instance, the genres and the actors are definitely not enough to explain all themes in a movie. Stated differently, if we could describe movies with adequate features, Deep Learning would help us to find better representations of them.
But unless a picture with a cat, or a document that describes how to build a time machine, movies are different and because people interpret them differently, even handcrafting features is a real challenge. It is like a storybook, with text _and_ images.
In the last weeks, we tested a lot of different approaches. Different algorithms, different feature transformations, algorithms with supervision and lots of unsupervised learning. We tuned parameters for one model and we discarded another. The lessons learned read like that:
– a good unsupervised model requires a sufficient amount of data
– the quality of the model largely depends on the ability of the raw features to describe the data properly
– great care must be taken for the selection of the hyper-parameters
– the fact that the data is very sparse needs to be incorporated into the model building process
– shallow models are not expressive enough to describe higher-order correlations of the data
With the limited data at hand, it is very unlikely that the training of a larger model would succeed, but without it the final model is very limited. That is the reason why we decided to go another way. Similar to the greedy pre-training in other domains, we decided to train smaller models on subsets of the data. For now, we use the existing genre information of movies to train useful ‘concept neurons’. These smaller models are then combined into a larger network. Next, the network is used to transform the input data into a hidden representation of the data (similar to an ordinary neural network with one hidden layer). Similar to stacked RBMs, we use the output of the model as the input to a new model (layer) with the difference that the previous layer does not consists of a single RBM but many.
This detour is required because the sparsity of the input data does not allow to capture nuances of topics at a large scale. Therefore, we train models on specific genres to extract these details and then we train a larger model to find relations between these topics.
A schematic of this approach looks like this:
– train an RBM model for each genre
keep N hidden nodes (arbitrary, largest L2-norm, …)
save the top-K features with the largest weights from each node
store the reduced model
– transform each movie into the new feature space
each RBM model contributes N input dimensions
inference is done on the reduced features/weights
– train a new RBM model with |genres| * N input dimensions
store the model
To transform a new movie into the feature space, we first use the reduced models to get an input for the final model. Then we can infer the final feature representation. The whole process is very fast since it only consists of matrix vector multiplications and element-wise sigmoid operations.
A preliminary analysis was done on randomly sampled movies and indeed the new model seems to capture much better latent concepts of movies than the shallow ones. But an obvious bottle-neck is still the quality of the features at the lowest level; if a complex movie is only described by very few keywords, the best model in the world is not able to infer latent topics from such a coarse representation.
The other challenge is that deeper models have much more parameters to tune and to learn and thus, more time is required to study the dynamics of the new model. However, the numbers speak for itself, we are positive that we haven’t reached the full potential yet.