It is the dream of all machine learning guys to find a way to use the available data in combination with some unsupervised learning algorithm to train a useful representation of the data. Yes, we drastically simplifying things here, but the point is to learn without the necessity to label the data which is very expensive.
For example, there are tons of documents available which could be used for learning, but the problem is what cost function do we want to optimize? In case of word2vec and friends, we try to predict surrounding or center words without explicit labels. This works very good, but the result is an embedding of words and besides simple aggregation methods, there is no general way to represent documents with a learned embedding in a meaningful way. However, it is still a simple, but powerful approach that can easily utilize huge amounts of unlabeled text data to learn a useful representation.
Another example is a recently published paper [arxiv:1704.01444] that is also using a large text corpus without labels, at least for the first model, to just predict the next character of the data block. So far, this is nothing new, but it is remarkable that a single unit learned to predict the sentiment of a data block. In other words, all those models learn by predicting the next “thing” which can be, for instance, a word, a character, or some other token.
The interesting part is that such an “autoregression” model can be learned by just taking a sequence, removing the last item and try to predict it, given the previous data. This also works for sets, but the process is not straightforward since sets are not ordered. Furthermore, it is not obvious how to select the item, since there is no “previous” data.
Bottom line, it has been demonstrated several times that it is possible to learn a good representation of data by just predicting the next token. Nevertheless, often such methods are limited to short texts, since processing longer texts require to remember lots of data or context, especially for models based on RNNs.
However, since we are usually dealing with short descriptions of items, with the exception that we handle sets and no sequences, we adapted the method and trained a model to predict a keyword from the set, given the rest of the set, with moderate results. Despite the problems we encountered, we still believe that no (strongly) supervised model will ever be able to learn powerful but also general representation of data. Thus, it seems a good idea to follow the research track and address the existing problems one by one, until we
eventually find a method that addressed the major hurdles.
It is no secret that most of the energy has been put into advancing supervised approaches for machine learning. One reason is that lots of problems can be actually phrased as predicting labels and often with very good results. So, the question is, especially for commercial solutions where time and resources are very limited, if it isn’t better to spend some time to label data and train a classifier to get some insights about the data. We got some inspiration from a recent twitter post that suggested a similar approach.
For instance, if we want to predict if an “event” is an outlier or not, we have to decide between supervised and unsupervised methods. The advantage of the latter is that we have access to lots of data, but we have no clear notion of “outliers”, while for the former, we need events that are labeled with the risk that the data is not very representative and therefore, the trained model might be of limited use.
In other words, it is the old story again: A supervised model is usually easier to train, if we have sufficient labeled data at the expense that we get what we “feed”. Thus, more labeled data is likely to improve the model but we can never be sure when we captured all irregularities. On the other hand, unsupervised learning might be able to (fully) disentangle the explaining factors of the data and thus leads to a more powerful model, but coming up with a proper loss function and the actual training can be very hard.
Bottom line, there is some truth in it that if you cannot come up with a good unsupervised model, but you can partly solve the problem with an supervised one, you should start with it. With some luck, the simple model will lead to additional insights that might eventually lead to an unsupervised solution.
The hype about A.I. came to almost preposterous proportions. Without a doubt, there was a lot of recent progress, but there still is a long way to achieve even a modest success in terms of a real ‘intelligence’. That’s why it is no shame to say that we are just scratching the surface. With deep neural nets, we are closer than we were ten years ago, but most of the work is still supervised, even if some approaches are *very* clever. Thus, with more data we can likely improve the score of some model, but this does not help to overcome serious limitations of big, but dumb networks. One way out of it would be unsupervised learning, but the advances in this domain are rather modest, probably because supervised learning works so well for most tasks. Thus, it should be noted that for some kind of problems, more data actually helps a lot and might even solve the whole problem, but it is very unlikely that this is true for most kind of problems.
For instance, as soon as we use some kind of label, the learning is only driven by the error signal induced by the difference between the actual and the predicted value. Stated differently, if the model is able to correctly predict the labels, there will be no further disentangling of explaining factors in the data, because there is no benefit in terms of the objective function.
But, there are real-world problems with limited or no supervision at all, which means there is no direct error signal, but we still must explain the data. One solution to the problem is a generative approach, since if we can generate realistic data, we surely understand most of the explaining factors. However, generative models often involve sampling and learning can be rather slow and/or challenging. Furthermore, for some kind of data, like sparse textual data, a successful generative training can be even more difficult.
With the introduction of memory to networks, models got more powerful, especially in handling “rare events”, but most of the time the overall network is still supervised and so is the adjustment of the memory. The required supervision is the first problem and the second one is that there is no large-scale support for general memory architectures. For instance, non-differentiable memory often requires a nearest neighbor search which is a bottleneck, or it is requires to pre-fill the memory and reset it after so-called “episodes”.
In a talk they used the analogy with a cake where supervised learning is the “icing”, but the unsupervised learning is the core of the cake, the “heart” of it. So, in other words, even with unlimited data we cannot make a dumb model smarter, because at some point it would stop learning with respect to the supervised loss function. The reason is that it “knows” everything about the data for a “perfect” prediction but is ignoring other details. So, it’s the old story again, about choosing an appropriate loss function that actually learns the explaining factors of the data.
Bottom line, getting more data is always a good idea, but only if we can somehow extract knowledge from it. Thus, it should be our first priority to work on models that can learn without any supervision, and also with fewer data (one-shot learning). But we should also not forget about practical aspects, because models which are slow and require lots of resources are of very limited use.
Without a doubt backprop as a learning rule is very efficient to train (supervised) neural networks and as long as nobody comes up with a better method, backprop will be our companion for the coming years. This is not a real problem, because recent advances regarding the optimizations of deep nets has been very fruitful, with computer vision as a very prominent example. However, despite the success, the method to train a network that is only driven by weak labels is clumsy and not very efficient.
The problem is that the error provided by a label, like “western” for a genre, in combination with an objective, like the categorical cross-entropy, often leads to very weak learning signals. For example, when the training just started, everything is random. Then, the first sample is fed into the network which leads to an initial guess that is very likely wrong. Then, the mistake is backpropagated through the network to adjust the weights towards a correct prediction. Thus, the whole learning of the network is only motivated by avoiding classification errors with regard to the loss function. In other words, the model does not need to understand the whole data, it just has to find enough regularities to explain the labels. This makes the training often very easy, on the one hand, but on the other hand, the power of the model can be also very limited.
For example, if we have two genres “western” and “horror” which can be perfectly separated by a single keyword, like ‘cowboy’, the learning will immediately stop after the network found this pattern. The problem is that if the keyword is ever present in a “horror” movie, a prediction is garbage, because the world of the network does not makes sense any longer. In general, discriminative models do a very good job if the data contains enough patterns which allows the network to generalize to unseen data. However, this approach usually requires a lot of examples to catch all relevant regularities of the data and even then, the learning is purely driven by labels which carry very few information. In a figurative way, the networks sees the real world only through “label glasses” which hide all the details that are not important for the discriminative task. For the toy example, it means all objects are represented by a single note that contains the single word “cowboy” or no text at all. Stated differently, the model knows _nothing_ about the concept of horror or western.
Let’s try something new: We lock out the teacher and go nuts! All we do is to look at samples and to model correlations between raw features, without using any kind of label information. In the most elementary setup, we try to reconstruct the data by using concept bases which are continually defined with each visited sample. This is nothing more than an auto-encoder. At the end, we hopefully learned features that can explain
most of the data and because we did not use any labels, the features should be more general and not restricted to a specific task. In contrast to a classification, we model P(“data”) and not P(“label”|”data”).
Why is this step also useful for a more robust classification? First, the training is not driven by a single aspect, the label, but to model all aspects of the data that help to reconstruct it. The idea is that a model which knows how to reconstruct individual samples, or generate new ones, in case of generative models, must have a good grasp of the distribution P(“data”) where the data comes from. In other words, unsupervised learning helps to disentangle explaining factors of the data which allows a compact representation in the learned feature space. Furthermore, because these new representations are more likely to be linear separable, we can use very simple models for a classification.
In the domain of images, the concept can be illustrated very easily. Instead of using low-entropy labels to learn a model, we try to model directly the image content. This is done by starting with correlations of pixel values which lead to edges, then lines, contours and finally shapes. With such a model, we can describe the content of an image more effective and provide a summary of the factors as features. In a simplified way, features can be thought of as template detectors that activate if a certain patch of an image matches the concept (“eye” or “nose” or “face”). Then, a high-level concept can be described by a group of activated templates, like a “face” requires active templates for “eye”, “nose”, “mouth” and “ears”. Of course, the modeling of real-world data is more complex, but even there, it is done in a hierarchical way that starts with simple concepts which are then composed into more powerful ones.
The idea is not restricted to images, but for sparse bag-of-words data, like movie descriptions, the illustration is limited. At the bottom, we have a set of words which then form topics which are finally combined into a set of topics -a movie- and so forth. Nevertheless, even for this kind of data, unsupervised learning helps a lot to capture the structure of the data. This includes correlations of words which can be used to disentangle the various factors to form latent topics that are present in the data. With such a learned feature space, we are able to solve a broad range of problems, like the classification of genres, a clustering of “similar” movies -with related topics-, or to train preference-based models to suggest movies to users.
Bottom line, in contrast to discriminative models, unsupervised methods have the benefits that they can be used as a generic building block for other models. This is possible because they try to capture all regularities in the data and not only specific aspects of it. In terms of our glass analogy, the network now sees a lot of more details of the world, even if they are simplified, which makes it easier to compose existing knowledge to solve new problems. What makes our domain so challenging is that everybody knows what a cat is, because it is a specific concept and therefore can be decomposed, but to do the same for a sci-fi horror movie is totally different, because even the label is man-made and therefore subjective.
Like the introduction of the ReLU activation unit, batch normalization -BN for short- has changed the learning landscape a lot. Despite some reports that it might not always improve the learning a lot, it is still a very powerful tool that gained a lot of acceptance recently. No doubt that BN has been already used for autoencoder -AE- and friends, but most of the literature is focused on supervised learning. Thus, we would like to summarize our results for the domain of textual sparse input data, starting with warm-up that is soon followed by more details.
Because BN is applied before the (non-linear) activation, we will introduce some notation to illustrate the procedure. In the standard case, we have a projection layer (W,b) for an input x
g(x) -> W*x + b
and then we apply the activation function
f(x) -> maximum(0, g(x))
which is usually non-linear. For BN, it looks like this
g(x) -> W*x
h(x) -> (g(x) - mean(g(x))) / std(g(x))
f(x) -> maximum(0, a * h(x) + b)
The difference is that the projection “g” is without the bias, then “h” normalizes the pre-activation values with the statistics -mean and deviation of a mini-batch. Finally “f” is applied on the standardized output which is scaled with “a” and shifted with the bias “b”.
With this in mind, a ReLU layer can be expressed as:
bn = BatchNormalization(x) | Projection(x)
out = ReLU(bn)
The difference is that the activation function now needs to be a separate “layer” which does not have any parameter.
The use of sigmoid units has a bitter taste, but only because they saturate which slow down the learning or even stops it totally. However, with batch normalization, this can be avoided which is very beneficial if the data is binary and thus sigmoids are a natural choice.
Two issues we want to shed some light on are:
1) The need for dropout is reduced in case of BN, or it is even discouraged, but is it also true for the AE setting?
2) Usually BN is not applied for the top-layer, but at least one paper that is also about AE and BN mentions that they apply BN to all layers. Thus, we are interested to analyze the situation for AE setting.
As usual, we use Theano for all experiments, but don’t use any frameworks, to stay in full control of all parameters and to make sure we really understand what we are doing ;-).
For the domain of movies there are lots of labels, for instance ratings created by users, genres assigned to movies or themes to capture more coarse aspects of movies. But regardless of the availability of labeled data, it is very likely that there is much more unlabeled data. In case of movies, this includes for example, reviews, descriptions and meta information like budget, certificates, places, music and involved persons, but also information like relations between movies and so on. In other words, there are patterns everywhere that are waiting to be extracted.
Because we have no clear goal, except for “mining patterns”, supervised methods are not suited for the task. In other words, we want the “student” to explore the data without the supervision of any teacher and furthermore, we want the process to be as lightweight as possible. This time, we do not use a neural network or a sophisticated clustering algorithm. Instead we use a very simple competitive approach that uses the winner-takes-all strategy.
The description of the algorithm is pretty easy. We have X that contains our input data, L2 normalized and we have W, a random matrix that is used to capture K latent topics in the data, also L2 normalized. The following steps are repeated in a loop:
(1) select an arbitrary x from X
(2) h = dot(W, x)
(2.1) j = argmax(h)
(3) W[j] += learning_rate*x
(3.1) L2 normalize W[j]
The result is very similar to a matrix factorization with rank K.
With the normalization of W and X, the dot product (2) is the cosine similarity of all topics of W and x and the winner is the topic with the largest magnitude (2.1). Furthermore, the normalization also prevents that the norm of W becomes too large (3.1). The winning cluster is updated with the information from x (3) multiplied by a learning_rate that is usually decayed over time. Step (3) ensures that features that are captured by the chosen topic “j” gain gradually more weight.
Besides the ability to explore potential “topics” in the data, the approach can be also used to cluster the data since (2.1) returns an integer that represents the cluster ID. The intuition is that the dot product of some input x and the topics in W is large, when the overlap of features is maximal. For instance, a crime movie with the keywords x=[heist, robbery, police, officer] is not likely to have much in common with latent topics like sci-fi & aliens, love & romance or sword & sandals.
Bottom line, it is amazing how much we can learn from the data with such a simple algorithm and without any kind of error signal from a teacher. It is not surprising that the approach is not powerful enough to capture higher-order relations between features, but the results are still impressive.