More Data vs. Better Models

The hype about A.I. came to almost preposterous proportions. Without a doubt, there was a lot of recent progress, but there still is a long way to achieve even a modest success in terms of a real ‘intelligence’. That’s why it is no shame to say that we are just scratching the surface. With deep neural nets, we are closer than we were ten years ago, but most of the work is still supervised, even if some approaches are *very* clever. Thus, with more data we can likely improve the score of some model, but this does not help to overcome serious limitations of big, but dumb networks. One way out of it would be unsupervised learning, but the advances in this domain are rather modest, probably because supervised learning works so well for most tasks. Thus, it should be noted that for some kind of problems, more data actually helps a lot and might even solve the whole problem, but it is very unlikely that this is true for most kind of problems.

For instance, as soon as we use some kind of label, the learning is only driven by the error signal induced by the difference between the actual and the predicted value. Stated differently, if the model is able to correctly predict the labels, there will be no further disentangling of explaining factors in the data, because there is no benefit in terms of the objective function.

But, there are real-world problems with limited or no supervision at all, which means there is no direct error signal, but we still must explain the data. One solution to the problem is a generative approach, since if we can generate realistic data, we surely understand most of the explaining factors. However, generative models often involve sampling and learning can be rather slow and/or challenging. Furthermore, for some kind of data, like sparse textual data, a successful generative training can be even more difficult.

With the introduction of memory to networks, models got more powerful, especially in handling “rare events”, but most of the time the overall network is still supervised and so is the adjustment of the memory. The required supervision is the first problem and the second one is that there is no large-scale support for general memory architectures. For instance, non-differentiable memory often requires a nearest neighbor search[1] which is a bottleneck, or it is requires to pre-fill the memory and reset it after so-called “episodes”.

In a talk they used the analogy with a cake where supervised learning is the “icing”, but the unsupervised learning is the core of the cake, the “heart” of it. So, in other words, even with unlimited data we cannot make a dumb model smarter, because at some point it would stop learning with respect to the supervised loss function. The reason is that it “knows” everything about the data for a “perfect” prediction but is ignoring other details. So, it’s the old story again, about choosing an appropriate loss function that actually learns the explaining factors of the data.

Bottom line, getting more data is always a good idea, but only if we can somehow extract knowledge from it. Thus, it should be our first priority to work on models that can learn without any supervision, and also with fewer data (one-shot learning). But we should also not forget about practical aspects, because models which are slow and require lots of resources are of very limited use.

[1] research.google.com/pubs/pub45801.html

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s