Since quite some time, we study methods to learn the ‘essence’ of very sparse high-dimensional binary data. We tried auto-encoders, siamese networks, triplet embeddings, multi-label networks, rbms and cbow-based embedding schemes and any more. All those methods were used successfully for some kind of problems and data. Stated differently, if a method does not work for our problem, it might be very likely the data itself, or the learning method, is the problem, but not the algorithm.
For example, before batch normalization some networks could not be successfully trained which was no problem of the network architecture, but with the flow of the error signals, the gradients, in the network. Soon after this normalization method, other methods followed, like weight normalization or layer normalization. All of them have the aim to “decouple” input/output from different layers and to improve the flow of the gradients and with them it was “magically” possible to train much deeper networks and highly complicated loss functions.
Without a doubt normalization is not the end of the road, because there is still much territory to explore in the world of non-convex optimization, but normalization is definitely an important step which helps to train bigger networks or those that are very challenging to optimize. At the begin, weight initialization was very crucial for training a good network and without it the network might not have learned anything at all. But now, with normalization, the efforts to find new, nifty weight initialization method seem not so important any longer, because common heuristics + normalization often work very good and there is no need to tune the weights before training.
We are still far away from an “off-the-shelf” method that works for all kind of problems without fine-tuning, but equipped with a good high-level framework that supports all most techniques which were discovered so far, it is much easier to train good networks than about two years ago.