With the rise of Deep Learning, more and more pieces of the ancient knowledge, regarding how to train neural networks, will be made accessible for the masses. Especially useful are check lists that contain condensed best practices how to avoid common pitfalls and heuristics that turned out to be useful.
To train an “ordinary” network on a “common” data set, those tips usually suffice to get a good model. However, if the data is special, for instance very high-dimensional or extremely sparse, or not from the image domain, things might get hairy. The problem is to find a good loss function that allows the network to learn something useful and without proper initialization of the weights, this will never happen!
Now, one might argue that the recent trend to use ReLU units, fixes those problems and we agree that it is much easier with those units but nevertheless, ReLU units can also stall. For instance, ReLU units have the disadvantage that they can “die” which means they never become active again.
The point is that a good setup of the weights can make the difference between a good model and a model that does not learn anything. And often, the line between those two outcomes is very thin. A high-level is often very useful to train standard models, but in case of highly specialized data, it probably requires a lot of fine-tuning, especially to setup the weights in a proper way.
In a nutshell, the available rule of thumbs work usually very good and often suffice to get a good model, but when the learning stalls, we need to plunge deeper into the optimization part and this requires a fairly solid understanding of the theory.