Most of the real-world problems in machine learning are rather challenging and there is no easy step-by-step method how to get a good model. There surely are lots of recipes and heuristics, but not all of them can be applied for every problem. Therefore, reproducible results are extremely import to analyze the importance of hyper-parameters and weight initializations.
In the previous post, we talked about the sampling of labels when they are not balanced in the data set. As we demonstrated this is very important, because otherwise the model will treat everything like “Joe Sixpack”. To analyze the impact of different sampling strategies, we want to keep everything else fixed, which means we initialize all random seeds to constant numbers. This way, all generated values are exactly the same and only the method for the sampling is different. The same procedure can be used to analyze hyper-parameters, like the momentum value, or a drop-out rate, or the number of hidden units.
With this approach, we start with a very simple model and tune-it until we are satisfied with the results. Then, we introduce a new feature for which the hyper-parameters are tuned the same way and so forth. At the end, we have a full-fledged model with a set of hyper-parameters that are reasonable. No doubt that this approach is time consuming and won’t work for some big models, but in our case, it worked pretty good.
Let’s close the post with an example. The model consists of 1,000 visible nodes, 100 hidden nodes (ReLU) and 10 output nodes (Softmax). The data consists of 25K examples that are movies with TF-IDF-like features. The 10 outputs are genres like adventure/crime/western/scifi. Now, we want to replace the ReLU nodes with Leaky ReLUs, to see, if the model improves. When all weights are initialized exactly the same way, the only difference is the type of neuron. In our case, the final loss of the ReLU network was 2017.12, while the LReLU network converged to 2016.85. The error rate of both were the same: 76.32%. Thus, we can say that the new neuron type does not lead to any noticeable change of the performance for this particular split of the data (train/test).
Bottom line. The number of hyper-parameters in a large network can be huge, especially for custom architectures and non-standard types of neurons. When time is no problem, hyper-parameter search is very useful, because with wrong hyper-parameters the network will not “get off the ground” which can be very disappointing. Plus, in case of a single adjustment of the model, the results should be always comparable.
It cannot be stressed enough that even good recipes like “start with a learning rate of 0.01 and initialize all weights with a normal distribution with std=0.02 and use momentum” sometimes fail because of the input data and then, it is often a real pain to adjust the weights for each layer until the network starts to learn something. The “orthogonal initialization” of the weights helped a lot to avoid the black magic to find the proper weights, but even this method does not always lead to a success.
Second bottom line. For standard models and data, a lot of existing recipes work very well and lead to useful models, but in case of very specific problems and/or data, a lot of low-level knowledge is often required to train a good model and even with the knowledge the process of finding good hyper-parameters can be slow.