We all love to optimize a convex problem. Why? Because the training always converges to the same optimum and even better, there is only one. That’s great, right? It is, but the sad fact is that most of the interesting problems out there are non-convex and thus, harder to optimize. For quite some time, we are using a factorization machine to train a preference-based model that estimates if a user likes a movie or not. Without a doubt the model is simple, elegant and very powerful, but even a minor modification of the hyper-parameters can lead to very different results.
What do we mean by ceteris paribus? We keep all, but one hyper-parameters fixed to search for the best value. To get an overview of how many combinations we need to evaluate, let’s see what we can tune:
(1) the weight initialization (uniform, normal, orthogonal, …)
(2) number of factors (5, 10, 15, …)
(3) weight decay (0.01, 0.001, 0.0001, …)
(4) type of weight decay(l1, l2, elastic net)
(5) additional regularizer (margin between pos. and neg. examples, …)
(6) optimizer (adagrad, rmsprop, nesterov, adam, …)
(7) learning rate (0.1, 0.01, 0.001, …)
(8) sampling type in case of unbalanced classes
(9) hard negative sampling or not
(10) number of epochs / stop criteria
The list is rather small but even with 10 entries, the number of possible combinations is about ~40K, depending on how many alternatives we evaluate. But even a for simple model like ours, 40K trainings need lots of time and even more time to do the evaluation. But now comes the setback … we cannot tune each hyper-parameter by its own, because they all interact with each other. So, are we doomed? Kind of since the evaluation of the score of the model is only one indicator if the model actually captures important preference tendencies which can be only evaluated manually.
In a nutshell, it’s the old story. It is not sufficient just to throw in the data, use some good heuristics for weight initialization and a modern optimizer to train a good model. The major problem is not the training, but to determine the utility of a model that goes far beyond the ranking/prediction score. The other problem is that the features of some items are not sufficient to explain some preferences which cannot be solved with model tuning at all. With this setting, we need to be very careful if we change one parameter while leaving the others fixed (ceteris paribus).