Optimization: Ceteris Paribus

We all love to optimize a convex problem. Why? Because the training always converges to the same optimum and even better, there is only one. That’s great, right? It is, but the sad fact is that most of the interesting problems out there are non-convex and thus, harder to optimize. For quite some time, we are using a factorization machine to train a preference-based model that estimates if a user likes a movie or not. Without a doubt the model is simple, elegant and very powerful, but even a minor modification of the hyper-parameters can lead to very different results.

What do we mean by ceteris paribus? We keep all, but one hyper-parameters fixed to search for the best value. To get an overview of how many combinations we need to evaluate, let’s see what we can tune:
(1) the weight initialization (uniform, normal, orthogonal, …)
(2) number of factors (5, 10, 15, …)
(3) weight decay (0.01, 0.001, 0.0001, …)
(4) type of weight decay(l1, l2, elastic net)
(5) additional regularizer (margin between pos. and neg. examples, …)
(6) optimizer (adagrad, rmsprop, nesterov, adam, …)
(7) learning rate (0.1, 0.01, 0.001, …)
(8) sampling type in case of unbalanced classes
(9) hard negative sampling or not
(10) number of epochs / stop criteria

The list is rather small but even with 10 entries, the number of possible combinations is about ~40K, depending on how many alternatives we evaluate. But even a for simple model like ours, 40K trainings need lots of time and even more time to do the evaluation. But now comes the setback … we cannot tune each hyper-parameter by its own, because they all interact with each other. So, are we doomed? Kind of since the evaluation of the score of the model is only one indicator if the model actually captures important preference tendencies which can be only evaluated manually.

In a nutshell, it’s the old story. It is not sufficient just to throw in the data, use some good heuristics for weight initialization and a modern optimizer to train a good model. The major problem is not the training, but to determine the utility of a model that goes far beyond the ranking/prediction score. The other problem is that the features of some items are not sufficient to explain some preferences which cannot be solved with model tuning at all. With this setting, we need to be very careful if we change one parameter while leaving the others fixed (ceteris paribus).


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s