# Model Insights

Whenever we need to train a model for a very big dataset, stochastic gradient descent, with a proper optimizer on-top of course, is the first choice because it does not depend on the size of the dataset. Furthermore, with on-line learning, or smaller mini-batches, the learning happens instantaneous instead of waiting until each sample has been visited, because every sample makes a noisy step that tries to lowering the cost function. We bring this up again because the performance of our preference model could be better and we thought it would be a good possibility to improve the whole training work flow.

So, what are the major challenges? First, labels are not evenly distributed because there is a bias on positive labels. We solve this by uniformly sampling which clearly introduces a new bias because now samples with a negative labels are used more often. Second, we want to present the most informative pairs (x_pos, x_neg) to the model, which are ‘very wrong’ to encourage the model to learn a lot. Why is this important? Well, let’s take a closer look at the cost function:

[we use a factorization machine with a paired logistic loss]

y_pos = sigmoid(fm_out_pos)

y_neg = sigmoid(fm_out_neg)

loss_decay = 0.5 * sum(W**2) + 0.5 * sum(V**2)

loss_obj = -log(y_hat_pos) + -log(1 - y_hat_neg)

cost = loss_decay + loss_obj

At the very begin, the weights are small and the prediction error is large: `loss_obj >> loss_decay`, but as the training continues, the predictions will get better and eventually, we get `loss_decay > loss_obj` for more and more pairs, at least if the model is powerful enough. To be more concrete, at the end of our training, a lot of values looked like this: loss_decay = 0.33 and loss_obj = {0.10, 0.15, 0.02}.

For those cases, the model minimizes the objective function by just pushing more and more weights towards zero which are not important for the predictions. To get rid of useless weights is a good thing but since the weight decay is not coupled with the actual loss, the procedure continues and at some point it’s getting harmful.

The point is, if all pairs are correctly classified which means y_pos =~ 1.0 and y_neg =~ 0.0 there is no reason to continue. But, if there are some pairs left that are not correctly yet, we should not feed all pairs to the network again, because this would just push the weights down, but only those that provide an opportunity to learn something for the model. This is a bit like curriculum learning, there we start with easy examples but once we mastered them, we only spend time with the challenging problems to gather _new_ experiences.

The implementation with Theano is trivial but it comes with a price, because now, we have to use fprop to get the loss for a pair and if the loss is sufficient, we will feed the sample to the model which results in a fprop/bprop step. There might be clever ways to avoid to do the fprop step twice, but since our dataset is very small, we use the naive implementation and neglect the overhead.

Bottom line, a dataset likely consists of problems that are not equally complex and the question is how to feed the data to the network to maximize the learning and to avoid computational overhead due to “easy” samples. The issue is especially important if a model promotes sparsity with l1/l2 decay because “non-sense” samples can be harmful for the final performance.