# Knowledge Transfer

The practical use of Machine Learning largely depends on the performance of models in ‘test mode. That means, a very large and deep model is almost useless if predictions for unseen examples take too much time. This is especially a problem when the accuracy is improved by using model ensembles, because then, several models have to be evaluated.

A fairly recent trend is to distill the knowledge of bigger models into a smaller one to get faster predictions in test mode. The approach is also called compression or ‘thinning’. The idea is to guide the smaller model, the student, with concrete hints from the teacher to improve its learning. Informally, the teacher tells the student the solution, but also valuable information which includes possible correlations. Currently the approach is mainly used for supervised classification, but definitely not limited to it.

Recently we trained a ranking SVM to personalize our movie search engine and we wanted to output probabilities instead of ranking scores, because they are more intuitive for users. A simple approach is Platt scaling that uses a sigmoid to transform scores into the 0, 1] range. The idea is to learn two parameters A,B to turn the score S into a probability P: P = sigmoid(A * S + B) = P(Y=1|P).

The input of the procedure are pairs of (Si, Yi) where S is the score for movie i and Y is the label of it. For the sake of simplicity, we only consider binary labels 1/0 to indicate if a user liked a movie or not.

However, a good ranking model in combination with hard 1/0 labels will very likely lead to parameters that saturate the sigmoid function and thus, it only outputs values close to 0 and 1. To avoid this problem, we use the probabilities (Ysoft) of a previously learned supervised model to guide the training procedure, or stated differently, we want to transfer a part of the knowledge into the new model. To learn the parameters A, B, we now use Ysoft_i and Yi with a clear focus on the soft labels in combination with cross entropy CE as the cost function which is then minimized:

Hi = sigmoid(A*Si + B)

cost = alpha * CE(Hi, Si) + (1-alpha) * CE(Hi, Yi)

In a nutshell, we use Ysoft as a pseudo label to the guide the learning, but we also consider the real target value to handle the case that the pseudo label from the teacher is wrong. Of course some fine-tuning is required to avoid over fitting and other issues, but at a first glimpse, the estimated parameters seem to be pretty accurately.