The Dark Side of the Knowledge

Very recently the idea of model compression has been picked up again and coined with a new, nifty term: ‘Dark Knowledge’. In this post, we will focus on an intuitive description with a detailed example from the movie domain and not theoretical aspects.

The problem is the lack of resources at test time. Imagine that we trained a big ensemble of different models that do very well for assigning useful tags to movies. However, the process to infer these tags for new movies at test time, is computationally too expensive. The idea of model compression is that we use the big model as the teacher and the small model as a student that learns a specific function. In other words, the big model provides the mapping from the data X to the output Y and we -the student- use this knowledge to learn directly the function to do this.

A challenge is that the so-called hard outputs of the original labels, usually a 1-hot representation of the object class, are too restricting during the learning process. One idea to solve the problem is to use the input values of the softmax values, but before the exponentiation and another is to increase the entropy of the softmax output with transformation that involves a temperature to soften the values.

Imagine that we trained an ensemble of genre classifiers and the averaged output of the probabilities for the movie ‘Doom’ (ground truth is horror, action) is:

98.31% horror
01.52% action
00.14% sci-fi

If we transform the values with a high temperature, we can “zoom in” the classification results to see the values at a higher resolution[1]:

56.60% horror
19.96% action
01.10% sci-fi
00.67% comedy

The new picture changes the way the student learns the function, because now the learning is focused on several aspects instead of just the horror aspect that clearly dominates the raw softmax output. In other words, instead of using the 1-hot labels, we let the teacher provide “soft labels” for each training example -generated with the big model- to train a simpler model that learns the mapping function.

Of course, we omitted lots of details, but in a nutshell that is the idea of model compression and it seems to be very useful if we manage to take all the initial hurdles.

[1]
We do not want to start to read tea leaves, but the second probability distribution makes more sense for the movie ‘Doom’. There clearly is a horror scheme, but there is also lots of action, maybe even more than horror and the sci-fi elements are limited. Yes, it plays on Mars and there is lots of technology, but it still feels like a scenery for the plot. What about humor?, well there is some…

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s