Fun With Kernels

Nowadays, where everybody is into Deep Learning, kernel machines got a little out of fashion. One reason might be that you cannot stack multiple SVMs to form something like a Deep SVM. Another reason might be that you need a custom kernel if you have special data set which can be very challenging, since hand-crafting a good kernel needs experiences and a solid understanding of the data at hand. However, regardless of the opinion of some researchers in the ML community that SVMs are mostly template matchers, they are still powerful and very efficient for some problems.

We demonstrate the power of non-linear SVMs with an example from the movie domain. Let us assume that we want to split movies into two sets, one that belongs to a special concept and the other doesn’t. Instead of using a standard kernel, we will create our own. Why? With custom kernels, we can model the similarity of arbitrary objects and not only traditional vectors (graphs, documents, sets).

The drawback of a custom kernel is that often off-the-shelf libraries have no intuitive interface to classify new, unseen examples. We do not want to dive too deep into the theory, but for the example, we need at least to summarize the required steps.
The training set X is a list of arbitrary objects, in our case, keywords, and we need to create the gram matrix that encodes the similarity of all pairwise objects with a kernel function K (dice coefficient). Thus, if we have N examples, we get a matrix G that has N rows and N columns where G[i,j] encodes the similarity of object i and j with the function K. The gram matrix G is the input to the SVM module that fits our model with the given data.

The trained model consists of a bias, a list of examples that are support vectors, each combined with a coefficient that weights this example. To classify a new example x, we need to sum the pairwise similarity of all support vectors and x, multiplied with the label and the weight of the support vector:
y_hat = sign(sum(K(sv[i], x) * y[i] * c[i]) + bias)
Positive values of y_hat will be treated as +1, -1 otherwise.

To avoid juggling with the gram matrix and other intermediate values to make predictions, we need to extract the bias and the coefficients from the model. That means we need to store all examples that are support vectors, with the label and the coefficient weight to be able to make straight-forward predictions of unseen examples.

Here, we can see a possible weakness of kernel learning. For large datasets it can happen that lots of examples become support vectors which increases the time and space complexity a lot. In other words, we need to store lots of examples from the data and we need to sum over all these examples, and evaluate K, to get predictions.

However, a huge advantage is the fact that kernel machines are perfect for sparse data, or if the data can be best described by pairwise dot products in some high dimensional space. For instance, instead of using only the top-k keywords for classification or clustering, we can use all keywords in the kernel function to model the similarity of two movies. Furthermore, we can also use additional data to derive a context to better describe the similarity of objects. A simple example is to use a factor model that is able to relate movies even if they do not have a single keyword in common.

Bottom line, there are no real limits except the computational complexity, if too many support vectors are required to fit the data or the amount of raw feature data gets very large.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s