The last post was about benefits of kernel machines to classify sparse data with lots of rare features that would drop out if we use a top-k feature selection approach. With linear SVM models, we can compactly describe the model by assigning a weight to each feature, positive or negative, to get a better intuition what the model learned. With the introduction of kernels, this is not possible any longer. However, if we remember how the model decides if a new example x is positive or not, we can use the support vectors -SV- with positive labels to learn something about the selected features. Since we use a keyword-based kernel function, each SV contains a list of keywords that is used to determine the similarity with a given example x.
For the sake of simplicity, we trained a model that decides if a movie has a ‘vampire’ theme or not. The kernel is the dice coefficient. Of course we cannot say that all keywords of positive SVs are valuable, but since the kernel measures the overlapping keywords of two movies, there should be keywords that are more descriptive than others for the theme. Plus, since movies are usually described by very few keywords, the “spotting” becomes much easier. Here are some “anchor” keywords from the model:
– paranormal, dracula, half-breed, vampire, monster, ghost, immortality, disease, small-town, victorian, bloodsucker
We should also remember that classification with kernels is no black magic. All we do is to measure the overlap of a movie with all SVs -which are also movies-, weight them with a coefficient and consider the label. In other words, SVs can be seen as templates and we look how well a new movie matches the weighted sum of the templates. If the example matches more ‘important’ positive templates, +1 is the answer.
Despite the fact that we can now reduce the learning to pairwise dot products, kernel machines for non-numeric data often require a sophisticated kernel and features. For example, in the simplest case, we only measure the overlap of keywords of two movies and no weighting of keywords is done. This results in a “weight” for each SV, but the features are all treated equally. Clearly, this is far away from being optimal and therefore, we need a better kernel that considers the importance of each single keyword.
Bottom line, we are in a kind doom loop. Our ultimate goal is to learn good features for the classification of arbitrary concepts. However, since the data is extremely sparse, traditional methods that are trained on selected keywords usually fail, because rare keywords are essential for a good performance. On the other hand, kernel machines only consider pairwise dot products and are therefore perfect for the task. But the drawback of them is that we need to hand-craft our features and the kernel function for a good performance.
To summarize the situation, we provide a list of key points for each method:
– only dot products are required
– user-defined kernel for non-numeric data
– inference and/or training can be computationally expensive
– optimization problem is convex
– inference is easy and very fast
– non-convex optimization problem
– features are created based on raw data
– training larger on larger data sets is not prohibitive