In contrast to kernel SVMs, linear SVMs are pretty versatile and lightweight. The required memory is usually neglictable and the cost function is convex, in case of the squared hinge loss. Thus, they are an ideal method whenever the input features already disentangled the explaining factors of the data or in case of high-dimensional spaces where it is more likely that the data is already linearly separable.
Let us consider the task to predict if a movie is about a specific theme, for instance “supernatural” with a focus on vampires and other mystical creatures. In case of a linear SVM, we want to learn a hyperplane with the positive examples on the one side and the negative ones on the other. So far, for the geometric interpretation. The parameters of an SVM consist of a weight vector and bias value and the decision function is sign(x*W+b).
So, in our case, where the input data is either binary or tf-idf, all values are positive. That means, the SVM model needs to learn what kind of keyword features are important to describe a theme and lower weights of all other keywords. At the end, x*W+b needs to be positive for “supernatural” movies and because the input data is rather sparse, the learning procedure is pretty easy if the few keywords of each movie are descriptive. Stated differently, if a keyword is present in many movies, there is no reason to give it a higher weight. With the help of L1 regularizers, we can further drive the weight of “useless” keywords to zero.
In terms of a neural networks, a linear SVM is comparable to a highly specialized neuron that is on (>0) whenever it detected a specific topic and off (<0) else. Plus, the better some input matches the "neuron", the higher the score. Combined with the ReLU activation, we can model a "real" neuron that needs sufficient input before it is activated but then the excitation is linear in the input.
This can be easily explained by the fact that a weight vector gives a high value to all keywords that are descriptive for a theme and lower weights keywords that are contradictory. Plus, since "useless" keywords are close to zero, they do not contribute much to the final decision. Therefore, if the majority of the movie keywords is descriptive and not from disjoint themes, the neuron is activated.
The dilemma is that themes that actually consist of several topics are much harder to describe than highly specialized ones. This can be compared to genres. A horror movie is more specific than drama or action and thus requires more keywords to describe them. Plus, without a proper context, lots of keywords are highly ambiguous in some broader themes and might not be descriptive at all.
The question is if it suffices to use only very specific themes and use the learned "ReLU neurons" as a feature representation? And indeed, the approach seems to work pretty good even for top-level genres, but it is noticeable that neurons for broader genres like "drama" or "action" are activated more often.