The Power Of Linear Models

Without a doubt Deep Learning is extremely successful, but is it also always required, to solve a particular problem? Stated differently, if we can solve a problem with a linear model, we surely can solve it also with a neural net, since the latter is more expressive. However, in such a case, we likely waste resources, time + space, and it might be even possible that we need to tune the net to achieve the same performance as the linear model. Of course some domains are biased, like images where linear models are known to lack the expressiveness to solve even simple classification problems. But, if we take for instance a classification problem with thousands of human-generated features, even a linear SVM is likely to suffice, since the high-dimensional space increases the chance a lot that the data is already linearly separable or that at least the data is not as entangled as in the case of images.

So, let’s start with a linear SVM by minimizing the hinge loss which is max(0, 1 – y*y_hat). Because there are lots of off-the-shelf solutions available, all we have to take care of is the sparsity in the input. In case of cross-features it can easily happen that we have to cope with 250,000 features but each sample only has about 25. In case of Python support, there is a pretty good chance that there is support for sparse matrices via scipy/numpy and we are done. Nevertheless, to demonstrate how little effort is required to train such a model without any sophisticated libraries, we started from scratch.

The method is straightforward: First, we create a lookup L to map each word to a unique feature ID. The model W is an empty defaultdict of type float. We draw a random sample (x, y), y in [-1, 1], and sum up all features to get y_hat:
`y_hat = sum([W[L[word]] for word in x])`
First the model is empty and y_hat equals 0. Thus, the loss is: max(0, 1 – y*y_hat) = 1. Next, we derive the gradient, which is extremely simple for the hinge loss, and update the parameters accordingly.
``` for word in x:  W[L[word]] -= lrate * (-y) ```
Since we treat all words as equally important, for the sake of simplicity, the update rule just increases/decreases a feature proportional to the learning rate according to the label:
``` w_i = w_i - lrate * (-1) => w_i += lrate w_i = w_i - lrate * (1) => w_i -= lrate ```
This is neither magical nor unexpected, since we want features that are related to +1 samples to be positive and vice versa negative for -1 samples.

The beauty of this method is that we only need to update the model parameters, features, that are present in the drawn sample which is very efficient. Furthermore, with the lookup and the dictionary, every actual parameter update can be also performed very efficiently and there is no scalability problem in case of millions of features. It is also possible to perform on-line learning, because of the modest model size, which means we just update the model parameters continuously for each new sample (x, y). The advantage is that we can automatically cope with drifts in the preferences over time which means nothing more than to slowly shift the weight from some features to others.

Once the model W is trained, the parameters can be easily serialized as a Python dictionary, along with the lookup L. Then, at test time, the prediction just requires us to sum the weights for the present features in a new sample x, which is very fast:
``` y_hat = sum([W[L[word]] for word in x]) ```

Bottom line, it is no secret that the power of a linear model lies in the features. However, if features are already available and there is no need to handcraft them, a simple feature combination and a linear model might already provide a very strong baseline. In other words, even if Deep Learning is awesome, we should always start with the simplest model and only increase the capacity if really needed.