# Classification: Linear vs. Embedding

Whenever we are juggling with very high-dimensional, but also very sparse data, linear models are a good way to start. Why? Because they are fast to train and to evaluate, with minimal footprint and often sufficient to deliver a good performance, because data in high-dim spaces is more likely to be separable. However, the simplicity also comes with a price.

In a linear model, each feature has a scalar weight, like barbecue=1.1, cooking=0.3, racing=-0.7, car=-0.9 and a bias is used as a threshold. To predict if something belongs to the category “food”, we calculate:

y_hat = x_1 * barbecue + x_2 * cooking + x_3 * racing + x_4 * car + bias

where x_1 is {0, 1} depending if the feature is present or not. If y_hat is positive, the answer is yes, otherwise no. It is obvious that the answer SHOULD be positive, if a sufficient number of food-related features are present. In case of a mixture, the answer resembles a majority vote. The problem is that linear models completely ignore the context of features which means the meaning of a feature cannot be adjusted depending on present neighbors.

This is when embedding models jumps in to fill the gap. In such a model, each feature is represented by a vector and not a scalar and an item is represented as the average of all its feature vectors. For the labels, we also use a vector, like for linear models, but not in the original feature space, but in the embedding space. Then, we can make a prediction by transforming the item and calculating for each label:

y_i = f(dot(mean_item, label_i) + bias_i)

What is the difference? Let’s assume that we encode the vectors with 20-dims, which means, we have much more representational power to encode relations between features. For instance, if a certain feature usually belongs to the non-food category, but if it is combined with specific other features, it is strongly related, a linear model is likely to have trouble to capture the correlation. More precisely, the weight can be either negative, around zero, or positive. On the one hand, if it is positive, it must be related to the category which is usually not the case, on the other hand, if it’s negative, it never contributes to a positive prediction. And the last case, where it is very small, neglecting the sign, it does not contribute at all. To be frank, we simplified the situation a lot, but the point is still valid.

In case of an embedding, we can address the issue because we have more than a single dimension to encode relations with other features. Let’s see again how a prediction looks like:

dot(mean_item, label_i) + bias_i

following by some function f that that we ignore, since it only rescales the output value. We can also express this as:

sum(mean_item[j] * label_i[j]) + bias_i

Stated differently, we could say that each dim has a weighted vote that is summed up and thresholded against the label bias. The more positive votes we get, the higher the chance that (sum_i + bias_i) > 0 which means the final prediction is positive.

To come back to our context example, with the embedding it is possible that some dimensions are sensible for correlations. For instance, for a specific feature, dim “k” might be negative and positive for non-food features which eliminates the contribution due to the averaging. However, if it is also negative for food-related features, and also negative in the label, the specific feature strengthens the prediction because of the context. Of course, it is not that simple for real-word data, but the argument remains the same, because with a vector the model is able to learn more powerful representations.

Bottom line, linear models are very useful, but depending on the problem, they are too simple. Given sparse data, embedding models are still very efficient, because the complexity does not depend on |features| but on |features>0|, with a sparsity of ~99.9%. With the more powerful representations, it is also likely that the embedding can be re-used, for instance to learn preference-based models or to cluster data.