In the world of Shallow Learning, we spend most of the time with hand-crafting features for models. The reason is that there is usually no way to learn good features because of the limited expressiveness of the original features. Thus, we have to find clever ways to combine all kind of data into raw features. In the domain of IR and recommender, the result is usually a very high-dimensional space but with very sparse feature vectors. So, the question is if non-linear methods are really required to learn good models, or stated differently, how likely is it that the data is already linearly separable for a problem?
For example, we build a feature space with ~8K features but on average, only 30 features are activated which is a sparsity of >99%. Thus, it makes sense to use a method that is suited for sparse data like factorization machines (FMs) or linear SVMs. Since we already analyzed FMs and found out they have a very good precision, we trained an ordinary L2-SVM to compare the generalization of a linear and a non-linear model.
It should be noted that the training of an FM model requires lots of more decisions than a L2-SVM which only requires the ‘C’ parameter. For the FM model, we need to decide how to initialize the weights, the number of factors, the learning rate, the optimization method and the regularization (l2 weight decay). So, without a full grid search of the hyper-parameters we cannot be sure if there isn’t a better setting.
To compare the models, we have chosen a list of movies that are neither in the training, nor the test set. Due to different model parameters, we only care for the relative magnitude of the model output. For this test, both models correctly predicted all examples, but the SVM was better at moving “vague” predictions closer to the decision boundary. We continued to handpick movies and compared the outcome for both models, with the result that the linear model seem to be superior.
It seems that the regularization worked much better for the linear model, because for easy examples, both models predicted the same class, but for examples that are ambiguous, the output of the linear model was often more precise. In other words, the magnitude of the weights for the FM model seem to induce a decision boundary that is is not smooth enough to provide a good generalization. But of course, it is also possible that the hyper-parameters of the FM model are not optimal.
Bottom line, for the data at hand it seems that even an off-the-shelf linear SVM model is comparable to a non-linear model in terms of precision and even seem to provide better generalization to unseen data. This raises the question if it is worth to spend all the time on tuning and traing non-linear -and non-convex- models or, if it is more important to carefully select/engineer the features and then use a simple linear model? In any case, we are interested why the generalization capability of the FM model is not optimal, but if we have to choose right now, we would stick with the linear model because it is very fast (train+eval) and the results are often better.