Since bag-of-word features are usually very sparse but also high-dimensional, it is reasonable to assume that a linear model has already good performance. But without the consideration of feature correlations, the expressive power is limited. For that reason, we began to analyze the benefits of Factorization Machines (FM) because they combine simple training with a powerful non-linear classifier.
In contrast to non-linear SVMs, FMs can be trained in the primal with the price that the full objective function that captures pairwise correlations is not convex any longer. The advantage is that we do not need any support vectors to score new examples and that the training of a FM scales much better than for kernels SVMs. Plus, even with the non-convex loss function, it is usually no problem to train a good FM model.
However, even if the training of FMs with stochastic gradient descent is straightforward, it makes sense to use a sophisticated optimization scheme because the number of model parameters are usually very small and with a 2nd order method, there is no need to tune the learning rate and the training might converge also faster.
In our second test, we used 1750 raw input features and 10 latent factors to train a FM model with a logistic loss function. Because we only care for the quality of the model, it was no problem that we made no use of the sparsity of the input data to speed up the training. For the training, we used Theano in combination with L-BFGS  on large randomized mini-batches. The maximal number of epochs was set to 30 and the random seed was fixed for deterministic results.
The trained model got 30 examples wrong which is an accuracy of 96.94%. When we only train the linear part of the model, the accuracy is 94.71% (52 errors) which is still pretty impressive. In other words, in the high-dimensional space, even a linear model is able to disentangle most explaining factors in the data. Furthermore, for biased user ratings, it is likely that the non-linear part of the model is not fully utilized which means the capacity of the model would allow to find patterns which are much more complex without a decrease in the final score.
In a nutshell, if the runtime of the training does not matter, it is often beneficial to use a more sophisticated optimizer if the number of model parameters is very small. And because the optimizer might find a similar solution in fewer iterations, the training overhead is often negligible. What’s left is the qualitative comparison of SGD and L-BFGS models, since each model makes very different errors.