At the end of the previous post, we mentioned that the solution found by L-BFGS made different errors compared to the model we trained with SGD and momentum. So, one question is what solution is better in terms of generalization and if they focus on different aspects, how do they differ for individual methods.
To make the analysis easier, but to be at least a little realistic, we train a linear SVM classifier (W, bias) for a “werewolf” theme. In other words, all movies with that theme are marked with “+1” and we sample random movies for the ‘rest’ that are marked with -1. For the features, we use the 1,500 most frequent keywords. All random seeds were fixed which means both models start at the same “point”.
In our first experiment, we only care to minimize the errors. The SGD method (I) uses standard momentum and a L1 penalty of 0.0005 in combination with mini-batches. The learning rate and momentum was kept at a fixed value. The L-BFGS method (II) minimizes the same loss function. Both methods were able to get an accuracy of 100% for the training data and the training has been stopped as soon as the error was zero.
(I) loss=1.64000 ||W||=3.56, bias=-0.60811 (SGD)
(II) loss=0.04711 ||W||=3.75, bias=-0.58073 (L-BFGS)
As we can see, the L2 norm of the final weight vector is similar, also the bias, but of course we do not care for absolute norms but rather for the correlation of both solutions. For that reason, we converted both weight vectors W to unit-norm and determined the cosine similarity: correlation = W_sgd.T * W_lbfgs = 0.977.
Since we do not have any empirical data for such correlations, we analyzed the magnitude of the features in the weight vectors. More precisely the top-5 most important features:
(I) werewolf=0.6652, vampire=0.2394, creature=0.1886, forbidden-love=0.1392, teenagers=0.1372
(II) werewolf=0.6698, vampire=0.2119, monster=0.1531, creature=0.1511, teenagers=0.1279
If we also consider the top-12 features of both models, which are pretty similar,
(I) werewolf, vampire, creature, forbidden-love, teenagers, monster, pregnancy, undead, curse, supernatural, mansion, bloodsucker
(II) werewolf, vampire, monster, creature, teenagers, curse, forbidden-love, supernatural, pregnancy, hunting, undead, beast
we can see some patterns here: First, a lot of the movies in the dataset seem to combine the theme with love stories that may involve teenagers. This makes sense because this is actually a very popular pattern these days and second, vampires and werewolves are very likely to co-occur in the same movie.
Those patterns were learned by both models, regardless of the actual optimization method but with minor differences which can be seen by considering the magnitude of the individual weights in W. However, as the correlation of the parameters vectors confirmed, both solutions are pretty close together.
Bottom line, we should be careful with interpretations since the data at hand was limited, but nevertheless the results confirmed that with proper initializations and hyper-parameters, good solutions can be both achieved with 1st and 2nd order methods. Next, we will study the ability of models to generalize for unseen data.