# Benefits of Adversarial Learning

It is nothing new that neural nets are pretty good at learning a mapping function from x to y where x is the input data and y could be a set of labels. Then the task is to predict the correct label given the input data x. About the end of 2013 there was a paper [arxiv:1312.6199] that analyzed “blind spots” in networks and the short summary is that we can slightly modify an sample “x” which changes the prediction from the correct class to a wrong class with high confidence. Very often, the domain of “x” were images, but the problem is not limited to this domain but a property of neural nets in general.

Since there is not much literature for adversarial examples for the case of non-image data, we were interested if the problem also exists in our model and if we could improve the model by using adversarial examples for the training. Since we are not using a classical neural net with a softmax on top we had to run some tests. The model we are using is a variation of a factorization machine.

First, a quick reminder how to generate adversarial examples: We need to derive the gradient with respect to the input data “x”, not the parameters of the network! Then, we take the sign of the gradient and add it to the original data:

x_adv = x + alpha * sign(grad(cost, x, y))

In our case, alpha is 0.003 and cost is the hinge loss:

cost = T.maximum(0, 1 – y_hat*y)

and y_hat is calculated with this formula:

y_hat = 0.5 * T.sum(T.dot(x, V)**2 – T.dot(x**2, V**2)) + bias + T.dot(x, W)

The general procedure is described in [arxiv:1607.02533]. In other words, since “sign” returns {-1, 0, 1}, the method shifts values in “x” by +alpha and -alpha. For images, x_adv can be hardly discriminated from “x” if alpha is small.

Since we are working with binary data x_i in {0, 1}, the range is [0, 1] and there are some restrictions. First, if x_i + alpha is 1. The clipping is done by

x_adv = T.minimum(1, T.maximum(0, x_adv)).

However, that means that if x_i is 0, we only use +alpha, since for -alpha the result is again zero. Conversely, if x_i is zero, we only use -alpha since for +alpha the result is one again.

Of course we first need to check if adversarial examples are a problem at all, because as demonstrated in the paper, there are some classifier are almost immune to the problem. Therefore, we trained a model and check if the predicted class of (x, x_adv) is different. We got different predictions for about 8.25% of the examples and as a note, the model trained with gradient noise has a value of 6.94%. In comparison to the networks used for image classification, our model is more robust against adversarial examples which might be explained by the non-linearity and the network architecture itself.

Now the question is, can we improve our model if we use adversarial examples for the training? To be more specific, does the learning and generalization improves, if we force the model to correctly classify adversarial examples? To test the hypothesis, we split the cost function into loss_facm and beta*loss_adv, where the first is the loss for the factorization machine and the second is the loss for the prediction of adversarial examples (with the fm output). As a hint, since we only use binary input data, we would never encounter an adversarial example in the wild, since values of (1 – 0.003) and (0 + 0.003) are not possible. Still, such examples demonstrate that a slightly change of the input values move examples to the other side of the decision boundary which results in a classification with the wrong label.

We trained with beta=0.5 and alpha=0.003 and with these values the error rate for adversarial examples reduced to 1.06%. Stated differently, the model is now more robust to minor changes of input values. But since those examples are not possible in the wild, we need to check if the regularization actually improved the predictions. To compare both models, we generated the top-k predictions for a fixed set of movies and checked if both lists are in accordance. Those are the results:

– the preference score for the adv model was lower in 55%

– the overlap of both sets was 78%

– movies only suggested by the adv model better matched the preferences

– movies only suggested by the normal model were less informative

As usual the few tests do not confirm that adversarial training really improved the model, but similar to the test with the gradient noise, the method obviously changes the dynamic of the model in a positive way. The next step is to find the relation between gradient noise and adversarial training to better understand the influence those methods have for the learning and use the insights to further improve the generalization of the model.