# Attention For Bag-of-Words Data

For quite some time now, attention is a very hot topic and it has been used very successfully for various problems, like translations, or captions for images. The basic idea is clever and simple: if we consider the input of a model, usually a sequence, some parts of it are likely to be more important for the problem which is usually a prediction of some kind. However, since in our domain we are not working with sequences, but sets, we require an attention mechanism for unordered data. Let’s start with an example.

We consider the domain of movies and in this particular case, we want to predict the genre from a bag-of-words input. And let the input be x=[town”, “lawman”, “explosion”, “sheriff”, “brother”, “prison”, “ranch”]. So, the question is which features are most important for the decision, or stated differently, do we really need all features for a confident prediction of the genre? For this example, we only consider very basic genres, like western, horror, scifi or romance.

Since the input data is not ordered and a prediction should therefore not depend on it, a recurrent network is not straightforward to use, which is why we use a CBOW-based model. With this method, we have an embedding matrix E that has #features rows. Usually the final representation of the input is done by aggregating all input features, either by the sum or the mean value. However, this assumes that all features equally contribute to the final prediction:

E = np.random.uniform(-1, 1, size=(#features, #dim))*scale

x = [i1, i2, i3, ..., ik]

U = E[x]

h = np.mean(U, axis=0)

Instead, we want that the model puts more focus on “relevant” aspects:

x = [i1, i2, i3, ..., ik]

U = E[x]

g = tanh(np.dot(U, v) + bias)

a = softmax(g)

h = np.sum(a * U, axis=0)

Which is in the spirit of [arxiv:1512.08756], where “v” is a vector of #dim dimensions and bias is a scalar.

With such an attention mechanism, we get a vector “a”, with a length equal to the number of input features with only positive entries such that the sum equals one, like a=[0.3, 0.6, 0.1]. Then, “h” is a weighted combination of all features:

`h = 0.3 * U[0] + 0.6 * U[1] + 0.1 * U[2]`

.

When we think of our initial example, the different weights are likely reflect the importance of a word with respect to the genre to predict. For instance, “sheriff” and “ranch” are probably more relevant for the western genre than “explosion” or “brother”, assuming that the dataset contains enough classical western movies to back this up.

Bottom line, if the input data is not ordered, it is not obvious howto learn with a recurrent model. On the other hand, bag-of-words models treat all input features equal which can hurt the performance when the importance of features is conditional. With the illustrated approach, we are able to work with variable-length data and furthermore, we use attention to re-weight portions of the input. And finally, as stated in [arxiv:1512.08756] the evaluation can be done in parallel, since a step does not depend on the previous one, unlike RNNs.

The conclusion is that we can use a simple feed-forward network in combination with attention to handle bag-of-words data in a very efficient way. The next step is to incorporate and evaluate the method into existing models to study the benefits, if any at all.