### Introduction

The current setup we are working with is more information retrieval than NLP, but of course we also work with textual data. But in contrast to most NLP problems, we treat input not as ordered sequences, but as sets or bag-of-words. This is no problem, since we neither generate output that needs to be meaningful in a grammatical sense, nor do we have any problems were a swapped pair of words would be problematical in any way. Why do we mention this? For the popular Transformer architecture there is a position embedding which is exactly required to turn the input that is a set into a sequence but this is exactly what we do not want. However, without the position embedding, the token can be only used once, since otherwise it would degenerate into the exact same representation.

Long story short, what we want and need is an encoder that turns a variable-length set into a fixed representation that is supposed to summarized the ‘relevant’ parts of the set. Of course the relevance is subjective and is defined by the task and or loss function we use to train the representation. In other words, what we want is a simple model to derive a ‘sentence’ representation but without taking the order of the words in consideration. Furthermore, we do not want to use any supervision or labels, since it is very hard work to keep a dataset up-to-date so it is clean, without too much errors or contradictions, not to mention the time that is required to create one!

### Method

We could use the masked language modelling (MLM) approach, but as we mentioned before, without the position embedding, we cannot use multiple tokens which is a show stopper. We also do not want to use any auto-regressive approach since those approaches usually induce a high computational complexity. So, what’s left? We could use contrastive learning but then we need (lots of) negative samples and depending on the task there might be a good chance that a negative sample is also related but nevertheless pushed away from the ‘anchor’. Plus, for some setups a lot of negative samples are required which also induces a high computational burden, even with very efficient matrix multiplications.

So, is there something else? Yes, there is a simple approach that we described in [1] and the beauty of the method is that we do _not_ need any negative samples. All we need is function that returns different views for the same sample which is easy for images, but a bit harder for text. But instead of spending too much time to come up with a (not so) clever idea, we tried a the following baseline first.

To turn a set of words into two different views, we converted the set into a list, shuffled it and split the list into two halves to get a “left” and a “right” part. The split is done at offset len(input)//2, which ensure that left + right are almost of the same length. The randomization is some kind of augmentation to prevent that the model learns a shortcut in case the two halves are always the same. With this method, the model is forced to to learn a representation that is useful for any permutation of the input. However, even with this approach the importance of ‘common’ words on each side might be overestimated.

### Recap

It fllows a a quick recap how the method works. The augmentation is done outside the model, so for the input x, we get x_left and x_right, where x = x_left || x_right. Both parts are each fed into some encoder f(x) which returns a fixed length vector. All output is l2 normalized. The alignment loss is then just the squared difference of the two ‘views’:

loss_align = ((f(x_left) - f(x_right))**2).sum()

The procedure is repeated for every sample in a mini-batch. So if our mini-batch consists of 64 samples, the output H contains 2*64 representation which can be reshaped into (64, 2, dim) to efficiently ‘batch’ the loss:

loss_align = ((H[:, 0] - H[:, 1])**2).sum(dim=1).mean()

In other words, the loss forces the two views to be close in the representation space which makes sense since the are obviously related.

The uniformity loss is not so simple but still very intuitive. The aim is to spread all representations on the unit circle to avoid collapsed regions and to make use the the whole space. Thus it follows immediately that the two loss functions are competing and if one wines, the representation is useless. For example, it is trivial to collapse all samples into a simple point and then loss_align is always zero. But then, the uniform loss is maximal. On the other hand, if all points are maximally spread, the distance between x_left, x_right of any view is also maximal.

To derive the uniform loss, we need every pairwise distances of the output. PyTorch provides a dedicated function torch.pdist for exactly this task, but since we use a 2nd order optimzation [2] we had to implement our own variant. But for the simplicity, let us use the PyTorch function:

loss_uniform_l = torch.exp(-3*torch.pdist(H_left, p=2)**2).mean().log()

loss_uniform_r = torch.exp(-3*torch.pdist(H_right, p=2)**2).mean().log()

loss_uniform = loss_uniform_r * 0.5 + loss_uniform_l * 0.5

For more details, see the website [3] of the authors. But if all distances are close to zero, the term evalulates to exp(-3*0).mean().log() to zero which is the maximum and we try to minimize the term.

### Contributions

We have to admit that our contributions are minor but let’s take a closer look at the details. First, by avoding negative samples we do not accidentally pushing relevant content away. This is nothing new and exactly the method of [1]. But for text, simple augmentations are not trivial and we empirically verified that the randomized split learns representation that is able to group related content without supervision and those groups often are matching the corresponding ground truth labels. This means that even a very simple ‘view’ suffices to learn representations that capture both semantics but also topically aspects. This is the first contribution.

The second and last one is that we only used very simple building blocks, like hash-based word embeddings [4] and a very shallow model to encode the set into a fixed length representation. Actually, we just used kernelized attention with a learnable token plus some non-linearity which again suffices to learn a useful representation for a vocabulary of ~100K tokens and ~50K samples. We tested the generalization of the method by applying the encoder on a dataset with ~6K tokens and ~1K samples. This test set contains labels to allow a visual summary of the representation based on t-SNE which easily allows to analyze how well related content is grouped into the

feature space.

### Conclusion

We surely did not find the holy grail, but it is satisfying that we don’t need a Transformer model with billions of parameters to learn something useful with respect to a real-world problem. In our case, the semantic retrieval of related content by providing a simple query. The clou is that the model was never trained to rank query + documents, since it just learned to relate content in a local scope, namely the sample itself. The success is likely owed to the limited model complexity that acts as a regularizer and the weight sharing that forces the model to learn a word embedding that captures lots of different regularities.

[1] raberrytv.wordpress.com/2020/06/19/1330/ [Alignment + Uniform]

[2] raberrytv.wordpress.com/2020/11/07/adapsgd-no-learning-rates-hessian-trace/

[3] ssnl.github.io/hypersphere/

[4] raberrytv.wordpress.com/2020/10/03/dynamic-word-embeddings/