HowTo: Load H.264 Videos With dvbcut

Almost a decade ago, in September 2012, we began our work on a personalized TV app. The project is like IPv6 which means a rathervery long-term project. However, we still learned a lot about Linux + DVB-S which was very useful back then. Nowadays things are a little easier, but when it comes to editing recorded movies via DVB-S2 not that much has changed. There are video editors but if all you want is a cutter, those apps feel a bit overloaded for the task.

A little elaboration first: We modified gnutv to support H.264 streams which still works like a charm. The output of the gnutv program is a MPEG-TS container that can be viewed with mplayer/mpv without any problems. And for ordinary MPEG-2 videos in the container, dvbcut is the first choice, if you just want to quickly edit a movie to remove some parts or to adjust begin and/or end.

So far everything is fine, but since nowadays people record stuff in high definition, formely H.264 and higher, those media cannot be loaded into dvbcut. We did a quick research about the status quo, but found no real solution. Instead, we use the swiss army knife for media ffmpeg to convert the media instead.

The number of options for ffmpeg can be intimidating, but all you have to do is:

ffmpeg -i your_h264_media_file.mpeg -c:v mpeg2video -qscale:v 2 -c:a mp2 -b:a 192k /tmp/output.ts

The last part is the audio stuff: MP2 with 192K/s and the first one re-encodes the h264 video into the good old mpeg2 format. But since the operation is no simple copy, it takes a lot of time for longer videos and thus does not come for free.

Bottom line, the proper solution would still be to implement the feature directly into dvbcut but since we have no time for this, the only option -we are aware of- is this lengthy kludge.

Contrastive Learning By Maximizing Mutual Information

To learn useful representations without labels has gained a lot of attention recently, for NLP with Transformers, but also for the domain of images. However, as pointed out in previous posts, like [3], most of the methods require a large batch size to ensure a sufficient number of negative examples. The approach in [2] uses latent cluster assignments combined with optimal transport, to avoid negative samples altogether. However, despite its simplicity, the training is not always straightforward and one needs to fiddle with hyperparameters and likely also with the network architecture.

But since the idea of the method can be expressed in many ways, we do not need to explicitly use the optimal transport + swapped assignments design. For no good reason we had to think about mutual information that was used also in relation with contrastive learning. For instance in [1] the authors presented an approach to train a Part-of-Speech tagger (PoS) in an unsupervised way. The idea is to relate ‘past’ and ‘future’ information with mutual information. Details can be found in [1] or in the referenced github code.

The adaption to our problem is then straightforward. All we need to do is to create two views of a sample and assigning the first one to ‘past’ and the other one to ‘future’. The derived loss from the paper is semantically similar to the one in [2], with the exception that we do not need to switch anything. The advantage of this objective is that we do not need a separate cluster step and thus, no extra hyper parameters. All we need is the softmax temperature which can be kept from the other approach. And last but not least, we can still use the procedure to derive the soft cluster assignments for new data, because the approach from [1] also learns a set of ‘clusters’.

For our smaller test dataset, the approach from [1] was able to recover the labels by using only pairwise similarities perfectly. A t-SNE plot also confirmed that the learned representation is able to both group related content, but also to separate samples with different concepts.

Bottom line, it is not reasonable to assume that we can solve all our problems by just finding the right loss function, but thanks to all the available papers, we are slowly getting more and more insights how we can tackle the remaining hurdles, one by one.

[1] “Mutual Information Maximization for Simple and Accurate Part-Of-Speech Induction”
[2] “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”

Findining Similarities, Not Differences

Despite all the labeled data out there, it is illusive to assume that every problem can be rephrased as a classification problem. First, annotations are costly and second, labeling cannot be done by everybody but often requires a certain knowledge and thus time. And finally, a label does not carry much information. For instance, a picture with a cat on a tree might be tagged with ‘cat’ but contains so much more additional information that is useful to understand the content as a whole and its concepts. So, even if everything would be labeled, a classical softmax model just learns the necessary features to differ between classes and there is no constraint to understand concepts as a whole to do this. Whatever fits to drive the loss to zero is used and these information does not always make sense. For instance, a copyright could be used for discrimination, or any other watermark. Bottom line, learning with labels feels wrong if your model should truly understand the underlying data and not just learn prototypes to summarize each label.

Unsupervised learning is still the holy grail of AI, but despite all the progress, especially for the domain of images, there is no bullet proof method yet. To learn the underlying factors of data by relating different views is very successful, as shown in [2], but the use of negative examples always worried us. Why? Because there is no way to decide if the query instance and the rest have actually nothing in common. Whenever there is a latent aspect both in the query and in some negative sample, it is pushed away, or at least it is not as plausible as the query and the positive example: score(query, pos) > score(query, neg_similar) after the gradient step. A possible solution was proposed by [3] which does not use negative examples at all by spreading all embeddings on the hypersphere.

And this bring us to our actual goal, namely to explain preferences of a user by using only the notation of pairwise similarity. In the classical setting, we learn a utility function that assigns a score to every sample and the higher it is, the better it fits the preferences. With a linear SVM, we require positive and negative instances to learn such a function and we also use labels which is not optimal for the reasons stated above. Like the fact that a label does not reveal what aspects of a sample a user liked or disliked.

Such models often provide a very strong baseline, but to express the preference of instances just as a the sum of feature weights is not very satisfying. The problem is that the total set of preferences is encoded in the positive features of the weight vector, but there are no fine-grained topics visible that are expressed by the user and hopefully learned by the model.

To be clear, we consider the domain of text ‘documents’ of arbitrary length. In our case, the docs contain a brief summary of TV content. So, we could create views by splitting content into halve, or by drawing two descriptions from the same TV item. But it is also possible to use tags if present, or meta information like directors, or annotated themes. The aim is now two relate two samples in terms of latent topics, cluster prototypes, with the clever trick to swap the assignments for prediction. In other words, if two samples have something in common, it should be possible to predict this particular aspect of x1 by using the features, topic assignments, of x2 and vice versa.

Bottom line, if feels much more natural to understand preferences by directly relating content than to contrasting samples with a large set of negative samples for which we cannot guarantee that there are no similarities with the query sample. And not to forget the computational complexity, because often contrastive learning needs a very large batchsize for good results, while this method just needs ‘enough’ samples with respect to the chosen number of clusters.

[1] “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”
[2] “Big Self-Supervised Models are Strong Semi-Supervised Learners”
[3] “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”

Learning Topics By Contrasting Cluster Assignments

Frankly, we are a bit excited, since we tried for a quite a while to find the missing puzzle piece but nothing seemed to work and finally it seems that the picture is complete. The goal is to learn a representation of a dataset that disentangles the concepts and topics in it. And a further goal is to to be as green as possible with respect to resources. No large machinery for training, no runtime of weeks and the model should have low memory footprint. But let us come back to part one.

The recently (newly) introduced contrastive learning caught our attention, especially the uniform + alignment method [1]. But despite the simplicity and that the method does not need negative samples at all, the final representation was never powerful enough to perform robust semantic retrieval for unseen data. This is likely not a problem of the method, but for our problem it did not work.

The soft nearest neighbor loss [2] also looked promising, but the partitioning of a dataset based on labels is only as good as the labels. In other words, the loss worked as expected, but since our larger dataset is only weekly annotated, and topics of labels are not disjoint, the generalization was also limited. So, again the problem is that labels do not carry enough information to learn an embedding with respect to the latent concepts of the data, or the labels would need to be very fine-grained which is intractable for many reasons.

For the task we try to solve, negative samples likely hurt the performance, since we never know if the reference and an arbitrary negative sample have something conceptually in common. This is why we favored the uniform + alignment loss, since it does not require any negatives. However, we also wish to summarize the dataset with respect to latent concepts and the loss did not enforce this. We could incorporate a neural topic model in our network, but as argued in [3], those layers easily degenerate and often either either only perform averaging or there are lots of ‘dead’ concepts that are never activated.

Frankly, we missed a paper [4] from last year, June 2020 that contains the last piece of our puzzle. But the key element [5] was presented much earlier, in 2013 at NIPS. However, the paper combined contrastive learning and clustering with [5] in a way that solved our problem. First, the method does not require any negative samples which fits perfectly in our setup, and second, and this is even more important, the contrasting is done via learned codebooks. Those clusters that are forced to capture the concepts of the dataset in order to classify a view of a sample from a cluster assignment of a different view of the same example. For more details, see [4].

So, it is a bit like AI Christmas for us, because the community was so kind to find the needle in our haystack for us. Furthermore, a by-product of the method is that it provides a lot of useful extra functions:
– We can use the learned embedding for semantic retrieval [main goal]
– The concepts in the dataset are summarized in codebooks
– The soft assignment can be determined for unseen data and returns a topic distribution
– The cluster distributions can be used for semantic retrieval (Kullback Leibler)
– Individual codebooks can be also used as ‘classifiers’ to detect this particular concept
– And finally we can cluster the data topically with the soft assignments

Bottom line, this is a perfect example that clever ideas do not need to be complex at all. You just need a creative mind, or two, to combine existing approaches with a new ingredient to invent something new. We are thankful that the authors [4, 5] shared their work with the community which allows us to stop tinkering with the problem and to continue to advance our idea.

[1] 2005.10242: “Understanding Contrastive Representation Learning through Alignment and Uniformity on the Hypersphere”
[2] 1902.01889: “Analyzing and Improving Representations with the Soft Nearest Neighbor Loss”
[4] 2006.09882: “Unsupervised Learning of Visual Features by Contrasting Cluster Assignments”
[5] Sinkhorn distances: Lightspeed computation of optimal transport, NIPS 2013

Optimal Transport: The Silver Bullet?

In the last years we tried a lot of different approaches to learn latent topics/factors/concepts in textual data. Neither shallow methods like NMF or LDA, nor any (deep) neural net that we tried lead to a representation that generalized well. The problem is that the regularization parameters, and also the loss function, needs to be chosen very carefully. To introduce sparsity in shallow methods helped, but no method really learned all relevant concents, or failed at least to disentangle them, so topics were mixed up and were really hard to interpret.

Another approach we tried was the Mixture of Expert (MoE) approach where the dataset is divided into regions and each expert is responsible to learn a certain aspect of the data, or stated differently, a latent topic. In the literature it is mentioned very often that without extra regularization a balanced assignment from the data to the experts likely fails. And even with regularization, it is not straightforward to always find the proper schedule to enforce this. The problem is a reinforcement effect that if an expert wins, the connections are updated and the next time, the it might attract even more samples than before, while the other experts are beginning to starve. The problem is not limited MoE, for example, neural topic models with a softmax activation suffer the same problem and also other models that try to bucketize data in a way that buckets are equally filled.

In [1] a method is proposed to produce a soft assignment for samples to a set of clusters in a way that the assignment is balanced. It is like a regularized K-Means. The method is simple and very fast, since it mostly requires matrix multiplications. The Sinkhorn-Knopp method is very versatile, since it just needs a batch of embeddings X of shape (nbatch, dim) and a set of “prototypes” C of shape (nprotos, dim) to work.

Let p=sinkhorn(X @ C.t()) the soft assignment of the method, then we can use it to balance assignments for
MoE: C is a gating network to decide what expert to choose
Clustering: C is a matrix of centers to learn
Neural topic models: C is a matrix of latent topics to learn
and any other method which aims to learn a balanced partition of the data.

The implementation is also straightforward, as demonstrated in [2] and the additional computational complexity is limited, and even if it would be more expensive, the benefits clear outweigh the extra time.

In terms of interpretation it is also a benefit that the assignment is soft which is especially relevant for tasks where an input sample contains multiple latent topics and since the assignment can be treated as a probability distribution, samples can be also matched with respect to their distributions and not only the embedding distance. This also allows a ranked retrieval by using the kullback leibler divergence between a query and samples in a database. To predict unseen data is also no problem, all we need to do is to call p_new=sinkhorn(X_new @ C.t()).

Bottom line, with the default parameters, we do not have to fiddle with a schedule or annealing procedure any longer to ensure that the assignment to experts/clusters is balanced. Another benefit is that the method is agnostic to the modality. Even though all papers we found applied it for the domain of vision, it can be also applied to NLP which we verified by various experiments.

[1] Sinkhorn distances: Lightspeed computation of optimal transport, NIPS 2013.
[2] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments:

Sinkhorn: Doubly Sparse Topic Models

We did a literature review of topic models and if we encountered the term sparsity, it usually means that the learned topics should be sparse and thus, easier to interpret. In other words, most coefficients per row of the topic matrix W of shape (n_topics, n_words) are zero or very small which means a topic only consists of a very small subset of the vocabulary. This makes the interpretation of each latent topic much easier, but still does not guarantee a human-readable summary. However, this is not the point. The point is that for bag-of-words data, also the input is very sparse, but usually this is not explicitly handled by models.

Why is this a problem? When the model is initialized randomly, like NMF, every feature gets a positive weight and without regularization, spurious patterns might grow, even if the pattern in the input data is very rare or not present at all. With a sparsity constraint, the topics likely push those weights towards zero either by weight decay or a hard non-zero sparsity constraint.

Still, this is not optimal since the sparsity of the input data can be easily 99% which means 20 tokens for a vocabulary of 9000 tokens. As a result, patterns should emerge only from present combinations in the input data and it seems wrong to randomly initialize all connections only to push most of them down to zero if they are irrelevant. Stated differently, all topics should be initialized identically, but then we have the problem that we cannot break the symmetry.

Rather by incident, we stumbled about [1] while we tried to find a way to partition a dataset into latent topics with the constraint that the ‘buckets’ should be equally filled if possible. A lot of traditional methods like K-Means have the problem that too often very few clusters get most of the attention and then we end up with lots of ‘dead’ clusters and some winners with every unbalanced ‘bucket’ sizes. One reason is the hard assignment of K-Means and that randomly initialized clusters do not work well with very sparse
input data.

So, Sinkhorn K-Means comes to rescue? Well, yes because it addresses two major issues:
(1) The hard assignment is turned into a soft one, like applying a softmax as a weighted mean.
(2) Entropy regularization discourages peaky distributions where an input sample is always
  assigned to a single cluster. {More details: Appendix (C) in [1]}
Thanks to the math, the final algorithm {Algorithm 1,2 in [1]} is very simple but also very condensed which makes the understanding a bit more challenging. However, since the authors kindly provided the source code, a ‘test first and understand later’ approach is no problem.

Recap: Lots of topic models have a tendency to degenerate which means long-tail topics are almost never learned since the training is dominated by popular topics. In terms of clustering it means, most of the data is assigned to very few clusters, while a lot of clusters are practically ‘starving’. For neural models there is similar effect, since if a topic wins, it gets the gradient update which means a reinforcement effect can be seen which must be addressed by explicit regularization. But even then, a degeneration might still occur.

But why is Sinkhorn K-Means suited for sparse input data? A good question that can be easily answered. At first the clusters are initialized with small Gaussian noise, just to break the symmetry, because without this step the affinity scores would be all equal for the clusters. But then, the assignment is purely driven by the data and with sparse data, only those terms that are actually present are updated. At the end, since the input data is forced to split ‘equally’ among all clusters, it is much easier to summarize a cluster with respect to the assigned data. Why? Since naturally, the weights for tokens that are not present in the assigned inputs are below a threshold that can be easily derived, or even the weights are actually zero.

Example: We trained on a dataset with ~1K rows and 7K distinct tokens. The dataset contains six classes for which some do not overlap at all, while others might have some topics in common. We trained the model with 50 latent topics (clusters). The assignment is nearly uniform, ~20, and each cluster has about 250 ‘distinct’ keywords. While 250 * 50 would be 12,500 tokens which means there is some overlap between topics, the result is pretty good for an off-the-shelf method without any tuning. We also studied the top-k words per cluster and they clearly follow a topic, like wedding, tattoos, animal training, baking, cooking or crime, all which are present in the data.

Bottom line: We do not always need a Transformer to get awesome results which is in line with our Green AI research line. We are searching for methods that are both biologically more plausible, but also very lightweight with respect to the computational complexity, both for the training and inference. Sinkhorn K-Means might not be biologically plausible, but very efficient to train and it delivers a very strong baseline compared to NMF or LDA and it further allows to partition a dataset into a set of (almost disjoint) latent topics if the number of clusters is chosen properly, with very few resources.

[1] 1902.08605 “Are Few-Shot Learning Benchmarks too Simple ?”

Neural Network: Building Blocks

Not long ago, there were very few building blocks for neural networks. The weight initialization schemes, like ‘xavier uniform’, were a step forward to learn anything at all and also all the various normalization methods to modulate the flow of the gradient through the network.

At a higher level, like ResNets, the building block spans the whole network architecture and not only a part of it. Therefore, for computer vision there is usually no need to come up with something new to solve a broad range of problems. Just take a ResNet as a baseline and surprisingly often it suffices. The idea, that is also used in later architectures, like the Transformer, is to define simple building blocks and to repeat them in a stack-like manner until you reached a certain depth which defines the capacity of your model.

With the introduction of Transformers, the situation also changed a lot for the domain of NLP, but in contrast to ConvNets, the training of such models is still often non-trivial. One reason is that bigger models need tons of data and since there is no inductive bias, like in convolution + filters in ConvNets, for 2D images, you also need a very fast machinery for training.

Furthermore, the operation of nets with billion of learned parameters is also challenging which is why another building block would be required, to distill those parts of the network one needs for a certain task and ignore the others. But in this area, no real standards has been established yet.

Nevertheless, the Transformer architecture is still useful for lots of tasks even if fewer data is available. For such a situation you won’t get a general NLP model, but very often often the benefits of the self-attention building blocks suffices to generalize well for the problem at hand, like classification problems, trained in an end-to-end manner. And thanks to the building blocks, you just need to add some new layers if the capacity of the network needs to be increased.

However, we are still far away from a ”formal” method to design a network that matches the complexity of the problem and also a possible budget for training. Not to mention that even with building blocks, the training procedure can be still very painful and often requires patience and experiences. But it seems that we are slowly making progress in the right direction, even though it might not be the best idea to put everything on a single horse, especially if the power of an architecture is purely driven by the number of layers and the amount of training data.

Learning Local Neighborhoods With NCA

Despite the simplicity, Nearest Neighbors (NN for short), is still a very powerful baseline for classification. Except for the number of neighbors, there are no parameters to tune or learn, but the accuracy depends solely on the quality of the feature representation. For domains with highly entangled representations, the results might be very disappointing. Thus, it seems reasonable to combine some neural net encoder to disentangle the input data first and then to jointly optimize the NN loss on this representation.

We experimented with a soft NN loss function [1] a while and we also stumbled again about NCA [2] (Neighborhood Component Analysis). Especially the note on the last slide [2] “This does not force all points of the same class to be similar.” caught our attention for several reasons. First, in case of clustering or softmax classifiers a single template is learned for the whole class which can be burdensome if the a single class contains a lot of variance. In this case, the network needs a sufficient depth to disentangle those aspects to learn the final template prototype that allows a linear separation of classes.

However, in case of NCA which is a soft version of NN, some arbitrary sample just need to be closer to some other sample with the same class, but not all of them. And a bit further away from a sample with a different label [2]. So, instead of wasting energy to collapsing all samples from the same class into a single point, we just need to ensure that the probability mass of all samples with the same label is maximal.

k, i = torch.arange(len(batch)), 0
dists = ((batch[i] - batch)**2).sum(dim=1)
dists[i] = 1e9 # mask out 'self'
probs = torch.softmax(-dists**2, dim=1).view(-1)
probs_leave_one_out = probs[0, (labels == labels[i]) & (k != i)]
loss = -torch.log(probs_leave_one_out.sum())

In words: We select the zero-th sample from the batch, calculating the distances to all other examples, we mask the i-th out since dist(x, x) is always zero to avoid a contribution to the softmax and then we further only consider probabilities from the same class and we also leave out the i-th example.

The classical softmax would push the whole mass to a single sample, while we are happy if the whole mass belongs to samples with same label class. This makes learning much easier, since it does not matter if a single positive sample is nearby or a dozen, as long as the negative samples are a bit more unlikely neighbors according to the softmax output.

To illustrate this with an numerical example: Let us assume that we have 6 examples where 3 are positive and three are negative. Then the all the following softmax outputs would all lead to a loss of zero

#1 [0.98, 0.02, 0.02, ~0, ~0, ~0]
#2 [0.33, 0.33, 0.33, ~0, ~0, ~0]
#3 [0.50, 0.25, 0.25, ~0, ~0, ~0]

since the sum of the positive samples is always close to one.

As stated in [2] this means the manifold can be fractured since there might bedifferent regions for the same label that are not connected. But frankly we do not care, as long as any new example is mapped into a region where at least one correct example, with respect to the label, is present and any sample with a different label is just a bit more unlikely.

A drawback of NCA is that the optimal solution requires all training examples which does not scale due to the quadratic complexity to calculate all distances. The problem is that if there exist a neighbor for some input x_i that would preserve the local neighborhood of x_i, but there is no such neighbor in the (mini-)batch, the neighborhood will be either distorted or even destroyed. A solution for the problem is memory + momentum encoders, like [3], where it is possible to scan much more samples to find nearest neighbors that might preserve the local neighborhood. For those samples, the gradient is not derived which means it is only propagated through the mini-batch data and the momentum samples are considered as constants.

Bottom line, the strength of NCA that it is often able to decompose categories into sub-categories, even if the labels are already fine-grained. This works, because the network is not forced to summarize a category with a single prototype template, at the cost that an inference step now requires to determine the k nearest neighbors from memory.

[2] CSC2535 Lecture 11: “Non-linear Dimensionality Reduction”, Hinton – Toronto
[3] Improving Generalization via Scalable Neighborhood Component Analysis

Simple Unsupervised Learning of Contextiualized Bag-of-Word Representations


The current setup we are working with is more information retrieval than NLP, but of course we also work with textual data. But in contrast to most NLP problems, we treat input not as ordered sequences, but as sets or bag-of-words. This is no problem, since we neither generate output that needs to be meaningful in a grammatical sense, nor do we have any problems were a swapped pair of words would be problematical in any way. Why do we mention this? For the popular Transformer architecture there is a position embedding which is exactly required to turn the input that is a set into a sequence but this is exactly what we do not want. However, without the position embedding, the token can be only used once, since otherwise it would degenerate into the exact same representation.

Long story short, what we want and need is an encoder that turns a variable-length set into a fixed representation that is supposed to summarized the ‘relevant’ parts of the set. Of course the relevance is subjective and is defined by the task and or loss function we use to train the representation. In other words, what we want is a simple model to derive a ‘sentence’ representation but without taking the order of the words in consideration. Furthermore, we do not want to use any supervision or labels, since it is very hard work to keep a dataset up-to-date so it is clean, without too much errors or contradictions, not to mention the time that is required to create one!


We could use the masked language modelling (MLM) approach, but as we mentioned before, without the position embedding, we cannot use multiple tokens which is a show stopper. We also do not want to use any auto-regressive approach since those approaches usually induce a high computational complexity. So, what’s left? We could use contrastive learning but then we need (lots of) negative samples and depending on the task there might be a good chance that a negative sample is also related but nevertheless pushed away from the ‘anchor’. Plus, for some setups a lot of negative samples are required which also induces a high computational burden, even with very efficient matrix multiplications.

So, is there something else? Yes, there is a simple approach that we described in [1] and the beauty of the method is that we do _not_ need any negative samples. All we need is function that returns different views for the same sample which is easy for images, but a bit harder for text. But instead of spending too much time to come up with a (not so) clever idea, we tried a the following baseline first.

To turn a set of words into two different views, we converted the set into a list, shuffled it and split the list into two halves to get a “left” and a “right” part. The split is done at offset len(input)//2, which ensure that left + right are almost of the same length. The randomization is some kind of augmentation to prevent that the model learns a shortcut in case the two halves are always the same. With this method, the model is forced to to learn a representation that is useful for any permutation of the input. However, even with this approach the importance of ‘common’ words on each side might be overestimated.


It fllows a a quick recap how the method works. The augmentation is done outside the model, so for the input x, we get x_left and x_right, where x = x_left || x_right. Both parts are each fed into some encoder f(x) which returns a fixed length vector. All output is l2 normalized. The alignment loss is then just the squared difference of the two ‘views’:

loss_align = ((f(x_left) - f(x_right))**2).sum()

The procedure is repeated for every sample in a mini-batch. So if our mini-batch consists of 64 samples, the output H contains 2*64 representation which can be reshaped into (64, 2, dim) to efficiently ‘batch’ the loss:

loss_align = ((H[:, 0] - H[:, 1])**2).sum(dim=1).mean()

In other words, the loss forces the two views to be close in the representation space which makes sense since the are obviously related.

The uniformity loss is not so simple but still very intuitive. The aim is to spread all representations on the unit circle to avoid collapsed regions and to make use the the whole space. Thus it follows immediately that the two loss functions are competing and if one wines, the representation is useless. For example, it is trivial to collapse all samples into a simple point and then loss_align is always zero. But then, the uniform loss is maximal. On the other hand, if all points are maximally spread, the distance between x_left, x_right of any view is also maximal.

To derive the uniform loss, we need every pairwise distances of the output. PyTorch provides a dedicated function torch.pdist for exactly this task, but since we use a 2nd order optimzation [2] we had to implement our own variant. But for the simplicity, let us use the PyTorch function:

loss_uniform_l = torch.exp(-3*torch.pdist(H_left, p=2)**2).mean().log()
loss_uniform_r = torch.exp(-3*torch.pdist(H_right, p=2)**2).mean().log()
loss_uniform = loss_uniform_r * 0.5 + loss_uniform_l * 0.5

For more details, see the website [3] of the authors. But if all distances are close to zero, the term evalulates to exp(-3*0).mean().log() to zero which is the maximum and we try to minimize the term.


We have to admit that our contributions are minor but let’s take a closer look at the details. First, by avoding negative samples we do not accidentally pushing relevant content away. This is nothing new and exactly the method of [1]. But for text, simple augmentations are not trivial and we empirically verified that the randomized split learns representation that is able to group related content without supervision and those groups often are matching the corresponding ground truth labels. This means that even a very simple ‘view’ suffices to learn representations that capture both semantics but also topically aspects. This is the first contribution.

The second and last one is that we only used very simple building blocks, like hash-based word embeddings [4] and a very shallow model to encode the set into a fixed length representation. Actually, we just used kernelized attention with a learnable token plus some non-linearity which again suffices to learn a useful representation for a vocabulary of ~100K tokens and ~50K samples. We tested the generalization of the method by applying the encoder on a dataset with ~6K tokens and ~1K samples. This test set contains labels to allow a visual summary of the representation based on t-SNE which easily allows to analyze how well related content is grouped into the
feature space.


We surely did not find the holy grail, but it is satisfying that we don’t need a Transformer model with billions of parameters to learn something useful with respect to a real-world problem. In our case, the semantic retrieval of related content by providing a simple query. The clou is that the model was never trained to rank query + documents, since it just learned to relate content in a local scope, namely the sample itself. The success is likely owed to the limited model complexity that acts as a regularizer and the weight sharing that forces the model to learn a word embedding that captures lots of different regularities.

[1] [Alignment + Uniform]

It is Important to Use Your Heads

We experimented a lot with attention recently and not only self-attention, since we still think that attention and context is a key to success to build robust and versatile neural nets.

In the literature, like [1], the question is asked how many heads do we really need which is a very good question and there is no definite answer to. Especially for larger transformers with dozens of layers, it is hard to determine how much a specific head contributes to the final task and also since this might be different depending on the task. But since our resources are very limited, we focus on a very simple example to analyze the different settings.

The task is to classify some textual content into one of eight classes. The challenge is that the vocabulary is very dynamic due to spelling errors, proper nouns and slight variations to mean the same thing. This is the reason why we use dynamic word embeddings [2], so we can embed all words, also the ones that are not seen during training and there are no out-out-vocab tokens. Because it is unlikely that we need a full-fledged transformer, we started with a simple baseline. The network consists of the embedding layer and a single attention layer, plus a linear output layer to predict the classes. In contrast to most implementations, we use a kernelized attention layer [3] with a quadratic kernel as the feature map.

To avoid tuning the learning rate, we use our default optimizer AdapSGD [4]. Except for label smoothing and gradient clipping, no other tuning or regularization is done. Now, we treat the number of heads as our only hyper-parameter. Since it is burdensome and often impossible to analyze what a single attention head really looks for, the pattern, we first study the effect on the loss and the prediction errors.

The dimensions of the embedding is 100. We evaluated the number of heads: 1, 5, 10, 20, 50. But because multiple heads are implemented by reshaping tensors, a higher number does not introduce additional parameters. In PyTorch this is done by using x.view(1, n_seq, n_head, -1). However, the induced subspace gets smaller and smaller: 1 => 100, 5 => 20, 10 => 10, 20 => 5, 50 => 2.

It might be reasonable to assume that there is a peak at some point and then the performance decreases, but astonishingly this is not visible for the training(!) loss + accuracy:

 1: 2.6498 errors=868 of 1024
 5: 2.5302 errors=847 of 1024
10: 2.4477 errors=831 of 1024
20: 2.3499 errors=804 of 1024
50: 2.2531 errors=784 of 1024

For the last setting, we have 50 heads, but each head only consists of a two dimensional subspace which is finally concatenated into a 100 dimensional vector again. Still, we get the best performance.

We trained every setting for five epochs to see how it develops and the trend stays that more heads lead to better results. At the end, the number of errors is:

 1: 447 of 1024
 5: 411 of 1024
10: 371 of 1024
20: 363 of 1024
50: 355 of 1024

During training, the advantage of the bigger number of heads shrinks, but despite the tiny subspaces they still perform very well compared to the other settings.

Of course this is no proof that more heads are always better, since we did not consider the generalization of the network at all, and we also consider only a single task, but to us the results are a bit surprising. We expected that an attention subspace must be ‘sufficiently’ large to positively contribute to the performance, since for the default transformer settings, with 12 heads and 768 dims, each subspace has a size of 64.

Bottom line, for our real setting, we ended up with 100 dims and 10 heads and even if some heads learned the same pattern, or no pattern at all, the final performance, and also on the test set, was noticeable better when we doubled the number of heads from 5 to 10. What is disappointing is the fact that there is no easy way to explain what is happening inside the model that causes the advantage. Even for our small model, dissecting each head to derive a pattern seems not feasible. Maybe we could do a visual exploration, but this is also burdensome and there is no guarantee to success.

[1] NIPS 2019: Are Sixteen Heads Really Better than One?