One Day In The Neural Treadmill

We still pursue our SOSO idea [1] and during literature research, we stumbled about mixture of expert (MOE). The idea is from the early 90s and is quite simple: instead of having one big network to do all the work, we delegate the decision to one or more experts. The architecture consists of two building blocks:
(1) a controller network to decide what experts to asked, based on the current input
(2) a set of expert networks that are as simple as possible.
The controller outputs a probability distribution and depending on the design, only those experts are asked whose probability is above a threshold, or only the top-k experts. The idea is to limit the required computations at test time to derive a decision. MOEs can be also used for regression.

While we followed this track, we stumbled about ‘Active Memory Networks’ [2] that are somehow related to the idea. Since one of our goals is also topic learning we gave it a try, since one conclusion was that the memory cells often learned topics but without explicitly giving them topics labels, or encourage them to learn such information with a specific loss function. We found the hyper-parameters in the PhD thesis of the first author.

In contrast to our other posts, this one is about the frustration when the training silently fails. In [2] the authors also mention one big issue and they propose a solution, but as often, it seems that the solution is data dependent and is no golden rule in general. The issue is that gating functions based on the softmax, often converge to absolutely useless states, namely the uniform output, or that a single unit gets all the score which is reinforced over time and all other units quickly die. We encountered this problem with subtracting the mean activation value [3], while [2] proposes to use a temperature-based annealing. The problem is that in the derivative the probability score for unit ‘i’ is multiplied with the term and thus, if it is very small, the unit gets almost no update and if it is close to one, one unit gets all the weight which is a kind of reinforcement. With the temperature, a uniform probability is forced during the early epochs which means the units all get updated. When the annealing is done, the temperature T is set to one, which recovers the classical softmax output. This can be considered a warmup that is slowly turned off to recover the original state in a step-wise, but continual fashion.

As also noted by the authors of [2] it is unlikely that such a complex network automatically adjusts itself to learn a powerful, but also selective representation. A problem is definitely co-adaption which means that units likely cooperate and rely on the other in case of errors. To promote diversity, a special loss function is added and also dropout is used. And as usual, most hyper-parameters interact with each other and we further need a working baseline to start tuning at all. So, finally we need to take care of at least:
– the temperature T, the multiplier gamma and a schedule when to derive a new T value
– the amount of dropout in the memory cells
– the regularization penalty lambda for the ITL loss
Seems manageable, but the problem is that we need to wait until the warmup is done to analyze the proability distribution of the controller, since earlier, it is surely almost uniform.

Just a quick note: this issue is not limited to this particular method which is why it is so important to mention those pitfalls in a paper and if there is no general solution, at least to give some hints where to start. Still, even with those hints or solutions there is no guarantee that they will also work with a new dataset, or at least without a considerable time of finetuning.

Bottom line, it is often a dream that one can just re-implement some neural method found in a paper and it works out-of-the-box with the data at hand. When we read papers we usually wonder how much time the authors spent until a working solution was found. The problem is that the landscape of loss functions with millions of parameters is very hard to understand and that every (hyper-)parameter has an impact and due to the chain rule of backprop, a tiny modification propagates through the network and might be magnified or damped depending on the dynamics.

At the begin we said silently fail, but why silently? The final accuracy of the network might be reasonable or even perfect, but the attention mechanism might still be useless, since it often performs either averaging or one-hot selection. But for a complex dataset it is reasonable to assume that indeed a mixture of experts should be used to derive the final representation. However when the architecture is powerful enough, there is no benefit for the network to learn a non-trivial attention mechanism, since the other parts of the network compensate the problem.

[2] “Active Memory Networks for Language Modeling”, O. Chen, A. Ragni, M.J.F. Gales, X. Chen

Encouraging Diversity for Softmax-Based Topic Models

These days reinforcement learning is hot stuff, but reinforcement can be a real pain if you try to train a mixture of experts with a softmax. Why is that? At the begin, you initialize model, maybe you even l2 normalize your experts and your input data, but this is usually also does not help. With normalized vectors, the range of scores is bounded [-1, +1], or equivalently, the dot product is the cosine score. So, you are starting by drawing some input and then determining the scores:
scores = torch.softmax(torch.matmul(input, experts_W)/T, dim=1)
where T is the temperature to sharpen the softmax. Depending on T, there is a clear winner, or better a subset, and for the rest, the margin between the winner and this particular expert is notable which is amplified by the softmax. Now, the loss is determined and a gradient step is taken. What happens? The winners get updated according to the loss and the rest stays pretty much the same. The effect is accumulated with each new step, that a (small) subset of experts matches the data and the rest is ignored and get almost no attention. The problem is similar to dying ReLU units, once they are ‘knocked off’ the feature landscape, they never became active again.

However, by drawing inspiration from hebbian learning, the issue can be fixed very easily. What is the problem? The inner product between the input and the each expert from the matrix experts_W is large for those ‘reinforced’ units and small for the rest. So, we keep track of the average ‘activation’ score of each expert and we subtract it before it goes into the softmax:
scores = torch.softmax((torch.matmul(input, experts_W)-mean_act_W)/T, dim=1)

To illustrate this with numbers, let us assume that we have three experts: e1, e2, e3 = experts_W and the average activation is mean_act_w = (0.7, 0.01, 0.1), then depending on the temperature, e1 gets most of the gradient update, let us consider T=4, and scores=(0.8, 0.015, 0.12) then the softmax looks like this: [0.9016, 0.0390, 0.0594]. With each new gradient step, the same experts are reinforced, while others get only very small updates and finally might die. If we treat the mean activation as an inhibitor, other experts now get a chance to became active: softmax(scores – mean_act_w) = [0.4149, 0.2838, 0.3013]. This little ‘trick’ is very simple and does not introduce any new parameters, since the mean activation can be easily tracked and does not itself require any gradients. At test time, there is no tracking and the mean_act_w is set to 0.

To analyze the potential to learn orthogonal codebooks, to capture different topics in some dataset, we started to develop our SOSO method from [1]. Our setup stays the same, the dataset consists of ~1,000 short documents and the vocabulary size is 6,000 tokens. We enforce sparsity but keeping only K experts and masking out the rest, but even with this competition there is no diversity due to the reinforcement effect. With the inhibitor trick, the problem is gone and all the experts learn useful topics.

So, let us take a look at our sub-goals:
(4) orthogonality with SGD is challenging, since we have a sum of loss functions and without proper scaling, one loss might dominate the other or is ignored altogether. Furthermore, the term ||W*W.T – I||_2 does not enforce sparsity, just encourages it. However, we checked the correlation of some experts and the numbers confirm that the overlap is indeed low.

(3) sparsity is usually enforced by allowing to use only K coefficients and set the others to zero. We do this with the scores that are fed into the softmax. We use argsort and for all non-top-K coefficients, we mask the values with -inf which is similar to self-attion and padded tokens.

(2) by definition SGD is online, when the batch size is set to one, so this goal is no problem.

(1) well, simple is no precise concept, but at least we can train the method end-to-end and there is not much to tune. Hyper-parameters are just the number of topics and the embedding dimension.

Bottom line, it seems that our journey to classical machine learning ends before it really started and we are back again in the PyTorch + backprop world. This is not necessarily a bad thing, especially since we could solve our first problem with an inspiration from neuro science.


Revisiting Codebook Learning

During the last weeks we felt a little enthusiasm, mainly because our simple models delivered solid results and are often biologically more plausible. So far for the preliminary results, but as soon as we dived deeper the phase of disillusion came.

Why? We experimented with the methods from [1, 2]. The data we used for training contains of 1,000 sentences where each input can be described with a label like ‘wedding’, ‘baking’, ‘cooking’ or ‘dogs’. In general all those methods learn feature detectors to describe patterns in the data. Since the labels are not used, the model acts as an explorer to derive features from patterns it sees during training and those are then fixed in hidden units.

Now comes the pigeonhole principle: Let us assume that the number of topics in the table equals the number of labels and let us further assume that a model can perfectly recover those labels/topics in an unsupervised way. This is reasonable to assume when the overlap of the topics is minimal and we already verified that this works for our data with a t-SNE plot of the learned features. So, we assume that we have N labels and K hidden units. Then, different cases are possible:

(1) If K < N, not all concepts can be learned and to describe the data, a unit must detect more than one topic.
Depending on the frequency of each topic, it is likely that weights for the common topic overshadows the other topic which means the dot product is smaller for the minor topic.

(2) If K = N, all topics can be learned and by using the labels as a colormap, a ‘perfect’ clustering
should be visible with a t-SNE plot, or at least a separation of the classes and groups of the same class.

(3) If K > N, we have more units than topics which means a topic will be learned twice, or a slight
variation of it, or a mixture of one or more topics. It is also possible that a spurious topic is learned
if there is a sufficiently large correlation of some features.

For case (3), method [2] will not learn things twice due to the orthogonalization, but it tries to learnsomething new which might not be possible. For instance, an input might be about cake and wedding which means we could label with with {baking, wedding}, but if this pattern is only present a couple of times, the model might ‘forget’ the topic during training or it is overwritten by a pattern that is more frequent, but maybe also less interpretable.

Furthermore, depending on the hyper-parameters method [2], has an odd behavior to assign a large negative value to one topic indicating that the input is very contrary to it, but this information is not very valuable. And the performance of the model, at least for our setup, strongly depends on a proper scheduling of the learning rate. Thus, we would need to run a grid search for the optimal settings, but the method does not allow efficient batching which makes is very slow and probably not feasible without a larger machinery.

Bottom line, the dream of a simple, but elegant method to learn from unsupervised data, at least for our kind of data, still needs some ‘dream work’. We also tried to learn an overcomplete set of dictionaries and then to throw away those that are highly correlated, but this does not feel right and is very wasteful with respect to the required computations.

We know that topic modeling is an old hut, but we dream of a method that is:
(1) Simple, to avoid expensive operations
(2) Online, to avoid to solve large optimization problems and also to bound the required space
(3) support Sparsity both in the input and in the features
(4) Orthogonal, to avoid wasting capacity to do things twice

Frankly, we never expected that a single off-the-shelf method to fulfill all those requirements which is why we need to come up with our own SOSO method.


Learning Codebooks With Hebbian Learning

One could say that we are back to the roots, since backprop and neural nets are both very powerful, but we did not manage it yet to come up with a loss function that allows us to fully control the learned representation with respect to our data. And there is nothing wrong with non-DL approaches, since if we admit it or not, sometimes all we need us a linear regression and no transformer ;-). In [1] we discussed a method to learn feature detectors in an unsupervised way which worked very well, except the number of neurons need to be carefully selected for the input data. Otherwise the learned concepts are mixed up which means that for any input x, most concepts are active which indicates a strong overlap of detected features. In [1] we demonstrated that the number of units should roughly equals the number of latent classes. This is bad for several reasons. First, the latent classes might not be known and second, we might want to learn more concepts than classes. The problem is that if we increase the number of units, some units might die, they do not get enough updates or no updates at all, or, they learn similar concepts which is redundant and leads to lots of ‘spikes’ in the extracted features for an arbitrary input.

A long time ago [2], we discussed how to avoid this, namely orthogonalization which enforces that the correlation of concepts is low or at best zero. This can be easily checked with the dot product of two detectors and if the values are high, there is a noticeable overlap which is bad. To be more precise, we want to learn N concepts that share as little information as required. In case of NLP, it is obvious that there are concepts that share at least some tokens from the vocabulary, but it does not make sense to waste capacity to encode two very similar concepts in two different

There are different methods available to achieve this, but the one described in [3], Orthogonal Sparse Coding (OSC) is appealing, since it can be implemented with very simple linear algebra routines and thus in numpy without dedicated optimization routines, since it is online. The drawback is that the iteration cannot be fully batched, but let’s focus on representational power instead of performance first. With the enforced orthogonalization, it should be no problem to train more units than latent classes and due to the nature of the method, there should be also little overlap between the learned concepts. Depending on the number of patterns in the input data, at some point the units will ‘saturate’ or come up with pseudo patterns, but we did not encounter the problem yet.

We ran our experiments from [1] again, but with 30 units instead of 10, as a quick test. As mentioned in the paper, new input can be easily encoded with a =, x) and using np.argsort to just keep the top-K values. As a first step, we used a tSNE 2D plot of the features to verify that the training actually ‘converged’ which was clearly visible by using the class labels as a colormap. It should be noted that the method in [1] tends to assign a single concept to each input, like in clustering, while [3] uses several concepts if useful.

So, let’s do a quick recap: The aim is to learn a matrix of orthogonal feature extractors, U of shape (n_units, dim_in), and the feature encoding is simply the linear projection a =, x), where x is an input of shape (dim_in, 1). Even without the top-k step, the sparsification, most of the dot products are close to zero, so with a simple threshold we get a sparse vector that indicates which concepts are present in the input.

Let’s take a second look what we got here? It is a kind of prototype learning, like LDA or NMF, but with the advantage that no energy is wasted to learn things twice. For factor models, an interpretation of the concepts is not always simple and there are often noticeable overlaps between concepts. Plus, OSC works online and thus has a much lower space complexity than NMF and de-correlated concepts by design. At the end, such models can be understood as a linear classifier that predicts how well an input matches the concept, where a > threshold, with threshold >= 0, indicates a positive match.

Bottom line, even though it the method is simple, it is still very versatile. We could use the features for ranking with a user provided query, we could cluster content either in a soft-way, or by using the argmax as a cluster ID. We could also use it for query expansion since if a query matches a concept, the positively weighted words are likely relevant to describe the concept behind the query. It is also possible to use it for preference-based learning, because liked content can be decomposed into concepts and those liked concepts can be further use to suggest related content for users. Without a doubt linear features are limited in use, but they often provide a strong baseline with minimal or even no overhead with respect to model training. So, maybe it is time to be simple again.

[3] “Learning Efficient Data Representations with Orthogonal Sparse Coding”

Information Retrieval With Associative Memories

We have to admit that the resurgence of Hopfield networks caught our attention. Despite the hype title ‘X is all you need’ [2] the content is definitely worth to think about. By the way, there is an excellent blog post with more details that can be found here [1]. For quite some time we work on (neural) methods to perform semantic retrieval, autocomplete of search queries which can be summarized as: finding related content using a starting point from an arbitrary query. How is this related? In classical Hopfield networks, we have a set of patterns X and we have a query x that is an incomplete version of a pattern. For images, the example is often that x is half of the image and the other half is ‘masked out’. For text, we ignore the order for simplicity and think of x as content that does not contain all words from a pattern, but only a subset of it. We further assume that patterns only contain nouns and proper nouns to be more descriptive. Now, let us assume that a user enters some text, the query x, and we want to retrieve the ‘closest’ pattern (document).

For our experiment, we use a simple bag-of-word encoding with a vocabulary of roughly 7,000 words and 1,000 samples and a binary encoding which means x[i] = 1 if the word is present x[i] = 0 otherwise. To simulate user queries, we roughly mask 60% of the active words from an arbitrary memory, to see if we can still retrieve it from the masked query. This is only a rough model, since user queries are usually much shorter, but due to the sparsity in the input data, a ‘document’ already contains very few active words on average. With the derived update formula from [1, 2] the code is very compact:

Let X be an array of shape (n_rows, dim), beta be 2.5 and x is a query of shape (1, dim) submitted to the ‘memory network’ Then all we have to do is to calculate:

scores = softmax(beta *, X.T)) # applied row-wise
query_new =, X) # shape (1, dim) again

Since the output is continual, we have to binarize it with a simple threshold:

query_bin_recon = 1 * (query >= 0.5)

And finally, we want to determine the hamming distance to the original memory pattern:

diff = np.abs(X[query_id] - query_r).sum()

As described in the paper, beta is the inverse temperature and needs to be ‘sufficiently’ high to avoid unstable states.

Let us do a quick recap what is happening here. First, we determine the dot product between the masked patterns and all memories, the more the input x agrees with a memory x_i the higher is the inner product score. These scores are then normalized to to sum to one. The beta parameter sharpens the score which means most irrelevant patterns are pushed to a zero score and the ‘few’ relevant ones get most of the ‘probability mass’. In other words, the reconstructed version of the query is a weighted average of all memories. In a stable stable, a single memory should be chosen, but it is also possible that memories are superimposed in case the input pattern is too ambiguous.

For our ‘toy’ data, the reconstruction is always perfect which means we can recover the full memory from the masked input version in one step. What is quite surprising is the simplicity of the method. First, there is no learning step involved which means no parameters, except for the beta hyper-parameter. Second, the inference step is just a simple weighted average and can be fairly easy optimized and batched. Another way to understand the method is to see it as a nearest neighbor algorithm with k=1 and a very special ‘distance’ metric.

[2] arxiv:2008.02217 “Hopfield Networks is All You Need”

A Down-To-Earth Look At The NLP Progress

We are all amazed by the possibilities of recent AI models, but let’s pause for the moment to take a second look. We have huge models with billions of parameters trained with large clusters of GPUs on a big dataset. Despite the costs, literally and CO2-wise, the question is what do we get? There is still lots of potential, so we can stack more layers, train longer and add more data, to get even better models. But again, are there any guarantees that such a model learns useful concepts? For example to reason, humor, being sensitive or to admit that it does not know the answer, instead of making it up? If it takes 1,000 images to learn a category and even then the model might be easily fooled, how much data and gpu-time do we need to learn a generic NLP model?

Not to mention that training such models is already a huge challenge when it comes to computational complexity, like efficient batching, distributing data across different GPUs and also to run such a model in the inference mode. The capabilities of larger models look amazing, but it still feels like brute force, since all we do is to force models to learn correlations between words to get the cloze test right. Those models still learn a lot of other useful stuff about language, but there is no way to control it, or to correct it, in case of systematic errors in the ‘logic’.

In our case, we started with a simple ‘semantic autocompletion’ task, where the order of words is not important, to get a better understanding about capabilities of smaller models that are trained on limited data. First, the challenges to train such a network on a single GPU also exists, since without efficient batching, the training takes an eternity, even for smaller models. The padding for the sequences and masking for the softmax is no rocket science, but it needs some experience when you implement it from scratch. So, solid engineering skills are plus. Second, how much layers do we need to fit the data? In case of small-scale problems, the literature is not very useful to answer the question. The closest what exists might be the MinGPT project by Karpathy, since he is evaluating GPT on ‘toy’ problems. But even for our very modest problem, the time until the model converges takes some hours on a laptop with GPU-support. The learned correlations between words make sense, but sometimes the results are weird and obviously wrong, probably owed to the unbalanced distribution of some concepts in the data.

It is really sad that learning, even pretty easy concepts, with neural nets takes a lot of time, weights and data and even then it sometimes feels like the network just clever memorized the data, despite dropout and other regularization tricks. After deep learning became so popular, different paths like ‘biologically plausible learning’ approaches seem to get less and less attention, since in the supervised case, DL works very good as demonstrated for images and recently also for text.

But the question is, regardless if you have labels or not, does it lead somewhere when you just scale up the networks and throw more data at a problem? For instance, systematically evaluating huge models takes time and likely reveals weaknesses and then the question is how to fix them when the training takes several days or even weeks? Should we really concentrate on just a single path of research? Alternatives are present, but not so popular since they do not scale, or cannot be easily trained end-to-end with backprop. But maybe those approaches that require less data and/or are biologically more plausible just need a bit more tinkering to reach the next level? We don’t know, but as shown in our previous post [1] there are researchers following different tracks, but it is probably much harder since funding for those projects is probably limited due to the huge shadows that Deep Learning casts over the research landscape.


The Winner Takes All: WTA Neurons Revisited

Nowadays when people hear about AI, they mostly think about Deep Learning and with DL we mean a big neural net, trained end-to-end. There is nothing wrong with it, except that training such a network is often expensive in terms of money and time, not to mention the tons of data you often need to learn a proper representation.

Before DL became standard, pre-training networks layer by layer was standard which means for the next layer, you freeze all weights from layers below and only adjust the weights of the current layer. The method described in [1] has some parallels to it, except that learning is not done via backprop, but with a more biologically plausible learning rule. However, the idea is still the same, namely to learn good feature detectors, layer-by-layer, but just with a local update rule.

In contrast to the authors, we used textual data. Our dataset consists of 1,000 sentences and each sentence is labeled with one of six labels. The labels are not used for training, but during the analysis phase where we plot a 2D visualization of the data.

We used the jupyter code from github, kindly provided by one of the authors. We further simplified the design by always using P=2 in which case the expression:
scores = * np.absolute(W)**(2-1), inputs)
is equal to the dot product:, inputs)
and since this expression is computationally prohibitive, the rewrite brings a major performance boost. But before we dive deeper into the code, let’s summarize the idea first.

We have a weight matrix W of shape (n_hidden, n_input_dim) that is initialized randomly from a Gaussian distribution N(0, 1) and we have some input of shape (1, n_input_dime). A good feature detector should agree with the input and thus, the inner product should be large for matching features and low or negative for everything else. This is pretty standard and not limited to the method. But what is new is that we don’t use any global information, like in a matrix factorization.

So, during training the inner products of the input and all detectors, columns of W, are determined. The two largest are kept and the second largest should be ‘damped’ which is why a negative value, in the code delta (-0.4), is used. Thus, each learning step just modifies two detectors and the ‘loser’ is pushed away from the input pattern. In other words, all units are competing for an update and if the scores are similar, the other unit is forced to learn a different pattern, or it is at least discouraged to detect the one that is already handled by the winning unit. The novelty of the method is the learning rule, not the WTA approach which is pretty old.

For our dataset, the number of hidden units was cruical for success, so we treated the problem as a clustering task and set the number to the number of available labels. In this case, the matrix W learns prototypes that act as classifiers learned in an unsupervised way. A different way is to interpret the learned dectors as latent topics which is confirmed when we further analyze the top-k weights per detector. Those words usually accurately summarize the topic of the sentences, like food, cake, pets, or wedding.

Now, let’s see what the features, F =, W.T), look like. Here is an numerical example of four sentences. The label is append at the end of each line:

-0.08 -0.08 -0.08 -0.08 +1.26 -0.08 -0.08 -0.08 -0.08 -0.08 (10)
-0.07 -0.07 -0.09 -0.07 +1.41 -0.07 -0.16 -0.07 -0.10 -0.07 (10)
-0.03 -0.03 -0.03 -0.07 +0.74 -0.03 -0.06 -0.03 -0.03 -0.04 (10)
+0.11 +0.09 +0.63 -0.00 -0.01 +0.17 +0.17 +0.03 +0.14 +0.01 ( 6)

For the first three rows, which share the same label, one detector is particularily active, while the others are always negative with a much smaller magnitude. This is what we expect when a sentence just have a single topic/label.

The last row is more interesting, since it has lots of positive scores and and the winner is rather small with respect to its magnitude.
Thus, we assume that the sentence contains ambiguities and do not easily allow a single label which is reasonable to assume since not very sentence contains the necessary keywords for an easy tagging.

What we should not forget is that these features were learned without any labels and no global information was used either, only two winning detectors were updated per step. Furthermore, it is just a one-layer network with no linearity.

Depending on the problem, a layer-wise training probably improves the representation, but in our case the second-layer converged quickly and the 2D plot did not show any benefits in terms of further disentanglement, probably because our problem was too simple.

Bottom line, why should we care if DL works so great and we have tons of data? Well, it still feems crude that we require so many images to learn the underlying concept and even then, the network can be often easily fooled. And of course the classic argument that most of the learning done by humans is unsupervised which is very relevant since for some task, data is scarce and labeling is expensive or even impossible. In total, the method is no drop-in replacement for DL, but it shows that it is worth to pursue also research in different directions.

Appendix: The simplified code derived from the Jupyter Notebook [2]

inputs = np.zeros(1, n_indim)
W = np.random.rand(n_hidden, n_indim)
lrate = 2e-4 * (1 - epoch/max_epoch)
scores =, inputs)

winners = np.argsort(-scores, axis=0)[0:2, 0]
mask = np.array([1.0, -0.4]).reshape(-1, 1)
xx = (mask * scores[winners]).reshape(-1)

t1 =, 1), inputs.reshape(1, -1))
t2 = xx.reshape(-1, 1) * W[winners]
ds = t1 - t2
nc = np.maximum(np.amax(np.absolute(ds)), 1e-30)
W[winners] += lrate * np.true_divide(ds, nc)

[1] (arxiv:1806.10181) “Unsupervised Learning by Competing Hidden Units”

How Much Sentences Do We Need to Learn a Concept?

With the recent advances in computation, it became feasible to train larger and larger models with more and more data. Compared to the amount of data used to train recent transformer-like NLP models, the original image net dataset with 1.2M images seems rather small. But what do they still have in common? We need about 1,000 images per class and thus supervision to train a good classifier. To be fair, recently, contrastive learning enabled us to learn without labels, but the amount of required data is still enormous, not to mention challenges like distributed training due to very large batch sizes, for instance. Bottom line, we don’t need labels any longer, but still lots of data to learn a good representation, both for images and text.

Now the problem is what does “a lot” means? It has been shown that transformers memorize ‘facts’ in their weights, but this only works if those facts are present in the dataset. A good analogy from a reddit post is that the model “is googling the question”, but in a non-keyword based way which still can be valuable. Both images and text can describe a concept in many different ways. For images, the angle, color, perspective and position of the object can vary a lot, which is no problem for a human but a challenge for the model. And for text, there are also different ways to describe something, or to speak about something. Again, humans have less trouble to identify those relations, but models need much more data to learn something and then we often don’t know if the concept has been really learned or just clever memorized.

Thus, it is a but unsatisfying is that there is no way to determine how much data do we need to learn a meaningful representation, for a specific problem that just needs part of this ‘world knowledge’. We could still fine-tune an existing model and try to distill the subset of knowledge into a smaller model, but the core problem remains. And if the domains of the problems differ a lot, is there an advantage at all by using an existing model? We think Wikipedia, as a project and a dataset is awesome, but it feels a bit like brutce-force learning where we throw everything we have in and hope that we learn a giant knowledge graph that generalizes to ’42’.

A further problem is that for text, depending on the language, both the syntax and the semantic can be very different and therefore, standard pre-processing pipelines might not work. The common denominator is definitely English to exchange ideas in research, but we should not forget that only max 400M of all people speak it as their first language. Not to forget that Wikipedia differs in size for different languages and that we need a quality check performed by native speaker of each language, to check the ‘world knowledge’ learned by a particular model.

Bottom line, these are amazing times for NLP and AI, but as mentioned elsewhere not all challenges are related to machine learning. The value of AI should relate to the resources that are required to train a model and the value is not only precision, but also how many people benefit from such a system. So, time for a solar-powered GPU farm to train your next big model and there should be more efforts to support a broad range of different languages.

PyTorch: Juggling With Loss Functions

Thanks to the existing AI ecosystem it has never been easier to try out our own -crazy- ideas. We still prefer PyTorch since it is easy to use, very pythonic, but nevertheless also very powerful. However, the framework is just one ingredient and despite some voices that might disagree, knowing a bit about the math is definitely also very important ;-).

When it comes to probabilities, we often have to juggle with very small values which might cause instabilities during learning and in the worst case the notorious NaN problem happens, due to overflows or underflow of gradient values. That is the reason why almost every framework, and PyTorch is no exception, provides ‘numerically stable’ version of common operations.

For instance, y = softmax(x), where x is of shape (1, N), does not change if we subtract the row-wise maximum of x: y = softmax(x – x.max(dim=1)). The advantage is the range of the values which is smaller since exp() is applied to very x_i of x. Numeric toy example:

x = array([0.50, 4.78, 3.57 , 3.50])
np.exp(x): array([1.67, 120.14, 35.54, 33.28])
np.exp(x-x.max()): array([0.01, 1.00 , 0.29, 0.27])

In our case, most of the problems we have to tackle are a combination of information retrieval and language modelling. In a recent blog post the paper [1] was referenced which has exactly the same goal: enhancing contextualized embeddings with ‘external’ retrieved knowledge.

The loss function is not very complicated, but again, due to numeric stability, it need to be rewritten a bit.

L: sum( p(y| z, x) * p(z|x) ), where the latter is the retrieval loss and the former the prediction loss. Details can be found in [1].

To further stabilize the negative likelihood loss, the log step if often combined into a single function and within the log domain, the ‘*’ becomes a ‘+’ operation: L = p_predict * p_retrieve => log_L = log(p_predict) + log(p_retrieve).

l_predict = torch.log_softmax(p_y_z_x)
l_retrieve = torch.log_softmax(p_z_x)
l_joint = l_predict + l_retrieve

So far, this is pretty standard without any challenges or surprises. The last processing step is the summation:

L = -torch.logexpsum(l_joint)

Again, the aim of the function is numeric stability for the steps:

loss = -torch.log(l_joint.exp())

Bottom line, the required math is not much, just a few rules from exp & log calculus. The only ‘challenge’ is to perform the math and to find the appropriate function in your framework that calculates the output in a numerically stable way. And since a lot of operations are used over an over again, there is a good chance that the required function already exists. In general, it is always a good idea to use existing functions since they likely have been tested and evaluated thoroughly, not to forget that they might be faster due to the use of low-level functions.

[1] “REALM: Retrieval-Augmented Language Model Pre-Training” [arxiv:2002.08909]

The ‘X is All You Need’ Trend

Without access to a large (GPU) machinery, it is common to choose from the model zoo which is a good idea since training huge NLP models consumes a lot of energy and it would be wasteful to do it twice when the result is the same. However, since the introduction of transformers, most papers were concerned with applying the networks to different tasks or, how to reduce the quadratic complexity of the self-attention step. The latter is definitely useful, but the question still is why does the architecture works so well? To be fair, there are some papers [2] that take a closer look inside such models, but it still feels unsatisfying if we just continue to build larger and larger models with more data and longer trainings without knowing what is the actual key to success!

Is attention really everything I need? In a recent paper with a very similar name [1], it is demonstrated that in the lower levels, self-attention often resembles the average operation and as a result, it is possible to replace self-attention with some kind of non-attention averaging, at least in the first layer. However, for higher layers, there is no such clear pattern and thus, there attention cannot be easily replaced. But if not all layers require attention, what is the actual pattern here? One advantage of the transformer is that all layers are identical and thus, stacking and scaling the architecture is straightforward. When we could split the architecture into two blocks, bottom and top, it would be also no problem, but the insights from recent papers do not allow to draw a complete picture.

For instance, with these details the question is how large can be a minimal transformer that does not operate in the ‘averaging regime’ and furthermore, when you just use a single self-attention layer as a postproc operation, does it ever learn specific patterns or just clever averaging? Maybe the latter suffices to improve the model, but this would probably not met the expectation of the model builder.

Bottom line, this is not supposed to be a complete review, but we do not think, that a single building block is all you need. And yes, the papertitles are not meant to be taken literally, but they still transport a message.

[1] Hopfield Networks is All You Need (arxiv:2008.02217)
[2] Synthesizer: Rethinking self-attention in transformer models (arxiv:2005.00743)