So, let’s start with a linear SVM by minimizing the hinge loss which is max(0, 1 – y*y_hat). Because there are lots of off-the-shelf solutions available, all we have to take care of is the sparsity in the input. In case of cross-features it can easily happen that we have to cope with 250,000 features but each sample only has about 25. In case of Python support, there is a pretty good chance that there is support for sparse matrices via scipy/numpy and we are done. Nevertheless, to demonstrate how little effort is required to train such a model without any sophisticated libraries, we started from scratch.

The method is straightforward: First, we create a lookup L to map each word to a unique feature ID. The model W is an empty defaultdict of type float. We draw a random sample (x, y), y in [-1, 1], and sum up all features to get y_hat:

`y_hat = sum([W[L[word]] for word in x])`

First the model is empty and y_hat equals 0. Thus, the loss is: max(0, 1 – y*y_hat) = 1. Next, we derive the gradient, which is extremely simple for the hinge loss, and update the parameters accordingly.

for word in x:

W[L[word]] -= lrate * (-y)

Since we treat all words as equally important, for the sake of simplicity, the update rule just increases/decreases a feature proportional to the learning rate according to the label:

w_i = w_i - lrate * (-1) => w_i += lrate

w_i = w_i - lrate * (1) => w_i -= lrate

This is neither magical nor unexpected, since we want features that are related to +1 samples to be positive and vice versa negative for -1 samples.

The beauty of this method is that we only need to update the model parameters, features, that are present in the drawn sample which is very efficient. Furthermore, with the lookup and the dictionary, every actual parameter update can be also performed very efficiently and there is no scalability problem in case of millions of features. It is also possible to perform on-line learning, because of the modest model size, which means we just update the model parameters continuously for each new sample (x, y). The advantage is that we can automatically cope with drifts in the preferences over time which means nothing more than to slowly shift the weight from some features to others.

Once the model W is trained, the parameters can be easily serialized as a Python dictionary, along with the lookup L. Then, at test time, the prediction just requires us to sum the weights for the present features in a new sample x, which is very fast:

y_hat = sum([W[L[word]] for word in x])

Bottom line, it is no secret that the power of a linear model lies in the features. However, if features are already available and there is no need to handcraft them, a simple feature combination and a linear model might already provide a very strong baseline. In other words, even if Deep Learning is awesome, we should always start with the simplest model and only increase the capacity if really needed.

]]>languages than English.

However, with a vocab of 2M words and 300 dimensions, those models can be a bit cumbersome to work with. You require approximately 8 GB RAM just to hold the data in memory which is no problem for modern servers, but at your machine at home, it might be quite a lot since you also need to run other services. A further question is about the interface between the data and your favorite language. There is a very nice API for python, but it requires to load the full binary model and also introduces some minor overhead.

So, we decided to analyze the binary format of the model, to see if we can somehow represent the model more compactly. The good thing is that the format is quite simple: there is a dictionary, additional model parameters and at the end of the file there is the input and the output matrix.

For the pre-trained models, we have 2M n-gram buckets and 2M words, and each row has 300 dimensions (float32). A matrix is represented by a fixed 16 octet header: int64 rows, int64 cols and each vector is a sequence of 300 float32 values. Between the input and output matrix is a single octet that indicates if quantization has been used. With this numbers it is easy to go to the correct offset in the file to read the data.

The size of a matrix is: rows*cols*sizeof(float32) which is 2M * 300 * 4 ~ 2.2 GB. Actually, the input matrix is a concatenation of the vocabulary words and the buckets for the n-grams. Thus, the index 0…2M references the words and 2M…4M references the buckets which is why 2M is added to the index returned by the hash function. So, we have three matrices with a total size of ~7 GB.

Now, we come to the coding part which is pretty easy with python. You just need to calculate the total size of the data + 1 octet as a quantization flag and seek to this position relative from the end: lseek(fd, -total_size, 2). After, you parsed the header, you can simply read each vector by `vec = unpack('300f', fd.read(300 * sizeof(float32)))` and store it in a numpy array. It is also possible to use float16 for the array since the loss of precision should not be noteable for similarity lookups. Since there is no direct lookup for each n-gram, we also need to port the hash function from c++ to python which is used to retrieve the row ID for each n-gram. Example:

h = 2166136261

for i in xrange(len(str)):

h = h ^ int(ord(str[i]))

h = h * 16777619

h = h & 0xffffffff

return h % bucket_size

Because we cannot model a int32_t type, we need to ensure the boundaries with the AND mask. And bucket_size is 2M. To retrieve the embedding of a specific ngram, we use the hash function:

v_emb = W_ngrams[hash("%wh")]

To convert a new word into a lookup vector, we first generate all n-grams with a size between 3 and 6. Then we convert each n-gram to a row ID and sum all vectors up and dividing it by the length of the n-gram list, which is nothing more than the average.

id_list = [hash(x) for x in ngrams("%where%")]

emb = np.sum(W_ngrams[id_list], 0) / len(id_list)

It is not complicated to wrap the whole procedure into a single class that outputs an embedding vector for an arbitrary word. So, instead of fiddling with 7 GB we just have 2.2 GB as a single numpy array, in case we need to generate vectors for unknown words.

But there is a minor issue we need to address. The representation of words from the vocabulary is a combination of the actual embedding vector plus the sum of its n-grams. In other words, to perform a similarity lookup, we require both matrices to perform it. Since the accumulated representation does not change and the actual embedding vector is never use stand-alone, we decided to perform the pre-processing step and store the result as a separate matrix. Now, we can use this matrix to directly output word embeddings and/or to perform nearest neighbor queries.

W_words = [..]

W_grams = [..]

id_list = ngrams(vocab[0])

W_words = np.vstack((W_words[0], W_grams[id_list]))

W_word = np.sum(W_words, 0) / W.shape[0]

Finally, we have two matrices: (1) the accumulated vocabulary matrix, which can be compared to the output of a word2vec with n-gram support and (2) one for the 2M n-grams which can be used with the hash function for lookups.

Bottom line, it is very helpful to have access to pre-trained embeddings which were trained on a large-scale corpus since very often, your data at hand is not sufficient to train ‘unbiased’ embeddings that generalize well. Furthermore, the n-gram support allows you to embed also oov words which will definitely occur and neither random inits nor UNK token are really appropriate to address them.

]]>In case of fastText there is a clever, built-in, way to handle oov words: n-grams. The general idea of n-grams is to also consider the structure of a word. For instance, with n=3 and word=’where’: [%wh, whe, her, ere, re%]. The % are used to differ between the word “her” and the ngram her, since the former is encoded as “%her%” which leads to [%he, her, er%]. In case of a new word, the sum of n-grams is used to encode the word which means as long as you have seen the n-grams, you can encode any new word you require. For a sufficiently large text corpus, it is very likely that a large portion, or even all, required n-grams are present.

Since most implementions, also fastText, is using the hashing trick, you cannot directly export the mapping n-gram vector, however, there is a function to query n-grams for a given word:

$ fasttext print-ngrams my_model "gibberish"

There is an open pull request (#289), to export all n-grams for a list of words, but right now to call fasttext for each new word which is very inefficient in case of huge models. Without knowing about the pull request, we did exactly the same. First we slightly modified the code to accept a list of words from stdin and then we performed a sort with duplicate elimination to get a distinct list of all n-grams which are present in the model.

Now, we have a pre-trained model for the fixed vocabulary, but additionally, we also have a model that allows us to map oov words to the same vector space as long as the n-grams are known. We also evaluated the mapping with slightly modified or related words which are not in the vocabulary with very good results.

Bottom line, without a doubt n-grams are not the silver bullet but they help a lot if you work with data that is dynamically changing, which includes spelling errors, variations and/or made-up words. Furthermore, publicly available models often deliver already solid results which takes the burden from you to train a model yourself which might overfit to the problem at hand or is not satisfying at all because you don’t have enough data. In case a model does not come with n-gram support, there is also a good chance to transfer the knowledge encoded in the vectors into n-grams by finding an appropriate loss function that preserves this knowledge in the sum of n-grams.

]]>`gist.github.com/Tushar-N/dfca335e370a2bc3bc79876e6270099e`

which is minimal but at the same time very well documented.

Just a quick note which might lead to the next blog post: After we trained our network, we wanted to do a under-the-hood analysis of the reset and forget gates of the GRU cells in case of the few errors the network makes. However, due to stacking the parameters, for performance reasons, a straightforward analysis needs some more preparation. In general the question is, if we use pre-defined modules, how can we debug the internal states of individual steps and units?

]]>What is astonishing is that PyTorch provides functionality to help you with the issue, but there is no tutorial or example code that contains all the steps. Sure, there are blogs and snippets on the web that explain it, but often a stand-alone, fully working, example allows to retrace the whole process more easily. Indeed, once you know all the details it is fairly simple to implement, since the PyTorch team did a very good job to hide all the nasty details from the users.

So, let’s start to describe the actual problem: During training, RNNs deal with sequences of different lengths which is no problem in single batch mode. However, if you want to use batching, you have to use padding to convert all samples to the same length as a first step. This can be done by using an extra “dummy” entry (“padding_idx”) in the nn.Embedding module which is added to each input at the end

until all inputs in the batch have the same length. But that is only the first step, since the RNN must ignore all those padded tokens for each input sequence while deriving the gradient w.r.t to the loss function.

This sounds a bit complicated because we have to fiddle with the computational graph, but kindly, there are helper functions for this to avoid to get your hands dirty. But let us start at the begin. Let us assume that we have an input X = [A, B, C] and the length of each sequence X_len = [4, 2, 8]. First, we need to pad each sequence to get a uniform length which requires to sort the input in decreasing order:

X = [torch.ones(4), torch.ones(2), torch.ones(8)]

X.sort(key=lambda x: x.shape[0], reverse=True)

X_pad = pad_sequence(X, batch_first=True, padding_value=0).long()

tensor([[ 1, 1, 1, 1, 1, 1, 1, 1], [ 1, 1, 1, 1, 0, 0, 0, 0], [ 1, 1, 0, 0, 0, 0, 0, 0]])

X_len = torch.LongTensor(map(lambda x: x.shape[0], X))

The option “batch_first” just ensures that the shape is always (batch, seq, feature).

As we can see, each sequence has now a length of 8 with “0”s as padding whenever required. Since we need the unpadded length of each sequence later, we also calculate X_len. With X_pad we can already perform a lookup in an nn.Embeding module:

emb = nn.Embedding(2, 5, padding_idx=0) #n_vocab, n_dim

X_emb = emb(X_pad)

Now, we are ready to feed the input to the RNN:

# setup network and initialize hidden states to zero

net = nn.GRU(5, 10, batch_first=True) #n_dim, n_units

hidden = torch.zeros(1, X_emb.shape[0], 10)

# pack batch

X_packed = pack_padded_sequence(X_emb, X_len, batch_first=True)

# forward step

out, hidden = net(X_packed, hidden)

# unpack batch

out, _ = pad_packed_sequence(out, batch_first=True)

# retrieve the last hidden state w.r.t to the original length for each sequence

idx = torch.arange(0, len(X_len)).long()

out_final = out[idx, X_len - 1, :]

The required steps can be easily wrapped into some class that hides all the nasty details and allows to get the output of an arbitrary recurrent network for a batch of (text) sequences in a straightforward way.

However, there is a drawback we need to take care of. For example, if we train a classifier and we sample a mini batch and the corresponding labels (X, Y), the procedure described above changes the order of X, while Y remain the same. The problem arises because of the sorting step that is only applied to X which makes the solution obvious: we also have to sort Y, but by X_len to get the identical order. The following code is not very nifty, but it works:

Y = [-1, 1, -1]

Y_ = zip(Y, X_len)

Y = map(lambda x: x[0], sorted(Y_, key=lambda x: x[1], reverse=True))

Bottom line, single-batch use of RNNs is a piece of cake, but the performance neither let you allow to train bigger networks or larger datasets, nor is the inference performance sufficient for real-world use. Despite the fact that the pad/pack/unpack scheme by PyTorch is not very complicated, it still needs some time to get used to it. But once one mastered it, the performance gain is more than noteworthy and allows to use RNNs at a much larger scale.

]]>There is a very nice post about feature transformation on distill that can be summarized as conditional layer normalization. For example, to answer relational questions about an image, the query is used as a context that guides the learning of the conv net. To be a bit more precise, depending on the question the output of the filters is adapted by scaling and shifting. This way, units can be turned on/off or the magnitude can be adjusted. The idea to not assume a strong prior about the data is a clever trick to avoid to manually engineer to explicit knowledge into a network.

With text for the queries and images to describe the content the task is still challenging, but at least the data is complete in the sense that it is possible to answer the questions with a correctly learned net (whatever this looks like). In our case, we have at least two problems: First, we don’t know if the data at hand is sufficient even for simple tasks and second, we need to find an appropriate

context that is always available.

So, with all this in mind a better question might be if we can somehow measure if the input data is powerful enough to solve the formulated problem at all. And let’s assume for the moment that we have all the computational power we need. In other words, even with the most powerful network and a millions of GPUs to train it, it is still possible that no such function f(x) exists that gives the correct answer given the particular input data x.

This is related to neural architecture search where one tries to find the best network architecture for the data, but here the assumption is that such a function f(x) exists. And such an assumption is reasonable, since usually images contain enough details to let a network give the correct answer. In case of human-engineered textual features, there is no such guarantee that they generalize beyond the purpose that they were created for which is usually purely informative.

At the end, we are back to where we started: Without external knowledge it seems almost hopeless to train networks with such data that can do more than simple classification, at best. But the problem are not the networks, but the lack of proper data, or better how to enhance and incorporate such data. With Wikipedia a lot of knowledge is available, but it is not a trivial task to extract relevant information and assembly them so it can be feed into the network.

There is a recent trend to make larger data sets publicly available, but it is wishful thinking that even big companies have all the data you need and/or the will to release it. Maybe it’s time to work harder on an Open Data Initiative (ODI) for machine learning, or come up at least with a community based service like a model zoo, but for data.

]]>Autoencoders (AEs) have a long history in machine learning and since some years, the convolutional variant became also more and more popular. However, since conv AEs use inverse operations and some advance stuff to recover lost information during the forward-propagation step, we thought it is a good idea to provide a clean, minimal example with some additional hints which help to understand the workflow. Without a doubt there are other examples around, but we did not find one that was exactly matching our domain (audio + conv1d), or at least not a minimal one that does not involve studying lots of unrelated code.

The conv AE consists of two modules, an encoder and a decoder which is not different to the vanilla AE. The encoder part looks a lot like a common convnet with some minor, but important variations:

c1 = nn.Conv1d(in_size, 16, 3)

m1 = nn.MaxPool1d(2, return_indices=True)

i1 = None

c2 = nn.Conv1d(16, 16, 3)

The first layer c1 is an ordinary 1D convoluation with the given in_size channels and 16 kernels with a size of 3×1. The next layer m1 is a max-pool layer with a size of 2×1 and stride 1×1. Additionally the indices of the maximal value will be returned since the information is required in the decoder later. The last layer is again conv 1d layer.

The forward step looks like that:

_c1 = c1(x_in)

_m1, i1 = m1(_c1)

return c2(_m1)

Again this should look pretty familiar, except for the pooling call because it returns both the output and the indices of the maximal value.

Then comes the decoder that uses the input from the encoder step:

d1 = nn.ConvTranspose1d(16, 16, 3)

u1 = nn.MaxUnpool1d(2)

d2 = nn.ConvTranspose1d(16, in_size, 3)

The architecture is reversed which means the last layer of the encoder fits into the first layer in the decoder. Thus, every layer is the inverse operation of the encoder layer: conv->transpose conv, pool->unpool. At the end, the full input is reconstructed again.

With the forward step as follows:

_d1 = d1(x_in)

_i1 = encoder.i1 # pool positions from encoder

_u1 = self.u1(_d1, _i1)

return self.d2(_u1)

Here we can see, that the unpooling uses the position information from the encoder. This is required since after the max pooling is done, no reversing is possible with the index information.

For example: x = (5, 10), maxpool(x, size=2) = 10 but we have no longer the information at which position the value was located: (10, ?) or (?, 10)? With the index from the encoder step, we can at least recover the position of the maximal value, but we still have to set all other values to 0 since this data is not available any longer: (0, 10). As a result, we still lose information but we can at least undo the maxpool step.

The workflow is easier to understand if we analyze the shape of each step:

Encoder: x_in=(1, 128, 44), c1=(1, 16, 42), m1=(1, 16, 21), c2=(1, 16, 19)

Decoder: x_hat_in=(1, 16, 19), d1=(1, 16, 21), u1=(1, 16, 42), d2=(1, 128, 44)

We can see that every shape in the decoder has a matching counterpart in the encoder: d2 x_in, u1 c1, d1 m1, x_hat_in c2.

Now, equipped with this knowledge, which can be also found in the excellent documentation of PyTorch, we can move from this toy example to a real (deep) conv AE with as much layers as we need and furthermore, we are also not limited to audio, but we can also build 2D convolutional AEs for images or even videos.

]]>A possible scenario looks like this: We record a movie via DVB-S and extract the audio stream. Then we convert the raw audio into a more suitable representation and classify all time frames, or time windows, with our learned model with +1/-1. At the end, we have time markers where the trained voice has been detected: [at min 3.1, at min 37.3, ..]. So far for the theory, now let’s turn to reality.

For us it was settled, that PyTorch is our framework of choice. Thus, as a first step we needed audio support. We hoped that in the spirit of torchvision, there is also torchaudio and we were not disappointed. The “load” function allows us to load arbitrary audio files in raw format and return the data as a tensor. However, this format requires a lot of computational resources, since every second is encoded as rate (e.g. 41,000) float values, per channel. Thus, the shape of the tensor is (rate * seconds, channels), which is huge for a full-length movie.

So we are interested in a more compact representation and as a first step, we converted stereo signals to mono (“transforms.DownmixMono”) which reduces the shape to (rate * seconds, 1). But since this is still a lot of data, we did some research to get an overview of popular transformations and we decided to use MEL spectrograms, also because there is an interface in the torchaudio package (“transforms.MEL”). With default values from papers, and re-sampling to 22,1000 Hz, each second of raw audio is now encoded as a (128, 22) matrix. In this setting, the rows are the frequency axis and the columns are the time axis. We further apply a log transformation on the data to avoid exploding gradients, since the magnitude of the spectrogram data can be very high.

Now the question is how to encode this information into a new representation to model the similarity between frames? There are several approaches possible. For instance, we could train an ordinary classifier one-vs-rest that outputs +1 if the frame is spoken by the speaker or -1 otherwise. But we opted for a triplet-based method to better model local neighborhoods. The drawback is that we cannot directly classify unseen frames, but we need some kind of nearest neighbor lookup to decide if the frame is a positive match. Thus, it makes sense that the positive data from training forms a memory component that in combination with a threshold acts like a classifier.

Next, we need to design our network architecture. With the chosen MEL transformation, we could easily train a feed-forward neural net, the input dim would be just 128*22=2816, but dense layers are not invariant to shifts in frequency[arxiv:1709.04396] and thus, a minor change in the input can lead to a larger change in the feature space. Thus, we decided to follow the steps of the early papers that uses convolution over the time axis to learn a representation which is a 1d convolution. The architecture is heavily inspired by the convnets from vision, with the exception that pooling and convolution just uses one channel, not two.

Thanks to PyTorch we have everything we need and a prototype consists just of a few lines of Python. Here is a sketch of the network:

import torch

from torch.nn import Conv1d

from torch.nn import MaxPool1d

from torch.nn import Linear

from torch.autograd import Variable

from torch.nn import functional as F

```
```

`x = Variable(torch.randn(1, 128, 22))`

c1 = Conv1d(in_channels=128, out_channels=32, kernel_size=3)

c2 = Conv1d(in_channels=32, out_channels=32, kernel_size=3)

m1 = MaxPool1d(2)

l1 = Linear(32, 16, bias=False)

h_2d = c2(m1(c1(x)))

h = F.adaptive_avg_pool2d(h_2d, (32, 1)).squeeze()

out = l1(h)

First, there is a convolution, followed by max-pooling, followed by a convolution and at the end, a global average pooling, that returns the mean of each filter map, followed by an affine transformation that represents the final embedding space. Additional blocks like normalization and non-linear activation functions are omitted for clarity. Such an architecture has a lot of benefits: First, we can stack blocks of conv/norm/relu/pool to form a deep network, second the network has also very few trainable parameters and last but not least, the forward step is computationally very efficient.

The training of the network is also pretty straightforward. The data set consists of spoken audio material by the person to recognize, as positive examples and arbitrary audio from other persons as negative examples. Without a doubt the selection of “the rest” impacts the performance of the network, since if all samples are already sufficiently far away from the speaker samples, no learning is done. This issue requires more research, but even our naive selection of negative samples lead to a solid performance.

Next, all audio files are pre-processed and split into frames of ~2 seconds on which the transformation is applied. The order of the frames is not preserved, since the “classification” works on single frames. A learning step consists of a sampling of an anchor and a positive sample and an arbitrary negative sample. Each input to the network represents a single time frame with the possibility to feed a batch of frames to the network. We l2 normalize all network output and use the cosine similarity to determine the triplet loss:

`loss = torch.clamp(margin=0.3 + dot(anchor, negative) - dot(anchor, positive), min=0)`

In other words, if the negative sample is sufficiently far away from the anchor (>= margin) no learning is required, otherwise the parameters are adjusted to push the negative sample away from the anchor.

However, it can be challenging to find good negative samples, since at later stages of the training, most samples are already well separated and thus have a loss of zero. This means, we need to find violators, outside the batch, to further improve the model. This can be computationally expensive, since we need to calculate the loss on many samples until we find enough of them. However, the procedure is required to ensure that we learn a good model and that the learning converges.

When the model is trained, the positive samples are fed to the network and the representation is stored as some kind of “memory”. As a baseline, new frames are classified by performing a nearest neighbor lookup (cosine similarity) on the memory and a frame is marked as “positive” if the mean of the top-5 scores from memory are above a threshold, like 0.7. Astonishingly, this baseline is pretty robust and already allows to reliably mark relevant time windows of audio material without too many false positives.

Bottom line, regardless of the domain, the machine learning pipeline stays pretty much the same. We have a problem, data, cleansing, optional a transformation and we need a good network architecture and a proper loss function to learn a good model. The next steps are more experiments to evaluate the model and to come up with a better way to classify unseen data based only on positive examples.

]]>In our case, we often calculate dot products between vectors and matrices and there are different ways to do the math. For example:

(1) `torch.sum(anchor * examples, 1) # shape: (1, dim) x (n, dim)`

(2) `examples.mm(anchor.view(-1, 1)) # shape: (n, dim) x (dim, 1)`

For both methods there is not much overhead, at least not function-wise, however, after we did some profiling, we found out that method (2) is about 40% faster than the first one. This is probably related to hardware utilization since (2) feels more “batched”.

Frankly, this is nothing new, but it just reminded us that for large-scale learning, using optimal numeric calculation can save you a day or week, or it can give you the opportunity to train a little longer. In our case, by introducing padding we reduced the time by almost 50% and now with the batched dot product, we got another 40%.

]]>The naive implementation only works if all those token lists have the same size, otherwise we are not able to build a LongTensor:

torch.LongTensor([[0, 5, 10], [3, 33, 333]]) [okay]

torch.LongTensor([[0, 5, 10], [3, 33]]) [error]

Since this is a common problem, the nn.Embedding module of PyTorch supports padding with “padding_idx=PAD”. Whenever PAD is found in the long tensor, the output is filled with zeros:

torch.LongTensor([3, 33, PAD]):

x_3_0 ... x_3_d

x_33_0 ... x_33_d

0 ... 0

In other words, this acts like a dummy embedding that does not change the gradient because no actual parameters are used. With this approach, we are able to return the aggregated embeddings (sum) for a batch of samples with different lengths, instead of forwarding each sample separately through the network.

batch = torch.LongTensor([

[0, 5, 10, 15],

[3, 33, PAD, PAD],

[17, PAD, PAD, PAD]])

batch_emb = net(Variable(batch))

We measured the runtime for both approaches and as expected, there is a notable performance gain by using batching: naive=14863 msecs. vs. batched=8294 msecs. which is an improvement of more than 40%.

Actually there is not any magic involved and you just need to make sure that you are working with the correct axis if you perform per-sample transformations. In our case, we normalized each aggregated vector (sum) so it has a unit-norm.

As a last step, let’s go through an example: If we assume that our embed_dim is 10 and we use batch as the input to the network, we get the following output shape: (3, 4, 10) which means we have 3 samples, each with 4 embeddings and each with 10 dimensions. Now, we want to calculate the sum of the embedding for each sample in the batch: batch_emb_sum = torch.sum(batch_emb, 1) with a resulting shape of (3, 10) and finally the normalization step: batch_emb_final = batch_emb_sum / batch_emb_sum.norm(dim=-1, keepdim=True) and that’s it. Thanks to the padding, the zero vectors do not interfere with any steps, since adding zero to something does not change anything.

But we need to be careful when we use an operation that depends on the number of elements, like torch.mean since the padding changes the size of the shape. To be more concrete, if we only have one token, but three PAD entries, the shape is (4, 10) and the mean would be: torch.sum(x, 1) / 4 even if the last three entries do not hold any values. Thus, we need to re-calculate the shape if padding has been used: actual_len = #rows – #pad_rows.

]]>