Since we started with our audio project, we thought about ways how to learn audio features in an unsupervised way. For instance, in case of speaker recognition we are more interested in a condensed representation of the speaker characteristics than in a classifier since there is much more unlabeled data available to learn from. However, without supervision there is always the risk that the learned representation does not help in the task at hand. Still, it’s worth a try since the data is available and so are suited network architectures.
Autoencoders (AEs) have a long history in machine learning and since some years, the convolutional variant became also more and more popular. However, since conv AEs use inverse operations and some advance stuff to recover lost information during the forward-propagation step, we thought it is a good idea to provide a clean, minimal example with some additional hints which help to understand the workflow. Without a doubt there are other examples around, but we did not find one that was exactly matching our domain (audio + conv1d), or at least not a minimal one that does not involve studying lots of unrelated code.
The conv AE consists of two modules, an encoder and a decoder which is not different to the vanilla AE. The encoder part looks a lot like a common convnet with some minor, but important variations:
c1 = nn.Conv1d(in_size, 16, 3)
m1 = nn.MaxPool1d(2, return_indices=True)
i1 = None
c2 = nn.Conv1d(16, 16, 3)
The first layer c1 is an ordinary 1D convoluation with the given in_size channels and 16 kernels with a size of 3×1. The next layer m1 is a max-pool layer with a size of 2×1 and stride 1×1. Additionally the indices of the maximal value will be returned since the information is required in the decoder later. The last layer is again conv 1d layer.
The forward step looks like that:
_c1 = c1(x_in)
_m1, i1 = m1(_c1)
Again this should look pretty familiar, except for the pooling call because it returns both the output and the indices of the maximal value.
Then comes the decoder that uses the input from the encoder step:
d1 = nn.ConvTranspose1d(16, 16, 3)
u1 = nn.MaxUnpool1d(2)
d2 = nn.ConvTranspose1d(16, in_size, 3)
The architecture is reversed which means the last layer of the encoder fits into the first layer in the decoder. Thus, every layer is the inverse operation of the encoder layer: conv->transpose conv, pool->unpool. At the end, the full input is reconstructed again.
With the forward step as follows:
_d1 = d1(x_in)
_i1 = encoder.i1 # pool positions from encoder
_u1 = self.u1(_d1, _i1)
Here we can see, that the unpooling uses the position information from the encoder. This is required since after the max pooling is done, no reversing is possible with the index information.
For example: x = (5, 10), maxpool(x, size=2) = 10 but we have no longer the information at which position the value was located: (10, ?) or (?, 10)? With the index from the encoder step, we can at least recover the position of the maximal value, but we still have to set all other values to 0 since this data is not available any longer: (0, 10). As a result, we still lose information but we can at least undo the maxpool step.
The workflow is easier to understand if we analyze the shape of each step:
Encoder: x_in=(1, 128, 44), c1=(1, 16, 42), m1=(1, 16, 21), c2=(1, 16, 19)
Decoder: x_hat_in=(1, 16, 19), d1=(1, 16, 21), u1=(1, 16, 42), d2=(1, 128, 44)
We can see that every shape in the decoder has a matching counterpart in the encoder: d2 x_in, u1 c1, d1 m1, x_hat_in c2.
Now, equipped with this knowledge, which can be also found in the excellent documentation of PyTorch, we can move from this toy example to a real (deep) conv AE with as much layers as we need and furthermore, we are also not limited to audio, but we can also build 2D convolutional AEs for images or even videos.
Sometimes it is a good idea to try a new direction when you are stuck. In other words, we needed some new inspiration and we thought it’s worth to turn to a very different domain, in our case audio. Furthermore, since quite some time we toyed with the idea to tag a specific voice in an audio signal by somehow learning a representation of the speaker, so it felt like the way to go.
A possible scenario looks like this: We record a movie via DVB-S and extract the audio stream. Then we convert the raw audio into a more suitable representation and classify all time frames, or time windows, with our learned model with +1/-1. At the end, we have time markers where the trained voice has been detected: [at min 3.1, at min 37.3, ..]. So far for the theory, now let’s turn to reality.
For us it was settled, that PyTorch is our framework of choice. Thus, as a first step we needed audio support. We hoped that in the spirit of torchvision, there is also torchaudio and we were not disappointed. The “load” function allows us to load arbitrary audio files in raw format and return the data as a tensor. However, this format requires a lot of computational resources, since every second is encoded as rate (e.g. 41,000) float values, per channel. Thus, the shape of the tensor is (rate * seconds, channels), which is huge for a full-length movie.
So we are interested in a more compact representation and as a first step, we converted stereo signals to mono (“transforms.DownmixMono”) which reduces the shape to (rate * seconds, 1). But since this is still a lot of data, we did some research to get an overview of popular transformations and we decided to use MEL spectrograms, also because there is an interface in the torchaudio package (“transforms.MEL”). With default values from papers, and re-sampling to 22,1000 Hz, each second of raw audio is now encoded as a (128, 22) matrix. In this setting, the rows are the frequency axis and the columns are the time axis. We further apply a log transformation on the data to avoid exploding gradients, since the magnitude of the spectrogram data can be very high.
Now the question is how to encode this information into a new representation to model the similarity between frames? There are several approaches possible. For instance, we could train an ordinary classifier one-vs-rest that outputs +1 if the frame is spoken by the speaker or -1 otherwise. But we opted for a triplet-based method to better model local neighborhoods. The drawback is that we cannot directly classify unseen frames, but we need some kind of nearest neighbor lookup to decide if the frame is a positive match. Thus, it makes sense that the positive data from training forms a memory component that in combination with a threshold acts like a classifier.
Next, we need to design our network architecture. With the chosen MEL transformation, we could easily train a feed-forward neural net, the input dim would be just 128*22=2816, but dense layers are not invariant to shifts in frequency[arxiv:1709.04396] and thus, a minor change in the input can lead to a larger change in the feature space. Thus, we decided to follow the steps of the early papers that uses convolution over the time axis to learn a representation which is a 1d convolution. The architecture is heavily inspired by the convnets from vision, with the exception that pooling and convolution just uses one channel, not two.
Thanks to PyTorch we have everything we need and a prototype consists just of a few lines of Python. Here is a sketch of the network:
from torch.nn import Conv1d
from torch.nn import MaxPool1d
from torch.nn import Linear
from torch.autograd import Variable
from torch.nn import functional as F
x = Variable(torch.randn(1, 128, 22))
c1 = Conv1d(in_channels=128, out_channels=32, kernel_size=3)
c2 = Conv1d(in_channels=32, out_channels=32, kernel_size=3)
m1 = MaxPool1d(2)
l1 = Linear(32, 16, bias=False)
h_2d = c2(m1(c1(x)))
h = F.adaptive_avg_pool2d(h_2d, (32, 1)).squeeze()
out = l1(h)
First, there is a convolution, followed by max-pooling, followed by a convolution and at the end, a global average pooling, that returns the mean of each filter map, followed by an affine transformation that represents the final embedding space. Additional blocks like normalization and non-linear activation functions are omitted for clarity. Such an architecture has a lot of benefits: First, we can stack blocks of conv/norm/relu/pool to form a deep network, second the network has also very few trainable parameters and last but not least, the forward step is computationally very efficient.
The training of the network is also pretty straightforward. The data set consists of spoken audio material by the person to recognize, as positive examples and arbitrary audio from other persons as negative examples. Without a doubt the selection of “the rest” impacts the performance of the network, since if all samples are already sufficiently far away from the speaker samples, no learning is done. This issue requires more research, but even our naive selection of negative samples lead to a solid performance.
Next, all audio files are pre-processed and split into frames of ~2 seconds on which the transformation is applied. The order of the frames is not preserved, since the “classification” works on single frames. A learning step consists of a sampling of an anchor and a positive sample and an arbitrary negative sample. Each input to the network represents a single time frame with the possibility to feed a batch of frames to the network. We l2 normalize all network output and use the cosine similarity to determine the triplet loss:
loss = torch.clamp(margin=0.3 + dot(anchor, negative) - dot(anchor, positive), min=0)
In other words, if the negative sample is sufficiently far away from the anchor (>= margin) no learning is required, otherwise the parameters are adjusted to push the negative sample away from the anchor.
However, it can be challenging to find good negative samples, since at later stages of the training, most samples are already well separated and thus have a loss of zero. This means, we need to find violators, outside the batch, to further improve the model. This can be computationally expensive, since we need to calculate the loss on many samples until we find enough of them. However, the procedure is required to ensure that we learn a good model and that the learning converges.
When the model is trained, the positive samples are fed to the network and the representation is stored as some kind of “memory”. As a baseline, new frames are classified by performing a nearest neighbor lookup (cosine similarity) on the memory and a frame is marked as “positive” if the mean of the top-5 scores from memory are above a threshold, like 0.7. Astonishingly, this baseline is pretty robust and already allows to reliably mark relevant time windows of audio material without too many false positives.
Bottom line, regardless of the domain, the machine learning pipeline stays pretty much the same. We have a problem, data, cleansing, optional a transformation and we need a good network architecture and a proper loss function to learn a good model. The next steps are more experiments to evaluate the model and to come up with a better way to classify unseen data based only on positive examples.
It might happen that if we start with a new idea, we focus on the clarity of the code but not on the overall performance. Of course the model should not be slow as a snail, but often there is room for improvement. Still, first it is more important to get it working than to be super fast. When everything works well, it’s time to take a closer look at the code and to identify possible bottlenecks.
In our case, we often calculate dot products between vectors and matrices and there are different ways to do the math. For example:
torch.sum(anchor * examples, 1) # shape: (1, dim) x (n, dim)
examples.mm(anchor.view(-1, 1)) # shape: (n, dim) x (dim, 1)
For both methods there is not much overhead, at least not function-wise, however, after we did some profiling, we found out that method (2) is about 40% faster than the first one. This is probably related to hardware utilization since (2) feels more “batched”.
Frankly, this is nothing new, but it just reminded us that for large-scale learning, using optimal numeric calculation can save you a day or week, or it can give you the opportunity to train a little longer. In our case, by introducing padding we reduced the time by almost 50% and now with the batched dot product, we got another 40%.
There are quite a few helper functions when it comes to recurrent nets, but in our case we just wanted to speed up the forward step of a model that is just using Embedding layers. Maybe there is also a helper for our problem, but in any case it’s a good idea to manually implement these steps to see how it works under the hood and to learn about possible side effects. Our setup is pretty simple: We have a batch of lists that contain individual tokens and our network shall return the sum of the corresponding embeddings for each sample.
The naive implementation only works if all those token lists have the same size, otherwise we are not able to build a LongTensor:
torch.LongTensor([[0, 5, 10], [3, 33, 333]]) [okay]
torch.LongTensor([[0, 5, 10], [3, 33]]) [error]
Since this is a common problem, the nn.Embedding module of PyTorch supports padding with “padding_idx=PAD”. Whenever PAD is found in the long tensor, the output is filled with zeros:
torch.LongTensor([3, 33, PAD]):
x_3_0 ... x_3_d
x_33_0 ... x_33_d
0 ... 0
In other words, this acts like a dummy embedding that does not change the gradient because no actual parameters are used. With this approach, we are able to return the aggregated embeddings (sum) for a batch of samples with different lengths, instead of forwarding each sample separately through the network.
batch = torch.LongTensor([
[0, 5, 10, 15],
[3, 33, PAD, PAD],
[17, PAD, PAD, PAD]])
batch_emb = net(Variable(batch))
We measured the runtime for both approaches and as expected, there is a notable performance gain by using batching: naive=14863 msecs. vs. batched=8294 msecs. which is an improvement of more than 40%.
Actually there is not any magic involved and you just need to make sure that you are working with the correct axis if you perform per-sample transformations. In our case, we normalized each aggregated vector (sum) so it has a unit-norm.
As a last step, let’s go through an example: If we assume that our embed_dim is 10 and we use batch as the input to the network, we get the following output shape: (3, 4, 10) which means we have 3 samples, each with 4 embeddings and each with 10 dimensions. Now, we want to calculate the sum of the embedding for each sample in the batch: batch_emb_sum = torch.sum(batch_emb, 1) with a resulting shape of (3, 10) and finally the normalization step: batch_emb_final = batch_emb_sum / batch_emb_sum.norm(dim=-1, keepdim=True) and that’s it. Thanks to the padding, the zero vectors do not interfere with any steps, since adding zero to something does not change anything.
But we need to be careful when we use an operation that depends on the number of elements, like torch.mean since the padding changes the size of the shape. To be more concrete, if we only have one token, but three PAD entries, the shape is (4, 10) and the mean would be: torch.sum(x, 1) / 4 even if the last three entries do not hold any values. Thus, we need to re-calculate the shape if padding has been used: actual_len = #rows – #pad_rows.
To relate TV titles that come from the electronic program guide (EPG), we have decided to train an embedding that directly optimizes on “sentence-level” instead of just related words, like word2vec. That is in the spirit of StarSpace[arxiv:1709.03856] which is a fairly simple but approach but which is nevertheless a strong baseline.
The idea for the training is straightforward and uses the leave-out-out approach. We use either existing or manually annotated categories for a weak supervision. Then we sample N positive items from one category, and use N-1 items to predict the Nth item. Negative items are sampled from other categories. A title is encoded as the sum of all the embeddings of the words it contains: np.sum(E[word_id_list], 0) (bag-of-words). All vectors are normalized to lie on the unit-ball. Next, we combine all bag-of-words into bag-of-documents: np.mean(V, 0) where V is a matrix with #title rows and #dim columns.
The loss is the well-known triplet loss: np.maximum(0, margin – pos + neg), where pos = np.dot(bag_items, nth_item) and neg = np.dot(bag_items, neg_item) and margin is a hyper-parameter (0.2 by default). The idea is not to learn a hyperplane to separate classes, but to move related items closer and push unrelated items further away. The learning stops, when positive and negative pairs are separated by at least the given margin. The performance of such models largely depends on the sampling of positive and negative items
but this is not the concern of this post.
In contrast to titles from books, and to some degree movies, generic TV titles belonging to shows, reports or entertain, are very heterogeneous with lots of special characters and that can also include meta information. Therefore, we need a more sophisticated tokenizer to convert titles into “words”, but we also need to address the issue of rare “words”. The long-tail is always a problem for text, but in our case the domain is more like tweets with special emoticons and/or hashtags than traditional text. Throwing away
those “words” is no solution which is why we need to adjust the learning scheme.
In word2vec down-sampling of frequent words is used, but this does not really address the problem since we do not want to damp the learning signal for frequent “words”, but we want to boost the signal for rare “words”. That is why we decided to scale the gradients with the inverse frequency of the words. The procedure just requires a fixed lookup table: ID->WEIGHT, which is easy to implement.
The necessity of the procedure became obvious when we checked the result of our first model. We took an arbitrary title and used the cosine score to rank all other titles. The results looked promising, but from time to time there were outliers and we wanted to find out why. We started by removing single “words” from the offending title and repeated the ranking. We found out that the problem were often related to rare words that did not get much weight updates and thus, their position in the embedding space is something “arbitrary”. When the “word” was removed, the cosine score reduced dramatically. This also worked for other titles.
Thanks to PyTorch, the implementation of re-scaling the gradients was very easy:
def scale_grad_by_freq(parameters, lookup):
parameters = list(filter(lambda p: p.grad is not None, parameters))
for p in parameters:
g,grads = p.grad.data, g._values()
for j, i in enumerate(g._indices().view(-1)): grads[j].mul_(lookup[i])
The multiplication is done in-place and thanks to the sparse flag of the PyTorch Embedding module, we only re-scale a small subset of all embeddings. With this minor modification, a loss that involves rare “words” leads to a stronger error signal which partly compensates the fact that those “words” get fewer updates. This is not the holy grail, but a good example that a deeper understanding of a problem can minimize the level of frustration and gives you more time to enhance or tune your model.
Even if the GPU is a mobile version, the performance gain compared to a CPU is very much noticeable. As a result, it makes a lot of sense to have a dedicated GPU in your notebook if you buy a new one. This allows you to play with more complex models, or even to train them, while you are traveling, which might mean that you don’t have easy access to servers with GPU cards all the time. And even if you just want to use a pre-trained model for feature extraction, you can spare a lot of time by using the GPU.
There is always the option to use a gamers notebook, but in case you want a lightweight companion there are much fewer options. Our choice was the IdeaPad 720s with a 14′ screen, because it is lightweight, but still powerful with enough RAM, and a dedicated GeForce 940mx GPU that comes with non-shared memory. Without a doubt this is no high-end configuration, but the CUDA capabilities are sufficient to run older nets, or to design your own one, might it be a ConvNet or a RNN. Plus, with the huge SSD, you can train on pretty large training sets with better I/O performance than SATA disks.
So far for the theory, but now comes the reality. Especially for newer, or more exotic notebooks, installing Linux on them is not always trivial. To spare others the pain, we summarize the steps we did to get it running. It’s still not perfect, there are minor problems with the WLAN, but we used it for quite some hours now without any problems and successfully tested PyTorch + CUDA.
The first thing you have to do is to switch from “UEFI” to “Legacy Support” but that’s nothing new. You can enter the BIOS by pressing F2 during boot, or use the Novo “button” on the left side of the notebook. If this worked, you should shrink the NTFS volume which is pretty straightforward to make room for a real OS. Just half the size, so you got about ~250 GB for Linux. After all settings were adjusted, we can start with the Linux installation. Lubuntu seems a good choice since it is also lightweight, but comes with excellent support for detecting more exotic hardware. Make sure you have chosen the correct boot order, so you can boot from a USB stick/DVD drive.
Long story short, the installation failed miserable because the SSD drive was not recognized. But there is no time for panic! Thanks to the active community, we found the answer to that problem pretty fast. You have to switch the “SATA Controller Mode” from “RAID” to “AHCI” in the BIOS. With the new setting, it was possible to create a ext4+swap partition in the free space of the SSD. Then, the actual installation could be done without any problems. Merely the GRUB installation seems not optimal, since we get no boot screen and thus, we don’t know if our Windows partition was correctly recognized. From the grub config it does not seem so, but this is not our major concern, since our focus is a working Linux system with GPU support. So, we are blind at startup, but since Linux starts correctly we do not investigate this any further right now.
The next step is to get PyTorch working which was no problem at all. We used pip, python 2.7 + cuda 8.0 and it worked like a charm. Only torchvision failed, but we solved it by using “pip install –no-deps torchvision” since one dependency is still pytorch 0.1.2. A quick test with ipython confirmed that everything is okay and working. The last step is the installation of the CUDA toolkit which was also no problem thanks to the apt sources we just had to uncomment in the sources.list file. After “apt-get update” we installed the cuda toolkit packages and all its dependencies. Since CUDA requires a kernel module that is compiled at the end of the installation, a restart is required. To check if the setup was done correctly, start “nvidia-smi” -after reboot- and see if at the device is listed there.
After we got a prompt again, we downloaded a pre-trained network from the model collection and hacked some code to perform an image classification. Compared to the early days of ConvNets, even the CPU version was pretty “fast”. Next, we checked that cuda is correctly recognized by PyTorch and after that, we moved the model to the GPU and also the tensor we use for classification. As we mentioned at the begin of the post, the performance boost was pretty much visible and except for the first call that triggered some background activities to setup CUDA, everything went smooth.
Bottom line, here is the check list again to enjoy the notebook with Linux:
– Shrink the size of your NTFS volumne by 50%
– Switch from “UEFI” to “Legacy Support”
– Switch the “SATA Controller Mode” from “RAID” to “AHCI”
Since there is still room for improvements, we might create a successor blog post with additional details.
In this year, Santa is a little early but the gifts are nonetheless still impressive. In the last days, we hacked on a model to predict if a text sequence belongs to a certain type, or not, and we found out, that if we do not process the text from left to right but in reverse order, the model learns much better and faster. The idea is not new and was already introduced in neural translation models. Still, it’s amazing that such a little modification has such a huge impact.
Next, we heard about the new PyTorch release and that it might also bring gifts, for instance performance boosts along with other nice goodies. So, we updated, still from the source since the provided pre-compiled packages do not work for us, and then run a short test. The results were pretty stunning since now the required time for a cycle takes about 5 seconds less compared to version 0.2.0.
In total, from a machine learning perspective we are pretty happy with the gifts Santa brought for us and with the experiences we made so far, we encourage everybody to update their PyTorch version to also feel the machine learning Christmas spirit.