Since we started with our audio project, we thought about ways how to learn audio features in an unsupervised way. For instance, in case of speaker recognition we are more interested in a condensed representation of the speaker characteristics than in a classifier since there is much more unlabeled data available to learn from. However, without supervision there is always the risk that the learned representation does not help in the task at hand. Still, it’s worth a try since the data is available and so are suited network architectures.
Autoencoders (AEs) have a long history in machine learning and since some years, the convolutional variant became also more and more popular. However, since conv AEs use inverse operations and some advance stuff to recover lost information during the forward-propagation step, we thought it is a good idea to provide a clean, minimal example with some additional hints which help to understand the workflow. Without a doubt there are other examples around, but we did not find one that was exactly matching our domain (audio + conv1d), or at least not a minimal one that does not involve studying lots of unrelated code.
The conv AE consists of two modules, an encoder and a decoder which is not different to the vanilla AE. The encoder part looks a lot like a common convnet with some minor, but important variations:
c1 = nn.Conv1d(in_size, 16, 3)
m1 = nn.MaxPool1d(2, return_indices=True)
i1 = None
c2 = nn.Conv1d(16, 16, 3)
The first layer c1 is an ordinary 1D convoluation with the given in_size channels and 16 kernels with a size of 3×1. The next layer m1 is a max-pool layer with a size of 2×1 and stride 1×1. Additionally the indices of the maximal value will be returned since the information is required in the decoder later. The last layer is again conv 1d layer.
The forward step looks like that:
_c1 = c1(x_in)
_m1, i1 = m1(_c1)
Again this should look pretty familiar, except for the pooling call because it returns both the output and the indices of the maximal value.
Then comes the decoder that uses the input from the encoder step:
d1 = nn.ConvTranspose1d(16, 16, 3)
u1 = nn.MaxUnpool1d(2)
d2 = nn.ConvTranspose1d(16, in_size, 3)
The architecture is reversed which means the last layer of the encoder fits into the first layer in the decoder. Thus, every layer is the inverse operation of the encoder layer: conv->transpose conv, pool->unpool. At the end, the full input is reconstructed again.
With the forward step as follows:
_d1 = d1(x_in)
_i1 = encoder.i1 # pool positions from encoder
_u1 = self.u1(_d1, _i1)
Here we can see, that the unpooling uses the position information from the encoder. This is required since after the max pooling is done, no reversing is possible with the index information.
For example: x = (5, 10), maxpool(x, size=2) = 10 but we have no longer the information at which position the value was located: (10, ?) or (?, 10)? With the index from the encoder step, we can at least recover the position of the maximal value, but we still have to set all other values to 0 since this data is not available any longer: (0, 10). As a result, we still lose information but we can at least undo the maxpool step.
The workflow is easier to understand if we analyze the shape of each step:
Encoder: x_in=(1, 128, 44), c1=(1, 16, 42), m1=(1, 16, 21), c2=(1, 16, 19)
Decoder: x_hat_in=(1, 16, 19), d1=(1, 16, 21), u1=(1, 16, 42), d2=(1, 128, 44)
We can see that every shape in the decoder has a matching counterpart in the encoder: d2 x_in, u1 c1, d1 m1, x_hat_in c2.
Now, equipped with this knowledge, which can be also found in the excellent documentation of PyTorch, we can move from this toy example to a real (deep) conv AE with as much layers as we need and furthermore, we are also not limited to audio, but we can also build 2D convolutional AEs for images or even videos.