And The Winner Takes… All

Learning with a teacher is very popular, but it is not very reasonable to assume that a teacher is always present to correct wrong decisions. Therefore, we need an alternative way to learn something from data.

There are lots of different approaches to find pattern in data, for instance autoencoders that use a bottleneck in the representation to force the model to learn pattern that are useful to reconstruct the data. Another method is competitive learning, where a group of neurons compete to contribute to a representation. With this method, there can only be one winner which means the other neurons are “muted” and only one will survive. This helps individual neurons to become “experts” for specific patterns, which is similar to a factorization model where a specific topic is recognized by a particular weight vector.

Actually, WTA -Winner Takes All- neurons are nothing new and popular since quite a while. However, in the beloved Deep Learning, they are rather seldom. The recently introduced “maxout” unit also uses a competitive step that groups neurons and forward only the maximal value from this group. However, since only the maximum is forwarded, maxout is also a form of dimensionality reduction (“pooling”). In contrast to maxout, WTA sets the output of the non-winning neurons to “0.0” and thus forwards all values to the next layer, but only one of the value is non-zero. Besides that WTA is biologically more plausible, the output is also sparse: In case of 20 neurons with a group size of 2, half of the output values are zero.

To learn more about WTA in the context of unsupervised learning, we implemented a very simple autoencoder with Theano. Except for the implementation of the WTA units, the code is pretty straightforward. Because often naive implementations in Theano are not very efficient, we used ideas from a posting to a mailing list that described how to implement a maxout autoencoder. The parameters of the model consist of [W, bias, V_bias] where V_bias is the bias of the visible nodes and we use a group size of two which means the size of weight matrix W is [num_input, num_units*group_size] and the bias is [num_units*group_size].

The idea is to generate a binary mask that is “1” for the winning neuron and “0” for the rest. First, we compute pre-activation values and then we reshape the result where block_size is the size of the data batch in X:
(1) pre_hidden = dot(X, W) + bias
(2) shaped_hidden = (block_size, num_units, group_size)
Next, we determine the maximum per group with some shuffling of the dimensions:
(3) maxout = shaped_hidden.max(axis=2)
(4) max_mask = (shaped_hidden >= maxout.dimshuffle(0, 1, ‘x’))
And finally, we can calculate the actual output value and reshape accordingly:
(5) (max_mask * shaped_hidden).reshape((block_size, num_units*group_size))

Since all the magic is done in the “WTA layer” the autoencoder model does not look different compared to one with ReLU units, but it is obvious that the required operations are more complex. Furthermore, there are more parameters to learn, since now each neuron is part of a group of at least size 2.

Bottom line, why should we give WTA neuron a chance at all? WTA neurons share the benefits of ReLU units, but do not have the problem that it might die which means that the activation value is always negative and therefore zero. The price we have to pay for it is that we have more parameters to learn and that it has a higher computational complexity.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s