There has been a lot of written about the benefits of sparsity for output values of neurons in a network. In our case, with bag-of-word data that is binary and high-dimensional, it allows a compact representation of the input, disentangling of topical factors and an easier interpretation. For instance, let us assume that we have 64 units and all of them are activated regardless of the input pattern. This is, without a doubt, not very useful and sparsity would clearly help to analyze what neuron is sensible for what kind of latent topic. In case of sigmoid units, with values in the range 0…1 there are different variants to encourage sparsity:
– the L1 norm on the hidden activation
– Student-t penalty log(1 + h**2)
– cross entropy with desired probability, like p=0.05
or p * T.log(p / h) + (1 -p) * T.log((1 -p) / (1 – h))
But since sigmoid units are not really popular these days, all approaches based on probabilities are not applicable any longer. With modern neurons like elu/relu/leaky that are unbounded on the positive side, a different approach is required. A method that can be used regardless of the neuron is to encourage output values to be close to a specific value, like “p=0” for relu units. This can be done by the following penalizing term:
– T.sum((0 – T.mean(h, axis=0))**2)
In case of mini-batch training, “h” has the shape (batchsize, num_units) and we consider the mean activation of each unit in the minibatch which is “T.mean(h, axis=0)”. The error is the sum of the squared difference between “p” and each unit mean.
Now the question is, why is such a sparsity constraint required at all when relu units + dropout already lead to sparse activations? Well, the problem is the high-dimensional, but very sparse, input data that often leads to feature representations where all hidden units are activated. As noted earlier, this is not useful at all and also a strong indicator that the features were not able to learn the underlying structure of the input data.
Our recent experiment with an autoencoder and threshold units did not suffer from the described problem, but the feature representation was still very dense, only 10% zeros, despite the fact that we used dropout and rectification. To analyze possible benefits of sparse features, we re-trained the model with the penalizing term. With the adjusted model, the portion of zeros increased to about 56%. The next step is to analyze the qualitative results of the sparse model.
Bottom line, some network architectures provide natural sparsity and no extra work is required, but sometimes, an explicit sparsity term is required, especially for some kind of input data.