Since the advent of rectified linear units, ReLU for short, there has been a lot of research about them. Despite for some types of models, ReLUs dominate the learning landscape, especially for supervised models. And why not? They are very easy to compute, therefore fast, and since they do not saturate, there is usually no vanishing gradient problem. The only drawback is that “dead units” might appear, which means a unit is never activated because max(0, x*w + b) is always negative, but this does not always happen and can be often avoided with adjustments of the network architecture. So it is time to celebrate, right?
Well, sort of, because for some kind of data, manual tuning is required. As a quick reminder, our movie data is high dimensional, usually scaled to [0,1] and very sparse, which means on average we have about 8 non-zero entries. Therefore the sparsity is more than 99% for a 1,000 feature encoding. To demonstrate the problem, we train a simple Auto-Encoder with 50 ReLU units and sigmoid units in the reconstruction layer. The random seed is fixed and the weights are initialized “orthogonal”. We do not really care for the loss, but focus on the learned representation.
Before, we begin with the analysis, we illustrate how an ReLU unit, max(0, x*w + b), works. The input “x” is a binary vector of 1,000 dims and “w” is a learned weight vector of the same shape and the scalar “b” shifts the dot product x*w into some direction. If we think of the learned weights as topics, the output of a unit is positive if the sum of the non-zero features of x, weighted by the importance for the topic, sum(x_i * w_i), is larger than the bias “b”. In other words, a unit learns a set of keywords that describe a topic, like “lawman, gunfigther, west, sheriff, cowboy” to describe western, and the neuron ‘fires’ if a sufficient number of those keywords is present (w*x > b). The idea is that for non-western movies, negative weights of unrelated keywords push the total sum below zero which means, the unit is off.
In a nutshell, each ReLU unit is a hyper-plane that separates the space into two regions, where one region, w*x + b > 0, contains all the movies where the matching topic is present. This behavior is responsible for a natural sparsity in the learned representation because usually movies only consist of a limited number of topics. However, since the model is not able to learn a perfect representation of the data, some topics are “weakly” present.
With this in mind, the result of the trained model is very surprising, since there is not a single disabled unit in the learned representation for the whole dataset. So, what happened? A quick look at the learned values shows that the bias “b” is always positive and on average much larger than 1. Therefore, max(0, x*w + b) is always positive or equivalently, x*w < b. A dump of the highest learned "weighted" keywords per neuron also gave no hints what the learned topic could be, which means neurons are not sensible for "human-readable" topics. Not to forget that a ReLU unit that always fires is not different to a linear unit.
In the literature it was suggested that a non-zero bias helps an AE model to avoid trivial solutions but does not always improve the learned representation of the data. Therefore, a simple solution would be to set the bias to zero during test time, but since this would not change the learned weights, we doubt that we can learn the best topic representation this way. A further suggestion was to set the bias to a fixed value during training max(0, x*w + threshold). We tried that, but with a lower threshold than suggested, 0.1 instead of 1.0, due to the massive sparsity in the input data.
The results of the "zero bias" AE model were very different. First, with the newly learned weights, we could identify a lot of topics like "crime", "western" or "courtroom" immediately. Furthermore, the learned representation was indeed sparse. On average, only 15 of 50 topics were enabled, 70% sparsity, and some of them very weakly. So, what happened?
The first step was to compare the weight matrices. The mean correlation of learned topics was 0.023 for the ZAE model, while it was 0.105 for the AE. Both values are low, but the ZAE was factor 4x lower: 2.3% vs. 10.5%. Next, we plotted histograms of the weight values per neuron and for the AE, the distribution was always Gaussian, while for the ZAE there always was a huge negative peak and the distribution only sometimes resembled a Gaussian. The difference can be also seen by plotting the weights of two neurons from different models. For the ZAE, most weights are negative but with a fairly stable value, but there are weights with a very high positive value, which are the "topic keywords". The weights of the AE have a much larger spread, both on the positive and negative side.
What we can say is that the experiment confirms that the bias has a huge impact on the learned representation. In our case, the AE model with ReLU units degenerated into a model with linear units because of the large positive bias. As a result, all units where always on and since the pre-activation of a ReLU unit is linear, so is a model when ReLU units are never off.
It should be noted that the ZAE literature mainly focus on dense image data and often uses linear units for the reconstruction in combination with a squared loss function. This is in contrast to our approach with sigmoid units and a cross-entropy loss, and large positive bias values, instead of large negative ones, but nevertheless our experiments confirmed that zero-bias units are also useful for sparse, textual data.