After the introduction of ReLU neurons, a lot of researchers started to analyze why this activation function works so well and we mean beyond the fact that there is “no” vanishing gradient. And the success is not limited to convolutional networks as demonstrated by some papers that successfully trained multiple layers of ReLUs without any kind of pre-training.
So, why are these neurons so popular? Well, they are very fast to compute, and there are much fewer problems with the gradients. Plus, they introduce hard sparsity in the output features which has further impact on storage and the computational complexity in adjacent layers. The only drawback is that ReLU neurons are unbounded which means there is no upper limit of their value. This property can lead to very large gradients but this can be very easily avoided with a lower learning rate or by restricting the norm of the gradient to a fixed value.
For a better understanding of what a ReLU neuron learns, we can imagine that each neuron is a hyperplane that divides the space into two parts. If the dot product of the input and the neuron weights is negative, the sample is “left” to the hyperplane which is interpreted as “zero. The other case, the dot product is positive means it lies on the “right” side of the hyperplane and the magnitude of the dot product is the “distance” from the hyperplane.
The fact that neither the input nor the weights of a ReLU neuron have usually unit norm means that we need to normalize the input data first (to the unit L2 norm), before we can use the features for instance retrieval. In case of traditional neuron types (sigmoid, tanh) this is not required because those types have bounded outputs [0,1] or [-1, 1].
In a nutshell, we have these advantages:
– the training is usually easier and faster than with saturating types of neurons
– ReLUs introduce a hard sparsity
– The activation value is very fast to compute