It is no secret that part of the huge success of deep neural networks is owed to new activation functions like ReLU and its friends. The major advantages of ReLU units is that they are very efficient, non-saturating and they have a very easy derivative. However, for some problems we encounter that there are “dead” units which means units that get never activated, or equivalently, (x*W + b) < 0, for all input x.
The question is if max(x, 0) is always the right choice, since if a ReLU unit returns 0, no gradient is passed. This is no problem if the error signal can somehow flow backwards through the net. For that reason, LReLU units are used, L stands for leaky, which returns a small value in case if (x*W + b) < 0, usually 0.01. The difference of max(x, 0) and max(x, 0.01*x) is sometimes noticeable in case of model accuracies, but often it also leads to very similar results.
The next logical step was to make the ReLU unit more powerful by considering h = dot(x, W) + b; max(h, h*b), or stated differently, by learning the parameter for the leaky version of ReLU, which would be otherwise always fixed to 0.01. This unit is called PReLU, parameterized ReLU. It is no surprise that PReLU units can have a bigger impact to the model accuracy, since they allow to learn different activation functions for different neurons. Plus, the set of new parameters to learn is minimal, one scalar for each unit, which is in the same range as the bias. However, to train successful model, we also have to take care of many things, including to avoid overfitting in case of very few training data. Nevertheless the choice of activation function is very important as shown by the introduction of the ReLU unit.
We can also demonstrate the problem with an example from the movie domain: With a set of latent topics W, learned by some model, we want to project the input data into the feature space. A naive approach would be to use max(0, dot(x*W)). The drawback of such a simple activation function is that a match of two feature pairs, already triggers the output even if a minimal overlap is not really useful. Why? If a sample only consists of the feature “love”, all neurons with W[love] > 0 would get activated, but with an overlap of a single keyword, we can barely say that the topic learned by a neuron W_i is a good match for the input data. In other words, we need a more powerful activation function that both considers the magnitude of the weights, but also the overlap of the input features with a topic. Something like a combination of a rectified dot product and the jaccard coefficient.
Bottom line, for lots of problems, standard activation functions can be used to get very good models, but sometimes it makes sense to learn the activation function.