Optimisation techniques for Neural Networks II: Activation functions & vanishing gradient

Have you ever stopped to think about why ReLU function works so well?
In our previous article in this series we presented two techniques to improve the optimisation of a neural network, both based on adaptive techniques: adapting the learning rate α and adapting the gradient descent formula by including new terms in the equation.
Previus Article: How using adaptive methods can help your network perform better.
In this article we are going to focus on another essential element of Artificial Neural Networks (ANNs), the activation function, and we will see what limitations it has and how they can be overcome.
The activation function
We already mentioned the activation function in our introduction article to ANNs, but what does it do specifically and how can we know which one to use?
The nodes of an ANN are characterised by performing an operation on the information they receive through this activation function, which we’ll denote by φ. Thus, the output of a neuron j, yⱼ, is the result of the activation function φ applied to the information that neuron j receives from its m predecessor neurons:

Why are these functions not always perfect?
The sigmoid activation function takes the input v and transforms it into a value between 0 and 1, while the hyperbolic tangent does it between -1 and 1.
The problem is that the neurons can saturate; arbitrarily large input values will always return 1, and very small values, 0 — or -1 in the case of tanh — . Therefore, these functions are sensitive to changes only when vⱼ is very close to 0.5 and 0 respectively. Once the neurons saturate, it becomes very difficult for the algorithm to adapt the weights to improve the performance of the model.
In addition, deep networks — those with many hidden layers — can be difficult to train because of the way the gradients of the first layers are related to the ones in the final layers. It is possible that the magnitude of the error decreases exponentially with each additional layer that we add, and this means that the algorithm won’t know how to adjust the parameters for improving the cost function. This problem is the well-known vanishing gradient problem.
Understanding the vanishing gradient problem
Let us look at this problem in a little more detail. Suppose we have a network with m hidden layers that has a single neuron in each layer, and let us note the weights between layers by ω⁽¹⁾, ω⁽²⁾, …, ω⁽ᵐ⁾. Suppose too that the activation function of each layer is the sigmoid function and that the weights have been randomly initialised so that they have an expected value equal to 1. Let x be the input vector, y⁽ⁱ⁾ the hidden values of each layer and φ⁽ᵗ⁾’(v⁽ᵗ⁾) the derivative of the activation function in the hidden layer t. From the backpropagation algorithm we know the expression:



Left: Plot of the derivative of the sigmoid function. Right: Plot of the derivative of the tanh function
The ReLU function
This is nowadays an open problem and we know different solutions that can be applied to this particular issue. One which has been proposed in recent years is to use the Rectifier Linear Activation function Unit or, as it is known in the Data Science circle, RELU function. It is defined by the expression:
- Computational simplicity: Unlike the other functions, it does not require the calculation of an exponential function.
- Representational sparsity: A great advantage of this function is that it is able to return an absolute zero. This allows hidden layers to contain exactly one or more nodes “off”. This is called a sparse representation, and it is a desirable property since it speeds up and simplifies the model.
