Leaky Relu vs Relu - Explain the difference.

1.7K    Asked by ananyaPawar in QA Testing , Asked on May 10, 2022

 I think that the advantage of using Leaky ReLU instead of ReLU is that in this way we cannot have a vanishing gradient. Parametric ReLU has the same advantage with the only difference that the slope of the output for negative inputs is a learnable parameter while in the Leaky ReLU it's a hyperparameter.


However, I'm not able to tell if there are cases where it is more convenient to use ReLU instead of Leaky ReLU or Parametric ReLU.

Answered by Andrea Bailey

Leaky Relu vs Relu


Combining ReLU, the hyper-parameterized1 leaky variant, and variant with dynamic parameterization during learning confuses two distinct things: The comparison between ReLU with the leaky variant is closely related to whether there is a need, in the particular ML case at hand, to avoid saturation — Saturation is there loss of signal to either zero gradient2 or the dominance of chaotic noise arising from digital rounding3.

The comparison between training-dynamic activation (called parametric in the literature) and training-static activation must be based on whether the non-linear or non-smooth characteristics of activation have any value related to the rate of convergence4.

The reason ReLU is never parametric is that to make it so would be redundant. In the negative domain, it is the constant zero. In the non-negative domain, its derivative is constant. Since the activation input vector is already attenuated with a vector-matrix product (where the matrix, cube, or hyper-cube contains the attenuation parameters) there is no useful purpose in adding a parameter to vary the constant derivative for the non-negative domain.

When there is curvature in the activation, it is no longer true that all the coefficients of activation are redundant as parameters. Their values may considerably alter the training process and thus the speed and reliability of convergence. For substantially deep networks, the redundancy reemerges, and there is evidence of this, both in theory and practice in the literature. In algebraic terms, the disparity between ReLU and parametrically dynamic activations derived from it approaches zero as the depth (in number of layers) approaches infinity.

In descriptive terms, ReLU can accurately approximate functions with curvature5 if given a sufficient number of layers to do so. That is why the ELU variety, which is advantageous for averting the saturation issues mentioned above for shallower networks, is not used for deeper ones.

So one must decide two things.

Whether parametric activation is helpful is often based on experimentation with several samples from a statistical population. But there is no need to experiment at all with it if the layer depth is high. Whether the leaky variant is of value has much to do with the numerical ranges encountered during back propagation. If the gradient becomes vanishingly small during back propagation at any point during training, a constant portion of the activation curve may be problematic. In such a case one of the smooth functions or leaky RelU with it's two non-zero slopes may provide an adequate solution.

In summary, the choice is never a choice of convenience.

Footnotes

[1] Hyper-parameters are parameters that affect the signalling through the layer that are not part of the attenuation of inputs for that layer. The attenuation weights are parameters. Any other parametrization is in the set of hyper-parameters. This may include learning rate, dampening of high frequencies in the back propagation, and a wide variety of other learning controls that are set for the entire layer, if not the entire network.

[2] If the gradient is zero, then there cannot be any intelligent adjustment of the parameters because the direction of the adjustment is unknown, and its magnitude must be zero. Learning stops.

[3] If chaotic noise, which can arise as the CPU rounds extremely small values to their closest digital representation, dominates the correction signal that is intended to propagate back to the layers, then the correction becomes nonsense and learning stops.

[4] Rate of convergence is a measure of the speed (either relative to microseconds or relative to the iteration index of the algorithm) in which the result of learning (system behaviour) approaches what is considered good enough. That's usually some specified proximity to some formal acceptance criteria for the convergence (learning).

[5] Functions with curvature are ones that are not visualised as straight or flat. A parabola has curvature. A straight line does not. The surface of an egg has curvature. A perfect flat plane does not. Mathematically, if any of the elements of the Hessian of the function is non-zero, the function has curvature.



Your Answer

Interviews

Parent Categories