author:  ishankarora 
score:  9 / 10 

What is the core idea?

First paper (I believe) to prove using rectifier function (\(max(0, x)\)) better than \(\textit{logistic sigmoid}\) or \(\textit{hyperbolic tangent}\) activation functions. Approaches this from the field of computational neuroscience.

Proposes the use of rectifying nonlinearities as alternatives to the hyperbolic tangent or sigmoid in deep artificial neural networks, in addition to using an \(L_1\) regularizer on the activation values to promote sparsity and prevent potential numerical problems with unbounded activation.

Rectifier function brings together the fields of computational neuroscience and machine learning.

Rectifying neurons are a better model of biological neurons than hyperbolic tangenet networks.

There are gaps between computational neuroscience models and machine learning models. Two main gaps are:
 It’s estimated that about only 14% of neurons are active in the brain at the same time. Ordinary feedforward neural nets (without additional regularization such as an \(L_1\) penalty) do not have this property.
 Ex.: in the steady state, when employing the sigmoid activation function, all neurons fire are activated about 50% of the time. This is biologically implausible and hurts gradientbased optimization.
 Computational Neuroscience used Leaky Integrateandfire (LIF) activation function. Deep learning and neural networks literature most commonly used tanh and logisitc sigmoid. (see figure 1)
 It’s estimated that about only 14% of neurons are active in the brain at the same time. Ordinary feedforward neural nets (without additional regularization such as an \(L_1\) penalty) do not have this property.

Advantages of sparsity:
 Information disentangling.
 A claimed objective of deep learning algorithms (Bengio, 2009) is to disentangle the factors explainging the variations in the data.
 A dense representation is highly entangled  almost any change in the input modifies most of the entries in the representation vector.
 A sparse representation that is robust to small input changes, therefore, conserves the set of nonzero features.
 Efficient variablesize representation
 Different inputs may contain different amounts of information  this can then be reflected with varying amounts of sparsity as a result of the rectifier activation function.
 Linear separability
 More sparsity, more likely to be linearly separable simply because data is in represented in a highdimensional space.
 Distributed but sparse
 Information is distributed amongst the nonzero values that if there is some noise and some values have to be discarded, information loss is low.
 Storing is easier, only have to store nonzero values and their locations.
 Information disentangling.

Disadvantages of sparsity:
 Too much sparsity may hurt “predictive performance for an equal number of neurons” as it reduces the “effective capaicty of the model.”
 Advantages of rectifier neurons:
 Allows network to easily obtain sparse representations
 Is more biologicially plausible
 Computations are also cheaper as sparsity can be exploited
 No gradient vainishing effect due to activation nonlinearities of sigmoid or tanh units.
 Better gradient flow due on active neurons (where computation is linear  see figure 2)
 Allows network to easily obtain sparse representations
 Potential Problems:
 Hard saturation at 0 may hurt optimization
 Smooth version of rectifying nonlinearity  \(softplus(x) = \log(1 + e^x)\) (see figure 2)
 loses exact sparsity but may hope to gain easier training
 Smooth version of rectifying nonlinearity  \(softplus(x) = \log(1 + e^x)\) (see figure 2)
 Numerical problems due to unbounded behaviour of the activations
 Use the \(L_1\) penalty on the activation values  also promotes additional sparsity.
 Hard saturation at 0 may hurt optimization


How is it realized (technically)?
 Rectifier function is \(max(0, x)\).
 A smoother version: \(softplus(x) = \log(1 + e^x)\)

How well does the paper perform?
 Experiment results
 Sparsity does not hurt performance until around 85% of neurons are 0. (see figure 3)
 Rectifiers outperform softplus
 Rectifer outperforms tanh in image recognition.
 No improvement using pretrained autoencoders  hence just using rectifier activation function is easier to use
 Experiment results

What interesting variants are explored?
 rectifier versus softplus
TL;DR
 rectifier activation function better than sigmoid and tanh activation functions
 sparsity is good  increases accuracy, computationally better performance, and representative of the biological neuron
 5080% sparsity for best generalising models, whereas the brain is hypothesized to have 95% to 99% sparsity.