author: biofizzatreya
score: 7 / 10

TODO: Summarize the paper: This paper attempts to fix the vanishing gradient problem of Relu activation functions by using, what they call Mish. Relu is a conditionally defined function. It has a fixed value 0 for \(x<0\) and is linear for \(x>0\). Since Relu has a fixed value of 0 for \(x<0\) this results in a problem of vanishing gradients, where negative gradients in one layer do not propagate to the next layer. Moreover, Relu itself is not a smooth function as it is condtionally defined at 0. This creates rugged training landscapes. An alternative to using Relu is a function called Swish. Swish is defined as \(x\cdot \text{sigmoid}(\beta\cdot x)\), where the sigmoid function is defined as \(\frac{1}{1+e^{-x}}\). This function makes a smooth-activation function with a shape similar to Relu. This also reduces the vanishing-gradient problem.

Mish is influenced by swish and is defined as \(x\cdot tanh\left(\text{ln}(1+e^x)\right)\). The \(\text{ln}(1+e^x)\) is also called a softplus function. The authors compare the effect of using different activation functions such as Relu, Swish and Mish.

Picture1

Understandably Mish is differentiable and not as rough compared to Relu

Picture2

Mish compares well with other activation functions.

Picture3

Mish seems to work quite well when it comes to overfitting. Compared to other activation functions, Mish does not increase its loss as much as number of layers are increased. Mish also performs more accurately when it comes to training data corrupted by gaussian noise. They have performed other comparisons as well. To distinguish itself from Swish, they show that Swish decreases Top-1 accuracy in larger models, unlike Mish. Mish was also able to seriously reduce the runtime for object detection training, possibly due to its differentiability and lack of conditional statements.

Picture4

TL;DR