Activation Functions

  • Mohit GoyalEmail author
  • Rajan Goyal
  • P. Venkatappa Reddy
  • Brejesh Lall
Part of the Studies in Computational Intelligence book series (SCI, volume 865)


Activation functions lie at the core of deep neural networks allowing them to learn arbitrarily complex mappings. Without any activation, a neural network learn will only be able to learn a linear relation between input and the desired output. The chapter introduces the reader to why activation functions are useful and their immense importance in making deep learning successful. A detailed survey of several existing activation functions is provided in this chapter covering their functional forms, original motivations, merits as well as demerits. The chapter also discusses the domain of learnable activation functions and proposes a novel activation ‘SLAF’ whose shape is learned during the training of a neural network. A working model for SLAF is provided and its performance is experimentally shown on XOR and MNIST classification tasks.


Activation functions Neural networks Learning deep neural networks Adaptive activation functions ReLU SLAF 


  1. 1.
    Bengio, Y., Aaron, C., Courville, Vincent, P.: Unsupervised feature learning and deep learning: a review and new perspectives. In: CoRR abs/1206.5538 (2012). arXiv:1206.5538.
  2. 2.
    Zhang, G.P.: Neural networks for classification: a survey . IEEE Trans. Syst. Man Cybern. Part C Appl. Rev. 30(4), 451–462 (2000). ISSN 1094-6977.
  3. 3.
    Tian, G.P., Pan, L.: Predicting short-term traffic flow by long short-term memory recurrent neural network. In: 2015 IEEE International Conference on Smart City/SocialCom/SustainCom (SmartCity), pp. 153–158 (2015).
  4. 4.
    Wiki. Activation Potential | Wikipedia, The Free Encyclopedia. (2018). Accessed 31 Dec 2018
  5. 5.
    Stanford CS231n—Convolutional neural networks for visual recognition. Accessed 01 May 2019
  6. 6.
    London, M., Hausser, M.: Dendritic computation. Annu. Rev. Neurosci. 28(1), 503–532 (2005).
  7. 7.
    Wiki. Activation Function | Wikipedia, The Free Encyclopedia. (2018). Accessed 31 Dec 2018
  8. 8.
    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Statist. 22(3), 400–407 (1951).
  9. 9.
    Krogh, A., Hertz, J.A.: A simple weight decay can improve generalization. In: Proceedings of the 4th International Conference on Neural Information Processing Systems, NIPS’91, pp. 950–957. Morgan Kaufmann Publishers Inc., Denver, Colorado. (1991). ISBN 1-55860-222-4
  10. 10.
    Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67, 301–320 (2005)MathSciNetCrossRefGoogle Scholar
  11. 11.
    Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: CoRR abs/1502.03167 (2015). arXiv:1502.03167.
  12. 12.
    Autoencoders. Accessed 05 Sept 2019
  13. 13.
    Saxe, A.M., Mcclelland, J., Ganguli, G.: Exact solutions to the nonlinear dynamics of learning in deep linear neural networks (2013)Google Scholar
  14. 14.
    Arora, S., et al.: A convergence analysis of gradient descent for deep linear neural networks. In: CoRR abs/1810.02281 (2018). arXiv:1810.02281.
  15. 15.
    Toms, D.J.: Training binary node feedforward neural networks by back propagation of error. Electron. Lett. 26(21), 1745–1746 (1990)CrossRefGoogle Scholar
  16. 16.
    Muselli, M.: On sequential construction of binary neural networks. IEEE Trans. Neural Netw. 6(3), 678–690 (1995)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Ito, Y.: Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory. Neural Netw. 4(3), 385–394 (1991)CrossRefGoogle Scholar
  18. 18.
    Kwan, H.K.: Simple sigmoid-like activation function suitable for digital hardware implementation. Electron. Lett. 28(15), 1379–1380 (1992)CrossRefGoogle Scholar
  19. 19.
    Bishop, C.M.: Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, Berlin (2006). ISBN 0387310738Google Scholar
  20. 20.
    Parkes, E.J., Duffy, B.R.: An automated tanh-function method for finding solitary wave solutions to non-linear evolution equations. Comput. Phys. Commun. 98(3), 288–300 (1996)CrossRefGoogle Scholar
  21. 21.
    LeCun, Y., et al.: Efficient backprop In: Neural Networks: Tricks of the Trade, This Book is an Outgrowth of a 1996 NIPS Workshop, pp. 9–50. Springer, Berlin (1998). ISBN: 3-540-65311-2.
  22. 22.
    Pascanu, R., Mikolov, R., Bengio, Y.: Understanding the exploding gradient problem. In: CoRR abs/1211.5063 (2012). arXiv:1211.5063.
  23. 23.
    Hahnloser, R.H.R., et al.: Digital selection and analogue amplification coexist in a cortex-inspired silicon circuit. Nature 405, 947 (2000).
  24. 24.
    Nair, V., Hinton, G.E.: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML’10, pp. 807–814. Omnipress, Haifa, Israel (2010). ISBN 978-1-60558-907-7.
  25. 25.
    Maas, A.L.: Rectifier nonlinearities improve neural network acoustic models. In: ICML, vol. 30 (2013)Google Scholar
  26. 26.
    He, K., et al.: Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)Google Scholar
  27. 27.
    Clevert, D.A., Unterthiner, T., Hochreiter, S.: Fast and accurate deep network learning by exponential linear units (elus). In: arXiv preprint (2015). arXiv:1511.07289
  28. 28.
    Klambauer, G., et al.: Self-normalizing neural networks. In: Advances in Neural Information Processing Systems, pp. 971–980 (2017)Google Scholar
  29. 29.
    Goodfellow, I., et al.: Maxout networks. In: Dasgupta, S., McAlleste, D. (eds.) Proceedings of the 30th International Conference on Machine Learning, vol. 28. Proceedings of Machine Learning Research 3. Atlanta, Georgia, USA: PMLR, June 2013, pp. 1319–1327.
  30. 30.
    Ramachandran, P., Zoph, B., Le, Q.V.: Searching for activation functions (2018)Google Scholar
  31. 31.
    He, K., et al.: Identity mappings in Deep residual networks. In: CoRR abs/1603.05027 (2016). arXiv:1603.05027.
  32. 32.
    Zagoruyko, S., Komodakis, N.: Wide residual networks. In: CoRR abs/1605.07146 (2016). arXiv:1605.07146.
  33. 33.
    Huang, G., et al.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269 (2017).
  34. 34.
    Yu, C.C., Tang, Y.C., Liu, B.D.: An adaptive activation function for multilayer feedforward neural networks. In: 2002 IEEE Region 10 Conference on Computers, Communications, Control and Power Engineering. TENCOM ’02. Proceedings, vol. 1, pp. 645–650 (2002).
  35. 35.
    Qian, S., et al.: Adaptive activation functions in convolutional neural networks. Neurocomput 272, 204–212 (2018). ISSN 0925-2312.
  36. 36.
    Hou, L., et al.: ConvNets with smooth adaptive activation functions for regression. In: Singh, A., Zhu, J. (eds.) Proceedings of the 20th International Conference on Artificial Intelligence and Statistics. vol. 54. Proceedings of Machine Learning Research. Fort Lauderdale, FL, USA: PMLR, pp. 430–439 (2017).
  37. 37.
    Agostinelli, F., et al.: Learning activation functions to improve deep neural networks. In: CoRR abs/1412.6830 (2014). arXiv:1412.6830.
  38. 38.
    Lin, M., Chen Q., Yan, S.: Network in network. In: 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, 14–16, 2014 Conference Track Proceedings, (2014).
  39. 39.
  40. 40.
    Carini, A., Sicuranza, G.L.: Even mirror Fourier nonlinear filters. In: IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5608–5612 (2013)Google Scholar
  41. 41.
    Loshchilov, L., Hutter, F.: Fixing weight decay regularization in adam. In: CoRR abs/1711.05101 (2017). arXiv:1711.05101.
  42. 42.
    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, 7–9 May 2015, Conference Track Proceedings, (2015).
  43. 43.
    Duchi, J., Hazan, E., Singer, Y.: Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). ISSN 1532-4435.
  44. 44.
    Zeiler, M.D.: ADADELTA: An adaptive learning rate method. In: CoRR abs/1212.5701 (2012). arXiv:1212.5701.
  45. 45.
    Krizhevsky, A., Sutskever, I., Hinton, G.E.: imagenet classification with deep convolutional neural networks. In: Pereira, F., et al. (eds.) Advances in Neural Information Processing Systems 25. Curran Associates, Inc., pp. 1097–1105 (2012).

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Mohit Goyal
    • 1
    Email author
  • Rajan Goyal
    • 1
  • P. Venkatappa Reddy
    • 1
    • 2
  • Brejesh Lall
    • 1
  1. 1.Department of Electrical EngineeringIndian Institute of Technology DelhiDelhiIndia
  2. 2.Electronics and Communication EngineeringVignan’s Foundation for Science, Technology & ResearchGunturIndia

Personalised recommendations