Skip to main content

Non-convex Optimization Using Parameter Continuation Methods for Deep Neural Networks

  • Chapter
  • First Online:
Deep Learning Applications, Volume 2

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1232))

  • 1038 Accesses

Abstract

Numerical parameter continuation methods are popularly utilized to optimize non-convex problems. These methods have had many applications in Physics and Mathematical analysis such as bifurcation study of dynamical systems. However, as far as we know, such efficient methods have seen relatively limited use in the optimization of neural networks. In this chapter, we propose a novel training method for deep neural networks based on the ideas from parameter continuation methods and compare them with widely practiced methods such as Stochastic Gradient Descent (SGD), AdaGrad, RMSProp and ADAM. Transfer and curriculum learning have recently shown exceptional performance enhancements in deep learning and are intuitively similar to the homotopy or continuation techniques. However, our proposed methods leverage decades of theoretical and computational work and can be viewed as an initial bridge between those techniques and deep neural networks. In particular, we illustrate a method that we call Natural Parameter Adaption Continuation with Secant approximation (NPACS). Herein we transform regularly used activation functions to their homotopic versions. Such a version allows one to decompose the complex optimization problem into a sequence of problems, each of which is provided with a good initial guess based upon the solution of the previous problem. NPACS uses the above-mentioned system uniquely with ADAM to obtain faster convergence. We demonstrate the effectiveness of our method on standard benchmark problems and compute local minima more rapidly and achieve lower generalization error than contemporary techniques in a majority of cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 169.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Assumptions:

    • \(G: R^N \times R \xrightarrow {} R^N\) be smooth map

    • \(|| G(\theta _0, \lambda _0) || \le c\)

    • \(G_0(\theta )\) be non-singular at a known root (\((\theta _0, \lambda _0)\))

    See. [34] for the IFT theorem and proofs for local continuation.

  2. 2.

    For PCA the data is normalized by having the mean of each column being 0.

  3. 3.

    https://github.com/harsh306/NPACS.

  4. 4.

    Dataset collection: https://github.com/harsh306/curriculum-datasets.

References

  1. I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al., Solving rubik’s cube with a robot hand (2019). arXiv preprint arXiv:1910.07113

  2. E. Allgower, K. Georg, Introduction to numerical continuation methods. Soc. Ind. Appl. Math. (2003). https://epubs.siam.org/doi/abs/10.1137/1.9780898719154

  3. Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning (2009)

    Google Scholar 

  4. Y. Bengio, M. Mirza, I. Goodfellow, A. Courville, X. Da, An empirical investigation of catastrophic forgeting in gradient-based neural networks (2013)

    Google Scholar 

  5. Z. Cao, M. Long, J. Wang, P.S. Yu, Hashnet: deep learning to hash by continuation. CoRR (2017). arXiv:abs/1702.00758

  6. R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997)

    Article  MathSciNet  Google Scholar 

  7. T. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, Pcanet: a simple deep learning baseline for image classification? IEEE Trans. Image Process. 24(12), 5017–5032 (2015). https://doi.org/10.1109/TIP.2015.2475625

    Article  MathSciNet  MATH  Google Scholar 

  8. A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, Y. LeCun, The loss surface of multilayer networks. CoRR (2014). arXiv:abs/1412.0233

  9. J. Clune, Ai-gas: ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. CoRR (2019). arXiv:abs/1905.10985

  10. T. Dick, E. Wong, C. Dann, How many random restarts are enough

    Google Scholar 

  11. E.J. Doedel, T.F. Fairgrieve, B. Sandstede, A.R. Champneys, Y.A. Kuznetsov, X. Wang, Auto-07p: continuation and bifurcation software for ordinary differential equations (2007)

    Google Scholar 

  12. J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). http://dl.acm.org/citation.cfm?id=1953048.2021068

  13. T. Erez, W.D. Smart, What does shaping mean for computational reinforcement learning? in 2008 7th IEEE International Conference on Development and Learning (2008), pp. 215–219. https://doi.org/10.1109/DEVLRN.2008.4640832

  14. C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proceedings of the 34th International Conference on Machine Learning, vol. 70. (JMLR. org, 2017), pp. 1126–1135

    Google Scholar 

  15. I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016). http://www.deeplearningbook.org

  16. I.J. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. NIPS (2017). arXiv:abs/1701.00160

  17. I.J. Goodfellow, O. Vinyals, Qualitatively characterizing neural network optimization problems. CoRR (2014). arXiv:abs/1412.6544

  18. A. Graves, M.G. Bellemare, J. Menick, R. Munos, K. Kavukcuoglu, Automated curriculum learning for neural networks. CoRR (2017). arXiv:abs/1704.03003

  19. C. Grenat, S. Baguet, C.H. Lamarque, R. Dufour, A multi-parametric recursive continuation method for nonlinear dynamical systems. Mech. Syst. Signal Process. 127, 276–289 (2019)

    Article  Google Scholar 

  20. M. Grzes, D. Kudenko, Theoretical and empirical analysis of reward shaping in reinforcement learning, in 2009 International Conference on Machine Learning and Applications (2009), pp. 337–344. 10.1109/ICMLA.2009.33

    Google Scholar 

  21. C. Gülçehre, M. Moczulski, M. Denil, Y. Bengio, Noisy activation functions. CoRR (2016). arXiv:abs/1603.00391

  22. C. Gülçehre, M. Moczulski, F. Visin, Y. Bengio, Mollifying networks. CoRR (2016). arXiv:abs/1608.04980

  23. G. Hacohen, D. Weinshall, On the power of curriculum learning in training deep networks. CoRR (2019). arXiv:abs/1904.03626

  24. G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597

    Article  Google Scholar 

  25. G. Hinton, N. Srivastava, K. Swersky, Rmsprop: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012)

    Google Scholar 

  26. G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006). https://doi.org/10.1126/science.1127647, http://science.sciencemag.org/content/313/5786/504

  27. D.J. Im, M. Tao, K. Branson, An empirical analysis of deep network loss surfaces. CoRR (2016). arXiv:abs/1612.04010

  28. D. Jakubovitz, R. Giryes, M.R. Rodrigues, Generalization error in deep learning, in Compressed Sensing and Its Applications (Springer, 2019), pp. 153–193

    Google Scholar 

  29. F. Jalali, J. Seader, Homotopy continuation method in multi-phase multi-reaction equilibrium systems. Comput. Chem. Eng. 23(9), 1319–1331 (1999)

    Article  Google Scholar 

  30. L. Jiang, Z. Zhou, T. Leung, L.J. Li, L. Fei-Fei, Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels, in ICML (2018)

    Google Scholar 

  31. R. Johnson, F. Kiokemeister, Calculus, with Analytic Geometry (Allyn and Bacon, 1964). https://books.google.com/books?id=X4_UAQAACAAJ

  32. T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation. CoRR (2017). arXiv:abs/1710.10196

  33. K. Kawaguchi, L.P. Kaelbling, Elimination of all bad local minima in deep learning. CoRR (2019)

    Google Scholar 

  34. H.B. Keller, Numerical solution of bifurcation and nonlinear eigenvalue problems, in Applications of Bifurcation Theory, ed. by P.H. Rabinowitz (Academic Press, New York, 1977), pp. 359–384

    Google Scholar 

  35. D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR (2014). arXiv:abs/1412.6980

  36. P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell, Data-dependent initializations of convolutional neural networks. CoRR (2015). arXiv:abs/1511.06856

  37. A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/kriz/cifar.html

  38. A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012)

    Google Scholar 

  39. Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature (2015). https://www.nature.com/articles/nature14539

  40. C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., Photo-realistic single image super-resolution using a generative adversarial network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4681–4690

    Google Scholar 

  41. S. Liang, R. Sun, J.D. Lee, R. Srikant, Adding one neuron can eliminate all bad local minima, in S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.), Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., 2018), pp. 4350–4360. http://papers.nips.cc/paper/7688-adding-one-neuron-can-eliminate-all-bad-local-minima.pdf

  42. J. Lorraine, P. Vicol, D. Duvenaud, Optimizing millions of hyperparameters by implicit differentiation (2019). arXiv preprint arXiv:1910.07113

  43. T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., 2013), pp. 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf

  44. H. Mobahi, Training recurrent neural networks by diffusion. CoRR (2016). arXiv:abs/1601.04114

  45. H. Mobahi, III, J.W. Fisher, On the link between gaussian homotopy continuation and convex envelopes, in Lecture Notes in Computer Science (EMMCVPR 2015) (Springer, 2015)

    Google Scholar 

  46. A. Nagabandi, I. Clavera, S. Liu, R.S. Fearing, P. Abbeel, S. Levine, C. Finn, Learning to adapt in dynamic, real-world environments through meta-reinforcement learning (2018). arXiv preprint arXiv:1803.11347

  47. K. Nordhausen, The elements of statistical learning: data mining, inference, and prediction, second edn. T. Hastie, R. Tibshirani, J. Friedman (eds.), Int. Stat. Rev. 77(3), 482–482

    Google Scholar 

  48. R. Paffenroth, E. Doedel, D. Dichmann, Continuation of periodic orbits around lagrange points and auto2000, in AAS/AIAA Astrodynamics Specialist Conference (Quebec City, Canada, 2001)

    Google Scholar 

  49. R.C. Paffenroth, Mathematical visualization, parameter continuation, and steered computations. Ph.D. thesis, AAI9926816 (College Park, MD, USA, 1999)

    Google Scholar 

  50. H.N. Pathak, Parameter continuation with secant approximation for deep neural networks (2018)

    Google Scholar 

  51. H.N. Pathak, X. Li, S. Minaee, B. Cowan, Efficient super resolution for large-scale images using attentional gan, in 2018 IEEE International Conference on Big Data (Big Data) (IEEE, 2018), pp. 1777–1786

    Google Scholar 

  52. H.N. Pathak, R. Paffenroth, Parameter continuation methods for the optimization of deep neural networks, in 2019 18th IEEE International Conference on Machine Learning And Applications (ICMLA) (IEEE, 2019), pp. 1637–1643

    Google Scholar 

  53. A. Pentina, V. Sharmanska, C.H. Lampert, Curriculum learning of multiple tasks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5492–5500

    Google Scholar 

  54. J. Rojas-Delgado, R. Trujillo-Rasúa, R. Bello, A continuation approach for training artificial neural networks with meta-heuristics. Pattern Recognit. Lett. 125, 373–380 (2019). https://doi.org/10.1016/j.patrec.2019.05.017, http://www.sciencedirect.com/science/article/pii/S0167865519301667

  55. S. Saxena, O. Tuzel, D. DeCoste, Data parameters: a new family of parameters for learning a differentiable curriculum (2019)

    Google Scholar 

  56. B. Settles, Active Learning Literature Survey, , Tech. rep. (University of Wisconsin-Madison Department of Computer Sciences, 2009)

    Google Scholar 

  57. M. Seuret, M. Alberti, R. Ingold, M. Liwicki, Pca-initialized deep neural networks applied to document image analysis. CoRR (2017). arXiv:abs/1702.00177

  58. N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  59. F.P. Such, A. Rawal, J. Lehman, K. Stanley, J. Clune, Generative teaching networks: accelerating neural architecture search by learning to generate synthetic training data (2020)

    Google Scholar 

  60. I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning (2013), pp. 1139–1147

    Google Scholar 

  61. Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, C. Dyer, Learning the curriculum with Bayesian optimization for task-specific word representation learning, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1 (Association for Computational Linguistics, Berlin, Germany, 2016), pp. 130–139. https://doi.org/10.18653/v1/P16-1013., https://www.aclweb.org/anthology/P16-1013

  62. R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning. Artif. Intell. Rev. 18(2), 77–95 (2002)

    Article  Google Scholar 

  63. R. Wang, J. Lehman, J. Clune, K.O. Stanley, Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR (2019). arXiv:abs/1901.01753

  64. W. Wang, Y. Tian, J. Ngiam, Y. Yang, I. Caswell, Z. Parekh, Learning a multitask curriculum for neural machine translation (2019). arXiv preprint arXiv:1908.10940 (2019)

  65. M.A. Wani, F.A. Bhat, S. Afzal, A.I. Khan, Advances in deep learning, in Advances in Deep Learning (Springer, 2020), pp. 1–11

    Google Scholar 

  66. D. Weinshall, G. Cohen, Curriculum learning by transfer learning: theory and experiments with deep networks. CoRR (2018). arXiv:abs/1802.03796

  67. H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017). https://github.com/zalandoresearch/fashion-mnist

  68. H. Xuan, A. Stylianou, R. Pless, Improved embeddings with easy positive triplet mining (2019)

    Google Scholar 

  69. C. Zhou, R.C. Paffenroth, Anomaly detection with robust deep autoencoders, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2017), pp. 665–674

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Harsh Nilesh Pathak .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this chapter

Check for updates. Verify currency and authenticity via CrossMark

Cite this chapter

Nilesh Pathak, H., Clinton Paffenroth, R. (2021). Non-convex Optimization Using Parameter Continuation Methods for Deep Neural Networks. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds) Deep Learning Applications, Volume 2. Advances in Intelligent Systems and Computing, vol 1232. Springer, Singapore. https://doi.org/10.1007/978-981-15-6759-9_12

Download citation

Publish with us

Policies and ethics