Non-convex Optimization Using Parameter Continuation Methods for Deep Neural Networks

Nilesh Pathak, Harsh; Clinton Paffenroth, Randy

doi:10.1007/978-981-15-6759-9_12

Harsh Nilesh Pathak¹⁷ &
Randy Clinton Paffenroth¹⁸

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1232))

1038 Accesses

Abstract

Numerical parameter continuation methods are popularly utilized to optimize non-convex problems. These methods have had many applications in Physics and Mathematical analysis such as bifurcation study of dynamical systems. However, as far as we know, such efficient methods have seen relatively limited use in the optimization of neural networks. In this chapter, we propose a novel training method for deep neural networks based on the ideas from parameter continuation methods and compare them with widely practiced methods such as Stochastic Gradient Descent (SGD), AdaGrad, RMSProp and ADAM. Transfer and curriculum learning have recently shown exceptional performance enhancements in deep learning and are intuitively similar to the homotopy or continuation techniques. However, our proposed methods leverage decades of theoretical and computational work and can be viewed as an initial bridge between those techniques and deep neural networks. In particular, we illustrate a method that we call Natural Parameter Adaption Continuation with Secant approximation (NPACS). Herein we transform regularly used activation functions to their homotopic versions. Such a version allows one to decompose the complex optimization problem into a sequence of problems, each of which is provided with a good initial guess based upon the solution of the previous problem. NPACS uses the above-mentioned system uniquely with ADAM to obtain faster convergence. We demonstrate the effectiveness of our method on standard benchmark problems and compute local minima more rapidly and achieve lower generalization error than contemporary techniques in a majority of cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Hardcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Assumptions:
- \(G: R^N \times R \xrightarrow {} R^N\) be smooth map
- \(|| G(\theta _0, \lambda _0) || \le c\)
- \(G_0(\theta )\) be non-singular at a known root (\((\theta _0, \lambda _0)\))
See. [34] for the IFT theorem and proofs for local continuation.
2.
For PCA the data is normalized by having the mean of each column being 0.
3.
https://github.com/harsh306/NPACS.
4.
Dataset collection: https://github.com/harsh306/curriculum-datasets.

References

I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al., Solving rubik’s cube with a robot hand (2019). arXiv preprint arXiv:1910.07113
E. Allgower, K. Georg, Introduction to numerical continuation methods. Soc. Ind. Appl. Math. (2003). https://epubs.siam.org/doi/abs/10.1137/1.9780898719154
Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning (2009)
Google Scholar
Y. Bengio, M. Mirza, I. Goodfellow, A. Courville, X. Da, An empirical investigation of catastrophic forgeting in gradient-based neural networks (2013)
Google Scholar
Z. Cao, M. Long, J. Wang, P.S. Yu, Hashnet: deep learning to hash by continuation. CoRR (2017). arXiv:abs/1702.00758
R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
Article MathSciNet Google Scholar
T. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, Pcanet: a simple deep learning baseline for image classification? IEEE Trans. Image Process. 24(12), 5017–5032 (2015). https://doi.org/10.1109/TIP.2015.2475625
Article MathSciNet MATH Google Scholar
A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, Y. LeCun, The loss surface of multilayer networks. CoRR (2014). arXiv:abs/1412.0233
J. Clune, Ai-gas: ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. CoRR (2019). arXiv:abs/1905.10985
T. Dick, E. Wong, C. Dann, How many random restarts are enough
Google Scholar
E.J. Doedel, T.F. Fairgrieve, B. Sandstede, A.R. Champneys, Y.A. Kuznetsov, X. Wang, Auto-07p: continuation and bifurcation software for ordinary differential equations (2007)
Google Scholar
J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). http://dl.acm.org/citation.cfm?id=1953048.2021068
T. Erez, W.D. Smart, What does shaping mean for computational reinforcement learning? in 2008 7th IEEE International Conference on Development and Learning (2008), pp. 215–219. https://doi.org/10.1109/DEVLRN.2008.4640832
C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proceedings of the 34th International Conference on Machine Learning, vol. 70. (JMLR. org, 2017), pp. 1126–1135
Google Scholar
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016). http://www.deeplearningbook.org
I.J. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. NIPS (2017). arXiv:abs/1701.00160
I.J. Goodfellow, O. Vinyals, Qualitatively characterizing neural network optimization problems. CoRR (2014). arXiv:abs/1412.6544
A. Graves, M.G. Bellemare, J. Menick, R. Munos, K. Kavukcuoglu, Automated curriculum learning for neural networks. CoRR (2017). arXiv:abs/1704.03003
C. Grenat, S. Baguet, C.H. Lamarque, R. Dufour, A multi-parametric recursive continuation method for nonlinear dynamical systems. Mech. Syst. Signal Process. 127, 276–289 (2019)
Article Google Scholar
M. Grzes, D. Kudenko, Theoretical and empirical analysis of reward shaping in reinforcement learning, in 2009 International Conference on Machine Learning and Applications (2009), pp. 337–344. 10.1109/ICMLA.2009.33
Google Scholar
C. Gülçehre, M. Moczulski, M. Denil, Y. Bengio, Noisy activation functions. CoRR (2016). arXiv:abs/1603.00391
C. Gülçehre, M. Moczulski, F. Visin, Y. Bengio, Mollifying networks. CoRR (2016). arXiv:abs/1608.04980
G. Hacohen, D. Weinshall, On the power of curriculum learning in training deep networks. CoRR (2019). arXiv:abs/1904.03626
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597
Article Google Scholar
G. Hinton, N. Srivastava, K. Swersky, Rmsprop: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012)
Google Scholar
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006). https://doi.org/10.1126/science.1127647, http://science.sciencemag.org/content/313/5786/504
D.J. Im, M. Tao, K. Branson, An empirical analysis of deep network loss surfaces. CoRR (2016). arXiv:abs/1612.04010
D. Jakubovitz, R. Giryes, M.R. Rodrigues, Generalization error in deep learning, in Compressed Sensing and Its Applications (Springer, 2019), pp. 153–193
Google Scholar
F. Jalali, J. Seader, Homotopy continuation method in multi-phase multi-reaction equilibrium systems. Comput. Chem. Eng. 23(9), 1319–1331 (1999)
Article Google Scholar
L. Jiang, Z. Zhou, T. Leung, L.J. Li, L. Fei-Fei, Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels, in ICML (2018)
Google Scholar
R. Johnson, F. Kiokemeister, Calculus, with Analytic Geometry (Allyn and Bacon, 1964). https://books.google.com/books?id=X4_UAQAACAAJ
T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation. CoRR (2017). arXiv:abs/1710.10196
K. Kawaguchi, L.P. Kaelbling, Elimination of all bad local minima in deep learning. CoRR (2019)
Google Scholar
H.B. Keller, Numerical solution of bifurcation and nonlinear eigenvalue problems, in Applications of Bifurcation Theory, ed. by P.H. Rabinowitz (Academic Press, New York, 1977), pp. 359–384
Google Scholar
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR (2014). arXiv:abs/1412.6980
P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell, Data-dependent initializations of convolutional neural networks. CoRR (2015). arXiv:abs/1511.06856
A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/kriz/cifar.html
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012)
Google Scholar
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature (2015). https://www.nature.com/articles/nature14539
C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., Photo-realistic single image super-resolution using a generative adversarial network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4681–4690
Google Scholar
S. Liang, R. Sun, J.D. Lee, R. Srikant, Adding one neuron can eliminate all bad local minima, in S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.), Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., 2018), pp. 4350–4360. http://papers.nips.cc/paper/7688-adding-one-neuron-can-eliminate-all-bad-local-minima.pdf
J. Lorraine, P. Vicol, D. Duvenaud, Optimizing millions of hyperparameters by implicit differentiation (2019). arXiv preprint arXiv:1910.07113
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., 2013), pp. 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
H. Mobahi, Training recurrent neural networks by diffusion. CoRR (2016). arXiv:abs/1601.04114
H. Mobahi, III, J.W. Fisher, On the link between gaussian homotopy continuation and convex envelopes, in Lecture Notes in Computer Science (EMMCVPR 2015) (Springer, 2015)
Google Scholar
A. Nagabandi, I. Clavera, S. Liu, R.S. Fearing, P. Abbeel, S. Levine, C. Finn, Learning to adapt in dynamic, real-world environments through meta-reinforcement learning (2018). arXiv preprint arXiv:1803.11347
K. Nordhausen, The elements of statistical learning: data mining, inference, and prediction, second edn. T. Hastie, R. Tibshirani, J. Friedman (eds.), Int. Stat. Rev. 77(3), 482–482
Google Scholar
R. Paffenroth, E. Doedel, D. Dichmann, Continuation of periodic orbits around lagrange points and auto2000, in AAS/AIAA Astrodynamics Specialist Conference (Quebec City, Canada, 2001)
Google Scholar
R.C. Paffenroth, Mathematical visualization, parameter continuation, and steered computations. Ph.D. thesis, AAI9926816 (College Park, MD, USA, 1999)
Google Scholar
H.N. Pathak, Parameter continuation with secant approximation for deep neural networks (2018)
Google Scholar
H.N. Pathak, X. Li, S. Minaee, B. Cowan, Efficient super resolution for large-scale images using attentional gan, in 2018 IEEE International Conference on Big Data (Big Data) (IEEE, 2018), pp. 1777–1786
Google Scholar
H.N. Pathak, R. Paffenroth, Parameter continuation methods for the optimization of deep neural networks, in 2019 18th IEEE International Conference on Machine Learning And Applications (ICMLA) (IEEE, 2019), pp. 1637–1643
Google Scholar
A. Pentina, V. Sharmanska, C.H. Lampert, Curriculum learning of multiple tasks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5492–5500
Google Scholar
J. Rojas-Delgado, R. Trujillo-Rasúa, R. Bello, A continuation approach for training artificial neural networks with meta-heuristics. Pattern Recognit. Lett. 125, 373–380 (2019). https://doi.org/10.1016/j.patrec.2019.05.017, http://www.sciencedirect.com/science/article/pii/S0167865519301667
S. Saxena, O. Tuzel, D. DeCoste, Data parameters: a new family of parameters for learning a differentiable curriculum (2019)
Google Scholar
B. Settles, Active Learning Literature Survey, , Tech. rep. (University of Wisconsin-Madison Department of Computer Sciences, 2009)
Google Scholar
M. Seuret, M. Alberti, R. Ingold, M. Liwicki, Pca-initialized deep neural networks applied to document image analysis. CoRR (2017). arXiv:abs/1702.00177
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MathSciNet MATH Google Scholar
F.P. Such, A. Rawal, J. Lehman, K. Stanley, J. Clune, Generative teaching networks: accelerating neural architecture search by learning to generate synthetic training data (2020)
Google Scholar
I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning (2013), pp. 1139–1147
Google Scholar
Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, C. Dyer, Learning the curriculum with Bayesian optimization for task-specific word representation learning, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1 (Association for Computational Linguistics, Berlin, Germany, 2016), pp. 130–139. https://doi.org/10.18653/v1/P16-1013., https://www.aclweb.org/anthology/P16-1013
R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning. Artif. Intell. Rev. 18(2), 77–95 (2002)
Article Google Scholar
R. Wang, J. Lehman, J. Clune, K.O. Stanley, Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR (2019). arXiv:abs/1901.01753
W. Wang, Y. Tian, J. Ngiam, Y. Yang, I. Caswell, Z. Parekh, Learning a multitask curriculum for neural machine translation (2019). arXiv preprint arXiv:1908.10940 (2019)
M.A. Wani, F.A. Bhat, S. Afzal, A.I. Khan, Advances in deep learning, in Advances in Deep Learning (Springer, 2020), pp. 1–11
Google Scholar
D. Weinshall, G. Cohen, Curriculum learning by transfer learning: theory and experiments with deep networks. CoRR (2018). arXiv:abs/1802.03796
H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017). https://github.com/zalandoresearch/fashion-mnist
H. Xuan, A. Stylianou, R. Pless, Improved embeddings with easy positive triplet mining (2019)
Google Scholar
C. Zhou, R.C. Paffenroth, Anomaly detection with robust deep autoencoders, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2017), pp. 665–674
Google Scholar

Download references

Author information

Authors and Affiliations

Expedia Group, 1111 Expedia Group Way W, Seattle, WA, 98119, USA
Harsh Nilesh Pathak
Worcester Polytechnic Institute, Mathematical Sciences Computer Science & Data Science, 100 Institute Rd, Worcester, MA, 01609, USA
Randy Clinton Paffenroth

Authors

Harsh Nilesh Pathak
View author publications
You can also search for this author in PubMed Google Scholar
Randy Clinton Paffenroth
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Harsh Nilesh Pathak .

Editor information

Editors and Affiliations

Department of Computer Science, University of Kashmir, Srinagar, India
M. Arif Wani
Computer and Electrical Engineering, Florida Atlantic University, Boca Raton, FL, USA
Taghi M. Khoshgoftaar
Faculty of Engineering and Computing, Coventry University, Coventry, UK
Vasile Palade

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Nilesh Pathak, H., Clinton Paffenroth, R. (2021). Non-convex Optimization Using Parameter Continuation Methods for Deep Neural Networks. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds) Deep Learning Applications, Volume 2. Advances in Intelligent Systems and Computing, vol 1232. Springer, Singapore. https://doi.org/10.1007/978-981-15-6759-9_12

Download citation

DOI: https://doi.org/10.1007/978-981-15-6759-9_12
Published: 25 September 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6758-2
Online ISBN: 978-981-15-6759-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics