Abstract
Numerical parameter continuation methods are popularly utilized to optimize non-convex problems. These methods have had many applications in Physics and Mathematical analysis such as bifurcation study of dynamical systems. However, as far as we know, such efficient methods have seen relatively limited use in the optimization of neural networks. In this chapter, we propose a novel training method for deep neural networks based on the ideas from parameter continuation methods and compare them with widely practiced methods such as Stochastic Gradient Descent (SGD), AdaGrad, RMSProp and ADAM. Transfer and curriculum learning have recently shown exceptional performance enhancements in deep learning and are intuitively similar to the homotopy or continuation techniques. However, our proposed methods leverage decades of theoretical and computational work and can be viewed as an initial bridge between those techniques and deep neural networks. In particular, we illustrate a method that we call Natural Parameter Adaption Continuation with Secant approximation (NPACS). Herein we transform regularly used activation functions to their homotopic versions. Such a version allows one to decompose the complex optimization problem into a sequence of problems, each of which is provided with a good initial guess based upon the solution of the previous problem. NPACS uses the above-mentioned system uniquely with ADAM to obtain faster convergence. We demonstrate the effectiveness of our method on standard benchmark problems and compute local minima more rapidly and achieve lower generalization error than contemporary techniques in a majority of cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
Assumptions:
-
\(G: R^N \times R \xrightarrow {} R^N\) be smooth map
-
\(|| G(\theta _0, \lambda _0) || \le c\)
-
\(G_0(\theta )\) be non-singular at a known root (\((\theta _0, \lambda _0)\))
See. [34] for the IFT theorem and proofs for local continuation.
-
- 2.
For PCA the data is normalized by having the mean of each column being 0.
- 3.
- 4.
Dataset collection: https://github.com/harsh306/curriculum-datasets.
References
I. Akkaya, M. Andrychowicz, M. Chociej, M. Litwin, B. McGrew, A. Petron, A. Paino, M. Plappert, G. Powell, R. Ribas et al., Solving rubik’s cube with a robot hand (2019). arXiv preprint arXiv:1910.07113
E. Allgower, K. Georg, Introduction to numerical continuation methods. Soc. Ind. Appl. Math. (2003). https://epubs.siam.org/doi/abs/10.1137/1.9780898719154
Y. Bengio, J. Louradour, R. Collobert, J. Weston, Curriculum learning (2009)
Y. Bengio, M. Mirza, I. Goodfellow, A. Courville, X. Da, An empirical investigation of catastrophic forgeting in gradient-based neural networks (2013)
Z. Cao, M. Long, J. Wang, P.S. Yu, Hashnet: deep learning to hash by continuation. CoRR (2017). arXiv:abs/1702.00758
R. Caruana, Multitask learning. Mach. Learn. 28(1), 41–75 (1997)
T. Chan, K. Jia, S. Gao, J. Lu, Z. Zeng, Y. Ma, Pcanet: a simple deep learning baseline for image classification? IEEE Trans. Image Process. 24(12), 5017–5032 (2015). https://doi.org/10.1109/TIP.2015.2475625
A. Choromanska, M. Henaff, M. Mathieu, G.B. Arous, Y. LeCun, The loss surface of multilayer networks. CoRR (2014). arXiv:abs/1412.0233
J. Clune, Ai-gas: ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. CoRR (2019). arXiv:abs/1905.10985
T. Dick, E. Wong, C. Dann, How many random restarts are enough
E.J. Doedel, T.F. Fairgrieve, B. Sandstede, A.R. Champneys, Y.A. Kuznetsov, X. Wang, Auto-07p: continuation and bifurcation software for ordinary differential equations (2007)
J. Duchi, E. Hazan, Y. Singer, Adaptive subgradient methods for online learning and stochastic optimization. J. Mach. Learn. Res. 12, 2121–2159 (2011). http://dl.acm.org/citation.cfm?id=1953048.2021068
T. Erez, W.D. Smart, What does shaping mean for computational reinforcement learning? in 2008 7th IEEE International Conference on Development and Learning (2008), pp. 215–219. https://doi.org/10.1109/DEVLRN.2008.4640832
C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in Proceedings of the 34th International Conference on Machine Learning, vol. 70. (JMLR. org, 2017), pp. 1126–1135
I. Goodfellow, Y. Bengio, A. Courville, Deep Learning (MIT Press, 2016). http://www.deeplearningbook.org
I.J. Goodfellow, NIPS 2016 tutorial: generative adversarial networks. NIPS (2017). arXiv:abs/1701.00160
I.J. Goodfellow, O. Vinyals, Qualitatively characterizing neural network optimization problems. CoRR (2014). arXiv:abs/1412.6544
A. Graves, M.G. Bellemare, J. Menick, R. Munos, K. Kavukcuoglu, Automated curriculum learning for neural networks. CoRR (2017). arXiv:abs/1704.03003
C. Grenat, S. Baguet, C.H. Lamarque, R. Dufour, A multi-parametric recursive continuation method for nonlinear dynamical systems. Mech. Syst. Signal Process. 127, 276–289 (2019)
M. Grzes, D. Kudenko, Theoretical and empirical analysis of reward shaping in reinforcement learning, in 2009 International Conference on Machine Learning and Applications (2009), pp. 337–344. 10.1109/ICMLA.2009.33
C. Gülçehre, M. Moczulski, M. Denil, Y. Bengio, Noisy activation functions. CoRR (2016). arXiv:abs/1603.00391
C. Gülçehre, M. Moczulski, F. Visin, Y. Bengio, Mollifying networks. CoRR (2016). arXiv:abs/1608.04980
G. Hacohen, D. Weinshall, On the power of curriculum learning in training deep networks. CoRR (2019). arXiv:abs/1904.03626
G. Hinton, L. Deng, D. Yu, G.E. Dahl, A. Mohamed, N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, B. Kingsbury, Deep neural networks for acoustic modeling in speech recognition: the shared views of four research groups. IEEE Signal Process. Mag. 29(6), 82–97 (2012). https://doi.org/10.1109/MSP.2012.2205597
G. Hinton, N. Srivastava, K. Swersky, Rmsprop: divide the gradient by a running average of its recent magnitude. Neural networks for machine learning, Coursera lecture 6e (2012)
G.E. Hinton, R.R. Salakhutdinov, Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006). https://doi.org/10.1126/science.1127647, http://science.sciencemag.org/content/313/5786/504
D.J. Im, M. Tao, K. Branson, An empirical analysis of deep network loss surfaces. CoRR (2016). arXiv:abs/1612.04010
D. Jakubovitz, R. Giryes, M.R. Rodrigues, Generalization error in deep learning, in Compressed Sensing and Its Applications (Springer, 2019), pp. 153–193
F. Jalali, J. Seader, Homotopy continuation method in multi-phase multi-reaction equilibrium systems. Comput. Chem. Eng. 23(9), 1319–1331 (1999)
L. Jiang, Z. Zhou, T. Leung, L.J. Li, L. Fei-Fei, Mentornet: learning data-driven curriculum for very deep neural networks on corrupted labels, in ICML (2018)
R. Johnson, F. Kiokemeister, Calculus, with Analytic Geometry (Allyn and Bacon, 1964). https://books.google.com/books?id=X4_UAQAACAAJ
T. Karras, T. Aila, S. Laine, J. Lehtinen, Progressive growing of gans for improved quality, stability, and variation. CoRR (2017). arXiv:abs/1710.10196
K. Kawaguchi, L.P. Kaelbling, Elimination of all bad local minima in deep learning. CoRR (2019)
H.B. Keller, Numerical solution of bifurcation and nonlinear eigenvalue problems, in Applications of Bifurcation Theory, ed. by P.H. Rabinowitz (Academic Press, New York, 1977), pp. 359–384
D.P. Kingma, J. Ba, Adam: a method for stochastic optimization. CoRR (2014). arXiv:abs/1412.6980
P. Krähenbühl, C. Doersch, J. Donahue, T. Darrell, Data-dependent initializations of convolutional neural networks. CoRR (2015). arXiv:abs/1511.06856
A. Krizhevsky, V. Nair, G. Hinton, Cifar-10 (canadian institute for advanced research). http://www.cs.toronto.edu/kriz/cifar.html
A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep convolutional neural networks, in Advances in Neural Information Processing Systems (2012)
Y. LeCun, Y. Bengio, G. Hinton, Deep learning. Nature (2015). https://www.nature.com/articles/nature14539
C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta, A. Aitken, A. Tejani, J. Totz, Z. Wang et al., Photo-realistic single image super-resolution using a generative adversarial network, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017), pp. 4681–4690
S. Liang, R. Sun, J.D. Lee, R. Srikant, Adding one neuron can eliminate all bad local minima, in S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, R. Garnett (eds.), Advances in Neural Information Processing Systems, vol. 31 (Curran Associates, Inc., 2018), pp. 4350–4360. http://papers.nips.cc/paper/7688-adding-one-neuron-can-eliminate-all-bad-local-minima.pdf
J. Lorraine, P. Vicol, D. Duvenaud, Optimizing millions of hyperparameters by implicit differentiation (2019). arXiv preprint arXiv:1910.07113
T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed representations of words and phrases and their compositionality, in C.J.C. Burges, L. Bottou, M. Welling, Z. Ghahramani, K.Q. Weinberger (eds.), Advances in Neural Information Processing Systems, vol. 26 (Curran Associates, Inc., 2013), pp. 3111–3119. http://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf
H. Mobahi, Training recurrent neural networks by diffusion. CoRR (2016). arXiv:abs/1601.04114
H. Mobahi, III, J.W. Fisher, On the link between gaussian homotopy continuation and convex envelopes, in Lecture Notes in Computer Science (EMMCVPR 2015) (Springer, 2015)
A. Nagabandi, I. Clavera, S. Liu, R.S. Fearing, P. Abbeel, S. Levine, C. Finn, Learning to adapt in dynamic, real-world environments through meta-reinforcement learning (2018). arXiv preprint arXiv:1803.11347
K. Nordhausen, The elements of statistical learning: data mining, inference, and prediction, second edn. T. Hastie, R. Tibshirani, J. Friedman (eds.), Int. Stat. Rev. 77(3), 482–482
R. Paffenroth, E. Doedel, D. Dichmann, Continuation of periodic orbits around lagrange points and auto2000, in AAS/AIAA Astrodynamics Specialist Conference (Quebec City, Canada, 2001)
R.C. Paffenroth, Mathematical visualization, parameter continuation, and steered computations. Ph.D. thesis, AAI9926816 (College Park, MD, USA, 1999)
H.N. Pathak, Parameter continuation with secant approximation for deep neural networks (2018)
H.N. Pathak, X. Li, S. Minaee, B. Cowan, Efficient super resolution for large-scale images using attentional gan, in 2018 IEEE International Conference on Big Data (Big Data) (IEEE, 2018), pp. 1777–1786
H.N. Pathak, R. Paffenroth, Parameter continuation methods for the optimization of deep neural networks, in 2019 18th IEEE International Conference on Machine Learning And Applications (ICMLA) (IEEE, 2019), pp. 1637–1643
A. Pentina, V. Sharmanska, C.H. Lampert, Curriculum learning of multiple tasks, in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5492–5500
J. Rojas-Delgado, R. Trujillo-Rasúa, R. Bello, A continuation approach for training artificial neural networks with meta-heuristics. Pattern Recognit. Lett. 125, 373–380 (2019). https://doi.org/10.1016/j.patrec.2019.05.017, http://www.sciencedirect.com/science/article/pii/S0167865519301667
S. Saxena, O. Tuzel, D. DeCoste, Data parameters: a new family of parameters for learning a differentiable curriculum (2019)
B. Settles, Active Learning Literature Survey, , Tech. rep. (University of Wisconsin-Madison Department of Computer Sciences, 2009)
M. Seuret, M. Alberti, R. Ingold, M. Liwicki, Pca-initialized deep neural networks applied to document image analysis. CoRR (2017). arXiv:abs/1702.00177
N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, R. Salakhutdinov, Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
F.P. Such, A. Rawal, J. Lehman, K. Stanley, J. Clune, Generative teaching networks: accelerating neural architecture search by learning to generate synthetic training data (2020)
I. Sutskever, J. Martens, G. Dahl, G. Hinton, On the importance of initialization and momentum in deep learning, in International Conference on Machine Learning (2013), pp. 1139–1147
Y. Tsvetkov, M. Faruqui, W. Ling, B. MacWhinney, C. Dyer, Learning the curriculum with Bayesian optimization for task-specific word representation learning, in Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, Long Papers, vol. 1 (Association for Computational Linguistics, Berlin, Germany, 2016), pp. 130–139. https://doi.org/10.18653/v1/P16-1013., https://www.aclweb.org/anthology/P16-1013
R. Vilalta, Y. Drissi, A perspective view and survey of meta-learning. Artif. Intell. Rev. 18(2), 77–95 (2002)
R. Wang, J. Lehman, J. Clune, K.O. Stanley, Paired open-ended trailblazer (POET): endlessly generating increasingly complex and diverse learning environments and their solutions. CoRR (2019). arXiv:abs/1901.01753
W. Wang, Y. Tian, J. Ngiam, Y. Yang, I. Caswell, Z. Parekh, Learning a multitask curriculum for neural machine translation (2019). arXiv preprint arXiv:1908.10940 (2019)
M.A. Wani, F.A. Bhat, S. Afzal, A.I. Khan, Advances in deep learning, in Advances in Deep Learning (Springer, 2020), pp. 1–11
D. Weinshall, G. Cohen, Curriculum learning by transfer learning: theory and experiments with deep networks. CoRR (2018). arXiv:abs/1802.03796
H. Xiao, K. Rasul, R. Vollgraf, Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms (2017). https://github.com/zalandoresearch/fashion-mnist
H. Xuan, A. Stylianou, R. Pless, Improved embeddings with easy positive triplet mining (2019)
C. Zhou, R.C. Paffenroth, Anomaly detection with robust deep autoencoders, in Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM, 2017), pp. 665–674
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 The Editor(s) (if applicable) and The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this chapter
Cite this chapter
Nilesh Pathak, H., Clinton Paffenroth, R. (2021). Non-convex Optimization Using Parameter Continuation Methods for Deep Neural Networks. In: Wani, M.A., Khoshgoftaar, T.M., Palade, V. (eds) Deep Learning Applications, Volume 2. Advances in Intelligent Systems and Computing, vol 1232. Springer, Singapore. https://doi.org/10.1007/978-981-15-6759-9_12
Download citation
DOI: https://doi.org/10.1007/978-981-15-6759-9_12
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-6758-2
Online ISBN: 978-981-15-6759-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)