Abstract
This paper introduces a family of new customised methodologies for ensembles, called Boosted Residual Networks, which builds a boosted ensemble of residual networks by growing the member network at each round of boosting. The proposed approach combines recent developments in residual networks—a method for creating very deep networks by including a shortcut layer between different groups of layers—with Deep Incremental Boosting, a methodology to train fast ensembles of networks of increasing depth through the use of boosting. Additionally, we explore a simpler variant of Boosted Residual Networks based on bagging, called Bagged Residual Networks. We then analyse how the recent developments in ensemble distillation can improve our results. We demonstrate that the synergy of residual networks and Deep Incremental Boosting has better potential than simply boosting a residual network of fixed structure or using the equivalent Deep Incremental Boosting without the shortcut layers, by permitting the creation of models with better generalisation in significantly less time.
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig1_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig2_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig3_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig4_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig5_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig6_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig7_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig8_HTML.png)
![](http://media.springernature.com/m312/springer-static/image/art%3A10.1007%2Fs00521-018-3922-2/MediaObjects/521_2018_3922_Fig9_HTML.png)
Similar content being viewed by others
Notes
In a few cases BRN is actually faster than DIB, but we believe this to be just noise due to external factors such as system load and affinity of some resulting computational graphs instead of others.
References
He K, Zhang X, Ren S, Sun J (2015) Deep residual learning for image recognition. arXiv:1512.03385
He K, Zhang X, Ren S, Sun J (2016) Identity mappings in deep residual networks. arXiv:1603.05027
Schapire RE, Freund Y (1996) Experiments with a new boosting algorithm. In: Machine learning: proceedings of the thirteenth international conference, pp 148–156
Dietterich TG (2000) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40(2):139–157
Tramèr F, Kurakin A, Papernot N, Goodfellow I, Boneh D, McDaniel P (2017) Ensemble adversarial training: attacks and defenses. arXiv:1705.07204
Mosca A, Magoulas GD (2018) Distillation of deep learning ensembles as a regularisation method. In: Advances in hybridization of intelligent methods, Springer, pp 97–118
Mosca A, Magoulas G (2017) Boosted residual networks. In: EANN. 18th international conference on engineering applications of neural networks
Breiman L (1996) Bagging predictors. Mach Learn 24(2):123–140
Mosca A, Magoulas G (2016) Deep incremental boosting. In: Benzmuller C, Sutcliffe G, Rojas R (eds) GCAI 2016. 2nd global conference on artificial intelligence. EPiC series in computing, EasyChair, vol 41, pp 293–302
Whitley D, Starkweather T, Bogart C (1990) Genetic algorithms and neural networks: optimizing connections and connectivity. Parallel Comput 14(3):347–361
Malakooti B, Zhou YQ (1994) Feedforward artificial neural networks for solving discrete multiple criteria decision making problems. Manag Sci 40(11):1542–1561
Płaczek S, Adhikari B (2014) Analysis of multilayer neural networks with direct and cross forward connection. Fundam Inf 133(2–3):227–240
Bishop C (1995) Neural networks for pattern recognition. Oxford University Press, Oxford
Ripley BD (2007) Pattern recognition and neural networks. Cambridge University Press, Cambridge
Raiko T, Valpola H, LeCun Y (2012) Deep learning made easier by linear transformations in perceptrons. In: Artificial intelligence and statistics, pp 924–932
Schraudolph N (1998) Accelerated gradient descent by factor-centering decomposition. Technical report/IDSIA 98
Schraudolph NN (2012) Centering neural network gradient factors. In: Montavon G, Orr GB, Müller KR (eds) Neural networks: tricks of the trade. Springer, Berlin, pp 205–223
Vatanen T, Raiko T, Valpola H, LeCun Y (2013) Pushing stochastic gradient towards second-order methods—backpropagation learning with transformations in nonlinearities. In: International conference on neural information processing, Springer, pp 442–449
Srivastava RK, Greff K, Schmidhuber J (2015) Highway networks. arXiv:1505.00387
Srivastava RK, Greff K, Schmidhuber J (2015) Training very deep networks. In: Advances in neural information processing systems, pp 2377–2385
Huang G, Liu Z, Weinberger KQ (2016) Densely connected convolutional networks. arXiv:1608.06993
Greff K, Srivastava RK, Schmidhuber J (2016) Highway and residual networks learn unrolled iterative estimation. arXiv:1612.07771
Yosinski J, Clune J, Bengio Y, Lipson H (2014) How transferable are features in deep neural networks? In: Advances in neural information processing systems, pp 3320–3328
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1717–1724
Veit A, Wilber MJ, Belongie S (2016) Residual networks behave like ensembles of relatively shallow networks. In: Advances in neural information processing systems, pp 550–558
Schapire RE (1990) The strength of weak learnability. Mach Learn 5:197–227
Hastie T, Rosset S, Zhu J, Zou H (2009) Multi-class adaboost. Stat Interface 2(3):349–360
Mukherjee I, Schapire RE (2013) A theory of multiclass boosting. J Mach Learn Res 14:437–497
Freund Y, Iyer R, Schapire RE, Singer Y (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969
Zagoruyko S, Komodakis N (2016) Wide residual networks. arXiv:1605.07146
Ba LJ, Caurana R (2014) Do deep nets really need to be deep? In: Advances in neural information processing systems, pp 2654–2662
Bucilu C, Caruana R, Niculescu-Mizil A (2006) Model compression. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 535–541
Hinton G, Vinyals O, Dean J (2015) Distilling the knowledge in a neural network. arXiv:1503.02531
Mosca A, Magoulas GD (2016) Regularizing deep learning ensembles by distillation. In: 6th international workshop on combinations of intelligent methods and applications (CIMA 2016), p 53
Benenson R What is the class of this image? http://rodrigob.github.io/are_we_there_yet/build/classification_datasets_results.html. Accessed 6 Dec 2018
Lecun Y, Cortes C The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/. Accessed 6 Dec 2018
Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. vol 4, No. 4. Technical report, University of Toronto
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) ImageNet large scale visual recognition challenge. Int J Comput Vis (IJCV) 115(3):211–252
Wan L, Zeiler M, Zhang S, Cun YL, Fergus R (2013) Regularization of neural networks using dropconnect. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 1058–1066
Graham B (2014) Fractional max-pooling. CoRR arXiv:1412.6071
Clevert D, Unterthiner T, Hochreiter S (2015) Fast and accurate deep network learning by exponential linear units (elus). CoRR arXiv:1511.07289
Friedman M (1937) The use of ranks to avoid the assumption of normality implicit in the analysis of variance. J Am Stat Assoc 32(200):675–701
Wilcoxon F (1945) Individual comparisons by ranking methods. Biometrics 1(6):80–83
Mosca A, Magoulas GD (2017) Training convolutional networks with weight-wise adaptive learning rates. In: ESANN 2017 proceedings, European symposium on artificial neural networks, computational intelligence and machine learning. Bruges (Belgium), 26–28 April 2017, i6doc.com publ
He K, Zhang X, Ren S, Sun J (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision, pp 1026–1034
Lu Y, Zhong A, Li Q, Dong B (2018) Beyond finite layer neural networks: bridging deep architectures and numerical differential equations. ICLR https://openreview.net/forum?id=ryZ283gAZ. Accessed 6 Dec 2018
Ciccone M, Gallieri M, Masci J, Osendorfer C, Gomez F (2018) NAIS-Net: stable deep networks from non-autonomous differential equations. CoRR arXiv:1804.07209
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors have received a hardware grant from NVIDIA for this research.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The authors gratefully acknowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPUs used for this research.
Rights and permissions
About this article
Cite this article
Mosca, A., Magoulas, G.D. Customised ensemble methodologies for deep learning: Boosted Residual Networks and related approaches. Neural Comput & Applic 31, 1713–1731 (2019). https://doi.org/10.1007/s00521-018-3922-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-018-3922-2