Downsizing and enhancing broad learning systems by feature augmentation and residuals boosting

Xie, Runshan; Wang, Shitong

doi:10.1007/s40747-020-00139-2

Downsizing and enhancing broad learning systems by feature augmentation and residuals boosting

Original Article
Open access
Published: 09 April 2020

Volume 6, pages 411–429, (2020)
Cite this article

Download PDF

You have full access to this open access article

Complex & Intelligent Systems Aims and scope Submit manuscript

Downsizing and enhancing broad learning systems by feature augmentation and residuals boosting

Download PDF

Runshan Xie^1,2 &
Shitong Wang^1,2

1650 Accesses
8 Citations
3 Altmetric
Explore all metrics

Abstract

Recently, a broad learning system (BLS) has been theoretically and experimentally confirmed to be an efficient incremental learning system. To get rid of deep architecture, BLS shares the same architecture and learning mechanism of the well-known functional link neural networks (FLNN), but works in broad learning way on both the randomly mapped features of original features of data and their randomly generated enhancement nodes. As such, BLS often requires a huge heap of hidden nodes to achieve the prescribed or satisfactory performance, which may inevitably cause both overwhelming storage requirement and overfitting phenomenon. In this study, a stacked architecture of broad learning systems called D&BLS is proposed to achieve enhanced performance and simultaneously downsize the system architecture. By boosting the residuals between previous and current layers and simultaneously augmenting the original input space with the outputs of the previous layer as the inputs of current layer, D&BLS stacks several lightweight BLS sub-systems to guarantee stronger feature representation capability and better classification/regression performance. Three fast incremental learning algorithms of D&BLS are also developed, without the need for the whole re-training. Experimental results on some popular datasets demonstrate the effectiveness of D&BLS in the sense of both enhanced performance and reduced system architecture.

An effective and efficient broad-based ensemble learning model for moderate-large scale image recognition

Article 24 September 2022

H-BLS: a hierarchical broad learning system with deep and sparse feature learning

Article 13 April 2022

Incremental Bayesian broad learning system and its industrial application

Article 13 November 2020

Find the latest articles, discoveries, and news in related topics.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

As everyone may know well, over the past decade, deep learning systems have earned great success in many application fields [1, 2] and have been drawing more and more attentions in today’s academic and industrial communities. Typical deep learning systems include deep restricted Boltzmann machines (RBM) [3], deep belief networks (DBN) [4, 5] and deep convolutional neural networks (CNN) [6]. Deep learning systems exhibit their overwhelming performance by taking their complicated deep structures and deep learning methods. However, they often suffer from very time-consuming training because a huge number of hyperparameters in the corresponding complicated deep structures are involved. Besides, updating their deep structure becomes an extraordinarily tedious task, due to the need for the whole re-training. To get rid of complicated deep architecture and effectively avoid very time-consuming training (and even re-training), as an alternative to deep learning systems, broad learning system (BLS in brevity) [7] has been recently invented to organize the well-known functional link neural networks (FLNN) [8, 9] in broad learning way. Up to date, several variants of BLS have been developed. A typical example is fuzzy BLS [10] which is based on Takagi–Sugeno–Kang fuzzy systems [11,12,13]. Theoretical and experimental evidences have revealed BLS’s excellent classification/regression performance, with fast wide learning capability.

Since FLNN has both its simple structure of single-layer feedforward neural networks and its fast learning which can be realized quickly using its analytical solution, BLS takes FLNN as its basic structure and then expands it in broad learning way. In contrast to FLNN, BLS has two distinctive discrepancies in the structure sense: (1) BLS transforms the set of the original input features of data into randomly mapped features and then take sparse autoencoders (SAE) to optimize them into sparse mapped features [14] so as to capture the intrinsic correlations of the input features. (2) BLS generates the enhancement nodes using nonlinear activation functions with randomly generated weights on all the sparse mapped features. All the sparse mapped features and the enhancement nodes are linked to the output layer, where the weights between the hidden nodes and the output layer are determined analytically by ridge regression [15], without the need for iterative updates of the weights. More importantly, without the need for re-training the whole network, BLS can train its structure in broad learning way for the incremental increase of randomly mapped features, enhancement nodes and input data, thereby avoiding very time-consuming training especially for big data. BLS has been theoretically proved to be a universal approximator [16]. A large amount of experimental evidences has indicated that BLS indeed has at least comparably and even much more satisfactory classification/regression performance and strong generalization capability, in contrast to the corresponding deep learning system. However, because the mapped features and enhancement nodes are randomly generated, BLS often needs a huge heap of enhancement nodes to achieve the prescribed performance, which may inevitably cause both overwhelming storage requirement and overfitting phenomenon, therefore, how to downsize a BLS system and simultaneously keep the strong capability of the whole system is becoming an urgent demand.

In this study, we attempt to tackle with this issue by stacking several lightweight BLS sub-systems with both feature augmentation and residuals boosting into its stacked structure called D&BLS. The basic idea can be clearly stated as follows.

1.
To avoid overfitting phenomenon of the general BLS and simultaneously downsize its structure, D&BLS is proposed to stack several lightweight BLS sub-systems. As a deep classifier, the whole stacked structure and learning of D&BLS has its novelty in the sense of two joint stacking ways: feature augmentation and residuals boosting, which indeed guarantees enhanced performance of D&BLS.
2.
Using the bootstrap strategy on feature nodes of each lightweight BLS sub-system, enhancement nodes are generated and hence each lightweight BLS sub-system is built to maintain powerful capability in both fast training and good generalization.
3.
When D&BLS has its fixed depth, three incremental learning algorithms are designed for three incremental cases to endow D&BLS with the ability of fast learning in broad expansion of each layer without the whole re-training process.
4.
Experimental results on six classification datasets demonstrate that D&BLS can reduce the number of enhancement nodes of BLS to about 76% or less with the same or better performance. On ten regression datasets, D&BLS can achieve smaller error with comparable number of enhancement nodes. On the MNIST dataset, three incremental algorithms of D&BLS can reach the performance comparable to that of D&BLS with one-shot construction and the training time of three incremental algorithms is much less than that of D&BLS with the one-shot construction, In addition, D&BLS only needs much less running time than BLS when they both share the same sizes. In general, D&BLS with some (i.e., from 2 to 4) lightweight sub-systems can obtain promising learning ability.

The rest of this paper is organized as follows. Section “On BLS” reviews BLS in brevity. Section “On D&BLS” proposes the structure of D&BLS, gives the learning algorithm of D&BLS, and discusses its computational complexity. Besides, three incremental algorithms of D&BLS are also designed in this section. Experimental results about image/non-image dataset classification, regression and incremental learning are presented in section “Experimental results”. And the last section concludes this paper.

On BLS

In this section, we give a brief review about BLS’s structure and three incremental learning algorithms.

Framework of BLS

In [7], BLS is conceptually composed of four basic parts: (1) original input; (2) feature nodes; (3) enhancement nodes; and (4) output layer. BLS begins with the original input as the input to certain feature mapping algorithms [7], which then generates the feature nodes by taking the sparse autoencoder to slightly fine-tune the random features. As such, all the mapped feature nodes actually form an efficient representation way of the original data. Accordingly, BLS generates a series of enhancement nodes by consecutively acting certain activation function with random parameters on all the mapped feature nodes, then all mapped features generated an enhancement node. Finally, BLS links both feature nodes and enhancement nodes to the output layer with each output weight. As one of the most famous merits, all the output weights can be trained analytically by the well-known ridge regression method [15].

Figure 1 gives an illustrative structure of BLS. According to Fig. 1, we can concretely state how to construct a BLS as follows.

Without loss of generality, here we only consider the multi-input and single-output case. Suppose the training dataset X contains the total T original input samples with D dimensions and its output vector Y corresponds the total T target outputs, i.e., $ \varvec{X} \in {\mathbb{R}}^{T \times D} ,\;\varvec{Y} \in {\mathbb{R}}^{T \times 1} $. We assume BLS has the total n groups of feature nodes in which the ith group of feature nodes takes the mapping function $ \phi_{i} ,\;i = 1,2, \ldots ,n $ and contains the total $ d_{i} $ feature nodes and the total m groups of enhancement nodes in which the jth group of enhancement nodes takes the activation function $ \xi_{j} ,\;j = 1,2, \ldots ,m $ and contains the total $ q_{j} $ enhancement nodes. As such, the ith group of feature nodes may be expressed as:

$$ \varvec{M}_{i} = \phi_{i} (\varvec{XW}_{ei} +\varvec{\beta}_{ei} ),\quad \;i = 1,2, \ldots ,n $$

(1)

where the weight matrix $ {\mathbf{W}}_{ei} \in {\mathbb{R}}^{{D \times d_{i} }} $ and the bias matrix $ \varvec{\beta}_{ei} \in {\mathbb{R}}^{{T \times d_{i} }} $ are randomly generated according to a certain distribution, such as normal distribution. According to [7], sparse autoencoder is taken to slightly fine-tune the random feature matrix $ \varvec{M}_{i} $ into sparse and compact feature matrix. Using alternating direction method of multiplier (ADMM), the weight matrix tuned by sparse autoencoder can be solved by the following iterative steps:

$$ \left\{ \begin{aligned} & \hat{\varvec{W}}_{k + 1} = \left( {\hat{\varvec{M}}_{i}^{T} \hat{\varvec{M}}_{i} + r\varvec{I}} \right)^{ - 1} \left( {\hat{\varvec{M}}_{i}^{T} \varvec{X} + r\left( {\varvec{o}_{k} - \varvec{u}_{k} } \right)} \right) \hfill \\ & \varvec{o}_{k + 1} = \text{S}_{{{\lambda \mathord{\left/ {\vphantom {\lambda r}} \right. \kern-0pt} r}}} \left( {\hat{\varvec{W}}_{k + 1} + \varvec{u}_{k} } \right) \hfill \\ & \varvec{u}_{k + 1} = \varvec{u}_{k} + \left( {\hat{\varvec{W}}_{k + 1} - \varvec{o}_{k + 1} } \right) \hfill \\ \end{aligned} \right. $$

(2)

where $ \hat{\varvec{M}}_{i} $ is the desired sparse feature matrix, $ \hat{\varvec{W}} $ is the sparse autoencoder solution, $ \hat{\varvec{W}}_{0} $, $ \varvec{o}_{0} $ and $ \varvec{u}_{0} $ are zero matrices, $ \lambda $ is the given regularization parameter. In addition, r > 0 and S is the soft thresholding operator which is defined as:

$$ \text{S}_{t} \left( a \right) = \left\{ {\begin{array}{*{20}c} {a - t,} \\ {0,} \\ {a + t,} \\ \end{array} } \right.\begin{array}{*{20}c} {\quad a > t} \\ {\quad \left| a \right| \le t} \\ {\quad a < - t} \\ \end{array} $$

(3)

After a certain number of iterations, the sparse feature matrix can be obtained by:

$$ \varvec{M}_{i} = \varvec{X\hat{W}} $$

(4)

We collect all the groups of feature nodes to form the group $ \varvec{M}^{n} \triangleq \left[ {\varvec{M}_{1} ,\varvec{M}_{2} , \ldots ,\varvec{M}_{n} } \right] $, which will be used to generate the enhancement nodes. Similarly, the jth group of enhancement nodes may be expressed as:

$$ \varvec{E}_{j} = \xi_{j} (\varvec{M}^{n} \varvec{W}_{hj} +\varvec{\beta}_{hj} ),\;j = 1,2, \ldots ,m $$

(5)

where the weight matrix $ \varvec{W}_{hj} \in {\mathbb{R}}^{{\left( {\sum\nolimits_{i = 1}^{n} {d_{i} } } \right) \times q_{j} }} $ and the bias matrix $ \varvec{\beta}_{hj} \in {\mathbb{R}}^{{T \times q_{j} }} $ are also randomly generated according to a certain distribution. We collect all the groups of enhancement nodes $ \varvec{E}_{j} $ to form the group $ \varvec{E}^{m} \triangleq \left[ {\varvec{E}_{1} ,\varvec{E}_{2} , \ldots ,\varvec{E}_{m} } \right] $. In this way, the output vector of such a BLS becomes:

$$ \varvec{Y} = \left[ {\varvec{M}_{1} ,\varvec{M}_{2} , \ldots ,\varvec{M}_{n} |\varvec{E}_{1} ,\varvec{E}_{2} , \ldots ,\varvec{E}_{m} } \right]\varvec{W}_{n}^{m} = \left[ {\varvec{M}^{n} |\varvec{E}^{m} } \right]\varvec{W}_{n}^{m} $$

(6)

where $ \varvec{W}_{n}^{m} \in {\mathbb{R}}^{{\left( {\sum\nolimits_{i = 1}^{n} {d_{i} } + \sum\nolimits_{j = 1}^{m} {q_{j} } } \right)\, \times \,1}} $ are the output weight vector that can be solved analytically by ridge regression method. That is to say, after denoting $ \varvec{U}_{n}^{m} = \left[ {\varvec{M}^{n} |\varvec{E}^{m} } \right] $, with the Moore–Penrose inverse of $ \varvec{U}_{n}^{m} $, we have the analytical solution of $ \varvec{W}_{n}^{m} $.

$$ \varvec{W}_{n}^{m} = \left[ {\varvec{U}_{n}^{m} } \right]^{\dag } \varvec{Y} $$

(7)

Using ridge regression [15], the pseudoinverse of $ \varvec{U}_{n}^{m} $ can be solved as follows:

$$ \left[ {\varvec{U}_{n}^{m} } \right]^{\dag } = \left\{ {\begin{array}{*{20}c} {\left( {\lambda \varvec{I} + \left( {\varvec{U}_{n}^{m} } \right)^{T} \varvec{U}_{n}^{m} } \right)^{ - 1} \left( {\varvec{U}_{n}^{m} } \right)^{T} ,} &\quad {\text{if}\,\;T \ge h} \\ {\left( {\varvec{U}_{n}^{m} } \right)^{T} \left( {\lambda \varvec{I} + \varvec{U}_{n}^{m} \left( {\varvec{U}_{n}^{m} } \right)^{T} } \right)^{ - 1} ,} &\quad {\text{if}\,\;T < h} \\ \end{array} } \right. $$

(8)

where $ \lambda $ is the given regularization parameter, $ h = \sum\nolimits_{i = 1}^{n} {d_{i} } + \sum\nolimits_{j = 1}^{m} {q_{j} } $ is the total number of hidden nodes of BLS.

As a result, such a BLS is built quickly, without the need for iterative updates at a very slow learning speed, likewise in BP neural network learning [18].

Incremental learning of BLS

Except for fast training of BLS, another famous merit exists in its simple yet fast incremental learning methods, which indeed reveals its very applicability in online scenarios and/or practical situations where the more promising performance is expected. According to [7], without re-training the whole network structure, incremental learning of BLS can be realized from three cases.

Increment of enhancement nodes

In this case, BLS is expanded by adding the (m + 1)th group of enhancement nodes. That is to say, we have

$$ \varvec{E}_{m + 1} = \xi_{m + 1} \left( {\varvec{M}^{n} {\kern 1pt} \varvec{W}_{h(m + 1)} +\varvec{\beta}_{h(m + 1)} } \right) $$

(9)

$$ \varvec{U}_{n}^{m + 1} = \left[ {\varvec{U}_{n}^{m} |\varvec{E}_{m + 1} } \right] $$

(10)

After denoting $ \varvec{U}^{ex} = \varvec{E}_{m + 1} $, we can decompose the pseudoinverse of $ \varvec{U}_{n}^{m + 1} $ as,

$$ \left[ {\varvec{U}_{n}^{m + 1} } \right]^{\dag } = \left[ {\begin{array}{*{20}c} {\left[ {\varvec{U}_{n}^{m} } \right]^{\dag } - \varvec{AB}^{T} } \\ {\varvec{B}^{T} } \\ \end{array} } \right] $$

(11)

where

$$ \begin{aligned} & {\kern 1pt} \varvec{A} = \left[ {\varvec{U}_{n}^{m} } \right]^{\dag } \varvec{U}^{ex} \\ & \varvec{B}^{T} \; = \left\{ \begin{aligned} \,& \varvec{C}^{\dag } ,{\text{ if }}\varvec{C} \ne 0, \hfill \\ & \left( {{\mathbf{1}} + \varvec{A}^{T} \varvec{A}} \right)^{ - 1} \varvec{A}^{T} \left[ {\varvec{U}_{n}^{m} } \right]^{ + } {\text{ if }}\varvec{C}{ = 0} . { } \hfill \\ \end{aligned} \right. \\ & {\kern 1pt} \varvec{C}\;{\kern 1pt} = \;\;\varvec{U}^{ex} - \varvec{U}_{n}^{m} \varvec{A} \\ \end{aligned} $$

(12)

Therefore, the new output weight vector $ \varvec{W}_{n}^{m + 1} $ becomes

$$ \varvec{W}_{n}^{m + 1} = \left[ {\begin{array}{*{20}c} {\varvec{W}_{n}^{m} - \varvec{AB}^{T} \varvec{Y}} \\ {\varvec{B}^{T} \varvec{Y}} \\ \end{array} } \right] $$

(13)

Since $ \varvec{W}_{n}^{m} $, $ \varvec{U}_{n}^{m} $ and $ \left[ {\varvec{U}_{n}^{m} } \right]^{\dag } $ have been obtained in advance, $ \varvec{W}_{n}^{m + 1} $ can be quickly obtained according to Eqs. (12) and (13).

Increment of feature nodes

In this case, BLS is expanded by adding the (n + 1)th group of feature nodes. As such, we have,

$$ \varvec{M}_{n + 1} = \phi_{n + 1} \left( {\varvec{XW}_{e(n + 1)} +\varvec{\beta}_{e(n + 1)} } \right) $$

(14)

$$ \varvec{U}_{{n{ + }1}}^{m} = \left[ {\varvec{U}_{n}^{m} |\varvec{M}_{n + 1} |\varvec{E}_{1}^{ex} |\varvec{E}_{2}^{ex} | \cdots |\varvec{E}_{m}^{ex} } \right] $$

(15)

where $ \varvec{E}_{1}^{ex} |\varvec{E}_{2}^{ex} | \cdots |\varvec{E}_{m}^{ex} $ are the additional outputs of the enhancement nodes corresponding to the (n + 1)th group of feature nodes. After denoting $ \varvec{U}^{ex} = \left[ {\varvec{M}_{n + 1} |\varvec{E}_{1}^{ex} |\varvec{E}_{2}^{ex} | \cdots |\varvec{E}_{m}^{ex} } \right] $, we can decompose the pseudoinverse of $ \left[ {\varvec{U}_{{n{ + }1}}^{m} } \right] $ as,

$$ \left[ {\varvec{U}_{{n{ + }1}}^{m} } \right]^{\dag } = \left[ {\begin{array}{*{20}c} {\left[ {\varvec{U}_{n}^{m} } \right]^{\dag } - \varvec{AB}^{T} } \\ {\varvec{B}^{T} } \\ \end{array} } \right] $$

(16)

where A and B can be obtained according to Eq. (12). Therefore, the new output weight vector $ \varvec{W}_{{n{ + }1}}^{m} $ is,

$$ \varvec{W}_{{n{ + }1}}^{m} = \left[ {\begin{array}{*{20}c} {\varvec{W}_{n}^{m} - \varvec{AB}^{T} \varvec{Y}} \\ {\varvec{B}^{T} \varvec{Y}} \\ \end{array} } \right] $$

(17)

Obviously, since $ \varvec{W}_{n}^{m} $, $ \varvec{U}_{n}^{m} $ and $ \left[ {\varvec{U}_{n}^{m} } \right]^{\dag } $ have been obtained in advance, $ \varvec{W}_{{n{ + }1}}^{m} $ can be quickly obtained according to Eqs. (12) and (17).

Increment of input data

In this case, BLS is fed with the additional training data $ {\text{\{ }}\varvec{X}^{ex} ,\varvec{Y}^{ex} {\text{\} }} $. The additional feature node $ \varvec{M}^{ex} $ generated from $ \varvec{X}^{ex} $ is

$$ \varvec{M}^{ex} = \left[ {\phi_{1} \left( {\varvec{X}^{ex} \varvec{W}_{e1} +\varvec{\beta}_{e1} } \right), \ldots ,\phi_{n} \left( {\varvec{X}^{ex} \varvec{W}_{en} +\varvec{\beta}_{en} } \right)} \right] $$

(18)

The output of enhancement nodes corresponding to $ \varvec{M}^{ex} $ is

$$ \varvec{E}^{ex} = \left[ {\xi_{1} \left( {\varvec{M}^{ex} \varvec{W}_{h1} +\varvec{\beta}_{h1} } \right), \ldots ,\xi_{m} \left( {\varvec{M}^{ex} \varvec{W}_{hm} +\varvec{\beta}_{hm} } \right)} \right] $$

(19)

Please note, both $ \varvec{W}_{ei} $, $ \varvec{\beta}_{ei} $ (i = 1,…,n) and $ \varvec{W}_{hj} $, $ \varvec{\beta}_{hj} $ (j = 1,…,m) are initialized by taking their corresponding values in Eqs. (1) and (5). Let

$$ \left( {\varvec{U}_{n}^{m} } \right)^{ex} = \left[ {\begin{array}{*{20}c} {\varvec{U}_{n}^{m} } \\ {\varvec{U}^{ex} } \\ \end{array} } \right] $$

(20)

and $ \varvec{U}^{ex} = \left[ {\varvec{M}^{ex} |\varvec{E}^{ex} } \right] $, the pseudoinverse of $ \left( {\varvec{U}_{n}^{m} } \right)^{ex} $ can be decomposed as,

$$ \left( {\left( {\varvec{U}_{n}^{m} } \right)^{ex} } \right)^{\dag } = \left[ {\left( {\varvec{U}_{n}^{m} } \right)^{\dag } - \varvec{BA}^{T} |\varvec{B}} \right] $$

(21)

where

$$ \begin{aligned} & {\kern 1pt} \,\varvec{A} = \left( {\left( {\varvec{U}_{n}^{m} } \right)^{T} } \right)^{\dag } \left( {\varvec{U}^{ex} } \right)^{T} \\ & \varvec{B}^{T} \; = \left\{ \begin{aligned} & \varvec{C}^{\dag } ,{\text{ if }}\varvec{C} \ne 0, \hfill \\ & \left( {{\mathbf{1}} + \varvec{A}^{T} \varvec{A}} \right)^{ - 1} \varvec{A}^{T} \left( {\left( {\varvec{U}_{n}^{m} } \right)^{T} } \right)^{\dag } ,{\text{ if }}\varvec{C}{ = 0} . { } \hfill \\ \end{aligned} \right. \\ & {\kern 1pt} \varvec{C}\;{\kern 1pt} = \;\left( {\varvec{U}^{ex} } \right)^{T} - \left( {\varvec{U}_{n}^{m} } \right)^{T} \varvec{A} \\ \end{aligned} $$

(22)

Therefore, the new output weight vector $ \left( {\varvec{W}_{n}^{m} } \right)^{ex} $ is

$$ \left( {\varvec{W}_{n}^{m} } \right)^{ex} = \varvec{W}_{n}^{m} + \varvec{B}\left( {\varvec{Y}^{ex} - \varvec{U}^{ex} \varvec{W}_{n}^{m} } \right) $$

(23)

Obviously, since $ \varvec{W}_{n}^{m} $, $ \varvec{U}_{n}^{m} $ and $ \left[ {\varvec{U}_{n}^{m} } \right]^{\dag } $ have been obtained in advance, $ \left( {\varvec{W}_{n}^{m} } \right)^{ex} $ can be quickly obtained according to Eqs. (22) and (23).

While BLS shares the fast training originating from the flexible yet random construction of both feature nodes and enhancement nodes, it indeed requires a large amount of feature nodes and especially enhancement nodes to achieve satisfactory performance, which may cause an overwhelming storage requirement and overfitting phenomenon. For example, for MNIST dataset, according to [7], the built BLS needs 11,000 enhancement nodes. Therefore, in the next section, we will deepen BLS by stacking its structures into the proposed classifier so as to downsize the structure of BLS and simultaneously share the promising advantages of BLS.

On D&BLS

In this section, by means of feature augmentation, we first state the deep structure of D&BLS and its learning algorithm and then derive three incremental learning algorithms of D&BLS for three incremental cases. Their computational complexities are also discussed.

Structure and learning of D&BLS

Given D&BLS with L layers, the original training dataset $ \varvec{X} \in {\mathbb{R}}^{T \times D} $ and its target output vector $ \varvec{Y} \in {\mathbb{R}}^{T \times 1} $. The parameters of the ith lightweight sub-system include the total $ n_{i} $ groups of feature nodes in which each group contains the total $ d_{i} $ feature nodes, the number of the selected feature nodes $ p_{i} $ that are chosen to generate enhancement nodes, and the total $ m_{i} $ groups of enhancement nodes in which each group contains the total $ q_{i} $ enhancement nodes. For convenience, we use $ \left( {d_{1} \times n_{{{\kern 1pt} 1}} ,p_{1} ,q_{1} \times m_{1} } \right) - \cdots - \left( {d_{L} \times n_{{{\kern 1pt} L}} ,p_{L} ,q_{L} \times m_{L} } \right) $ to represent the parameters setting of this D&BLS and $ \left( {\sum\nolimits_{i = 1}^{L} {\sum\limits_{{}}^{{}} {d_{i} \times n_{{{\kern 1pt} i}} } ,\;\sum\nolimits_{i = 1}^{L} {q_{i} \times m_{{{\kern 1pt} i}} } } } \right) $ to denote its whole structure.

Figure 2 illustrates its deep structure. According to Fig. 2, D&BLS essentially stacks several lightweight BLS sub-systems into its deep structure simultaneously by two ways: (1) augmenting the original input space with the outputs of previous layer as the inputs of current layer; (2) boosting the residual outputs between the original outputs and the outputs of all the previous layers as the target outputs of current layer. According to stacked generalization principle [17], the use of feature augmentation guarantees enhanced performance of D&BLS, whereas the use of residual outputs maintains good local approximation of current BLS sub-system for subtle output differences remaining in current layer. In general, a lightweight BLS sub-system takes not so many feature nodes and enhancement nodes, in contrast to the BLS system for the same training dataset. To stack lightweight BLS sub-systems well, the deep structure of D&BLS embodies our three reliable considerations as follows.

(1) According to the adopted feature augmentation, the ith lightweight sub-system has its augmented training dataset $ \varvec{X}_{i} $, i.e.,

$$ \varvec{X}_{\varvec{i}} { = }\left\{ \begin{aligned} & \varvec{X},\,\;i = 1 \hfill \\ & \left[ {\varvec{X}|\varvec{Y^{\prime}}_{i - 1} \,\,} \right],\quad i = 2,3, \ldots ,L \hfill \\ \end{aligned} \right. $$

(24)

where $ \varvec{Y^{\prime}}_{i - 1} $ corresponds to the augmented feature and denotes the output vector of the (i − 1)th sub-system. In other words, the augmented training dataset $ \varvec{X}_{i} $ contains the output information of previous sub-system. Please note, each system always has only one augmented feature whose values are the outputs of previous layer in D&BLS.

(2) Inspired by the idea of residuals boosting [19,20,21], each target output of sub-system in current layer is taken to be the residual value between the corresponding target output and the corresponding outputs of the sub-systems of all the previous layers. As such, the ith sub-system has its target output vector $ \varvec{Y}_{i} $:

$$ \varvec{Y}_{i} = \left\{ \begin{aligned} & \varvec{Y},\quad i = 1 \hfill \\ & \varvec{Y}_{i - 1} - \varvec{Y^{\prime}}_{i - 1} ,\quad i = 2,3, \ldots ,L \hfill \\ \end{aligned} \right. $$

(25)

Obviously, according to stacked generalization principle, the above stacking way of both feature augmentation and residuals boosting can surely enhance the performance of D&BLS with the increase of the number of sub-systems. In other words, with the prescribed performance, a downsized structure can be expected in the sense of the total number of hidden nodes, in contrast to the general BLS.

(3) When combining all the lightweight sub-systems together, diversity between them should be emphasized to achieve an effective ensemble. To do so, D&BLS only randomly boostraps [22] some feature nodes to generate an enhancement node. Suppose the ith sub-system selects $ p_{i} $ feature nodes to generate an enhancement node. After denoting $ \varvec{M}^{{p_{i} }} \triangleq \left[ {\varvec{M}_{{c_{1} }} ,\varvec{M}_{{c_{2} }} , \ldots ,\varvec{M}_{{c_{{p_{i} }} }} } \right] $ as the random selected feature nodes, where $ c_{1} ,c_{2} , \ldots ,c_{{p_{i} }} \in \left[ {1,n_{i} } \right] $ denote random integers, then an enhancement node $ \varvec{E}_{j} $ can be generated as,

$$ \varvec{E}_{j} = \xi_{j} (\varvec{M}^{{p_{i} }} \varvec{W}_{hj} +\varvec{\beta}_{hj} ),\quad j = 1,2, \ldots ,m $$

(26)

Once feature nodes and enhancement nodes of each lightweight BLS system are fixed according to the above strategy, this sub-system can be quickly constructed by the same learning algorithm of BLS as in the last section.

D&BLS begins with the construction of the first lightweight. BLS sub-system on the original training dataset X and its target output vector Y. And then D&BLS obtains the output vector $ \varvec{Y^{\prime}}_{1} $ of the first lightweight BLS sub-system. After generating $ \varvec{X}_{2} = \left[ {\varvec{X}|\varvec{Y^{\prime}}_{1} } \right] $ and $ \varvec{Y}_{2} = \varvec{Y} - \varvec{Y^{\prime}}_{1} $, respectively, as the inputs and target outputs of the second lightweight BLS sub-system, the second sub-system is built in the same way as in the above. This procedure is repeated until the maximum number of sub-systems (i.e., layers) or the prescribed performance of D&BLS is achieved. As a result, the final output vector of D&BLS on the original training dataset can be expressed as,

$$ \varvec{Y}^{\prime } = \sum\limits_{{i{\kern 1pt} = {\kern 1pt} 1}}^{L} {\varvec{Y}_{i}^{{{\kern 1pt} \prime }} } $$

(27)

In summary, the learning algorithm is given in Algorithm 1. To visualize the effectiveness of Algorithm 1, a simple two-class classification dataset contained 200 training samples (100 positive samples) is taken for didactic experimentation. The dataset is generated from the model make_moons of Scikit-learn [29] package with 0 noise, as shown in Fig. 3a. Four D&BLSs with different number of sub-systems are taken and the deeper one is deepened from the shallower one. The deepest one is a four-layer D&BLS: (6 × 6, 9, 1 × 3)–(5 × 6, 9, 1 × 2)–(5 × 6, 9, 1 × 2)–(5 × 6, 9, 1 × 2), its whole structure becomes (126, 9). Here the average results of the ten runs are reported. Figure 3b–e shows the decision boundary of all D&BLSs.

From Fig. 3, the decision boundary of D&BLS becomes more complex to match the distribution of training samples as D&BLS becomes deeper. It indicates that Algorithm 1 is effective and the learning ability of D&BLS can be enhanced by adding more sub-systems with appropriate parameters.

The computational complexity of Algorithm 1 can be analyzed in terms of D&BLS with L sub-systems. Without loss of generality, according to the same strategy of BLS in [7], here we suppose $ \xi $ in step 12 is a sigmoid activation function [23] and the number of iterations of sparse autoencoder taken in step 6 is K.

Thus the computational complexity of step 2 and 3 can be calculated to be $ O\left( {2T} \right) \approx O\left( T \right) $. The computational burden from step 4 to step 7 about the generation of feature nodes takes,

$$ \begin{aligned} & O\left( {n_{i} \left( \begin{aligned} & d_{i} \left( {D + 1} \right) + 2Td_{i} \left( {D + 1} \right) + 3Td_{i} + \hfill \\ & K\left( {Td_{i}^{2} + 2d_{i} + d_{i}^{3} + Td_{i} \left( {D + 1} \right) + 7d_{i} \left( {D{ + }1} \right){ + }d_{i}^{2} \left( {D + 1} \right)} \right) \hfill \\ \end{aligned} \right)} \right) \\ & \quad \approx O\left( {2TDn_{i} d_{i} + Kn_{i} \left( {Td_{i}^{2} + d_{i}^{3} + TDd_{i} + Dd_{i}^{2} } \right)} \right) \\ \end{aligned} $$

In general, $ d_{i} < T $, $ d_{i} < D $ and $ 2 < K $ can be assured, so the computational complexity mentioned above will be reduced to $ O\left( {KTDn_{i} d_{i} } \right) $. Then step 8 requires $ O\left( {Tn_{i} d_{i} } \right) $. The computational burden of the generation of enhancement nodes from step 9 to step 13 takes

$$ O\left( {m_{i} \left( {Tp_{i} + p_{i} q_{i} + Tq_{i} p_{i} + 3Tq_{i} } \right)} \right) \approx O\left( {Tm_{i} q_{i} p_{i} } \right) $$

Then steps 14 and 15 require $ O\left( {2Tm_{i} q_{i} } \right) \approx O\left( {Tm_{i} q_{i} } \right) $. The calculation of the pseudoinverse of $ \left( {U_{{n_{i} }}^{{m_{i} }} } \right)_{i} $ in step 16 requires

$$ \left\{ \begin{aligned} & O\left( {\left( {n_{i} d_{i} + m_{i} q_{i} } \right)^{3} + 2T\left( {n_{i} d_{i} + m_{i} q_{i} } \right)^{2} + 2\left( {n_{i} d_{i} + m_{i} q_{i} } \right)} \right) \approx O\left( {T\left( {n_{i} d_{i} + m_{i} q_{i} } \right)^{2} } \right),\quad {\text{if }}\left( {n_{i} d_{i} + m_{i} q_{i} } \right) \le T \hfill \\ & O\left( {T^{3} + 2T^{2} \left( {n_{i} d_{i} + m_{i} q_{i} } \right) + 2T} \right) \approx O\left( {T^{2} \left( {n_{i} d_{i} + m_{i} q_{i} } \right)} \right),{\text{ if }}\left( {n_{i} d_{i} + m_{i} q_{i} } \right) > T \hfill \\ \end{aligned} \right. $$

while step 17 and step 18 require

$$ O\left( {2T\left( {n_{i} d_{i} + m_{i} q_{i} } \right)} \right) \approx O\left( {T\left( {n_{i} d_{i} + m_{i} q_{i} } \right)} \right) $$

Hence, with the total L sub-systems, the computational complexity of Algorithm 1 becomes

$$ \left\{ \begin{aligned}& O\left( {\left( {L - 1} \right)T + \sum\nolimits_{i = 1}^{L} {\left( \begin{aligned} KTDn_{i} d_{i} + Tn_{i} d_{i} + Tm_{i} q_{i} p_{i} + Tm_{i} q_{i} + \hfill \\ T\left( {n_{i} d_{i} + m_{i} q_{i} } \right)^{2} + T\left( {n_{i} d_{i} + m_{i} q_{i} } \right) \hfill \\ \end{aligned} \right)} } \right)\\ &\approx O\left( {\sum\nolimits_{i = 1}^{L} {\left( {KTDn_{i} d_{i} + Tm_{i} q_{i} p_{i} + T\left( {n_{i} d_{i} + m_{i} q_{i} } \right)^{2} } \right)} } \right),\\ & {\text{ if }}\left( {n_{i} d_{i} + m_{i} q_{i} } \right) \le T \\ &O\left( {\left( {L - 1} \right)T + \sum\nolimits_{i = 1}^{L} {\left( \begin{aligned} KTDn_{i} d_{i} + Tn_{i} d_{i} + Tm_{i} q_{i} p_{i} + Tm_{i} q_{i} + \hfill \\ T^{2} \left( {n_{i} d_{i} + m_{i} q_{i} } \right) + T\left( {n_{i} d_{i} + m_{i} q_{i} } \right) \hfill \\ \end{aligned} \right)} } \right) \\ &\approx O\left( {\sum\nolimits_{i = 1}^{L} {\left( {KTDn_{i} d_{i} + Tm_{i} q_{i} p_{i} + T^{2} \left( {n_{i} d_{i} + m_{i} q_{i} } \right)} \right)} } \right),\\ & {\text{ if }}\left( {n_{i} d_{i} + m_{i} q_{i} } \right) > T \, \hfill \\ \end{aligned} \right. $$

As we can see, the most time-consuming parts for each sub-system in D&BLS are the generation of feature nodes, the generation of enhancement nodes and the calculation of pseudoinverse. Similarly, we can calculate the computational complexity of BLS for comparison. Suppose BLS contains d × n feature nodes and q × m enhancement nodes, its computational complexity is

$$ \left\{ \begin{aligned} O\left( {KTDnd + Tmqnd + T\left( {nd + mq} \right)^{2} } \right),\quad {\text{if }}\left( {nd + mq} \right) \le T \hfill \\ O\left( {KTDnd + Tmqnd + T^{2} \left( {nd + mq} \right)} \right),\quad {\text{if }}\left( {nd + mq} \right) > T \hfill \\ \end{aligned} \right. $$

For easy comparison, here we suppose each sub-system of D&BLS has the total $ \left\lceil {n/L} \right\rceil \times d $ feature nodes and $ \left\lceil {m/L} \right\rceil \times q $ enhancement nodes. As such, the computational complexity of Algorithm 1 becomes

$$ \left\{ \begin{aligned} O\left( {KTDnd + \frac{Tmq}{L}\sum\nolimits_{i = 1}^{L} {p_{i} } + \frac{T}{L}\left( {nd + mq} \right)^{2} } \right), \, \quad {\text{ if }}\frac{{\left( {nd + mq} \right)}}{L} \le T, \hfill \\ O\left( {KTDnd + \frac{Tmq}{L}\sum\nolimits_{i = 1}^{L} {p_{i} } + T^{2} \left( {nd + mq} \right)} \right), \, \quad {\text{ if }}\frac{{\left( {nd + mq} \right)}}{L} > T \, \hfill \\ \end{aligned} \right. $$

In general, $ \frac{1}{L}\sum\nolimits_{i = 1}^{L} {p_{i} } < nd $ and $ \left( {nd + mq} \right) \le T $ can be assured. Thus the computational complexity is obviously less than that of BLS.

Incremental learning for D&BLS

To achieve enhanced performance, it seems that we may expand D&BLS by increasing a lightweight BLS sub-system successively. However, too many lightweight BLS sub-systems often cause overfitting phenomenon. Our experimental evidences demonstrate that the total number of layers (i.e., sub-systems) should generally take a small integer (i.e., from 2 to 4). Therefore, a feasible strategy is to develop its incremental algorithms with the fixed number of layers. What is more, we should still consider how to expand D&BLS for the case of incremental input data. Here we develop three incremental algorithms of D&BLS for three incremental cases, without the need of re-training the whole classifier.