Introduction

Deep neural networks have led to a revolution in the areas of machine learning. These networks achieve state-of-the- art performance in numerous applications, such as audio analysis and computer vision [1,2,3]. However, most deep neural networks, such as the deep Boltzmann machine (deep belief network [DBN]) [4], require long training time because of the iterative parameter tuning based on the back-propagation algorithm. Meanwhile, the extreme learning machine proposed by Huang et al. [5] and its variants [6,7,8] have fast training speeds and good generalization capabilities. Huang also proves that, in contrast to the common knowledge and conventional neural network learning tenets, hidden nodes/neurons do not need to be iteratively tuned for wide types of neural networks and learning models [9]. Therefore, the ELM and ELM variants achieve higher training speeds without iterative parameter tuning. Similar to deep networks, one of extended ELMs, called the multi-layer extreme learning machine (ML-ELM), stacks the extreme learning machine auto encoder (ELM-AE) layer by layer [10]. The ELM-AE learns the deep features faster than the stacked auto encoders. However, as with most deep learning algorithms, the ML-ELM learns the features totally from data and the learning performance largely depends on the probability distribution of data. Actually, some prior knowledge of human cognition may be used to facilitate the discriminative deep feature learning.

People can learn new concepts successfully from a single example, while machine learning algorithms typically require tens or hundreds of examples to produce similar accuracies. Artificial intelligence aims to make the machine have the perceptive abilities of humans. To improve the algorithm performance, one of the general methods is to understand the perceptive methods of people and propose a solution that mimics that of humans. According to our understanding, people can quickly focus on the main interesting parts of an image when performing recognition. The interesting parts contain the most significant information to aid cognition, while the other parts of the image normally offer assisting information. Based on this understanding, we suggest modifying the ML-ELM by utilizing the human perception method. From the point of view of mathematics, deep networks are just functions of inputs. In other words, the outputs map the inputs. Therefore, if the important parts of the input are carefully enhanced, the output will carry more information about these parts. Based on this theory, we illustrate our solution of the local region-enhanced ML-ELM in the following parts.

In this work, we have focused on two points. Firstly, find the local significant region of the input. Normally, the significant parts of different data are different. It is necessary to detect the important parts of the input data. Detecting interesting parts is generally a difficult research problem. Before training a model, people usually preprocess the data. The most important part of the data is generally aligned in the middle part of the data. Therefore, we choose the central local region as the local significant region. Second, train the ML-ELM network with the local significant region. The traditional ML-ELM takes the data as the input. When enhancing the importance of the local significant region in the deep network, the network will contain more information, and thus we take the data and the local significant region as inputs to train the deep network. The results of the experiments show that our model performs better than the other algorithms.

Brief Review of ML-ELM

Extreme Learning Machine

The ELM is a single hidden-layer feedforward neural network that was proposed by Huang et al. [5]. The input data is mapped to an L-dimensional ELM random feature space and the network is given by Eq. 1:

$$ f_{\mathrm{L}}(x)=\sum\limits_{i = 1}^{L}\beta_{\mathrm{i}}h_{\mathrm{i}}(x)= h(x)\beta $$
(1)

where β = [β1,⋯ ,βL]T is the matrix of the output weights from the hidden nodes to the output nodes. h(x) = [h1(x),⋯ ,hL(x)] is the row vector representing the outputs of the L hidden nodes with respect to the input x, and h(x) maps the data from the d-dimensional input space to the L-dimensional hidden layer feature space (ELM feature space) H.

The above linear equation can be written in the matrix form.

$$ H\beta=T $$
(2)

where T = [t1,⋯ ,tN]T is the target labels and H = [hT(x1),⋯ ,hT(xN)] is the output of the hidden layers. The output weights β can be calculated by Eq. 3:

$$ \beta=H^{-1}T $$
(3)

where H− 1 is the Moore-Penrose generalized inverse of matrix H. To have a better generalization performance and to make the solution more robust, one can add a regularization term. The output weights β can be calculated by Eq. 4 [11].

$$ \beta=\left( \frac{I}{C}+H^{T}H\right)^{-1}H^{T}T $$
(4)

where I is the identity matrix, and C is a constant.

Multi-layer Extreme Learning Machine

The multi-layer extreme learning machine performs layer by layer unsupervised learning. Each layer is an ELM-AE that represents features based on singular values. The ELM-AE is a modified ELM that performs unsupervised learning, obtains the representation of input X at the coding stage, and recovers the X at the encoding stage. The ELM-AE network is shown in Fig. 1a. The parameters of this ELM-AE are calculated as follows. The input parameters are the weights w and biases b of the hidden respective nodes that are generated randomly and satisfy the orthogonal constraint such that WTW = IandbTb = I. The orthogonalization of the randomly generated weights and biases tends to improve the generalization performance of the ELM-AE. Under the minimization, the difference between recoding X and inputting X together with the sparse requirement and the output weights of the ELM-AE are calculated as shown in Eq. 5

$$ \beta=\left( \frac{I}{C}+H^{T}H\right)^{-1}H^{T}X $$
(5)

where X is the input data. For the symmetrical ELM-AE, the output weights are calculated by the singular value decomposition (SVD). Kasun et al. [12] proved that the output weights β of the ELM-AE are learned and derived the features of the input data via singular values.

Fig. 1
figure 1

The network structure of ML-ELM

To get the distributed representation, the ML-ELM network is created by stacking the ELM-AE layer by layer, where each layer takes the output from the last layer. Parameter β in each layer is calculated as shown in Eq. 5. Moreover, unlike typical deep networks, such as the DBM, the ELM-AE does not tune the parameters, and it trains the network faster than the other deep networks.

Proposed Methodology

The ELM proposed by Huang was originally inspired by biological learning. They conjectured that some parts of brain systems should have random neurons, and thus they constructed the universal learning structure ELM. All the hidden nodes in the ELM are both independent of the training data and independent of each other [13]. Nevertheless, the final weight calculation still depends on the training data. Furthermore, the selective attention system proves that humans will focus on the local interested area. Inspired by this selective attention mechanism, we make some modifications to the ML-ELM and improve the representation learning ability to mimic the human cognitive method to some extent.

The Main Idea of the Region-Enhanced ML-ELM

In this part, we aim to rectify the ML-ELM by adding additional inputs that are selected from selective attention areas to address the idea of the human visual attention mechanism. The structure of the proposed region-enhanced ML-ELM is illustrated in “Structure of Region-Enhanced ML-ELM”. Our proposed region-enhanced ML-ELM uses the same scheme as most deep networks that learn the distributed representation from the data. However, the original data is not treated equally because the significant areas are repeated and enhanced at the input’s end. This is illustrated in Fig. 2. In this figure, the red rectangle marks the local significant region and this part is multiplexed at the input’s end. The significant areas are selected by referring to the human being’s visual selective attention mechanism. Nevertheless, the human selective attention system is complex. To simplify the attention simulation problem and focus on the attention-aided learning algorithm, we use only two typical selective attention areas according to our understanding. One is the center area. The other is the task-driven interested area.

Fig. 2
figure 2

Illustration of the region-enhanced ML-ELM input

Definition of our Local Significant Regions Based on the Attention Knowledge

In cognitive psychology, there are some models that describe how visual attention operates. The first of these models to appear in the literature is the spotlight model. In this spotlight model, attention is described as having a focus, a margin, and a fringe [14]. The focus is the area that extracts information from the visual scene with a high resolution, where its geometric center is where the visual attention is directed. Surrounding the focus is the fringe of attention, which extracts the information with a low resolution. This fringe extends out to a specified area, and the cutoff is called the margin (https://en.wikipedia.org/wiki/Attention). The measurements of the strength and direction of the illusory motion at increasing separations from the cue reveal an attentional “perceptive field” with an excitatory center at the cued locus and an inhibitory surround subtending the remaining visual field [15].

To explain the center attention scheme, we use the following example. Here are five example images of the handwritten digit “3” shown in Fig. 3.

Fig. 3
figure 3

Illustration of the center-based local significant region

Image (a) is the original image. Image (b) to image (e) were each was marked with a different sized red rectangle. It is easy to know the digit in the image (a) is three. However, if you just see the red rectangle blocks from image (b) to image (e), you will understand the information of the image increasingly better. The image showing the digit three is the knowledge that people obtain from their eyes. When the red rectangle block becomes larger, people can see it better. However, from image (b) to image (e), we can see that, the red rectangle block of image (e) shows enough information to allow people to see which number is shown in the image. It seems that we do not need the other part of the image’s information. In other words, the part of the red rectangle provides more important information than the remaining part. Combining the spotlight model of the visual attention, we extract the center region from the whole image and add them as multiplexing input nodes to increase the resolution of this region. In images with simple scenes, the geometric center is utilized as we do in this paper for digital and object recognition in pre-processed databases of MNIST and NORB. For an image with a complex background, the excitatory center at the locus must be detected at a certain attention system or a cognition task.

Some attention theories placed a new emphasis on the separation of visual attention tasks alone and they are mediated by supplementary cognitive processes. Duncan et al. state that there is an initial pre-attentive parallel phase of the perceptual segmentation and analysis that encompasses all visual items present in a scene [16]. Being task-driven is a possible stimulus for perceptual segmentation. Based on this understanding, we select several segments as local significant regions for a certain cognition task. When we perform motion recognition, we focus on the hands and feet parts of an image. When we perform face recognition, our attention will be concentrated on the eyes, nose, and mouth of the image. When we perform facial expression recognition, we care about the eyes, mouth, eyebrows, etc.

In this paper, we take the face recognition as the task to extract several relative significant regions, and then add them as multiplexing inputs in our improved ML-ELM.

In the task of face recognition, we take the eyes, mouth, and nose parts of image as the local significant regions. In Fig. 4, we use some algorithms that are embedded in OpenCV to detect the eyes (b), nose (c), and mouth (d) from the image (a). After we get the significant regions, we will use them in the next step.

Fig. 4
figure 4

Illustration of task-driven local significant regions

Implementation of Region-Enhanced ELM-AE

An auto encoder (AE) is an artificial neural network used for learning efficient coding. The aim of the auto encoder is to learn a representation for a set of data. An auto encoder is a feedforward, non-recurrent neural network, that is similar to the multilayer perceptron (MLP), with an input layer, an output layer, and one or more hidden layers connecting them. The difference between auto encoders and MLPs is that in an auto encoder, the output layer has the same number of nodes as the input layer. Instead of being trained to predict the target value Y given inputs X, auto encoders are trained to reconstruct their own inputs X, as does the ELM-AE. The ELM-AE learns the representative features from data in an unsupervised way, which is based on the assumption that the reconstructed data forms the feature in the encoding stage to approximate the source data to generate that feature in the coding stage. By stacking the coding network of the ELM-AE, the deep feature is then learned with the ML-ELM. Furthermore, the ML-ELM is modified by making use of some human cognition knowledge to improve the performance of the representation learning. The details of our proposed region-enhanced ML-ELM are elaborated as follows.

Structure of Region-Enhanced ML-ELM

The network’s structure is shown in Fig. 5. Different from the ML-ELM, our proposed region-enhanced ML-ELM has added several input nodes by multiplexing the local significant region or regions at the input’s end. In other words, the input of the proposed region-enhanced ML-ELM includes the original data, and it contains two parts. The first part is the source data (X in Fig. 5), and the second part is the data from the local significant region (X-a in Fig. 5). In the training stage, the parameter is calculated in each layer separately based on the ELM-AE. At the input’s end, the ELM-AE takes them as inputs. If the ELM-AE takes more information from the data, the representation learning of the ELM-AE will be improved.

Fig. 5
figure 5

The structure and parameter-training diagram of region-enhanced ML-ELM

Calculation of Network Parameters

Since our region-enhanced ML-ELM has a similar structure as the ML-ELM, the parameter calculation method is similar, except for the weights in the input layer. Therefore, the hidden and output weights are calculated based on the ML-ELM, as illustrated in section 3. Here, we address how we calculate the weights in the input layer.

The input of ELM-AE with local significant region is given by Eq. 6:

$$ X_{\text{region}}= [x,x{-}a] $$
(6)

where the x represents the original input and the xa denotes the multiplexing input of the local significant region of the input. Thus, the hidden output of the region-enhanced ELM-AE is calculated by Eq. 7:

$$ H_{\text{region}}=g(a*x_{\text{region}}+b) $$
(7)

where a is the orthogonal weights, b is the orthogonal bias, and the g is the activation function.

The optimization target minimizes the loss function between the estimated input and the original one, as in Eq. 8

$$ J_{\text{cost}}=min|\tilde{X_{\text{region}}}-X_{\text{region}}| $$
(8)

where \(\tilde {X_{\text {region}}}=H_{\text {region}}*\beta \), the parameter β is the output weight under-computing. Then, β is calculated as in Eq. 9.

$$ \beta={H_{\text{region}}}^{\dagger}X_{\text{region}} $$
(9)

Analysis of Reconstructed Results

Figure 6 contains the source sample images and their relative central regions, and Fig. 6b, c contains the corresponding reconstructed sample images with and without the region enhancement from Fig. 6a, respectively. From Fig. 6b, c, we can see that the reconstructed images become slightly cluttered and fuzzy compared with the source images. Nevertheless, the reconstructed results are very acceptable with complete information. Meanwhile, when comparing the reconstruction results with/without the region-enhanced ML-ELM in Fig. 6b, c, the difference is tiny but could still be found. After adding the local region enhancement, the edge is a bit clearer and less fuzzy than that found without using the region enhancement. More noise could be found in the reconstruction results without using the region enhancement than that using our region enhancement. For example, note the extra flaw in the vertical stroke around the bottom-right corner of the center region of the digit “4” image. This, to some extent, demonstrates that our proposed region-enhanced method improves the representation of important parts by increasing the resolution of the relative parts at the input’s end as intended by multiplexing the local significant regions. In the future, more experiments will be done with various increasing resolution rate. Of course, considering the random generation of weights between the input and hidden nodes, the strategy needs to make full use of the apriori knowledge of the selectiveness.

Fig. 6
figure 6

Source and reconstruction images

Performance Evaluation and Analysis

Image Datasets

The MNIST dataset is a commonly used dataset for testing the performance of deep networks. It contains 28*28 images of handwritten digits with 60,000 training samples and 10,000 testing samples [17].

The NORB dataset is intended for experiments in 3D object recognition from shapes. The NORB dataset contains 24,300 stereo images that are used for training and 24,300 stereo images that are used for testing. It contains images of toys belonging to five generic categories: four-legged animals, human figures, airplanes, trucks, and cars. As shown in Fig. 7a [18], the objects were imaged by 2 cameras under 6 lighting conditions, 9 elevations (30 to 70 degrees every 5 degrees), and 18 azimuths (0 to 340 every 20 degrees). Referring to Prof. Huang’s work, we use the preprocessed images with normalized object size and the uniform background of the NORB dataset. The size of the images is 32*32 with stereo pairs. Some samples are shown in Fig. 7b [19].

Fig. 7
figure 7

Sample images in the NORB dataset

The ORL dataset is used in the context of a face recognition project. There are ten different images of each of the 40 distinct subjects [6]. To get more subjects, the images were taken at different times with varying lighting conditions, facial expressions, and facial details. All the images were taken against a dark homogeneous background with the subjects in an upright, frontal position. Some sample images of ORL dataset are shown in Fig. 8.

Fig. 8
figure 8

Sample images in the ORL dataset

Experiments

The experiments were conducted on a desktop computer with a core Intel Xeon E5-2687 3.10-GHz processor and 32-GB RAM running MATLAB 2013a. The symbol “RE-ML-ELM” marked in Tables 12, and 3 indicates region-enhanced multi-layer extreme learning machine.

Table 1 Performance comparison of the RE-ML-ELM with the state-of-the-art deep networks using the MNIST dataset
Table 2 Performance comparison of the RE-ML-ELM with the ML-ELM, and ELM using the NORB dataset
Table 3 Performance comparison of the RE-ML-ELM with the ML-ELM, and ELM using the ORL dataset

The ML-ELM (network structure, 784-700-15000-10 with ridge parameters 10− 1 for layer 784-700, 103 for layer 700-15000, and 108 for layer 15000-10) with sigmoidal hidden layer activation function generates an accuracy of 98.97% in the MNIST dataset. In addition, the RE-ML-ELM that we propose (network structure, 784-700-15000-10 with ridge parameters 10− 1 for layer 784-700, 104 for layer 700-15000, 108 for layer 15000-10, and 2*28 for the local region size) with the sigmoidal hidden layer activation function generates an accuracy of 99.12% in the MNIST dataset. The ML-ELM uses the same parameters as the RE-ML-ELM. For the ELM, in order to get a higher accuracy using the MNIST dataset, we incorporate significantly more nodes in the hidden layers, and so it costs much more time. Moreover, the two-layer DBM network produces better results than the three-layer DBM network [20]. We achieved the same test results for DBN as those reported in [20]. We also perform experiments using the NORB datasets and the ORL datasets, and the results are shown in Tables 2 and 3.

As shown in Tables 12, and 3, the RE-ML-ELM outperforms the ML-ELM, ELM, DBN, and SAE. It takes less time than the DBN and the other deep networks. The RE-ML-ELM takes more data as inputs. Because the RE-ML-ELM incorporates more information than ML-ELM, the accuracy of the RE-ML-ELM is higher than that of the ML-ELM. There is more useful information could be used in RE-ML-ELM. The experiments show the additional significance information can be utilized to improve the accuracy of classifier.

Conclusion and Future Job

In the paper, we propose an improved ML-ELM with multiplexing applied to the local significant regions at the input end and apply it to image classification. By enhancing the center area and the selected areas, respectively, in the relative recognition tasks, our proposed RE-ML-ELM achieves the best recognition results and a competitive fast training time comparing to that without the region-enhancement ML-ELM, the ELM, and some state-of-art deep networks. This proves our idea of combining the apriori knowledge of attention with data learning to improve the performance of the ML-ELM. Actually, our proposed idea of enhancing the local significant regions can be applied to most data-driven algorithms. However, the definition of significant regions in this paper is not complete and practical for all applications. Like the focus center, in this paper, we used only geometric center, and we verify that our local region enhanced idea using two datasets with simple backgrounds. In the future, we will extend the region selection with more comprehensive and practical significance referring to the psychological research of the selective attention mechanism. At the least, in the next step, we will employ a more practical focal attention mechanism that detects the location of the excitator and utilizes the center region around the locus instead of only the geometric center. In another perspective, our proposed region-enhanced ML-ELM does not consider the spatial attribution of images such as the CNN and ELM-RLF do. It reconstructs the two-dimensional image into the one-dimension vector. This causes the serious loss of spatial information during the image cognition and accordingly a lower accuracy rate on the NORB dataset comparing to that of the CNN and ELM-RLF given in paper [9]. In the next step, we could consider making use of the spatial information in our RE-ML-ELM to improve the cognition computing performance.

In a word, there are several jobs we will do in the future to increase the performance of our current algorithm. (1) We will extract more comprehensive and practical significant regions based on the psychology research of the attention mechanism, such as the center around excitatory locus instead of the simple geometry center. (2) We will extract the local regions with more complete semantic information. (3) We will improve the algorithm’s scheme with the consideration of spatial information.