1 Introduction

Convolutional neural network (CNN) is a typical deep learning method which is based on feature extraction of convolution calculation [9]. It is widely applied to fields of prediction, classification [14] etc. CNN can solve high-dimensional problems which are difficult for traditional machine learning methods [19]. The ability to minimize the system error between the label and the inference [22] of CNN is much more powerful especially in the application of image processing. The neuron weights [12] of CNN are modified by forward propagation and error back propagation [15]. In recent years, the ability of CNN becomes more powerful because the distributed computing power has been greatly improved. Apart from image recognition [3], CNN are also applied in the other fields [11] such as text classification [26], control system [1] and target tracking [21].

The development history of CNN is as follows

The earliest study about CNN can be traced back to Fukushima, who mimicked the visual cortex of an organism and proposed the Neocognition model [7]. Time-Delay Neural Network (TDNN) was proposed by Alexander Waibel et al. in 1987 [27]. It is proved in TDNN that more hidden layers have greater feature extraction capabilities, which becomes the foundation of further optimization of CNN. After a series of improvements, He-Kaiming et al. released ResNet in 2015 [8]: the network manages to skip some neuron nodes to achieve higher performance. In 2017, Gao Huang et al. proposed DenseNet.

Problems of CNN can be summarized as follows

(1) The precision, accuracy and efficiency of CNN are expected to be improved. (2) High-dimensional information contains more details, which is difficult to be learned such as datasets of MINIST and CIFAR. Even human brain also tends to ignore the high-dimensional information. (3) CNN is more complex than classical neural network, but the trained model of CNN cannot be well explained. It is proved that randomly generated network of CNN can solve difficult problems better than the carefully designed network sometimes. More intelligent module which can identify more detailed information is expected.

Wavelet transform (WT) is often used in deep learning [5, 16, 24]. Many features can be obtained by the discrete wavelet transform which have been improved by researches. The application fields based on WT and deep learning methods are image classification [10, 23], computer vision [4, 17], texture classification [6], etc.

The applications based on wavelet neural network (WNN) in deep learning are as follows

In 2019, Pengju Liu et al. proposed a Multi-level Wavelet Convolutional Neural Networks(MWCNN) [16], which is proved to increase the receptive field by reducing the number of map. The Multi-Path Learnable Wavelet Neural Network for Image Classification was introduced by De Silva et al. [5]. This model introduces a multi-path layout with several levels of wavelet decompositions. In the domain of prediction, a convolutional LSTM network using the wavelet decomposition has been proposed in 2018 [28]. It takes the wavelet decomposition as the method of feature extraction rather than the manual feature extraction, which has been also proved by Kiskin et al. in 2017 [13].

The advantages of wavelet analysis are as follows

Wavelet analysis has been widely used in signal processing and analysis. Wavelet analysis method is called mathematical microscope [2, 18], which is considered as a powerful tool for zooming details of sound, image, etc. Although the wavelet transformation has some complexity [32], the powerful detail extraction ability of wavelet transformation is helpful and important to solve the above problems of CNN [20].

The motivation of this research is to solve the CNN’s problems based on the advantages of the WT. The importance of the research is that the improvements of CNN neurons are focused. Different from the ability of network with deeper layers, it is believed that the improvements of each neuron of CNN can improve the features identification and learning ability of the whole CNN [30]. Wavelet analysis is adopted [29] to improve the CNN network in this study.

The contributions of this study are as follows

(1) The wavelet-based Convolutional Neural Network (wCNN) is proposed, where the wavelet transformation is adopted as the activation function in Convolutional Pool Neural Network (CPNN) of CNN. (2) Based on wCNN, the wavelet-ased Convolutional wavelet Neural Network (wCwNN) is proposed, where the Fully connected Neural Network (FCNN) of wCNN is replaced by wavelet Neural Network (wNN). (3) Comparative experiments between CNN, wCNN and wCwNN are implemented on the MNIST dataset.

The following sections are organized as follows

The traditional CNN model is described in the second section. The improved wCNN is proposed in the third section. The improved wCwNN is proposed in the fourth section. The performance of CNN, wCNN and wCwNN is verified, analyzed and compared respectively with MNIST dataset in the fifth section. Discussion, conclusions and further research are given in the sixth section.

2 Model of convolutional neural network (CNN)

2.1 Structure of CNN

The structure of classical CNN is shown in Fig. 1. There are two parts in CNN: the first part is CPNN, and the second part is FCNN. In CPNN, the first layer is an input layer, and the following layers of CPNN are several pairs of convolutional layers and pooling layers. In FCNN, the first layer is an input layer, and the second layer of FCNN is an output layer, both layers of FCNN are fully connected.

Fig. 1
figure 1

Structure of CNN

The relation and features of CPNN and FCNN are as follows. (1) The input layer of CPNN is the input layer of CNN; (2) The last layer of CPNN is the input layer of FCNN; (3) The output layer of FCNN is the output layer of CNN; (4) The activation function of the convolutional layer in CPNN and the output layer in FCNN is sigmoid function; (5) There are not any activation functions in the input layer and pooling layer of CPNN and the input layer of FCNN.

2.2 Algorithm of CNN

The algorithm of CNN can be described as follows: (1) Initializing weights between layers and bias of neurons. (2) Forward propagating. (3) Calculating the mean square error (MSE) of all samples according to the loss function. (4) Calculating the errors of back propagating for each layer, which are the results of derivation by the chain rule. (5) Applying gradient to adjust the weights and bias according to the back-propagated errors. (6) Repeating the step (2) to step (5) until the MSE is small enough. (6) Evaluating the accuracy, precision and efficiency.

2.2.1 Forward propagation of CNN

Forward propagation of CNN is the calculation process from the input layer to the output layer, which can be described as follows: (1) The input layer of CNN is filled by a two-dimensional matrix of pixels of an image. (2) Forward propagation is calculated in convolutional and pooling layers (CPNN). (3) Forward propagation is calculated in fully connected layer (FCNN).

Definition 1: netl and Ol are the input and the output of neurons in layer l. The output of each neuron can be calculated according to the input and the activation function of each neuron. l is the layer number, e.g. l = 1 stands for the first layer, and l =  − 1 stands for the last layer. i and j are the row number and column number respectively.

According to the above definition, net−1, net−2 and net−3 stand for the input of the last FCNN layer, the input of the first FCNN layer and the input of the layer before FCNN (i.e. the last layer of pooling layers) respectively. The data structures of netl and Ol of each layer of CPNN are two-dimensional matrix, while the netl and Ol of each layer in FCNN are one-dimensional vectors.

Definition 2: \( {w}_{ij}^l \) and \( {b}_j^l \) are the weights and bias of layer l. \( {w}_{ij}^{-1} \) and \( {b}_j^{-1} \) are the weights and bias of the last layer of FCNN. If the layer l is a convolutional layer or a pooling layer, the size of the convolutional kernel or the pooling windows can be expressed as sizel × sizel. If layer l is a fully-connected layer, the number of neurons is expressed as sizel.

Definition 3: int(x) is the function for getting the integer part of x, e.g., int(5.1) = int(5.7) = 5.

Forward propagation of convolutional layer

The input of convolutional layer (netl) can be calculated according to Eq. (1). The \( {net}_{mn}^l \) stands for each input value of neurons in layer l. The convolution(Ol − 1, wl, m, n) is the function for convolution calculations. The Ol − 1 is the output of the previous layer. The wl is the matrix of weights between the input of layer l (netl) and the output of the previous layer (Ol − 1). The bl is the bias of layer l.

$$ {net}_{mn}^l=\mathrm{convolution}\left({O}^{l-1},{w}^l,m,n\right)+{b}^l=\sum \limits_{i=0}^{size^l-1}\sum \limits_{j=0}^{size^l-1}\left({O}_{m+i,n+j}^{l-1}\cdotp {w}_{i,j}^l+{b}^l\right) $$
(1)

An example of convolution operation is provided. If \( x={\displaystyle \begin{array}{cc}\begin{array}{cc}{x}_{11}& {x}_{12}\\ {}{x}_{21}& {x}_{22}\end{array}& \begin{array}{c}{x}_{13}\\ {}{x}_{23}\end{array}\\ {}{x}_{31}\kern0.5em {x}_{32}& {x}_{33}\end{array}} \), \( y={\displaystyle \begin{array}{cc}{y}_{11}& {y}_{12}\\ {}{y}_{21}& {y}_{22}\end{array}} \), the formula of convolution(x, y) can be expressed as Eq. (2):

$$ \mathrm{convolution}\left(x,y\right)={\displaystyle \begin{array}{cc}{x}_{11}{y}_{11}+{x}_{12}{y}_{12}+{x}_{21}{y}_{21}+{x}_{22}{y}_{22}& {x}_{12}{y}_{11}+{x}_{13}{y}_{12}+{x}_{22}{y}_{21}+{x}_{23}{y}_{22}\\ {}{x}_{21}{y}_{11}+{x}_{22}{y}_{12}+{x}_{31}{y}_{21}+{x}_{32}{y}_{22}& {x}_{22}{y}_{11}+{x}_{23}{y}_{12}+{x}_{32}{y}_{21}+{x}_{33}{y}_{22}\end{array}} $$
(2)

The output of the convolutional layer l (\( {O}_{mn}^l \)) can be calculated as Eq. (3), where sigmoid() is the activation function.

$$ {O}_{mn}^l=\mathrm{F}\left({net}_{mn}^l\right)=s\mathrm{igmoid}\left({net}_{mn}^l\right)=\frac{1}{1+{e}^{-{net}_{mn}^l}} $$
(3)

Forward propagation of pooling layer

Definition 4: The function pool(x) represents the average pooling of matrix x. The formula of pool(x) can be expressed as Eq. (4). The sizel stands for the size of the pooling window.

$$ {y}_{ij}=\mathrm{pool}\left(x,i,j\right)=\frac{\sum_{m=1}^{size^l}{\sum}_{n=1}^{size^l}{x}_{size^l\times \left(i-1\right)+m,{size}^l\times \left(j-1\right)+n}^{l-1}}{size^l\times {size}^l} $$
(4)

An example of average pooling is provided: If \( x={\displaystyle \begin{array}{cc}\begin{array}{cc}{\mathrm{x}}_{11}& {\mathrm{x}}_{12}\\ {}{\mathrm{x}}_{21}& {\mathrm{x}}_{22}\end{array}& \begin{array}{cc}{\mathrm{x}}_{13}& {\mathrm{x}}_{14}\\ {}{\mathrm{x}}_{23}& {\mathrm{x}}_{24}\end{array}\\ {}\begin{array}{cc}{\mathrm{x}}_{31}& {\mathrm{x}}_{32}\\ {}{\mathrm{x}}_{41}& {\mathrm{x}}_{42}\end{array}& \begin{array}{cc}{\mathrm{x}}_{33}& {\mathrm{x}}_{34}\\ {}{\mathrm{x}}_{43}& {\mathrm{x}}_{44}\end{array}\end{array}} \), The pooling result is calculated as Eq. (5).

$$ \mathrm{pool}(x)={\displaystyle \begin{array}{cc}\frac{x_{11}+{x}_{12}+{x}_{21}+{x}_{22}}{4}& \frac{x_{13}+{x}_{14}+{x}_{23}+{x}_{24}}{4}\\ {}\frac{x_{31}+{x}_{32}+{x}_{41}+{x}_{42}}{4}& \frac{x_{33}+{x}_{34}+{x}_{43}+{x}_{44}}{4}\end{array}} $$
(5)

According to Eq. (4), the output of the pooling layer Ol is according to the output of the previous layer (Ol − 1). In other words, the input of the pooling layer l (netl) is as same as the output of the previous layer (Ol − 1).

Forward propagation of fully-connected layer

The total number of neurons in the first layer of FCNN (size−2 × 1) is equal to the number of neurons of in the last layer of CPNN (size−3 × size−3), which can be expressed as size−2 × 1 = size−3 × size−3. The output of the first layer of FCNN (\( {O}_i^{-2} \)) is transformed from the output of the last layer of CPNN (\( {O}_{mn}^{-3} \)). The transform relation between \( {O}_i^{-2} \) and \( {O}_{mn}^{-3} \) can be expressed as Eq. (6).

$$ {O}_i^{-2}={O}_{mn}^{-3},\kern0.5em m=\operatorname{int}\left(\frac{i}{size^{-2}}\right)+1,n=i-{size}^{-2}\times \left(m-1\right) $$
(6)

The result of forward propagation is \( {\hat{y}}_n \), which can be formulated in Eq. (7) to Eq. (9).

$$ {net}_j^{-1}={\sum}_{i=1}^{size^{-2}}\left({O}_i^{-2}\cdotp {w}_{ij}^{-1}+{b}^{-1}\right),\kern0.5em j=1,2,\dots, {size}^{-1} $$
(7)
$$ {O}_j^{-1}=\mathrm{F}\left({net}_j^{-1}\right)=\mathrm{sigmoid}\left({net}_j^{-1}\right)=\frac{1}{1+{e}^{-{net}_i^{-1}}} $$
(8)
$$ {\hat{y}}_n={O}^{-1} $$
(9)

Back propagation of CNN

There are three kinds of back propagation (BP) in CNN algorithm: BP in fully-connected layer, BP in pooling layer and BP in convolutional layer.

Definition 5: δl is defined as the input error of layer l.

Definition 6: L is the mean square error (MSE) of all samples, which can be formulated as Eq. (10). The closer the values of \( {\hat{y}}_n \) and yn, the better the training effect is, because \( {\hat{y}}_n \) is the prediction of xn, and yn is the label of xn. If each value of \( {\hat{y}}_n \) is very close to yn, the value of L will be very small, which means that the training effect is very good and the trained model has a good fitting.

$$ L=\frac{1}{2}\sum \limits_{n=1}^N{\left({\hat{y}}_n-{y}_n\right)}^2 $$
(10)

Back propagation of fully-connected layer

According to the above definition, \( {\updelta}_j^{-1} \) is defined as the input error of the last layer of FCNN, which is formulated as Eq. 11. yn is the labels of training and testing samples. \( {\hat{y}}_n \) is the predictive result of samples. n is number (ID) of the samples. N is the total number of samples.

$$ {\updelta}_j^{-1}=\frac{\partial L}{\partial {net}_j^{-1}}=\frac{1}{N}\sum \limits_{n=1}^N\left({\hat{y}}_n-{y}_n\right)\left(1-{net}_j^{-1}\right){net}_j^{-1} $$
(11)

\( {\delta}_i^{-2} \) is defined as the input error of the first layer of FCNN. The size of \( {\delta}_i^{-2} \) is size−2 × 1. \( {\delta}_{mn}^{-3} \) is defined as the back propagation error of previous layer of FCNN (the last layer of CPNN before the first layer of FCNN). The size of \( {\delta}_{mn}^{-3} \) is size−3 × size−3. The transform relation between \( {\delta}_i^{-2} \) and \( {\delta}_{mn}^{-3} \) can be expressed as Eq. (12):

$$ {\delta}_{mn}^{-3}={\delta}_i^{-2},\kern0.5em i={size}^{-2}\times \left(m-1\right)+n $$
(12)

The error back propagation from the first layer of FCNN to the last pooling layer in CPNN is expressed as Eq. (13):

$$ {\delta}_i^{-2}=\frac{\partial L}{\partial {net}_i^{-2}}=\frac{\partial L}{\partial {O}_i^{-2}}=\frac{\partial L}{\partial {net}_j^{-1}}\cdotp \frac{\partial {net}_j^{-1}}{\partial {O}_i^{-2}}=\sum \limits_{j=1}^{size^{-2}}{\delta}_j^{-1}\cdotp {w}_{ij}^{-1}\ i=1,2,\dots, {size}^{-1} $$
(13)

Backpropagation of pooling layer

If the layer l is a convolutional layer, the layer l + 1 is a pooling layer. Functions of pool calculations can be expressed as Eq. (14) to Eq. (16):

Definition 7: Function of padding(x): matrix x can be expanded with 0 around as Eq. (14):

$$ x={\displaystyle \begin{array}{cc}{x}_{11}& {x}_{12}\\ {}{x}_{21}& {x}_{22}\end{array}},\kern0.5em padding(x)={\displaystyle \begin{array}{cc}\begin{array}{cc}0& 0\\ {}0& {x}_{11}\end{array}& \begin{array}{cc}0& 0\\ {}{x}_{12}& 0\end{array}\\ {}\begin{array}{cc}0& {x}_{21}\\ {}0& 0\end{array}& \begin{array}{cc}{x}_{22}& 0\\ {}0& 0\end{array}\end{array}} $$
(14)

Definition 8: Function of rotate(x): matrix x can be rotated 180 degrees as Eq. (15):

$$ x={\displaystyle \begin{array}{cc}{x}_{11}& {x}_{12}\\ {}{x}_{21}& {x}_{22}\end{array}},\kern0.5em rotate(x)={\displaystyle \begin{array}{cc}{x}_{22}& {x}_{21}\\ {}{x}_{12}& {x}_{11}\end{array}} $$
(15)

The input error of convolutional layer is calculated by Eq. 16:

$$ {\delta}_{mn}^l=\frac{\partial L}{\partial {net}_{mn}^l}=\frac{\partial L}{\partial {net}_{mn}^{l+1}}\cdotp \frac{\partial {net}_{mn}^{l+1}}{\partial {O}_{mn}^l}\cdotp \frac{\partial {O}_{mn}^l}{\partial {net}_{mn}^l}= convolution\left( padding\left({\delta}^{l+1}\right), rotate\left({w}^l\right)\right) $$
(16)

Backpropagation of convolutional layer

Definition 9: Function poolExpand(x): the size and data of the output of pooling layer is expanded to the input of pooling layer. For example, matrix xuv (output of pooling layer) is replaced by matrix ymn (input of pooling layer) according to the function poolExpand(x) which is expressed as Eq. (17). sizel is the size of pooling window.

$$ {y}_{mn}= poolExpand\left({x}_{uv}\right)=\frac{1}{size^l\times {size}^l}\cdotp {x}_{uv},u=\mathit{\operatorname{int}}\left(\frac{m-1}{size^l\times {size}^l}\right)+1,v=\mathit{\operatorname{int}}\left(\frac{n-1}{size^l\times {size}^l}\right)+1 $$
(17)

For example, if \( x={\displaystyle \begin{array}{cc}{x}_{11}& {x}_{12}\\ {}{x}_{21}& {x}_{22}\end{array}} \), the result of poolExpand(x) is calculated as Eq. (18):

$$ \mathrm{poolExpand}(x)={\displaystyle \begin{array}{cc}\begin{array}{cc}\frac{x_{11}}{4}& \frac{x_{11}}{4}\\ {}\frac{x_{11}}{4}& \frac{x_{11}}{4}\end{array}& \begin{array}{cc}\frac{x_{12}}{4}& \frac{x_{12}}{4}\\ {}\frac{x_{12}}{4}& \frac{x_{12}}{4}\end{array}\\ {}\begin{array}{cc}\frac{x_{21}}{4}& \frac{x_{21}}{4}\\ {}\frac{x_{21}}{4}& \frac{x_{21}}{4}\end{array}& \begin{array}{cc}\frac{x_{22}}{4}& \frac{x_{22}}{4}\\ {}\frac{x_{22}}{4}& \frac{x_{22}}{4}\end{array}\end{array}} $$
(18)

If the layer l is a pooling layer, then the layer (l + 1) is a convolutional layer, and the input error of pooling layer can be calculated as the Eq. (19),:

$$ {\displaystyle \begin{array}{c}{\delta}_{mn}^l=\frac{\partial L}{\partial {net}_{mn}^l}=\frac{\partial L}{\partial {net}_{mn}^{l+1}}\cdotp \left(\frac{\partial {net}_{mn}^{l+1}}{\partial {O}_{mn}^l}\right)\cdotp \left[\frac{\partial {O}_{mn}^l}{\partial {net}_{mn}^l}\right]=\mathrm{poolExpand}\left({\delta}^{l+1}\right)\cdotp \left(\frac{\partial {net}_{mn}^{l+1}}{\partial {O}_{mn}^l}\right)\cdotp \left[{F}^{\prime \left({net}_{mn}^l\right)}\right]\\ {}=\mathrm{poolExpand}\left({\delta}^{l+1}\right)\cdotp (1)\cdotp \left[\left(1-{net}_{mn}^l\right)\cdotp {net}_{mn}^l\right]=\mathrm{poolExpand}\left({\delta}^{l+1}\right)\cdotp \left(1-{net}_{mn}^l\right)\cdotp {net}_{mn}^l\end{array}} $$
(19)

Adjustment of weights and parameters of CNN

The change value of weights and bias can be calculated as Eq. (20) to Eq. (21):

$$ \Delta {w}^l=\frac{\partial L}{\partial {w}^l}=\frac{\partial L}{\partial {net}^l}\times \frac{\partial {net}^l}{\partial {w}^l}={\delta}^l\cdotp {O}^{l-1} $$
(20)
$$ \Delta {b}^l=\frac{\partial L}{\partial {b}^l}=\frac{\partial L}{\partial {net}^l}\times \frac{\partial {net}^l}{\partial {b}^l}={\delta}^l $$
(21)
$$ {\Delta w}_{ij}^{-1}=\frac{\partial L}{{\partial w}_{ij}^{-1}}=\frac{\partial L}{\partial {net}_j^{-1}}\times \frac{\partial {net}_j^{-1}}{{\partial w}_{ij}^{-1}}={\updelta}_j^{-1}\cdotp {O}_i^{-2} $$
(22)
$$ \Delta {b}^{-1}=\frac{\partial L}{\partial {b}_j^{-1}}=\frac{\partial L}{\partial {net}_j^{-1}}\times \frac{\partial {net}_j^{-1}}{\partial {b}_j^{-1}}=\frac{1}{size^l}\sum \limits_{j=1}^{size^{-1}}{\updelta}_j^{-1} $$
(23)

The updated value of weights and bias can be calculated as Eq. (24) to Eq. (27). η _ CPNN is the learning rate of CNN:

$$ {w}^l\left(t+1\right)={w}^l(t)-\eta \_ CPNN\times \Delta {w}^l $$
(24)
$$ {b}^l\left(t+1\right)={b}^l(t)-\eta \_ CPNN\times \Delta {b}^l $$
(25)
$$ {w}_{ij}^{-1}\left(t+1\right)={w}_{ij}^{-1}(t)-\eta \_ CPNN\times {\Delta w}_{ij}^{-1} $$
(26)
$$ {b}^{-1}(t)={b}^{-1}(t)-\eta \_ CPNN\times \Delta {b}_j^{-1} $$
(27)

Pseudocode of CNN

Features and labels are contained in training set, which are learned by the model of CNN. Weights and bias are adjusted in the training process.

  1. (1)

    Definition of Adjustment cycle and Simulation process

Definition 10: Adjustment cycle (AC): In each AC, all the weights and biases are adjusted one time according to Eq. (20) to Eq. (27).

Definition 11: Simulation process (SP): Each SP is a complete training process. 1SP contains many continuous ACs. Each SP starts from the first AC (for example: t = 1) to the last AC (for example: t = 6000).

The relationship between 1SP and 1 AC is that 1SC is composed of nACs.

  1. (2)

    Training algorithm of CNN

The effect of each 1SP is measured by the loss function, which is expressed as Eq. (28):

$$ \boldsymbol{L}(t)=\frac{1}{2}\sum \limits_{n=1}^N{\left( train\_p(n)- train\_y(n)\right)}^2 $$
(28)

In Eq. (28), n is the number of each training sample, and N represents the total number of training samples. train _ p(n) is the result of forward propagation of the nth training sample, which is also expressed as \( {\hat{y}}_n \). train _ y(n) is the label of nth training sample, which is also expressed as yn.

The pseudocode of CNN is listed in Algorithm 1.

figure a

3 Wavelet transform

Wavelet transform (WT) is an ideal method to process details of signals. WT provides a Time-Frequency Window which can capture higher and lower resolution of details of signals. The problem of Fourier transform [25] (FT) is that the window size cannot be changed when the frequency is changed. This problem can be solved by WT. The ψ(a, b) is called wavelet generating function, which can be expressed as \( \psi \left(a,b\right)=\frac{1}{\sqrt{a}}{\int}_{-\infty}^{\infty }f(t)\ast \varphi \left(\frac{t-b}{a}\right) \), where a and b are the scale parameters which control the extension and translation of function.

ψ(a, b) is designed according to the following conditions

(1) Only in a very small domain, the function value is not 0, and other domains are 0. In other words, translating the signal in timeline is same to adding a window on the original signal. (2) The integral value of function in the x axis must be 0. (3) The transform must be reversible. There are many wavelet generating functions such as: (1) haar wavelet, (2) db wavelet [31], (3) sym wavelet [15], (4) coif series wavelet, etc. The wavelet function of this study is \( \varphi (x)=\cos (1.75t)\ast {e}^{-\frac{t^2}{2}} \).

The processes of wavelet transform are visualized as follows

In Fig. 2, the error is the difference between the signal of wavelet inverse transformation and the original signal. The scale is the parameter of wavelet function, which controls the extension of wavelet function. When the scale parameter is changed, the wavelet transform’s ability of information extraction to original signal is changed.

Fig. 2
figure 2

Different effects of WT. a. scale = 20, error = 2.08. b. scale = 100, error = 0.044. c. scale = 200, error = 0.00045

In summary, by adjusting the scale and translation, wavelet transform can learn the different feature. So, richer feature can be learned by adding the wavelet transformation into the CNN.

4 Model of wavelet convolutional neural network (wCNN)

4.1 Structure of wCNN

The improvement of the proposed wCNN is that: the activation function F() of the convolutional layer in CNN is replaced by the Ψ(). The F() of CNN is sigmoid function, and the Ψ() of wCNN is wavelet scale transformation function.

The structure of proposed wCNN is that: The first part of wCNN is Wavelet Convolutional Pooling Neural Network (wCPNN), and the second part is Fully Connected Neural Network (FCNN). The structure of wCNN is shown in Fig. 3.

Fig. 3
figure 3

Structure of wCNN

4.2 Algorithm of wCNN

The difference between wCNN and CNN is the activation function of convolution layer.

The training algorithm of wCNN also has three steps: (1) forward propagation of wCNN; (2) back propagation of wCNN; (3) weight and bias adjustment of wCNN.

4.2.1 Forward propagation of wCNN

The forward propagation of wCNN is same as CNN. The input of wCNN is the feature of training samples. The output of wCNN is calculated from the first layer of wCNN (input of wCNN) through convolutional layer, pooling layer and fully connected layer.

  1. (1)

    Forward propagation of convolution layer

If the layer l is the convolutional layer, the input of this layer (\( {net}_{mn}^l \)) is calculated by Eq. (29):

$$ {net}_{mn}^l=\mathrm{convolution}\left({O}^{l-1},{w}^l\right)+{b}^l=\sum \limits_{i=0}^{size^l-1}\sum \limits_{j=0}^{size^l-1}\left({O}_{m+i,n+j}^{l-1}\cdotp {w}_{i,j}^l\right) $$
(29)

The output of this layer (\( {O}_j^l(t) \)) is calculated by Eq. (30). acland bcl is the parameters of the scale transformation in activation function:

$$ {O}_{mn}^l=\Psi \left(\frac{net_{mn}^l-{ac}^l}{bc^l}\right) $$
(30)

In the convolutional layer of wCNN, the activation function Ψwc(x) is expressed as Eq. (31):

$$ {\Psi}_{wc}(x)=\mathit{\cos}(1.75x)\cdotp {e}^{-\frac{x^2}{2}} $$
(31)

(2)Forward propagation of Pooling layer and Fully connected layer.

Forward propagation in the pooling layer and the fully-connected layer of wCNN is the same as CNN, which are shown in Eq. (4) and Eq. (7).

4.2.2 Back propagation of wCNN

The predicted values \( {\hat{y}}_n \) of the training samples can be calculated by forward propagation of wCNN, then the MSE (mean square error) of all the training samples can be calculated by loss function as Eq. (10).

Back propagation of error is necessary for weights and bias adjustment, which is calculated in fully connected layer, pooling layer and convolutional layer. The back propagation in fully connected layer and convolutional layer are same as CNN, while back propagation in pooling layer is different:

If the layer l is pooling layer, the layer (l + 1) is a convolutional layer, and the error of the input of the pooling layer is expressed as Eq. (32):

$$ {\delta}_{mn}^l=\frac{\partial L}{\partial {net}_{mn}^l}=\frac{\partial L}{\partial {net}_{mn}^{l+1}}\cdotp \frac{\partial {net}_{mn}^{l+1}}{\partial {O}_{mn}^l}\cdotp \frac{\partial {O}_{mn}^l}{\partial {net}_{mn}^l}=\frac{1}{bc^l}\mathrm{poolExpand}\left({\delta}_{uv}^{l+1}\right)\cdotp \Psi^{\prime}\left(\frac{net_{mn}^l-{ac}^l}{bc^l}\right) $$
(32)
$$ {\Psi}^{\prime }(x)=-1.75\cdotp \mathit{\sin}(1.75x)\cdotp {e}^{-\frac{x^2}{2}}-x\cdotp \mathit{\cos}(1.75x)\cdotp {e}^{-\frac{x^2}{2}} $$
(33)

Gradient descent method is applied to calculate the changed values of weights and bias such as wl,acl,bcl, which can be expressed as Eq. (34) to Eq. (36):

$$ \Delta {w}^l=\frac{\partial L}{\partial {w}^l}=\frac{\partial L}{\partial {net}^l}\times \frac{\partial {net}^l}{\partial {w}^l}={\delta}^l\cdotp {O}^{l-1} $$
(34)
$$ \Delta {ac}^l=\frac{\partial L}{\partial {ac}^l}=\frac{\partial L}{\partial {net}^l}\times \frac{\partial {net}^l}{\partial {ac}^l}=-{\delta}^l\cdotp \frac{1}{bc^l} $$
(35)
$$ \Delta {bc}^l=\frac{\partial L}{\partial {bc}^l}=\frac{\partial L}{\partial {net}^l}\times \frac{\partial {net}^l}{\partial {bc}^l}=-\frac{1}{{bc^l}^2}\cdotp {\delta}^l\cdotp \left({net}_{mn}^l-{ac}^l\right) $$
(36)

The adjusted results such as \( {w}_{ij}^{-1} \) and \( {b}_j^{-1} \) can be expressed as Eq. (37) to Eq. (41), where η _ CPNN is the learning rate of wCNN:

$$ {w}^l\left(t+1\right)={w}^l(t)-\eta \_ wCPNN\times \Delta {w}^l $$
(37)
$$ {bc}^l\left(t+1\right)={bc}^l(t)-\eta \_ wCPNN\times \Delta {bc}^l $$
(38)
$$ {ac}^l\left(t+1\right)={ac}^l(t)-\eta \_ wCPNN\times \Delta {ac}^l $$
(39)
$$ {w}_{ij}^{-1}\left(t+1\right)={w}_{ij}^{-1}(t)-\eta \_ CPNN\times {\Delta w}_{ij}^{-1} $$
(40)
$$ {b}_j^{-1}(t)={b}_j^{-1}(t)-\eta \_ wCPNN\times \Delta {b}_j^{-1} $$
(41)

4.3 Pseudocode of wCNN

The training process of wCNNwCNN is similar to CNN, while the activation function of wCNN is different from CNN. The pseudocode of wCNN is listed in Algorithm 2.

figure b

5 Model of wavelet convolutional wavelet neural network(wCwNN)

5.1 Structure of wCwNN

Based on wCNN, the improvement of wCwNN is that: the fully connected network (FCNN) is replaced by a wavelet Neural Network(wNN). The structure of wCwNN has two parts: (1) wavelet Convolutional Pooling Network(wCPNN), and (2) wavelet Neural Network (wNN). In the convolutional layer of wCPNN and the hidden layer of the wNN, all the activation functions are wavelet scale transformation functions. The structure of wCwNN is drawn as Fig. 4.

Fig. 4
figure 4

Structure of wCwNN

5.2 Algorithm of wCwNN

The first part of wCwNN is wCPNN, which is same as wCNN. The second part of wCwNN is wNN, which is different from the second part of wCNN (FCNN). In this section, the forward propagation and back propagation of wNN are described in detail.

Definitions for wNN are listed as follows.

  1. (1)

    The number of the last layer (output layer) of wNN is expressed as l =  − 1. size−1 is the number of neurons in this output layer.

  2. (2)

    The number of the second layer (the second to last layer, hidden layer) of wNN is expressed as l =  − 2. size−2 is the number of neurons in this hidden layer.

  3. (3)

    The number of the first layer (the third to last layer, input layer) of wNN is expressed as l =  − 3. size−3 is the number of neurons in this input layer.

  4. (4)

    The number of the last layer of wCPNN, which is the previous layer of the input layer of wNN, is defined as l =  − 4. size−4 × size−4 is the number of neurons in this output layer of wCPNN.

5.2.1 Forward propagation of wCwNN

The input of the input layer in wNN (\( {O}_i^{-3} \)) comes from the output of the last layer in wCPNN (\( {O}_{mn}^{-4} \)), both two layers have the same numbers of neurons: size−3 = size−4 × size−4.

The dimension of matrix \( {O}_{mn}^{-4} \) is m × n, and the dimension of matrix of \( {O}_i^{-3} \) is i × 1. The correspondence between \( {O}_{mn}^{-4} \) and \( {O}_i^{-3} \) can be expressed as Eq. 42:

$$ {O}_i^{-3}={O}_{mn}^{-4},m=\operatorname{int}\left(\frac{i}{size^{-3}}\right)+1,n=i-{size}^{-3}\times \left(m-1\right) $$
(42)

In the hidden layer of wNN, the input matrix is \( {net}_j^{-2} \), and the output matrix is \( {O}_j^{-2} \). \( {O}_j^{-2} \) can be calculated by Eq. (43) to Eq. (44):

$$ {net}_j^{-2}={\sum}_{i=1}^{size^{-3}}\left({O}_i^{-3}\cdotp {w}_{ij}^{-2}\right),\kern0.5em j=1,2,\dots, {size}^{-2} $$
(43)
$$ {O}_j^{-2}=\Psi \left({net}_j^{-2}\right)=\Psi \left(\frac{net_j^{-2}-{ac}^{-2}}{bc^{-2}}\right),\kern0.5em j=1,2,\dots, {size}^{-2} $$
(44)

In the output layer of wNN, the input matrix and the output matrix are \( {net}_j^{-1} \) and \( {O}_j^{-1} \) respectively, and the predicted result of wCwNN is \( {\hat{y}}_n \). \( {O}_j^{-1} \) and \( {\hat{y}}_n \) can be calculated by Eq. (45) to Eq. (47):

$$ {net}_k^{-1}={\sum}_{j=1}^{size^{-2}}\left({O}_j^{-2}\cdotp {w}_{kj}^{-1}+{b}_k^{-1}\right),\kern1.9em k=1,2,\dots, {size}^{-1} $$
(45)
$$ {O}_k^{-1}=F\left({net}_k^{-1}\right)=\mathrm{sigmoid}\left({net}_k^{-1}\right)=\frac{1}{1+{e}^{-{net}_k^{-1}}},\kern2.5em k=1,2,\dots, {size}^{-1} $$
(46)
$$ {\hat{y}}_n={O}^{-1} $$
(47)

5.2.2 Back propagation of wCwNN

In wNN, the back propagation of input errors (\( {\delta}_i^{-3} \) in the output layer, \( {\delta}_j^{-2} \) in the hidden layer and\( {\delta}_k^{-1} \) in the input layer) can be calculated as Eq. (48) to Eq. (50).

$$ {\updelta}_k^{-1}=\frac{\partial L}{\partial {net}_k^{-1}}=\frac{1}{N}\sum \limits_{n=1}^N\left({\hat{y}}_n-{y}_n\right)\left(1-{net}_k^{-1}\right){net}_k^{-1},\kern2em k=1,2,\dots, {size}^{-1} $$
(48)
$$ {\updelta}_j^{-2}=\frac{\partial L}{\partial {net}_j^{-2}}=\frac{\partial L}{\partial {net}_k^{-1}}\cdotp \frac{\partial {net}_k^{-1}}{\partial {O}_j^{-2}}\cdotp \frac{\partial {O}_j^{-2}}{\partial {net}_j^{-2}}=\frac{1}{bc^{-2}}\sum \limits_{k=1}^{size^{-1}}{\updelta}_k^{-1}\cdotp {\Psi}^{\prime \left(\frac{net_j^{-1}-{ac}^{-2}}{bc^{-2}}\right)}\cdotp {w}_{kj}^{-1},\kern2em j=1,2,\dots, {size}^{-2} $$
(49)
$$ {\updelta}_{mn}^{-3}={\updelta}_i^{-3}=\frac{\partial L}{\partial {net}_i^{-3}}=\frac{\partial L}{\partial {O}_i^{-3}}=\frac{\partial L}{\partial {net}_j^{-2}}\cdotp \frac{\partial {net}_j^{-2}}{\partial {O}_i^{-3}}=\sum \limits_{j=1}^{size^{-2}}{\updelta}_j^{-2}\cdotp {w}_{ij}^{-2},\kern2.5em i=1,2,\dots, {size}^{-3} $$
(50)

Gradient descent method is applied to adjust the weights (\( {w}_{ij}^{-2} \) and \( {w}_{kj}^{-1} \)) and bias (ac−2, bc−2 and \( {b}_k^{-1} \)) of wCwNN. The changed values of the above weights and bias can be calculated by Eq. (51) to Eq. (55).

$$ {\Delta w}_{ij}^{-2}=\frac{\partial L}{{\partial w}_{ij}^{-2}}=\frac{\partial L}{\partial {net}_j^{-2}}\times \frac{\partial {net}_j^{-2}}{{\partial w}_{ij}^{-2}}={\updelta}_j^{-2}\cdotp {O}_i^{-3} $$
(51)
$$ \Delta {ac}^{-2}=\frac{\partial L}{\partial {ac}^{-2}}=\frac{\partial L}{\partial {net}_j^{-2}}\times \frac{\partial {net}_j^{-2}}{\partial {ac}^{-2}}=\frac{1}{size^{-2}}\sum \limits_{j=1}^{size^{-2}}-{\delta}_j^2\cdotp \frac{1}{bc^{-2}} $$
(52)
$$ \Delta {bc}^{-2}=\frac{\partial L}{\partial {bc}^{-2}}=\frac{\partial L}{\partial {net}_j^{-2}}\times \frac{\partial {net}_j^{-2}}{\partial {bc}^{-2}}=-\frac{1}{size^{-2}}\sum \limits_{j=1}^{size^{-2}}\frac{1}{{\left({bc}^{-2}\right)}^2}\cdotp {\delta}_j^{-2}\cdotp \left({net}_j^{-1}-{ac}^{-2}\right) $$
(53)
$$ \Delta {w}_{kj}^{-1}=\frac{\partial L}{{\partial w}_{kj}^{-1}}=\frac{\partial L}{\partial {net}_k^{-1}}\times \frac{\partial {net}_k^{-1}}{\partial {w}_{kj}^{-1}}={\updelta}_k^{-1}\cdotp {O}_j^{-2} $$
(54)
$$ \Delta {b}_k^{-1}=\frac{\partial L}{\partial {b}_k^{-1}}=\frac{\partial L}{\partial {net}_k^{-1}}\times \frac{\partial {net}_j^{-1}}{\partial {b}_k^{-1}}={\updelta}_k^{-1} $$
(55)

The adjustive result of the above weights and bias are expressed as Eq. (56) to Eq. (60), where α _ wNN, η _ wNN, and η _ wCPNN are the inertia coefficient of wNN, the learning rate of wNN, and the learning rate of wCPNN, respectively.

$$ {w}_{ij}^{-2}\left(t+1\right)={w}_{ij}^{-2}(t)-{\Delta w}_{ij}^{-2}\times \eta \_ wNN+{w}_{ij}^{-2}(t)\times \alpha \_ wNN $$
(56)
$$ {ac}^{-2}\left(t+1\right)={ac}^{-2}(t)-{\Delta ac}^{-2}\times \eta \_ wNN+{ac}^{-2}(t)\times \alpha \_ wNN $$
(57)
$$ {bc}^{-2}\left(t+1\right)={bc}^{-2}(t)-\Delta {bc}^{-2}\times \eta \_ wNN+{bc}^{-2}(t)\times \alpha \_ wNN $$
(58)
$$ {w}_{kj}^{-1}\left(t+1\right)={w}_{kj}^{-1}(t)-\Delta {w}_{kj}^{-1}\times \eta \_ wCPNN $$
(59)
$$ {b}_k^{-1}\left(t+1\right)={b}_k^{-1}(t)-\Delta {b}_k^{-1}\times \eta \_ wCPNN $$
(60)

5.2.3 Pseudocode of wCwNN

Training algorithm of wCwNN is similar to wCNN, while the difference is that FCNN in wCNN is replaced by wNN in wCwNN. The pseudo code of wCwNN is listed in Algorithm 3.

figure c

6 Experiment

The objectives of the experiment are as follows: (1) verify the viability of each algorithm (convergence), (2) improve accuracy of three algorithms(reduce the minimum mean square error), (3) improve the accuracy rate (reduce the error rate), (4) analyze the efficiency of the algorithms (5) find an algorithm with greater classification capacity.

The contents of experiment are as follows: (1) record error of three algorithms in every AC, then plot time-error curve (2) calculate the mean square error of all test samples after training (precision); (3) calculate error rate (accuracy rate) of the test sample after training; (4) record the time of training process (5) analyze the results of each algorithm; (6) analyze the results of each experiment.

6.1 Datageset introduction

Dataset of MNIST and CIFAR-10 are adopted for the comparative experiments. MNIST is well known from the National Institute of Standards and Technology. The MNIST database of handwritten digits has a training set of 60,000 examples, and a test set of 10,000 examples.. Training set and test set are shown in Fig. 5 below:

Fig. 5
figure 5

Data Set of MNIST. a Training set. b Test set

CIFAR-10 is an open dataset, which has 60,000 images. The resolution of the images is 32*32. The images are divided into 10 categories, each category contains 6000 images. There are 50,000 images for training, 10,000 images for testing. The data set is shown in Fig. 6 below. In this experiment, two kinds images are selected.

Fig. 6
figure 6

Data Set of CIFAR-10

6.2 Experiment design of CNN

For the following comparative experiments, the structure of CNN is designed as follows: the first layer of CNN is an input layer; the 2nd to 5th layers are two pairs of convolutional-pooling layers; the 6th to 7th layers are fully connected (FCNN).

Firstly, parameters of the structure of CNN are set as follows:

  1. (1)

    Input layer of CNN: The size is 28 × 28.

  2. (2)

    The first convolutional layer: the input size is 28 × 28. The size of convolutional kernel is set to 5 × 5. The output size is 24 × 24. The feature number of the output is set to 6. The activation function is Sigmoid.

  3. (3)

    The first pooling layer: the input size is set to 24 × 24. The size of the pooling window is 2 × 2. The output size is 12 × 12. The number of output features is 6.

  4. (4)

    The second convolutional layer: The input size is 12 × 12. The convolutional kernel size is 5 × 5. The output size is 8 × 8. The activation function is Sigmoid. The number of output feature is 12.

  5. (5)

    The second pooling layer: the input size is 8 × 8. The size of pool window is 2 × 2, the output size is 4 × 4, the number of output feature is 12.

  6. (6)

    Input layer of FCNN: The size is 192, which is equal to 4 × 4 × 12.

  7. (7)

    Hidden layer of FCNN: The size is 10, the activation function is Sigmoid.

  8. (8)

    Output layer of FCNN: The size is 10, which can represent 10 different classes. For example: if the predicted result is class 1, the output is: [1,0,0,0,0,0,0,0,0,0], and if the predicted result is class 2, the output is: [0,1,0,0,0,0,0,0,0,0].

Secondly, parameters for simulations are configured as follows:

  1. (1)

    Learning rate: η = 0.01 or η = 0.1.

  2. (2)

    Total number of training processes (SP): max _ SPs = 10.

  3. (3)

    Total number of adjustment cycles (AC): max _ ACs = 6000.

  4. (4)

    Target error: target _ err = 0.0000001. In each SP, current error is calculated by current _ err = L(). L() is loss function, where adopts MSE in this study. When current _ err is smaller than target _ err, SP will be stop though the current AC is smaller than max _ ACs.

    In each SP, the current error is continuously reduced. Therefore, in order to get smaller error, target _ err is set to a very small value which cannot be reached within max _ ACs in each SP.

  5. (5)

    Total number of training samples taken in each AC: BatchSize = 10.

6.3 Experiments design of wCNN and wCwNN

Configuration of parameters of network structures and simulations of wCNN and wCwNN are listed in Table 2. There are two groups of experiments with different learning rate η = 0.1 and η = 0.01 respectively.

Comparative results will be recorded as follows: (1) feasibility (whether the algorithm is convergent), (2) minimum MSE (precision), (3) correct rate (accuracy) of all the test data, (4) running time (efficiency). The structure and experimental parameters configuration of wCNN, wCwNN are shown in Table 1.

Table 1 Configurations of CNN, wCNN and wCwNN experiments

Table 2 shows that the most significant difference is that: (1) The activation function of CPNN in wCNN and wCwNN is wavelet function, while the activation function of CNN is sigmoid function. (2) The second part of neural network (after CPNN) of wCwNN is wavlet neural network (wNN), while the second part of neural network of CNN and wCNN is FCNN.

Table 2 Experiment results of CNN

7 Results

7.1 Results of the experiment of CNN

The learning rate is set as η = 0.01 and η = 0.1 respectively. Results are recorded as follows: (1) current error (MSE) of each AC in each SP; (2) the error rate of all test samples after each SP; (3) time spent in each SP. The above results are recorded in Table 2:

In the experiment of CNN, each SP has 6000 ACs. All the errors of 6000 ACs are recorded and drawn in Fig. 7. All the points of error values are drawn into the orange line and fitted to a blue line by linear regression. The blue line indicates the downward trend of the orange line. Fig. 7a shows the result of 1SP when η = 0.01. Fig. 7b shows the result of 1SP when η = 0.1. Fig. 7c shows the result of 10SPs when η = 0.01. Fig. 7d shows the result of 10SPs when η = 0.1.

Fig. 7
figure 7

MSE plot of CNN. a Result of 1SP when η = 0.01. b Result of 1SP when η = 0.1. c Result of 10SPs when η = 0.01. d Result of 10SPs when η = 0.1

7.2 Results of the experiments of wCNN

In experiment of the wCNN, learning rate is set as η = 0.01 and η = 0.1 respectively. Each SP contains 6000ACs. All the MSE in each ACs are recorded, which is shown in Fig. 8:

Fig. 8
figure 8

MSE plot of wCNN. a Result of 1SP when η = 0.01. b Result of 1SP when η = 0.1. c Result of 10SPs when η = 0.01. d Result of 10SPs when η = 0.1

Statistic results of the above simulations are listed in Table 3, which includes: (1) the final MSE of each SP, (2) the error rate of each SP, (3) Consumed time of each SP.

Table 3 Experiment results of wCNN

7.3 Results of the experiments of wCwNN

In the experiment of the wCwNN, learning rate is set as η = 0.01 and η = 0.1 respectively. Each SP contains 6000ACs. All the MSE in each ACs are recorded, which is shown in Fig. 9.

Fig. 9
figure 9

MSE plot of wCwNN. a Result of 1SP when η = 0.01. b Result of 1SP when η = 0.1. c Result of 10SPs when η = 0.01. d Result of 10SPs when η = 0.1

Statistic results of the above simulations are listed in Table 4, which includes: (1) the final MSE of each SP, (2) the error rate of each SP, (3) consumed time of each SP.

Table 4 Experiment results of wCwNN

8 Discussion

8.1 Discussion of results of CNN

According to the experimental results of the CNN presented in Table 2 and Fig. 7, the main findings are as follows:

  1. (1)

    Classification of MNIST can be completed by CNN, the correct rate is 90.03% (the error rate is 9.97%)

  2. (2)

    CNN has a good ability for image classification, and the maximum correct rate can reach 90.8% (the error rate is at least 9.2%).

  3. (3)

    When the learning rate is increased (from η = 0.01 to η = 0.1), the MSE is significantly decreased (from 0.289 to 0.094), and the error rate is significantly decreased (from 28.07% to 9.97%). Time spent in simulation (η = 0.1) is lightly increased (from 83.39 to 91.04).

  4. (4)

    The descending process of MSE is stable, and the results among 10SPs are close.

8.2 Discussion of the wCNN results

According to the experimental results of the wCNN presented in Table 3 and Fig. 8, the following findings can be drawn:

  1. (1)

    wCNN algorithm is convergent. Classification of the MNIST can be completed by wCNN. The average correct rate is 92.78% (the average error rate of 7.22%).

  2. (2)

    wCNN has a good (better than CNN) ability for image classification, and the maximum correct rate can reach 95.66% (the error rate is at least 4.34%).

  3. (3)

    When the learning rate of wCNN is increased (from η = 0.01 to η = 0.1), the MSE is significantly decreased (from 0.184 to 0.088), and the error rate is significantly decreased (from 13.91% to 7.22%). Time spent in simulation (η = 0.1) is lightly increased (from 258.53 to 268.56).

  4. (4)

    MSE reduction processes have differences, variance of final error of wCNN is greater than CNN. wCNN achieves a very low error rate (4.34%), but there are some high error rates (such as 18.71% and 18.35%) of wCNN experiments. With the research of Liu in 2015 [15], the wavelet network’s ability to make the training MSE smaller is a result of sacrificing network stability, because it has the advantage of jumping out of the local minimum, which is not available in classical BP network and RBF network. However, this advantage also brings the disadvantage that the error decline during the learning process is more oscillating.

8.3 Discussion of experimental results of wCwNN

According to the experimental results of wCwNN presented in Fig. 9 and Table 4, the following conclusions can be drawn:

  1. (1)

    wCwNN is convergent. It can complete the task of classification for MNIST dataset, and the average accuracy is 96.57% (the average error rate is 3.43%).

  2. (2)

    wCwNN has a good ability (better than CNN and wCNN) of image classification, and the maximum correct rate can reach 97.04% (the minimal error rate is 2.96%).

  3. (3)

    When the learning rate of wCwNN is increased (from η = 0.01 to η = 0.1), the MSE is decreased (from 0.068 to 0.054), and the error rate is significantly decreased (from 5.17% to 3.43%). Time spent in simulation (η = 0.1) is lightly increased (from 357.23 to 368.81).

  4. (4)

    The descending process of MSE is stable, and the differences among 10SPs are not significant.

9 Conclusion

Firstly, MNIST dataset is adopted to verify the proposed methods in this study, and CNN is implemented to finish the task of classification. The correct rate of CNN is more than 90%. Secondly, wCNN is proposed, the activation function of the convolutional network in CNN is replaced by the wavelet function. Thirdly, the FCNN of CNN and wCNN is replaced by wNN, wCwNN is proposed. With the same hyperparameters, the comparative results of experiments among CNN, wCNN and wCwNN are shown in Tables 5 and 6.

Table 5 Comparative results of CNN, wCNN and wCwNN experiments (Learning Efficiency = 0.01)
Table 6 Comparative results of CNN, wCNN and wCwNN experiments (Learning Efficiency = 0.1)

The MSE of CNN, wCNN and wCwNN in each ACs are drawn in Fig. 10.

Fig. 10
figure 10

MSE plot of CNN, wCNN and wCwNN (Learning efficiency is 0.1). a Result of 1SP of CNN. b Result of 1SP of wCNN. c Result of 1SP of wCwNN. d Results of 10SPs of CNN. e Results of 10SPs of wCNN. f Results of 10SPs of wCwNN

We took the same experiments on both the MNIST dataset and the CIFAR10 dataset. Results are shown in the Experiment section. The comparison between MNIST dataset and the CIFAR-10 dataset is as follows:

In the experiment of CIFAR-10 dataset, wCNN’s performance(the Mean MSE is 0.144948, the mean error rate is 0.20095) is better than CNN(the Mean MSE is 0.29845, and the mean error rate is 0.205107), while the wCwNN’s performance(the Mean MSE is 0.134675, and the mean error rate is 0.18145) is better than wCNN.

According to the comparative experimental results of CNN, wCNN and wCwNN shown in Tables 5, 6 and 7and Fig. 10. The following findings can be drawn:

  1. (1)

    CNN, wCNN and wCwNN are convergent, the task of classification for MNIST dataset can be completed by all above methods.

  2. (2)

    When the learning rate of CNN, wCNN and wCwNN are increased (from η = 0.01 to η = 0.1), the MSE of each algorithm is decreased significantly, and the error rate of each experiment is decreased significantly, while the consumed time is increased slightly.

  3. (3)

    wCwNN has the highest precision (the minimum MSE is 0.045), and the precision of wCNN (the minimum MSE is 0.088) is less than the precision of CNN (the minimum MSE is 0.094).

  4. (4)

    The wCwNN has the highest accuracy (the minimum error rate is 3.43%, when the η = 0.1), the error rate of wCNN (the minimum error rate is 7.22%, when the η = 0.1), both of them are less than the error rate of CNN (9.97%).

  5. (5)

    The variance of MSE of trained wCwNN is the smallest (the variance is 0.000014). The variance of MSE of trained wCNN is the largest (the variance is 0.000554). The variance of MSE of trained CNN is 0.000069.

  6. (6)

    The wCwNN (average time of SP is 131.67) consumes more time than wCNN (average time of SP is 83.57), and wCNN consumes much more time than CNN (average time of SP is 10.89).

Table 7 Comparative results of different datasets (Learning Efficiency = 0.1)

In summary

(1) the proposed wCwNN is successfully improved based on wCNN, and the proposed wCNN is successfully improved based on CNN. (2) Both wCwNN and wCNN have higher precession (MSE is smaller) than CNN, and the precision of wCwNN is the highest. (3) Both wCwNN and wCNN have higher accuracy (error rate is smaller) than CNN, and the accuracy of wCwNN is the highest. Both improvements of wCNN and wCwNN lead to more time comsuming in each SPs.

In the future, the research we continue to do is as follows

(1)We should find a mechanism so that wavelet function will not overextend the minimum value while maintaining the ability to jump out of the local minimum value. (2) We would improve the learning ability of CNN, wCNN and wCwNN by expanding the network structure like depth of network. Furthermore, the proposed wCNN and wCwNN can be used as a neuron to build a deeper neural network. (3) We would conduct more experiments to verify the performances of the improved methods. .