Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

In the area of Artificial Intelligence there is a great diversity of algorithms for pattern classification, and one of the most important is the Multi-Layer Perceptron (MLP) which through a training process adjusts the hyperplanes of each neuron in each layer to separate the classes of some dataset [22, 25, 26]. The training is often based on gradient descent and back-propagation [22]. This model since its appearance in 1961 [25] has been widely used in the area of pattern recognition. However, there are other classification algorithms such as Dendrite Morphological Neuron (DMN) which use a training algorithm completely different from back-propagation [22], in the sense that they do not try to approximate a hyperplane through an iterative training process, analyzing each sample of the training set. Instead, this type of neuron analyzes the elements as a complete set and based on lattice operations generate hyperboxes. They are able to classify the different classes of the training set with a higher rate.

The success of Deep Neural Networks (DNN) is well known for recognizing objects in images [12] and speech in audio [9]. The mathematical operations employed in these neurons remain the same as those of a MLP [22]: sums, multiplications and some well-known non-linear functions. Furthermore, convolutions are used for reducing the number of learning parameters [13]. So, the novelty of the last 10 years has focused on more computing power, more layers, more data and the dropout [14, 23]. These last two are to avoid overfitting in deep models that previously prevented the MLPs from giving better results than the Support Vector Machines (SVM) [3]. It is important to note that these developments are not related to the mathematical structure. This leads us to ask if there are other mathematical operations that can improve the recognition performance. In this paper, we started a research project in that direction. In particular, we compared DNNs with DMNs for a specific type of problem: multi-class spirals with several loops in 2D. Even when this classification problem is artificial, it is useful for studying the essential properties of the two models. A very first analysis was published in [27]; here we extend the analysis for deeper models and more classes. As classification tools, both models are subjects for comparison in terms of percentage of classification and training times, which depend directly on the number of parameters that constitute the model.

The rest of the paper is organized as follows. Section 2 provides a brief description of works that have proposed a different mathematical structure from the mainstream of neural networks. Sections 3 and 4 present the architecture of DNNs and DMNs, respectively. Section 5 discusses the experimental results. Then, in Sect. 6 we give our conclusions and future work.

2 Previous Work

Currently, there are few studies aimed at improving the mathematical structure of deep neural networks. However, before the term “deep learning” was born, we could find several papers with interesting proposals. Pessoa and Maragos [18] combined linear with rank filters. This architecture has shown that it can recognize digits in images, generating similar or better results compared to classical MLPs in shorter training times. Ivakhnenko [11] proposes a multilayer of polynomials to approximate the decision boundary for clasification problems. This was the first deep learning model published in literature. Dubin and Rumelhart [4] introduce product units into neural networks. These units add complexity to the model in order to use less layers. Other mathematical structures have been proposed such as: higher-order neural networks (NNs) [5], sigma-pi NNs [8], second-order NNs [17], functionally expanded NNs [10], wavelet NNs [29] and Bayesian NNs [16]. Glorot investigated more effective ways of training very deep neural networks using ReLUs as activation functions, achieving results comparable to the state-of-the-art [6]. Bengio [2] argues that in order to learn complex functions through training by gradient descent, it is necessary to use deep architectures. In [1] Bengio also analyzes and considers alternatives to training by standard gradient descent, due to the trade-off between efficient learning and latching on information. In this paper, we evaluate the performance of the DNNs with that of the DMNs to show some limitations of the DNNs and how morphological operations could improve deep learning.

3 Deep Neural Networks

“A deep learning architecture is a multilayered stack of simple modules with multiple non-linear layers” [14] (usually between 5 and 20 layers), and each layer contains a \(n_{i}\) number of modules, where i is the layer number, each module is a neuron with some activation function such as sigmoid or tanh. So an MLP and its generalization a DNN are defined by a set of neurons divided into layers: an input, one or more intermediate and an output layer. Thus, the DNN architectures that are constructed to classify the datasets are neural networks which have an i number of intermediate layers and a \(n_{i}\) number of neurons per layer, and the numbers of neurons per layer \(n_{i-1}\) and \(n_{i}\) are not necessarily the same. In our experiments we used the Rectified Linear Unit (ReLU) due to better results in DNN according to [6, 14, 15], so that a neuron is defined by:

$$\begin{aligned} f\left( x\right) =\max \left( 0,w^{T}x\right) \!, \end{aligned}$$
(1)

where x is the input vector of N dimensions and w is the weights vector that multiplies the input vector. In the output layer, the activation function is changed by a softmax, which is commonly used to predict the probabilities associated with a multinoulli distribution [7], which is defined by

$$\begin{aligned} softmax\left( x\right) _{i}=\frac{\exp \left( x_{i}\right) }{\sum _{j=1}^{n}\exp \left( x_{j}\right) }, \end{aligned}$$
(2)

The general DNN architecture is shown in Fig. 1. It is also common practice to vary the number of neurons contained in each layer of the DNN. The training method used for the DNN is Nesterov gradient descent with a mini-batch size of 64 and a moment of 0.9, which helps us to a more stable and fast convergence.

Fig. 1.
figure 1

Architecture of a DNN, where the number of hidden layers is another hyper-parameter.

4 Dendrite Morphological Neurons

A DMN segments the input space into hyperboxes of N dimensions. The output y of a neuron is a scalar given by

$$\begin{aligned} y=\underset{k}{argmax}\left( d_{n,k}\right) \!, \end{aligned}$$
(3)

where n is the dendrite number, k is the class number, and \(d_{n,k}\) is the scalar output of a dendrite given by

$$\begin{aligned} d_{n,k}=\underset{i}{\min }\left( \min \left( x-w_{min}^{n},w_{max}^{n}-x\right) \right) , \end{aligned}$$
(4)

where x is the input vector, \(w_{min}\) and \(w_{max}\) are dendrite weight vectors. The min operations together check if x is inside the hyperbox limited by \(w_{min}\) and \(w_{max}\) as the extreme points (see Fig. 2). If \(d_{n.k}>0\), x is inside the hyperbox, If \(d_{n,k}=0\), x is somewhere in the hyperbox boundary; otherwise, it is outside. A good property of DMN is that they can create complex non-linear decision boundaries that separate classes with only one neuron [20, 21]. The reader can consult [28] for more information.

Fig. 2.
figure 2

Dendrite morphological neuron and an example of a hyperbox in 2D generated by its dendrite weights. The hyperbox divides the input space for classification purposes.

The training goal is to determine the number of hyperboxes and their weights needed to classify an input pattern. The regularized divide and conquer training method [28] consists of only two steps. The algorithm begins by opening an initial hyperbox \(H_{0}\) that encloses all the samples with a margin distance M respect to each side of \(H_{0}\) to have a better noise tolerance. Next the divide and conquer strategy is executed in a recursive way. The algorithm chooses a training sample x to generate a sub-hyperbox \(H_{sub}\) around it. Next it extracts the samples \(\left( X_{H_{sub}},T_{H_{sub}}\right) \) from \(\left( X,T\right) \) that are enclosed in \(H_{sub}\), where X is a training samples set represented as a matrix \(X\epsilon \mathfrak {R}{}^{NxQ_{train}}\), \(Q_{train}\) is the number of training samples and the target class for each sample is contained in vector \(T\epsilon \mathfrak {R}{}^{1xQ_{train}}\). The recursion divides \(H_{0}\) until the error rate \(E_{\%}\) in the hyperbox H is less or equal to the hyper-parameter \(E_{0}\). The error rate is defined as \(E_{\%}=\frac{\left| X_{mode}\right| }{\left| X\right| }\), where \(X_{mode}\) is the set of the most repeated training class [19]. At the end of the recursion process, the deepest hyperbox is assigned to the ruling class, which is set to the statistical mode of T. The recursive closing procedure is executed by appending all generated sub-hyperboxes with their corresponding classes. The hyperboxes with a common hyperface are joined. A complete description of this training method can be found in [24, 28].

5 Experiments

The experiments were designed with the aim of comparing the performance of the two neural networks, taking as a starting point the same training set. The aspects evaluated are the classification accuracy in the validation set, the training time, the number of parameters necessary for the network to correctly classify the training set, and the decision boundaries.

5.1 Spiral Datasets

The training set is a set of synthetic data, designed to test the ability of the two types of neural networks in the unraveling of the hyperplanes, that is, the synthetic data is generated with a high rate of entanglement, and a low degree of overlap between classes. For this purpose the generated data spiral consists of 1 to 5 classes wrapped one over the other, and the number of turns vary between 1 and 10. The representation of said training set is shown in Fig. 3 in such a way that the training set is shaped as shown in Table 1.

Table 1. Datasets for spirals with different number of classes \(N_{C}=\{2,3,4,5,10\}\) and increasing number of loops \(N_{L}=\{1,2,...,10\}\). The number of training patterns is \(Q_{train}\) and the number of validation patterns is \(Q_{val}\).

5.2 Experimetal Results for DNNs

In order to classify the patterns presented in the Sect. 5.1 the DNN architecture varies in depth the number of neurons per layer, as well as the number of hidden layers, leaving the hyper-parameters fixed to the following values, learning rate of 0.1, Nesterov momentum of 0.9 and batch size of 64. The value of the hyper-parameters was obtained by performing classification tests by varying the values of the learning rate in a range of \(\left[ 1,0.001\right] \), with increments of 0.01. Table 2 summarizes the resulting architectures applied to each training set; the column “Dataset” specifies the number of the training set used, column \(N_{p}\) specifies the number of parameters in the neural network model, column \(T_{a}\) specifies the percentage of classification on the training set, column \(V_{a}\) shows the classification percentage on the validation set obtained by that neural network model, and column \(T_{t}\) shows the total training and validation time. Figure 5 shows the classification accuracies for each neural network, number of classes and number of loops of each training set; showing better results for DMN over DNN models.

Fig. 3.
figure 3

Spiral of two interlaced spin classes (left), spiral of five classes of one spin per class (center), spiral of two classes with 10 turns each class.

Table 2. Experimental results for DNNs.
Fig. 4.
figure 4

Decision boundary generated by DMN (first row) and by DNN (second row).

Fig. 5.
figure 5

(a) Classification percentages for the dataset 1 (2 classes, 10 loops), (b) classification ratios for datsets 2–5 (2–10 classes, 1–5 loops). (c) and (d) number of parameters used to classify each dataset. (e) and (f) classification times used to classify each dataset. (\(N_{L}\), number of loops, \(N_{C}\), number of classes).

5.3 Experimetal Results for DMNs

In the same way as in Sect. 5.2, in Table 3 the architecture of the DMN is presented; the first column shows the training set number used and the third column \(G_{i}\) shows the index of generalization of the DMN.

Table 3. Experimental results for DMNs.

5.4 Decision Boundaries

This section compares the decision boundaries generated by the two types of neural network architectures (DNN and DMN) on the same training sets specified in Sect. 5.1. As we observe, the nature of each algorithm is very different, generating approximations to hyperplanes/hyperboxes, which yield similar results. However, for the specific dataset used, we can observe that the generation of hyperboxes of variable size best models the training set with a higher classification rate and less parameters in the DMN model. These results can be observed in Fig. 4. Each pair of images grouped by column, shows the decision boundary generated by the DMN (top) and the DNN (bottom). As can be seen in column (b), the decision boundaries are best defined by the DMN (column (b), top) than the decision boundaries generated by the DNN (column (b), bottom).

6 Conclusion and Future Work

Linear filters with non-linear activation functions (and back-propagation) are today the battle horses of the neural network community. This leads us to ask the questions: Are there other mathematical structures that produce better results for some problems? What advantages would they have? The motivation of this research is to answer these questions. In this paper, we compare DNNs and DMNs in a very simple 2D classification problem: multi-class spirals with increasing number of loops. We show that the performance of the DMNs surpasses that of the DNNs in terms of higher accuracies and a lesser number of learning parameters. Of course, these results are limited to spiral-like problems, which we specifically designed to test the ability of separation for the two neural architectures. It is clear that the DMN training time is longer than the DNN training time, furthermore, the classification rate is not compromised, that is, the DNNs can be trained in a shorter time, but their validation accuracy is much lower to that obtained by the DMN.

We conclude that this result is due to the nature of both algorithms. The hyperboxes of DMNs make better models for these types of datasets because the divide and conquer training is based on geometrical interpretation of the whole data, and refines the model each recursion step, while training based on gradient descent is a search method in a dark environment only guided by partial dataset information, and local information cost function. From this, we raise the hypothesis that deep learning networks can be improved adding morphological neurons. This is a consideration for future research.