# Training Neural Networks with Krill Herd Algorithm

- 1.6k Downloads
- 9 Citations

## Abstract

In recent times, several new metaheuristic algorithms based on natural phenomena have been made available to researchers. One of these is that of the Krill Herd Algorithm (KHA) procedure. It contains many interesting mechanisms. The purpose of this article is to compare the KHA optimization algorithm used for learning an artificial neural network (ANN), with other heuristic methods and with more conventional procedures. The proposed ANN training method has been verified for the classification task. For that purpose benchmark examples drawn from the UCI Machine Learning Repository were employed with Classification Error and Sum of Square Errors being used as evaluation criteria. It has been concluded that the application of KHA offers promising performance—both in terms of aforementioned metrics, as well as time needed for ANN training.

## Keywords

Krill Herd Algorithm Biologically Inspired Algorithm Metaheuristic Neural Networks Optimization## 1 Introduction

Increasingly, in the engineering profession, optimization methods and algorithms are becoming essential tools. However, employing these is in general time-consuming. This is due to the need for extensive computational power when deriving solutions through their enlistment, and rests as well in the nature of the properties of the employed methods and algorithms, themselves. The methods that are currently used (to very good effect) in deriving solutions to problems of optimization, are the gradient methods and the heuristic algorithms. Unfortunately, both procedures, in addition to their advantages, have some drawbacks either.

The advantages of employing gradient methods lies in their ability to enable the achieving of a rapid convergence to the nearest optimum—not always the global one. When used for deriving a solution to a multi-modal function through simple enlistment, very often such a local optimum is found. The disadvantages of employing these methods rests within the conditions of the objective function. This must be continuous. What is more, the Hessian function must be positive definite, while the calculations are performed with a single starting point, which, in turn, significantly restricts search area. Furthermore, the choice of the starting point has an impact on the convergence of the method, and may bring about the possibility of the results falling into a local extremes. In practice many procedures based on this paradigm, e.g. the Back-Propagation Algorithm or the Quasi-Newton methods for can be applied for various optimization tasks—in particular, for the artificial neural networks (ANN) training [26].

Metaheuristic procedures aimed at finding the global optimum (usually with a certain probability) have also been developed. To this group belong the Evolutionary Algorithms [17], the Simulated Annealing Algorithms [20], the Immunological Methods [2] and the swarm intelligence procedures [21]. The aforementioned heuristic algorithms require only knowing the formula of the cost (fitness) function, and are quite simple to implement using contemporary computing resources. Nowadays, the group of algorithms that belong within the category of swarm intelligence is very extensive. Among the latest nature-inspired metaheuristics are the Bat Algorithm [8] the Harmony Search [16], the Firefly Algorithm [7, 23], the Cuckoo Search [6, 27], the Flower Pollination Algorithm [22] and the Krill Herd Algorithm [4].

Heuristic methods are often used for neural network learning process, and they constitute an alternative methods to typical traditional gradient algorithms as, for instance, error Back-Propagation or Levenberg–Marquardt procedures. For certain types of a neural networks, these conventional methods cannot be used. This comes about due to the lack of the possibility of applying analytical derivative formulas, or because a small convergence is achieved within the learning algorithm. The Fuzzy Flip-Flop (FFF) neural network type is an example of both of aforementioned phenomena. For this type of networks, certain training algorithms have been applied, among them, the Bacterial Memetic Algorithm [19] and the Evolutionary Strategy [17]. For a typical network of a multilayer perceptron type, many heuristic optimization methods can be also employed. These range from the Genetic Algorithm, to the Evolutionary Algorithms and end with the swarm optimization procedures. An interesting example of the use of the heuristic method for learning ANN, is the procedure for the prediction of peak ground acceleration that is described in [11]. In this work, the authors applied a well-known derivatively-free global optimisation algorithm (based on a simulated annealing metaheuristic) so as to improve the neural networks efficiency. Other stimulating illustrations of employing metaheuristics for ANN learning procedure include the application of Artificial Bee Colony algorithm [14, 15], Differential Evolution [3] or plain Particle Swarm Optimization procedure [12, 13]. From the results found within scientific publications, it can be concluded that very often after using the heuristic algorithm, we obtain positive results much faster than when applying gradient methods [16].

The use of the Krill Herd Algorithm (KHA) have become very popular recently [5]. This is because it represents an effective modern optimisation and search procedure [4, 9]. This metaheuristic technique is based on the behaviour of a krill herd [10]. The purpose of this paper is to investigate the possibility of applying the KHA for parameters’ optimization purposes within ANN. In our work, the results of numerical studies performed on the typical benchmark dataset, were compared with other heuristic methods, and with a gradient algorithm.

The outline of this paper is as follows. Following this introduction, in Sect. 2, information about the Krill Herd Algorithm (KHA) is to be introduced, while in Sect. 3, that of the neural network and the tuning of its structure and parameters by means of the KHA will be presented. Subsequently, In Sect. 4, application examples will be presented. Finally, conclusion will be drawn in last part of this work.

## 2 Krill Herd Algorithm

In this chapter, the optimization algorithm covered by this paper will be briefly described. KHA is one of the newest optimization procedures that come with a heuristic character. Its main inspiration lays in following and imitating the biological swarming behaviour of the Antarctic Krill (*Euphausia superba*), found in the Southern Ocean. This algorithm was introduced in 2012 by Gandomi and Alavi [4].

*f*, called the ’fitness function’ or the ’cost function’. In essence KHA procedure is based on observing such behaviors in the herd as foraging and communicating with other members of the swarm. Therefore, the position of the particular individuals \((i=1,\ldots ,M)\) in the herd is described through the following equation:

*i*-th krill. The movement effected by the presence of other krill can be described using the following formula:

*I*is the index of current iteration, \(I_{max}\) indicates a maximum number of iterations, \(D^{max}\in [0.002; 0.1]\) represents the maximal diffusion speed, while \(\delta \) constitutes the random directional vector, with its elements belonging to the interval \([-1; 1]\).

At the last stage of the KHA procedure, genetic operators are used [25]. Doing so, primarily classical mutation and crossover operators known from Genetic Algorithms are employed. In other studies [24], alternative operators based on Differential Evolution are also proposed.

*m*th coordinate of

*i*th krill as shown below by the formula:

*m*th coordinate of the

*i*th krill in accordance with the formula:

## 3 KHAs Application for Tuning Neural Network’s Parameters

An optimization procedure based on the KHA described in section 2 has been applied for the training of an Artificial Neural Network (ANN), and the obtained results were compared with the ones obtained by using: Back Propagation (BP), the Genetic Algorithm (GA) and the Harmony Search (HS) taken from article [16]. In the KHA-based training method, all weights and biases from ANN translate into a vector. This represents the position of an individual krill. In our study, the initial conditions for a krill position are obtained using a random value generator with uniform distribution. Moreover, in the presented research, the parameters of KHA were adopted based on literature [4] and [18]. In particular, the following parameters were assumed: \(N^{max}=0.01, V_f=0.02, \omega =2.0, \omega _f=2.0\) and finally \(D^{max}=0.7\). In addition, for each example, the scaling factor associated with element *dt* was determined individually. Here the training process is terminated when maximal epoch number has been reached. In this study, it was set as 100.

*P*is the number of examples under classification. It should be clearly emphasized that the choice of both measures is not accidental. This because generating the minimum value of SSE error does not always entail a marked decrease in the CE error. This can be explained by considering the construction of a neural network. Here, the output layer is a set of neurons representing each of the considered classes. The result of classification is, hence, designated by way of an index of neurons of the highest output value, thus reducing—during the learning phase—the value of the output signals, which in turn decrees the overall SSE error, while not always having an impact on the CE error. To circumvent this, weighted (with factors 0.5) error containing both types of errors (SSE and CE) is being employed here.

## 4 Numerical Experiments

Data sets used for verification

Data set | No. | No. | Training | Testing | Equinumerosity | No. neurons | No. NN |
---|---|---|---|---|---|---|---|

attributes | classes | examples | examples | examples in classes | in hidden layer | parameters | |

Iris | 4 | 3 | 120 | 30 | Yes | 5 | 43 |

Ionosphere | 33 | 2 | 281 | 70 | No | 4 | 146 |

Glass | 9 | 6 | 171 | 43 | No | 12 | 198 |

Thyroid | 21 | 3 | 5760 | 1440 | No | 15 | 378 |

Additionally, it should be emphasized here that the last two sets are characterized by having a very large imbalance in the multiplicity of examples within the considered classes.

For the classification of the individual data sets, ANN were constructed. Their structure is based on the comparative study conducted in article [16]. Thus, all tested networks have three layers, and all neurons have an applied *tanh* transfer function. Moreover, input-output data values were normalized to be in the range \([-1, 1]\).

In order to generate a comparison with the other results, the data sets were divided up as follows: \(80~\%\) of all samples constituted learning data, while the rest represented the testing data. Naturally, these proportions were maintained for each class. For every data set, the main learning algorithm was executed 20 times, using the KHA procedure with the same parameters as described in the previous section. The exception here is a factor for a scaling step associated with \(\delta t\). This was individually set for each of the considered data sets, and was used as 2.0, 3.0, 3.7, 3.0 respectively.

All parameters used in this investigation were based on recommendations that were already provided in the literature [4, 18]. What is more, the selection variable factor was based on the pilot runs tests. In the case of the first two sets of benchmarks (ie. Iris and Ionosphere), the cardinality of the population was 50, while in other cases, this figure was set to 70 individuals.

Results at learning and testing process

Dataset | Method | Learning | Testing | |||||||
---|---|---|---|---|---|---|---|---|---|---|

SSE | \(\sigma _{SSE}\) | CE (%) | \(\sigma _{CE}\) | Accepted It. | SSE | \(\sigma _{SSE}\) | CE (%) | \(\sigma _{CE}\) | ||

Iris | KHA | 21.28 | 2.42 | | 0.08 | 30 | 4.88 | 0.42 | | 0.11 |

HS | 18.00 | – | 1.67 | – | 162 | – | – | 3.33 | – | |

GA | 96.00 | – | 10.00 | – | 66 | – | – | 10.0 | – | |

BP | 7.85 | – | 0.83 | – | 1254 | – | – | 3.33 | – | |

Ionosphere | KHA | | 2.92 | 11.00 | 0.17 | | 13.87 | 4.47 | 8.57 | 0.05 |

HS | 106.4 | – | 5.00 | – | 170 | – | – | 5.63 | – | |

GA | 152 | – | 6.79 | – | 2244 | – | – | 5.63 | – | |

BP | 8.52 | – | 0.56 | – | 1628 | – | – | 4.23 | – | |

Glass | KHA | 41.21 | 1.88 | 40.94 | 0.91 | 21 | 9.82 | 2.09 | 41.86 | 0.04 |

HS | 355.85 | – | 29.82 | – | 177 | – | – | 27.91 | – | |

GA | 544.00 | – | 42.11 | – | 6123 | – | – | 32.56 | – | |

BP | 218.06 | – | 18.71 | – | 662 | – | – | 32.56 | – | |

Thyroid | KHA | | 28.31 | | 0.03 | | 35.9 | 3.01 | | 0.08 |

HS | 3146.4 | – | 6.94 | – | 94 | – | – | 7.22 | – | |

GA | 3416.0 | – | 7.42 | – | 167 | – | – | 7.43 | – | |

BP | 450.0 | – | 1.33 | – | 4201 | – | – | 2.78 | – |

## 5 Discussion of Results

The original Ionosphere Data contains 34 elements in the feature vector, but one coordinate in all cases was found to be equal to 9, therefore, it was removed from the data set. Accordingly, a neural network with a simple structure \(33{-}4{-}2\) was examined. In this particular case, the first class is represented by \(64~\%\) of the examples, so it can be seen that this data set is significantly unbalanced. In the learning process, the lowest SSE error type (within heuristic method groups) was achieved for KHA. This, amounted to 31.0. On the other hand, CE and testing errors were not so impressive. But, the time of learning that comes about through using the other methods (which take much time to bring about the desired result) possibly indicates that the structure of this data is very complex. In Fig. 3, convergence of the process of determining the weights of neural network was illustrated. The optimization process fairly quickly (in 80 iteration) reaches the final level, and further training, brings only a negligible increase in the quality of classification. At this point it is worth noting that such a fast train network can be a good starting point for a different heuristic strategy in particular related to the finding a local minimum.

The Glass Data set represent six types of glass. In this case, it must be emphasized that the first and second class contain \(70~\%\) of all the data’s examples. In Fig. 4 the convergence of neural networks learning process was presented. From the graph it can be seen, that within 20 iteration the learning process is saturated and further training practically does not change the quality of the neural classifier. In dealing with this set, the proposed algorithm obtained the smallest SSE error of learning, and, in addition, the training process lasted only 21 epochs. Moreover, the set’s CE error is smaller than in the case of employing GA, but greater than that of the other methods.

In summary, and in consideration of the results achieved in our work, it should be stressed out that the specific data that was used for the classification task was predominantly complex, multidimensional and multimodal. Despite this, neural classifiers were optimized by KH-based method in much less time than for other heuristic algorithms or the gradient method of BP. Indeed, in Table 2, for example we can see that in the case of the Iris dataset, the execution time is a half of that of the time of a learning process based on the GA method. Simultaneously the BP classic method took a vast amount of time: about 42 times more iterations then that of the proposed algorithm. Similar observations can be drawn for the other data collections.

In conclusion, proposed algorithm gives the best result among both heuristic and classical gradient procedures for Iris dataset. In the case of the other data collections, the obtained result is the best among all the heuristic methods examined. It must be said, however, that an exception here is the Ionosphere dataset. In this case the approach based on neural network classifier generates significant classification error. We cautiously deduce that is due to the large number of features found within the vector representing examined phenomenon.

## 6 Conclusions

Our study reveals that, although at first glance, seemingly quite complex KHA can be employed to obtain very quickly, satisfactory results for ANN training. Hence, we believe that the proposed method represents a promising tool for neural classification. With regards to its utilization, in some cases classical gradient method (BP) was shown to yield better results, but the iteration number of its execution was incomparably bigger.

Furthermore, this method can be used when working with many types of neural networks, especially those for which the gradient learning method cannot be applied or in situations wherein its use evokes poor results in terms of training procedure convergence. In particular KHA can be very useful in case of neural networks with unique transfer (activation) function or the ones with diverse set of functions used in neural network’s layers. FFF neural network constitutes an example of network possessing such properties.

The positive results generated through the application of the KHA procedure for supervised learning in the neural network domain, has been brought about by the fact that the applied metaheuristics is based on three main components. These are: the singular movement of individuals within the swarm as affected by the presence of other swarm members; the swarm’s individual and collective foraging movements; and the individual swarm member’s random diffusion within the swarm. These actions describe the mutual influence on the individual swarm member’s position, as well as the environmental impacts of the individual and the entire swarm. The first two actions contain both global and local optimisation functions. These components, working together, constitute an optimisation strategy, which improves the results of the KHA procedure. Additionally, a randomised search operation (given as random diffusion), and two evolutionary operators (mutation and crossover) are added to the main algorithm so as to create a powerful and effective optimization procedure. Because of the aforementioned elements, the algorithm is built upon an appropriate balancing between exploration and exploitation of solution space.

What is more, during the numerical study, we discovered that the KHA procedure showed a fairly sizeable sensitivity to some internal parameters. This issue was initially considered in [18]. We feel that this issue alone constitutes another advantage of the proposed approach.

Further studies will be focused upon discerning alternative fitness measures taking into account sample imbalance. Planned future experiments will also involve using datasets of a mixed-type nature. Furthermore, in a subsequent part of planned investigation, the KHA procedure designed for optimizing structure and parameters of an ANN of FFF type, will be developed. In this case, the proposed method will be employed for the prediction of both time series data and data streams. The main intent of these tests will be to demonstrate the advantages of applying the proposed method, with respect to utilizing other classic algorithms.

## Notes

### Acknowledgments

This research was supported in part by PL-Grid Infrastructure.

## References

- 1.Asuncion A, Newman DJ (2007) UCI Machine Learning Repository. Department of Information and Computer Science, University of California, Irvine. http://www.ics.uci.edu/~mlearn/MLRepository.html
- 2.Castro LN, Timmis J (2002) Artificial immune systems: a new computational intelligence approach. Springer, LondonzbMATHGoogle Scholar
- 3.Chauhan N, Ravi V, Chandra DK (2009) Differential evolution trained wavelet neural networks: application to bankruptcy prediction in banks. Expert Syst Appl 36(4):7659–7665CrossRefGoogle Scholar
- 4.Gandomi AH, Alavi AH (2012) Krill Herd: a new bio-inspired optimization algorithm. Commun Nonlinear Sci Numer Simul 17:4831–4845MathSciNetCrossRefzbMATHGoogle Scholar
- 5.Gandomi AH, Alavi AH (2013) An introduction of Krill Herd Algorithm for engineering optimization. J Civ Eng Manag (in press)Google Scholar
- 6.Gandomi AH, Yang X-S, Alavi AH (2011) Cuckoo search algorithm: a metaheuristic approach to solve structural optimization problems. Eng Comput 29:17–35CrossRefGoogle Scholar
- 7.Gandomi AH, Yang X-S, Alavi AH (2011) Mixed variable structural optimization using Firefly Algorithm. Comput Struct 89:2325–2336CrossRefGoogle Scholar
- 8.Gandomi AH, Yang X-S, Alavi AH, Talatahari S (2012) Bat algorithm for constrained optimization tasks. Neural Comput Appl 22:1239–1255CrossRefGoogle Scholar
- 9.Gandomi AH, Alavi AH, Talatahari S (2013) Structural optimization using Krill Herd Algorithm. In: Yang XS et al (eds) Swarm intelligence and bio-inspired computation: theory and applications. Elsevier, London, pp 335–349CrossRefGoogle Scholar
- 10.Gandomi AH, Talatahari S, Tadbiri F, Alavi AH (2013) Krill Herd Algorithm for optimum design of truss structures. Int J Bio-Inspired Comput Rev 5(5):281–288CrossRefGoogle Scholar
- 11.Gandomi M, Soltanpour M, Zolfaghari MR, Gandomi AH (2014) Prediction of peak ground acceleration of Irans tectonic regions using a hybrid soft computing technique. Geosci Front. http://www.sciencedirect.com/science/article/pii/S1674987114001376
- 12.Grimaldi EM, Grimaccia F, Mussetta M, Zich RE (2004) PSO as an effective learning algorithm for neural network applications. In: Proceedings of the 3rd international conference on computational electromagnetics and its applications. IEEE, pp 557–560Google Scholar
- 13.Gudise VG, Venayagamoorthy GK (2003) Comparison of particle swarm optimization and backpropagation as training algorithms for neural networks. In: Proceedings of the swarm intelligence symposium. IEEE, New York, pp 110–117Google Scholar
- 14.Karaboga D, Ozturk C (2009) Neural networks training by artificial bee colony algorithm on pattern classification. Neural Netw World 19(3):279–292Google Scholar
- 15.Karaboga D, Akay B, Ozturk C (2007) Artificial bee colony (ABC) optimization algorithm for training feed-forward neural networks. In: Modeling decisions for artificial intelligence. Springer, Berlin, pp 318–329Google Scholar
- 16.Kattan A, Abdullah R (2013) Training feed-forward artificial neural networks for pattern-classification using the harmony search algorithm. In: The second international conference on digital enterprise and information systems (DEIS2013). The Society of Digital Information and Wireless Communication, pp 84–97Google Scholar
- 17.Kowalski PA (2013) Evolutionary strategy for the Fuzzy Flip-Flop neural networks supervised learning procedure. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Lecture notes in computer science, vol 7894. Springer, Berlin, pp 294–305Google Scholar
- 18.Kowalski PA, Łukasik S (2014) Experimental study of selected parameters of the Krill Herd algorithm. In: Filev D, Jabkowski J, Kacprzyk J, Popchev I, Rutkowski L, Sgurev V, Sotirova E, Szynkarczyk P, Zadrozny S (eds) Intelligent Systems’ 2014. Springer, Berlin, pp 473–486Google Scholar
- 19.Lovassy R, Koczy LT, Gal L (2009) Optimizing Fuzzy Flip-Flop based neural networks by bacterial memetic algorithm. In: Proceedings of IFSA/EUSFLAT, Lisbon, pp 1508–1513Google Scholar
- 20.Łukasik S, Kulczycki P (2011) An algorithm for sample and data dimensionality reduction using fast simulated annealing. In: Tang J, King I, Chen L, Wang J (eds) ADMA 2011 Part I. Lecture notes in computer science, vol 7120. Springer, Heidelberg, pp 152–161Google Scholar
- 21.Łukasik S, Kowalski PA (2014) Fully informed swarm optimization algorithms: basic concepts, variants and experimental evaluation. In: Ganzha M, Maciaszek LA, Paprzycki M (eds) Annals of computer science and information systems, vol 2. IEEE, New York, pp 155–161Google Scholar
- 22.Łukasik S, Kowalski PA (2014) Study of flower pollination algorithm for continuous optimization. In: Filev D, Jabkowski J, Kacprzyk J, Popchev I, Rutkowski L, Sgurev V, Sotirova E, Szynkarczyk P, Zadrozny S (eds) Intelligent Systems’ 2014. Springer, Berlin, pp 451–460Google Scholar
- 23.Łukasik S, Zak S (2009) Firefly algorithm for continuous constrained optimization tasks. In: Nguyen NT, Kowalczyk R, Chen SM (eds) Computational collective intelligence, semantic web, social networks and multiagent systems. Lecture notes in computer science, vol 5796. Springer, Berlin, pp 97–106Google Scholar
- 24.Wang GG, Gandomi AH, Alavi AH, Hao GS (2014) Hybrid Krill Herd Algorithm with differential evolution for global numerical optimization. Neural Comput Appl 25:297–308CrossRefGoogle Scholar
- 25.Wang GG, Guo L, Wang H, Duan H, Liu L, Li J (2014) Incorporating mutation scheme into Krill Herd Algorithm for global numerical optimization. Neural Comput Appl 24:853–871CrossRefGoogle Scholar
- 26.Werbos PJ (1994) The roots of backpropagation: from ordered derivatives to neural networks and political forecasting. Wiley-Interscience, New YorkGoogle Scholar
- 27.Yang X-S, Deb S (2013) Cuckoo search: recent advances and applications. Neural Comput Appl 24(1):169–174CrossRefGoogle Scholar

## Copyright information

**Open Access**This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.