Pruning of recurrent neural models: an optimal brain damage approach

This paper considers the problem of pruning recurrent neural models of perceptron type with one hidden layer which may be used for modelling of dynamic system. In order to reduce the number of model parameters (i.e. the number of weights), the Optimal Brain Damage (OBD) pruning algorithm is adopted for the recurrent neural models. Efficiency of the OBD algorithm is demonstrated for pruning neural models of a neutralisation reactor benchmark process. For the considered neutralisation system, the OBD algorithm makes it possible to reduce as many as 60% of model parameters and reduce the validation error by some 30% when compared to the full (unpruned) models.

tems in order to make online decisions. Additionally, dynamic models are used in state estimation [6], simulation [7,8], time-series forecasting [9], numerical optimisation [10,11] and they are necessary for development of soft sensors [12]. Models are also necessary in recognition and interpretation of medical images [13]. That is why finding precise and uncomplicated models is the first step, but a fundamental one in development of the mentioned algorithms. In the mentioned advanced algorithms the model is used not only offline, during its development, but also for online calculations. For example, in Model Predictive Control (MPC) [2] an optimisation procedure calculates online at each sampling instant the best possible control policy considering future predictions of the dynamic model. A precise model results in excellent performance of the MPC controller, but, the opposite is also true, i.e. when the accuracy of the model is poor, the controller makes decisions using false predictions and the resulting control quality may be below expectations.
Two approaches may be used to find the model: modelling and identification. In the first case, all the phenomena taking place in the process must be described analytically which leads to a fundamental (first-principles) model [7,8]. Theoretically, the fundamental models have very good accuracy, but from a practical point of view they need many technological parameters, whose values may be difficult to determine. Moreover, in practice dynamic fundamental models may consist of many differential equations, whose online solution may be difficult and time-consuming in predictive control, fault detection and fault-tolerant control. That is why black-box models are frequently used in many applications. In such cases, the structure of the model is chosen arbitrarily and its parameters are optimised in such a way that the discrepancy between model output and a recorded set of data is minimised [14]. Taking into account that a good model should be not only precise, but also that it is desirable to have a model which may be easily used in the aforementioned algorithms [15], one may easily conclude that neural networks of different structures [16,17] are very good options. In particular, the recurrent neural models of perceptron type with one hidden layer [16,18] are successfully used for approximation of numerous dynamic systems, e.g. a polystyrene batch chemical reactor [19], an ethylene-ethane distillation column and a polymerisation reactor [1], a neutralisation reactor [20], a fluid catalytic cracking unit [21].
Unlike fundamental models, the neural ones have a very simple structure and they do not consist of differential as well as algebraic equations, which greatly simplifies their usage. On the other hand, the basic question is the number of hidden nodes, which affects the overall number of model parameters (weights). The higher number of model parameters, the better accuracy for the training data set, but, at the same time, the higher risk of low generalisation ability. This means that too complex neural models tend to approximate specific data sets rather than to mimic behaviour of the dynamic processes. A frequent approach used in practice is to train neural models and next to remove the weights of the lowest importance (the process of pruning). As a result of pruning, one obtains networks of good accuracy and good generalisation, which also have a low number of parameters. There are numerous pruning methods, e.g. the Tukey-Kramer multiple comparison procedure [22], pruning using cross-validation [23], the pruning method optimised with a Particle Swarm Optimisation (PSO) algorithm [24], Bayesian regularisation [25], pruning using Minimum Validation error regulariser [26], Optimal Brain Damage (OBD) [27], optimal brain surgeon [28], and other novel approaches [29]. In particular, the OBD algorithm is very effective as reported in the literature. The strategy used in this algorithm assumes deleting parameters (setting its value permanently to 0) which have the least effect on the training error of the model. It has been successfully used to prune models next used in different fields, e.g. for monitoring of exhaust valves [30], in classification of spectral features for automatic modulation recognition [31], in modelling two-mass drive system [25], in motor fault diagnosis [32], for load forecasting of a power system [33], for simultaneous determination of phenol isomers in binary mixtures [34], for microbial growth prediction in food [35]. The applications of the OBD algorithm reported in the literature are concerned with non-recurrent model configuration whereas in the case of dynamic systems the recurrent training mode is a straightforward option.
The motivation of this work is the necessity to obtain precise dynamic models capable of long-range prediction that have moderate number of parameters. In general, two model configurations are possible: serialparallel and parallel [36]. In the non-recurrent serialparallel one the model output signal is a function of the process input and output signal values from previous discrete sampling instants (real measurements). Hence, the serial-parallel model should be only used for onestep-ahead prediction. In the recurrent parallel configuration, the model output signal depends on its values at some previous sampling instants. Since in MPC, fault detection, fault-tolerant control, process optimisation and simulation it is necessary to calculate precise predictions of the process output variable over long horizons (for multiple steps ahead), it is obvious that for such applications the recurrent parallel model should be used, not the simple non-recurrent one. In the context of MPC, demonstration of this fact for linear models is discussed in [37,38], considerations for nonlinear models are given in [39,40]. Finally, it is essential to stress the fact that the models characterised by a moderate number of parameters are preferred. It is important not only because such models have good generalisation ability, but also because in practical applications resources of computational units used in online process control, fault detection and optimisation are typically limited and models with too many parameters are likely to slow down calculations repeated in real time.
Contribution of this work is twofold. Firstly, the rudimentary OBD pruning algorithm [27] is derived for a particular model of recurrent dynamic model-a neural network with one hidden layer. Implementation details of the algorithm are given. Since the focus is entirely on the recurrent neural network, this work is an extension of the original paper [27] in which the OBD algorithm has been introduced, but only static models have been considered. Secondly, effectiveness of the derived OBD algorithm for recurrent neural models is demonstrated for a neutralisation (pH) reactor which is a classical benchmark in process control. The process has significantly nonlinear steadystate and dynamic properties and is frequently used to compare dynamic models and their identification algorithms as well as advanced control methods (e.g. [41][42][43][44][45][46]). Detailed discussion how to use the described algorithm to obtain precise models with good generalisation ability is given. The remainder of the paper is organised in the following way. Firstly, in Sect. 2, the structure of the neural model is defined and its training algorithm is shortly discussed. Section 3, which is the main part of the paper, details the OBD algorithm for the recurrent neural models. Next, Sect. 4 presents simulation results concerned with training and pruning of recurrent neural models of a neutralisation process. Finally, Sect. 5 concludes the paper.

Structure of the model
It is assumed that the input variable of the dynamic process under consideration is denoted by u and the output variable is denoted by y. There are two possible configurations of dynamic models: serial-parallel and parallel [36]. In the non-recurrent serial-parallel structure the output signal of the model for the current sampling instant k, y mod (k), is a function f : R n A +n B −τ +1 → R of the process input and output signal values from some previous instants where the positive integers n A and n B define the order of model dynamics and τ ≤ n B is the time-delay. In the recurrent parallel model, which may be also named the simulation model, the past process outputs are replaced by the model outputs calculated at previous sampling instants, i.e. the output signal of the model is a function of the past process input signal values and of the output signal values calculated from the model at some previous instants The serial-parallel structure is a one-step-ahead predictor since the real measurements of the process output variable are necessary whereas the parallel structure is more appropriate not only for simulation, but also in applications where the very recent measurements are not available, e.g. in long-range prediction, including MPC algorithms, model-based fault detection and fault isolation. As the dynamic model of the process the neural network the structure of which is depicted in Fig. 1 is used. The most popular Multi-Layer Perceptron (MLP) feedforward neural network structure with two layers is used [16,17]. The network has n A + n B − τ + 1 input nodes, which corresponds with the arguments of the general parallel model (1), K nonlinear hidden nodes with the nonlinear transfer function ϕ : R → R, e.g. of the tanh type, one linear output element (sum) and one output y mod (k). The additional constant unitary inputs of the first and the second layers are biases. The weights of the first layer are denoted by w 1 i, j , where i = 1, . . . , K , j = 0, . . . , n A + n B − τ + 1, the weights of the second layer are denoted by w 2 i , where i = 0, . . . , K . The output signal of the neural network can be expressed by where the output signal of the ith hidden node is denoted by v i (k), z i (k) is the sum of input signals of the ith hidden node: and I u = n B − τ + 1. From Eqs. (2) and (3), the output signal of the network is Fig. 1 Structure of the neural model

Training of the model
The objective of training is to find such a set of weights so that the value of the model error has an acceptable minimal value. The error of the model is defined by the following sum of squared errors where y mod (k) is the output of the neural network for the sampling instant k, y(k) is an output signal of the real process (the training pattern), S = max{n A , n B } + 1, P is the number of training samples and all the weights form a vector of parameters During training, the model error function (5) is minimised. Due to the nonlinear hidden layer transfer function ϕ, this is an unconstrained nonlinear optimisation problem. The general gradient-based training algorithm, leading to minimisation of the model error function (5)  3. If the model error or a norm of its gradient satisfies a stopping criterion, the algorithm is stopped. 4. The optimisation direction p t is calculated. 5. The optimal step-length η t along the direction p t is calculated using, e.g. the golden section approach or the Armijo's rule [48]. 6. The model weights are updated w t+1 = w t + η t p t , the training algorithm goes to step 1.
The following stopping criteria may be used: -the training algorithm terminates when the maximal number of iterations is exceeded, i.e. when t > t max , -the algorithm terminates when the change of weights in two consecutive iterations is smaller than arbitrarily small quantity ε w > 0, i.e. when w t − w t−1 < ε w , -the algorithm terminates when the norm of the gradient of the minimised error function is small, i.e.
trarily small positive quantity.
The simplest approach to find the optimisation direction is to use the steepest-descent technique, in which the direction is opposite to the current gradient of the model error with respect to the optimised weights [47][48][49], i.e.
and I = n A + n B − τ + 1. Due to very slow convergence of the steepest-descent method, a quasi-Newton algorithms [47][48][49] are recommended in this work. In these algorithms, the direction is calculated from the general formula given by where H(w t ) is the Hessian matrix of the error function E(w t ), the structure of which is given by Eq. (6). Because analytical calculation of the inverse of the Hessian matrix, i.e. [H(w t )] −1 , is quite complex, it is not calculated analytically, but approximated numerically. In this work, a very efficient Broyden-Fletcher-Goldfarb-Shanno (BFGS) method is used [47][48][49]. In each iteration of the training (weight optimisation) where the increment of the weights vector is s t = w t − w t−1 , increment of the gradient vector of the weights vector is denoted by The gradients of the error function are determined analytically at each iteration of the training algorithm. Differentiating Eq. (5) with respect to the weights of the first and the second layer, one obtains where v n (k) = 1 f o r n = 0 ϕ(z n (k)) for n = 0 .
where the partial derivative ∂ϕ(z n (k)) ∂z n (k) depends on the type of the transfer function used. When ϕ(z n (k)) = tanh(z n (k)) Differentiating Eq. (3) with respect to the weights of the first layer gives (3) with respect to the weights of the second layer leads to The above formulae are universal for the considered recurrent neural model. When an alternative transfer function is used in place of the tanh function, it is only necessary to calculate the first-order derivatives ∂ϕ(z n (k)) ∂(z n (k)) , n = 1, . . . , K , used in Eqs. (10) for the specific transfer function.
Two data sets are used: the training data set and the validation set. The first set is used only for model training, i.e. the value of the error function E is minimised only for this set. In order to assess generalisation ability of the trained model, the value of the error is also calculated for the validation set. Model selection, e.g. among a few compared models of different structures and/or initial weights, is accomplished taking into account only the validation error.

Pruning of the neural dynamic model
In the Optimal Brain Damage (OBD) pruning algorithm [27] the weights of the neural network with small saliency [i.e. those whose removal have the least influence on the error (5)] are deleted. In order to do so, a local model of the error function is formulated and the effect of perturbing the weights is analysed. The minimised error function (5) is approximated by means of a Taylor series. A perturbation w of the weight vector w changes the error function by where w i denotes the perturbation of the ith weight, g i is the ith element of the gradient vector with respect to the weight w i , i.e. g i = ∂ E ∂w i , h i j denotes the element of the Hessian matrix, i.e. the second-order derivative of the error E, h i j = ∂ 2 E ∂w i ∂w j . It is not recommended to remove weights during training because small saliency of the network with respect to some weight may result from a temporary value of that weight or because the value of the network error is huge. Therefore, it is recommended to remove weights after completing training. In the OBD algorithm, it is assumed that a local or a global minimum of the error function E is reached. In such a case, all gradients of the error with respect to weights are (approximately) 0 and the first term of the right side in Eq. (14) may be neglected. It is also assumed that the Hessian matrix H (6) is positivedefinite and diagonally dominant. Hence, only the diagonal elements h ii are considered, off-diagonal elements are assumed to be 0. Therefore, in the OBD algorithm the following quadratic approximation is used in place of Eq. (14) In the OBD algorithm a weight of the network is removed when its saliency is small. For the first layer the saliency is where i = 1, . . . , K , j = 0, . . . , n A + n B − τ + 1 whereas for the second layer the saliency is where i = 0, . . . , K . To carry out the OBD procedure, the following steps must be considered: 1. The initial structure of the full network is selected and the network is trained (a local or a global minimum of the error function E is reached). 2. The second-order derivatives of the error function with respect to all the weights are calculated, i.e.
3. The saliency value (S 1 i, j and S 2 i ) for each model weight is calculated using Eqs. (15) and (16). 4. The weights are sorted by their saliency value and then some weights with the lowest saliency value are deleted. 5. The pruned network is retrained (a local or a global minimum of the error function E is reached).
6. The algorithm returns to step 2.
The saliency value of each weight is calculated using the training data set. After the training, the validation error is calculated and assessed. If its value before and after removing one or some more weights increases too much, the OBD algorithm is stopped and the network before the last pruning is used.

Process description
The process under consideration, shown schematically in Fig. 2, is a pH neutralisation reactor [50]. The reactor is a classical benchmark used for model identification and control, e.g. [20]. A base (NaOH) stream q 1 , a buffer (NaHCO 3 ) stream q 2 and an acid (HNO 3 ) stream q 3 are mixed in a constant volume tank. The output pH may be controlled by manipulating the base flow rate q 1 (ml/s). The buffer and acid streams are assumed to be constant (q 2 = 0.55 ml/s, q 3 = 16.60 ml/s). Therefore, the reactor has one input (manipulated) variable q 1 and one output (controlled) variable buffer (q 2 ) acid (q 3 ) base (q 1 ) pH pH V effluent solution Fig. 2 Schematic representation of the pH neutralisation process pH. The reactor is described by two continuous-time nonlinear ordinary differential equations and one algebraic output equation 0 = W a (t) + 10 pH(t)−14 − 10 −pH(t) + W b (t) 1 + 2 × 10 pH(t)−pK 2 1 + 10 pK 1 −pH(t) + 10 pH(t)−pK 2 .

Training of the initial full model
At first the fundamental model given by Eqs. (21), (22) and (23) is simulated for a random sequence of steps in the input variable, the sampling time is 10 s. As a result Table 1 Parameters of the fundamental model of the pH neutralisation process W a1 = − 3.05×10 −3 mol W b1 = 5 × 10 −5 mol V = 2900 ml W a2 = − 3 × 10 −2 mol W b2 = 3×10 −2 mol pK 1 = 6.35 W a3 = 3 × 10 −3 mol W b3 = 0 mol pK 2 = 10.25 In order to eliminate the problem with saturation of hidden nodes, the process variables q 1 and pH are scaled where in the initial operating point q 10 = 15.55 ml/s, pH 0 = 7. Analogously to some previous work [51,52], the second-order of model dynamics is used, i.e. τ = 1, n A = n B = 2. From Eq. (1), the recurrent model is described by the following general relation In this study two different configurations of the MLP neural model are considered: the network with 20 hidden nodes (Fig. 5a) and the network with 30 hidden nodes (Fig. 6a). The initial full models (i.e. with all weights) have quite a large number of parameters: the first structure has as many as 121 weights whereas the second one-181 weights. Because training is a nonlinear optimisation problem of the model error function (5), which may be badly affected by local minima, for each model configuration as many as 10 networks with different initial weights initialised randomly are trained and pruned. The results presented next show the best 3 networks for each model configuration. All models are trained using the BFGS nonlinear optimisation algorithm, the golden section procedure is used for step-length calculation. Because in the OBD algorithm it is assumed that before pruning the network is well trained, during training of the initial full models the maximal number of iterations of the BFGS algorithm is 2500 and the stopping criterion is defined as The chosen full neural models (trained) with 20 hidden nodes are denoted by N 1 20 , N 2 20 and N 3 20 , the full models with 30 nodes are denoted by N 1 30 , N 2 30 and N 3 30 . Training and validation errors for the full networks are given in Table 2. In general, the full networks are able to approximate behaviour of the process quite well, but it is interesting to note that although all networks have similar values of the training error, the validation error in some cases is significantly bigger. Such property of the networks may be associated with the fact that they have too many parameters as they tend to work fine only for the training data set. Figure 7 compares the validation data set and the outputs of the best initial full network with 20 hidden nodes-the network N 2 20 . Figure 8 depicts a similar comparison for the best network with 30 nodes, that is the structure N 1 30 . Inaccuracy of the models can be noticed especially for the samples 300-350 and 550-600. Figures 9 and 10 depict the difference between the output signal taken    Tables 3 and 4 give values of the training and validation errors for networks with removed selected numbers of weights. It is clear that deleting weights causes firstly drop and then the raise of both training and validation errors. This behaviour is expected, because each removal causes model to move from the point where the error function gradient is zeroed, and therefore escaping from a local minimum. What is more, there are less optimisation parameter, therefore there are less local minima of the error function which is minimised. Unfortunately removal of the weight might cause the model to be moved closer to local minimum which yields lower model quality-it is visible as rapid increase in the error values. Global optimisation algorithms might be used to try to obtain more stable behaviour. When the model is oversimplified for the process to be approximated accurately, then the error values start to climb up. During experiments, it is observed that the classical stopping criterion of the OBD algorithm (i.e. termination of the algorithm if the validation error grows) may be misleading. For example, for the initial network N 1 20 the validation error grows rapidly after removing 28 weights (Fig. 11a), which may suggest that there is no point in continuing the OBD algorithm. Nevertheless, it is continued and although the next iteration gives an additional increase in the error, after removing 30 weights the validation error drops rapidly, resulting in even better fitted model than the original one. Similar significant temporal increases in the validation error may be observed after removing 22 and 55 weights from the initial network N 2 20 (Fig. 11b) and after removing 45 weights from the network N 3 20 (Fig. 11c). That is why it is reasonable to continue to remove weights even if the validation error indicates that the whole pruning procedure should be stopped. Finally, when too many weights are removed from the network, the model error grows permanently which means that further pruning must be stopped. Table 5 includes training and validation error values calculated for fully pruned models (when there are no more weights that can be pruned and the model cannot be trained any further) and the number of resulting weights. As it was mentioned, although the number of parameters is low, the error values are unacceptable. It is a question how to choose a satisfying compromise between model accuracy and its simplicity. Taking that into consideration, only the models that are considered to have reasonable accuracy and a relatively low number of parameters are chosen. Training and validation errors of such models and their number of weights are given in Table 6. Their errors are significantly lower when compared to fully pruned models, and the numbers of weights left are maintained on the  Taking into account Tables 2 and 6, one may easily find changes of the number of parameters and model errors of the full and pruned networks, which are given in Table 7. In general, the OBD algorithm makes it possible to remove a big portion of weights and may give precise models. In particular, the best models N 3 20 and N 3 30 , when compared to their full versions, have approximately 60% less weights. It is also interesting to note that in case of the best models the OBD algorithm results in reduction in the validation error by some 30%. The approximation of the validation data set quality using models N 3 20 and N 3 30 is still almost as good as using unpruned networks. Their comparison with validation data set is depicted in Figs. 15 and 16 for structure N 3 20 and N 3 30 , respectively. It is worth noting, that there are models, where it is highly difficult to choose at which iteration the OBD algorithm should be stopped. The computational complexity and the memory that has to be used to store these models can be determined easily based on the model configuration. The complexity is closely related to the number of parameters, whereas the accuracy is difficult to predict. For example, the error values of the model N 1 30 slowly grow with each iteration of the OBD algorithm. There are almost only inaccurate and simple models or accurate and complex ones. Almost nothing in between. A similar problem can be faced for the model N 2 20 -even tough with each removed weight model complexity drops, the accuracy drops as well. On the other hand, there are models for which at the consecutive iterations of the OBD algorithm the model Table 3 Training (E t ) and validation (E v ) errors for the networks with K = 20 hidden nodes after removing the given number of weights Removed weights Initial network N 1 20 Initial network N 2 20 Initial network N 3   simplicity and accuracy grows at the same time. It is because removal of one weight acts like a small disturbance, that helps to leave a local minimum and get closer to a global one. Nevertheless, each of the mod-   and N 3 30 . It is interesting to note that the OBD algo- the iteration no. 95, there are only negative values, and zeros corresponding to removed weights. The saliency of 0 appears when the weight's value or the secondorder derivative of the error function with respect to this weight equals 0 as in Eqs. (15) and (16). The second condition takes place when, for example, a hidden neuron has no input signals (i.e. there are no connections between this node and any node in the first layer), then saliency of weight connecting this and summing nodes is 0. It is because it has no influence on the output signal, and so the second-order derivative of the error function equals 0. An analogous case takes place if there are no connections between the first layer's node, and any of the second layer's node. Worth mentioning is that if saliency equals 0, then that corresponding weight will be removed as fast as possible, what is consistent with intuition-weight linked with node that is of no use, should be removed. The use of the OBD algorithm requires that the diagonal of Hessian matrix (6) is positive-definite [so that minimum of error function (5) is achieved]. That implies that the saliency values are nonnegative and the removal of weight can be carried out. Computational constraints cause that reaching exact minimum is not always possible, and that cause the saliency values not to be the way we are expecting them to be. Nonetheless even with this kind of inaccuracy, implementation of OBD algorithm allows to obtain reasonable results.

Conclusions
This work describes derivation and implementation details of the OBD algorithm for pruning the recurrent dynamic neural models with one hidden layer. The neutralisation reactor benchmark process is considered to demonstrate effectiveness of the algorithm. The problem resulting from computational inaccuracy is shown, as well as its possible consequences. The models of two different architectures have been trained and pruned using the discussed implementation of the OBD algorithm. Considering only the best results, for the considered neutralisation process, reduction in the number of weights is approximately 60% and the validation error is some 30% smaller when compared to the full models. Choosing the model that has a moderate number of parameters and is precise is not a simple task. It requires a compromise between error values and total number of weights of the model. Although this procedure is time-consuming, it is worth repeating several times to achieve the best model configuration.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.