Pruning of recurrent neural models: an optimal brain damage approach
 252 Downloads
Abstract
This paper considers the problem of pruning recurrent neural models of perceptron type with one hidden layer which may be used for modelling of dynamic system. In order to reduce the number of model parameters (i.e. the number of weights), the Optimal Brain Damage (OBD) pruning algorithm is adopted for the recurrent neural models. Efficiency of the OBD algorithm is demonstrated for pruning neural models of a neutralisation reactor benchmark process. For the considered neutralisation system, the OBD algorithm makes it possible to reduce as many as 60% of model parameters and reduce the validation error by some 30% when compared to the full (unpruned) models.
Keywords
Neural networks Dynamic systems Model pruning Model structure optimisation1 Introduction
Advanced control methods [1, 2] as well as fault detection techniques [3, 4] and faulttolerant control approaches [5] directly use models of dynamic systems in order to make online decisions. Additionally, dynamic models are used in state estimation [6], simulation [7, 8], timeseries forecasting [9], numerical optimisation [10, 11] and they are necessary for development of soft sensors [12]. Models are also necessary in recognition and interpretation of medical images [13]. That is why finding precise and uncomplicated models is the first step, but a fundamental one in development of the mentioned algorithms. In the mentioned advanced algorithms the model is used not only offline, during its development, but also for online calculations. For example, in Model Predictive Control (MPC) [2] an optimisation procedure calculates online at each sampling instant the best possible control policy considering future predictions of the dynamic model. A precise model results in excellent performance of the MPC controller, but, the opposite is also true, i.e. when the accuracy of the model is poor, the controller makes decisions using false predictions and the resulting control quality may be below expectations.
Two approaches may be used to find the model: modelling and identification. In the first case, all the phenomena taking place in the process must be described analytically which leads to a fundamental (firstprinciples) model [7, 8]. Theoretically, the fundamental models have very good accuracy, but from a practical point of view they need many technological parameters, whose values may be difficult to determine. Moreover, in practice dynamic fundamental models may consist of many differential equations, whose online solution may be difficult and timeconsuming in predictive control, fault detection and faulttolerant control. That is why blackbox models are frequently used in many applications. In such cases, the structure of the model is chosen arbitrarily and its parameters are optimised in such a way that the discrepancy between model output and a recorded set of data is minimised [14]. Taking into account that a good model should be not only precise, but also that it is desirable to have a model which may be easily used in the aforementioned algorithms [15], one may easily conclude that neural networks of different structures [16, 17] are very good options. In particular, the recurrent neural models of perceptron type with one hidden layer [16, 18] are successfully used for approximation of numerous dynamic systems, e.g. a polystyrene batch chemical reactor [19], an ethyleneethane distillation column and a polymerisation reactor [1], a neutralisation reactor [20], a fluid catalytic cracking unit [21].
Unlike fundamental models, the neural ones have a very simple structure and they do not consist of differential as well as algebraic equations, which greatly simplifies their usage. On the other hand, the basic question is the number of hidden nodes, which affects the overall number of model parameters (weights). The higher number of model parameters, the better accuracy for the training data set, but, at the same time, the higher risk of low generalisation ability. This means that too complex neural models tend to approximate specific data sets rather than to mimic behaviour of the dynamic processes. A frequent approach used in practice is to train neural models and next to remove the weights of the lowest importance (the process of pruning). As a result of pruning, one obtains networks of good accuracy and good generalisation, which also have a low number of parameters. There are numerous pruning methods, e.g. the Tukey–Kramer multiple comparison procedure [22], pruning using crossvalidation [23], the pruning method optimised with a Particle Swarm Optimisation (PSO) algorithm [24], Bayesian regularisation [25], pruning using Minimum Validation error regulariser [26], Optimal Brain Damage (OBD) [27], optimal brain surgeon [28], and other novel approaches [29]. In particular, the OBD algorithm is very effective as reported in the literature. The strategy used in this algorithm assumes deleting parameters (setting its value permanently to 0) which have the least effect on the training error of the model. It has been successfully used to prune models next used in different fields, e.g. for monitoring of exhaust valves [30], in classification of spectral features for automatic modulation recognition [31], in modelling twomass drive system [25], in motor fault diagnosis [32], for load forecasting of a power system [33], for simultaneous determination of phenol isomers in binary mixtures [34], for microbial growth prediction in food [35]. The applications of the OBD algorithm reported in the literature are concerned with nonrecurrent model configuration whereas in the case of dynamic systems the recurrent training mode is a straightforward option.
The motivation of this work is the necessity to obtain precise dynamic models capable of longrange prediction that have moderate number of parameters. In general, two model configurations are possible: serialparallel and parallel [36]. In the nonrecurrent serialparallel one the model output signal is a function of the process input and output signal values from previous discrete sampling instants (real measurements). Hence, the serialparallel model should be only used for onestepahead prediction. In the recurrent parallel configuration, the model output signal depends on its values at some previous sampling instants. Since in MPC, fault detection, faulttolerant control, process optimisation and simulation it is necessary to calculate precise predictions of the process output variable over long horizons (for multiple steps ahead), it is obvious that for such applications the recurrent parallel model should be used, not the simple nonrecurrent one. In the context of MPC, demonstration of this fact for linear models is discussed in [37, 38], considerations for nonlinear models are given in [39, 40]. Finally, it is essential to stress the fact that the models characterised by a moderate number of parameters are preferred. It is important not only because such models have good generalisation ability, but also because in practical applications resources of computational units used in online process control, fault detection and optimisation are typically limited and models with too many parameters are likely to slow down calculations repeated in real time.
Contribution of this work is twofold. Firstly, the rudimentary OBD pruning algorithm [27] is derived for a particular model of recurrent dynamic model—a neural network with one hidden layer. Implementation details of the algorithm are given. Since the focus is entirely on the recurrent neural network, this work is an extension of the original paper [27] in which the OBD algorithm has been introduced, but only static models have been considered. Secondly, effectiveness of the derived OBD algorithm for recurrent neural models is demonstrated for a neutralisation (pH) reactor which is a classical benchmark in process control. The process has significantly nonlinear steadystate and dynamic properties and is frequently used to compare dynamic models and their identification algorithms as well as advanced control methods (e.g. [41, 42, 43, 44, 45, 46]). Detailed discussion how to use the described algorithm to obtain precise models with good generalisation ability is given. The remainder of the paper is organised in the following way. Firstly, in Sect. 2, the structure of the neural model is defined and its training algorithm is shortly discussed. Section 3, which is the main part of the paper, details the OBD algorithm for the recurrent neural models. Next, Sect. 4 presents simulation results concerned with training and pruning of recurrent neural models of a neutralisation process. Finally, Sect. 5 concludes the paper.
2 Neural dynamic model
2.1 Structure of the model
2.2 Training of the model
 0.
Initialisation of the weights \(\varvec{w}\), random values are usually chosen from the range \(<\,1,1>\).
 1.
The model output signal \(y^{\mathrm {mod}}(k)\) for the sampling instants \(k=S,\ldots ,P\) and for the current weights \(\varvec{w}\) are calculated from Eq. (4) .
 2.
The model error for the whole training data set is calculated from Eq. (5).
 3.
If the model error or a norm of its gradient satisfies a stopping criterion, the algorithm is stopped.
 4.
The optimisation direction \(\varvec{p}_{t}\) is calculated.
 5.
The optimal steplength \(\eta _{t}\) along the direction \(\varvec{p}_{t}\) is calculated using, e.g. the golden section approach or the Armijo’s rule [48].
 6.
The model weights are updated \(\varvec{w}_{t+1}=\varvec{w}_{t}+{\eta _{t}}\varvec{p}_{t}\), the training algorithm goes to step 1.

the training algorithm terminates when the maximal number of iterations is exceeded, i.e. when \(t > t_\mathrm {max}\),

the algorithm terminates when the change of weights in two consecutive iterations is smaller than arbitrarily small quantity \(\varepsilon _{\Delta \varvec{w}}>0\), i.e. when \( \left\ \varvec{w}_t  \varvec{w}_{t1}\right\ < \varepsilon _{\Delta \varvec{w}}\),

the algorithm terminates when the norm of the gradient of the minimised error function is small, i.e. when \( \left\ \left. \frac{\mathrm {d} E(\varvec{w})}{\mathrm {d} \varvec{w}}\right _{\varvec{w} = \varvec{w}_t} \right\ < \varepsilon _{\nabla E}\), where \(\varepsilon _{\nabla E}\) is a arbitrarily small positive quantity.
Two data sets are used: the training data set and the validation set. The first set is used only for model training, i.e. the value of the error function E is minimised only for this set. In order to assess generalisation ability of the trained model, the value of the error is also calculated for the validation set. Model selection, e.g. among a few compared models of different structures and/or initial weights, is accomplished taking into account only the validation error.
3 Pruning of the neural dynamic model
 1.
The initial structure of the full network is selected and the network is trained (a local or a global minimum of the error function E is reached).
 2.
The secondorder derivatives of the error function with respect to all the weights are calculated, i.e. \(\frac{\partial ^2 E}{\partial (w_{i,j}^1)^2}\) for \(i=1,\ldots ,K\), \(j=0,\ldots ,n_{\mathrm {A}}+n_{\mathrm {B}}\tau +1\) and \(\frac{\partial ^2 E}{\partial (w_i^2)^2}\) for \(i=0,\ldots ,K\).
 3.
The saliency value (\(S^1_{i,j}\) and \(S^2_i\)) for each model weight is calculated using Eqs. (15) and (16).
 4.
The weights are sorted by their saliency value and then some weights with the lowest saliency value are deleted.
 5.
The pruned network is retrained (a local or a global minimum of the error function E is reached).
 6.
The algorithm returns to step 2.
4 Simulation results
4.1 Process description
Parameters of the fundamental model of the pH neutralisation process
\(W_{\mathrm {a}_1}=\,3.05{\times }10^{3} \ \mathrm {mol}\)  \(W_{\mathrm {b}_1}=5\times 10^{5} \ \mathrm {mol}\)  \(V=2900 \ \mathrm {ml}\) 
\(W_{\mathrm {a}_2}=\,3\times 10^{2} \ \mathrm {mol}\)  \(W_{\mathrm {b}_2}=3{\times }10^{2} \ \mathrm {mol}\)  \(\mathrm {pK}_1=6.35\) 
\(W_{\mathrm {a}_3}=3\times 10^{3} \ \mathrm {mol}\)  \(W_{\mathrm {b}_3}=0 \ \mathrm {mol}\)  \(\mathrm {pK}_2=10.25\) 
4.2 Training of the initial full model
In this study two different configurations of the MLP neural model are considered: the network with 20 hidden nodes (Fig. 5a) and the network with 30 hidden nodes (Fig. 6a). The initial full models (i.e. with all weights) have quite a large number of parameters: the first structure has as many as 121 weights whereas the second one—181 weights. Because training is a nonlinear optimisation problem of the model error function (5), which may be badly affected by local minima, for each model configuration as many as 10 networks with different initial weights initialised randomly are trained and pruned. The results presented next show the best 3 networks for each model configuration. All models are trained using the BFGS nonlinear optimisation algorithm, the golden section procedure is used for steplength calculation. Because in the OBD algorithm it is assumed that before pruning the network is well trained, during training of the initial full models the maximal number of iterations of the BFGS algorithm is 2500 and the stopping criterion is defined as \(\left\ \left. \frac{\mathrm {d} E(\varvec{w})}{\mathrm {d} \varvec{w}}\right _{\varvec{w} = \varvec{w}_t} \right\ \le 10^{9}\) or \(\varvec{w}_t  \varvec{w}_{t1}\le 10^{9}\).
Training (\(E_{\mathrm {t}}\)) and validation (\(E_{\mathrm {v}}\)) errors for the initial full networks
Network  No. of weights  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\) 

\(N_{20}^{1}\)  121  0.1785  1.0763 
\(N_{20}^{2}\)  121  0.1907  0.8960 
\(N_{20}^{3}\)  121  0.2094  1.2360 
\(N_{30}^{1}\)  181  0.1672  0.5958 
\(N_{30}^{2}\)  181  0.2505  3.2746 
\(N_{30}^{3}\)  181  0.1839  1.1960 
4.3 Model pruning
Training (\(E_{\mathrm {t}}\)) and validation (\(E_{\mathrm {v}}\)) errors for the networks with \(K=20\) hidden nodes after removing the given number of weights
Removed weights  Initial network \(N_{20}^{1}\)  Initial network \(N_{20}^{2}\)  Initial network \(N_{20}^{3}\)  

\(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\)  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\)  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\)  
10  0.1538  0.7704  0.1710  0.7555  0.1886  0.9096 
20  0.1488  0.6015  0.1662  0.6708  0.1901  1.4407 
30  0.1554  1.4820  0.1771  0.4896  0.2280  0.6061 
40  0.1870  0.8471  0.3548  1.3461  0.1887  0.3792 
50  0.1805  0.7399  0.2785  1.7369  0.2390  0.5405 
60  0.1978  0.8347  5002.5945  13622.8370  0.3975  1.2945 
70  0.2628  2.5504  0.8618  4.1967  0.4308  1.1162 
80  0.2720  1.7519  –  –  –  – 
90  882.8008  687.5612  –  –  –  – 
Training (\(E_{\mathrm {t}}\)) and validation (\(E_{\mathrm {v}}\)) errors for the networks with \(K=30\) hidden nodes after removing the given number of weights
Removed weights  Initial network \(N_{30}^{1}\)  Initial network \(N_{30}^{2}\)  Initial network \(N_{30}^{3}\)  

\(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\)  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\)  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\)  
10  0.1610  0.5179  0.2249  2.3317  0.1614  0.6359 
20  0.1658  0.3658  0.2174  2.2419  0.1567  0.5921 
30  0.1583  0.3528  0.2093  1.9629  0.1468  0.4569 
40  0.1490  0.2654  0.1996  0.9401  0.1376  0.5026 
50  0.1421  0.3178  0.1932  1.0448  0.1355  0.4300 
60  0.1729  0.6745  0.1744  1.1082  0.1292  0.3667 
70  0.1688  0.5825  1.2428  10.6029  0.1250  0.2689 
80  0.1783  0.6548  –  –  0.2060  0.3978 
90  0.1755  0.6589  –  –  0.2206  0.4772 
100  0.2275  1.2958  –  –  0.3265  0.8097 
110  0.2059  2.2145  –  –  7765.5044  7093.5566 
120  0.2429  2.8449  –  –  –  – 
130  0.9478  8.9890  –  –  –  – 
Training (\(E_{\mathrm {t}}\)) and validation (\(E_{\mathrm {v}}\)) errors for the fully pruned networks
Initial network  No. of weights  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\) 

\(N_{20}^{1}\)  26  1086.5534  674.4024 
\(N_{20}^{2}\)  44  17458.9152  15841.2064 
\(N_{20}^{3}\)  47  11993.3168  161.5995 
\(N_{30}^{1}\)  44  2071.2067  1266.7459 
\(N_{30}^{2}\)  108  2562.9705  760.8429 
\(N_{30}^{3}\)  70  12110.3688  7067.0067 
Training (\(E_{\mathrm {t}}\)) and validation (\(E_{\mathrm {v}}\)) errors for the best pruned networks
Initial network  No. of weights  \(E_{\mathrm {t}}\)  \(E_{\mathrm {v}}\) 

\(N_{20}^{1}\)  40  0.2748  1.6894 
\(N_{20}^{2}\)  46  0.8542  4.7684 
\(N_{20}^{3}\)  52  0.4141  0.9105 
\(N_{30}^{1}\)  50  1.2299  5.7530 
\(N_{30}^{2}\)  114  0.1695  1.0717 
\(N_{30}^{3}\)  80  0.3260  0.8157 
Percentage of removed weights and percentage ratio of the best pruned networks’ training (\(E^\mathrm {p}_{\mathrm {t}}\)) and validation error (\(E^\mathrm {p}_{\mathrm {v}}\)) comparing to the initial full networks (\(E^\mathrm {f}_{\mathrm {t}}\), \(E^\mathrm {f}_{\mathrm {v}}\))
Initial network  Removed weights (%)  \(\left( E^\mathrm {p}_{\mathrm {t}}/E^\mathrm {f}_{\mathrm {t}}\right) \times 100\%\) (%)  \(\left( E^\mathrm {p}_{\mathrm {v}}/E^\mathrm {f}_{\mathrm {v}}\right) \,\times 100\%\) (%) 

\(N_{20}^{1}\)  66.94  153.95  156.96 
\(N_{20}^{2}\)  61.98  447.93  532.19 
\(N_{20}^{3}\)  57.02  197.76  73.67 
\(N_{30}^{1}\)  72.38  735.59  965.59 
\(N_{30}^{2}\)  37.02  67.66  32.73 
\(N_{30}^{3}\)  55.80  177.27  68.20 
Figure 5b–d depicts the architectures of the pruned networks with 20 hidden nodes, i.e. the networks \(N_{20}^{1}\), \(N_{20}^{2}\) and \(N_{20}^{3}\) whereas Fig. 6b–d depicts the pruned networks with 30 hidden nodes, i.e. the networks \(N_{30}^{1}\), \(N_{30}^{2}\) and \(N_{30}^{3}\). It is interesting to note that the OBD algorithm works in an intelligent way, if all weight in the first layer connected to a some hidden node are deleted, it also removes the corresponding weight in the second layer.
In this work, the OBD algorithm was performed as many times as there were saliency values equal or greater than 0. Knowing that saliency values of removed weights are set to 0, therefore the vector of saliency values at the end of model pruning always consists of negative values as shown in Figs. 17 and 18. Most of time the vector of saliency values consists of nonnegative ones only (Fig. 18), but as in the Fig. 17, after 60 iterations of the OBD algorithm, there are only nonpositive values in the saliency vector. The algorithm does not stop there because there are weights for which saliency equals 0, but they are not yet removed—one of them will be removed in the next iteration. At the iteration no. 95, there are only negative values, and zeros corresponding to removed weights. The saliency of 0 appears when the weight’s value or the secondorder derivative of the error function with respect to this weight equals 0 as in Eqs. (15) and (16). The second condition takes place when, for example, a hidden neuron has no input signals (i.e. there are no connections between this node and any node in the first layer), then saliency of weight connecting this and summing nodes is 0. It is because it has no influence on the output signal, and so the secondorder derivative of the error function equals 0. An analogous case takes place if there are no connections between the first layer’s node, and any of the second layer’s node. Worth mentioning is that if saliency equals 0, then that corresponding weight will be removed as fast as possible, what is consistent with intuition—weight linked with node that is of no use, should be removed.
The use of the OBD algorithm requires that the diagonal of Hessian matrix (6) is positivedefinite [so that minimum of error function (5) is achieved]. That implies that the saliency values are nonnegative and the removal of weight can be carried out. Computational constraints cause that reaching exact minimum is not always possible, and that cause the saliency values not to be the way we are expecting them to be. Nonetheless even with this kind of inaccuracy, implementation of OBD algorithm allows to obtain reasonable results.
5 Conclusions
This work describes derivation and implementation details of the OBD algorithm for pruning the recurrent dynamic neural models with one hidden layer. The neutralisation reactor benchmark process is considered to demonstrate effectiveness of the algorithm. The problem resulting from computational inaccuracy is shown, as well as its possible consequences. The models of two different architectures have been trained and pruned using the discussed implementation of the OBD algorithm. Considering only the best results, for the considered neutralisation process, reduction in the number of weights is approximately 60% and the validation error is some 30% smaller when compared to the full models.
Choosing the model that has a moderate number of parameters and is precise is not a simple task. It requires a compromise between error values and total number of weights of the model. Although this procedure is timeconsuming, it is worth repeating several times to achieve the best model configuration.
References
 1.Ławryńczuk, M.: Computationally Efficient Model Predictive Control Algorithms: A Neural Network Approach. Studies in Systems Decision and Control, vol. 3. Springer, Heidelberg (2014)MATHGoogle Scholar
 2.Tatjewski, P.: Advanced Control of Industrial Processes: Structures and Algorithms. Springer, London (2007)MATHGoogle Scholar
 3.Korbicz, J., Koscielny, J.M., Kowalczuk, Z., Cholewa, W.: Fault Diagnosis: Models, Artificial Intelligence, Applications. Springer, London (2004)CrossRefMATHGoogle Scholar
 4.Witczak, M.: Modelling and estimation strategies for fault diagnosis of nonlinear systems: from analytical to soft computing approaches. In: Lecture Notes in Control and Information Sciences, vol. 354. Springer, Berlin (2007)Google Scholar
 5.Witczak, M.: Fault diagnosis and faulttolerant control strategies for nonlinear systems: analytical and soft computing approaches. In: Lecture Notes in Electrical Engineering, vol. 266. Springer, Berlin (2014)Google Scholar
 6.Simon, D.: Optimal State Estimation: Kalman, \({\rm H}_{\infty }\) and Nonlinear Approaches. Wiley, Hoboken (2006)CrossRefGoogle Scholar
 7.Luyben, W.L.: Process Modelling, Simulation and Control for Chemical Engineers. McGraw Hill, New York (1990)Google Scholar
 8.Marlin, T.E.: Process Control. McGrawHill, New York (1995)Google Scholar
 9.Palit, A.K., Popovic, D.: Computational Intelligence in Time Series Forecasting: Theory and Engineering Applications. Springer, Berlin (2005)MATHGoogle Scholar
 10.Yan, Z., Wang, J.: Nonlinear model predictive control based on collective neurodynamic optimization. IEEE Trans. Neural Netw. Learn. Syst. 26, 840–850 (2015)MathSciNetCrossRefGoogle Scholar
 11.Yan, Z., Wang, J.: Robust model predictive control of nonlinear systems with unmodeled dynamics and bounded uncertainties based on neural networks. IEEE Trans. Neural Netw. Learn. Syst. 25, 457–469 (2014)CrossRefGoogle Scholar
 12.Fortuna, L., Graziani, S., Rizzo, A., Xibilia, M.G.: Soft Sensors for Monitoring and Control of Industrial Processes. Springer, Berlin (2007)MATHGoogle Scholar
 13.Ogiela, M., Tadeusiewicz, R.: Modern Computational Intelligence Methods for the Interpretation of Medical Images. Studies in Computational Intelligence, vol. 84. Springer, Heidelberg (2006)MATHGoogle Scholar
 14.Nelles, O.: Nonlinear System Identification. From Classical Approaches to Neural Networks and Fuzzy Models. Springer, Berlin (2001)MATHGoogle Scholar
 15.Pearson, R.K.: Selecting nonlinear model structures for computer control. J. Process Control 13, 1–26 (2003)CrossRefGoogle Scholar
 16.Haykin, S.: Neural Networks and Learning Machines. Prentice Hall, Upper Saddle River (2009)Google Scholar
 17.Ripley, B.D.: Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge (1996)CrossRefMATHGoogle Scholar
 18.Mandic, D.P., Chambers, J.: Recurrent Neural Networks for Prediction: Learning Algorithms, Architectures and Stability. Wiley, New York (2001)CrossRefGoogle Scholar
 19.Hosen, M.A., Hussain, M.A., Mjalli, F.S.: Control of polystyrene batch reactors using neural network based model predictive control (NNMPC): an experimental investigation. Control Eng. Pract. 19, 454–467 (2011)CrossRefGoogle Scholar
 20.Ławryńczuk, M.: Practical nonlinear predictive control algorithms for neural wiener models. J. Process Control 23, 696–714 (2013)CrossRefGoogle Scholar
 21.Vieira, W.G., Santos, V.M.L., Carvalho, F.R., Pereira, J.A.F.R., Fileti, A.M.F.: Identification and predictive control of a FCC unit using a MIMO neural model. Chem. Eng. Process. 44, 855–868 (2005)CrossRefGoogle Scholar
 22.De Carvalho, R.M., Mello, C., Kubota, L.T.: Simultaneous determination of phenol isomers in binary mixtures by differential pulse voltammetry using carbon fibre electrode and neural network with pruning as a multivariate calibration tool. Anal. Chim. Acta 420(1), 109–121 (2000)CrossRefGoogle Scholar
 23.Ghani, N., Lamontagne, R.: Neural networks applied to the classification of spectral features for automatic modulation recognition. In: Military Communications Conference, 1993. MILCOM ’93. Conference Record. Communications on the Move, IEEE, vol. 1, pp. 111–115 (1993)Google Scholar
 24.Giles, C.L., Omlin, C.W.: Pruning recurrent neural networks for improved generalization performance. IEEE Trans. Neural Netw. 5(5), 848–851 (1994)CrossRefGoogle Scholar
 25.Hassibi, B., Stork, D.G., Wolff, G.J.: Optimal brain surgeon and general network pruning. In: Neural Networks, 1993. IEEE International Conference on, vol. 1, pp. 293–299 (1993)Google Scholar
 26.HintzMadsen, M., Kai Hansen, L., Larsen, J., With Pedersen, M., Larsen, M.: Neural classifier construction using regularization, pruning and test error estimation. Neural Netw. 11(9), 1659–1670 (1998)CrossRefGoogle Scholar
 27.Le Cun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Touretzky, D. (ed.) Advances in NIPS2, pp. 598–605. Morgan Kaufmann, San Mateo (1990)Google Scholar
 28.Goh, Y.S., Tan, E.C.: Pruning neural networks during training by backpropagation. In: Proceedings of TENCON’94—1994 IEEE Region 10’s 9th Annual International Conference on: ’Frontiers of Computer Technology’, vol. 2, pp. 805–808 (1994)Google Scholar
 29.Mauch, L., Yang, B.: A novel layerwise pruning method for model reduction of fully connected deep neural networks. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2382–2386 (2017)Google Scholar
 30.Endisch, C., Stolze, P., Endisch, P., Hackl, C., Kennel, R.: Levenbergmarquardtbased OBS algorithm using adaptive pruning interval for system identification with dynamic neural networks. In: Systems, Man and Cybernetics, 2009. SMC 2009. IEEE International Conference on, pp. 3402–3408 (2009)Google Scholar
 31.Fog, T.L., Larsen, J., Hansen, L.K.: Training and evaluation of neural networks for multivariate time series processing. In: Neural Networks, 1995. Proceedings, IEEE International Conference on, vol. 2, pp. 1194–1199 (1995)Google Scholar
 32.Huynh, T.Q., Setiono, R.: Effective neural network pruning using crossvalidation. In: Proceedings of the International Joint Conference on Neural Networks, vol. 2, pp. 972–977 (2005)Google Scholar
 33.Kaminski, M., OrlowskaKowalska, T.: Comparison of bayesian regularization and optimal brain damage methods in optimization of neural estimators for twomass drive system. In: Industrial Electronics (ISIE), 2010 IEEE International Symposium on, pp. 102–107 (2010)Google Scholar
 34.Setiono, R., Gaweda, A.: Neural network pruning for function approximation. In: Neural Networks, 2000. IJCNN 2000, Proceedings of the IEEEINNSENNS International Joint Conference on, vol. 6, pp. 443–448 (2000)Google Scholar
 35.Silvestre, M.R., Ling, L.L.: Pruning methods to MLP neural networks considering proportional apparent error rate for classification problems with unbalanced data. Measurement 56, 88–94 (2014)CrossRefGoogle Scholar
 36.Narendra, K.S., Parthasarathy, K.: Identification and control of dynamical systems using neural networks. IEEE Trans. Neural Netw. 1, 4–27 (1990)CrossRefGoogle Scholar
 37.Shook, D.S., Mohtadi, C., Shah, S.L.: Identification for longrange predictive control. IEE Proc. D Control Theory Appl. 138(1), 75–84 (1991)CrossRefMATHGoogle Scholar
 38.Shook, D.S., Mohtadi, C., Shah, S.L.: A controlrelevant identification strategy for GPC. IEEE Trans. Autom. Control 37(7), 975–980 (1992)CrossRefMATHGoogle Scholar
 39.Ławryńczuk, M., Tatjewski, P.: Nonlinear predictive control based on neural multimodels. Int. J. Appl. Math. Comput. Sci. 20(1), 7–21 (2010)MathSciNetMATHGoogle Scholar
 40.Ławryńczuk, M.: Training of neural models for predictive control. Neurocomputing 73(7), 1332–1343 (2010)Google Scholar
 41.Böling, J.M., Seborg, D.E., Hespanha, J.P.: Multimodel adaptive control of a simulated pH neutralization process. Control Eng. Pract. 15(6), 663–672 (2007)CrossRefGoogle Scholar
 42.Grancharova, A., Kocijan, J., Johansen, T.A.: Explicit outputfeedback nonlinear predictive control based on blackbox models. Eng. Appl. Artif. Intell. 24(2), 388–397 (2011)CrossRefGoogle Scholar
 43.Henson, M.A., Seborg, D.E.: Adaptive nonlinear control of a pH neutralization process. IEEE Trans. Control Syst. Technol. 2(3), 169–182 (1994)CrossRefGoogle Scholar
 44.Karasakal, O., Guzelkaya, M., Eksin, I., Yesil, E., Kumbasar, T.: Online tuning of fuzzy PID controllers via rule weighing based on normalized acceleration. Eng. Appl. Artif. Intell. 26(1), 184–197 (2013)CrossRefGoogle Scholar
 45.Kumbasar, T., Eksin, I., Guzelkaya, M., Yesil, E.: Type2 fuzzy model based controller design for neutralization processes. ISA Trans. 51(2), 277–287 (2012)CrossRefMATHGoogle Scholar
 46.Oblak, S., Åǎkrjanc, I.: Continuoustime Wienermodel predictive control of a pH process based on a PWL approximation. Chem. Eng. Sci. 65(5), 1720–1728 (2010)CrossRefGoogle Scholar
 47.Bonans, J.F., Gilbert, J.C., Lemarechal, C., Sagastizabal, C.A.: Numerical Optimization: Theoretical and Practical Aspects. Springer, Berlin (2006)Google Scholar
 48.Nocedal, J., Wright, S.J.: Numerical Optimization. Springer, Berlin (2006)MATHGoogle Scholar
 49.Ruszczyński, A.: Nonlinear Optimization. Princeton University Press, Princeton (2006)MATHGoogle Scholar
 50.Gómez, J.C., Jutan, A., Baeyens, E.: Wiener model identification and predictive control of a pH neutralisation process. IEE Proc. Part D Control Theory Appl. 151, 329–338 (2004)CrossRefGoogle Scholar
 51.Ławryńczuk, M.: Modelling and predictive control of a neutralisation reactor using sparse support vector machine wiener models. Neurocomputing 205(Supplement C), 311–328 (2016)Google Scholar
 52.Yang, Y., Wu, Q.: A neural network PID control for pH neutralization process. In: 2016 35th Chinese Control Conference (CCC), pp. 3480–3483 (2016)Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.