Parameter Estimation and Optimization

  • Jun Zhao
  • Wei Wang
  • Chunyang Sheng
Part of the Information Fusion and Data Science book series (IFDS)


The selection of parameters or hyper-parameters gives great impact on the performance of a data-driven model. This chapter introduces some commonly used parameter optimization and estimation methods, such as the gradient-based methods (e.g., gradient descend, Newton method, and conjugate gradient method) and the intelligent optimization ones (e.g., genetic algorithm, differential evolution algorithm, and particle swarm optimization). In particular, in this chapter, the conjugate gradient method is employed to optimize the hyper-parameters in a LSSVM model based on noise estimation, which enable to alleviate the impact of noise on the performance of the LSSVM. As for dynamic models, this chapter introduces nonlinear Kalman-filter methods for parameter estimation. The well-known ones include the extended Kalman-filter, the unscented Kalman-filter, and the cubature Kalman-filter. Here, a dual estimation model based on two Kalman-filters is illustrated, which simultaneously estimates the uncertainties of internal state and the output. Besides, the probabilistic methods for parameter estimation are also introduced, where a Bayesian model, especially a variational inference framework, is elaborated in details. In such a framework, a particular variational relevance vector machine (RVM) model based on automatic relevance determination kernel is introduced, which provides the approximated posterior distributions over the kernel parameters. Finally, we give some case studies by employing a number of industrial data.


Parameter estimation and optimization Gradient-based method Intelligent optimization Genetic algorithm Differential evolution algorithm Particle swarm optimization Nonlinear Kalman-filter Extended Kalman-filter Dual estimation Probabilistic methods Bayesian method Variational inference 


  1. 1.
    Protter, M. H. (2014). Basic elements of real analysis. Springer.Google Scholar
  2. 2.
    Fletcher, R. (2005). On the Barzilai-Borwein method. Applied Optimization, 96, 235–256.MathSciNetCrossRefGoogle Scholar
  3. 3.
    Bottou, L. (1998). Online algorithms and stochastic approximations. Cambridge University Press.Google Scholar
  4. 4.
    Kiwiel, K. C. (2001). Convergence and efficiency of subgradient methods for quasiconvex minimization. Mathematical Programming, 90(1), 1–25.MathSciNetCrossRefGoogle Scholar
  5. 5.
    Shanno, D. F. (1970). Conditioning of quasi-Newton methods for function minimization. Mathematics of Computation, 24(111), 647–656.MathSciNetCrossRefGoogle Scholar
  6. 6.
    Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Springer-Verlag New York Inc.Google Scholar
  7. 7.
    Dai, Y. H. (2013). A perfect example for the BFGS method. Mathematical Programming, 138(1–2), 501–530.MathSciNetCrossRefGoogle Scholar
  8. 8.
    Malouf, R. (2002). A comparison of algorithms for maximum entropy parameter estimation (pp. 49–55). In Proc. Sixth Conf. on Natural Language Learning (CoNLL).Google Scholar
  9. 9.
    Andrew, G., & Gao, J. (2007). Scalable training of L-regularized log-linear models. In Proceedings of the 24th International Conference on Machine Learning.Google Scholar
  10. 10.
    Knyazev, A. V., & Lashuk, I. (2008). Steepest descent and conjugate gradient methods with variable preconditioning. SIAM Journal on Matrix Analysis and Applications, 29(4), 1267.MathSciNetCrossRefGoogle Scholar
  11. 11.
    Hestenes, M. R., & Stiefel, E. L. (1952). Methods of conjugate gradients for solving linear systems. Journal of Research of the National Bureau of Standards, 5(2), 409–432.MathSciNetCrossRefGoogle Scholar
  12. 12.
    Fletcher, R., & Reeves, C. (1964). Function minimization by conjugate gradients. Computer Journal, 7(1), 149–154.MathSciNetCrossRefGoogle Scholar
  13. 13.
    Polak, B., & Ribiere, G. (1969). Note sur la convergence des methods de directions conjuguees. Rev Francaise Imformmat Recherche Opertionelle, 16(1), 35–43.zbMATHGoogle Scholar
  14. 14.
    Polyak, B. T. (1969). The conjugate gradient method in extreme problems. USSR Computational Mathematics and Mathematical Physics, 9(1), 94–112.CrossRefGoogle Scholar
  15. 15.
    Fletcher, R. (1987). Practical methods of optimization, Vol. 1: Unconstrained optimization (pp. 10–30). New York: Wiley.Google Scholar
  16. 16.
    Liu, Y., & Storey, C. (1991). Efficient generalized conjugate gradient algorithms, Part 1: Theory. Journal of Optimization Theory and Applications, 69(1), 129–137.MathSciNetCrossRefGoogle Scholar
  17. 17.
    Dai, H. Y., & Yuan, Y. (2000). A nonlinear conjugate gradient method with a strong global convergence property. SIAM Journal on Optimization, 10(1), 177–182.MathSciNetCrossRefGoogle Scholar
  18. 18.
    Zhang, X. P., Zhao, J., Wei, W., et al. (2010). COG holder level prediction model based on least square support vector machine and its application. Control and Decision, 25(8), 1178–1183.Google Scholar
  19. 19.
    Holland, J. H. (1975). Adaptation in natural and artificial systems. Ann Arbor: University of Michigan Press.Google Scholar
  20. 20.
    Srinivas, M., & Patnaik, L. (1994). Adaptive probabilities of crossover and mutation in genetic algorithms. IEEE Transactions on System, Man and Cybernetics, 4(4), 656–667.CrossRefGoogle Scholar
  21. 21.
    Zhang, J., Chung, H., & Lo, W. L. (2007). Clustering-based adaptive crossover and mutation probabilities for genetic algorithms. IEEE Transactions on Evolutionary Computation, 11(3), 326–335.CrossRefGoogle Scholar
  22. 22.
    Storn, R. (1995). Constrained optimization. Dr. Dobb’s Journal, 119–123.Google Scholar
  23. 23.
    Das, S., Abraham, A., & Konar, A. (2009). Differential evolution algorithm: Foundations and perspectives (Vol. 178, pp. 63–110).Google Scholar
  24. 24.
    Kennedy, J., & Eberhart, R. (1995). Particle swarm optimization (pp. 1942–1948). In International Conference on Neural Networks.Google Scholar
  25. 25.
    Eberhart, R., & Kennedy, J. (1995). A new optimizer using particle swarm theory (pp. 39–43). In International Symposium on Micro Machine and Human Science.Google Scholar
  26. 26.
    Eberhart, R. C., Shi, Y., & Kennedy, J. (2001). Swarm intelligence. Amsterdam: Elsevier.Google Scholar
  27. 27.
    Shi, Y., & Eberhart, R. C. (1998). Parameter selection in particle swarm optimization (pp. 591–600). In Evolutionary Programming VI/: Proc. EP98. New York: Springer.Google Scholar
  28. 28.
    Shi, Y., & Eberhart, R. C. (1998). A modified particle swarm optimizer (pp. 69–73). In Proceedings of the IEEE International Conference on Evolutionary Computation. Piscataway, NJ: IEEE Press.Google Scholar
  29. 29.
    Kitayama, S., Arakawa, M., & Yamazaki, K. (2006). Penalty function approach for the mixed discrete nonlinear problems by particle swarm optimization. Structural and Multidisciplinary Optimization, 32(3), 191–202.MathSciNetCrossRefGoogle Scholar
  30. 30.
    Li, D., Wang, B., Kita-Yama, S., Yamazaki, K., & Arakawa, M. (2005). Application of particle swarm optimization to the mixed discrete non-linear problems (pp. 315–324). In Artificial intelligence applications and innovations, USA, Vol. 187.Google Scholar
  31. 31.
    Kitayama, S., & Yasuda, K. (2006). A method for mixed integer programming problems by particle swarm optimization. Electrical Engineering in Japan, 157(2), 40–49.CrossRefGoogle Scholar
  32. 32.
    Chen, W. N., Zhang, J., Chung, H. S. H., et al. (2010). A novel set-based particle swarm optimization method for discrete optimization problems. IEEE Transactions on Evolutionary Computation, 14(2), 278–300.CrossRefGoogle Scholar
  33. 33.
    Gong, Y. J., Zhang, J., Liu, O., et al. (2012). Optimizing the vehicle routing problem with time windows: A discrete particle swarm optimization approach. IEEE Transactions on Systems Man & Cybernetics Part C, 42(2), 254–267.CrossRefGoogle Scholar
  34. 34.
    Robinson, D. G. (2005). Reliability analysis of bulk power systems using swarm intelligence (pp. 96–102). In Reliability and maintainability symposium, 2005. Proceedings. IEEE.Google Scholar
  35. 35.
    Pampara, G., Franken, N., & Engelbrecht, A. P. (2005). Combining particle swarm optimisation with angle modulation to solve binary problems (pp. 89–96). In The 2005 I.E. Congress on Evolutionary Computation, 2005. IEEE.Google Scholar
  36. 36.
    Wu, W. C., & Tsai, M. S. (2011). Application of enhanced integer coded particle swarm optimization for distribution system feeder reconfiguration. IEEE Transactions on Power Systems, 26(3), 1591–1599.CrossRefGoogle Scholar
  37. 37.
    Kirkpatrick, S., Gelatt, C. D., et al. (1983). Optimization by simulated annealing. Science, 220, 671–680.MathSciNetCrossRefGoogle Scholar
  38. 38.
    Cerny, V. (1985). Thermodynamical approach to the travelling salesman problem: An efficient simulation algorithm. Journal of Optimization Theory and Applications, 45, 41–51.MathSciNetCrossRefGoogle Scholar
  39. 39.
    Fleischer, M. A. (1995). Simulated annealing: Past, present, and future (pp. 155–161). In Proceedings of the 1995 Winter Simulation Conference, IEEE Press, Arlington, Virginia.Google Scholar
  40. 40.
    Henderson, D., Jacobson, S. H., & Johnson, A. W. (2003). Handbook of metaheuristics. Boston: Kluwer.Google Scholar
  41. 41.
    Kumar, P. (2006). A survey of simulated annealing as a tool for single and multiobjective optimization. Journal of the Operational Research Society, 57(10), 1143–1160.CrossRefGoogle Scholar
  42. 42.
    Sastry, Y. (1971). Decomposition of the extended Kalman filter. IEEE Transactions on Automatic Control, 16(3), 260–261.MathSciNetCrossRefGoogle Scholar
  43. 43.
    Einicke, G. A. (2012). Smoothing, filtering and prediction: Estimating the past, present and future. Rijeka: Intech.Google Scholar
  44. 44.
    Andreasen, M. M. (2013). Non-linear DSGE Models and the central difference Kalman Filter †. Journal of Applied Econometrics, 28(6), 929–955.MathSciNetGoogle Scholar
  45. 45.
    Wan, E. A., & van der Menve, R. (2000). The unscented Kalman Filter for nonlinear estimation. In IEEE Conference on Symposium on Adaptive Systems for Signal Processing, Communications, and Control (AS-SPCC).Google Scholar
  46. 46.
    Arasaratnam, I., & Haykin, S. (2009). Cubature Kalman filters. IEEE Transactions on Automatic Control, 54(6), 1254–1269.MathSciNetCrossRefGoogle Scholar
  47. 47.
    Sheng, C., Zhao, J., Liu, Y., et al. (2012). Prediction for noisy nonlinear time series by echo state network based on dual estimation. Neurocomputing, 82(4), 186–195.CrossRefGoogle Scholar
  48. 48.
    Venayagamoorthy, G., & Shishir, B. (2009). Effects of spectral radius and settling time in the performance of echo state networks. Neural Networks, 22(7), 861.CrossRefGoogle Scholar
  49. 49.
    Bishop, C. M. (2006). Pattern recognition and machine learning (information science and statistics). New York: Springer.zbMATHGoogle Scholar
  50. 50.
    Bernardo, J. M., & Smith, A. F. M. (1994). Bayesian theory. Chichester: Wiley.CrossRefGoogle Scholar
  51. 51.
    Gull, S. F. (1989). Developments in maximum entropy data analysis. In J. Skilling (Ed.), Maximum entropy and Bayesian methods (pp. 53–71). Dordrecht: Kluwer.CrossRefGoogle Scholar
  52. 52.
    MacKay, D. J. C. (1992). The evidence framework applied to classification networks. Neural Computation, 4(5), 720–736.CrossRefGoogle Scholar
  53. 53.
    Berger, J. O. (1985). Statistical decision theory and Bayesian analysis (2nd ed.). New York: Springer.CrossRefGoogle Scholar
  54. 54.
    Parisi, G. (1988). Statistical field theory. New York: Addison-Wesley.zbMATHGoogle Scholar
  55. 55.
    Zhao, J., Chen, L., Pedrycz, W., & Wang, W. (in press). Variational inference based automatic relevance determination kernel for embedded feature selection of noisy industrial data, IEEE Transactions on Industrial Electronics.
  56. 56.
    Bishop, C. M., & Tipping, M. E. (2000). Variational relevance vector machines. In Conference on uncertainty in artificial intelligence.Google Scholar
  57. 57.
    Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine Learning Research, 1(3), 211–244.MathSciNetzbMATHGoogle Scholar
  58. 58.
    Vapnik, V. N. (1995). The nature of statistical learning theory. New York: Springer.CrossRefGoogle Scholar
  59. 59.
    Rasmussen, C., & Williams, C. (2006). Gaussian processes for machine learning. MIT Press.Google Scholar
  60. 60.
    Zhao, J., Liu, Q., Pedrycz, W., et al. (2012). Effective noise estimation-based online prediction for byproduct gas system in steel industry. IEEE Transactions on Industrial Informatics, 8(4), 953–963.CrossRefGoogle Scholar
  61. 61.
    Zhao, Y., & Keong, K. C. (2004). Fast leave-one-out evaluation and improvement on inference for LS-SVM (pp. 1051–4651). In Proc. IEEE Int. Conf. Pattern Recognit., Cambridge, U.K.Google Scholar
  62. 62.
    An, S., Liu, W., & Venkatesh, S. (2007). Fast cross-validation algorithms for least squares support vector machine and kernel ridge regression. Pattern Recognition, 40, 2154–2162.CrossRefGoogle Scholar
  63. 63.
    Chi, M. V., Wong, P. K., & Li, Y. P. (2006). Prediction of automotive engine power and torque using least squares support vector machines and Bayesian inference. Engineering Applications of Artificial Intelligence, 19(3), 277–287.CrossRefGoogle Scholar
  64. 64.
    Rubio, G., Pomares, H., Rojas, I., et al. (2009). Efficient optimization of the parameters of LS-SVM for regression versus cross-validation error (pp. 406–415). In International Conference on Artificial Neural Networks. Springer.Google Scholar
  65. 65.
    Jones, A. J. (2004). New tools in non-linear modelling and prediction. Computational Management Science, 1(2), 109–149.CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  • Jun Zhao
    • 1
  • Wei Wang
    • 1
  • Chunyang Sheng
    • 2
  1. 1.Dalian University of TechnologyDalianChina
  2. 2.Shandong University of Science and TechnologyQingdaoChina

Personalised recommendations