Advertisement

Applied Intelligence

, Volume 27, Issue 1, pp 79–88 | Cite as

Semi-parametric optimization for missing data imputation

  • Yongsong Qin
  • Shichao ZhangEmail author
  • Xiaofeng Zhu
  • Jilian Zhang
  • Chengqi Zhang
Article

Abstract

Missing data imputation is an important issue in machine learning and data mining. In this paper, we propose a new and efficient imputation method for a kind of missing data: semi-parametric data. Our imputation method aims at making an optimal evaluation about Root Mean Square Error (RMSE), distribution function and quantile after missing-data are imputed. We evaluate our approaches using both simulated data and real data experimentally, and demonstrate that our stochastic semi-parametric regression imputation is much better than existing deterministic semi-parametric regression imputation in efficiency and effectiveness.

Keywords

Missing data Missing data imputation Semi-parametric data 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Allison P (2001) Missing data. Sage Publication, IncGoogle Scholar
  2. 2.
    Cios K, Kurgan L (2002) Trends in data mining and knowledge discovery. In: Pal N, Jain L, Teoderesku N (eds) Knowledge discovery in advanced information systems. SpringerGoogle Scholar
  3. 3.
    Clifton C (2003) Change detection in overhead imagery using neural networks. Appl Intell 18(2):215–234zbMATHCrossRefGoogle Scholar
  4. 4.
    Dempster et al (1983) Incomplete data in sample surveys. In: Madow WG, Olkin I, Rubin D (eds) Sample surveys Vol.: Theory and annotated bibliography, New York, NY, Academic Press, pp 3–10Google Scholar
  5. 5.
    Engle RF et al (1986) Semiparametric estimates of the relation between weather and electricity sales. J Am Statist Assoc 81(394), Applications.Google Scholar
  6. 6.
    Friedman JH, Khavi R, Yun Y (1996) Lazy decision trees. In: Proceedings of the 13th national conference on artificial intelligence, AAAI Pres/MIT Press, pp 717–724Google Scholar
  7. 7.
    Ghahramani et al (1997) Mixture models for Learning from incomplete data. In: Greiner R, Petsche T, Hanson SJ (eds) Computational learning theory and natural learning systems, Volume IV: Making learning systems practical, Cambridge, MA, The MIT Press, pp 67–85Google Scholar
  8. 8.
    Han J, Kamber M (2000) Data mining concepts and techniques. Morgan Kaufmann PublishersGoogle Scholar
  9. 9.
    Hand D et al (1994) A handbook of small data sets. London, Chapman & Hall, pp 208–211Google Scholar
  10. 10.
    Hoti F, Holmstrom L (2004) A semiparametric density estimation approach to pattern classification. Patt Recog 37:409–419CrossRefGoogle Scholar
  11. 11.
    Hu X (2005) A data mining approach for retailing bank customer attrition analysis. Appl Intell 22(1):47–60CrossRefGoogle Scholar
  12. 12.
    Kaya M, Alhajj R (2006) Utilizing genetic algorithms to optimize membership functions for fuzzy weighted association rule mining. Appl Intell 24(1):7–15CrossRefGoogle Scholar
  13. 13.
    Kim Y (2001) The curse of the missing data. In: http://209.68.240.11:8080
  14. 14.
    Little R, Rubin D (2002) Statistical analysis with missing data (2nd edn.). John Wiley and Sons, New YorkzbMATHGoogle Scholar
  15. 15.
    Liu WZ, White AP, Thompson SG, Bramer MA (1997) Techniques for dealing with missing values in classification. In: IDAL97, vol 1280 of Lecture notes, pp 527–536Google Scholar
  16. 16.
    Ramoni M (1997) Learning Bayesian networks from incomplete databases. Technical report kmi-97-6, Knowledge Media Institute, The Open UniversityGoogle Scholar
  17. 17.
    Millimet D, List J, Stengos T (2003) The environmental kuznets curve: Real progress or misspecified models? Rev Econ Stat 85(4):1038–1047CrossRefGoogle Scholar
  18. 18.
    Peixoto J (1990) A property of well-formulated polynomial regression models. Am Stat 44:26–30CrossRefMathSciNetGoogle Scholar
  19. 19.
    Pickle S et al (2005). Robust parameter design: a semi-parametric approach. In: http://www.stat.vt.edu/tech_reports/VTTechReport05-7.pdf
  20. 20.
    Pin T, James L (1999) The elasticity of demand for gasoline: a semi-parametric analysis. In http://uiuc.edu/∼ng/working/gas.ps
  21. 21.
    Pyle D (1994) Data preparation for data mining. Morgan Kaufmann Publishers, IncGoogle Scholar
  22. 22.
    Qin YS, Rao JNK (2004) Confidence intervals for parameters of the response variable in a linear model with missing data. Technique ReportGoogle Scholar
  23. 23.
    Quinlan JR (1989) Unknown attribute values in induction. In: proc. 6th int’ workshop on machine learning, Ithaca, pp 164–168Google Scholar
  24. 24.
    Quinlan JR (1993) C4.5: Programs for machine learning. Morgan Kaufmann, San Mateo, USAGoogle Scholar
  25. 25.
    Silverman B (1986) Density estimation for statistics and data analysis. Chapman and Hall, New YorkzbMATHGoogle Scholar
  26. 26.
    Wang Q, Rao JNK (2002a) Empirical likelihood-based inference in linear models with missing data. Scand J Statist 29:563–576zbMATHCrossRefMathSciNetGoogle Scholar
  27. 27.
    Wang Q, Rao J (2002b) Empirical likelihood-based inference under imputation with missing response. Ann Statistics 30:563–576MathSciNetGoogle Scholar
  28. 28.
    Wang Q, Hardle W (2004) Semiparametric regression analysis with missing response at random. J Am Statistical Assoc 99Google Scholar
  29. 29.
    White AP (1987) Probabilistic induction by dynamic path generation in virtual trees. In: Bramer MA (ed) Research and development in expert systems III. Cambridge, Cambridge University Press, pp 35–46Google Scholar
  30. 30.
    Zhang C, Yang Q, Liu B (2005) Intelligent data preparation. IEEE Trans Knowl Data Eng 17(9):1163–1165CrossRefGoogle Scholar
  31. 31.
    Zhang C, Zhang S, Webb G (2003) Identifying approximate itemsets of interest in large databases. Appl Intell 18:91–104CrossRefGoogle Scholar
  32. 32.
    Zhang S, Zhang C, Yang Q (2004) Information enhancement for data mining. IEEE Intell Syst 19(2):12–13CrossRefGoogle Scholar
  33. 33.
    Zhang S, Qin ZX, Ling CX, Sheng SL (2005) Missing is useful: missing values in cost-sensitive decision trees. IEEE Trans Knowl Data Eng 17(12):1689–1693CrossRefGoogle Scholar
  34. 34.
    Zhang S et al (2006) Optimized parameters for missing data imputation. In: Proceedings of PRICAI 2006, Guilin, China, August 7–11, 2006 Proceedings, pp 1010–1016Google Scholar

Copyright information

© Springer Science+Business Media, LLC 2006

Authors and Affiliations

  • Yongsong Qin
    • 2
  • Shichao Zhang
    • 1
    Email author
  • Xiaofeng Zhu
    • 2
  • Jilian Zhang
    • 2
  • Chengqi Zhang
    • 1
  1. 1.School of AutomationBeihang UniversityBeijingChina
  2. 2.Deparment of Computer ScienceGuangxi Normal UniversityBeijingChina

Personalised recommendations