Missing Values Imputation for a Clustering Genetic Algorithm

  • Eduardo R. Hruschka
  • Estevam R. HruschkaJr.
  • Nelson F. F. Ebecken
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3612)


The substitution of missing values, also called imputation, is an important data preparation task for data mining applications. This paper describes a nearest-neighbor method to impute missing values, showing that it can be useful for a clustering genetic algorithm. The proposed nearest-neighbor method is assessed by means of simulations performed in two datasets that are benchmarks for data mining methods: Wisconsin Breast Cancer and Congressional Voting Records. The efficacy of the proposed approach is evaluated both in prediction and clustering scenarios. Empirical results show that the employed imputation method is a suitable data preparation tool.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Fayyad, U.M., Shapiro, G.P., Smyth, P.: From Data Mining to Knowledge Discovery: An Overview. In: Fayyad, U.M., Piatet-sky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 1–37. MIT Press, Cambridge (1996)Google Scholar
  2. 2.
    Witten, I.H., Frank, E.: Data Mining – Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann Publishers, USA (2000)Google Scholar
  3. 3.
    Pyle, D.: Data Preparation for Data Mining. Academic Press, London (1999)Google Scholar
  4. 4.
    Little, R., Rubin, D.B.: Statistical Analysis with Missing Data. John Wiley & Sons, New York (1987)zbMATHGoogle Scholar
  5. 5.
    Rubin, D.B.: Multiple Imputation for non Responses in Surveys. John Wiley & Sons, New York (1987)CrossRefGoogle Scholar
  6. 6.
    Hruschka, E.R., Ebecken, N.F.F.: A genetic algorithm for cluster analysis. Intelligent Data Analysis (IDA), Netherlands 7(1) (2003)Google Scholar
  7. 7.
    Batista, G.E.A.P., Monard, M.C.: An Analysis of Four Missing Data Treatment Meth-ods for Supervised Learning. Applied Artificial Intelligence 17(5-6), 519–534 (2003)CrossRefGoogle Scholar
  8. 8.
    Hruschka, E.R., Hruschka Jr., E.R., Ebecken, N.F.F.: Towards Efficient Imputation by Nearest-Neighbors: A Clustering Based Approach. In: Webb, G.I., Yu, X. (eds.) AI 2004. LNCS (LNAI), vol. 3339, pp. 513–525. Springer, Heidelberg (2004)CrossRefGoogle Scholar
  9. 9.
    Hruschka, E.R., Hruschka Jr., E.R., Ebecken, N.F.F.: Evaluating a Nearest-Neighbor Method to Substitute Continuous Missing Values. In: Gedeon, T(T.) D., Fung, L.C.C. (eds.) AI 2003. LNCS (LNAI), vol. 2903, pp. 723–734. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  10. 10.
    Troyanskaya, O., Cantor, M., Sherlock, G., Brown, P., Hastie, T., Tibshirani, R., Botstein, D., Altman, R.B.: Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 520–525 (2001)CrossRefGoogle Scholar
  11. 11.
    Mitchell, T.M.: Machine Learning. McGraw-Hill, New York (1997)zbMATHGoogle Scholar
  12. 12.
    Arabie, P., Hubert, L.J.: An Overview of Combinatorial Data Analysis. In: Arabie, P., Hubert, L.J., DeSoete, G. (eds.) Clustering and Classification, ch. 1. World Scientific, Singapore (1999)Google Scholar
  13. 13.
    Park, Y., Song, M.: A Genetic Algorithm for Clustering Problems. In: Proceedings of the Genetic Programming Conference, University of Wisconsin (July 1998)Google Scholar
  14. 14.
    Yao, X.: Evolutionary Computation: Theory and Applications. World Scientific, Singapore (1999)Google Scholar
  15. 15.
    Tan, K.C., Lim, M.H., Yao, X., Wang, L.: Recent Advances in Simulated Evolution and Learning. World Scientific, Singapore (2004)zbMATHGoogle Scholar
  16. 16.
    Falkenauer, E.: Genetic Algorithms and Grouping Problems. John Wiley & Sons, Chichester (1998)Google Scholar
  17. 17.
    Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data – An Introduction to Cluster Analysis. Wiley Series in Probability and Mathematical Statistics (1990)Google Scholar
  18. 18.
    Everitt, B.S., Landau, S., Leese, M.: Cluster Analysis. Arnold Publishers, London (2001)zbMATHGoogle Scholar
  19. 19.
    Goldberg, D.E.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison Wesley Longmann, Reading (1989)zbMATHGoogle Scholar
  20. 20.
    Merz, C.J., Murphy, P.M.: UCI Repository of Machine Learning Databases. University of California, Irvine,
  21. 21.
    Hruschka, E.R., Hruschka Jr., E.R., Ebecken, N.F.F.: A Nearest-Neighbor Method as a Data Preparation Tool for a Clustering Genetic Algorithm. In: Proceedings of the 18th Brazilian Symposium on Databases, Manaus, Brazil, pp. 319–327 (2003)Google Scholar
  22. 22.
    Triola, M.F.: Elementary Statistics, 7th edn. Addison Wesley Longman Inc., Reading (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Eduardo R. Hruschka
    • 1
  • Estevam R. HruschkaJr.
    • 2
  • Nelson F. F. Ebecken
    • 3
  1. 1.Catholic University of Santos (UniSantos)SantosBrazil
  2. 2.Federal University of São CarlosSão CarlosBrazil
  3. 3.COPPE / Federal University of Rio de JaneiroRio de JaneiroBrazil

Personalised recommendations