Data Pre-processing to Apply Multiple Imputation Techniques: A Case Study on Real-World Census Data

  • Zoila Ruiz-ChavezEmail author
  • Jaime Salvador-Meneses
  • Jose Garcia-Rodriguez
  • Antonio J. Tallón-Ballesteros
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11315)


Improving accuracy or reducing computational cost are the main approaches of machine learning techniques, but it depends heavily on the test data used. Even more so when it comes to from real-world data such as censuses, surveys or tokens that contain a high level of missing values. The data absence or presence of outliers are problems that must be treated carefully prior to any process related to data analysis. The following work presents an overview of data pre-processing and aims at presenting the steps to follow prior to process large volumes of high-dimensionality data with categorical variables. As part of the dimensionality reduction process, when there is a high level of missing values present in one or more variables, we use the Pairwise and Listwise Deletion methods. Thus, the generation of m-clusters using the Kohonen Self-Organizing Maps (SOM) algorithm with H2O over R is also considered as a division of data into similar groups, which are used as cluster to apply Multiple Imputation algorithms, creating different m-values to impute a missing value.


Machine learning Data preparation Data mining Imputation methods 


  1. 1.
    Bar, H.: Missing data–mechanisms and possible solutions. Cultura y Educación 29(3), 492–525 (2017)CrossRefGoogle Scholar
  2. 2.
    Chackiel, J.: Métodos de estimaciones demográficas de pueblos indígenas a partir de censos de población: La Fecundidad y la Mortalidad. Pueblos indigenas y afrodescendientes de América Latina y el Caribe: relevancia y pertinencia de la informacion sociodemografica para politicas y programas, p. 30 (2005)Google Scholar
  3. 3.
    Cheema, J.R.: A review of missing data handling methods in education research. Rev. Educ. Res. 84(4), 487–508 (2014)CrossRefGoogle Scholar
  4. 4.
    Famili, A., Shen, W.-M., Weber, R., Simoudis, E.: Data preprocessing and intelligent data analysis. Intell. Data Anal. 1(1), 3–23 (1997)CrossRefGoogle Scholar
  5. 5.
    Fessant, F., Midenet, S.: Self-organising map for data imputation and correction in surveys. Neural Comput. Appl. 10(4), 300–310 (2002)CrossRefGoogle Scholar
  6. 6.
    Kamiran, F., Calders, T.: Data preprocessing techniques for classification without discrimination. Knowl. Inf. Syst. 33(1), 1–33 (2012)CrossRefGoogle Scholar
  7. 7.
    Li, D., Deogun, J., Spaulding, W., Shuart, B.: Towards missing data imputation: a study of fuzzy k-means clustering method. In: Tsumoto, S., Słowiński, R., Komorowski, J., Grzymała-Busse, J.W. (eds.) RSCTC 2004. LNCS (LNAI), vol. 3066, pp. 573–579. Springer, Heidelberg (2004). Scholar
  8. 8.
    Myers, T.A.: Goodbye, listwise deletion: presenting hot deck imputation as an easy and effective tool for handling missing data. Commun. Methods Meas. 5(4), 297–310 (2011)CrossRefGoogle Scholar
  9. 9.
    Newman, D.A.: Longitudinal modeling with randomly and systematically missing data: a simulation of ad hoc, maximum likelihood, and multiple imputation techniques. Organ. Res. Methods 6(3), 328–362 (2003)CrossRefGoogle Scholar
  10. 10.
    Nishanth, K.J., Ravi, V.: Probabilistic neural network based categorical data imputation. Neurocomputing 218, 17–25 (2016)CrossRefGoogle Scholar
  11. 11.
    Rubin, D.B.: Multiple Imputation for Nonresponse in Surveys, vol. 81. Wiley, New York (2004)zbMATHGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2018

Authors and Affiliations

  • Zoila Ruiz-Chavez
    • 1
    Email author
  • Jaime Salvador-Meneses
    • 1
  • Jose Garcia-Rodriguez
    • 2
  • Antonio J. Tallón-Ballesteros
    • 3
  1. 1.Universidad Central del EcuadorQuitoEcuador
  2. 2.Universidad de AlicanteAlicanteSpain
  3. 3.University of SevilleSevilleSpain

Personalised recommendations