Data Preprocessing for Decision Making in Medical Informatics: Potential and Analysis

  • H. Benhar
  • A. Idri
  • J. L. Fernández-Alemán
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 746)


Clinical databases often comprise noisy, inconsistent, missing, imbalanced and high dimensional data. These challenges may reduce the performance of DM techniques. Data preprocessing is, therefore, essential step in order to use DM algorithms on these medical datasets as regards making it appropriate and suitable for mining. The objective is to carry out a systematic mapping study in order to review the use of preprocessing techniques in clinical datasets. As results, 110 papers published between January 2000 and March 2017 were, selected, analyzed and classified according to publication years and channels, research type and the preprocessing tasks used. This study shows that researchers have paid a considerable amount of attention to preprocessing in medical DM in last decade and a significant number of the selected studies used data reduction and cleaning preprocessing tasks.


Mapping study Medical data mining Data preprocessing Clinical data Electronic heath records 



This research is part of the project PPR1/09: “mPHR in Morocco” financed by the Ministry of High education and Scientific research in Morocco and CNRST, 2015-2017, and part of the GINSENG project (TIN2015-70259-C2-2-R) supported by the Spanish Ministry of Economy and Competitiveness and European FEDER funds.


  1. 1.
    Kitchenham, B., Budgen, D., Brereton, O.P.: The value of mapping studies – participant-observer case study. In: Proceedings of the 14th international conference on Evaluation and Assessment in Software Engineering EASE 2010, pp. 25–33 (2010)Google Scholar
  2. 2.
    Petersen, K., Feldt, R., Mujtaba, S., Mattsson, M.: Systematic mapping studies in software engineering. In: Proceedings of the 12th international conference on Evaluation and Assessment in Software Engineering EASE 2008, pp. 68–77 (2008)Google Scholar
  3. 3.
    Bowyer, K.W.: Mentoring Advice on “Conferences Versus Journals” for CSE Faculty (2012)Google Scholar
  4. 4.
    Akay, M.F.: Support vector machines combined with feature selection for breast cancer diagnosis. Expert Syst. Appl. 36, 3240–3247 (2009)CrossRefGoogle Scholar
  5. 5.
    Khemphila, A., Boonjing, V.: Heart disease classification using neural network and feature selection. In: 21st International Conference on Systems Engineering, pp. 406–409 (2011).
  6. 6.
    Poolsawad, N., Moore, L., Kambhampati, C., Cleland, J.G.F.: Issues in the mining of heart failure datasets. Int. J. Autom. Comput. 11, 162–179 (2014)CrossRefGoogle Scholar
  7. 7.
    Almuhaideb, S., Menai, M.E.B.: Impact of preprocessing on medical data classification. Front. Comput. Sci. 10, 1082–1102 (2016)CrossRefGoogle Scholar
  8. 8.
    Exarchos, T.P., Papaloukas, C., Fotiadis, D.I., Michalis, L.K.: An association rule mining-based methodology for automated detection of ischemic ECG beats. IEEE Trans. Biomed. Eng. 53, 1531–1540 (2006)CrossRefGoogle Scholar
  9. 9.
    Demšar, J., et al.: Feature mining and predictive model construction from severe trauma patient’s data. Int. J. Med. Inform. 63, 41–50 (2001)CrossRefGoogle Scholar
  10. 10.
    Duggal, R., Shukla, S., Chandra, S., Shukla, B., Khatri, S.K.: Impact of selected pre-processing techniques on prediction of risk of early readmission for diabetic patients in India. Int. J. Diabetes Dev. Ctries. 36, 469–476 (2016)CrossRefGoogle Scholar
  11. 11.
    Razzaghi, T., Roderick, O., Safro, I., Marko, N.: Multilevel weighted support vector machine for classification on healthcare data with missing values. PLoS One 11 (2016)CrossRefGoogle Scholar
  12. 12.
    Bai, B.M., Mangathayaru, N., Rani, B.P.: An Approach to Find Missing Values in Medical Datasets. In: Proceedings of the International Conference on Engineering & MIS 2015 - ICEMIS 2015, pp. 1–7 (2015).
  13. 13.
    Lee, I.-N., Liao, S.-C., Embrechts, M.: Data mining techniques applied to medical information. Med. Inform. Internet Med. 25, 81–102 (2000)CrossRefGoogle Scholar
  14. 14.
    Lungeanu, D., Zaharie, D., Zamfirache, F. Influence of Missing Values Handling on Classification Rules Evolved from Medical Data in Industrial Conference on Data Mining - Posters and Workshops (2008)Google Scholar
  15. 15.
    Zhang, Y., Kambhampati, C., Davis, D. N., Goode, K., Cleland, J.G.F.: A comparative study of missing value imputation with multiclass classification for clinical heart failure data. In Proceedings of 9th International Conference on Fuzzy Systems and Knowledge Discovery, FSKD 2012, pp. 2840–2844 (2012)Google Scholar
  16. 16.
    Bhat, V.H., Rao, P.G., Shenoy, P.D., Venugopal, K.R., Patnaik, L.M.: An efficient prediction model for diabetic database using soft computing techniques. In: 12th International Conference Rough Sets, Fuzzy Sets, Data Mining Granular Computing RSFDGrC 2009, December 15, 2009 - December 18, 2009 5908 LNAI, pp. 328–335 (2009)Google Scholar
  17. 17.
    Mendes, D., Paredes, S., Rocha, T., Carvalho, P., Henriques, J., Cabiddu, R., Morais, J.: Assessment of cardiovascular risk based on a data -driven knowledge discovery approach. In: Conference of the IEEE Engineering in Medicine and Biology Society (2015)Google Scholar
  18. 18.
    Jayalskshmi, T., Santhakumaran, A.: Impact of preprocessing for diagnosis of diabetes mellitus using artificial neural networks. In: Second International Conference on Machine Learning and Computing (ICMLC), pp. 109–112 (2010).
  19. 19.
    Karabulut, E.M., Ibrikci, T.: Effective automated prediction of vertebral column pathologies based on logistic model tree with SMOTE preprocessing. J. Med. Syst. 38, 50 (2014)CrossRefGoogle Scholar
  20. 20.
    Huang, J., Li, Y.-F., Xie, M.: An empirical analysis of data preprocessing for machine learning-based software cost estimation. Inf. Softw. Technol. 67, 108–127 (2015)CrossRefGoogle Scholar
  21. 21.
    Esfandiari, N., Babavalian, M.R., Moghadam, A.M.E., Tabar, V.K.: Knowledge discovery in medicine: Current issue and future trend. Expert Syst. Appl. 41, 4434–4463 (2014)CrossRefGoogle Scholar
  22. 22.
    Jabbar, M.A., Deekshatulu, B. L., Chandra, P.: Computational intelligence technique for early diagnosis of heart disease. In: IEEE International Conference on Engineering and Technology (ICETECH), pp. 1–6 (2015)Google Scholar
  23. 23.
    Huang, M.W., et al.: Data preprocessing issues for incomplete medical datasets. Expert Syst. 33, 432–438 (2016)CrossRefGoogle Scholar
  24. 24.
    Hejazi, M., Al-Haddad, S.A.R., Singh, Y.P., Hashim, S.J., Aziz, A.F.A.: Multiclass support vector machines for classification of ECG data with missing values. Appl. Artif. Intell. 29, 660–674 (2015)CrossRefGoogle Scholar
  25. 25.
    El-Sappagh, S., Elmogy, M., Riad, A.M., Zaghlol, H., Badria, F.A.: EHR data preparation for case based reasoning construction. In: International Conference on Advanced Machine Learning Technologies and Applications, vol. 488, pp. 483–497(2014)Google Scholar
  26. 26.
    Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud. Health Technol. Inf. 95, 269–274 (2003)Google Scholar
  27. 27.
    Pérez, J., et al.: A data preparation methodology in data mining applied to mortality population databases. Adv. Intell. Syst. Comput. 353, 1173–1182 (2015)Google Scholar
  28. 28.
    Rahm, E., Do, H.: Data cleaning: Problems and current approaches. IEEE Data Eng. Bull. 23, 3–13 (2000)Google Scholar
  29. 29.
    Oded, M., Lior, R.: Data Mining and Knowledge Discovery Handbook, 2nd edn. Springer, US (2010)zbMATHGoogle Scholar
  30. 30.
    Pradhan, M., Bamnote, G.R.: Efficient binary classifier for prediction of diabetes using data preprocessing and support vector machine. In: International Conference on Frontiers of Intelligent Computing: Theory and Applications, vol. 327, pp. 131–140 (2014)Google Scholar
  31. 31.
    Ragothaman, B., Sarojini, B.: A Multi-objective Non-Dominated Sorted Artificial Bee Colony Feature Selection Algorithm for Medical Datasets. Indian J. Sci. Technol. 9, 1–5 (2016)CrossRefGoogle Scholar
  32. 32.
    Zhu, M., et al.: Dimensionality Reduction in Complex Medical Data: Improved Self-Adaptive Niche Genetic Algorithm. Comput. Math. Methods Med. 2015(2), 1–12 (2015)Google Scholar
  33. 33.
    Huang, Y., McCullagh, P., Black, N., Harper, R.: Feature selection and classification model construction on type 2 diabetic patients’ data. Artif. Intell. Med. 41, 251–262 (2007)CrossRefGoogle Scholar
  34. 34.
    Longadge, R., Dongre, S.S., Malik, L.: Class imbalance problem in data mining: review. Int. J. Comput. Sci. Netw. 2, 83–87 (2013)Google Scholar
  35. 35.
    Abolkarlou, N.A., Niknafs, A.A., Ebrahimpour, M.K.: Ensemble imbalance classification: Using data preprocessing, clustering algorithm and genetic algorithm. In: Proceedings of the 4th International Conference on Computer and Knowledge Engineering, ICCKE 2014 (2014).
  36. 36.
    Brereton, P., Kitchenham, B.A., Budgen, D., Turner, M., Khalil, M.: Lessons from applying the systematic literature review process within the software engineering domain. J. Syst. Softw. 80, 571–583 (2007)CrossRefGoogle Scholar
  37. 37.
    Kitchenham, B., Charters, S.: Guidelines for performing Systematic Literature reviews in Software Engineering Version 2.3. Engineering 45, 1051 (2007)Google Scholar
  38. 38.
    Ouhbi, S., Idri, A., Fernández-Alemán, J.L., Toval, A.: Requirements engineering education: a systematic mapping study. Requir. Eng. 20, 119–138 (2013)CrossRefGoogle Scholar
  39. 39.
    Kadi, I., Idri, A., Fernandez-Aleman, J.L.: Knowledge discovery in cardiology: a systematic literature review. Int. J. Med. Inform. 97, 12–32 (2017)CrossRefGoogle Scholar
  40. 40.
    Li, D.-C., Liu, C.-W., Hu, S.C.: A fuzzy-based data transformation for feature extraction to increase classification performance with small medical data sets. Artif. Intell. Med. 52, 45–52 (2011)CrossRefGoogle Scholar
  41. 41.
    Kitchenham, B., Mendes, E., Travassos, G.: A systematic review of cross-vs. within-company cost estimation studies. In: Proceedings of the Empirical Assessment in Software Engineering, pp. 81–90 (2006)Google Scholar
  42. 42.
    Gonçalves, J.J., Rocha, Á.M.: A decision support system for quality of life in head and neck oncology patients. Head Neck Oncol. 4(1), 3 (2012)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing AG, part of Springer Nature 2018

Authors and Affiliations

  1. 1.Software Project Management Research Team, ENSIASUniversity Mohammed VRabatMorocco
  2. 2.Department of Informatics and Systems, Faculty of Computer ScienceUniversity of MurciaMurciaSpain

Personalised recommendations