A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases

  • Joaquín Pérez
  • Emmanuel Iturbide
  • Victor Olivares
  • Miguel Hidalgo
  • Nelva Almanza
  • Alicia Martínez
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 353)


It is known that the data preparation phase is the most time consuming phase in the data mining process. Between 50% or up to 70% of the total project time and the results of data preparation directly affect the quality of it. Currently, data mining methodologies hold a general purpose; one of the limitations being that they do not provide a guide about what particular task to develop in a particular domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specific Data Preparation. For both sets, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is adopted as a guideline. The main contribution of our methodology is fourteen specialized tasks concerning such domain. To validate the proposed methodology, we developed a data mining system and the entire process was applied to real mortality databases. The results were encouraging, on one hand, we observed that the use of the methodology reduced some of the time-consuming tasks and, on the other hand, the data mining system showed findings of unknown and potentially useful patterns for the public health services in Mexico.


Data Preparation Methodology Mortality Databases Epidemiology 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Witten, I.H., Eibe, F., Hall, M.A.: Data Mining: Practical machine learning tools and techniques. Elsevier (2011)Google Scholar
  2. 2.
    Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0 Step-by-step data mining guide. SPSS (2000)Google Scholar
  3. 3.
    Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud. Health. Technol. Inform. 95, 269–274 (2003)Google Scholar
  4. 4.
    Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5, 597–604 (2006)CrossRefGoogle Scholar
  5. 5.
    Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. International Journal of Applied Artificial Intelligence 17, 375–381 (2003)CrossRefGoogle Scholar
  6. 6.
    Razavi, A.R., Gill, H.S., Åhlfeldt, H., Shahsavar, N.: A data pre-processing method to increase efficiency and accuracy in data mining. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds.) AIME 2005. LNCS (LNAI), vol. 3581, pp. 434–443. Springer, Heidelberg (2005)CrossRefGoogle Scholar
  7. 7.
    Tseng, S., Wang, K., Lee, C.: A pre-processing method to deal with missing values by integrating clustering and regression techniques. Applied Artificial Intelligence 17, 535–544 (2003)CrossRefGoogle Scholar
  8. 8.
    Bogorny, V., Engel, P., Alvares, L.: Spatial data preparation for knowledge discovery. In: IFIP Academy on the State of Software Theory and Practice – PhD Colloquium (2005)Google Scholar
  9. 9.
    Delen, D.: Analysis of cancer data: A data mining approach. Expert Systems: The Journal of Knowledge Engineering 26, 100–112 (2009)CrossRefGoogle Scholar
  10. 10.
    Fallahi, A., Jafaro, S.: An expert system for detection of breast cancer using data preprocessing and bayesian network. International Journal of Advance Science and Technology 34, 65–70 (2011)Google Scholar
  11. 11.
    Izadi, M., Buckeridge, D., Charland, K.: Mining epidemiological data sources in H1N1 pandemic using probabilistic graphical models. In: International Conference on Advances in Information Mining and Management (IMMM 2011), Spain, pp. 1–6 (October 2011)Google Scholar
  12. 12.
    Yilmaz, N., Inan, O., Serter, M.: A new data preparation method based on clustering algoritms for diagnosis systems of heart and diabetes deseases. Journal of Medical Systems 38, 48 (2014)CrossRefGoogle Scholar
  13. 13.
    Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F., Hua, L.: Data mining in healthcare and biomedicine: A survey of the literature. Journal of Medical System 36, 2431–2448 (2012)CrossRefGoogle Scholar
  14. 14.
    Milovic, B., Milovic, M.: Prediction and decision making in health care using data mining. International Journal of Public Health Science 1, 69–76 (2012)Google Scholar
  15. 15.
    Salinas, J.: Adaptation of a data mining methodology for its application to a real population-based database of cancer records. Master thesis, Cuernavaca Mexico (2007)Google Scholar
  16. 16.
    Mexicano, A.: Development of a methodology for feature selection and indicator generation for the application of data mining to a real population-based cancer database. Master thesis, Cuernavaca Mexico (2007)Google Scholar
  17. 17.
    Baron, M.: Development of a prototype for the application of data mining techniques on a real population-based cancer database. Master thesis, Cuernavaca Mexico (2008)Google Scholar
  18. 18.
    García, S., Luengo, J., Herrera, F.: Data preprocessing in Data Mining. Intelligent Systems Reference Library 72 (2014)Google Scholar
  19. 19.
    Lee, M.L., Ling, T.W.: Resolving structural conflicts in the integration of entity relationships schemas. In: Papazoglou, M.P. (ed.) ER 1995 and OOER 1995. LNCS, vol. 1021, pp. 424–433. Springer, Heidelberg (1995)CrossRefGoogle Scholar
  20. 20.
    Sujansky, W.: Heterogeneous database integration in biomedicine. Journal of Biomedical Inform 34, 285–298 (2001)CrossRefGoogle Scholar
  21. 21.
    National Health Information System (SINAIS),
  22. 22.
  23. 23.
    Statistics and Geography National Institute (INEGI),
  24. 24.
    Collaboration Center for the Family of International Classifiers (CEMECE),
  25. 25.
    Pérez, J., Fragoso, O., Santaolaya, R., Mexicano, A., Henriques, F.: A data mining system for the generation of geographical C16 cáncer patterns. In: International Conference on Software Engineering Advances (ICSEA 2010), pp. 417–421 (2010)Google Scholar
  26. 26.
    Mohar, A., Ley, C., Guarner, J., Herrera-Goepfert, R., Sanchez, L., Halperin, D., Parsonnet, J.: Alta frecuencia de lesiones precursoras de cáncer gastrico asociadas a Helicobacter pyloru y respuesta al tratamiento, en Chiapas, México. Gaceta Médica de México 38, 405–410 (2000)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Joaquín Pérez
    • 1
  • Emmanuel Iturbide
    • 1
  • Victor Olivares
    • 1
  • Miguel Hidalgo
    • 1
  • Nelva Almanza
    • 1
  • Alicia Martínez
    • 1
  1. 1.Department of Computer ScienceCENIDETCuernavacaMéxico

Personalised recommendations