Skip to main content

A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases

  • Conference paper
New Contributions in Information Systems and Technologies

Abstract

It is known that the data preparation phase is the most time consuming phase in the data mining process. Between 50% or up to 70% of the total project time and the results of data preparation directly affect the quality of it. Currently, data mining methodologies hold a general purpose; one of the limitations being that they do not provide a guide about what particular task to develop in a particular domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specific Data Preparation. For both sets, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is adopted as a guideline. The main contribution of our methodology is fourteen specialized tasks concerning such domain. To validate the proposed methodology, we developed a data mining system and the entire process was applied to real mortality databases. The results were encouraging, on one hand, we observed that the use of the methodology reduced some of the time-consuming tasks and, on the other hand, the data mining system showed findings of unknown and potentially useful patterns for the public health services in Mexico.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 369.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Witten, I.H., Eibe, F., Hall, M.A.: Data Mining: Practical machine learning tools and techniques. Elsevier (2011)

    Google Scholar 

  2. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0 Step-by-step data mining guide. SPSS (2000)

    Google Scholar 

  3. Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud. Health. Technol. Inform. 95, 269–274 (2003)

    Google Scholar 

  4. Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5, 597–604 (2006)

    Article  Google Scholar 

  5. Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. International Journal of Applied Artificial Intelligence 17, 375–381 (2003)

    Article  Google Scholar 

  6. Razavi, A.R., Gill, H.S., Åhlfeldt, H., Shahsavar, N.: A data pre-processing method to increase efficiency and accuracy in data mining. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds.) AIME 2005. LNCS (LNAI), vol. 3581, pp. 434–443. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. Tseng, S., Wang, K., Lee, C.: A pre-processing method to deal with missing values by integrating clustering and regression techniques. Applied Artificial Intelligence 17, 535–544 (2003)

    Article  Google Scholar 

  8. Bogorny, V., Engel, P., Alvares, L.: Spatial data preparation for knowledge discovery. In: IFIP Academy on the State of Software Theory and Practice – PhD Colloquium (2005)

    Google Scholar 

  9. Delen, D.: Analysis of cancer data: A data mining approach. Expert Systems: The Journal of Knowledge Engineering 26, 100–112 (2009)

    Article  Google Scholar 

  10. Fallahi, A., Jafaro, S.: An expert system for detection of breast cancer using data preprocessing and bayesian network. International Journal of Advance Science and Technology 34, 65–70 (2011)

    Google Scholar 

  11. Izadi, M., Buckeridge, D., Charland, K.: Mining epidemiological data sources in H1N1 pandemic using probabilistic graphical models. In: International Conference on Advances in Information Mining and Management (IMMM 2011), Spain, pp. 1–6 (October 2011)

    Google Scholar 

  12. Yilmaz, N., Inan, O., Serter, M.: A new data preparation method based on clustering algoritms for diagnosis systems of heart and diabetes deseases. Journal of Medical Systems 38, 48 (2014)

    Article  Google Scholar 

  13. Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F., Hua, L.: Data mining in healthcare and biomedicine: A survey of the literature. Journal of Medical System 36, 2431–2448 (2012)

    Article  Google Scholar 

  14. Milovic, B., Milovic, M.: Prediction and decision making in health care using data mining. International Journal of Public Health Science 1, 69–76 (2012)

    Google Scholar 

  15. Salinas, J.: Adaptation of a data mining methodology for its application to a real population-based database of cancer records. Master thesis, Cuernavaca Mexico (2007)

    Google Scholar 

  16. Mexicano, A.: Development of a methodology for feature selection and indicator generation for the application of data mining to a real population-based cancer database. Master thesis, Cuernavaca Mexico (2007)

    Google Scholar 

  17. Baron, M.: Development of a prototype for the application of data mining techniques on a real population-based cancer database. Master thesis, Cuernavaca Mexico (2008)

    Google Scholar 

  18. García, S., Luengo, J., Herrera, F.: Data preprocessing in Data Mining. Intelligent Systems Reference Library 72 (2014)

    Google Scholar 

  19. Lee, M.L., Ling, T.W.: Resolving structural conflicts in the integration of entity relationships schemas. In: Papazoglou, M.P. (ed.) ER 1995 and OOER 1995. LNCS, vol. 1021, pp. 424–433. Springer, Heidelberg (1995)

    Chapter  Google Scholar 

  20. Sujansky, W.: Heterogeneous database integration in biomedicine. Journal of Biomedical Inform 34, 285–298 (2001)

    Article  Google Scholar 

  21. National Health Information System (SINAIS), http://www.sinais.salud.gob.mx/basesdedatos/estandar.html

  22. Database District System (SIMBAD), http://sc.inegi.org.mx/sistemas/cobdem/contenido-arbol.jsp

  23. Statistics and Geography National Institute (INEGI), http://www.inegi.org.mx/

  24. Collaboration Center for the Family of International Classifiers (CEMECE), http://www.cemece.salud.gob.mx/fic/cie/index.html

  25. Pérez, J., Fragoso, O., Santaolaya, R., Mexicano, A., Henriques, F.: A data mining system for the generation of geographical C16 cáncer patterns. In: International Conference on Software Engineering Advances (ICSEA 2010), pp. 417–421 (2010)

    Google Scholar 

  26. Mohar, A., Ley, C., Guarner, J., Herrera-Goepfert, R., Sanchez, L., Halperin, D., Parsonnet, J.: Alta frecuencia de lesiones precursoras de cáncer gastrico asociadas a Helicobacter pyloru y respuesta al tratamiento, en Chiapas, México. Gaceta Médica de México 38, 405–410 (2000)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joaquín Pérez .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Pérez, J., Iturbide, E., Olivares, V., Hidalgo, M., Almanza, N., Martínez, A. (2015). A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases. In: Rocha, A., Correia, A., Costanzo, S., Reis, L. (eds) New Contributions in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-319-16486-1_116

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-16486-1_116

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-16485-4

  • Online ISBN: 978-3-319-16486-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics