Abstract
It is known that the data preparation phase is the most time consuming phase in the data mining process. Between 50% or up to 70% of the total project time and the results of data preparation directly affect the quality of it. Currently, data mining methodologies hold a general purpose; one of the limitations being that they do not provide a guide about what particular task to develop in a particular domain. This paper shows a new data preparation methodology oriented to the epidemiological domain in which we have identified two sets of tasks: General Data Preparation and Specific Data Preparation. For both sets, the Cross-Industry Standard Process for Data Mining (CRISP-DM) is adopted as a guideline. The main contribution of our methodology is fourteen specialized tasks concerning such domain. To validate the proposed methodology, we developed a data mining system and the entire process was applied to real mortality databases. The results were encouraging, on one hand, we observed that the use of the methodology reduced some of the time-consuming tasks and, on the other hand, the data mining system showed findings of unknown and potentially useful patterns for the public health services in Mexico.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Witten, I.H., Eibe, F., Hall, M.A.: Data Mining: Practical machine learning tools and techniques. Elsevier (2011)
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., Wirth, R.: CRISP-DM 1.0 Step-by-step data mining guide. SPSS (2000)
Duhamel, A., Nuttens, M.C., Devos, P., Picavet, M., Beuscart, R.: A preprocessing method for improving data mining techniques. Application to a large medical diabetes database. Stud. Health. Technol. Inform. 95, 269–274 (2003)
Yang, Q., Wu, X.: 10 challenging problems in data mining research. International Journal of Information Technology and Decision Making 5, 597–604 (2006)
Zhang, S., Zhang, C., Yang, Q.: Data preparation for data mining. International Journal of Applied Artificial Intelligence 17, 375–381 (2003)
Razavi, A.R., Gill, H.S., Åhlfeldt, H., Shahsavar, N.: A data pre-processing method to increase efficiency and accuracy in data mining. In: Miksch, S., Hunter, J., Keravnou, E.T. (eds.) AIME 2005. LNCS (LNAI), vol. 3581, pp. 434–443. Springer, Heidelberg (2005)
Tseng, S., Wang, K., Lee, C.: A pre-processing method to deal with missing values by integrating clustering and regression techniques. Applied Artificial Intelligence 17, 535–544 (2003)
Bogorny, V., Engel, P., Alvares, L.: Spatial data preparation for knowledge discovery. In: IFIP Academy on the State of Software Theory and Practice – PhD Colloquium (2005)
Delen, D.: Analysis of cancer data: A data mining approach. Expert Systems: The Journal of Knowledge Engineering 26, 100–112 (2009)
Fallahi, A., Jafaro, S.: An expert system for detection of breast cancer using data preprocessing and bayesian network. International Journal of Advance Science and Technology 34, 65–70 (2011)
Izadi, M., Buckeridge, D., Charland, K.: Mining epidemiological data sources in H1N1 pandemic using probabilistic graphical models. In: International Conference on Advances in Information Mining and Management (IMMM 2011), Spain, pp. 1–6 (October 2011)
Yilmaz, N., Inan, O., Serter, M.: A new data preparation method based on clustering algoritms for diagnosis systems of heart and diabetes deseases. Journal of Medical Systems 38, 48 (2014)
Yoo, I., Alafaireet, P., Marinov, M., Pena-Hernandez, K., Gopidi, R., Chang, J.F., Hua, L.: Data mining in healthcare and biomedicine: A survey of the literature. Journal of Medical System 36, 2431–2448 (2012)
Milovic, B., Milovic, M.: Prediction and decision making in health care using data mining. International Journal of Public Health Science 1, 69–76 (2012)
Salinas, J.: Adaptation of a data mining methodology for its application to a real population-based database of cancer records. Master thesis, Cuernavaca Mexico (2007)
Mexicano, A.: Development of a methodology for feature selection and indicator generation for the application of data mining to a real population-based cancer database. Master thesis, Cuernavaca Mexico (2007)
Baron, M.: Development of a prototype for the application of data mining techniques on a real population-based cancer database. Master thesis, Cuernavaca Mexico (2008)
García, S., Luengo, J., Herrera, F.: Data preprocessing in Data Mining. Intelligent Systems Reference Library 72 (2014)
Lee, M.L., Ling, T.W.: Resolving structural conflicts in the integration of entity relationships schemas. In: Papazoglou, M.P. (ed.) ER 1995 and OOER 1995. LNCS, vol. 1021, pp. 424–433. Springer, Heidelberg (1995)
Sujansky, W.: Heterogeneous database integration in biomedicine. Journal of Biomedical Inform 34, 285–298 (2001)
National Health Information System (SINAIS), http://www.sinais.salud.gob.mx/basesdedatos/estandar.html
Database District System (SIMBAD), http://sc.inegi.org.mx/sistemas/cobdem/contenido-arbol.jsp
Statistics and Geography National Institute (INEGI), http://www.inegi.org.mx/
Collaboration Center for the Family of International Classifiers (CEMECE), http://www.cemece.salud.gob.mx/fic/cie/index.html
Pérez, J., Fragoso, O., Santaolaya, R., Mexicano, A., Henriques, F.: A data mining system for the generation of geographical C16 cáncer patterns. In: International Conference on Software Engineering Advances (ICSEA 2010), pp. 417–421 (2010)
Mohar, A., Ley, C., Guarner, J., Herrera-Goepfert, R., Sanchez, L., Halperin, D., Parsonnet, J.: Alta frecuencia de lesiones precursoras de cáncer gastrico asociadas a Helicobacter pyloru y respuesta al tratamiento, en Chiapas, México. Gaceta Médica de México 38, 405–410 (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Pérez, J., Iturbide, E., Olivares, V., Hidalgo, M., Almanza, N., Martínez, A. (2015). A Data Preparation Methodology in Data Mining Applied to Mortality Population Databases. In: Rocha, A., Correia, A., Costanzo, S., Reis, L. (eds) New Contributions in Information Systems and Technologies. Advances in Intelligent Systems and Computing, vol 353. Springer, Cham. https://doi.org/10.1007/978-3-319-16486-1_116
Download citation
DOI: https://doi.org/10.1007/978-3-319-16486-1_116
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-16485-4
Online ISBN: 978-3-319-16486-1
eBook Packages: Computer ScienceComputer Science (R0)