Rough Set Theory as a Data Mining Technique: A Case Study in Epidemiology and Cancer Incidence Prediction
A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It considers epidemiological data that has been collected from three international institutions for the purpose of cancer incidence prediction. The concrete data set used aggregates about 5 495 risk factors (features), spanning 32 years and 38 countries. Detailed experiments demonstrate that RST is relevant to real world big data applications as it can offer insights into the selected risk factors, speed up the learning process, ensure the performance of the cancer incidence prediction model without huge information loss, and simplify the learned model for epidemiologists. Code related to this paper is available at: https://github.com/zeinebchelly/Sp-RST.
KeywordsBig data Rough set theory Feature selection Epidemiology Cancer incidence prediction Application
- 5.Dagdia, Z.C., Zarges, C., Beck, G., Lebbah, M.: A distributed rough set theory based algorithm for an efficient big data pre-processing under the spark framework. In: Proceedings of the 2017 IEEE International Conference on Big Data, pp. 911–916. IEEE, Boston (2017)Google Scholar
- 9.Porta, M.: A Dictionary of Epidemiology. Oxford University Press, Oxford (2008)Google Scholar
- 10.Dicker, R.C., Coronado, F., Koo, D., Parrish, R.G.: Principles of epidemiology in public health practice; an introduction to applied epidemiology and biostatistics. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention (CDC) (2006)Google Scholar
- 11.Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: an ever evolving frontier in data mining. In: Feature Selection in Data Mining, pp. 4–13 (2013)Google Scholar
- 18.Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2323–2324. ACM (2015)Google Scholar