Rough Set Theory as a Data Mining Technique: A Case Study in Epidemiology and Cancer Incidence Prediction

  • Zaineb Chelly DagdiaEmail author
  • Christine Zarges
  • Benjamin Schannes
  • Martin Micalef
  • Lino Galiana
  • Benoît Rolland
  • Olivier de Fresnoye
  • Mehdi Benchoufi
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 11053)


A big challenge in epidemiology is to perform data pre-processing, specifically feature selection, on large scale data sets with a high dimensional feature set. In this paper, this challenge is tackled by using a recently established distributed and scalable version of Rough Set Theory (RST. It considers epidemiological data that has been collected from three international institutions for the purpose of cancer incidence prediction. The concrete data set used aggregates about 5 495 risk factors (features), spanning 32 years and 38 countries. Detailed experiments demonstrate that RST is relevant to real world big data applications as it can offer insights into the selected risk factors, speed up the learning process, ensure the performance of the cancer incidence prediction model without huge information loss, and simplify the learned model for epidemiologists. Code related to this paper is available at:


Big data Rough set theory Feature selection Epidemiology Cancer incidence prediction Application 


  1. 1.
    Pawlak, Z., Skowron, A.: Rudiments of rough sets. Inf. Sci. 177(1), 3–27 (2007)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Bagherzadeh-Khiabani, F., Ramezankhani, A., Azizi, F., Hadaegh, F., Steyerberg, E.W., Khalili, D.: A tutorial on variable selection for clinical prediction models: feature selection methods in data mining could improve the results. J. Clin. Epidemiol. 71, 76–85 (2016)CrossRefGoogle Scholar
  3. 3.
    Mooney, S.J., Westreich, D.J., El-Sayed, A.M.: Epidemiology in the era of big data. Epidemiology 26(3), 390 (2015)CrossRefGoogle Scholar
  4. 4.
    Woodward, M.: Epidemiology: Study Design and Data Analysis. CRC Press, Boca Raton (2013)CrossRefGoogle Scholar
  5. 5.
    Dagdia, Z.C., Zarges, C., Beck, G., Lebbah, M.: A distributed rough set theory based algorithm for an efficient big data pre-processing under the spark framework. In: Proceedings of the 2017 IEEE International Conference on Big Data, pp. 911–916. IEEE, Boston (2017)Google Scholar
  6. 6.
    Thangavel, K., Pethalakshmi, A.: Dimensionality reduction based on rough set theory: a review. Appl. Soft Comput. 9(1), 1–12 (2009)CrossRefGoogle Scholar
  7. 7.
    Amersi, F., Agustin, M., Ko, C.Y.: Colorectal cancer: epidemiology, risk factors, and health services. Clin. Colon Rectal Surg. 18(3), 133 (2005)CrossRefGoogle Scholar
  8. 8.
    Banerjee, A., Chaudhury, S.: Statistics without tears: populations and samples. Ind. Psychiatry J. 19(1), 60 (2010)CrossRefGoogle Scholar
  9. 9.
    Porta, M.: A Dictionary of Epidemiology. Oxford University Press, Oxford (2008)Google Scholar
  10. 10.
    Dicker, R.C., Coronado, F., Koo, D., Parrish, R.G.: Principles of epidemiology in public health practice; an introduction to applied epidemiology and biostatistics. U.S. Department of Health and Human Services, Centers for Disease Control and Prevention (CDC) (2006)Google Scholar
  11. 11.
    Liu, H., Motoda, H., Setiono, R., Zhao, Z.: Feature selection: an ever evolving frontier in data mining. In: Feature Selection in Data Mining, pp. 4–13 (2013)Google Scholar
  12. 12.
    Schneider, J., Vlachos, M.: Scalable density-based clustering with quality guarantees using random projections. Data Min. Knowl. Discov. 31, 1–34 (2017)MathSciNetCrossRefGoogle Scholar
  13. 13.
    Dean, J., Ghemawat, S.: MapReduce: a flexible data processing tool. Commun. ACM 53(1), 72–77 (2010)CrossRefGoogle Scholar
  14. 14.
    Zhai, T., Gao, Y., Wang, H., Cao, L.: Classification of high-dimensional evolving data streams via a resource-efficient online ensemble. Data Min. Knowl. Discov. 31, 1–24 (2017)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Vinh, N.X., et al.: Discovering outlying aspects in large datasets. Data Min. Knowl. Discov. 30(6), 1520–1555 (2016)MathSciNetCrossRefGoogle Scholar
  16. 16.
    Zhang, J., Wang, S., Chen, L., Gallinari, P.: Multiple Bayesian discriminant functions for high-dimensional massive data classification. Data Min. Knowl. Discov. 31(2), 465–501 (2017)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Pawlak, Z.: Rough Sets: Theoretical Aspects of Reasoning About Data. Springer, Heidelberg (2012)zbMATHGoogle Scholar
  18. 18.
    Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2323–2324. ACM (2015)Google Scholar
  19. 19.
    Polkowski, L., Tsumoto, S., Lin, T.Y.: Rough Set Methods and Applications: New Developments in Knowledge Discovery in Information Systems, vol. 56. Physica, Heidelberg (2012)zbMATHGoogle Scholar
  20. 20.
    Guller, M.: Big Data Analytics with Spark: A Practitioner’s Guide to Using Spark for Large Scale Data Analysis. Springer, Heidelberg (2015)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Zaineb Chelly Dagdia
    • 1
    • 2
    Email author
  • Christine Zarges
    • 1
  • Benjamin Schannes
    • 3
  • Martin Micalef
    • 4
  • Lino Galiana
    • 5
  • Benoît Rolland
    • 6
  • Olivier de Fresnoye
    • 7
  • Mehdi Benchoufi
    • 7
    • 8
    • 9
  1. 1.Department of Computer ScienceAberystwyth UniversityAberystwythUK
  2. 2.LARODEC, Institut Supérieur de Gestion de TunisTunisTunisia
  3. 3.Department of StatisticsENSAEPalaiseauFrance
  4. 4.ActuarisParisFrance
  5. 5.ENS LyonLyon Cedex 07France
  6. 6.Altran Technologies S.A.Neuilly-sur-SeineFrance
  7. 7.Coordinateur Scientifique Programme ÉpidemiumParisFrance
  8. 8.Centre d’Épidémiologie CliniqueHôpital Hôtel Dieu, Assistance Publique-Hôpitaux de ParisParisFrance
  9. 9.Faculté de MédecineUniversité Paris Descartes and INSERM UMR1153ParisFrance

Personalised recommendations