Skip to main content

A Nearest Neighbor Approach to Build a Readable Risk Score for Breast Cancer

  • Chapter
  • First Online:
Real World Data Mining Applications

Part of the book series: Annals of Information Systems ((AOIS,volume 17))

  • 2855 Accesses

Abstract

According to the World Health Organization, starting from 2010, cancer has become the leading cause of death worldwide. Prevention of major cancer localizations through a quantified assessment of risk factors is a major concern in order to decrease their impact in our society. Our objective is to test the performances of a modeling method that answers to needs and constraints of end users. In this article, we follow a data mining process to build a reliable assessment tool for primary breast cancer risk. A k-nearest-neighbor algorithm is used to compute a risk score for different profiles from a public database. We empirically show that it is possible to achieve the same performances as logistic regressions with less attributes and a more easily readable model. The process includes the intervention of a domain expert, during an offline step of the process, who helps to select one of the numerous model variations by combining at best, physician expectations and performances. A risk score made of four parameters: age, breast density, number of affected first degree relatives and breast biopsy, is chosen. Detection performance measured with the area under the ROC curve is 0.637. A graphical user interface is presented to show how users will interact with this risk score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

eBook
USD 16.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Howlader, N., Noone, A.M., Krapcho, M., Garshell, J., Miller, D., Altekruse, S.F., Kosary, C.L., Yu, M., Ruhl, J., Tatalovich, Z.,Mariotto, A., Lewis, D.R., Chen, H.S., Feuer, E.J., Cronin, K.A. (eds). SEER Cancer Statistics Review, 1975–2011, National Cancer Institute. Bethesda, MD (2010)

    Google Scholar 

  2. Ballard-Barbash, R., Taplin, S., Yankaskas, B., Ernster, V., Rosenberg, R., Carney, P., Barlow, W., Geller, B., Kerlikowske, K., Edwards, B., Lynch, C., Urban, N., Chrvala, C., Key, C., Poplack, S., Worden, J., Kessler, L.: Breast cancer surveillance consortium: a national mammography screening and outcomes database. Am. J. Roentgenol. 169(4), 1001–1008 (1997)

    Article  Google Scholar 

  3. Barlow, W.E., White, E., Ballard-Barbash, R., Vacek, P.M., Titus-Ernstoff, L., Carney, P.A., Tice, J.A., Buist, D.S.M., Geller, B.M., Rosenberg, R., Yankaskas, B.C., Kerlikowske, K.: Prospective breast cancer risk prediction model for women undergoing screening mammography. J. Natl. Cancer Inst. 98(17), 1204–1214 (2006)

    Article  Google Scholar 

  4. Chapman, P., Clinton, J., Kerber, R., Khabaza, T.: CRISP-DM 1.0 step-by-step data mining guide. Tech. Rep., The CRISP-DM Consortium (2000)

    Google Scholar 

  5. Chen, J., Pee, D., Ayyagari, R., Graubard, B., Schairer, C., Byrne, C., Benichou, J., Gail, M.H.: Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density. J. Natl. Cancer Inst. 98(17), 1215–1226 (2006)

    Article  Google Scholar 

  6. Costantino, J., Gail, M., Pee, D., Anderson, S., Redmond, C., Benichou, J., Wieand, H.: Validation studies for models projecting the risk of invasive and total breast cancer incidence. J. Natl. Cancer Inst. 91(18), 1541–1548 (1999)

    Article  Google Scholar 

  7. Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)

    Article  Google Scholar 

  8. Decarli, A., Calza, S., Masala, G., Specchia, C., Palli, D., Gail, M.H.: Gail model for prediction of absolute risk of invasive breast cancer: Independent evaluation in the Florence-European prospective investigation into cancer and nutrition cohort. J. Natl. Cancer Inst. 98(23), 1686–1693 (2006)

    Article  Google Scholar 

  9. Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)

    Article  Google Scholar 

  10. Egan, J.P.: Signal detection theory and ROC analysis. Academic Press series in cognition and perception. Academic (1975)

    Google Scholar 

  11. Endo, A., Shibata, T., Tanaka, H.: Comparison of seven algorithms to predict breast cancer survival. Biomed. Soft Comput. Hum. Sci. 13(2), 11–16 (2008)

    Google Scholar 

  12. Fan, X., Tang, K., Weise, T.: Margin-based over-sampling method for learning from imbalanced datasets. In: Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, Springer (2011)

    Google Scholar 

  13. Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)

    Article  Google Scholar 

  14. Fix, E., Hodges, J.L.: Discriminatory analysis, non-parametric discrimination: consistency properties. Tech. Rep., USAF Scholl of Aviation and Medicine, Randolph Field (1951)

    Google Scholar 

  15. Gail, M.H., Brinton, L.A., Byar, D.P., Corle, D.K., Green, S.B., Schairer, C., Mulvihill, J.J.: Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J. Natl. Cancer Inst. 81(24), 1879–1886 (1989)

    Article  Google Scholar 

  16. Gauthier, E., Brisson, L., Lenca, P., Clavel-Chapelon, F., Ragusa, S.: Challenges to building a platform for a breast cancer risk score. In: Sixth International Conference on Research Challenges in Information Science, pp. 1–10. IEEE (2012)

    Google Scholar 

  17. IARC: World Cancer Report. IARC Publications. http://www.iarc.fr/en/publications/pdfs-online/wcr/2008/wcr_2008_1.pdf (2008)

  18. Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)

    Google Scholar 

  19. Jerez-Aragonés, J.M., Gómez-Ruiz, J.A., Ramos-Jiménez, G., Muñoz-Pérez, J., E., A.C.: A combined neural network and decision trees model for prognosis of breast cancer relapse. Artif. Intell. Med. 27(1), 45–63 (2003)

    Google Scholar 

  20. Li, Y., Zhang, X.: Improving k nearest neighbor with exemplar generalization for imbalanced classification. In: Huang, J., Cao, L., Srivastava, J. (eds.) Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 6635, pp. 321–332. Springer, Berlin (2011)

    Google Scholar 

  21. Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A., Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., Hemminki, K.: Environmental and heritable factors in the causation of cancer, analyses of cohorts of twins from Sweden, Denmark, and Finland. N. Engl. J. Med. 343(2), 78–85 (2000)

    Google Scholar 

  22. Liu, W., Chawla, S.: Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol. 6635, pp. 345–356. Springer, Berlin (2011)

    Google Scholar 

  23. Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. B 39(2), 539–550 (2009)

    Article  Google Scholar 

  24. Pham, N.K., Do, T.N., Lenca, P., Lallich, S.: Using local node information in decision trees: coupling a local labeling rule with an off-centered entropy. In: The International Conference on Data Mining, pp. 117–123. Las Vegas, Nevada, USA. CSREA Press (2008)

    Google Scholar 

  25. D'Orsi, C.J., Sickles, E.A., Mendelson, E.B., Morris, E.A., et al.: ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System, Reston, VA, American College of Radiology (2013)

    Google Scholar 

  26. Teams, F.C.: Mammographic surveillance in women younger than 50 years who have a family history of breast cancer: tumour characteristics and projected effect on mortality in the prospective, single-arm, fh01 study. Lancet Oncol. 11(12), 1127–1134 (2010)

    Article  Google Scholar 

  27. Testard-Vaillant, P.: The war on cancer. CNRS Int. Mag. 17, 18–21 (2010)

    Google Scholar 

  28. Visa, S., Ralescu, A.: Issues in mining imbalanced data sets—a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, MAICS–2005, Dayton, pp. 67–73 (2005)

    Google Scholar 

  29. Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Émilien Gauthier .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this chapter

Cite this chapter

Gauthier, É., Brisson, L., Lenca, P., Ragusa, S. (2015). A Nearest Neighbor Approach to Build a Readable Risk Score for Breast Cancer. In: Abou-Nasr, M., Lessmann, S., Stahlbock, R., Weiss, G. (eds) Real World Data Mining Applications. Annals of Information Systems, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-07812-0_13

Download citation

Publish with us

Policies and ethics