A Nearest Neighbor Approach to Build a Readable Risk Score for Breast Cancer

  • Émilien GauthierEmail author
  • Laurent Brisson
  • Philippe Lenca
  • Stéphane Ragusa
Part of the Annals of Information Systems book series (AOIS, volume 17)


According to the World Health Organization, starting from 2010, cancer has become the leading cause of death worldwide. Prevention of major cancer localizations through a quantified assessment of risk factors is a major concern in order to decrease their impact in our society. Our objective is to test the performances of a modeling method that answers to needs and constraints of end users. In this article, we follow a data mining process to build a reliable assessment tool for primary breast cancer risk. A k-nearest-neighbor algorithm is used to compute a risk score for different profiles from a public database. We empirically show that it is possible to achieve the same performances as logistic regressions with less attributes and a more easily readable model. The process includes the intervention of a domain expert, during an offline step of the process, who helps to select one of the numerous model variations by combining at best, physician expectations and performances. A risk score made of four parameters: age, breast density, number of affected first degree relatives and breast biopsy, is chosen. Detection performance measured with the area under the ROC curve is 0.637. A graphical user interface is presented to show how users will interact with this risk score.


Breast Cancer Breast Cancer Risk Receiver Operating Characteristic Curve Risk Score Breast Density 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Howlader, N., Noone, A.M., Krapcho, M., Garshell, J., Miller, D., Altekruse, S.F., Kosary, C.L., Yu, M., Ruhl, J., Tatalovich, Z.,Mariotto, A., Lewis, D.R., Chen, H.S., Feuer, E.J., Cronin, K.A. (eds). SEER Cancer Statistics Review, 1975–2011, National Cancer Institute. Bethesda, MD (2010)Google Scholar
  2. 2.
    Ballard-Barbash, R., Taplin, S., Yankaskas, B., Ernster, V., Rosenberg, R., Carney, P., Barlow, W., Geller, B., Kerlikowske, K., Edwards, B., Lynch, C., Urban, N., Chrvala, C., Key, C., Poplack, S., Worden, J., Kessler, L.: Breast cancer surveillance consortium: a national mammography screening and outcomes database. Am. J. Roentgenol. 169(4), 1001–1008 (1997)CrossRefGoogle Scholar
  3. 3.
    Barlow, W.E., White, E., Ballard-Barbash, R., Vacek, P.M., Titus-Ernstoff, L., Carney, P.A., Tice, J.A., Buist, D.S.M., Geller, B.M., Rosenberg, R., Yankaskas, B.C., Kerlikowske, K.: Prospective breast cancer risk prediction model for women undergoing screening mammography. J. Natl. Cancer Inst. 98(17), 1204–1214 (2006)CrossRefGoogle Scholar
  4. 4.
    Chapman, P., Clinton, J., Kerber, R., Khabaza, T.: CRISP-DM 1.0 step-by-step data mining guide. Tech. Rep., The CRISP-DM Consortium (2000)Google Scholar
  5. 5.
    Chen, J., Pee, D., Ayyagari, R., Graubard, B., Schairer, C., Byrne, C., Benichou, J., Gail, M.H.: Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density. J. Natl. Cancer Inst. 98(17), 1215–1226 (2006)CrossRefGoogle Scholar
  6. 6.
    Costantino, J., Gail, M., Pee, D., Anderson, S., Redmond, C., Benichou, J., Wieand, H.: Validation studies for models projecting the risk of invasive and total breast cancer incidence. J. Natl. Cancer Inst. 91(18), 1541–1548 (1999)CrossRefGoogle Scholar
  7. 7.
    Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)CrossRefGoogle Scholar
  8. 8.
    Decarli, A., Calza, S., Masala, G., Specchia, C., Palli, D., Gail, M.H.: Gail model for prediction of absolute risk of invasive breast cancer: Independent evaluation in the Florence-European prospective investigation into cancer and nutrition cohort. J. Natl. Cancer Inst. 98(23), 1686–1693 (2006)CrossRefGoogle Scholar
  9. 9.
    Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)CrossRefGoogle Scholar
  10. 10.
    Egan, J.P.: Signal detection theory and ROC analysis. Academic Press series in cognition and perception. Academic (1975)Google Scholar
  11. 11.
    Endo, A., Shibata, T., Tanaka, H.: Comparison of seven algorithms to predict breast cancer survival. Biomed. Soft Comput. Hum. Sci. 13(2), 11–16 (2008)Google Scholar
  12. 12.
    Fan, X., Tang, K., Weise, T.: Margin-based over-sampling method for learning from imbalanced datasets. In: Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, Springer (2011)Google Scholar
  13. 13.
    Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)CrossRefGoogle Scholar
  14. 14.
    Fix, E., Hodges, J.L.: Discriminatory analysis, non-parametric discrimination: consistency properties. Tech. Rep., USAF Scholl of Aviation and Medicine, Randolph Field (1951)Google Scholar
  15. 15.
    Gail, M.H., Brinton, L.A., Byar, D.P., Corle, D.K., Green, S.B., Schairer, C., Mulvihill, J.J.: Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J. Natl. Cancer Inst. 81(24), 1879–1886 (1989)CrossRefGoogle Scholar
  16. 16.
    Gauthier, E., Brisson, L., Lenca, P., Clavel-Chapelon, F., Ragusa, S.: Challenges to building a platform for a breast cancer risk score. In: Sixth International Conference on Research Challenges in Information Science, pp. 1–10. IEEE (2012)Google Scholar
  17. 17.
    IARC: World Cancer Report. IARC Publications. (2008)
  18. 18.
    Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)Google Scholar
  19. 19.
    Jerez-Aragonés, J.M., Gómez-Ruiz, J.A., Ramos-Jiménez, G., Muñoz-Pérez, J., E., A.C.: A combined neural network and decision trees model for prognosis of breast cancer relapse. Artif. Intell. Med. 27(1), 45–63 (2003)Google Scholar
  20. 20.
    Li, Y., Zhang, X.: Improving k nearest neighbor with exemplar generalization for imbalanced classification. In: Huang, J., Cao, L., Srivastava, J. (eds.) Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 6635, pp. 321–332. Springer, Berlin (2011)Google Scholar
  21. 21.
    Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A., Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., Hemminki, K.: Environmental and heritable factors in the causation of cancer, analyses of cohorts of twins from Sweden, Denmark, and Finland. N. Engl. J. Med. 343(2), 78–85 (2000)Google Scholar
  22. 22.
    Liu, W., Chawla, S.: Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol. 6635, pp. 345–356. Springer, Berlin (2011)Google Scholar
  23. 23.
    Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. B 39(2), 539–550 (2009)CrossRefGoogle Scholar
  24. 24.
    Pham, N.K., Do, T.N., Lenca, P., Lallich, S.: Using local node information in decision trees: coupling a local labeling rule with an off-centered entropy. In: The International Conference on Data Mining, pp. 117–123. Las Vegas, Nevada, USA. CSREA Press (2008)Google Scholar
  25. 25.
    D'Orsi, C.J., Sickles, E.A., Mendelson, E.B., Morris, E.A., et al.: ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System, Reston, VA, American College of Radiology (2013)Google Scholar
  26. 26.
    Teams, F.C.: Mammographic surveillance in women younger than 50 years who have a family history of breast cancer: tumour characteristics and projected effect on mortality in the prospective, single-arm, fh01 study. Lancet Oncol. 11(12), 1127–1134 (2010)CrossRefGoogle Scholar
  27. 27.
    Testard-Vaillant, P.: The war on cancer. CNRS Int. Mag. 17, 18–21 (2010)Google Scholar
  28. 28.
    Visa, S., Ralescu, A.: Issues in mining imbalanced data sets—a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, MAICS–2005, Dayton, pp. 67–73 (2005)Google Scholar
  29. 29.
    Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)Google Scholar

Copyright information

© Springer International Publishing Switzerland 2015

Authors and Affiliations

  • Émilien Gauthier
    • 1
    • 2
    • 3
    Email author
  • Laurent Brisson
    • 2
    • 3
  • Philippe Lenca
    • 2
    • 3
  • Stéphane Ragusa
    • 1
  1. 1.Statlife companyInstitut Gustave RoussyVillejuif CedexFrance
  2. 2.UMR CNRS 6285 Lab-STICCInstitut Telecom, Telecom BretagneBrest Cedex 3France
  3. 3.Université Européenne de BretagneBretagneFrance

Personalised recommendations