A Nearest Neighbor Approach to Build a Readable Risk Score for Breast Cancer

Gauthier, Émilien; Brisson, Laurent; Lenca, Philippe; Ragusa, Stéphane

doi:10.1007/978-3-319-07812-0_13

Émilien Gauthier^7,8,9,
Laurent Brisson^8,9,
Philippe Lenca^8,9 &
…
Stéphane Ragusa⁷

Part of the book series: Annals of Information Systems ((AOIS,volume 17))

2855 Accesses

Abstract

According to the World Health Organization, starting from 2010, cancer has become the leading cause of death worldwide. Prevention of major cancer localizations through a quantified assessment of risk factors is a major concern in order to decrease their impact in our society. Our objective is to test the performances of a modeling method that answers to needs and constraints of end users. In this article, we follow a data mining process to build a reliable assessment tool for primary breast cancer risk. A k-nearest-neighbor algorithm is used to compute a risk score for different profiles from a public database. We empirically show that it is possible to achieve the same performances as logistic regressions with less attributes and a more easily readable model. The process includes the intervention of a domain expert, during an offline step of the process, who helps to select one of the numerous model variations by combining at best, physician expectations and performances. A risk score made of four parameters: age, breast density, number of affected first degree relatives and breast biopsy, is chosen. Detection performance measured with the area under the ROC curve is 0.637. A graphical user interface is presented to show how users will interact with this risk score.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

eBook: USD 16.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Howlader, N., Noone, A.M., Krapcho, M., Garshell, J., Miller, D., Altekruse, S.F., Kosary, C.L., Yu, M., Ruhl, J., Tatalovich, Z.,Mariotto, A., Lewis, D.R., Chen, H.S., Feuer, E.J., Cronin, K.A. (eds). SEER Cancer Statistics Review, 1975–2011, National Cancer Institute. Bethesda, MD (2010)
Google Scholar
Ballard-Barbash, R., Taplin, S., Yankaskas, B., Ernster, V., Rosenberg, R., Carney, P., Barlow, W., Geller, B., Kerlikowske, K., Edwards, B., Lynch, C., Urban, N., Chrvala, C., Key, C., Poplack, S., Worden, J., Kessler, L.: Breast cancer surveillance consortium: a national mammography screening and outcomes database. Am. J. Roentgenol. 169(4), 1001–1008 (1997)
Article Google Scholar
Barlow, W.E., White, E., Ballard-Barbash, R., Vacek, P.M., Titus-Ernstoff, L., Carney, P.A., Tice, J.A., Buist, D.S.M., Geller, B.M., Rosenberg, R., Yankaskas, B.C., Kerlikowske, K.: Prospective breast cancer risk prediction model for women undergoing screening mammography. J. Natl. Cancer Inst. 98(17), 1204–1214 (2006)
Article Google Scholar
Chapman, P., Clinton, J., Kerber, R., Khabaza, T.: CRISP-DM 1.0 step-by-step data mining guide. Tech. Rep., The CRISP-DM Consortium (2000)
Google Scholar
Chen, J., Pee, D., Ayyagari, R., Graubard, B., Schairer, C., Byrne, C., Benichou, J., Gail, M.H.: Projecting absolute invasive breast cancer risk in white women with a model that includes mammographic density. J. Natl. Cancer Inst. 98(17), 1215–1226 (2006)
Article Google Scholar
Costantino, J., Gail, M., Pee, D., Anderson, S., Redmond, C., Benichou, J., Wieand, H.: Validation studies for models projecting the risk of invasive and total breast cancer incidence. J. Natl. Cancer Inst. 91(18), 1541–1548 (1999)
Article Google Scholar
Cover, T., Hart, P.: Nearest neighbor pattern classification. IEEE Trans. Inf. Theory 13(1), 21–27 (1967)
Article Google Scholar
Decarli, A., Calza, S., Masala, G., Specchia, C., Palli, D., Gail, M.H.: Gail model for prediction of absolute risk of invasive breast cancer: Independent evaluation in the Florence-European prospective investigation into cancer and nutrition cohort. J. Natl. Cancer Inst. 98(23), 1686–1693 (2006)
Article Google Scholar
Dudani, S.A.: The distance-weighted k-nearest-neighbor rule. IEEE Trans. Syst. Man Cybern. 6(4), 325–327 (1976)
Article Google Scholar
Egan, J.P.: Signal detection theory and ROC analysis. Academic Press series in cognition and perception. Academic (1975)
Google Scholar
Endo, A., Shibata, T., Tanaka, H.: Comparison of seven algorithms to predict breast cancer survival. Biomed. Soft Comput. Hum. Sci. 13(2), 11–16 (2008)
Google Scholar
Fan, X., Tang, K., Weise, T.: Margin-based over-sampling method for learning from imbalanced datasets. In: Proceedings of the 15th Pacific-Asia Conference on Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, Springer (2011)
Google Scholar
Fawcett, T.: An introduction to ROC analysis. Pattern Recognit. Lett. 27(8), 861–874 (2006)
Article Google Scholar
Fix, E., Hodges, J.L.: Discriminatory analysis, non-parametric discrimination: consistency properties. Tech. Rep., USAF Scholl of Aviation and Medicine, Randolph Field (1951)
Google Scholar
Gail, M.H., Brinton, L.A., Byar, D.P., Corle, D.K., Green, S.B., Schairer, C., Mulvihill, J.J.: Projecting individualized probabilities of developing breast cancer for white females who are being examined annually. J. Natl. Cancer Inst. 81(24), 1879–1886 (1989)
Article Google Scholar
Gauthier, E., Brisson, L., Lenca, P., Clavel-Chapelon, F., Ragusa, S.: Challenges to building a platform for a breast cancer risk score. In: Sixth International Conference on Research Challenges in Information Science, pp. 1–10. IEEE (2012)
Google Scholar
IARC: World Cancer Report. IARC Publications. http://www.iarc.fr/en/publications/pdfs-online/wcr/2008/wcr_2008_1.pdf (2008)
Japkowicz, N., Stephen, S.: The class imbalance problem: a systematic study. Intell. Data Anal. 6(5), 429–449 (2002)
Google Scholar
Jerez-Aragonés, J.M., Gómez-Ruiz, J.A., Ramos-Jiménez, G., Muñoz-Pérez, J., E., A.C.: A combined neural network and decision trees model for prognosis of breast cancer relapse. Artif. Intell. Med. 27(1), 45–63 (2003)
Google Scholar
Li, Y., Zhang, X.: Improving k nearest neighbor with exemplar generalization for imbalanced classification. In: Huang, J., Cao, L., Srivastava, J. (eds.) Proceedings of the 15th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining. Lecture Notes in Computer Science, vol. 6635, pp. 321–332. Springer, Berlin (2011)
Google Scholar
Lichtenstein, P., Holm, N.V., Verkasalo, P.K., Iliadou, A., Kaprio, J., Koskenvuo, M., Pukkala, E., Skytthe, A., Hemminki, K.: Environmental and heritable factors in the causation of cancer, analyses of cohorts of twins from Sweden, Denmark, and Finland. N. Engl. J. Med. 343(2), 78–85 (2000)
Google Scholar
Liu, W., Chawla, S.: Class confidence weighted knn algorithms for imbalanced data sets. In: Proceedings of the 15th Pacific-Asia conference on Advances in knowledge discovery and data mining. Lecture Notes in Computer Science, vol. 6635, pp. 345–356. Springer, Berlin (2011)
Google Scholar
Liu, X.Y., Wu, J., Zhou, Z.H.: Exploratory undersampling for class-imbalance learning. IEEE Trans. Syst. Man Cybern. B 39(2), 539–550 (2009)
Article Google Scholar
Pham, N.K., Do, T.N., Lenca, P., Lallich, S.: Using local node information in decision trees: coupling a local labeling rule with an off-centered entropy. In: The International Conference on Data Mining, pp. 117–123. Las Vegas, Nevada, USA. CSREA Press (2008)
Google Scholar
D'Orsi, C.J., Sickles, E.A., Mendelson, E.B., Morris, E.A., et al.: ACR BI-RADS® Atlas, Breast Imaging Reporting and Data System, Reston, VA, American College of Radiology (2013)
Google Scholar
Teams, F.C.: Mammographic surveillance in women younger than 50 years who have a family history of breast cancer: tumour characteristics and projected effect on mortality in the prospective, single-arm, fh01 study. Lancet Oncol. 11(12), 1127–1134 (2010)
Article Google Scholar
Testard-Vaillant, P.: The war on cancer. CNRS Int. Mag. 17, 18–21 (2010)
Google Scholar
Visa, S., Ralescu, A.: Issues in mining imbalanced data sets—a review paper. In: Proceedings of the Sixteen Midwest Artificial Intelligence and Cognitive Science Conference, MAICS–2005, Dayton, pp. 67–73 (2005)
Google Scholar
Weiss, G.M., Provost, F.: Learning when training data are costly: the effect of class distribution on tree induction. J. Artif. Intell. Res. 19, 315–354 (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Statlife company, Institut Gustave Roussy, 114 rue Édouard Vaillant, 94805, Villejuif Cedex, France
Émilien Gauthier & Stéphane Ragusa
UMR CNRS 6285 Lab-STICC, Institut Telecom, Telecom Bretagne, Technopôle Brest Iroise CS 83818, 29238, Brest Cedex 3, France
Émilien Gauthier, Laurent Brisson & Philippe Lenca
Université Européenne de Bretagne, Bretagne, France
Émilien Gauthier, Laurent Brisson & Philippe Lenca

Authors

Émilien Gauthier
View author publications
You can also search for this author in PubMed Google Scholar
Laurent Brisson
View author publications
You can also search for this author in PubMed Google Scholar
Philippe Lenca
View author publications
You can also search for this author in PubMed Google Scholar
Stéphane Ragusa
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Émilien Gauthier .

Editor information

Editors and Affiliations

Research & Advanced Engineering, Ford Motor Company, Dearborn, Michigan, USA
Mahmoud Abou-Nasr
Universität Hamburg Inst. Wirtschaftsinformatik, Hamburg, Germany
Stefan Lessmann
Universität Hamburg Inst. Wirtschaftsinformatik, Hamburg, Germany
Robert Stahlbock
Deptartment of Computer & Information Science, Fordham University, Bronx, New York, USA
Gary M. Weiss

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Gauthier, É., Brisson, L., Lenca, P., Ragusa, S. (2015). A Nearest Neighbor Approach to Build a Readable Risk Score for Breast Cancer. In: Abou-Nasr, M., Lessmann, S., Stahlbock, R., Weiss, G. (eds) Real World Data Mining Applications. Annals of Information Systems, vol 17. Springer, Cham. https://doi.org/10.1007/978-3-319-07812-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-319-07812-0_13
Published: 14 November 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-07811-3
Online ISBN: 978-3-319-07812-0
eBook Packages: Business and EconomicsBusiness and Management (R0)

Publish with us

Policies and ethics