Imputation techniques on missing values in breast cancer treatment and fertility data

  • Xuetong WuEmail author
  • Hadi Akbarzadeh Khorshidi
  • Uwe Aickelin
  • Zobaida Edib
  • Michelle Peate
Part of the following topical collections:
  1. Special Issue on Artificial Intelligence in Health Informatics


Clinical decision support using data mining techniques offers more intelligent way to reduce the decision error in the last few years. However, clinical datasets often suffer from high missingness, which adversely impacts the quality of modelling if handled improperly. Imputing missing values provides an opportunity to resolve the issue. Conventional imputation methods adopt simple statistical analysis, such as mean imputation or discarding missing cases, which have many limitations and thus degrade the performance of learning. This study examines a series of machine learning based imputation methods and suggests an efficient approach to in preparing a good quality breast cancer (BC) dataset, to find the relationship between BC treatment and chemotherapy-related amenorrhoea, where the performance is evaluated with the accuracy of the prediction. To this end, the reliability and robustness of six well-known imputation methods are evaluated. Our results show that imputation leads to a significant boost in the classification performance compared to the model prediction based on listwise deletion. Furthermore, the results reveal that most methods gain strong robustness and discriminant power even the dataset experiences high missing rate (> 50%).


Missing data Imputation Classification Breast cancer Post-treatment amenorrhoea 



This work is fully funded by Melbourne Research Scholarships (MRS), Grant No. 385545 and partially supported by Fertility After Cancer Predictor (FoRECAsT) Study. Michelle Peate is currently supported by an MDHS Fellowship, University of Melbourne. The FoRECAsT study is supported by the FoRECAsT consortium and Victorian Government through a Victorian Cancer Agency (Early Career Seed Grant) awarded to Michelle Peate.


  1. 1.
    Acuna E, Rodriguez C. The treatment of missing values and its effect on classifier accuracy., Classification, clustering, and data mining applicationsNew York: Springer; 2004. p. 639–47.Google Scholar
  2. 2.
    Barakat MS, Field M, Ghose A, Stirling D, Holloway L, Vinod S, Dekker A, Thwaites D. The effect of imputing missing clinical attribute values on training lung cancer survival prediction model performance. Health Inf Sci Syst. 2017;5(1):16.CrossRefGoogle Scholar
  3. 3.
    Batista GE, Monard MC, et al. A study of k-nearest neighbour as an imputation method. HIS. 2002;87(251–260):48.Google Scholar
  4. 4.
    Buuren SV, Groothuis-Oudshoorn K. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2010.
  5. 5.
    de Goeij MC, van Diepen M, Jager KJ, Tripepi G, Zoccali C, Dekker FW. Multiple imputation: dealing with missing data. Nephrol Dial Transplant. 2013;28(10):2415–20.CrossRefGoogle Scholar
  6. 6.
    Ives A, Saunders C, Bulsara M, Semmens J. Pregnancy after breast cancer: population based study. BMJ. 2007;334(7586):194.CrossRefGoogle Scholar
  7. 7.
    Jerez JM, Molina I, García-Laencina PJ, Alba E, Ribelles N, Martín M, Franco L. Missing data imputation using statistical and machine learning methods in a real breast cancer problem. Artif Intell Med. 2010;50(2):105–15.CrossRefGoogle Scholar
  8. 8.
    Johnson N, Bagrie E, Coomarasamy A, Bhattacharya S, Shelling A, Jessop S, Farquhar C, Khan K. Ovarian reserve tests for predicting fertility outcomes for assisted reproductive technology: the international systematic collaboration of ovarian reserve evaluation protocol for a systematic review of ovarian reserve test accuracy. BJOG. 2006;113(12):1472–80.CrossRefGoogle Scholar
  9. 9.
    Kalton G, Kish L. Some efficient random imputation methods. Commun Stat Theory Methods. 1984;13(16):1919–39.CrossRefGoogle Scholar
  10. 10.
    Lee S, Kil WJ, Chun M, Jung YS, Kang SY, Kang SH, Oh YT. Chemotherapy-related amenorrhea in premenopausalwomen with breast cancer. Menopause. 2009;16(1):98–103.CrossRefGoogle Scholar
  11. 11.
    Lee G, Rubinfeld I, Syed Z. Adapting surgical models to individual hospitals using transfer learning. In: 2012 IEEE 12th international conference on data mining workshops; 2012. pp. 57–63.Google Scholar
  12. 12.
    Liem GS, Mo FK, Pang E, Suen JJ, Tang NL, Lee KM, Yip CH, Tam WH, Ng R, Koh J, et al. Chemotherapy-related amenorrhea and menopause in young chinese breast cancer patients: analysis on incidence, risk factors and serum hormone profiles. PloS ONE. 2015;10(10):e0140842.CrossRefGoogle Scholar
  13. 13.
    Lin WC, Tsai CF. Missing value imputation: a review and analysis of the literature (2006–2017). Artif Intell Rev. 2019.
  14. 14.
    Little RJ, Rubin DB. Statistical analysis with missing data, vol. 793. Hoboken: Wiley; 2019.zbMATHGoogle Scholar
  15. 15.
    Moon TK. The expectation-maximization algorithm. IEEE Signal Process Mag. 1996;13(6):47–60.CrossRefGoogle Scholar
  16. 16.
    Nelwamondo FV, Mohamed S, Marwala T. Missing data: a comparison of neural network and expectation maximization techniques. Curr Sci. 2007;93:1514–21.Google Scholar
  17. 17.
  18. 18.
    Peate M, Meiser B, Friedlander M, Zorbas H, Rovelli S, Sansom-Daly U, Sangster J, Hadzi-Pavlovic D, Hickey M. It’s now or never: fertility-related knowledge, decision-making preferences, and treatment intentions in young women with breast cancer–an australian fertility decision aid collaborative group study. J Clin Oncol. 2011;29(13):1670–7.CrossRefGoogle Scholar
  19. 19.
    Peate M, Stafford L, Hickey M. Fertility after breast cancer and strategies to help women achieve pregnancy. Cancer Forum. 2017;41:32.Google Scholar
  20. 20.
    Purwar A, Singh SK. Hybrid prediction model with missing value imputation for medical data. Expert Syst Appl. 2015;42(13):5621–31.CrossRefGoogle Scholar
  21. 21.
    Rubin DB. Multiple imputation for nonresponse in surveys, vol. 81. Hoboken: Wiley; 2004.zbMATHGoogle Scholar
  22. 22.
    Ruddy KJ, Gelber S, Tamimi RM, Schapira L, Come SE, Meyer ME, Winer EP, Partridge AH. Breast cancer presentation and diagnostic delays in young women. Cancer. 2014;120(1):20–5.CrossRefGoogle Scholar
  23. 23.
    Schafer JL. Analysis of incomplete multivariate data. New York: Chapman and Hall/CRC; 1997.CrossRefGoogle Scholar
  24. 24.
    Stekhoven DJ, Bühlmann P. Missforest: non-parametric missing value imputation for mixed-type data. Bioinformatics. 2011;28(1):112–8.CrossRefGoogle Scholar
  25. 25.
    Van Rossum G, Drake FL Jr. Python tutorial. Amsterdam: Centrum voor Wiskunde en Informatica; 1995.Google Scholar
  26. 26.
    Wilson DR, Martinez TR. Improved heterogeneous distance functions. J Artif Intell Res. 1997;6:1–34.MathSciNetCrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  1. 1.Department of Computing and Information SystemsUniversity of MelbourneParkvilleAustralia
  2. 2.Department of Obstetrics and GynaecologyUniversity of MelbourneParkvilleAustralia

Personalised recommendations