Melanoma risk modeling from limited positive samples

  • Aaron N. RichterEmail author
  • Taghi M. Khoshgoftaar
Original Article


The key to effective cancer treatment is early detection. Risk models built from routinely collected clinical data have the opportunity to improve early detection by identifying high-risk patients. In this study, we explored various machine learning techniques for building a melanoma skin cancer risk model. The dataset contains records of routine dermatology office visits from 9,531,408 patients spread throughout the United States. Of these patients, 17,246 (0.18%) developed melanoma. We conducted extensive experiments to effectively learn from this dataset with limited positive samples. We derived datasets with more severe class imbalance and tested several classifiers with different data sampling techniques to build the best possible model. Additionally, we explored various properties of the datasets to determine relationships between class distributions and model performance. We found that randomly removing negative cases from the training datasets significantly improved model performance. K-means clustering of different groups of instances shows that there is greater homogeneity in negative samples, and the model results reflect that removing these samples increases overall model performance. This experiment provides a reference framework for future risk models, since most datasets will have a plethora of healthy patients, but only a few key patients that are at high risk for developing a disease.


Cancer risk Melanoma Big data Class imbalance Machine learning 



The authors would like to thank the anonymous reviewers for their constructive evaluation of this paper, and the various members of the Data Mining and Machine Learning Laboratory, Florida Atlantic University, for assistance with the reviews.

Compliance with ethical standards

Conflict of interest

The authors declare that they have no conflict of interest.


  1. American Cancer Society (2019) Cancer facts and figures 2019.
  2. Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
  3. Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp 785–794.
  4. Fernández A, del Río S, Chawla NV, Herrera F (2017) An insight into imbalanced big data classification: outcomes and challenges. Complex Intell Syst 3(2):105–120. CrossRefGoogle Scholar
  5. Gelman A (2005) Analysis of variance: why it is more important than ever. Ann Stat 33(1):1–31.
  6. Hosmer DW Jr, Lemeshow S, Sturdivant RX (2013) Applied logistic regression, vol 398. Wiley, New YorkCrossRefGoogle Scholar
  7. Jerez-Aragonés JM, Gómez-Ruiz JA, Ramos-Jiménez G, Muñoz-Pérez J, Alba-Conejo E (2003) A combined neural network and decision trees model for prognosis of breast cancer relapse. Artif. Intell Med 27(1):45–63CrossRefGoogle Scholar
  8. Jones E, Oliphant T, Peterson P et al (2001) SciPy: Open source scientific tools for Python.
  9. Kotsiantis S, Kanellopoulos D, Pintelas P et al (2006) Handling imbalanced datasets: a review. GESTS Int Trans Comput Sci Eng 30(1):25–36Google Scholar
  10. Lemaître G, Nogueira F, Aridas CK (2017) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. J Mach Learn Res 18(17):1–5.
  11. National Cancer Institute (2018) Cancer statistics.
  12. Park S, Nam BH, Yang HR, Lee JA, Lim H, Han JT, Park IS, Shin HR, Lee JS (2013) Individualized risk prediction model for lung cancer in korean men. PLoS One 8(2):e54,823. CrossRefGoogle Scholar
  13. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay E (2011) Scikit-learn: machine learning in Python. J Mach Learn Res 12:2825–2830MathSciNetzbMATHGoogle Scholar
  14. Platt JC (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In: Smola AJ, Bartlett PJ (eds) Advances in large margin classifiers. MIT Press, Cambridge, pp 61–74Google Scholar
  15. Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4:77–90Google Scholar
  16. Radespiel-Tröger M, Hohenberger W, Reingruber B (2004) Improved prediction of recurrence after curative resection of colon carcinoma using tree-based risk stratification. Cancer 100(5):958–967. CrossRefGoogle Scholar
  17. Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the 20th international conference on machine learning (ICML-03), pp 616–623Google Scholar
  18. Richter AN, Khoshgoftaar TM (2017) Modernizing analytics for melanoma with a large-scale research dataset. In: IEEE 18th International Conference on Information Reuse and Integration (IRI), 2017Google Scholar
  19. Richter AN, Khoshgoftaar TM (2018) A review of statistical and machine learning methods for modeling cancer risk using structured clinical data. Artif Intell Med.
  20. Seiffert C, Khoshgoftaar TM, Hulse JV, Napolitano A (2007) Mining data with rare events: a case study. In: Proceedings of the 19th IEEE international conference on tools with artificial intelligence, vol 2, IEEE Computer Society, Washington, DC, USA, ICTAI ’07, pp 132–139.
  21. Seiffert C, Khoshgoftaar TM, Van Hulse J, Napolitano A (2008) A comparative study of data sampling and cost sensitive learning. In: IEEE International Conference on Data Mining Workshops, pp 46–52Google Scholar
  22. Sun Y, Kamel MS, Wong AK, Wang Y (2007) Cost-sensitive boosting for classification of imbalanced data. Pattern Recogn 40(12):3358–3378CrossRefGoogle Scholar
  23. Triguero I, del Río S, López V, Bacardit J, Benítez JM, Herrera F (2015) ROSEFW-RF: The winner algorithm for the ECBDL’14 big data competition: an extremely imbalanced big data bioinformatics problem. Knowledge-Based Syst 87:69–79.
  24. Tukey JW (1949) Comparing individual means in the analysis of variance. Biometrics 5:99–114MathSciNetCrossRefGoogle Scholar
  25. US Census Bureau (2018) US and World Population Clock.
  26. Usher-Smith JA, Emery J, Kassianos AP, Walter FM (2014) Risk prediction models for melanoma: a systematic review. Cancer Epidemiol Biomark Prev 23(8):1450–1463.
  27. Van Hulse J, Khoshgoftaar TM, Napolitano A (2007) Experimental perspectives on learning from imbalanced data. In: Proceedings of the 24th International Conference on Machine Learning, ACM, pp 935–942Google Scholar
  28. van der Walt S, Colbert SC, Varoquaux G (2011) The numpy array: a structure for efficient numerical computation. Comput Sci Eng 13(2):22–30. CrossRefGoogle Scholar
  29. Yu A, Woo SM, Joo J, Yang HR, Lee WJ, Park SJ, Nam BH (2016) Development and validation of a prediction model to estimate individual risk of pancreatic cancer. PLoS One 11(1):e0146,473. CrossRefGoogle Scholar
  30. Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ et al (2016) Apache Spark: a unified engine for big data processing. Commun ACM 59(11):56–65CrossRefGoogle Scholar

Copyright information

© Springer-Verlag GmbH Austria, part of Springer Nature 2019

Authors and Affiliations

  1. 1.Department of Computer and Electrical Engineering and Computer ScienceFlorida Atlantic UniversityBoca RatonUSA

Personalised recommendations