ROC operating point selection for classification of imbalanced data with application to computer-aided polyp detection in CT colonography

  • Bowen Song
  • Guopeng Zhang
  • Wei Zhu
  • Zhengrong LiangEmail author
Original Article



   Computer-aided detection and diagnosis (CAD) of colonic polyps always faces the challenge of classifying imbalanced data. In this paper, three new operating point selection strategies based on receiver operating characteristic curve are proposed to address the problem.


   Classification on imbalanced data performs inferiorly because of a major reason that the best differentiation threshold shifts due to the degree of data imbalance. To address this decision threshold shifting issue, three operating point selection strategies, i.e., shortest distance, harmonic mean and anti-harmonic mean, are proposed and their performances are investigated.


   Experiments were conducted on a class-imbalanced database, which contains 64 polyps in 786 polyp candidates. Support vector machine (SVM) and random forests (RFs) were employed as basic classifiers. Two imbalanced data correcting techniques, i.e., cost-sensitive learning and training data down sampling, were applied to SVM and RFs, and their performances were compared with the proposed strategies. Comparing to the original thresholding method, i.e., 0.488 sensitivity and 0.986 specificity for RFs and 0.526 sensitivity and 0.977 specificity for SVM, our strategies achieved more balanced results, which are around 0.89 sensitivity and 0.92 specificity for RFs and 0.88 sensitivity and 0.90 specificity for SVM. Meanwhile, their performance remained at the same level regardless of whether other correcting methods are used.


   Based on the above experiments, the gain of our proposed strategies is noticeable: the sensitivity improved from 0.5 to around 0.88 for RFs and 0.89 for SVM while remaining a relatively high level of specificity, i.e., 0.92 for RFs and 0.90 for SVM. The performance of our proposed strategies was adaptive and robust with different levels of imbalanced data. This indicates a feasible solution to the shifting problem for favorable sensitivity and specificity in CAD of polyps from imbalanced data.


Computer-aided detection and diagnosis (CAD) Computed tomography colonography (CTC) Random forests Harmonic mean  Support vector machine (SVM) Receiver operating characteristic (ROC) 



This work was supported in part by the NIH/NCI under Grants #CA082402 and #CA143111.

Conflict of Interest

Bowen Song, Guopeng Zhang, Wei Zhu and Zhengrong Liang declare that they have no conflict of interest.


  1. 1.
    American Cancer Society (2012) Cancer facts & figures 2012. American Cancer Society, AtlantaGoogle Scholar
  2. 2.
    Eddy D (1990) Screening for colorectal cancer. Ann Intern Med 113:373–384PubMedCrossRefGoogle Scholar
  3. 3.
    Gluecker T, Johnson C, Harmsen W, Offord K, Harris A, Wilson L, Ahlquist D (2003) Colorectal cancer screening with CT colonography, colonoscopy, and double-contrast barium enema examination: prospective assessment of patient perceptions and preferences. Radiology 227(2):378–384PubMedCrossRefGoogle Scholar
  4. 4.
    Pickhardt P, Choi J, Hwang I, Butler J, Puckett M, Hildebrandt H, Wong R, Nugent P, Mysliwiec P, Schindler W (2003) Computed tomographic virtual colonoscopy to screen for colorectal neoplasia in asymptomatic adults. N Engl J Med 349:2191–2200PubMedCrossRefGoogle Scholar
  5. 5.
    Summers RM, Yao J, Pickhardt P, Franaszek M, Bitter I, Brickman D, Krishna V, Choi R (2005) Computed tomographic virtual colonoscopy computer-aided polyp detection in a screening population. Gastroenterology 129:1832–1844PubMedCentralPubMedCrossRefGoogle Scholar
  6. 6.
    Wang S, Zhu H, Lu H, Liang Z (2008) Volume-based feature analysis of mucosa for automatic initial polyp detection in virtual colonoscopy. Int J Comput Assist Radiol Surg 3(1–2):131–142PubMedCentralPubMedCrossRefGoogle Scholar
  7. 7.
    Zhu H, Fan Y, Lu H, Liang Z (2010) Improving initial polyp candidate extraction for CT colonography. Phys Med Biol 55:2087– 2102PubMedCentralPubMedCrossRefGoogle Scholar
  8. 8.
    Hossain M, Hassan M, Kirley M, Bailey J (2008) ROC-tree: a novel decision tree induction algorithm based on receiver operating characteristics to classify gene expression data. In: Proceedings of the 2008 SIAM international conference on data mining (SDM), pp 455–465Google Scholar
  9. 9.
    Fawcett T (2006) An introduction to ROC analysis. Pattern Recognit Lett 27:861–874CrossRefGoogle Scholar
  10. 10.
    Rakotomamonjy A (2004) Optimizing area under ROC curve with SVMs. ROC Analysis in Artificial Intelligence, pp 71–80Google Scholar
  11. 11.
    Zhao P, Hoi SCH, Jin R, Yang T (2011) Online AUC maximization. In: Proceeding of international conference of machine learningGoogle Scholar
  12. 12.
    Yoshida H, Nappi J (2001) Three-dimensional computer-aided diagnosis scheme for detection of colonic polyps. IEEE Trans Med Imag 20(12):1261–1274CrossRefGoogle Scholar
  13. 13.
    Wang Z, Liang Z, Li L, Li X, Li B, Anderson J, Harrington D (2005) Reduction of false positives by internal features for polyp detection in CT-based virtual colonoscopy. Med Phys 32(12):3602–3616PubMedCentralPubMedCrossRefGoogle Scholar
  14. 14.
    Liu J, Yao J, Summers R (2008) Scale-based scatter correction for computer-aided polyp detection in CT colonography. Med Phys 35(12):5664–5671PubMedCrossRefGoogle Scholar
  15. 15.
    Zhu H, Duan C, Pickhardt P, Wang S, Liang Z (2009) CAD of colonic polyps with level set-based adaptive convolution in volumetric mucosa to advance CT colonography toward a screening modality. J Cancer Manag Res DOVE Med Press 1:1–13Google Scholar
  16. 16.
    Marelo F, Musé P, Aguirre S, Sapiro G (2010) Automatic colon polyp flagging via geometric and texture features. Engineering in Medicine and Biology Society (EMBC), 2010 Annual International Conference of the IEEE, pp 3170–3173Google Scholar
  17. 17.
    Zhu H, Fan Y, Lu H, Liang Z (2011) Improved curvature estimation for computer-aided detection of colonic polyps in CT colonography. Acad Radiol 18(8):1024–1034PubMedCentralPubMedCrossRefGoogle Scholar
  18. 18.
    American College of Radiology (2005) ACR practice guideline for the performance of computed tomography (CT) colonography in adults. ACR Pract Guidel 29:295–298Google Scholar
  19. 19.
    Breiman L (1996) Bagging predictors. Mach Learn 24:123–140Google Scholar
  20. 20.
    Breiman L (2001) Random forests. Mach Learn 45(1):5–32Google Scholar
  21. 21.
    Vapnik V (1998) Statistical learning theory. Wiley, New YorkGoogle Scholar
  22. 22.
    Morik K, Brokhausen P, Joachims T (1999) Combining statistical learning with a knowledge-based approach—a case study in intensive care monitoring. In: Proceedings 16th international conference on machine learningGoogle Scholar
  23. 23.
    Chang C, Lin C (2011) LIBSVM: a library for support vector machines. ACM Trans Intell Syst Technol 2:27:1–27:27. Software available at
  24. 24.
    Osuna E, Freund R, Girosi F (1997) Training support vector machines: an application to face detection. In: Proceedings computer vision and pattern recognition pp 130–136Google Scholar
  25. 25.
    Pontil M, Verri A (1998) Object recognition with support vector machines. IEEE Trans Pattern Anal Mach Intell 20:637–646 Google Scholar
  26. 26.
    Diaz-Uriarte R, Alvarez de Andres S (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics. doi: 10.1186/1471-2105-7-3
  27. 27.
    Alexandre LA, Casteleiro J, Nobreinst N (2007) Polyp detection in endoscopic video using SVMs. Lect Notes Comput Sci 4702:358–365Google Scholar
  28. 28.
    Zhu H, Liang Z, Barish M, Pickhardt P, You J, Wang S, Fan Y, Lu H, Richards R, Posniak E, Cohen H (2010) Increasing computer-aided detection specificity by projection features for CT colonography. Med Phys 37(4):1468–1481PubMedCrossRefGoogle Scholar
  29. 29.
    Liu M, Lu L, Bi J, Raykar V, Wolf M, Salganicoff M (2011) Robust large scale prone-supine polyp matching using local features: a metric learning approach. Med Image Comput Assist Interv 14(3):75–82Google Scholar
  30. 30.
    Liu M, Lu L, Ye X, Yu J, Salganicoff M (2011) Sparse classification for computer aided diagnosis using learned dictionaries. In: Proceedings of the 14th international conference on medical image computing and computer assisted intervention (MICCAI), September 18–22, 2011, Toronto, CanadaGoogle Scholar
  31. 31.
  32. 32.
    Chen C, Liaw A, Breiman L (2004) Using random forest to learn Imbalanced data. Technical Report of Dept. of Stat., UC, BerkeleyGoogle Scholar
  33. 33.
    He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284CrossRefGoogle Scholar
  34. 34.
    Blagus R, Lusa L (2010) Class prediction for high-dimensional class-imbalanced data. BMC Bioinformatics 11:523–539PubMedCentralPubMedCrossRefGoogle Scholar
  35. 35.
    Maloof M (2003) Learning when data sets are imbalanced and when cost are unequal and unknown. In: Proceedings ICML workshop learn imbalanced data sets, pp 73–80Google Scholar

Copyright information

© CARS 2013

Authors and Affiliations

  • Bowen Song
    • 1
    • 2
  • Guopeng Zhang
    • 3
  • Wei Zhu
    • 2
  • Zhengrong Liang
    • 1
    Email author
  1. 1.Department of RadiologyStony Brook UniversityStony BrookUSA
  2. 2.Department of Applied Mathematics and StatisticsStony Brook UniversityStony BrookUSA
  3. 3.Department of Biomedical EngineeringFourth Military Medical UniversityXi’anChina

Personalised recommendations