Oversampling Methods for Classification of Imbalanced Breast Cancer Malignancy Data

  • Bartosz Krawczyk
  • Łukasz Jeleń
  • Adam Krzyżak
  • Thomas Fevens
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7594)


During breast cancer malignancy grading the main problem that has direct influence on the classification is imbalanced number of cases of the malignancy classes. This poses a challenge for pattern recognition algorithms and leads to a significant decrease of the classification accuracy for the minority class. In this paper we present an approach which ameliorates such a problem. We describe and compare several state of the art methods, that are based on the oversampling approach, i.e. introduction of artificial objects into the dataset to eliminate the disproportion among classes. We also describe the automatic thresholding and fuzzy c-means algorithms used for the nuclei segmentation from fine needle aspirates. Based on the segmented images a set of 15 feattures used for classification process was extracted.


pattern recognition image processing imbalanced classification oversampling classifier ensemble breast cancer nuclei segmentation 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Alpaydin, E.: Combined 5 x 2 cv f test for comparing supervised classification learning algorithms. Neural Computation 11(8), 1885–1892 (1999)CrossRefGoogle Scholar
  2. 2.
    Bloom, H.J.G., Richardson, W.W.: Histological Grading and Prognosis in Breast Cancer. British Journal of Cancer 11, 359–377 (1957)CrossRefGoogle Scholar
  3. 3.
    Chawla, N.V., Bowyer, K.W., Hall, L.O., Kegelmeyer, W.P.: Smote: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research 16, 321–357 (2002)zbMATHGoogle Scholar
  4. 4.
    Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.W.: SMOTEBoost: Improving Prediction of the Minority Class in Boosting. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 107–119. Springer, Heidelberg (2003)CrossRefGoogle Scholar
  5. 5.
    Chen, S., He, H., Garcia, E.A.: Ramoboost: Ranked minority oversampling in boosting. IEEE Transactions on Neural Networks 21(10), 1624–1642 (2010)CrossRefGoogle Scholar
  6. 6.
    Galar, M., Fernandez, A., Barrenechea, E., Bustince, H., Herrera, F.: A review on ensembles for the class imbalance problem: Bagging-, boosting-, and hybrid-based approaches. IEEE Transactions on Systems, Man and Cybernetics Part C: Applications and Reviews (2011) (article in press)Google Scholar
  7. 7.
    He, H., Bai, Y., Garcia, E.A., Li, S.: Adasyn: Adaptive synthetic sampling approach for imbalanced learning. In: Proceedings of the International Joint Conference on Neural Networks, pp. 1322–1328 (2008)Google Scholar
  8. 8.
    Jeleń, Ł.: Computerized Cancer Malignancy Garding of Fine Needle Aspirates. PhD thesis, Concordia University (2009)Google Scholar
  9. 9.
    Jeleń, Ł., Krzyżak, A., Fevens, T.: Comparison of Pleomorphic and Structural Features Used for Breast Cancer Malignancy Classification. In: Bergler, S. (ed.) Canadian AI. LNCS (LNAI), vol. 5032, pp. 138–149. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  10. 10.
    Klir, G.J., Yuan, B.: Fuzzy Sets and Fuzzy Logic: Theory and Applications. Prentice-Hall, New Jersey (1995)zbMATHGoogle Scholar
  11. 11.
    Krawczyk, B.: Pattern recognition approach to classifying cyp 2c19 isoform. Central European Journal of Medicine 7(1), 38–44 (2012)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Krawczyk, B., Woźniak, M.: Combining Diverse One-Class Classifiers. In: Corchado, E., Snášel, V., Abraham, A., Woźniak, M., Graña, M., Cho, S.-B. (eds.) HAIS 2012. LNCS (LNAI), vol. 7209, pp. 590–601. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  13. 13.
    R Development Core Team. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN 3-900051-07-0Google Scholar
  14. 14.
    Ridler, T.W., Calvard, S.: Picture thresholding using an iterative selection. IEEE Trans. System, Man and Cybernetics 8, 630–632 (1978)CrossRefGoogle Scholar
  15. 15.
    Schölkopf, B., Smola, A.J.: Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning. MIT Press (2002)Google Scholar
  16. 16.
    Sun, Y., Kamel, M.S., Wong, A.K.C., Wang, Y.: Cost-sensitive boosting for classification of imbalanced data. Pattern Recognition 40(12), 3358–3378 (2007)CrossRefzbMATHGoogle Scholar
  17. 17.
    Wang, S., Yao, X.: Diversity analysis on imbalanced data sets by using ensemble models. In: 2009 IEEE Symposium on Computational Intelligence and Data Mining, CIDM 2009 Proceedings, pp. 324–331 (2009)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Bartosz Krawczyk
    • 1
  • Łukasz Jeleń
    • 2
    • 3
  • Adam Krzyżak
    • 4
  • Thomas Fevens
    • 4
  1. 1.Department of Systems and Computer NetworksWrocław University of TechnologyWrocławPoland
  2. 2.Wrocław School of Applied InformaticsWrocławPoland
  3. 3.Institute of Agricultural EngineeringWrocław University of Environments and Life ScienceWrocławPoland
  4. 4.Department of Computer Science and Software EngineeringConcordia UniversityWest MontréalCanada

Personalised recommendations