Skip to main content
Log in

Detecting malignant patients via modified boosted tree

  • Research Papers
  • Published:
Science China Information Sciences Aims and scope Submit manuscript

Abstract

As one of the most effective measures to extract useful information from medical database and provide scientific decision-making for diagnosis and treatment of diseases, medical data mining has become an increasingly hot topic in the last few years. Some of the intrinsic characteristics of medical databases, such as the huge volume and imbalanced samples as well as stringent performance standards, make this mining process particularly challenging. By elaborating various challenges existing in Task 1 of KDD Cup 2008 competition, this paper analyzes some potential solutions to these problems and presents a modified boosted tree as the final classification model. This model ranked the fourth among all the solutions to Task 1. We hope that our analysis and solutions to these challenges would contribute to the development of medical data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Cios K J, Moore W. Uniqueness of medical data mining. Artif Intel Med, 2002, 26: 1–24

    Article  Google Scholar 

  2. American Cancer Society. Breast Cancer Facts & Figures, http://www.cancer.org/, 2009

  3. Bi J B, Liang J M. Multiple instance learning of pulmonary embolism detection with geodesic distance along vascular structure. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska: IEEE Computer Society, 2008. 1–8

    Google Scholar 

  4. Rao R B, Yakhnenko O, Krishnapuram B. KDD cup 2008 and the workshop on mining medical data. ACM SIGKDD Explor, 2008, 10: 34–38

    Article  Google Scholar 

  5. Bornefalk H, Hermansson A B. On the comparison of FROC curves in mammography CAD systems. Med Phys, 2005, 32: 412–417

    Article  Google Scholar 

  6. Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005

    MATH  Google Scholar 

  7. Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 2001. 74–81

    Google Scholar 

  8. Puuronen S, Tsymbal A, Skrypnyk I. Advanced local feature selection in medical diagnostics. In: 13th IEEE Symposium on Computer-Based Medical Systems. Washington DC: IEEE Computer Society, 2000. 25

    Google Scholar 

  9. Bell R M, Haffner P G, Volinsky C. Modifying boosted trees to improve performance on task 1 of the 2006 KDD challenge cup. ACM SIGKDD Explor Newslett, 2006, 8: 47–52

    Article  Google Scholar 

  10. Karagiannopoulos M, Anyfantis D, Kotsiantis S B, et al, A wrapper for reweighting training instances for handling imbalanced data sets. In: Proceedings of the 4th IFIP International Federation for Information Processing. Boston: Springer, 2007. 247: 29–36

    Google Scholar 

  11. Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning. Tennessee: Morgan Kaufmann, 1997. 179–186

    Google Scholar 

  12. Japkowicz N. The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence, 2000. 111–117

  13. Jo T, Japkowicz N. Class imbalances versus small disjuncts. ACM SIGKDD Explor Newslett, 2004, 6: 40–49

    Article  Google Scholar 

  14. Weiss G M. Mining with rarity: a unifying framework. ACM SIGKDD Explor, 2004, 6: 7–19

    Article  Google Scholar 

  15. Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2006. 965–969

    Google Scholar 

  16. Ertekin S, Huang J, Bottou L, et al. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management. New York: ACM, 2007. 127–136

    Chapter  Google Scholar 

  17. Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intel Res, 2002, 16: 321–357

    MATH  Google Scholar 

  18. Domingos P. MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 1999. 155–164

    Google Scholar 

  19. Ting K M. An empirical study of MetaCost using Boosting Algorithms. In: Proceedings of the Eleventh European Conference on Machine Learning. Berlin: Springer, 2000. 413–425

    Google Scholar 

  20. Wu S H, Lin K P, Chen C M, et al. Asymmetric support vector machines: low false-positive learning under the user tolerance. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2008. 749–757

    Chapter  Google Scholar 

  21. Dietterich T G, Lathrop R H, Perez L T. Solving the multiple-instance problem with axis-parallel rectangles. Artif Intel, 1997, 89: 31–71

    Article  MATH  Google Scholar 

  22. Zhou Z H. Multi-instance learning from supervised view. J Comput Sci Tech, 2006, 21: 800–809

    Article  MathSciNet  Google Scholar 

  23. Carreras X, Márquez L. Boosting trees for anti-spam email filtering. In: Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria. 2001. 58–64

  24. Blockeel H, Page D, Srinivasan A. Multi-instance tree learning. In: Proceedings of the 22nd International Conference on Machine Learning. New York: ACM, 2005. 57–64

    Chapter  Google Scholar 

  25. Chevaleyre Y, Zucker J D. Solving multiple-instance and multiple-part learning problems with decision trees and decision rules. Application to the mutagenesis problem. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence. Berlin: Springer, 2001. 204–214

    Google Scholar 

  26. Rätsch G, Onoda T, Müller K R. Soft margins for AdaBoost. Mach Learn, 2000, 42: 287–320

    Article  Google Scholar 

  27. Schapire R E, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn, 1999, 37: 297–336

    Article  MATH  Google Scholar 

  28. Mitchell T. Machine Learning. New York: McGraw-Hill, 1997

    MATH  Google Scholar 

  29. Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann, 2006

    Google Scholar 

  30. Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 1998, 28: 337–407

    Article  MathSciNet  Google Scholar 

  31. Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the Second European Conference on Computational Learning Theory. Berlin: Springer, 1995, 55: 23–37

    Google Scholar 

  32. Perlich C, Melville P, Liu Y, et al. Winner’s Report: KDD CUP Breast Cancer Identification. ACM SIGKDD Explor, 2008, 10: 39–42

    Article  Google Scholar 

  33. Lo H Y, Chang C M, Chiang T H, et al. Learning to improve area-under-FROC for imbalanced medical data classification using an ensemble method. ACM SIGKDD Explor, 2008, 10: 43–46

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to YiLong Yin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dong, C., Yin, Y. & Yang, X. Detecting malignant patients via modified boosted tree. Sci. China Inf. Sci. 53, 1369–1378 (2010). https://doi.org/10.1007/s11432-010-3107-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11432-010-3107-9

Keywords

Navigation