Abstract
As one of the most effective measures to extract useful information from medical database and provide scientific decision-making for diagnosis and treatment of diseases, medical data mining has become an increasingly hot topic in the last few years. Some of the intrinsic characteristics of medical databases, such as the huge volume and imbalanced samples as well as stringent performance standards, make this mining process particularly challenging. By elaborating various challenges existing in Task 1 of KDD Cup 2008 competition, this paper analyzes some potential solutions to these problems and presents a modified boosted tree as the final classification model. This model ranked the fourth among all the solutions to Task 1. We hope that our analysis and solutions to these challenges would contribute to the development of medical data mining applications.
Similar content being viewed by others
References
Cios K J, Moore W. Uniqueness of medical data mining. Artif Intel Med, 2002, 26: 1–24
American Cancer Society. Breast Cancer Facts & Figures, http://www.cancer.org/, 2009
Bi J B, Liang J M. Multiple instance learning of pulmonary embolism detection with geodesic distance along vascular structure. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska: IEEE Computer Society, 2008. 1–8
Rao R B, Yakhnenko O, Krishnapuram B. KDD cup 2008 and the workshop on mining medical data. ACM SIGKDD Explor, 2008, 10: 34–38
Bornefalk H, Hermansson A B. On the comparison of FROC curves in mammography CAD systems. Med Phys, 2005, 32: 412–417
Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005
Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 2001. 74–81
Puuronen S, Tsymbal A, Skrypnyk I. Advanced local feature selection in medical diagnostics. In: 13th IEEE Symposium on Computer-Based Medical Systems. Washington DC: IEEE Computer Society, 2000. 25
Bell R M, Haffner P G, Volinsky C. Modifying boosted trees to improve performance on task 1 of the 2006 KDD challenge cup. ACM SIGKDD Explor Newslett, 2006, 8: 47–52
Karagiannopoulos M, Anyfantis D, Kotsiantis S B, et al, A wrapper for reweighting training instances for handling imbalanced data sets. In: Proceedings of the 4th IFIP International Federation for Information Processing. Boston: Springer, 2007. 247: 29–36
Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning. Tennessee: Morgan Kaufmann, 1997. 179–186
Japkowicz N. The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence, 2000. 111–117
Jo T, Japkowicz N. Class imbalances versus small disjuncts. ACM SIGKDD Explor Newslett, 2004, 6: 40–49
Weiss G M. Mining with rarity: a unifying framework. ACM SIGKDD Explor, 2004, 6: 7–19
Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2006. 965–969
Ertekin S, Huang J, Bottou L, et al. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management. New York: ACM, 2007. 127–136
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intel Res, 2002, 16: 321–357
Domingos P. MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 1999. 155–164
Ting K M. An empirical study of MetaCost using Boosting Algorithms. In: Proceedings of the Eleventh European Conference on Machine Learning. Berlin: Springer, 2000. 413–425
Wu S H, Lin K P, Chen C M, et al. Asymmetric support vector machines: low false-positive learning under the user tolerance. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2008. 749–757
Dietterich T G, Lathrop R H, Perez L T. Solving the multiple-instance problem with axis-parallel rectangles. Artif Intel, 1997, 89: 31–71
Zhou Z H. Multi-instance learning from supervised view. J Comput Sci Tech, 2006, 21: 800–809
Carreras X, Márquez L. Boosting trees for anti-spam email filtering. In: Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria. 2001. 58–64
Blockeel H, Page D, Srinivasan A. Multi-instance tree learning. In: Proceedings of the 22nd International Conference on Machine Learning. New York: ACM, 2005. 57–64
Chevaleyre Y, Zucker J D. Solving multiple-instance and multiple-part learning problems with decision trees and decision rules. Application to the mutagenesis problem. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence. Berlin: Springer, 2001. 204–214
Rätsch G, Onoda T, Müller K R. Soft margins for AdaBoost. Mach Learn, 2000, 42: 287–320
Schapire R E, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn, 1999, 37: 297–336
Mitchell T. Machine Learning. New York: McGraw-Hill, 1997
Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann, 2006
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 1998, 28: 337–407
Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the Second European Conference on Computational Learning Theory. Berlin: Springer, 1995, 55: 23–37
Perlich C, Melville P, Liu Y, et al. Winner’s Report: KDD CUP Breast Cancer Identification. ACM SIGKDD Explor, 2008, 10: 39–42
Lo H Y, Chang C M, Chiang T H, et al. Learning to improve area-under-FROC for imbalanced medical data classification using an ensemble method. ACM SIGKDD Explor, 2008, 10: 43–46
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Dong, C., Yin, Y. & Yang, X. Detecting malignant patients via modified boosted tree. Sci. China Inf. Sci. 53, 1369–1378 (2010). https://doi.org/10.1007/s11432-010-3107-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11432-010-3107-9