Detecting malignant patients via modified boosted tree

Dong, CaiLing; Yin, YiLong; Yang, XiuKun

doi:10.1007/s11432-010-3107-9

Detecting malignant patients via modified boosted tree

Research Papers
Published: 10 July 2010

Volume 53, pages 1369–1378, (2010)
Cite this article

Science China Information Sciences Aims and scope Submit manuscript

CaiLing Dong¹,
YiLong Yin¹ &
XiuKun Yang²

107 Accesses
4 Citations
Explore all metrics

Abstract

As one of the most effective measures to extract useful information from medical database and provide scientific decision-making for diagnosis and treatment of diseases, medical data mining has become an increasingly hot topic in the last few years. Some of the intrinsic characteristics of medical databases, such as the huge volume and imbalanced samples as well as stringent performance standards, make this mining process particularly challenging. By elaborating various challenges existing in Task 1 of KDD Cup 2008 competition, this paper analyzes some potential solutions to these problems and presents a modified boosted tree as the final classification model. This model ranked the fourth among all the solutions to Task 1. We hope that our analysis and solutions to these challenges would contribute to the development of medical data mining applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Cios K J, Moore W. Uniqueness of medical data mining. Artif Intel Med, 2002, 26: 1–24
Article Google Scholar
American Cancer Society. Breast Cancer Facts & Figures, http://www.cancer.org/, 2009
Bi J B, Liang J M. Multiple instance learning of pulmonary embolism detection with geodesic distance along vascular structure. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska: IEEE Computer Society, 2008. 1–8
Google Scholar
Rao R B, Yakhnenko O, Krishnapuram B. KDD cup 2008 and the workshop on mining medical data. ACM SIGKDD Explor, 2008, 10: 34–38
Article Google Scholar
Bornefalk H, Hermansson A B. On the comparison of FROC curves in mammography CAD systems. Med Phys, 2005, 32: 412–417
Article Google Scholar
Witten I H, Frank E. Data Mining: Practical Machine Learning Tools and Techniques. 2nd ed. San Francisco, CA: Morgan Kaufmann, 2005
MATH Google Scholar
Das S. Filters, wrappers and a boosting-based hybrid for feature selection. In: Proceedings of the Eighteenth International Conference on Machine Learning. San Francisco: Morgan Kaufmann, 2001. 74–81
Google Scholar
Puuronen S, Tsymbal A, Skrypnyk I. Advanced local feature selection in medical diagnostics. In: 13th IEEE Symposium on Computer-Based Medical Systems. Washington DC: IEEE Computer Society, 2000. 25
Google Scholar
Bell R M, Haffner P G, Volinsky C. Modifying boosted trees to improve performance on task 1 of the 2006 KDD challenge cup. ACM SIGKDD Explor Newslett, 2006, 8: 47–52
Article Google Scholar
Karagiannopoulos M, Anyfantis D, Kotsiantis S B, et al, A wrapper for reweighting training instances for handling imbalanced data sets. In: Proceedings of the 4th IFIP International Federation for Information Processing. Boston: Springer, 2007. 247: 29–36
Google Scholar
Kubat M, Matwin S. Addressing the curse of imbalanced training sets: one-sided selection. In: Proceedings of the 14th International Conference on Machine Learning. Tennessee: Morgan Kaufmann, 1997. 179–186
Google Scholar
Japkowicz N. The class imbalance problem: significance and strategies. In: Proceedings of the 2000 International Conference on Artificial Intelligence, 2000. 111–117
Jo T, Japkowicz N. Class imbalances versus small disjuncts. ACM SIGKDD Explor Newslett, 2004, 6: 40–49
Article Google Scholar
Weiss G M. Mining with rarity: a unifying framework. ACM SIGKDD Explor, 2004, 6: 7–19
Article Google Scholar
Liu X Y, Wu J X, Zhou Z H. Exploratory under-sampling for class-imbalance learning. In: Proceedings of the 6th IEEE International Conference on Data Mining. Washington, DC: IEEE Computer Society, 2006. 965–969
Google Scholar
Ertekin S, Huang J, Bottou L, et al. Learning on the border: active learning in imbalanced data classification. In: Proceedings of the Sixteenth ACM Conference on Information and Knowledge Management. New York: ACM, 2007. 127–136
Chapter Google Scholar
Chawla N V, Bowyer K W, Hall L O, et al. SMOTE: synthetic minority over-sampling technique. J Artif Intel Res, 2002, 16: 321–357
MATH Google Scholar
Domingos P. MetaCost: a general method for making classifiers cost-sensitive. In: Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining. San Diego: ACM, 1999. 155–164
Google Scholar
Ting K M. An empirical study of MetaCost using Boosting Algorithms. In: Proceedings of the Eleventh European Conference on Machine Learning. Berlin: Springer, 2000. 413–425
Google Scholar
Wu S H, Lin K P, Chen C M, et al. Asymmetric support vector machines: low false-positive learning under the user tolerance. In: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2008. 749–757
Chapter Google Scholar
Dietterich T G, Lathrop R H, Perez L T. Solving the multiple-instance problem with axis-parallel rectangles. Artif Intel, 1997, 89: 31–71
Article MATH Google Scholar
Zhou Z H. Multi-instance learning from supervised view. J Comput Sci Tech, 2006, 21: 800–809
Article MathSciNet Google Scholar
Carreras X, Márquez L. Boosting trees for anti-spam email filtering. In: Proceedings of the 4th International Conference on Recent Advances in Natural Language Processing, Tzigov Chark, Bulgaria. 2001. 58–64
Blockeel H, Page D, Srinivasan A. Multi-instance tree learning. In: Proceedings of the 22nd International Conference on Machine Learning. New York: ACM, 2005. 57–64
Chapter Google Scholar
Chevaleyre Y, Zucker J D. Solving multiple-instance and multiple-part learning problems with decision trees and decision rules. Application to the mutagenesis problem. In: Proceedings of the 14th Biennial Conference of the Canadian Society on Computational Studies of Intelligence: Advances in Artificial Intelligence. Berlin: Springer, 2001. 204–214
Google Scholar
Rätsch G, Onoda T, Müller K R. Soft margins for AdaBoost. Mach Learn, 2000, 42: 287–320
Article Google Scholar
Schapire R E, Singer Y. Improved boosting algorithms using confidence-rated predictions. Mach Learn, 1999, 37: 297–336
Article MATH Google Scholar
Mitchell T. Machine Learning. New York: McGraw-Hill, 1997
MATH Google Scholar
Han J W, Kamber M. Data Mining: Concepts and Techniques. 2nd ed. San Francisco: Morgan Kaufmann, 2006
Google Scholar
Friedman J, Hastie T, Tibshirani R. Additive logistic regression: a statistical view of boosting. Annals of Statistics, 1998, 28: 337–407
Article MathSciNet Google Scholar
Freund Y, Schapire R E. A decision-theoretic generalization of on-line learning and an application to boosting. In: Proceedings of the Second European Conference on Computational Learning Theory. Berlin: Springer, 1995, 55: 23–37
Google Scholar
Perlich C, Melville P, Liu Y, et al. Winner’s Report: KDD CUP Breast Cancer Identification. ACM SIGKDD Explor, 2008, 10: 39–42
Article Google Scholar
Lo H Y, Chang C M, Chiang T H, et al. Learning to improve area-under-FROC for imbalanced medical data classification using an ensemble method. ACM SIGKDD Explor, 2008, 10: 43–46
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Technology, Shandong University, Jinan, 250101, China
CaiLing Dong & YiLong Yin
College of Information and Communication, Harbin Engineering University, Harbin, 150001, China
XiuKun Yang

Authors

CaiLing Dong
View author publications
You can also search for this author in PubMed Google Scholar
YiLong Yin
View author publications
You can also search for this author in PubMed Google Scholar
XiuKun Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to YiLong Yin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Dong, C., Yin, Y. & Yang, X. Detecting malignant patients via modified boosted tree. Sci. China Inf. Sci. 53, 1369–1378 (2010). https://doi.org/10.1007/s11432-010-3107-9

Download citation

Received: 09 June 2009
Accepted: 23 December 2009
Published: 10 July 2010
Issue Date: July 2010
DOI: https://doi.org/10.1007/s11432-010-3107-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Detecting malignant patients via modified boosted tree

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

Machine learning for risk stratification of thyroid cancer patients: a 15-year cohort study

Heart Disease Prediction using Machine Learning Techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Detecting malignant patients via modified boosted tree

Abstract

Access this article

Similar content being viewed by others

A comparative analysis of gradient boosting algorithms

Machine learning for risk stratification of thyroid cancer patients: a 15-year cohort study

Heart Disease Prediction using Machine Learning Techniques

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation