Abstract
Purpose
It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper will test an ensemble method, Diversified Multiple Tree (DMT), on its capability for classifying instances in a new laboratory using the classifier built on the instances of another laboratory.
Methods
DMT is tested on three real world biomedical data sets from different laboratories in comparison with four benchmark ensemble methods, AdaBoost, Bagging, Random Forests, and Random Trees. Experiments have also been conducted on studying the limitation of DMT and its possible variations.
Results
Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on classifying new instances of a different laboratory from the laboratory where instances are used to build the classifier.
Conclusions
This paper demonstrates that an ensemble classifier, DMT, is more robust in classifying noisy data than other widely used ensemble methods. DMT works on the data set that supports multiple simple trees.
Similar content being viewed by others
References
Akaike H. A new look at the statistical model identification. IEEE Trans Autom Control. 1974;19(6):716–23.
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell. 2007;29:173–80, ISSN 0162-8828.
Beer D, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8(8):816–24.
Bhattacharjee A, et al. Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001;98(24):13790–5.
Box GEP, Muller ME. A note on the generation of random normal deviates. Ann Math Stat. 1958;29(2):610–1.
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Cheema MS, Eweiwi A, Bauckhage C. High dimensional low sample size activity recognition using geometric classifiers. Digit Signal Process. 2015;42(C):61–9.
Chen X, et al. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–39.
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
Dietterich T. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000;40(2):139–58.
Dutta S, Ghosh AK. On some transformations of high dimension, low sample size data for nearest neighbor classification. Mach Learn. 2016;102(1):57–83.
Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci. 2006;103:59235928.
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Proceedings of IEEE international conference on machine learning. Bari, Italy; 1996a. p. 148–56.
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: International conference on machine learning; 1996b. p. 148–56.
Garber M, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci. 2001;98(24):13784–9.
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor. 2009;11 (1).
Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–44.
Hu H, Li J, Wang H, Daggard G, Shi M. A maximally diversified multiple decision tree algorithm for microarray data classification. In: Proceedings of the 2006 workshop on intelligent systems for bioinformatics, vol. 73. Australian Computer Society, Inc.; 2006. p. 35–38.
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of international joint conference on artificial intelligence (IJCAI); 1995. p. 1137–45. http://citeseer.ist.psu.edu/kohavi95study.html.
Li J, Liu H. Kent Ridge Bio-medical Data Set Repository. http://levis.tongji.edu.cn/gzli/data/mirror-kentridge.html.
Long PM, Servedio RA. Random classification noise defeats all convex potential boosters. In: Proceedings of international conference on machine learning (ICML). New York, NY, USA, ACM; 2008. p. 608–15. ISBN 978-1-60558-205-4.
Ma X-J, et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell. 2004;5:607–16.
Merz CJ, Murphy P. UCI repository of machine learning database. http://www.cs.uci.edu/ mlearn/mlrepository.html, 1996. http://www.ics.uci.edu/~mlearn/MLRepository.html.
Pomeroy SL, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436–42.
Quinlan JR. C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, California; 1993.
Rissanen JJ. Modeling by shortest data description. Automatica. 1978;14:465–71.
Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2001;1:203–9.
Triola MM, Mario F. Biostatistics for the biological and health sciences. 2nd ed. Boston: Addison-Wesley; 2005.
Yata K, Aoshima M. Effective pca for high-dimension, low-sample-size data with noise reduction via geometric representations. J Multivar Anal. 2012;105(1):193–215.
Acknowledgements
The work has been partially supported by ARC Discovery Grants DP0559090 and DP140103617.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Li, J., Liu, L., Liu, J. et al. Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data. Health Inf Sci Syst 5, 5 (2017). https://doi.org/10.1007/s13755-017-0025-x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13755-017-0025-x