Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data

Li, Jiuyong; Liu, Lin; Liu, Jixue; Green, Ryan

doi:10.1007/s13755-017-0025-x

Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data

Research
Published: 10 October 2017

Volume 5, article number 5, (2017)
Cite this article

Health Information Science and Systems Aims and scope Submit manuscript

Jiuyong Li¹,
Lin Liu¹,
Jixue Liu ORCID: orcid.org/0000-0002-0794-0404¹ &
…
Ryan Green¹

137 Accesses
3 Citations
Explore all metrics

Abstract

Purpose

It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper will test an ensemble method, Diversified Multiple Tree (DMT), on its capability for classifying instances in a new laboratory using the classifier built on the instances of another laboratory.

Methods

DMT is tested on three real world biomedical data sets from different laboratories in comparison with four benchmark ensemble methods, AdaBoost, Bagging, Random Forests, and Random Trees. Experiments have also been conducted on studying the limitation of DMT and its possible variations.

Results

Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on classifying new instances of a different laboratory from the laboratory where instances are used to build the classifier.

Conclusions

This paper demonstrates that an ensemble classifier, DMT, is more robust in classifying noisy data than other widely used ensemble methods. DMT works on the data set that supports multiple simple trees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Ensemble of a subset of kNN classifiers

Article Open access 22 January 2016

An Ensemble Tree Classifier for Highly Imbalanced Data Classification

Article 26 August 2021

Empirical study of seven data mining algorithms on different characteristics of datasets for biomedical classification applications

Article Open access 02 November 2017

References

Akaike H. A new look at the statistical model identification. IEEE Trans Autom Control. 1974;19(6):716–23.
Article Google Scholar
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell. 2007;29:173–80, ISSN 0162-8828.
Beer D, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8(8):816–24.
Article CAS PubMed Google Scholar
Bhattacharjee A, et al. Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001;98(24):13790–5.
Article CAS PubMed PubMed Central Google Scholar
Box GEP, Muller ME. A note on the generation of random normal deviates. Ann Math Stat. 1958;29(2):610–1.
Article Google Scholar
Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.
Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article Google Scholar
Cheema MS, Eweiwi A, Bauckhage C. High dimensional low sample size activity recognition using geometric classifiers. Digit Signal Process. 2015;42(C):61–9.
Chen X, et al. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–39.
Article CAS PubMed PubMed Central Google Scholar
Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.
Google Scholar
Dietterich T. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000;40(2):139–58.
Article Google Scholar
Dutta S, Ghosh AK. On some transformations of high dimension, low sample size data for nearest neighbor classification. Mach Learn. 2016;102(1):57–83.
Article Google Scholar
Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci. 2006;103:59235928.
Article Google Scholar
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Proceedings of IEEE international conference on machine learning. Bari, Italy; 1996a. p. 148–56.
Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: International conference on machine learning; 1996b. p. 148–56.
Garber M, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci. 2001;98(24):13784–9.
Article CAS PubMed PubMed Central Google Scholar
Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor. 2009;11 (1).
Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–44.
Article Google Scholar
Hu H, Li J, Wang H, Daggard G, Shi M. A maximally diversified multiple decision tree algorithm for microarray data classification. In: Proceedings of the 2006 workshop on intelligent systems for bioinformatics, vol. 73. Australian Computer Society, Inc.; 2006. p. 35–38.
Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of international joint conference on artificial intelligence (IJCAI); 1995. p. 1137–45. http://citeseer.ist.psu.edu/kohavi95study.html.
Li J, Liu H. Kent Ridge Bio-medical Data Set Repository. http://levis.tongji.edu.cn/gzli/data/mirror-kentridge.html.
Long PM, Servedio RA. Random classification noise defeats all convex potential boosters. In: Proceedings of international conference on machine learning (ICML). New York, NY, USA, ACM; 2008. p. 608–15. ISBN 978-1-60558-205-4.
Ma X-J, et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell. 2004;5:607–16.
Article CAS PubMed Google Scholar
Merz CJ, Murphy P. UCI repository of machine learning database. http://www.cs.uci.edu/ mlearn/mlrepository.html, 1996. http://www.ics.uci.edu/~mlearn/MLRepository.html.
Pomeroy SL, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436–42.
Article CAS PubMed Google Scholar
Quinlan JR. C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, California; 1993.
Rissanen JJ. Modeling by shortest data description. Automatica. 1978;14:465–71.
Article Google Scholar
Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2001;1:203–9.
Article Google Scholar
Triola MM, Mario F. Biostatistics for the biological and health sciences. 2nd ed. Boston: Addison-Wesley; 2005.
Google Scholar
Yata K, Aoshima M. Effective pca for high-dimension, low-sample-size data with noise reduction via geometric representations. J Multivar Anal. 2012;105(1):193–215.
Article Google Scholar

Download references

Acknowledgements

The work has been partially supported by ARC Discovery Grants DP0559090 and DP140103617.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

School of Information Technology and Mathematical Sciences, University of South Australia, Adelaide, Australia
Jiuyong Li, Lin Liu, Jixue Liu & Ryan Green

Authors

Jiuyong Li
View author publications
You can also search for this author in PubMed Google Scholar
Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jixue Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ryan Green
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jixue Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, J., Liu, L., Liu, J. et al. Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data. Health Inf Sci Syst 5, 5 (2017). https://doi.org/10.1007/s13755-017-0025-x

Download citation

Received: 16 August 2017
Accepted: 25 September 2017
Published: 10 October 2017
DOI: https://doi.org/10.1007/s13755-017-0025-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data