Skip to main content
Log in

Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data

  • Research
  • Published:
Health Information Science and Systems Aims and scope Submit manuscript

Abstract

Purpose

It is common that a trained classification model is applied to the operating data that is deviated from the training data because of noise. This paper will test an ensemble method, Diversified Multiple Tree (DMT), on its capability for classifying instances in a new laboratory using the classifier built on the instances of another laboratory.

Methods

DMT is tested on three real world biomedical data sets from different laboratories in comparison with four benchmark ensemble methods, AdaBoost, Bagging, Random Forests, and Random Trees. Experiments have also been conducted on studying the limitation of DMT and its possible variations.

Results

Experimental results show that DMT is significantly more accurate than other benchmark ensemble classifiers on classifying new instances of a different laboratory from the laboratory where instances are used to build the classifier.

Conclusions

This paper demonstrates that an ensemble classifier, DMT, is more robust in classifying noisy data than other widely used ensemble methods. DMT works on the data set that supports multiple simple trees.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Akaike H. A new look at the statistical model identification. IEEE Trans Autom Control. 1974;19(6):716–23.

    Article  Google Scholar 

  2. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP. A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell. 2007;29:173–80, ISSN 0162-8828.

  3. Beer D, et al. Gene-expression profiles predict survival of patients with lung adenocarcinoma. Nat Med. 2002;8(8):816–24.

    Article  CAS  PubMed  Google Scholar 

  4. Bhattacharjee A, et al. Classification of human lung carcinomas by mrna expression profiling reveals distinct adenocarcinoma subclasses. Proc Natl Acad Sci. 2001;98(24):13790–5.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Box GEP, Muller ME. A note on the generation of random normal deviates. Ann Math Stat. 1958;29(2):610–1.

    Article  Google Scholar 

  6. Breiman L. Bagging predictors. Mach Learn. 1996;24(2):123–40.

    Google Scholar 

  7. Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.

    Article  Google Scholar 

  8. Cheema MS, Eweiwi A, Bauckhage C. High dimensional low sample size activity recognition using geometric classifiers. Digit Signal Process. 2015;42(C):61–9.

  9. Chen X, et al. Gene expression patterns in human liver cancers. Mol Biol Cell. 2002;13:1929–39.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  10. Demšar J. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res. 2006;7:1–30.

    Google Scholar 

  11. Dietterich T. An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn. 2000;40(2):139–58.

    Article  Google Scholar 

  12. Dutta S, Ghosh AK. On some transformations of high dimension, low sample size data for nearest neighbor classification. Mach Learn. 2016;102(1):57–83.

    Article  Google Scholar 

  13. Ein-Dor L, Zuk O, Domany E. Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer. Proc Natl Acad Sci. 2006;103:59235928.

    Article  Google Scholar 

  14. Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: Proceedings of IEEE international conference on machine learning. Bari, Italy; 1996a. p. 148–56.

  15. Freund Y, Schapire RE. Experiments with a new boosting algorithm. In: International conference on machine learning; 1996b. p. 148–56.

  16. Garber M, et al. Diversity of gene expression in adenocarcinoma of the lung. Proc Natl Acad Sci. 2001;98(24):13784–9.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  17. Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH. The WEKA data mining software: an update. SIGKDD Explor. 2009;11 (1).

  18. Ho TK. The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell. 1998;20(8):832–44.

    Article  Google Scholar 

  19. Hu H, Li J, Wang H, Daggard G, Shi M. A maximally diversified multiple decision tree algorithm for microarray data classification. In: Proceedings of the 2006 workshop on intelligent systems for bioinformatics, vol. 73. Australian Computer Society, Inc.; 2006. p. 35–38.

  20. Kohavi R. A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of international joint conference on artificial intelligence (IJCAI); 1995. p. 1137–45. http://citeseer.ist.psu.edu/kohavi95study.html.

  21. Li J, Liu H. Kent Ridge Bio-medical Data Set Repository. http://levis.tongji.edu.cn/gzli/data/mirror-kentridge.html.

  22. Long PM, Servedio RA. Random classification noise defeats all convex potential boosters. In: Proceedings of international conference on machine learning (ICML). New York, NY, USA, ACM; 2008. p. 608–15. ISBN 978-1-60558-205-4.

  23. Ma X-J, et al. A two-gene expression ratio predicts clinical outcome in breast cancer patients treated with tamoxifen. Cancer Cell. 2004;5:607–16.

    Article  CAS  PubMed  Google Scholar 

  24. Merz CJ, Murphy P. UCI repository of machine learning database. http://www.cs.uci.edu/ mlearn/mlrepository.html, 1996. http://www.ics.uci.edu/~mlearn/MLRepository.html.

  25. Pomeroy SL, et al. Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002;415:436–42.

    Article  CAS  PubMed  Google Scholar 

  26. Quinlan JR. C4.5: programs for machine learning. Morgan Kaufmann, San Mateo, California; 1993.

  27. Rissanen JJ. Modeling by shortest data description. Automatica. 1978;14:465–71.

    Article  Google Scholar 

  28. Singh D, et al. Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2001;1:203–9.

    Article  Google Scholar 

  29. Triola MM, Mario F. Biostatistics for the biological and health sciences. 2nd ed. Boston: Addison-Wesley; 2005.

    Google Scholar 

  30. Yata K, Aoshima M. Effective pca for high-dimension, low-sample-size data with noise reduction via geometric representations. J Multivar Anal. 2012;105(1):193–215.

    Article  Google Scholar 

Download references

Acknowledgements

The work has been partially supported by ARC Discovery Grants DP0559090 and DP140103617.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jixue Liu.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, J., Liu, L., Liu, J. et al. Building Diversified Multiple Trees for classification in high dimensional noisy biomedical data. Health Inf Sci Syst 5, 5 (2017). https://doi.org/10.1007/s13755-017-0025-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13755-017-0025-x

Keywords

Navigation