A new co-training-style random forest for computer aided diagnosis
Machine learning techniques used in computer aided diagnosis (CAD) systems learn a hypothesis to help the medical experts make a diagnosis in the future. To learn a well-performed hypothesis, a large amount of expert-diagnosed examples are required, which places a heavy burden on experts. By exploiting large amounts of undiagnosed examples and the power of ensemble learning, the co-training-style random forest (Co-Forest) releases the burden on the experts and produces well-performed hypotheses. However, the Co-forest may suffer from a problem common to other co-training-style algorithms, namely, that the unlabeled examples may instead be wrongly-labeled examples that become accumulated in the training process. This is due to the fact that the limited number of originally-labeled examples usually produces poor component classifiers, which lack diversity and accuracy. In this paper, a new Co-Forest algorithm named Co-Forest with Adaptive Data Editing (ADE-Co-Forest) is proposed. Not only does it exploit a specific data-editing technique in order to identify and discard possibly mislabeled examples throughout the co-labeling iterations, but it also employs an adaptive strategy in order to decide whether to trigger the editing operation according to different cases. The adaptive strategy combines five pre-conditional theorems, all of which ensure an iterative reduction of classification error and an increase in the scale of new training sets under PAC learning theory. Experiments on UCI datasets and an application to small pulmonary nodules detection using chest CT images show that ADE-Co-Forest can more effectively enhance the performance of a learned hypothesis than Co-Forest and DE-Co-Forest (Co-Forest with Data Editing but without adaptive strategy).
KeywordsSemi-supervised learning Co-training Co-forest Adaptive data editing PAC theory Pulmonary nodules detection Computer aided diagnosis
This work is supported by the National Science Foundation of China under the Grant Nos.60702033, 60772076 and 2007307000189, the National High-Tec Research and Development Plant of China under the Grant No.2007AA01Z171, the Heilongjiang Science Foundation Key Project under the No. ZJG0705, the Science Foundation for Distinguished Young Scholars of Heilongjiang Province in China under the Grant No.JC200611. The authors thank partners from the 2nd Affiliated Hospital of Harbin Medical University for collecting and labeling the CT images.
- Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343–370.Google Scholar
- Blake, C., Keogh, E., & Merz, C. J. (1998). UCI repository of machine learning databases. Dept. Inf. and Comput. Sci., Univ. California, [Online]. http://www.ics.uci.edu/~mlearn/MLRepository.html.
- Blum, A., & Chawla, S. (2001). Learning from labeled and unlabeled data using graph mincuts. In Proc. 18th int. conf. on machine learning (ICML01) (pp. 19–26). Williamstown, MA.Google Scholar
- Chapelle, O., Schoelkopf, B., & Zien, A. (2006). Semi-supervised learning. Cambridge: MIT Press.Google Scholar
- Dasgupta, S., Littman, M., & McAllester, D. (2002). PAC generalization bounds for co-training. In Advances in neural information processing systems (NIPS02) (Vol. 4, pp. 375–382). Cambridge: MIT Press.Google Scholar
- Deng, C., & Guo, M. Z. (2006). Tri-training and data editing based semi-supervised clustering algorithm. In A. F. Gelbukhm & C. A. R. García (Eds.), MICAI2006: Advances in artificial intelligence (pp. 641–651). Mexico: Apizaco.Google Scholar
- Goldman, S., & Zhou, Y. (2000). Enhancing supervised learning with unlabeled data. In Proc. 17th int. conf. on machine Learning (ICML00) (pp. 327–334). San Francisco, CA.Google Scholar
- Hwa, R., Osborne, M., Sarkar A., & Steedman, M. (2003). Corrected cotraining for statistical parsers. In Proc. 20th int. conf. on machine learning (ICML03) workshop on continuum from labeled to unlabeled data in machine learning and data mining (pp. 95–102). Washington, DC.Google Scholar
- Jiang, Y., & Zhou, Z. H. (2004). Editing training data for kNN classifiers with neural network ensemble. In Proc. IEEE 2004 int. sym. on neural networks (ISNN04) (pp. 356–361). Dalian, China.Google Scholar
- Li, M., & Zhou, Z. H. (2005). SETRED: Self-training with editing. In Proc. 9th Pacific-Asia conf. on knowledge discovery and data mining (PAKDD05) (pp. 611–621). Hanoi, Vietnam.Google Scholar
- Martínez, C., & Fuentes, O. (2003). Face recognition using unlabeled data. Computación y Sistemas, 7(2), 123–129.Google Scholar
- Mitchell, T. M. (1997). Machine learning (ch. 3). New York: McGraw-Hill.Google Scholar
- Nigam K., & Ghani, R. (2000). Analyzing the effectiveness and applicability of co-training. In Proc. ACM 9th conf. on information and knowledge management (pp. 86–93). McLean, Virginia.Google Scholar
- Roli, F. (2005). Semi-supervised multiple classifier systems: Background and research direction. In Proc. multiple classifiers systems (pp. 1–11). Seaside, CA.Google Scholar
- Seeger, M. (2001). Learning with labeled and unlabeled data. Tech. Rep., Univ. of Edinburgh, Edinburgh, Scotland.Google Scholar
- Vincent, N., & Claire, C. (2003). Bootstrapping coreference classifiers with multiple machine learning algorithms. In Proc. 2003 conf. empirical methods in natural language processing (pp. 113–120). Sapporo, Japan.Google Scholar
- Witten, I. H., & Frank, E. (2005). Data mining: Practical machine learning tools and techniques with java implementations (2nd ed.). San Francisco: Morgan Kaufmann.Google Scholar
- Zhou, Y., & Goldman, S. (2004). Democratic co-learning. In Proc. 16th IEEE int. conf. tools with artificial intelligence (pp. 594–602). Boca Raton, FL.Google Scholar
- Zhu, X. J. (2008). Semi-supervised learning literature survey. Tech. Rep. Computer Sciences, TR1530, Univ. of Wisconsin-Madison, Wisconsin.Google Scholar