Abstract
For easy comprehensibility, rules are preferrable to non-linear kernel functions in the analysis of bio-medical data. In this paper, we describe two rule induction approaches—C4.5 and our PCL classifier—for discovering rules from both traditional clinical data and recent gene expression or proteomic profiling data. C4.5 is a widely used method, but it has two weaknesses, the single coverage constraint and the fragmentation problem, that affect its accuracy. PCL is a new rule-based classifier that overcomes these two weaknesses of decision trees by using many significant rules. We present a thorough comparison to show that our PCL method is much more accurate than C4.5, and it is also superior to Bagging and Boosting in general.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Blake, C.L., Murphy, P.M.: The UCI machine learning repository. In: Department of Information and Computer Science. University of California, Irvine (1998), http://www.cs. uci.edu/Â mlearn/MLRepository.html
Breiman, L., Friedman, J., Olshen, R., Stone, C.: Classification and Regression Trees. Wadsworth International Group, Belmont (1984)
Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)
Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Chaudhuri, S., Madigan, D. (eds.) Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CA, pp. 43–52. ACM Press, New York (1999)
Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy, R. (ed.) Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1029. Morgan Kaufmann, San Francisco (1993)
Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Saitta, L. (ed.) Machine Learning: Proceedings of the Thirteenth International Conference, Bari, Italy, July 1996, pp. 148–156. Morgan Kaufmann, San Francisco (1996)
Friedman, J.H., Kohavi, R., Yun, Y.: Lazy decision trees. In: Proceedings of the Thirteenth National Conference on Artificial Intelligence, AAAI 1996, Portland, Oregon, August 1996, pp. 717–724. AAAI Press, Menlo Park (1996)
Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)
Gordon, G.J., Jensen, R.V., Hsiao, L.-L., Gullans, S.R., Blumenstock, J.E., Ramaswamy, S., Richards, W.G., Sugarbaker, D.J., Bueno, R.: Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Research 62, 4963–4967 (2002)
Li, J., Liu, H., Downing, J.R., Yeoh, A.E.-J., Wong, L.: Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients. Bioinformatics 19, 71–78 (2003)
Li, J., Ramamohanarao, K., Dong, G.: The space of jumping emerging patterns and its incremental maintenance algorithms. In: Proceedings of the Seventeenth International Conference on Machine Learning, Stanford, CA, USA, June 2000, pp. 551–558. Morgan Kaufmann, San Francisco (2000)
Li, J., Wong, L.: Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns. Bioinformatics 18, 725–734 (2002)
Li, J., Wong, L.: Solving the fragmentation problem of decision trees by discovering boundary emerging patterns. In: Proceedings of 2002 IEEE International Conference on Data Mining (ICDM 2002), Maebashi City, Japan, pp. 653–656. IEEE Computer Society, Los Alamitos (2002)
Pagallo, G., Haussler, D.: Boolean feature discovery in empirical learning. Machine Learning 5, 71–99 (1990)
Petricoin, E.F., MArdekani, A., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002)
Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)
Singh, D., Febbol, P.G., Ross, K., Jackson, D.G., Manola, J., Ladd, C., Tamayo, P., Renshaw, A.A., D’Amico, A.V., Richie, J.P., Lander, E.S., Loda, M., Kantoff, P.W., Golub, T.R., Sellers, W.R.: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1, 203–209 (2002)
Yeoh, E.-J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M.V., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C.-H., Evans, W.E., Naeve, C., Wong, L., Downing, J.R.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1, 133–143 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Li, J., Wong, L. (2003). Using Rules to Analyse Bio-medical Data: A Comparison between C4.5 and PCL. In: Dong, G., Tang, C., Wang, W. (eds) Advances in Web-Age Information Management. WAIM 2003. Lecture Notes in Computer Science, vol 2762. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-45160-0_25
Download citation
DOI: https://doi.org/10.1007/978-3-540-45160-0_25
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40715-7
Online ISBN: 978-3-540-45160-0
eBook Packages: Springer Book Archive