A Tree-Based Approach to the Discovery of Diagnostic Biomarkers for Ovarian Cancer

  • Jinyan Li
  • Kotagiri Ramamohanarao
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3056)


Computational diagnosis of cancer is a classification problem, and it has two special requirements on a learning algorithm: perfect accuracy and small number of features used in the classifier. This paper presents our results on an ovarian cancer data set. This data set is described by 15154 features, and consists of 253 samples. Each sample is referred to a woman who suffers from ovarian cancer or who does not have. In fact, the raw data is generated by the so-called mass spectrosmetry technology measuring the intensities of 15154 protein or peptide-features in a blood sample for every woman. The purpose is to identify a small subset of the features that can be used as biomarkers to separate the two classes of samples with high accuracy. Therefore, the identified features can be potentially used in routine clinical diagnosis for replacing labour-intensive and expensive conventional diagnosis methods. Our new tree-based method can achieve the perfect 100% accuracy in 10-fold cross validation on this data set. Meanwhile, this method also directly outputs a small set of biomarkers. Then we explain why support vector machines, naive bayes, and k-nearest neighbour cannot fulfill the purpose. This study is also aimed to elucidate the communication between contemporary cancer research and data mining techniques.


Decision trees committee method ovarian cancer biomarkers classification 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Breiman, L.: Bagging predictors. Machine Learning 24, 123–140 (1996)zbMATHMathSciNetGoogle Scholar
  2. 2.
    Breiman, L.: Random forest. Machine Learning 45, 5–32 (2001)zbMATHCrossRefGoogle Scholar
  3. 3.
    Burges, C.J.: A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery 2, 121–167 (1998)CrossRefGoogle Scholar
  4. 4.
    Cortez, C., Vapnik, V.: Support vector networks. Machine Learning 20, 273–279 (1995)Google Scholar
  5. 5.
    Cover, T.M., Hart, P.E.: Nearest neighbor pattern classification. IEEE Transactions on Information Theory 13, 21–27 (1967)zbMATHCrossRefGoogle Scholar
  6. 6.
    Dietterich, T.G.: An experimental comparison of three methods for constructing ensembles of decision trees: Bagging, boosting, and randomization. Machine Learning 40, 139–158 (2000)CrossRefGoogle Scholar
  7. 7.
    Duda, R., Hart, P.: Pattern Classification and Scene Analysis. John Wiley & Sons, New York (1973)zbMATHGoogle Scholar
  8. 8.
    Fayyad, U.M., Irani, K.B.: Multi-interval discretization of continuous-valued attributes for classification learning. In: Bajcsy, R. (ed.) Proceedings of the Thirteenth International Joint Conference on Artificial Intelligence, pp. 1022–1029. Morgan Kaufmann, San Francisco (1993)Google Scholar
  9. 9.
    Freund, Y., Schapire, R.E.: Experiments with a new boosting algorithm. In: Saitta, L. (ed.) ICML, Bari, Italy, July 1996, pp. 148–156. Morgan Kaufmann, San Francisco (1996)Google Scholar
  10. 10.
    Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P., Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander, E.S.: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286, 531–537 (1999)CrossRefGoogle Scholar
  11. 11.
    Langley, P., Sage, S.: Induction of selective bayesian classifiers. In: Proceedings of the Tenth Conference on Uncertainty of Artificial Intelligence, pp. 399–406. Morgan Kaufmann, San Francisco (1994)Google Scholar
  12. 12.
    Li, J., Liu, H.: Ensembles of cascading trees. In: Proceedings of ICDM, pp. 585–588. IEEE Computer Society, Los Alamitos (2003)Google Scholar
  13. 13.
    Li, J., Liu, H., Ng, S.-K., Wong, L.: Discovery of significant rules for classifying cancer diagnosis data. Bioinformatics 19, ii93–102 (2003)Google Scholar
  14. 14.
    Liu, H., Li, J., Wong, L.: A comparative study on feature selection and classification methods using gene and protein expression profiles. In: Genome Informatics 2002, Tokyo, Japan, pp. 51–60. Universal Academy Press, Washington DC (2002)Google Scholar
  15. 15.
    Petricoin, E.F., MArdekani, A., Hitt, B.A., Levine, P.J., Fusaro, V.A., Steinberg, S.M., Mills, G.B., Simone, C., Fishman, D.A., Kohn, E.C., Liotta, L.A.: Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359, 572–577 (2002)CrossRefGoogle Scholar
  16. 16.
    Quinlan, J.R.: C4.5: Programs for Machine Learning. Morgan Kaufmann, San Mateo (1993)Google Scholar
  17. 17.
    Wulfkuhle, J.D., Liotta, L.A., Petricoin, E.F.: Proteomic applications for the early detection of cancer. Nature Review: Cancer 3, 267–275 (2001)CrossRefGoogle Scholar
  18. 18.
    Yeoh, E.-J., Ross, M.E., Shurtleff, S.A., Williams, W.K., Patel, D., Mahfouz, R., Behm, F.G., Raimondi, S.C., Relling, M., Patel, A., Cheng, C., Campana, D., Wilkins, D., Zhou, X., Li, J., Liu, H., Pui, C.-H., Evans, W.E., Naeve, C., Wong, L., Downing, J.R.: Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1, 133–143 (2002)CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2004

Authors and Affiliations

  • Jinyan Li
    • 1
  • Kotagiri Ramamohanarao
    • 2
  1. 1.Institute for Infocomm ResearchSingapore
  2. 2.Dept. of CSSEThe University of MelbourneAustralia

Personalised recommendations