Knowledge Discovery in Multi-label Phenotype Data

  • Amanda Clare
  • Ross D. King
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2168)


The biological sciences are undergoing an explosion in the amount of available data. New data analysis methods are needed to deal with the data. We present work using KDD to analyse data from mutant phenotype growth experiments with the yeast S. cerevisiae to predict novel gene functions. The analysis of the data presented a number of challenges: multi-class labels, a large number of sparsely populated classes, the need to learn a set of accurate rules (not a complete classification), and a very large amount of missing values. We developed resampling strategies and modified the algorithm C4.5 to deal with these problems. Rules were learnt which are accurate and biologically meaningful. The rules predict function of 83 putative genes of currently unknown function at an estimated accuracy of > 80%.


Knowledge Discovery Functional Class Decision Tree Algorithm Functional Hierarchy International Human Genome Sequencing Consortium 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    M. Andrade, C. Ouzounis, C. Sander, J. Tamames, and A. Valencia. Functional classes in the three domains of life. Journal of Molecular Evolution, 49:551–557, 1999.CrossRefGoogle Scholar
  2. 2.
    W. P. Blackstock and M. P. Weir. Proteomics: quantitative and physical mapping of cellular proteins. Tibtech, 17:121–127, 1999.Google Scholar
  3. 3.
    C.L. Blake and C.J. Merz. UCI repository of machine learning databases, 1998.Google Scholar
  4. 4.
    M. Brown, W. Nobel Grundy, D. Lin, N. Cristianini, C. Walsh Sugnet, T. Furey, M. Ares Jr., and D. Haussler. Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc. Nat. Acad. Sci. USA, 97(1):262–267, Jan 2000.CrossRefGoogle Scholar
  5. 5.
    J. DeRisi, V. Iyer, and P. Brown. Exploring the metabolic and genetic control of gene expression on a genomic scale. Science, 278:680–686, October 1997.Google Scholar
  6. 6.
    M. des Jardins, P. Karp, M. Krummenacker, T. Lee, and C. Ouzounis. Prediction of enzyme classification from protein sequence without the use of sequence similarity. In ISMB’ 97, 1997.Google Scholar
  7. 7.
    B. Efron and R. Tibshirani. An introduction to the bootstrap. Chapman and Hall, 1993.Google Scholar
  8. 8.
    M. Eisen, P. Spellman, P. Brown, and D. Botstein. Cluster analysis and display of genome-wide expression patterns. Proc. Nat. Acad. Sci. USA, 95:14863–14868, Dec 1998.Google Scholar
  9. 9.
    J. Fürnkranz. Separate-and-conquer rule learning. Artificial Intelligence Review, 13(1): 3–54, 1999.zbMATHCrossRefGoogle Scholar
  10. 10.
    The Arabidopsis genome initiative. Analysis of the genome sequence of the flowering plant arabidopsis thaliana. Nature, 408:796–815, 2000.CrossRefGoogle Scholar
  11. 11.
    International human genome sequencing consortium. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001.CrossRefGoogle Scholar
  12. 12.
    Aram Karalic and Vlado Pirnat. Significance level based classification with multiple trees. Informatica, 15(5), 1991.Google Scholar
  13. 13.
    D. Kell and R. King. On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends Biotechnol., 18:93–98, March 2000.Google Scholar
  14. 14.
    R. King, A. Karwath, A. Clare, and L. Dehaspe. Genome scale prediction of protein functional class from sequence using data mining. In KDD 2000, 2000.Google Scholar
  15. 15.
    R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection. In IJCAI 1995, 1995.Google Scholar
  16. 16.
    D. Koller and M. Sahami. Hierarchically classifying documents using very few words. In ICML 97, 1997.Google Scholar
  17. 17.
    E. Koonin, R. Tatusov, M. Galperin, and M. Rozanov. Genome analysis using clusters of orthologous groups (COGS). In RECOMB 98, pages 135–139, 1998.Google Scholar
  18. 18.
    A. Kumar, K.-H. Cheung, P. Ross-Macdonald, P.S.R. Coelho, P. Miller, and M. Snyder. TRIPLES: a database of gene function in S. cerevisiae. Nucleic Acids Res., 28:81–84, 2000.CrossRefGoogle Scholar
  19. 19.
    M. Lussier, A. White, J. Sheraton, T. di Paolo, J. Treadwell, S. Southard, C. Horenstein, J. Chen-Weiner, A. Ram, J. Kapteyn, T. Roemer, D. Vo, D. Bondoc, J. Hall, W. Zhong, A. Sdicu, J. Davies, F. Klis, P. Robbins, and H. Bussey. Large scale identification of genes involved in cell surface biosynthesis and architecture in Saccharomyces cerevisiae. Genetics, 147:435–450, Oct 1997.Google Scholar
  20. 20.
    A. McCallum. Multi-label text classification with a mixture model trained by EM. In AAAI 99 Workshop on Text Learning, 1999.Google Scholar
  21. 21.
    A. McCallum, R. Rosenfeld, T. Mitchell, and A. Ng. Improving text classification by shrinkage in a hierarchy of classes. In ICML 98, 1998.Google Scholar
  22. 22.
    H.W. Mewes, K. Heumann, A. Kaps, K. Mayer, F. Pfeiffer, S. Stocker, and D. Frishman. MIPS: a database for protein sequences and complete genomes. Nucleic Acids Research, 27:44–48, 1999.CrossRefGoogle Scholar
  23. 23.
    D. Michie, D. J. Spiegelhalter, and C. C. Taylor, editors. Machine Learning, Neural and Statistical Classification. Ellis Horwood, London, 1994. Out of print but available at Scholar
  24. 24.
    D. Mladenic and M. Grobelnik. Learning document classification from large text hierarchy. In AAAI 98, 1998.Google Scholar
  25. 25.
    S. Oliver. A network approach to the systematic analysis of yeast gene function. Trends in Genetics, 12(7):241–242, 1996.CrossRefMathSciNetGoogle Scholar
  26. 26.
    J. R. Quinlan. C4.5: programs for Machine Learning. Morgan Kaufmann, San Mateo, California, 1993.Google Scholar
  27. 27.
    L. M. Raamsdonk, B. Teusink, D. Broadhurst, N. Zhang, A. Hayes, M. C. Walsh, J. A. Berden, K. M. Brindle, D. B. Kell, J. J. Rowland, H. V. Westerho., K. van Dam, and S. G. Oliver. A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nature Biotech, pages 45–50, 2001.Google Scholar
  28. 28.
    A. Ram, A. Wolters, R. Ten Hoopen, and F. Klis. A new approach for isolating cell wall mutants in Saccharomyces cerevisiae by screening for hypersensitivity to calcofluor white. Yeast, 10: 1019–1030, 1994.CrossRefGoogle Scholar
  29. 29.
    M. Riley. Systems for categorizing functions of gene products. Current Opinion in Structural Biology, 8:388–392, 1998.CrossRefGoogle Scholar
  30. 30.
    R. Schapire and Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 39(2/3):135–168, 2000.zbMATHCrossRefGoogle Scholar
  31. 31.
    K. Sugimoto, Y. Sakamoto, O. Takahashi, and K. Matsumoto. HYS2, an essential gene required for DNA replication in Saccharomyces cerevisiae. Nucleic Acids Res, 23(17):3493–500, Sep 1995.CrossRefGoogle Scholar
  32. 32.
    P. Törönen, M. Kolehmainen, G. Wong, and E. Castrén. Analysis of gene expression data using self-organizing maps. FEBS Lett., 451(2):142–6, May 1999.CrossRefGoogle Scholar
  33. 33.
    J. C. Venter et al. The sequence of the human genome. Science, 291:1304–1351, 2001.CrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Amanda Clare
    • 1
  • Ross D. King
    • 1
  1. 1.Department of Computer ScienceUniversity of Wales AberystwythUK

Personalised recommendations