Divide and Conquer Machine Learning for a Genomics Analogy Problem

(Progress Report)
  • Ming Ouyang
  • John Case
  • Joan Burnside
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2226)


Genomic strings are not of fixed length,but provide one- dimensional spatial data that do not divide for conquering by machine learning into manageable .xed size chunks obeying Dietterich independent and identically distributed assumption.We nonetheless need to divide genomic strings for conquering by machine learning in this case for genomic prediction. Orthologs are genomic strings derived from a common ancestor and having the same biological function.Ortholog detection is biologically interesting since it informs us about protein divergence through evolution, and,in the present context,also has important agricultural applications. In the present paper is indicated means to obtain an associated (fixed size)attribute vector for genomic string data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem.The attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential metrics,many new to bioinformatics.Many of the differential metrics are based on evolutionary considerations,both theoretical and empirically observed,in some cases observed by the authors. C5.0 with AdaBoosting activated was employed and the preliminary results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques described for ortholog detection on the more readily available EST (incomplete)genomic data.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [AGM + 90]
    Stephen F. Altschul, Warren Gish, Webb Miller, Eugene W. Myers,and David J. Lipman.Basic local alignment search tool.J.Mol.Biol.,215:403–410,1990.Google Scholar
  2. [AGS89]
    D. Angluin, W. Gasarch,and C. Smith.Training sequences.Theoretical Computer Science,66(3):255–272,1989.MATHCrossRefMathSciNetGoogle Scholar
  3. [AKF + 95]
    M.D. Adams, A.R. Kerlavage, R.D. Fleischmann, R.A. Fuldner, C.J. Bult, N.H. Lee, E.F. Kirkness, K.G. Weinstock, J.D. Gocayne, O. White,and et al.Initial assessment of human gene diversity and expression patterns based upon 83 million nucleotides of cDNA sequence.Nature,377:3–174, 1995.Google Scholar
  4. [AMS + 93]
    S. Arikawa, S. Miyano, A. Shinohara, S. Kuhara, Y. Mukouchi,and T. Shinohara.A machine discovery from amino-acid-sequences by decision trees over regular patterns.New Generation Computing,11:361–375,1993.MATHCrossRefGoogle Scholar
  5. [AMS + 97]
    Stephen F. Altschul, Thomas L. Madden, Alejandro A. Schäffer, Jinghui Zhang, Zheng Zhang, Webb Miller,and David J. Lipman.Gapped BLAST and PSI-BLAST:A new generation of protein database search programs. Nucleic Acids Research,25(17):3389–3402,1997.CrossRefGoogle Scholar
  6. [Ash60]
    R. Ashby.Design for a Brain:The Origin of Adaptive Behavior.Wiley, NY,second edition,1960.Google Scholar
  7. [BB98]
    P. Baldi and S. Brunak.Bioinformatics:The Machine Learning Approach. MIT Press, Cambridge,MA,third edition,1998.Google Scholar
  8. [BF84]
    E. Boros and Z. Füredi.Triangles covering the centre of an n-set.Geometriae Dedicata,17:69–77,1984.MATHCrossRefMathSciNetGoogle Scholar
  9. [BGN97]
    Kai Bartlmae, Steffen Gutjahr,and Gholamreza Nakhaeizadeh.Incorporating prior knowledge about financial markets through neural multitask learning.In Proceedings of the Fifth International Conferenc on Neural Networks in the Capital Markets,1997.Google Scholar
  10. [BK97]
    C. Burge and S. Karlin.Prediction of complete gene structures in human genomic DNA.J.Mol.Biol.,268:78–94,1997.CrossRefGoogle Scholar
  11. [BO98]
    Andreas D.Baxevanis and B.F. Francis Ouellette ,editors.Bioinformatics: A Practical Guide to the Analysis of Genes and Proteins.John Wiley & Sons,Inc.,1998.Google Scholar
  12. [Car93]
    Richard A.Caruana.Multitask connectionist learning.In Proceedings of the 1993 Connectionist Models Summer School,pages 372–379,1993.Google Scholar
  13. [Car96]
    R. Caruana.Algorithms and applications for multitask learning.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 87–95.Morgan Kaufmann, San Francisco,CA,1996.Google Scholar
  14. [CJO + 00]
    J. Case, S. Jain, M. Ott, A. Sharma,and F. Stephan.Robust learning aided by context.Journal of Computer and System Sciences (Special Issue for COLT’ 98 ),60:234–257,2000.MATHMathSciNetGoogle Scholar
  15. [CO01]
    Andrew Y.Cheng and Ming Ouyang.On algorithms for simplicial depth. In 13th Canadian Conferenc on Computational Geometry,pages 53–56. University of Waterloo,August 13-15 2001.Google Scholar
  16. [DHB95]
    Thomas G.Dietterich, Hermann Hild,and Ghulum Bakiri.A comparison of ID3 and backpropogation for English text-to-speech mapping.Machine Learning,18(1):51–15,1995.Google Scholar
  17. [Die00]
    T. Dietterich.The divide-and-conquer manifesto.In Proceedings of The 11th International Workshop on Algorithmic Learning Theory (ALT’ 0),Lecture Notes in Artificial Intelligence,pages 13–16.Springer-Verlag, Berlin,2000.Google Scholar
  18. [Eva68]
    T. Evans.A program for the solution of a class of geometric-analogy intelligence-test questions.In M. Minsky,editor,Semantic Information Processing,pages 271–353.MIT Press,1968.Google Scholar
  19. [FMS01]
    Y. Freund, Y. Mansour,and R. Schapire.Why averaging classifiers can protect against overfitting.In Proceedings of the Eighth International Workshop on Artificial Intelligenc and Statistics,2001.Google Scholar
  20. [FS96]
    Y. Freund and R. Schapire.Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 148–156.Morgan Kaufmann, San Francisco, CA,1996.Google Scholar
  21. [FS97]
    Y. Freund and R. Schapire.A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences,55:119–139,1997.MATHCrossRefMathSciNetGoogle Scholar
  22. [FS99]
    Y. Freund and R. Schapire.A short introduction to boosting.Journal of Japanese Society for Artificial Intelligenc,14(5):771–780,1999.In Japanese and translated by Naoki Abe;English version at http://www.research.att.com/~schapire/cgi-bin/uncompress-papers/FreundSc99.ps.Google Scholar
  23. [FSBL98]
    Y. Freund, R. Schapire, P. Bartlett,and W. Lee.Boosting the margin:A new explanation for the efectiveness of voting methods.The Annals of tatistics,26(5):1651–1686,1998.MATHCrossRefMathSciNetGoogle Scholar
  24. [GME + 92]
    X. Guan, R.J. Mural, J.R. Einstein, R.C. Mann,and E.C. Uberbacher. GRAIL:An integrated artificial intelligence system for gene recognition and interpretation.In Eighth IEEE Conferenc on AI Applications,pages 9–3,Monterey,CA, March 2–6 1992.IEEE Computer Society Press.CrossRefGoogle Scholar
  25. [Got82]
    O. Gotoh.An improved algorithm for matching biological sequences.J. Mol.Biol.,162:705–708,1982.Google Scholar
  26. [KA90]
    Samuel Karlin and Stephen F. Altschul.Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes.Proc.Natl.Acad.Sci.USA,87:2264–2268,1990.MATHCrossRefGoogle Scholar
  27. [KCL90]
    D.G. Kneller, F.E. Cohen,and R. Langridge.Improvements in protein secondary structure prediction by an enhanced neural network.Journal of Molecular Biology,214:171–182,1990.CrossRefGoogle Scholar
  28. [KS94]
    M. Kummer and F. Stephan.Inclusion problems in parallel learning and games.In Proceedings of the Workshop on Computational Learning Theory,pages 287–298.ACM Press, NY,July 1994.Journal version to appear, Journal of Computer and System Sciences (Special Issue for COLT 94), 52(3):403–420,1996.Google Scholar
  29. [KSVW93]
    E. Kinber, C. Smith, M. Velauthapillai,and R. Wiehagen.On learning learning multiple concepts in parallel.In Proceedings of the Workshop on Computational Learning Theory,pages 175–81.ACM, NY,1993.Google Scholar
  30. [Li97]
    Wen-Hsiung Li.Molecular Evolution.Sinauer Associates,Inc.,1997.Google Scholar
  31. [Liu90]
    R.Y. Liu.On a notion of data depth based on random simplices.The Annals of Statistics,pages 405–414,1990.Google Scholar
  32. [LS93]
    R.Y. Liu and K. Singh.A quality index based on data depth and multivariate rank tests.Journal of American Statistical Association,88:252–260, 1993.MATHCrossRefMathSciNetGoogle Scholar
  33. [MB98]
    Wojciech Makalowski and Mark S. Boguski.Evolutionary parameters of the transcribed mammalian genome:An analysis of 2,820 orthologous rodent and human sequences.Proc.Natl.Acad.Sci.USA,95:9407–9412, 1998.CrossRefGoogle Scholar
  34. [MCF + 94]
    T. Mitchell, R. Caruana, D. Freitag, J. McDermott,and D. Zabowski. Experience with a learning,personal assistant.Communications of the ACM,37:80–91,1994.CrossRefGoogle Scholar
  35. [Mit97]
    T. Mitchell.MachineLearning.McGraw Hill,1997.Google Scholar
  36. [MK96]
    S. Matwin and M. Kubat.The role of context in concept learning.In M. Kubat and G. Widmer,editors,Proceedings of the ICML-96 Pre-Conferenc Workshop on Learning in Context-Sensitive Domains, Bari, Italy,pages 1–5,1996.Google Scholar
  37. [MST94]
    D. Michie, D. Spiegelhalter, and C. Taylor,editors.Machine Learning, Neural and Statistical Classiffication.Ellis Horwood,NY,1994.Google Scholar
  38. [NW70]
    Saul B. Needleman and Christian D. Wunsch.A general method applicable to the search for similarities in the amino acid sequence of two proteins. J.Mol.Biol.,48:443–453,1970.CrossRefGoogle Scholar
  39. [PBS91]
    M.J. Pazzani, C.A. Brunk,and G. Silverstein.A knowledge-intensive approach to learning relational concepts.In L. Birnbaum and G. Collins, editors,Proceedings of the 8th International Workshop on Machine Learning,pages 432–436.Morgan Kaufmann,1991.Google Scholar
  40. [Pea95]
    William R. Pearson.Comparison of methods for searching protein sequence databases.Protein Science,4:1145–1160,1995.CrossRefGoogle Scholar
  41. [PMK91]
    L. Pratt, J. Mostow,and C. Kamm.Direct transfer of learned information among neural networks.In Proceedings of the 9th National Conferenc on Artificial Intelligenc (AAAI-91),1991.Google Scholar
  42. [Qui93]
    J.R. Quinlan.C4.5:Programs for Machine Learning.Morgan Kaufmann Publishers, San Mateo,CA,1993.Google Scholar
  43. [Qui97]
    J.R. Quinlan,1997.Private communication.Google Scholar
  44. [Qui98]
    R. Quinlan.Miniboosting decision trees.Journal of AI Research,1998.Google Scholar
  45. [RN95]
    S. Russell and P. Norvig.Artificial Intelligence:A Modern Approach. Prentittce-Hall,NJ,1995.Google Scholar
  46. [RYW + 00]
    Gerald M. Rubin, Mark D. Yandell, Jennifer R. Wortman, George L. Gabor Miklos, Catherine R. Nelson, Iswar K. Hariharan, Mark E. Fortini, Peter W. Li, Rolf Apweiler, Wolfgang Fleischmann, J. Michael Cherry, Steven Heniko., Marain P. Skupski, Sima Misra, Michael Ashburner, Ewan Birney, Mark S. Boguski, Thomas Brody, Peter Brokstein, Susan E. Celniker, Stephen A. Chervitz, David Coates, Anibal Cravchik, Andrei Gabrielian, Richard F. Falle, William M. Gelbart, Reed A. George, Lawrence S.B._Goldstein, Fangcheng Gong, Ping Guan, Nomi L. Harris, Bruce A. Hay, Roger A. Hoskins, Jiayin Li, Zhenya Li, Richard O. Hynes, S.J.M. Jones, Peter M. Kuehl, Bruno Lemaitre, J. Troy Littleton, Debrah K. Morrison, Chris Mungall, Patrick H. O ?arrell, Oxana K. Pickeral, Chris Shue, Leslie B. Vosshall, Jiong Zhang, Qi Zhao, Xiangqun H. Zheng, Fei Zhong, Wenyan Zhong, Richard Gibbs, J. Craig Wenter, Mark D. Adams,and Suzanna Lewis.Comparative genomics of the eukaryotes.Science,287:2204–2215,2000.Google Scholar
  47. [SCH + 88]
    Paul M. Sharp, Elizabeth Cowe, Desmond G. Higgins, Denis C. Shields, Kenneth H. Wolfe,and Frank Wright.Codon usage patterns in escherichia coli,bacillus subtilis,saccharomyces c revisiae,schizosaccharomyces pombe,drosophila melanogaster and homo sapiens:a review of the considerable within-species diversity.Nucleic Acids Research, 16(17): 8207–8211,1988.CrossRefGoogle Scholar
  48. [SDFH98]
    Steven Salzberg, Arthur L. Delcher, Kenneth H. Fasman,and John Henderson.A decision tree system for finding genes in DNA.Journal of Computational Biology,5(4):667–680,1998.Google Scholar
  49. [SG94]
    David J. States and Warren Gish.Combined use of sequence similarity and codon bias for coding region identification.Journal of Computational Biology,1(1):39–50,1994.Google Scholar
  50. [SM82]
    R. Staden and A.D. McLachlan.Codon preference and its use in identifying protein coding regions in long DNA sequences.Nucleic Acids Research, 10(1):141–156,1982.CrossRefGoogle Scholar
  51. [SR86]
    Terrence J. Sejnowski and Charles Rosenberg.NETtalk:A parallel network that learns to read aloud.Technical Report JHU-EECS-86-01,Johns Hopkins University,1986.Google Scholar
  52. [Ste88]
    R. Sternberg.The Triarchic Mind.Viking, NY,1988.Google Scholar
  53. [TS96]
    S. Thrun and J. Sullivan.Discovering structure in multiple learning tasks: The TC algorithm.In Proceedings of the Thirteenth International Conferenc on Machine Learning (ICML-96),pages 489–497.Morgan Kaufmann, San Francisco,CA,1996.Google Scholar
  54. [TSB00]
    V. Tirunagaru, L. Sofer,and J. Burnside.An expressed sequence tag database of activated chicken T cells:Sequence analysis of 5000 cDNA clones.Genomics,2000.In press.Google Scholar
  55. [Vap95]
    V. Vapnik.The Natur of Statistical Learning Theory.Springer Verlag, New York,1995.Google Scholar
  56. [Vap98]
    V. Vapnik.Statistical Learning Theory.John Wiley and Sons,New York, 1998.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Ming Ouyang
    • 1
  • John Case
    • 2
  • Joan Burnside
    • 3
  1. 1.Environmental and Occupational Health Sciences Institute UMDNJ Robert Wood Johnson Medical School and RutgersThe State University of New JerseyPiscatawayUSA
  2. 2.Department of CISUniversity of DelawareNewarkUSA
  3. 3.Department of Animal & Food SciencesUniversity of DelawareNewarkUSA

Personalised recommendations