Discovery Science

Volume 2226 of the series Lecture Notes in Computer Science pp 290-303


Divide and Conquer Machine Learning for a Genomics Analogy Problem

(Progress Report)
  • Ming OuyangAffiliated withEnvironmental and Occupational Health Sciences Institute UMDNJ Robert Wood Johnson Medical School and Rutgers, The State University of New Jersey
  • , John CaseAffiliated withDepartment of CIS, University of Delaware
  • , Joan BurnsideAffiliated withDepartment of Animal & Food Sciences, University of Delaware

* Final gross prices may vary according to local VAT.

Get Access


Genomic strings are not of fixed length,but provide one- dimensional spatial data that do not divide for conquering by machine learning into manageable .xed size chunks obeying Dietterich independent and identically distributed assumption.We nonetheless need to divide genomic strings for conquering by machine learning in this case for genomic prediction. Orthologs are genomic strings derived from a common ancestor and having the same biological function.Ortholog detection is biologically interesting since it informs us about protein divergence through evolution, and,in the present context,also has important agricultural applications. In the present paper is indicated means to obtain an associated (fixed size)attribute vector for genomic string data and for dividing and conquering the machine learning problem of ortholog detection herein seen as an analogy problem.The attributes are based on both the typical string similarity measures of bioinformatics and on a large number of differential metrics,many new to bioinformatics.Many of the differential metrics are based on evolutionary considerations,both theoretical and empirically observed,in some cases observed by the authors. C5.0 with AdaBoosting activated was employed and the preliminary results reported herein re complete cDNA strings are very encouraging for eventually and usefully employing the techniques described for ortholog detection on the more readily available EST (incomplete)genomic data.