A Comparative Study of Content Statistics of Coding Regions in an Evolutionary Computation Framework for Gene Prediction

  • Javier Pérez-Rodríguez
  • Alexis G. Arroyo-Peña
  • Nicolás García-Pedrajas
Part of the Lecture Notes in Computer Science book series (LNCS, volume 7345)

Abstract

The determination of which parts of a DNA sequence are coding is an unsolved and relevant problem in the field of bioinformatics. This problem is called gene prediction or gene finding, and it consists of locating the most likely gene structure in a genomic sequence.

Taking into account some restrictions, gene structure prediction may be considered as a search problem. To address the problem, evolutionary computation approaches can be used, although their performance will depend on the discriminative power of the statistical measures employed to extract useful features from the sequence.

In this study, we test six different content statistics to determine which of them have higher relevance in an evolutionary search for coding and non-coding regions of human DNA. We conduct this comparative study on the human chromosomes 3, 19 and 21.

Keywords

Codon Usage Synonymous Codon Content Statistic Average Mutual Information Translation Initiation Site 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Brent, M.R., Guigó, R.: Recent advances in gene structure prediction. Current Opinion in Structural Biology 14, 264–272 (2004)CrossRefGoogle Scholar
  2. 2.
    Claverie, J., Sauvaget, I., Bougueleret, L.: k-tuple frequency analysis from intron/exon discrimination to t-cell epitope mapping. Methods Enzymology 183, 237–252 (1990)CrossRefGoogle Scholar
  3. 3.
    García-Pedrajas, N., Pérez-Rodríguez, J., García-Pedrajas, M.D., Ortiz-Boyer, D., Fyfe, C.: Class imbalance methods for translation initiation site recognition in dna sequences. Knowledge-Based Systems 25, 22–34 (2012)CrossRefGoogle Scholar
  4. 4.
    Gross, S.S., Do, C.B., Sirota, M., Batzoglou, S.: CONTRAST: a discriminative, phylogeny-free approach to multiple informant de novo gene prediction. Genome Biology 8, R269.1–R269.16 (2007)Google Scholar
  5. 5.
    Guigó, R.: DNA composition, codon usage and exon prediction. In: Bishop, M. (ed.) Genetic Databases, pp. 53–80. Academic Press (1999)Google Scholar
  6. 6.
    Hawkins, J.D.: A survey of intron and exon lengths. Nucleic Acids Research 16, 9893–9908 (1988)CrossRefGoogle Scholar
  7. 7.
    Herzel, H., Große, I.: Measuring correlations in symbolic sequences. Physica A 216, 518–542 (1995)MathSciNetCrossRefGoogle Scholar
  8. 8.
    Japkowicz, N.: The class imbalance problem: significance and strategies. In: Proceedings of the 200 International Conference on Artificial Intelligence (IC-AI 2000): Special Track on Inductive Learning, Las Vegas, USA, vol. 1, pp. 111–117 (2000)Google Scholar
  9. 9.
    Konopka, A.K., Owens, J.: Complexity charts can be used to map functional domains in DNA. Genetic Analysis, Techniques and Applications 7(2), 35–38 (1990)CrossRefGoogle Scholar
  10. 10.
    Lodhi, H., Saunders, C., Shawe-Taylor, J., Christiani, N., Watkins, C.: Text classification using string kernels. Journal of Machine Learning Research 2, 419–444 (2002)MATHGoogle Scholar
  11. 11.
    Mathé, C., Sagot, M.F., Schiex, T., Rouzé, P.: Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Research 30(19), 4103–4117 (2002)CrossRefGoogle Scholar
  12. 12.
    Pérez-Rodríguez, J., García-Pedrajas, N.: An evolutionary algorithm for gene structure predictionGoogle Scholar
  13. 13.
    Shannon, C.E., Weaver, W.: The Mathematical Theory of Communication. The University of Illinois Press, Urbana (1964)Google Scholar
  14. 14.
    Syswerda, G.: A Study of Reproduction in Generational and Steady-State Genetic Algorithms. In: Rawlins, G. (ed.) Foundations of Genetic Algorithms, pp. 94–101. Morgan Kaufmann (1991)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2012

Authors and Affiliations

  • Javier Pérez-Rodríguez
    • 1
  • Alexis G. Arroyo-Peña
    • 1
  • Nicolás García-Pedrajas
    • 1
  1. 1.Department of Computing and Numerical AnalysisUniversity of CórdobaSpain

Personalised recommendations