Open Systems & Information Dynamics

, Volume 10, Issue 4, pp 321–333 | Cite as

Self-Organizing Approach for Automated Gene Identification

  • Audrey Yu. Zinovyev
  • Alexander N. Gorban
  • Tatyana G. Popova


Self-training technique for automated gene recognition both in entire genomes and in unassembled ones is proposed. It is based on a simple measure (namely, the vector of frequencies of non-overlapping triplets in sliding window), and needs neither predetermined information, nor preliminary learning. The sliding window length is the only one tuning parameter. It should be chosen close to the average exon length typical to the DNA text under investigation. An essential feature of the technique proposed is preliminary visualization of the set of vectors in the subspace of the first three principal components. It was shown, the distribution of DNA sites has the bullet-like structure with one central cluster (corresponding to non-coding sites) and three or six flank ones (corresponding to protein-coding sites). The bullet-like structure itself revealed in the distribution seems to be very interesting illustration of triplet usage in DNA sequence. The method was examined on several genomes (mitochondrion of P.wickerhamii, bacteria C.crescentus and primitive eukaryot S.cerevisiae). The percentage of truly predicted nucleotides exceeds 90%.


Essential Feature Central Cluster Tuning Parameter Gene Identification Entire Genome 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. [1]
    J.-M. Claverie, Computational methods for the identification of genes in vertebrate genomic sequences, Human Molec. Genetics 6, 1735 (1997).Google Scholar
  2. [2]
    E. Yeramian, Genes and the physics of the DNA double-helix, Gene 255, 139 (2000).Google Scholar
  3. [3]
    E. Yeramian, The physics of DNA and the annotation of the Plasmodium falsiparum genome, Gene 255, 151 (2000).Google Scholar
  4. [4]
    P. Bernaola-Galvan, I. Grosse, P. Carpena, and et al., Finding Borders between Coding and Noncoding DNA Regions by an Entropic Segmentation Method, Phys. Rev. Lett. 85(6), 1342 (2000).Google Scholar
  5. [5]
    P. Audic, J.-M. Claverie, Self-identification of protein-coding regions in microbial genomes, Proc. Natl. Acad. Sci. USA 95, 10026 (1998).Google Scholar
  6. [6]
    P. Baldi, On the convergence of a clustering algorithm for proteing coding regions i microbial genomes, Bioinformatics 16, 367 (2000).Google Scholar
  7. [7]
    J. Fickett, The Gene Identification Problem: An Overview For Developers, Computers Chem. 20, 103 (1996).Google Scholar
  8. [8]
    A. Gorban, A. Zinovyev, and T. Popova, Statistical approaches to automated gene identification without teacher, Institut des Hautes Etudes Scientifiques Preprint, IHES/M/01/34 (2001).Google Scholar
  9. [9]
    A. Gorban and A. Rossiev, Neural Network Iterative Method of Principal Curves for Data with Gaps, Journal of Computer and System Sciences International 38, 825 (1999).Google Scholar
  10. [10]
    A. Zinovyev, Visualisaton of Multidimensional Data, Krasnoyarsk State Technical University Press, Russia, 2000, 168pp.Google Scholar
  11. [11]
    A. Gorban, A. Zinovyev, and A. Pitenko, Data visualization by the method of elastic maps, Informatsionnie technologii, Moscow. 6, 26 (2000). (in Russian)Google Scholar
  12. [12]
    A. Gorban and A. Zinovyev, Visualization of data by method of elastic maps and its application in genomics, economics and sociology, Institut des Hautes Etudes Scientifiques Preprint, IHES/M/01/36 (2001).Google Scholar
  13. [13]
    A. Zinovyev, Visualizing the spatial structure of triplet distributions in genetic texts, Institut des Hautes Etudes Scientifiques Preprint, IHES/M/02/28 (2002).Google Scholar
  14. [14]
    A. Gorban, A. Zinovyev, Method of Elastic Maps and its Applications in Data Visualization and Data Modeling, International Journal of Computing Anticipatory Systems, CHAOS. 12, 353 (2001).Google Scholar
  15. [15]
    T. Kohonen, Self-Organizing Maps, Berlin-Heidelberg, 1997, 420pp.Google Scholar
  16. [16]
    J. Hartigan, Clustering Algorithms, John Wiley & Sons, Inc., New York, 1975.Google Scholar

Copyright information

© Kluwer Academic Publishers 2003

Authors and Affiliations

  • Audrey Yu. Zinovyev
    • 1
  • Alexander N. Gorban
    • 2
  • Tatyana G. Popova
    • 2
  1. 1.Institut des Hautas Etudes ScientifiquesFrance
  2. 2.Institute of Computational Modeling of Russian Academy of Sciences AkademgorodokKrasnoyarskRussia

Personalised recommendations