A Novel Algorithm for Automatic Species Identification Using Principal Component Analysis

  • Shreyas Sen
  • Seetharam Narasimhan
  • Amit Konar
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3776)


This paper describes a novel scheme for automatic identification of a species from its genomic data. Random samples of a given length (10,000 elements) are taken from a genome sequence of a particular species. A set of 64 keywords is generated using all possible 3-tuple combinations of the 4 letters: A (for Adenine), T (for Thymine), C (for Cytosine) and G (for Guanine) representing the four types of nucleotide bases in a DNA strand. These 43= 64 keywords are searched in a sample of the genome sequence and their corresponding frequencies of occurrence are determined. Upon repeating this process for N randomly selected samples taken from the genome sequence, an N × 64 matrix of frequency count data is obtained. Then Principal Component Analysis is employed on this data to obtain a Feature Descriptor of reduced dimension (1 × 64). On determining the feature descriptors of different species and also by taking different samples from the same species, it is found that they are unique for a particular species while wide differences exist between those of different species. The variance of the descriptors for a given genome sequence being negligible, the proposed scheme finds extensive applications in automatic species identification.


Principal Component Analysis Genome Sequence Feature Descriptor Frequency Count Global Alignment 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


  1. 1.
    Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)CrossRefGoogle Scholar
  2. 2.
    Waterman, M.S., Eggert, M.: A new algorithm for best subsequence alignments with applications to tRNA-rRNA. J.Mol.Biol. 197, 723–728 (1987)CrossRefGoogle Scholar
  3. 3.
    Needleman, S.B., Wunsch, C.: A general method applicable to the search for similarities in the amino sequence of two proteins. J. Mol. Biol. 48, 443–453 (1970)CrossRefGoogle Scholar
  4. 4.
    Galperin, M.Y., Koonin, E.V.: Comparative Genome Analysis. In: Baxevins, A.D., Oullette, B.F.F. (eds.) Bioinformatics- A Practical Guide to the Analysis of Genes and Proteins, 2nd edn., p. 387. Wiley-Interscience, New York (2001)Google Scholar
  5. 5.
    States, D.J., Boguski, M.S.: Similarity and Homology. In: Gribskov, M., Devereux, J. (eds.) Sequence analysis primer, pp. 92–124. Stockton Press, New York (1991)Google Scholar
  6. 6.
    Mount, D.W.: Bioinformatics: Sequence and Genome Analysis. Cold Spring Harbor Laboratory Press, NY (2001)Google Scholar
  7. 7.
    Blattner, F.R., Plunkett, G., Bloch, C.A., Perna, N.T., Burland, V., Riley, M., et al.: The complete genome sequence of Escherichia coli K-12. Science 277, 1453–1462 Google Scholar
  8. 8.
    Adams, M.D., Celniker, S.E., Holt, R.A., Evans, C.A., Gocayne, J.D., Amanatides, P.G., Scherer, S.E., Li, P.W., et al.: The genome sequence of Drosophila melanogaster. Science 287, 2185–2195 (2000)CrossRefGoogle Scholar
  9. 9.
    Cherry, J.M., Ball, C., Weng, S., Juvik, G., Schimidt, R., Alder, C., Dunn, B., Dwight, S., Riles, L., et al.: Genetic and Physical maps of Saccharomyces cerevisiae. Nature 387(suppl. 6632 ), 67–73 (1997)Google Scholar
  10. 10.
    Smith, L.I.: A tutorial on Principal Components Analysis (2002)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2005

Authors and Affiliations

  • Shreyas Sen
    • 1
  • Seetharam Narasimhan
    • 1
  • Amit Konar
    • 1
  1. 1.Dept. of Electronics and Telecommunication EngineeringJadavpur UniversityKolkataIndia

Personalised recommendations