Applied Bioinformatics

, Volume 3, Issue 2–3, pp 137–148 | Cite as

A Sequence Alignment-Independent Method for Protein Classification

  • John K. Vries
  • Rajan Munshi
  • Dror Tobi
  • Judith Klein-Seetharaman
  • Panayiotis V. Benos
  • Ivet Bahar
Original Research


Annotation of the rapidly accumulating body of sequence data relies heavily on the detection of remote homologues and functional motifs in protein families. The most popular methods rely on sequence alignment. These include programs that use a scoring matrix to compare the probability of a potential alignment with random chance and programs that use curated multiple alignments to train profile hidden Markov models (HMMs). Related approaches depend on bootstrapping multiple alignments from a single sequence. However, alignment-based programs have limitations. They make the assumption that contiguity is conserved between homologous segments, which may not be true in genetic recombination or horizontal transfer. Alignments also become ambiguous when sequence similarity drops below 40%. This has kindled interest in classification methods that do not rely on alignment. An approach to classification without alignment based on the distribution of contiguous sequences of four amino acids (4-grams) was developed. Interest in 4-grams stemmed from the observation that almost all theoretically possible 4-grams (204) occur in natural sequences and the majority of 4-grams are uniformly distributed. This implies that the probability of finding identical 4-grams by random chance in unrelated sequences is low. A Bayesian probabilistic model was developed to test this hypothesis. For each protein family in Pfam-A and PIR-PSD, a feature vector called a probe was constructed from the set of 4-grams that best characterised the family. In rigorous jackknife tests, unknown sequences from Pfam-A and PIR-PSD were compared with the probes for each family. A classification result was deemed a true positive if the probe match with the highest probability was in first place in a rank-ordered list. This was achieved in 70% of cases. Analysis of false positives suggested that the precision might approach 85% if selected families were clustered into subsets. Case studies indicated that the 4-grams in common between an unknown and the best matching probe correlated with functional motifs from PRINTS. The results showed that remote homologues and functional motifs could be identified from an analysis of 4-gram patterns.


  1. 1.
    Bairoch A, Apweiler R. The SWISS-PROT protein sequence database and its supplement TrEMBL in 2000. Nucleic Acids Res 2000; 28: 45–8PubMedCrossRefGoogle Scholar
  2. 2.
    Wu CH, Yeh LS, Huang H, et al. The Protein Information Resource. Nucleic Acids Res 2003; 31: 345–7PubMedCrossRefGoogle Scholar
  3. 3.
    Pruitt KD, Maglott DR. RefSeq and LocusLink: NCBI gene-centered resources. Nucleic Acids Res 2001; 29: 137–40PubMedCrossRefGoogle Scholar
  4. 4.
    GenPept Database. Genetic sequence data bank translated protein-coding sequences [online]. Available from URL: [Accessed 2004 Sep 21]
  5. 5.
    Bourne PE, Weissig H. Structural bioinformatics. Hoboken (NJ): John Wiley & Sons Inc, 2003: 181–198CrossRefGoogle Scholar
  6. 6.
    Waterston RH, Lindblad-Toh K, Birney E, et al. Initial sequencing and comparative analysis of the mouse genome. Nature 2002; 420: 520–62PubMedCrossRefGoogle Scholar
  7. 7.
    Adams MD, Celniker SE, Holt RA, et al. The genome sequence of Drosophila melanogaster. Science 2000; 287: 2185–95PubMedCrossRefGoogle Scholar
  8. 8.
    Gosele C, Hong L, Kreitler T, et al. High-throughput scanning of the rat genome using interspersed repetitive sequence-PCR markers. Genomics 2000; 69: 287–94PubMedCrossRefGoogle Scholar
  9. 9.
    Holt RA, Subramanian GM, Halpern A, et al. The genome sequence of the malaria mosquito Anopheles gambiae. Science 2002; 298: 129–49PubMedCrossRefGoogle Scholar
  10. 10.
    Kunst F, Ogasawara N, Moszer I, et al. The complete genome sequence of the gram-positive bacterium Bacillus subtilis. Nature 1997; 390: 249–56PubMedCrossRefGoogle Scholar
  11. 11.
    Lander ES, Linton LM, Birren B, et al. Initial sequencing and analysis of the human genome. Nature 2001; 409: 860–921PubMedCrossRefGoogle Scholar
  12. 12.
    Tettelin H, Nelson KE, Paulsen IT, et al. Complete genome sequence of a virulent isolate of Streptococcus pneumoniae. Science 2001; 293: 498–506PubMedCrossRefGoogle Scholar
  13. 13.
    Venter JC, Adams MD, Myers EW, et al. The sequence of the human genome. Science 2001; 291: 1304–51PubMedCrossRefGoogle Scholar
  14. 14.
    Chambers G, Lawrie L, Cash P, et al. Proteomics: a new approach to the study of disease. J Pathol 2000; 192: 280–8PubMedCrossRefGoogle Scholar
  15. 15.
    Thornton JM. From genome to function. Science 2001; 292: 2095–7PubMedCrossRefGoogle Scholar
  16. 16.
    Bateman A, Birney E, Cerruti L, et al. The Pfam protein families database. Nucleic Acids Res 2002; 30: 276–80PubMedCrossRefGoogle Scholar
  17. 17.
    Sigrist CJ, Cerutti L, Hulo N, et al. PROSITE: a documented database using patterns and profiles as motif descriptors. Brief Bioinform 2002; 3: 265–74PubMedCrossRefGoogle Scholar
  18. 18.
    Henikoff S, Henikoff JG. Protein family classification based on searching a database of blocks. Genomics 1994; 19: 97–107PubMedCrossRefGoogle Scholar
  19. 19.
    Attwood TK, Bradley P, Flower DR, et al. PRINTS and its automatic supplement, prePRINTS. Nucleic Acids Res 2003; 31: 400–2PubMedCrossRefGoogle Scholar
  20. 20.
    Ponting CP, Schultz J, Milpetz F, et al. SMART: identification and annotation of domains from signalling and extracellular protein sequences. Nucleic Acids Res 1999; 27: 229–32PubMedCrossRefGoogle Scholar
  21. 21.
    Servant F, Bru C, Carrere S, et al. ProDom: automated clustering of homologous domains. Brief Bioinform 2002; 3: 246–51PubMedCrossRefGoogle Scholar
  22. 22.
    Marchler-Bauer A, Anderson JB, DeWeese-Scott C, et al. CDD: a curated entrez database of conserved domain alignments. Nucleic Acids Res 2003; 31: 383–7PubMedCrossRefGoogle Scholar
  23. 23.
    Mulder NJ, Apweiler R, Attwood TK, et al. The InterPro database 2003 brings increased coverage and new features. Nucleic Acids Res 2003; 31: 315–8PubMedCrossRefGoogle Scholar
  24. 24.
    Altschul SF, Gish W, Miller W, et al. Basic local alignment search tool. J Mol Biol 1990; 215: 403–10PubMedGoogle Scholar
  25. 25.
    Altschul SF, Madden TL, Schaffer AA, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997; 25: 3389–402PubMedCrossRefGoogle Scholar
  26. 26.
    Altschul SF. Amino acid substitution matrices from an information theoretic perspective. J Mol Biol 1991; 219: 555–65PubMedCrossRefGoogle Scholar
  27. 27.
    Dayhoff MO, Schwartz R, Orcutt BC. A model of evolutionary change in proteins. In: Davidoff MO, editor. Atlas of protein sequence and structure. Silver Spring (MD): National Biomedical Research Foundation, 1978: 345–52Google Scholar
  28. 28.
    Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992; 89: 10915–9PubMedCrossRefGoogle Scholar
  29. 29.
    Durbin R, Eddy S, Krogh A, et al. Biological sequence analysis. Cambridge: Cambridge University Press, 1998CrossRefGoogle Scholar
  30. 30.
    Gotoh O. An improved algorithm for matching biological sequences. J Mol Biol 1982; 162: 705–8PubMedCrossRefGoogle Scholar
  31. 31.
    Needleman SB, Wunsch CD. A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970; 48: 443–53PubMedCrossRefGoogle Scholar
  32. 32.
    Smith TF, Waterman MS. Identification of common molecular sub-sequences. J Mol Biol 1981; 147: 195–7PubMedCrossRefGoogle Scholar
  33. 33.
    Pearson WR. Rapid and sensitive sequence comparison with FASTP and FASTA. Methods Enzymol 1990; 183: 63–98PubMedCrossRefGoogle Scholar
  34. 34.
    Baldi P, Chauvin Y, Hunkapiller T, et al. Hidden Markov models of biological primary sequence information. Proc Natl Acad Sci U S A 1994; 91: 1059–63PubMedCrossRefGoogle Scholar
  35. 35.
    Gribskov M, McLachlan AD, Eisenberg D. Profile analysis: detection of distantly related proteins. Proc Natl Acad Sci U S A 1987; 84: 4355–8PubMedCrossRefGoogle Scholar
  36. 36.
    Jaakkola T, Diekhans M, Haussler D. A discriminative framework for detecting remote protein homologies. J Comput Biol 2000; 7: 95–114PubMedCrossRefGoogle Scholar
  37. 37.
    Karplus K, Barrett C, Hughey R. Hidden Markov models for detecting remote protein homologies. Bioinformatics 1998; 14: 846–56PubMedCrossRefGoogle Scholar
  38. 38.
    Madera M, Gough J. A comparison of profile hidden Markov model procedures for remote homology detection. Nucleic Acids Res 2002; 30: 4321–8PubMedCrossRefGoogle Scholar
  39. 39.
    Park J, Karplus K, Barrett C, et al. Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998; 284: 1201–10PubMedCrossRefGoogle Scholar
  40. 40.
    Sonnhammer EL, Eddy SR, Durbin R. Pfam: a comprehensive database of protein domain families based on seed alignments. Proteins 1997; 28: 405–20PubMedCrossRefGoogle Scholar
  41. 41.
    Thompson JD, Higgins DG, Gibson TJ. CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994; 22: 4673–80PubMedCrossRefGoogle Scholar
  42. 42.
    Vinga S, Almeida J. Alignment-free sequence comparison: a review. Bioinformatics 2003; 19: 513–23PubMedCrossRefGoogle Scholar
  43. 43.
    Lynch M. Intron evolution as a population-genetic process. Proc Natl Acad Sci U S A 2002; 99: 6118–23PubMedCrossRefGoogle Scholar
  44. 44.
    Zhang YX, Perry K, Vinci VA, et al. Genome shuffling leads to rapid phenotypic improvement in bacteria. Nature 2002; 415: 644–6PubMedCrossRefGoogle Scholar
  45. 45.
    Wu CH, Huang H, Yeh LL, et al. Protein family classification and functional annotation. Comput Biol Chem 2003; 27: 37–47PubMedCrossRefGoogle Scholar
  46. 46.
    Pearson WR. Effective protein sequence comparison. Methods Enzymol 1996; 266: 227–58PubMedCrossRefGoogle Scholar
  47. 47.
    Pearson WR. Empirical statistical estimates for sequence similarity searches. J Mol Biol 1998; 276: 71–84PubMedCrossRefGoogle Scholar
  48. 48.
    Blaisdell BE. A measure of the similarity of sets of sequences not requiring sequence alignment. Proc Natl Acad Sci U S A 1986; 83: 5155–9PubMedCrossRefGoogle Scholar
  49. 49.
    Blaisdell BE. Average values of a dissimilarity measure not requiring sequence alignment are twice the averages of conventional mismatch counts requiring sequence alignment for a computer-generated model system. J Mol Evol 1989; 29: 538–47PubMedCrossRefGoogle Scholar
  50. 50.
    Felsenstein J. PHYLIP (Phytogeny Inference Package) [online]. Seattle (WA): Department of Genetics, University of Washington, 1993. Available from URL: [Accessed 2004 Sep 21]
  51. 51.
    Zharkikh AA, Rzhetsky AY. Quick assessment of similarity of two sequences by comparison of their L-tuple frequencies. Biosystems 1993; 30: 93–111PubMedCrossRefGoogle Scholar
  52. 52.
    Petrilli P. Classification of protein sequences by their dipeptide composition. Comput Appl Biosci 1993; 9: 205–9PubMedGoogle Scholar
  53. 53.
    Solovyev VV, Makarova KS. A novel method of protein sequence classification based on oligopeptide frequency analysis and its application to search for functional sites and to domain localization. Comput Appl Biosci 1993; 9: 17–24PubMedGoogle Scholar
  54. 54.
    Wu TJ, Hsieh YC, Li LA. Statistical measures of DNA sequence dissimilarity under Markov chain models of base composition. Biometrics 2001; 57: 441–8PubMedCrossRefGoogle Scholar
  55. 55.
    Stuart GW, Moffett K, Leader JJ. A comprehensive vertebrate phylogeny using vector representations of protein sequences from whole genomes. Mol Biol Evol 2002; 19: 554–62PubMedCrossRefGoogle Scholar
  56. 56.
    Wu TJ, Burke JP, Davison DB. A measure of DNA sequence dissimilarity based on Mahalanobis distance between frequencies of words. Biometrics 1997; 53: 1431–9PubMedCrossRefGoogle Scholar
  57. 57.
    Kullback S. Information theory and statistics. New York: Dover, 1968Google Scholar
  58. 58.
    Shannon CE. A mathematical theory of communication. Bell Syst Tech J 1948; 27: 379–423–656Google Scholar
  59. 59.
    Almeida JS, Carrico JA, Maretzek A, et al. Analysis of genomic sequences by chaos game representation. Bioinformatics 2001; 17: 429–37PubMedCrossRefGoogle Scholar
  60. 60.
    Almeida JS, Vinga S. Universal sequence map (USM) of arbitrary discrete sequences. BMC Bioinformatics 2002; 3: 6. Epub 2002 Feb 05PubMedCrossRefGoogle Scholar
  61. 61.
    Li M, Vitanyi P. An introduction to Kolmogorov complexity and its applications. New York: Springer, 1997Google Scholar
  62. 62.
    Baldi P, Brunak S. Bioinformatics: the machine learning approach. Cambridge (MA): MIT Press, 2001Google Scholar
  63. 63.
    Cristianini N, Shawe-Taylor J. An introduction to support vector machines. New York: Cambridge University Press, 2000Google Scholar
  64. 64.
    Vapnik V. The nature of statistical learning theory. New York: Springer-Verlag, 1995Google Scholar
  65. 65.
    Deshpande M, Karypis G. Evaluation of techniques for classifying biological sequences. 6th Pacific-Asia Conference on Knowledge Discovery (PAKDD 2002); 2002 May 6–8; Taipei.Google Scholar
  66. 66.
    Karchin R, Karplus K, Haussler D. Classifying G-protein coupled receptors with support vector machines. Bioinformatics 2002; 18: 147–59PubMedCrossRefGoogle Scholar
  67. 67.
    Zavaljevski N, Stevens FJ, Reifman J. Support vector machines with selective kernel scaling for protein classification and identification of key amino acid positions. Bioinformatics 2002; 18: 689–96PubMedCrossRefGoogle Scholar
  68. 68.
    Hansen L, Salamon P. Neural network ensembles. IEEE Trans Pattern Anal Mach Intell 1990; 12: 993–1001CrossRefGoogle Scholar
  69. 69.
    Krogh A, Vedelsby J. Neural network ensembles, cross validation, and active learning. In: Tesauro G, Touretzky D, Leen T, editors. Advances in neural information processing systems. Vol. 7. Cambridge (MA): MIT Press, 1995: 231–8Google Scholar
  70. 70.
    Opitz D, Maclin R. Popular ensemble methods: an empirical study. J Artif Intell Res 1999; 11: 169–98Google Scholar
  71. 71.
    Wu CH. Gene classification artificial neural system. Methods Enzymol 1996; 266: 71–88PubMedCrossRefGoogle Scholar
  72. 72.
    Eddy SR. Profile hidden markov models. Bioinformatics 1998; 14: 755–63PubMedCrossRefGoogle Scholar
  73. 73.
    SAM: sequence alignment and modeling software system. The SAM documentation [technical report no.: UCSC-CRL-95-7]. Santa Cruz (CA): University of California, 1995Google Scholar
  74. 74.
    Sonnhammer EL, Eddy SR, Birney E, et al. Pfam: multiple sequence alignments and HMM-profiles of protein domains. Nucleic Acids Res 1998; 26: 320–2PubMedCrossRefGoogle Scholar
  75. 75.
    Zhang Z, Schaffer AA, Miller W, et al. Protein sequence similarity searches using patterns as seeds. Nucleic Acids Res 1998; 26: 3986–90PubMedCrossRefGoogle Scholar
  76. 76.
    Okinaka R, Cloud K, Hampton O, et al. Sequence, assembly and analysis of pX01 and pX02. J Appl Microbiol 1999; 87: 261–2PubMedCrossRefGoogle Scholar
  77. 77.
    Khorana HG. Molecular biology of light transduction by the mammalian photoreceptor, rhodopsin. J Biomol Struct Dyn 2000; 11: 1–16CrossRefGoogle Scholar
  78. 78.
    Hwa J, Reeves PJ, Klein-Seetharaman J, et al. Structure and function in rhodopsin: further elucidation of the role of the intradiscal cysteines, Cys-110, -185, and -187, in rhodopsin folding and function. Proc Natl Acad Sci U S A 1999; 96: 1932–5PubMedCrossRefGoogle Scholar
  79. 79.
    Hwa J, Klein-Seetharaman J, Khorana HG. Structure and function in rhodopsin: mass spectrometric identification of the abnormal intradiscal disulfide bond in misfolded retinitis pigmentosa mutants. Proc Natl Acad Sci U S A 2001; 98: 4872–6PubMedCrossRefGoogle Scholar
  80. 80.
    Palczewski K, Kumasaka T, Hori T, et al. Crystal structure of rhodopsin: a G protein-coupled receptor. Science 2000; 289: 739–45PubMedCrossRefGoogle Scholar
  81. 81.
    Altenbach C, Yang K, Farrens DL, et al. Structural features and light-dependent changes in the cytoplasmic interhelical E-F loop region of rhodopsin: a site-directed spin-labeling study. Biochemistry 1996; 35: 12470–8PubMedCrossRefGoogle Scholar
  82. 82.
    Altenbach C, Cai K, Khorana HG, et al. Structural features and light-dependent changes in the sequence 306–322 extending from helix VII to the palmitoylation sites in rhodopsin: a site-directed spin-labeling study. Biochemistry 1999; 38(25): 7931–7PubMedCrossRefGoogle Scholar
  83. 83.
    Altenbach C, Klein-Seetharaman J, Hwa J, et al. Structural features and light-dependent changes in the sequence 59–75 connecting helices I and II in rhodopsin: a site-directed spin-labeling study. Biochemistry 1999; 38(25): 7945–9PubMedCrossRefGoogle Scholar
  84. 84.
    Altenbach C, Klein-Seetharaman J, Cai K, et al. Structure and function in rhodopsin: mapping light-dependent changes in distance between residue 316 in helix 8 and residues in the sequence 60–75, covering the cytoplasmic end of helices TM1 and TM2 and their connection loop CL1. Biochemistry 2001; 40(51): 15493–500PubMedCrossRefGoogle Scholar
  85. 85.
    Altenbach C, Cai K, Klein-Seetharaman J, et al. Structure and function in rhodopsin: mapping light-dependent changes in distance between residue 65 in helix TM1 and residues in the sequence 306–319 at the cytoplasmic end of helix TM7 and in helix H8. Biochemistry 2001; 40(51): 15483–92PubMedCrossRefGoogle Scholar
  86. 86.
    Cai K, Langen R, Hubbell WL, et al. Structure and function in rhodopsin: topology of the C-terminal polypeptide chain in relation to the cytoplasmic loops. Proc Natl Acad Sci U S A 1997; 94: 14267–72PubMedCrossRefGoogle Scholar
  87. 87.
    Cai K, Klein-Seetharaman J, Farrens D, et al. Single-cysteine substitution mutants at amino acid positions 306–321 in rhodopsin, the sequence between the cytoplasmic end of helix VII and the palmitoylation sites: sulfhydryl reactivity and transducin activation reveal a tertiary structure. Biochemistry 1999; 38: 7925–30PubMedCrossRefGoogle Scholar
  88. 88.
    Cai K, Klein-Seetharaman J, Altenbach C, et al. Probing the dark state tertiary structure in the cytoplasmic domain of rhodopsin: proximities between amino acids deduced from spontaneous disulfide bond formation between cysteine pairs engineered in cytoplasmic loops 1, 3, and 4. Biochemistry 2001; 40(42): 12479–85PubMedCrossRefGoogle Scholar
  89. 89.
    Farrens DL, Altenbach C, Yang K, et al. Requirement of rigid-body motion of transmembrane helices for light activation of rhodopsin. Science 1996; 274: 768–70PubMedCrossRefGoogle Scholar
  90. 90.
    Klein-Seetharaman J, Hwa J, Cai K, et al. Single-cysteine substitution mutants at amino acid positions 55–75, the sequence connecting the cytoplasmic ends of helices I and II in rhodopsin: reactivity of the sulfhydryl groups and their derivatives identifies a tertiary structure that changes upon light-activation. Biochemistry 1999; 38(25): 7938–44PubMedCrossRefGoogle Scholar
  91. 91.
    Klein-Seetharaman J, Hwa J, Cai K, et al. Probing the dark state tertiary structure in the cytoplasmic domain of rhodopsin: proximities between amino acids deduced from spontaneous disulfide bond formation between Cys316 and engineered cysteines in cytoplasmic loop 1. Biochemistry 2001; 40(42): 12472–8PubMedCrossRefGoogle Scholar
  92. 92.
    Resek JF, Farahbakhsh ZT, Hubbell WL, et al. Formation of the meta II photointermediate is accompanied by conformational changes in the cytoplasmic surface of rhodopsin. Biochemistry 1993; 32: 12025–32PubMedCrossRefGoogle Scholar
  93. 93.
    Yang K, Farrens DL, Hubbell WL, et al. Structure and function in rhodopsin: single cysteine substitution mutants in the cytoplasmic interhelical E-F loop region show position-specific effects in transducin activation. Biochemistry 1996; 35(38): 12464–9PubMedCrossRefGoogle Scholar
  94. 94.
    Cai K, Klein-Seetharaman J, Hwa J, et al. Structure and function in rhodopsin: effects of disulfide cross-links in the cytoplasmic face of rhodopsin on transducin activation and phosphorylation by rhodopsin kinase. Biochemistry 1999; 38(39): 12893–8PubMedCrossRefGoogle Scholar
  95. 95.
    Cheng BYM, Carbonell J, Klein-Seetharaman J. Protein classification based on text document classification techniques. Proteins 2004. In pressGoogle Scholar
  96. 96.
    Horn F, Weare J, Beukers MW, et al. GPCRDB: an information system for G protein-coupled receptors. Nucleic Acids Res 1998; 26: 275–9PubMedCrossRefGoogle Scholar

Copyright information

© Adis Data Information BV 2004

Authors and Affiliations

  • John K. Vries
    • 1
  • Rajan Munshi
    • 1
  • Dror Tobi
    • 1
  • Judith Klein-Seetharaman
    • 2
    • 3
  • Panayiotis V. Benos
    • 4
  • Ivet Bahar
    • 1
  1. 1.Department of Molecular Genetics and Biochemistry, School of Medicine, Center for Computational Biology and BioinformaticsUniversity of PittsburghPittsburghUSA
  2. 2.Department of Pharmacology, School of MedicineUniversity of PittsburghUSA
  3. 3.Language Technologies Institute, Institute for Software Research International School of Computer ScienceCarnegie Mellon UniversityPittsburghUSA
  4. 4.School of Medicine, Department of Human Genetics, Graduate School of Public Health, Center for Computational Biology and BioinformaticsUniversity of PittsburghPittsburghUSA

Personalised recommendations