Principal components analysis of protein sequence clusters

  • Bo Wang
  • Michael A. Kennedy


Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.


Principal components analysis PCA Protein sequence analysis 



This work was supported by the National Institute of General Medical Sciences; Protein Structure Initiative-Biology Program; Grant Number U54-GM094597. The calculations were performed at the Ohio Center of Excellence in Biomedicine in Structural Biology and Metabonomics at Miami University.

Supplementary material

10969_2014_9173_MOESM1_ESM.tiff (310 kb)
CLANs analysis of the 1414 sequences that form two sub-clusters within Pfam04327. In the plot, 1040 sequences fall into the sub-cluster to the left, whereas 374 sequences fall into the sub-cluster to the right. (TIFF 310 kb)
10969_2014_9173_MOESM2_ESM.tiff (1.1 mb)
PCA loadings data mapped onto a representative structure from Pfam04327 based on PC1 only. The loadings plot of PC1 vs PC2 is shown in a). The absolute values of PC1 were used, and the 30% of the largest one is used to filter the data (red break lines). In the figure 24 out of 196 was considered important for variance (red<orange <cyan<green). The cartoon of protein 3H9X (two sides of view) and the colored positions are corresponding to the same color sequence in the loading plot are shown in c) and d). (TIFF 1124 kb)
10969_2014_9173_MOESM3_ESM.tiff (535 kb)
Analysis of the PCA loadings data for Pfam04327 to identify conserved residue positions using only the first PC. A maximum Euler distance of 5% of the maximum was used to filter the data (red dashed lines). In the figure 103 out of 196 was considered as the less variance residues (red<orange<yellow<cyan<green). b) and c) are the cartoon of protein 3H9X (two sides of view) and the colored positions are corresponding to the same color sequence in the loading plot. (TIFF 534 kb)


  1. 1.
    Blanchette M (2007) Computation and analysis of genomic multi-sequence alignments. Annu Rev Genomics Hum Genet 8:193–213PubMedCrossRefGoogle Scholar
  2. 2.
    Skrabanek L, Saini H, Bader G, Enright A (2008) Computational prediction of protein–protein interactions. Mol Biotechnol 38:1–17PubMedCrossRefGoogle Scholar
  3. 3.
    Zhu C, Zeng X, Huang W (2003) Codon usage decreases the error minimization within the genetic code. J Mol Evol 57:533–537PubMedCrossRefGoogle Scholar
  4. 4.
    Di Giulio M (2005) The origin of the genetic code: theories and their relationships, a review. Biosystems 80:175–184PubMedCrossRefGoogle Scholar
  5. 5.
    Goodarzi H, Najafabadi H, Hassani K, Nejad H, Torabi N (2005) On the optimality of the genetic code, with the consideration of coevolution theory by comparison of prominent cost measure matrices. J Theor Biol 235:318–325PubMedCrossRefGoogle Scholar
  6. 6.
    Goodarzi H, Katanforoush A, Torabi N, Najafabadi H (2007) Solvent accessibility, residue charge and residue volume, the three ingredients of a robust amino acid substitution matrix. J Theor Biol 245:715–725PubMedCrossRefGoogle Scholar
  7. 7.
    Cosic I (1994) Macromolecular bioactivity—is it resonant interaction between macromolecules—theory and applications. IEEE Trans Biomed Eng 41:1101–1114PubMedCrossRefGoogle Scholar
  8. 8.
    Tsai C, Chiu C (2008) An efficient conserved region detection method for multiple protein sequences using principal component analysis and wavelet transform. Pattern Recogn Lett 29:616–628CrossRefGoogle Scholar
  9. 9.
    Henikoff S, Henikoff J (1994) Position-based sequence weights. J Mol Biol 243:574–578PubMedCrossRefGoogle Scholar
  10. 10.
    Bruno W (1996) Modeling residue usage in aligned protein sequences via maximum likelihood. Mol Biol Evol 13:1368–1374PubMedCrossRefGoogle Scholar
  11. 11.
    Wallace I, Higgins D (2007) Supervised multivariate analysis of sequence groups to identify specificity determining residues. BMC Bioinforma 8:135CrossRefGoogle Scholar
  12. 12.
    Casari G, Sander C, Valencia A (1995) A method to predict functional residues in proteins. Nat Struct Biol 2:171–178PubMedCrossRefGoogle Scholar
  13. 13.
    Dong Q, Wang X, Lin L, Guan Y (2007) Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins. BMC Bioinforma 8:147CrossRefGoogle Scholar
  14. 14.
    Atchley W, Zhao J, Fernandes A, Druke T (2005) Solving the protein sequence metric problem. Proc Natl Acad Sci USA 102:6395–6400PubMedCentralPubMedCrossRefGoogle Scholar
  15. 15.
    Rausell A, Juan D, Pazos F, Valencia A (2010) Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc Natl Acad Sci 107:1995–2000PubMedCentralPubMedCrossRefGoogle Scholar
  16. 16.
    de Juan D, Pazos F, Valencia A (2013) Emerging methods in protein co-evolution. Nat Rev Genet 14:249–261PubMedCrossRefGoogle Scholar
  17. 17.
    Lichtarge O, Bourne H, Cohen F (1996) An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 257:342–358PubMedCrossRefGoogle Scholar
  18. 18.
    Mihalek I, Res I, Lichtarge O (2004) A family of evolution-entropy hybrid methods for ranking protein residues by importance. J Mol Biol 336:1265–1282PubMedCrossRefGoogle Scholar
  19. 19.
    Kalinina O, Gelfand M, Russell R (2009) Combining specificity determining and conserved residues improves functional site prediction. BMC Bioinformatics 10:174PubMedCentralPubMedCrossRefGoogle Scholar
  20. 20.
    Mesa M, Pazos F, Valencia A (2003) Automatic methods for predicting functionally important residues. J Mol Biol 326:1289–1302CrossRefGoogle Scholar
  21. 21.
    Dunn S, Wahl L, Gloor G (2008) Mutual information without the influence of phylogeny or entropy dramatically improves residue contact prediction. Bioinformatics 24:333–340PubMedCrossRefGoogle Scholar
  22. 22.
    Landgraf R, Xenarios I, Eisenberg D (2001) Three-dimensional cluster analysis identifies interfaces and functional residue clusters in proteins. J Mol Biol 307:1487–1502PubMedCrossRefGoogle Scholar
  23. 23.
    Xu I, Yuille A (1995) Robust principal component analysis by self-organizing rules based on statistical physics approach. IEEE Trans Neural Netw 6:131–143PubMedCrossRefGoogle Scholar
  24. 24.
    Nichols S (1977) Interpretation of principal components-analysis in ecological contexts. Vegetatio 34:191–197CrossRefGoogle Scholar
  25. 25.
    Werth M, Halouska S, Shortridge M, Zhang B, Powers R (2010) Analysis of metabolomic PCA data using tree diagrams. Anal Biochem 399:58–63PubMedCentralPubMedCrossRefGoogle Scholar
  26. 26.
    Gogos A, Jantz D, Senturker S, Richardson D, Dizdaroglu M, Clarke N (2000) Assignment of enzyme substrate specificity by principal component analysis of aligned protein sequences: an experimental test using DNA glycosylase homologs. Proteins Struct Funct Genet 40:98–105PubMedCrossRefGoogle Scholar
  27. 27.
    Frickey T, Lupas A (2004) CLANS: a Java application for visualizing protein families based on pairwise similarity. Bioinformatics 20:3702–3704PubMedCrossRefGoogle Scholar
  28. 28.
    Feldmann EA, Seetharaman J, Ramelot TA, Lew S, Zhao L, Hamilton K, Ciccosanti C, Xiao R, Acton TB, Everett JK, Tong L, Montelione GT, Kennedy MA (2012) Solution NMR and X-ray crystal structures of Pseudomonas syringae Pspto_3016 from protein domain family PF04237 (DUF419) adopt a “double wing” DNA binding motif. J Struct Funct Genom 13:155–162CrossRefGoogle Scholar

Copyright information

© Springer Science+Business Media Dordrecht 2014

Authors and Affiliations

  1. 1.Department of Chemistry and BiochemistryMiami UniversityOxfordUSA

Personalised recommendations