Principal components analysis of protein sequence clusters
Sequence analysis of large protein families can produce sub-clusters even within the same family. In some cases, it is of interest to know precisely which amino acid position variations are most responsible for driving separation into sub-clusters. In large protein families composed of large proteins, it can be quite challenging to assign the relative importance to specific amino acid positions. Principal components analysis (PCA) is ideal for such a task, since the problem is posed in a large variable space, i.e. the number of amino acids that make up the protein sequence, and PCA is powerful at reducing the dimensionality of complex problems by projecting the data into an eigenspace that represents the directions of greatest variation. However, PCA of aligned protein sequence families is complicated by the fact that protein sequences are traditionally represented by single letter alphabetic codes, whereas PCA of protein sequence families requires conversion of sequence information into a numerical representation. Here, we introduce a new amino acid sequence conversion algorithm optimized for PCA data input. The method is demonstrated using a small artificial dataset to illustrate the characteristics and performance of the algorithm, as well as a small protein sequence family consisting of nine members, COG2263, and finally with a large protein sequence family, Pfam04237, which contains more than 1,800 sequences that group into two sub-clusters.
KeywordsPrincipal components analysis PCA Protein sequence analysis
This work was supported by the National Institute of General Medical Sciences; Protein Structure Initiative-Biology Program; Grant Number U54-GM094597. The calculations were performed at the Ohio Center of Excellence in Biomedicine in Structural Biology and Metabonomics at Miami University.
- 28.Feldmann EA, Seetharaman J, Ramelot TA, Lew S, Zhao L, Hamilton K, Ciccosanti C, Xiao R, Acton TB, Everett JK, Tong L, Montelione GT, Kennedy MA (2012) Solution NMR and X-ray crystal structures of Pseudomonas syringae Pspto_3016 from protein domain family PF04237 (DUF419) adopt a “double wing” DNA binding motif. J Struct Funct Genom 13:155–162CrossRefGoogle Scholar