Background

The biological function of a protein can often be inferred from its similarity to sequences of known function in sequence databases using single-sequence similarity algorithms such as BLAST [1] or FASTA [2]. Such algorithms are suitable for determining highly similar sequences, but are not sensitive enough to capture highly divergent sequences. Therefore, many members of an evolutionarily diverse family of proteins may be overlooked. Within the last decade, the sensitivity of sequence searching techniques has been improved by profile- or motif-based analysis, which uses information derived from MSAs to construct and search for sequence patterns [36]. Unlike single-sequence similarity, a profile or motif can exploit additional information, such as the position and identity of residues that are conserved throughout the family, as well as variable insertion and deletion probabilities.

Currently, the most widely-used profile and motif databases are: BLOCKS [4], which stores ungapped MSAs corresponding to the most conserved regions of protein families; PROSITE [3], which uses single consensus patterns and profiles to characterize each family of sequences; Pfam [7] or SMART [8], which uses profile hidden Markov models (HMMs) to find commonly occurring protein domains; and PRINTS-S [9], which is a database similar to both PROSITE and BLOCKS, except it uses "fingerprints" composed of more than one pattern to characterize a protein family. Recently, a new profile and motif database, InterPro [10, 11], consisting of an amalgamation of PROSITE, ProDom [12], Pfam and the PRINTS fingerprint database, was used in the automatic annotation of complete proteomes including fly[13] and human[14].

Most of these databases strive to classify protein sequences into broad families, with the exception of the PRINTS-S fingerprint database, which has both family- and subfamily-specific fingerprints [9]. The ability to classify query proteins into subfamilies within superfamilies is useful in providing more specific functional annotations. Therefore, we propose a method based on HMMs to find windows of residues that are distinct in protein subfamilies. Although HMMs are expensive, both in terms of memory and computation time, they provide a solid statistical foundation for the modeling of information in an MSA. Our method works by constructing an HMM database representing a sliding window of residues for the MSA of each subfamily and then comparing the HMM database across an entire sequence database of the protein superfamily (Fig. 1). To demonstrate the utility of our approach, it has been applied to two well-studied protein superfamilies: the cadherin superfamily [15] and the EF-hand superfamily [16].

Figure 1
figure 1

Flow diagram of the method. First, filter a primary database using a profile or motif database for a subset of sequences that will comprise the protein superfamily database. Then, partition the protein superfamily database into subfamilies depending on the criterion for a subfamily. Then, build an MSA for each subfamily and build HMMs of all w width windows of the MSA. Finally, tabulate matches with an e-value under 100 to identify subfamily signatures for the HMM database of the superfamily and tabulate matches with e-value under 0.1 to identify potentially significant functional regions in the subfamily.

Results and discussion

Subfamily partitioning

The purpose of subfamily partitioning is to create an MSA of each subfamily, however, if quality MSAs of subfamilies already exist, it is possible to commence with the analysis at that point, as is done for PRINTS-S [9]. This section outlines a simple procedure for partitioning, however other methods exist which may be more preferable [1721]. Many methods, like the one described herein, use a tree clustering approach based on sequence distance or identity.

The members of a protein family can be identified by collecting the matching sequences to profile or motif databases such as the ones described in the Background. This initial set of sequences is designated as the superfamily database and let the total number of sequences in this database be represented by n T . The method of selecting a protein subfamily and defining its limits depends on the researcher who defines it. Subfamilies can be partitioned based on sequence or function and while function-based methods are valid, sequence-based methods can be automated.

To divide the sequences into subfamilies, construct a square similarity matrix, S, of dimensions n T by n T . S i,j is the percent similarity between the sequence i and sequence j. The alignment between a pair of sequences is determined in CLUSTALW by performing a global alignment [22] with an opening gap penalty of 10, an extension gap penalty of 0.1 and a Gonnet scoring matrix [23, 24]. The percent similarity is estimated by the division of the alignment score by the maximum alignment score between each sequence aligned to itself.

The similarity matrix is used to build a tree by the UPGMA (unweighted pair group method using arithmetic averages) clustering algorithm [25] for the purpose of partitioning sequences based on sequence similarity. At this point, Sjolander [20] pointed out that any partition of the tree may be meaningful. Indeed, there is no partitioning criterion that is impartially better than another. In the end, the biologist must decide the most appropriate partitioning criterion from their perspective given their experience with the protein superfamily. Therefore, the introduction of complementary methods may be important for consistent and reproducible analysis.

Our aim is to achieve a high quality MSA of each subfamily. A benchmark of the quality of an MSA is how well it reflects the structural alignment. Comparative homology modeling allows us to predict the three-dimensional structure of a target protein based on its alignment to one or more proteins with a known template structure [26]. It has been observed that as the sequence identity between the target sequence and the template increases, the average structural similarity between the template and the target also increases and for closely related protein sequences with identity over 40%, the alignment is almost always correct [27]. Therefore, if a similarity threshold greater than 40% is used for partitioning, the resulting MSAs should be reasonably high quality and well correlated with the structure. Since Dayhoff used a 60% identity for the threshold for a subfamily [17], we adopt a 60% universal similarity threshold as a slight modification. This strict threshold may create multiple partitions of the same subfamily, however, careful inspection of the sequence descriptions hint at what partitions can be joined.

Let n S be the number of subfamilies and n i be the number of sequences in the ith subfamily. Therefore, the number of sequences that cannot be partitioned, n H , can be expressed in the following equation:

These sequences are less than 60% similar to each other and to sequences in any subfamily. Note that n H will never be zero due to the intermediate nodes of the initial tree. Also note that n H will increase as the similarity of the sequences in the superfamily decreases.

Creating an HMM histogram for one subfamily

The creation of an HMM histogram for a subfamily commences with an MSA, which can be acquired from manual or automatic sequence alignment of the sequences in each subfamily. If another method was used for partitioning subfamilies, it is necessary to check if the automatically generated MSAs are correct; however, using the outlined partitioning procedure, an automatic MSA method such as CLUSTALW should produce a structurally correlated MSA, since the sequences in the subfamilies have a greater than 40% sequence identity.

Sliding MSA windows with a width of w are created. Let a i be the width of the MSA of the ith subfamily, then the number of MSA windows for the ith subfamily, b i , is:

b i = a i - w - 1

An HMM is created for each sliding MSA window of the subfamily by the HMMER software package [28]. The HMM database of the subfamily is created from the concatenation of all these individual HMMs and calibrated with a sample size of 10000 sequences. The superfamily sequence database is then searched with the HMM database and an HMM histogram is constructed from the number of matches of each window. Let the HMM histogram of the ith subfamily be represented by, f i (x), where x is the starting position of the window.

The window width (w) is a critical parameter in the generation of the HMM histogram. A small value w is desirable because it allows the features of an HMM histogram to be more evident. As the size of w increases from 20 to 80, it has the effect of smoothing the shape of the HMM histogram (Fig. 2). Empirically, it was determined that a good value of w is approximately 20 because lower values may create models that are statistically insignificant. If necessary, we suggest gradually increasing that number to achieve an acceptable balance between significance and window resolution.

Figure 2
figure 2

HMM histograms of epithelial cadherin. This figure shows the HMM histograms of epithelial cadherin with varying window widths (w). The x-axis represents the starting position of the window in the MSA of the subfamily; the y-axis represents the number of times that window was found in the cadherin superfamily database. The shape of the HMM histogram becomes smoother as the size of w increases from 20 to 40 to 80 residues because the score is calculated over a larger region.

Using HMM histograms to find subfamily signatures

Finding signatures involves discovering MSA windows that can distinguish this subfamily from all other subfamilies. A particular MSA window can fall into one of three categories: divergent window (a window that is not shared by the subfamily), superfamily window (shared by the superfamily), or subfamily window (shared by the subfamily). Divergent windows can be easily identified from an MSA by a stretch of positions that do not align well; however, superfamily and subfamily windows cannot be separated because they will both align well.

However, from an HMM histogram, subfamily windows have an equal number of matches (f i (x)) to the number of sequences in the subfamily MSA (n i ), f i (x) = n i ; superfamily windows, f i (x) >n i ; divergent windows, f i (x) <n i . Since the HMM histogram sweeps across the MSA with a window size of w, if there is a subfamily signature greater than w positions, it will be identified by consecutive subfamily windows.

To define an HMM match, HMMER returns both a score and an e-value. The score is the base two logarithm of the ratio between the probability that the query sequence is a significant match to the probability that it is generated by a random model. The e-value represents the expected number of sequences with a score greater than or equal to the returned HMM score. While decreasing the e-value threshold favors finding true positives, increasing the e-value threshold favors finding true negatives. For finding subfamily signatures, a tolerant e-value of 100 is used because windows matching only sequences in the subfamily, under loose conditions, are characteristic to the subfamily.

The complete set of HMMs created from all subfamily signatures is concatenated to build the HMM database for the protein superfamily. The analysis of a query sequence follows a two-step process. First, search the query sequence for the conserved domain of the protein superfamily (i.e. presence of the cadherin repeat or EF-hand motif). If the conserved domain is found, then search for subfamily signatures. If subfamily signatures are found, the sequence belongs to the subfamily whose signature has the lowest e-value (Fig. 3). Otherwise, the sequence is classified to the protein superfamily and the classification system has achieved an equivalent level of success as most profile and motif databases. To cross-validate the analysis, remove 5% of the sequences in the initial superfamily database (the test set) prior to building the HMM histograms. The test set is checked with the constructed HMM database of the superfamily and the sequences in the test set should fall into the expected subfamilies within an acceptable error rate. We suggest a 5% acceptable error rate, but other more stringent rates may also be appropriate.

Figure 3
figure 3

Flow diagram of a query into the HMM database of the superfamily

Using HMM histograms to visualize functional regions

In the previous section, to identify subfamily signatures, we focused on subfamily windows. However, superfamily windows also may provide insight into which regions in the subfamily share functional significance relative to the superfamily. Peaks in the HMM histogram can suggest which regions are particularly well conserved across the entire superfamily.

To extract this data, a few modifications are needed to the method. First, create a HMM histogram of the ith subfamily as previously described, but instead with an e-value threshold of 0.1. This is a stringent threshold because for this purpose, it is important to favor true positives. Thus far, the HMM histograms presented are functions of the starting position of the window (f i (x)) and while this is convenient for identifying subfamily signatures, HMM histograms as a function of the position in the alignment, g i (x), are useful to assess the contribution of individual positions.

The mapping from f i (x) → g i (x) is determined by tabulating a count of 1 for each position in the window when a match is found. Therefore, the mapping equation is expressed as follows:

Peaks in g i (x) may hint at positions that may have functional importance.

Analysis of the cadherin superfamily

Cadherins represent a large family of proteins having diverse functions including cell-cell adhesion, morphogenesis, synapse formation, cell polarization, cell sorting, cell migration, and cell rearrangements [15]. All members of the cadherin superfamily possess a cadherin repeat (CR) and by using Pfam's HMM of the CR, 203 sequences were filtered that match the model below a 0.1 e-value from the SWISS PROT sequence database (Release 39).

Subfamily clustering produced 21 known subfamilies of cadherins with on average 8 members (Table 1). To cross-validate the effectiveness of the final HMM database of the superfamily in classifying subfamilies, 11 sequences (representing 5% of the sequence data) were removed to form the test set. The analysis to create the HMM database of the superfamily was performed using the sequences in the superfamily database minus the test set. HMM histograms of the subfamilies were created from MSAs generated by CLUSTALW (Fig. 4). 95 total subfamily signatures were extracted from the consolidation of consecutive subfamily windows. Finally, the HMM database of the superfamily was created from the concatenation of HMMs constructed from the subfamily signatures. Cross-validation revealed that all the sequences in the test set were classified into the expected subfamily.

Figure 4
figure 4

HMM histograms of cadherin subfamilies. The HMM histograms were constructed with a window width of 20 and an e-value threshold of 100. The signature regions are highlighted in yellow for various subfamilies in the cadherin superfamily. A) Protocadherin-γA B) Liver Intestine cadherinC) Truncated cadherin

Table 1 Tabulation of sequences in cadherin subfamilies

From the solved crystal structure of the first and second N-terminal CRs (CR1 and CR2) of epithelial cadherin [29], it was shown that the homodimerization of epithelial cadherin is stabilized by the Ca2+ ions bound in the linker region between CRs. Single amino acid substitutions in the Ca2+ binding site could disrupt the cell adhesion function [30]. The HMM histogram of the epithelial cadherin subfamily was plotted on the solved crystal structure (Fig. 5A) where interestingly, the Ca2+ binding linker between CR1 and CR2 had the highest counts. Furthermore, the peaks of the HMM histogram were found within one or two positions in 6 of 8 residues critical in Ca2+ binding (Fig. 5B,C).

Figure 5
figure 5

Mapping the HMM histogram to the crystal structure of epithelial cadherin. A) The HMM histogram mapped onto the crystal structure (PDB code: 1EDH) of the first and second cadherin repeats of epithelial cadherin. Ca2+ ions are depicted as yellow spheres. The regions of high occurrence map to the Ca2+ binding site (blue represents low occurrence and red represents high occurrence)B) The HMM histogram of the epithelial cadherin subfamily with an e-value cutoff of 0.1. The orange bars in the histogram reflect positions involved in Ca2+ binding. Below the histogram is the domain layout. The features are colored: cadherin repeat (CR), blue rectangle; cytoplasmic domain (Cyt), green rectangle; catenin binding sites in the cytoplasmic domain, pink rectangles. The segment between the last CR and the cytoplasmic domain is the single pass transmembrane domain.C) The MSA of the segment involved in Ca2+ binding between the first and second CRs. The SWISS PROT code of the sequence is shown in the left and the 8 residues involved in Ca2+ binding are highlighted orange.

Various biochemical and structural studies have suggested that Ca2+ binding occurs between all CRs [31]. These Ca2+ binding linkers seem to play critical roles in the cell-adhesion function of cadherins, as they are directly involved in molecular assembly [29]. The high peak between linker of CR2 and CR3 in the HMM histogram (Fig. 5B) strongly suggests the functional importance of this domain linker. Interestingly, the two linkers between the last 3 CRs do not display an intense peak in the HMM histogram. These findings may suggest that the two N-terminal linkers are functionally more essential than the two C-terminal linkers. Further structural and mutagenesis studies are required to test this hypothesis derived from our sequence analysis.

Analysis of the EF-hand superfamily

Kretsinger and Nockolds [32] discovered the EF-hand motif in the crystal structure of parvalbumin in 1973. The EF-hand motif has a characteristic helix-loop-helix structure, consisting of approximately 30 residues. Numerous proteins that interact with Ca2+ contain the EF-hand motif [33]. The most prevalent classification of the EF-hand superfamily based on domain relations has been reported previously [16].

Using Pfam's HMM of the EF-hand, 736 sequences were filtered from SWISS-PROT (Rel. 39) to comprise our EF-hand superfamily database. The subfamily partitioning methodology presented here produced 26 known EF-hand subfamilies, each consisting of approximately 15 members (Table 2). The subfamily partitioning identified a significant portion of classified EF-hand subfamilies, however not all. This is because our subfamily partitioning is based entirely on sequence similarity while previous classifications utilized not only sequence similarities but also other information available from experimental studies. In addition, there was a large portion of the superfamily which could not be partitioned using strictly sequence similarity, suggesting that sequences in the EF-hand superfamily are significantly dissimilar and that a complementary approach may be need to fully partition all subfamilies.

Table 2 Tabulation of sequences in EF-hand subfamilies

Similar to the cross-validation analysis on the cadherin superfamily, 37 sequences (representing 5% of the sequence data) were removed to form the test set. Again, HMM histograms of the subfamilies were created from the reduced set of superfamily sequences (Fig. 6). In total, 40 subfamily signatures were extracted. The HMM database of the EF-hand superfamily was created from the subfamily signatures. Again, cross-validation revealed that all the sequences in the test set were classified into the expected subfamily. This suggested that the method can classify sequences with a high specificity.

Figure 6
figure 6

HMM histograms of EF-hand subfamilies. Using the same conventions as Fig. 4, the HMM histograms were constructed for various subfamilies in the EF-hand superfamily. A) Calbindin D28k B) Calcineurin B C) Caltractin.

The peaks in the HMM histograms corresponded to windows that include EF-hand motifs (Fig. 6). Calbindin D28k, for example, has six EF-hands (designated EF1-EF6). Ca2+ binding studies have shown that EF2 does not bind Ca2+ and EF6 binds Ca2+ with a lower affinity than the other four functional sites [34]. Interestingly, the HMM histogram of Calbindin D28k shows no peaks at the locations of EF2 and EF6 (Fig. 6A). Calcineurin B contains four EF-hands, all shown to bind Ca2+[35]. The HMM histogram clearly shows the presence of four functionally active Ca2+ binding EF-hands in calcineurin B (Fig. 6B). Caltractin also possesses four EF-hands: two higher affinity and two lower affinity [36]. Similarly, the HMM histogram shows the four peaks corresponding to four EF-hands (Fig. 6C). Parvalbumin is a Ca2+ buffering protein involved in the relaxation of muscle after contraction by binding up free Ca2+ in the cell [37, 38]. The HMM histogram was mapped onto the solved crystal structure of parvalbumin B [39] (Fig. 7A). Parvalbumin B has three EF-hands and the first N-terminal EF-hand does not bind Ca2+[39]. The HMM histogram clearly displayed the lack of the functional N-terminal EF-hand and the existence of two active C-terminal EF-hands (Fig. 7B,C). These examples demonstrated that HMM histograms are not only useful for finding subfamily signatures but also in locating functionally significant regions of subfamilies.

Figure 7
figure 7

Mapping the HMM histogram to the crystal structure of parvalbumin B. A) The HMM histogram mapped onto the crystal structure (PDB code: 1CDP) of parvalbumin B. Using the same conventions as Fig. 5A, the Ca2+ binding loops of two EF-hand motifs have a high occurrence level. B) The HMM histogram of the parvalbumin B subfamily with an e-value cutoff of 0.1. The orange regions in the histogram reflect the segments encoding the Ca2+ binding loops. Below the histogram is the domain layout. The blue rectangle represents the EF-hand motif (EF). C) The MSA of the parvalbumin B subfamily. Using the same conventions as Fig. 5C, the Ca2+ binding loops are highlighted orange.

Conclusions

We developed a method to decipher signature regions of protein subfamilies, which can be used to build HMM databases for diagnosing subfamilies of large protein superfamilies. Using this method, we identified subfamily signatures and built HMM databases for two well-studied superfamilies of cadherins and EF-hand proteins. Additionally, peaks in the HMM histogram plots of subfamilies were found to coincide with functionally important regions (i.e. Ca2+ binding sites and loops). Future work should include the comparison between different subfamily partitioning techniques and also the creation of richly annotated databases for subfamilies of superfamilies for possible application in automated genomic annotation in conjunction with other motif and profile databases.

Materials and methods

The studies were performed using a variety of tools and whenever necessary, in-house programs were written to pre- and post-process data from the different applications. MSAs were generated using CLUSTALW [40] and all HMMs were created using the HMMER package [28]. Data was stored on the Oracle relational database management system and Microsoft FoxPro was used as an ODBC (Open Database Connectivity) client for querying and joining tables from the database. Microsoft Excel was used for dynamic charting of data. Perl was used for shell scripting, text manipulation and pattern matching with regular expressions. HMMER, CLUSTALW, Oracle database server (version 8) and Perl scripts were executed on a machine with a dual 750 MHz UltraSPARC-111 processor and 4 G of RAM running SunOS 5.8. Microsoft FoxPro and Excel were executed on a 500 MHz Intel Celeron processor and 128 MB of RAM running a Windows 98 operating system.

The time required to analyze one superfamily depended largely on the computation platform, the number of sequences of the superfamily and the average width of subfamily MSAs. Using the computation platforms described, the computation time to generate the MSA using CLUSTALW for the cadherin superfamily (~200 sequences, ~800 average width) was ~3 hours and for the EF-hand superfamily (~700 sequence, ~200 average width) was ~9 hours. The computation time for the creation of a calibrated HMM database (window size of 20) for an average cadherin subfamily was ~6 hours; for an EF-hand subfamily, ~45 minutes. The execution time for an average HMM database of cadherin subfamily over the superfamily database was ~12 hours; for an EF-hand sub-family, ~7 hours. The computation time was extensive but could easily be adapted to a parallel computing system.

The HMM database created for the cadherin and EF-hand superfamilies and all glue programs that were used for the analysis are available upon request.