Background

Most protein families can be divided into functionally distinct subfamilies. Such subfamilies exhibit characteristic properties which manifest for instance as binding specificity of regulatory proteins, substrate specificity of enzymes, and pore selectivity of channels and transporters. Functional differences are often linked to sequence characteristics in regions which are conserved throughout the protein superfamily. This is because conserved domains define the fold of the functional protein core or provide catalytic residues. Recognition of subfamily-specific deviations at such sites can be valuable for elucidating mechanistic principles of the protein family by site-directed mutagenesis and subsequent functional analysis of the mutants. An automated approach to identify relevant deviations should (i) provide the ability to take into account a large number of reference sequences, (ii) determine sequence conservation, i. e. positions of high information content, and (iii) visualize deviations, i.e. subfamily characteristics, relative to the information content in a graphical output which is easy to comprehend.

Implementation

One sophisticated way of presenting sequence conservation is to display a sequence logo [6]. Here, the information content I (P i ) of each alignment position i is defined inverse to the uncertainty H(P i ) by the equation

I ( P i ) = log 2 | Σ | H ( P i ) = log 2 | Σ | + j | Σ | P i j log 2 P i j MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGjbqscqGGOaakcqWGqbaudaWgaaWcbaGaemyAaKgabeaakiabcMcaPiabg2da9iGbcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIYaGmaeqaaOGaeiiFaWNaeu4OdmLaeiiFaWNaeyOeI0IaemisaGKaeiikaGIaemiuaa1aaSbaaSqaaiabdMgaPbqabaGccqGGPaqkcqGH9aqpcyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGOmaidabeaakiabcYha8jabfo6atjabcYha8jabgUcaRmaaqafabaGaemiuaa1aaSbaaSqaaiabdMgaPjabdQgaQbqabaGccqGHflY1cyGGSbaBcqGGVbWBcqGGNbWzdaWgaaWcbaGaeGOmaidabeaakiabdcfaqnaaBaaaleaacqWGPbqAcqWGQbGAaeqaaaqaaiabdQgaQjabgIGiolabcYha8jabfo6atjabcYha8bqab0GaeyyeIuoaaaa@6858@

with |Σ| being the cardinality of the used alphabet, i.e. 4 for DNA and 20 for protein sequences, and P ij being the frequency of residue j at this position (variables according to [7]). Each position is displayed as a stack of residue symbols whose heights l ij represent their proportion of the information content:

l ij = P ij ·I(P i )

Protein sequence logos are often adjusted to the background frequency of each amino acid in the alignment [7]. For simplicity, the variable name I (P i ) will be used in the following for both, information content with or without frequency correction. Generally, both approaches are compatible with subfamily logos and have been implemented in the algorithm.

Contrary to a sequence logo that depicts sequence conservation, here, it is desired to display the relevance of deviations at conserved positions. The recently published pairwise HMM logo approach does align the sequence logos of two subfamilies [8]. This certainly facilitates the identification of relevant deviant positions, but one still has to inspect position by position and judge different symbol heights by eye. Subfamiliy logos provide a very intuitive display. They are derived by subtracting from the frequency S ij of a residue j within a pre-defined subset of sequences, i. e. a subfamily, the frequency R ij of this residue in the remaining set of sequences for each position i. The difference is then weighted by the overall information content I(P i ) computed from all sequences and the residue is plotted with a symbol height of s ij :

s ij = (S ij - R ij f S ˜ R ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGMbWaaSbaaSqaaiqadofagaacaiqadkfagaacaaqabaaaaa@3611@ ·I(P i )

The term (S ij - R ij ) gives values from -1 to 1. Positive values correspond to residues which are characteristic for the subfamily (shown upright in the output), negative values to those that are typical for the remaining sequences (shown upside-down). Positions with an equal distribution of residue j result in a zero value.

The need for a correction factor f S ˜ R ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGMbWaaSbaaSqaaiqadofagaacaiqadkfagaacaaqabaaaaa@3611@ is illustrated by the following example. Assume an alignment with an equal number of sequences in the subfamily and in the remaining set of sequences. Further, assume a position i within the alignment where all sequences in the subfamily carry amino acid a and all remaining sequences carry amino acid b with ab. This situation can be considered as the best possible discrimination between the subfamily and the remaining set of sequences and results in the frequencies P ia = 0.5, P ib = 0.5 and all other P ij = 0. The overall information content at this position, thus, is I(P i ) = Iog2 20 + 0.5 Iog2 0.5 + 0.5·Iog2·0.5 = Iog2 20 - 1, i. e. one bit less than the maximal information content. For either group of sequences, however, the information content should be maximal due to the frequencies S ia = 1 (subfamily) and R ib = 1 (remaining sequences). The decrease in the apparent information content depends on the fraction of sequences in the subfamily ( S ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGtbWugaacaaaa@2DEA@ ) and in the remaining set ( R ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacuWGsbGugaacaaaa@2DE8@ ). Hence, the factor f S ˜ R ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfeBSjuyZL2yd9gzLbvyNv2Caerbhv2BYDwAHbqedmvETj2BSbqee0evGueE0jxyaibaiKI8=vI8tuQ8FMI8Gi=hEeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciGacaGaaeqabaqadeqadaaakeaacaWGMbWaaSbaaSqaaiqadofagaacaiqadkfagaacaaqabaaaaa@3611@ was introduced, which follows the form shown in the example above and corrects for the described error:

f S ˜ R ˜ = log 2 | Σ | log 2 | Σ | + S ˜ log 2 S ˜ + R ˜ log 2 R ˜ MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaqabeGadaaakeaacqWGMbGzdaWgaaWcbaGafm4uamLbaGaacuWGsbGugaacaaqabaGccqGH9aqpdaWcaaqaaiGbcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIYaGmaeqaaOGaeiiFaWNaeu4OdmLaeiiFaWhabaGagiiBaWMaei4Ba8Maei4zaC2aaSbaaSqaaiabikdaYaqabaGccqGG8baFcqqHJoWucqGG8baFcqGHRaWkcuWGtbWugaacaiabgwSixlGbcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIYaGmaeqaaOGafm4uamLbaGaacqGHRaWkcuWGsbGugaacaiabgwSixlGbcYgaSjabc+gaVjabcEgaNnaaBaaaleaacqaIYaGmaeqaaOGafmOuaiLbaGaaaaaaaa@5B33@

Results and discussion

Fig. 1 displays two sections of a protein alignment (pos. 42–71 and 173–202) which consists of a total of 135 aquaporin sequences. Two functionally distinct subfamilies are represented by 32 aquaglyceroporins (GlpFs; permeability for water and glycerol), and 103 water-specific aquaporins (AQPs). From the latter, another water-specific subfamily consisting of 11 plant tonoplast intrinsic proteins (TIPs) can be separated.

Figure 1
figure 1

Subfamily logos in comparison to classical sequence logos. Sections of three aquaporin subfamilies are shown, i.e. water/glycerol channels (GlpFs), water-specific channels (AQPs), and tonoplast intrinsic proteins (TIPs). Subfamily-specific residues are displayed upright, residues that are typical for the remaining sequences as tinted upside-down characters. The unit of the ordinates is in bits. Triangles mark known positions of relevant subfamily-specific deviations. Asterisks were computed by the subfamily logo algorithm to label subfamily-specifc residues.

The frequency-corrected sequence logo on the top highlights conserved positions around the two canonical Asn-Pro-Ala (NPA) motifs. The scale of the ordinate is in bits. Sequence conservation is further indicated by a color scale below the logo based on a structural matrix integrated into TEXshade. The triangles mark positions where the GlpF, AQP, and TIP subfamilies deviate as shown before in various publications [2, 4, 5]. These positions are directly connected to function because they contribute to the layout of the selective pore constriction.

Three frequency-corrected subfamily logos are shown below. Readability is greatly improved when upside-down symbols are tinted by 50%. This gives the impression of a reflective surface with a focus on the positive, subfamily-relevant residue symbols. The output is intuitive and basically self-explaining. Positions which are conserved throughout do not appear in the subfamily logos, see for instance the NPA motifs at positions 63–65 and 194–196. However, sequence deviations become visible dependent on the information content, e.g. Val197 vs. Arg in the TIP subfamily, or Asp198 and Ser198, respectively, in the GlpF or AQP subfamilies. Deviations are less pronounced at positions with a higher number of possible residues due to the lower information content. Nevertheless, subfamily characteristics are still visible if relevant, e.g. at positions 43, 182, and 202. The algorithm further accepts a threshold bit-value above which a deviant residue is additionally highlighted by a symbol (asterisks in Fig. 1). Empirically, this value is set to log25 (2.322 bit) for proteins, which corresponds to the presence of one particular residue in 25% of all sequences or 50% of the subfamily, and log22 (1 bit) for DNA sequences. The threshold value can be manually adjusted to match the alignment situation in question. It may also be used in the future to indicate statistical evaluations of the residue distribution. Inherently, best results are obtained when only two subfamilies are compared.

Currently, subfamily logos are implemented in TEXshade [see additional files 1 and 2], i.e. a LATEX macro package for setting and shading multiple sequence alignments [1]. Some sample code is displayed in Fig. 2 depicting that a small number of commands leads to satisfying output. TEXshade provides numerous additional commands for individual adjustments of the output and comprehensive labeling. However, implementation of a subfamily logo extension into software that provides a graphical user interface and TEXshade output, such as STRAP [3] or the San Diego Supercomputer Center Biology WorkBench http://workbench.sdsc.edu/, is strongly encouraged. Further, integration of the subfamily logo algorithm into local or web-based sequence logo plotting tools should be straight forward.

Figure 2
figure 2

Example input for subfamily logo generation. Shown is the code needed to calculate and display positions 42–71 of the subfamily logo for the GlpF aquaporin subfamily displayed in Fig. 1. The input file AQP_all.aln contains a multiple sequence alignment of 135 aquaporin protein sequence.

Conclusion

Subfamily logos are an extension to the classical application of sequence logos. They provide a novel tool to intuitively visualize subfamily sequence characteristics. The validity of the method was confirmed by analysis of 135 aligned aquaporin sequences and correct identification of subfamily-specific sequence deviations. Their relationship to sequence logos makes it easy to integrate them into existing logo software.

Availability and requirements

Project name: TEXshade

Project home page: http://homepages.uni-tuebingen.de/beitz/txe.html or any CTAN site

Operating system(s): Platform independent

Programming language: LATEX

Other requirements: LATEX2ε

License: GNU GPL

Any restrictions to use by non-academics: none