Background

The divergence of protein function following gene duplication – or the colonization of new ecological niches – is of central importance in the evolution of novelty. Changes in protein structure and function are reflected at the level of amino acid sequence. This principle suggests that lineage-specific functional divergence in proteins can be identified by the analysis of primary sequence data. However, many amino acid substitutions have a negligible effect on protein function. This means that a simple comparison of the sequence differences between two clusters of proteins will not reveal the subset of changes responsible for functional divergence. While several methods to identify these biologically important substitutions exist [1], they are not optimized for analyses of large numbers of protein sequences. Here, we present a fast new method for identifying these substitutions across a large phylogenetic tree.

Materials and methods

Our method requires a bifurcating phylogenetic tree and a protein sequence alignment. Each node on the tree is defined by two downstream clades and one or more outgroup sequences. Using BLOSUM [2] scores to quantify how radical or conservative substitutions in each clade are relative to the outgroup, we assign a score to each column of the alignment at each tree node, which is then tested for significance [3]. Here, we apply our method to a tree of the GroEL genes from 622 bacterial genomes.

Results

GroEL is an important molecular chaperone which helps at least 250 client proteins fold in Escherichia coli [4]. Interestingly, we found that four out of the five bacterial lineages most enriched for functional divergence are intracellular pathogens (see Figure 1). Radical change in GroEL has previously been implicated in the adaptation of endosymbiotic bacteria to intracellular life [5], and these results suggest this may be a more general response to the population-genetic conditions of an intracellular lifestyle.

Figure 1
figure 1

Bacterial lineages enriched for functional divergence in GroEL. The thermosome-related sequences are found in certain extremophilic bacteria, perhaps as a result of horizontal gene transfer from archaea. The other highlighted lineages are intracellular pathogens, with the exception of Chloroflexi. The tree was produced by RAxML [6].