Background

Bacterial and archaeal transcriptional regulators typically form large protein families consisting of numerous paralogs (for example the LacI/GntR, AraC and DeoR families [1]). Only three readily detectable clusters of orthologous transcription factors include just one or two representatives from a broad range of diverse branches of bacteria, namely the SOS repressors LexA/DinR, the heat-shock repressor HrcA, and the arginine repressor ArgR/AhrC [2] (Table 1). A comparison of the coevolution of these conserved regulators and their binding sites in DNA could reveal general trends in the evolution of regulons.

Table 1 Comparison of three transcriptional regulator families with predominantly single representatives from each bacterial genome

The signals recognized by LexA in Gram-negative bacteria and by its ortholog DinR in Gram-positive bacteria (the SOS box [3] and the Cheo box [4], respectively) are completely different. Accordingly, the DNA-binding domains of these proteins are divergent (Table 1). The heat-shock regulator HrcA binds CIRCE elements that are located upstream of genes encoding heat-shock proteins (molecular chaperones) in many different genomes [5,6]; in the mycoplasmas, HrcA also regulates heat-shock protease genes [7]. The CIRCE signal is very specific (two complementary nonamers with a 9 base pair (bp) spacer) and is extremely highly conserved in all genomes that encode HrcA (not more than five, and usually less than three, mismatches to the consensus in all known and predicted sites [7]). The amino acid sequence of HrcA is conserved as well (Table 1).

The arginine regulon, which is regulated by the arginine repressor ArgR/AhrC, represents an evolutionary strategy distinct from that of either the SOS or the heat-shock regulons. The DNA-binding domains of the ArgR/AhrC family are less conserved than those of the HrcA family, but more conserved than those of the LexA/DinR family (Table 1, column 5). DNA signals recognized by ArgR/AhrC are also similar in several bacterial lineages at least [8,9,10,11]. These sites often occur in pairs [12,13,14,15], although single-box sites have also been shown to bind ArgR/AhrC, for example the sites in the catabolic operons of B. subtilis [9], the adenine deaminase pathway operon in Bacillus licheniformis [14], and the cer recombination region of the E. coli plasmid ColE1 ([16,17]; see also the study of mutated ArgR [18]). Unlike the CIRCE element, the ARG box seems to be weakly conserved, even within a genome, and the specificity of recognition is often achieved by cooperative interactions between tandem sites, as shown in both experimental [9,12,13] and statistical [19] studies. The set of ARG boxes from different genomes, however, is fairly homogeneous, and indeed, arginine repressors from different bacteria appear to be at least partially interchangeable within major taxonomic groups: there is some cross-binding between ArgR and AhrC [20]; ArgR but not AhrC binds to the Thermus thermophilus sites [21] and AhrC binds to the Streptomyces coelicolor sites [22]. The ARG box consensus was described as TNTGAATWWWWATTCANW in E. coli [8,12], CATGAATAAAAATKCAAK in B. subtilis [9,10] and AWTGCATRWWYATGCAWT in Streptomycetes [11] (where W = A or T, K = G or T, R = A or G, Y = T or C, N = any base; Table 1). In addition, binding of ArgR homologs to the sites similar to ARG boxes was reported for other Bacillus species (B. licheniformis [14] and B. stearothermophilus [23,24]), and for Salmonella typhimurium [25]. Several ArgR-binding sites were predicted on the basis of similarity with the E. coli consensus in the upstream regions of various genes involved in arginine metabolism in Moritella [26].

In a previous study [27], we used comparative genomic analysis of regulatory signals to predict the gene composition of the arginine regulon of Haemophilus influenzae using the well characterized E. coli regulon as the starting point. Here we extend this analysis to explore the conservation of the ARG box in all bacteria that encode an ortholog of the ArgR repressor.

Results and discussion

The comparative approach to the analysis of regulation is based on the assumption that regulons (sets of co-regulated genes) are conserved in genomes containing orthologs of the relevant regulatory proteins. Thus true candidate binding sites for the regulator occur upstream of orthologous genes, whereas false positives are scattered at random in the genome. This provides a consistency check that sharply increases the accuracy of prediction.

The ARG box profile constructed as described in the Metraisl and methods section was used to scan the complete genomes of other bacteria (excluding the gamma-proteobacteria). The profile is not very selective: at threshold z-score = 3.75 [27] about 1% of the B. subtilis and M. tuberculosis genes are selected, compared with 7% for T. maritima. Nevertheless, there is a sharp distinction between the arginine-related genes without ARG boxes (for example, argT of E. coli, argF of H. influenzae, carAB of M. tuberculosis, argF of T. maritima and several Deinococcus genes, see Figure 1) and those with relatively strong and probably functional ARG boxes. Only the genes involved in arginine metabolism and transport (see below) have upstream ARG boxes in more than five out of eight of the genomes considered. Thus despite the seeming weakness of individual predictions, the basic assumption of the regulon conservation yields validity of the candidate sites [27,28]. Many weaker sites are second sites in cooperative cassettes. The candidate ArgR-binding sites are listed in Table 2 and shown in Figure 1. Validity of the B. subtilis profile for analysis of other genomes is confirmed by a candidate ARG box with z-score = 3.96 within the region protected when ArgR binds upstream of the argR gene of Thermotoga neapolitana [29] (data not shown).

Figure 1
figure 1

Schematic representation of the operon organization and regulation of the arginine metabolism and transport genes. Genes are represented by boxes. ARG boxes in the upstream region are shown by black arrows. The direction of the arrow indicates the direction of transcription. The linear pathway (in E. coli and V. cholerae) involves N-acetylglutamate synthase (argA) and N2-acetylornithine deacetylase (argE). The circular pathway (in other bacteria) involves N2-acetyl-L-ornithine: L-glutamate acetyltransferase (argJ). The common genes are acetylglutamate kinase (argB); acetylglutamate semialdehyde dehydrogenase (argC); acetylornitine delta-aminotransferase (argD); ornithine carbamoyltransferase (argF, argI); argininosuccinate synthase (argG); argininosuccinate lyase (argH); carbamoyl-phosphate synthase (carAB). The H. influenzae genome contains only argH, argG, argF and possibly argD orthologs. There are difficulties in identifying orthologs for argC, argJ and argB in D. radiodurans because there are several paralogous genes encoding proteins that can possibly perform these functions. The B. subtilisroc operons involved in arginine degradation are also regulated by AhrC, as well as anaerobic arginine catabolism genes arcABCD in B. licheniformis [14] (data not shown). The transporter genes are: periplasmic binding protein (white), permease transmembrane protein (light gray), ATPase component (dark gray).

Table 2 Candidate ARG boxes upstream of arginine metabolism related genes and operons

In addition to previously characterized ARG boxes in B. subtilis we identified a candidate ARG box upstream of the yqjN gene (Figure 1, Table 2), a probable product of recent duplication of the rocB gene encoding an arginine utilization protein with unknown biochemical function. Thus is it likely that YqjN has the same function as RocB and is also involved in arginine degradation.

An important outcome of the analysis is that in addition to the genes encoding the arginine metabolism enzymes, ArgR probably regulates ABC-cassette operons or scattered genes responsible for arginine transport in all bacteria except M. tuberculosis and maybe C. pneumoniae (Figure 1). Straightforward resolution of the orthology relationships between genes involved in transport of polar amino acids on the basis of their sequence similarity is impossible (Figure 2, and see COG0834, COG0795, COG1126 in [1]). Therefore the presence of candidate ARG boxes upstream of these genes could be the only indication of their involvement in arginine transport before experimental verification. Nevertheless, the protein tree presented in Figure 2 demonstrates clustering of closely related paralogs within one organism (E. coli, Clostridium acetobutylicum) or orthologs in closely related organisms (E. coli and H. influenzae) that have upstream candidate ARG boxes (Figure 1, Table 2). In the E. coli genome, this family includes two loci, artPIQM-artJ and argT-hisJQMP. In each case the four-gene operon encodes a complete ABC cassette with two transmembrane components, whereas the single-gene operon encodes an additional periplasmic protein. The art genes encode an arginine transport system. The hisJQMP operon encodes a histidine-specific ABC cassette, whereas the product of the upstream gene argT, lysine-arginine-ornithine-binding periplasmic protein ArgT, can substitute the periplasmic protein HisJ in binding to the membrane component HisP, thus changing the initial histidine transporter specificity [30]. The operons hisJQMP and argT have no candidate ARG boxes and do not seem to belong to the arginine regulon.

Figure 2
figure 2

Unrooted, neighbor-joining tree of the predicted polar amino acid periplasmic binding proteins for selected organisms. The tree was reconstructed using the PHYLIP package (SEQBOOT, PROTDIST, NEIGHBOR, CONSENSE and FITCH programs). Nodes with bootstrap value exceeding 60% are marked by open circles. BS, B. subtilis; CA, Cl. acetobutylicum; Cpn, C.pneumoniae; DR, D. radiodurans; EC, E.coli; HI, H. influenzae; Rv, M. tuberculosis; TM, T. maritima. Experimentally established specificity of transporters is indicated in parentheses. Genes with candidate ARG boxes in upstream regions are shown in italic and in a larger font.

In the Pseudomonas aeruginosa genome there are three systems closely related to the above transporters. One is orthologous to hisJQMP and the other to artPIQM. These two systems have not been characterized experimentally. The third system, aotQJMP, is closer to hisJQMP than to artPIQM. It encodes transporters of arginine and ornithine, but not lysine [31], and is located within the arginine and ornithine catabolism locus aot-aru. The aot system is positively regulated by an activator, ArgR, which is encoded by the distal gene of the aotJQMOPargR operon [31]. This activator belongs to the AraC family and is not related to the ArgR repressor of E. coli [32].

The situation with the C. pneumoniae genome is not clear. It contains the argR gene but no genes for the arginine metabolism. There is a stand-alone artJ gene (encoding an ABC cassette periplasmic protein) and two genes annotated as glnPQ immediately downstream of argR (encoding the transmembrane and ATPase components respectively). In fact, glnP of C. pneumoniae is the bidirectional best hit of the E. coli gene yecC situated in the flagellar locus. The ABC transporters are not easily amenable to orthology analysis, as their specificity may change at a fast rate. As mentioned above, positional and regulatory analysis is often the only computational technique for determining the cellular role of ABC cassettes before experimental verification. We note a pair of ARG boxes upstream of glnPQ and two ARG boxes with lower z-scores upstream of the artJ operon of C. pneumoniae. Thus it is very tempting to predict that these genes in fact encode an arginine transport system regulated by ArgR. We feel, however, that this prediction cannot be accepted without experimental verification, especially in view of two complicating observations. First, both artJ and glnPQ operons are conserved in the genome of C. trachomatis, despite the fact that the latter has no gene for ArgR. Second, ArgR of C. pneumoniae is closer to the ArgR of gamma-proteobacteria than to the AhrC/ArgR of Gram-positive bacteria, but nevertheless the ARG boxes of C. pneumoniae are visible with the Bacillus profile, but not with the gamma-proteobacteria profile.

Taken together these data suggest that ARG regulons represent an interesting (and possibly unique) case which could be considered as an intermediate evolutionary state compared to the HrcA and LexA/DinR regulons. ArgR orthologs retain high similarity on the amino acid level within the major taxonomic groups, and are identifiable between these groups, whereas ARG box conservation is low, although sufficient to be detected in diverged bacterial lineages. Nevertheless, this state seems to be stable and it is not clear what evolutionary forces are responsible for its stability. In this respect it is noteworthy that the structural type of the DNA-binding domain in the protein apparently does not determine the evolutionary relationships with its recognition site. All three aforementioned regulator families, as well as many others, contain the so-called 'winged helix' DNA-binding domain and its conservation is not correlated with conservation of its binding site (Table 1).

Conclusions

The composition of the ARG regulons in different bacteria is known to vary mainly because of diversity in the arginine degradation pathways and species-specific paralogs. The question of the origin of 'additional' ARG boxes thus arises. Because of the low conservation of the ArgR-binding signal, it is possible that some of the sites could be convergent in origin. Moreover, each genome contains a large number of potential ARG box-like sequences that could become actual sites when they become located upstream of an arginine metabolism gene following chromosomal rearrangements [33].

In contrast, CIRCE elements appear to be direct descendants of the ancient regulon present in the common ancestor of the Bacteria, because the variation in the composition of the CIRCE regulon is minimal and the few additional sites found in some genomes are apparently products of duplication. Most other DNA-binding domains of transcriptional regulators (including LexA) seem to undergo considerable changes together with their DNA signals and regulons. Thus, the evolution of the arginine regulon and ARG boxes seems to reflect a tradeoff between maintaining regulon flexibility on one hand and retaining the universal regulatory mechanism on the other.

Another interesting aspect of the arginine regulon strategy is the use of single and cooperative sites. In E. coli, the use of cooperative binding sites by ArgR seems to be a consequence of a requirement for a sharper response to a stimulus (arginine starvation) compared to the SOS response (single sites are usually used by LexA) [19]. Unfortunately, the available data seems to be insufficient to draw any systematic conclusions. In particular, as second sites in the cooperative cassettes are often weak (have low scores), some of them could be missed by the recognition rule. Direct experimental studies are needed to clarify this issue. Another problem that was not directly addressed in this study is the role of the E. coli arginine repressor in recombination and its binding to the cer site, which contains a single ARG box [16,17]. We have noted, however, conservation of this box in the monomerization site ckr of the plasmid ColK [34].

There are a few more transcription factor families (biotin operon repressor, COG1654; putative stress-responsive transcriptional regulator PcpC, COG1983; Bvg accessory factor homologs, COG1521 [1]) with a single representative per genome, and it would be interesting to compare them as well. They do not, however, contain a sufficient number of experimentally determined binding sites and are not so ubiquitous in the bacterial genomes as the three regulators discussed previously. With more available genomes, we hope that our approach, combined with positional analysis aimed at finding co-localized, and thus possibly functionally related enzymes and regulator genes [35,36], will enable us to make this comparison. On the other hand, we feel that the predictions made in this study, especially identification of the Art family ABC transporters in several diverse genomes, are sufficiently interesting to warrant experimental verification.

Materials and methods

The profile for ARG box identification was constructed as follows. Upstream regions of B. subtilis operons involved in arginine metabolism were selected. An iterative signal search procedure was applied as described previously [28]. The resulting ARG box profile was constructed using the four sites upstream of argC, argG, rocA and rocD. These formally identified sites are a subset of the experimentally known sites [9]. Gamma-proteobacteria were analyzed using the longer E. coli ARG box profile taken from [18]. Only genes having candidate sites in five or more out of the eight genomes analyzed were considered as candidate regulon members and were retained for further analysis. This procedure could lead to the loss of some true sites, but ensured that false sites were not accepted.

The complete genomes of E. coli, H. influenzae, Vibrio cholerae, B. subtilis, Mycobacterium tuberculosis, Thermotoga maritima, Chlamydia pneumoniae and Deinococcus radiodurans were downloaded from GenBank [37]. The complete genome of Clostridium acetobutylicum was obtained at [38].