Introduction

Streptomyces are a group of aerobic high %G+C Gram positive bacteria that undergo complex differentiation to form filamentous mycelium, aerial hyphae and spores. In addition, they produce a broad range of secondary metabolites including antibiotics, antiparasitic agents, herbicides, anti-cancer drugs and various enzymes of industrial importance. Two Streptomyces species have had their complete genome sequences published, namely the model organism Streptomyces coelicolor (%G+C = 72.1) and avermictin producer Streptomyces avermitilis (%G+C = 70.7) (Bentley et al. 2002; Ikeda et al. 2003). Two important aspects of the genomes structures of Streptomyces were supported by sequence data. Firstly, that the genome size of Streptomyces is large compared to other bacteria; 8,667,507 basepairs for S. coelicolor (7,825 protein coding genes) and 9,025,608 bp (7,577 protein coding genes) for S. avermitilis. Secondly, that the genomes of these two species are linear and both ends contain unique terminal inverted repeats that probably covalently bind a terminal protein. Terminal inverted repeats and covalently bound terminal proteins are not found in the limited number of other bacteria that have linear chromosomes such as Borrelia burgdorferi and Agrobacterium tumefaciens and, up to the present, seem to be unique to the Streptomyces and perhaps other Actinobacteria (Lin et al. 1993; Chen et al. 2002; Goodner et al. 1999; Huang et al. 2004). Over 2,500 Streptomyces strains are present in the Ribosomal Database Project (http://www.rdp.cme.msu.edu), over 1,500 are available at the American Type Culture Collection (http://www.atcc.org/) and many more are held in both public and private culture collections throughout the world. Analysis of the small subunit ribosomal RNA gene sequences of Streptomyces confirms that they form a monophyletic clade, but one with considerable diversity. In addition, there is significant gene diversity at the interspecies level across the genomes of both completely sequenced Streptomyces with 2,291 gene unique to S. avermitilis and 2,307 genes unique to S. coelicolor.. This makes them particularly interesting targets for comparative genomic studies. In this study we chose four species to begin an analysis of the genomic diversity of the Streptomyces. S. avermitilis was chosen because of the availability of the complete genome sequence of this species, while Streptomyces maritimus was chosen because of its intermediate position in terms of phylogeny within the Streptomyces. Streptomyces cattleya was chosen because, based on small subunit ribosomal RNA sequence, this species is phylogenetically quite divergent from S. coelicolor and branches near the root of the Streptomyces clade. Streptomyces cattleya is a β-lactam producing species. Finally, Kitasatospora aureofaciens was chosen as this genus is very closely related to the Streptomyces.

The availability of two microarrays for S. coelicolor (Lum et al. 2004; Huang et al. 2001; Vinciotti et al. 2005; http://www.surrey.ac.uk/SBMS/Fgenomics/Microarrays/index.html) makes possible a comparative genomic analysis of Streptomyces species. The genes that make up the genome of S. coelicolor have been classified based on scheme of Riley and colleagues for E. coli and modified for S. coelicolor (http://www.sanger.ac.uk/Projects/S_coelicolor/scheme.shtml). A microarray analysis of the genomes of these Streptomyces using the S. coelicolor microarray is able to provide a wide ranging comparative analysis of the conserved genome content of these Streptomyces. This type of approach, where a heterologous microarray is used to analyze the genome content of a range of strains or species, has been successfully used in a wide range of organisms (Akman and Aksoy 2001; Akman et al. 2001; Behr et al. 1999; Chan et al. 2003; Cho and Tiedje 2001; Dorrell et al. 2001; Dziejman et al. 2002; Fitzgerald et al. 2001; Gill et al. 2002; Leonard et al. 2003; Murray et al. 2001; Porwollik et al. 2002; Salama et al. 2000; Israel et al. 2001; Rajashekara et al. 2004). The strains analyzed using this approach range from intraspecies comparisons such as Campylobacter jejuni, Vibrio cholerae and Staphylococcus aureus (Dorrell et al. 2001; Dziejman et al. 2002; Fitzgerald et al. 2001) to interspecies comparisons such as Sodalis glossinidiusversus an Escherichia coli array, Salmonella bongori versus a Salmonella enterica array, Shewanella species versus Shewanella oneidensis and E. coli arrays and Brucella species versus a Brucella melitensis array (Akman et al. 2001; Chan et al. 2003; Murray et al. 2001; Rajashekara et al. 2004).

In this study, we used both versions of the S. coelicolor genome microarrays to compare the gene complements of the three Streptomyces species and one Kitasatospora species. The genus Kitasatospora is closely related to the genus Streptomyces in terms of morphology, chemical taxonomy and small subunit ribosomal RNA sequence analysis. Thus, the choice of a species from this genus acts as potential outgroup in terms of overall genome structure. In terms of genes that are conserved, the types of genes of particular interest include genes involved in secondary metabolism, genes involved in chromosome replication, genes in the terminal regions of the chromosome, sigma factors, genes involved in differentiation and hypothetical genes. In terms of gene absence, the distribution of such genes along the chromosome and the apparent absence of any major housekeeping genes in a specific species are of interest. This information provides insights into genes that make up the core complement for a member of the Streptomyces and into which genes are central to defining a Streptomyces species.

Materials and methods

16S phylogeny

This was carried out on selected small subunit 16S ribosomal RNA gene sequences obtained from Ribosomal Database Project-II Release 9 (http://www.rdp.cme.msu.edu/index.jsp) and aligned using CLUSTALX (Thompson et al. 1997). The analysis was carried out using Neighbor-Joining algorithm from the same program. In the case of S. maritimus, the taxonomy of the strain was confirmed by DNA sequencing of the 16S ribosomal RNA gene.

Arrays

Two series of arrays that cover about 97% of the complete genome of Streptomyces coelicolor A3(2) (Lum et al. 2004; http://www.surrey.ac.uk/SBMS/Fgenomics/Microarrays/index.html) were used in this study. Both arrays are PCR arrays, but from different sources, namely Stanford University, USA and the University of Surrey, UK and made up of different PCR products. The Stanford array as used in this study contained sequences covering 7603 open reading frames. The Surrey microarray is made up of 7,758 unique PCR amplified sequences, 7,563 from the chromosome and 195 from SCP1. There are an additional 376 non-unique, alternative and cross-hybridizing sequences that are also spotted on to the array together with no probe spots and control spots. The two types of arrays were used to improve validation with a system using heterologous hybridization; however, only the University of Surrey array was hybridized and analyzed in duplicate. The major difference between the two arrays was that the Surrey array did not include a number of transposition element related genes, although there were other overlap differences. The sequences of the PCR products are not available for either array due to intellectual property protection requirements.

Strains and growth conditions

S. coelicolor A3(2) (SCP1+) 104, S. avermitilis ATCC 31267, S. cattleya ATCC 35852, S. maritimus Yang-Ming and K. aureofaciens ATCC 10762 were used in these studies. Fresh spores were collected and mycelium cultured in TSB liquid medium with 0.5% glycine at 30°C overnight.

Preparation of labeled DNA

Genomic DNA from a stationary phase culture was purified by the salting out procedure (Pospiech and Neumann, 1995) and had been sonicated to < 2 Kb. Four to six micrograms of sonicated genomic DNA were used as template and this was denatured in the presence of 12 μg of 72%-GC-content random hexamers in a total volume of 25 μl at 100°C for 10 min. The mixture was then snap-cooled on ice before adding the remaining reaction components: 1.5 μl of Cy3-dCTP or Cy5-dCTP (Amersham Pharmacia Biotech), 4μl Klenow fragment (NEB #212), 5μl Klenow buffer, 0.5 μl dNTP (4 mM dATP, 4 mM dTTP, 10 mM dGTP, and 0.2 mM dCTP), and 14 μl ddH2O. The random primed labeling reaction was carried out for 2–3 h at 37°C. Buffer exchange, purification and concentration of the DNA products was accomplished by three cycles of diluting the reaction mixture in 0.5 ml TE buffer (10 mM Tris and 1 mM EDTA pH 8.0) and filtering though a Microcon-30 microconcentrators (Millipore).

Microarray hybridization and data analysis

The two DNA pools to be compared were mixed and applied to an array in a hybridization mixture that contained 3.68 × SSC, 0.18% SDS, and 1 μg yeast tRNA (total 16.3 μl), which had been heated at 100°C for 5 min before being applied to array. Hybridization took place under a glass coverslip sealed by glue in a humidified Omnislide (Thermo Hybaid) at 60°C for 12–14 h. The slides were washed, dried and scanned for fluorescence using a GenePix TM 4000B scanner (Axon instruments). Average signal intensity and local background measurements were obtained for each spot on each array using GenePixPro software. The dataset was screened for aberrant spots and these were eliminated from the analysis after manual checking. Most genes are present in duplicate on the two arrays and the signal from each pair of spots was inputted into the computer program available from ScanAlyze (Eisen et al. 1998; Gollub et al. 2003). The data was then processed into a mean log2 Cy3/Cy5 ratio format. The dataset was normalized for each array separately and outputted to Excel where after checking the alignment of the datasets from each array, a mean signal for each common gene was calculated. Genes that were absent from either array, mostly transposon related genes in the University of Surrey array, were not included in the analysis. Based on Bentley et al. 2002, the mean signal and standard deviation for the core region of genes from SCO2050 to SCO5800 was calculated. The standard deviation was used to set a cut-off for gene absence at 2SD below the core mean. The microarray data is presented relative to the S. coelicolor standard in two ways. This is either as a color plot of the genes where green presents a negative hybridization signal, black represents an equal hybridization signal and red indicates a positive hybridization signal using the program Treeview (Eisen et al. 1998) or as numeric values for the signal from each gene. The microarray data for the four species described here and additional unpublished species can be accessed via rkirby@ym.edu.tw.

Comparison of the microarray dataset for S. avermitilis with the complete genome sequence

The nucleotide sequences for all the identified open reading frame from the S. avermitilis genome sequence (Ikeda et al. 2003) were compared with the genome sequence of S. coelicolor using blastn limiting the output to the best match. This E value dataset for the genes was then aligned with the S. avermitilis microarray dataset and a comparison plotted as a scatterplot. Genes showing disagreement between the two datasets were identified based on a 2 Standard Deviation (SD) cutoff for the microarray dataset and a E-10 cutoff for the blast value.

Analysis of gene presence across the chromosome

A graphical display was created by counting the number of gene detected as present from the signal based on the 2SD cutoff from each normalized microarray dataset using a moving window of 10 genes in steps of one.

Results and discussion

Comparison of S. avermitilis, S. cattleya, S. maritimus and K. aureofaciens with the S. coelicolor genome

In total, after spot and data validation, a total of 7,083 open reading frames were included in this analysis as presence on both types of array and giving analyzable signal on all three arrays. Validity in this study was initially obtained by using microarrays from two sources that presumably use different PCR products to create the arrays. In addition, the University of Surrey array was hybridized and analyzed in duplicate. In terms of gene absence based on two standard deviations as described in the “Materials and methods" section, the agreement between the Stanford array and the duplicated University of Surrey array was about 95%, while the agreement between the two University of Surrey arrays was about 98%. In order to minimize the effect of divergent individual array spots, the signal mean for each gene from the three arrays was used throughout this study.

In this study, the genomic content of three Streptomyces species and one Kitasatospora species with divergent taxonomy, antibiotic production and SSU rRNA sequence are compared using two different S. coelicolor microarrays. It is clear that there are inherent limitations to this approach. Firstly, only gene absence or divergence rather than the presence of new genes can be identified. Secondly, it is not possible to clearly separate the absence of a gene from the presence of a divergent homologue of the same gene. Finally, although the order of the genes in S. coelicolor and S. avermitilis are known from their complete genome sequences and are well conserved, this does not mean that the synteny of most of them is conserved in other Streptomyces species. However, the detection of synteny across Actinobacteria including Mycobacterium tuberculosis, Corynebacteriun glutamicum and other species (Bentley et al. 2002 and unpublished data) supports a conserved central core structure to the genomes of the Actinomycetes and a priori most Streptomyces. Thus, although major chromosomal reorganizations in the central core region cannot be detected by microarray data, a basic chromosomal structure can be assumed as a first approximation; namely, a linear chromosome with variable terminal regions and a relatively well conserved core region.

When the pooled data from the two arrays for the four species was analyzed using Cy-3 labeled S. coelicolor A(3)2 chromosomal DNA compared to heterologous Cy-5 labeled chromosomal DNA, a wide range of signal variation could be noted and this is shown in Supplementary Fig. 1. The SSU rRNA tree places the divergence of these four strains from S. coelicolor as S. cattleya > K. aureofaciens > S. maritimus > S.avermitilis (Fig. 1). Gene differences were present in the order S. cattleya > K. aureofaciens > S. avermitilis > S. maritimus based on −2SD cutoff below the mean signal for the core region genes. The microarray data thus shows general agreement with S. cattleya and K aureofaciens being more divergent and the other two species being relatively closer. It is interesting to note that the Kitasatospora species used in this study, K. aureofaciens, shows the same general structure as the Streptomyces species. This is not unexpected and confirms the close relationship between Kitasatospora and Streptomyces and agrees with the SSU rRNA tree data.

Fig. 1
figure 1

SSU rRNA phylogenetic tree of selected Streptomyces species and other Actinomycetes that have known complete genome sequences. The species analyzed by microarray are indicated in bold

Further support of the reliability of the data comes from a comparison of the blastn E values for all genes and the microarray data as shown in the Fig. 2 scatterplot. This indicated 232 out of 6,832 genes show gene absence by microarray when they seem to be present by blastn and 268 out of 6,832 gene show gene presence by microarray when they seem to be absent by blastn; these results are both based on cutoffs of −2SD for the microarray data and −10 for the E value. This gives an overall reliability for S. coelicolor compared to S. avermitilis of 93%. Potential errors factors include in the case of the former type of error, poor spotting of the array at that point and choice of the PCR product sequence (the comparison is with the whole gene, as the PCR products are not available) and in the latter case cross-hybridization between multiple gene copies or a unreliable hybridization signal due to poor washing in that area. However, the results for S. avermitilis clearly support the reliability of the genome comparisons produced by this study.

Fig. 2
figure 2

Scatterplot comparing gene presence/absence based on the microarray data and gene presence/absence based in blastn between Streptomyces coelicolor and Streptomyces avermitilis. See “Material and methods" for details. Box A and Box C includes genes identified as absent in S. avermitilis by the microarray dataset but present using blastn and genes present in S. avermitilis using blastn, but identified as absent by the microarray dataset. Box B includes genes that are correctly identified as absent by the microarray dataset

Distribution of gene differences across the complete chromosome of S. coelicolor for all four other Streptomyces species

The whole chromosome microarray dataset supports the following structure for the Streptomyces chromosome. Based on Fig. 3 and Supplementary Fig. 1, there is a central core of conserved probably syntenous genes that can be found across many Actinomycetes and in the S. coelicolor genome this reach from about SCO2050 to SCO5800 (Bentley et al. 2002). The regions between SCO1100 and SCO2050 and between SCO5800 and SCO7600 are also quite well conserved between the Streptomyces studied here as well as being syntenous between the S. coelicolor and S. avermitilis genome sequences. However they are not present when the genomes of these two species are compared bioinformatically to other divergent Actinomycetes. These two regions seem to be two genus specific areas. Figure 3 also clearly shows that gene conservation drops off dramatically in the terminal region. The regions from the left terminus to SCO1100 and from SCO7600 to the right terminus show much higher gene divergence that the rest of the chromosome. This agrees with the results for the S. ambofaciens sequencing studies of that species’ terminal regions (Choulet et al. 2006a, b). The gene conservation levels averaged across the four species are as follows: left terminal region (SCO0001–SCO1100) 40.9%; left genus specific region (SCO1101–SCO2050) 84.8%; core region (SCO2050–SCO5800) 79.4%; right genus specific region (SCO5801–SCO7600) 69.6% and right terminal region (SCO7601–SCO7845) 50.3%. It is noticeable that neither the size nor the distribution of conserved genes is symmetrical between the two terminal regions or the two genus specific regions. Notably, the genus specific region actually has a higher frequency of gene conservation than the core regions as a whole and that the left terminal region is much larger than the right terminal region. This possibly represents horizontal exchange of terminal regions by recombination between strains/species that involves only one terminal region. Such an event would give rise to asymmetric gene conservation similar to that detected here.

Fig. 3
figure 3

Analysis of “gene presence” across the four species. Created using a moving window of 10 genes and counting the number of genes with a microarray signal >2SD below the mean for the core region genes. The Y axis is the count for “gene presence”

In the Karoonuthaisiri et al. (2005) study of regional gene expression in S. coelicolor, the boundaries for higher transcript levels during vegetative growth were placed at 1.5 Mb for the left arm and 2.3 Mb for the right arm. The former is midway across the left genus specific region and the latter approximately agrees with the boundary between the core and the right genus specific region. As the core region boundaries are also defined in terms of synteny with the Mycobacterium and Corynebacterium genomes as well as the data presented here, this supports the idea that the S. coelicolor chromosome structure is asymmetrical with respect to both gene conservation and gene function. It should be noted that because we are using only S. coelicolor as the source of the array data, the results do not imply that the genomes of S. cattleya, S. maritimus and K. aurefaciens are asymmetric. However, it should be noted that the S. avermitilis genome is also asymmetric (Ikeda et al. 2003).

Notably, there are 22 identifiable regions where all four species show a significant degree of concurrent gene absence outside of the terminal regions (Table 1). The regions of high gene divergence are shown in Supplementary Fig. 2 in detail. Previously, Bentley et al. identified 14 regions in the S. coelicolor chromosome that were potentially laterally acquired regions. This analysis pinpoints all of these regions and quite accurately, usually to within one or two open reading frames. This suggests that other eight regions are probably quite robust when designated as potential lateral transfer regions. It also supports the usefulness of the microarray approach. All 22 regions were analyzed using Frame Plot (Artemis v7.1) and except for region B, they show abnormalities for at least some of the open reading frames compared to the G+C bias expected for the 1st, 2nd and 3rd codon positions of Streptomyces genes. Eight regions, A, B, F, I, M, O, Q and T contain transposon related genes near to or within the region. Four regions, H, N, P and R are flanked by highly conserved genes such as a ribosomal protein or sigma factor genes, which could encourage interspecific recombination. Finally, five regions consist largely of hypothetical proteins with no known similarity to any known protein as yet; these regions are G, J, L, S and W. Region L is particularly interesting as there is a central core of conserved gene flanked by two subregions that are highly not conserved. One of these genes is a putative spore septum determining protein, while the rest have unknown functions. Taken as a whole, the results suggests that S. coelicolor may have recently acquired all these regions either by transposition or by interspecific/intraspecific recombination (Wolf et al. 2002; Zhang et al. 2002). It is also unlikely that they were acquired from any of the four species studied here. There are other regions that could potentially be identified as lateral transfer positions using less stringent criteria and a wider screening of genomes might help to support these additional regions as being involved in hotizontal transfer. In addition, such a wider screen might allow the identification of possible origins of these regions in other species.

Table 1 Areas of the Streptomyces coelicolor genome identified as potentially horizontally transferred regions based on microarray parallel gene absence in all four species

Gene conservation in the terminal regions of the four Streptomyces species

As has been mentioned earlier, the two regions at either terminus are much less well conserved than the central core region; these extend from SCO0001 to about SCO1100 on the left arm of the chromosome and from about SC7600 to SCO7845 on the right arm. The boundaries of these regions are not absolutely clear-cut, but what is clear is that as one moves towards the centre of the genome, gene conservation increases beyond these points. This can be clearly seen in Fig. 3 where the gene conservation is plotted using a moving window for the four species, but it is also clear that the lack of conservation is not uniform across the terminal regions and that areas of higher gene conservation can be identified.

The significant interest in the terminal regions arises because the genomes of all Streptomyces that have been examined are linear and the problem of how the termini of such a molecule replicate is of particularly importance. Recent studies have indicated that two genes in particular, tpgA (SCO7734) and tapA (SCO7733), are involved in this process (Yang et al. 2002; Bao and Cohen 2001). tpgA encoding the terminal protein that covalently binds to the termini of many linear Streptomyces replicons is conserved across all four species. In S. avermitilis this is also true based on sequence data and, further more, there are multiple copies of tpgA unlike S. coelicolor. The signal level of the S. avermitilis gene at +1.2 supports the presence of these multiple copies. The signal levels for the other three species are between about −0.3 and −0.1, which supports a single slightly diverging copy of this gene in these species. However, if two copies are present then the sequence divergence may be higher. Furthermore, tapA is also conserved except for S. maritimus, which seems to be more divergent at −0.8. It should be noted that the presence of these two genes is not a criteria for defining a genome with a linear topology, but the presence of one or both is certainly suggestive (Dary et al. 2000; Wang et al. 1999; Huang et al. 1998; Lin and Chen 1997). Finally, ttrA is known to be involved in chromosomal transfer and is found very close to the telomere of S. coelicolor and S. avermitilis. This is also conserved in all four species suggesting the genetic exchange is highly important in Streptomyces and related species.

The two terminal regions encompass the major areas that are prone to deletion in many Streptomyces species and are therefore not essential except for linear terminal replication and genetic exchange. Given the relatively high lack of conservation of genes in this region, genes that are present in all four species represent an interesting class. A full list of all genes conserved in all four species in the terminal regions is provided in Tables 2a and 2b. There are 36 hypothetical genes that show high similarity in the two terminal regions. Analysis of these groups of conserved genes using Artemis v7 (The Sanger Institute) identifies a total of five groups of genes that may make up possible single transcriptional units. These are SCO0551–SCO0552, SCO0705–SCO0710, SCO1021–SCO1024, SCO7677–SCO7680 and SCO7682–SCO7688. In addition to TpgA and TapA, it is possible that there are other genes involved in terminal replication and these may be among the conserved genes present in the terminal regions. Although possible candidates can be deduced from a direct comparison of the two known Streptomyces genome sequences, they are many in number. Using the microarray analysis of the Actinomycetes in this study, the candidates can be reduced significantly. From candidates in Tables 2a and 2b, two possible transcriptional units seem to be potential candidates for involvement in terminal replication; these are SCO1021–SCO1024 (hypothetical proteins), and SCO7677–SCO7689 (including hypothetical proteins, an AMP-binding ligase and membrane proteins). Gene knockout studies may be able to identify possible functions for these and other gene candidates, especially the other hypothetical proteins that are conserved in these four species.

Table 2 Genes from the (a) left terminal, (b) right terminal region of Streptomyces coelicolor showing microarray conservation in all four species

Conservation of functional groups of genes across the four Streptomyces species

One approach to analyzing genetic variation across these four Streptomyces species is to look at the functional groupings of genes. Such an approach should allow the identification of strain versus genus specific genes especially when there are large numbers of genes with related functions such as sigma factors or where there are two copies of a gene, such as ftsK. However, because microarray data paints a broad picture across a whole genome, it is essential that once a gene or genes has been targeted based on microarray data, that experimental verification by other means is carried out. However, it is hoped that this dataset will be able to help researchers prioritize their gene targets better. The genes of the S. coelicolor chromosome have been grouped based on the scheme of M. Riley and colleagues for E. coli (ecocyc.org) modified for S. coelicolor (http://www.sanger.ac.uk/Projects/S_coelicolor/scheme.shtml) and we used this classification. The genes involved in ribosomal proteins synthesis and modification should be highly conserved and the results indicate that almost all of them are present in all four species (Table 3; Supplementary Fig. 4). The only exceptions are SCO0436, SCO0509 SCO3430 and SCO3909 in S. avermitilis and SCO4716 and SCO5514 in K. aureofaciens. Of these genes, SCO0436, SCO0509 and SCO5514 represent duplicate genes in the S. coelicolor genome and therefore the choice of the microarray sequence will have had a significant effect on the heterologous hybridization. There is no obvious explanation for the failure to hybridize of the other two genes, but as a whole, this dataset supports the integrity of the array system for analysis of genome content as these genes are scattered across the whole Streptomyces genome.

Table 3 Microarray data for ribosomal proteins from the four species

Table 4 shows genes identified as possible sigma factors, anti-sigma factors and ant-sigma factor antagonists. The genes found in the central core region are more conserved. As would be expected, the major sigma factors such as hrdA, hrdB, hrdC and hrdD are conserved as well as many of the other studied sigma factors of S. coelicolor such as are sigA, sigE, sigF, sigG, sigR, sigT and whiG. Overall, fewer regulation genes from this group (anti-sigma factors and anti-anti-sigma factors) are conserved than sigma factors themselves. This analysis allows the identification of new candidate sigma factors for further study outside of the well studied ones, but within S. coelicolor and in other species. Overall, the results support the hypothesis that there is a core of sigma factors essential to keeping protein synthesis in Streptomyces running smoothly. The functionality of the rest may vary and include complete silence of some gene fragments, duplication of function, involvement in specific secondary metabolic activities and species/genus specific functions.

Table 4 Conservation across the four species of genes annotated as sigma factors or related proteins in Streptomyces coelicolor

All four species studied here undergo differentiation and spore formation and as such would be expected to retain most genes involved in cell division/sporulation/differentiation. This is supported by Table 5. K. aureofaciens shows greater gene divergence for certain genes when compared to the three Streptomyces species and these are specifically ftsI (SCO2090) and a putative cell division protein (SCO2968). However, in general, the same genes in all four species show a higher divergence, for example sapA, which is a protein associated with the spore surface hydrophobicity. As spore morphology varies a lot in the Streptomyces, high variability/gene loss in such a gene is not unexpected. Other genes that show higher divergence are those involved in partitioning and cell division. This suggests that the genes and thus the proteins involved in these functions may differ from species to species in order to create the variation seen in aerial mycelium and spore structure across Streptomyces species. Specifically, SCO3934, an ftsK family protein gene is less well conserved than its homologue. This suggests that SCO5750 may produce the major ftsK protein. Other Fts proteins show a similar pattern with at least one homologue being well conserved. This may well help an understanding of the relationships between the genes involved in cell division and will allow better identification of specific targets for further study. One anomaly that stands out is bldB. This gene consistently shows a low level of hybridization. A comparison of the bldB gene sequence between S. coelicolor and S. avermitilis shows a nucleotide identity of about 87%, which ought to give a signal in the region of 0.0 or better. As two different arrays are used in this study, mechanical problems with this spot can probably be eliminated as the source of the anomaly. We suggest that because this is a relatively small gene, the PCR product chosen for both arrays may be the reason for this result. This emphasizes that array data should be used with a degree of caution and needs to be backed up by other experimental evidence when specific genes are being investigated.

Table 5 Conservation across the four species of genes in Streptomyces coelicolor annotated as involved in cell division, sporulation and differentiation

The genes involved in DNA replication, repair, restriction/modification are shown in Table 6 and only about 20% of these genes are not conserved relatively well across all four species. This is to be expected as DNA replication and repair are core functions. Most of the genes that show higher levels of gene divergence are found in the terminal regions of the linear chromosome and probably are genes that perform functions that are not essential to cell survival because the terminal regions of Streptomyces chromosomes are unstable and liable to deletion without lethality. Of particular interest are SCO0183 and SCO0842 (deoxiribopyrmidine photolyases); these repair system would seem to be absent in S. lividans and S. maritimus, but a homologue is present in S. avermitilis (confirmed by the genome sequence) and in S. cattleya. This confirms the high variability found for this repair function across the Streptomyces (Kobayashi et al. 1989). A similar situation of high variability is found for the mutT homologues, potential 8 hydroxy-dGTP hydrolases. Knockout of this gene has been shown to increase the A:T to G:T mutation rate and thus it has a possible repair function (Kamiya et al. 2004). The genes for recA (SCO5769), recF (SCO3876) and recR (SCO3618) are present in all four species; however, the recX (SCO5770), is more divergent and gives a low signal for S. cattleya and S. maritimus. SCO6405, a putative DNA recombinase, is scored as absent in all four species suggesting that there is redundancy in the Streptomyces genes concerned with recombination or that this gene is transposon related. The latter is supported by low homology to S. avermitilis putative integrases/recombinases. There are four genes encoding DNA gyrases on the microarray, namely, gyrA DNA gyrase subunit A (SCO3873) and gyrB DNA gyrase subunit B (SCO3874) together with SCO5836 and SCO5822 and these may be TopIV homologues involved in resolving chromosome concatenates. All are conserved although the conservation of SCO5822 gyrB2 is lower. Thus both sets of gyrase genes would seem to be important. As expected, SCO1518, a ruvB Holliday junction protein gene and SCO1520, a ruvC crossover junction endonuclease are conserved across all the species. Unexpectedly, although probably present in all species, SCO1519 ruvA is much more divergent that the other two gene in this Holliday junction complex. This diversity is unexpected and not easily explicable except by the fact that recombination in Streptomyces may occur via a more variable mechanism than in other groups of bacteria and this is then reflected in the greater divergence of SCO1519 ruvA. All three genes annotated as a DNA polymerase 1 homologue are conserved as are four out of the five DNA polymerase III homologues, suggesting that there are roles for all of these conserved genes in Streptomyces. Two other unclassified DNA polymerase type genes, SCO4495 and SCO6084 are also conserved and thus may have important functions. There is, however, more diversity among the helicases and methylases/methyltransferases. With the helicases, three out of 14 show significant divergence and therefore most of the helicases probably have important cellular roles. Four out of nine methylases/methyltransferases show divergence. As some of these genes may be involved in the DNA modification part of restriction/modification, such diversity across strains in not unexpected. Finally, four out of six ligases show divergence, perhaps reflecting the fact that the origin of a number of these ligases might be from bacteriophages.

Table 6 Conservation across the four species of genes in Streptomyces coelicolor annotated as involved in DNA replication, repair, restriction and modification

Table 7 shows the genes involved in peptidoglycan and teichoic acid synthesis. In this area of metabolism, there is also a relatively high level of conservation of genes, particularly the murA, murA2, murB, murD, murE, murF,murG and murX genes. Also conserved are the shape-determining genes SCO2609, SCO2610 and SCO2611, which may form an operon. This probably represent a core of genes together with the genes involved in biosynthesis of the cell wall that are needed to give a basic structure to the cells of any Streptomyces species. The penicillin binding proteins show a higher degree of variability, except for SCO2897, SCO4013 and SCO5301. The peptidases SCO3580, SCO3596, SCO3011 and SCO4439 and the D-alanine:D-lactate ligase SCO3595 all show a low level of gene conservation, perhaps because they are involved in relatively broad cellular functions and not under a great deal of selective pressure.

Table 7 Conservation across the four species of genes in Streptomyces coelicolor annotated as involved in peptidoglycan biosynthesis

Conserved genes with no known function

Genes with no known function and no homologue outside of S. avermitilis that are conserved across the other three Streptomyces species should represent genes important to specifically being a myceliate Actinobacteria and the phenotype of gene knockout strains for these genes will be particularly interesting in terms of Streptomyces biology. Based on the dataset here, 936 genes can be identified as annotated as either conserved hypothetical genes or non-conserved hypothetical genes and these are shown in Supplementary Table 1. The proportion of these genes that are conserved across all four species are 9%, 20%, 13%, 16% and 12% for the left terminal region, left Streptomyces specific region, core region, right Streptomyces specific region and right terminal region, respectively. There is also a low frequency of conserved hypothetical genes in the left terminal region and right Streptomyces specific region, 0.78% and 0.96%, respectively compared to 3.4% for the left Streptomyces specific region, 1.80% for the core region and 2.11% for the right terminal region. It is clear that there is a need to further screen these genes by increasing the range of Streptomyces species analyzed by microarray hybridization. This will reduce the number to a manageable number and will allow prioritization of genes for knockout and detailed phenotypic analysis.

Another approach to the problem of identifying functionally important genes is by the pinpointing of functional groups of such genes that may form a transcriptional unit. Blocks of three or more hypothetical genes that are conserved across all species were identified and are shown in Table 8. It is possible that these groups represent conserved functional groups of genes essential to core functions that make Streptomyces different from other bacteria. They are found mostly in the area between the Streptomyces terminal regions and the central core region. There are seven groups of conserved hypothetical genes larger than five genes (SCO1407–SCO1413, SCO2362–SCO2370, SCO2911–SCO2919, SCO3846–SCO3854, SCO5536–SCO5543, SCO5762–SCO5767 and SCO6522–6528). It is likely, due to the proximity of various genes around SCO3846–SCO3854, that this complex is involved in cell division, development and DNA partitioning. The function of the others groups is unknown. Interestingly, none of these gene groups are upregulated shifting from exponential phase to stationary phase or under stress shift as indicated by Karoonuthaisiri et al. (2005).

Table 8 Hypothetical genes in S. coelicolor conserved as a group in the four species analyzed

Conservation of genes involved in secondary metabolism and similar functions

Genes that are involved in secondary metabolism and antibiotic production are widely distributed in the Streptomyces and many if not most may have been involved in horizontal transfer. However, there is significant similarity between genes involved in similar pathways and thus significant cross-hybridization may occur between similar metabolic pathways. A large number of genes are also involved in secondary metabolism (165) and polyketide synthesis (102) in the S. coelicolor genome. These are grouped together in 23 pathway groups and are displayed in Supplementary Fig. 3. Genes identified as secondary metabolic genes but existing on their own and not in a group of secondary metabolic genes have been eliminated to simplify the analysis leaving only genes involved in these functions with two or more genes together in a group. These include specific pathways producing secondary metabolic products such as melanin, actinorhodin, CDA and Red pathway. Many of the other potential pathways have not been studied in detail and the functions of these genes are unknown. Because of evolutionary similarity, the presence of genes hybridizing to a particular pathway does not mean that the specific pathway is present, but possibly that a related one is may be present. Similarly, a high level of hybridization can mean either a very close relationship between the pathways in the two species or the presence of multiple copies of related pathways. In general terms S. maritimus shows the greatest absence of secondary metabolic pathways that are present in S. coelicolor. Interesting, S. cattleya and K. aureofaciens seems to have pathways related to many of the S. coelicolor secondary metabolic pathways present in their genomes, although they are phylogenetically more distant than S. maritimus. The actinorhodin pathway would seem to be absent from S. avermitilis (as expected from the genome data), S. cattleya and S. maritimus although some related genes do seem to be present in K. aureofaciens. The WhiE pathway is conserved in all species, but with some genes showing a very low level of hybridization in certain cases and these include whiE protein VII and the acyl carrier protein. Genes from the Red pathway show varying levels of hybridization suggesting that distantly related pathways may be present in these species. The CDA pathway is conserved in all four species and in certain cases the genes seem to be over represented suggesting multiple examples of the same type of pathway in S. cattleya and S. maritimus. The presence of similar pathways at a level of about 50% for K. aureofaciens supports the well established idea that horizontal gene transfer of secondary metabolic pathways may have played a significant role in the evolution of the Streptomyces and any related genus.

Because the natural environment of Streptomyces is the soil, they are thought to play an important role in the recycling of lignocellulose material. However, there is relatively little information on what genes are involved in this process. Interestingly, melC1 and melC2, which encode tyrosinase (monophenol monooxygenase, SCO2700) and its cofactor (SCO2701) (Leu et al. 1992) are conserved across the three Streptomyces species and probably also Kitasatospora (SCO2700 −0.76, SCO2701 0.08). On the other hand, the duplicate MelD1 (SCO2701) and MelD2 (SCO2700) genes found in S. coelicolor are not conserved and are phylogenetically distinct from MelC1 and MelC2 found in other Streptomyces (unpublished results). This perhaps represents a divergence of function between this two gene pairs. S. coelicolor does not produce a detectable amount of black melanin pigment and these results suggest that these enzymes may be involved in the metabolic conversion of lignocellulose byproducts rather than pigment formation. Evolutionary conservation of these genes to serve this function under particular conditions of induction would make more sense than retention of inducible black pigment formation. Other enzymes with a possible role in the lignocellulose cycle that are conserved across the species are shown in Table 9. These include many oxygenases that may have a role in producing oxygen radicals capable of attacking lignin, genes involved in the sensing and breaking down hydrogen peroxide, cellulose metabolism genes, cellobiose metabolism genes, etc. Those found in the terminal regions may represent gene groups that are not conserved in a syntenous manner and subject to horizontal gene transfer, while those within the core and intermediate regions may be part of the basic group of genes essential to Streptomyces in the soil environment. Lignocellulose degradation is a difficult topic to study in the Actinomycetales and therefore these candidate genes may help to solve some of the problems associated with this.

Table 9 Genes conserved in the four Streptomyces species that are potentially involved in lignocellulose cycling

Conclusions

This study confirms that within the Streptomyces analyzed here there is conservation of a core set of genes in the middle of the linear S. coelicolor/S. avermitilis chromosome structure. This is associated with a much higher diversity of gene in the terminal regions of the linear chromosome. Linking these regions are two intermediate regions where there seems to be conservation of genus specific genes and gene clusters. This study also identifies candidate genes that may be possibly involved in terminal replication and other myceliate growth related functions based on a classification of genes into conserved and none conserved groups. This study also provides insights into which genes in Streptomyces play a more significant role in the biochemical network of S. coelicolor, Streptomyces and myceliate Actinobacteria in general. Finally, the degree of gene conserved detected between the four species implies that that genome model of S. coelicolor may extent well beyond the borders of the Streptomyces. It includes at least one Kitasatospora species; furthermore, a similar structure by microarray analysis has been found for Saccharomonospora viridis and Streptosporangium roseum, but not Streptomyces rimosus ATCC10970 (unpublished data). Thus, the microarray approach to genome content analysis and exploration of genome evolution may be fairly widely applicable in the various Actinomycete genus close to Streptomyces that undergo complex morphogenesis.