Spatial Patterns of Gene Expression in Bacterial Genomes

Gene expression in bacteria is a remarkably controlled and intricate process impacted by many factors. One such factor is the genomic position of a gene within a bacterial genome. Genes located near the origin of replication generally have a higher expression level, increased dosage, and are often more conserved than genes located farther from the origin of replication. The majority of the studies involved with these findings have only noted this phenomenon in a single gene or cluster of genes that was re-located to pre-determined positions within a bacterial genome. In this work, we look at the overall expression levels from eleven bacterial data sets from Escherichia coli, Bacillus subtilis, Streptomyces, and Sinorhizobium meliloti. We have confirmed that gene expression tends to decrease when moving away from the origin of replication in majority of the replicons analysed in this study. This study sheds light on the impact of genomic location on molecular trends such as gene expression and highlights the importance of accounting for spatial trends in bacterial molecular analysis. Electronic Supplementary Material The online version of this article (10.1007/s00239-020-09951-3) contains supplementary material, which is available to authorized users.


Correlation of Gene Expression Over Datasets
To assess uniform expression over bacteria with multiple data sets we looked at the mean normalized expression values. Multiple replicates from a data set were combined by nding the median normalized CPM expression value for each gene. This was done for any data sets that had multiple replicates. For each gene (x i ) the mean normalized expression value was calculated across all data sets (x ij ). Then the normalized median expression value for each data set was subtracted from the mean across all expression values (|x ij −x ij |). The distribution of these |x ij −x ij | across all genes are found in Figures S1 and S2. All data sets are well mixed, implying that the expression levels are consistent across all data sets. Only E. coli and B. subtilis had multiple expression datasets available so they are the only ones that were analyzed.
Streptomyces and all replicons of S. meliloti had only one data set each and therefore were not analyzed.    Table S5: Linear regression analysis of total added expression and distance from the origin of replication.
The total added expression values were calculated by summing the total counts per million expression value per 10Kbp section of the genome . Linear regression was calculated after the origin of replication was moved to the beginning of the genome and all subsequent positions were scaled around the origin accounting for bidirectional replication. NS indicates Not Signicant at P ≤ 0.05. A grey row indicates a signicant negative trend.

Leading and Lagging Strand
A two-sample Wilcox test was computed to compare expression of genes on the leading strand and the lagging strand. We found that there was no signicant dierence between gene expression on the leading and lagging strand of any of the bacterial replicons.

COG Analysis
A supplementary analysis of the spatial distribution of COG categories for each bacterial replicon was performed. For a full list of COG categories, please refer to Table S7.
This supplementary analysis shows that there appears to be no clear COG categories that are universally increasing or decreasing among the bacterial replicons in this analysis.

COG Data
Whole genomes of dierent strains and species of E. coli , B. subtilis , Streptomyces and S. meliloti were downloaded (Table S8). The analysis was performed on each replicon of multi-repliconic bacteria. For S. meliloti the analysis was performed on each of its replicons separately. The COG database information was downloaded on February 27, 2017 and spans the years 2003-2014. This data can be found on GitHub at (https://github.com/dlato/Spatial_Patterns_of_Gene_Expression.git) The only available data in the COG database for Streptomyces was for Streptomyces bingchenggensis and not S. coelicolor . We were therefore limited to using the annotation for Streptomyces bingchenggensis .
Using simple Python scripts,the COG protein ID and functional category was obtained for each known protein of each bacterial replicon in this analysis. This information was combined with the GenBank acession number and protein genome location to obtain the functional category of each protein and its midpoint location in the genome. The midpoint of each protein was calculated to be the singular point between the start and the end of the protein. This calculation was done to simplify the statistical calculations to verify the spatial trends of each COG category.
The origin and terminus of replication location, and bidirectional nature of bacterial replication were accounted for using the same methods as in the Gene Expression analysis. See The Spatial Patters of Gene Expression in Bacterial Genomes main paper for detailed methods.  To determine if each COG category increased or decreased with increasing distance from the origin, a logistic regression was performed on each COG category for each replicon. Each of the proteins was considered present (1) or absent (0) in each COG category. Proteins that were classied under more than one COG category had a present (1) data point for each COG category. The binary nature of the COG data allowed for a simple logistic regression to be performed for each COG category using R. Logistic regression results are found in Table S9.
A visualization of the proportional distribution of the COG categories for each replicon can be seen        Table S7.  Table S7.   Table S7. Position in Genome (bp) % of COG Categories Figure S7: Histogram of COG categories across pSymA of S. meliloti . Bidirectional distance from the origin of replication is along the x-axis. Each bar represents a 50Kbp segment of the genome. The grey graph represents the total number of genes in each 50Kbp section of the genome. The colourful graph represents the percentage of COG categories in each 50Kbp section of the genome. The full name for each COG category can be found in Table S7. Position in Genome (bp) % of COG Categories Figure S8: Histogram of COG categories across pSymA of S. meliloti . Bidirectional distance from the origin of replication is along the x-axis. Each bar represents a 50Kbp segment of the genome. The grey graph represents the total number of genes in each 50Kbp section of the genome. The colourful graph represents the percentage of COG categories in each 50Kbp section of the genome. The full name for each COG category can be found in Table S7.