The Trithorax group protein ASH1 requires a combination of BAH domain and AT hooks, but not the SET domain, for mitotic chromatin binding and survival

The Drosophila Trithorax group (TrxG) protein ASH1 remains associated with mitotic chromatin through mechanisms that are poorly understood. ASH1 dimethylates histone H3 at lysine 36 via its SET domain. Here, we identify domains of the TrxG protein ASH1 that are required for mitotic chromatin attachment in living Drosophila. Quantitative live imaging demonstrates that ASH1 requires AT hooks and the BAH domain but not the SET domain for full chromatin binding in metaphase, and that none of these domains are essential for interphase binding. Genetic experiments show that disruptions of the AT hooks and the BAH domain together, but not deletion of the SET domain alone, are lethal. Transcriptional profiling demonstrates that intact ASH1 AT hooks and the BAH domain are required to maintain expression levels of a specific set of genes, including several involved in cell identity and survival. This study identifies in vivo roles for specific ASH1 domains in mitotic binding, gene regulation, and survival that are distinct from its functions as a histone methyltransferase. Supplementary Information The online version contains supplementary material available at 10.1007/s00412-021-00762-z.


Analysis work flow
The main steps of the analysis of gene expression data are presented in Figure 1. The first step is aligning the transcriptome sequence reads to the reference genome and counting the number of reads for each gene (Section 3). This is followed by the general data inspection (Section 4) during which the general data quality and sample relations are being studied using various measures and visualizations. If low quality samples are detected they may be excluded from further analysis at this point. The data are also normalized (Section 3) to reduce systematic noise caused by non-biological sources and to improve the comparability of the samples.
After the preprocessing, statistical testing is performed between the compared sample groups (Section 5). The results from the testing are used to filter the so called DE or differentially expressed genes. The filtering is based on the statistical significance and the size of the difference in the mean expression levels between the sample groups (Section 5.1). Annotated result files for all genes and for the filtered gene lists are included in the results (Section 7).
Functional analysis is a term used for describing all of the advanced analyses where the information related to the gene functions is taken into account during the analysis (Section 6). These analyses can be run for the filtered DE gene lists or the whole data to detect enrichments in various functional categories. One functional analysis for both filtered DE gene lists and the whole data has been performed in this data analysis. Input files and usage instructions for two other commonly used enrichment analysis tools have also been included in the report (see 6.3 for more information). The usage of these additional functional analysis tools requires good knowledge of biological background of the experimental set up and thus the purpose of this report is to provide comprehensive support and guidance for the independent use of these tools.
Various statistical methods are used for the analysis of gene expression profiling data. These methods and the related technology will be thoroughly explained in the report so that the results will be interpretable even without deep knowledge in statistics.
All the analyses have been performed with R language and environment for statistical computing [7]. Many of the analyses have been performed using R related Bioconductor module [1].

Samples
General description of each sample group is given in the table below.

Group
Number of Group  name  replicates  definition  WT  3  wild type  nAT  3  nAT rescue  dBAH  3  dBAH rescue  dBAHmut1 3 dBAH mut 1 rescue 3 Data preprocessing and normalization

HiSeq sequencing run information
The samples were sequenced with the HiSeq2500 instrument using single-end sequencing with 50bp read length. The analysed samples were sequenced in three separate runs with differing amounts of samples multiplexed on one lane.

Aligning reads to the reference genome
The reads obtained from the instrument were base called using the instrument manufacturer's base calling software. The reads were aligned against the Drosophila melanogaster reference genome (BDGP5.25 assembly) using TopHat version 2.0.9 [4]. Summary of the mapping statistics is provided in

Calculating normalized gene counts
After alignment, the reads were associated with known genes based on annotations derived from Ensembl and the number of reads aligned within each gene was counted using HTSeq tool version 0.5.4p3.
The data are normalised to remove variation between samples caused by non-biological reasons and to make the values comparable across the sample set. Here the counts were normalised using the TMM normalisation method of the edgeR R/Bioconductor package (R version 3.0.1, Bioconductor version 2.12). The method takes the variable number of total reads across samples into account by calculating specific scaling factors between the samples.

Files for genome browsers
Various different genome browsers can be used for viewing the aligned count data in genomic context with a variety of different annotations. Figure 2 shows an example of the data viewed with IGV Genome Browser (IGV (Integrative Genome Viewer) Genome Browser at the Broad Institute: http://www.broadinstitute.org/igv/)

Figure 2: Snapshot from IGV Genome Browser
The *.bam and *.bai files needed for the genome browser are delivered separately by request since these files are very large. IGV genome browser is easy to use and comes with good manuals and tutorials. After choosing the correct genome, the data can be downloaded from File -> 'Load from file...' and choosing the appropriate *.bam and *.bai files.

Data quality control
The general quality of the samples and their relationships will be investigated in the following sections. Figure 3 visualizes the expression value distribution across the sample set. Table 3 shows the minimum, median, mode, mean and maximum expression values of the normalised samples. The mode can be seen as a rough measure of the background since most of the genes in the genome are not expressed in most experiments.  The expression value distributions across the count values of all samples can be visualized by box plots. In the Figure 5 the asinh-transformed expression value distributions of the samples are visualized as box plots before and after normalization. The structure of a box plot is described in Figure 4.

Expression values
Large variations in the sample expression value distributions may indicate quality problems.
In these data there are some differences which is quite typical. This variation is compensated by the means of normalization after which the expression value distributions of the samples are almost equal. Normalization brings the expression values in to the same register so that the samples can be compared to each other.

Sample relations
The sample relationships will be studied in the following sections by the means of correlation and cluster analysis.

Correlations
Between sample correlation values describe the similarity between the samples in a general level, when all measurement features of all samples are taken into consideration. In this analysis the so called Spearman's metrics is used which describes the between sample similarity on a scale of 0-1. Value 0 means perfect uncorrelation between the samples where as value 1 means perfect correlation between them. As a rule of thumb, correlation of at least 0.95 is usually expected from samples that are biological replicates.
The correlation values between all possible pairs of samples are visualized in Figure 6. Correlations between samples vary between 0.937 and 0.973. This depicts a high correlation between the samples. Groupwise minimum and mean correlations are given in the

Hierarchical clustering
In hierarchical clustering the samples are grouped according to their general similarity when all the measurements of all the samples are taken into consideration. In this analysis the samples were clustered with Euclidean metrics.
The result of the cluster analysis can be visualised as a dendrogram which is an out branching graph where the most similar samples (in another words best correlating) can be found in the branches that are nearest to one another.
The dendrogram produced by cluster analysis is shown for normalized data in Figure 7. In the hierarchical clustering, the WT samples group into a cluster of their own. The other samples group according to the sequencing runs. This is quite expected when clustering whole data and as the correlation values are high, the overall differences between the samples are likely quite small.

PCA
The sample relations can also be studied by the means of a Principal Component Analysis (PCA) which is an ordination technique complementary to clustering. Ordination orders objects so that similar objects are placed near each other and dissimilar objects are placed further from each other.
In PCA analysis the sample relationships can be visualized in three dimensional space. Figure 8 shows the PCA analysis for all samples. The samples group similarly in the PCA plot as they did in the clustering dendrogram.

Summary of the sample quality inspections
Sample correlation values are high and technically the sample data is of high quality.
In the hierarchical clustering the WT sample group clustered in a group of its own and the other samples clustered according to the sequencing run. The PCA gave similar results as the hierarchical clustering. The correlation values were high and therefore the differences between the samples are likely to be small.

Group comparisons
The following comparisons were performed to detect differentially expressed genes between groups: • nAT rescue vs. WT • dBAH rescue vs. nAT rescue • dBAH mut 1 rescue vs. dBAH rescue • dBAH rescue vs. WT • dBAH mut 1 rescue vs. WT • dBAH mut 1 rescue vs. nAT rescue R package Limma [8] was used for performing the statistical testing. More information on the package can also be found in appendix B.

Filtering parameters
When filtering up-and down-regulated (i.e. differentially expressed = DE ) genes between certain conditions (groups) fold changes and p-values (or corrected p-values) calculated during statistical testing are used as filtering criteria.
All of the measured genes are filtered to list those that show the strongest evidence for being differentially expressed between the compared groups.

Short descriptions of fold change and p-values:
FC/logFC Fold change describes the size of the difference in gene expression between the compared groups. In this analysis it results from linear modeling process performed with Limma package (see Appendix B). Fold changes are often expressed as log2-transformed, where value 0 means 'no change' and 1 means doubled value and -1 means halved value. The values are always in relation to the group used as a base level group (reference).
p-value P-value describes the reliability of the change in expression value between the compared groups. Better (i.e. smaller) p-value is given for those genes that show homogeneous behaviour inside each group and yet clearly differ between the compared groups. In this analysis the P-values used for filtering can be either so called modified t-test p-values or FDR (false-discovery-rate) p-values which are both produced by Limma (see B). Modified t-test p-values are not corrected for multiple testing. FDR p-values are used to control the rate of false positive findings in the result list and have been generally found to perform better than traditional p-values.

Choosing thresholds for filtering
The choice of the thresholds for p-value and fold change used for filtering the differentially expressed (DE) genes is not a trivial task. There is no one correct way or method to determine the thresholds but the choice is based on different aspects of each study. Different thresholds can also be used for filtering the data for different purposes. For example, often very strict thresholds are chosen when the data is filtered to be included in a publication. Then the result list will contain very few false positive findings but on the other hand many true positives are left outside the result set. Because of this it is typically useful to use less stringent thresholds for filtering data for internal research purposes or functional analysis when a larger proportion of possible false positive findings can be tolerated. Cluster analysis of the filtered genes can also be used as a means for choosing the filtering thresholds: such thresholds should be chosen, that the samples are grouping according to the known sample groups in the cluster analysis of the filtered genes.
Thresholds used in filtering the DE genes and the corresponding numbers of results: The tables below shows the numbers of filtered genes and thresholds used for each comparison. The thresholds were chosen to list a reasonable number of the most differentially expressed genes. With the chosen thresholds the filtered genes clustered according to the sample groups giving evidence of condition dependent behavior of the selected genes.   Result files for each comparison 'annotatedData<comparisonName>.xls' include results from the statistical testing as well as the annotations for all targets on the array. These files can be used to check results for genes not appearing in the lists of differentially expressed targets.
In these files only the group mean intensities and standard deviations are reported. The normalised intensities for individual samples can be found from the file 'NormalizedData.xls' file by the corresponding target identifiers. The unnormalised intensities can be found in the file 'RawData.xls'.
The results are organised by average ranking (based both on p-value and fold change), in decreasing order of significance. As p-values alone ignore the size of the gene expression change and fold changes alone ignore the consistency of the changes, ordering the genes in the result files based only on one of these two is not the best approach. Thus we have developed a method based on average ranking for ordering the genes which is more likely to capture the true biological order of importance. The method works as follows: First the genes inside each comparison are ranked independently based on p-values (increasing order) and absolute fold changes (decreasing order). Next the average ranks for these two are calculated. As these values do not follow the original value range, these average ranks are further ranked again to get the ranks from 1 to the number of genes, rank of 1 meaning the highest possibble rank (most differentially expressed gene based both on p-value and fold change). In tie situations, the order of the results is randomized.

Visualization of the DE features
Volcano plot(s) The following figure(s) show the results of the comparison(s) as Volcano plot(s).
In a Volcano plot the log10 of the p-values is on the y-axis and the logFC calculated for the comparison group vs. base level group is on the x-axis. In this plot it can be seen how the reliability values of the measurement features behave in relation to the fold change. The thresholds used in filtering are marked in the plot with tickled lines, up-regulated genes are coloured red and down-regulated genes green.

Heatmap(s)
The following figure(s) show the heat map clustering of the differentially expressed features for the comparison(s). Pearson's metrics has been used in hierarchical clustering of the samples and filtered features. The clustering is based on the general expression measurement similarity. In the plot red colour means high expression and green low expression. Each row represents one DE feature and each column represents one sample.