Modeling the relationship of diverse genomic signatures to gene expression levels with the regulation of long-range enhancer-promoter interactions

Enhancer–promoter (E–P) interaction is an essential component of cis-regulatory regulation for gene expression. However, to comprehensively study the gene expression with the regulation of long-range E–P interactions is a major challenge in the regulatory networks. As these types of gene expression are regulated by diverse genomic signatures, we presented a computational method to study the relationships between gene expression levels and diverse genomic signatures. In this paper, based on the datasets of long-range E–P interactions, we extracted feature parameters from multiple signatures ( e.g. , epigenetic marks, transcription factors) and used regression models to predict the gene expression levels. In our results, we found that the predicted expression values correlated well with the measured expression values in both the interacting and non-interacting sets, and the correlation values of the interacting set were higher than that of the corresponding non-interacting set in each cell line, which indicated that the distal enhancers would cooperate with diverse genomic signatures to facilitate the expression level of target genes. By comparing the important signature features for the gene expression levels between the interacting and non-interacting sets in the same cell line, we found that the important speciﬁc signatures affect the gene expression regulated by distal enhancers. Our research provided additional insights about the roles of diverse signatures in gene expression with the regulation of distal enhancers.


INTRODUCTION
In various processes of eukaryotic cells, numerous DNA reactions are regulated by transcriptional promoters, enhancers and insulators (Bejerano et al. 2004;The ENCODE Project Consortium 2007). Among these cisregulatory elements, the enhancers can be away from their target genes over tens of kilobases distances (Miele and Dekker 2008;de Laat and Duboule 2013). It was reported that the enhancers can act as a key role in the regulation of tissue-specific gene expression (Visel et al. 2009a, b), and many specific enhancers would take part in the regulation of several human diseases (Williamson et al. 2011). Thus, elucidating the mechanism that how an enhancer selects its target promoter(s) and regulates gene expression is a major challenge so far.
Recently, several types of experimental techniques enable us to study the E-P interactions, including fluorescence in situ hybridization (FISH) (Williamson et al. 2014), chromosome conformation capture (3C) and other 3C-based techniques (e.g., 4C, 5C, Hi-C) (Dekker et al. 2002;Sanyal et al. 2012;Rao et al. 2014;van de Werken et al. 2012). Generally, the 3C-based techniques determine cross-linked interactions in a cell where a pair of sequence fragments are close enough in threedimensional (3D) space, and these experimental techniques are expensive and time consuming. In the past years, based on the availability of sequence, epigenetics and other genomic data, computational approaches for understanding the interactions between enhancers and promoters have made a significant advance, such as PreSTIGE (Corradin et al. 2013), IM-PET (He et al. 2014), HIPPIE (Hwang et al. 2015), RIPPLE (Roy et al. 2015), Target Finder (Whalen et al. 2016) and our group's works (Feng and Li 2017;Feng et al. 2018). Furthermore, many deep sequencing data profiles of transcription factors and epigenetic marks were generated from different sequencing techniques. Thus, the omics data enable us to study the relationships between related features and gene expression levels, which offered insights into the roles of the regulatory proteins, RNAs and other factors in the regulation of gene expression on gene expression (Budden et al. 2015;Cheng et al. 2011;Cheng and Gerstein 2012;Karlić et al. 2010;Mcleay et al. 2012;Ouyang et al. 2009). However, for the regulation of gene expression related to distal enhancers, there is still much work to do.
In this study, based on the mass-sequencing data, we mainly aim to develop computational models to study the relationship between diverse genomic signatures and gene expression levels with the regulation of distal enhancers. Thus, we extracted feature parameters from large numbers of epigenetic marks and other signatures including histone modifications, transcription factors, DNase I, enhancer RNAs (eRNAs), DNA methylation and nucleosome occupancy. Then, by integrating the combined features and ensemble classification strategy, we used Random Forest (RF), gradient boosting machine (GBM) and support vector regression (SVR) algorithms to predict the gene expression levels with the regulation of distal enhancers. Meanwhile, we studied the different types of important features for the gene expression associated with long-range E-P interactions and their location-specificity differences between the positive and negative sets, which would be beneficial to understand the distal regulatory mechanisms deeply.

RESULTS AND DISCUSSION
Modeling gene expression levels with the regulation of distal enhancers by diverse genomic features We aimed to explore the quantitative relationships between diverse genomic signatures and gene expression levels in the long-range E-P interactions datasets. In this paper, we constructed three computational models (i.e., the SVR, RF and GBM algorithms) to predict the gene expression levels. In the prediction models, we incorporated the signature features of different regulatory regions (enhancers, promoters and loop regions) together from diverse genomic signatures (i.e., histone modifications, transcription factors, DNase I, FAIRE-seq, DNA methylation, nucleosome position and eRNAs). Using the combined features for SVR, RF and GBM models, we showed the prediction results of 10-fold cross-validation test in Table 1. Because of the advantages of ensemble learning algorithm in big data mining, the RF and GBM models obtained better performance results than that of SVR model in different cell lines. It showed that the best R 2 values of the interacting sets measured up to 0.7636, 0.8293, 0.9085 and 0.9021 in Gm12878 (B-lymphocyte), H1hesc (embryonic stem cells), Hela (cervical carcinoma) and K562 (leukemia) cell lines, respectively. However, the best R 2 values of the non-interacting sets were 8.23%, 24.45%, 14.05%, 16.48% lower in the same corresponding cell lines, respectively. Moreover, the above tendency was similar in the three models.
Based on the above results, we further discussed the Pearson correlation coefficient (PCC) between the predicted expression values and experimental measured levels in the RF model. In Fig. 1, we showed the results of interacting and non-interacting sets in four cell lines, respectively. In our observation, the PCC values of the interacting sets measured up to 0.8736, 0.9052, 0.9532 and 0.9498 in Gm12878, H1hesc, Hela and K562 cell lines, respectively. In contrast, the corresponding values of the non-interacting sets in above cell lines decreased to 0.8521, 0.7583, 8764 and 0.8587, respectively. Although the predicted expression values correlated well with the measured expression values in both the interacting and non-interacting sets, the PCC values of the interacting set were higher than that of the corresponding non-interacting set in each cell line, which indicated that the distal enhancers would cooperate with diverse genomic signatures to facilitate the expression level of target genes.

The contributions of diverse signatures to gene expression levels with the regulation of distal enhancers
To quantify the relative contributions of diverse signatures to gene expression levels, according to the previous works (Feng et al. 2018;Roy et al. 2015), we used the ''%IncMSE'' error in RF model to compute the relative importance of diverse variables in the prediction models of gene expression levels. One feature with the higher ''%IncMSE'' value contributes more to the prediction model. Here, we presented the importance scores of the top 50 variables of interacting and noninteracting sets in each cell line (Hela and K562 in Fig. 2; Gm12878 and H1hesc in Fig. S1).
In the prediction of gene expression levels regulated by distal enhancers, multiple types of signatures were identified as top-ranked variables, including histone modifications, DNase I, transcription factors, nucleosome occupancy, eRNAs and DNA methylation. Especially in promoter regions, several ones tended to appear robustly in the top-ranked positions in both the interacting and non-interacting sets of multiple cell lines, such as eRNA, H3k36me3, H3k79me2, H3k27me3, H4k20me1, H3k4me3, H3k9me3, H2az, Pol2, DNase I, DNA methylation and nucleosome occupancy. Among the top variables of interacting/non-interacting data sets, the histone modification features respectively accounted for 42%/ 36%, 52%/42%, 44%/48% and 48%/42% of the total 50 features in Gm12878, H1hesc, Hela and K562 cell lines. While the transcription factor features, respectively, accounted for 38%/50%, 36%/48%, 40%/38% and 34%/40% of the total 50 features in the four cell lines. Although the distal enhancers can interact with their target promoter regions in the complex cis-regulatory system, the promoter regions always occupy the more important position in the gene expression regulation. In both of the interacting and non-interacting sets, the genomic signatures tended to be promoter-specific among the three regulatory regions. In the top 50 features, we observed that the features of promoter regions occupied the largest proportion in the interacting and non-interacting sets, which respectively accounted for 48%/60%, 44%/40%, 44%/36% and 58%/44% of the total features in Gm12878, H1hesc, Hela and K562 cell lines (Table 2). Among the above variables, there were many features with minor changes in score, and the features with great changes related to the specific gene expression regulation of distal enhancers. Thus, there would not be a higher percentage of important enhancer features in the interacting set than the non-interacting set. We observed that there was an obvious difference in the regulation of gene expression level between the histone modifications and transcription factors. For histone modifications, the features of promoter and loop regions took a large proportion of the top 50 variables of multiple cell lines. However, for transcription factors, the features of promoter and enhancer regions took a larger proportion (Blackwood and Kadonaga 1998;Pennacchio et al. 2013). Consistent with previous studies (Budden et al. 2015;Cheng et al. 2011;Cheng and Gerstein 2012;Karlić et al. 2010;Ouyang et al. 2009), the histone modifications and transcription factors in the promoter regions play an important role in the gene expression.

The changes of diverse signatures in the importance scores between interacting and non-interacting sets
Among the important features in one cell line, some signatures ranked well in both the interacting and noninteracting sets, however, their changes of importance scores between interacting and non-interacting sets were not obvious, such as H3k36me3, H3k27me3, H3k4me1, H3K79me2 (Fig. 1). To find important signatures that regulate gene expression levels with the regulation of distal enhancers, we made a study on the score changes of top-ranked features between interacting and non-interacting sets in the same cell line. We observed that some features had significant changes in score, such as H3k27me3_p, Mef2e_e, Ikzf1_e, Nfya_e, Rfx5_e, Irf4_p and eRNA_p in Gm12878 cell line, and H3k36me3_p, H3k79me2_p, H3K4me3_p, H2az_p, nucleosome_p and eRNA_w in K562 cell line (Fig. 3). Among the features with great changes, the transcription factors and histone modifications took a large proportion, and other features (e.g., DNA methylation, eRNAs) also had cell line-specific changes. In Fig. 3, we can see that in the top-ranked 30 features, the transcription factor features respectively accounted for 56.7%, 60%, 43.3% and 36.6%, and the histone modification features, respectively, accounted for 23.3%, 36.67%, 43.3% and 46.7% in Gm12878, H1hesc, Hela and K562 cell lines. These features with great changes would play a cell line-specific role in the gene expression level by the regulation of distal enhancer. According to previous research (Heintzman et al. 2009), many specific histone modifications in enhancers regions would functionally regulate the gene expression in a tissue-specific manner. The eRNA transcription requires the presence of the promoter at the same time, and eRNA levels can correlate with mRNA levels at adjacent genes (Krivega and Dean 2012). In eukaryotes, by binding to enhancer regions of DNA adjacent to regulated genes, many transcription factors can precisely regulate the expression of various genes in the right cell and at the right time. For example, cohesion Rad21 is involved in chromatin looping and plays an important role in maintaining chromosomal loops (Merkenschlager and Odom 2013; Nitzsche et al. 2011). P300 was known to bind the enhancer regions and has been applied to predict enhancers (Blow et al. 2010;Visel et al. 2009a, b). MafK transcriptional activation can be mediated by an enhancer and regulated by different GATA factors in both hematopoietic and cardiac tissues (Katsuoka et al. 2000). However, in the same cell line, there were also many top-ranked features with minor changes in the importance score between interacting and non-interacting sets, such as H3k36me3, H2az, H3k27me3, H3k4me1, H3K79me2 (Fig. 4). In our observation, the histone modifications in loop and promoter regions had a large proportion of the features. These features with minor changes would play a fundamental role in gene expression with the regulation of long-range E-P interactions in a specific cell line. For example, associated with elongating RNA polymerase II, the H3K36me3 is indicative of actively transcribed genes (Li et al. 2002). However, as clear markers of gene repression, the H3k27me3 is likely bound by other proteins to exert a repressive function (Cao et al. 2002) and H3K9me3 is a well-characterised marker for heterochromatin (Bannister et al. 2001;Lachner et al. 2001 Predicted expression

Predicted expression
Gm12878 ( Fig. 2 The importance scores of the top 50 features in the interacting and non-interacting sets in Hela and K562 cell lines. The ''%IncMSE'' error in the RF model was used to compute feature importance scores. The longer the bar, the more important in prediction model. The way to encode the features of enhancer, promoter and loop region was the same with our previous work (Feng et al. 2018) 1995), SAP30 (Huang et al. 2003), FKBP3 (Yang et al. 2001) and RYBP (García et al. 1999). Taken together, on the basis of the genomic signatures with basic biological functions, the distal enhancers would cooperate with multiple specific signatures and regulate the gene expression levels in complex regulatory networks.

CONCLUSION
The enhancers act as an important role in cell linespecific gene expression (Visel et al. 2009a, b). In the regulation, enhancers can specifically select their longrange target promoter(s) and facilitate expression levels of related genes. In the cis-regulatory mechanism of gene expression, diverse genomic signatures would affect the enhancers and cooperatively establish a complex regulatory network. Recently, previous researchers have made great effort to study the relationships between the general gene expression levels and associated genomic signatures in promoters (Budden et al. 2015;Cheng et al. 2011;Cheng and Gerstein 2012;Karlić et al. 2010;Mcleay et al. 2012;Ouyang et al. 2009). However, due to the cell line-specificity of long-range E-P interactions and the limited experimental data, the gene expression levels with the regulation of distal enhancers does not been further studied. In this paper, based on the available experimental data from multiple platforms, we developed computational   H1hesc Fig. 4 The top-ranked features with minor changes in scores between interacting and non-interacting sets in the same cell line. The bars represent the importance scores of interacting and non-interacting sets. The D values of the above signatures between the interacting and non-interacting sets were less than 10% of the corresponding positive values of diverse genomic signatures on the gene expression levels with the regulation of long-range E-P interactions.

Datasets of long-range E-P pairs
In our work, the long-range E-P pair datasets of four human cell lines came from Roy et al.'s work (2015), the four cell lines were K562, H1hesc, Hela and Gm12878. The datasets of the above four cell lines respectively contained 1754, 674, 1530 and 952 E-P pairs. Each cell line respectively contains an interacting set and a noninteracting set, and the two sets have an equal number of samples. For an arbitrary E-P pair in the datasets, the distance between the enhancer and the transcription start site was larger than 2500 bp. Because the enhancers tend to interact with their target promoters depending on their chromatin distances, the non-interacting set and interacting set in each cell line were constructed with the same distribution of chromatin distance.

The extraction of feature parameters
Tissue-specific gene expression with the regulation of long-range E-P interactions would be affected by the diverse genomic signatures. For example, DNA methylation can regulate tissue-specific gene expression, and it is related to multiple human cancers (Ball et al. 2009;Song and He 2012). The nucleosome reorganization in regulatory regions of enhancers and promoters accompany the gene activation by T cell receptor signaling (Schones et al. 2008). The eRNAs are short noncoding RNAs (Smith and Shilatifard 2014), and they would enhance or stabilize the long-range E-P interactions (Fan et al. 2013;Wang et al. 2011). In each cell line, we downloaded the available data of several genomic signatures, including DNase I, FAIRE-seq, transcription factors, histone modifications, DNA methylation, nucleosome position, eRNAs and RNA-seq. The sequencing techniques and the available data sources of corresponding signatures were shown in Table 3. For the sequencing data of above signatures, we processed data and extracted the corresponding features of enhancer, promoter and loop regions by using the same method in our previous works (Feng et al. 2018). For the two and more biological replicates, we averaged the feature values for the corresponding regions.

Predictive models for gene expression levels
In this work, we used three algorithms to predict the gene expression levels, including RF, GBM and SVR. The above prediction algorithms were implemented in R packages of 'randomForest', 'gbm' and 'e1071', respectively. For RF model, we set the parameter of n.trees = 500 (number of built trees), and selected the best mtry (number of variables randomly sampled as candidates at each split). For the SVR model, we selected the nonlinear radial basis kernel. In the GBM model, we fit the parameters: interaction.depth = 5, n.trees = 5000, shrinkage = 0.01, n.minobsinnode = 10 and used R squared to select the optimal model with the largest value. For the evaluation methods of statistical prediction, we used the 10-fold cross-validation test in the same cell line to evaluate the prediction performances. According to the previous works (Liu et al. 2016), we computed gene expression levels [log(RPKM)]. In the regression model, we calculated the standard deviation of regression model (RMSE) which was used to quantify the amount of dispersion of a set of data values. Then we calculated the R 2 (coefficient of determination) which indicates the proportion of variation of the gene expression levels that has been explained by the model (Cheng and Gerstein 2012;Everitt 2002). Finally, we computed the PCC (Stigler 1989) between the predicted expression values and experimental measured values in the testing data.

5C
Carbon copy chromosome capture conformation E-P Enhancer-promoter EPI Enhancer-promoter interaction GBM Gradient-Boosting Machine RF Random Forest RPKM The reads per kilobase of exon model per million mapped reads