INTRODUCTION

In various processes of eukaryotic cells, numerous DNA reactions are regulated by transcriptional promoters, enhancers and insulators (Bejerano et al. 2004; The ENCODE Project Consortium 2007). Among these cis-regulatory elements, the enhancers can be away from their target genes over tens of kilobases distances (Miele and Dekker 2008; de Laat and Duboule 2013). It was reported that the enhancers can act as a key role in the regulation of tissue-specific gene expression (Visel et al. 2009a, b), and many specific enhancers would take part in the regulation of several human diseases (Williamson et al. 2011). Thus, elucidating the mechanism that how an enhancer selects its target promoter(s) and regulates gene expression is a major challenge so far. Recently, several types of experimental techniques enable us to study the E–P interactions, including fluorescence in situ hybridization (FISH) (Williamson et al. 2014), chromosome conformation capture (3C) and other 3C-based techniques (e.g., 4C, 5C, Hi-C) (Dekker et al. 2002; Sanyal et al. 2012; Rao et al. 2014; van de Werken et al. 2012). Generally, the 3C-based techniques determine cross-linked interactions in a cell where a pair of sequence fragments are close enough in three-dimensional (3D) space, and these experimental techniques are expensive and time consuming. In the past years, based on the availability of sequence, epigenetics and other genomic data, computational approaches for understanding the interactions between enhancers and promoters have made a significant advance, such as PreSTIGE (Corradin et al. 2013), IM-PET (He et al. 2014), HIPPIE (Hwang et al. 2015), RIPPLE (Roy et al. 2015), Target Finder (Whalen et al. 2016) and our group’s works (Feng and Li 2017; Feng et al. 2018). Furthermore, many deep sequencing data profiles of transcription factors and epigenetic marks were generated from different sequencing techniques. Thus, the omics data enable us to study the relationships between related features and gene expression levels, which offered insights into the roles of the regulatory proteins, RNAs and other factors in the regulation of gene expression on gene expression (Budden et al. 2015; Cheng et al. 2011; Cheng and Gerstein 2012; Karlić et al. 2010; Mcleay et al. 2012; Ouyang et al. 2009). However, for the regulation of gene expression related to distal enhancers, there is still much work to do.

In this study, based on the mass-sequencing data, we mainly aim to develop computational models to study the relationship between diverse genomic signatures and gene expression levels with the regulation of distal enhancers. Thus, we extracted feature parameters from large numbers of epigenetic marks and other signatures including histone modifications, transcription factors, DNase I, enhancer RNAs (eRNAs), DNA methylation and nucleosome occupancy. Then, by integrating the combined features and ensemble classification strategy, we used Random Forest (RF), gradient boosting machine (GBM) and support vector regression (SVR) algorithms to predict the gene expression levels with the regulation of distal enhancers. Meanwhile, we studied the different types of important features for the gene expression associated with long-range E–P interactions and their location-specificity differences between the positive and negative sets, which would be beneficial to understand the distal regulatory mechanisms deeply.

RESULTS AND DISCUSSION

Modeling gene expression levels with the regulation of distal enhancers by diverse genomic features

We aimed to explore the quantitative relationships between diverse genomic signatures and gene expression levels in the long-range E–P interactions datasets. In this paper, we constructed three computational models (i.e., the SVR, RF and GBM algorithms) to predict the gene expression levels. In the prediction models, we incorporated the signature features of different regulatory regions (enhancers, promoters and loop regions) together from diverse genomic signatures (i.e., histone modifications, transcription factors, DNase I, FAIRE-seq, DNA methylation, nucleosome position and eRNAs). Using the combined features for SVR, RF and GBM models, we showed the prediction results of 10-fold cross-validation test in Table 1. Because of the advantages of ensemble learning algorithm in big data mining, the RF and GBM models obtained better performance results than that of SVR model in different cell lines. It showed that the best R2 values of the interacting sets measured up to 0.7636, 0.8293, 0.9085 and 0.9021 in Gm12878 (B-lymphocyte), H1hesc (embryonic stem cells), Hela (cervical carcinoma) and K562 (leukemia) cell lines, respectively. However, the best R2 values of the non-interacting sets were 8.23%, 24.45%, 14.05%, 16.48% lower in the same corresponding cell lines, respectively. Moreover, the above tendency was similar in the three models.

Table 1 The prediction results of the full feature sets using 10-fold cross-validation test

Based on the above results, we further discussed the Pearson correlation coefficient (PCC) between the predicted expression values and experimental measured levels in the RF model. In Fig. 1, we showed the results of interacting and non-interacting sets in four cell lines, respectively. In our observation, the PCC values of the interacting sets measured up to 0.8736, 0.9052, 0.9532 and 0.9498 in Gm12878, H1hesc, Hela and K562 cell lines, respectively. In contrast, the corresponding values of the non-interacting sets in above cell lines decreased to 0.8521, 0.7583, 8764 and 0.8587, respectively. Although the predicted expression values correlated well with the measured expression values in both the interacting and non-interacting sets, the PCC values of the interacting set were higher than that of the corresponding non-interacting set in each cell line, which indicated that the distal enhancers would cooperate with diverse genomic signatures to facilitate the expression level of target genes.

Fig. 1
figure 1

Pearson correlation of experimental gene expression levels versus predicted expression levels in RF model. The scatter plots of interacting sets and non-interacting sets were plotted in red and blue, respectively. The horizontal axis is the measured expression value, and vertical axis is the predicted expression values

The contributions of diverse signatures to gene expression levels with the regulation of distal enhancers

To quantify the relative contributions of diverse signatures to gene expression levels, according to the previous works (Feng et al. 2018; Roy et al. 2015), we used the “%IncMSE” error in RF model to compute the relative importance of diverse variables in the prediction models of gene expression levels. One feature with the higher “%IncMSE” value contributes more to the prediction model. Here, we presented the importance scores of the top 50 variables of interacting and non-interacting sets in each cell line (Hela and K562 in Fig. 2; Gm12878 and H1hesc in Fig. S1).

Fig. 2
figure 2

The importance scores of the top 50 features in the interacting and non-interacting sets in Hela and K562 cell lines. The “%IncMSE” error in the RF model was used to compute feature importance scores. The longer the bar, the more important in prediction model. The way to encode the features of enhancer, promoter and loop region was the same with our previous work (Feng et al. 2018)

In the prediction of gene expression levels regulated by distal enhancers, multiple types of signatures were identified as top-ranked variables, including histone modifications, DNase I, transcription factors, nucleosome occupancy, eRNAs and DNA methylation. Especially in promoter regions, several ones tended to appear robustly in the top-ranked positions in both the interacting and non-interacting sets of multiple cell lines, such as eRNA, H3k36me3, H3k79me2, H3k27me3, H4k20me1, H3k4me3, H3k9me3, H2az, Pol2, DNase I, DNA methylation and nucleosome occupancy. Among the top variables of interacting/non-interacting data sets, the histone modification features respectively accounted for 42%/36%, 52%/42%, 44%/48% and 48%/42% of the total 50 features in Gm12878, H1hesc, Hela and K562 cell lines. While the transcription factor features, respectively, accounted for 38%/50%, 36%/48%, 40%/38% and 34%/40% of the total 50 features in the four cell lines. Although the distal enhancers can interact with their target promoter regions in the complex cis-regulatory system, the promoter regions always occupy the more important position in the gene expression regulation. In both of the interacting and non-interacting sets, the genomic signatures tended to be promoter-specific among the three regulatory regions. In the top 50 features, we observed that the features of promoter regions occupied the largest proportion in the interacting and non-interacting sets, which respectively accounted for 48%/60%, 44%/40%, 44%/36% and 58%/44% of the total features in Gm12878, H1hesc, Hela and K562 cell lines (Table 2). Among the above variables, there were many features with minor changes in score, and the features with great changes related to the specific gene expression regulation of distal enhancers. Thus, there would not be a higher percentage of important enhancer features in the interacting set than the non-interacting set. We observed that there was an obvious difference in the regulation of gene expression level between the histone modifications and transcription factors. For histone modifications, the features of promoter and loop regions took a large proportion of the top 50 variables of multiple cell lines. However, for transcription factors, the features of promoter and enhancer regions took a larger proportion (Blackwood and Kadonaga 1998; Pennacchio et al. 2013). Consistent with previous studies (Budden et al. 2015; Cheng et al. 2011; Cheng and Gerstein 2012; Karlić et al. 2010; Ouyang et al. 2009), the histone modifications and transcription factors in the promoter regions play an important role in the gene expression.

Table 2 The proportion of the features of enhancer, promoter and loop regions in the top 50 variables

The changes of diverse signatures in the importance scores between interacting and non-interacting sets

Among the important features in one cell line, some signatures ranked well in both the interacting and non-interacting sets, however, their changes of importance scores between interacting and non-interacting sets were not obvious, such as H3k36me3, H3k27me3, H3k4me1, H3K79me2 (Fig. 1). To find important signatures that regulate gene expression levels with the regulation of distal enhancers, we made a study on the score changes of top-ranked features between interacting and non-interacting sets in the same cell line. We observed that some features had significant changes in score, such as H3k27me3_p, Mef2e_e, Ikzf1_e, Nfya_e, Rfx5_e, Irf4_p and eRNA_p in Gm12878 cell line, and H3k36me3_p, H3k79me2_p, H3K4me3_p, H2az_p, nucleosome_p and eRNA_w in K562 cell line (Fig. 3). Among the features with great changes, the transcription factors and histone modifications took a large proportion, and other features (e.g., DNA methylation, eRNAs) also had cell line-specific changes. In Fig. 3, we can see that in the top-ranked 30 features, the transcription factor features respectively accounted for 56.7%, 60%, 43.3% and 36.6%, and the histone modification features, respectively, accounted for 23.3%, 36.67%, 43.3% and 46.7% in Gm12878, H1hesc, Hela and K562 cell lines. These features with great changes would play a cell line-specific role in the gene expression level by the regulation of distal enhancer. According to previous research (Heintzman et al. 2009), many specific histone modifications in enhancers regions would functionally regulate the gene expression in a tissue-specific manner. The eRNA transcription requires the presence of the promoter at the same time, and eRNA levels can correlate with mRNA levels at adjacent genes (Krivega and Dean 2012). In eukaryotes, by binding to enhancer regions of DNA adjacent to regulated genes, many transcription factors can precisely regulate the expression of various genes in the right cell and at the right time. For example, cohesion Rad21 is involved in chromatin looping and plays an important role in maintaining chromosomal loops (Merkenschlager and Odom 2013; Nitzsche et al. 2011). P300 was known to bind the enhancer regions and has been applied to predict enhancers (Blow et al. 2010; Visel et al. 2009a, b). MafK transcriptional activation can be mediated by an enhancer and regulated by different GATA factors in both hematopoietic and cardiac tissues (Katsuoka et al. 2000).

Fig. 3
figure 3

The 30 top-ranked features with large changes in importance scores between the interacting and non-interacting sets in the same cell line. The bars represent the changed D values between the interacting and non-interacting sets. The D values above the abscissa axis indicate the larger values of interacting set, while the values below the abscissa axis indicate the larger values of non-interacting set. The longer the bar, the greater the difference between the interacting and non-interacting sets

However, in the same cell line, there were also many top-ranked features with minor changes in the importance score between interacting and non-interacting sets, such as H3k36me3, H2az, H3k27me3, H3k4me1, H3K79me2 (Fig. 4). In our observation, the histone modifications in loop and promoter regions had a large proportion of the features. These features with minor changes would play a fundamental role in gene expression with the regulation of long-range E–P interactions in a specific cell line. For example, associated with elongating RNA polymerase II, the H3K36me3 is indicative of actively transcribed genes (Li et al. 2002). However, as clear markers of gene repression, the H3k27me3 is likely bound by other proteins to exert a repressive function (Cao et al. 2002) and H3K9me3 is a well-characterised marker for heterochromatin (Bannister et al. 2001; Lachner et al. 2001). As a multiprotein complex, Pol II catalyzes the transcription of DNA to synthesize precursors of mRNA and most snRNA and microRNA (Kornberg 1999; Sims et al. 2004). Belonging to the GLI-Kruppel class of zinc finger proteins, transcription factor YY1 involved in repressing and activating a diverse number of promoters and interacts with multiple regulatory elements, such as Myc (Shrivastava et al. 1993), ATF6 (Li et al. 2000), EP300 (Lee et al. 1995), SAP30 (Huang et al. 2003), FKBP3 (Yang et al. 2001) and RYBP (García et al. 1999).

Fig. 4
figure 4

The top-ranked features with minor changes in scores between interacting and non-interacting sets in the same cell line. The bars represent the importance scores of interacting and non-interacting sets. The D values of the above signatures between the interacting and non-interacting sets were less than 10% of the corresponding positive values

Taken together, on the basis of the genomic signatures with basic biological functions, the distal enhancers would cooperate with multiple specific signatures and regulate the gene expression levels in complex regulatory networks.

CONCLUSION

The enhancers act as an important role in cell line-specific gene expression (Visel et al. 2009a, b). In the regulation, enhancers can specifically select their long-range target promoter(s) and facilitate expression levels of related genes. In the cis-regulatory mechanism of gene expression, diverse genomic signatures would affect the enhancers and cooperatively establish a complex regulatory network. Recently, previous researchers have made great effort to study the relationships between the general gene expression levels and associated genomic signatures in promoters (Budden et al. 2015; Cheng et al. 2011; Cheng and Gerstein 2012; Karlić et al. 2010; Mcleay et al. 2012; Ouyang et al. 2009). However, due to the cell line-specificity of long-range E–P interactions and the limited experimental data, the gene expression levels with the regulation of distal enhancers does not been further studied. In this paper, based on the available experimental data from multiple platforms, we developed computational methods to study the relationships between diverse genomic signatures and gene expression levels regulated by long-range E–P interactions in four human cell lines. On the basis of the long-range E–P interaction datasets from the 5C experiments (Roy et al. 2015), we extracted features from diverse genomic signatures and built three regression models of RF, GBM and SVR. First, through the results of regression models, we found that the predicted expression levels correlated well with the measured expression levels in the interacting sets, in which the best PCC values measured up to 0.8736, 0.9052, 0.9532 and 0.9498 in Gm12878, H1hesc, Hela and K562 cell lines, respectively. However, the values of the non-interacting set were lower in the same cell line, which indicated that the distal enhancers would cooperate with diverse genomic signatures to facilitate the expression levels of far target genes. Second, according to analysis of the variable importance by RF models in different cell lines, we discovered the key roles of diverse signature features in the gene expression level with long-range E–P interactions, such as eRNA, H3k36me3, H3k79me2, H3k27me3, H3k4me3, H2az, Pol2, DNase I, DNA methylation and nucleosome occupancy. For histone modifications, features of promoter and loop regions took a large proportion of the top variables of multiple cell lines. However, for transcription factors, the features of promoter and enhancer regions took a large proportion of the top variables. Finally, through the comparison of the interacting and non-interacting sets in the same cell line, we found the important genomic signatures with great changes would play specific roles in the gene expression level of the regulation of distal enhancer, such as H3k27me3_p, Mef2e_e, Ikzf1_e, Nfya_e, Rfx5_e, Irf4_p and eRNA_p in Gm12878 cell lines. In addition, we also found the important features with minor changes would play a fundamental role in gene expression with long-range E–P interactions in a specific cell line, such as H3k36me3_p, H2az_p, H3k27me3_w, H3k4me1, H3K79me2. Taken together, this work studied the roles of diverse genomic signatures on the gene expression levels with the regulation of long-range E–P interactions.

MATERIALS AND METHODS

Datasets of long-range E–P pairs

In our work, the long-range E–P pair datasets of four human cell lines came from Roy et al.’s work (2015), the four cell lines were K562, H1hesc, Hela and Gm12878. The datasets of the above four cell lines respectively contained 1754, 674, 1530 and 952 E–P pairs. Each cell line respectively contains an interacting set and a non-interacting set, and the two sets have an equal number of samples. For an arbitrary E–P pair in the datasets, the distance between the enhancer and the transcription start site was larger than 2500 bp. Because the enhancers tend to interact with their target promoters depending on their chromatin distances, the non-interacting set and interacting set in each cell line were constructed with the same distribution of chromatin distance.

The extraction of feature parameters

Tissue-specific gene expression with the regulation of long-range E–P interactions would be affected by the diverse genomic signatures. For example, DNA methylation can regulate tissue-specific gene expression, and it is related to multiple human cancers (Ball et al. 2009; Song and He 2012). The nucleosome reorganization in regulatory regions of enhancers and promoters accompany the gene activation by T cell receptor signaling (Schones et al. 2008). The eRNAs are short non-coding RNAs (Smith and Shilatifard 2014), and they would enhance or stabilize the long-range E–P interactions (Fan et al. 2013; Wang et al. 2011). In each cell line, we downloaded the available data of several genomic signatures, including DNase I, FAIRE-seq, transcription factors, histone modifications, DNA methylation, nucleosome position, eRNAs and RNA-seq. The sequencing techniques and the available data sources of corresponding signatures were shown in Table 3. For the sequencing data of above signatures, we processed data and extracted the corresponding features of enhancer, promoter and loop regions by using the same method in our previous works (Feng et al. 2018). For the two and more biological replicates, we averaged the feature values for the corresponding regions.

Table 3 The sequencing techniques and the available data sources of corresponding signatures

Predictive models for gene expression levels

In this work, we used three algorithms to predict the gene expression levels, including RF, GBM and SVR. The above prediction algorithms were implemented in R packages of ‘randomForest’, ‘gbm’ and ‘e1071’, respectively. For RF model, we set the parameter of n.trees = 500 (number of built trees), and selected the best mtry (number of variables randomly sampled as candidates at each split). For the SVR model, we selected the non-linear radial basis kernel. In the GBM model, we fit the parameters: interaction.depth = 5, n.trees = 5000, shrinkage = 0.01, n.minobsinnode = 10 and used Rsquared to select the optimal model with the largest value.

For the evaluation methods of statistical prediction, we used the 10-fold cross-validation test in the same cell line to evaluate the prediction performances. According to the previous works (Liu et al. 2016), we computed gene expression levels [log(RPKM)]. In the regression model, we calculated the standard deviation of regression model (RMSE) which was used to quantify the amount of dispersion of a set of data values. Then we calculated the R2 (coefficient of determination) which indicates the proportion of variation of the gene expression levels that has been explained by the model (Cheng and Gerstein 2012; Everitt 2002). Finally, we computed the PCC (Stigler 1989) between the predicted expression values and experimental measured values in the testing data.