Background

HBV is the main cause of viral hepatitis and liver cancer (HCC) [1]. HBV can integrate into the host genome via an RNA intermediate due to its small size [1]. Cases of viral DNA integrated into the human genome were detected in 85–90% of HBV-related HCCs [2]. HBV attaches and enters hepatocytes, then transports its nucleocapsid, which contains a relaxed circular DNA (rcDNA), to the host nucleus. In the host nucleus, rcDNA is converted into covalently closed circular DNA (cccDNA), which produces messenger RNA (mRNA) and pre-genomic RNA (pgRNA) by transcription. Then, pgRNA produces new rcDNA and double-stranded linear DNA (dslDNA) via reverse transcription in the host nucleus, which tend to integrate into the host cell genome [3].

A previous study showed HBV integration breakpoints distributed randomly across the whole genome with a handful of hotspots [7]. Further analysis revealed an association between HBV integration and genomic instability during these insertional events [8]. Moreover, significant enrichment of HBV integration was found near the following genomic features: repetitive regions, fragile sites, CpG islands, and telomeres in tumors compared to non-tumor tissues [3]. For instance, HBV integration was reported to recur in the telomerase reverse transcriptase (TERT) and myeloid/lymphoid or mixed-lineage leukemia 4 (MLL4, also known as KMT2B) genes. The insertional events were also accompanied by altered expression of the integrated gene [3, 7, 9], indicating important biological impacts on the local genome. However, the pattern and mechanism of HBV integration remain to be explored. Many HBV integration sites are distributed throughout the human genome and seem completely random [8, 10, 11]. Whether the features and patterns of these “random” viral integration events could be learned and extracted remains an open question, and once solved, will greatly improve the understanding of HBV integration-related carcinogenesis.

Deep learning has shown a promising ability to discover intricate structures in multidimensional data and automatically extract features from these data [12]. Under the conditions of big data and a well-designed network structure, deep learning models can better predict performance in many cases. Moreover, deep learning has performed excellently in computational biology research such as medical image identification [13] and protein sequence motif discovery [14]. The convolutional neural network (CNN) is the most important part of deep learning, enabling a computer to learn and program itself from training data [15]. Although deep learning performs well in various fields, the detailed theory of how it makes decisions is difficult to explain due to its black box effect. Therefore, an approach called the attention mechanism, which can highlight the outstanding parts and connect the encoder and the decoder was invented to open the “black box” [16, 17].

This study developed DeepHBV, an attention-based model for predicting HBV integration sites using deep learning. The attention mechanism highlights the regions concentrated upon by DeepHBV and helps determine the investigated patterns. DeepHBV can predict HBV integration sites accurately and specifically, and the attention mechanism highlights positions with potentially important biological meanings. Our work identified novel transcription factor-binding sites (TFBSs) near HBV integration hotspots, revealing new insights into HBV-induced cancer.

Results

DeepHBV effectively predicts HBV integration sites by adding genomic features

The DeepHBV model structure and the scheme of encoding a 2000 bp sample into a binary matrix are shown in Fig. 1. The DeepHBV model was tested using the HBV integration sites database (http://dsvis.wuhansoftware.com). HBV integration sequences were prepared according to HBV integration sites as positive samples, following the steps in the method. The negative sample abstracting also followed the method, and the negative samples should be twice the number of positive samples to maintain data balance and improve the confidence level. The positive samples were divided into 2902 and 1264 positive training datasets and testing datasets, respectively. We extracted 5804 and 2528 negative samples as the negative training dataset and testing dataset, respectively. Tests were performed on the DeepHBV model using samples of DNA sequences near the HBV integration sites. DeepHINT, an existing deep learning model for predicting HIV integration sites according to the surroundings [18], was also evaluated using HBV integration sequences for training and testing. The preparation of input data for DeepHINT also applied 2000 bp sequences near HIV integration sites [18]. We tested DNA sequence samples with lengths of 500 bp, 1000 bp, and 2000 bp and 4000 bp. As shown in Additional file 4: Table S5, 2000 bp had the largest accuracy (0.7368), sensitivity (0.7695), specificity (0.7321), AUROC (0.6901) and the most of the performance results among all tested lengths. In this case, 2000 bp sequences were used in our study. The ReLU activation function and the almost identical encoding function make it possible to use HBV integration sequences to perform tests on DeepHINT. Both models were trained using the same HBV integration training dataset, and the same testing dataset was used for the evaluation. The results showed that DeepHBV with HBV integration sequences had an AUROC of 0.6363 and an AUPR of 0.5471, while DeepHINT with HBV integration sequences had an AUROC of 0.6199 an AUPR of 0.5152 (Fig. 2). Except for AUROC and AUPR, the tenfold cross-validation and confusion matrix, which included true positives, true negatives, false positives, and false negatives, followed by accuracy, specificity, sensitivity, Mathews’ correlation coefficient, and F−1 score were applied to evaluate the predictive model, and the results are shown in Table 1 and Additional file 4: Table S8.

Fig. 1
figure 1

The deep learning framework applied in DeepHBV. (a) Scheme of encoding a 2 kb DNA sequence into a binary matrix using one-hot code; (b) A brief flowchart of DeepHBV structure, the matrix shape was included in brackets, and a detailed flowchart was in Supplementary Figure 1

Fig. 2
figure 2

Evaluation of DeepHBV model prediction performance on the test dataset: (a) receiver-operating characteristic (ROC) curves and (b) precision recall (PR) curves, respectively

Table 1 The testing results of DNA sequence samples of 2000 bp length

Several previous studies have shown that HBV integration prefers surrounding genomic features such as repeats, histone markers, CpG islands, among other features [3, 8]. Thus, we added these genomic features into DeepHBV by mixing some genomic feature samples with HBV integration sequences as new datasets and then trained and tested the updated DeepHBV models. We downloaded the following genomic features from different datasets [19,20,21] into four subgroups: (1) DNase Clusters, Fragile site, RepeatMasker; (2) CpG islands, GeneHancer; (3) Cons 20 Mammals, TCGA Pan-Cancer; (4) H3K4Me3 ChIP-seq, H3K27ac ChIP-seq (Additional file 2: Fig. S2a, b). After obtaining genomic feature data positions (sources are mentioned in Additional file 4: Table S2), we extended the positions to 2000 bp and extracted related sequences on the hg38 reference genome. These sequences were defined as positive genomic feature samples. We then mixed HBV integration sequences, positive genome feature samples, randomly picked negative genomic feature samples (see Method), 0, and trained the DeepHBV model. Once a subgroup performed well, we re-tested each genomic feature in that subgroup to determine which specific genomic features significantly affected the model performance (Additional file 2: Fig. S2) (AUROC and AUPR values were recorded in Additional file 4: Table S3). From the ROC and PR curves, we found that DeepHBV with HBV integration sites and the genomic features repeat (AUROC: 0.8378 and AUPR: 0.7535) and TCGA Pan Cancer (AUROC: 0.9430 and AUPR: 0.9310) can significantly improve the HBV integration site prediction performance against DeepHBV with HBV integration sequences (Fig. 2). We also performed the same test on DeepHINT but did not find a subgroup that could substantially improve the model performance (these results are recorded in Additional file 4: Table S3). Thus, DeepHBV with HBV integration sequences plus repeat or TCGA Pan Cancer can significantly improve model performance.

Validation of DeepHBV using the VISDB independent dataset

DeepHBV must be applied to general datasets, so we tested the pre-trained DeepHBV models (DeepHBV with HBV integration sequences + repeat peaks and DeepHBV with HBV integration sequences + TCGA Pan-Cancer peaks) on the HBV integration sites dataset in another virus integration sites (VIS) database, VISDB [22]. We found that in the model trained with HBV integration sequences + repeat sequences, the AUROC and AUPR were 0.6657 and 0.5737, respectively, while the model trained with HBV integrated sequences + TCGA Pan Cancer showed an AUROC of 0.7603 and an AUPR of 0.6189.

The DeepHBV model with HBV integration sequences + TCGA Pan Cancer performed better than the DeepHBV model with HBV integration sequences + repeat and was more robust for both testing datasets from dsVIS (AUROC: 0.9430 and AUPR: 0.9310) and the independent testing dataset from VISDB (AUROC: 0.7603 and AUPR: 0.6189). Thus, we decided to use this model for future HBV integration site predictions. With repeat or TCGA Pan-Cancer genome features, the probability of a 2000 bp input DNA sequence to be an HBV integration site in the human genome can be predicted accurately by DeepHBV.

Study HBV integration site selection preference by important sequence elements

DeepHBV can extract features with translation invariance by pooling operations, enabling DeepHBV to recognize certain patterns even when the features are slightly translated. The participation of the attention mechanism in the DeepHBV framework might partly open the deep learning black box by giving attention to each position. Each attention weight represented the computational importance level of that position in the DeepHBV judgment. The attention weights in the attention layer were extracted after two de-convolutions and one de-pooling operation, and the output shape was 667 × 1. Each value represents the attention weight of a 3 bp region. Positions with higher attention weight values might have a more important impact on the pattern recognition of DeepHBV, which means these positions might be the critical points for identifying HBV integration sites. We defined the fractions of attention values averaged among each site of all HBV integration sequences and normalized them to the mean of all positions. We then visualized the fractions of identified attention values where the figure showed peak-valley-peak patterns only in positive samples (Fig. 3). We were interested in the positions with higher attention weights, which were monitored in the CNN. In addition, we found that in the attention weight distribution of DeepHBV with HBV integration sites + TCGA Pan-Cancer, a cluster of attention weights that were much higher than other weights in the same sample often occurred in positive samples. However, in the model of DeepHBV with HBV integration sites + repeats, this pattern was not observed (Fig. 3).

Fig. 3
figure 3

The attention weight distribution of analysed by DeepHBV with HBV integration sequences + genomic features. (a) DeepHBV with HBV integration sequences + TCGA Pan Cancer peaks; (b) DeepHBV with HBV integration sequences + repeat peaks. The left graph showed the fractions of attention weight, which were averaged among all samples and normalized to the average of all positions, each index represents a 3 bp region due to the multiple convolution and pooling operation. The graphs on the right are representative samples of attention weight distribution of positive samples and negative samples

To further discover the pattern behind these positions with higher attention weights, we defined the sites with the top 5% highest attention weights as attention-intensive sites and the regions of 10 bp near them as attention-intensive regions. We mapped these attention-intensive sites on the hg38 reference genome with genomic features (Fig. 4) but found that the positional relationship between attention-intensive sites and genomic features was not quite clear.

Fig. 4
figure 4

Attention intensive regions highlighted essential local genomic features on predicting HBV integration sites. Representative examples showed the positional relationship between the attention intensive sites and several genomic features using DeepHBV with HBV integration sequences + TCGA Pan Cancer model on (a) chr5:1,294,063-1,296,063 (hg38), (b) chr5: 1291277-1293277 (hg38)

The convolution and pooling module learns the patterns with translation invariance in deep learning, based on the fact that the deep learning network tends to learn the domains that occur recurrently among samples within the same pooling matrix, even though the learned feature was not at the same position in different samples [23, 24]. Attention-intensive regions are more likely to be conserved because of translation invariance in the convolution and pooling modules. The accurate results indicated that the conserved regions would provide hints to the selection preference of HBV integration sites. We then designed and applied an attempt to enrich transcription factor-binding site (TFBS) motifs, conserved genomic elements in these attention-intensive regions. We used all HBV integration samples with prediction values higher than 0.95 from dsVIS and VISDB separately to enrich local TFBS motifs in attention-intensive regions using HOMER v 4.11.1 [25] with the vertebrate transcription factor databases provided by HOMER. We enriched several TFBS near attention-intensive sites, which are shown in Table 2. From the result of DeepHBV with HBV integration sequences + TCGA Pan-Cancer, the binding sites of AR-halfsite, Arnt, Atf1, bHLHE40, bHLHE41, BMAL1, CLOCK, c-Myc, COUP-TFII, E2A, EBF1, Erra, Foxo3, HEB, HIC1, HIF-1b, LRF, Meis1, MITF, MNT, Myoga, n-Myc, NPAS2, NPAS, Nr5a2, Ptf1a, Snail1, Tbx5, Tbx6, TCF7, TEAD1, TEAD3, TEAD4, TEAD, Tgif1, Tgif2, THRb, USF1, Usf2, Zac1, ZEB1, ZFX, ZNF692, ZNF711 can be enriched in both attention-intensive regions of dsVIS and VISDB sequences. We selected two representative samples to obtain a more intuitive display. Genomic features, HBV integration sites from dsVIS and VISDB, attention-intensive sites, and TFBS were aligned and shown in the hg38 reference genome (Fig. 4). Most attention-intensive sites can be mapped to enrich the TF motifs. The clusters of high attention weight from the output of DeepHBV with HBV integration sites plus TCGA Pan-Cancer in Fig. 4 show the binding site of a tumor suppressor gene HIC1 or circadian clock-related elements BMAL1, CLOCK, c-Myc, and NAPS2. Together, the data provided novel insights into HBV integration site selection preference and revealed biological importance that warrants future experimental confirmation.

Table 2 Enriched TFBS from attention intensive regions of DeepHBV with HBV integration sites + TCGA Pan Cancer peaks

Discussion

This study developed an explainable attention-based deep learning model, DeepHBV, to predict HBV integration sites. In comparing DeepHBV and DeepHINT for predicting HBV integration sites (Additional file 4: Table S3), DeepHBV outperformed DeepHINT after adding genomic features due to its more suitable model structure and parameters for recognizing the surroundings of HBV integration sites. We applied two convolution layers (first layer: 128 convolution kernels with a kernel size of 8; second layer: 256 convolution kernels with a kernel size of 6) and one pooling layer (with a pooling size of 3) in DeepHBV. In DeepHINT, the model only has one convolution layer (64 convolution kernels with a kernel size of 6) and one pooling layer (with a pool size of 3). Increasing the convolution layers enables the information from higher dimensions to be extracted, and the increase in convolution kernels enables more feature information to be extracted [26].

We trained the DeepHBV model using three strategies: (1) DNA sequences near HBV integration sites (HBV integration sequences), (2) HBV integration sequences + TCGA Pan-Cancer peaks, and (3) HBV integration sequences + repeat peaks. We found that the model with HBV integration sequences adding TCGA Pan Cancer or repeats can significantly improve model performance. The DeepHBV with HBV integration sequences adding TCGA Pan-Cancer peaks performed better with the VISDB independent test dataset. However, attention-intensive regions cannot be well-aligned to these genomic features. Thus, we further inferred that other features such as TFBS motifs could lead to the prediction of DeepHBV. HOMER was applied to recognize these TFBSs, and we found that these motifs might be related to HBV-related diseases or cancer development.

We noticed that the attention-intensive regions identified by the attention mechanism of DeepHBV with HBV integration sequences + TCGA Pan-Cancer strongly focused on the binding site of the HIC1 tumor suppressor gene, the circadian clock-related elements BMAL1, CLOCK, c-Myc, NAPS2, and other transcription factors such as TEAD and Nr5a2. These DNA-binding proteins have been reported to be related to tumor development [27,28,29,30,31,32,33]. For instance, HIC1 is a tumor suppressor gene, which is associated with hepatocarcinogenesis development [27, 28]. BMAL1, CLOCK, c-Myc, and NAPS2 are all related to the regulation of the circadian clock [29], which is closely related to HBV-related diseases [30, 31] (Additional file 4: Table S4). Together, these TFBSs on the human genome are closely associated with HBV integration, and their biological significance should be further verified by experimental research.

Conclusion

This study developed an explainable attention-based deep learning model, DeepHBV, to predict HBV integration sites. DeepHBV is a robust deep learning model for predicting HBV integration sites and is the first attempt to use CNNs for HBV integration prediction. The attention mechanism in DeepHBV can be used to highlight the genomic preference for HBV integration and offer a deeper understanding of the mechanism underlying HBV-related cancer.

Methods

Data preparation

For DeepHBV model training and testing, 1000 bp DNA sequences were extracted from upstream and downstream, respectively, of HBV integration sites as a positive dataset. Each sample was denoted as \(S=({n}_{1},{n}_{2},\dots ,{n}_{2000})\), where \({n}_{\mathrm{i}}\) represents the nucleotide in position i. DNA sequences do not contain HBV integration sites as negative samples. The existence of HBV integration hot spots, which contain several integration events within the 30–100,000 bp range [34], prompted us to select the background area with sufficient distance from known HBV integration sites. The regions within 50,000 bp around the known HBV integration sites in the hg38 reference genome were ignored. A 2000 bp sequence that did not contain HBV integration sites was randomly selected from the remaining regions as negative samples.

The extracted DNA sequences were encoded into one-hot code to calculate the similarity and distance between features in training more accurately. Original DNA sequences were converted to binary matrices of four dimensions, corresponding to one nucleotide type.

Feature extraction

The DeepHBV model first applied convolution and pooling modules to learn and obtain sequence features around HBV integration sites (Additional file 1: Fig. S1). Specifically, the model employed multiple variant convolution kernels for the calculation to obtain different features. A DNA sequence is denoted as \(S = \left({n}_{1},{n}_{2},\dots ,{n}_{2000}\right)\) and further encoded into a binary matrix E. Each binary matrix was entered into the convolution and pooling module for convolution calculations, according to \(X=conv(E)\), which can be denoted as:

$$X_{{k,j}} = \mathop \sum \limits_{{j = 0}}^{{p - 1}} \mathop \sum \limits_{{l = 1}}^{L} W_{{k,j,l}} E_{{l,i + j}}$$
(1)

Here, 1 ≤ kd, d refers to the number of convolution kernels, 1 ≤ jnp + 1, \(j\) refers to the index, p refers to the convolution kernel size, n refers to the input sequence length, and \(W\) refers to the convolution kernel weight.

The convolutional layer activated eigenvectors using a rectified linear unit (REL) after extracting relative eigenvectors and mapping each element on a sparse matrix. Next, the model applies a max-pooling strategy to minimize the dimensions and maximize the predicted information. The final eigenvector \({F}_{\mathrm{c}}\) was then extracted.

The attention mechanism in the DeepHBV model

The attention mechanism was applied in DeepHBV to determine the contribution of each position to the extracted eigenvector \({F}_{\mathrm{c}}\). Each eigenvalue was assigned a weight value in the attention layer, which refers to the contribution level of the convolutional neural network (CNN) in that position.

The output from the convolution-and-pooling module, eigenvector \({F}_{\mathrm{c}}\), is the input of the attention layer, and the output is the weight vector \(W\), which can be denoted as

$$W = att\left( {a_{1} ,a_{2} , \ldots ,a_{q} } \right)$$
(2)

Here, \(att()\) refers to the attention mechanism, \({a}_{i}\) is the eigenvector in the \({i}^{th}\) dimension in the eigenmatrix, and \(W\) refers to the dataset containing the contribution values of each position in the eigenmatrix extracted by the convolution-and-pooling module.

All contribution values were normalized to achieve a dense eigenvector matrix, which is denoted as \({F}_{a}\):

$$F_{a} = \mathop \sum \limits_{{j = 1}}^{q} a_{j} v_{j}$$
(3)

where \({a}_{j}\) refers to the relevant normalization value, and \({v}_{j}\) refers to the eigenvector at position \(j\) of the input eigenmatrix. Each position refers to an extracted eigenvector in each convolution kernel.

The convolution-pooling module and the attention mechanism module must be combined in the model prediction. In other words, eigenvector \({F}_{c}\) and the relative eigenvalue \({F}_{a}\) should work together in predicting HBV integration sites.

The values in the eigenvector \({F}_{c}\) were linearly mapped to a new vector, \({F}_{v}\), which is

$$F_{v} = \left( {dense\left( {flatten\left( F \right)_{{\text{c}}} } \right)} \right)$$
(4)

In this step, the flattened layer performs the function \(flatten()\) to reduce the dimension and concatenate data; the dense layer performed function \(dense()\) to map dimension-reduced data to a single value. Then, the \({F}_{v}\) and \({F}_{a}\) concatenated vector entered the linear prediction classifier to calculate the probability that HBV integration occurred within the current sequence, as follows:

$$P = sigmoid\left( {concat\left( {F_{a} ,F_{v} } \right)} \right)$$
(5)

where \(P\) is the predicted value, \(sigmoid()\) refers to the activation function acting as a classifier in the final output, \(concat()\) refers to the concatenation operation.

At the same time, the output eigenvector \({F}_{c}\) from the convolution-and-pooling module serves as the input and executes the attention mechanism where the weight vector \(W\) can be described as:

$$W = att\left( {a_{1} ,a_{2} , \ldots ,a_{q} } \right)$$
(6)

Here, \(W\) refers to the dataset containing the contribution values of each position in the eigenmatrix extracted by the convolution-and-pooling module, \(att()\) refers to the attention mechanism, and \({a}_{i}\) refers to the eigenvector in the ith dimension in the eigenmatrix.

DeepHBV model evaluation

After each parameter in DeepHBV was confirmed (Additional file 4: Table S1), the DeepHBV deep learning neural network model was trained using binary cross-entropy. The loss function of DeepHBV can be defined as:

$$Loss = - \mathop \sum \limits_{i} y_{i} \log \left( P \right) + \left( {1 - y_{i} } \right)\log \left( {1 - P} \right)$$
(7)

where \({y}_{i}\) is the prediction value, \(P\) is the binary value of that sequence (in this dataset, positive samples were labeled as 1, and negative samples were labeled as 0).

To evaluate the best output of the DeepHBV model, a tenfold cross-validation was adopted. The confusion matrix, which included true positive, true negative, false positive, and false negative, followed by accuracy, specificity, sensitivity, AUROC, AUPR, Mathews’ correlation coefficient, and F-1 score, were adopted.

The DeepHBV model adapted Tensorflow1.13.1, scikit-learn0.24 [35] by NVIDIA Tesla V100-PCIE-32G (NVIDIA Corporation, California, USA).