Background

Approximately 45% of the human genome is composed of transposable elements (TEs) [1,2,3]. Of these, a fraction of TEs are retroelements (REs), which move via a ‘copy and paste’ mechanism involving the reverse transcription of an RNA intermediate and insertion of its cDNA copy at a new position within the host genome [3, 4]. One class of REs, human endogenous retroviruses (HERVs), result from ancestral infections by exogenous retroviruses that became incorporated into the germline DNA and evolutionarily fixed in the genome; HERVs are estimated to comprise approximately 8% of the human genome [5]. HERVs can be transmitted vertically as proviruses in a Mendelian fashion but are not inherently infectious [6,7,8]. HERVs are structurally similar to the proviruses of common retroviruses, in which the gag, pol, and env genes are flanked by two long terminal repeats (LTRs) that act as promoters [5]. Most HERV families have the pro genes, but some families, such as HERV-K HML10, have none [9]. These elements are usually inactive due to the accumulation of substitutions, deletions, and insertions [10, 11]. However, integrated LTR elements have been shown to influence gene regulation by providing regulatory elements such as enhancers, promoters, and splice- and polyadenylation sites for various host genes [4].

HERVs have been divided into three classes, namely, Class I (gamma retrovirus-like), Class II (beta retrovirus-like), and Class III (vaguely spuma retrovirus-like) [12]. The classification of HERVs is complex, with several different classification systems in use. In addition to a system based on pol sequence identity, a system based on the tRNA molecule used by retroviruses as a primer during replication is also used. The primer binding site (PBS) regions of Class II HERVs are complementary to lysine (K) tRNA molecules; thus, these HERVs have been designated HERV-Ks [13]. HERV-K proviruses appeared approximately 30–35 million years ago (mya) and are divided into subfamilies from HML-1 through HML-10 [14, 15]. HML-2 of HERV-K, the clade of beta retrovirus-like endogenous retroviruses, is recognized as the most biologically active subgroup, and many of its members still have transcriptional activity [16,17,18,19,20].

The distribution of HERV elements is usually enriched outside transcription units in the human genome. In addition, the few HERV elements within transcription units exhibit a strong orientation bias, such that the orientation of the viral genome is usually opposite to that of host gene transcription. Both of these trends in the location of HERV distribution are likely to be the result of purifying selection. In this case, the harmful HERV provirus within a transcription unit is subject to negative selection and disappears over the course of evolution [12, 15, 21,22,23,24]. Because the splicing and poly(A) addition signals of HERV are present in the antisense direction, HERV transcription in the opposite direction to that of the gene may be the least disruptive to mRNA synthesis [15, 21, 22, 25]. A recent study proposed a correlation between silencing mechanisms and the evolutionary age of HERVs. CpG-rich young LTRs were found to be repressed by DNA methylation, while middle-aged LTRs were silenced mainly by posttranslational histone modifications such as H3K9me3 [26].

The first study of the relationship between the expression of the reverse transcriptase (RT) protein of HERV-K and cancer was reported in the early 1970s [27, 28]. Correlations between HERVs and human cancers such as melanoma, breast cancer, germ cell tumors, and ovarian cancer have been described, with significant differences in the protein expression of HERV-K (HML-2/HML-6) in cancer tissues compared to normal tissues. In addition, HERV-K (HML-2) is associated with autoimmune diseases and motor neuron diseases, such as rheumatoid arthritis and amyotrophic lateral sclerosis [29,30,31,32,33,34,35,36]. HERV-K (HML-2) proviruses are classified as type 1 or type 2 based on the presence or absence of a 292 nt deletion at the pol-env junction [37]. Type 2 proviruses without these deletions encode Rec or Env. The type 1 provirus with the 292 nt deletion encodes the Np9 protein [38, 39]. Env protein can act as a tumor-specific antigen that impacts both innate and adaptive immune responses, leading to B- and T-cell stimulation and activation and inducing antibody production and cytotoxic T-cell responses [40]. Elevated levels of the Rec or Np9 protein have been observed in breast cancer, ovarian cancer, and leukemia [18, 41, 42]. HERV-H is transcriptionally repressed in adult tissues through DNMT1-dependent cytosine methylation, which contributes to blocks its transcription and translation, potentially triggering an autoimmune response [43, 44]. However, histone deacetylation alone is not responsible for the repression of HERV family members (HERV-K (HML-2), HERV-W, HERV-FRD), and HDACi treatment did not significantly upregulate HERVs in either latent cell lines or primary T cells infected with HIV-1 [45].

Characterization of the genomic distribution of the HML-9 group is critical to understanding the regulation of the expression of this family and its relationship with human health and disease. To date, there has been only limited characterization of and research on HML-9. In light of this, a precise and updated HML-9 genomic map is urgently needed to better evaluate the role of these elements in human health.

Here, we report a comprehensive analysis of the presence and distribution of HERV-K HML-9 elements within the human genome, with a detailed characterization of the structural and phylogenetic properties of the group. Additionally, we analyzed the provirus integration time and the genes that may be  regulated by these elements. Overall, the results provide a clear characterization of HML-9 and a comprehensive background for subsequent functional studies.

Materials and methods

HML-9 identification, localization, and genomic distribution

To evaluate the HML-9 provirus and solo LTR distribution in the human genome, we performed HML-9 identification by using the Genome Reference Consortium assembly GRCh38/hg38 (released Dec. 2013) as the human genome reference. A traditional BLAT search tool [46] in the UCSC Genome Browser database [47] was used to identify the integrated HML-9 elements. DNA BLAT works by keeping an index of the entire genome in its memory. The index consists of all overlapping 11-mers stepped by 5, except for those heavily involved in repeats (http://genome.ucsc.edu/cgi-bin/hgBlat). The assembled LTR14C-HERVK14C-LTR14C sequence was used as a query. Generally, there are two resources that can be selected as references: consensus representatives or single best representative strains. The major advantage of consensus representatives is their much broader representation [9, 48]. Therefore, they are used as references or queries in most studies. The assembled LTR14C-HERVK14C-LTR14C in the current work is from the Dfam database. Additionally, the expected distribution of the HML-9 loci on each chromosome was predicted according to the formula e = Cl × n/Tl (e is the number of integrations expected in the chromosome, Cl represents the ungapped length of the chromosome, n is the total number of actual HML-9 loci identified in the human genome, and Tl represents the sum ungapped length of all chromosomes) [49]. The difference between the expected number of integrations and the actual number of HML-9 loci was analyzed via the chi-square (χ2) test, and statistical significance was estimated according to the p value.

Structural characterization

The identified HML-9 elements were aligned to the proviral reference LTR14C-HERVK14C-LTR14C. Alignments were analyzed on the BioEdit software platform [50]. All insertions and deletions were annotated.

Phylogenetic analyses

Maximum likelihood (ML) phylogenetic trees were built with MEGA7 [51] to confirm the assignment of the identified HML-9 elements. The 44 out of 47 solo LTR sequences that were longer than 90% of LTR14C and the 5 out of 23 proviral sequences that were longer than 80% of LTR14C-HERVK14C-LTR14C were used to construct phylogenetic trees. The best-fit models of nucleotide substitution for solo LTRs and full-length proviruses were calculated as K2 + G and GTR + G + I by the Model Selection function in MEGA7, respectively. For the 4 coding regions, elements longer than 90% of the corresponding section of HML-9 were screened to construct phylogenetic trees. According to the model selection function of MEGA7, the best-fit models of nucleotide substitution for gag, pro, pol, and env analysis were HKY + G + I, GTR + G + I, GTR + G, and HKY + G, respectively. Tree topologies were searched using the nearest neighbor interchange (NNI) procedure. The confidence of each node in phylogenetic trees was determined using bootstrap testing with 500 bootstrap replicates. The final ML trees were visualized using iTOL [52].

Estimation of the integration time of HML-9

To estimate the time of integration, we used the substitution rate of 0.2%/nucleotide/million years to assess the effect of divergence on each HML-9 element [53]. D is the percentage of divergent nucleotides, and the D of each HML-9 member was estimated between (1) the 5′ and 3′ LTRs of each provirus and (2) each HML-9 internal element (gag, pro, pol, and env genes) and its generated consensus. The divergence values were estimated with MEGA7. For the 4 internal regions, the integration time was calculated based on the formula T = D/0.2, in which T represents the estimated time of integration (in million years). For the flanking LTR regions, the provirus integration time was calculated based on the formula T = D/0.2/2.

Functional prediction of cis-regulatory regions and enrichment analysis

Noncoding regions typically lack biological function annotations. To understand the biological significance of both HML-9 solo LTRs and proviral LTRs, an analysis of the annotations of genes adjacent to LTRs was performed based on the Genomic Regions Enrichment of Annotations Tool (GREAT) [54]. The association rule was as follows: basal + extension: 5000 bp upstream, 1000 bp downstream; 1,000,000 bp max extension; curated regulatory domains included. After identifying potential regulatory genes, the WEB-based Gene SeT AnaLysis Toolkit (WebGestalt) [55] was used to analyze their functional enrichment (http://www.webgestalt.org), which is crucial for interpreting the list of genes of interest. The enrichment method used in the current work was over-representation analysis (ORA). The parameters for the enrichment analysis were as follows: minimum number of IDs in the category: 5; maximum number of IDs in the category: 2000; FDR Method: Benjamini–Hochberg (BH); and significance level: top 10.

In silico examination of conserved transcription factor binding sites

The transcription factor binding sites of the HML-9 LTR consensus reference sequence were predicted from the JASPAR (https://jaspar.genereg.net/) database. The taxon was vertebrates, and the species was Homo sapiens. The data chosen for the prediction of transcription factors were ChIP-seq data in JASPAR with a relative profile score threshold of 95%. The alignment and annotation of the HML-9 LTR reference sequence with the 4 proviral sequences (length of 5′LTR > 90%) were performed using Geneious software [56].

Primer binding site feature representation

The composition of the primer binding sites (PBSs) of 11 near-full-length proviruses (LTR length > 80%) and the HML-9 reference sequence were all analyzed using MEGA7 and BioEdit. The degree of conservation at each position was represented by a logo built from WebLogo at http://weblogo.berkeley.edu [57]. Then, the PBS type was identified with tRNAdb (http://trna.bioinf.uni-leipzig.de/) [58].

Results

HML-9 element identification, localization, and distribution in hg38

According to the BLAT results for LTR14C-HERVK14C-LTR14C in GRCh38/hg38, we characterized a total of 23 HERV-K HML-9 proviral elements. Each HML-9 element was named according to the genomic locus of integration, as a previously proposed nomenclature for HERV-Ks [16] (Table 1). Element length analysis indicated that 6 elements were longer than 70% of the full length of the reference, 9 elements were between 40 and 70% of the reference length, and the remaining 8 elements were between 17.11 and 34.26% of the reference length. Moreover, a total of 47 solo LTR elements of HERV-K HML-9 were characterized. Of these, 44 solo LTRs (93.62%) were longer than 90% of the representative reference LTR14C. The nucleotide sequence of each element is shown in Additional file 1: Dataset S1, Additional file 2: Dataset S2. The overall HML-9 element distribution is displayed based on the data representation obtained from the Ensemble website (www.ensembl.org) (Fig. 1A).

Table 1 HML-9 provirus distribution
Fig. 1
figure 1

Chromosomal distribution of HML-9 loci. A All HML-9 elements (red arrows) are displayed on the human karyotype (www.ensembl.org). The number of HML-9 proviral elements (B) and solo LTRs (C) integrated into each human chromosome was determined and compared to the expected number of insertion events. The expected number of sequences in each chromosome is marked in blue, and the actual number of sequences detected is marked in orange

Next, the expected number of integrations of HML-9 elements per chromosome was predicted and compared with the number of actually detected sites to assess whether HML-9 is randomly distributed in the human genome. The number of HML-9 integration events observed was often inconsistent with the expected number (Fig. 1B, C). For the proviral elements, the number of HML-9 insertions on chromosomes 8, 13, 15, 16, 19, 21, and Y was higher than expected. In particular, the number of proviral elements on the Y chromosome was significantly higher than that predicted by the chi-square test (p < 0.05). On chromosomes 1, 2, 4, 5, 6, 7, 10, and 12, the actual numbers identified were lower than expected (Fig. 1B). Notably, no HML-9 provirus integration was detected on chromosomes 3, 9, 11, 14, 17, 18, 20, 22, and X. With respect to the solo LTR elements, the number of HML-9 solo LTRs on chromosomes 2, 3, 14, 15, 18, 21, X, and Y was higher than expected. On chromosomes 1, 4, 5, 6, 7, 8, 10, 11, 12, 13, 17, and 20, the actual numbers identified were lower than expected. In particular, no HML-9 solo LTR integration was detected on chromosomes 9, 16, 19, and 22 (Fig. 1C). This analysis revealed that HML-9 provirus and solo LTR integration is nonrandom among human chromosomes.

Furthermore, all 23 identified proviral elements and 47 solo LTRs were analyzed to determine their locations in intergenic regions, introns, or exons (Tables 1, 2). The results showed that 13 proviral elements were located in intergenic regions, accounting for 56.52% of all proviral elements. Four proviral elements (17.39%) were located in introns. Six proviral elements (26.09%) were located in both introns and exons (Table 1). With respect to solo LTRs, 28 (59.57%) were located in intergenic regions, and the remaining 19 (40.43%) were located in introns (Table 2). The results revealed an apparent distribution preference for intergenic regions and introns. Previously, Brady et al. [15] demonstrated that the accumulation of HML-2 proviruses in introns and intergenic regions is not a result of integration bias but selection against proviruses that integrate into exons and genic regions. This conclusion also applies to the current study. The proviruses in genes and their relative transcriptional orientation are presented in Additional file 3: Table S1, Additional file 4: Table S2.

Table 2 HML-9 solo LTR tracks distribution

Structural characterization

To define the structural characteristics of HML-9 elements, the 23 proviruses were further analyzed by comparing them with the reference LTR14C-HERVK14C-LTR14C. According to the annotation information summarized in the Dfam database (https://www.dfam.org/family/DF0000193/features), the complete HML-9 reference exhibits a typical proviral structure, containing 4 open reading frames (ORFs) and 2 flanking LTRs. Specifically, the 5′ LTR is located from nucleotides 1 to 587, the CDS range of the HERVK14C_gag protein is from nucleotides 758 to 2548, the CDS range of the HERVK14C_pro protein is from nucleotides 2548 to 3435, the CDS range of the HERVK14C_pol protein is from nucleotides 3411 to 6060, the CDS range of the HERVK14C_env protein is from nucleotides 5975 to 8020, and the 3′ LTR is from nucleotides 8022 to 8608. We aligned the 23 HML-9 proviral sequences and annotated the position of the single retroviral component and deletions to describe the structure of each HML-9 provirus element (Fig. 2). HML-9 16p12.3, 2p12, 15q21.1, 8p11.1, 13q31.1 and 4q33 are longer than 70% of the complete reference sequence in length. Furthermore, the integrity of 6 separate regions (5′ LTR, gag, pro, pol, env, and 3’ LTR) is summarized in Table 3.

Fig. 2
figure 2

HML-9 provirus structural characterization. Each HML-9 provirus element was analyzed and compared to the Dfam reference sequence. The LTRs and the gag, pro, pol, and env genes were annotated. Black lines represent deleted regions

Table 3 The integrity of 6 separate regions relative to the corresponding sections of reference

Phylogenetic analyses

To characterize the phylogenetic relationships among the HML-9 group, 5 proviral sequences (longer than 80% of the HML-9 reference length) were screened together with Dfam HERV-K groups (HML-1–10) and exogenous betaretroviruses as representatives to generate ML trees. The 5 identified proviruses all clustered with the Dfam HML-9 reference by 100% of bootstrap support (Fig. 3A). Subsequently, phylogenetic trees of a total of 44 solo LTRs identified to be longer than 90% of LTR14C were constructed together with the LTR reference (Fig. 3B). Next, 4 ML trees were constructed for subregions whose lengths were longer than 90% of the corresponding section of the reference sequence, including 10 gag elements, 8 pro elements, 11 pol elements, and 13 env elements (Fig. 3C–F). These phylogenetic groups of different regions of HML-9 all clustered together and were clearly separated from the other HERV-K groups (HML1-8, 10). Two distinct clusters in the pro and pol groups were observed. They were statistically supported by 100% of bootstrap values and were named HML-9 type a and type b. The results showed that HML-9 21q21.1, HML-9 1q22, and HML-9 5q33.3 were included in HML-9 type a, whereas HML-9 15q21.1, HML-9 16p12.3, HML-9 4q33, HML-9 8p11.1, and HML-9 2p12 were included in HML-9 type b. HML-9 type b sequences included the Dfam HML-9 reference, whereas HML-9 type a elements showed more divergence relative to the group references.

Fig. 3
figure 3

Phylogenetic analysis of the HML-9 near-full-length proviruses, solo LTRs, and 4 subregions by the maximum likelihood method. Phylogenetic analyses of 5 HML-9 proviral elements (A), 44 solo LTRs (B), 10 gag elements (C), 8 pro elements (D), 11 pol elements (E), and 13 env elements (F), together with references. The two intragroup clusters of the pro and pol genes (types a and b) were annotated and depicted with brown and orange background colors, respectively. The resulting phylogeny was tested by the bootstrap method with 500 replicates. The branch length indicates the number of substitutions per site

Estimated time of integration

The HML-9 proviral age in the human genome was estimated based on the LTRs and the gag, pro, pol, and env regions. Each region whose length exceeded 90% of the corresponding reference sequence was selected to calculate the integration time. For LTRs, the 5′ and 3′ LTRs of a given provirus are identical at the time of integration and then accumulate random substitutions in an independent manner [53]; therefore, the T value was estimated by the relation T = D/0.2/2. For the gag, pro, pol, and env regions, the ancestral sequences of each region were generated via MEGA7 following the ML method based on multiple alignments of all elements. The T value (integration time) was estimated by the relation T = D/0.2, where 0.2 represents the human genome neutral mutation rate expressed in substitutions/nucleotide/million years. For each provirus region, we provide details on the period of provirus formation in Table 4. Overall, the estimated time of integration based on LTR elements is later than that estimated based on the four regions (gag, pro, pol, and env). The LTRs integrated between 17.5 and 48.5 mya. The average time of integration was 28.83 mya. However, the majority of HML-9 elements (gag, pro, pol, and env) found in the human genome integrated between 37.5 and 151.5 mya. The average time of integration was 76 mya. There exists a very large discrepancy between the two analyses. A reasonable explanation for the difference between the two methods is as follows. The two flanking LTRs (5' LTR and 3' LTR) were identical when the provirus was integrated into the host genome. However, the internal regions contain multiple sequence differences due to the mutations accumulated during viral replication cycles, with a much higher error rate. This difference would inevitably lead to LTRs being a more accurate timing starting point for integration time estimation.

Table 4 Estimated time of HML-9 elements integration

Functional prediction of cis-regulatory regions and enrichment analysis

GREAT analysis can predict possible regulated genes based on spatial proximity. The results describing the associations between each solo LTR and its putative regulated gene(s) are shown in Additional file 5: Table S3. A total of 69 genes were predicted. Among these, 5 solo LTRs were not associated with any genes, 15 solo LTRs were associated with 1 gene, and 27 solo LTRs were associated with 2 genes (Fig. 4A). The absolute distances of 3 genes to the transcription start site (TSS) were less than 5 kb. The absolute distances of 13 genes to the TSS were between 5 and 50 kb. The absolute distances of 34 genes to the TSS were between 50 and 500 kb. The absolute distances of 19 genes to the TSS were more than 500 kb (Fig. 4B, C).

Fig. 4
figure 4

The genes associated with solo LTRs and GO Slim summaries. A The number of associated genes per solo LTR. B Binned by orientation and distance to TSS. C Binned by absolute distance to TSS. Biological process (D), cellular component (E), and molecular function (F) summaries are represented by red, blue, and green bars, respectively. The height of the bar represents the number of IDs in the gene list and in the category

To analyze the biological classification of key genes related to solo LTRs, GO Slim summaries were generated. The biological processes (BP) summary revealed that these genes were mainly enriched in biological regulation, metabolic process, multicellular organismal process, response to stimulus, cell communication, developmental process, localization, and cellular component organization (Fig. 4D). The GO Slim cellular component (CC) summary showed that these genes were significantly enriched in the membrane and nucleus, and the GO Slim molecular function (MF) summary showed that these genes were significantly enriched in protein binding and ion binding (Fig. 4E, F).

Furthermore, these potential regulatory genes were all annotated to the selected functional categories and subjected to enrichment analysis. The top 10 most significant GO terms according to FDR value for BPs included the regulation of endothelial cell chemotaxis, the regulation of natural killer cell-mediated immunity, the positive regulation of synapse assembly, natural killer cell-mediated immunity, the regulation of synapse assembly, the positive regulation of chemotaxis, synapse assembly, the regulation of synapse organization, the regulation of synapse structure or activity, and synapse organization (Fig. 5A). The bar chart shows the enrichment ratio of the results. Bars representing categories with an FDR ≤ 0.05 are shown in a darker shade (Fig. 5A). The volcano plot in Fig. 5B shows the log2 of the FDR versus the enrichment ratio for all the functional categories in the database, highlighting the degree to which the significant categories are separated from the background. The size and color of a dot are proportional to the number of overlaps (for ORA). The significantly enriched categories are labeled, and the labels are positioned automatically by a force field-based algorithm at startup. The enrichment results for the CC and MF categories are illustrated in Fig. 5C–F. It must be noted that these results are entirely prediction-based and that future research is required to confirm any of the implied associations between solo LTRs and nearby genes.

Fig. 5
figure 5

Enrichment result categories binned by biological process, cellular component, and molecular function. A, B Bar chart and customizable volcano plot of the biological process enrichment results. C and D, Bar chart and customizable volcano plot of the cellular component enrichment results. E and F, Bar chart and customizable volcano plot of molecular function enrichment results

Similar to the approach used for solo LTRs, GREAT prediction of genes putative regulated by proviral LTRs was also performed. Enrichment analysis was carried out as described for solo LTRs. The results describing the associations between each proviral LTR and its putative regulated gene(s) are shown in Additional file 6: Table S4. A total of 36 genes were predicted. Among these, 4 proviral LTRs were not associated with any genes, 6 proviral LTRs were associated with 1 gene, and 15 proviral LTRs were associated with 2 genes (Additional file 8: Figure S1A). Of these, the 4 proviral LTRs associated with no genes belong to two pairs of 5′ and 3′ LTRs of the same provirus, 2p12 and Yp11.2, respectively. In particular, 2p12 is a rather complete provirus. No genes with an absolute distance of less than 5 kb from the LTR to the TSS were found. The absolute distances to the TSSs were between 5 and 50 kb for 13 genes. The absolute distances to the TSSs were between 50 and 500 kb for 15 genes. The absolute distances to the TSSs were more than 500 kb for 8 genes (Additional file 8: Figure S1B, C). The GO Slim summaries for the biological classification of key genes related to proviral LTRs are shown in Additional file 8: Figure S1D–F. The enrichment results for BP, CC, and MF categories are shown in Additional file 9: Figure S2, Additional file 10: Figure S3, Additional file 11: Figure S4.

In silico examination of the conserved transcription factor binding sites

Specific base insertions in HML-9 elements may influence the complexity of LTR transcriptional regulation [16]. A complete view of the putative transcription factor binding sites within the HML-9 LTR is shown in Fig. 6A. A total of 22 human transcription binding sites were predicted for 19 transcription factors: EHF, SOX10, FOS, FOSL1, FOSL2, JUNB, JUND, ETV4, KLF1, KLF5, KLF7, ZNF263, THAP1, SP4, RBPJ, HAND2, MAZ, NEUROG2, and NEUROD1. The motifs are marked on the sense strand and antisense strand of the consensus sequence.

Fig. 6
figure 6

In silico examination of the conserved transcription factor binding sites and logos representing the PBSs of HML-9. A The forward arrow indicates the sense strand, and the reverse arrow indicates the antisense strand. Different transcription factors are marked with different colors. B PBS nucleotide sequence

PBS type of HML-9 sequences

Traditionally, HERVs have been named according to the tRNA that binds their RT enzyme and PBS [59]. Thus, HERV-K is named after the lysine-tRNA. In the 11 proviral and consensus sequences of HML-9 elements analyzed, the PBS was located approximately 3–20 nucleotides downstream of the 5′LTR. To summarize the general variation of the PBS sequence within the HML-9 group, we generated a logo in which the letter height is proportional to the nucleotide conservation at each position (Fig. 6B). The results showed that the TGG starting nucleotides were the most conserved among the 18 bases. However, only the 15q21.1 and 8p11.1 PBSs belong to lysine, confirming the relatively low taxonomic value of this feature (Additional file 7: Table S5) [61, 62].

Discussion

At present, the HML1-8 and HML10 groups have been characterized and identified [9, 16, 48, 49, 60,61,62,63,64]. However, existing research on HML-9 elements is very limited [44]. A descriptive study of HML-9 elements, especially the characterization and description of their features, is still lacking. Characterization of the genomic distribution of the HML-9 group is critical to understanding the regulation of the expression of this family in healthy humans and its relationship with diseases. Therefore, it is necessary to perform a systematic and comprehensive characterization of HML-9.

Our research followed the approach carried out in previously published studies [9, 48], completely mapping out the HML-9 proviruses and solo LTRs in the human genome and thus providing an exhaustive characterization of this group, including genomic distribution, structural characterization, phylogenesis, integration time analysis and regulatory function prediction. A total of 23 HERV-K HML-9 proviruses and 47 solo LTR elements were characterized. The chromosomal distribution of these proviruses and the solo LTRs revealed a nonrandom integration pattern. HERV-K HML-9 elements are usually enriched outside transcription units in the human genome [15, 65]. The results showed that these elements are mainly distributed in intergenic regions and introns. This may be because the integration of a HERV provirus within the transcription unit is harmful and therefore subject to negative selection and elimination during evolution [12, 15, 21,22,23,24]. In particular, the number of proviruses on the Y chromosome was significantly different from that predicted by the chi-square test (p = 0.01), which indicates that the male-specific region of the Y chromosome (MSY) accumulates higher densities of HERVs and associated sequences, consistent with previous studies [65].

Phylogenetic analyses showed that 5 sequences of HML-9 near-full-length proviruses as well as 10 gag elements, 8 pro elements, 11 pol elements, and 13 env sequences formed a unique monophyletic cluster, clearly divided from other HML groups, supported by the maximum bootstrap value. The phylogenetic trees of the pro and pol regions both revealed the presence of two well-supported clusters, identified here as HML-9 types a and b, which were statistically supported by bootstrap values of 100. The HML-9 type b cluster included the Dfam HML-9 reference, whereas the HML-9 type a cluster showed more divergence relative to the group references. In addition, the integration time of HML-9 proviruses was calculated using the LTR, gag, pro, pol, and env regions. The results indicated that the LTRs integrated between 17.5 and 48.5 mya. However, the main period of HML-9 integration based on 4 internal regions is between 37.5 and 151.5 mya. The difference in estimated integration time between the two methods likely occurred because internal coding regions can accumulate mutations during every replication cycle, while two identical LTRs integrate into the host genome during the integration phase [66]. Therefore, it is more reasonable to use LTRs to evaluate the integration time.

Furthermore, we performed prediction and cluster analysis of potential regulatory genes for both the HML-9 provirus and solo LTRs. A total of 69 genes were predicted. BP and MF analyses showed that these genes were associated with synapses. Previous studies have shown that HERV-W can interfere with neuronal protrusions and alter N-methyl-D-aspartate receptor (NMDAR)-mediated synaptic organization and plasticity through glia- and cytokine-dependent changes [67]. Here, our work suggested that HML-9 LTR-regulated genes may also be widely involved in the function of synapses. Furthermore, the prediction of TFBSs in HML-9 elements by JASPAR also indicated that HML-9 is likely to play a role in regulating downstream genes. In addition, for the PBS analysis of HML-9 elements, the results showed that the TGG starting nucleotides were the most conserved among the 18 bases. Similar to previous work [68, 69], we identified 11 proviral PBS sequences and confirmed that this nomenclature is imprecise because although HML-9 belongs to the HERV-K subgroup, only the PBSs of 15q21.1 and 8p11.1 belong to lysine. It should be noted that these results are entirely prediction based. Experimental validation studies are required to confirm the associations between these elements and these genes.

Conclusion

A previous study of HML-9 (HERVK14C) indicated that HML-9 could exert its effects in different tissues under physiological conditions as well as during disease development, possibly contributing to immune regulation and antiviral defense [44]. To systematically study the important role of HML-9 in pathological and physiological processes, the current work provides a clear and detailed description of all HML-9 elements integrated into the human genome, which could contribute to better defining the real impact of these elements and their contribution to the genome.