1 Introduction

The biophysical properties of DNA have been extensively studied in the context of mutagenesis, leading to carcinogenesis. The flexibility of DNA is influenced by temperature, ionic strength, and protein binding [1]. The DNA secondary structure can create regions of increased tension, rendering them more susceptible to breakage or mutagenic events [2, 3].

Both internal and external factors can trigger mutations. External factors include chemical carcinogens such as tobacco smoke [4], ultraviolet rays [5], exposure to ionising radiation [6], and infectious pathogens such as human papillomavirus (HPV) [7], hepatitis B virus (HBV) [8], leukaemia-induced retroviruses including human T-cell leukaemia retrovirus I (HTLV-I) [9], and murine leukaemia retrovirus (MLV) [10]. Conversely, intrinsic factors encompass replication errors, reactive oxygen species generated during metabolism, and inherited genetic anomalies. There remains an ongoing debate about the existence of specific target sequences prone to mutations. For instance, ionising radiation often elicits random mutations, whereas carcinogens like benzopyrene specifically target genes like p53 through intercalation into the groove of DNA [11]. Furthermore, oncogenes such as c-myc [12], N-myc [13, 14], Stat5 [15], and Zfp521 [16] serve as hotspots for mutations induced by MLV, ultimately culminating in the formation of lymphomas due to the mutated cells.

After the retroviral infection, MLV genomic RNA undergoes reverse transcription within host cells to synthesise DNA. This DNA encodes an integrase enzyme that catalyses the insertion of mutations, specifically the insertion of viral DNA into the host DNA. Subsequently, this enzyme associates with the MLV reverse-transcribed DNA ends to form a pre-integration complex (PIC). The PIC navigates through the nuclear membrane to engage with the host cell DNA. Notably, hydroxyl groups at the termini of MLV-DNA within the PIC attack the host DNA, creating a staggered nick before and after a specific four-base region. This process leads to the duplication of these bases (Fig. 1). Consequently, retroviral DNA is integrated into infected cells, thereby becoming a carrier for the retroviral genome. Integrated retroviral DNA, termed the provirus, may affect the control of host DNA gene activation, potentially contributing to various diseases [17]. The perturbation of growth regulation genes within host cells may give rise to tumours, such as leukaemia/lymphoma and soft tissue tumours [18]. Historically, the interaction between the PIC and host DNA has been perceived as random and unaffected by the primary DNA sequence of the host [19]. However, recent insights have suggested a potential preference for palindromic sequence motifs [20]. A palindrome is a segment where the sequence reads the same forwards on one strand as it does backwards on the complementary strand. Dimeric transcription factors and retroviral integrases can recognise these sequences. Long palindromic-like sequences have the potential to form secondary structures due to intramolecular hydrogen bonding within the same DNA strand, possibly forming tertiary structures within the strand [3, 21].

Fig. 1
figure 1

Pre-integration complex (PIC) and insertion into the target segment 5'-AAAC-3' within Zfp521 gene. The PIC comprises MLV-DNA and dimeric integrases that bind to the termini of the DNA. For simplicity, the integrase is omitted in this figure (see Fig. 3B). The red double-lined curve represents the PIC DNA. The terminal –OH group attacks phosphodister bonds on opposite strands of the DNA, introducing nicks before and after the 5'-AAAC-3' sequence, and the target oligonucleotides are duplicated after insertion. The 5' and 3' indicate the DNA ends and denote its direction, with genes being transcribed in the 5' → 3' direction

The Zfp521 gene has been reported to contain the target sequence for MLV genome insertion [16]. Below, we consider this model of insertion into the target sequence. However, the mechanism underlying the distribution of mutation positions within the target sequence remains enigmatic. Here, we propose two models. The first posits that the PIC collides with the target DNA segment, resulting in insertion at the collision position, while others may result in insertions during subsequent re-collisions with neighboring sites. The second model posits that PIC-DNA interactions follow a probabilistic process, treating the Zfp521 DNA as a diffraction lattice and the PIC as a presumed wave. Consequently, we emphasise the similarity between the probability density function of the insertion position, which represents the frequency of insertional mutations, and the lattice scattering pattern generated by the presumed PIC wave.

2 Methods

2.1 Mouse strains

To examine the MLV insertion mechanism, we have employed an inbred mouse strain, SL/Kh (Materials and Methods). SL/Kh mice possess a hereditary MLV genome (AKV1 provirus) from their progenitor strain, AKR, shared with AKXD mice [2]. In other strain, such an inherited MLV genome contributes to the development of lymphoma [2, 15, 16] and  autoimmune diseases [17]. SL/Kh mice spontaneously produce MLV originating from the AKV1 provirus at birth. These MLV particles infects early lineage B cells in the bone marrow, leading to the insertion of MLV genome DNA following reverse transcription of the RNA gene into the genomic DNA of these B cells. By the age of three months, a considerable majority (> 95%) of SL/Kh mice acquire AKV1 genome insertion events within Zfp521 in B cells, which subsequently instigates the clonal proliferation of lymphocytes [16, 21] (Fig. 2). This gene has been postulated to play a vital role in B cell development [16] and chondrocyte maturation [22]. SL/Kh strains were bred and maintained at the RIKEN BioResource Centre (http://en.brc.riken.jp/) [15]. All mice used in this study were handled in strict adherence to the guidelines for good animal practice, as defined by the pertinent national and local animal welfare bodies. Additionally, all animal experiments were approved by the Kyoto University Ethics Committee for Animal Experiments. The SL/Kh mice were bred and propagated over a span of 5 years without any intervention and were subjected to continuous observation throughout the study. Notably, these mice developed lymphoma within a timeframe ranging from 3 to 6 months after birth.

Fig. 2
figure 2

The structure of the Zfp521 gene. Exons 1, 2, and 3, and the position of the target segment in the second intron between Exon 2 and Exon 3 are shown. Illustrated are three boxes in grey, black, and white representing the long terminal repeat (LTR), which functions in reverse transcription and insertion reactions and acts as a promoter for the transcription of downstream genes. A wavy line connecting these LTRs indicates the integrated retroviral genome, indicated by boxes. LTRs are located at both ends of the viral genome. PCR primers are set within the host cell DNA for Zfp521 and inside MLV, allowing for the amplification of the junction between MLV and Zfp521 in two-step PCR reactions [16]

2.2 Polymerase chain reaction for identification of murine leukaemia retrovirus insertion site

Upon the development of lymphoma in the SL/Kh mice (typically developing around the sixth month after birth), DNA was extracted from the lymphoma tissue. To directly identify MLV insertion into the second intron of Zfp521, a Zfp521 forward primer (1F: 5'-CGGCCACGAGGAAGTGTAG-3'), an MLV reverse primer (1R: 5'-TACAGAAGCGAGAAGCGAGC-3'), and a nested reverse PCR primer (2R: 5'-AGTGCTTGACCACAGATATCC-3') (8233–8252) were used (Fig. 2). The virus–host junctions were amplified in a 50 μL reaction mixture containing 2.5 mM dNTP, 10 pmol/mL primer, and 0.25 U of Taq polymerase (Expand Long Template PCR System) (Roche Diagnostics, Mannheim, Germany). The PCR amplification process was executed over three steps, conducted under the following conditions in a thermal cycler (Perkin–Elmer/Cetus): The initial phase consisted of 10 cycles (30 s at 94 °C, 40 s at 62 °C, and 1 min at 68 °C), preceded by an initial denaturation step (1 min at 95 °C). The second phase encompassed 20 cycles (30 s at 94 °C, 40 s at 62 °C, and 4 min plus an extended 20 s during one cycle at 68 °C). The final PCR phase comprised elongation (10 min at 72 °C).

2.3 Computational simulation

A comprehensive description of the algorithm is provided in the appendix. The simulation was performed using Mathematica version 11 (Wolfram, Champaign, Il) and Python 3.0 on Jupyter Notebook (6.3.0). At the onset, each particle was positioned at coordinate x = 25 (Fig. 3A), at the centre of the designated hotspot segment (x = 1 to 55). Subsequently, as collision ensued, each particle was subjected to the possibility of either remaining stationary or undergoing re-location to the other positions, contingent upon established probabilities. Re-location typically occurs in increments of 10-base pair length, with a 1 to 3-base pair adjustment. In the first collision model, as the code progresses through iterations, each PIC makes a decision to either stay in place or displace based on its dynamically updated stationary probability. If a PIC opts for displacement, the algorithm encoded in the program determines its re-location, taking into account probabilistic behaviors. Upon completion of the default simulation duration for an ensemble of 170 PICs, the resulting PIC distributions are visually represented as a histogram. As iterations progressed, there was a gradual increase in the likelihood of particles retaining their stationary positions.

3 Results

3.1 Frequent murine leukaemia retrovirus insertion into Zfp521

The MLV insertion sites were identified using primers for MLV and Zfp521 via the PCR method. Each insertion site is located in the approximate 50 bp region within the second intron (Fig. 3A). Considering that the total length of the Mus musculus genome is approximately 2.7–2.9 Gb, it is highly specific for the MLV PIC complex to approach a particular gene within a narrow 50 bp region.

Fig. 3
figure 3

MLV insertion site and the presumed secondary structure. A Insertion sites and the counts. The X-axis depicts the positions of these integration sites within an approximate 55 bp region situated in the second intron of Zfp521. The Y-axis represents the counts of insertion at each specific position. The red circles signify the count of insertions (total number of mice = 136, total insertions = 170). The grey bars represent the insertion counts by the simulation at each position. Error bars indicate the standard deviation of the count. The DNA sequence at the bottom of the figure denotes the Zfp521 target segment, with underlined red and blue letters in the sequence symbolising the alternating palindromic motifs. A model of the double-stranded DNA structure is also displayed at the bottom of the figure. The numbers by the DNA indicate the groove number. The wedge sites, labelled “p”, “q”, “r”, “s”, and “t”, denote hot spots of frequent MLV genome insertion. Notably, these spots are positioned at an approximate 10-base interval, corresponding to the period of a DNA helix turn. B Presumed structure in the alternating palindromic sequence. p, q, and r are hotspots for insertion mutations, and p', q', and r' indicate the locations on the complementary strand. The PIC is bound to MLV-DNA and integrase at both ends, which catalyses the MLV-DNA insertion reaction. –OH is the hydroxyl group located at the end of MLV-DNA (refer to Fig. 1). The arrow direction indicates the transcription of the target Zfp521 DNA

Zfp521 contains the target 55-base segment including an alternating palindromic sequence (Fig. 3A). The sequence is “CTGAATTGAAAC: AACTTCAGCTGTTT“, where the italicised and bold parts represent the palindromes: AAACA vs TGTTT and CTGAA vs TTCAG. The most frequently observed insertion site is indicated by “:“. The insertion within the segment is not random; it frequently occurs at 10 bp intervals, particularly symmetrically distributed on both sides of hotspot r. This 10-base interval corresponds to one turn of the DNA helix (Fig. 3A, illustrated at the bottom), suggesting a potential relationship between the insertion and periodic structure of DNA. Such an alternating palindromic sequence may form a quadruplet structure akin to a G-quadruplex [23] (Fig. 3B). In the presumed secondary structure, “r” is the most accessible point for the approaching pre-integration complex. However, these target segments are shorter and significantly smaller than the Kuhn length of DNA [24]. In addition, since there are no strong hydrogen bonds between base pairs, the probability of such a secondary structure being formed by the bending of DNA from the given segment is low. Herein, we modelled this target segment as almost a planar configuration in the following simulation for simplification. In this context, it is imperative to underscore the uniqueness of this target sequence when compared to the target segments observed in other gene host loci for MLV genome insertion, such as Stat5 [15] and c-myc [12, 21], in which the secondary structure is anticipated.

3.2 Pre-integration complex-DNA collision model

We first present a biophysical model designed to explain the dynamic behaviour of the PIC upon collision with the target DNA. The foundation of this model is a scenario where the PIC congregates at a designated locus, indicated by x = 25 in the X-axis (Fig. 3A), which was assigned to the middle position in the insertion target segment (Fig. 2). The insertions were hypothesised to result from collisions between the PIC and host DNA (Fig. 4, the simulation code is given in supplementary text).

Fig. 4
figure 4

A collision and repulsion of PIC and the target DNA. In this case, 5'-AAAC-3' and the complementary oligonucleotide 3'-GTTA-5'in the host DNA is the target. The OH-group at the end of PIC collides the target. The downward arrows indicate the direction of the collision, while other arrows represent the subsequent re-location after the repulsion of the PIC complex along the target DNA segment

The implemented code for this simulation was structured to first position all PICs at the initial collision point, 5’-AAAC-3' (the C position is x = 25 at the X-axis in Fig. 3A) (Fig. 4, Supplementary text). With progressive iterations, the decision of each PIC is determined by its dynamically updated stationary probability. If a PIC chooses to displace, its movement is governed by the encoded algorithm, taking into account probabilistic behaviors. Boundary conditions have been enforced to ensure PICs remain within the defined simulation space, with any exceeding displacements being automatically corrected. Upon completion of the default simulation duration for an ensemble of 170 PICs, the resultant PIC distributions are visually rendered as a histogram.

Following the initial collision at x = 25, each PIC particle is confronted with a probabilistic decision: whether to either remain at the collision site (staying probability is set to 0.35) for subsequent insertion, insertion or relocate and engage in re-collision at a 9-10 base pairs (bp) away (x = 15 and x = 34 or 35). If a PIC collides at x = 15, 34, or 35, it undergoes a subsequent determination regarding whether it will be inserted at that location or not, and this process continues as described above. Furthermore, we assume that with each successive collision between the PIC and DNA, the probability of the PIC remaining at that location increases by 20%. The collision probability for regions approximately 10 bp distant was established at 40%, with the collision probabilities for regions ± 1, ± 2, ± 3 bp distant set sequentially at 30%, 20%, and 10%. Subsequent to the delineation of this probabilistic framework, simulations were initiated, encompassing a total of 170 particles, corresponding to the observed insertion number in the murine lymphoma tissues. Employing this model, the simulations successfully emulated the probability density functions evidenced in the experimental data (Fig. 3A). Both the 15:25 ratio and the 35:25 ratio represent the ratio of insertion counts at x = 15 and x = 25, and x = 35 and x = 25, respectively, obtained from the simulation. These ratios align with the observed data from actual experiments, thereby effectively accounting for the presence of the target sequence and the insertion hot spot.

However, it is critical to acknowledge that in these numerical simulations, aligning with the observed results required an arbitrary specification of the aforementioned retention and collision probabilities. These parameters were retrospectively and arbitrarily defined, grounded on insertion frequency data, and are not reflective of the intrinsic structural characteristics of DNA. This delineation underscores potential inherent limitations in modelling the interactions between PIC particles and DNA through rudimentary collision scenarios in scientific exploration.

3.3 The probability function model

Next, our investigation delved into the intricate interplay between the target segment and the PIC based on the probability density function model. We hypothesised a DNA grating composed of ten slits, with each slit separated by an interval five times its width (Fig. 5). This proportion closely mirrors the actual structure of the DNA. Figure 5A depicts the rectangular diffraction grating. The N slits are spaced at intervals of d. It is assumed that the PIC approaches the grating along the y-axis like a presumed planar wave. Consequently, the amplitude of the collision probability at point P can be expressed using direction cosine (u, w) as follows:

$$\begin{aligned} U\left( P \right) & = - \frac{iA}{{\lambda s_{0} }}{\text{exp}} (iks_{0})\sum \limits_{n} \iint\limits_{A} {\exp ( - ik\left( {\left( {nd + \lambda} \right)u + \mu w}) \right){\text{d}}\lambda {\text{d}}\mu } \\ & = U^{\left( 0 \right)} \left( \lambda \right)\mathop \sum \limits_{n = 0}^{N - 1} \exp \left( { - ikndu } \right) = U^{\left( 0 \right)} \frac{{1 - \exp \left( { - ikNdu } \right)}}{{1 - \exp \left( { - ikdu } \right)}} \\ \end{aligned}$$
(1)
Fig. 5
figure 5

Probability function predicting murine leukaemia retrovirus (MLV) insertion into Zfp521. A Scheme of a ten-slit DNA grating. a and b represent the slit size. Ten rectangular slits are arranged: P shows the observation point, d is the interval distance between individual slits, and s0 is the distance between the observation point P and the centre of the rectangular slit PIC. The target DNA is indicated by a lattice structure in blue. The upper section illustrates a five-turned 55-bp DNA helix corresponding to the shown lattice. The PIC approaches the DNA grating from the y-axis direction. B Plot of the MLV insertion sites. The x-axis depicts the positions of these insertion sites within an approximate 50 bp region situated in the second intron of the Zfp521 gene. The y-axis represents the counts of insertion (I) at each specific position. The blue curve illustrates the function of the light diffraction pattern resulting from the 10-slit grating (Eq. (1)). The dotted red curve illustrates the count of insertions, the same as the red circles pot in Fig. 3A. The Zfp521 target segment sequence is the same as Fig. 3A. A double-stranded 55-bp DNA structural model is shown at the bottom of the figure. The q and r positions in B are identical to those previously reported in another mouse strain AKXD [25]

In the above,

$$U^{\left( 0 \right)} \left( P \right) = - \frac{{2iA\exp \left( { - iks_{0} } \right)}}{{\lambda s_{0} }}\frac{\sin kua}{{kua}}\frac{\sin kw b}{{kw b}}$$
(2)

and

$$u = x/s_{0} ,\; w = z/s_{0}$$
(3)

In Eq. (2), A is a constant coefficient, and k is a wave number. s0 represents the distance between the observation point and the centre of the rectangular slit. The small area element is given by dS (λ, μ) =dλdμ (Fig. 5A). Additionally, the terms that depend on u and w in this equation indicate their dependence on the wave propagation direction. The intensity I(P) at P through a slit is expressed as follows:

$$I^{\left( 0 \right)} \left( P \right) \propto \left( {\frac{AS}{{\lambda s_{0} }}} \right)\left( {\frac{{\sin \left( {ku a} \right)}}{kua}} \right)^{2} \left( {\frac{{\sin \left( {kw b} \right)}}{kw b}} \right)^{2}$$
(4)

Subsequently, we considered that N rectangles are arranged at equal intervals of d. The plane is assumed to move along the y-axis. The centre of the nth rectangle is given by (nd, 0). Accordingly, the light intensity I(P) is expressed as follows:

$$I\left( P \right) \propto \left| {U\left( P \right)} \right|^{2} = \frac{1 - \cos Nkdu }{{1 - \cos kdu }}I^{\left( 0 \right)} \left( P \right) = \left( {\frac{{\sin \frac{Nkdu }{2}}}{{\sin \frac{kdu}{2}}}} \right)^{2} \left( {\frac{{\sin \left( {ku a} \right)}}{ku a}} \right)^{2} \left( {\frac{{\sin \left( {kw b} \right)}}{kw b}} \right)^{2}$$
(5)

where I(0)(p) denotes the square of the absolute value on the right side of Eq. (5), which shows the intensity of incident PIC wave passing through a single slit. N indicates the slit number. In Fig. 5B, I at the y-axis represents the prospective insertion number of lymphoma tissue harbouring the inserted viral DNA at the Δx-nucleotide position from the cytosine at x = 25 (position r), where the insertion occurred most frequently. The insertion count at individual sites is plotted in Fig. 5B using dotted lines. kua = Δx, d = 5a, N = 10, in Eq. (5). The slit number, 10, notably coincided with the number of grooves and the helix turn within the target segment. The multiple 5 of d for a coincided with the number of nucleotide pairs between the minor and major grooves. The maximum wave intensity was set at 100. The curve was calculated as follows:

$$I\left( P \right) \propto \left( {\frac{{\sin \frac{50\Delta x}{2}}}{{\sin \frac{5\Delta x}{2}}}} \right)\left( {\frac{\sin \Delta x}{{\Delta x}}} \right)^{2}$$
(6)

These intriguing observations suggest that DNA structure described by a, b, d, N, contributes to the simulation of the distribution of insertion sites (Fig. 5B). To adjust the maximum value of the probability intensity function to match the actual count of insertional mutations in Zfp521, the proportionality coefficient in Eq. (6) was set to 0.35 by aligning the maximum value at the most insertion frequent site (position r). Finally, we obtained a diffraction intensity described by the Fresnel–Kirchhoff theory as follows:

$$I\left( P \right) = 0.35\left( {\frac{\sin \pi \Delta x}{{\sin \frac{\pi \Delta x}{{10}}}}} \right)\left( {\frac{\sin \pi \Delta x/25}{{\pi \Delta x/25}}} \right)^{2}$$
(7)

where I(Δx) signifies the prospective count of insertion at the Δx-nucleotide position from the position r. For example, when Δx = 10 (bp), the q and s positions, I(Δx) was close to 21, the actual insertion count. Summarily, we can obtain the probability function of the MLV insertion, Eq. (7).

4 Discussion

In this study, we hypothesised that (i) PIC initially targets palindromic motifs and (ii) subsequently inserts into the site according to a probability function. We reasoned that within a short range of 55 base pairs, the DNA flexibility influence is constrained, thereby minimising the potential for substantial structural alterations. However, it cannot be ruled out that structural changes in DNA, as depicted in Figure 3B, may by formed by an assist of DNA-binding proteins, such as histones.

In the first model, collisions between PIC and host DNA can introduce nicks and initiate the insertion of the MLV genome into the host DNA. Otherwise, when the PIC can be repelled, resulting in collisions with neighbouring sites, typically situated 10 bp away, or with other sites, and insertion ensues. We attribute this 10-base-pair spacing to the relationship between the three-dimensional configuration of the PIC and the target DNA, which likely recurs periodically every 10 base pairs. In the collision model, a random probability model describes the number of insertions at positions other than x = 25; however, the simulation results will inherently vary on each run and lack robustness.

In contrast, DNA grating model is based on the structural factors of DNA. The coincidence of the simulated probability function with Fresnel–Kirchhoff’s diffraction theory may evoke the earlier proposition by Ardont et al., who suggested a wave–particle duality for macromolecules, C60, including virus particles. This concept extends to the idea that insertional mutation selectivity can be explained through a diffraction pattern [26].

Notably, other than SL/Kh mice, there are few known mice wherein insertion mutations concentrate within a single gene. Moreover, repeated insertional mutations occur in a sufficiently large number of cases to permit the derivation of probable distribution patterns. Although further investigation is required to establish the universality of this observed phenomenon across all MLV insertions, it is worth noting that a similar insertion event within Zfp521 has been identified in AKXD mice, a strain closely related to SL/Kh mice (see Fig. 5B), with specific insertion sites being identical [26]. This fascinating phenomenon reinforces the notion that Zfp521 harbours a targeting sequence for the insertional mutation and lends credence to the broader applicability of the model presented here to other contexts of the insertional mutation of DNA.

The conclusion of this study, predicated on the existence of a precise target sequence, may hold broader implications for our comprehension of mutagenesis. Recent advancements in genome analysis technology have revealed analogous hotspots wherein frequent mutations are documented, as exemplified by the EGFR gene in human lung cancer [27]. The underlying reasons for the recurring incidence of probabilistic processes, such as mutations in identical locations, remain incompletely understood. However, an illuminating avenue for deeper comprehension involves biophysical investigations based on analysing DNA properties and molecular interactions, incorporating principles from quantum mechanics.

In conclusion, our study underscores the feasibility of predicting the locations of DNA mutations through the meticulous examination of probability density functions. This predictive capacity emerges from the interaction between the mutagenic PIC and host DNA. It stands as an intriguing research endeavour to delve deeper into the mechanisms underlying these mutations from the vantage point of biophysical mechanisms and the inherent physical properties of DNA.