Introduction

DNA and RNA molecules are essential for life. For example, mice lacking a precise DNA repair mechanism exhibit multi-level abnormalities Wilson and Thompson (1997). The studies of DNA and RNA conformational changes and self-organization are extremely computationally demanding. Even a small RNA loop, such as UUCG poses a significant challenge for computational methods Mráziková et al. (2020)The use of molecular dynamics in all-atom representation has been applied to study non-coding RNA functions (Palermo et al. 2019), RNA and DNA self-assembly (Yoo et al. 2020) and local DNA flexibility (Mazur 2011). However, for the huge systems (over 1,000,000 atoms) time scales are limited to the \(\mu\)s range Yan et al. (2015), Hang et al. (2015). Therefore, for systems requiring longer time scales, coarse-graining is needed. In coarse-grained force fields, several atoms are merged into one center of interaction enabling a significant speed-up of the conformational sampling due to the smaller number of interaction centers and smoothening of the potential energy surface. Coarse-grain application to various macromolecules has been described in a recent review Dhamankar and Webb (2021).

CG models of biomolecules are constructed in a wide range of resolutions. One of the most commonly used model is Martini. For nucleic acids, it uses 6-7 interaction sites per nucleotide Uusitalo et al. (2015). PRIMONA, a CG model with 12-13 interaction sites per nucleotide can retain atomistic accuracy, although the simulation speed-up is moderate Gopal et al. (2010). Lower-resolution models include 3SPN (named for having 3 sites per nucleotide) (Sambriski et al. 2009; Hinckley et al. 2013) and oxDNA (with four sites per nucleotide) (Ouldridge et al. 2010), which can handle longer polynucleotide chains. One of the most promising models is the coarse-grained (CG), Nucleic-Acid united RESidue 2-points (NARES-2P) model which has only two centers of interaction per repeating unit (He et al. 2013; Liwo et al. 2014; He et al. 2015). It has been successfully applied to RNA and DNA structure prediction (Sieradzan et al. 2018), DNA association studies (He et al. 2013) and telomere properties predictions (Sieradzan et al. 2017). It can speed up the simulation by several orders of magnitude as in this model several atoms are merged into one center of interaction. NARES-2P is part of the CG force field family UNICORN (Liwo et al. 2021). For a part of UNICORN dedicated to proteins (UNRES) 4000-speed up is obtained (Khalili et al. 2005) in comparison with all-atom simulation. However, CG has several drawbacks: Usually, the accuracy of the structure is lower than for an all-atom (AA) model, energy function can be significantly more complicated and temperature-dependent (especially if the model is intended to be transferable), and finally CG model requires back-mapping (reconstruction of AA structure) to regain atomistic details. Multi-scale simulations which use different levels of representation reveal a promising potential (Peter and Kremer 2009). There are known, successful, back-mapping algorithms (Wassenaar et al. 2014; Poblete et al. 2018) for 3(5) interaction centers per repeating unit for SPQR coarse-grained force field (Poblete et al. 2017) or 4 heavy atoms to 1 interaction center MARTINI coarse-grained force field Uusitalo et al. (2015). Another successful, example of back-mapping for a very coarse-grained representation (only 3 centers of interaction per repeating unit) was developed by Shimizu and Takada (2018).

Previously for NARES-2P model, we used the xleap software (Case et al. 2016) to convert the coarse-grained structure to all-atom representation Sieradzan et al. (2017). In that procedure, the position of the center of mass of the sugar ring was assigned as the C1’ atom and the position of the nucleoside center was assigned as the C5’ atom. This led to significant structure distortion and the recovered structure could only be used for visualization purposes in cartoon representation. In this article, we present new software that restores the AA structure from the NARES-2P model which makes it possible to grasp atomistic details while enabling large times scale simulations. Compared to the back-mapping algorithms referred to above (Wassenaar et al. 2014; Poblete et al. 2018, 2017; Uusitalo et al. 2015) converting the NARES-2P structures to all-atom structures was more challenging since only two interaction sites are present in the NARES-2P model.

Methods

NARES model of the nucleic-acid chain

In the NARES-2P model He et al. (2013); Liwo et al. (2014), a polynucleotide chain is represented by a sequence of deoxyribose (for DNA) or ribose (for RNA) ring centers (S) with united nucleoside (sugar-nucleic base;B) centers attached to them and united phosphate groups (P) located in the midpoint between the consecutive ring centers (Fig. 1). As the C\(\alpha\)s in UNRES, the virtual atoms S only serve to define the geometry of a virtual chain, while the B and P centers are the interaction sites.

Fig. 1
figure 1

Comparison between the all-atom representation of the nucleotide chain (sticks) and the coarse-grained NARES-2P model (circles, ellipsoids, and thick black lines). Red circles represent the united sugar groups (S), which serve only as geometric reference points. Yellow circles represent the united phosphate groups (P). Ellipsoids (light blue) represent united nucleosides, with their geometric centers shown as solid green circles. The P’s are located halfway between two consecutive sugar atoms. Dipoles are located on the bases to represent their electrostatic interaction. The electrostatic part of the base-base interactions is represented by the mean-field interactions of the base dipoles (red arrows). Virtual-bond angles, \(\theta\), and virtual-bond dihedral angles, \(\gamma\), are used to describe the backbone geometry. The base orientation angles \(\alpha\) and torsional angles \(\beta\), which define the location of a base with respect to the backbone, are also indicated. Adapted from Liwo, A., Baranowski, M., Czaplewski, C. et al. A unified coarse-grained model of biological macromolecules based on mean-field multipole-multipole interactions. J. Mol. Model. 20, 2306 (2014)

Back-mapping algorithm

The algorithm (Fig. 2) for back-mapping starts with reading the coarse-grained coordinates and sequence from a NARES-2P PDB file.

Fig. 2
figure 2

The backbone reconstruction algorithm block diagram

Subsequently, the nucleoside library is read. In the library, the B-DNA and A-RNA nucleosides are used as a reference for the DNA and RNA, respectively. For each nucleoside in the library there is one structure (one rotamer). Nucleosides in the library were constructed with the use of 3DNA (Li et al. 2019) with calf thymus DNA model which corresponds to \(\hbox {B}_{II}\) structure. The O5’ oxygen atom is not present in the library. As a first step, the nucleosides are reconstructed (Fig. 3). Moving through the chain from 5’ to 3’-end nucleoside by nucleoside an all-atom structure is taken from the library. The nucleoside geometries in the library are described in a local coordinate system based on the following four points: the geometric center of the sugar ring (corresponding to S), nucleobase center of mass (corresponding to B), as well as C5’ and O3’ atoms. The local coordinate system is set up so that the X axis is coaxial with the S-B vector, the origin is in the midpoint of the S-B vector, and the Y axis is parallel to the C5’-O3’ vector’s component normal to the S-B vector.

Fig. 3
figure 3

Superposition of the NARES-2P side-chain (blue spheroid) with the reconstructed all-atom structure from the library. The center of mass of the sugar ring of the all-atom representation and the center of mass of the nucleobase are indicated as black circles and termed S and B, respectively. The center of the NARES-2P side-chain and the NARES-2P sugar-base vector are indicated as a pink circle and black vector, respectively

During reconstruction the nucleosides are placed so that the origin of the local coordinate system is in the midpoint of the CG model’s S-B vector, the X axis of the local coordinate system coincides with the CG model’s S-B vector, and the Y axis of the local coordinate system is parallel to the CG model’s \(B_{(i-1)}-B_{(i+1)}\) vector, where i refers to the index number of the residue. In the case of the first and last nucleoside \(B_i\) replaces the nonexistent \(B_{-1}\) and \(B_{(n+1)}\) points (where n is the total number of residues in the oligonucleotide chain) (Fig. 4).

Fig. 4
figure 4

Schematic representation of sugar-base rotation during back-mapping. The black circle represents the center of mass of the sugar (S) and a nucleobase(B). The pink circle indicates the midpoint of the sugar-base vector. The red arrow is the normal vector to the S-B vector in direction of the previous sugar

Once the nucleoside is recovered the missing P and O5’ atoms are reconstructed in the next step. These atoms are placed on the plane containing the \(O3'_i\) and \(C5'_{(i+1)}\) atoms reconstructed in the previous step, as well as the midpoint of CG \(B_i-B_{(i+1)}\) vector. O5’ is placed in 0.735 fraction of all-atom \(O3'_i\) and C5\(_{i+1}\) vector and then shifted by 1.32Å in direction perpendicular to \(O3'_i-C5_{i+1}\) vector and opposite to the NARES-2P \(\hbox {B}_i\) to \(\hbox {B}_{i+1}\) vector midpoint. P atom is placed in 0.238 fraction of all-atom \(O3'_i\) and C5\(_{i+1}\) vector and then shifted by 1.32Å in direction perpendicular to \(O3'_i-C5_{i+1}\) vector and opposite to the NARES-2P \(\hbox {B}_i\) to \(\hbox {B}_{i+1}\) vector (Fig. 5). Two remaining oxygen atoms (OP1 and OP2) are remapped with the constraint of the tetrahedral structure of the phosphate group and constant bond length.

Fig. 5
figure 5

Schematic representation of backbone phosphorus atom and oxygen atom reconstruction. The pink circles indicate the three points used for P and O5’ plane reconstruction. The green triangle determines the plane for placement of the P and O5’ and the green arrows are vectors of shifting of P and O5’ on the indicated plane from O3’ - C5 vector (black line)

The fully reconstructed structure sometimes has clashes and/or inappropriate bond lengths in the sugar-phosphate backbone. Therefore, a short minimization with the use of the AMBER14 ff14SB force field is used with implicit solvent Maier et al. (2015). Firstly, 400 steps of the steepest descent minimization without any restraints are performed to remove any clashes, bad bond lengths, etc. Secondly, 500 steps of the steepest descent minimization with restraints are performed. The torsional restraints were imposed on the 4 atoms adjacent to: C1’, C2’, and C3’ atoms. The flat bottom restraints with force constant 25 kcal/mol/rad\(^2\) and the flat bottom with of 20\(^\circ\) were imposed. The restraints are based on the initially reconstructed structure. This drives toward the initial NARES-2P structure and allows us to obtain a structure as close to the initial structure without any clashes.

Application of the back-mapping algorithm

We transformed several experimental structures to coarse-grained NARES-2P representation: (a) the pyrimidine motif triple helix in the Kluyveromyces lactis telomerase RNA pseudoknot (PDB code: 2M8K) (Cash et al. 2013) (b) the 222-parallel G4 quadruplex (GGTTGGTTGGTTGG) (c) the NMR Structure of a 21 bp DNA duplex preferentially cleaved by Human Topoisomerase II (PDB code: 2JYK) (Masliah et al. 2008) (d) the crystal structure of a nucleic acid four-way junction (PDB code: 1EGK)(Nowakowski et al. 2000) (e) high-resolution Dickerson-Drew Dodecamer B-DNA (PDB code: 4C64) (Lercher et al. 2014). The reduced representation was then subjected to the back-mapping algorithm. The obtained all-atom structure was refined as explained in the back-mapping procedure.

As the next target to test the algorithm a ‘real-case’ scenario was used. The DNA duplex structure (PDB code: 9BNA) (Westhof 1987). The DNA duplex chains were separated by the distance of 7Å and each chain was extended. Multiplexed replica-exchange molecular dynamics (MREMD)(Czaplewski et al. 2009) was performed with a temperature range 200–350 K with 10 K intervals and two replicas in each temperature (32 trajectories per system in total). For each trajectory, 60 mln steps with time step 0.498fs were performed. The Berendsen thermostat was used with a coupling time of 49.8 fs. The exchange was attempted every 20000 steps. After simulation bin-less weighted histogram analysis (Kumar et al. 1992; Czaplewski et al. 2009) was performed to obtain different ensemble properties, and finally clustering at 220K was performed to obtain five dominant clusters. For the all-atom structure reconstruction, only the most probable cluster was used. The all-atom structure was reconstructed and refined as explained in the earlier section.

Data analysis

After recovering all-atom representation, the structure was subject to performance analysis. The total RMSD, by root-mean-square deviation per residue (root-mean-square fluctuations, RMSF), was computed, Euclidian RMSD (eRMSD) (Bottaro et al. 2014), which indicates the similarity of contact maps between two structures, hydrogen bonds pattern, and the distribution of the torsional angles. The angles (\(\alpha\)-\(\chi\)) definition is shown in Fig. 6 (Frellsen et al. 2009). The eRMSD and the relevant dihedral angles were computed with Barnaba software (Bottaro et al. 2019). The angles’ probability was computed with a bin size of 15\(^\circ\). The hydrogen bond pattern was computed with cpptraj with default options.

Fig. 6
figure 6

The seven relevant dihedral angles in the central nucleotide (\(\alpha\) to \(\zeta\) ) are indicated with red labels. Each label is placed on the central bond of the four consecutive atoms that define the dihedral angle. The \(\chi\) angle describes the rotation of the base relative to the RNA backbone, while the six other angles define the course of the backbone. All atoms in the central nucleotide are labeled and colored according to atom type (oxygen: red, phosphor: yellow, nitrogen: blue, and carbon/hydrogen: grey). For clarity, the base is only partly shown. Reproduced from Frellsen et al. (2009)

Results and discussion

In Table 1, the RMSD and the eRMSD of the back-mapped structures before and after minimization are shown. In all cases when the idealized coarse-grained structure is used the RMSD of the back-mapped structure is below 1.5Å and the eRMSD is below 0.9. This indicates the high accuracy of the back-mapping algorithm. In the case of the simulated association of the DNA, the RMSD is less than 4Å for the dominant cluster.

Table 1 RMSD(Å) and eRMSD for structures reconstructed before and after minimization

The RNA pseudoknot experimental structure (Fig. 7A) was converted into coarse-grain representation (Fig. S1B) and then back-mapped to all-atom representation (Fig. S1C). The structure has some clashes which arise from using the limited library of rotamers - only one rotamer available (Fig. S1C). A short minimization leads to the removal of all clashes and the final RMSD of the all-atom structure is 0.90Å (Fig. 7B) which is a very good result as the back-mapping algorithm recovers all-atom structure from a low-resolution one.

Fig. 7
figure 7

Comparison of the RNA pseudoknot experimental structure (A) with structure converted to coarse-grain representation and reconstructed by NARall software after short refinement (B). The coloring is blue to red based on the RMSD of each residue

The largest differences are observed in the loop-forming triplex structure (Fig. S2). The 14 native H-bonds and 19 non-native H-bonds were observed, with the eRMSD value of 0.65. This indicates that the contact pattern is preserved, though the H-bonds for triplex structures are not well preserved and the short structure optimization leads to the occurrence of more hydrogen bonds than in the native structure.

The 222-parallel G4 quadruplex (Fig. 8A) was converted into coarse-grained representation (Fig. S3B) and then back-mapped to all-atom representation (Fig. S3C) and refined with all-atom minimization (Fig. 8B). The final structure after short minimization revealed a low RMSD of 1.35Å revealing the largest deviation in the loop region (Fig. S4). The short refinement improved the quality of the backbone conformation (Fig. S3C and D). None of the hydrogen bonds were correctly reproduced, though the 5 out 6 hydrogen bonds had either proper donor or acceptor atoms, with eRMSD of 0.87. This indicates that even though the h-bonds are reproduced correctly, the rotamers in the case of quadruplex require improvements (Fig. 8B).

Fig. 8
figure 8

Comparison of the 222 G-quadruplex structure (A) with structure converted to coarse-grain representation and reconstructed by NARall software after short refinement (B). The coloring is blue to red based on the RMSD of each residue

The NMR Structure of a 21 bp DNA duplex preferentially cleaved by Human Topoisomerase II (Fig. 9A) was converted into coarse-grained representation (Fig. S5B) and then back-mapped to all-atom representation (Fig. S5C) and refined with all-atom minimization (Fig. 9B). The final structure after short minimization has an RMSD of 0.75Å which is significantly lower than the RMSD of the experimental structure ensemble (4.03Å). The largest deviations are observed at C3’ and C5’ ends (Fig. S6), especially in the backbone position of the phosphate groups (Fig. 9B). The eRMSD is 0.224 revealing that all contacts are preserved, moreover, 40 out of 40 h-bond obtained after structure recovery are present in the native structure(0 non-native h-bonds), however, 9 h-bonds were missing.

Fig. 9
figure 9

Comparison of the NMR Structure of a 21 bp DNA duplex preferentially cleaved by Human Topoisomerase II (PDB code: 2JYK) (A) with structure converted to coarse-grain representation and reconstructed by NARall software after short refinement (B). The coloring is blue to red based on the RMSD of each residue

The crystal structure of a nucleic acid four-way junction, which is an RNA/DNA hybrid (Fig. 10A) was converted into a coarse-grained representation (Fig. S7B) and then back-mapped to all-atom representation (Fig. S7C) and refined with all-atom minimization (Fig. 10D). The structure after recovery has 0.85Å RMSD to the native crystal structure, which is significantly below the crystal structure resolution (3.1Å). It should be noted that there is no direct relation between the RMSD of the prediction or the back-mapped structure to the crystal structure resolution. However, both measurements indicate the quality of the prediction. The contacts between the bases are similar to the crystal structure with the eRMSD of 0.497. The rebuilt structure has 78 h-bonds (61 native and 17 non-natives) and 20 h-bonds that are present in the crystal structure are missing in the rebuilt structure. The RNA is rebuilt more accurately than the DNA (Fig. S8).

Fig. 10
figure 10

Comparison of the crystal structure of a nucleic acid four-way junction (PDB code: 1EGK) (A) with structure converted to coarse-grain representation and reconstructed by NARall software after short refinement (B). The coloring is blue to red based on the RMSD of each residue

The ultra-high resolution Dickerson dodecamer (CGCGAATTCGCG; PDB code: 4C64) was converted to coarse-grained representation (Fig. S9B) and back-mapped with the NARall algorithm (Fig. S9C). After back-mapping the RMSD is 0.51Å, while the eRMSD is 0.38 (Fig. S9), whereas after short minimization the RMSD increases to 0.61Å, while eRMSD drops to 0.18 (Fig 11). The RMSF plots reveal the uniform distribution of quality of back-mapping along the chain without any significant raise at the ends (Fig. S10). Despite very low RMSD, out of 26 h-bonds present in the native structure: 15 h-bond were predicted correctly with 11 h-bonds missing and 3 h-bonds predicted incorrectly.

Fig. 11
figure 11

Comparison of the crystal structure of a Dickerson dodecamer (PDB code: 4C64) (A) with structure converted to coarse-grained structure and then obtained from NARall after short refinement (B). The coloring is blue to red based on the RMSD of each residue

We analyzed the angle distribution of the four rebuilt structures and compared them with experimentally determined (Fig. 12). The angles \(\alpha\) and \(\gamma\) reveal a single maximum in the experimental structures while in our rebuilt structures those reveal two maxima with one overlapping with the experimental and one maximum shifted. The \(\beta\) angles are predicted correctly as the distributions overlap well. The \(\delta\) angles have bimodal distribution both in experimental and in our rebuilt structures. One maximum (\(-75^\circ\) angle) overlaps well while the second overlap only partially with angles shifted by approximately 30\(^\circ\). The \(\epsilon\) angle distribution reveals a single maximum in the experimental structures while our rebuilt structures reveal two maxima, however, those two maxima overlap well with the experimental distribution. In the case of the \(\zeta\) angles, the distribution reveals two well-separated maxima, while in our rebuilt structures those maxima are not well separated leading to an excessively narrow distribution of this angle. In the case of \(\chi\) angle distribution, the two maxima revealed are similar in our rebuilt structures but are shifted by approximately 15\(^\circ\).

Fig. 12
figure 12

The \(\alpha\), \(\gamma\) (A), \(\beta\), \(\delta\) (B) \(\epsilon\), \(\zeta\) (C), and \(\chi\) (D) angles distribution for the experimentally determined structures and the rebuilt structures with our algorithm for four chosen nucleic acids

In the case of DNA duplex folding the dominant structure obtained after clustering is a properly paired DNA duplex (Fig. 13B). However, the NARES-2P force field has the tendency to produce excessively stretched structures (the \(5^\prime\) to \(3^\prime\) distance is too large) in comparison with the native duplex (Fig. 13A). This leads to multiple clashes after the recovery of the all-atom structure (Fig. 13C) as the base pairs are too close to each other. Those clashes are easily removed by the short refinement (Fig. 13D). Despite starting from separated chains and simulating folding, the final obtained structure after refinement has a low RMSD of 3.90Å. The largest deviations are observed in the terminal fragments of the chains (Fig. S11). The rebuilt structure has 17 native hydrogen bonds and 4 non-native hydrogen bonds, 11 hydrogen bonds were missing, while the eRMSD value is 1.54. This indicates that the ab initio folded structure has most of the contacts and local structures predicted correctly.

Fig. 13
figure 13

Comparison of the DNA duplex experimental structure (A) with coarse-grain dominant structure obtained after MREMD simulation (B), reconstructed coarse-grained structure directly from NARall software (C) and after short refinement (D). The coloring is blue to red based on the RMSD of each residue

Conclusions

In this article, we presented a new algorithm to back map NARES-2P coarse-grained structures to all-atom structures. Our back-mapping algorithm was applied to both idealized structures and real-case scenarios. This algorithm enables recovering all-atom details with good precision, however, the result depends on the quality of the NARES-2P model and the topology of the system. Our algorithm enables studying DNA/RNA dynamics in longer time ranges with only mild loss of resolution. Finally, as we now have a tool to recover the all-atom structure of the NARES-2P, we can take part in the RNA-puzzles experiment (Cruz et al. 2012) and we are currently taking part in the ongoing CAPRI experiment which has a protein-nucleic acids interface. The most of atomistic details are preserved as the RMSD and eRMSD to the reference structures are low, but the precision of h-bond recovery could be improved, we plan to improve this algorithm by introducing nucleobase-rotamers library dependence of the backbone conformation and improve the angle distribution mainly for \(\alpha\) and \(\gamma\) angles. Our method should be handled with care when applied to study hydrogen bonds pattern, and when the conformation of the backbone is essential for the non-helical structures.

The software is available from www.unres.pl webpage (NARall software).