Introduction

The great advancement in biological sciences and technologies has led to the accumulation of unprecedented gigantic amount of biomolecular data. Generally speaking, biological data can be classified into several categories, including genomics, transcriptomics, proteomics and metabolomics, which are deposited in several major databanks. A quick look at these databanks gives us a general perspective of the great amount of biological data that is available. Currently, in GenBank, there are more than 100 million gene sequences, which is more than 1 billion bases. In protein data bank (PDB)1, there are about 150,000 three-dimensional biomolecular structures. The availability of the tremendous amount of the biological data has posed unprecedented opportunities for researchers from all areas. With great opportunities come great challenges. The high dimensionality, complexity, and variety of the biological data have rendered most powerful traditional methods and models useless. Data analysis methods and models, including statistical learning, machine learning, data mining, manifold learning, graph/network models, topological data analysis (TDA), etc, have provided great promise in big data era and became more and more popular in bioinformatics and computational biology in the past two decades. Among these models, TDA has drawn special attention from mathematicians and computational scientists due to its unique characteristics. Unlike the general data analysis models, TDA studies topological invariants, which are global intrinsic structure properties. Roughly speaking, TDA can identify the “shape of the data”, thus it works as a powerful tool for simplification and dimensionality reduction. The key component of TDA is persistent homology (PH), which is developed from computational topology and algebraic topology. By assigning a geometric measurement to topological invariants, PH provides a bridge between geometry and topology. Recently, PH based machine learning models have delivered one of the best results in protein-ligand binding affinity prediction, partition coefficients, and mutation-induced folding energy variation2,3,4,5,6,7 and won champions in several categories in the recent D3R Grand Challenges8, which is widely regarded as the most difficult challenge in drug design.

The great success of persistent homology based machine learning models depends on the multiscale topological features obtained from biomolecular structures9,10,11. Unlike all previous geometric and topological models, which either focus on local structure information or study qualitative global properties, persistent homology embeds the geometric information into the topological invariants, thus provides the first quantitative topological measurements. With the proposed filtration process, a series of nested simplicial complexes, which are encoded with structural topological information from different scales, are generated in PH. These simplicial complexes provide a multiscale topological profile of the structures. Topological features, such as individual components, holes, circles, and voids, can be evaluated from them. More importantly, some features persist while other die quickly during the filtration. The “lifespans” or “persisting times” provide a size measurement of the topological features12,13,14. With its unique power in data simplification and structure representation, PH has already been applied to various fields, including shape recognition15, network structure16,17,18, image analysis19,20,21,22,23, data analysis24,25,26,27,28, chaotic dynamics verification29, computer vision23, computational biology30,31,32,33,34,35,36,37, amorphous material structures38,39, etc. Many powerful softwares, including JavaPlex40, Perseus41, Dipha42, Dionysus43, jHoles44, GUDHI45, Ripser46, PHAT47, DIPHA48, R-TDA package49, etc, have been developed. The persistent times of topological features can be represented or visualized by several models, including persistent diagram (PD)14, persistent barcode (PB)50, persistent landscape51,52, persistent image53, persistent curves54,55, etc. However, traditional persistent homology models use only one filtration parameter, significantly hindering their applications in revealing some heterogeneous properties. To overcome this problem, several multidimensional filtration PH models have been proposed. These models can greatly boost the performance of traditional PH models. Another important approach is to design weighted persistent homology (WPH). The essential idea of WPH is to introduce weight information, which reflects certain physical, chemical or biological properties, into the simplicial complex generation or homological generator calculation. In this way, the topological features, obtained from WPH, will characterize more heterogeneous biomolecular properties.

Generally speaking, the weighted persistent homology can be characterized into three major categories, vertex-weighted35,56,57,58,59,60, edge-weighted2,5,35,44,61,62, and simplex-weighted models63,64,65. For vertex-weighted models, a weight value is defined on each vertex. Among these methods, the weighted alpha complex is the first model that has been used in biomolecular structure characterization58. By assigning a weight value to each atom, a modified distance function can be proposed and further used to generate the weighted Voronoi cell. The weighted alpha complex is a subset of weighted Delaunay complex, which is the dual of weighted Voronoi diagram. Other vertex-weighted models consider different types of weighted distance functions56,57,59. In a weighted Vietoris-Rips and Čech complex model, a new distance function is proposed as a minimal value of the scaled Euclidean distances between the current position to all atoms56. The inverse of the distance function represents the union of balls centered at the atom, and naturally induces weighted Čech complexes through nerve theorem56,59. Weighted Vietoris-Rips complexes can be constructed by scaling the Euclidean distances between any two atoms by their weights56. Similarly, k-distance functions are proposed and can be used to produce weighted Čech complexes59. To characterize the multiscale properties of the biomolecules, a multiscale rigidity function is proposed35,60,66. Each atom is associated with a weighted kernel function with a scale parameter. A rigidity function is defined as the summation of all the kernel functions and can be used to generate a series of nested Morse complexes in persistent homology. Unlike vertex-weighted models, edge-weighted models usually specify unique weight values on edges44,61. Using weight value as a filtration parameter, Vietoris-Rips or clique complex can be defined as a maximal simplicial complex, whose 1-skeleton has weight values larger (or smaller) than a certain filtration value57. For weighted clique rank homology model61, a network/graph, with a weight value on each edge, is considered. Clique complex can be defined on the subgraph composed of edges with weight larger than filtration value. A series of physics-aware models are proposed to characterize interactions within and between the biomolecules2,3,4,5,6,7,8,35,62. Various modified or generalized distance matrixes are used in these models. Further, simplex-weighted persistent homology models, based on weighted boundary operators, are proposed63,64,65. In these models, weight values are defined on simplexes in different dimensions. To ensure the consistence of the homology definition, weight values on different simplexes need to satisfy certain constraints or relations, so that a weighted boundary operation can be well-defined.

In this paper, we propose a localized weighted persistent homology LWPH. Our LWPH model is inspired by the recent great success of element specific persistent homology (ESPH) models as mentioned above. Unlike all previous weighted persistent homology models, which treat biomolecules as an inseparable system, ESPH decomposes the structure into a series of sub-structures made of certain type(s) of atoms. The subnetworks or subgraphs, especially those from protein-ligand complexes, have been proved to capture important biological properties, such as hydrophobic or hydrophilic interactions. In this way, ESPH models have delivered amazing results in biomolecular data analysis2,3,4,5,6,7,8. Further, our LWPH models are different from traditional persistent local homology (PLH)67,68,69,70,71,72. The PLH studies the relative homology groups between a topological space and its subspace. It is usually used to assess the local structure of a special point within a topological space. In our LWPH, the biomolecular structures and configurations are decomposed into a series of local domains, that may overlap with each other. The general persistent homology or weighted persistent homology analysis is then applied on each of these local domains. In this way, DNA local structure, dynamics and functional properties can be embedded in our LWPH models. Our model has been used in the analysis of DNAs. It has been found that our LWPH based features can be used to successfully discriminate A-, B-, and Z-types of DNA. More importantly, our LWPH based principal component analysis (PCA) model can identify two configurational states of a DNA system in ion liquid environment, which can only be revealed by the complicated helical coordinate representation. The great consistence with the helical-coordinate model demonstrates that our model captures the local structure variations so well that it is comparable with geometric models. Moreover, geometric measurements are usually defined in very local regions, for instance the helical-coordinate system is limited to one or two basepairs. However, our localized weighted homology can quantitatively characterize structure information in a much larger domain, where traditional geometrical measurements fail.

Methods

In this section, we will provide a brief review of the weighted persistent homology models and their applications in biomolecular data analysis. After that, a detailed discussion of our localized weighted persistent homology model will be presented.

Weighted persistent homology

The essential idea of weighted persistent homology models is to introduce a specially-designed weight function/parameter that incorporates the biomolecular physical, chemical or biological properties, into the construction of simplicial complexes or homology generator evaluation. Generally speaking, all these WPH models can be classified into three types, including vertex-weighted35,56,57,58,59,60, edge-weighted2,5,35,44,61,62, and simplex-weighted models63,64,65.

To facilitate our discussion, we define a weighted point set as \((X,V)\) with \(X=\{{x}_{i}{|}_{i=1,2,\ldots ,N}\}\) and \(V=\{{v}_{i}{|}_{i=1,2,\ldots ,N}\}\). For each point xi, a weight vi is assigned to it. We use \(d({x}_{i},{x}_{j})\) to represent the Euclidean distances between two points xi and xj. Biologically, the weight V is usually chosen to be the radius, atom number, etc.

Vertex-weighted persistent homology

Weighted alpha complex: In weighted alpha complex58, a weighted distance is defined as \({d}^{\alpha }(x,{x}_{i})=\sqrt{{v}_{i}^{2}+d{(x,{x}_{i})}^{2}}\). The weighted Voronoi region or Voronoi cell can be defined as

$$V{C}_{i}=\{x|{d}^{\alpha }(x,{x}_{i})\le {d}^{\alpha }(x,{x}_{j}),for\,all\,i\ne j\}.$$

A t-weighted closed ball for \({x}_{i}\) is defined as \({\bar{B}}_{i}^{\alpha }(t)=\{x|{d}^{\alpha }(x,{x}_{i})\le t\}\). The intersection of \(t\)-weighted closed balls and Voronoi cells is \({R}_{i}(t)={\bar{B}}_{i}^{\alpha }(t)\cap V{C}_{i}\). In this way, the weighted alpha complex can be expressed as,

$$A(t)=\{\sigma |\mathop{\cap }\limits_{{x}_{i}\in X}{R}_{i}(t)\ne 0\}.$$

Essentially, the weighted alpha complex is a subsect of weighted Delaunay complex, which is the dual of weighted Voronoi diagram.

Weighted Vietoris-Rips and Čech: Bell et al., have proposed a weighted Vietoris-Rips model and a weighted Čech model56. The weighted Čech complex is defined as Čech\((X,V)={\mathscr{N}}\{\bar{B}({x}_{i},{v}_{i}){|}_{i=1,2,\ldots ,N}\}\). Here \(\bar{B}({x}_{i},{v}_{i})=\{x|d(x,{x}_{i})\le {v}_{i}\}\) is the closed ball centered at \({x}_{i}\) with radius \({v}_{i}\). The nerve \({\mathscr{N}}\) is the abstract simplicial complex from the closed balls. The weighted Vietoris-Rips Complex is defined as \(VR(X,V)\) = \(\{\sigma \subset X|d({x}_{i},{x}_{j})\le {v}_{i}+{v}_{j},for\,all\,{x}_{i},{x}_{j}\in \sigma \,with\,{x}_{i}\ne {x}_{j}\}\).

For a filtration parameter \(t\ge 0\), weighted Čech complex at scale t can be denoted as

$${\check{C}}ech(X,V,t)={\mathscr{N}}\{\bar{B}({x}_{i},t{v}_{i});i=1,2,\mathrm{.}.,N\},$$
(1)

and the weighted Vietoris-Rips Complex at scale t can be expressed as

$$VR(X,V,t)=\{\sigma \subset X|d({x}_{i},{x}_{j})\le t{v}_{i}+t{v}_{j},for\,all\,{x}_{i},{x}_{j}\in \sigma \,with\,{x}_{i}\ne {x}_{j}\}.$$

Moreover, we can define a distance function as \({f}_{X,V}(x)={{\rm{\min }}}_{{x}_{i}\in X}\{\frac{d(x,{x}_{i})}{{v}_{i}}\}\). In this way, we have the inverse function

$${f}_{X,V}^{-1}([0,t])=\mathop{\cup }\limits_{{x}_{i}\in X}\bar{B}({x}_{i},t{v}_{i}),$$
(2)

and it is homotopy equivalent to Čech\((X,V,t)\) as in Eq. (1).

Computationally, the weighted Vietoris-Rips Complex56 can be constructed by using a weighted distance matrix \(M=\{{M}_{ij}{|}_{i,j=1,2,\ldots ,N}\}\) with

$${M}_{i,j}=\frac{d({x}_{i},{x}_{j})}{{v}_{i}+{v}_{j}}.$$

Various softwares, such as JavaPlex40, Perseus41, Dipha42, GUDHI45, Ripser46, PHAT47, DIPHA48, R-TDA package49, can use distance matrix as their input data.

k-distance based model

Definition 1.

For any point set X and k is nonnegative integer, the k-distance57,59 can be denoted as

$${d}_{X,k}^{2}(x)=\frac{1}{k}\,\sum _{{x}_{i}\in N{N}_{X}^{k}(x)}\,{d}^{2}(x,{x}_{i})$$

with \(N{N}_{X}^{k}(x)\) denotes the k nearest neighbors in X to the point x.

Further, it can be expressed as power distance as follows,

$${d}_{X,k}^{2}(x)=\,{\rm{\min }}\,\{{d}^{2}(x,\bar{x})-{w}_{\bar{x}};\bar{x}\in Bar{y}^{k}(X)\}.$$

Here \(Bar{y}^{k}(X)\) denotes the barycenters of any subsets of k points of X and \({w}_{\bar{x}}=-\,\frac{1}{k}\,{\sum }_{1\le i\le k}\,{d}^{2}(\bar{x},{x}_{i})\). Moreover, the sublevel sets of the k-distance \({d}_{X,k}\) are finite union of balls,

$${d}_{X,k}^{-1}([0,t])=\mathop{\cup }\limits_{\bar{x}\in Bar{y}^{k}(X)}B(\bar{x},{({t}^{2}+{w}_{\bar{x}})}^{1/2}).$$
(3)

Similar to Eq. (2), the inverse of this distance function is homotopy equivalent to a weighted Čech complex, which is the nerve of the closed balls59.

Rigidity function based models: Multiscale topological simplification models have been proposed35,60,66. The key part of these models is multiscale rigidity function,

$$\mu (x)=\mathop{\sum }\limits_{i=1}^{N}\,{v}_{i}\Phi (d(x,{x}_{i});{\eta }_{i}).$$

Here \({\eta }_{i}\) is the scale parameter and \(\Phi (d(x,{x}_{i});{\eta }_{i})\) can be chosen from any monotonically-decreasing functions, such as the generalized power-law equation,

$$\Phi (d(x,{x}_{i});{\eta }_{i},\nu )=\frac{1}{1+{(\frac{d(x,{x}_{i})}{{\eta }_{i}})}^{\nu }},$$
(4)

and the generalized exponential equation,

$$\Phi (d(x,{x}_{i});{\eta }_{i};\nu )={e}^{-{(\frac{d(x,{x}_{i})}{{\eta }_{i}})}^{\nu }}.$$
(5)

Unlike previous distance functions in Eqs. (2) and (3), the inverse of rigidity function \({\mu }^{-1}([0,t])\) may not be expressed as a union of balls. However, it can generate Morse complexes. Computationally, the discrete Morse models can be used to evaluate the persistent homology.

Edge-weighted persistent homology

The essential idea for the edge-weighted persistent homology models is to assign a weight to each edge. With weight as filtration parameter, the Vietoris-Rips complex can be defined as the maximal simplicial complex whose 1-skeleton has weight values larger (or smaller) than the filtration value. Computationally, a weighted distance matrix is usually proposed. The filtration is achieved through the increasing (or decreasing) of the weighted distance value.

Weighted clique rank homology: The weighted clique rank homology is defined on weighted complex networks44,61. For weighted networks, each edge/link has a weight on it. The filtration goes from the largest weight to the lowest one. At each filtration value t, a subgraph composed of edges with weight larger than t is formed. Based on the subgraph, clique complex can be constructed. In this way, with the decrease of filtration value, a series of clique complexes are built and their homology and persistence can be calculated.

Physics-aware models: Recently, a series of new persistent homology models have been proposed to characterize the various physical interactions within and between the biomolecules2,5,35,62. In these models, the distance matrix between atoms is modified based on their physical properties, including covalent bonds, protein-ligand interactions, electrostatic interactions, etc. To avoid confusion, we call them as physics-aware persistent homology.

For a biomolecule or biomolecular complex, we denote their atomic coordinates as \(X=\{{x}_{i}{|}_{i=1,2,\ldots ,N}\}\), a distance matrix can be constructed as \(M=\{{M}_{ij}=d({x}_{i},{x}_{j}){|}_{i,j=1,2,\ldots ,N}\}\). Various modified distance matrices are proposed to characterize different physical, chemical and biological properties of the biomolecular structure.

Definition 2.

Multi-level persistent homology model2 considers a modified distance matrix as follows,

$${M}_{ij}=\{\begin{array}{ll}d({x}_{i},{x}_{j}), & {\rm{if}}\,{\rm{atoms}}\,i\,{\rm{and}}\,j\,{\rm{are}}\,{\rm{not}}\,{\rm{bonded}};\\ \infty , & {\rm{if}}\,{\rm{atoms}}\,i\,{\rm{and}}\,j\,{\rm{are}}\,{\rm{bonded}}.\end{array}$$
(6)

In computation, we can take \(\infty \) as any value larger than the filtration size. More generally, we can define an n-th level matrix2 as

$${M}_{ij}=\{\begin{array}{ll}\infty , & d({x}_{i},{x}_{j})\le n;\\ d({x}_{i},{x}_{j}), & {\rm{otherwise}}.\end{array}$$

It has been found that when the modified matrices are employed, the barcode representation is significantly enriched and is able to capture the tiny structure perturbation between the conformations. Further, an interactive persistent homology model is proposed for protein-ligand binding analysis.

Definition 3.

An interactive persistent homology model is based on the revised distance matrix as follows,

$${M}_{ij}=\{\begin{array}{l}\begin{array}{ll}d({x}_{i},{x}_{j}), & {\rm{if}}\,{\rm{atoms}}\,i\,{\rm{and}}\,j\,{\rm{are}}\,{\rm{from}}\,{\rm{different}}\,{\rm{molecules}};\\ \infty , & {\rm{otherwise}}.\end{array}\end{array}$$
(7)

In this way, interactions between two molecules, such as protein-protein, protein-DNA/RNA, protein-ligand, DNA/RNA-ligand, etc, can be incorporated into topological invariants.

Essentially, Physics-aware persistent homology models2 are all based on the generalized matrix \({M}_{ij}=\Phi ({x}_{i},{x}_{j})\). Here \(\Phi ({x}_{i},{x}_{j})\) can be any function properties, including van der Waals interaction, electrostatic potential, or any other generalized correlations.

Simplex-weighted persistent homology

Weighted simplicial homology: Weighted simplicial homology is a generalization of simplicial homology63,64,65. Every simplex has a weight in a ring R, and the boundary map is weighted accordingly. When all the simplices have the same weight \(a\in R\backslash \{0\}\), the resulting weighted homology is the same as the usual simplicial homology. We list some of the key definitions and results below.

Definition 4.

A weighted simplicial complex is a pair \((K,w)\) consisting of a simplicial complex K and a weight function \(w:K\to R\), where R is a commutative ring, such that for any \({\sigma }_{1}\), \({\sigma }_{2}\) with \({\sigma }_{1}\subseteq {\sigma }_{2}\), we have \(w({\sigma }_{1})|w({\sigma }_{2})\).

Theorem 1.

Let I be an ideal of a commutative ring R. Let \((K,w)\) be a weighted simplicial complex, where \(w:K\to R\) is a weight function. Then \(K\backslash {w}^{-1}(I)\) is a simplicial subcomplex of K.

For the definition of homology of weighted simplicial complexes, we require R to be an integral domain with unity 1.

Definition 5.

The weighted boundary map \({\partial }_{n}:{C}_{n}(K)\to {C}_{n-1}(K)\) is the map:

$${\partial }_{n}(\sigma )=\mathop{\sum }\limits_{i=0}^{n}\,\frac{w(\sigma )}{w({d}_{i}(\sigma ))}{(-1)}^{i}{d}_{i}(\sigma )$$

where the face maps di are defined as:

$${d}_{i}(\sigma )=[{v}_{0},\ldots ,{\hat{v}}_{i},\ldots ,{v}_{n}]\,({\rm{deleting}}\,{\rm{the}}\,{\rm{vertex}}\,{v}_{i})$$

for any n-simplex \(\sigma =[{v}_{0},\ldots ,{v}_{n}]\).

Theorem 2.

Let \(f:K\to L\) be a simplicial map. Then \({f}_{\#}\partial =\partial {f}_{\#}\), where \(\partial \) refers to the relevant weighted boundary map.

Definition 6.

We define the weighted homology of a weighted simplicial complex to be

$${H}_{n}(K,w)=ker({\partial }_{n})/Im({\partial }_{n+1}),$$

where \({\partial }_{n}\) is the weighted boundary map.

Proposition 2.1.

Proposition. If all the simplices in \((K,w)\) have the same weight \(a\in R\backslash \{0\}\), the weighted homology functor is the same as the usual simplicial homology functor.

Weighted persistent homology: Given a weighted filtered complex \((K,w)={\{({K}^{i},w)\}}_{i}\), for the i-th complex Ki, we have the associated weighted boundary maps \({\partial }_{k}^{i}\) and chain group \({C}_{k}^{i}\), cycle group \({Z}_{k}^{i}\), boundary group \({B}_{k}^{i}\), and homology group \({H}_{k}^{i}\) for all integers i and k.

Definition 7.

The weighted boundary map \({\partial }_{i}^{k}\), where i denotes the filtration index, is the weighted boundary map of the i-th complex Ki. That is, \({\partial }_{k}^{i}\) is the map \({\partial }_{k}^{i}:{C}_{k}({K}_{i},w)\to {C}_{k-1}({K}_{i},w)\). The chain group \({C}_{i}^{k}\) is the group \({C}_{k}({K}_{i},w)\). The cycle group \({Z}_{k}^{i}\) is the group \(ker({\partial }_{i}^{k})\), while the boundary group \({B}_{i}^{k}\) is the group \(Im({\partial }_{k+1}^{i})\). The homology group \({H}_{k}^{i}\) is the quotient group \({Z}_{k}^{i}/{B}_{k}^{i}\).

Definition 8.

The p-persistent k-th homology group of \((K,w)={\{({K}_{i},w)\}}_{i=0}\) is defined as

$${H}_{k}^{i,p}(K,w)\,:={Z}_{k}^{i}/({B}_{k}^{i+p}\cap {Z}_{k}^{i})$$

Localized weighted persistent homology

In all the above WPH models, weights are defined on the whole system to reveal the intrinsic global structural properties. Stated differently, all these models treat a biomolecule structure as an inseparable system, and explore their topological properties using the whole structure. In contrast, ESPH models2,3,4,5,6,7,8 decompose a biomolecular structure into a series of sub-structures made of certain type(s) of atoms. It has been found that the generated subnetworks or subgraphs, especially those from protein-ligand complexes, can capture important biological properties, such as hydrophobic or hydrophilic interactions2. Different from WPH and ESPH models, persistent local homology considers the relative homology groups between a topological space and its subspace. Usually, persistent local homology focuses on the local structure around a special point in a certain topological space.

Motivated by the success of ESPH models, we introduce our localized weighted persistent homology. Instead of decomposing biomolecular structures by their atom types, we focus on local biomolecular regions or domains and study their topological properties. For a better introduction of our LWPH, we will briefly review ESPH and PLH first.

Element specific persistent homology: Biomolecules are made of various atoms with different properties. In protein, DNA or RNA, there are five common types of atoms, including C, N, O, P, and S, and several metal ions, such as Fe, Mn, Zn, etc. Ligands are small molecules, that interact with the protein, DNA or RNA. Other than the common five types of atoms, they may have some other unique atoms, such as F, Cl, Br, I, etc.

Generally speaking, ESPH is proposed to characterize the topological properties within the structure formed by one or several types of atoms2,3,4,5,7. Mathematically, it is achieved by assigning weight value 1 to the selected types of atoms, and weight value 0 to all the rest atoms. For instance, all C atoms from both protein and ligand can be selected to form C-networks or graphs. The topological properties of these structures characterize the hydrophobic interactions between protein and ligands. Similarly, hydrophilic interactions can be well captured by networks from protein nitrogen atoms and ligand oxygen atoms.

One of the most important properties for ESPH is to generate a series of sub-structures from the biomolecule, and systematically explore their topological properties. These element based sub-structures reveal more structure information that is directly related to the biomolecular physical, chemical, and biological properties.

Persistent local homology. The persistent local homology67,68,69,70,71 is based on the algebraic topological concept called local homology groups72.

Definition 9.

If X is a space and if \(x\in X\) is a point, then the local homology groups of X at x are the singular homology groups \({H}_{k}(X,X-x)\).

From the excision theorem, we have the following Lemma.

Lemma 2.1.

Let X be a Hausdorff space and let \(A\subset X\). If A contains a neighborhood of the point x, then \({H}_{k}(X,X-x)\simeq {H}_{k}(A,A-x)\). Therefore, for Hausdorff spaces X and Y, if \(x\in X\) and \(y\in Y\) have neighbourhoods U, V, respectively, such that \((U,x)\) is homeomorphic to \((V,y)\), then the local homology groups of X at x and of Y at y are isomorphic.

Generally speaking, local homology is used to evaluate the local structure of a topological space. Persistent local homology can be used in dimension reduction and manifold dimension detection69,70,71.

Localized weighted persistent homology: The essential idea of our LWPH is to focus on local regions, of biomolecular structures or configurations, that incorporate the important physical, chemical or biological information, and perform the persistent homology analysis on them. In many situations, there may be several domains that are of great interest and we need to perform LPH on each domain. More generally, we can decompose biomolecular structures into a series of (overlapping) domains. For each domain, we carry out LWPH on all (or certain type(s) of) atoms that are of particular interest. In this way, a series of local topological properties can be obtained and we call them localized topological fingerprints. To avoid confusion, if the general persistent homology is considered, we call it localized persistent homology (LPH). If a weighted persistent homology is considered, we call it localized weighted persistent homology (LWPH).

More specifically, for a biomolecule or biomolecular complex with atomic coordinates \(X=\{{x}_{i}{|}_{i=1,2,\ldots ,N}\}\). The coordinate set X can be decomposed into a series of domains XI, with \(X={\cup }_{I=1}^{m}\,{X}^{I}\). Similar distance matrix MI as in Eqs. (6) and (7) can be constructed on each of the domain. In this paper, we will focus on the study of DNA, which is made of paired nucleotides. We can consider the weighted distance matrix on the domain XI as follows,

$${M}_{ij}^{I}=\{\begin{array}{ll}d({x}_{i},{x}_{j}), & {x}_{i},{x}_{j}\in {X}^{I},\,{\rm{atoms}}\,i\,{\rm{and}}\,j\,{\rm{are}}\,{\rm{from}}\,{\rm{different}}\,{\rm{nucleotide}}\,{\rm{residues}};\\ \infty , & {\rm{otherwise}}.\end{array}$$
(8)

By using different weight values, LWPH can be designed to capture various local properties in biomolecular structures. The application of LPH and LWPH can be found in following sections.

It should be noticed that we still regard our LPH as a weighted persistent homology model. Essentially, we can set weight as a vector composed of only 1 and 0, and only atoms within the local region have weight values as 1.

Results

In this section, we discuss the application of our localized persistent homology and localized weighted persistent homology in the study of DNA structures. To avoid confusion, only the weight definition as in Eq. (8) is used in our LWPH. The persistent barcodes are used for the representation and visualization of LPH and LWPH results. The persistent Betti numbers (PBNs) are evaluated from barcodes. A systematical evaluation of PBNs under helical coordinates demonstrates the incorporation of the geometric information in our LPH and LWPH models. Further, we show that PCA of the feature vectors from LWPH based PBNs can be successfully used in the classification of the A-, B-, and Z-types of DNAs. Moreover, we explore DNA structure variations in both water and ion liquid (IL) environments using molecular dynamics. With LWPH based feature vectors, we can not only reveal the confinement effect of DNA configurations from water to IL environment, but also identify two DNA configurational states in IL environment. Detailed analysis shows that global-scale PCA models, including atom-coordinate-based PCA, common PH-based PCA, and ESPH-based PCA, all fail in clustering the DNA configurational states in IL. In contrast, LWPH based all-atom or selected-atom models are always able to characterize the DNA structure variations in IL environment. The LWPH results are highly consistent with the helical-coordinate based PCA model.

DNA local topological features

DNA molecule has a double helical structure composed of four types of nucleotides, i.e., A, T, G, and C. It has remarkably different scales, ranging from nucleobases, minor and major grooves, to larger structures like nucleosome, chromatin, and then to chromosome. We have proposed a multi-resolution PH model to characterize structural topological features or topological fingerprints of DNA structures in different scales60,66. However, the model focuses on the global DNA topology. In this section, we explore DNA topological features from local structures, i.e., DNA localized topological fingerprints. We study the barcodes for both LPH and LWPH models, i.e., one with traditional persistent homology and the other with the weighted persistent homology as in Eq. (8). Different local structures can be systematically studied. In the current paper, our focus is local topological features from different DNA base pairs or base steps.

To facilitate a better description, we introduce some basic notations. In general, results from PH can be represented as pairs of “birth” and “death” times, i.e., the filtration values for homology generators to appear and disappear. We denote them as follows,

$${L}_{k}=\{{l}_{j}^{k}|{l}_{j}^{k}={b}_{j}^{k}-{a}_{j}^{k};k\in {\mathbb{N}};j\in \{1,2,\ldots ,{N}_{k}\}\},$$

here \({a}_{j}^{k}\), \({b}_{j}^{k}\), \({l}_{j}^{k}\) to represent “birth”, “death”, and “persistence” for \(j\)-th generator of \(k\)-th dimensional Betti number, respectively. And \({N}_{k}\) is the total number of \(k\)-th dimensional topological generators. Due to the limited number of atoms in a local structure, we only consider dimension \(k\) equals to 0 and 1. Further, different PH based functions are proposed for the visualization, representation and modeling of topological information24,51,54,73. Persistent Betti number, or Betti curve, is one of them. It is defined as the summation of all the \(k\)-th dimensional barcodes,

$${f}_{{\rm{PBN}}}(x;{L}_{k})=\sum _{j}\,{\chi }_{[{a}_{j}^{k},{b}_{j}^{k}]}(x).$$
(9)

Function \({\chi }_{[{a}_{j}^{k},{b}_{j}^{k}]}(x)\) is a step function, which equals to one in the region \([{a}_{j}^{k},{b}_{j}^{k}]\) and zero otherwise.

As a simple one-dimensional function, PBN has been used in data analysis for dimensionality and complexity reduction. Moreover, PBN based feature vectors can be input into various machine learning models2,3,4,5,6,7,8. In this section, PBN based PCA models are used in DNA structure classification and trajectory clustering.

LPH and LWPH for DNA structure representation

As stated above, we consider two types of LPHs, one is LPH with Euclidean distance matrix and the other is LWPH with a similar weighted distance matrix as in Eq. (8). For LWPH, we assume the distance between two atoms from same nucleotide residue to be infinity, and the distance between two atoms from different residues to be their Euclidean distance. In this way, our LPH characterizes topology from covalent-bond-formed structure, while our LWPH reveals topological information on non-covalent bond properties. The DNA base-pair structures are generated by using the 3DNA software74.

Figure 1 illustrates LPH barcodes for different combinations of base pairs. Shorter \({\beta }_{0}\) bars (with length around 1.0 Å to 1.5 Å) correspond to covalent bonds, longer \({\beta }_{0}\) bars with length around 2.8 Å correspond to the hydrogen bonds between paired bases. For \({\beta }_{1}\) bars, longer ones appear in earlier stage of filtration, i.e., from around 1.4 Å to 2.4 Å, correspond to sugar rings and nitrogenous base rings. The longer \({\beta }_{1}\) bars range from around 2.9 Å to 3.6 Å, representing loops between paired bases. It can be seen that each A-T pair only contributes one such longer \({\beta }_{1}\) bar, while each G-C pair contributes two.

Figure 1
figure 1

The LPH based barcodes for different combinations of DNA base pairs. It can be seen that A-T and G-C base pairs all have three local \({\beta }_{1}\) bars from around 1.4 Å to 2.4 Å. But they differ greatly in the global region, where A-T pair contributes one significant \({\beta }_{1}\) bar from around 2.9 Å to 3.6 Å, while G-C pair generates two. These barcode fingerprints characterize the intrinsic DNA structure properties, i.e., local and global loop/ring motifs.

Figure 2 illustrates the corresponding LWPH barcodes for different combinations of base pairs. Similar to LPH results, the total number of \({\beta }_{0}\) bars is exactly the number of atoms, and shorter \({\beta }_{0}\) bars with length around 2.8 Å correspond to hydrogen bonds between paired bases. Different from LPH results, more hydrogen-bond related \({\beta }_{0}\) bars appear in LWPH barcode than LPH barcodes. And the lengths of \({\beta }_{0}\) bars for LWPH are systematically longer than those for LPH model, indicating that more long-range interactions related information are preserved in our LWPH model. Moreover, the \({\beta }_{1}\) barcodes for LWPH are much more complicated. Their geometric meanings are not as straightforward as LPH models. Generally speaking, the \({\beta }_{1}\) bar in LWPH represents loop or ring structure with edges between different nucleotides. In this way, a much larger amount of \({\beta }_{1}\) bars are generated. Moreover, when there are only two nucleotides (or one base pair), the \({\beta }_{1}\) bars persist forever.

Figure 2
figure 2

The LWPH based barcodes for different combinations of DNA base pairs. The weighted distance matrix in Eq. (8) is considered. Only the distances between two atoms from different nucleobases are set to be their Euclidean distance, while other distances are set to infinity. The shortest \({\beta }_{0}\) bars are distances between adjacent atoms from two bases. They characterize the hydrogen bonds between two nucleobases. For a single base pair (A-T or G-C) situation, the generated \({\beta }_{1}\) bars will never be “killed” as no 2-simplexes can be formed.

In general, LPH and LWPH characterize different structure properties, the former is more about covalent-bond related topology while the latter is more about topology from non-covalent bonds. Physically, covalent bonds are much stronger than non-covalent bonds. In this way, LPH are relatively more “stable” and less sensitive to structure variations under thermal fluctuations. While LWPH are less “stable” and much easier to change if there is some external perturbations.

LPH and LWPH based DNA structural analysis

With the embedded geometric information, LPH and LWPH can be used in not only qualitative but also quantitative analysis of different structures. To assess LPH and LWPH based quantitative analysis, we systematically generate a series of DNA base-pair configurations using DNA helical coordinates. According to the Cambridge University Engineering Department Helix computation Scheme (CEHS)75, the motion of a base pair or two neighbouring base pairs can be depicted by 12 helical parameters, including 6 one-base-step related parameters, i.e., shear, stretch, stagger, buckle, propeller and opening, and another 6 two-base-step related parameters, i.e., shift, slide, rise, tilt, roll and twist. For each parameter, we prepare 11 DNA structures, with parameter value taken equally from \({\mu }_{i}-2{\sigma }_{i}\) to \({\mu }_{i}+2{\sigma }_{i}\), using 3DNA74. Here \({\mu }_{i}\), \({\sigma }_{i}\) are the mean value and standard deviation of parameter i. And the rest of the helical parameters remain as constants, i.e., their mean values. The mean value and standard deviation of each parameter can be obtained from crystal structures75. In DNA helical coordinate evaluation, only the base atoms and C1′ of the sugar ring are considered. For a fair comparison, the same atoms are used in our LPH and LWPH models.

We apply our LPH and LWPH on these series of DNA local structures and check if the quantitative structure variations are reflected in their barcodes. To facilitate a systematical comparison, we consider the PBN function in Eq. (9). For each helical parameter, we take the natural logarithm of all its PBN functions and stack them together to form a two-dimensional image. Note that we systematically add 1 to all PBN functions to avoid computational problem (from ln 0). Figures 3 and 4 illustrate the results of our LPH and LWPH for two-base-step parameters, respectively. The AT/AT base steps are considered. It can be seen that instead of remaining unchanged for all helical parameter values, both \({\beta }_{0}\) and \({\beta }_{1}\) PBN functions for LPH and LWPH models vary greatly, indicating that both models are sensitive to subtle structure variations. More specifically, in LPH based PBN, \({\beta }_{1}\) functions seem to have comparably larger variations than \({\beta }_{0}\) functions. In LWPH based PBN, \({\beta }_{0}\) functions seem to have greater variations than \({\beta }_{1}\) functions. We also check the \({\beta }_{0}\) and \({\beta }_{1}\) PBN functions for LPH and LWPH models for one base step. The results are demonstrated in Figs. 9 and 10 in Supplementary. Both functions in those two models show variations with the change of helical parameter value.

Figure 3
figure 3

The LPH based PBN image representation for two-base-step (AT/AT) at different helical parameter values. In the i-th PBN image, we systematically change the i-th helical parameter value from \({\mu }_{i}-2{\sigma }_{i}\) to \({\mu }_{i}+2{\sigma }_{i}\), with all other helical parameters remaining as constants, to deliver a series of base-step structures. PBN can be calculated for each two-base-step structure and all of them stacked together to form a two-dimensional image. It can be seen that, both \({\beta }_{0}\) (upper figures) and \({\beta }_{1}\) (lower figures) PBN functions vary with the change of helical parameter value. The change of \({\beta }_{1}\) PBN functions seem to be more dramatic. Note that the color values are logarithm of PBNs.

Figure 4
figure 4

The LWPH based PBN image representation for each two-base-step (AT/AT) at different helical parameter values. Base-step structures are prepared in the same way as in Fig. 3. It can be seen that, similar to LPH based PBNs, LWPH based \({\beta }_{0}\) (upper figures) and \({\beta }_{1}\) (lower figures) PBN functions vary with the change of helical parameter value. However, the change of \({\beta }_{0}\) PBN functions seem is more dramatic in LWPH models.

Since PBN functions from LPH and LWPH are sensitive to the structure variations, we can use them as measurements for DNA structure, function and dynamics analysis. In the following sections, PBN and PBN based features are used in the classification of DNA types and clustering of DNA trajectories.

Local topological feature based DNA classification and clustering

In this section, we study local topological feature based DNA classification and clustering. Essentially, topological features are extracted from PBNs, which are generated from LPH or LWPH. Figure 5 illustrates the LPH and LWPH based DNA featurization. Note that the results from LPH and LWPH can also be represented as persistent diagram14, persistent barcode50, persistent landscape51,52, persistent image53, persistent curves55, etc. Based on these representations, other featurization forms can also be considered76. In this paper, we focus on PBN based featurization.

Figure 5
figure 5

LPH and LWPH based DNA featurization. In our DNA cases, the local region is defined as the two adjacent base-steps. The common Euclidean distance matrix is considered in LPH, while the weighted distance matrix as in Eq. (8) is used in LWPH. From LPH or LWPH, PBNs can be calculated and then discretized into vectors, which can be used as a representation of DNA structures.

Classification of three typical DNA forms

We consider three types of DNA structures, including A-, B- and Z-forms. We randomly pick 10 PDB files from each form of DNA and the PDB IDs are shown in Table 2 in Supplementary. In LPH and LWPH, the same atom combination of each base step is chosen as in the above case, i.e., base atoms and C1′ of the sugar ring. The PBN function is calculated for each base step. To systematically compare the PBNs for three types of DNA forms, we summarize all the PBNs from the same DNA form and then compute the average.

The results from LWPH are demonstrated in Fig. 6. It can be seen that the three PBN profiles have very different \({\beta }_{1}\), particularly on the filtration range from 4.0 Å to 7.0 Å. Further, we consider the PCA for DNA classification. For each DNA structure, we define a vector made of the average \({\beta }_{0}\) and \({\beta }_{1}\) PBN values equally taken from 2.0 Å to 8.0 Å with an interval 0.1 Å. In this way, a feature vector with 120 elements is defined for all 30 DNA structures. The PCA results are demonstrated in Fig. 6. Here x-axis and y-axis represent the first and second eigenvectors (principal components), respectively. It can be seen that three forms of DNA locate in the different regions with clear boundary, which further confirms that LWPH based features can distinguish the subtle conformational deviation. Figure 11 in Supplementary shows the results from LPH. In comparison with LWPH, LPH based PBNs show no obvious difference, and PBN based PCA does not classify the dataset into three individual clusters.

Figure 6
figure 6

LWPH based classification of three DNA types, i.e., A-DNA, B-DNA, and Z-DNA. The average persistent Betti number (aPBN) from our LWPH for three types of DNAs. We discretize the aPBN equally into a series of numbers and use these values as features for PCA. It can be seen that LWPH based aPBN and PCA results can clearly discriminate the three DNA types.

Clustering of DNA conformations in different environments

We have demonstrated the DNA structure classification with our LWPH. However, A-, B- and Z- forms of DNAs are static and have relatively “large” configurational differences. In the following, we consider a more challenging case. That is the clustering of the molecular dynamics simulations of the same DNA molecule in different solvent environments.

Molecular dynamics setting

A brief introduction of the MD procedure is presented as follows. The initial structure of 16-mer DNA duplex is prepared using 3DNA74 and centered in a cubic box. Two different solution environments are used, including ion liquid (IL) and water (WAT). For IL environment, 600 BMIM+ and 600 \({{\rm{BF}}}_{4}^{-}\) are firstly inserted and the box is then solvated with TIP3P water and Na+. For WAT environment, the box is directly solvated with water and Na+. After a 100 ps thermostat and 100 ps barostat, the system then goes through a 100 ns product MD. Under each environment setting, we conduct 3 repeated MD simulations, so we obtain 6 trajectories in total. We denoted them as IL1-3 (trajectory 1 to 3 in IL) and WAT1-3 (trajectory 1 to 3 in water). All the simulations are conducted using GROMACS 4.6 package77. In our data analysis, 5000 sample frames evenly sampled from the last 10 ns trajectory of 6 simulations are considered. The detailed MD simulation setting and parameters can be found in the related paper78.

Weighted persistent homology modeling

For DNA conformation clustering, we extract 13 non-terminal DNA base steps for each frame of the simulation data. Similar to DNA-type classification case, for each base step, we construct a 120-element PBN feature from the LWPH. Then, we concatenate all 13 sets of PBN values together into feature vector for each DNA configuration. This LWPH based feature vector is used in the PCA of DNA trajectories in different environments. More specifically, a covariance matrix of feature vectors for all the frames is built up, and further eigen-decomposed into principal components (eigenvectors). The first two eigenvectors construct a plane and all feature vectors are projected to it. For comparison, in both IL and water, we apply PCA not only on three individual MD trajectories, but also the ensemble made of all three trajectories together. The results are demonstrated in Fig. 7. The projected points are illustrated as their contour values for a better visualization. To avoid confusion, Fig. 7(a1–a4) are for DNA trajectories in IL environment. Among them, Fig. 7(a1) is for the ensemble of all three trajectories. Figure 7(a2–a4) are for the three trajectories, respectively. Figure 7(b1–b4) are for DNA trajectories in water environment. Among them, Fig. 7(b2–b4) are for the three trajectories, respectively, and Fig. 7(b1) is for the ensemble of all three trajectories.

Figure 7
figure 7

The contour map generated from our LWPH based PCA models for DNA configurations in IL and WAT environments. The x-axis and y-axis are the first and second principal components. (a1) The DNA configuration ensemble for all three trajectories for IL. (a2a4) Three DNA configuration trajectories from the MD simulation with IL. (b1) The DNA configuration ensemble for all three trajectories in water environment. (b2b4) Three DNA configuration trajectories from the MD simulation using water solution. It can be seen clearly that areas of contour map in IL are much smaller than in water, indicating the confinement effect of the DNA configurations in IL. Further, contour graph for IL1 (a2) shows a clear difference of that for IL2 (a3) and IL3 (a4), meaning there is a subtle change of the ion-DNA binding mode in trajectory IL1.

Several unique properties can be seen from Fig. 7. Firstly, the confinement effect, i.e., the reduction of distribution area, can be clearly observed in IL environment. In fact, we can count the area of the distribution in IL and water and the results are listed in Table 1. It can be seen that, the areas of distribution map in IL are up to 2/3 of that in water, the smaller area confirms the fluctuation of DNA in IL is greatly attenuated. Secondly, contour graphs for IL and WAT show significantly different patterns. For IL solution, two centers can be clearly identified from contour graph in Fig. 7(a1). Among them, one locates on upper left and the other locates on right part. In contrast, for water solution, only one center can be found and its position differs greatly from the ones of IL systems. The huge difference indicates the change of DNA conformations in different environments. Further, for water solution, all contour graphs have nearly the same distribution, indicating that all three trajectories behaved quite similarly. In contrast, for IL solution, contour graph for IL1 shows a clear difference of those for IL2 and IL3, meaning there is a subtle change of ion-DNA binding mode in trajectory IL1. Figure 12 in Supplementary shows the observed variation of ion-DNA binding in different metastates in IL1. Our LWPH based PCA is sensitive enough to capture this subtle structure variation.

Table 1 The confinement effect in ion liquid (IL) solution.

Our LWPH based results are highly consistent with previous results from the helical parameter based model78. As demonstrated in Fig. 8, similar confinement effect is also observed in CEHS based contour graphs for IL solution. Further, CEHS results show two different centers, while WAT contour graphs have only one center. This means that, CEHS model also captures the subtle change of ion-DNA binding mode in trajectory IL1. To further check if our LWPH and CEHS models identify the same type of ion-DNA configurational changes, we decompose the contour graphs of IL1 into 10 separated subgraphs, each of them represent the DNA trajectories in 1 ns. We can clearly identify four center regions from these subgraphs, and they are consistent with results from HP based PCA. This further indicates that our LWPH based models are highly sensitive to DNA local structural variations. The results are demonstrated in Figs. 13 and 14 in Supplementary.

Figure 8
figure 8

The contour map generated from helical-parameter based PCA models for DNA configurations in IL and WAT environments. The x-axis and y-axis are the first and second principal components. (a1) The DNA configuration ensemble for all three trajectories for IL. (a2a4) Three DNA configuration trajectories from the MD simulation with IL. (b1) The DNA configuration ensemble for all three trajectories in water environment. (b2b4) Three DNA configuration trajectories from the MD simulation using water solution. The same confinement effect and two center distribution of DNA configurations in IL environment as in Fig. 7 are observed.

Further, it should be noted that the local DNA structure variations in IL1 cannot be captured by the general global models. As demonstrated in Fig. 15 in Supplementary, the general atom-coordinates based PCA fails to capture the DNA structure variation in IL1, even thought it manages to preserve the confinement effect (details in Table 1). Similarly, LPH based model also fails to reveal the variation. Moreover, it even cannot reveal the confinement effect of IL environment. Details can be found in Table 1 and Fig. 16 in Supplementary. This is largely due to the reason that LPH based model focuses more on the covalent bonds and its related structures. Even though different combination of atoms are considered, both coordinate-based PCA and LPH-based PCA are unable to identify the structure variation. In contrast, not only the general LWPH model works, we also construct different LWPH models by taking different combinations of backbone atoms and base atoms at local scale. For instance, according to CEHS scheme, we can take \(C8\), \(C4\), \(N1\) and C1′ of purine base and \(N3\), \(C6\) and C1′ of pyrimidine base, as shown in Fig. 17 in Supplementary. These selected atoms based LWPH can also capture very well the confinement effect and ion-DNA configurational changes. The corresponding trajectories also show great consistence with both helical-parameter based and LWPH based results. Figures 18 and 19 in Supplementary show the corresponding results.

Lastly, our LWPH based models are very flexible and easy to be combined with machine learning models. Traditional CEHS helical coordinate systems work well only for one and two base steps. They tend to fail if the local structure variation is induced by adjacent three or more base steps. Moreover, CEHS models are only suitable for DNA or RNA and cannot be used in proteins or other biomolecules. In comparison, our LWPH are more general and can be used for any local structures from DNAs, RNAs, proteins, biomolecular complexes, or biomolecular assemblies. Another important property of LWPH based feature vectors are that they are unit free and can be used to compare different-sized local structures. In this way, these feature vectors are extremely suitable for machine learning models.

Conclusion Remarks

In this paper, we discuss weighted persistent homology models and their applications in biomolecular structure, function, and dynamics analysis. We briefly review all the WPH approaches, including vertex-weighted, edge-weighted, and simplex-weighted models. Essentially, weight values, which reflects physical, chemical and biological properties, are assigned to vertices (atom centers), edges (bonds), or higher order simplexes (cluster of atoms), depending on the biomolecular structure, function, and dynamics properties.

Further, we propose the first localized persistent homology and localized weighted persistent homology and apply them in the DNA structure classification and clustering. Our LPH and LWPH models are inspired by the great success of element specific persistent homology. In our models, biomolecules are not treated as an inseparable system, instead they are decomposed into a series of local domains, which may overlap with each other. The general persistent homology or weighted persistent homology analysis is then applied on each of these local domains. In this way, functional properties, that embedded in localized topological invariants, can be revealed. Our models characterize structural variations at any level and provide a new featurization of biomolecules for machine learning models.