1 Introduction

The research area of computational molecular biophysics has experienced, in the past few decades, impressive advancements in two complementary and strictly intertwined fields: on the one hand, the steadily growing and increasingly cheaper computational power has enabled the simulation of ever larger systems with atomistic resolution [1, 2]; on the other hand, there has been an explosion of diverse coarse-grained (CG) models [3,4,5], i.e. simpler representations of molecules in terms of relatively few sites interacting through effective potentials: these have filled several gaps between the length- and time-scales of interest and the current capability of all-atom methods to cover them. The scientific efforts making use of one or both these techniques have cracked several important problems open, ranging from protein folding to cell growth [6,7,8].

The development of a successful CG model is strongly dependent on the choice of the reduced representation, or CG mapping, and on the correct parametrization of the effective interactions [4, 8]. The latter challenge has received an enormous amount of attention, leading, e.g. in the case of proteins, to extremely accurate and sophisticated CG potentials such as OPEP [9, 10], AWSEM [11, 12] and UNRES [13, 14]. The former task has been the object of a smaller number of works, however its centrality in and beyond the process of coarse graining has recently started to emerge [8, 15]; indeed, a deep relationship exists between the degrees of freedom one selects to construct a CG model of the system, and those one employs to analyse the system’s behaviour from a more detailed representation.

On the one hand, high-resolution, fully atomistic models are necessarily required to let the properties and behaviour of complex biomolecular systems emerge; on the other hand, the interpretation and understanding of this behaviour requires a reduction of the mountain of data and its synthesis in a smaller amount of information. In a nutshell, while the generative process has to be high-resolution to be useful, its outcome has to be low-resolution to be intelligible. An intuitive example of this concept is given by the representation of a protein structure in terms of its \(C_{\alpha }\)’s, i.e. the alpha carbons of the backbone: this mapping is not only extensively employed in the development of CG models [16, 17](that is, models in which the whole amino acid is represented as a single bead whose position coincides with that of the \(C_\alpha \)), but it is also extremely common in the analysis of structures sampled in fully atomistic simulations [18, 19].

A few different strategies have been developed that aim at identifying the optimal CG mapping to describe a molecule, which differ most notably in the observable used to drive the optimisation. There exists a first class of algorithms that rely on a completely static, graph-based description of the system [20, 21], such as the recent one proposed by Webb et al. in which a bottom-up approach iteratively aggregates separate nodes of the molecular graph in CG sites [20]. A second group of approaches makes use of the dynamics of the system, obtained through models with more [22, 23] or less [24, 25] detailed force fields. For instance, a recent protocol proposed by us [23] revolves around the analysis of an all-atom molecular dynamics (MD) [26, 27] simulation trajectory of a protein in terms of a subset of the molecule’s atoms; a physics-driven choice of the latter allows one to identify the one or few mappings that return the most parsimonious yet informative simplified description of the system.

Each of these methods can be the most appropriate to investigate specific properties of the system at hand; at the same time, the majority of them performs the search for solutions of an optimisation problem within the overwhelmingly large space of all possible CG representations that can be assigned to the system. As an exhaustive exploration of this mapping space is hardly feasible in practice, the outcome of such schemes is often a pool of optimal CG representations, that is, an ensemble of local minima in the—likely—rugged landscape of the cost function that defines the mapping optimisation procedure.

Given the complexity of this problem, several natural questions arise: how degenerate is the space of solutions? Are its elements all markedly distinct from one another, or do they only represent mostly neutral and equivalent variations of a “typical” optimal CG mapping? Furthermore, how are the solutions distributed across the space of possible mappings? How much do they differ from, e.g. randomly chosen CG representations?

To be in the position of answering these questions, the most basic instrument one needs is a meter, i.e. a tool to measure distances in mapping space and assess the degree of “similarity” among its elements in a quantitative manner. The definition of such metric should be independent of the choice of the function subsequently employed to quantify the quality of a given reduced representation, in the same manner, e.g. in which the Euclidean distance separating two particles is independent of the interaction potential acting between them.

The objective of the present work is, thus, to contribute a tool for the exploration and characterisation of the mapping space, its metric and topological properties, and the relations among its instances. To this end, we introduce a notion of scalar product, and consequently of norm and distance, between reduced representations of a system. We first make use of these instruments in the exploration of some basic, bare metric and topological properties of mappings of a single, static molecular structure, i.e. without reference to its interactions and/or conformational sampling, but rather solely considering the coordinates of its constituent atoms. This provides a notion of mapping distance based on purely geometric properties of the molecule. Through the application of an enhanced sampling algorithm, namely the Wang–Landau method [28, 29], we characterise this mapping space in its entirety, and associate its properties to structural features of the underlying molecule. Furthermore, the isomorphism between the problem of exploring the possible mappings of a molecule and that of a lattice gas in a finite volume enables to highlight the emergence of first-order phase transitions in the latter, distinguishing CG representations with qualitatively different properties. We then investigate the topology of the mapping space making use of the distance between reduced representations, which enables a low-dimensional embedding that highlights its general features. This analysis is performed both in absence and in presence of a cost function, namely the mapping entropy [23, 24, 30,31,32], which gauges the quality of a given CG representation according to an information theoretical measure. Finally, we suggest a possible manner to extend the tools we introduced to characterise the mapping space in the static case so as to incorporate information about the reference system’s exploration of conformational space and thus, indirectly, about its energetics as well.

The paper is organised as follows: in Sect. 2 we develop a scalar product between decimation mappings of a macromolecular structure in a static conformation, and derive from it a notion of norm and distance in the mapping space; in Sect. 3 we study CG representations in terms of the distribution of values of the squared norm for mappings having a given number of retained sites N, first through random sampling, then making use of the Wang–Landau enhanced sampling method; in Sect. 4 we exploit a duality between the problem of mappings of a macromolecule and that of an interacting lattice gas in a finite volume to investigate the properties the molecule’s reduced representations; in Sect. 5 we discuss some topological features of the mapping space; in Sect. 6 we discuss an extension of the structure-based definition of the norm that includes information about the system’s energetics; in Sect. 7 we sum up the results of this work and discuss its future perspectives.

2 Theory

The construction of a CG model for a macromolecular system starts with the selection of a mapping M, that is, the projection operator connecting a microscopic, detailed configuration \(\mathbf{r}_i,\;i=1,...,n\) to a low-resolution one \(\mathbf{R}_I,\; I=1,...,N<n\),

$$\begin{aligned} M= & {} \left\{ \mathbf{M}_I(\mathbf{r}),\;\; I=1,...,N \right\} , \nonumber \\&\mathbf{M}_I(\mathbf{r})=\mathbf{R}_I=\sum _{i=1}^n c_{Ii} \mathbf{r}_i, \end{aligned}$$
(1)

where n and N are the number of atoms in the system and the number of effective interaction sites employed in its CG simplified picture, respectively. In Eq. 1, the weights \(c_{Ii}\) are positive, spatially homogeneous—i.e. independent of the configuration \(\mathbf{r}\)—and subject to the normalization condition \(\sum _{i=1}^n c_{Ii}=1\) to preserve translational invariance [4]. While a particular choice of these coefficients corresponds to a specific CG representation of the system, by varying them, along with changing the degree of CG’ing N, one spans the mapping space \(\mathcal {M}\), whose elements are all the possible low-resolution descriptions that can be assigned to a macromolecule.

In the perspective of quantitatively characterising the properties of such space, the cardinality of \(\mathcal {M}\) in the continuous definition presented in Eq. 1 makes its thorough exploration, although appealing, hard to handle in practice. In this work, we, thus, restrict our analysis to the discrete subspace of CG representations that can be obtained for a system through a decimation [33, 34] of its microscopic degrees of freedom: a subset of N constituent atoms is retained while the remaining ones are neglected.

By selectively discarding a subset of the system’s atoms, this structural simplification procedure can be applied to systems of arbitrary size or environments, e.g. molecules in solution, embedded in lipid membranes, or in the gas phase; the set of potentially retained atoms can be extended to the solvent as well, so as to expand the definition of the system under examination to include e.g. water molecules that form hydrogen bonds with the solute. Clearly, the filtering brought forward by the decimation mapping can be applied to a single configuration as well as to an ensemble of conformations sampled, e.g. in a molecular dynamics simulation. The outcome would consist, in the first case, in a single subset of coordinates, which thus preserves no information about the system other than the positions of the retained atoms; in the second case, the filtering would result in a ensemble of reduced representations whose distribution in configuration space reverberates the intra- and intermolecular interactions of the underlying high-resolution sample.

In this work, we will focus on a particular type of molecular structure, namely a protein, and restrict our analysis to decimation mappings in which retained atoms are only selected among those composing the molecule. Initially, calculations will be performed only considering the static, crystallographic configuration of the molecule; subsequently, its energetics will be explicitly accounted for in Sect. 6, where we provide some preliminary results on the extension of the theory to an ensemble of conformations generated through a molecular dynamics simulation of the protein in explicit solvent. Also in this latter case, the environment is not explicitly retained by the CG mapping—again, only protein degrees of freedom are considered—but its effects are implicitly encoded in the distributions of conformations sampled by the molecule.

Let us, thus, consider a protein composed by n constituent atoms; the number of representations \(\Omega _N\) that can be constructed by selecting N of them as effective CG sites is

$$\begin{aligned} \Omega _N=\frac{n!}{N!(n-N)!}, \end{aligned}$$
(2)

so that the total number of possible decimation mappings \(\Omega \) reads

$$\begin{aligned} \Omega =\sum _{N=1}^n \Omega _N=\sum _{N=1}^n \frac{n!}{(n-N)!N!}=2^n -1, \end{aligned}$$
(3)

which becomes prohibitively large as the size of the system increases. Consequently, in the following we only consider the heavy atoms of the molecule as candidate CG sites, indicating with \(\mathcal {M}\) the subspace of mappings obtained according to these prescriptions.

The investigation of the topological structure of \(\mathcal {M}\) calls for the introduction of a distance \(\mathcal {D}(M,M'),\ M,M' \in \mathcal {M}\), able to quantify the “separation” between pairs of points M and \(M'\) belonging to the space of decimation mappings, that is, pairs of CG representations employed to represent the system that differ in the choice of the retained atoms. Such distance must be equipped with all the associated metric properties, namely identity, symmetry, and triangle inequality.

To construct \(\mathcal {D}(M,M')\), we consider a static configuration of the molecule, namely the crystallograpic one, with (heavy) atoms located in positions \({\mathbf {r}}_i,~i=1,...,n\) and a set of selection operators \(\chi _{M,i},~i=1,..,n\) defining mapping M,

$$\begin{aligned} \chi _{M,i}= & {} \left\{ \begin{array}{l@{\quad }l} 1 &{} \,\text{ if } \text{ atom } i \text{ is } \text{ retained, }\\ 0 &{} \,\text{ if } \text{ atom } i \text{ is } \text{ not } \text{ retained, } \end{array} \right. \end{aligned}$$
(4)
$$\begin{aligned} \sum _{i = 1}^n \chi _{M,i}= & {} N(M), \end{aligned}$$
(5)

where N(M) is the number of retained atoms in the mapping. Taking inspiration from the Smooth Overlap of Atomic Positions method (SOAP) developed by Csány et al. [35, 36], we associate with each \(M\in \mathcal {M}\) an element \(\phi _M(\mathbf{r})\) of the Hilbert space of square-integrable real functions \(L_2({\mathbb {R}}^3)\) as

$$\begin{aligned} \phi _M({\mathbf {r}})=\sum _{i=1}^n\phi _{M,i}({\mathbf {r}})=\sum _{i=1}^n Ce^{-({{\mathbf {r}}}-{{\mathbf {r}}}_i)^2/2\sigma ^2}\chi _{M,i}, \end{aligned}$$
(6)

obtained by centering a three-dimensional Gaussian—whose normalization factor C will be fixed in the following—on the position of each atom of the macromolecule retained in the mapping.

The inner product \(\langle \phi _M,\phi _{M'}\rangle \) of \(L_2({\mathbb {R}}^3)\) between two mappings M and \(M'\),

$$\begin{aligned} \left\langle \phi _M,\phi _{M'}\right\rangle =\int \mathrm{d}{\mathbf {r}}\ \phi _M({\mathbf {r}})\phi _{M'}({\mathbf {r}}), \end{aligned}$$
(7)

induces a norm \(||\phi _M||\) for mapping M, with

$$\begin{aligned} \mathcal {E}(M)=||\phi _M||^2=\langle \phi _M,\phi _M\rangle , \end{aligned}$$
(8)

starting from which the distance \(\mathcal {D}(M,M')\) can be defined as

$$\begin{aligned} \mathcal {D}\left( M,M'\right)= & {} ||\phi _M-\phi _{M'}||\nonumber \\= & {} \left\langle \phi _M-\phi _{M'},\phi _M-\phi _{M'}\right\rangle ^{\frac{1}{2}}, \end{aligned}$$
(9)

\(\mathcal {D}(M,M')\) satisfying all the aforementioned metric properties.Footnote 1

By inserting Eq. 6 in Eq. 7, the inner product \(\langle \phi _M, \phi _{M'}\rangle \) between mappings generated by two distinct selection operators \(\chi _{M}\) and \(\chi _{M'}\) becomes

$$\begin{aligned} \left\langle \phi _M, \phi _{M'}\right\rangle =\sum _{i,j=1}^n J_{ij}\chi _{M,i}\chi _{M',j}, \end{aligned}$$
(10)

while the associated distance \(\mathcal {D}(M,M')\) in Eq. 9 reads

$$\begin{aligned}&\mathcal {D}\left( M,M'\right) = \left( \mathcal {E}(M)+\mathcal {E}(M')-2\langle \phi _M,\phi _{M'}\rangle \right) ^{\frac{1}{2}}\nonumber \\&\quad =\left( \sum _{i,j=1}^n J_{ij}\chi _{M,i}\chi _{M,j} \;+\; \sum _{i,j=1}^n J_{ij}\chi _{M',i}\chi _{M',j}\;+\right. \nonumber \\&\qquad \left. -2\sum _{i,j=1}^n J_{ij}\chi _{M,i}\chi _{M',j}\right) ^{\frac{1}{2}}. \end{aligned}$$
(11)

In Eqs. 10 and 11, the coupling constant \(J_{ij}=J_{ij}({\mathbf {r}}_i,{\mathbf {r}}_j)\) between two atoms i and j is given by

$$\begin{aligned} J_{ij}({\mathbf {r}}_i,{\mathbf {r}}_j)=C^2\int \mathrm{d}{\mathbf {r}}\ e^{-\left[ ({\mathbf {r}}-{\mathbf {r}}_i)^2+({\mathbf {r}}-{\mathbf {r}}_j)^2\right] /2\sigma ^2}, \end{aligned}$$
(12)

with

$$\begin{aligned} J_{ij}\left( {\mathbf {r}}_i,{\mathbf {r}}_j\right) =J_{ij}\left( |{\mathbf {r}}_i-{\mathbf {r}}_j|\right) =J_{ij}\left( r_{ij}\right) . \end{aligned}$$
(13)

due to translational and rotational invariance. By introducing polar coordinates in Eq. 12, one has

$$\begin{aligned}&J_{ij}(r_{ij})=2\pi C^2\int \mathrm{d}r \mathrm{d}\theta \ r^2 \sin \theta e^{-\frac{1}{2\sigma ^2}\left( 2r^2+r_{ij}^2-2rr_{ij}\cos \theta \right) }\nonumber \\&\quad =\frac{4\pi \sigma ^2}{r_{ij}}C^2 e^{-r^2_{ij}/2\sigma ^2}\int \mathrm{d}r\ re^{-r^2/\sigma ^2}\sinh \left( \frac{rr_{ij}}{\sigma ^2}\right) , \end{aligned}$$
(14)

and a chain of Gaussian integrals provides

$$\begin{aligned} J_{ij}(r_{ij})=\pi ^{3/2}C^2\sigma ^3 e^{-r^2_{ij}/4\sigma ^2}=e^{-r^2_{ij}/4\sigma ^2}, \end{aligned}$$
(15)

where the last equality has been obtained by setting, without loss of generality,

$$\begin{aligned} C^2=\frac{1}{\pi ^{3/2}\sigma ^3}. \end{aligned}$$
(16)

Finally, by combining Eqs. 10 and 15 the inner product \(\langle \phi _M,\phi _{M'}\rangle \) reads

$$\begin{aligned} \left\langle \phi _M,\phi _{M'}\right\rangle =\sum _{i,j=1}^n e^{-r^2_{ij}/4\sigma ^2}\chi _{M,i}\chi _{M',j}, \end{aligned}$$
(17)

i.e. a sum of Gaussian factors over the positions of all pairs of atoms retained in the two mappings. Notably, the factorization with respect to the operators \(\chi _{M}\) and \(\chi _{M'}\) in Eqs. 10 and 17 enables the inner product (and therefore the distance \(\mathcal {D}\) and the squared norm \(\mathcal {E}\)) to be determined starting from a matrix \(J_{ij}\) that can be calculated a priori over the static structure of the molecule.

One might ask what kind of information the previously defined quantities provide about the possible CG representations of a system. To answer this question, we first focus on the squared norm of a mapping \(\mathcal {E}(M)\), see Eqs. 8 and 17,

$$\begin{aligned} \mathcal {E}(M)=\langle \phi _M,\phi _{M}\rangle =\sum _{i,j=1}^n e^{-r^2_{ij}/4\sigma ^2}\chi _{M,i}\chi _{M,j}. \end{aligned}$$
(18)

For a given retained atom i, the sum over j in Eq. 18,

$$\begin{aligned} Z_i(M)=\sum _{j=1}^n e^{-r^2_{ij}/4\sigma ^2}\chi _{M,j}, \end{aligned}$$
(19)

approximately represents its CG coordination number, that is, the number of retained atoms in the mapping that are located within a sphere of radius \(\sqrt{2}\sigma \) from i. By fixing the degree of coarse-graining N, \(\mathcal {E}(M)\) scales as

$$\begin{aligned} \mathcal {E}(M)= & {} N{\bar{Z}}(M), \end{aligned}$$
(20)
$$\begin{aligned} {\bar{Z}}(M)= & {} \frac{1}{N}\sum _{i=1}^n Z_i(M)\chi _{M,i} \end{aligned}$$
(21)

showing that the dependence of the norm on the specific selection of atoms is dictated by \({\bar{Z}}(M)\), the average CG coordination number. Let us now consider two limiting cases: (i) extremely sparse and homogeneous CG representations, in which each retained atom does not have any retained neighbour within a radius of order \(\sqrt{2}\sigma \)—this condition can only be fulfilled provided that N is not too large, vide infra, or \(\sigma \) is much smaller than the typical interatomic distance. In this case, one has \({\bar{Z}}(M)\approx 1\) and consequently \(\mathcal {E}(M)\approx N\); (ii) globular mappings characterised by densely populated (i.e. almost atomistic) regions of retained sites surrounded by “empty” ones. In this case, the average coordination number \({\bar{Z}}(M)\) will roughly resemble its atomistic counterpart, the latter being defined as

$$\begin{aligned} {\bar{z}}=\frac{1}{n}\sum _{i,j=1}^n e^{-r^2_{ij}/4\sigma ^2}, \end{aligned}$$
(22)

and thus \(\mathcal {E}(M)\approx N{\bar{z}}\). It follows that the squared norm \(\mathcal {E}(M)\) captures the average homogeneity of a CG representation, that is, whether the associated retained atoms are uniformly distributed across the macromolecule or are mainly localized in well-defined regions of it. In Fig. 1, we report examples of CG mappings extracted for these two extreme categories in the case of adenylate kinase (see Sect. 3 for further details on this protein) together with a CG representation in which the retained atoms are randomly selected.

Fig. 1
figure 1

Top row: Example of possible CG representations for adenylate kinase with \(N=214\) sites (represented as beads) characterised by a low (a), intermediate (b) and high (c) mapping squared norm \(\mathcal {E}\). By increasing \(\mathcal {E}\) we move from maximally homogeneous to extremely globular CG representations. Bottom row: Examples of CG mappings with \(N=53\) sites that are approximately parallel (d) and orthogonal (e) to a given one. The atoms composing the reference CG representation are represented as black beads. Parallel (resp. orthogonal) mappings tend to displace CG sites on similar (resp. complementary) regions of the system

An analogous discussion can be performed for the inner product \(\langle \phi _M,\phi _{M'}\rangle \) in Eq. 17, calculated between two mappings M and \(M'\) that respectively retain N and \(N'\) atoms of the system. For a given atom i in mapping M,

$$\begin{aligned} T_i\left( M'\right) =\sum _{j=1}^n e^{-r^2_{ij}/4\sigma ^2}\chi _{M',j}, \end{aligned}$$
(23)

approximately counts the number of neighbours j in mapping \(M'\) located within a sphere of radius \(\sqrt{2}\sigma \) from i. The inner product scales as

$$\begin{aligned} \langle \phi _M,\phi _{M'}\rangle= & {} N{\bar{T}}\left( M,M'\right) , \end{aligned}$$
(24)
$$\begin{aligned} {\bar{T}}\left( M,M'\right)= & {} \frac{1}{N}\sum _{i=1}^n T_i\left( M'\right) \chi _{M,i}, \end{aligned}$$
(25)

where \({\bar{T}}(M,M')\) is again the average number of neighbours an atom in mapping M has that belong to mapping \(M'\). Eqs. 23, 24 and 25 provide a very intuitive explanation of the orthogonality of mappings, \(\langle \phi _M,\phi _{M'}\rangle \approx 0\): it is sufficient that each atom in mapping M does not have any neighbour in \(M'\) (and obviously vice-versa). As such, orthogonal mappings cover complementary regions of the system.

In general, the existence of an inner product enables the definition of an angle \(\theta _{M,M'}\) between mappings, whose cosine reads

$$\begin{aligned} \cos \theta _{M,M'}=\frac{\langle \phi _M,\phi _{M'}\rangle }{\left( \mathcal {E}(M)\mathcal {E}(M')\right) ^{\frac{1}{2}}}. \end{aligned}$$
(26)

While the orthogonality of mappings (\(\cos \theta _{M,M'}\approx 0\)) has a relatively straightforward interpretation in terms of their spatial complementarity, the condition of parallelism, \(\cos \theta _{M,M'}\approx 1\), is a bit less intuitive. If the mappings M and \(M'\) have the same number of atoms N, by inserting Eqs. 20 and 24 in Eq. 26 one obtains

$$\begin{aligned} \cos \theta _{M,M'}=\frac{{\bar{T}}\left( M,M'\right) }{\left( {\bar{Z}}(M){\bar{Z}}(M')\right) ^{\frac{1}{2}}}. \end{aligned}$$
(27)

If furthermore the two mappings show also roughly the same “globularity”, \({\bar{Z}}(M)\approx {\bar{Z}}(M')\), their parallelism requires

$$\begin{aligned} {\bar{T}}\left( M,M'\right) \approx {\bar{Z}}(M), \end{aligned}$$
(28)

that is, the average number of neighbors one atom of M has that belong to mapping \(M^\prime \) has to be equal to the average number of neighbors the atom has that belong to M itself. This means that the two mappings must place retained atoms across the macromolecule in a similar fashion. Examples of approximately parallel and orthogonal CG representations for adenylate kinase are presented in Fig. 1.

It follows that while \(\mathcal {E}(M)\) quantifies the average sparseness of a CG representation, \(\langle \phi _M,\phi _{M'}\rangle \)—or equivalently \(\cos \theta _{M,M'}\)—characterises the average degree of spatial similarity between two different decimations of the microscopic degrees of freedom of the system. The distance \(\mathcal {D}(M,M')\) in Eq. 11 combines these two notions to extract how “far” a pair of CG representations is in the space of possible mappings \(\mathcal {M}\).

Based on these observations, we implemented a slight modification to the inner product \(\langle \phi _M,\phi _{M'}\rangle \)—and hence to the squared norm \(\mathcal {E}(M)\) and distance \(\mathcal {D}(M,M')\)—with respect to the definition originally presented in Eq. 17, which, however, does not change its overall properties or interpretation. We have previously discussed how in the limiting cases of extremely sparse and globular mappings one respectively obtains \(\mathcal {E}(M)\approx N\) and \(\mathcal {E}(M)\approx N{\bar{z}}\), where \({\bar{z}}\) is the atomistic coordination number in Eq. 22. As the number of CG sites N increases, however, it will be extremely hard for a retained site not to have any retained neighbor within a sphere of radius of order \(\sigma \), so that the exact scaling of \(\mathcal {E}(M)\) on the degree of CG’ing N in the case of sparse mappings will be hardly observed. We thus divide the inner product in Eq. 17 by the average atomistic coordination number, and define

$$\begin{aligned} \langle \phi _M,\phi _{M'}\rangle _{{\bar{z}}}~=\frac{1}{{\bar{z}}}~\langle \phi _M,\phi _{M'}\rangle . \end{aligned}$$
(29)

Consequently, one has

$$\begin{aligned} \mathcal {E}_{{\bar{z}}}(M)= & {} \frac{1}{{\bar{z}}}~\mathcal {E}(M), \end{aligned}$$
(30)
$$\begin{aligned} \mathcal {D}_{{\bar{z}}}\left( M,M'\right)= & {} \frac{1}{\sqrt{{\bar{z}}}}~\mathcal {D}\left( M,M'\right) , \end{aligned}$$
(31)

while the cosine between two mappings \(\cos \theta _{M,M'}\) is not affected by the rescaling. With this choice, globular mappings are now associated to \(\mathcal {E}(M)_{{\bar{z}}}\approx N\), which can always be observed also in the case of low degrees of CG’ing, that is, high N. Note that the definition of \(\langle \phi _M,\phi _{M'}\rangle _{{\bar{z}}}\) in Eq. 29 corresponds to a rescaling of the coupling constant \(J_{ij}\) in Eq. 15 to

$$\begin{aligned} J_{ij}=\frac{1}{{\bar{z}}}~e^{-r^2_{ij}/4\sigma ^2}. \end{aligned}$$
(32)

For notational convenience, in the following, we will omit the subscript \({\bar{z}}\) and refer to \(\mathcal {E}(M)_{{\bar{z}}}\), \(\langle \phi _M,\phi _{M'}\rangle _{{\bar{z}}}\) and \(\mathcal {D}_{{\bar{z}}}(M,M')\) as \(\mathcal {E}(M)\), \(\langle \phi _M,\phi _{M'}\rangle \) and \(\mathcal {D}(M,M')\), respectively.

3 Exploration of the mapping space

Starting from the definitions introduced in Sect. 2, we now proceed to perform a quantitative analysis of the high-dimensional space \(\mathcal {M}\) of CG representations that can be constructed for a macromolecule through a decimation of its atomistic degrees of freedom. As a testbed system we consider adenylate kinase (AKE), a 214 residue-long phosphotransferase enzyme catalysing the interconversion between adenine nucleotides, namely adenine diphosphate (ADP), adenine monophosphate (AMP), and the adenine triphosphate complex (ATP) [37]. The structure of adenylate kinase can be divided in three main building blocks [38, 39], with the mobile LID and NMP domains exhibiting a conformational rearrangement around a hinge, the stable CORE domain, which results in an overall open \(\leftrightarrow \) closed transition of the enzyme [40, 41]. Our calculations require in input only the value of the \(\sigma \) parameter and a static configuration \({\mathbf {r}}_i,~i=1,...,n\) of the system to determine the set of Gaussian couplings \(J_{ij}\) in Eq. 32. We here set \(\sigma = 1.9\)Å (that is, half the separation between two consecutive \(\alpha \) carbons), and rely on the open crystal conformation of adenylate kinase (PDB code 4AKE), excluding from the analysis all hydrogens composing the biomolecule, resulting in a total of 1656 heavy atoms.

The investigation of the topological structure of the decimation mapping space of AKE calls for an extensive characterisation of the relational properties among its points, achievable by analysing the behaviour of the distance \(\mathcal {D}(M,M')\) over an ensemble of prototypical CG representations extracted from \(\mathcal {M}\). The discussion carried out in Sect. 2, however, highlighted that \(\mathcal {D}(M,M')\) strictly depends on two factors: the globularity of each mapping—encoded in the squared norm \(\mathcal {E}(M)\)—and their mutual spatial complementarity—that is, the inner product \(\langle \phi _M,\phi _{M'}\rangle \) or equivalently the cosine \(\cos \theta _{M,M'}\). It is then useful to first focus on these one- and two-“body” ingredients before combining them into the distance \(\mathcal {D}(M,M')\). As such, in Sects. 3.1, 3.2 and 4 we will respectively discuss the behaviour of \(\mathcal {E}(M)\) and \(\cos \theta _{M,M'}\) across the mapping space of AKE; the analysis of the distance \(\mathcal {D}\), and hence of the topology of \(\mathcal {M}\), will be presented in Sect. 5.

3.1 Norm distributions

Let us first consider the squared norm \(\mathcal {E}(M)\) of a CG representation M defined in Eq. 30. As previously discussed, this quantity provides information about the spatial homogeneity of a mapping with a given degree of CG’in N; that is to say, it recapitulates how the retained atoms are distributed across the molecular structure, from uniformly scattered (\(\mathcal {E}(M)\approx N/{\bar{z}}\)) to mainly concentrated in well-defined, almost atomistic domains emerging out of a severely CG’ed background (\(\mathcal {E}(M)\approx N\)).

It is important to stress that mappings belonging to the two aforementioned extreme cases are routinely employed by the CG’ing community in the description of a biomolecular system. In proteins, examples from the homogeneous class include physically intuitive, residue-based CG representations of the molecule in terms of its \(\alpha \) carbons or backbone atoms [5, 8]; homogeneity, on the other hand, is often abruptly broken in chemically informed, multiscale mappings, in which a higher level of detail, up to the atomistic one, is sharply localized on the biologically/chemically relevant regions of the system—e.g. the active sites of the protein—while the reminder is treated at extremely low resolution [8]. Furthermore, moving away from these limiting cases, an increasing attention is being posed in employing CG descriptions in which the level of detail is, although inhomogeneously, quasi-continuosly modulated throughout the molecular structure [8].

Be they fully homogeneous, markedly inhomogeneous, or smoothly interpolating between these two classes, the CG representations that are usually adopted in the literature to simplify a biomolecule are often selected a priori by relying on general and intuitive criteria. Critically, such representations only constitute elements, isolated instances extracted from the high-dimensional mapping space \(\mathcal {M}\) of the system. One natural question follows: how representative are these “common” mappings of the diversity of the space \(\mathcal {M}\)? In other words, how spatially homogeneous are the possible CG descriptions that can be designed for a macromolecule when no prior knowledge about its chemical structure or biological function is exploited to guide the mapping construction?

Table 1 Average mapping squared norm \(\langle \mathcal {E}\rangle _N\) and associated standard deviation \(\sigma _{\mathcal {E},N}\) at different degrees of coarse-graining N, calculated over the mapping space \(\mathcal {M}\) of AKE. We present random sampling results (RS), as well as those obtained from a saddle-point approximation to the density of states \(\Omega _N(\mathcal {E})\) determined through the Wang–Landau method (WL-SP), see text

To answer this question, we start by introducing the number of mappings \(\Omega _N(\mathcal {E})\) that attain a particular value \(\mathcal {E}\) of the squared norm for a given number of CG sites N, which is given by:

$$\begin{aligned}&\Omega _N(\mathcal {E}) = \sum _{M\in \mathcal {M}}\delta (N(M),N)\delta ({\mathcal {E}(M),\mathcal {E}}) \end{aligned}$$
(33)

with

$$\begin{aligned}&\sum _{M\in \mathcal {M}}\mathcal {O}(M)=\sum _{\chi _1=0,1}...\sum _{\chi _n=0,1}\mathcal {O}(\lbrace \chi _i\rbrace ), \end{aligned}$$
(34)

where \(\mathcal {O}\) is a generic observable that depends on the mapping through the operators \(\chi _i\). Normalizing Eq. 33 by the total number of mappings with N sites, \(\Omega _N\), we define the conditional probability of having a mapping with norm \(\mathcal {E}\) given that the degree of coarse-graining is N, that is

$$\begin{aligned} P_N(\mathcal {E}) = \frac{\Omega _N(\mathcal {E})}{\Omega _N}, \end{aligned}$$
(35)

which satisfies the normalization condition

$$\begin{aligned} \sum _{\mathcal {E}}P_N(\mathcal {E})=1 \end{aligned}$$
(36)

regardless of the number of retained sites. \(P_N(\mathcal {E})\) can be rewritten as

(37)

where the primed sum runs over all mappings with fixed resolution N, i.e. over all values of the set of operators \(\chi _i=0,1,~i=1,..,n\) satisfying

$$\begin{aligned} \sum _{i = 1}^n \chi _i = N. \end{aligned}$$
(38)

By providing direct insight on the degree of spatial uniformity characterising the ensemble of all possible CG descriptions of a macromolecular system, \(P_N(\mathcal {E})\) represents a first important ingredient in the investigation of the structure of the mapping space \(\mathcal {M}\). We, thus, aimed at analysing the behaviour of the conditional probability \(P_N(\mathcal {E})\) across the decimation mapping space \(\mathcal {M}\) of AKE for a set of 16 values of N ranging from \(N=53\) to 1605, see Table 1. However, even restricted to these cases, an exhaustive enumeration of all possible CG representations of the system is unfeasible in practice: for example, in the case of AKE (\(n=1656\)), roughly \(10^{276}\) possible CG representations can be constructed that describe the enzyme in terms of a subset of \(N=214\) heavy atoms (one for each residue). This number grows to \(10^{496}\) for \(N=856\) (four heavy atoms per residue on average), that is, close to the maximum of the binomial coefficient, obtained for \(N=n/2\), see Eq. 2.

To overcome this combinatorial challenge, for each degree of CG’ing we generated \({\widetilde{\Omega }}_\mathrm{tot}=2\times 10^6\) uniformly distributed random mappings as strings \(\chi _i,~i=1,...,n\) of zeros and ones compatible with Eq. 38, and calculated the associated squared norm \(\mathcal {E}\). Results for each N were then binned along the \(\mathcal {E}\) axis in intervals of \(\delta \mathcal {E}=0.1\), and the corresponding \(P_N(\mathcal {E})\) was estimated as

$$\begin{aligned} P_N(\mathcal {E})=\frac{1}{\delta \mathcal {E}}\frac{{\widetilde{\Omega }}_N(\mathcal {E})}{{\widetilde{\Omega }}_\mathrm{tot}}, \end{aligned}$$
(39)

where \({\widetilde{\Omega }}_N(\mathcal {E})\) is the number of sampled mappings with squared norm falling between \(\mathcal {E}\) and \(\mathcal {E}+\delta \mathcal {E}\). Note that in this way we are approximately treating as continuous the intrinsically discrete, unevenly spaced spectrum of possible norms, and the density \(P_N(\mathcal {E})\)—and consequently \(\Omega _N(\mathcal {E})\)—as piecewise constant. In this “continuous” limit, the normalization condition of \(P_N(\mathcal {E})\) becomes

$$\begin{aligned} 1=\sum _{\mathcal {E}}P_N(\mathcal {E})\delta {\mathcal {E}}\simeq \int \mathrm{d}\mathcal {E}P_N(\mathcal {E}). \end{aligned}$$
(40)

The set of distributions \(P_N(\mathcal {E})\) obtained from our random sampling of the mapping space of AKE are displayed in Fig. 2. We observe that, for each value of the CG resolution N, \(P_N(\mathcal {E})\) is unimodal and narrowly peaked around its average squared norm,

$$\begin{aligned} \langle \mathcal {E}\rangle _N=\int \mathrm{d}\mathcal {E}P_N(\mathcal {E})\mathcal {E}, \end{aligned}$$
(41)

\(\langle \mathcal {E}\rangle _N\) being an increasing function of N. On the other hand, the standard deviation \(\sigma _{\mathcal {E},N}\),

$$\begin{aligned} \sigma _{\mathcal {E},N}=\left( \int \mathrm{d}\mathcal {E}P_N(\mathcal {E})(\mathcal {E}-\langle \mathcal {E}\rangle _N)^2\right) ^{\frac{1}{2}}, \end{aligned}$$
(42)

is non-monotonic in the degree of CG’ing: starting from extremely small values in the case of few retained atoms (e.g. \(N=53,107\) and 214), \(\sigma _{\mathcal {E},N}\) increases roughly up to \(N\approx 3n/4\) and then starts to decrease, reaching zero for \(N=n\)—in this case, only one possible mapping exists, namely the atomistic one. These features are further highlighted in Table 1 and Fig. 3, in which we report the dependence of \(\langle \mathcal {E}\rangle _N\) and \(\sigma _{\mathcal {E},N}\) on the degree of CG’ing N as obtained from the distributions \(P_N(\mathcal {E})\) in Fig. 2.

Fig. 2
figure 2

Probability \(P_N(\mathcal {E})\) of the norm of the mapping \(\mathcal {E}\) for AKE calculated at various degrees of CG’ing N, as obtained from a random sampling of the mapping space \(\mathcal {M}\). Arrows indicate the values of N for which a reconstruction of the density of states \(\Omega _N(\mathcal {E})\) through the Wang–Landau algorithm has been performed

Fig. 3
figure 3

Inset: Standard deviation \(\sigma _{\mathcal {E},N}\) of the mapping norm \(\mathcal {E}\) as a function of the degree of CG’ing N obtained from a random sampling of the mapping space \(\mathcal {M}\) of AKE. Main plot: N-dependence of the average squared norm \(\langle \mathcal {E}\rangle _N\) (“Random”, black line) and associated \(3\sigma _{\mathcal {E},N}\) confidence interval (khaki area) as obtained from a random sampling of the mapping space of AKE, superimposed to the region covered by the set of single-window, preliminary WL runs (purple area). The minimum (“WL-min”, blue line) and maximum (“WL-max”, red line) squared norms reached by the preliminary runs are highlighted. “WL-max” also corresponds to the scaling \(\mathcal {E}\approx N\) obtained in the case of inhomogeneous, globular mappings

\(\langle \mathcal {E}\rangle _N\) quantifies the average spatial homogeneity of the ensemble of CG representations that can be randomly assigned to AKE at a specific resolution. As previously discussed, maximally inhomogenous mappings, in which a chiseled chunk of the biomolecule is treated atomistically while the remainder is almost neglected, are characterised by \(\mathcal {E}\approx N\). Critically, Fig. 3 displays that such linear scaling lies always above the average \(\langle \mathcal {E}\rangle _N\) for all degrees of coarse-graining investigated. The deviation between the two curves is non-monotonic, with a maximum obtained for \(N=n/2\), and only vanishes for \(N\rightarrow n\), where mappings become very dense as they collapse towards the atomistic representation. As a consequence, the CG representations one encounters by randomly probing the mapping space \(\mathcal {M}\) tend to be “sparse” rather than compact. Furthermore, the difference between the squared norm of the globular case and \(\langle \mathcal {E}\rangle _N\) is always (but for \(N{\approx } n\)) one or two orders of magnitudes larger than the standard deviation of the corresponding \(P_N(\mathcal {E})\), see Fig. 3. It follows that inhomogeneous mappings lie extremely far away in the right tails of the distributions displayed in Fig. 2, thus constituting an exponentially vanishing subset of the space \(\mathcal {M}\).

The suppression of the statistical weight associated with high-norm, globular CG representations of AKE in the space of all possible ones is not surprising, and is solely driven by entropic effects. Indeed, at least for small and intermediate N, it is extremely unlikely that a completely random selection of retained atoms across the biomolecule will result in their dense confinement within sharply defined spatial domains of the system, just as it is unlikely for a gas to occupy only a small fraction of the volume in which it is enclosed. Interestingly, this latter analogy can be pushed further by noting that the squared norm \(\mathcal {E}(M)\), see Eqs. 30 and 18, is akin to the negative configurational energy of a lattice gas living on the irregular grid defined by the protein’s conformation, whose particle interact via a hard-core, short-range potential followed by an attractive Gaussian tail. In this context, the selection operators \(\chi _{M,i}=0,1,~i=1,...,n\) of a mapping M with N retained atoms can be interpreted as the set of occupation numbers describing a distribution of the N particles of the gas on the n available lattice sites. It follows that compact CG representations of AKE, located in the large-\(\mathcal {E}\) limit of \(P_N(\mathcal {E})\), are just as challenging to randomly sample within the space \(\mathcal {M}\) as are the low-energy configurations of the gas in which the N particles spontaneously occupy only a fraction of the available volume. The implications of this analogy will be thoroughly explored in Sect. 4.

The strongly entropy-driven distribution of mappings calls for the introduction of enhanced sampling techniques to boost the exploration of the mapping space; in this work, we resort to the algorithm proposed by Wang and Landau (WL) [28, 29, 42, 43]. For each CG resolution N, the aim is to obtain a uniform sampling of the possible mapping norms \(\mathcal {E}\) across the space \(\mathcal {M}\), in contrast to the set of narrowly peaked probability distributions displayed in Fig. 2. In principle, this is attained by setting up a Markov chain Monte Carlo simulation in which a transition between two subsequent mappings M and \(M'\)—both retaining N atoms—is accepted with probability \(\alpha \) given by [28]

$$\begin{aligned}&\alpha _{M\rightarrow M'}=\text {min}\left[ 1,\frac{\Omega _N(\mathcal {E}(M))}{\Omega _N(\mathcal {E}(M'))}\right] \nonumber \\&\quad =\text {min}\left[ 1,\exp {\left( -[S_N(\mathcal {E}(M'))- S_N(\mathcal {E}(M))]\right) }\right] , \quad \quad \end{aligned}$$
(43)

where \(\Omega _N(\mathcal {E})\) is the density of states defined in Eq. 35 while \(S_N(\mathcal {E})=\ln [\Omega _N(\mathcal {E})]\) is the corresponding microcanonical entropy. When compounded with a symmetric proposal probability \(\pi \) for the attempted move, \(\pi _{M\rightarrow M'}=\pi _{M'\rightarrow M}\), the Markov chain in Eq. 43 would generate, after an initial relaxation transient, CG representations distributed according to \(p(M)\sim 1/\Omega _N(\mathcal {E}(M))\) [28], resulting in a flat histogram \(P_N(\mathcal {E})\) of visited norms over the whole range of possible ones [43].

Fig. 4
figure 4

Left: Logarithm of the density of states \(\Omega _{N}(\mathcal {E})\) of AKE, \(S_N(\mathcal {E})=\ln [\Omega _{N}(\mathcal {E})]\), for \(N=856\). We report results obtained via (i) Wang–Landau sampling (“WL”, red dotted line), vertically shifting the data so that the minimum of \(S_N\) over the range of investigated norms is zero; (ii) a saddle-point approximation of the WL predictions (“SP-approx”, orange dashed line); and (iii) a random drawing of CG representations (“Random”, black line), in this latter case shifting the curve so that its maximum coincides with the one of the WL profile. In the figure we also report the squared norm associated to the mapping in which all the heavy atoms composing the backbone of AKE are retained (“backbone”, dashed blue line), a CG representation that is commonly employed when CG’ing a protein [5, 8]. Right: First (main plot) and second (inset) derivatives \(S'_N(\mathcal {E})\) and \(S''_N(\mathcal {E})\) of the entropy \(S_N(\mathcal {E})\) determined via WL sampling for \(N=856\)

In practice, however, the density of states in Eq. 43 is not known a priori. The power of WL approach resides in its ability to self-consistently obtain \(\Omega _N(\mathcal {E})\) through a sequence \(k=1,...,K\) of non-equilibrium simulations in which increasingly accurate approximations \({\bar{\Omega }}^k_N(\mathcal {E})\) to the exact result are generated, iterations being stopped when the desired precision is achieved [28, 29]. For the sake of brevity, we here omit an exhaustive discussion of the general algorithmic workflow behind WL sampling as well as an in-depth description of the specific implementation employed in this work; these details are provided in Appendix A.

In the WL reconstruction of a density of states such as \(\Omega _N(\mathcal {E})\), knowledge of the sampling boundaries proves extremely beneficial to the accuracy and rate of convergence of the self-consistent scheme [44]. For each degree of CG’ing investigated, we, thus, initially performed a preliminary, non-iterative WL run to approximately locate the minimum and maximum mapping norms \(\mathcal {E}_\mathrm{min}(N)\) and \(\mathcal {E}_\mathrm{max}(N)\) achievable for AKE at that specific CG resolution, and consequently bound the support of the corresponding \(\Omega _N(\mathcal {E})\).

The results for \(\mathcal {E}_\mathrm{min}(N)\) and \(\mathcal {E}_\mathrm{max}(N)\) obtained from this analysis are presented in Fig. 3 and Table 2 of Appendix A. We observe that the mapping norms visited by the set of preliminary WL runs extend, for all values of N, over a significantly wider range compared to the one obtained by random sampling. Remarkably, the maximum norm \(\mathcal {E}_\mathrm{max}(N)\) exhibits a linear dependence on N that is fully compatible with the one associated to globular CG representations, \(\mathcal {E}_\mathrm{max}(N)\approx N\), highlighting that the WL approach succeeds in exploring this entropically suppressed region of the mapping space. Furthermore, Fig. 3 displays that the minimum norm \(\mathcal {E}_\mathrm{min}(N)\) identified by the preliminary runs lies always below the average \(\langle \mathcal {E}\rangle _N\) for all values of N. In contrast to globular mappings, CG representations living in this low \(\mathcal {E}\) limit are maximally homogeneous, that is, retained atoms are scattered throughout the molecular structure as uniformly as possible. This class constitutes another exponentially vanishing subset of the mapping space: in the gas picture, it would correspond to the ensemble of configurations in which gas particles are regularly distributed within the available volume.

Having approximately identified the range of mapping norms achievable for AKE at each CG resolution, we subsequently moved to the determination of the associated densities of states \(\Omega _N(\mathcal {E})\) via the iterative WL scheme, see Appendix A for all technical details. Calculations were only performed for a subset of degrees of CG’ing, namely those in which the number of retained atoms N is an integer multiple of the number of residues composing the biomolecule, \(N=i\cdot 214,~i=1,...,7\).

To speed-up convergence of the algorithm, for each N we slightly reduced the range of norms \([\mathcal {E}_\mathrm{min},\mathcal {E}_\mathrm{max}]\) with respect to the one predicted by the explorative WL runs, see Table 2 in Appendix A. This interval was then divided into a set of overlapping windows in which independent WL simulations were performed [29]. The resulting partial densities of states were a posteriori combined to determine the cumulative \(\Omega _N(\mathcal {E})\) up to a global multiplicative factor, or, in our case, the entropy \(S_N(\mathcal {E})=\ln [\Omega _N(\mathcal {E})]\) up to an additive constant.

WL estimates of the entropy \(S_N(\mathcal {E})\) are presented in Fig. 4 for \(N=856\), while results for all the other degrees of CG’ing are reported in Fig. 12 of Appendix A. In all cases, we observe that the behaviour of \(S_N\) is non-monotonic in \(\mathcal {E}\), exhibiting a unique maximum as the mapping norm moves from the left to right boundary of the range of investigated ones—that is, in transitioning from extremely homogeneous to maximally globular CG representations. As \(\Omega _N(\mathcal {E})=\exp [S_N(\mathcal {E})]\), this result confirms how these two limiting classes of mappings constitute regions of exponentially vanishing size within the broad space \(\mathcal {M}\). At the same time, the overall shape of \(S_N\) strongly depends on the degree of CG’ing: while for high N entropy profiles are nearly symmetric around their maximum, they become increasingly skewed as fewer and fewer atoms are employed to represent the macromolecule, see Fig. 12. This asymmetry becomes apparent by performing, for each CG resolution, a quadratic expansion of \(S_N\) around its maximum,

$$\begin{aligned} S_N(\mathcal {E})\simeq S_N(\tilde{\mathcal {E}}(N))+\frac{1}{2}S''_N\left( \tilde{\mathcal {E}}(N)\right) \left( \mathcal {E}-\tilde{\mathcal {E}}(N)\right) ^2,\nonumber \\ \end{aligned}$$
(44)

where \(\tilde{\mathcal {E}}(N)\) is the norm at which the first derivative \(S'_N\) of the entropy vanishes, and \(S''_N(\tilde{\mathcal {E}}(N))\) is the corresponding second derivative—the dependence of \(S'_N\) and \(S''_N\) on \(\mathcal {E}\) being displayed in Fig. 4 for \(N=856\). The accuracy of this parabolic, symmetric approximation in reproducing the exact \(S_N\) over the whole \(\mathcal {E}\)-range increases with the number of retained atoms, see Figs. 4 and 12, especially as far as the limit of high mapping norms is concerned.

Finally, it is interesting to test the predictions of WL sampling against the results obtained via a completely random exploration of the mapping space. To this end, Fig. 4 and Fig. 12 include a comparison between the WL entropies \(S_N\) and their random counterparts \(S^{ran}_N\), the latter defined as \(S^{ran}_N(\mathcal {E})=\ln [P_N(\mathcal {E})]+C_N\), where \(P_N(\mathcal {E})\) are the probability densities presented in Fig. 2 and the constants \(C_N\) are set so that the maxima of \(S^\mathrm{ran}_N\) and \(S_N\) coincide. For each value of N the two profiles are in perfect agreement, thus confirming the accuracy of the self-consistent WL scheme in determining the density of states of a system. Critically, results for \(S^\mathrm{ran}_N\) only extend over a very narrow range of mapping norms, centred around the value \(\tilde{\mathcal {E}}(N)\) for which the maximum of the entropy is attained. It is, therefore, largely unfeasible, by randomly drawing CG representations, to exhaustively explore the mapping space \(\mathcal {M}\) of a macromolecule. In this respect it is worth to inspect the position, on the \(\mathcal {E}\) axis, of the \(C_{\alpha }\) and backbone mappings (which in AKE retain \(N=214\) and \(N=856\) sites, respectively), two reduced representations that are routinely employed for CG’ing proteins [5, 8]. These turn out to be located in the vicinity of the class of “prototypical” random ones, for which the entropy \(S_N\) reaches its maximum; however, their intrinsic regularity, dictated by the position of the retained sites on the peptide chain, makes these mappings slightly more homogeneous than the random ones, see Figs. 4 and 12.

To provide a more quantitative measure of the consistency between random and WL sampling results, for each degree of CG’ing, we recalculated the average and variance of the mapping norm, see Eqs. 41 and 42, starting from the WL entropies \(S_N\). These are used to compute \(P_N(\mathcal {E})\) making use of a saddle-point approximation of Eq. 35, namely

$$\begin{aligned}&P_N(\mathcal {E})=\frac{\Omega _N(\mathcal {E})}{\Omega _N} =\frac{\exp [S_N(\mathcal {E})]}{\int \mathrm{d}\mathcal {E}\exp [S_N(\mathcal {E})]} \nonumber \\&\simeq \left( \frac{|S''_N(\tilde{\mathcal {E}}(N))|}{2\pi }\right) ^{\frac{1}{2}} \exp \left[ {\frac{1}{2}S''_N(\tilde{\mathcal {E}}(N))(\mathcal {E}-\tilde{\mathcal {E}}(N))^2}\right] ,\nonumber \\ \end{aligned}$$
(45)

where in the last step of Eq. 45 we made use of the quadratic expansion of \(S_N\) defined in Eq. 44. Within the saddle point approximation, one has \(\langle \mathcal {E}\rangle _N=\tilde{\mathcal {E}}(N)\), \(\tilde{\mathcal {E}}(N)\) being the position of the maximum of \(S_N\), and \(\sigma _{\mathcal {E},N}=|S''_N(\tilde{\mathcal {E}}(N))|^{-\frac{1}{2}}\): these predictions are found to be in perfect agreement with their random sampling counterparts, results being presented in Table 1.

Fig. 5
figure 5

Left: histogram of cosine values extracted from random (yellow) and WL CG mappings (purple, see main text) for AKE with \(N=856\) sites. Elements of \(\mathcal {M}\) with the lowest value of the cosine (\(\cos \theta = 0.457\)) are shown; such value corresponds to an angle of 63.25 degrees. Right: range of cosine values covered by the two data sets when N is changed. The dotted black line shows the average value of \(\cos \theta \) over the different random data sets and the yellow region represents the points within \(3 \sigma \) from the mean. The red (blue) dotted lines report the maximum (minimum) values of \(\cos \theta \) inside WL data sets, respectively

3.2 Inner product distributions

We now proceed to the description of the mapping space \(\mathcal {M}\) from the perspective of the inner product between its elements. Following the same scheme of Sect. 3.1, we here focus on the cosine between mappings that are constrained to share the same resolution N, and introduce the probability \(P_{NN}(\cos \theta )\) of observing a value of \(\cos \theta \) provided that this constraint is satisfied:

$$\begin{aligned} P_{NN}(\cos \theta ) = \frac{\Omega _{NN}(\cos \theta )}{\Omega ^2_{N}}, \end{aligned}$$
(46)

that is, the ratio between the number of mapping pairs whose cosine is equal to \(\cos \theta \), \(\Omega ^2_{NN}(\cos \theta )\), and the total number of possible pairs \(\Omega ^2_{N}\). We can now investigate how the average degree of parallelism between two mappings changes when considering randomly selected mappings or more peculiar elements of \(\mathcal {M}\).

In this section, we compare two data sets, each one containing \(10^6\) elements: the first one was obtained by computing the cosine between two mappings in which the retained sites were picked randomly; the second data set was instead constructed in a more sophisticated manner, making use of the WL sampling scheme to collect mappings that uniformly span the range \([\mathcal {E}_{min},\mathcal {E}_{max}]\) of accessible values of \(\mathcal {E}\) identified in the previous section. More specifically, we started a WL exploration as in Sect. 3.1 over this range and, when all the reference bins were visited at least once, we began saving a mapping every 1656 Monte Carlo moves. Mappings were saved in different macro-bins, each one covering an interval of amplitude 20 (in terms of units of \(\mathcal {E}\)). Sampling ended when 5000 mappings were saved in each box, without considering the convergence of the WL algorithm. The data set was then generated by computing the cosine (Eq. 26) between randomly selected pairs of mappings extracted through this procedure. Importantly, the WL sampling scheme produces a pool of potentially correlated mappings, so that the chance of collecting similar elements of \(\mathcal {M}\) cannot be excluded.

The normalised histograms of cos(\(\theta \)) values obtained from the two datasets are displayed in Fig. 5a for \(N=856\). We observe that while the random cosine distribution displays a narrow peak around its average the WL histogram is more smeared, reflecting the increased diversity of the data set. Indeed, the latter histogram spans values that range from \(\approx 1\), obtained when two mappings are perfectly parallel, to 0.457, when two mappings are as orthogonal as possible given the properties of the lattice and the selected number of retained sites. In Fig. 5a, we also report a graphical rendering of the two maximally orthogonal mappings, which possess a high value of \(\mathcal {E}\) (\(\mathcal {E}=847.32\) and \(\mathcal {E}=843.82\), respectively) and cover different regions of the enzyme’s structure.

In Fig. 5b, we extend these considerations to different values of N, namely those employed in Sect. 3.1. The random distribution is always confined in a narrow interval of values of \(\cos \theta \), while WL data sets are capable of spanning a much wider range. In particular, for sufficiently small values of N, it is possible to retrieve maximally parallel (\(\cos \theta = 1\)) and maximally orthogonal (\(\cos \theta = 0\)) mappings inside the WL dataset. Reaching orthogonality is made possible by the fact that, at such low values of N, it is possible to confine retained sites in two separate regions of the protein structure.

4 Lattice gas analogy and phase transitions

As anticipated in Sect. 3, the reduced representation discussed in the present work, in which a mapping is defined in terms of a decimation of the atoms available on the molecular structure, suggests the analogy with a lattice gas. Also in this case, in fact, we have a number n of nodes that can be occupied by \(N \le n\) sites, each node being accessible to a single site at a time—thus implementing a hard-core repulsion. This analogy is a classic of statistical mechanics, and enables one, e.g. to map an Ising model to a gas of interacting particles, thus making it manifest that the spontaneous magnetisation in the former and the liquid–gas phase transition in the latter belong to the same universality class [45]. Here, we investigate the consequences of the lattice gas interpretation of reduced representations in order to tackle the issue of characterising the mapping space from a different perspective. Specifically, we mutuate concepts from equilibrium statistical mechanics to show that sharp transitions can occur that separate one or more phases corresponding to classes of reduced representations endowed with markedly distinct structural properties. While the previously performed analysis of the smooth and continuous densities of states \(\Omega _N(\mathcal {E})\) already suggested the existence of such classes, see Sect. 3, for particular numbers of retained sites these are shown to be as distinct as two or more phases of a fluid can be when observed through the perspective of this statistical mechanical analogue.

The role of the energy can be played by the norm of the mapping: in analogy with a lattice gas, we expect that if two retained sites are close to each other, they feel an attractive interaction, thereby reducing the energy. We thus define the energy of the system as

$$\begin{aligned} E(M) = - {\mathcal {E}}(M). \end{aligned}$$
(47)

In the previous sections, we obtained the density of states in terms of the mapping norm, \(\Omega _N = \Omega _N({\mathcal {E}})\). Making use of Eq. 47 we can, thus, write

$$\begin{aligned} \Omega _N(E) = \Omega _N(- {\mathcal {E}}). \end{aligned}$$
(48)

Let us now consider a system governed by the lattice Hamiltonian in Eq. 47 at equilibrium with a reservoir at temperature \(T = \beta ^{-1}\). The partition function of such system can be expressed in terms of \(\Omega _N(E)\) via

$$\begin{aligned} {\mathcal {Z}}_N(\beta )= & {} \int \mathrm{d}E\ e^{-\beta E}\Omega _N(E)\nonumber \\\equiv & {} \int \mathrm{d}E\ e^{- (\beta E - S_N(E))}, \end{aligned}$$
(49)

where we used the relation \(S_N(E) = \ln \Omega _N(E)\) to define the entropy. Equation 49 enables us to compute the dimensionless Helmholtz free energy as

$$\begin{aligned} \beta F_N(\beta )= & {} - \ln {\mathcal {Z}}_N(\beta )\nonumber \\= & {} - \ln \int \mathrm{d}E\ e^{- (\beta E - S_N(E))}. \end{aligned}$$
(50)

While the logarithm of the integral can be theoretically and numerically cumbersome to compute, it is possible to obtain a reasonable estimate of \(\beta F_N\) through a saddle point approximation. Specifically, we can expect that the integral is approximately equal to the largest integrand, so that

$$\begin{aligned} \int \mathrm{d}E\ e^{- \left( \beta E - S_N(E)\right) } \simeq C\ \max _E \left( e^{- (\beta E - S_N(E))}\right) ,\nonumber \\ \end{aligned}$$
(51)

where C is an immaterial constant. This approximation provides us with a definition of the free energy that is equivalent to the Legendre-Fenchel transform:

$$\begin{aligned} \beta F_N(\beta ) \simeq \min _E \left( \beta E - S_N(E)\right) . \end{aligned}$$
(52)

The thermodynamics of the lattice gas at thermal equilibrium can thus be retrieved computing Eq. 52 for a given value of N at all values of \(\beta \).

It is particularly instructive to investigate the temperature dependence of \(E^\star \), defined as the value of the energy for which \(\beta E - S(E)\) reaches its minimum. In Fig. 6 (blue curve, left ordinate), we report this function for \(N = 214\): it is possible to observe that \(E^\star = E^\star (\beta )\) decreases monotonically, i.e. the lower the temperature, the lower the value of the energy—which corresponds to higher values of the mapping norm. At a particular value \(\beta _{gl}\) of the inverse temperature, however, \(E^\star \) drops abruptly: in this context, such behaviour is suggestive of a first-order, discontinuous phase transition.

Fig. 6
figure 6

Heat capacity \(C_V\) (red circles, right ordinate) and value of the energy \(E^\star \) corresponding to the minimum of the free energy (blue triangles, left ordinate) as functions of the inverse temperature \(\beta \) for the system with \(N = 214\). \(E^\star \) decreases monotonically with \(\beta \), indicating that higher temperatures correspond to higher values of the average internal energy of the lattice gas, as expected; however, a jump discontinuity in \(E^\star \) appears in correspondence of the same value \(\beta _{gl}\) for which the heat capacity features a sharp peak, suggesting the occurrence of a first-order phase transition that separates two distinct phases: a gas (low \(\beta \)) from a liquid (high \(\beta \)) for the lattice gas model, and, correspondingly, a sparse phase from a dense, localised phase in the case of mappings

To gain further insight, we computed the shapes of \(\beta E - S(E)\) for values before and after \(\beta _{gl}\). These functions, reported in Fig. 7, indeed show two minima separated by a relatively low barrier; increasing \(\beta \), the absolute minimum shifts from the right to the left, crossing a point for which the two are essentially degenerate. This is the point of coexistence of two distinct “phases” of our lattice gas: a low density one corresponding to distributed mappings (high energy), and one ascribable to more dense, compact conglomerates of sites (low energy). The critical nature of the transition from one regime to the other is confirmed by the inspection of the heat capacity, computed as

$$\begin{aligned} C_V = - \beta ^2 \frac{\partial ^2 (\beta F)}{\partial \beta ^2} \end{aligned}$$
(53)

and reported in Fig. 6 (red curve, right ordinate). The sharp, asymmetric peak in \(C_V\), located at the value \(\beta _{gl}\) of the inverse temperature, shows that the lattice gas crosses a phase transition between a gas and a liquid phase.

A crucial role in this behaviour is played by the number of coarse-grained sites. In fact, as N increases, the system acquires the possibility of crossing a second phase transition: for example, in the case of \(N = 1070\), besides the gas–liquid one, it is possible to observe a second, even sharper discontinuity in \(E^\star \) for a value of the inverse temperature \(\beta _{ls} > \beta _{gl}\). This temperature separates the liquid from the solid phase: when the lattice gas particles are sufficiently many, and the temperature sufficiently low, the system can “freeze” in particularly dense mappings with very low entropy. Also in this case, the inspection of the heat capacity (Fig. 13 in Appendix B) supports the interpretation of this as a phase transition. Finally, if the number of sites is too large (e.g. \(N = 1498\)) no transition is observed, see Fig. 13.

Fig. 7
figure 7

Helmholtz free energy \(\beta F\) of the lattice gas as a function of the energy for different values of the inverse temperature \(\beta \). For low values of \(\beta \) the curves have a unique and absolute minimum; however, as \(\beta \) increases, a metastable minimum appears that, for a particular value of the inverse temperature, becomes degenerate with the previous one. The presence of a small but appreciable barrier between the two minima makes the position of the absolute minimum, \(E^\star \), shift abruptly from one to the other, as can be seen in Fig. 6, thus making \(E^\star (\beta )\) discontinuous

The observations reported in this section resonate with those made by Foley and collaborators in a recent work [15]: there, they observed a phase transition in a system whose degrees of freedom were the retained sites of a reduced model of proteins. In that case, the energy of a given mapping was obtained from the calculation of the spectral quality of the associated model, a quantity related to the sum of the eigenvalues of the covariance matrix obtained integrating exactly a Gaussian network model (GNM). While apparently very distinct, the spectral quality and the norm of the mapping might bear substantial similarities: in fact, the former entails information about a very simple model, whose mechanical and thermodynamical properties are completely determined by the contact matrix of the underlying protein structure. It is, thus, reasonable to guess that the mapping norm provides, in an effective and efficient manner, information akin to that entailed in the spectral quality about the sparsity or localisation of the retained sites in a given mapping. If and up to which degree these two quantities are related, and how intimately this relation depends on the Gaussian nature of the GNM, requires further investigations that will be the object of future studies.

In conclusion of this section, we note that the observed phase transitions separate mappings so structurally diverse that they can be associated to qualitatively different phases. It is, thus, natural to wonder if and how these phases are organised in the metric space induced by the norm of the mapping, and what information the exploration of the latter can bring about the system it is applied to. To provide an answer to these questions, the next section is devoted to the topological characterisation of the mapping space.

5 Topology

In the previous sections, we analyzed the mapping space \(\mathcal {M}\) in terms of the mapping norm \(\mathcal {E}\) and of the cosine between its constituent elements. Here, we discuss the distance \(\mathcal {D}\) (Eqs. 11, 31) between members of \(\mathcal {M}\) with the aim of showing, once again, that a peculiar choice of retained CG sites, i.e. one impossible to obtain with random sampling, displays non-trivial statistical properties that reflect in the topological organization of the mapping space.

5.1 Topology of the mapping norm space

Without loss of generality,Footnote 2 we restrict our investigation to the case \(N=214\), namely the number of amino acids of adenylate kinase. We generated a data set of mappings following the protocol explained in Sect. 3.2; in this case, the range of values of \(\mathcal {E}\) was narrower and only 10 macro-bins of amplitude 20 were explored. The data set was constructed by randomly selecting 100 elements for each of the macro-bins, resulting in 1000 CG mappings that homogeneously span the accessible values of \(\mathcal {E}\).

The sketch map algorithm [46, 47] was employed to embed 1000 points from the high-dimensional space of mappings \({\mathcal {M}}\) into a two-dimensional plane, at the same time preserving as faithfully as possible the relative distances among them—that is to say that nearby points in the mapping space are mapped onto nearby points on the 2D space, see Fig. 8. The two critical parameters of the algorithm are \(\sigma _d\) and \(\sigma _D\), which modulate how far and close points are in the low and high resolution space, respectively [46]. To provide the reader with a feeling of the impact that these parameters have on the structure of the low-dimensional representation, we report the embeddings obtained for a low (Fig. 8a) and high (Fig. 8b) value of \(\sigma _d\) and \(\sigma _D\).

Fig. 8
figure 8

Top: topology of the mapping space \(\mathcal {M}\) in 2D obtained with the sketch map algorithm [46, 47]. The algorithm requires six parameters, namely \(\sigma _d\), \(a_d\), \(b_d\) in the low resolution space and \(\sigma _D\), \(a_D\), \(b_D\) in the original, high resolution one. We select \(\sigma _D = \sigma _d = 2\) for subfigure (a) and \(\sigma _D = \sigma _d = 20\) for subfigure (b), while \(a_d = b_d = 2\) and \(a_D = b_D = 5\) in both cases. Mappings are depicted with different colors depending on their norm \(\mathcal {E}\). We note that a different choice for \(\sigma _D\) and \(\sigma _d\) results in a completely different 2D embedding (see [46] for a detailed explanation). Bottom: three different mappings located in three separated regions of the plane in (a, b). Mappings in subfigures (c) and (e) possess very high values of \(\mathcal {E}\) and are localised in different domains of the protein. It is interesting to notice that sparse mappings, such as the one in subfigure (d), are clustered in the same region in (b) but not in (a)

In the first case, presented in Fig. 8a and referring to low values of the \(\sigma \) parameters, data points are in general very sparse and uniformly distributed on the plane, with the exception of a group of points that accumulate in a denser cluster: these are particularly compact mappings localised in a specific region of the molecule. Such mappings remain close to each other even when the \(\sigma \) parameters are increased, thus “squeezing” all points in the low-D embedding, see Fig. 8b. At the same time, we observe that, in this latter scenario, points corresponding to low-\(\mathcal {E}\), uniform mappings collapse in a small region of the embedding space. Furthermore, a third group of points corresponding to compact mappings appears, distinct from the ones previously discussed, and absent in the low-\(\sigma \) embedding.

The high-\(\sigma \) embedding, thus, highlights two relevant features: first, the presence of specific regions with qualitatively distinct mapping properties; these are either sparse, but necessarily similar one to the other (Fig. 8d), or dense, with atoms localised in different domains of the molecule (Figs. 8c, e). The distance among the latter is necessarily large, since the retained sites cover non-overlapping regions.

The second relevant feature is that different groups of points, associated to qualitatively distinct types of mappings, can be connected one to the other only “passing through” a third one, as in the case, see Fig. 8, of mapping c going going to e through d. This is suggestive of the presence of routes in mapping space that join points having the same value of the norm, which, however, cannot be connected through “iso-\({\mathcal {E}}\)” paths: to transform mappings such as that in c into that in e through a sequence of single-site changes (i.e. one retained atom is discarded, a formerly discarded one is now retained) one cannot but increase or decrease the value of the norm.

5.2 Topology of mapping entropy space

While the mapping norm \(\mathcal {E}\) can be employed to investigate the structure of \(\mathcal {M}\) itself, the quality of a CG representation can be determined by means of an appropriate cost function. One such function is, e.g. the mapping entropy \(S_\mathrm{map}\) [23, 24, 30,31,32], which is a measure of the intrinsic information loss that is inherent to the process of dimensionality reduction operated by a mapping. This quantity is defined as

$$\begin{aligned} S_\mathrm{map}(M) = k_\mathrm{B}\int \mathrm{d}{} \mathbf{r}\ p_r(\mathbf{r}) \ln \left[ \frac{p_r(\mathbf{r})}{{\bar{p}}_r(\mathbf{r})} \right] , \end{aligned}$$
(54)

where \(p_r(\mathbf{r})\propto \exp (-\beta u(\mathbf{r}))\) is the Boltzmann weight associated to the atomistic configuration \(\mathbf{r}\), while \({\bar{p}}_r(\mathbf{r})\) represents the “smeared” weight of \(\mathbf{r}\) upon coarse-graining the system by means of a CG mapping M. More specifically, one introduces the probability of sampling the CG configuration \(\mathbf{R}\), given by

$$\begin{aligned} p_R(\mathbf{R})=\int \mathrm{d}{} \mathbf{r}\ p_r(\mathbf{r})\delta (\mathbf{M}(\mathbf{r}) - \mathbf{R}), \end{aligned}$$
(55)

where \(\mathbf{M}(\mathbf{r})\) is the projection operator defined in Eq. 1, as well as the number of high-resolution microstates \(\mathbf{r}\) mapping onto it,

$$\begin{aligned} \Omega _1(\mathbf{R})=\int d\mathbf{r}\ \delta \left( \mathbf{M}(\mathbf{r}) - \mathbf{R}\right) . \end{aligned}$$
(56)

The probability \({\bar{p}}_r(\mathbf{r})\) is then defined as [31]

$$\begin{aligned} {\bar{p}}_r(\mathbf{r}) = {p_R(\mathbf{M}(\mathbf{r}))}/{\Omega _1(\mathbf{M}(\mathbf{r}))}. \end{aligned}$$
(57)

Critically, while both \(p_r(\mathbf{r})\) and \({\bar{p}}_r(\mathbf{r})\) are functions of the atomistic coordinates, they differ in assigning the probability to a given high-resolution configuration or microstate, in that \({\bar{p}}_r(\mathbf{r})\) associates the same probability with all microstates that map onto the same macrostate \(\mathbf{R}\). Minimising the mapping entropy \(S_\mathrm{map}\) in the space of possible CG representations of the system thus implies maximising the consistency between the reconstructed probability distribution \({\bar{p}}_r(\mathbf{r})\) and the all-atom one. In Ref. [23] we derived an approximate expression for Eq. 54, which allows one to compute this observable provided a set of configurations and their energies are available, e.g. sampled from the canonical ensemble by means of a MD simulation:

$$\begin{aligned} S_\mathrm{map}\simeq & {} k_B \frac{\beta ^2}{2} \int \mathrm{d} \mathbf{R} p_R(\mathbf{R}) \left\langle (u-\langle u\rangle _{\beta |\mathbf{R}})^2\right\rangle _{\beta |\mathbf{R}},\nonumber \\ \end{aligned}$$
(58)

where \(\langle (u-\langle u\rangle _{\beta |\mathbf{R}})^2\rangle _{\beta |\mathbf{R}}\) is the variance of the energies of the atomistic microstates mapping onto macrostate R.

While the norm \(\mathcal {E}\) depends only on the geometric properties of a single protein conformation, \(S_\mathrm{map}\) is calculated from an ensemble of configurations sampled according to the Boltzmann distribution; \(S_\mathrm{map} (M)\), thus, contains more information than \(\mathcal {E} (M)\), since it makes explicit use of the average structural and thermodynamical properties of the system.

Here we employ a data set of 1968 CG mappings of AKE with \(N=214\) generated by us in a previous work [48] and covering a wide range of values of \(S_\mathrm{map}\); the relations among these mappings are then quantified in terms of their distance \(\mathcal {D}\), taking the enzyme crystal structure as a reference. With respect to this, it is worth keeping in mind that \(\mathcal {D}\) intimately depends on this reference, and mappings that lie close to each other when a given structure is employed might turn out to be closer or further away from each other when a different conformation is used.

Fig. 9
figure 9

Application of the sketch map algorithm to a distance matrix obtained calculating \(\mathcal {D}\) (Eqs. 11 and 30) over a data set of 1968 mappings [48] that span a wide range of values of mapping entropy. The x component separates very well the data points according to their value of \(S_\mathrm{map}\), thus suggesting that informative mappings can be distinguished among the elements of \(\mathcal {M}\) according to a measure of geometrical similarity such as \(\mathcal {D}\). The parameters fed to the algorithm are the following: \(\sigma _D = \sigma _d = a_D = b_D = 5\), \(a_d = b_d = 2\)

Figure 9 shows that the two-dimensional embedding obtained through the application of the sketch map algorithm separates the CG mappings according to a gradient of \(S_\mathrm{map}\). In particular, the x component of the sketch map and the mapping entropy \(S_\mathrm{map}\) display a clear anticorrelation. The results suggest that highly informative mappings, characterised by low values of \(S_\mathrm{map}\), share geometrical features that are not present in less informative (high \(S_\mathrm{map}\)) representations. In other words, the peculiar resolution distribution found in low-\(S_\mathrm{map}\) mappings separates them from the other elements of \(\mathcal {M}\). The relevant features that the mapping entropy highlights thus reverberate in the merely structural characterisation provided by the mapping distance; this connection among the norm \({\mathcal {E}}\), the distance \({\mathcal {D}}\), and highly informative representations is potentially interesting and deserves to be further investigated.

6 Extension of the theory to equilibrium sampling: preliminary results

Insofar, our analysis of the mapping space has relied on a definition of a scalar product between CG representations based on a single, static structure of the reference protein. Proteins and other biologically relevant macromolecules, however, are not static objects, but rather flexible entities which, in a typically aqueous environment, undergo fluctuations and deformations. It is therefore natural to extend our metric to incorporate such structural variability; in this Section, we will, thus, present some preliminary results obtained by performing such an extension, restricting, for the sake of brevity, the discussion to the case of the mapping norm \(\mathcal {E}\).

We assume our high-resolution (i.e. atomistic) system, constituted by the protein (whose atomic coordinates are indicated with \(\mathbf{r}\)) and its environment (indicated with \(\mathbf{s}\)), to be subject to an interaction potential \(u(\mathbf{r}, \mathbf{s})\). In the canonical ensemble the probability density to sample a given configuration is proportional to the Boltzmann weight, that is,

$$\begin{aligned} p_{r}(\mathbf{r}, \mathbf{s}) = \frac{e^{-\beta u(\mathbf{r}, \mathbf{s})}}{Z}, \end{aligned}$$
(59)

where \(Z = \int \mathrm{d}{} \mathbf{r}\mathrm{d}{} \mathbf{s}\ e^{-\beta u (\mathbf{r}, \mathbf{s})}\) is the configurational partition function of the system.

The norm \(\mathcal {E} \) of a mapping in Eqs. 30 and 8 only depends on a single conformation of the molecule under examination; however, one can straightforwardly extend the definition of \(\mathcal {E}\)—and analogously of the scalar product and the distance between mappings—to account for the whole conformational space sampled by the system, in that the canonical average of the norm is taken:

$$\begin{aligned} \langle \mathcal {E} \rangle= & {} \int \mathrm{d}{} \mathbf{r}\mathrm{d}{} \mathbf{s}\ p_{r}(\mathbf{r}, \mathbf{s})\ \mathcal {E}(\mathbf{r}) \nonumber \\= & {} \int \mathrm{d}{} \mathbf{r}\mathrm{d}{} \mathbf{s}\ p_{r}(\mathbf{r}, \mathbf{s})\ \frac{1}{{\bar{z}}(\mathbf{r})} \left( \sum _{i,j=1}^n e^{-r^2_{ij}/4\sigma ^2}\chi _{M,i}\chi _{M,j} \right) \nonumber \\= & {} \sum _{i,j=1}^n \langle J_{ij} \rangle \chi _{M,i}\chi _{M,j}. \end{aligned}$$
(60)

Note that the average is carried out both over the protein and environment degrees of freedom; at the same time, for mappings that only retain protein degrees of freedom, the couplings \(J_{ij}\)—and thus the norm \(\mathcal {E}\)—only depend on the latter. The linearity of the norm with respect to the couplings allows one to first compute their thermal average, that is,

$$\begin{aligned} \langle J_{ij} \rangle= & {} \int \mathrm{d}{} \mathbf{r}\mathrm{d}{} \mathbf{s}\ p_{r}(\mathbf{r}, \mathbf{s}) J_{ij} (\mathbf{r}) \nonumber \\= & {} \int \mathrm{d}{} \mathbf{r}\mathrm{d}{} \mathbf{s}\ \frac{e^{-\beta u (\mathbf{r}, \mathbf{s})}}{Z} \frac{1}{{\bar{z}}(\mathbf{r})} e^{-r^{2}_{ij}/4 \sigma ^2}, \end{aligned}$$
(61)

and subsequently employ them for the calculation of norms, scalar products, and distances, in the same manner as it was done insofar. In this case, however, the resulting values entail information about the conformational space sampled by the whole system, including the environment, described in terms of a high-resolution model.

To investigate the effect that accounting for the conformational variability of the system has on the norm of a mapping, Fig. 10 displays a comparison between the value of \(\mathcal {E}\) computed on the crystal structure of AKE and its canonical average \(\langle \mathcal {E} \rangle \) obtained through molecular dynamics sampling. Each point in the plot represents a \(\mathcal {E}\)\(\langle \mathcal {E} \rangle \) pair out of \(5 \times 10^4\) mappings with \(N = 214\) extracted so as to homogeneously span all the possible values of \(\mathcal {E}\), see Sect. 3.2. The ensemble average is performed over \(10^4\) configurations of a 200 ns long NVT simulation, the technical details of which are available in the SI of Ref. [23].

Interestingly, points are very narrowly dispersed along the diagonal, with a Pearson correlation coefficient very close to unity. This suggests that, at least in this case, the canonical average of \(\mathcal {E}\) is robust to structural changes: we ascribe this behavior to the fact that at the outset of the simulation the protein is in its native state and, due to the strong constraints present in the molecule, the local environment of each atom generally performs small-amplitude fluctuations about a well-defined average. In this particular case, the couplings computed explicitly accounting for the energetics of the system do not induce significant deviations in the value of the norm with respect to their static-structure counterparts it is hence reasonable to expect that the same will hold for the metric and topological properties of the mapping space discussed insofar.

However, this consistency will not be observed when secondary and tertiary structures heavily change, as, e.g. in the case of protein folding: the value of \(\mathcal {E}\) calculated over the unfolded polypeptide chain will not match its canonical average performed over a sample containing folded, more globular configurations. A more detailed understanding of how equilibrium sampling can change the metric properties of the mapping space—especially in the presence of large-amplitude conformational rearrangements—is required, and will be the subject of future work.

Fig. 10
figure 10

Scatter plot of the single-conformation mapping norm \(\mathcal {E}\), calculated on the crystal structure of AKE, against its canonical average \(\langle \mathcal {E} \rangle \) (Eq. 60) for \(5 \times 10^4\) CG mappings with \(N = 214\). The red straight line with slope one serves as a guide to the eye. The Pearson correlation coefficient is 0.9997

7 Conclusions

In this work we have addressed the problem of defining a measure to quantify the distance between two low-resolution representations of a macromolecule, and to “explore” the metric space induced by it.

The recent advances in the computational investigation of soft and biological matter have provided us with the tools to perform large-scale simulations of large and complex systems; however, due to the sheer size of the data produced, one has to filter out the large amount of detail with which the system is described [8], thus relying on a coarse-grained description of it.

Decimation mappings offer a simple and intuitive way of applying this filter, in that only a subset of a molecule’s atoms is retained; however, not all mappings entail or deliver the same amount of information, and the identification of the most informative ones allows one to highlight relevant properties of the system. Various methods have been devised [20,21,22,23,24,25] to identify the most informative mappings as the solution to an optimisation problem, which thus relies on the definition of an appropriate cost function. Since the landscape induced by the latter is typically a rather rugged one, as it is often the case in the field of complex systems [49, 50], it is to be expected that more than one “optimal” solution will be found. Hence, to understand the relationship among such solutions, as well as between structural representation and physical properties in general, it is of fundamental importance to possess an instrument to measure the difference, or distance, among mappings.

The metrics proposed here, which builds on the SOAP measure proposed by Csány and coworkers [35, 36], has been employed to quantify the number, dissimilarity, and structural features of different mappings of a macromolecule in a static conformation, thereby providing the basis for quantitative analysis of the aforementioned relationship.

The exploration of the mapping space relied on the application of the Wang–Landau enhanced sampling algorithm [28, 29], which allowed us to compute the (logarithm of the) density of states for mappings with a given number of CG sites, as a function of their squared norm. On the one hand, these calculations brought to the surface information about “special” (i.e. atypical) representations that, just due to their lower number with respect to randomly sampled ones, are exponentially suppressed; on the other hand, we made use of the densities of states to implement a lattice-gas analogy in terms of which we have interpreted mappings of qualitatively different types as different phases of the same physical system undergoing a phase transition. We have then made use of the distance between mappings to investigate the properties of optimal reduced representations obtained by minimising the mapping entropy, a measure of the amount of information that a given mapping can return about the underlying system at thermal equilibrium: this last analysis has shown that optimal mappings are markedly distant, and therefore qualitatively different, from randomly sampled ones, thus corroborating the idea that the former belong to a particular subregion of the mapping space endowed with nontrivial properties. Finally, we proposed a possible extension of the theory to samples of conformations at thermal equilibrium, focusing on the case of the mapping norm. In this manner, the \(J_{ij}\) couplings are weighted with the probabilities associated to each configuration, thus indirectly accounting for the system’s energy.

A number of questions remain open, which could not be addressed in this work. As a first thing, in the application of the theory to the system under examination we have observed a substantial consistency between the values of the mapping norm computed with single-structure couplings and their averaged counterparts; however, it is reasonable to expect that this won’t be a general case. Consequently, the inclusion of the reference system’s conformational variability might lead to interesting outcomes in the analysis of structural and topological properties of the mapping space. This relevant line of research is currently under investigation.

A second open issue, partially related to the former, is that the phase transitions that we observed are analogous to the ones reported in a previous work [15], where explicit reference to the molecule’s free energy was made: this connection might entail important insights in the relationship between the properties of mappings as elements of the mapping space and the functional characteristics of the underlying system, and it certainly deserves to be further inspected.

Finally, the general character of the tools developed in this work make them suitable to be easily combined with other methods. For example, they can be employed to quantitatively gauge similarities and differences among the solutions to the mapping optimisation problem obtained making use of the various protocols proposed in the literature; or to boost the accurate determination of cost functions profiles, whose computation can be accelerated by a preliminary exploration of the mapping space followed by a cycle of biased enhanced sampling simulations [48].

In conclusion, the mathematical, biophysical, and computational methods developed and applied in this work have served to start gathering the treasure of information buried in the relationship between how we look at a system and the properties it is endowed with, of which we think that what has been reported here has just scratched the surface.