Enumerating and indexing many-body intramolecular interactions: a graph theoretic approach

The central idea observes a recursive mapping of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}n-body intramolecular interactions to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$(n+1)$$\end{document}(n+1)-body terms that is consistent with the molecular topology. Iterative application of the line graph transformation is identified as a natural and elegant tool to accomplish the recursion. The procedure readily generalizes to arbitrary \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}n-body potentials. In particular, the method yields a complete characterization of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$4$$\end{document}4-body interactions. The hierarchical structure of atomic index lists for each interaction order \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}n is compactly expressed as a directed acyclic graph. A pseudo-code description of the generating algorithm is given. With suitable data structures (e.g., edge lists or adjacency matrices), automatic enumeration and indexing of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n$$\end{document}n-body interactions can be implemented straightforwardly to handle large bio-molecular systems. Explicit examples are discussed, including a chemically relevant effective potential model of taurocholate bile salt.


Introduction
The implementation of computer code for realistically simulating the configurations and motion of molecular objects requires modelling of many-body through-bond inter- actions. In turn, it is necessary to identify the participating atoms in each interaction. This paper presents a novel and exhaustive enumeration procedure exploiting the line graph transformation of the graph that encodes the molecular structure. In principle, by virtue of the recursive nature of the algorithm, straightforward extension to arbitrarily high order interactions is possible.
Application of graph theoretic methods to the general study of molecular structure, and to equilibrium statistical mechanics in particular, are not new. Significant examples include the development of molecular branching rules [1], the enumeration of isomers and the definition of topological indexes [2], as well as the analysis of discrete lattice models [3] and Mayer's cluster decomposition of the 2-body configuration integral [4].
Modelling and theory distinguish between simple materials, comprised of weakly interacting elementary units, and complex materials including objects with internal structure characterized by relatively strong coupling and typically mimicking the covalent architecture of molecular species [5]. Furthermore, intramolecular interactions can be treated either quantum mechanically (Car-Parrinello method [6] with density functional theory of electronic structure [7]) or by a classical effective potential obtained through some more or less ad hoc coarse-graining procedure [8,9]. In molecular dynamics or equilibrium Monte Carlo simulations, for example, the total intramolecular potential energy is typically decomposed as [10] U intra r s = bonds U 2 r i , r j topology + bends U 3 r i , r j , r k + torsions U 4 r i , r j , r k , r l + · · · flexibility , (1) where the sums range over suitable n-body potentials U n that depend on the spatial configuration r s of the constituent atoms. Similar expansions are developed for nonbonded forces too, and may also include 1-body coupling to an external field, but the indiscriminate character of these through-space interactions usually means that the identification of participating atoms is determined by a simple range parameter. For intramolecular interactions, however, the enumeration and indexing of atoms involving in each sum of (1) is subject to the constraints of molecular topology. Indeed, the 2-body bond interactions U 2 define the molecular framework, while the higher order terms in (1) serve to model the more or less restricted molecular flexibility associated with bond hybridization or electronic delocalisation (e.g., aromaticity and resonance structures). The selection of these higher order potentials is often based on chemical intuition and are typically supplied to simulation software as user defined input. Mature and widely used simulation packages (e.g., GROMACS, LAMMPS, NAMD, etc.) invariably support highly optimized force fields (e.g., CHARMM, AMBER, OPLS, MMFF, etc.) that faithfully represent detailed atomistic structures. An alternative approach is to input only the molecular topology, then systematically generate all possible many-body index lists from this information and invite the user to select non-zero force constants and appropriate functional forms for the required potentials. This paradigm is more natural for the implementation of coarse-grained models derived by thermodynamic considerations (e.g., MARTINI [11]). To facilitate this semi-automatic procedure, a graph theoretic construction is developed here to exhaustively enumerate and index arbitrary n-body intramolecular interactions starting from the description of 2-body adjacency.
The central idea in this work observes the correspondence between the hierarchy of n-body intramolecular interactions and iteration of the line graph [12] transformation L(G) on a connected "molecular" graph G.

Graph theory
A simple, connected and undirected graph G = V (G), E(G) is composed of a finite nonempty vertex set V (G) = v 1 , . . . , v p of p ∈ N vertices v i and a finite edge set E(G) = e 1 , . . . , e q comprising q ∈ N distinct unordered pairs e α = v i , v j of distinct vertices such that each element from V (G) appears at least once [13]. The number of vertices p = |V (G)| and edges q = |E(G)| are called the order and size of G, respectively. By restricting edges to ensure bond uniqueness (any vertex pair is joined by at most one edge) and forbid loops (each edge must join distinct vertices), molecular structures are naturally represented by such graphs G.
The p × p square, symmetric and binary vertex adjacency matrix A(G) of a graph G, defined by is in one-to-one correspondence with the molecular structure and is arguably the most natural algebraic representation of topological connectivity. An elementary inductive argument [14] establishes that A m i j (G) is the total number of m length walks (a sequence of vertices joined consecutively by edges) between vertices v i and v j . In particular, the degree of vertex v i corresponding to the valency of atom i in the molecular structure is so that the p × p diagonal degree matrix becomes where the e i are natural basis vectors in the coordinate vector space R p . Another useful representation of graph connectivity is the p × q binary vertex-edge incidence matrix Z(G) defined by For each graph G there is an associated line graph L(G) (also called the "derivative" graph [15]) such that V L(G) is in bijective correspondence with E(G) and where I q is the q × q identity matrix. In other words, each edge of G is mapped to a vertex of L(G), while two vertices of the line graph are adjacent if and only if their corresponding edges are incident in G (that is, they share a common endpoint). Furthermore, the associated Kirchhoff matrix The spectrum of K(G) provides a useful consistency check since the algebraic multiplicity of the zero eigenvalue is equal to the number of connected components in G [16]. Hence, for a physically sensible molecular graph G, the rank of K(G) must be p − 1.
We recall the following definitions and terminology. A cycle graph C r comprises p = r vertices, all of degree 2, connected in a closed chain by q = r edges. Removing a single edge produces a path graph P r (of order p = r and size q = r − 1) with two terminal vertices of degree 1. The complete graph K r on p = r vertices is We will also have occasion to consider directed graphs G = V (G), A(G) where edges are replaced by arrows specified by ordered pairs v i , v j ∈ A(G) and oriented with the tail at vertex v i pointing towards the head at vertex v j . Associated with each

Intramolecular interactions
In a molecular structure, covalently bonded atom pairs are considered to be adjacent. Similarly, a pair of adjacent bonds sharing a common "hinge" atom form a more or less flexible bend. The corresponding 3-body interaction is often described by an effective potential in terms of the external bond angle θ supplementary to the angle subtended at the hinge atom [17]. A 4-body dihedral interaction is associated with a pair of adjacent bends that share a common bond. Two situations are possible [18] (see Fig. 1): "proper" torsions arise when both hinge atoms are distinct, so that one bend is rotated about the other through a dihedral angle φ; while "improper" dihedral interactions link two bends through a common hinge atom, and are defined by a wag angle ω. Proper torsions typically account for geometric restrictions conferred by implicit substituents (usually protons) or lone electron pairs and may be alternatively characterized by a bond lying along the dihedral axis. Conversely, the dihedral axis of an improper torsion does not contain a bond and these interactions are used to constrain planar groups (like rings) or to hinder interconversion of stereocenters.
In graph theoretical terms, this hierarchical organization of interactions is precisely captured by iterated application of the line graph construction inductively defined by A graph G establishes adjacency of vertices (i.e., bonds); the graph L(G) encodes the adjacency of bonds (i.e., bends) in G; the graph L L(G) = L 2 (G) encodes the adjacency of bends (i.e., dihedrals) in G; and so on. Generally, the graph L ν−2 (G) encodes the adjacency of ν-body interactions in G. It has been shown [19] that the sequence of graphs L n (G) with n = 0, 1, 2, . . . has only four possible outcomes as n → ∞: 1. if G ∼ = C r (a cycle graph on r vertices), then L n (G) ∼ = G for all n ∈ N (cycle graphs are the only connected graphs for which for all n ∈ N; 3. if G ∼ = P r (a path graph on r vertices), then L n (G) ∼ = P max{0,r −n} so each subsequent graph is a shorter path until eventually the sequence terminates at the trivial null graph; 4. otherwise, G is a "prolific" graph [20] so that the sizes of the graphs in the sequence eventually increase without bound, Ghebleh and Khatirinejad [21] have proved an interesting and chemically relevant result concerning the smallest non-negative integer m such that L m (G) is nonplanar: In particular, if G is not prolific it is easy to see that L n (G) is planar for all n ≥ 0, but for G prolific then 0 ≤ ξ(G) ≤ 4 and a complete characterization of these graphs is possible [21].

Enumeration
Given a molecular graph G, the total number of intramolecular interactions N n (G) involving n ∈ N connected atoms (vertices on G) is generally given by Here, N 1 (G) simply counts the number of 1-body interactions in the presence of an external field. Elementary combinatorial arguments also establish the handshaking lemma [22]: the number of bonds is just half the total number of incident edges over all vertices so that where u = p i=1 e i is a p-vector of ones. By summing the number of possible bond pairs over each vertex, the total bend count is obtained

Similarly reckoning the combinations of bond triplets over all vertices yields the number of improper dihedral interactions
Other possible arrangements of three contiguous bonds are just the proper dihedrals and triangular 3-cycles. In total, these arrangements can be enumerated as follows: for each bond, calculate the product of the number of free edges otherwise incident on each of the two vertices; then sum over all edges to get where we have observed the identity p j=1 A i j (G) = D ii (G). The number of 3cycles, however, can be obtained directly from the adjacency matrix as

The total 4-body interaction number N 4 includes torsions of both types as well as any 3-cycles present
but over-counts the improper dihedrals and triangles by distinguishing the three rotational permutations of labels. Combining these results gives the useful consistency checksums that involve only the molecular graph G.

Indexing
By definition of the line graph transformation, for A i j L n (G) = 1, the indexes i = α and j = β point to the edges e α , e β ∈ E L n−1 (G) that share a common vertex v l ∈ V L n−1 (G) . In turn, these edges e α = v k , v l and e β = v l , v m are associated with the adjacency matrix of the previous graph in the recursive hierarchy so that A kl L n−1 (G) = A lm L n−1 (G) = 1. Successively backtracking to the source graph G yields, for each order n, a sequence of atomic indexes that participate in an n-body interaction. A formal pseudo-code implementation of this procedure is set out in Algorithm 1. Details of each sequence structure completely characterizes the interaction form. For concreteness, specify a maximum order ν for the multibody interactions of interest so that n-body terms with n = 1, 2, . . . , ν are considered. Most commonly in  (G) . Moreover, the hierarchy of line graphs provides a natural topological order on H ν (G). Further, the adjacency matrix takes the block tridiagonal form By virtue of this DAG structure, each vertex v jn ∈ V H ν (G) is associated with a unique sequence of 2 n−1 atomic indexes v i1 that define each n-body interaction. For the trivial cases n = 1 and n = 2, these sequences correspond respectively with the individual atom list V (G) and the atom pairs E(G) defining the molecular bonds. Bends (n = 3) are signified by a pattern of 2 2 = 4 atoms with a single repeated index identifying the common hinge atom. An 8-atom sequence classifies a 4-body interaction and distinguishes between three types as illustrated in Fig. 1. A pair of triply repeated atom indexes identifies the distinct hinge atoms of a proper torsion, provided the remaining two indexes are different (Fig. 1a). Otherwise, a common pair of "flap" atoms signals a degenerate 3-cycle where only three atom indexes appear (Fig. 1c). Each triangle generates 3 C 2 = 3 such sequences by the arbitrary choice of a single common bond. The single common hinge of an improper dihedral corresponds to a fourfold repeated index among four distinct labels (Fig. 1b). Again, three sequences are generated for each improper dihedral by arbitrary assignment of the shared bond.

A toy model: methylcyclopropane
The molecular graph G indicated in Fig. 2 presents amongst other possibilities a plausible lumped model for methylcyclopropane where hydrogen atoms are absorbed onto the carbon backbone in the usual way. Direct inspection of G immediately establishes four 1-body interactions in the presence of an external field (that is, the number of "atoms" N 1 = 4) and also four 2-body bonds N 2 = 4 . Atom 2 is the hinge for three distinct 3-body bends with a further two hinged at atoms 3 and 4 respectively, to give N 3 = 5. Among the 4-body interactions there are two proper torsions N prop = 2 with distinct hinge atoms 2-3 and 2-4, respectively, as well as a single improper dihedral N impr = 1 with common hinge atom 2. A single 3-cycle is present N 3cyc = 1 so that N 4 = 2 + 3 × (1 + 1) = 8. Adjacency matrices for the iterated line graphs necessary for describing interactions up to the 4-body level are given by For added clarity, the edge index associated with each adjacent vertex pair is indicated by the superscript. From these matrices, the DAG obtained that represents the line graph hierarchy is shown in Fig. 2. The atomic index sequences automatically generated on the DAG, particularly at the 4-body level (n = 4), confirm the informal analysis of the molecular graph. It is easy to show that the complete graph on five vertices K 5 is a minor of L 3 (G), whence it follows from the theorem of Wagner [23] that L 3 (G) is   [24]. The molecular line graph L 3 (G). To simplify the diagram, subgraphs corresponding to complete graphs K q on q vertices ( q-cliques of L 3 (G)) have been rendered as pentagons (q = 5) and hexagons (q = 6)  2), (2, 4)) (((1, 2), (2, 3)), ((2, 3), (2, 4))) i nonplanar. All of G, L(G) and L 2 (G) are manifestly planar (see Fig. 2) so the line index ξ(G) = 3 in accord with the result of Ghebleh and Khatirinejad [21].

Bile salt: taurocholate
A chemically relevant example, central to the hydrolysis and solubilisation of lipid associated with food digestion in the human lower gastrointestinal tract, are the bile salt and bile acid surfactants. Vila Verde and Frenkel [24] have proposed an effective coarse-grained model of trihydroxy bile salts (taurocholate) for a molecular dynamics study of micelle formation that is related to the rate and extent of nutrient absorption by intestinal cells. The authors proposed a "three-to-one" mapping scheme, that groups three carbon or nitrogen atoms into a single bead, to arrive at the molecular graph G shown in Fig. 3. Iterated line graphs up to L 3 (G) are collected in Figs. 3 and 4. Table 1 lists the vertices for the corresponding DAG description of the hierarchy as sequences of bead indexes.
Overall, the counts of bonds, bends, proper torsions and improper dihedrals are obtained as follows, where N 4 = 22 + 3 × 4 = 34. Clearly, there are no 3-cycles in this example so N cyc = 0. It is easy to verify that L 2 (G) admits the minor K 5 and hence, by Wagner's theorem [23], it follows that L 2 (G) is nonplanar. Both G and L(G) are manifestly planar so the line index ξ(G) = 2 in accord with the result of Ghebleh and Khatirinejad [21].

Conclusion
The line graph transformation provides a practical and elegant theoretical tool for exhaustively enumerating and indexing many-body intramolecular interactions. Given a suitable graphical representation of a molecular structure, an explicit pseudo-code implementation of the recursive line graph algorithm is given for automatically generating complete canonical lists of atomic indexes associated with each interaction order. No attempt has been made to computationally optimize this algorithm or the associated data structures. Instead, clarity of exposition is the main objective here. We anticipate the main application will involve embedding the algorithm within a Monte Carlo or Molecular Dynamics simulation code where other implementation details will determine the most efficient realization. In accord with common practice, intramolecular interactions up to order 4 have been considered (bonds, bends and dihedrals), but the method can be extended to arbitrarily many atomic centers. Higher order interactions will involve increasingly many sub-type variations and polycyclic structures. Two specific examples are discussed: a toy model of methylcyclopropane and a published effective potential model of taurocholate bile salt [24] that is relevant for the study of digestive processes in the human lower gastrointestinal tract.