Background

In computational molecular biology, various types of data have been utilized, which include sequences, gene expression patterns, and protein structures. Graph structured data have also been extensively utilized, which include metabolic pathways, protein-protein interaction networks, gene regulatory networks, and chemical graphs. Much attention has recently been paid to the analysis of chemical graphs due to its potential applications to computer-aided drug design. One of the major approaches to computer-aided drug design is quantitative structure activity/property relationship (QSAR/QSPR) analysis, the purpose of which is to derive quantitative relationships between chemical structures and their activities/properties. Furthermore, inverse QSAR/QSPR has been extensively studied [1, 2], the purpose of which is to infer chemical structures from given chemical activities/properties. Inverse QSAR/QSPR is often formulated as an optimization problem to find a chemical structure maximizing (or minimizing) an objective function under various constraints.

In both QSAR/QSPR and inverse QSAR/QSPR, chemical compounds are usually represented as vectors of real or integer numbers, which are often called descriptors and correspond to feature vectors in machine learning. Using these chemical descriptors, various heuristic and statistical methods have been developed for finding optimal or nearly optimal graph structures under given objective functions [1, 3, 4]. Inference or enumeration of graph structures from a given feature vector is a crucial subtask in many of such methods. Various methods have been developed for this enumeration problem [5,6,7,8] and the computational complexity of the inference problem has been analyzed [9, 10]. On the other hand, enumeration in itself is a challenging task, since the number of molecules (i.e., chemical graphs) with up to 30 atoms (vertices) C, N, O, and S, may exceed \(10^{60}\) [11].

As a new approach, artificial neural network (ANN) and deep learning technologies have recently been applied to inverse QSAR/QSPR. For example, variational autoencoders [12], recurrent neural networks [13, 14], and grammar variational autoencoders [15] have been applied. In these approaches, new chemical graphs are generated by solving a kind of inverse problems on neural networks that are trained using known chemical compound/activity pairs. However, the optimality of the solution is not necessarily guaranteed in these approaches. In order to guarantee the optimality mathematically, a novel approach has been proposed [16] for ANNs, using mixed integer linear programming (MILP).

Recently, a new framework has been proposed [17,18,19] by combining two previous approaches: efficient enumeration of tree-like graphs [5], and MILP-based formulation of the inverse problem on ANNs [16]. This combined framework for inverse QSAR/QSPR mainly consists of two phases. The first phase solves (I) Prediction Problem, where a feature vector f(G) of a chemical graph G is introduced and a prediction function \(\psi _{{{\mathcal {N}}}}\) on a chemical property \(\pi \) is constructed with an ANN \({{\mathcal {N}}}\) using a data set of chemical compounds G and their values a(G) of \(\pi \). The second phase solves (II) Inverse Problem, where (II-a) given a target value \(y^*\) of the chemical property \(\pi \), a feature vector \(x^*\) is inferred from the trained ANN \({{\mathcal {N}}}\) so that \(\psi _{{{\mathcal {N}}}}(x^*)\) is close to \(y^*\) and (II-b) then a set of chemical structures \(G^*\) such that \(f(G^*)= x^*\) is enumerated by a graph search algorithm. In (II-a) of the above-mentioned previous methods [17,18,19], an MILP is formulated for acyclic chemical compounds. Afterwards, Ito et al. [20] and Zhu et al. [21] designed a method of inferring chemical graphs with cycle index 1 and 2, respectively, by formulating a new MILP and using an efficient algorithm for enumerating chemical graphs with cycle index 1 [22] and cycle index 2 [23, 24]. The computational results conducted on instances with n non-hydrogen atoms show that a feature vector \(x^*\) can be inferred for up to around \(n=40\) whereas graphs \(G^*\) can be enumerated for up to around \(n=15\).

In this paper, we present a new characterization of graph structure, called “branch-height.” Based on this, we can treat a class of acyclic chemical graphs with a structure that is topologically restricted but frequently appears in a chemical database, formulate a new MILP formulation that can handle acyclic graphs with a large diameter, and design a new graph search algorithm that generates acyclic chemical graphs with up to around 50 vertices. The results of computational experiments using such chemical properties as octanol/water partition coefficient, boiling point and heat of combustion suggest that the proposed method is much more useful than the previous method.

The paper is organized as follows. "Preliminary" section introduces some notions on graphs, a modeling of chemical compounds and a choice of descriptors. "A method for inferring chemical graphs" section reviews the framework for inferring chemical compounds based on ANNs and MILPs. "MILPs for chemical acyclic graphs with bounded branch-height" section introduces a new method of modeling acyclic chemical graphs and proposes a new MILP formulation that represents an acyclic chemical graph G with n vertices, where our MILP requires only O(n) variables and constraints when the branch-parameter k and the k-branch height in G (graph topological parameters newly introduced in this paper) is constant. "A new graph search algorithm" section describes the idea of our new dynamic programming type of algorithm that enumerates a given number of acyclic chemical graphs for a given feature vector. "Experimental results" section reports the results on some computational experiments conducted for chemical properties such as octanol/water partition coefficient, boiling point and heat of combustion. "Concluding remarks" section makes some concluding remarks. Appendix A provides the statistical distribution of structural features of acyclic chemical graphs in a chemical graph database. Appendices B and C describe the idea of our MILP formulation and the details of all variables and constraints in the MILP formulation, respectively. Appendix D presents descriptions of our new graph search algorithm.

Preliminary

This section introduces some notions and terminology on graphs, a modeling of chemical compounds and our choice of descriptors.

Let \({\mathbb {R}}\), \({\mathbb {Z}}\) and \({\mathbb {Z}}_+\) denote the sets of reals, integers and non-negative integers, respectively. For two integers a and b, let [ab] denote the set of integers i with \(a\le i\le b\).

Graphs

A graph stands for a simple undirected graph, where an edge joining two vertices u and v is denoted by uv \((= vu)\). The sets of vertices and edges of a graph H are denoted by V(H) and E(H), respectively. Let \(H=(V,E)\) be a graph with a set V of vertices and a set E of edges. For a vertex \(v\in V\), the set of neighbors of v in H is denoted by \(N_H(v)\), and the degree \(\deg _H(v)\) of v is defined to be \(|N_H(v)|\). The length of a path is defined to be the number of edges in the path. The distance \({\text {dist}}_H(u,v)\) between two vertices \(u,v\in V\) is defined to be the minimum length of a path connecting u and v in H. The diameter \({\text {dia}}(H)\) of H is defined to be the maximum distance between two vertices in H; i.e., \({\text {dia}}(H)\triangleq \max _{u,v\in V}{\text {dist}}_H(u,v)\). Denote by \(\ell (P)\) the length of a path P.

Centers of trees For a tree T with an even (resp., odd) diameter d, the center is defined to be the vertex v (resp., the adjacent vertex pair \(\{v,v'\}\)) that situates in the middle of one of the longest paths, with length d. The center of each tree is uniquely determined.

Rooted trees A rooted tree is defined to be a tree where a vertex (or a pair of adjacent vertices) is designated as the root. Let T be a rooted tree, where for two adjacent vertices u and v, vertex u is called the parent of v if u is closer to the root than v is. The height \({\text {height}}(v)\) of a vertex v in T is defined to be the maximum length of a path from v to a leaf u in the descendants of v, where \({\text {height}}(v)=0\) for each leaf v in T. Figure 1a and b illustrate examples of trees rooted at the center.

Fig. 1
figure 1

An illustration of rooted trees and a 2-branch-tree: a A tree \(H_1\) with odd diameter 11; b A tree \(H_2\) with even diameter 10; c The 2-branch-tree of \(H_2\)

Degree-bounded trees For positive integers ab and c with \(b\ge 2\), let T(abc) denote the rooted tree such that the number of children of the root is a, the number of children of each non-root internal vertex is b and the distance from the root to each leaf is c. We see that the number of vertices in T(abc) is \(a(b^c-1)/(b-1)+1\), and the number of non-leaf vertices in T(abc) is \(a(b^{c-1}-1)/(b-1)+1\). In the rooted tree T(abc), we denote the vertices by \(v_1,v_2,\ldots ,v_n\) with a breadth-first-search order, and denote the edge between a vertex \(v_i\) with \(i\in [2,n]\) and its parent by \(e_i\), where \(n=a(b^c-1)/(b-1)+1\) and each vertex \(v_i\) with \(i\in [1, a(b^{c-1}-1)/(b-1)+1]\) is a non-leaf vertex. For each vertex \(v_i\) in T(abc), let \(\text {Cld}(i)\) denote the set of indices j such that \(v_j\) is a child of \(v_i\), and \(\text {prt}(i)\) denote the index j such that \(v_j\) is the parent of \(v_i\) when \(i\in [2,n]\). Let \(P_{\text {prc}}(a,b,c)\) be a set of ordered index pairs (ij) of vertices \(v_i\) and \(v_j\) in T(abc). We call \(P_{\text {prc}}(a,b,c)\) proper if the next conditions hold:

  1. (a)

    For each pair of vertices \(v_i\) and \(v_j\) in T(abc) such that \(v_i\) is the parent of \(v_j\), there is a sequence \((i_1,i_2),(i_2,i_3),\ldots ,(i_{k-1},i_k)\) of index pairs in \(P_{\text {prc}}(a,b,c)\) such that \(i_1=i\) and \(i_k=j\); and

  2. (b)

    Each subtree \(H=(V,E)\) of T(abc) with \(v_1\in V\) is isomorphic to a subtree \(H'=(V',E')\) by a graph isomorphism \(\psi :V\rightarrow V'\) with \(\psi (v_1)=v_1\) so that if \(v_j\in V'\) for a pair \((i,j)\in P_{\text {prc}}(a,b,c)\) then \(v_i\in V'\).

Note that a proper set \(P_{\text {prc}}(a,b,c)\) is not necessarily unique.

Branch-height in trees In this paper, we introduce “branch-height” of a tree as a new measure to the “agglomeration degree” of trees. We specify a non-negative integer k, called a branch-parameter to define branch-height. First we regard T as a rooted tree by choosing the center of T as the root. Figure 1a, b illustrate examples of rooted trees. We introduce the following terminology on a rooted tree T.

  • A leaf k-branch: A non-root vertex v in T such that \({\text {height}}(v)= k\).

  • A non-leaf k-branch: A non-root vertex v in T such that v has at least two children, and for each child u of v it holds that \({\text {height}}(u)\ge k\). We call a leaf or a non-leaf k-branch a k-branch. Figure 2a–c illustrate the k-branches of the rooted tree \(H_2\) in Fig. 1b for \(k=1,2\) and 3, respectively.

  • A k-branch-path: A path P in T that joins two vertices u and \(u'\) such that each of u and \(u'\) is the root or a k-branch and P does not contain the root or a k-branch as an internal vertex.

  • The k-branch-subtree of T: The subtree of T that consists of the edges in all k-branch-paths of T. We call a vertex (resp., an edge) in T a k-internal vertex (resp., a k-internal edge) if it is contained in the k-branch-subtree of T and a k-external vertex (resp., a k-external edge) otherwise. Let \(V^{\text {in}}\) and \(V^\text {ex}\) (resp., \(E^{\text {in}}\) and \(E^\text {ex}\)) denote the sets of k-internal and k-external vertices (resp., edges) in T.

  • The k-branch-tree of T: The rooted tree obtained from the k-branch-subtree of T by replacing each k-branch-path with a single edge. Figure 1c illustrates the 2-branch-tree of the rooted tree \(H_2\) in Fig. 1b. Notice that by our definitions, leaf k-branches and non-leaf k-branches are leaves and branching points in the k-branch-tree.

  • A k-fringe-tree: One of the connected components that consists of the edges not in the k-branch-subtree. Each k-fringe-tree \(T'\) contains exactly one vertex v in the k-branch-subtree, where \(T'\) is regarded as a tree rooted at v. Note that the height of any k-fringe-tree is at most k. Figure 2a–c illustrate the k-fringe-trees of the rooted tree \(H_2\) in Fig. 1b for \(k=1, 2\) and 3, respectively.

  • The k-branch-leaf number \({\text {bl}}_k(T)\): The number of leaf k-branches in T. For the trees \(H_i\), \(i=1,2\) in Fig. 1a, b, it holds that \({\text {bl}}_0(H_1)= {\text {bl}}_0(H_2)=8\), \({\text {bl}}_1(H_1)= {\text {bl}}_1(H_2)=5\), \({\text {bl}}_2(H_1)= {\text {bl}}_2(H_2)=3\) and \({\text {bl}}_3(H_1)= {\text {bl}}_3(H_2)=2\).

  • The k-branch height \(\text {bh}_k(T)\) of T: The maximum number of k-branches along a path from the root to a leaf of T; i.e., \(\text {bh}_k(T)\) is the height of the k-branch-tree \(T^*\) (the maximum length of a path from the root to a leaf in \(T^*\)). For the example of trees \(H_i\), \(i=1,2\) in Fig. 1a, b, it holds that \(\text {bh}_0(H_1)=\text {bh}_0(H_2)=3\), \(\text {bh}_1(H_1)=\text {bh}_1(H_2)=3\), \(\text {bh}_2(H_1)=\text {bh}_2(H_2)=2\) and \(\text {bh}_3(H_1)=\text {bh}_3(H_2)=1\).

Fig. 2
figure 2

An illustration of the k-branches (depicted by gray circles), the k-branch-subtree (depicted by solid lines) and k-fringe-trees (depicted by dashed lines) of \(H_2\): a \(k=1\); b \(k=2\); c \(k=3\)

Even though this paper deals exclusively with acyclic graphs, we formally introduce the k-branch height for chemical cyclic graphs (chemical graphs that contain at least one cycle). The core of a chemical cyclic graph G is defined to be the induced subgraph \(G'\) of G that consists of vertices in a cycle or the vertices in a path joining two cycles. A vertex in the core (not in the core) is called a core vertex (resp., a non-core vertex). The edges not in the core of a chemical cyclic graph G form a collection of trees T, which we call a non-core tree. Each non-core tree contains exactly one core vertex and is regarded as a tree rooted at the core vertex. The k-branch height of a chemical cyclic graph G is defined to be the maximum of k-branch heights over all non-core trees. We observe that most chemical graphs G with at most 50 non-hydrogen atoms satisfy \(\text {bh}_2(G)\le 2\). See Appendix A for a summary of statistical feature distribution of chemical graphs registered in the chemical database PubChem [25].

For convenient reference, we summarize the graph-related notation used throughout this paper in Table 1.

Table 1 Graph-theoretic notation

Modeling of chemical compounds

We represent the graph structure of a chemical compound as a graph with labels on vertices and multiplicity on edges in a hydrogen-suppressed model. Let \(\Lambda \) be a set of labels each of which represents a chemical element such as C (carbon), O (oxygen), N (nitrogen) and so on, where we assume that \(\Lambda \) does not contain H (hydrogen). Let \(\text {mass}({\mathtt{a}})\) and \({\text {val}}({\mathtt{a}})\) denote the mass and valence of a chemical element \({\mathtt{a}}\in \Lambda \), respectively. In our model, we use integer \(\text {mass}^*({\mathtt{a}})=\lfloor 10\cdot \text {mass}({\mathtt{a}})\rfloor \), \({\mathtt{a}}\in \Lambda \), and assume that each chemical element \({\mathtt{a}}\in \Lambda \) has a unique valence \({\text {val}}({\mathtt{a}})\in [1, 4]\).

We introduce a total order < over the elements in \(\Lambda \) according to their mass values; i.e., we write \(\mathtt{a<b}\) for chemical elements \(\mathtt{a,b}\in \Lambda \) with \(\text {mass}({\mathtt{a}})<\text {mass}(\mathtt{b})\). A pair of two atoms \({\mathtt{a}}\) and \(\mathtt{b}\), \(\mathtt{a, b} \in \Lambda \), joined with a bond-multiplicity \(m \in [1, 3]\), where \(m=1, 2, 3,\) correspond to single, double, and triple bonds, respectively, is denoted by a tuple \(\gamma =(\mathtt{a, b}, m)\), called the adjacency-configuration of the atom pair. Choose a set \(\Gamma _{<}\) of tuples \(\gamma =(\mathtt{a,b},m)\in \Lambda \times \Lambda \times [1,3]\) such that \(\mathtt{a<b}\). For a tuple \(\gamma =(\mathtt{a,b},m)\in \Lambda \times \Lambda \times [1,3]\), let \(\overline{\gamma }\) denote the tuple \((\mathtt{b,a},m)\). Set \(\Gamma _{>}=\{\overline{\gamma }\mid \gamma \in \Gamma _{<}\}\) and \(\Gamma _{=}=\{(\mathtt{a,a},m)\mid {\mathtt{a}}\in \Lambda , m\in [1,3]\}\), and \(\Gamma = \Gamma _{<}\cup \Gamma _{=}\).

We use a hydrogen-suppressed model because hydrogen atoms can be added at the final stage.

Let \((H,\alpha ,\beta )\) be a tuple of a graph \(H=(V,E)\), a function \(\alpha :V\rightarrow \Lambda \) and a function \(\beta : E\rightarrow [1,3]\), where \(\alpha (v)={\mathtt{a}}\) and \(\beta (e)=m\) mean that a chemical element \({\mathtt{a}}\) is assigned to a vertex v and a bond-multiplicity m is assigned to an edge e, respectively. For a notational convenience, we denote the sum of bond-multiplicities of edges incident to a vertex \(u \in V\) by

  • \(\beta (u) \triangleq \sum _{uv \in E}\beta (uv)\).

A tuple \(G=(H,\alpha ,\beta )\) is called a chemical graph over \(\Lambda \) and \(\Gamma _{<}\cup \Gamma _{=}\) if the following holds:

  1. (i)

    H is connected;

  2. (ii)

    \((\alpha (u),\alpha (v),\beta (uv))\in \Gamma _{<}\cup \Gamma _{=}\) for each edge \(uv\in E\); and

  3. (iii)

    \(\beta (u) \le {\text {val}}(\alpha (u))\) for each vertex \(u\in V\).

A chemical graph \(G=(H,\alpha ,\beta )\) is called a “chemical acyclic graph” if the graph H is an acyclic graph. Similarly for other types of graphs for H.

We define the bond-configuration of an edge \(e=uv \in E\) in a chemical graph G to be a tuple \((\deg _H(u),\deg _H(v),\beta (e))\) such that \(\deg _H(u)\le \deg _H(v)\) for the end-vertices u and v of e. Let \(\text {Bc}\) denote the set of bond-configurations \(\mu =(d_1,d_2,m)\in [1,4]\times [1,4]\times [1,3]\) such that \(\max \{d_1,d_2\}+m \le 5\). We regard that \((d_1,d_2,m)=(d_2,d_1,m)\).

In summary, we give the notation on modeling chemical compounds used throughout this paper in Table 2.

Table 2 Notation adopted for modeling chemical compounds

Descriptors

In our method, we use only graph-theoretical descriptors for defining a feature vector, which facilitates our design of an algorithm for constructing graphs. Given a chemical acyclic graph \(G=(H,\alpha ,\beta )\), we define a feature vector f(G) that consists of the following 11 kinds of descriptors. We choose an integer \(k^*\in [1,4]\) as a branch-parameter.

General chemical graph descriptors

  • n(G): the number |V| of vertices.

  • \(\overline{\text {dia}}(G)\triangleq \text {dia}(H)/n(G)\): the diameter of H divided by \(n(G)=|V|\).

  • \(\overline{\text {ms}}\triangleq \sum _{v\in V}\text {mass}^*(\alpha (v))/n(G)\): the average \(\hbox {mass}^*\) of atoms in G.

  • \(n_\mathtt{H}(G)\): the number of hydrogen atoms to be added to G.

Descriptors for vertices of certain degree

  • \(\text {dg}_i^\text {t}(G)\triangleq |\{v\in V^\text {t}\mid \deg _{H}(v)=i\}|,\) \(i\in [1,4],\) \(\text {t}\in \{{\text {in}},\text {ex}\}\): the number of \(k^*\)-internal/\(k^*\)-external vertices of degree i in H, where the bond-multiplicity of edges incident to a vertex v is ignored in the degree of v.

Descriptors for branch-leaf number and branch-height

  • \({\text {bl}}_{k^*}(G)\): the \(k^*\)-branch-leaf number of G.

  • \(\text {bh}_{k^*}(G)\): the \(k^*\)-branch height of G.

Descriptors for vertex labels

  • \(\text {ce}_{\mathtt{a}}^\text {t}(G)\triangleq |\{ v\in V^\text {t}\mid \alpha (v)={\mathtt{a}}\}|,\) \({\mathtt{a}}\in \Lambda,\) \(\text {t}\in \{{\text {in}},\text {ex}\}\): the number of \(k^*\)-internal/\(k^*\)-external vertices with chemical element \({\mathtt{a}}\in \Lambda \).

Descriptors for the number of bonds

  • \(\text {bd}_m^\text {t}(G)\triangleq \{e\in E^\text {t}\mid \beta (e)=m\}\), \(m=2, 3\), \(\text {t}\in \{{\text {in}},\text {ex}\}\): the number of \(k^*\)-internal/\(k^*\)-external edges with bond-multiplicity m.

Descriptors for adjacency-configurations

  • \(\text {ac}_{\gamma }^\text {t}(G)\), \(\gamma \in \Gamma \), \(\text {t}\in \{{\text {in}},\text {ex}\}\): the number of \(k^*\)-internal/\(k^*\)-external edges \(e=uv\) with adjacency-configuration \(\gamma =(\mathtt{a,b},m)\) (i.e., \(\alpha (u)={\mathtt{a}},\alpha (v)=\mathtt{b}\) and \(\beta (e)=m\)) in G.

Descriptors for bond-configurations

  • \(\text {bc}_{\mu }^\text {t}(G)\), \(\mu \in \text {Bc}\), \(\text {t}\in \{{\text {in}},\text {ex}\}\): the number of \(k^*\)-internal/\(k^*\)-external edges \(e=uv\) with bond-configuration \(\mu =(d,d',m)\) (i.e., \(\deg _H(u)=d, \deg _H(v)=d'\) and \(\beta (e)=m\)) in G.

Note that

$$\begin{aligned} n_\mathtt{H}(G)&\triangleq \sum _{\begin{array}{c} {\mathtt{a}}\in \Lambda , \\ \mathtt{t} \in \{{\text {in}},\text {ex}\} \end{array}} {\text {val}}({\mathtt{a}})\text {ce}_{\mathtt{a}}^\text {t}(G) - \sum _{\begin{array}{c} \gamma =(\mathtt{a, b}, m)\in \Gamma ,\\ \mathtt{t} \in \{{\text {in}},\text {ex}\} \end{array}} 2m\cdot \text {ac}_{\gamma }^\text {t}(G) \\&= \sum _{\begin{array}{c} {\mathtt{a}}\in \Lambda , \\ \mathtt{t}\in \{{\text {in}},\text {ex}\} \end{array}} {\text {val}}({\mathtt{a}})\text {ce}_{\mathtt{a}}^\text {t}(G) -2(n(G)-1 + \sum _{\begin{array}{c} m\in [2,3], \\ \mathtt{t}\in \{{\text {in}},\text {ex}\} \end{array}} (m-1) \cdot \text {bd}_m^\text {t}(G)). \end{aligned}$$

The number K of descriptors in our feature vector \(x=f(G)\) is \(K=2|\Lambda |+2|\Gamma |+50\). Note that the above K descriptors are not independent in the sense that some descriptors depend on the combination of other descriptors. For example, descriptor \(\text {bd}_i^{\text {in}}(G)\) can be determined by \(\sum _{\gamma =(\mathtt{a,b},m)\in \Gamma : m=i }\text {ac}_{\gamma }^{\text {in}}(G)\).

A method for inferring chemical graphs

Framework for the Inverse QSAR/QSPR

We review the framework that solves the inverse QSAR/QSPR by using MILPs [20, 21], which is illustrated in Fig. 3. For a specified chemical property \(\pi \) such as boiling point, we denote by a(G) the observed value of the property \(\pi \) for a chemical compound G. As the first phase, we solve (I) Prediction Problem with the following three steps.

Fig. 3
figure 3

ac An illustration of Phase 1: a Stage 1 for preparing a data set \(D_{\pi }\) for a graph class \({\mathcal {G}}\) and a specified chemical property \(\pi \); b Stage 2 for introducing a feature function f with descriptors; c Stage 3 for constructing a prediction function \(\psi _{{\mathcal {N}}}\) with an ANN \({{\mathcal {N}}}\); de An illustration of Phase 2: (d) Stage 4 for formulating an MILP \({{\mathcal {M}}}(x,y,g;{\mathcal {C}}_1,{\mathcal {C}}_2)\) and finding a feasible solution \((x^*,g^*)\) of the MILP for a target value \(y^*\) so that \(\psi _{{\mathcal {N}}}(x^*)=y^*\) (possibly detecting that no target graph \(G^*\) exists); (e) Stage 5 for enumerating graphs \(G^*\in {\mathcal {G}}\) such that \(f(G^*)=x^*\)

Phase 1.

Stage 1: Let \(\text {DB}\) be a set of chemical graphs. For a specified chemical property \(\pi \), choose a class \({\mathcal {G}}\) of graphs such as acyclic graphs or monocyclic graphs. Prepare a data set \(D_{\pi }=\{G_i\mid i=1,2,\ldots ,m\}\subseteq {\mathcal {G}}\cap \text {DB}\) such that the value \(a(G_i)\) of each chemical graph \(G_i\), \(i=1,2,\ldots ,m\) is available. Set reals \({\underline{a}}, {\overline{a}}\in {\mathbb {R}}\) so that \({\underline{a}}\le a(G_i)\le {\overline{a}}\), \(i=1,2,\ldots ,m\).

Stage 2: Introduce a feature function \(f: {\mathcal {G}}\rightarrow {\mathbb {R}}^K\) for a positive integer K. We call f(G) the feature vector of \(G\in {\mathcal {G}}\), and call each entry of a vector f(G) a descriptor of G.

Stage 3: Construct a prediction function \(\psi _{{\mathcal {N}}}\) with an ANN \({{\mathcal {N}}}\) that, given a vector in \({\mathbb {R}}^K\), returns a real number in the range \([{\underline{a}},{\overline{a}}]\) so that \(\psi _{{\mathcal {N}}}(f(G))\) takes a value nearly equal to a(G) for many chemical graphs in \(\text {DB}\). See Fig. 3a–c for an illustration of Stages 1, 2, and 3 in Phase 1.

In this paper, we use the range-based method to define an applicability domain (AD) [26] to our inverse QSAR/QSPR. Set \(\underline{x_j}\) and \(\overline{x_j}\) to be the minimum and maximum values of the j-th descriptor \(x_j\) in \(f(G_i)\), respectively, over all graphs \(G_i\), \(i=1,2,\ldots ,m\), where we possibly normalize some descriptors such as \(\text {ce}_{\mathtt{a}}^{\text {in}}(G)\), which is normalized with \(\text {ce}_{\mathtt{a}}^{\text {in}}(G)/n(G)\). Define our AD \({\mathcal {D}}\) to be the set of vectors \(x\in {\mathbb {R}}^K\) such that \(\underline{x_j}\le x_j\le \overline{x_j}\) for the variable \(x_j\) of each j-th descriptor, \(j=1,2,\ldots ,k\).

In the second phase, we try to find a vector \(x^*\in {\mathbb {R}}^K\) from a target value \(y^*\) of the chemical propery \(\pi \) such that \(\psi _{{\mathcal {N}}}(x^*)=y^*\). Based on the method due to Akutsu and Nagamochi [16], Chiewvanichakorn et al. [18] showed that this problem can be formulated as an MILP. By including a set of linear constraints such that \(x\in {\mathcal {D}}\) into their MILP, we obtain the next result.

Theorem 1

([20, 21]) Let \({{\mathcal {N}}}\) be an ANN with a piecewise-linear activation function for an input vector \(x\in {\mathbb {R}}^K,\) \(n_A\) denote the number of nodes in the architecture and \(n_B\) denote the total number of break-points over all activation functions. Then there is an MILP \({{\mathcal {M}}}(x,y;{\mathcal {C}}_1)\) that consists of variable vectors \(x\in {\mathcal {D}}~(\subseteq {\mathbb {R}}^K)\), \(y\in {\mathbb {R}}\), and an auxiliary variable vector \(z\in {\mathbb {R}}^p\) for some integer \(p=O(n_A+n_B)\) and a set \({\mathcal {C}}_1\) of \(O(n_A+n_B)\) constraints on these variables such that: \(\psi _{{{\mathcal {N}}}}(x^*)=y^*\) if and only if there is a vector \((x^*,y^*)\) feasible to \({{\mathcal {M}}}(x,y;{\mathcal {C}}_1)\).

See Appendix “Upper and lower bounds on descriptors” for the set of constraints to define our AD \({\mathcal {D}}\) in the MILP \({{\mathcal {M}}}(x,y;{\mathcal {C}}_1)\) in Theorem 1.

A vector \(x\in {\mathbb {R}}^K\) is called admissible if there is a chemical graph \(G\in {\mathcal {G}}\) such that \(f(G)=x\) [17]. Let \({\mathcal {A}}\) denote the set of admissible vectors \(x\in {\mathbb {R}}^K\). To ensure that a vector \(x^*\) inferred from a given target value \(y^*\) becomes admissible, we introduce a new vector variable \(g\in {\mathbb {R}}^{q}\) for an integer q. For the class \({\mathcal {G}}\) of chemical acyclic graphs, Azam et al. [17] introduced a set \({\mathcal {C}}_2\) of new constraints with a new vector variable \(g\in {\mathbb {R}}^{q}\) for an integer q so that

  • A feasible solution \((x^*,g^*)\) of a new MILP for a target value \(y^*\) delivers a vector \(x^*\) with \(\psi _{{{\mathcal {N}}}}(x^*)=y^*\), and

  • A vector \(g^*\) that represents a chemical acyclic graph \(G^*\in {\mathcal {G}}\).

Afterwards, for the classes of chemical graphs with cycle index 1 and 2, Ito et al. [17] and Zhu et al. [21] presented such a set \({\mathcal {C}}_2\) of constraints so that a vector \(g^*\) in a feasible solution \((x^*,g^*)\) of a new MILP can represent a chemical graph \(G^*\) in the class \({\mathcal {G}}\), respectively.

As the second phase, we solve (II) Inverse Problem for the inverse QSAR/QSPR by treating the following inference problems.

(II-a) Inference of Vectors

Input: A real \(y^*\) with \({\underline{a}}\le y^*\le {\overline{a}}\).

Output: Vectors \(x^*\in {\mathcal {A}}\cap {\mathcal {D}}\) and \(g^*\in {\mathbb {R}}^{q}\) such that \(\psi _{{\mathcal {N}}}(x^*)=y^*\) and \(g^*\) forms a chemical graph \(G^*\in {\mathcal {G}}\) with \(f(G^*)=x^*\).

(II-b) Inference of Graphs

Input: A vector \(x^*\in {\mathcal {A}}\cap {\mathcal {D}}\).

Output: All graphs \(G^*\in {\mathcal {G}}\) such that \(f(G^*)=x^*\).

The second phase consists of the next two steps.

Phase 2.

Stage 4:  Formulate Problem (II-a) as the above MILP \({{\mathcal {M}}}(x,y,g;{\mathcal {C}}_1,{\mathcal {C}}_2)\) based on \({\mathcal {G}}\) and \({{\mathcal {N}}}\). Find a feasible solution \((x^*,g^*)\) of the MILP such that

  • \(x^*\in {\mathcal {A}}\cap {\mathcal {D}}\) and \(\psi _{{\mathcal {N}}}(x^*)=y^*\).

The second requirement may be replaced with inequalities \((1-\varepsilon )y^* \le \psi _{{\mathcal {N}}}(x^*) \le (1+\varepsilon )y^*\) for a tolerance \(\varepsilon >0.\)

Stage 5:  To solve Problem (II-b), enumerate all (or a specified number) of graphs \(G^*\in {\mathcal {G}}\) such that \(f(G^*)=x^*\) for the inferred vector \(x^*\). See Fig. 3d, e for an illustration of Stages 4 and 5 in Phase 2.

In practical applications, there would be many criteria that a target chemical compound needs to satisfy rather than a single chemical property \(\pi \), such as stability and synthesizability. The above five steps in the framework are rather schematic in the sense that it would be necessary to adjust several settings in each stage in order to find a collection of chemical graphs that meet many of those criteria after a repeated application of the framework. For example, we can include in an MILP formulation in Stage 4 additional conditions such as lower and upper bounds on the frequency of adjacency-configurations and extra requirements on substructures of a target chemical graph as long as these conditions can be expressed as linear constraints with integer/real variables. Also an efficient algorithm in Stage 5 can quickly offer a large number of isomers of the same feature vectors, to which we can apply a further screening to choose promising candidates for chemical graphs.

Our target graph class

In this paper, we choose a branch-parameter \(k\ge 1\) and define a class \({\mathcal {G}}\) of chemical acyclic graphs G such that

  • The maximum degree in G is at most 4;

  • The k-branch height \(\text {bh}_k(G)\) is bounded for a specified branch-parameter k; and

  • The size of each k-fringe-tree in G is bounded.

The reason why we restrict ourselves to the graphs in \({\mathcal {G}}\) is that this class \({\mathcal {G}}\) covers a large part of the acyclic chemical compounds registered in the chemical database PubChem. See Appendix A for a summary of the statistical features of the chemical graphs in PubChem in terms of k-branch height and the size of 2-fringe-trees. According to this, over 55% (resp., 99%) of acyclic chemical compounds with up to 100 non-hydrogen atoms in PubChem have the maximum degree 3 (resp., 4); and nearly 87% (resp., 99%) of acyclic chemical compounds with up to 50 non-hydrogen atoms in PubChem have the 2-branch height at most 1 (resp., 2). This implies that \(k=2\) is sufficient to cover most of chemical acyclic graphs. For \(k=2\), over 92% of 2-fringe-trees of chemical compounds with up to 100 non-hydrogen atoms in PubChem obey the following size constraint:

$$\begin{aligned} n(T) \le 2\deg _T(r) + 2\hbox { for each 2-fringe-tree }T \hbox { with the root }r. \end{aligned}$$
(1)

We formulate an MILP in Stage 4 that, given a target value \(y^*\), infers a vector \(x^*\in {\mathbb {Z}}_+^K\) with \(\psi _{{\mathcal {N}}}(x^*)=y^*\) and a chemical acyclic graph \(G^*=(H,\alpha ,\beta )\in {\mathcal {G}}\) with \(f(G^*)=x^*\). We here specify some of the features of a graph \(G^*\in {\mathcal {G}}\) such as the number of non-hydrogen atoms in order to control the graph structure of target graphs to be inferred and to simplify MILP formulations. In this paper, we specify the following features on a graph \(G\in {\mathcal {G}}\): a set \(\Lambda \) of chemical elements, a set \(\Gamma _{<}\) of adjacency-configurations, the maximum degree, the number of non-hydrogen atoms, the diameter, the k-branch height and the k-branch-leaf number for a branch-parameter k.

More formally, given specified integers \(n^*\), \(d_\text {max}\), \(\text {dia}^*\), \(k^*\), \(\text {bh}^*\), \({\text {bl}}^*\in {\mathbb {Z}}\) other than \(\Lambda \) and \(\Gamma \), let \({\mathcal {H}}(n^*, d_\text {max}, \text {dia}^*, k^*, \text {bh}^*, {\text {bl}}^*)\) denote the set of acyclic graphs H such that

  • The maximum degree of a vertex in H is at most 3 when \(d_\text {max}=3\) (or equal to 4 when \(d_\text {max}=4\)),

  • The number n(H) of vertices in H is \(n^*\),

  • The diameter \(\text {dia}(H)\) of H is \(\text {dia}^*\),

  • The \(k^*\)-branch height \(\text {bh}_{k^*}(H)\) is \(\text {bh}^*\),

  • The \(k^*\)-branch-leaf number \({\text {bl}}_{k^*}(H)\) is \({\text {bl}}^*\) and

  • (1) holds.

To design Stage 4 for our class \({\mathcal {G}}\), we formulate an MILP \({{\mathcal {M}}}(x,g; {\mathcal {C}}_2)\) that infers a chemical graph \(G^*=(H,\alpha ,\beta )\in {\mathcal {G}}\) with \(H\in {\mathcal {H}}(n^*, d_\text {max}, \text {dia}^*, k^*, \text {bh}^*, {\text {bl}}^*)\) for a given specification \((\Lambda ,\Gamma ,n^*, d_\text {max}, \text {dia}^*, k^*, \text {bh}^*, {\text {bl}}^*)\). The details will be given in "MILPs for chemical acyclic graphs with bounded branch-height" section and Appendix C.

Design of Stage 5, i.e., generating chemical graphs \(G^*\) that satisfy \(f(G^*)=x^*\) for a given feature vector \(x^*\in {\mathbb {Z}}_+^K\) is still challenging for a relatively large instance with size \(n(G^*)\ge 20\). There have been proposed algorithms for generating chemical graphs \(G^*\) in Stage 5 for the classes of graphs with cycle index 0 to 2 [5, 22,23,24]. All of these are designed based on the branch-and-bound method and can generate a target chemical graph with size \(n(G^*)\le 20\). To break this barrier, we newly employ the dynamic programming method for designing an algorithm in Stage 5 in order to generate a target chemical graph \(G^*\) with size \(n(G^*)=50\). For this, we further restrict the structure of acyclic graphs G so that the number \({\text {bl}}_2(G)\) of leaf 2-branches is at most 3. Among all acyclic chemical compounds with up to 50 non-hydrogen atoms in the chemical database PubChem, the ratio of the number of acyclic chemical compounds G with \({\text {bl}}_2(G)\le 2\) (resp., \({\text {bl}}_2(G)\le 3\)) is 78% (resp., 95%). See "A new graph search algorithm" section and Appendix D for the details on the new algorithm in Stage 5.

To conclude the description of the target graph class to be inferred by the inverse QSAR/QSPR framework developed in this paper, we summarize the global parameters in Table 3.

Table 3 Fixed parameters of target graphs

MILPs for chemical acyclic graphs with bounded branch-height

In this section, we describe an idea of formulating an MILP \({{\mathcal {M}}}(x,g;{\mathcal {C}}_2)\) to infer a chemical acyclic graph G in the class \({\mathcal {G}}\) for a given specification \((\Lambda ,\Gamma ,n^*, d_\text {max}, \text {dia}^*,\) \( k^*, \text {bh}^*, {\text {bl}}^*)\) defined in the previous section. Please refer to Table 3 for a summary of the parameters that we assume to be fixed for a target graph.

Scheme graphs

Our new idea of constructing an acyclic graph H is as follows. See a rooted tree \(T_B=T(d_\text {max},d_\text {max}-1,\text {bh}^*)\) in Fig. 4a.

  • From the tree \(T_B\), we first choose a subtree T including the root \(u_1\). We use T as the \(k^*\)-branch-tree of H.

  • Next, we choose some edges in the tree T and replace each of the edges \(e=u_i u_j\) with a path \(P_e\) between vertices \(u_i\) and \(u_j\). Let \(T^*\) denote the resulting tree. We use \(T^*\) as the \(k^*\)-branch-subtree of H.

  • Finally, we append to the tree \(T^*\) rooted trees with height at most k as the \(k^*\)-fringe-trees of H. The resulting tree is a required rooted tree H.

Fig. 4
figure 4

An illustration of scheme graph \({\text {SG}}( d_\text {max}, k^*, {\text {bh}}^*, t^*)\) with \(d_{\text {max}}=3\), \(k^*=2\), \({\text {bh}}^*=2\), and \(t^*=5\), where the vertices in \(T_B\) (resp., in \(P_{t^*}\)) are depicted with black (resp., gray) circles: a A base-tree \(T_B\) and a link-path \(P_{t^*}\) are joined with directed edges between them; b A tree \(S_s\) rooted at a vertex \(u_s=u_{s,1}\in V_B\); c A tree \(T_t\) rooted at a vertex \(v_t=v_{t,1}\in V_P\)

In our MILP, we prepare a binary variable for each of the vertices and edges in \(T_B\) so that a subtree T of \(T_B\) can be selected as one of the combinations of these binary values.

To represent a replacement of an edge e with a path \(P_e\) in our MILP, we introduce a path \(P_{t^*}=(v_{1,1},v_{2,1},\ldots ,v_{t^*,1})\) of a sufficiently large length \(t^*-1\), and a set F of directed edges between the vertices in \(T_B\) and \(P_{t^*}\) as shown in Fig. 4a. We also introduce a binary variable for each of the vertices and edges in \(P_{t^*}\) and F in our MILP. When an edge \(e=u_i u_j\) is replaced with a path \(P_e\), we select an edge from \(u_i\) to a vertex \(v_{h,1}\) in \(P_{t^*}\) and an edge from a vertex \(v_{h+p,1}\) so that the edges \((u_i,v_{h,1})\) and \((v_{h+p,1},u_j)\) and the subpath \((v_{h,1},v_{h+1,1},\ldots , v_{h+p,1})\) of \(P_{t^*}\) form a path \(P_e\). Such a path \(P_e\) can be selected as one of the combinations of these binary values. To append rooted trees to tree \(T^*\), we prepare a rooted tree with a sufficiently large size at each vertex in \(T_B\) and \(P_{t^*}\) and introduce a binary variable for each of the vertices and edges in these rooted trees in our MILP. A rooted subtree from each of such rooted trees as a \(k^*\)-fringe-tree can be selected as one of the combinations of these binary values.

We call the graph that consists of all the above graphs \(T_B\), \(P_{t^*}\) and the edge set F and the set of rooted trees at the vertices in \(T_B\) and \(P_{t^*}\) a scheme graph \(\text {SG}( d_\text {max}, k^*, \text {bh}^*, t^*)\).

Fig. 5
figure 5

An illustration of selecting a subgraph H from the scheme graph \(\text {SG}( d_\text {max}, k^*, \text {bh}^*, t^*=n^*-{\text {bl}}^*-1)\): a An acyclic graph \(H\in {\mathcal {H}}(n^*, d_\text {max}, \text {dia}^*, k^*, \text {bh}^*, {\text {bl}}^*)\) with \(n^*=37\), \(d_\text {max}=3\), \(\text {dia}^*(H)=17\), \(k^*=2\), \(\text {bh}^*=2\) and \({\text {bl}}^*=3\), where the labels of some vertices indicate the corresponding vertices in the scheme graph \(\text {SG}( d_\text {max}, k^*, \text {bh}^*, t^*)\); b The \(k^*\)-branch-tree of H for \(k^*=2\); c An acyclic graph \(H'\) selected from \(\text {SG}( d_\text {max}, k^*, \text {bh}^*, t^*)\) as a graph that is isomorphic to H in (a)

Figure 5a illustrates an acyclic graph H with \(n(H)=37\), \(\text {dia}(H)=17\), \(\text {bh}_2(H)=2\) and \({\text {bl}}_2(H)=3\), where the maximum degree of a vertex is 3. Figure 5b illustrates the 2-branch-tree of the acyclic graph H in Fig. 5a. Figure 5c illustrates a subgraph \(H'\) of the scheme graph \(\text {SG}( d_\text {max}, k^*, \text {bh}^*, t^*=n^*-{\text {bl}}^*-1)\) such that \(H'\) is isomorphic to the acyclic graph H in Fig. 5a.

In this paper, we obtain the following result.

Theorem 2

Let \(\Lambda \) be a set of chemical elements, \(\Gamma \) be a set of adjacency-configurations, where \(|\Lambda |\le |\Gamma |\), and \(K = 2|\Lambda | + 2|\Gamma | + 50\). Given non-negative integers \(n^*\ge 3\), \(d_\text {max}\in \{3,4\}\), \(\text {dia}^*\ge 3\), \(k^*\ge 1\), \(\text {bh}^*\ge 1\) and \({\text {bl}}^*\ge 2\), there is an MILP \({{\mathcal {M}}}(x,g;{\mathcal {C}}_2)\) that consists of variable vectors \(x\in {\mathbb {R}}^K\) and \(g\in {\mathbb {R}}^q\) for an integer \(q=O( |\Gamma |\cdot [ (d_\text {max}-1)^{\text {bh}^*+k^*} +n^*\cdot (d_\text {max}-1)^{\max \{\text {bh}^*,k^*\}})])\) and a set \({\mathcal {C}}_2\) of constraints on x and g with size \(O(|\Gamma | + (d_\text {max}-1)^{\text {bh}^*+k^*} +n^*\cdot (d_\text {max}-1)^{\max \{\text {bh}^*, k^*\}}) )\) such that: \((x^*, g^*)\) is feasible to \({{\mathcal {M}}}(x,g; {\mathcal {C}}_2)\) if and only if \(g^*\) forms a chemical acyclic graph \(G=(H,\alpha ,\beta )\) such that \(H\in {\mathcal {H}}(n^*, d_\text {max}, \text {dia}^*, k^*, \text {bh}^*,{\text {bl}}^*)\) and \(f(G)=x^*\).

Note that our MILP requires only \(O(n^*)\) variables and constraints when the branch-parameter \(k^*\), the \(k^*\)-branch height and \(|\Gamma |\) are constant.

See Appendices B and C for the details of the MILP formulation and the set of all variables and constraints in the MILP formulation, respectively.

A new graph search algorithm

Previous methods of inferring chemical graphs [17,18,19] use a graph search algorithm based on the branch-and-bound algorithm proposed by Fujiwara et al. [5], where an enormous number of chemical graphs are constructed by repeatedly appending and removing a vertex one by one until a target chemical graph is constructed. Their algorithm cannot generate even one acyclic chemical graph when n(G) is larger than around 20.

This section introduces a new dynamic programming method for designing an algorithm in Stage 5. We consider the following aspects:

  1. (a)

    Treat acyclic graphs with a certain limited structure that frequently appears among chemical compounds registered in the chemical database; and

  2. (b)

    Instead of manipulating acyclic graphs directly, first compute the frequency vectors \(\pmb {f}(G')\) (sub-vectors of the feature vectors \(f(G')\), see Appendix D) of subtrees \(G'\) of all target acyclic graphs and then construct a limited number of target graphs G from the process of computing the vectors.

In (a), we choose a branch-parameter \(k^*=2\) and treat acyclic graphs G that have a small 2-branch number such as \({\text {bl}}_2(G)\in [2,3]\) and satisfy the size constraint (1) on 2-fringe-trees. Figure 6a, b illustrate chemical acyclic graphs G with \({\text {bl}}_2(G)=2\) and \({\text {bl}}_2(G)= 3\), respectively.

Fig. 6
figure 6

An illustration of chemical acyclic graphs G with diameter \(\text {dia}^*\) and \({\text {bl}}_2(G)=2,3\): a A chemical acyclic graph G with two leaf 2-branches \(v_1\) and \(v_2\); b A chemical acyclic graph G with three leaf 2-branches \(v_1, v_2\) and \(v_3\)

We design a method in (b) based on the mechanism of dynamic programming in the following way. Define a frequency vector \(\pmb {f}(T)\) of each chemical rooted tree T to be a vector that consists of the frequency of each chemical element \({\mathtt{a}}\in \Lambda \), each adjacency-configuration \({\mathtt{a}}\in \Lambda \), each bond-configuration \(\mu \in \text {Bc}\), and each degree \(\text {dg}i\in \text {Dg}\) in T. We are given a vector \(\pmb {x}^*\) that is the frequency vector \(\pmb {f}(G)\) of a chemical acyclic graph G to be inferred.

We first construct a set \(\text {FT}\) of chemical rooted trees with height at most \(k^*=2\) and compute the frequency vector \(\pmb {f}(T)\) of each chemical rooted tree \(T\in \text {FT}\) to obtain the set \(\text {W}(\text {FT})\) of frequency vectors \(\pmb {f}(T), T\in \text {FT}\). Note that a large number of chemical rooted trees \(T\in \text {FT}\) maps to the same frequency vector \(\pmb {w}\) and the size \(|\text {W}(\text {FT})|\) is considerably smaller than the size \(|\text {FT}|\).

We next combine two chemical rooted trees \(T_a,T_b\in \text {FT}\) to construct a chemical tree \(T_{a,b}\) by joining their roots \(r_a\) and \(r_b\) with an edge \(e=r_a r_b\) of a bond-multiplicity m, as illustrated in Fig. 6a. In fact, we compute only the feature vector \(\pmb {f}(T_{a,b})\) of such a tree \(T_{a,b}\) without directly treating the graph structures of \(T_a\), \(T_b\) and \(T_{a,b}\). For this, we add two frequency vectors \(\pmb {w}_a,\pmb {w}_b\in \text {W}(\text {FT})\) together with an additional term from the bond-multiplicity m to obtain the frequency vector \(\pmb {w}_{a,b}~(=\pmb {f}(T_{a,b}))\) of such a tree \(T_{a,b}\). Given such a vector \(\pmb {w}_{a,b}\), we can actually construct a chemical tree \(T_{a,b}\) with \(\pmb {f}(T_{a,b})=\pmb {w}_{a,b}\) by choosing trees \(T_a,T_b\in \text {FT}\) and combining them with an edge of bond-multiplicity m.

Our algorithm for generating a chemical acyclic graph G with \({\text {bl}}_2(G)=2\) continues to compute a set \(\text {W}^{(p)}\) of frequency vectors of chemical trees that can be obtained by combining p trees in \(\text {FT}\) for each \(p=2,3,\ldots , \lceil (\text {dia}^*-5)/2\rceil \). Finally, we find a vector pair \((\pmb {w}^1,\pmb {w}^2)\) with \(\pmb {w}^1\in \text {W}^{(\lfloor (\text {dia}^*-5)/2\rfloor )}\) and \(\pmb {w}^2\in \text {W}^{(\lceil (\text {dia}^*-5)/2\rceil )}\) such that a vector with \(\pmb {w}^1\), \(\pmb {w}^2\) and a bond-multiplicity m is equal to the given vector \(\pmb {x}^*\); i.e., a chemical acyclic graph G with \(\pmb {f}(G)=\pmb {x}^*\) is obtained by joining chemical trees \(T^1\) and \(T^2\) with \(\pmb {w}^i=\pmb {f}(T_i), i=1,2\) with an edge of bond-multiplicity m.

With a slight modification, the algorithm can generate a chemical acyclic graph G with \({\text {bl}}_2(G)=3\).

Appendix D presents the details of our new algorithms for generating acyclic graphs G with \({\text {bl}}_2(G)\in [2,3]\).

Experimental results

We implemented our method of Stages 1 to 5 for inferring chemical acyclic graphs and conducted experiments to evaluate the computational efficiency for three chemical properties \(\pi \): octanol/water partition coefficient (Kow), boiling point (Bp) and heat of combustion (Hc). We executed the experiments on a PC with Two Intel Xeon CPUs E5-1660 v3 @3.00GHz, 32 GB of RAM running under OS: Ubuntu 14.04.6 LTS. We show 2D drawings of some of the inferred chemical graphs, where ChemDoodle version 10.2.0 was used for constructing the drawings.

Table 4 Results of Stage 1 in Phase 1

Results on Phase 1. We implemented Stages 1, 2, and 3, in Phase 1 as follows.

Stage 1. We set a graph class \( {\mathcal {G}}\) to be the set of all chemical acyclic graphs, and set a branch-parameter \(k^*\) to be 2. For each property \(\pi \in \{\) Kow, Bp, Hc\(\}\), we first select a set \(\Lambda \) of chemical elements and then collected a data set \(D_{\pi }\) on chemical acyclic graphs over the set \(\Lambda \) of chemical elements provided by the Hazardous Substances Data Bank (HSDB) of PubChem. To construct the data set, we eliminated chemical compounds that have at most three carbon atoms or contain a charged element such as \(\mathtt{N}^+\) or an element \({\mathtt{a}}\in \Lambda \) whose valence is different from our setting of valence function \({\text {val}}\).

Table 4 shows the size and range of data sets that we prepared for each chemical property in Stage 1, where we denote the following:

  • \(\pi \): one of the chemical properties Kow, Bp and Hc;

  • \(\Lambda \): the set of selected chemical elements (hydrogen atoms are added at the final stage);

  • \(|D_{\pi }|\): the size of data set \(D_{\pi }\) over \(\Lambda \) for property \(\pi \);

  • \(|\Gamma |\): the number of different adjacency-configurations over the compounds in \(D_{\pi }\);

  • \([{\underline{n}},{\overline{n}}]\): the minimum and maximum number n(G) of non-hydrogen atoms over the compounds G in \(D_{\pi }\);

  • \([{\underline{{\text {bl}}}},\overline{{\text {bl}}}]\): the minimum and maximum numbers \({\text {bl}}_2(G)\) of leaf 2-branches over the compounds G in \(D_{\pi }\);

  • \([{\underline{\text {bh}}},\overline{\text {bh}}]\): the minimum and maximum values of the 2-branch height \(\text {bh}_2(G)\) over the compounds G in \(D_{\pi }\); and

  • \([{\underline{a}},{\overline{a}}]\): the minimum and maximum values of a(G) for \(\pi \) over compounds G in \(D_{\pi }\).

Stage 2. We used a feature function f that consists of the descriptors defined in “Descriptors” section.

Table 5 Results of Stages 2 and 3 in Phase 1

Stage 3. We used scikit-learn version 0.21.6 with Python 3.7.4 to construct ANNs \({{\mathcal {N}}}\) where the tool and activation function are set to be MLPRegressor and ReLU, respectively. We tested several different architectures of ANNs for each chemical property. To evaluate the performance of the resulting prediction function \(\psi _{{\mathcal {N}}}\) with cross-validation, we partition a given data set \(D_{\pi }\) into five subsets \(D_{\pi }^{(i)}\), \(i\in [1,5]\) randomly, where \(D_{\pi }\setminus D_{\pi }^{(i)}\) is used for a training set and \(D_{\pi }^{(i)}\) is used for a test set in five trials \(i\in [1,5]\). For a set \(\{y_1,y_2,\ldots ,y_N\}\) of observed values and a set \(\{\psi _1,\psi _2,\ldots ,\psi _N\}\) of predicted values, we define the coefficient of determination to be \(\text {R}^2\triangleq 1- \frac{\sum _{j\in [1,N]}(y_j-\psi _j)^2}{\sum _{j\in [1,N]}(y_j-{\overline{y}})^2}\), where \({\overline{y}}= \frac{1}{N}\sum _{j\in [1,N]}y_j\). Table 5 shows the results on Stages 2 and 3, where

  • K: the number of descriptors for the chemical compounds in data set \(D_{\pi }\) for property \(\pi \);

  • Activation: the choice of activation function;

  • Architecture: (ab, 1) consists of an input layer with a nodes, a hidden layer with b nodes and an output layer with a single node, where a is equal to the number K of descriptors;

  • L-time: the average time (in seconds) to construct ANNs for each trial;

  • test \(\text {R}^2\) (ave.): the average of coefficient of determination over the five tests; and

  • test \(\text {R}^2\) (best): the largest value of coefficient of determination over the five test sets.

From Table 5, we see that the execution of Stage 3 was successful, where the average of test \(\text {R}^2\) is over 0.9 for all three chemical properties.

For each chemical property \(\pi \), we selected the ANN \({{\mathcal {N}}}\) that attained the best test \(\text {R}^2\) score among the five ANNs to formulate an MILP \({{\mathcal {M}}}(x,y,z;{\mathcal {C}}_1)\) which will be used in Phase 2.

Results on Phase 2. We implemented Stages 4 and 5 in Phase 2 as follows.

Stage 4. In this step, we solve the MILP \({{\mathcal {M}}}(x,y,g;{\mathcal {C}}_1,{\mathcal {C}}_2)\) formulated based on the ANN \({{\mathcal {N}}}\) obtained in Phase 1. To solve an MILP in Stage 4, we use CPLEX version 12.10. In our experiment, we choose a target value \(y^* \in [{\underline{a}}, {\overline{a}}]\) and fix or bound some descriptors in our feature vector as follows:

  • Set the 2-leaf-branch number \({\text {bl}}^*\) to be each of 2 and 3;

  • Fix the instance size \(n^*=n(G)\) to be each integer in \(\{26,32,38,44,50\}\);

  • Set the diameter \(\text {dia}^*=\text {dia}(G)\) be one of the integers in \(\{ \lceil (2/5)n^*\rceil , \lceil (3/5)n^*\rceil \}\).

  • Set the maximum degree \(d_\text {max}:=3\) for \(\text {dia}^*=\lceil (2/5)n^*\rceil \) and \(d_\text {max}:=4\) for \(\text {dia}^*= \lceil (3/5)n^*\rceil \);

  • For each instance size \(n^*\), test a target value \(y^*_{\pi }\) for each chemical property \(\pi \in \{\) Kow, Bp, Hc\(\}\).

Based on the above setting, we generated six instances for each instance size \(n^*\). We set \(\varepsilon =0.02\) in Stage 4.

Tables 6, 7 (resp., Tables 8, 9) show the results on Stage 4 for \({\text {bl}}^*=2\) (resp., \({\text {bl}}^*=3\)), where we denote the following:

  • \(y^*_{\pi }\): a target value in \([{\underline{a}},{\overline{a}}]\) for a property \(\pi \);

  • \(n^*\): a specified number of vertices in \([{\underline{n}},{\overline{n}}]\);

  • \(\text {dia}^*\): a specified diameter in \(\{ \lceil (2/5)n^*\rceil , \lceil (3/5)n^*\rceil \}\);

  • IP-time: the time (sec.) to an MILP instance to find vectors \(x^*\) and \(g^*\).

We observe that most of the MILP instances with \({\text {bl}}^*=2\), \(n^*\le 50\) and \(\text {dia}^*\le 30\) (resp., \({\text {bl}}^*=3\), \(n^*\le 50\) and \(\text {dia}^*\le 30\)) are solved within one minute (resp., in a few minutes). The previously most efficient MILP formulation for inferring chemical acyclic graphs due to Zhang et al. [19] could solve instances with a relatively small diameter of \(\text {dia}^*=9\) for the case of \(d_\text {max}=4\) and \(n^*=20\) and \(\text {dia}^*=8\) for the case of \(d_\text {max}=3\) and \(n^*=50\). Our new MILP formulation on chemical acyclic graphs with bounded 2-branch height considerably improved the tractable size of chemical acyclic graphs in Stage 4 for the inference problem (II-a).

Fig. 7
figure 7

An illustration of chemical acyclic graphs G with \(n(G)=50\), \({\text {bl}}_2(G)=2\) and \(d_\text {max}=4\) obtained in Stage 4 by solving an MILP: a \(y^*_{\text {Kow}}=9\), \(\text {dia}(G)=\lceil (2/5)n^*\rceil =20\); b \(y^*_{\text {Bp}}=880\), \(\text {dia}(G)= n^*/2 =25\); c \(y^*_{\text {Hc}}=25000\), \(\text {dia}(G)=\lceil (3/5)n^*\rceil =30\)

Fig. 8
figure 8

An illustration of chemical acyclic graphs G with \(n(G)=50\), \({\text {bl}}_2(G)=3\) and \(d_\text {max}=4\) obtained in Stage 4 by solving an MILP: a \(y^*_{\text {Kow}}=9\), \(\text {dia}(G)=\lceil (2/5)n^*\rceil =20\); b \(y^*_{\text {Bp}}=880\), \(\text {dia}(G)= n^*/2 =25\); c \(y^*_{\text {Hc}}=25,000\), \(\text {dia}(G)=\lceil (3/5)n^*\rceil =30\)

Figure 7ac illustrate some chemical acyclic graphs G with \({\text {bl}}_2(G)=2\) obtained in Stage 4 by solving an MILP. Remember that these chemical graphs obey the AD \({\mathcal {D}}\) defined in Appendix A.

Figure 8ac illustrate some chemical acyclic graphs G with \({\text {bl}}_2(G)=3\) obtained in Stage 4 by solving an MILP.

Stage 5. In this stage, we execute our new graph search algorithms for generating target graphs \(G\in {\mathcal {G}}(\pmb {x}^*)\) with \({\text {bl}}_2(G)\in \{2,3\}\) for a given feature vector \(\pmb {x}^*\) obtained in Stage 4.

We introduce a time limit of 10 minutes for each iteration h in Step 2 and an execution of Steps 1 and 3 for \({\text {bl}}^*=2\) (resp., each iteration h in Steps 2 and 3 and \(\delta _1\) in Step 4 and an execution of Steps 1 and 5 for \({\text {bl}}^*=3\)). In the last step, we choose at most 100 feasible vector pairs and generate a target graph from each of these feasible vector pairs. We also impose an upper bound \(\text {UB}\) on the size \(|\text {W}|\) of a vector set \(\text {W}\) that we maintain during an execution of the algorithm. We executed the algorithm for each of the three bounds \(\text {UB}=10^6, 10^7, 10^8\) until a feasible vector pair is found or the running time exceeds a global time limitation of two hours.

When no feasible vector pair is found by the graph search algorithms, we output the target graph \(G^*\) constructed from the vector \(g^*\) in Stage 4.

Tables 6, 7 (resp., Tables 8, 9) show the results of Stage 5 for \({\text {bl}}^*=2\) (resp., \({\text {bl}}^*=3\)), where we denote the following:

  • \(\#\)FP: the number of feasible vector pairs obtained by an execution of the graph search algorithm for a given feature vector \(\pmb {x}^*\);

  • G-LB: a lower bound on the number of all target graphs \(G\in {\mathcal {G}}(\pmb {x}^*)\) for a given feature vector \(\pmb {x}^*\);

  • \(\#\)G: the number of all (or up to 100) chemical acyclic graphs G such that \(f(G)=x^*\) (where at least one such graph G has been found from the vector \(g^*\) in Stage 4);

  • G-time: the running time (sec.) to execute Stage 5 for a given feature vector \(\pmb {x}^*\), where “> 2 hours” means that the running time exceeds two hours.

Previously, an instance of chemical acyclic graphs with size \(n^*\) up to 16 was solved in Stage 5 by Azam et al. [17]. For the classes of chemical graphs with cycle index 1 and 2, the maximum size of instances solved in Stage 5 by Ito et al. [17] and Zhu et al. [21] was around 18 and 15, respectively. Our new algorithm based on dynamic programming solves instances with \(n^*=50\). In our experiments, we also computed a lower bound G-LB on the number of target graphs. We observe that there are over \(10^{10}\) or \(10^{14}\) target graphs in some cases. Remember that these lower bounds are computed without actually generating each target graph one by one. So when a lower bound is enormously large, this would suggest that we may need to impose some more constraints on the structure of graphs or the range of descriptors to narrow a family of target graphs to be inferred.

Table 6 Results of Stages 4 and 5 for \({\text {bl}}^*= 2\), \(d_\text {max}=3\) and \(\text {dia}^* = \lceil \frac{2}{5}n^* \rceil \)
Table 7 Results of Stages 4 and 5 for \({\text {bl}}^*= 2\), \(d_\text {max}=4\) and \(\text {dia}^* = \lceil \frac{3}{5}n^* \rceil \)
Table 8 Results of Stages 4 and 5 for \({\text {bl}}^*= 3\), \(d_\text {max}=3\) and \(\text {dia}^* = \lceil \frac{2}{5}n^* \rceil \)
Table 9 Results of Stages 4 and 5 for \({\text {bl}}^*= 3\), \(d_\text {max}=4\) and \(\text {dia}^* = \lceil \frac{3}{5}n^* \rceil \)

An additional experiment We also conducted some additional experiment to demonstrate that our MILP-based method is flexible to control conditions on inference of chemical graphs. In Stage 3, we constructed an ANN \({{\mathcal {N}}}_{\pi }\) for each of the three chemical properties \(\pi \in \{\) Kow, Bp, Hc\(\}\), and formulated the inverse problem of each ANN \({{\mathcal {N}}}_{\pi }\) as an MILP \({{\mathcal {M}}}_{\pi }\). Since the set of descriptors is common to all three properties Kow, Bp and Hc, it is possible to infer a chemical acyclic graph G that satisfies a target value \(y^*_{\pi }\) for each of the three properties at the same time (if one exists). We specify the size of graph so that \(n^* =50\), \({\text {bl}}^* =2\), \(\text {dia}^* = 25\) and \(d_\text {max}=4\), and set target values with \(y^*_{\text {Kow}} =4.0\), \(y^*_{\text {Bp}} =400.0\) and \(y^*_{\text {Hc}} =13000.0\) in an MILP that consists of the three MILP \({{\mathcal {M}}}_{\text {Kow}}\), \({{\mathcal {M}}}_{\text {Hc}}\) and \({{\mathcal {M}}}_{\text {Bp}}\). The MILP was solved in 18930 seconds and we obtained a chemical acyclic graph G illustrated in Fig. 9. We continued to execute Stage 5 for this instance to generate more target graphs \(G^*\). Table 10 shows that 100 target graphs are generated by our new dynamic programming algorithm.

Fig. 9
figure 9

An illustration of a chemical acyclic graph G inferred for three chemical properties Kow, Bp and Hc simultaneously, where \(y^*_{\text {Kow}} =4.0\), \(y^*_{\text {Bp}} =400.0\) and \(y^*_{\text {Hc}} =13000.0\), \(n^* =50\), \({\text {bl}}^* =2\), \(\text {dia}^* = 25\), and \(d_\text {max}=4\)

Table 10 Results of Stages 4 and 5 for \({\text {bl}}^*= 2\), \(d_\text {max}=4\), \(n^* = 50\) and \(\text {dia}^* = 25\)

Concluding remarks

In this paper, we introduced a new measure, branch-height of a tree, and showed that many chemical compounds in the chemical database have a simple structure where the number of 2-branches is small. Based on this, we proposed a new method of applying the framework for inverse QSAR/QSPR [17,18,19] to the case of acyclic chemical graphs where Azam et al. [17] inferred chemical graphs with around 20 non-hydrogen atoms and Zhang et al. [19] solved an MILP of inferring a feature vector for an instance with diameter 9. In our method, we formulated a new MILP in Stage 4 specialized for acyclic chemical graphs with a small branch number and designed a new graph search algorithm in Stage 5 that computes frequency vectors of graphs in a dynamic programming scheme.

We implemented our new method and conducted some experiments on chemical properties such as octanol/water partition coefficient, boiling point and heat of combustion.

The resulting method improved the performance so that chemical graphs with around 50 non-hydrogen atoms and around diameter 30 can be inferred. Since there are many acyclic chemical compounds having large diameters, this is a significant improvement.

It is left as a future work to design MILPs and graph search algorithms based on the new idea of the paper for classes of graphs with a higher rank. Recently, a method for inferring a chemical cyclic graph with any rank has been designed by Akutsu and Nagamochi [27] based on the ideas in this paper. The method is also designed so that a target chemical graph to be inferred can be specified in a more flexible way, where we can include a prescribed substructure of graphs such as a benzene ring into a target chemical graph while imposing constraints on a global topological structure of a target graph at the same time.