Abstract
In de novo drug design, chemical compounds are quantitized as real-valued vectors called chemical descriptors, and an optimization algorithm runs on known drug-like chemical compounds in a database and outputs an optimal chemical descriptor. Since structural information is needed for chemical synthesis, we must infer chemical graphs from the obtained descriptor. This is formalized as a graph inference problem from a real-value vector. By generalizing subword history, which was originally introduced in formal language theory to extract numerical information of words and languages based on counting, we propose a comprehensive framework to investigate the computational complexity of chemical graph inference. We also propose a (pseudo-)polynomial-time algorithm for inferring graphs in a class of practical importance from spectrums.
Similar content being viewed by others
Abbreviations
- \(\Upsigma\) :
-
Alphabet
- \({\overrightarrow{{\mathcal{G}}}}\) :
-
The class of (\(\Upsigma\)-labeled, loopless, (weakly-)connected) directed multigraphs
- \({{\mathcal{G}}}\) :
-
The class of (\(\Upsigma\)-labeled, loopless, connected) undirected multigraphs
- d(v):
-
The degree of a vertex v
- T :
-
A tree
- h(T):
-
The height of tree T
- ∂T K :
-
The T’s frontier vector of level K
- tw(G):
-
The tree-width of G
- \(\Upupsilon\) :
-
The class of trees
- \(\Upupsilon_h\) :
-
The class of trees of height at most h
- \({{\mathcal{SPG}}}\) :
-
The class of series-parallel graphs
- \({{\mathcal{PLG}}}\) :
-
The class of planar graphs
- \(\mathcal{TW}(w)\) :
-
The class of graphs of tree-width at most w
- \({{\overrightarrow{\mathcal{SSG}}}}\) :
-
The class of scattered subword graphs
- \({{\overrightarrow{\mathcal{CSG}}}}\) :
-
The class of continuous subword graphs
- WH :
-
A walk history
- \({{\mathcal{SWH}}}\) :
-
The class of systems of walk histories
- \({{\mathcal{SLWH}}}\) :
-
The class of systems of linear walk histories
- \({{\mathcal{COUNT}}}\) :
-
The class of counting systems
- \(\mathcal{WH}\) :
-
The class of systems of single walk history
- \(\mathcal{LWH}\) :
-
The class of systems of single linear walk history
- A–F algorithm:
-
Akutsu–Fukagawa algorithm
References
Akutsu T, Fukagawa D (2005) Inferring a graph from path frequency. In: Aposolico A, Crochemore M, Park K (eds) CPM 2005. Lecture notes in computer science, vol 3537. Springer, New York, pp 371–382
Bakir GH, Weston J, Schölkopf B (2004a) Learning to find pre-images. In: Advances in neural information processing systems, pp 449–456
Bakir GH, Zien A, Tsuda K (2004b) Learning to find graph pre-images. In: Proceedings of the 26th DAGM symposium. Lecture notes in computer science, vol 3175, Springer, New York, pp 253–261
Bodlaender H (1998) A partial k-arboretum of graphs with bounded treewidth. Theor Comput Sci 209(1–2): 1–45
Diestel R (2010) Graph theory, 4th edn. Springer, New York
Fraigniaud P, Nisse N (2006) Connected treewidth and connected graph searching. In: LATIN 2006. Lecture notes in computer science, vol 3887, Springer, New York, pp 479–490
Fujiwara H et al. (2008) Enumerating treelike chemical graphs with given path frequency. J Chem Inf Model 48:1345–1357
Garey M R, Johnson D S (1979) Computers and intractability. A guide to the theory of NP-completeness. W. H. Freeman and Co, New York
Goto S et al (2002) LIGAND: Database of chemical compounds and reactions in biological pathways. Nucleic Acids Res 30:402–404
Ibarra OH (1978) Reversal-bounded multicounter machines and their decision problems. J ACM 25:116–133
Leslie C, Eskin E, Noble WS (2002) The spectrum kernel: a string kernel for SVM protein classification. In: Proceedings of the 7th Pacific symposium on biocomputing. pp 564–575
Mateescu A, Salomaa A, Yu S (2004) Subword histories and Parikh matrices. J Comput Syst Sci 68:1–21
Matiyasevich Y (1970) Solution of the tenth problem of Hilbert. Matematikai Lapok 21:83–87
Matiyasevich Y (1993) Hilbert’s tenth problem. MIT Press, Cambridge
Nagamochi H (2009) A detachment algorithm for inferring a graph from path frequency. Algorithmica 53:207–224
Parikh RJ (1966) On context-free languages. J Assoc Comput Mach 13:570–581
Robertson N, Seymour PD (1986) Graph minors. ii. algorithmic aspects of tree-width. J Algor 7:309–322
Rozenberg G, Salomaa A (eds). (1997) Handbook of formal languages, vol 1. Springer, New York
Seki S (2011) Absoluteness of subword inequality is undecidable. Theor Comput Sci 418:116-120
Shannon CS, Weaver W (1949) The mathematical theory of communication. The University of Illinois Press, Urbana
Yamaguchi A, Aoki KF, Mamitsuka H (2003) Graph complexity of chemical compounds in biological pathways. Genome Inform 14:376–377
Acknowledgements
We wish to express our gratitude for the anonymous referees for their carefully and thoroughly reviewing the earlier version of this manuscript and giving valuable comments and suggestions on it. Shinnosuke Seki expresses his sincere gratitude to Professor Mark Daley, Professor Oscar. H. Ibarra, Professor Helmut Jürgensen, Professor Lila Kari, and Professor Arto Salomaa for the creative discussions with them on the research topic in this paper. This research was carried out with the financial support of the JSPS Postdoctoral Fellowship P10827 to Szilárd Zsolt Fazekas, of the Funding Program for Next Generation World-Leading Researchers (NEXT program) to Yasushi Okuno, and of the Kyoto University Start-up Grant-in-Aid for Young Scientists, No. 021530, to Shinnosuke Seki. Works by Shinnosuke Seki were also financially supported by Department of Information and Computer Science, Aalto University.
Author information
Authors and Affiliations
Corresponding author
Appendix: Proof of Theorem 3
Appendix: Proof of Theorem 3
Let us propose two results first, which are useful in constructing a pseudo polynomial time transformation from 3-PARTITION to \({{\sc Solvability}({\user2 S}_2, {\mathcal{PLG}})}\). The graphs considered from now on are assumed to be undirected. A vertex \({v_1}\) of a graph is singly connected with another vertex \({v_2}\) if there exists exactly one edge between them.
Lemma 12
For an undirected graph \({G}\), if \({G}\) contains exactly one \({a}\)-vertex and \({|G|_{ab} = |G|_{aba} = n}\) for some \({n, }\) then there exist \({n b^{\prime} s}\) that are singly connected with the \({a}\)-vertex.
Proof
Let \({v}\) be the \({a}\)-vertex of \({G}\). An \({ab}\)-walk contributes to \({|G|_{aba}}\) by 1 (starting from the \({a}\)-vertex, we arrive at the \({b}\)-vertex via the edge on this walk and return back via the same edge). Hence, \({|G|_{ab} \le |G|_{aba}}\) holds.
Suppose that a \({b}\)-vertex \({v_1}\) is connected with \({v}\) by two edges \({e_1, e_2}\). Then apart from the \({aba}\)-walks explained above, now we have extra two \({aba}\)-walks, that is, \({v e_1 v_1 e_2 v}\) and \({v e_2 v_1 e_2 v}\). Then, \({G}\) would contain at least \({n+2 aba}\)-walks, a contradiction.
Since \({G}\) contains \({n ab}\)-walks, exactly \({n b}\)-vertices are singly connected with the \({a}\)-vertex. \(\square\)
Lemma 13
For an undirected graph G, if G contains n a-vertices and n b-vertices and |G| ab = |G| aba = |G| bab = n for some n, then there exist n pairwise distinct pairs of an a-vertex and b-vertex that are singly connected.
Our proof of the following theorem will borrow several basic terminologies from topology. Let G be a planar multigraph, that is, G can be embedded onto the plane \({\mathbb{R}^2}\). The regions of \({\mathbb{R}^2 \setminus G}\) are called the faces of G. Since we can lay G inside some sufficiently large disc D, there exists exactly one among its faces that cannot be thus bounded, that is, the face that contains \({\mathbb{R}^2 \setminus D}\). This face is called the outer face of G, and the others are called its inner faces.
Now, we are ready for proving Theorem 3.
Proof
The basic idea is from (Akutsu and Fukagawa 2005): a pseudo polynomial time transformation from 3-PARTITION, which is defined as: given a set X that consists of 3m elements \(x_1, \ldots, x_{3m}\) along with their integer weights w(x i ) and a positive integer B such that B/4 < w(x i ) < B/2 for 1 ≤ i ≤ 3m, find a partition of X into m (disjoint) sets \(A_1, \ldots, A_m\) of cardinality 3 such that A j = {x j,1, x j,2, x j,3} and w(x j,1) + w(x j,2) + w(x j,3) = B for 1 ≤ j ≤ m, where \(x_{j, 1}, x_{j, 2}, x_{j, 3} \in X\).
Let \(\Upsigma = X \cup \{a_1, \ldots, a_m\} \cup \{a, b, c, d, f_1, f_2\}\). From a given instance of 3-PARTITION, we construct a feature vector v of level 2 specified as follows; we write x i = 1 to indicate that the x i coordinate of v has value 1. For any u, if the value of u coordinate of v is not mentioned below, then it is 0, that is, in the target graph, no u-walk is found. For 1 ≤ i ≤ 3m and 1 ≤ h ≤ m,
-
VERTICES x i = 1, a = Bm, b = 3m, c = 3m + 1, d = 1, f 1 = f 2 = 3m, and a h = 1;
-
WALKS-C(enter) d a h = 1, a h b = a h b a h = 3, and \(a_1 a_2 = a_2 a_3 = \cdots = a_{m-1} a_m = 1\);
-
WALKS-B(lock) for \(s \in \{1, 2\}, b f_s = b f_s b = f_s b f_s = 3m, x_i f_s = 1, ba = bab = Bm, x_i a = x_i a x_i = w(x_i); \)
-
WALKS-BC x i d = 1, f 1 c f 2 = 3m − 1, a h c = a h c a h = 3, f 1 c a 1 = 3, f 2 c a 1 = 2, f 1 c a ℓ = 3, f 2 c a ℓ = 3 for 1 < ℓ < m, f 1 c a m = 3, f 2 c a m = 4, and a h b a = B;
-
WALKS-I(nhibited) for any 1 ≤ j, k ≤ m with j ≠ k, x j f 1 x k = x j f 2 x k = a j b a k = a j c a k = 0.
For example, x i = 1 in VERTICES means that a target graph must contain exactly 1 x i -vertex.
Let us give a topological characterization of graphs \({G \in {\mathcal{PLG}}}\) that satisfy |G| S 2 = v. Indeed, we shall see that v uniquely determines a structure that consists of the center graph (an m-star with the center d-vertex) and the a h -vertex (1 ≤ h ≤ m), to each of which 3 b-vertices are singly connected) and 3m rhombuses bounded by the cycle b f 1 x i f 2 b (x i -rhombus), within which exactly w(x i ) a-vertices are forced to be fenced and single connected with the b-vertex on the rhombus (see Fig. 9). Once confirmed, this structure and its uniqueness enable us to conclude that the given instance of 3-PARTITION has a solution if and only if there exists a planar graph whose feature vector of level 2 is v; this is because a h ba = B must be satisfied for 1 ≤ h ≤ m. Note that our construction of the system of inequalities is a pseudo polynomial time transformation.
First of all, VERTICES, WALKS-C, and WALKS-I determine the center graph. Due to Lemma 12, a h b = a h b a h = 3 force exactly 3 b-vertices be singly connected with each a h -vertex and a j b a k = 0 in WALKS-I inhibits a b-vertex from being connected with more than one of \(a_1, \ldots, a_m\)-vertices. For 1 ≤ ℓ < m, the a ℓ-vertex has to be connected with the a ℓ+1-vertex in order to satisfy a ℓ a ℓ+1 = 1.
We shift our focus onto the x i -rhombus and w(x i ) a-vertices. As being done above but using Lemma 13, one can easily see that exactly one of 3m f 1 (f 2)-vertices must be singly connected to each of the 3m b-vertices of the center graph in a one-to-one manner. To these f 1-vertices, distinct x i -vertex is to be singly connected as we need exactly one x i f 1-walk and x j f 1 x k -walk is inhibited whenever j ≠ k. This fact allows us to index the f 1-vertex and b-vertex on the walk from the x i -vertex to the d-vertex by the subscript i as f 1, i and b i , and the f 2-vertex that is connected to the b i -vertex is thus indexed as f 2, i , but this indexing is only for the ease of explanation. In the following, we denote the three of \(x_1, \ldots, x_{3m}\)-vertices that have been thus connected with the a 1-vertex by x 1,1, x 1,2, and x 1,3 for convenience sake (see Fig. 9), but note that we do not know which of \(x_1, \ldots, x_{3m}\) is x 1,1 or we should not. The extended center graph built so far is still a tree, and hence, has only one face. Now we draw edges from each of these x i -vertices to the d-vertex, but due to the \(a_1 a_2 \cdots a_m\)-walk, these edges cannot help but go through between the a 1-vertex and a m -vertex as shown in Fig. 9. These edges separate the face into 3m + 1 faces, that is, the face bounded by the d a 1 b f 1 x 1,1-walk, one bounded by the d x 1,1 f 1 b a 1 b f 1 x 1,2 d-walk, and so on. Since the x j f 2 x k -walk is inhibited whenever j ≠ k, each x i -vertex must be singly connected with distinct f 2-vertex. The lines from x i to d now force the x i -vertex to be thus connected with the f 2, i -vertex. As a result, the x i -rhombus has been formed.
Now we will fence in w(x i ) a-vertices in the x i -rhombus. To this end, we connect the 3m f 1-vertices and f 2-vertices via 3m − 1 f 1 c f 2-walks. The readers should be now familiar enough with the technique based on Lemma 12 and WALKS-I to check that exactly 3 c-vertices are singly connected with the a h -vertex, and these be distinct. This means that the c-vertex on any of these f 1 c f 2-walks must be connected with the a h -vertex for some h, and hence, none of these walks can share their f 1-vertex and f 2-vertex. It is left to the reader to check that the way illustrated in Fig. 9 is the only way to draw 3m − 1 f 1 c f 2-walks so as to satisfy all of these requirements and f 1 c a 1 = 3, f 2 c a 1 = 2, f 1 c a ℓ = 3, f 2 c a ℓ = 3 for 1 < ℓ < m, f 1 c a m = 3, f 2 c a m = 4. These newly-added structures prevent an a-vertex from being connected both with a b-vertex and with x i -vertex unless it is placed in the x i -rhombus. Check that each a-vertex must be singly connected with exactly one b-vertex, and w(x i ) a-vertices must be singly connected with the x i -vertex. Thus, the x i -rhombus must contain exactly w(x i ) a-vertices and they have to be singly connected with the b i -vertex. \(\square\)
Rights and permissions
About this article
Cite this article
Fazekas, S.Z., Ito, H., Okuno, Y. et al. On computational complexity of graph inference from counting. Nat Comput 12, 589–603 (2013). https://doi.org/10.1007/s11047-012-9349-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11047-012-9349-2