Outline
The objective of the backward prediction is to create a chemical structure S with p properties \(\mathbf{Y} = (Y_1, \ldots , Y_p)^{\mathrm {T}} \in \mathbb {R}^p\) lying in a desired region U. The Bayesian molecular design relies on the statement of Bayes’ law, which is sometimes called the inverse law of conditional probability,
$$\begin{aligned} p(S|\mathbf{Y} \in U) \propto p(\mathbf{Y} \in U|S)p(S). \end{aligned}$$
(1)
This law states that the posterior distribution \(p(S|\mathbf{Y} \in U)\) is proportional to the product of the likelihood \(p(\mathbf{Y} \in U| S)\) and the prior p(S). Exploring high-probability regions of the posterior, we aim to identify promising hypothetical structures S that exhibit the desired U.
Along with Eq. (1), three internal steps linking the forward and backward analyses are outlined (see also Fig. 1):
-
Forward prediction A set of QSPR models on the p properties is trained with structure-property relationship data. This defines the forward prediction model \(p(\mathbf{Y}|S) = \prod _{j=1}^p p(Y_j| S)\) on the right-hand side of Eq. (1).
-
Prior. The prior distribution p(S) serves as a regularizer that imposes low probability masses on chemically unfavorable structures in the posterior distribution.
-
Backward prediction Bayes’ law inverts the forward model \(p(\mathbf{Y}|S)\) to the backward \(p(S|\mathbf{Y} \in U)\) in which a desired property U is specified for the conditional. A Monte Carlo calculation is conducted to generate a random sample of molecules \(\{S^{r} | r = 1, \ldots , R\}\) of size R according to the posterior distribution.
In this study, a chemical structure is described by a SMILES string. As will be detailed, a chemical language model defines the conditional distribution \(S' \sim p(S'|S)\) to which the current structure S is randomly modified to a new \(S'\). By the machine learning of the SMILES language in tens of thousands of existing compounds, structural patterns of real molecules are compressed to the probabilistic language model. In combination with SMC, the trained model, which acquires the implicit meaning of ‘chemically unfavorable structures’, is utilized to modify SMILES strings under a given U while reducing the emergence of structures unlikely to occur. Furthermore, the trained language model serves as the prior in Eq. (1).
Forward prediction
A structure-property data set \(\mathcal {D}_j = \{Y_{ij}, S_i: i = 1, \ldots , N\}\) on property j is given where \(Y_{ij} \in \mathbb {R}^1\) and \(S_i\) consist of the ith sample. With the N observations, a QSPR model is trained by a linear regression \(Y_j = \mathbf{w}_j^{\mathrm {T}} \varvec{\psi }_j(S) + \epsilon\) with a d-dimensional fingerprint descriptor \(\varvec{\psi }_j(S) \in \{0,1\}^d\). To simplify the notation, the property index j is temporally omitted. The noise \(\epsilon\) is independently and identically distributed according to the normal distribution \(\mathrm {N}(\epsilon |0, \sigma ^2)\). The unknown parameters consist of the coefficient vector \(\mathbf{w} \in \mathbb {R}^d\) and the noise variance \(\sigma ^2 \in \mathbb {R}_+^{1}\). Putting the normal prior \(\mathbf{w} \sim \mathrm {N}(\mathbf{w}|\mathbf{0}, \sigma^2 \mathbf{V})\), and the inverse gamma prior \(\sigma ^2 \sim \mathrm {IG}(\sigma ^2|a, b)\) on the unknowns, we derive the predictive distribution on the property Y with respect to an arbitrary input S:
$$\begin{aligned} &p(Y|S, \mathcal {D})= \mathrm {T}_{2 a_*}\Bigg (Y \Big | \mathbf{w}_*^{\mathrm {T}} \varvec{\psi }(S), \frac{b_*}{a_*} (1 + \varvec{\psi }(S)^{\mathrm {T}} \mathbf{V}_* \varvec{\psi }(S))\Bigg ), \\ &\mathbf{V}_*= (\mathbf{V}^{-1} + {\varvec{\Psi }}^{\mathrm {T}}{\varvec{\Psi }})^{-1}, \\& \mathbf{w}_*= \mathbf{V}_* {\varvec{\Psi }}^{\mathrm {T}}{\varvec{y}} \\ &a_*= a + N/2, \ \mathrm {and}\\ &b_*= b + \frac{1}{2} (\mathbf{y} - {\varvec{\Psi }} \mathbf{w}_*)^{\mathrm {T}} (\mathbf{I} + {\varvec{\Psi }} \mathbf{V}{\varvec{\Psi }}^{\mathrm {T}})^{-1} (\mathbf{y} - {\varvec{\Psi }} \mathbf{w}_*), \end{aligned}$$
where \({\varvec{\Psi }}^{\mathrm {T}} = (\varvec{\psi }(S_1), \ldots , \varvec{\psi }(S_N))\) and \(\mathbf{y}^{\mathrm {T}} = (Y_1, \ldots , Y_N)\). Here, \(\mathbf{I}\) denotes the identity matrix, and \(\mathrm {T}_{\nu }(Y | \mu , \lambda )\) denotes the density function of the t-distribution with mean \(\mu\), scale \(\lambda\) and the degree of freedom \(\nu\). The predicted value of the property is given by the mean \(\mathbf{w}_*^{\mathrm {T}} \varvec{\psi }(S)\) of the predictive distribution.
The prediction models on the p properties, \(p(Y_j|S, \mathcal {D}_j)\) (\(j=1, \ldots , p\)), are obtained individually from the respective training sets. We then define the likelihood in Bayes’ law with a desired property region \(U=U_1 \times \cdots \times U_p\) as
$$\begin{aligned} p(\mathbf{Y} \in U | S) = \prod _{j=1}^p \int _{U_j} p(Y_j | S, \mathcal {D}_j) \mathrm {d} Y_j. \end{aligned}$$
(2)
For brevity, we write \(p(\mathbf{Y} \in U| S) = p(\mathbf{Y} \in U| S, \mathcal {D})\).
Though a simple instance of QSPR models is described here, we can exploit more advanced techniques of supervised learning such as state-of-the art deep learning or a class of the ensemble learning algorithms. When dealing with a discrete-valued property, the regression should be replaced by a classification model. This study is developed along the use of conventional fingerprints as the descriptor, but it is highly beneficial in practice to use more advanced descriptors, for example, molecular graph kernels coupled with kernel machine learning [23–25].
Table 1 Correspondence table between the formal and modified rules of SMILES
Chemical language model
With the SMILES chemical language, a molecule is translated to a linearly arranged string \(S = s_1 s_2 \ldots s_g\) of length g. A string of the SMILES encoding rules consists entirely of symbols that indicate element types, bond types, and the start and terminal for ring closures and branching components. The start and terminal of a ring closure is designated by a common digit, ‘1’, ‘2’, and so on. A branch is enclosed in parentheses, ‘(’ and ‘)’. Substrings corresponding to multiple rings and branches can be nested or overlapped. In addition to the formal rule of SMILES, all strings are revised as ending up with the termination code ‘$ ’. Inclusion of this symbol is necessary to automatically terminate a recursive string elongation process. For instance, once a string pattern ...CCC=O is present, any further elongation is prohibited and should be terminated at once by appending ‘$ ’. In addition, digits indicating the starts and terminals of rings are represented by ‘&’. The revised representation rule is listed in Table 1. Appendix 1 in Supplementary Materials provides an illustrative example.
With no loss of generality, the prior p(S) can be expressed as the product of the conditional probabilities:
$$\begin{aligned} p(S) = p(s_1) \prod _{i=2}^p p(s_i | s_{1:i-1}). \end{aligned}$$
(3)
The occurrence probability of character \(s_i\) depends on the preceding \(s_{1:i-1} = s_{1} \cdots s_{i-1}\). In general, the non-canonical SMILES encodes a chemical structure into many equivalent forms that correspond to different atom orderings. We treat such structurally equivalent strings as different S.
The fundamental idea of the chemical language modeling is as follows: (i) the conditional probability \(p(s_{i} | s_{1:i-1})\) is estimated with the observed frequencies of substring patterns in known compounds, and (ii) the trained model is anticipated to successfully learn an implied context of the chemical language. For a given substructure \(s_{1:i-1}\), the model is used to modify the rest of the components: until the termination code appears, subsequent characters are recursively added according to the conditional probabilities while putting the acquired chemical reality into the resulting structure.
The SMILES generator should create grammatically valid strings. In particular, we focus on two technical difficulties to be addressed, which are relevant to the rules of grammar on the expression of rings and branching components.
-
(i)
Unclosed ring and branch indicators must be prohibited. For instance, any strings extended rightward from a given \(s_{1:6} = \mathtt{CC(C(C}\) should contain two closing characters, ‘)’, somewhere in the rest.
-
(ii)
Neighbors in a chemical string are not always adjacent in the original molecular graph. Consider a structure expressed by CCCCC(CCCCC)C. The substring in the parentheses is a branch of the main chain. The main chain consists of six tandemly arranged carbons that are split into before and after the branch. In this case, the occurrence probability of the final character \(s_{13} = \mathtt{C}\) should be affected more by characters in the main chain than those in the branch. In other words, the conditional probability of \(s_{i}\) should depend selectively on a preferred subset of the conditional \(s_{1:i-1}\) according to the overall context of \(s_{1:i-1}\) and \(s_{i}\). The same holds when one or more rings appear in the conditional, e.g., c1ccc2ccccc2c1C.
To remedy these issues, the conditional probability is modeled as
$$\begin{aligned} p(s_i | s_{1:i-1}) = \prod _{k=1}^{20} p(s_i | \phi _{n-1}(s_{1:i-1}), \mathcal {A}_k)^{I(s_{1:i-1} \in \mathcal {A}_k)}, \end{aligned}$$
(4)
where \(I(\cdot )\) denotes the indicator function which takes value one if the argument is true and zero otherwise. One of the 20 different models \(p(\cdot | \cdot , \mathcal {A}_k)\) (\(k = 1, \ldots , 20\)) becomes active when the state of the preceding sequence \(s_{1:i-1}\) falls into any of the mutually exclusive “conditions” \(\mathcal {A}_k\) (\(k=1, \ldots , 20\)). The 20 (\(= 2 \times 10\)) conditions are classified according to the presence or absence of unclosed branches and the numbers \(\{0, 1, \ldots , 9\}\) of unclosed ring indicators in \(s_{1:i-1}\). For instance, if \(s_{1:i-1}\) contains two unclosed ring indicators, e.g., CCCC(CC(, the corresponding models should be probabilistically biased toward producing the two terminal characters ‘)’ in subsequent characters. In addition, the substring selector \(\phi _{n-1} (s_{1:i-1})\) is introduced for the treatment of the second problem. The definition is as follows:
-
Contraction Suppose that \(s_{1:i-1}\) contains a substring \(t = t_1 \cdots t_q\) enclosed by the closed parentheses such that t itself is never enclosed by any other closed parentheses. In other words, t is a substring inside of the outermost closed parentheses. The substring is then reduced to be \(t \rightarrow t' = t_1\) by removing all characters in t except for the first character, \(t_1\). In other words, \(t_1\) is the character that is the right-hand neighbor of the opening ‘(’ of the outermost closed parentheses.
-
Extraction The selector \(\phi _{n-1} (s_{1:i-1})\) outputs the last \(n-1\) characters in the reduced string of \(s_{1:i-1}\).
The substring selector is illustrated with several examples in Fig. 2. This operation reduces a substring in any nested closed parentheses to a single character that indicates the atom adjacent to the branching point. The occurrence probability of \(s_i\) is then conditioned by its \(n-1\) preceding characters in the reduced strings that correspond to neighbors in the molecular graph.
Under the maximum likelihood principle, the conditional probability for \(\mathcal {A}_k\) in Eq. 4 is estimated by the relative frequency of co-occurring n-gram, \(s_i\) and \(\psi _{n-1}(s_{1:i-1})\), in training instances of known compounds. Let \(f_{\mathcal {A}_k}(s_i, \phi _{n-1}(s_{1:i-1}))\) denote the count of the n-grams in which the conditional string \(s_{1:i-1}\) is in condition \(\mathcal {A}_k\). We then conduct the back-off procedure [26] separately with all possible substrings \(s_{1:i}\) whose the conditionals \(s_{1:i-1}\) belong to \(\mathcal {A}_k\):
$$\begin{aligned}&p(s_i | \phi _{n-1}(s_{1:i-1}), \mathcal {A}_k) \\&\quad = {\left\{ \begin{array}{ll} \frac{f_{\mathcal {A}_k}(s_i, \phi _{n-1}(s_{1:i-1}))}{ {\sum _{s_i \in \Sigma }} f_{\mathcal {A}_k}(s_i, \phi _{n-1}(s_{1:i-1})} &{}\quad \mathrm {if} {\displaystyle \sum _{s_i \in \Sigma }} f_{\mathcal {A}_k}(s_i, \phi _{n-1}(s_{1:i-1}))> 0 \\ p(s_i | \phi _{n-2}(s_{1:i-1}), \mathcal {A}_k) &{}\quad \mathrm {otherwise} \end{array}\right. } , \end{aligned}$$
where \(\Sigma\) denotes the set of all possible characters. This is a recursive formula across \(n = 1, 2, \ldots , n_{\mathrm {max}}\). In the upper formula, the estimate is given by the relative frequency of each instance of an n-gram in the \(\mathcal {A}_k\)-conditioned substrings. If there are no instances, the estimate at the previous \((n-1)\)-gram is substituted as in the lower formula.
Backward prediction
The objective of the backward prediction is to generate chemical strings from the posterior distribution in Eq. (1), conditioned on a desired property region U. The forward models and the trained chemical language model define the posterior as in Eqs. (2) and (3). The SMC algorithm that we developed is shown in Algorithm 1.
In general, diverse molecules exhibit significantly high probabilities in the posterior. In order to better capture the diversity of promising structures, we create a series of tempered target distributions, \(\gamma _t (S)\) (\(t=1, \ldots , T\)), with a non-decreasing sequence of inverse temperatures \(0 \le \beta _1 \le \beta _2 \le \cdots \le \beta _{s-1} \le \beta _s = \cdots = \beta _T = 1\).
$$\begin{aligned} \gamma _t(S) \propto p(\mathbf{Y} \in U | S)^{\beta _t} p(S). \end{aligned}$$
The likelihood function becomes flatter as the inverse temperature decreases, and vice versa. The algorithm begins with a small \(\beta _1 \simeq 0\). The series of target distributions monotonically approaches as the iteration number increases, and bridges to the posterior at \(\beta _t = 1\), \(\forall t \ge s\).
At the initial step \(t=0\), R structures \(\{S_0^{r} | r =1, \ldots , R\}\) are created by some means. For each subsequent t, a currently obtained structure \(S_{t-1}^{r}\) is mutated randomly to \(S_*^{r}\) (\(r=1, \ldots , R\)) according to a structure manipulation model \(G_{\theta }(S_{t-1}^{r}, S_*^{r})\) with a set of parameters, \(\theta = (\kappa , \eta )\), as detailed below. A new population \(\{S_{t}^{r} | r=1, \ldots , R\}\) is then produced by conducting the resampling of \(\{S_*^{r} | r=1, \ldots , R\}\) with the selection probabilities, \(W_{S_*^{r}}\) (\(r = 1, \ldots , R\)), which involve the current tempered distribution \(\gamma _t(S)\). The greater the likelihood a mutated structure achieves, the higher the chance it survives and the more the offspring it leaves. In general, this continues until the population has been updated hundreds or thousands of times. The present algorithm is essentially the same as a GA. The crucial difference lies in the mutation operator \(G_{\theta }(\cdot , \cdot )\).
The structure manipulation model \(G_{\theta }(S, S')\) is designed with the trained SMILES generator as summarized below.
-
(i)
Draw a uniform random number \(z \sim \mathrm {U}(0, 1)\). If S is grammatically correct and z is less than the reordering execution probability \(\kappa\) (=0.2), reorder the string \(S \rightarrow S^*\) of length g, otherwise set the unprocessed string to \(S^*\). With the first character chosen randomly using a uniform distribution, Open Babel 2.3.2 [27] is used from the command line with an argument ‘-xf’ for the reordering.
-
(ii)
Discard the rightmost m characters of the reordered string to derive \(S^{**} = s_{1:g-m}^{**}\). The deletion length m is sampled from the binomial distribution \(m \sim \mathrm {B}(m|L, \eta )\) with binomial probability \(\eta\) (=0.5 by default) and the maximum length L (=5 by default).
-
(iii)
Extend the reduced string by sequentially adding a new character to the terminal point \(L-m\) times. A newly added character follows the trained language model \(s_i \sim p(s_i | s_{1:i-1})\). Once the termination code appears, the elongation is stopped, and then we have \(S'\).
The reordering of strings plays a key role in preventing a series of designed molecules from getting stuck in local states. Note that temporally, the SMC algorithm can create structures containing unclosed rings and branching components. Then, the corresponding start codes for the unclosed rings or branches are temporally removed to avoid the syntax error when obtaining a descriptor for the likelihood calculation. In addition, the atom order is rearranged only when a current string is grammatically valid.
Software
The iqspr package can be installed thorough the CRAN repository. Installation of Open Babel 2.3.2 is required for getting started. The package consists of a set of functions to perform the QSPR model building (QSPRpred reference class) with molecular fingerprints in the rcdk package [28], the inverse-QSPR prediction (SmcChem reference class), and the training and simulation of the chemical language generator (ENgram reference class) with user-specified input SMILES strings. Currently, the chemical language modeling and the inverse analysis cannot deal with isomers, or ionic compounds. A sample code is given in Appendix 2 in Supplementary Materials.
Table 2 MAEs of the QSPR models with the eight different fingerprint descriptors for the internal energy and the HOMO-LUMO gap