Introduction

Deep learning (DL) algorithms hold the promise of further accelerating advancements in almost every aspect of scientific research. Recent architectural developments in deep neural networks (DNN) [1] allowed for new applications beyond its initial targets in image recognition and text processing. As evident from the recent boost in the number of relevant publications [2], the field of chemistry has proved a fruitful ground for the application of the algorithms originally developed for natural language processing (NLP) [3] and graph processing (GP) [4] purposes. Chemical objects emerged as the natural extension of these algorithms due to the common representation of molecules as SMILES (text representation) [5] or molecular graphs (from valence theory). Both NLP and GP methodologies provide predictive and generative capabilities, thus paving the way for tackling the challenging problems of predicting molecular structures based on the desired chemical property.

In this study, we set on the task of developing a model for the predictive generation of molecular structures with the desired chemical property. As a property of interest, we have chosen to focus on the highest occupied molecular orbitals (HOMO) of small organic molecules due to the impact these electronic orbitals have on the physicochemical properties of molecules. The HOMO energy levels affect the reactivity and stability of chemical compounds, influence the properties of materials—for example, photovoltaic materials—and intimately impact the efficiency of light-to-electricity conversion in solar cells [6]. Our motivation is to facilitate the sampling of the virtual chemical space by developing an extended DL approach that encompasses the HOMO prediction task and moves the generation of new molecules from an explicit enumeration of all possible compounds to a refined small set with predefined HOMO properties.

Using Density Functional Theory (DFT) computational methods, it is possible to calculate HOMO energies of photovoltaic materials [7], yet these computations are costly in resources and time. In contrast, Machine Learning techniques allow fast and accurate prediction of HOMO levels. A comprehensive and detailed review of various machine learning techniques applied for the prediction of molecular orbital characteristics is given in [8]. In particular, Kernel Ridge Regression (KRR) is used in [9,10,11,12,13,14,15], Gaussian process regression (GPR), linear regressions (Elastic Net, Bayesian Ridge Regression) and Random Forest (RF) are applied in [14, 16, 17], respectively. Deep Learning techniques were used in [15, 18,19,20,21,22,23,24,25,26], including those based on graph molecular representations [21, 23]. The previously developed models set the regression task as the principal goal. To the best of our knowledge, no generation model for the molecules with the given HOMO level has been developed yet. Our technique is based on the junction tree variational autoencoder (JT VAE) [27] architecture which uses graph molecular representations as a reliable way of reproducing chemically valid structures. Our model achieves state-of-the-art results in HOMO energy prediction and allows the generation of new molecules with desired HOMO value. Several strategies for discovering chemical structures in the embedding space were suggested and explored.

Model

VAE and regression

Adapting from the work of [28], we use the variational auto-encoder neural network [29] to build our model. To this end, we train modified message passing networks to encode the molecular graph and a GRU-based message passing network to decode it. This approach produces mappings from the space of molecules to the embedding vector space and back, allowing us to later train the regression model from the embedding space and determine the molecular property of interest (see Fig. 1).

Fig. 1
figure 1

Simplified VAE architecture

For the regression, we use a feedforward neural network (FFNN) with two hidden layers of size 1024.

Junction tree VAE

In VAE training on graphs, the reconstruction of a molecule is a key challenge arising from the variability and complexity of molecular structures. The choice of architectures to efficiently reconstruct a graph is typically limited. In the present work, the graph is reconstructed in a sequence-to-sequence type model with a modified gated recurrent unit block [27]. One of the challenges of this approach lies in the possible formation of long sequences of nodes in the initial graph. These sequences are hard to encode and even harder to reconstruct accurately. For example, molecules with cyclic carbon structures proved challenging for accurate decoding because of the combinatorial variety of ring placements and side chain arrangements. Although it is still possible to apply graph encoding and decoding methods, they perform poorly. To simplify the molecular graph trees and eliminate all the complex molecular patterns, we utilize a junction tree mechanism that constructs an underlying tree-like reduced graph structure (hence the name) by an algorithm that maps the molecule into a unique tree representation and back. The nodes of the junction tree are chemical substructures with a rigid (fixed) spatial shape.

In our approach, we use the encoded junction tree and the molecular graph to obtain independent embeddings for the molecule and its underlying structure (Fig. 2a). Thus, each molecule has two embeddings—one for the junction tree and one for the molecular graph which are concatenated to create one embedded representation of the molecule—a stacked vector used to perform regression and guided search for the suitable molecule in the embedded space (Fig. 3). The decoding procedure consists in creating a junction tree structure first and use the molecular graph hidden representation to guide the reconstruction of the junction tree into the final molecule (Fig. 2b).

Fig. 2
figure 2

VAE components (FFNN—feed forward neural network, MPNN—message passing neural network)

Fig. 3
figure 3

JT-VAE and regressor

Architecture details

We present further details of our model in the following tables. Table 1 presents the configuration of model used to encode junction tree of a molecule. Table 2 presents the configuration for molecular encoder. Table 3 presents the configuration of regressor model. Table 4 presents the model used to decode the molecule from latent representation and model used to extract junction tree representation of reconstructed molecule for structures matching in decoding stage [27]. In the tables we write MPNN for modified Message Passing Neural Network [21, 27] and FFNN for Feed-Forward Neural Network.

Table 1 Junction tree encoder
Table 2 Molecular graph encoder
Table 3 Regressor
Table 4 Junction tree decoder

Training

The VAE model is designed to sample the latent embeddings from a probability distribution. The initial VAE training process of the encoder-decoder pair was conducted in two phases in analogy with [27]: first training the deterministic autoencoder network and later tuning the autoencoder model by adding a penalty term involving Kullback-Leibler divergence between latent vectors and standard normal distribution with a fixed penalty coefficient. As in the work of [27], our experiments indicate this approach as the optimal way of training.

A crucial factor for the training process was the training order of the encoder, decoder, and regressor triplet. We tried three strategies to ensure consensus between the regression and VAE model.

Enc, Dec \(\rightarrow\) FFNN

The first strategy (Fig. 4) is simply when the feed forward neural network regression with ReLu activation (FFNN) is trained from the latent space to the property value space. Here the pretrained VAE encoder and decoder layers are set frozen.

Fig. 4
figure 4

Finetuning of regression

Fig. 5
figure 5

Training of regression

Fig. 6
figure 6

Training of decoder and regressor

Enc, Dec \(\rightarrow\) Enc, FFNN

The second strategy (Fig. 5) was the VAE pair encoder-decoder training with a subsequent training of a VAE pair encoder-regressor; this way, we obtain a finetuned encoder, but the VAE’s decoder remains unchanged.

Enc, Dec \(\rightarrow\) Enc, FFNN \(\rightarrow\) Dec

The third strategy (Fig. 6) was a modified second one with an added step of retraining the encoder-decoder pair while keeping frozen the finetuned VAE encoder; we obtained our finetuned decoder in this way.

Prediction of molecular structures with a given property value

The main problem addressed in this work is the accurate reconstruction of molecular structures from a given value of a chemical property of interest, namely HOMO energy values. There are two related issues in reverse quantitative structure-activity relationship (reverse QSAR), the first one is the existence of a molecule with the given property value, and the second one is the choice of the most interesting structure in the case when several molecules with the same HOMO value exist. We assume a molecule exists for each given HOMO value. In an ideal case, we would have to deal with a differentiable mapping f from the space G of molecules to the space V of its real-valued HOMO values. Assume that \(v_0\) is the desired HOMO value. Then, for a molecule g, we can introduce a loss function \(L_{v_0}(g) = |v_0 - f(g)|_p\) for some norm \(|\cdot |_p\). In such a scenario, it would be possible to search for the optimal structure, by minimizing the \(L_v\) using gradient descent methods. Unfortunately, since the space G of molecules is discrete, the mapping f is not differentiable, and we cannot apply the optimization approach directly. We use a VAE-type architecture to firstly map molecules from G to the corresponding latent space of the representations in \(R_n\) for some n (encoder mapping \(E: G \xrightarrow []{} R_n\)) and back from \(R_n\) to G (decoder mapping \(D: R_n \xrightarrow []{} G\)). Next, we construct regression mapping \(f_R: R_n \xrightarrow []{} V\) and apply the above mentioned optimization process to the function \(f_R\).

All these functions ED,  and \(f_R\) can be realized as neural networks. Once the models are trained (see Sect. "Training"), we can apply the auto differentiation techniques [28] to the neural network \(f_R\) in order to do search in the molecule embedding space \(R_n\). The minimization of \(L_{v_0}\) will consist in fixing the weights of neural network and varying the argument Z in the direction of the gradient of \(L_{v_0}(Z)\) in \(R_n\). Once the gradient descent converged to some vector \(Z_0\) in \(R_n\) such that \(f_R(Z_0)\) is close enough to \(v_0\), we can reconstruct a molecule \(D(Z_0)\) by applying the decoder mapping D (Fig. 7).

Fig. 7
figure 7

Latent space gradient descent

We have used three initialization approaches and two methods of structure optimization for the prediction of molecules from the test dataset. That resulted in six methods for searching molecules with prescribed property values.

Vector initialization in the embedding space

We had to choose a method of initialization of the embedding vector that is to be fed to the gradient descent procedure described above. A good initialization of an embedding vector is of crucial importance to the result of the algorithm since the embedding space can have multiple minima—say optimal molecules—or no minimum for a given output value. Also, there is no intrinsic dependability that the local extreme must correspond to the molecule with the best fitting property. Fortunately, due to the inherent properties of VAE architecture, similar structures tend to locate close to each other in the embedding space.

In our work, we studied different approaches to initialize a molecule (and thus the corresponding embedding vector). We introduce the following notion: let \(\text {REGR}_{\text {TR}}\) and \(\text {REGR}_{\text {VAL}}\) be datasets used in the regression training procedure, CTV—a molecule from \(\text {REGR}_{\text {TR}}\) closest by target property value, CHV—a molecule from \(\text {REGR}_{\text {TR}}\) closest by hidden vector in the latent space to a given one.

Fig. 8
figure 8

Closest by value initialization

Fig. 9
figure 9

Closest by hidden vector initialization

Gaussian :

The initial approach consisted in sampling the vector in the latent space \(R_n\) from a normal multinomial distribution with a 1-diagonal covariance matrix.

CTV + Gaussian :

We assume that the molecules with similar properties should have close projections in the embedding space \(R_n\). Thus, if we wish to invent molecules with the property value \(v_0\), then we search molecules in the train set \(\text {REGR}_{\text {TR}}\) with the property values the closest to \(v_0\), and we use their embeddings in \(R_n\). We applied the centered Gaussian noise with 1-diagonal covariance matrix to these starting vectors in order to augment the set of initializations in \(R_n\) (Fig. 8).

CHV + Gaussian :

This initialization method is good to simulate the generation of a molecule with a given property value which belongs to some particular class of chemical compounds. We take some molecule \(m_{int}\) of a given class, and we start the gradient descent with a molecule from \(\text {REGR}_{\text {TR}}\) which has the closest \(L_2\) distance to the embedding of \(m_{int}\). In our experiments we used the extreme case of this initialization method trying to reconstruct molecules from the validation set (Fig. 9).

Molecule optimization algorithms

Now we describe the optimization algorithms we used.

Algorithm A is a formalization of the gradient descent in the molecule embedding space described in the beginning of Sect. "Prediction of molecular structures with a given property value".

figure a

According to the VAE paradigm, any molecule corresponds not just to one vector, but to a domain in the embedding space \(R_n\). However, the regression function attributes different property values to each embedding, thus making the prediction ambiguous. We used the mean vector of normal distribution created by the VAE architecture as an input for the regression model. After the convergence in the latent space, the resulting vector does not a priori correspond to the mean of any molecule. A natural way to go around this problem is to decode the molecule from the found vector and then apply the encoder to calculate the mean embedding vector of the predicted molecule. Following this intuition, we suggest Algorithm B for molecule optimization. This approach allows us to look for an embedding that corresponds to a mean of some molecule and has the closest possible property value to the value of interest.

figure b

Datasets

We use the ZINC [30] and QM9 [31] datasets in our work.

The QM9 dataset [31] is a widely used benchmark for the prediction of physical properties of molecules in equilibrium state. It consists of around 130k small organic molecules with up to nine heavy atoms of C, O, N, or F with the properties computed using DFT calculations. The QM9 contains additional information, from which we also obtain the HOMO properties of molecules to perform the regression task and test our structure optimization approach.

The ZINC dataset [30] is a subset of a free database [32] of commercially available compounds for virtual screening.

A dataset mixture to train the encoder-decoder pair. Originally in [27], JT VAE was studied by utilizing the ZINC dataset [30]. It was discovered that ZINC lacks a variety of molecule types to the extent that the models trained on it lack efficiency in real tasks [8]. To avoid the influence of a bias of a particular dataset on the training, we have extended the previous approach to a mixture of QM9 and ZINC datasets (we call it MIX) which helped us improve the model’s generalization properties and general accuracy of reconstruction.

JT VAE method decomposes every molecule into building blocks. One of the central assumptions that we used in our work is that the dictionary of simple structures forming the molecules is large enough for decomposing any possible molecule. This assumption is rarely satisfied when we change from one chemical database to another. Therefore, for the MIX dataset we expanded the vocabulary of simple compounds as much as possible. In particular, we grow the dictionary from 780 original building blocks covering the ZINC database in JT VAE to 2327 basic objects spanning MIX. The total number of molecules in MIX is 355796.

Datasets used to train the regression. For the property of interest - e.g., molecular HOMO energy levels - merging two different databases is often a challenging task since there is no unique way to correctly join the data from various sources due to variation in theoretical calculation techniques, conditions of measurements, etc. To overcome the consistency issues, we performed the regression training on QM9 dataset only.

First, we have partitioned the \(\text {QM9}\) datasets on the train, validation and test subsets (\(\text {QM9}_{\text {TR}}\), \(\text {QM9}_{\text {VAL}}\), and \(\text {QM9}_{\text {TEST}}\), respectively), so that the validation part contains 2500 molecules, and the test part contains 5000 molecules. For the consistency reasons, VAE was trained on the combined \(\text {ZINC} + \text {QM9}_{\text {TR}}\) dataset, and \(\text {QM9}_{\text {VAL}}\) was used for validation; for the training procedure of regression, \(\text {QM9}_{\text {TR}}\) and \(\text {QM9}_{\text {VAL}}\) were used; \(\text {QM9}_{\text {TEST}}\) was used as test data across HOMO prediction and reconstruction of molecules.

Results

Basic auto encoding

In Table 5 we give the results of a basic JT-VAE training. The columns Train and Test correspond to the respective datasets used in the process. The column Acc contains the percentage of accurately reconstructed molecules by the encoder decoder pair. The column Chemical Validity represents the chemical validity of obtained molecules. The column KL indicates whether the Kullback-Leibler divergence penalty term was used in the second stage of the training (see Subsection  VAE and Regression).

Table 5 Results of reconstruction accuracy

The results of the basic encoder-decoder training presented in Table 5 serve as a baseline for a multi-step VAE training described in Sect. "Training".

Regression and multi-step auto encoding

Table 6 represents results mentioned in [8] for different datasets and methods of HOMO energies prediction, which were trained and tested on the indicated datasets. More complete description of the methods can be found in [8]. The last four lines colored in gray correspond to our results. The lines Ridge Regr. and Elastic Net correspond to the training of the eponymous regression from latent space into the property of interest (HOMO). The last two lines correspond to the training of the unfrozen encoder (pretrained during the basic JT VAE training) jointly with two layers feed forward neural network with ReLu activations.

We measure the HOMO accuracy (HOMO Acc columns in Tables 6 and 7) by MAE loss and the VAE accuracy (VAE Acc column in Table 7) by finding the percentage of SMILES strings that represent the molecule after reconstruction that are the same as initial SMILES strings.

Table 6 Results of regression accuracy

In Table 7 we give the results for the joint VAE + regressor training. Three strategies were introduced in Section Training. We figure out that the third strategy is a good trade-off for the quality of the regressor’s property value prediction and the molecule reconstruction accuracy by the encoder-decoder pair. Note that \(0\%\) VAE Accuracy in the second line of Table 7 is due to the fact that the encoder trained together with the regressor does not match the decoder anymore, therefore the third joint VAE and Regression training strategy was introduced. The best regression [22] from Table 6 performs slightly better than our JT-ENC+FFNN regression model, but coupled with the decoder, our model can generate new molecules which is not possible for other researchers’ models in Table 6.

Table 7 Comparison of joint VAE and Regression training strategies

Molecule reconstruction

For Table 8 the methods of the columns Initialization and Type of search are described in Subsection Vector initialization in the embedding space, and the column Reconstruction displays the rate of molecules predicted from \(\text {QM9}_{\text {TEST}}\) dataset.

Table 8 Experiments on type of initialization
Fig. 10
figure 10

Reconstructed molecules from \(\text {QM}9_{\text {val}}\) dataset. The target HOMO value is the same as the predicted HOMO value, that is specified under molecules

The low values of the reconstruction percentage column in Table 8 are due to the fact that the model performed the unguided structure search for the HOMO values of the molecules from the \(\text {QM9}_{\text {TEST}}\) dataset, and also since multiple molecules can have close (or equal) HOMO values. We are unable to compute the real HOMO values of all the molecules suggested by our model because of high costs of DFT computations, but we attest that our model generated correctly the “reconstructed” molecules according to their HOMO values. Given that the \(\text {QM9}_{\text {TEST}}\) dataset was never exposed to our model during the training, we argue that the implementation of our algorithm warrants both the validity of predicted molecules and the high accuracy of HOMO value prediction. In Fig. 10 we give several molecules from the \(\text {QM9}_\text {TEST}\) dataset that were suggested by our algorithm and thus we know that their HOMO values were predicted perfectly.

In addition, we tested method B with a combination of CHV and Gaussian initialization on independent data. To this purpose, we took a molecule from [33] that is not in any way connected to MIX or QM9 datasets. The proposed method has successfully reconstructed the selected molecule c1c2ccccc2co1 from [33], which corroborates the reliability of our method. In Fig. 11 we give the reconstructed molecule and other examples of proposed structures. The molecule of interest is surrounded with a square.

Fig. 11
figure 11

Reconstruction of a molecule from an independent data set and alternative structures

Fig. 12
figure 12

Guided optimization

Guided structure optimization

Guided structure optimization is a particular case of CHV; the concept is that we modify some molecules to obtain similar molecules with desired properties. With JT VAE architecture, we can influence how the molecular structure changes since the molecular decomposition is fully determined by a hidden vector that represents the junction tree. By modifying this vector slightly, we can introduce a change in the molecule, thus enhancing a certain property value. For instance, in Fig. 12 we showcase the molecules (together with their HOMO values predicted by our regressor model) that we obtained after 100 iterations of the algorithm B applied to the arbitrarily chosen molecule CC1C(O)CC1C1CC1 from MIX, with the base HOMO value of -6.8025, when the target value was arbitrarily set to be -5.93722 (the original molecule is marked with a square on the image).

Conclusion

The proposed model allows the prediction of important chemical property values, namely the HOMO energy levels, which are otherwise costly to compute. While our model performs on the same level of accuracy as the current state-of-the-art regression models, the junction tree variational autoencoder coupled with a regression algorithm also allows to create new molecular structures with desired HOMO value. We have proposed two strategies for performing the guided search for such structures and experimentally shown that our model produces molecules with desirable target properties.

Further developments

In our work we explore the applications of the JT-VAE architecture [27] for molecule design. It allows expansions in various directions, like choosing different functions that perform gradient descent and also multi-criteria search (for instance, for HOMO and LUMO energies simultaneously), diffetent graph VAE architectures, or, instead, Generative Adversarial Networks (GANs) [34, 35], and Wasserstein GANs (WGANs) [36, 37].