Electronic Properties of Molecules: A Computational Intelligence Approach

. The availability of BIG molecular databases derived from quantum mechanics computations represent an opportunity for computational intelligence practitioners to develop new tools with same accuracy but much lower computational complexity compared to the costly Schr¨odinger equation. In this study, unsupervised and supervised learning methods are applied to investigate the internal structure of the data and to learn the mapping between the atomic coordinates of molecules and their properties. Low dimensional spaces revealed a well deﬁned clustering structure as deﬁned by the measures used for comparing molecules based their atom distributions and chemical composition. Supervised learning techniques were applied on the original predictor variables, as well as on a subset of selected variables found using evolutionary algorithms guided by residual variance analysis (Gamma Test). Black and white box modeling approaches were used (random forests, neural networks and model trees and adaptive regression respectively). All of them delivered good performance, error and correlation-wise, with neural networks producing the best results. In particular white box techniques obtained explicit functional dependencies, some of them achieving considerably reduction of the feature set and expressed as simple models.


Introduction
Computational quantum mechanics derived from first principle has traditionally been used for the discovery and design of de-novo molecules and for the study of their structures and electronic properties [1]. More recently, the availability of huge molecular databases derived from quantum mechanics computations has given rise to new methods based machine learning [2][3][4]. These quantum mechanics machine learning models have shown great promises, approaching the same accuracy as first principle quantum mechanics computations at a much lower computational cost.
There are two main models in machine learning: discriminative (forward) and generative (inverse). In the context of quantum mechanics and discriminative learning which is the aim of this study, the goal is to learn a mapping from molecule x to a property y. In general, given a dataset [x i → y i , with i = 1 to N] that consists of N molecules (x i ) with their associated properties (y i ), the discriminative model will learn a mapping from x i → y i and use that mapping to predict new molecules. Several approaches have been explored in the literature to tackle these problems [4].
In these approaches, observations are described in terms of collections of variables/attributes, having several kinds of mutual dependencies, redundancies and noise. However, such a description may affect performance statistical and machine learning procedures because of the curse of dimensionality. Often the data concentrate in low dimension nonlinear manifolds, embedded within the high dimensional space in which the data is represented, either using an instanceattribute (the present case) or a dissimilarity representation. The implication is that in fact the data is not often really high dimensional. The dimension of those manifolds is considered to be the intrinsic dimension and usually it is much smaller than that of the original data representation space. Learning and uncovering these manifolds is important and useful for understanding the internal structure of the data, as well as for improving the performance of data analytic methods like clustering, classification and regression. In this study, we explored unsupervised and supervised learning approaches to map the atomic coordinates of the molecules to their electronics properties as follows. (a) From the atomic coordinate of each molecule, its Coulomb matrix is computed [2]. (b) Rows and columns of these matrices are sorted in decreasing order according to their column norm. This gives rise to an N × M feature matrix, where N is the total number of molecules and M the number of atoms that make the largest molecule in the set. (c) The Gamma test is performed to estimate the level of noise in the data. (d) M5 Model trees, Random Forests, adaptive regression and Neural networks are used to learn a mapping from the feature matrix to the electronics properties of the molecules with correlation coefficient close to 0.996.
The rest of this paper is organized as follows. In Sect. 2 we described the molecular dataset used in this study. Section 3 presents the machine learning techniques. Section 4 presents the experimental settings and the results obtained while we conclude in Sect. 5.

Molecules
The QM7 dataset used in this study is a subset of the GDB-13 dataset and was downloaded from [2]. This set consists of 7102 small organic molecules and their associated atomization energy. Initial Cartesian coordinates were generated and subsequently relaxed using the Universal Force Field106 as implemented in OpenBabel107 (version 2.3.2). Structures were further relaxed and self-consistent field energies calculated at the density functional level of theory (DFT) using the Perdew-Burke-Ernzerhof (PBE0) functional with def2-TZVP basis set as implemented in Gaussian (version 09 rev. D.01). Atomization energies were then obtained by subtracting free atom energies computed in the same fashion. More information relative to this dataset can be obtained at [2].

Coulomb Matrix
The inputs to our machine learning models are the same descriptors that also enter the Schrodinger equations i.e. the nuclear charges Z i and the atomic positions R i . Our machine learning model, instead of finding the wavefunction which maps the system's Hamiltonian to its energy, it directly learns a mapping from the system to energy based on examples given for training. The Coulomb matrix used in this case is directly obtained from Z i and R i .
Z i is the atomic number or nuclear charge of atom i, and R i is its position in atomic units. The Coulomb matrix M is symmetric and has as many rows and columns as the number of atoms in the molecule.
While the Coulomb matrix is invariant to translation and rotation of the molecule, it is not invariant to re-indexing of its atoms. One remedy is to sort the columns and rows of the Coulomb matrices by descending order relative to their norm 2 [2]. That is, for each molecule in the dataset, compute its Coulomb matrix. Pad each matrix to the right and bottom with zeros so they all have the same size that is 23 × 23, which is the maximum number of atoms per molecule in the QM7 dataset. Compute the norm-2 of each molecules and sort rows and columns in descending order. Given that the Coulomb matrix is symmetrical, only the lower triangular part is kept. Finally they are unfolded into 1D vector representation of the molecule. For the 7102 QM7 molecules the representation has a matrix of 7102 × 276 feature, where each row represents the signature of a molecule. This matrix was extended by including five extra features given by the chemical composition of the molecule with respect to the number of atoms of Carbon, Hydrogen, Nitrogen, and Sulfur. Both matrices were converted to z-scores (column-wise) by subtracting the mean and dividing by the standard deviation. The final data matrix was composed of 7102 × 282 features (281 predictors and the target property: Atomization energy).

Unsupervised Analysis and Data Exploration
Low Dimensional Spaces for Data Exploration. It is possible to create spaces for data exploration and visualization by computing low dimensional spaces that preserve chosen properties of the original dissimilarity matrix describing the objects [17]. Many techniques have been developed based on different principles. Among them are the Sammon nonlinear mapping and the tdistributed Stochastic Neighbor Embedding (t-SNE).
Sammon's nonlinear mapping [15] transforms vectors of two spaces of different dimensions (D > m) by means of a transformation like ϕ : The weight term δ −1 ij gives more importance to the preservation of smaller distances rather than larger ones and is determined by the dissimilarity distribution in the data space. Moreover, they are fixed, which is referred to as lack of plasticity.
t-SNE is an enhancement of SNE [10], where the mapping from higher dimensional space to lower dimensional space is based on the consideration of the similarity of conditional probabilities between datapoints. A conditional probability p j |i is the probability of datapoint x i to have x j as a neighbor based on a i is the variance of datapoint x i and k is a perplexity parameter related to selected local neighbors size. For the lower dimensional space, SNE utilizes conditional probabilities q j|i of datapoints x i based on another Gaussian distribution. The goal is to minimize the difference between the probability distributions of the two spaces, expressed as the sum of Kullback-Leibler divergences: C = i j p j|i log p j|i q j|i . One drawback of SNE, is the low cost when representing widely separated points. t-SNE applies a symmetric cost function and uses the Student's t-distribution in the target space, which has a heavier tail [18]. These modifications represent a notable improvement.

Supervised Analysis
Gamma Test. The Gamma test is a nonparametric technique aimed at estimating the variance of the noise present in a dataset [6,12,16], very useful in the construction of data-driven models. Noise is any source of variation in the target variable that cannot be explained by a smooth function (model) relating the target with the predictor variables. The gamma estimate indicates whether it is possible to explain the target variable by a smooth deterministic model based on the predictor variables. From this, an assessments can be made on (i) whether it is hopeful or hopeless to find a smooth model to the data, (ii) whether more explanatory variables should be incorporated to the data, (iii) how many observations are minimally required in order to build a model, (iv) appropriate thresholds in order to avoid overfitting during training and (v) what is the overall quality of the data. The most important assumptions of the procedure are (i) the model function f is continuous within the input space, (ii) the noise is independent of the input vector ← − x and (iii) the function f has bounded first and second partial derivatives.
Let S be a system described in terms of a set of variables and with y ∈ R being a variable of interest, potentially related to a set of m variables ← − x ∈ R m expressed as y = f ( ← − x ) + r, where f is a smooth unknown function representing the system, ← − x is a set of predictor variables and r is a random variable representing noise or unexplained variation. Let M be the number of observations and p is the number of nearest neighbors considered. If ← − x N i,k is the k-th nearest neighbor of object ← − x i , for every k ∈ [1, p], a sequence of estimates of E 1 2 (y − y) 2 based on sample means is computed as where E denotes the mathematical expectation and |.| Euclidean distance. The relationship between γ M (k) and δ M (k) is assumed linear as δ M (k) → 0 and an estimate for the variance of the noise Γ = V ar(r) is obtained by linear regression From Eq. 3 the vRatio (V r ) is defined as a normalized Γ value with respect to the variance of the target variable. Since V r ∈ [0, 1], it allows comparisons across different datasets: Assessing the relevance of the predictor variables is approached by searching for subsets with good Γ -statistics. In real-world cases the search space is determined by the power set of the predictor variables and evolutionary computation methods provide an alternative to the prohibitive brute force. A genetic algorithms explores subsets of predictors represented as binary vectors ← − ϑ = {0, 1} m ∈ R m (masks). Each represents a subset determined by the predictors present in the vector and the target y. The potential of each subset of variables is given by the Γ -statistics, which could be specified in different ways. A single-objective cost function can be formulated as a linear combination of partial fitness coming from (i) the MSE as associated to V r (the I f term), (ii) 'model smoothness' as associated to G (the G f term) and (iii) 'model complexity' given by the relative number of predictors (the L f term).
where W i = 0.8, W g = 0.1, W l = 0.1 are the weights of the contributing fitness terms, the largest of which is given to I f , directly related to the estimated MSE.
The choice of the weights {W i , W g , W l } is a compromise between the importance given to the partial fitness components coming from the subset's V r , the model complexity G and the model's cardinality (the smaller, the simpler, since it contains fewer predictors). The practical form of (6) is a heuristic emerging from many different applications. This use of GammaT est statistics has been very successful elsewhere [21][22][23].

Modeling Techniques
Several black and white box approaches have been used for learning predictive models for the Atomization Energy property. Namely, Neural networks (fully connected multilayer perceptrons) and Random Forests [5,13] as black box representatives, with M5 model trees [11,14,19,20] and Multivariate adaptive regression splines (MARS) 1 [7,8] as white box instances.

Unsupervised Analysis and Data Exploration
Since it is not possible to properly display 3D content on hard media, snapshots from fixed perspectives are presented. In order to simplify the representation, the original 7102 objects were pre-clustered using the leader algorithm [9] with an Euclidean distance threshold of 12.33 , which produced 1003 clusters (leaders). They are shown as semi-transparent spheres with sizes proportional to the cluster sizes. For both methods, the Sammon and the t-SNE mappings, the 3D transformations clearly exhibit the presence of well defined structures composed of different clusters. In the case of Sammon mapping (Fig. 1, Top), there is a lower density structure (upper right, mostly composed of outlying elements), well differentiated from a left area of much higher density composed of a sequence of clusters which progressively become more sparse. Under t-SNE, the mapping exhibits outlying elements at the top and the right respectively (Fig. 1, Bottom). Several clusters are also well defined and they correspond to the major structures of the Sammon mapping. This initial exploration of the data using unsupervised visualization techniques reveals the existence of well differentiated classes of molecules, determined by their Coulomb matrices and atomic composition. These structures would be exploited by supervised techniques aiming at predicting molecular properties.

Supervised Analysis
An orientative estimation of the predictive information contained in the data was obtained by computing the vRatio index (Eq. 4). The value obtained was quite low (V r = −0.0056) indicating that the information contained within the 281 descriptor variables has a high predictive power for the Atomization Energy target variable (in z-score form). This result provides a quantitative assessment of what the exploratory methods of Sect. 3.1. Several supervised machine learning approaches were used to construct models for the Atomic Energy property: random forests, model trees, neural networks (multilayer perceptrons) and multivariate adaptive regression splines. In all cases, a 10-fold crossvalidation scheme was used, with standard performance measures: (1) Mean absolute error The neural networks models used were multilayer perceptrons with one and two fully connected hidden layers and one output layer in 20 × 15 × 1, 30 × 15 × 1 architectures respectively. All activation functions were of the relu type, with MAE as the loss function, optimized by the 'Adam' procedure (using a pythonkeras implementation). The number of epochs was set to 200, with 10% of the training set in the fold used for validation. Other parameters were: learning rate = 0.001, beta 1 = 0.9, beta 2 = 0.999 and no AMSGrad. The random forests model had 10 trees in the forest, with a number of features equal to log(N i ) + 1, where N i is the number of inputs. M5 model trees were generated with a minimum of 4 instances/leaf node and pruned rules. In a second application, bagging of this type of models was used with 10 iterations of the bagging procedure, and 100% of the training set as the size of each bag. The Adaptive Regression models were applied with a maximum number of terms allowed set to 100 in two variants and with a maximum interaction degree of K = 5.
Two rounds of modeling experiments were conducted. In the first round the entire set of 281 predictor variables were used. For the second round, the power set of predictors were explored with a genetic algorithm using binary vectors as individuals, as described in Sect. 3.2. The objective function and the balancing weights W were those of Eq. 5, using the partial fitness from Eq. 6. The genetic algorithm settings were: population size = 100 individuals, one-point crossover operator with rate = 0.5 and bit mutation operator with rate = 0.05. At the end of the evolution process, the best individual had 156 selected predictors (55.5% of the originals) with a V r = 5.759 10 −8 , indicating both a high degree of predictability of the target variable and a high degree of redundancy in the original set of predictors. The modeling results for the two round of experiments are shown in Table 1.
All modeling techniques produced predicted outputs which are highly correlated with the observed Atomization Energies. The minimum correlation coeffi- In the case of Adaptive Regression models, they did not achieve a competitive error-wise performance with respect to neural networks. However, they ranked similarly with Random Forest with respect to both error and correlation measures. The later is specially important when looking at the model structure. Both Adaptive Regression models for data with 281 and 156 predictors respectively, used only 5 variables (1.78% and 3.2% respectively). Considering the original high dimensionality of the descriptor space, this represents a considerable reduction from a feature selection point of view. An important aspect is that this reduction is performed simultaneously with learning the underlying functional dependencies. The explicit models when using data with 281 and 156 predictors respectively are where x k i , K i ∈ R, i ∈ N + are constants found during the learning process and Max(p, q) is the maximum between p and q. In models of this kind, it is possible not only to explicit the important predictor variables, but also the way in which they interact, which is transparently shown by the multiplicative terms involving the basis functions (e.g. max(0, −x 1 +k 11 )max(0, x 1 −k 12 )). As discussed above, this model has a performance that matches the one from a random forest, which is a widely used, well established machine learning technique. From the point of view of performance, neural networks outperformed all other models. However, explicit, transparent models were capable of providing good results, at the level of other well established (black box) techniques, while working with significantly fewer number of predictors, with the advantage of exposing the nature of their interrelations and producing simple models.

Conclusions
The visualization of low-dimensional mappings from Coulomb matrices and atomic composition provided understanding of the structure of the data. They revealed the existence of well defined clusters from the point of view of both local distance preservation and consistency of conditional probability distributions between the original and the target spaces. The results obtained with different machine learning techniques aiming at modeling Atomization Energy (random forests, neural networks, model trees and adaptive regression), proved effective at capturing the functional dependencies between molecular structure and composition, and molecular properties, like Atomization Energy. Black and white models were produced that combine performance with transparency and explanation, identifying small subsets of relevant variables. Performance-wise, neural network models were superior, but adaptive regression in particular, produced relatively compact and transparent models, with accuracies comparable to those provided by well established techniques like random forests. Future work will expand the studies to other molecular properties, as well as to mechanisms for deriving molecular structure from desired properties.