Machine-learning correction to density-functional crystal structure optimization

Density functional theory is routinely applied to predict crystal structures. The most common exchange-correlation functionals used to this end are the Perdew–Burke–Ernzerhof (PBE) approximation and its variant PBEsol. We investigate the performance of these functionals for the prediction of lattice parameters and show how to enhance their accuracy using machine learning. Our data set is constituted by experimental crystal structures of the Inorganic Crystal Structure Database matched with PBE-optimized structures stored in the materials project database. We complement these data with PBEsol calculations. We demonstrate that the accuracy and precision of PBE/PBEsol volume predictions can be noticeably improved a posteriori by employing simple, explainable machine learning models. These models can improve PBE unit cell volumes to match the accuracy of PBEsol calculations, and reduce the error of the latter with respect to experiment by 35 percent. Further, the error of PBE lattice constants is reduced by a factor of 3–5. A further benefit of our approach is the implicit correction of finite temperature effects without performing phonon calculations. Knowledge about the crystal structure of solids is essential for describing their elastic and electronic properties. In particular, their accurate prediction is essential to predict the electronic properties of not-yet-synthesized materials. Lattice parameters are most commonly calculated by density functional theory using the Perdew–Burke–Ernzerhof (PBE) approximation and its variant PBEsol as exchange-correlation functional. They are successful in describing materials properties but do, however, not always achieve the desired accuracy in comparison with experiments. We propose a computationally efficient scheme based on interpretable machine learning to optimize crystal structures. We demonstrate that the accuracy of PBE- and PBEsol-structures can be, therewith, enhanced noticeably. In particular, the PBE unit cells, available in materials databases, can be improved to the level of the more accurate PBEsol calculations and the error of the latter with respect to the experiment can be reduced by 35 percent. An additional advantage of our scheme is the implicit inclusion of finite temperature corrections, which makes expensive phonon calculations unnecessary.


Introduction
Computational high-throughput studies form the basis for the discovery of new materials in modern material science. In solid state physics, these studies are mostly performed within Kohn-Sham density functional theory (DFT) [1][2][3]. While DFT formally provides an exact description of the manybody Schrödinger equation, it relies in practice on approximations for the exchange-correlation energy. In solid-state physics, one commonly utilizes the Perdew-Burke-Ernzerhof's PBE functional [4]. While PBE and its variants are successful in predicting structural and electronic properties of solids, they may yield, nevertheless, non-negligible deviations from experiments. Specifically, PBE underestimates atomic bond lengths, thus, overestimating lattice constants [5][6][7] and volumes. Variants of PBE such as the PBE for solids (PBEsol) [8] were designed to improve upon this problem [9]. However, they still do not achieve the desired accuracy in comparison with experiments.
The correctness of the lattice constants and the corresponding unit cell volumes is indispensable for a reliable prediction of bulk electronic properties [10][11][12] and when considering experimental realizations of composite materials. For instance, the lattice mismatch between growth substrates and films can be a source of major problems in experiments. Another reason to focus on lattice parameters is the fact that this is the material property for which it is possible to find the largest amount of experimental data, collected in the Inorganic Crystal Structure Database (ICSD) [13].
In this work, we demonstrate that machine learning methods can improve the lattice volume predictions based on PBE/PBEsol without increasing the computational effort. Machine learning enjoyed over the past few years great success in a wide variety of applications [14] ranging from property predictions of bandgaps [15][16][17] and elastic moduli [18] to the stability analysis of crystals [19,20] and molecular force field estimations [21]. Recently, the prediction of lattice constants and volumes generated much interest [22][23][24][25][26][27][28][29][30]. The majority of these studies is, however, limited to a particular crystal structure. In contrast to previous studies which are mainly based on direct calculations, we are building our approach on first-principles calculations, aiming at improving their accuracy in comparison with corresponding experiments. Our approach is not limited to a specific crystal structure or a subset of chemical elements. We will focus here on applying explainable machine learning methods [31,32] to correct errors of PBE/PBEsol calculations of crystal structures of newly predicted materials. Specifically, we will employ model agnostic supervised local explanations (MAPLEs) [33] in combination with a random forest model [34] to combine the high accuracy of tree models for small datasets and the interpretability of MAPLE models.
Before diving into detail, we can observe in figure 1 the primitive unit cell volumes V pred , predicted from DFT calculations using PBE and PBEsol functionals, plotted against the experimental unit cell volumes V exp . We remark that the primitive cell volume is the simplest quantity that can be directly compared, independently of the specific details of the crystal structure and   Fig. 1 (a) Correlation between measured primitive unit cell volumes Vexp and predicted ones V pred . Ideal predictions match the black dashed line. The blue dash-dotted line depicts PBE's linear regression. The inset additionally considers the machine learning prediction pred = PBE+MAPLE marked by gray tripods. chemical composition. Their correlation gives a first impression about the accuracy (systematic error) and precision (variability) of the theoretically estimated unit cell volumes. Calculations with the PBE functional (magenta circles) significantly overestimates the measured volumes by roughly 11 %, while PBEsol (green squares) provides a much better approximation of them.
On a closer look, one sees that V exp constitutes a soft lower bound on the PBE volumes in the sense that about 90 % of the predicted volumes lay above it. This is a consequence of the tendency of PBE to underbind. This soft bound entails a skewness on the predicted PBE volume distribution, which we revisit later. The inset of figure 1 shows a close-up view of primitive unit cell volumes of about 0.32 nm 3 to better distinguish the individual data points. We additionally include in the inset the volumes obtained by correcting PBE calculations with machine learning (gray tripods) to anticipate visually the strong error reduction. We will discuss thoroughly the machine-learning corrections in the next sections.
The remaining article is organized as follows. In section 2, we present the employed machine learning models. Details on the experimental and theoretical datasets and their matching are discussed in Sec. 3. We analyze in section 4 our predictive models and compare their performance with the one of underlying DFT calculations. In section 5, we discuss the correction of the lattice constants. The last section is devoted to our conclusions.

Predictive models
Tree-ensemble-based models such as random forests [34] and gradient boosting [35,36] are known to be suitable to the description of material properties for relatively small datasets [37,38] but they are not restricted to them [39]. A drawback of employing multiple-decision trees is however their general lack of interpretability [32]. Appropriate combination with local linear models [40][41][42], as in model agnostic supervised local explanations (MAPLEs) [33], overcomes this deficiency by providing local and example-based explanations. The former addresses causal relations between specific input features of an individual prediction (such as lattice constants) and its outcome by identifying their importances [33,43,44]. The latter asks instead for the contribution of specific training points [45][46][47].
In this work, we employ the MAPLE implementation of Plumb et al. [33] as well as tree models and utility functions from Ref. [48]. We evaluate the machine learning models by tenfold Monte-Carlo cross-validation. We choose this approach instead of using an independent testset, since our full dataset exhibits a large variance in structures/elements while being relatively small in size. In each independent run of the cross-validation scheme, the full dataset is randomly split at a ratio 1 : 9 without replacement into a test and a training set. The hyper-parameter optimization has been performed on a separate random splitting. Here, the number of trees forming the tree ensemble turns out to be the most important hyper parameter. The minimal number of samples controlling the splitting is of minor importance. The theoretical crystal structures calculated using DFT at the PBE and PBEsol level, respectively, serve as input parameters for the training, complemented with composition-specific features provided by Matminer [49].
By training the MAPLE models with the datasets discussed in Sec. 3, we find that the primitive-cell volume prediction is, indeed, to a large extent based on structural quantities, see Table A1 in the Appendix. Here, we use the average of the root-node impurity over the decision-tree ensemble as an estimator to quantify the relevance of the features. The binary splittings of an individual decision tree are such constructed that they minimize its impurity. In this sense, the first splitting and, therewith, the respective root-node feature has a major impact on the decision-tree structure and the model prediction. In particular, the splitting of the training set with respect to the root feature is in more than 50 % of the cases directly related to the crystal structure of the compounds, through, e.g., the volume V , the lattice constants a, b, c and the corresponding angles α, β, γ. Concerning the compositional features, the periodic table based features and averaged thermal properties of the elements play by far the largest role exceeding also two of the lattice angles in importance. Our models are available at Ref. [50].

Dataset
For our analysis and model training we consider roughly 2000 PBE-structures from the Materials Project (MP) [51]. The corresponding PBEsol calculations are available from Ref. [52]. The experimental crystal structures are extracted from the ICSD [13]. A mapping of ICSD-and MP-identifiers is provided in Table B2 of the Appendix.
We remark that experiments are conducted at finite temperature (2 K to 373 K) and pressure (≤1 bar) while DFT calculations describe equilibrium structures at zero temperature and pressure. In order to obtain PBE(sol) Fig. 2 Violin plot (a) and probability densities (b) of the volume residuals Vexp − V pred for the indicated testsets. The suffix '+lr' indicates linear regression and Q k is the k-th quartile.
crystal structures in the same thermodynamic conditions than experimental samples they should be corrected by expensive phonon calculations for thermal expansions and zero-point effects arising from finite lattice fluctuations [53][54][55][56]. For small molecules, the ambient pressure has additionally to be taken into account [57] but may be neglected for solids [53]. By training the predictive models on finite temperature volumes V exp as target variables one has the advantage to implicitly include finite temperature corrections. In principle, the ambient temperature of the measurements could be included as an input parameter for machine learning. However it turns out that the resulting models are mostly independent of temperature, since the large majority of the experiments are performed at roughly the same temperature (about 293 K, median and mode). The mean value of the temperature distribution is indeed 271 K.

Volume predictions
In this section, we compare volume predictions obtained with various machine learning models. First, we give an overview of these predictions by discussing the central characteristics of their volume residuals. Then, we study their crossvalidation error and address finally the convergence of the model training. We show in figure 2(a) violin plots [58] of the volume residuals V exp − V pred . One sees clearly that the median of the DFT-PBE calculations (red line in magenta violin) of about Q 2 ≈ −7.4 A 3 is, indeed, corrected by simple linear regression (blue violin). Also its interquartile range IQR ≡ Q 3 − Q 1 of about 14 A 3 is roughly reduced by 2 A 3 with the drawback that already well predicted volumes worsen. The skewness remains. Employing MAPLE cures the skewness and reduces the spreading further up to a third of the initial value (see gray violin). Intriguingly, its volume forecast is comparable to the simple linear regression prediction starting from PBEsol volumes. Beyond the median and the interquartile range, the violin shapes in Fig. 2(a) estimate the entire probability densities of the volume residuals. For the purpose of assessing their

MAPE [%]
estimation quality, we compare the DFT-PBE estimate with the corresponding normalized histogram depicted in Fig. 2(b). The estimate captures well the curve progression but is less pronounced around its mode located at −1.3 A 3 .
Additionally, we show in panel (b) the normalized histogram of the MAPLE prediction that corrects PBE volumes. As prefigured, it is considerably narrower and can be well approximated by a slightly biased Lorentzian (gray solid line) with a linewidth of roughly its interquartile range. For a more quantitative comparison of the different models, we focus in the following on the cross-validation error. The cross-validation error of a specific model, is obtained by evaluating for each testset the error of its prediction with respect to the measured value, and taking the arithmetic mean of these errors. Additionally, we determine the standard deviation of the individual errors. As error metrics we chose the mean absolute error MAE = n k=1 | y k,exp − y k | /n and the mean absolute percentage error MAPE = 100 n k=1 | y k,exp − y k | / | ny k,exp |, where y k,exp (y k ) indicates the measured (predicted) property of the k-th sample. Since all experimental volumes are finite, MAPE is well defined. In figure 3, we show the cross-validation errors of the predicted primitive unit-cell volumes for different models. Their numerical mean values and standard deviations are listed in Table 1. As expected, DFT-PBE itself leads overall to the worst cross-validation errors while PBE corrected with linear regression (+lr) improves the MAE leaving the MAPE unchanged. The MAPLE model based on PBE volumes reduces the PBE-MAE by about 50 % and is slightly better than DFT-PBEsol volumes. However, PBEsol corrected with linear regression is once again better. Most importantly, the MAPLE model based on PBEsol shows the best MAE improving by roughly 35 % upon DFT calculations alone with this functional. The IQRs in Table 1 show the same tendency as the MAE and MAPE, supporting the conclusion regarding the possible improvements achievable with the MAPLE models. Additionally, we report therein the MAPLE models using the measured temperature as an input feature. They perform, however, very similarly to the models that do not include such feature, as expected in view of the fact that most experiments were conducted at about the same temperature.
To assess the learning progress of the MAPLE models, we study in figure 4 the dependence of their MAE on the trainingset size n train . The MAE's are again obtained by tenfold cross-validation. As expected, they decrease polynomially with the trainingset size n train [59,60]. In particular, we find with MAE ∝ n −0.28 train for PBE+MAPLE and MAE ∝ n −0.24 train for PBEsol+MAPLE a similar learning behaviour for both predictive models. Including the total data available in the ICSD, the cross-validation error could be potentially reduced by 50 %. The relatively fast decay of the cross-validation errors with respect to the training-set size makes these correction procedures already applicable for small training sets. We would expect it to generalize to other material properties such as bulk moduli or formation energies for which very few experimental data are available and DFT results are a worse starting point.

Lattice constants
This far, we have discussed volume corrections. To a certain extent, we can, therewith, also improve the lattice constants as we show in this section. To this end, we recall how the volume is calculated. The unit cell volume V (a, b, c) is obtained from the triple product of the three lattice vectors a, b, c and can be written as product V = abc | polsin(α, β, γ) | of a factor abc only depending on their lengths and a dimensionless factor | polsin(α, β, γ) | only depending on their interior angles. 1 The latter is for cubic crystal systems well predicted by PBE and PBEsol with a MAPE of the order of 0.02 % while lower symmetric systems does not exceed a MAPE of 0.6 %. Exploiting the simplification that all lattice constants coincide for cubic crystals, we can extract the lattice constant correction by the prescription for DFT calculations using PBE(sol). Therewith, the MAPE of the lattice vectors is roughly reduced by a factor of 5, see figure 5. If we use the same prescription as an approximate way to correct the lattice constants of noncubic systems, we observe a consistent reduction of the MAPE of a factor of 3-5. In particular, we observe that when the three lattice parameters display different errors, the largest errors are those that get reduced more effectively, suppressing overall the MAPE of PBE lattice constants to less than 1 %.

Conclusions
We have investigated machine learning based unit cell volume corrections for density functional theory calculations. Model agnostic supervised local explanations improve both PBE's and PBEsol's volume prediction of the primitive unit cell. By applying MAPLE on PBE, one achieves overall improvements on the level of PBEsol calculations, hence, trivially reducing PBE's volume deviations from experiments by about 50 %. This is of great convenience since all large solid state databases rely on PBE calculations. We provide our implementation at Ref. [50]. Furthermore, PBEsol+MAPLE outperforms PBEsol with a roughly 1.5 times smaller mean absolute error. The most relevant features contributing to the MAPLE models are, indeed, given by the lattice parameters calculated with DFT, while composition-specific features are significantly less important. A further benefit of our approach is the implicit correction of finite temperature effects rendering time-consuming phonon calculations unnecessary. Since the considered experiments are mostly performed at the same (room) temperature, our trained MAPLE models are, however, not expected to generalize well to other temperatures. We plan to address this point in future by training on datasets with a larger temperature variation.

Appendix A Feature importances
The importance of the features for the MAPLE models is listed in Table A1.

Appendix B Mapping of crystal structures
The mapping between experimental structures stemming from ICSD and corresponding PBE structures from the materials project is listed hereafter in Table B2.