Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials

Liu, Yunsheng; Mo, Yifei

doi:10.1038/s41524-024-01333-3

Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials

Article
Open access
Published: 20 July 2024

Volume 10, article number 159, (2024)
Cite this article

Download PDF

You have full access to this open access article

npj Computational Materials

Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials

Download PDF

318 Accesses
1 Altmetric
Explore all metrics

Abstract

Machine learning interatomic potential (MLIP) has been widely adopted for atomistic simulations. While errors and discrepancies for MLIPs have been reported, a comprehensive examination of the MLIPs’ performance over a broad spectrum of material properties has been lacking. This study introduces an analysis process comprising model sampling, benchmarking, error evaluations, and multi-dimensional statistical analyses on an ensemble of MLIPs for prediction errors over a diverse range of properties. By carrying out this analysis on 2300 MLIP models based on six different MLIP types, several properties that pose challenges for the MLIPs to achieve small errors are identified. The Pareto front analyses on two or more properties reveal the trade-offs in different properties of MLIPs, underscoring the difficulties of achieving low errors for a large number of properties simultaneously. Furthermore, we propose correlation graph analyses to characterize the error performances of MLIPs and to select the representative properties for predicting other property errors. This analysis process on a large dataset of MLIP models sheds light on the underlying complexities of MLIP performance, offering crucial guidance for the future development of MLIPs with improved predictive accuracy across an array of material properties.

Machine Learning Interatomic Potentials: Keys to First-Principles Multiscale Modeling

A universal strategy for the creation of machine learning-based atomistic force fields

Article Open access 18 September 2017

Discrepancies and error evaluation metrics for machine learning interatomic potentials

Article Open access 26 September 2023

Introduction

As a promising technique for conducting atomistic modeling, machine learning interatomic potential (MLIP) uses machine learning (ML) models to predict the energies and the atomistic forces of atomic structures, by training on the data from density functional theory (DFT) calculations. Current state-of-the-art MLIPs are reported with near DFT-level accuracies as measured by small values of root-mean-square error (RMSE) and mean absolute error in total energies and atomistic forces^{1,2,3,4,5,6,7}. However, despite the widely reported low averaged errors in total energies and atomistic forces, a variety of discrepancies have been observed in atomic dynamics and physical properties, including defect structures, defect formation energies^8,9, atom vibrations, migration barriers^10,11, and erroneous forces¹² during the MLIP-based molecular dynamics (MD) simulations. These discrepancies can be attributed to the atom configurations, such as defects and transition states, that determine the properties^8,9, but are significantly underrepresented in the training and testing datasets. The testing dataset is dominated by near-equilibrium atom configurations and thus yields low average errors as widely reported in the literature^8,12. Identifying potential discrepancies and errors of MLIPs in the predictions of properties or atom dynamics in atomistic simulations is a crucial step toward guiding further improvements of MLIPs.

It is time-consuming and computationally expensive to study these errors and discrepancies of MLIPs, because running both ab initio- and MLIP-based MD simulations over a long period of time are required to obtain these physical properties derived from atom dynamics and their errors. The use of error metrics that are well correlated with these properties and the errors of properties is an effective strategy to mitigate this issue. Liu et al.⁸ proposed and demonstrated that the forces on rare event (RE) atoms are an error metric for diffusional properties and the MLIPs scoring higher on the proposed metrics showed significantly improved accuracy in predicting diffusional properties in MD simulations. A significant advantage of employing these error metrics is to assess the MLIPs’ errors on these properties without running time-consuming MD simulations, and thus can greatly accelerate the iterations of validation and testing for improving the MLIPs^8,13. To overcome these errors and discrepancies of MLIPs, the development of effective error metrics correlated to these properties is highly desired.

For the applications of MLIPs in atomistic modeling research^{14,15,16,17,18}, it is also critical for MLIPs to accurately predict a wide range of physical phenomena and properties¹⁹. The properties commonly encountered in atomistic modeling include defects, REs, diffusion, phonon, thermal conduction, elastic modulus. Each category of properties includes a number of quantities and metrics. It is computationally expensive to conduct the testing of the MLIPs on these many properties. Obtaining the MLIPs that simultaneously reproduce a large number of properties with small errors would be difficult and time-consuming, as often multiple iterations of training and testing would be required.

To overcome this challenge, we here propose to identify a reduced number of representative properties that are correlated with other properties in their MLIP prediction errors, by studying the correlations among a large array of different property errors from numerous MLIP models. Since many material properties are physically correlated¹⁹, the training, validation, and testing of the MLIPs can focus on these reduced number of representative properties to achieve improved joint performances spontaneously across all properties. For example, mechanical properties such as elasticities are often found to be correlated with thermal properties due to their common physical nature of interatomic bonding strength^20,21, and it is reasonable to expect that the errors in mechanical properties and thermal properties may also be correlated. Given the limited knowledge and studies on the MLIP errors on a variety of properties, there is an urgent need to systematically examine and understand the errors of many properties and their underlying correlations to guide the future development of MLIPs for good joint performances on many properties.

In this study, we propose an analysis process comprising the data generation, sampling, evaluation, benchmarking, and multi-dimensional statistical analyses of MLIPs’ joint performances on a diverse range of properties (Fig. 1). Given the complex behaviors of MLIP models based on high-dimensional functions and descriptors, it is important to analyze a large number of MLIP models and establish the statistical correlations of their behaviors, as the results based on only a few optimal MLIP models may suggest biased correlations (see Supplementary Note 1 in Supporting Information). To achieve this, our proposed analysis process samples a large number of MLIPs from the validation pool of MLIP models generated by hyperparameter tuning during the validation steps in the typical MLIP training process (Fig. 1a). Thus, no additional computation cost is incurred to sample MLIP models from the existing validation pool. Then the errors of many physical properties are evaluated for each sampled MLIP model, and the performance data of these many sampled models are obtained in comparison to the corresponding benchmarks (Section 2.1) (Fig. 1b). These performance data of many MLIP models are then further analyzed. These analyses include: identifying the properties that are difficult to be accurately predicted by the MLIPs (section 2.2) and their pairs (section 2.3) (Fig. 1c), analyzing the high-dimensional Pareto fronts of multiple property errors for the joint performances of MLIPs (Sections 2.4, 2.5), analyzing the statistical correlation of MLIP models (section 2.5) and property errors (Section 2.6), and identifying the representative properties (Section 2.6) for more efficient training and testing of MLIPs (Fig. 1d).

Results

Analysis process of MLIPs and error metrics

In this study, we implement and conduct this proposed analysis process on current state-of-the-art MLIPs, including Gaussian Approximation Potential (GAP), Neural Network Potential (NNP), Moment Tensor Potential (MTP), Spectral Neighbor Analysis Potential (SNAP), Deep Potential (DeePMD), and Deep Potential-Smooth Edition (DeepPot-SE)^{2,3,4,6,22,23,24}, for their performances on a wide array of properties, using Si as a study case. We analyze the performance of 2300 MLIP models (the MLIP training and generating performance data of many MLIP models, Fig. 1) with these six different MLIPs types trained for silicon in the previous study⁸ on five different training datasets, such as ${{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}$ from ref. 1, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ (the RE-enhanced training dataset in ref. ⁸), ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ (the vacancy-enhanced training dataset in ref. ⁸), ${{\mathcal{D}}}^{2-{\rm{I}}}$, and ${{\mathcal{D}}}^{4-{\rm{I}}}$ (Methods) (Fig. 2a). Every training dataset contains 219 configurations over a wide range of selected structures of solid Si, liquid Si, strained or distorted Si, Si surfaces, and bulk Si with single vacancy from ab initio molecular dynamics (AIMD) simulations (Methods). To increase the data diversity among training datasets, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$, ${{\mathcal{D}}}^{2-{\rm{I}}}$, and ${{\mathcal{D}}}^{4-{\rm{I}}}$ are generated by replacing 54.7% (120 configurations) of the structures in ${{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}$ by the configurations of single interstitial, di-interstitials, and four interstitials, respectively, of Si bulks from the AIMD simulations (Methods).

**Fig. 2: Summary of MLIP models and properties.**

The 2300 MLIP models are selected from the validation pools in the validation steps from the conventional training (Fig. 1 and Supplementary Table 1). The first half of 2300 is the MLIP models with the lowest validation scores (i.e., lowest errors) calculated from our validation dataset (Methods). The other half of the MLIP models are randomly selected from the rest of the validation pool excluding those already selected in the first half. These second half of MLIP models may not exhibit low validation scores but may show good predictions for only certain properties. Given the complex inter-dependence of hyperparameters and the performance on multiple properties, this random sampling of MLIPs models over the hyperparameter space can provide a more comprehensive understanding of the multi-property performance of MLIP models (see Supplementary Note 1 in Supporting Information).

In the next step of the analysis process (Fig. 1b), we evaluate each MLIP model for a range of material properties and errors (e.g., RMSEs in the predictions of energy and forces), which are collectively termed as ‘properties’ in the remaining text. As summarized in Table 1 and Fig. 2b, these properties include (1) the formation energies E_f^Defect of four different types of point defects, such as vacancy, split-<110> interstitial, tetrahedral interstitial, and hexagonal interstitial, (2) the elastic constants of bulk crystalline Si and the Si supercells with a point defect, 3) the lattice parameters of bulk crystalline Si and the Si supercells with a point defect, (4) the energy rankings (as defined in ref. 25) with multiple vacancies for four sets of Si configurations, (5) the free energy E_free, the entropy S, and the heat capacity K, of the bulk crystalline Si and the Si supercells with a vacancy, and additionally, (6) other properties based on the performance scores and evaluation metrics (Table 1) as in ref. 8 include the magnitude and directional errors of forces on RE atoms in interstitial or vacancy diffusions, the normalized area of curve (NAC) NAC($\delta$, ${\mathcal{D}}$^RE), the RMSEs of energies and forces on RE atoms, such as ${\sigma }_{{\rm{E}}}^{{\rm{RE}}-{\rm{I}}}$, ${\sigma }_{{\rm{F}}}^{{\rm{RE}}-{\rm{I}}}$, ${\sigma }_{{\rm{E}}}^{{\rm{RE}}-{\rm{V}}}$, and ${\sigma }_{{\rm{F}}}^{{\rm{RE}}-{\rm{V}}}$, evaluated for ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ dataset, and ${\sigma }_{{\rm{E}}}^{{\rm{enhanced}}-{\rm{I}}}$, ${\sigma }_{{\rm{F}}}^{{\rm{enhanced}}-{\rm{I}}}$, ${\sigma }_{{\rm{E}}}^{{\rm{enhanced}}-{\rm{V}}}$, and ${\sigma }_{{\rm{F}}}^{{\rm{enhanced}}-{\rm{V}}}$ evaluated for ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ datasets.

Table 1 Summary of the 60 error metrics, their corresponding benchmarks, and the fractions of MLIP models that meet the benchmarks

Full size table

In order to have a fair comparison among these different properties, the ‘error metric’ $\delta M$ for each property is normalized as

$$\delta M={\log }_{10}|{P}_{{\rm{predicted}}}-{P}_{{\rm{ideal}}}|$$

(1)

where P_predicted is the property predicted by the MLIP, and P_ideal is the ideal value. For most properties, P_ideal is given by DFT calculations using a k-point mesh of 4 × 4 × 4 (DFT K4), except for the 20 elastic constants calculated using a k-point mesh of 2 × 2 × 2 (DFT K2) due to the prohibitively high computation cost of DFT K4 for the large supercell of defected structures (Fig. 2b). The P_ideal values are 1 for the performance scores, such as force errors NAC($\delta$, ${\mathcal{D}}$), and 0 for the RMSEs of energies or forces (Methods). The lower values of these normalized ‘error metrics’ indicate more accurate predictions of MLIPs.

The benchmarks of error metrics are set to identify what properties are challenging (Table 1 and Fig. 2b). For most properties, the benchmarks are set by the differences ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}$ given by DFT calculations in a single Γ-point (DFT K1) using a ~1000 Å³ supercell (except for elastic constants, which use ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}2}$). Using DFT K1 to set the benchmarks has two following reasons: (1) the ease of computation compared to the experimental values that may not be easily obtained, and (2) the wide adoption in AIMD simulations. Therefore, the DFT K1 benchmarks ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}$ provide practical baselines of MLIP accuracies for further simulations. The MLIP models that meet these benchmarks may be considered viable substitutes for DFT or AIMD simulations at greatly reduced computation costs. A number of benchmarks are manually set, labeled as ‘handpicked’ (Table 1 and Fig. 2b): 15 meV atom⁻¹ for all energy RMSEs ${\sigma }_{{\rm{E}}}$; 0.12 eV Å⁻¹ for some force RMSEs (Table 1); 0.1 eV for E_f^Defect (because even DFT K1 calculations produce errors as large as 1 eV in E_f^Defect) (see “The benchmarks” section in Methods).

Identifying challenging properties of MLIPs

Among all properties, we first identify the properties that are difficult to accurately predicted by the MLIPs (Fig. 1c). A property is identified as ‘challenging’ if less than 15% of MLIP models in the sampled pool achieve low errors below the benchmark. The analysis reveals 35 challenging properties (Supplementary Table 2) mostly in the four categories: (1) defect formation energies E_f^Defect with only 2.8–5% of MLIP predictions achieving the 0.1 eV benchmark, (2) 13 metrics in the elastic constants with only 1–13% MLIPs achieving the ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}2}$ benchmarks, (3) all 12 metrics for energy rankings with less than four MLIP models, or 0.2%, for each of the metrics achieving the ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}$ benchmarks, and (4) five metrics for the forces on RE atoms, in the REs category, (Fig. 3a–d, and Supplementary Table 2), in addition to the entropy of Si bulk, S^bulk(T). The energy rankings of multi-defect configurations are particularly challenging for MLIPs, as the models with the lowest errors are much higher than the benchmark ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}$ (Fig. 3a). For the ranking error rate metric ${P}^{{\mathcal{D}}1}$_{ranking error} evaluated in the ranking dataset ${\mathcal{D}}$1 of two vacancies in the supercells with 214 atoms (Table 1), the top 10% MLIP models (except for two outliers) have the ranking error rates in the range of 18–38%, compared to 7.6% the ${\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}$ benchmark (Fig. 3a). The ranking error rates of multi-defects in Si are much higher than those of the elemental orderings (mostly below 10%) in Li-Al intermediate phases studied in our previous study²⁵. In summary, our analysis of a large number of MLIP models based on different MLIP types and training datasets identifies energy ranking, E_f^Defect, elastic constants, and force predictions on RE atoms as challenging properties for MLIPs to achieve small prediction errors.

**Fig. 3: ‘Challenging’ properties of MLIPs.**

The Pareto front of MLIPs: the trade-off between property errors

We analyze how MLIPs perform to accurately predict two properties by constructing the Pareto fronts of the MLIPs using their error metrics of two properties (Fig. 1c). As a concept widely used in multi-objective optimization²⁶, the Pareto front consists of a set of optimal data points in which no objective can be improved without degrading some of the other objectives. Here, our Pareto fronts consist of a set of MLIP models, which may generally represent some of the best-performing models for different property combinations (Methods). The pairs of properties from the 48 properties are examined, excluding the energy rankings of multi-defects given their high errors. Those at the vertices of the Pareto fronts are called the optimal MLIP models (Fig. 4a, b, and Supplementary Figs. 2–4). The Pareto fronts show the trade-offs for the optimal MLIP models on the prediction of these two properties. The MLIP models with lower errors on one property have higher errors on the other. This trade-off poses challenges for training MLIP models to achieve high accuracies for both properties.

**Fig. 4: Challenging property pairs shown by the Pareto fronts.**

The Pareto front analyses also reveal that those pairs of properties are difficult to simultaneously achieve the benchmarks. According to our analysis, there are 20 property pairs for which no optimal MLIP model can simultaneously meet or exceed the benchmarks for both properties (Supplementary Table 3). These challenging pairs include seven pairs in E_f^Defect–elastic constants, ten pairs in E_f^Defect–REs, and three pairs in E_f^Defect–E & F RMSEs (Supplementary Table 3), which are from the four categories of the defect formation energy E_f^Defect, the elastic constants, the REs, and the E & F RMSEs. Many of these are among the challenging properties identified in section 2.2. As shown by these analyses results, identifying these property combinations that may be difficult to simultaneously meet the benchmarks, and their trade-offs can provide valuable guidance for improving the training of MLIPs.

The high-dimensional Pareto fronts for multi-property performance of MLIPs

Here, we further examine the joint performances of MLIPs on the prediction errors of multiple properties (i.e., in higher dimensions than 2D). Achieving low errors on many properties spontaneously is expected to be even more difficult than on two properties. First, we study the joint performances using all four properties in the E_f^Defect category and all eight properties in the REs category (Table 1), which are mostly challenging properties identified in section 2.2. For each type of MLIPs, we construct a high-dimensional Pareto front on all properties (e.g., a 4D Pareto front for the E_f^Defect category and an 8D Pareto front for the REs category, shown as the last rows in Supplementary Tables 4–7) and obtain their optimal MLIP models. For all these optimal models in all six types of MLIPs, we calculate their similarities based on the error metrics of properties and cluster the MLIP models that have similar joint performances by applying a graph clustering algorithm, the Louvain community detection algorithm (Methods). The similarities of the MLIP model i and MLIP model j are measured by the inverse of the Euclidean distance between the vectors of error metrics,

$${Similarity}(i,\,j)=\frac{1}{\sqrt{\sum _{k}{({{\delta M}_{i}^{k}}^{{\prime} }-\delta {{M}_{j}^{k}}^{{\prime} })}^{2}}},$$

(2)

where the standardized value of error metric ${\delta M}_{i}^{{\prime} }$ for MLIP model i is calculated as

$${\delta M}_{i}^{{\prime} }=\frac{{\delta M}_{i}- < \delta M > }{{\sigma }_{{\rm{\delta }}{\rm{M}}}}$$

(3)

where <$\delta$M> is the mean value and ${\sigma }_{{{\delta }}{\rm{M}}}$ is the standard deviation.

These clusters of MLIP models based on the similarities of the error metrics allow the analysis of the joint performances of MLIPs on many properties in high-dimensional space. Generally, the models that are clustered exhibit similar errors on the same properties. For those MLIP models that don’t belong to clusters, their errors vary for different properties, i.e., some models may have high errors on some properties, but other models have high errors on others. For the clustering based on the error metrics of the defect formation energy category, 359 optimal MLIP models of all types are clustered into four major communities (more than 10 MLIP models for each community): one for mainly GAPs, one for SNAPs, and two communities for DeepPot-SE (Fig. 5e and Supplementary Fig. 6, see Supplementary Note 3 in Supporting Information), along with scattered models based on NNP, DeePMD, and MTP. The scattering of the NNP and DeePMD models in the similarity clustering based on the defect formation energy category suggests that NNPs and DeePMDs can only have good predictions in some properties in the defect formation energy but perform poorly in others.

**Fig. 5: The clustering and the performances of MLIPs in different property categories.**

To quantitatively measure and compare the joint performances of MLIPs with many properties in high-dimensional space, we employ the hypervolume (HV) and inverted generational distance (IGD) scores, which are commonly used to evaluate the Pareto fronts. The HV is calculated as the total volume between the Pareto fronts and the reference points, which are the points with the minimum error metric values for each property in the Pareto front (Fig. 5). The IGD is the closest Euclidean distance between error metric vectors of the Pareto fronts and the reference points (Methods). The Pareto fronts closer to the lower left corner of Fig. 5a, b, h, i have lower HV and IGD scores, indicating lower errors and better performances. For any property pairs in the E_f^Defect category, the Pareto fronts of all NNP models have the lowest HV scores and the lowest IGD scores in 83% and 67%, respectively, of all property pairs, and the Pareto fronts of DeePMD models account for 67% of the 2nd lowest HV and 50% of the 2nd lowest IGD (Fig. 5c, d, Supplementary Fig. 2, Supplementary Table 4, and Supplementary Table 5). NNP and DeePMD only perform well for two properties based on the two-dimensional HV and IGD scores, while they have poor performances for many properties (D > 2), given the scattering of the NNP and DeePMD models in the similarity clustering based on the E_f^Defect category (Fig. 5e). Indeed, none of the 35 optimal NNPs on its 4D Pareto front in the E_f^Defect category meets all benchmarks (Supplementary Table 9). Therefore, combining these with the clustering analyses above, we can reveal critical insights into the performance of MLIP models on multiple properties.

For the clustering of the REs category, 533 optimal MLIPs are clustered into seven major communities (Fig. 5j and Supplementary Fig. 6). In addition, 86% of the Pareto fronts of MTPs have the lowest HV, 79% of GAPs have the 2nd lowest HV, and 68% of MTPs have the lowest IGD score, and 57% of GAPs has the 2nd lowest IGD for the property pairs in the REs category (Fig. 5h, i, Supplementary Fig. 3, Supplementary Table 6, and Supplementary Table 7). As indicated by the results of these two analyses, many MTP and GAP models show good joint performances in predicting energies and forces on RE atoms. Indeed, there are a total of 20 optimal MTP models that meet all REs benchmarks (Supplementary Table 8). More analyses, including the HV and IGD scores for different types of MLIPs in the elastic constants category, are in Supplementary Note 2 in Supporting Information (Supplementary Figs. 4 and 5).

As illustrated in this section, the combined analyses of quantitative scores of the high-dimensional Pareto fronts and the clustering of error similarities reveal insights into the statistical behavior of the many MLIP models on a large range of property errors. Such analysis can guide the selection of MLIP models and the next round of training and testing.

‘Curse of dimensionality’ on the joint performances of MLIPs

In this analysis, we construct the high-dimensional Pareto fronts with different number D of properties in all their possible combinations, to understand the effect of dimensionality in the joint performances of MLIP models. We count the optimal MLIP models that meet all benchmarks, ${N}_{{\rm{optimal}}}^{{\rm{all}}}$, with different number D of properties. Low values of ${N}_{{\rm{optimal}}}^{{\rm{all}}}$ indicate the difficulties of MLIPs meeting all benchmarks for many properties in high-dimensional space. As shown in the cumulative distribution function (CDF) of ${N}_{{\rm{optimal}}}^{{\rm{all}}}$ for D = 2–5 (Fig. 6a), the probabilities of having no optimal MLIP meeting all benchmarks, P(${N}_{{\rm{optimal}}}^{{\rm{all}}}$ < 1), increases significantly with D. For most pairs of two properties, at least one optimal MLIP model on the Pareto front satisfies two benchmarks, as shown by P(${N}_{{\rm{optimal}}}^{{\rm{all}}}$ < 1) = 2% for the 2D Pareto fronts. It’s much more difficult to have MLIP models that meet all five property benchmarks, as 46% of the 5D Pareto fronts do not have optimal MLIP that meets all five benchmarks. This increase of P(${N}_{{\rm{optimal}}}^{{\rm{all}}}$ < 1) with dimensionality D is observed for MLIPs trained in various datasets (Methods), including our original dataset of 2300 MLIP models with full 60 properties, ${{\mathcal{D}}}_{2300}$, and the dataset of 124 MLIP models with 31 properties, ${{\mathcal{D}}}_{124}$ (Fig. 6b and Supplementary Fig. 7). As shown by these results, it can be difficult to find MLIP models that beat the benchmarks of three or more properties. Our result highlights the ‘curse of dimensionality’ for developing MLIPs that can accurately predict many properties.

**Fig. 6: The joint performances of MLIPs for all combinations of properties.**

Investigating the correlations of property errors

As shown by the above analyses, obtaining the MLIPs that simultaneously meet the benchmarks for a large number of properties would be difficult. During the multiple iterations of training and testing to improve MLIPs, it is time-consuming and computationally expensive for the error evaluations for these many properties and their errors. To overcome this challenge, we here identify a reduced number of representative properties that are correlated with other properties in the MLIP predictions. (Note: here the term property generically refers to error metrics of materials properties and performance scores as listed in Table 1). We first study the correlations of property errors by constructing the correlation graph of the property error metrics (defined in Eq. (3) (Methods)) for each MLIP type (Fig. 7a–d). For the analysis of all MLIP models (Fig. 7a), the error metrics of energy ranking properties are highly correlated (green nodes in the lower centre of Fig. 7a), and the errors of elastic constants and thermal properties are also correlated (as shown in the clustered orange and red nodes in the upper right corner of Fig. 7a). Based on the strong correlations of the error metrics of each cluster, i.e., the connections in the graph, we can select representative properties (Methods) to represent each cluster (nodes with red edges in Fig. 7). Further analyses are conducted to analyze the inter-dependence of these property errors and to identify the prediction relations using property errors as input to predict others as demonstrated in Supplementary Note 4 in Supporting Information (Supplementary Figs. 9–11, and Supplementary Table 12).

**Fig. 7: The correlation graphs of error metrics for 60 properties.**

The training and testing of the MLIPs can focus on these representative properties (Table 2). For example, for MTPs, three metrics, the ${P}_{{\rm{ranking}}\; {\rm{error}}}^{{\mathcal{D}}1}$, the maximum ${\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}1}$, and the mean ${\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}2}$, can well represent all 12 metrics in the energy ranking category (Fig. 7d). For developing MTPs with better performances on energy rankings of multiple defects, one may first focus on these three metrics, as a training strategy. More generally, to develop MLIPs that can achieve benchmarks for all properties, the training may initially focus on these 21 representative properties out of the original 60 properties, a 67% reduction (Table 2). The above analysis using correlation graphs can be further conducted on the newly trained MLIP models to identify the representative quantities in the next iteration, as the representative properties and their quantities may change for the MLIP models trained differently.

Table 2 The representative properties selected by their correlation graphs for different types of MLIPs

Full size table

In summary, this analysis reveals the correlation of the errors of different properties predicted by MLIPs and identifies the representative properties, which can serve as guidance for further development and improvement of MLIPs.

Discussion

In this study, we propose and demonstrate an analysis process (Fig. 1) for evaluating the prediction errors of MLIPs in a large number of MLIP models across a vast array of properties. Our analysis process presents several key features, such as evaluating a large number of MLIP models to gain the statistical behavior of the MLIPs, constructing the Pareto fronts to identify challenging properties and their combinations, and revealing the correlations of the property errors of MLIP models. Our analysis process utilizes many MLIP models generated in the validation pool, most of which would be discarded in a typical training process, and thus incurs little additional computation cost for generating models. By examining a large number of MLIP models, the analysis process can provide an understanding and insights into the MLIP performances on a large array of properties, which are essential for guiding the further training of MLIPs.

Our study highlights the challenge of developing MLIPs in achieving good performances or low errors across multiple properties. As shown by the Pareto fronts, the optimal MLIP models may exhibit low errors in some properties, but often show higher errors in many others. One often expects this challenge can be simply overcome by selecting different MLIP models with different sets of hyperparameters that have lower average errors in energies and forces. However, the Pareto fronts from our analysis suggest strong trade-offs existing among some properties, and this trade-off cannot be simply addressed by choosing a different model with a different set of hyperparameters. The challenge that MLIPs can only make accurate predictions for certain properties, while failing to perform well across all properties, is often overlooked. While many studies of MLIPs focus on a limited set of properties, it is critical to also assess the joint performances of MLIPs on a broad range of properties.

This challenge of MLIPs can be further illustrated by the high-dimensional analysis of many properties, such as evaluating the high-dimensional Pareto fronts and clustering MLIP models for their joint performances on multiple property errors. As illustrated in our high-dimensional analysis, there are increasing difficulties in finding optimized MLIP models that meet all benchmarks with the increased number of properties, which is a type of ‘curse of dimensionality’ in MLIP development. Furthermore, the high-dimensional correlation analysis is demonstrated in identifying key challenging properties, which would be valuable for guiding the training for better MLIPs. In our particular study case of Si using specific choices of model types, training data, and hyperparameters, the challenging properties generally involve the formation energies of different defects, the energy rankings of different defect configurations, and the elastic constants of defected supercells. Overall, to overcome the fundamental challenges of developing MLIPs that accurately predict a large array of properties, our analysis process can provide a lot of crucial information.

For future MLIPs that may involve more elements, phases, structures, and defects, this challenge will be more pronounced, and our analysis process should be conducted to identify the critical properties, which will further increase exponentially as we aim to capture material properties across the materials system. Given the large number of composition space and intermediate phases for multi-element systems²⁵, the sampling for a wide variety of relevant, representative phases and compositions in conjunction with a large number of properties needs to be devised. Future studies are needed to extend this framework for more complex systems, e.g., high-entropy alloys. Moreover, our analysis framework can apply to testing other MLIPs that are not covered in this work, e.g., MLIPs based on graph neural networks^18,27,28.

It is critical to train reliable MLIP models that meet the performance benchmarks for a large number of material properties. To achieve that goal, our analysis process (Fig. 1) could serve as an essential component. In addition to the typical training process, the analysis process would be conducted on a pool of MLIP models from the validation steps to gain an understanding of the trained models, such as identifying the challenging properties, the challenging pairs/combinations, and the representative properties. This information would be used to guide the next round or iteration of training. For example, the data related to these representative challenging properties can be added or overweighted into the training dataset, or the corresponding scores/metrics based on these representative properties can be used in the validation step for the next round of training. Then, the analysis process can be performed for the new batch of trained models to check if improved performance is achieved. If necessary, additional rounds of training and testing can be iteratively performed, as the challenging and representative properties may change after the additional training process. Through this iterative training and analysis process, the MLIPs can be improved to give accurate predictions on many properties.

In conclusion, our analysis process has effectively identified key challenges in the accurate predictions of MLIPs across many properties as demonstrated in the model system of Si. This analysis process can be generally applied for the MLIP training in any materials and can be further developed. Given the current state of rapid adoption of MLIPs^29,30,31, developing the processes of benchmarking and evaluating MLIP models is increasingly important. Our proposed analysis process can offer valuable guidance for future research and development of MLIPs with enhanced performance across a broad range of properties.

Methods

First-principles computation

DFT calculations were performed to compute energies and forces of configurations and to relax structures. Since a majority of models were retrieved from ref. ⁸, all DFT calculations were performed as described in ref. ⁸, using Vienna ab initio simulation package³² (VASP) with the projector augmented-wave approach. Generalized-gradient approximation (GGA) with Perdew-Burke-Ernzerhof³³ (PBE) functionals were used. All true values of energies, forces, physical properties, and evaluation metrics were calculated with a 4 × 4 × 4 k-point mesh (K4), except for elastic constants calculated with 2 × 2 × 2 k-point mesh (K2). As specified in Section 2.1, most of the benchmarks of the properties were obtained by the DFT calculations using a Γ-centered single k-point 1 × 1 × 1 (K1). All DFT calculations were spin-polarized, with an electronic relaxation cutoff of 10⁻⁵eV, energy cutoff of 520 eV, and other parameters set as in the Materials Project^34,35.

AIMD simulation

AIMD simulations were performed to generate atomistic configurations in training, validation, and testing datasets. The supercell models of bulk or defected Si for AIMD simulations have lattice parameters larger than 10 Å. AIMD simulations were non-spin-polarized, with electronic energy convergence cutoff to 10⁻⁴ eV and a time step of 2 fs. A Γ-centered single k-point of 1 × 1 × 1 was used for AIMD simulations. All AIMD simulations to obtain migrating (RE) atoms were performed at 1000 K or 1230 K, following the same scheme described in the main text and in ref. ⁸^. For each AIMD simulation, an initial period of heating up with static relaxed supercells from 100 K to the final temperatures using velocity scaling at a constant rate over 2 ps. Then, AIMD simulations were conducted using NVT ensemble with Nosé-Hoover thermostat.

Constructing training, validation, and testing datasets

This study used a number of training datasets, ${{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}\; {\rm{al}}}.}$, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$, ${{\mathcal{D}}}^{2-{\rm{I}}}$, and ${{\mathcal{D}}}^{4-{\rm{I}}}$, a validation dataset, ${{\mathcal{D}}}^{{\rm{validation}}}$, and two testing datasets, ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$, each consisting of a diverse range of AIMD snapshots, bulk, defected, and distorted Si configurations. The training dataset ${{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}$ was adopted from ref. 1, and other dataset were adopted from ref. ⁸, except for ${{\mathcal{D}}}^{2-{\rm{I}}}$ and ${{\mathcal{D}}}^{4-{\rm{I}}}$. ${{\mathcal{D}}}^{2-{\rm{I}}}$ and ${{\mathcal{D}}}^{4-{\rm{I}}}$ were generated the same as ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ as described below. All true values of energies and forces were DFT K4 calculation results.

The training dataset ${{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}$ consists of 219 configurations for a wide range of Si structures, including solid Si, melted Si, distorted Si, thin slabs of surface Si, and Si bulks with single vacancy from AIMD simulations at different temperatures as described in ref. ¹. To construct other datasets in ref. ⁸,120 configurations from liquid Si, AIMD simulations of Si bulk, and the strained Si bulk in ${{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}$ were randomly selected and removed, and were replaced by 120 snapshots generated by AIMD simulations with different defects: ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ uses snapshots from Si bulk with single vacancy; ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ uses snapshots from Si bulk with single interstitial; ${{\mathcal{D}}}^{2-{\rm{I}}}$ uses Si bulk with two interstitials; and ${{\mathcal{D}}}^{4-{\rm{I}}}$ uses Si bulk with four interstitials. The ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ and ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ in ref. ⁸ are the RE-enhanced and vacancy-enhanced training datasets with more RE vacancies and interstitials. Each of these training datasets has a total of 219 configurations, from solid Si, melted liquid Si, strained Si, Si surfaces, and Si defects (single vacancy, single interstitial, two interstitials, or four interstitials). To generate these configurations, the AIMD simulations were performed at 1000 K or 1230 K for defected structure supercells (such as single vacancy, split-<110> interstitial, tetrahedral interstitial, hexagonal interstitial, 2-interstitial, and 4-interstitial supercells) with −0.4%, 0.3%, 0.3%, 0.3%, 0.5%, and 1.6% lattice strains from the perfect Si bulk. All energies and atomic forces of the AIMD snapshots were converged using single-step self-consistent DFT K4 without relaxing atom positions and lattices. All snapshots contained at least one identified RE atom, which was a migrating atom with the distances difference is less than 0.75 Å between the 1st and the 2nd nearest neighbor atoms (~31% of the distance between two nearest static sites) as in ref. ⁸.

The validation dataset ${{\mathcal{D}}}^{{\rm{validation}}}$ is the enhanced validation set constructed in ref. ⁸, consisting of 50 Si configurations. Twenty of them are randomly selected from the 120 removed structures in the original ${{\mathcal{D}}}^{{\rm{Zuo\; et\; al}}.}$, 11 configurations are the AIMD snapshots with single vacancy RE, and 19 configurations are AIMD snapshots with single interstitial RE, performed at 1230 K.

The two testing datasets, ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$, are the interstitial RE testing set and the vacancy RE testing set used in ref. ⁸. Each testing dataset consists of 100 AIMD snapshots at 1230 K with a single interstitial (for ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$) or single vacancy (for ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$) respectively as identified RE.

Other performance datasets

Three performance datasets, ${{\mathcal{D}}}_{2300}$, ${{\mathcal{D}}}_{2300}^{48D}$, and ${{\mathcal{D}}}_{124}$ are used as shown in Fig. 6 and Supplementary Fig. 7. The ${{\mathcal{D}}}_{2300}$ is the full MLIP dataset with all 60 properties. The ${{\mathcal{D}}}_{2300}^{48D}$ is the subset of ${{\mathcal{D}}}_{2300}$ for 48 properties, excluding the 12 properties in the energy ranking category. The ${{\mathcal{D}}}_{124}$ is a dataset consisting of 124 MLIP models with 31 properties from ref. ⁸ (Supplementary Tables 10 and 11). The 124 MLIP models in ${{\mathcal{D}}}_{124}$ are trained by ${{\mathcal{D}}}^{{\rm{Zuo\; et\; al}}.}$, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$, or ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ using six MLIP types, such as GAP, NNP, DeePMD, DeepPot-SE, MTP, and SNAP.

Training MLIPs

We adopted the same approach of training MLIPs as in ref. ⁸, and in Zuo et al.¹ The mlearn Python package and the corresponding MLIP source codes, such as QUIP for GAP⁴, N2P2 for NNP³⁶, MLIP for MTP^6,37, and SNAP²³ embedded in LAMMPS were used to train these models respectively. We also used the DeePMD-kit package to train DeePMD and DeepPot-SE models, following ref. ³⁸ and ref. ², respectively. During the validation step, many MLIP models with different hyperparameter values were obtained, following the same approach in ref. ⁸. A grid search approach was adopted using two to ten values for each hyperparameter, including the number of radial basis functions and the band limit of spherical harmonic basis functions for GAP, the cutoff radius, and the size of the neural network for NNP, the choice of radial basis function sets for MTP, and the cutoff radius and the number of iteration steps for DeePMD and DeepPot-SE. Each pool for different types of MLIPs has 194 to 2304 MLIP models with various hyperparameters for different types of models and training datasets. There were 25,327 models generated across all our validation pools.

Optimizing MLIPs and selecting the hyperparameter sets

The validation scores of trained MLIP models were calculated as follows. We first calculated RMSEs of the energies and the forces for the corresponding training dataset, ${\sigma }_{{\rm{E}}}^{{\rm{train}}}$ and ${\sigma }_{{\rm{F}}}^{{\rm{train}}}$, and for the validation dataset ${{\mathcal{D}}}^{{\rm{validation}}}$, ${\sigma }_{{\rm{E}}}^{{\rm{validation}}}$ and ${\sigma }_{{\rm{F}}}^{{\rm{validation}}}$. These four RMSEs were then normalized to have similar distributions so that lower values of these criteria indicate better performances as,

$${\sigma }_{k}^{{\mathcal{D}}{{{\prime} }}}=\frac{{\sigma }_{k}^{{\mathcal{D}}}-{({\sigma }_{k}^{{\mathcal{D}}})}_{\min }}{{({\sigma }_{k}^{{\mathcal{D}}})}_{{\rm{median}}}-{({\sigma }_{k}^{{\mathcal{D}}})}_{\min }},\,k\in \left\{E,\,F\right\}\,{\rm{and}}\,{\mathcal{D}}\in \{{\rm{train}},{\rm{validation}}\}$$

(4)

where ${\sigma }_{k}^{{\mathcal{D}}}$ is one of the four RMSEs on dataset ${\mathcal{D}}$, and ${({\sigma }_{k}^{{\mathcal{D}}})}_{\min }$ and ${({\sigma }_{k}^{{\mathcal{D}}})}_{{\rm{median}}}$ is the minimum and the median RMSE, respectively. The validation score was calculated as:

$${Score}=\sqrt{{\sum}_{\mathop{k}\limits_{D}}{[{\sigma }_{k}^{{\mathcal{D}}^{\,{\prime}}}]}^{2}},$$

(5)

thereby lower validation scores correspond to better MLIP performances.

Sampling MLIP models

We use the following two steps to sample a total of 2300 MLIP models for the analysis process in our study. If the pool size in the validation step of the training process is larger than 1000, the top 10% MLIP models with the lowest validation scores were picked. Then, according to the ranking of validation scores, 50 MLIP models with even ranking spacing were chosen. If the pool size is smaller than 1000, every second model (a total of 50) was chosen from the top 100 models with the lowest validation scores. After picking the optimized models, another 50 models were randomly selected from the remaining pool. This two-step selection was conducted for each type of MLIP and for each different training dataset. A total of 2300 MLIP models were sampled.

Calculating different physical properties

The methods to calculate the physical properties were described as follows. The supercells of Si bulk, single vacancy, and different Si interstitials, such as split-<110>, tetrahedral, and hexagonal interstitials, were relaxed by DFT K4. The true values of all properties, such as energies, forces, energy RMSEs, force RMSEs, lattice parameters, energy rankings, the NAC, and the force constants, were calculated by DFT K4 except for elastic constants of bulk and defected Si bulks. For the energies and forces of atoms from the AIMD snapshots, single-step self-consistent DFT K4 were used without further relaxation. The same configurations from DFT calculations were used for the testing of MLIPs, except for lattice parameters. The methods for calculating properties in each category are described as follows.

Defect formation energy E _f ^Defect

The supercells with defects were constructed using perfect Si bulk with 2 × 2 × 2 conventional unit cells (64 atoms) and then relaxed by DFT K4. The defects considered were single vacancy, and single interstitial with split-<110 > , tetrahedral, or hexagonal interstitial configurations. The defect formation energy E_f^Defect was calculated as

$${E}_{{\rm{f}}}^{{\rm{Defect}}}={E}^{{\rm{defect}}}-\frac{{N}^{{\rm{defect}}}}{{N}^{{\rm{bulk}}}}{E}^{{\rm{bulk}}},$$

(6)

where ${E}^{{\rm{defect}}}$ is the energy of defected configuration, ${E}^{{\rm{bulk}}}$ is the energy of the crystalline bulk supercell, ${N}^{{\rm{bulk}}}$ is the number of atoms in perfect bulk supercell, and ${N}^{{\rm{defect}}}$ is the number of atoms of defected supercell.

Elastic constants

The true values of elastic constants were calculated by the finite differences approach via a stress-strain relationship implemented in VASP (IBRION = 6 and ISIF = 3). The width of the displacement of each ion was set to 0.015 Å (POTIM = 0.015), and the number of the ionic displacements was set to 2 (NFREE = 2). Since the defect supercells require high computation costs due to low symmetry, the true values were calculated by DFT K2. For GAP, NNP, MTP, and SNAP, the functions implemented in the maml Python package³⁹ were used to calculate elastic constants. For DeePMD and DeepPot-SE, the LAMMPS scripts from the maml package were utilized to compute the elastic constants.

Lattice parameter

The lattice parameters were calculated by the relaxed supercells by MLIPs and compared to the true values obtained by DFT.

Energy rankings

The MLIP performances on energy rankings of the different ordering of multiple defects in Si using the following four datasets: (1) 2-vacancies in Si bulks in the supercell with 3 × 3 × 3 conventional unit cells (214 atoms); (2) 3-vacancies in Si bulks in the supercell with 3 × 3 × 2 conventional unit cells (141 atoms); (3) 3-vacancies in Si bulks in the supercell with 3 × 2 × 2 conventional unit cells (93 atoms); (4) 3-vacancies in Si bulks in the supercell with 2 × 2 × 2 conventional unit cells (61 atoms). For each set, we randomly removed two to three Si atoms and generated up to 30 symmetrically distinct configurations (using Pymatgen package³⁴). All configurations had fixed lattices and atom positions, which were not relaxed.

The energy rankings were determined by three metrics, the ranking error rates, ${P}_{{\rm{ranking}}\; {\rm{error}}}^{{\mathcal{D}}}$ for the dataset ${\mathcal{D}}$, the mean, and the maximum values of the energy differences $\Delta$E^DFT of ranking errors, the mean ${\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}}$ and the maximum ${\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}}$. Following the definitions and methods in ref. ²⁵, the ranking error rates ${P}_{{\rm{ranking\; error}}}^{{\mathcal{D}}}$, were quantified by comparing all pairs of energies, the same as the calculation of concordance index⁴⁰ We calculated the energy difference $\Delta$E^DFT as the differences between the DFT energies of the mismatched pair for all ranking errors in the dataset ${\mathcal{D}}$ and identified the mean and the maximum ${\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}}$.

Rare events

The NAC metric proposed in ref. ⁸ was used to evaluate the magnitude or the directional errors of forces on RE atoms. The errors on the predicted forces of the RE atoms identified from ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$, were calculated, and were used to estimate the CDF of the force magnitude errors, ${\delta }_{{\rm{F}}}$, and force directional errors, ${\delta }_{{\rm{\theta }}}$. The NACs of the CDF curves were then calculated within the error ranges of 0<|${\delta }_{{\rm{F}}}$ | < 1 eV Å⁻¹ for magnitude errors NAC(|${\delta }_{{\rm{F}}}$ | , ${\mathcal{D}}$), and 0° < ${\delta }_{{\rm{\theta }}}$ < 60° for directional errors NAC(${\delta }_{{\rm{\theta }}}$, ${\mathcal{D}}$). The energy RMSEs ${\sigma }_{{\rm{E}}}^{{\mathcal{D}}}$ of ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$, and the force RMSEs ${\sigma }_{{\rm{F}}}^{{\mathcal{D}}}$ on identified RE atoms from ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$ were calculated.

E & F RMSEs

The energy and force RMSEs of all configurations and all atomistic forces were calculated using the ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ datasets.

Thermal properties

We calculated the thermal properties for both Si bulk and a supercell with 2 × 2 × 2 conventional unit cells containing a vacancy (63 atoms). The configurations with displaced atoms were generated according to the symmetry of the given supercell, and the atomistic forces and force constants were calculated. Then, we generated the equation of state (EOS) curves from 0 to 1000 K (with 10 K-spacing) were generated using Phononpy with a 40 × 40 × 40 q-point mesh. The EOS curves of entropy versus temperature S(T), free energy versus temperature E_free(T), and heat capacity versus temperature c(T), were calculated using Phononpy Python package^41,42,43.

Calculating property errors

The errors of the physical properties were evaluated by comparing the MLIP predicted values to the true values by DFT in the same configurations. For lattice parameters, the supercell was fully relaxed by DFT and MLIPs to obtain the error.

For the EOS curves of thermal properties, the error metrics ${\Delta }_{{\rm{EOS}}}$ of the curves were calculated as

$${\Delta }_{{\rm{EOS}}}=\sqrt{\frac{\mathop{\sum }\nolimits_{i=1}^{{\rm{N}}}{[{X}^{a}({T}_{i})-{X}^{{\rm{K}}4}({T}_{i})]}^{2}}{N}},$$

(7)

where ${X}^{a}({T}_{i})$ denotes property X (either entropy S, free energy E_free, or heat capacity c) computed using method a (either MLIP or DFT K1) at temperature T_i, ${X}^{{\rm{K}}4}$ is the property X calculated by DFT K4, and N is the total number of temperatures on the curve.

For the energy rankings, REs, and E & F RMSEs categories, the errors were the differences between these properties and their ‘ideal’ values, which were 1 for NAC force performance scores on RE atoms and 0 for all the others.

The benchmarks

The benchmarks were calculated by DFT K1, except for the following ‘handpicked’ benchmarks. The benchmarks of E_f^Defect were set to 0.1 eV, because the differences between DFT K4 and K1 were much larger. The benchmarks of energy RMSEs for ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$ datasets were set to 15 meV atom⁻¹ in the REs category. The benchmarks of energy RMSEs and of force RMSEs for ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$ and ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$ datasets were set to 15 meV atom⁻¹ and 0.12 eV Å⁻¹, respectively, in the E & F RMSEs category.

Clustering MLIPs by their performances

The clustering graphs (Fig. 5) of the optimal MLIP models by the similarities in their joint performance were constructed as follows. First, we picked the properties from the target categories, e.g., all four properties in the E_f^Defect category and all eight properties in the REs category, and constructed six Pareto fronts, one for each type of MLIPs. The Pareto fronts are constructed by enumerating all models following an established multi-objective optimization algorithm (see the repository in the “Code availability statement”). After collecting optimal models on all six Pareto fronts, we used the standardized error metrics from Eq. (3) and calculated the similarities between each pair of MLIP models following Eq. (2). A grid search was conducted over 80 values in the similarity cutoff range from 10⁻² to 10². For each similarity cutoff, the similarity graph was constructed by connecting pairs of MLIP models with the similarities above the cutoff. The communities of MLIP models were identified using the Louvain community detection algorithm implemented in networkx Python package. In the Louvain community detection algorithm, we used the calculated similarities as the weights of relationships on the graph and set the threshold of convergence to 10⁻⁷. The similarity cutoff for properties in the E_f^Defect category was 1.4, and the similarity cutoff in the REs category was 1.3 in Fig. 5. The cutoff is chosen by the corresponding similarity cutoff with the highest adjusted random score (ARS) matching the labels of different types of models. The ARS is a clustering performance metric, which evaluates how close the predicted labels match with the reference labels, implemented in scikit-learn package.

The HV and the IGD scores of the Pareto fronts

The HV and the IGD scores of the Pareto fronts were calculated using the hypervolume function from pygmo package and the equations provided in ref. ⁴⁴, respectively. The reference points for calculating HV and IGD scores were the minimum values of the selected properties among all optimal models. The HV score measures the space between the Pareto fronts and the reference points, and IGD is the closest Euclidean distance between the optimal MLIP models on the Pareto fronts and the reference points. The lower scores of the Pareto fronts indicate better performances of the MLIP models.

Constructing the correlation graphs of property errors and selecting representative properties

To construct the correlation graphs, Pearson’s correlation coefficient r between each pair of properties was calculated using the scipy package. The correlation graphs (Fig. 7 and Supplementary Fig. 8) were constructed by connecting the pairs of properties with high correlations of r² > 0.6. The representative properties of the correlation graphs in Fig. 7 were selected as the minimum set of vertices (properties) that connect to all other vertices in the graph. A recursive function involves three sub-steps that were used for the selection:

(1)
Isolated Vertex Screening: We first identified isolated vertices (properties) in the graph. These isolated vertices were added directly to the final set of representative properties, as no other connections would cover these vertices.
(2)
Vertex Ranking and Selection: Next, we ranked all connected vertices in the graph using the page rank algorithm (via the networkx package), with the correlation (r²) serving as weights. The algorithm parameters were set to a maximum of 100 iterations, a damping parameter (alpha) of 0.85, and an error tolerance of 10⁻⁶. The highest-ranked vertex was then added to the set of representative properties, and both this vertex and any vertices connected to it were removed from the graph.
(3)
Recursion: The remaining graph was provided for the next recursion step.

This process was repeated recursively until no vertices remained or all remaining vertices in the graph were isolated. The final set of representative properties comprises all isolated vertices in the original graph and the highest-ranked vertices from each recursion step.

Data availability

All relevant data to support the findings of this study is available from the corresponding author upon reasonable request. The structural (POSCAR files), energies, and forces data to support the finding of this study, including original training dataset from ref. ¹, ${{\mathcal{D}}}^{{\rm{Zuo\; et\; al}}.}$, the enhanced validation set ${\mathcal{D}}$^validation, the interstitial-enhanced training set, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}$, the interstitial-RE testing set ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}$, the vacancy-enhanced training set, ${{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}$, the vacancy RE testing set ${{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}$ are available from: https://github.com/mogroupumd/Si_MLIP_datasets. The data in this study, including the training set with two interstitials, ${{\mathcal{D}}}^{2-{\rm{I}}}$, and the training set with four interstitials, ${{\mathcal{D}}}^{4-{\rm{I}}}$, and analysis demonstrations with notebooks are accessible from: https://github.com/mogroupumd/Learning_from_models.

Code availability

The computation codes and programs to support the findings of this study are available from the corresponding author on reasonable request. The computation codes and programs to support the findings of this study are also available from https://github.com/mogroupumd/Learning_from_models.

References

Zuo, Y. et al. Performance and cost assessment of machine learning interatomic potentials. J. Phys. Chem. A 124, 731–745 (2020).
Article CAS PubMed Google Scholar
Zhang, L. et al. Deep potential molecular dynamics: a scalable model with the accuracy of quantum mechanics. Phys. Rev. Lett. 120, 143001 (2018).
Article CAS PubMed Google Scholar
Behler, J. Four generations of high-dimensional neural network potentials. Chem. Rev. 121, 10037–10072 (2021).
Article CAS PubMed Google Scholar
Bartók, A. P., Payne, M. C., Kondor, R. & Csányi, G. Gaussian approximation potentials: the accuracy of quantum mechanics, without the electrons. Phys. Rev. Lett. 104, 136403 (2010).
Article PubMed Google Scholar
Li, X.-G. et al. Quantum-accurate spectral neighbor analysis potential models for Ni-Mo binary alloys and fcc metals. Phys. Rev. B 98, 094104 (2018).
Article CAS Google Scholar
Shapeev, A. V. Moment tensor potentials: a class of systematically improvable interatomic potentials. Multiscale Model. Simul. 14, 1153–1173 (2016).
Article Google Scholar
Ko, T. W., Finkler, J. A., Goedecker, S. & Behler, J. A fourth-generation high-dimensional neural network potential with accurate electrostatics including non-local charge transfer. Nat. Commun. 12, 398 (2021).
Article CAS PubMed PubMed Central Google Scholar
Liu, Y., He, X. & Mo, Y. Discrepancies and error evaluation metrics for machine learning interatomic potentials. npj Comput. Mater. 9, 174 (2023).
Article Google Scholar
Luo, Y. et al. A set of moment tensor potentials for zirconium with increasing complexity. J. Chem. Theory Comput. 19, 6848–6856 (2023).
Article CAS PubMed Google Scholar
Botu, V. & Ramprasad, R. Learning scheme to predict atomic forces and accelerate materials simulations. Phys. Rev. B 92, 094306 (2015).
Article Google Scholar
Vandermause, J. et al. On-the-fly active learning of interpretable Bayesian force fields for atomistic rare events. npj Comput. Mater. 6, 20 (2020).
Article Google Scholar
Fu, X. et al. Forces are not enough: benchmark and critical evaluation for machine learning force fields with molecular simulations. Trans. Mach. Learn. Res. arXiv:2210.07237
He, X., Zhu, Y., Epstein, A. & Mo, Y. Statistical variances of diffusional properties from ab initio molecular dynamics simulations. npj Comput. Mater. 4, 18 (2018).
Article Google Scholar
Qi, J. et al. Machine learning moment tensor potential for modeling dislocation and fracture in L1_0−TiAl and D0_19-Ti_3Al alloys. Phys. Rev. Mater. 7, 103602 (2023).
Article CAS Google Scholar
Seko, A. Machine learning potentials for multicomponent systems: the Ti-Al binary system. Phys. Rev. B 102, 174104 (2020).
Article CAS Google Scholar
Zagaceta, D., Yanxon, H. & Zhu, Q. Spectral neural network potentials for binary alloys. J. Appl. Phys. 128, 045113 (2020).
Article CAS Google Scholar
Chen, C. & Ong, S. P. A universal graph deep learning interatomic potential for the periodic table. Nat. Comput. Sci. 2, 718–728 (2022).
Article PubMed Google Scholar
Batzner, S. et al. E(3)-equivariant graph neural networks for data-efficient and accurate interatomic potentials. Nat. Commun. 13, 2453 (2022).
Article CAS PubMed PubMed Central Google Scholar
Rohskopf, A. et al. Exploring model complexity in machine learned potentials for simulated properties. J. Mater. Res. 38, 5136–5150 (2023).
Article CAS Google Scholar
Li, R. et al. Glass formation, thermal properties, and elastic constants of La–Al–Co alloys. J. Mater. Res. 25, 1398–1404 (2010).
Article CAS Google Scholar
Saha, S. K. & Dutta, G. Elastic and thermal properties of the layered thermoelectrics BiOCuSe and LaOCuSe. Phys. Rev. B 94, 125209 (2016).
Article Google Scholar
Behler, J. & Parrinello, M. Generalized neural-network representation of high-dimensional potential-energy surfaces. Phys. Rev. Lett. 98, 146401 (2007).
Article PubMed Google Scholar
Thompson, A. P., Swiler, L. P., Trott, C. R., Foiles, S. M. & Tucker, G. J. Spectral neighbor analysis method for automated generation of quantum-accurate interatomic potentials. J. Comput. Phys. 285, 316–330 (2015).
Article CAS Google Scholar
Zhang, L. et al. End-to-end symmetry preserving inter-atomic potential energy model for finite and extended systems. Adv. Neural Inf. Process. Syst. 32, 4441–4451 (2018).
Google Scholar
Liu, Y. & Mo, Y. Assessing the accuracy of machine learning interatomic potentials in predicting the elemental orderings: a case study of Li-Al alloys. Acta Mater. 268, 119742 (2024).
Article CAS Google Scholar
Tusar, T. & Filipic, B. Visualization of pareto front approximations in evolutionary multiobjective optimization: a critical review and the prosection method. IEEE Trans. Evol. Comput. 19, 225–245 (2015).
Article Google Scholar
Gasteiger, J., Groß, J. & Günnemann, S. Directional message passing for molecular graphs. In: 8th Int. Conf. Learn. Represent. (ICLR, 2020).
Gasteiger, J., Becker, F. & Günnemann, S. GemNet: universal directional graph neural networks for molecules. Adv. Neural Inf. Process. Syst. 9, 6790–6802 (2021).
Google Scholar
Singh, A., D’Arcy, M., Cohan, A., Downey, D. & Feldman, S. SciRepEval: a multi-format benchmark for scientific document representations. In: Proc. 2023 Conf. Empir. Methods Nat. Lang. Process. 5548–5566 (2023)
Wu, S. et al. BloombergGPT: a large language model for finance. Preprint at http://arxiv.org/abs/2303.17564 (2023).
Bubeck, S. et al. Sparks of artificial general intelligence: early experiments with GPT-4. Preprint at http://arxiv.org/abs/2303.12712 (2023).
Kresse, G. & Furthmüller, J. Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set. Phys. Rev. B 54, 11169–11186 (1996).
Article CAS Google Scholar
Perdew, J. P., Ernzerhof, M. & Burke, K. Rationale for mixing exact exchange with density functional approximations. J. Chem. Phys. 105, 9982–9985 (1996).
Article CAS Google Scholar
Ong, S. P. et al. Python Materials Genomics (pymatgen): a robust, open-source python library for materials analysis. Comput. Mater. Sci. 68, 314–319 (2013).
Article CAS Google Scholar
Jain, A. et al. A high-throughput infrastructure for density functional theory calculations. Comput. Mater. Sci. 50, 2295–2310 (2011).
Article CAS Google Scholar
Desai, S., Reeve, S. T. & Belak, J. F. Implementing a neural network interatomic model with performance portability for emerging exascale architectures. Comput. Phys. Commun. 270, 108156 (2022).
Article CAS Google Scholar
Podryabinkin, E. V. & Shapeev, A. V. Active learning of linearly parametrized interatomic potentials. Comput. Mater. Sci. 140, 171–180 (2017).
Article CAS Google Scholar
Ward, L., Agrawal, A., Choudhary, A. & Wolverton, C. A general-purpose machine learning framework for predicting properties of inorganic materials. npj Comput. Mater. 2, 16028 (2016).
Article Google Scholar
Chen, C., Zuo, Y., Ye, W., Ji, Q. & Ong, S. P. Maml - materials machine learning package. GitHub Repos. https://github.com/materialsvirtuallab/maml (2020).
Longato, E., Vettoretti, M. & Di Camillo, B. A practical perspective on the concordance index for the evaluation and selection of prognostic time-to-event models. J. Biomed. Inform. 108, 103496 (2020).
Article PubMed Google Scholar
Togo, A. First-principles phonon calculations with phonopy and Phono3py. J. Phys. Soc. Japan 92, 12001 (2023).
Article Google Scholar
Togo, A. & Tanaka, I. First principles phonon calculations in materials science. Scr. Mater. 108, 1–5 (2015).
Article CAS Google Scholar
Togo, A., Chaput, L. & Tanaka, I. Distributions of phonon lifetimes in Brillouin zones. Phys. Rev. B 91, 094306 (2015).
Article Google Scholar
Bezerra, L. C. T., López-Ibáñez, M. & Stützle, T. An empirical assessment of the properties of inverted generational distance on multi- and many-objective optimization. Lect. Notes Comput. Sci. 10173 LNCS, 31–45 (2017).
Article Google Scholar

Download references

Acknowledgements

The authors acknowledge the funding support from 2004837 and the computational facilities from the University of Maryland supercomputing resources, and the Maryland Advanced Research Computing Center (MARCC).

Author information

Authors and Affiliations

Department of Materials Science and Engineering, University of Maryland, College Park, MD, USA
Yunsheng Liu & Yifei Mo

Authors

Yunsheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yifei Mo
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Y.M. supervised the project. Both authors designed the computation and analyses, and Y.L. performed them. Y.L. and Y.M. wrote the manuscript.

Corresponding author

Correspondence to Yifei Mo.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supporting_Information

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Liu, Y., Mo, Y. Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials. npj Comput Mater 10, 159 (2024). https://doi.org/10.1038/s41524-024-01333-3

Download citation

Received: 20 March 2024
Accepted: 28 June 2024
Published: 20 July 2024
DOI: https://doi.org/10.1038/s41524-024-01333-3
Springer Nature Limited

Learning from models: high-dimensional analyses on the performance of machine learning interatomic potentials

Abstract

Similar content being viewed by others

Machine Learning Interatomic Potentials: Keys to First-Principles Multiscale Modeling

A universal strategy for the creation of machine learning-based atomistic force fields

Discrepancies and error evaluation metrics for machine learning interatomic potentials

Introduction

Results

Analysis process of MLIPs and error metrics

Identifying challenging properties of MLIPs

The Pareto front of MLIPs: the trade-off between property errors

The high-dimensional Pareto fronts for multi-property performance of MLIPs

‘Curse of dimensionality’ on the joint performances of MLIPs

Investigating the correlations of property errors

Discussion

Methods

First-principles computation

AIMD simulation

Constructing training, validation, and testing datasets

Other performance datasets

Training MLIPs

Optimizing MLIPs and selecting the hyperparameter sets

Sampling MLIP models

Calculating different physical properties

Defect formation energy E f Defect

Elastic constants

Lattice parameter

Energy rankings

Rare events

E & F RMSEs

Thermal properties

Calculating property errors

The benchmarks

Clustering MLIPs by their performances

The HV and the IGD scores of the Pareto fronts

Constructing the correlation graphs of property errors and selecting representative properties

Data availability

Code availability

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Supplementary information

Supporting_Information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation

Defect formation energy E _f ^Defect