Introduction

As a promising technique for conducting atomistic modeling, machine learning interatomic potential (MLIP) uses machine learning (ML) models to predict the energies and the atomistic forces of atomic structures, by training on the data from density functional theory (DFT) calculations. Current state-of-the-art MLIPs are reported with near DFT-level accuracies as measured by small values of root-mean-square error (RMSE) and mean absolute error in total energies and atomistic forces1,2,3,4,5,6,7. However, despite the widely reported low averaged errors in total energies and atomistic forces, a variety of discrepancies have been observed in atomic dynamics and physical properties, including defect structures, defect formation energies8,9, atom vibrations, migration barriers10,11, and erroneous forces12 during the MLIP-based molecular dynamics (MD) simulations. These discrepancies can be attributed to the atom configurations, such as defects and transition states, that determine the properties8,9, but are significantly underrepresented in the training and testing datasets. The testing dataset is dominated by near-equilibrium atom configurations and thus yields low average errors as widely reported in the literature8,12. Identifying potential discrepancies and errors of MLIPs in the predictions of properties or atom dynamics in atomistic simulations is a crucial step toward guiding further improvements of MLIPs.

It is time-consuming and computationally expensive to study these errors and discrepancies of MLIPs, because running both ab initio- and MLIP-based MD simulations over a long period of time are required to obtain these physical properties derived from atom dynamics and their errors. The use of error metrics that are well correlated with these properties and the errors of properties is an effective strategy to mitigate this issue. Liu et al.8 proposed and demonstrated that the forces on rare event (RE) atoms are an error metric for diffusional properties and the MLIPs scoring higher on the proposed metrics showed significantly improved accuracy in predicting diffusional properties in MD simulations. A significant advantage of employing these error metrics is to assess the MLIPs’ errors on these properties without running time-consuming MD simulations, and thus can greatly accelerate the iterations of validation and testing for improving the MLIPs8,13. To overcome these errors and discrepancies of MLIPs, the development of effective error metrics correlated to these properties is highly desired.

For the applications of MLIPs in atomistic modeling research14,15,16,17,18, it is also critical for MLIPs to accurately predict a wide range of physical phenomena and properties19. The properties commonly encountered in atomistic modeling include defects, REs, diffusion, phonon, thermal conduction, elastic modulus. Each category of properties includes a number of quantities and metrics. It is computationally expensive to conduct the testing of the MLIPs on these many properties. Obtaining the MLIPs that simultaneously reproduce a large number of properties with small errors would be difficult and time-consuming, as often multiple iterations of training and testing would be required.

To overcome this challenge, we here propose to identify a reduced number of representative properties that are correlated with other properties in their MLIP prediction errors, by studying the correlations among a large array of different property errors from numerous MLIP models. Since many material properties are physically correlated19, the training, validation, and testing of the MLIPs can focus on these reduced number of representative properties to achieve improved joint performances spontaneously across all properties. For example, mechanical properties such as elasticities are often found to be correlated with thermal properties due to their common physical nature of interatomic bonding strength20,21, and it is reasonable to expect that the errors in mechanical properties and thermal properties may also be correlated. Given the limited knowledge and studies on the MLIP errors on a variety of properties, there is an urgent need to systematically examine and understand the errors of many properties and their underlying correlations to guide the future development of MLIPs for good joint performances on many properties.

In this study, we propose an analysis process comprising the data generation, sampling, evaluation, benchmarking, and multi-dimensional statistical analyses of MLIPs’ joint performances on a diverse range of properties (Fig. 1). Given the complex behaviors of MLIP models based on high-dimensional functions and descriptors, it is important to analyze a large number of MLIP models and establish the statistical correlations of their behaviors, as the results based on only a few optimal MLIP models may suggest biased correlations (see Supplementary Note 1 in Supporting Information). To achieve this, our proposed analysis process samples a large number of MLIPs from the validation pool of MLIP models generated by hyperparameter tuning during the validation steps in the typical MLIP training process (Fig. 1a). Thus, no additional computation cost is incurred to sample MLIP models from the existing validation pool. Then the errors of many physical properties are evaluated for each sampled MLIP model, and the performance data of these many sampled models are obtained in comparison to the corresponding benchmarks (Section 2.1) (Fig. 1b). These performance data of many MLIP models are then further analyzed. These analyses include: identifying the properties that are difficult to be accurately predicted by the MLIPs (section 2.2) and their pairs (section 2.3) (Fig. 1c), analyzing the high-dimensional Pareto fronts of multiple property errors for the joint performances of MLIPs (Sections 2.4, 2.5), analyzing the statistical correlation of MLIP models (section 2.5) and property errors (Section 2.6), and identifying the representative properties (Section 2.6) for more efficient training and testing of MLIPs (Fig. 1d).

Fig. 1: The analysis processes.
figure 1

a The steps of MLIP training. b A pool of many MLIP models from the validation steps are sampled, and their performance on many properties is evaluated. These performance data of many MLIP models are analyzed c to identify challenging properties and d by the high-dimensional correlation analyses.

Results

Analysis process of MLIPs and error metrics

In this study, we implement and conduct this proposed analysis process on current state-of-the-art MLIPs, including Gaussian Approximation Potential (GAP), Neural Network Potential (NNP), Moment Tensor Potential (MTP), Spectral Neighbor Analysis Potential (SNAP), Deep Potential (DeePMD), and Deep Potential-Smooth Edition (DeepPot-SE)2,3,4,6,22,23,24, for their performances on a wide array of properties, using Si as a study case. We analyze the performance of 2300 MLIP models (the MLIP training and generating performance data of many MLIP models, Fig. 1) with these six different MLIPs types trained for silicon in the previous study8 on five different training datasets, such as \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}\) from ref. 1, \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) (the RE-enhanced training dataset in ref. 8), \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) (the vacancy-enhanced training dataset in ref. 8), \({{\mathcal{D}}}^{2-{\rm{I}}}\), and \({{\mathcal{D}}}^{4-{\rm{I}}}\) (Methods) (Fig. 2a). Every training dataset contains 219 configurations over a wide range of selected structures of solid Si, liquid Si, strained or distorted Si, Si surfaces, and bulk Si with single vacancy from ab initio molecular dynamics (AIMD) simulations (Methods). To increase the data diversity among training datasets, \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\), \({{\mathcal{D}}}^{2-{\rm{I}}}\), and \({{\mathcal{D}}}^{4-{\rm{I}}}\) are generated by replacing 54.7% (120 configurations) of the structures in \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}\) by the configurations of single interstitial, di-interstitials, and four interstitials, respectively, of Si bulks from the AIMD simulations (Methods).

Fig. 2: Summary of MLIP models and properties.
figure 2

a The distribution of 2300 MLIP models of six different types and trained by five different datasets, \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{Zet}}\; {\rm{Zal}}.}\) (green), \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) (red), \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) (purple), \({{\mathcal{D}}}^{2-{\rm{I}}}\) (orange), and \({{\mathcal{D}}}^{4-{\rm{I}}}\) (blue). b The 60 calculated properties, performance scores, and evaluation metrics over seven categories and the corresponding benchmarks (colored, Table 1), including 30 properties using the differences between DFT K4 and K1, \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\), as benchmarks (blue), 20 properties using the differences between DFT K2 and K1, \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}2}\), as benckmarks in elastic constants (yellow), and 10 handpicked values as benchmarks (purple).

The 2300 MLIP models are selected from the validation pools in the validation steps from the conventional training (Fig. 1 and Supplementary Table 1). The first half of 2300 is the MLIP models with the lowest validation scores (i.e., lowest errors) calculated from our validation dataset (Methods). The other half of the MLIP models are randomly selected from the rest of the validation pool excluding those already selected in the first half. These second half of MLIP models may not exhibit low validation scores but may show good predictions for only certain properties. Given the complex inter-dependence of hyperparameters and the performance on multiple properties, this random sampling of MLIPs models over the hyperparameter space can provide a more comprehensive understanding of the multi-property performance of MLIP models (see Supplementary Note 1 in Supporting Information).

In the next step of the analysis process (Fig. 1b), we evaluate each MLIP model for a range of material properties and errors (e.g., RMSEs in the predictions of energy and forces), which are collectively termed as ‘properties’ in the remaining text. As summarized in Table 1 and Fig. 2b, these properties include (1) the formation energies EfDefect of four different types of point defects, such as vacancy, split-<110> interstitial, tetrahedral interstitial, and hexagonal interstitial, (2) the elastic constants of bulk crystalline Si and the Si supercells with a point defect, 3) the lattice parameters of bulk crystalline Si and the Si supercells with a point defect, (4) the energy rankings (as defined in ref. 25) with multiple vacancies for four sets of Si configurations, (5) the free energy Efree, the entropy S, and the heat capacity K, of the bulk crystalline Si and the Si supercells with a vacancy, and additionally, (6) other properties based on the performance scores and evaluation metrics (Table 1) as in ref. 8 include the magnitude and directional errors of forces on RE atoms in interstitial or vacancy diffusions, the normalized area of curve (NAC) NAC(\(\delta\), \({\mathcal{D}}\)RE), the RMSEs of energies and forces on RE atoms, such as \({\sigma }_{{\rm{E}}}^{{\rm{RE}}-{\rm{I}}}\), \({\sigma }_{{\rm{F}}}^{{\rm{RE}}-{\rm{I}}}\), \({\sigma }_{{\rm{E}}}^{{\rm{RE}}-{\rm{V}}}\), and \({\sigma }_{{\rm{F}}}^{{\rm{RE}}-{\rm{V}}}\), evaluated for \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) dataset, and \({\sigma }_{{\rm{E}}}^{{\rm{enhanced}}-{\rm{I}}}\), \({\sigma }_{{\rm{F}}}^{{\rm{enhanced}}-{\rm{I}}}\), \({\sigma }_{{\rm{E}}}^{{\rm{enhanced}}-{\rm{V}}}\), and \({\sigma }_{{\rm{F}}}^{{\rm{enhanced}}-{\rm{V}}}\) evaluated for \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) datasets.

Table 1 Summary of the 60 error metrics, their corresponding benchmarks, and the fractions of MLIP models that meet the benchmarks

In order to have a fair comparison among these different properties, the ‘error metric’ \(\delta M\) for each property is normalized as

$$\delta M={\log }_{10}|{P}_{{\rm{predicted}}}-{P}_{{\rm{ideal}}}|$$
(1)

where Ppredicted is the property predicted by the MLIP, and Pideal is the ideal value. For most properties, Pideal is given by DFT calculations using a k-point mesh of 4 × 4 × 4 (DFT K4), except for the 20 elastic constants calculated using a k-point mesh of 2 × 2 × 2 (DFT K2) due to the prohibitively high computation cost of DFT K4 for the large supercell of defected structures (Fig. 2b). The Pideal values are 1 for the performance scores, such as force errors NAC(\(\delta\), \({\mathcal{D}}\)), and 0 for the RMSEs of energies or forces (Methods). The lower values of these normalized ‘error metrics’ indicate more accurate predictions of MLIPs.

The benchmarks of error metrics are set to identify what properties are challenging (Table 1 and Fig. 2b). For most properties, the benchmarks are set by the differences \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\) given by DFT calculations in a single Γ-point (DFT K1) using a ~1000 Å3 supercell (except for elastic constants, which use \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}2}\)). Using DFT K1 to set the benchmarks has two following reasons: (1) the ease of computation compared to the experimental values that may not be easily obtained, and (2) the wide adoption in AIMD simulations. Therefore, the DFT K1 benchmarks \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\) provide practical baselines of MLIP accuracies for further simulations. The MLIP models that meet these benchmarks may be considered viable substitutes for DFT or AIMD simulations at greatly reduced computation costs. A number of benchmarks are manually set, labeled as ‘handpicked’ (Table 1 and Fig. 2b): 15 meV atom−1 for all energy RMSEs \({\sigma }_{{\rm{E}}}\); 0.12 eV Å−1 for some force RMSEs (Table 1); 0.1 eV for EfDefect (because even DFT K1 calculations produce errors as large as 1 eV in EfDefect) (see “The benchmarks” section in Methods).

Identifying challenging properties of MLIPs

Among all properties, we first identify the properties that are difficult to accurately predicted by the MLIPs (Fig. 1c). A property is identified as ‘challenging’ if less than 15% of MLIP models in the sampled pool achieve low errors below the benchmark. The analysis reveals 35 challenging properties (Supplementary Table 2) mostly in the four categories: (1) defect formation energies EfDefect with only 2.8–5% of MLIP predictions achieving the 0.1 eV benchmark, (2) 13 metrics in the elastic constants with only 1–13% MLIPs achieving the \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}2}\) benchmarks, (3) all 12 metrics for energy rankings with less than four MLIP models, or 0.2%, for each of the metrics achieving the \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\) benchmarks, and (4) five metrics for the forces on RE atoms, in the REs category, (Fig. 3a–d, and Supplementary Table 2), in addition to the entropy of Si bulk, Sbulk(T). The energy rankings of multi-defect configurations are particularly challenging for MLIPs, as the models with the lowest errors are much higher than the benchmark \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\) (Fig. 3a). For the ranking error rate metric \({P}^{{\mathcal{D}}1}\)ranking error evaluated in the ranking dataset \({\mathcal{D}}\)1 of two vacancies in the supercells with 214 atoms (Table 1), the top 10% MLIP models (except for two outliers) have the ranking error rates in the range of 18–38%, compared to 7.6% the \({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\) benchmark (Fig. 3a). The ranking error rates of multi-defects in Si are much higher than those of the elemental orderings (mostly below 10%) in Li-Al intermediate phases studied in our previous study25. In summary, our analysis of a large number of MLIP models based on different MLIP types and training datasets identifies energy ranking, EfDefect, elastic constants, and force predictions on RE atoms as challenging properties for MLIPs to achieve small prediction errors.

Fig. 3: ‘Challenging’ properties of MLIPs.
figure 3

The cumulative distribution function (CDF) curves of the sampled MLIP models for a the ranking error rates \({P}^{{\mathcal{D}}1}\)ranking error, b the NAC of errors on force directions, |\(\Delta {NAC}({\delta }_{{{\theta }}},{{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}})\)|, c the error of the formation energy of hexagonal interstitial, \(|\delta {E}_{{\rm{f}}}^{{\rm{hexagonal}}}|\), and d the error of the elastic constant C12 for the supercell with hexagonal interstitial, \(|\delta {C}_{12}^{{\rm{hexagonal}}}|\). The benchmarks of the property errors are shown in red dashed lines (\({\Delta }_{{\rm{DFT}}}^{{\rm{K}}1{\rm{to}}\; {\rm{K}}4}\) shown in black dashed line if different), and the values show the fractions below the benchmarks.

The Pareto front of MLIPs: the trade-off between property errors

We analyze how MLIPs perform to accurately predict two properties by constructing the Pareto fronts of the MLIPs using their error metrics of two properties (Fig. 1c). As a concept widely used in multi-objective optimization26, the Pareto front consists of a set of optimal data points in which no objective can be improved without degrading some of the other objectives. Here, our Pareto fronts consist of a set of MLIP models, which may generally represent some of the best-performing models for different property combinations (Methods). The pairs of properties from the 48 properties are examined, excluding the energy rankings of multi-defects given their high errors. Those at the vertices of the Pareto fronts are called the optimal MLIP models (Fig. 4a, b, and Supplementary Figs. 2–4). The Pareto fronts show the trade-offs for the optimal MLIP models on the prediction of these two properties. The MLIP models with lower errors on one property have higher errors on the other. This trade-off poses challenges for training MLIP models to achieve high accuracies for both properties.

Fig. 4: Challenging property pairs shown by the Pareto fronts.
figure 4

The scatter plots of the MLIP models for two property errors: a |\(\Delta {NAC}(|{\delta }_{{\rm{F}}}|,{{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}})\)| versus \(|\delta {E}_{{\rm{f}}}^{{\rm{tetrahedral}}}|\), and b \(|\delta {C}_{11}^{{\rm{hexagonal}}}|\) versus \(|\delta {E}_{{\rm{f}}}^{{\rm{vacancy}}}|\), with the Pareto fronts shown in the black lines. The benchmarks of the property error are shown in red dashed lines.

The Pareto front analyses also reveal that those pairs of properties are difficult to simultaneously achieve the benchmarks. According to our analysis, there are 20 property pairs for which no optimal MLIP model can simultaneously meet or exceed the benchmarks for both properties (Supplementary Table 3). These challenging pairs include seven pairs in EfDefect–elastic constants, ten pairs in EfDefect–REs, and three pairs in EfDefectE & F RMSEs (Supplementary Table 3), which are from the four categories of the defect formation energy EfDefect, the elastic constants, the REs, and the E & F RMSEs. Many of these are among the challenging properties identified in section 2.2. As shown by these analyses results, identifying these property combinations that may be difficult to simultaneously meet the benchmarks, and their trade-offs can provide valuable guidance for improving the training of MLIPs.

The high-dimensional Pareto fronts for multi-property performance of MLIPs

Here, we further examine the joint performances of MLIPs on the prediction errors of multiple properties (i.e., in higher dimensions than 2D). Achieving low errors on many properties spontaneously is expected to be even more difficult than on two properties. First, we study the joint performances using all four properties in the EfDefect category and all eight properties in the REs category (Table 1), which are mostly challenging properties identified in section 2.2. For each type of MLIPs, we construct a high-dimensional Pareto front on all properties (e.g., a 4D Pareto front for the EfDefect category and an 8D Pareto front for the REs category, shown as the last rows in Supplementary Tables 4–7) and obtain their optimal MLIP models. For all these optimal models in all six types of MLIPs, we calculate their similarities based on the error metrics of properties and cluster the MLIP models that have similar joint performances by applying a graph clustering algorithm, the Louvain community detection algorithm (Methods). The similarities of the MLIP model i and MLIP model j are measured by the inverse of the Euclidean distance between the vectors of error metrics,

$${Similarity}(i,\,j)=\frac{1}{\sqrt{\sum _{k}{({{\delta M}_{i}^{k}}^{{\prime} }-\delta {{M}_{j}^{k}}^{{\prime} })}^{2}}},$$
(2)

where the standardized value of error metric \({\delta M}_{i}^{{\prime} }\) for MLIP model i is calculated as

$${\delta M}_{i}^{{\prime} }=\frac{{\delta M}_{i}- < \delta M > }{{\sigma }_{{\rm{\delta }}{\rm{M}}}}$$
(3)

where <\(\delta\)M> is the mean value and \({\sigma }_{{{\delta }}{\rm{M}}}\) is the standard deviation.

These clusters of MLIP models based on the similarities of the error metrics allow the analysis of the joint performances of MLIPs on many properties in high-dimensional space. Generally, the models that are clustered exhibit similar errors on the same properties. For those MLIP models that don’t belong to clusters, their errors vary for different properties, i.e., some models may have high errors on some properties, but other models have high errors on others. For the clustering based on the error metrics of the defect formation energy category, 359 optimal MLIP models of all types are clustered into four major communities (more than 10 MLIP models for each community): one for mainly GAPs, one for SNAPs, and two communities for DeepPot-SE (Fig. 5e and Supplementary Fig. 6, see Supplementary Note 3 in Supporting Information), along with scattered models based on NNP, DeePMD, and MTP. The scattering of the NNP and DeePMD models in the similarity clustering based on the defect formation energy category suggests that NNPs and DeePMDs can only have good predictions in some properties in the defect formation energy but perform poorly in others.

Fig. 5: The clustering and the performances of MLIPs in different property categories.
figure 5

The optimal MLIP models on the Pareto fronts for the property pairs of b \(|\delta {E}_{{\rm{f}}}^{{\rm{vacancy}}}|\) versus \(|\delta {E}_{{\rm{f}}}^{{\rm{hexagonal}}}|\), d \(|\delta {E}_{{\rm{f}}}^{{\rm{vacancy}}}|\) versus \(|\delta {E}_{{\rm{f}}}^{{\rm{tetrahedral}}}|\), g |\(\Delta {NAC}(|{\delta }_{{\rm{F}}}|,{{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}})\)| versus |\(\Delta {NAC}(|{\delta }_{{\rm{F}}}|,{{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}})\)|, and i \(|{\sigma }_{{\rm{F}}}^{{\rm{RE}}-{\rm{V}}}|\) versus |\(\Delta {NAC}(|{\delta }_{{\rm{F}}}|,{{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}})\)|. The minimum values of error metrics are indicated by black lines, and their cross points are the reference points to calculate the hypervolume (HV) and the inverted generational distance (IGD) scores. The fraction of the optimal MLIP models for each type of MLIPs with the lowest and the 2nd lowest a HV scores and c IGD scores on all property pairs in the defect formation energy category and f, h in the REs category. The clusters of optimal MLIP models connected by the similarities (Methods) based on e all four properties in the defect formation energy category and j all eight properties in the REs category.

To quantitatively measure and compare the joint performances of MLIPs with many properties in high-dimensional space, we employ the hypervolume (HV) and inverted generational distance (IGD) scores, which are commonly used to evaluate the Pareto fronts. The HV is calculated as the total volume between the Pareto fronts and the reference points, which are the points with the minimum error metric values for each property in the Pareto front (Fig. 5). The IGD is the closest Euclidean distance between error metric vectors of the Pareto fronts and the reference points (Methods). The Pareto fronts closer to the lower left corner of Fig. 5a, b, h, i have lower HV and IGD scores, indicating lower errors and better performances. For any property pairs in the EfDefect category, the Pareto fronts of all NNP models have the lowest HV scores and the lowest IGD scores in 83% and 67%, respectively, of all property pairs, and the Pareto fronts of DeePMD models account for 67% of the 2nd lowest HV and 50% of the 2nd lowest IGD (Fig. 5c, d, Supplementary Fig. 2, Supplementary Table 4, and Supplementary Table 5). NNP and DeePMD only perform well for two properties based on the two-dimensional HV and IGD scores, while they have poor performances for many properties (D > 2), given the scattering of the NNP and DeePMD models in the similarity clustering based on the EfDefect category (Fig. 5e). Indeed, none of the 35 optimal NNPs on its 4D Pareto front in the EfDefect category meets all benchmarks (Supplementary Table 9). Therefore, combining these with the clustering analyses above, we can reveal critical insights into the performance of MLIP models on multiple properties.

For the clustering of the REs category, 533 optimal MLIPs are clustered into seven major communities (Fig. 5j and Supplementary Fig. 6). In addition, 86% of the Pareto fronts of MTPs have the lowest HV, 79% of GAPs have the 2nd lowest HV, and 68% of MTPs have the lowest IGD score, and 57% of GAPs has the 2nd lowest IGD for the property pairs in the REs category (Fig. 5h, i, Supplementary Fig. 3, Supplementary Table 6, and Supplementary Table 7). As indicated by the results of these two analyses, many MTP and GAP models show good joint performances in predicting energies and forces on RE atoms. Indeed, there are a total of 20 optimal MTP models that meet all REs benchmarks (Supplementary Table 8). More analyses, including the HV and IGD scores for different types of MLIPs in the elastic constants category, are in Supplementary Note 2 in Supporting Information (Supplementary Figs. 4 and 5).

As illustrated in this section, the combined analyses of quantitative scores of the high-dimensional Pareto fronts and the clustering of error similarities reveal insights into the statistical behavior of the many MLIP models on a large range of property errors. Such analysis can guide the selection of MLIP models and the next round of training and testing.

‘Curse of dimensionality’ on the joint performances of MLIPs

In this analysis, we construct the high-dimensional Pareto fronts with different number D of properties in all their possible combinations, to understand the effect of dimensionality in the joint performances of MLIP models. We count the optimal MLIP models that meet all benchmarks, \({N}_{{\rm{optimal}}}^{{\rm{all}}}\), with different number D of properties. Low values of \({N}_{{\rm{optimal}}}^{{\rm{all}}}\) indicate the difficulties of MLIPs meeting all benchmarks for many properties in high-dimensional space. As shown in the cumulative distribution function (CDF) of \({N}_{{\rm{optimal}}}^{{\rm{all}}}\) for D = 2–5 (Fig. 6a), the probabilities of having no optimal MLIP meeting all benchmarks, P(\({N}_{{\rm{optimal}}}^{{\rm{all}}}\) < 1), increases significantly with D. For most pairs of two properties, at least one optimal MLIP model on the Pareto front satisfies two benchmarks, as shown by P(\({N}_{{\rm{optimal}}}^{{\rm{all}}}\) < 1) = 2% for the 2D Pareto fronts. It’s much more difficult to have MLIP models that meet all five property benchmarks, as 46% of the 5D Pareto fronts do not have optimal MLIP that meets all five benchmarks. This increase of P(\({N}_{{\rm{optimal}}}^{{\rm{all}}}\) < 1) with dimensionality D is observed for MLIPs trained in various datasets (Methods), including our original dataset of 2300 MLIP models with full 60 properties, \({{\mathcal{D}}}_{2300}\), and the dataset of 124 MLIP models with 31 properties, \({{\mathcal{D}}}_{124}\) (Fig. 6b and Supplementary Fig. 7). As shown by these results, it can be difficult to find MLIP models that beat the benchmarks of three or more properties. Our result highlights the ‘curse of dimensionality’ for developing MLIPs that can accurately predict many properties.

Fig. 6: The joint performances of MLIPs for all combinations of properties.
figure 6

a The cumulative distribution function (CDF) curves of the number of optimal MLIPs that meet all benchmarks, \({N}_{{\rm{optimal}}}^{{\rm{all}}}\), for two (2D, blue), three (3D, orange), four (4D, green), and five (5D, red) properties. The probabilities of no optimal MLIP meeting all benchmarks, P(\({N}_{{\rm{optimal}}}^{{\rm{all}}}\) < 1), indicated at the black dot-dash line (values shown in legend). b P(\({N}_{{\rm{optimal}}}^{{\rm{all}}}\) < 1) as a function of the property dimension D for different datasets.

Investigating the correlations of property errors

As shown by the above analyses, obtaining the MLIPs that simultaneously meet the benchmarks for a large number of properties would be difficult. During the multiple iterations of training and testing to improve MLIPs, it is time-consuming and computationally expensive for the error evaluations for these many properties and their errors. To overcome this challenge, we here identify a reduced number of representative properties that are correlated with other properties in the MLIP predictions. (Note: here the term property generically refers to error metrics of materials properties and performance scores as listed in Table 1). We first study the correlations of property errors by constructing the correlation graph of the property error metrics (defined in Eq. (3) (Methods)) for each MLIP type (Fig. 7a–d). For the analysis of all MLIP models (Fig. 7a), the error metrics of energy ranking properties are highly correlated (green nodes in the lower centre of Fig. 7a), and the errors of elastic constants and thermal properties are also correlated (as shown in the clustered orange and red nodes in the upper right corner of Fig. 7a). Based on the strong correlations of the error metrics of each cluster, i.e., the connections in the graph, we can select representative properties (Methods) to represent each cluster (nodes with red edges in Fig. 7). Further analyses are conducted to analyze the inter-dependence of these property errors and to identify the prediction relations using property errors as input to predict others as demonstrated in Supplementary Note 4 in Supporting Information (Supplementary Figs. 911, and Supplementary Table 12).

Fig. 7: The correlation graphs of error metrics for 60 properties.
figure 7

The error correlation graphs of the properties from MLIP models of a all types, b GAP, c NNP, and d MTP on all 60 properties. Each node represents one error metric, and is connected if the correlations, r2 > 0.6. The representative properties for each cluster are shown in red edges.

The training and testing of the MLIPs can focus on these representative properties (Table 2). For example, for MTPs, three metrics, the \({P}_{{\rm{ranking}}\; {\rm{error}}}^{{\mathcal{D}}1}\), the maximum \({\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}1}\), and the mean \({\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}2}\), can well represent all 12 metrics in the energy ranking category (Fig. 7d). For developing MTPs with better performances on energy rankings of multiple defects, one may first focus on these three metrics, as a training strategy. More generally, to develop MLIPs that can achieve benchmarks for all properties, the training may initially focus on these 21 representative properties out of the original 60 properties, a 67% reduction (Table 2). The above analysis using correlation graphs can be further conducted on the newly trained MLIP models to identify the representative quantities in the next iteration, as the representative properties and their quantities may change for the MLIP models trained differently.

Table 2 The representative properties selected by their correlation graphs for different types of MLIPs

In summary, this analysis reveals the correlation of the errors of different properties predicted by MLIPs and identifies the representative properties, which can serve as guidance for further development and improvement of MLIPs.

Discussion

In this study, we propose and demonstrate an analysis process (Fig. 1) for evaluating the prediction errors of MLIPs in a large number of MLIP models across a vast array of properties. Our analysis process presents several key features, such as evaluating a large number of MLIP models to gain the statistical behavior of the MLIPs, constructing the Pareto fronts to identify challenging properties and their combinations, and revealing the correlations of the property errors of MLIP models. Our analysis process utilizes many MLIP models generated in the validation pool, most of which would be discarded in a typical training process, and thus incurs little additional computation cost for generating models. By examining a large number of MLIP models, the analysis process can provide an understanding and insights into the MLIP performances on a large array of properties, which are essential for guiding the further training of MLIPs.

Our study highlights the challenge of developing MLIPs in achieving good performances or low errors across multiple properties. As shown by the Pareto fronts, the optimal MLIP models may exhibit low errors in some properties, but often show higher errors in many others. One often expects this challenge can be simply overcome by selecting different MLIP models with different sets of hyperparameters that have lower average errors in energies and forces. However, the Pareto fronts from our analysis suggest strong trade-offs existing among some properties, and this trade-off cannot be simply addressed by choosing a different model with a different set of hyperparameters. The challenge that MLIPs can only make accurate predictions for certain properties, while failing to perform well across all properties, is often overlooked. While many studies of MLIPs focus on a limited set of properties, it is critical to also assess the joint performances of MLIPs on a broad range of properties.

This challenge of MLIPs can be further illustrated by the high-dimensional analysis of many properties, such as evaluating the high-dimensional Pareto fronts and clustering MLIP models for their joint performances on multiple property errors. As illustrated in our high-dimensional analysis, there are increasing difficulties in finding optimized MLIP models that meet all benchmarks with the increased number of properties, which is a type of ‘curse of dimensionality’ in MLIP development. Furthermore, the high-dimensional correlation analysis is demonstrated in identifying key challenging properties, which would be valuable for guiding the training for better MLIPs. In our particular study case of Si using specific choices of model types, training data, and hyperparameters, the challenging properties generally involve the formation energies of different defects, the energy rankings of different defect configurations, and the elastic constants of defected supercells. Overall, to overcome the fundamental challenges of developing MLIPs that accurately predict a large array of properties, our analysis process can provide a lot of crucial information.

For future MLIPs that may involve more elements, phases, structures, and defects, this challenge will be more pronounced, and our analysis process should be conducted to identify the critical properties, which will further increase exponentially as we aim to capture material properties across the materials system. Given the large number of composition space and intermediate phases for multi-element systems25, the sampling for a wide variety of relevant, representative phases and compositions in conjunction with a large number of properties needs to be devised. Future studies are needed to extend this framework for more complex systems, e.g., high-entropy alloys. Moreover, our analysis framework can apply to testing other MLIPs that are not covered in this work, e.g., MLIPs based on graph neural networks18,27,28.

It is critical to train reliable MLIP models that meet the performance benchmarks for a large number of material properties. To achieve that goal, our analysis process (Fig. 1) could serve as an essential component. In addition to the typical training process, the analysis process would be conducted on a pool of MLIP models from the validation steps to gain an understanding of the trained models, such as identifying the challenging properties, the challenging pairs/combinations, and the representative properties. This information would be used to guide the next round or iteration of training. For example, the data related to these representative challenging properties can be added or overweighted into the training dataset, or the corresponding scores/metrics based on these representative properties can be used in the validation step for the next round of training. Then, the analysis process can be performed for the new batch of trained models to check if improved performance is achieved. If necessary, additional rounds of training and testing can be iteratively performed, as the challenging and representative properties may change after the additional training process. Through this iterative training and analysis process, the MLIPs can be improved to give accurate predictions on many properties.

In conclusion, our analysis process has effectively identified key challenges in the accurate predictions of MLIPs across many properties as demonstrated in the model system of Si. This analysis process can be generally applied for the MLIP training in any materials and can be further developed. Given the current state of rapid adoption of MLIPs29,30,31, developing the processes of benchmarking and evaluating MLIP models is increasingly important. Our proposed analysis process can offer valuable guidance for future research and development of MLIPs with enhanced performance across a broad range of properties.

Methods

First-principles computation

DFT calculations were performed to compute energies and forces of configurations and to relax structures. Since a majority of models were retrieved from ref. 8, all DFT calculations were performed as described in ref. 8, using Vienna ab initio simulation package32 (VASP) with the projector augmented-wave approach. Generalized-gradient approximation (GGA) with Perdew-Burke-Ernzerhof33 (PBE) functionals were used. All true values of energies, forces, physical properties, and evaluation metrics were calculated with a 4 × 4 × 4 k-point mesh (K4), except for elastic constants calculated with 2 × 2 × 2 k-point mesh (K2). As specified in Section 2.1, most of the benchmarks of the properties were obtained by the DFT calculations using a Γ-centered single k-point 1 × 1 × 1 (K1). All DFT calculations were spin-polarized, with an electronic relaxation cutoff of 10−5 eV, energy cutoff of 520 eV, and other parameters set as in the Materials Project34,35.

AIMD simulation

AIMD simulations were performed to generate atomistic configurations in training, validation, and testing datasets. The supercell models of bulk or defected Si for AIMD simulations have lattice parameters larger than 10 Å. AIMD simulations were non-spin-polarized, with electronic energy convergence cutoff to 10−4 eV and a time step of 2 fs. A Γ-centered single k-point of 1 × 1 × 1 was used for AIMD simulations. All AIMD simulations to obtain migrating (RE) atoms were performed at 1000 K or 1230 K, following the same scheme described in the main text and in ref. 8. For each AIMD simulation, an initial period of heating up with static relaxed supercells from 100 K to the final temperatures using velocity scaling at a constant rate over 2 ps. Then, AIMD simulations were conducted using NVT ensemble with Nosé-Hoover thermostat.

Constructing training, validation, and testing datasets

This study used a number of training datasets, \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}\; {\rm{al}}}.}\), \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\), \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\), \({{\mathcal{D}}}^{2-{\rm{I}}}\), and \({{\mathcal{D}}}^{4-{\rm{I}}}\), a validation dataset, \({{\mathcal{D}}}^{{\rm{validation}}}\), and two testing datasets, \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\), each consisting of a diverse range of AIMD snapshots, bulk, defected, and distorted Si configurations. The training dataset \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}\) was adopted from ref. 1, and other dataset were adopted from ref. 8, except for \({{\mathcal{D}}}^{2-{\rm{I}}}\) and \({{\mathcal{D}}}^{4-{\rm{I}}}\). \({{\mathcal{D}}}^{2-{\rm{I}}}\) and \({{\mathcal{D}}}^{4-{\rm{I}}}\) were generated the same as \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) as described below. All true values of energies and forces were DFT K4 calculation results.

The training dataset \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}\) consists of 219 configurations for a wide range of Si structures, including solid Si, melted Si, distorted Si, thin slabs of surface Si, and Si bulks with single vacancy from AIMD simulations at different temperatures as described in ref. 1. To construct other datasets in ref. 8,120 configurations from liquid Si, AIMD simulations of Si bulk, and the strained Si bulk in \({{\mathcal{D}}}^{{\rm{Zuo}}\; {\rm{et}}\; {\rm{al}}.}\) were randomly selected and removed, and were replaced by 120 snapshots generated by AIMD simulations with different defects: \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) uses snapshots from Si bulk with single vacancy; \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) uses snapshots from Si bulk with single interstitial; \({{\mathcal{D}}}^{2-{\rm{I}}}\) uses Si bulk with two interstitials; and \({{\mathcal{D}}}^{4-{\rm{I}}}\) uses Si bulk with four interstitials. The \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) and \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) in ref. 8 are the RE-enhanced and vacancy-enhanced training datasets with more RE vacancies and interstitials. Each of these training datasets has a total of 219 configurations, from solid Si, melted liquid Si, strained Si, Si surfaces, and Si defects (single vacancy, single interstitial, two interstitials, or four interstitials). To generate these configurations, the AIMD simulations were performed at 1000 K or 1230 K for defected structure supercells (such as single vacancy, split-<110> interstitial, tetrahedral interstitial, hexagonal interstitial, 2-interstitial, and 4-interstitial supercells) with −0.4%, 0.3%, 0.3%, 0.3%, 0.5%, and 1.6% lattice strains from the perfect Si bulk. All energies and atomic forces of the AIMD snapshots were converged using single-step self-consistent DFT K4 without relaxing atom positions and lattices. All snapshots contained at least one identified RE atom, which was a migrating atom with the distances difference is less than 0.75 Å between the 1st and the 2nd nearest neighbor atoms (~31% of the distance between two nearest static sites) as in ref. 8.

The validation dataset \({{\mathcal{D}}}^{{\rm{validation}}}\) is the enhanced validation set constructed in ref. 8, consisting of 50 Si configurations. Twenty of them are randomly selected from the 120 removed structures in the original \({{\mathcal{D}}}^{{\rm{Zuo\; et\; al}}.}\), 11 configurations are the AIMD snapshots with single vacancy RE, and 19 configurations are AIMD snapshots with single interstitial RE, performed at 1230 K.

The two testing datasets, \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\), are the interstitial RE testing set and the vacancy RE testing set used in ref. 8. Each testing dataset consists of 100 AIMD snapshots at 1230 K with a single interstitial (for \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\)) or single vacancy (for \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\)) respectively as identified RE.

Other performance datasets

Three performance datasets, \({{\mathcal{D}}}_{2300}\), \({{\mathcal{D}}}_{2300}^{48D}\), and \({{\mathcal{D}}}_{124}\) are used as shown in Fig. 6 and Supplementary Fig. 7. The \({{\mathcal{D}}}_{2300}\) is the full MLIP dataset with all 60 properties. The \({{\mathcal{D}}}_{2300}^{48D}\) is the subset of \({{\mathcal{D}}}_{2300}\) for 48 properties, excluding the 12 properties in the energy ranking category. The \({{\mathcal{D}}}_{124}\) is a dataset consisting of 124 MLIP models with 31 properties from ref. 8 (Supplementary Tables 10 and 11). The 124 MLIP models in \({{\mathcal{D}}}_{124}\) are trained by \({{\mathcal{D}}}^{{\rm{Zuo\; et\; al}}.}\), \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\), or \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) using six MLIP types, such as GAP, NNP, DeePMD, DeepPot-SE, MTP, and SNAP.

Training MLIPs

We adopted the same approach of training MLIPs as in ref. 8, and in Zuo et al.1 The mlearn Python package and the corresponding MLIP source codes, such as QUIP for GAP4, N2P2 for NNP36, MLIP for MTP6,37, and SNAP23 embedded in LAMMPS were used to train these models respectively. We also used the DeePMD-kit package to train DeePMD and DeepPot-SE models, following ref. 38 and ref. 2, respectively. During the validation step, many MLIP models with different hyperparameter values were obtained, following the same approach in ref. 8. A grid search approach was adopted using two to ten values for each hyperparameter, including the number of radial basis functions and the band limit of spherical harmonic basis functions for GAP, the cutoff radius, and the size of the neural network for NNP, the choice of radial basis function sets for MTP, and the cutoff radius and the number of iteration steps for DeePMD and DeepPot-SE. Each pool for different types of MLIPs has 194 to 2304 MLIP models with various hyperparameters for different types of models and training datasets. There were 25,327 models generated across all our validation pools.

Optimizing MLIPs and selecting the hyperparameter sets

The validation scores of trained MLIP models were calculated as follows. We first calculated RMSEs of the energies and the forces for the corresponding training dataset, \({\sigma }_{{\rm{E}}}^{{\rm{train}}}\) and \({\sigma }_{{\rm{F}}}^{{\rm{train}}}\), and for the validation dataset \({{\mathcal{D}}}^{{\rm{validation}}}\), \({\sigma }_{{\rm{E}}}^{{\rm{validation}}}\) and \({\sigma }_{{\rm{F}}}^{{\rm{validation}}}\). These four RMSEs were then normalized to have similar distributions so that lower values of these criteria indicate better performances as,

$${\sigma }_{k}^{{\mathcal{D}}{{{\prime} }}}=\frac{{\sigma }_{k}^{{\mathcal{D}}}-{({\sigma }_{k}^{{\mathcal{D}}})}_{\min }}{{({\sigma }_{k}^{{\mathcal{D}}})}_{{\rm{median}}}-{({\sigma }_{k}^{{\mathcal{D}}})}_{\min }},\,k\in \left\{E,\,F\right\}\,{\rm{and}}\,{\mathcal{D}}\in \{{\rm{train}},{\rm{validation}}\}$$
(4)

where \({\sigma }_{k}^{{\mathcal{D}}}\) is one of the four RMSEs on dataset \({\mathcal{D}}\), and \({({\sigma }_{k}^{{\mathcal{D}}})}_{\min }\) and \({({\sigma }_{k}^{{\mathcal{D}}})}_{{\rm{median}}}\) is the minimum and the median RMSE, respectively. The validation score was calculated as:

$${Score}=\sqrt{{\sum}_{\mathop{k}\limits_{D}}{[{\sigma }_{k}^{{\mathcal{D}}^{\,{\prime}}}]}^{2}},$$
(5)

thereby lower validation scores correspond to better MLIP performances.

Sampling MLIP models

We use the following two steps to sample a total of 2300 MLIP models for the analysis process in our study. If the pool size in the validation step of the training process is larger than 1000, the top 10% MLIP models with the lowest validation scores were picked. Then, according to the ranking of validation scores, 50 MLIP models with even ranking spacing were chosen. If the pool size is smaller than 1000, every second model (a total of 50) was chosen from the top 100 models with the lowest validation scores. After picking the optimized models, another 50 models were randomly selected from the remaining pool. This two-step selection was conducted for each type of MLIP and for each different training dataset. A total of 2300 MLIP models were sampled.

Calculating different physical properties

The methods to calculate the physical properties were described as follows. The supercells of Si bulk, single vacancy, and different Si interstitials, such as split-<110>, tetrahedral, and hexagonal interstitials, were relaxed by DFT K4. The true values of all properties, such as energies, forces, energy RMSEs, force RMSEs, lattice parameters, energy rankings, the NAC, and the force constants, were calculated by DFT K4 except for elastic constants of bulk and defected Si bulks. For the energies and forces of atoms from the AIMD snapshots, single-step self-consistent DFT K4 were used without further relaxation. The same configurations from DFT calculations were used for the testing of MLIPs, except for lattice parameters. The methods for calculating properties in each category are described as follows.

Defect formation energy E f Defect

The supercells with defects were constructed using perfect Si bulk with 2 × 2 × 2 conventional unit cells (64 atoms) and then relaxed by DFT K4. The defects considered were single vacancy, and single interstitial with split-<110 > , tetrahedral, or hexagonal interstitial configurations. The defect formation energy EfDefect was calculated as

$${E}_{{\rm{f}}}^{{\rm{Defect}}}={E}^{{\rm{defect}}}-\frac{{N}^{{\rm{defect}}}}{{N}^{{\rm{bulk}}}}{E}^{{\rm{bulk}}},$$
(6)

where \({E}^{{\rm{defect}}}\) is the energy of defected configuration, \({E}^{{\rm{bulk}}}\) is the energy of the crystalline bulk supercell, \({N}^{{\rm{bulk}}}\) is the number of atoms in perfect bulk supercell, and \({N}^{{\rm{defect}}}\) is the number of atoms of defected supercell.

Elastic constants

The true values of elastic constants were calculated by the finite differences approach via a stress-strain relationship implemented in VASP (IBRION = 6 and ISIF = 3). The width of the displacement of each ion was set to 0.015 Å (POTIM = 0.015), and the number of the ionic displacements was set to 2 (NFREE = 2). Since the defect supercells require high computation costs due to low symmetry, the true values were calculated by DFT K2. For GAP, NNP, MTP, and SNAP, the functions implemented in the maml Python package39 were used to calculate elastic constants. For DeePMD and DeepPot-SE, the LAMMPS scripts from the maml package were utilized to compute the elastic constants.

Lattice parameter

The lattice parameters were calculated by the relaxed supercells by MLIPs and compared to the true values obtained by DFT.

Energy rankings

The MLIP performances on energy rankings of the different ordering of multiple defects in Si using the following four datasets: (1) 2-vacancies in Si bulks in the supercell with 3 × 3 × 3 conventional unit cells (214 atoms); (2) 3-vacancies in Si bulks in the supercell with 3 × 3 × 2 conventional unit cells (141 atoms); (3) 3-vacancies in Si bulks in the supercell with 3 × 2 × 2 conventional unit cells (93 atoms); (4) 3-vacancies in Si bulks in the supercell with 2 × 2 × 2 conventional unit cells (61 atoms). For each set, we randomly removed two to three Si atoms and generated up to 30 symmetrically distinct configurations (using Pymatgen package34). All configurations had fixed lattices and atom positions, which were not relaxed.

The energy rankings were determined by three metrics, the ranking error rates, \({P}_{{\rm{ranking}}\; {\rm{error}}}^{{\mathcal{D}}}\) for the dataset \({\mathcal{D}}\), the mean, and the maximum values of the energy differences \(\Delta\)EDFT of ranking errors, the mean \({\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}}\) and the maximum \({\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}}\). Following the definitions and methods in ref. 25, the ranking error rates \({P}_{{\rm{ranking\; error}}}^{{\mathcal{D}}}\), were quantified by comparing all pairs of energies, the same as the calculation of concordance index40 We calculated the energy difference \(\Delta\)EDFT as the differences between the DFT energies of the mismatched pair for all ranking errors in the dataset \({\mathcal{D}}\) and identified the mean and the maximum \({\Delta E}_{{\rm{DFT}}}^{{\mathcal{D}}}\).

Rare events

The NAC metric proposed in ref. 8 was used to evaluate the magnitude or the directional errors of forces on RE atoms. The errors on the predicted forces of the RE atoms identified from \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\), were calculated, and were used to estimate the CDF of the force magnitude errors, \({\delta }_{{\rm{F}}}\), and force directional errors, \({\delta }_{{\rm{\theta }}}\). The NACs of the CDF curves were then calculated within the error ranges of 0<|\({\delta }_{{\rm{F}}}\) | < 1 eV Å−1 for magnitude errors NAC(|\({\delta }_{{\rm{F}}}\) | , \({\mathcal{D}}\)), and 0° < \({\delta }_{{\rm{\theta }}}\) < 60° for directional errors NAC(\({\delta }_{{\rm{\theta }}}\), \({\mathcal{D}}\)). The energy RMSEs \({\sigma }_{{\rm{E}}}^{{\mathcal{D}}}\) of \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\), and the force RMSEs \({\sigma }_{{\rm{F}}}^{{\mathcal{D}}}\) on identified RE atoms from \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\) were calculated.

E & F RMSEs

The energy and force RMSEs of all configurations and all atomistic forces were calculated using the \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) datasets.

Thermal properties

We calculated the thermal properties for both Si bulk and a supercell with 2 × 2 × 2 conventional unit cells containing a vacancy (63 atoms). The configurations with displaced atoms were generated according to the symmetry of the given supercell, and the atomistic forces and force constants were calculated. Then, we generated the equation of state (EOS) curves from 0 to 1000 K (with 10 K-spacing) were generated using Phononpy with a 40 × 40 × 40 q-point mesh. The EOS curves of entropy versus temperature S(T), free energy versus temperature Efree(T), and heat capacity versus temperature c(T), were calculated using Phononpy Python package41,42,43.

Calculating property errors

The errors of the physical properties were evaluated by comparing the MLIP predicted values to the true values by DFT in the same configurations. For lattice parameters, the supercell was fully relaxed by DFT and MLIPs to obtain the error.

For the EOS curves of thermal properties, the error metrics \({\Delta }_{{\rm{EOS}}}\) of the curves were calculated as

$${\Delta }_{{\rm{EOS}}}=\sqrt{\frac{\mathop{\sum }\nolimits_{i=1}^{{\rm{N}}}{[{X}^{a}({T}_{i})-{X}^{{\rm{K}}4}({T}_{i})]}^{2}}{N}},$$
(7)

where \({X}^{a}({T}_{i})\) denotes property X (either entropy S, free energy Efree, or heat capacity c) computed using method a (either MLIP or DFT K1) at temperature Ti, \({X}^{{\rm{K}}4}\) is the property X calculated by DFT K4, and N is the total number of temperatures on the curve.

For the energy rankings, REs, and E & F RMSEs categories, the errors were the differences between these properties and their ‘ideal’ values, which were 1 for NAC force performance scores on RE atoms and 0 for all the others.

The benchmarks

The benchmarks were calculated by DFT K1, except for the following ‘handpicked’ benchmarks. The benchmarks of EfDefect were set to 0.1 eV, because the differences between DFT K4 and K1 were much larger. The benchmarks of energy RMSEs for \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{RE}}-{\rm{V}}}\) datasets were set to 15 meV atom−1 in the REs category. The benchmarks of energy RMSEs and of force RMSEs for \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{I}}}\) and \({{\mathcal{D}}}^{{\rm{enhanced}}-{\rm{V}}}\) datasets were set to 15 meV atom−1 and 0.12 eV Å−1, respectively, in the E & F RMSEs category.

Clustering MLIPs by their performances

The clustering graphs (Fig. 5) of the optimal MLIP models by the similarities in their joint performance were constructed as follows. First, we picked the properties from the target categories, e.g., all four properties in the EfDefect category and all eight properties in the REs category, and constructed six Pareto fronts, one for each type of MLIPs. The Pareto fronts are constructed by enumerating all models following an established multi-objective optimization algorithm (see the repository in the “Code availability statement”). After collecting optimal models on all six Pareto fronts, we used the standardized error metrics from Eq. (3) and calculated the similarities between each pair of MLIP models following Eq. (2). A grid search was conducted over 80 values in the similarity cutoff range from 10−2 to 102. For each similarity cutoff, the similarity graph was constructed by connecting pairs of MLIP models with the similarities above the cutoff. The communities of MLIP models were identified using the Louvain community detection algorithm implemented in networkx Python package. In the Louvain community detection algorithm, we used the calculated similarities as the weights of relationships on the graph and set the threshold of convergence to 10−7. The similarity cutoff for properties in the EfDefect category was 1.4, and the similarity cutoff in the REs category was 1.3 in Fig. 5. The cutoff is chosen by the corresponding similarity cutoff with the highest adjusted random score (ARS) matching the labels of different types of models. The ARS is a clustering performance metric, which evaluates how close the predicted labels match with the reference labels, implemented in scikit-learn package.

The HV and the IGD scores of the Pareto fronts

The HV and the IGD scores of the Pareto fronts were calculated using the hypervolume function from pygmo package and the equations provided in ref. 44, respectively. The reference points for calculating HV and IGD scores were the minimum values of the selected properties among all optimal models. The HV score measures the space between the Pareto fronts and the reference points, and IGD is the closest Euclidean distance between the optimal MLIP models on the Pareto fronts and the reference points. The lower scores of the Pareto fronts indicate better performances of the MLIP models.

Constructing the correlation graphs of property errors and selecting representative properties

To construct the correlation graphs, Pearson’s correlation coefficient r between each pair of properties was calculated using the scipy package. The correlation graphs (Fig. 7 and Supplementary Fig. 8) were constructed by connecting the pairs of properties with high correlations of r2 > 0.6. The representative properties of the correlation graphs in Fig. 7 were selected as the minimum set of vertices (properties) that connect to all other vertices in the graph. A recursive function involves three sub-steps that were used for the selection:

  1. (1)

    Isolated Vertex Screening: We first identified isolated vertices (properties) in the graph. These isolated vertices were added directly to the final set of representative properties, as no other connections would cover these vertices.

  2. (2)

    Vertex Ranking and Selection: Next, we ranked all connected vertices in the graph using the page rank algorithm (via the networkx package), with the correlation (r²) serving as weights. The algorithm parameters were set to a maximum of 100 iterations, a damping parameter (alpha) of 0.85, and an error tolerance of 10−6. The highest-ranked vertex was then added to the set of representative properties, and both this vertex and any vertices connected to it were removed from the graph.

  3. (3)

    Recursion: The remaining graph was provided for the next recursion step.

This process was repeated recursively until no vertices remained or all remaining vertices in the graph were isolated. The final set of representative properties comprises all isolated vertices in the original graph and the highest-ranked vertices from each recursion step.