Introduction

Predicting the crystal structure for a given composition before experimental syntheses is central to computation-guided materials discovery. The state-of-the-art approaches for crystal structure prediction (CSP) rely on efficient search algorithms such as simulated annealing (SA)1,2,3, genetic algorithm (GA)4,5,6, and particle-swarm optimization (PSO)7,8,9. To search for the global minima on the potential energy surface, SA focuses on overcoming energy barriers; GA utilizes self-improving methods, while PSO takes advantage of the collective intelligence of particles. In the conventional methods, extensive energy and force evaluations by density functional theory (DFT)10,11 are required when exploring the configuration space. As the numbers of atoms and species increase, the number of configurations grows exponentially, leading to an intolerable time and resources consumption. In this context, machine learning (ML) is particularly powerful in reducing the computational consumption by adopting a surrogate model, e.g., crystal graph convolutional neural network (CGCNN)12, and other graph-based prediction models13,14,15. For instance, CGCNN considers the crystal topology to build undirected multigraphs, which can efficiently integrate structural features and be used to predict physical properties to replace DFT calculations. More recently, ML has been employed in CSP approaches16,17, demonstrating significant speedups in search algorithms by replacing DFT calculations with surrogate models. However, these methods typically require a substantial amount of training data to build a general potential, and their effectiveness for compositions unavailable in the database remains uncertain.

After a large amount of structural searches, extracting the property-related structural features is essential for the exploration of materials. To deeply explore and visualize the underlying relationship between global and local atomic structures and physical properties such as stability and conductivity, numerous efforts have been made18,19. For example, the transformation between fold and unfold states in protein-folding dynamics has been unveiled by encoding the entire mapping from biomolecular coordinates to a Markov state model20; similarly, the transition that contributes to Li-ion conduction can also be clearly verified by using graph dynamical network to learn local atomic environment instead of global dynamics21. These studies imply that local atomic-scale structural motifs play a critical role in physical properties. However, this relationship remains unclear in the structural generation field because of the huge possible materials population and complex interatomic bonding, which are difficult to analyze by conventional methods. An ML-based framework for structural search and data analysis is thus in critical demand in order to improve the efficiency of exploring materials.

Two-dimensional (2D) materials are under extensive research, especially after the successful syntheses of 2D materials such as carbon biphenylene22 and T-carbon nanowires23,24, for fancinating physical phenomena induced by special structural features, e.g., nonhexagonal bonding and carbon tetrahedron. Since the differences in atomic mass and electronegativity are small enough, boron, carbon, and nitrogen elements can be combined into abundant planar BxCyN1-x-y compounds25,26,27 and enable the flexibility to modulate stability and electronic structure by tuning the alloy composition28. Nevertheless, systematic structural searches for the B-C-N alloy system are still rare29,30.

In this work, we construct a prediction-analysis framework that combines a symmetry-based combinatorial crystal optimization program (SCCOP) for structural search of target compositions and a feature additive attribution model for data analyses. The generality of SCCOP is showcased by testing it on 35 typical compounds. A practical demonstration is performed for a 2D B-C-N system to illustrate the high-throughput structural search and the ability on extracting structural features. We first convert the structures generated from 17 plane space groups to crystal vectors by graph neural network (GNN) and predict their energies. A Bayesian optimization is performed to explore the structure at the minimum of the potential energy surface. For the desired structures, we optimize it with ML-accelerated SA, in conjunction with a limited number of DFT calculations to obtain the lowest energy configuration. We further demonstrate that the additive feature attribution model can efficiently capture the structural features that dominate the energy and bandgap. We identify five low-energy semiconductors among all the B-C-N compounds, which have bandgaps and mechanical performance comparable with 2D hexagonal BN. Finally, we compare the performance of three methods: SCCOP, DFT-GA and DFT-PSO, which indicates that SCCOP is about 10 times faster while maintaining comparable accuracy.

Results and discussion

We apply SCCOP to 35 representative 2D compounds to demonstrate its generality. We then employ SCCOP to explore 82 different compositions of the B-C-N system (see Supplementary Figs. 14); for each composition, we select the structures up to 0.5 eV atom−1 above the convex hull, and a total of 2623 structures are identified. Further, we analyze the average energy and bandgaps with structural features extracted by the additive feature attribution model. By these approaches, five N-rich wide-bandgap insulators are discovered. Lastly, we compare SCCOP with other DFT-based methods, such as DFT-GA and DFT-PSO that have been employed in the mainstream USPEX9 and CALYPSO5 structural search codes, respectively.

Generality of SCCOP

We begin by demonstrating the generality of SCCOP in CSP, and the workflow is shown in Fig. 1. We apply SCCOP to 2D materials and select 35 representative compounds to evaluate its performance. To ensure fairness, the data of these 35 compounds are excluded from all training, validation, and testing processes. By employing transfer learning, we modify the GNN model, enabling us to reuse the pre-trained model for different compounds. This approach significantly reduces the number of DFT calculations required, with only 180 single-point energy calculations performed per compound. Figure 2a provides a comparison between the lowest energy structure in the 2D material database and the structures discovered by SCCOP. The results demonstrate that SCCOP successfully reaches the lowest energy level for 30 compounds, with an average computational time of 6.38 minutes. The remaining five structures are categorized as puckered structures, including Bi2Se3, GaS, MoS2, Ti2C, and WS2. SCCOP utilizes 17 plane space groups to search for 2D materials and introduces perturbations in the z-direction to generate puckered structures. It is important to note that while SCCOP ensures the inclusion of all plane structures, not all puckered structures are covered.

Fig. 1: Workflow of SCCOP for the search of two-dimensional materials.
figure 1

Step 1: Generating structures by symmetry. Step 2: characterizing structures into crystal vectors and exploring the potential energy surface by a Bayesian optimization. Step 3: updating energy prediction model. Step 4: optimizing structures to obtain the lowest-energy configuration by ML and DFT. The whole program runs in a closed loop.

Fig. 2: Performance of SCCOP on 35 compounds.
figure 2

a Time cost and lowest energy for each compound, and all energy calculations are performed by DFT. b Lowest three structures searched by SCCOP. Each compound has been explored 5 times by SCCOP, with up to 10 atoms in the unit cell.

Additionally, we present the three lowest energy structures for eight compounds in Fig. 2b. The result show that seven of these structures exhibit lower energies than those in the database. For example, in the case of AgI, the lowest energy structure in the database corresponds to a honeycomb structure (−2.308 eV atom−1), while SCCOP identifies a more energetically favorable puckered structure in space group P21/m (−2.37 eV atom−1). Furthermore, MgCl2 is recorded as having a four-fold coordination (−3.509 eV atom−1) in the database, but SCCOP discovers that a six-fold coordination exhibits lower energy (−3.591 eV atom−1). These findings underscore the broad applicability of SCCOP to 2D materials. Additional results to support the generality of SCCOP are provided in Supplementary Figs. 59.

Energy-related feature extraction

For a thorough understanding of the connection between stability and structural feature, we first plot the ternary phase diagram of the B-C-N system in Fig. 3a. In addition to 11 previously reported structures (blue circle)31,32,33, 28 dynamically stable low-energy structures are discovered (red hollow triangle). Figure 3b provides a list of examples, including more stable structures found in previously explored compositions such as B1C1, B3C5, B2N3, and C4N1, as well as structures found in previously unexplored compositions like B1N2, C5N4, B2C1N2, and B2C3N1. The stable phases of the B-C-N system have thus been greatly extended by the systematic search via SCCOP. We note that the low-energy structures are located on a line, where the stoichiometric ratio of B:N is 1:1, e.g., B1N1, B1C1N1, B1C2N1 and B1C4N1, since the valence electrons of boron and nitrogen can be fully paired to reduce the energy of structure. Similarly, the average valence electrons of boron carbides and carbon nitrides are either less or greater than four; they both hinder electrons pairing. Thus, their formation energies are relatively high. The phonon spectra of all searched stable structures are shown in Supplementary Figs. 1012.

Fig. 3: Searched structures and extracted structural features of B-C-N system.
figure 3

a Ternary phase diagram of the B-C-N system. All calculations are carried out at 0 K. The borene, graphene, and nitrogen are chosen as the corners of the Gibbs triangle. Blue circles and red hollow triangles represent stable compounds and searched stable structures, respectively; the gray dashed line indicates the compositions with a B:N ratio of 1. b Illustration of typical stable structures of four compounds searched by SCCOP. c Distribution of two-dimensional crystal vectors on a 2D plane using the TSNE dimensionality reduction. Energy contribution of the structural motifs in four compounds are listed on the sides; each motif contains center and neighbor atoms.

Next, we cluster structures by the crystal vectors and extract stable structural features in Fig. 3c. The crystal vectors strongly relate to the atomic species of the compounds and can be clearly grouped into four clusters: carbon nitrides (CxN1-x), boron carbides (BxC1-x), boron nitrides (BxN1-x) and boron-carbon nitrides (BxCyN1-x-y). This indicates that the compounds in the same cluster have similar electronic structures to form structural features with similar energies, making it possible for GNN to predict energy from these features. For all four compounds, ML finds that sp2 hybridization with bond angles of 120o is a universal structural feature, as the number of their valence electrons is close to four per atom. The honeycomb structure might thus be energetically favorable. In addition, the B-centered structural features contribute less to energy than those of carbon and nitrogen. This is primarily due to its electron-deficient bonding nature34. Carbon and nitrogen atoms can, however, form conjugated \(\pi\) bonds or fill empty p orbitals with lone pairs of electrons to enhance the stability. In the carbon nitrides, the two most common types of nitrogen atoms are found, i.e., pyridinic-N (−0.68 eV) and graphitic-N (−0.61 eV)31. For pyridinic-N, the nitrogen atom is coordinated to two carbons, and one orbital is occupied by a lone-pair of electrons, while graphitic-N is characterized by nitrogen sp2 hybridization with three carbon atoms. In the boron carbides and boron nitrides, the boron atoms tend to bond with more than three atoms, implying that boron can stabilize the structure by forming coordination bonds or multi-centered bonds29. Moreover, because of the good match on the chemical valence, three-fold coordination dominates the structural features of boron carbon nitrides. These extracted structural features deepen the understanding of structural stability and may guide future searches of low-energy B-C-N materials.

Bandgap-related feature extraction

To find out how element composition and bandgap are related, the bandgap distribution of the B-C-N system is plotted in Fig. 4a, which shows narrower bandgaps for the B-rich and C-rich compositions and wider bandgaps for the N-rich compositions. Interestingly, two metallic phase regimes are located on two sides of a line with a B:N ratio of 1 (see the red dashed line in Fig. 4); this is because the mismatch of valence electrons, which form a band crossing the Fermi level. Suitable compositions (e.g., B:N = 3:1 and 1:3) help to open the band gap, while the N-rich compounds are more likely to have larger bandgaps. We cluster the structural features by the coordination number in Fig. 4b. 2-fold and 3-fold coordination carbon atoms play a key role in closing the bandgap due to the free p electrons. However, 4-fold coordination carbon, strong electronegative nitrogen, 6-fold coordination boron have little contributions to the electrical conductivity due to either fully paired of electrons or the absence of free electrons. Overall, ML enables the bandgap analysis from the perspective of coordination number, allowing to draw conclusions that are consistent with our physical intuition.

Fig. 4: Bandgap distribution and extracted gap features of B-C-N system.
figure 4

a Bandgap distribution of the B-C-N system. For each composition, the bandgap of the lowest-energy structure is calculated. The red dashed line indicates the compositions with a B:N ratio of 1. b Contributions to the bandgap from different coordination numbers. Brown, pink and blue colors denote carbon, nitrogen and boron, respectively. c Structural features for opening or closing the bandgap of 4 typical structures; the spatial valence band edge state distribution, band structure near the Fermi level, as well as the density of state (DOS) are also depicted. The bandgap contributions and structural features are obtained from the additive feature attribution model.

Furthermore, we consider the contribution of larger structural features comprising several atoms to the bandgap. The percentage of contribution is defined by \(F={\sum }_{i}\frac{{G}_{i}}{{G}_{{\rm{tot}}}}\times 100 \%\), where the summation is over the atoms in the selected structural feature and \({G}_{\rm{tot}}\) is the total contribution to open or close the bandgap. Therefore, greater F implies that this structural feature is more important to the bandgap. Here, four structures are given as examples to show the main factor identified by ML that relates to the formation of bandgap in Fig. 4c. In C4N1, the band-edge states are mainly projected on the N-C-C-N chain, and ML identifies that the chain provides 86% contribution to the band-edge states. The N-C-C-N chain introduces a localized low-energy impurity energy level near the Fermi level, thus leading to the split of the electron cloud in 5-, 6-, and 8-membered rings. In B1C1, C chains are identified to be the central factor in closing the bandgap (100% contribution), which enable the formation of continuous electron clouds and spread to the empty orbitals of adjacent boron atoms.

In B2C1N2 and B2N3, 6- and 8-membered rings of alternating B-N bonds contribute 100 and 75% to the band-edge states to enlarge the bandgap, respectively. Both of them are formed by the same structural motif that is characterized by nitrogen coordination with boron atoms with electrons localized on nitrogen, The direct wide-bandgap insulator hexagonal BN (h-BN) is composed entirely of this feature. In general, ML can quantify the contribution percentage for a given structural feature to rationalize the formation of a bandgap. However, the selection of multi-atom structural features still requires human assistance to verify the rationality; a general method for the selection of features is still in demand.

Wide-bandgap insulators

It is known to be challenging to predict N-rich materials, since two nitrogen atoms can easily combine into nitrogen molecule, resulting in ill structures during structural searches. SCCOP solves the dilemma by quickly screening a large number of structures, with which we identify five stable wide-bandgap materials with bandgaps, mechanical performance, and structural motifs similar to h-BN in the N-rich area (see Fig. 5, Supplementary Figs. 10 and 11, and Table 1). B1N2, B3N4, and B4N5 are direct-gap, while B3N5 and B1C1N3 are indirect. Especially, B1N2 has a bandgap (5.32 eV) that is even greater than that of h-BN. This is because the formation of the fully occupied N-p dangling-bond states reduces the hybridization and band width of the band-edge states, and thus enlarges the bandgap. The Young’s modulus, Poisson’s ratio, and shear modulus of B3N4 are 180.24 N m−1, 0.19 N m−1, and 75.90 N m−1, respectively. The abundant strong bonding between boron and nitrogen in the plane leads to the fact that B3N4 has comparable mechanical properties with h-BN, and it is essential for the reliability in practical applications. Moreover, the thermal conductivity of B1N2 is 10.13 W m−1 K−1, which is 70 times smaller than B1N1 (708.07 W m−1 K−1).

Fig. 5: Electronic band structures for six searched structures.
figure 5

Electronic band structures and density of states (DOS) for a h-BN and b-f the discovered wide-bandgap materials.

Table 1 Calculated Young’s modulus (\(E\)), Poisson’s ratio (\(\nu\)), shear moduli (\(G\)), and lattice thermal conductivity (\(\kappa\)) at 300 K for \(h\)-BN (B1N1) and the discovered wide-bandgap materials.

The dramatic drop in the thermal conductivity is mainly caused by the asymmetric distribution of boron, carbon, and nitrogen atoms, which activates a phonon anharmonic effect, and hence results in the enhancement of phonon-phonon scattering to hinder thermal transport. Overall, owing to the exotic optoelectronic properties, good mechanical robustness, and low thermal conductivity, the discovered materials thus have fruitful potential applications, e.g., in ultraviolet photodetectors35,36, thermal insulation materials37,38, and energy storage devices39,40.

Method comparison

Finally, we compare the computational performance of SCCOP with other commonly used DFT-based search approaches, such as DFT-GA in USPEX and DFT-PSO in CALYPSO in Fig. 6. All of them are tested on 82 compositions while ensuring that the parameter setup and computational resources were as consistent as possible. Notably, SCCOP is the most time-saving among the three methods and performs well in most cases. For a more concise understanding of the performance of the three methods, we summarize the key results of comparison in Table 2. We find that SCCOP identifies the lowest-energy structures among 45 compositions with an average time of 5.7 minutes, which is about 10 times faster than DFT-GA and DFT-PSO; the successful rate of SCCOP is comparable or even greater than that of the other two. Therefore, we are confident that SCCOP can greatly reduce the search time while maintaining a comparable accuracy to the state-of-the-art DFT-based search approaches. As a matter of fact, the GNN model is trained based on the DFT-calculated data; it thus cannot surpass the accuracy of DFT results. However, due to the effective feature extraction and relative simple calculation style, GNN can predict energies faster than DFT by 3–5 orders of magnitude15,16,41 while keeping a comparable accuracy. Hence, the GNN-enhanced efficiency of SCCOP significantly reduces the time spent on initial structure screening and structural optimization, and this is the main reason why SCCOP can outperform DFT-based prediction methods.

Fig. 6: Performance comparison of three methods.
figure 6

Comparison of computational time cost and the lowest energy found after 1 iteration in (a) carbon nitrides, (b) boron carbides, (c) boron nitrides, and (d) boron carbon nitrides by SCCOP, DFT-GA, and DFT-PSO approaches. The left y-axis is the time cost in log scale and the right is the energy of the searched structures. The computational time is accounted for running on 2*GTX 1080 GPUs and 12*Xeon Gold 6248 CPUs.

Table 2 Comparison of the time cost and successful rate of three structural search methods for class of four compounds, where the successful rate is the ratio that the method finds the lowest-energy structures in the three methods.

In summary, we have developed an ML-based framework for crystal structure prediction and analysis, which consists of five parts: i) generating abundant random structures in AU with symmetry and distance constraints, ii) Bayesian optimization with crystal graph representation for structures to search, iii) modifying the energy prediction model to adapt to target composition by transfer learning techniques, iv) carrying out GNN-accelerated SA for structural optimization, and v) constructing an additive feature attribution model for feature extraction of the search results. We checked the generality of SCCOP by testing it on 35 typical compositions. We demonstrated this framework by applying it to predict the crystal structures of 82 compositions in the B-C-N system. In addition to the successful identification of previously unknown crystal structures, we were also able to extract the key features for structural stabilization, to establish the relationship between bandgap and coordination number, and to discover the critical factors for bandgap formation for specific structures. Five stable wide-bandgap materials with good mechanical properties and low thermal conductivities have been successfully discovered. Compared to conventional DFT-based prediction approaches and domain knowledge analysis methods, this integrated prediction-analysis framework, which takes full advantage of ML, can greatly shorten the discovery and design cycle of novel functional materials.

Methods

The framework of prediction-analysis consists of five parts: i) random sampling, ii) structural search, iii) prediction model update, iv) structural optimization, and v) structural analysis. The workflow of SCCOP is illustrated in Fig. 1, where GNN characterizes the crystal structures and connects each part to achieve iterations.

Random sampling

In the first step of SCCOP, to roughly measure the potential energy surface, unbiased initial structures are randomly generated from 17 plane space groups (PSGs), which cover all types of symmetry of 2D materials, as shown by step 1 in Fig. 1. To determine the structure with a target composition, only the periodic lattice \({\boldsymbol{L}}=\left({{\boldsymbol{l}}}_{1},{{\boldsymbol{l}}}_{2},{{\boldsymbol{l}}}_{3}\right)\in {{\mathbb{R}}}^{N\times 3}\), PSG, atom types \({\boldsymbol{A}}=({a}_{1},\ldots ,{a}_{N})\), and atomic positions \({\boldsymbol{X}}=\left({{\boldsymbol{x}}}_{1},\ldots ,{{\boldsymbol{x}}}_{N}\right)\in {{\mathbb{R}}}^{N\times 3}\) are necessary. The \(n\) atoms of a structure are placed in an asymmetric unit (AU)42, which is the irreducible space and can fill the primitive cell by applying symmetry operations, enabling efficient configurational evolution. The space discretization and minimal interatomic distance techniques43 are employed to reduce the search space. This fast structure sampling method in AU guarantees the generation of a set of reasonable crystal structures \({\mathcal{C}}\) with different space groups in a short time. All asymmetric units used in SCCOP are listed in Supplementary Tables 13.

Structural search

To further constrain the search space, a Bayesian optimization is applied to redistribute the sampling probability in order to find the energetically favorable structures, as illustrated in step 2 in Fig. 1. In this step, crystal structures are first converted to crystal vector \({\boldsymbol{c}}\) to achieve crystal characterization. A crystal graph \({\mathcal{G}}\) is built upon the atoms in AU to maximize the efficiency of GNN (Supplementary Table 4), and the graph convolutional operator12 defined as \({{\boldsymbol{v}}}_{i}^{(t+1)}={\rm{Conv}}\left({{\boldsymbol{v}}}_{i}^{\left(t\right)},{{\boldsymbol{v}}}_{j}^{\left(t\right)},{{\boldsymbol{u}}}_{\left(i,j\right)}\right)\), where \({{\boldsymbol{v}}}_{i}^{\left(t\right)},{{\boldsymbol{v}}}_{j}^{\left(t\right)}\) and \({{\boldsymbol{u}}}_{\left(i,j\right)}\) are atom feature vectors and bond feature vectors at \(t\) convolution, respectively. After \(K\) convolutions, the crystal vector \({\boldsymbol{c}}={{\boldsymbol{W}}}_{{\rm{m}}}{\boldsymbol{V}}\) is the weighted sum of atom vectors \({\boldsymbol{V}}=\left({{\boldsymbol{v}}}_{1}^{\left(K\right)},\ldots ,{{\boldsymbol{v}}}_{n}^{\left(K\right)}\right)\in {{\mathbb{R}}}^{n\times 64}\), where \({{\boldsymbol{W}}}_{{\rm{m}}}=\left({w}_{1},\ldots ,{w}_{n}\right)\in {{\mathbb{R}}}^{n\times 1}\) denotes the multiplicity weight matrix that depends on the symmetry of atoms. Lastly, two dense layers are added to map crystal vector \({\boldsymbol{c}}\) to \(\hat{E}\); hence, a rough energy estimation and structural clustering of samples in \({\mathcal{C}}\) can be realized by the GNN model. A few low-\(\hat{E}\) structures in each cluster are selected to obtain more precise energies by DFT calculations for the Bayesian optimization.

Approximating the function \(E=U({\boldsymbol{c}})\) between energy and structures is key for the Bayesian optimization. Here we characterize the structures by the crystal vectors and use samples from precise DFT calculations to fit the function \(U\) by a Gaussian Process Model44. The probability of improvement45 is adopted as the acquisition function \({PI}\left({\boldsymbol{c}}\right)=1-\Phi [(\mu \left({\boldsymbol{c}}\right)-U\left({{\boldsymbol{c}}}^{* }\right)-\xi )/\sigma ({\boldsymbol{c}})]\), where \({{\boldsymbol{c}}}^{* }={{\rm{argmin}}}_{i}U({{\boldsymbol{c}}}_{i})\); \(\mu ({\boldsymbol{c}})\) and \(\sigma ({\boldsymbol{c}})\) are the mean and standard deviations of the posterior distribution on \({\boldsymbol{c}}\) from the Gaussian Process, respectively, and \(\Phi\) is the cumulative distribution function for a normal distribution. The \(\xi\) parameter is used to balance the trade-off between exploitation and exploration. We calculate \({PI}\) among \({\mathcal{C}}\), and choose high-acquisition-value structures for further structural optimization. In this way, abundant initial structures are screened and clustered by GNN, enabling the location of low-energy structures and exploration of potential candidates through Bayesian optimization.

Prediction model update

For target compositions, the pretrained GNN prediction model should be slightly updated to reach a better accuracy, as seen in step 3 in Fig. 1. The pretrained model is trained by the 2D material databases JARVIS-DFT46, C2DB47, and 2DMATPedia48, which contain 10751 crystals covering 85 elements, 4 lattice systems and 17 PSGs. The train:validation:test ratio is 60%:20%:20%; a batch of 128 structures with the Adam optimizer49 is used, and the best-performing model in validation set is chosen as the pretrained model. The lowest mean absolute error (MAE) in the validation set is 0.1468 eV atom−1, with a smaller MAE of 0.1451 eV atom−1 in the test set, implying that the model has a strong generalization ability (shown in Supplementary Fig. 13). According to the transfer learning techniques50, when a small amount of DFT data is used in the search, the prediction model freezes the parameters of graph convolutional layers and only optimizes the full connected layers, which prevents overfitting of the DFT data and improves the capability of distinguishing the energy changes for different predicted structures. Hence, only a small number of single-point energy calculations are needed to modify the pre-train GNN model, greatly reducing the size of the training dataset required for newly added or unknown compounds.

Structural optimization

To obtain more accurate structural parameters and energies of target structures, SCCOP optimizes the structures by first ML and then DFT, as illustrated in step 4 in Fig. 1. The structures occupy the relatively high-energy area on the potential energy surface. We first optimize the structural candidates with the ML-accelerated SA. ML adjusts the structures by displacing the atomic positions and distorting lattice vectors with the Metropolis criterion1, i.e., using the probability \(\exp (-\Delta \hat{E}/{k}_{{\rm{B}}}T)\) to decide if the changes are accepted according to the energy differences \(\Delta \hat{E}\) given by the GNN prediction model. For the ML-optimized structures, \(t\)-distributed stochastic neighbor embedding (TSNE)51 is performed to reduce the dimension of crystal vectors and the Kmeans method52 is used to group the vectors into different clusters. Then DFT optimization is performed to more rigorously relax the structure (that has the lowest energy in each cluster) to find the local minimum on the potential energy surface. The optimized lattice in this step will be employed as the initial lattice in the next search iteration to sample crystal structures. Therefore, DFT is applied to the ML-searched structures to ensure the structures satisfy the physical constraints.

Structural analysis

An additive feature attribution model18,53 is applied to extract property-related features from massive amounts of data (as shown in Fig. 7. Thus, the averaged total energy per atom is predicted by the sum over different local chemical environments, i.e., \(\hat{E}=\mathop{\sum }\nolimits_{i}^{N}{\hat{E}}_{i}/N\), where \({\hat{E}}_{i}={{\boldsymbol{W}}}_{l}{{\boldsymbol{v}}}_{i}^{T}+{b}_{l}\) is built by the atom feature vector \({{\boldsymbol{v}}}_{i}^{T}\), the weight \({{\boldsymbol{W}}}_{l}\), and the bias \({b}_{l}\). To focus on the environment consisting of center and neighbor atoms, we calculate its contribution to energy \({\bar{E}}_{i}\) by the average of \({\hat{E}}_{i}\) on the data that are clustered by coordination atoms, bond lengths, and bond angles. In this way, the energy contribution from each structural motif can be accessed independently, and lower \(\bar{E}\) means higher local structural stability. Meanwhile, for solid-solution systems, the bandgap \(\hat{G}=\mathop{\sum }\nolimits_{i}^{N}{\hat{G}}_{i}/N\) is analyzed in the same way. \({\hat{G}}_{i}\) is also calculated by a linear transformation acting on \({{\boldsymbol{v}}}_{i}^{T}\), with a specifically designed loss function \({\mathcal{L}}={\hat{{\mathbb{E}}}}_{G > 0}\left[{\left(G-\hat{G}\right)}^{2}\right]+{\hat{{\mathbb{E}}}}_{G=0}\left[{\left(G-\max \left(\hat{G},0\right)\right)}^{2}\right]\); the expectation \(\hat{{\mathbb{E}}}[\ldots ]\) indicates an average over a finite batch of samples, and \(G\) is the bandgap computed from DFT. Therefore, structures with zero or negative \(\hat{G}\) are classified as metal, which makes \({\hat{G}}_{i}\) a physically meaningful term; a positive \({\hat{G}}_{i}\) means opening the bandgap, otherwise closing the bandgap. Both of the two analysis models are trained with 80% of the data and then validated with the remaining 20% of the data; the best-performing model in the validation set is selected. A comparison between additive feature attribution model and cluster expansion is provided in Supplementary Fig. 14.

Fig. 7: Illustration of additive feature attribution model based on GNN.
figure 7

The total energy is the summation of energies from atoms in different chemical environments.

DFT calculations

The DFT relaxations, energy and bandgap calculations for the searched structures are carried out using the Vienna Ab-initio Simulation Package (VASP)54,55,56. For structural relaxations and energy evaluations, the generalized gradient approximation (GGA) within the Perdew-Burke-Ernzerhof (PBE) form for the exchange-correlation functional57 is used. The ion-electron interactions are treated by projector-augmented-wave (PAW)58,59 technique. The plane-wave energy cutoff is set to 520 eV. The Brillouin zone associated with the primitive cell is sampled using a Monkhorst-Pack \(k\)-point mesh of \(4\times 4\times 1\). A vacuum space of 15 Å is applied to avoid artificial interactions between the periodic images. All structures are relaxed with energies and forces converged to 10−5 eV and 0.01 eV Å−1, respectively. The electronic band structures are calculated with the HSE06 hybrid functional60. The phonon thermal conductivity is predicted by the ShengBTE code61.