Introduction

Understanding the relationship of structures and functions is one of the most important questions in many disciplines such as chemistry, materials science, biology, which is critical for rational design of structures for achieving specific e.g. molecule, protein, or materials functions. However, the sophistication of the physical, chemical, and geometric atomic interactions makes it challenging to exhaustively enumerate the “design rules” for heuristic design approaches. On the other hand, the lack of sufficient data, the diversity of the hypothetical structure types, along with the strong physicochemical constraints makes it hard even for data driven approaches to structure design.

Here, we propose a physics guided deep learning model for generative design of crystal materials. Solid crystals such as ionic conductors, perovskites, photovoltaics, and piezoelectrics, play an important role in modern industries. Over centuries, humanity has dedicated significant amount of efforts to discovering high-performance functional materials. However, to date, only around 250,000 inorganic materials have been experimentally determined as collected by the ICSD database1, which only covers a tiny portion of the almost infinite material design space considering the combinatorial space with the number of elements cross the periodic table and the total 230 possible symmetries of crystal structures. Traditional trial-and-error tinkering methods for materials discovery are mainly reliant on domain experts’ knowledge2, which is time-consuming and labor-intensive. To meet the high demand for new functional materials, we need more efficient strategies to explore the vast chemical space to accelerate the materials discovery process.

Currently, a popular approach to generating materials is based on element substitution of existing materials combined with high-throughout virtual screening (HTVS)3. The whole process contains three steps: (1) combinatorially substituting elements in known crystal structures, (2) optimizing the candidate structures using density functional theory (DFT) calculations, and (3) experimental verification. Typical large computational materials databases created by HTVS are Materials Project (MP)4 and Open Quantum Materials Database (OQMD)5. Despite its promising usage in material design, a fundamental drawback of HTVS is that it cannot generate materials beyond the structural prototypes of existing materials. It is also extremely computationally intensive and its success rate heavily depends on experts’ intuitions.

One way to overcome the drawbacks of HTVS for discovering materials is to perform crystal structure prediction for candidate material compositions using global optimization techniques, which are used to identify their stable structural phases. Simulated annealing has been used to predict the structures of alloys6 and boron nitride7. The minima hopping8 is another algorithm for finding unknown crystalline structures9. Two widely used crystal structure prediction (CSP) algorithms are USPEX10 and CALYPSO11, which use evolutionary algorithms and particle swarm optimization for finding crystal structures. Despite their success in a variety of cases, these CSP based approaches for materials discovery suffer from their limited applicability to only relative simple structures usually with small number of atoms in unit cell.

Another promising approach to design solid materials beyond known crystal structure prototypes is generative deep learning models12,13,14, which can learn data distribution (knowledge of forming stable crystal structures) from known materials and then sample from it to generate materials. Variational Auto-encoder (VAE)15 and Generative Adversarial Network (GAN)16 are two popular generative models used to generate materials. A VAE15 model consists of two deep neural networks, an encoder and a decoder. The encoder is trained to encode materials into latent vectors and the decoder reconstructs the materials from the latent vectors. After training, different strategies can be used to sample the latent space and then use the decoder to generate materials. iMatGen2 is believed to be the first work that uses VAE to realize the inverse design of solid materials. It encodes unit cells into 3D grid based representations, and spherical linear interpolation and Gaussian random sampling are used to sample from the latent space to generate materials. Hoffmann et al.17 extend iMatGen by combining a UNet module to segment reconstructed 3D voxel images into atoms. Based on iMatGen and Hoffmann et al., ICSG3D18 integrates formation energy per atom into 3D voxelized solid crystals, which enables the VAE to encode materials and energy simultaneously. This makes it possible to generate materials subject to user-defined formation condition. Another approach to represent 3D crystals is to encode 2D crystallographic representations as the combination of the real space and the reciprocal-space Fourier-transformed features19. In CDVAE20, a diffusion network is trained to generate material structures21, in which a diffusion process within their diffusion variational autoencoder moves atoms into positions in the lower energy space to generate stable crystals. All these models have difficulty in generation of high quality structures with high symmetry (e.g. space group number >=62) due to their neglecting the structure symmetry in their generation models, a major special characteristic of periodic crystal structures. A GAN model16 consist of two deep neural networks, a generator and a discriminator (critic). The generator creates fake materials with inputs of random vectors with or without conditioning on elements and space groups while the discriminator tries to tell real materials from generated ones. With learnt knowledge of forming crystals, the generator can directly create materials. The first method to generate materials using GAN is CrystalGAN22, which leverages a CycleGAN23 to generate ternary materials from existing binaries. However, it remains uncertain whether CrystalGAN can be extended to produce more complex crystals. GANCSP24 and CubicGAN12 are two GAN based generation models that directly encode crystal structures as matrices containing information of fractional coordinates, element properties, and lattice parameters, which are fed as inputs to build models that generate crystals conditioned on composition or both composition and space group. The major difference between them is that GANCSP can only generate structures of a specific chemical system (e.g. Mg-Mn-O system) while CubicGAN can generate structures of diverse systems of three cubic space groups. In CCDCGAN25, Long et al. use 3D voxelized crystals as inputs for their autoencoder model, which then converts them to 2D crystal graphs, which is used as the inputs to the GAN model. A formation energy based constraint module is trained with the discriminator, which automatically guides the search for local minima in the latent space. More recently, modern generative models such as normalizing flow26,27 and diffusion models have also been20 (CDVAE) or planned to be28 applied to crystal structure generation. Less related works include MatGAN29 and CondGAN(xbp)30 developed for generating only chemical compositions.

Despite the success of VAEs and GANs in material generation2,12,20, all current generative models have several major drawbacks. For example, the iMatGen algorithm2 can only generate structures of a specific chemical system such as vanadium oxides and only several metastable VxOy materials were discovered out of 20,000 generated hypothetical materials. Similarly, GANCSP24 and CrystalGAN22 only generate for a given chemical system (e.g. Mg-Mn-O system and hydride systems). VAE-UNet pipeline developed in18 expands the diversity of generated materials and can reconstruct the atom coordinates more accurately by incorporating UNet segmentation and conditioning on properties. However, VAE-UNet still confines itself to cubic crystal system generation and the number of atoms in a unit cell is restricted to no more than 40. All above discussed works do not realize high-throughout generation of crystal materials. CubicGAN12 is an early public example of a high-throughput generative deep learning model for (cubic) crystal structures, which has discovered four prototypes with 506 materials confirmed to be stable by DFT calculations. Although CubicGAN has generated millions of crystal structures with hundreds of stable ones confirmed, the generated structures are limited to three space groups in the cubic crystal system, of which the atom coordinates are assumed to be multiples of 1/4: it is not capable of generating generic atom coordinates. While these works open the door to generative design of materials, several unique challenges still remain that prevents effective generative design: (1) how to learn the physical atomic constraints of stable materials to enable efficient sampling; (2) how to achieve precise generation of atom fractional coordinates and lattice parameters; (3) how to handle the extreme bias of the distribution of materials in the 230 space groups; (4) how to exploit the high symmetry of crystal structures in the generation process.

In this work. we introduce a physics guided crystal generative model (PGCGM) to exploit the physical rules for addressing aforementioned challenges. Our contributions are summarized as follows:

  1. 1.

    We present a physics guided deep generative model for crystal generation that combines the space group affine transformation and an efficient self-augmentation method.

  2. 2.

    We propose two physics-oriented losses based on atomic pairwise distance constraints and structural symmetry to fuse the physical laws into deep learning model training.

  3. 3.

    We evaluate our model against two baselines to show its superiority and perform DFT calculations to validate our generated structures with high success rate (93.5% can be optimized successfully).

Results

Dataset

We collect our material data from MP4, ICSD1 and OQMD (v1.4)5. In total, 42,072 ternary materials with 20 space groups (5 crystal systems) are curated when we start this project (The choice of 20 space groups is due to the limited structure samples for other remaining space groups. Our model is applicable to any space group given sufficient training samples). We use a 80–20 random training/validation split for all of our experiments. We term the dataset with 42,072 materials as MIO. When conducting this project, the latest version of OQMD is just yet released. There are 9441 ternary materials that are filtered by the same criteria and are brand-new materials in the latest OQMD (v1.5). We use these 9441 ternary materials as our test dataset TST to compare our method with two baselines. Details regarding dataset collection are in Dataset Curation section of supplementary materials.

Generation performance

We compare PGCGM with two latest algorithms that can generate crystals with multiple chemical systems instead of only a special group of materials (e.g. VxOy and Mg-Mn-O systems)2,24. FTCP19 combines real space properties (e.g., atom coordinates) and momentum-space properties (e.g. diffraction pattern) to represent crystal structures. Then a CNN based VAE is trained for materials generation. CubicGAN12 trains a WGAN-GP31 to generate cubic structures in three space groups and here we expand the original method to 20 space groups.

The performance is shown in Table 1. For each method, we sample 500,000 structures and for PGCGM and CubicGAN, we perform atom clustering and merging. However, our atom clustering and merging cannot proceed with materials generated by FTCP and then we did not perform atom clustering and merging on those materials. The percentage of Crystallographic Information Files (CIFs) that are readable by pymatgen32 are shown in the CIFs column. Here readable means it can be proceeded by pymatgen.core.structure.Structure.from_file. We can find that PGCGM+dist has the largest percentage of materials left and PGCGM+dist+coor comes next. It tells us that distance and coordinates losses play a big part in generating readable materials. For later percentage related metrics, we use the number of CIFs left of each method as denominator. Our model significantly outperforms FTCP by 36.4% in terms of distance validity and is slightly better than CubicGAN. In terms of distance validty, our model outperforms FTCP and CubicGAN by 6.5% and 27.0%, respectively. Since validity are relatively weak metrics, property distribution is further used to provide a stronger metric to evaluate whether the generated materials are realistic. Our model significantly outperforms both two baselines. In terms of minimum atom distance, PGCGM decreases wasserstein distance (WD) by 1.461 compared to FTCP and by 0.402 compared to CubicGAN. In terms of maximum atom distance, PGCGM+dist+coor decreases WD by 0.264 compared to FTCP and by 2.986 compared to CubicGAN. Although CubicGAN has a close minimum atom distance distribution to PGCGM, the much bigger gap of maximum atom distance distribution between CubicGAN and PGCGM+dist+coor indicates that CubicGAN tends to generate larger crystal structures. In terms of density, PGCGM+dist decreases WD by 2.130 compared to FTCP and by 3.106 compared to CubicGAN. PGCGM also achieves the best diversity score even though it generates more readable CIFs than FTCP, which further shows that FTCP is not able to generate not only physically realistic materials but also materials with restricted diversity of formulas. We choose PGCGM+dist+coor as our finalized model to generate materials for late analysis since PGCGM+dist+coor has better properties distribution performance than PGCGM and PGCGM+dist on average.

Table 1 Material generation performance.

Analysis of materials optimized by Bayesian optimization with symmetry relaxation

Bayesian optimization with Symmetry Relaxation (BOWSR) algorithm33 is an approach that uses Bayesian optimization to iteratively search lower energy surface to optimize the crystal structures based on the properties predicted by deep learning methods, such as CGCNN34 and MEGNet35. Instead of directly using expensive DFT for relaxing generated materials, we first use BOWSR to optimize structures generated by our model and two baseline models and then use DFT calculation to further relax them. We randomly select 2,000 generated materials with less than or equal to 32 atoms for FTCP, CubicGAN and PGCGM. We select 100 materials for 20 space groups equally generated by PGCGM. Note that we also use the same 2,000 materials of PGCGM for further DFT analysis. Because some space groups are underrepresented (with less than 100 materials) in CubicGAN-generated materials, we select all materials under these space groups and then we select materials for the rest of space groups proportionally to obtain 2000 materials. For FTCP, materials that can be successfully analyzed to have space groups by pymatgen get_space_group_info with symprec=0.132 surprisingly all belong to space group P1, which means FTCP loses the significant symmetric constraints when generating materials. Our methods PGCGM and CubicGAN are much better than FTCP in terms of space groups retention. Moreover, it takes more than 10 times time to optimize materials generated by FTCP than by PGCGM and CubicGAN using BOWSR. We use StructureMatcher from pymatgen32 to match the generated materials with the corresponding optimized materials by BOWSR.

Table 2 shows the match rate and RMS displacement. The match rate is the percentage of materials satisfying the criteria ltol=0.2, stol=0.2, angle_tol=0.5 and then we calculate RMS displacement for the matched materials. Firstly we find that our method has a slightly higher number of successfully optimized materials by BOWSR. However, our method significantly outperforms FTCP and CubicGAN by 4200% and 34.23% in terms of match rate, respectively. It seems that FTCP has the best RMS displace but the extremely low match rate might tell us that BOWSR is hard to optimize materials with low symmetry, such as space group P1. CubicGAN comes next in terms of RMS displacement.

Table 2 Bayesian optimization performance.

Analysis of rediscovering materials in training and test datasets

It would be helpful to show how fast our model can rediscover materials in training datase MIO and test dataset TST. To do this, we sample different number of materials and then calculate the percentage of materials rediscovered in generated materials. “Reduced Formula - Space Group ID - # of Atoms” is defined as prototype to identify unique materials in the existing and generated materials. Figure 1a shows the change of unique crystals and rediscovery rate over size of sampling materials. We start to sample materials from half million and the number ends at 60 million eventually. It is found that the percentage of unique materials (cyan line) are decreasing and gradually tend to grow flat as number of sampling materials increases. The rediscover rate of MIO (orange bars) increases consistently over the sampling process and it soars quickly to 42.6% when 35 million materials are sampled. Starting from 42.6%, the percentage of rediscovered materials in MIO grows smoothly and it reaches 52.0% when the sampling size is 60 million. Similar growing patterns can be observed for the rediscover rate for the test dataset, as shown by blue line in Fig. 1a. The rediscover rate reaches 43% at the end of 60 million sampling size. This percentage is lower than that of training dataset because of the different proportions over the 20 space groups in training and test datasets as shown in Supplementary Table 1 and Supplementary Table 2.

Fig. 1: Analysis of rediscovered materials.
figure 1

a The discovery rate of prototypes in MIO and TST. The number of unique prototypes is 26,463 (42,072) in MIO and the number of unique prototypes is 8103 (9441) in TST. b The parity plots for lattice lengths of generated materials (red dots) and corresponding ground truth materials that match the generated materials in MIO and TST datasets. Top row is for the training dataset and bottom row is for the test dataset (OQMD v1.5), respectively. The blue dots represent the results generated by a perfect generator that rediscover all training and testing samples. R2 and RMSE are also used to evaluate the performance of generated lattice lengths compared to existing ones.

After rediscovering the materials in MIO and TST from the generated materials when sampling size is 60 million, we utilize StructureMatcher from pymatgen32 to test whether the generated materials match the rediscovered materials and to calculate RMS displacement between two matched structures considering all invariances of materials. Because one prototype might correspond to multiple structures in existing and generated materials, we only show the least RMS displacement by exhausting each pair of existing and generated materials for this prototype. The match rate is the percentage of materials satisfying the criteria ltol=0.2, stol=0.3, angle_tol=5.0. The match rate and RMS are 25.4% and 0.05 for training dataset and are 17.7% and 0.085 for test dataset, respectively. Figure 1b shows parity plot that compares generated lattice lengths against DFT calculated lattice lengths. Surprisingly, the co-relation between the discovered materials in test dataset and generated materials is better than in training dataset in terms of R2. The R2 for lattice a, b, and c in test dataset are 0.606, 0.616, and 0.606, respectively as in Fig. 1b, which increases R2 as in training dataset by a factor of six except for lattice c. The rediscovered materials in training dataset have larger lattice a and b and we find that these materials mostly are with cubic space groups. It seems that our approach tends to generate more realistic lattice for non-cubic space groups than cubic space groups in rediscovered materials.

DFT validation

We use the same 2000 materials as in Bayesian optimization for DFT verification. Out of 2000 generated crystals, 93.5% (1869) are successfully optimized, which is significantly better than 33.8% of CubicGAN as reported in12. Figure 2a demonstrates the distribution of formation energy of successfully optimized materials after removing 6 materials with formation energy larger than 10 eV. It is observed that most of the materials have formation energy around 0 eV and 39.6% of them have negative formation energy. Negative formation energy indicates potentially stable materials. Figure 2b shows the distribution of 1579 materials that have energy-above-hull after removing one material with super large energy-above-hull (1160.8 eV). The energy-above-hulls are calculated using the Pymatgen’s Phase diagram analyzer32. Energy above hull is a stronger indicator whether the materials are stable or not. Overall, 3 materials with energy above hull of 0 eV/atom and 106 (5.3%) ones with energy above hull less than 0.25 eV/atom, which further indicates our model can generate reliably stable materials. All the optimized materials are included in the supplementary materials.

Fig. 2: The distribution of formation energy for 1863 materials and energy above hull for 1579 materials.
figure 2

a 39.6% materials are with negative formation energy. b Three materials are with energy above hull equal to zero and 106 ones with energy above hull less than 0.25 eV/atom among 1579 materials.

Pair-wise atom distance based loss not only constrains the two atoms in a reasonable range, but also helps generate lattice lengths close to DFT-calculated ones. To demonstrate this, we calculate relative error, R2, RMSE, and O (outliers percentage) for lattice lengths for 1869 materials as shown in left panel of Fig. 3 and for only 293 cubic materials by PGCGM and 14,432 cubic materials by CubicGAN as shown in right panel of Fig. 3. In terms of relative error, we can find that the mean relative error of lattice lengths is much more close to zero regardless of when comparing 1869 materials or just cubic materials by PGCGM with cubic materials by CubicGAN, which indicates that PGCGM tends to generated precise lattice lengths. In addition, the outliers of lattice lengths in 1869 materials by PGCGM scatter across 100% and cubic materials from 1869 ones only have two outliers compared to CubicGAN whose outliers cluster near to 150% even though CubicGAN overall has a lower outliers percentage. We also evaluate the lattice lengths generation performance between PGCGM and CubicGAN with R2 and RMSE. For 1869 materials, the generated lattice lengths by PGCGM better fit to the DFT calculated lattice lengths than CubicGAN in terms of R2. In terms of RMSE, PGCGM is generally slight better CubicGAN. When only comparing cubic materials, PGCGM significantly outperforms CubicGAN in terms of both R2 and RMSE. Although it is not a direct comparison between PGCGM and CubicGAN, all this findings indicate that our model can generate high quality materials with reasonable lattice lengths.

Fig. 3: The comparison of lattice parameters P* generation performance for 20 space groups and space groups in cubic system.
figure 3

Lattice angles are constrained to fixed values by virtue of crystal systems of 20 space groups. Relative error is calculate by (lengthGEN − lengthDFT)/lengthDFT, where lengthGEN is the generated lattice length and lengthDFT is the relaxed lattice length. The error boxplot in subfigure (a) and (b) extends from the first quartile to the third quartile of the relative error, with the line in the median. Two whiskers extend from the box by 1.5x the inter-quartile range. Outliers lie out of the whiskers. The bounding boxes correspond to each box plot above them and R2, RMSE, and O are used to evaluate the lattice lengths generation performance. O means the percentage of outliers in the box plots. a The error distribution of three lattice lengths for 1869 materials generated/relaxed in PGCGM. b The error distribution of one lattice length for cubic materials generated/relaxed in PGCGM and CubicGAN, respectively. There are 293 cubic materials optimized in PGCGM and 14,432 cubic materials optimized in CubicGAN successfully.

In Table 3, 20 structures with lowest formation energy are selected for 20 space groups as in the dataset. Before any post-processing, both materials has a large number of atoms. It is easily found that the atoms of the same elements are crowded together in column of GEN. After clustering and merging the atoms of the same elements, the number of atoms drop rapidly, such as space group 164 (from 36 to 14) and space group 227 (from 432 to 32) as shown in column of MER. Column OPT shows the crystal structures after DFT optimization. Columns of Formula (GEN) and Fromula (MER) show the formulas before and after clustering and merging, respectively. We can find that the formulas may be changed after clustering and merging even though we only keep materials that do not change space group when conducting clustering and merging. The reason behind this is because we employ base atom sites in generated materials. Note that line 10 of Algorithm 1 tends to fail because of the decimal mantissa of base atom sites which in return can easily lead to the large number of atoms when converting from base atom sites to full atom sites via Algorithm 1, particularly space groups with large affine matrix, such as 227 and 225 as shown in Table 3.

Table 3 20 example optimized crystals with lowest energy for 20 space group.

Case study of three example stable materials

We discovered three compounds with Mg2GaIr, SrYO6 and ZnTe2S6 chemical formulas, which are thermodynamically stable with negative formation energies and zero energy above hull. The structures we found have P6_3/mc [194] (hexagonal), Pm-3 [200] (cubic) and [148] (trigonal) space group symmetries [space group number] (crystal systems) for Mg2GaIr, SrYO6 and ZnTe2S6 compounds, respectively. Figure 4a shows the structures of those three materials. The lattice parameters for Mg2GaIr material are a = b = 4.38 Å, c = 8.54 Å, α = β = 900 and γ = 1200, while that for SrYO6 material are a = b = c = 4.61 Å and α = β = γ = 900. Moreover, the lattice parameters for ZnTe2S6 material are a = b = c = 7.57 Å and α = β = γ = 49.380. As shown in Table 4, our spin-polarized DFT calculations show that the both Mg2GaIr and ZnTe2S6 compounds have the non-magnetic ground states, whereas SrYO6 material has a ferromagnetic ground state with a total magnetic moment of 1 μB. Figure 4b contains the electronic band structures for each stable material. It is clear that both Mg2GaIr and ZnTe2S6 compounds are metals. However, we can see spin-splitting in SrYO6 ferromagnetic material. In this compound only spin-down electrons cross the Fermi level, while spin-up electrons have a band gap of 3.09 eV. Thus, this is a half-metal where spin-down electrons show metallic character, while spin-up electrons are insulating. Half metallicity is widely investigated for spintronics and it is vital for developing memory devices and computer processors36.

Fig. 4: The structures of Mg2GaIr, SrYO6 and ZnTe2S6 and their corresponding DFT calculated properties.
figure 4

The (a) structures, (b) electronic band structures, and (c) phonon dispersion of the stable materials.

Table 4 The electronic and magnetic properties of the stable materials.

Table 5 contains the elastic constants and the mechanical properties of the stable materials. The elastic stability criteria (Born criteria) for Mg2GaIr with P6_3/mm space group symmetry are C11 > ∣C12∣, \(2* {C}_{13}^{2} < {C}_{33}({C}_{11}+{C}_{12})\), and C44 > 0. The Born criteria for SrYO6 with Pm-3 space group symmetry are C11 − C12 > 0, C11 + 2C12 > 0, C44 > 0, and that for ZnTe2S6 with R-3 space group symmetry are C11 > ∣C12∣, \({C}_{13}^{2} < 0.5* {C}_{33}({C}_{11}+{C}_{12})\) and \({C}_{14}^{2}+{C}_{15}^{2} < 0.5* {C}_{44}* ({C}_{11}-{C}_{12})\), and C44 > 037. It is clear that all those three materials comply with their elastic stability criteria implying they are mechanically stable. Table 5 also has the bulk (B) modulus, shear (G) modulus and Young’s (Y) modulus and Poisson’s ratio (ν) for Mg2GaIr, SrYO6, and ZnTe2S6 compounds. We used Hill approximation as implemented in vaspkit code38,39. It is clear that Mg2GaIr and SrYO6 materials have approximately same B, G, and Y values, while those values for ZnTe2S6 compound is considerably lower. In Table 5, ν of Mg2GaIr has the highest value, while lowest ν can be obtained from ZnTe2S6. Furthermore, we used Phonopy code40 to calculate the phonon dispersion relations for the above materials. As shown in Fig. 4c, there are no imaginary phonon modes (negative frequencies) indicating those three materials are dynamically stable at 0 K.

Table 5 The elastic constants (Cij), bulk (B) modulus, shear (G) modulus and Young’s (Y) modulus in GPa and Poisson’s ratio (ν) for the stable materials.

In this work, we propose a physics guided deep crystal generative model (PGCGM), in which two kinds of physics based losses are invented in the generator to improve the quality of generated materials. The atom distance based losses constrain the atom distance in a certain range in the generated materials and thus the generated lattice parameters fall into reasonable range too. To fulfill the symmetry requirements, the model transforms the implicit rules between base atoms sites and full atom sites into explicit cost functions. Two baseline methods are compared and PGCGM achieves the best performance across all evaluation metrics. In particular, PGCGM significantly outperforms the two baseline models in terms of property distribution metric which is a much stronger indicator to show the reality of the generated materials20. In addition, we use BOWSR to optimize 2000 randomly selected materials in each method. Our approach has the best match rate calculated between the Generative model-generated materials and BOWSR-optimized materials, which further demonstrate our method can generate realistic materials.

In order to see how our approach can rediscover materials in existing databases, we sample different size of materials and calculate rediscover rate for training and test datasets. We can observe a clear trend of increased rediscover rate over sampling size. There is no clear saturation point of rediscover rate at the end of 60 million sampled materials as in CubicGAN12. The reasons are: (1) the possible design space of 20 space groups (5 crystal systems) in this work are much bigger than 3 space groups (only cubic crystal systems) in CubicGAN; (2) CubicGAN uses special fractional coordinates while PGCGM generates fractional coordinates in full space, which means PGCGM has a significantly broader space to explore new materials. Furthermore, 1869 out of 2000 materials are successfully optimized by DFT calculation. Among 1869 materials, 39.6% possess negative formation energy and 5.3% with energy above hull less than 0.25 eV/atom, indicating that invented physics guided losses help generate stable crystal structures effectively. This research gives a deep insight into how physics losses help generate realistic materials and offers an approach to expand the diversity of generated materials.

Due to the difficulty to generate 500,000 structures by the diffusion based generative model CDVAE, we conduct a small-scale evaluation and comparison. We generated 1100 structures using CDVAE, among which 78.2% (860) are pymatgen readable. We then analyzed the space groups of these 860 structures and find out of the top 10 space groups with the most structures (a total of 790), 97.8% (773) structures have low symmetry with space group number less or equal to 25 and 582 (73.7%) of them have no symmetry. However, it is shown41 that high-symmetry materials tend to form stable structures and have larger potential to have good functional properties. It seems that like other VAE based generator models, CDVAE also has difficulty in generating high-symmetry structures as PGCGM does.

Methods

Problem statement and notations

Our data-driven generative models of crystal structures are first trained with known crystal structures in materials databases. Since crystal materials are periodic structures, instead of representing the infinite structures, the structure of an inorganic material is represented by a unit cell in material science, which is the smallest unit that completely reflects the arrangement of atoms in the 3D space. Given a generated unit cell, it can be used to build the periodic structure of an inorganic material by repeating it multiple or infinite times along three directions to form a super cell. A material \({{{\mathcal{M}}}}\) can then be denoted as following:

$${{{\mathcal{M}}}}=({{{\bf{E}}}},{{{\bf{B}}}},{{{\bf{P}}}},{{{\bf{O}}}})$$
(1)

where

  1. (a)

    \({{{\bf{E}}}}=({e}_{0},{e}_{1},{e}_{2})\in {\mathbb{E}}\) denotes elements in materials, where \({\mathbb{E}}\) is the element set in periodic table. In this work, we only deal with ternary materials so that there are only three unique elements in the unit cell;

  2. (b)

    \({{{\bf{B}}}}=({{{{\bf{b}}}}}_{0},{{{{\bf{b}}}}}_{1},{{{{\bf{b}}}}}_{2})\in {{\mathbb{R}}}^{3\times 3}\) denotes the base atom sites, which are the symmetry equivalent positions. termed . bi is fractional coordinates of an atom denoted by \({\left[u,v,w\right]}^{T}\). We choose materials that one element only has one base atom site so that three atom sites can be used to represent the atom positions. Moreover, any one atom site of each element can be considered as the base atom site for that element;

  3. (c)

    \({{{\bf{P}}}}=(a,b,c,{{{\boldsymbol{\alpha }}}},{{{\boldsymbol{\beta }}}},{{{\boldsymbol{\gamma }}}})\in {\mathbb{R}}\) are six lattice parameters that define three lengths and three angles of the unit cell;

  4. (d)

    \({{{\bf{O}}}}=({{{{\bf{t}}}}}_{0},{{{{\bf{t}}}}}_{1},\ldots ,{{{{\bf{t}}}}}_{n})\in {{\mathbb{R}}}^{n\times 4\times 4}\) denotes affine matrix that represents the symmetry operations defined by space groups sgp. tj is one affine operator containing the rotation and translation matrices. n is determined by space groups. Generally the higher symmetry of a space group, the larger n. n can be as small as 1 or as large as 192.

Now we can model the generation of materials as follows:

$$({{{\bf{B}}}},{{{\bf{P}}}})={f}_{\theta }({{{\bf{Z}}}},{{{\bf{E}}}},{{{\rm{sgp}}}}),$$
(2)

where fθ is the generative model that learns the knowledge of forming crystal structures given inputs of random noise Z, element set E, and space group sgp.

Overall architecture of physics guided crystal generative model

Our Physics Guided Crystal Generative Model (PGCGM) is shown in Fig. 5. The PGCGM mainly consists of four major components: (1) discriminator, (2) generator, (3) self-augmentation, and (4) atom distance matrix/loss calculation module. In the generator and discriminator, affine matrix is integrated into the training to generate fake materials and tell fake materials from real ones, respectively. Affine matrix is related to symmetry information for space groups. The implicit combination of affine matrix and base atom sites can help keep the symmetry when generating materials. Self-augmentation increases the training materials for underrepresented space groups by randomly forming base atom sites. With three sets of base atom sites, we can not only have a fixed size of input to the discriminator, but also deduce more physical information for crystals to help the discriminator better distinguish real materials from fake ones. Furthermore, we design two kinds of physics guided losses. Any set of base atom sites can be converted to full set of unique atom sites. When generating three sets of base atom sites, it implicitly can be stated that the three sets of base atom sites should be different but the full atom sites converted from them separately ought to be same. Hence a specific loss is invented to explicitly incorporate this rule into training of the generator. In order to restrict the two atoms in the 3D space to be not too close or not too distant, inter- and intra-atom distance losses are proposed. With distance loss, the generator further can generate reasonable lattice parameters in order to push any pair of atoms to fall into a certain range.

Fig. 5: The overview of physics guided crystal generative model (PGCGM).
figure 5

The PGCGM comprises four components. a The generator takes affine matrix O, random noise Z, and element properties E as inputs. The affine matrix and random noise are projected to two vectors by 2D convolutional networks and fully connected layers, respectively and then the two vectors are merged and projected to generate lattice parameters P* by fully connected layers. The element properties are projected to a vector by 1D convolutional networks and then it is merged with the vector projected from random noise to generate three sets of base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\). b The discriminator has two input branches. It shares with the same affine matrix branch as in the generator. The assembled crystal representation matrix from three sets of base atom sites, lattice parameters, and properties calculated from them is used as the input to 2D convolutional networks. The assembled matrix is zero-padded to form a matrix with shape of 3 × 8 × 8. c The self-augmentation performed on the base atom sites. We choose three sets of base atom sites from three elements randomly and with space group, we can calculate more crystal information to assemble the input matrix for the discriminator. d Inter- and intra-atom distance matrices (Hintra and Hinter) are calculated from three sets of base atom sites for both real and fake materials. Then we design distance based losses to constrain the distance between two atoms in a certain range as shown in the grey area form by two circles.

Discriminator

There are two input branches for crystal representation and affine matrix in Discriminator as in sub-figure (b) of Fig. 5. Each branch is forwarded to a 2D convolutional block and the learnt features are concatenated together. The concatenated vector is sent to a couple of fully connected layers to get the discriminative score. We have three different sets of base atom sites in our inputs and with the affine matrix branch, it helps to implicitly learn the knowledge of how affine matrix transforms base atom sites into full atom sites. The detailed architectures of two convolutional blocks can be found in Table S3 in the supplementary materials.

Generator

The architecture of generator is shown in sub-figure (a) of Fig. 5, which contains three branches. Conditioning on element constituents and space group, the generator outputs three sets of base atom sites \(\left({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2}\right)\) and unit cell length P*. Then we re-formalize Eq. (2) as follow:

$$\left({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2},{{{{\bf{P}}}}}_{fake}^{* }\right)={f}_{\theta }^{* }({{{\bf{Z}}}},{{{\bf{E}}}},{{{\rm{sgp}}}}).$$
(3)

Taking random noise Z, space group sgp, and element properties matrix E as inputs, the generator can generate a material with the same lattice parameters and space group but different representations of the base atom sites when merely sampling one material. Our goal here is that the generated three sets of base atom sites belong to the same material. Random noise Z is mapped to a dense vector a fully connected layer. The space group branch is the same as in discriminator. Element matrix E is forwarded to a 1D convolutional layer (Conv1D). The outputs of random noise and space group branches are combined together as the inputs to a multi-layer perceptron (MLP) block to generate unit cell length \({{{{\bf{P}}}}}_{fake}^{* }\). The outputs of random noise and element branches are combined together as the inputs to 2D deconvolutional layers (ConvTran2D) to generate three sets of base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\). The detailed descriptions for MLP, Conv1D, and ConvTran2D can be found in Supplementary Table 4.

Physics guided loss function

The original GAN16 is notoriously hard to train because of saturation and mode collapse in discriminator. We take advantage of WGAN-GP31 with gradient penalty to enhance the training stability in our network. WGAN-GP changes the Sigmoid function of the discriminator to a 1-Lipschitz function while introducing a gradients penalty term to enforce the norm of gradients to be close to 1. The loss function is described in Eq. (4):

$$\begin{array}{l}{\hat{{{{\mathcal{M}}}}}}^{* }=\epsilon {{{{\mathcal{M}}}}}_{real}^{* }+(1-\epsilon ){{{{\mathcal{M}}}}}_{fake}^{* },\quad \epsilon \sim U(0,1),\\ {{{{\mathcal{L}}}}}_{dis}=D\left({{{{\mathcal{M}}}}}_{fake}^{* }\right)-D\left({{{{\mathcal{M}}}}}_{real}^{* }\right)+{\lambda }_{d}{\left({\left\Vert {\nabla }_{{\hat{{{{\mathcal{M}}}}}}^{* }}D({\hat{{{{\mathcal{M}}}}}}^{* })\right\Vert }_{2}-1\right)}^{2},\\ {{{{\mathcal{L}}}}}_{adv}=-D\left({{{{\mathcal{M}}}}}_{fake}^{* }\right),\end{array}$$
(4)

where \({\hat{{{{\mathcal{M}}}}}}^{* }\) is linearly interpolated between real materials \({{{{\mathcal{M}}}}}_{real}^{* }\) and fake materials \({{{{\mathcal{M}}}}}_{fake}^{* }\) and ϵ is uniformly sampled from 0 and 1. \({{{{\mathcal{L}}}}}_{dis}\) and \({{{{\mathcal{L}}}}}_{adv}\) represent the loss function of the discriminator and adversarial loss for generator respectively. The third term in \({{{{\mathcal{L}}}}}_{dis}\) is the gradient penalty and λd is set to 10. D(.) means the score result from the discriminator.

Atom Distance Losses. To ensure that the atoms in generated crystal structures are not crowded or not too far apart from each other, we introduce the inter- and intra-atom distance based losses as following:

$$\begin{array}{lll}{{{{\mathcal{L}}}}}_{inter}=\frac{1}{{{{\rm{9N}}}}}\mathop{\sum }\limits_{i=1}^{N}\left\{{\left[\max \left({{{{\bf{H}}}}}_{inter},{\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter}\right)-{\phi}_{inter}^{{upper}}{{{{\bf{S}}}}}_{inter}\right]}^{2}\right.\\ \qquad\qquad+\left.{\left[\min \left({{{{\bf{H}}}}}_{inter},{\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter}\right)-{\phi}_{inter}^{{lower}}{{{{\bf{S}}}}}_{inter}\right]}^{2}\right\},\\ {{{{\mathcal{L}}}}}_{intra}=\frac{1}{{{{\rm{9N}}}}}\mathop{\sum }\limits_{i=1}^{N}\left\{{\left[\max \left({{{{\bf{H}}}}}_{intra},{\phi }_{intra}^{upper}{{{{\bf{S}}}}}_{intra}\right)-{\phi}_{intra}^{{upper}}{{{{\bf{S}}}}}_{intra}\right]}^{2}\right.\\ \qquad\qquad+\left.{\left[\min \left({{{{\bf{H}}}}}_{intra},{\phi }_{intra}^{lower}{{{{\bf{S}}}}}_{intra}\right)-{\phi}_{intra}^{{lower}}{{{{\bf{S}}}}}_{intra}\right]}^{2}\right\},\end{array}$$
(5)

where \({{{{\mathcal{L}}}}}_{inter}\) constrains the distance in Hinter which describes inter-atom distance matrices. \({\left[\max ({{{{\bf{H}}}}}_{inter},{\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter})-{\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter}\right]}^{2}\) enforces the atom distance to be smaller than \({\phi }_{inter}^{upper}{{{{\bf{S}}}}}_{inter}\) and \({\left[\min ({{{{\bf{H}}}}}_{inter},{\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter})-{\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter}\right]}^{2}\) enforces the atom distance to be bigger than \({\phi }_{inter}^{lower}{{{{\bf{S}}}}}_{inter}\). Sinter are atom radius sum corresponding to each pair of atoms in Hinter and \({\phi }_{inter}^{upper}\) and \({\phi }_{inter}^{lower}\) are control weights for upper and lower bound of inter-atom distance, respectively. In this way, the distance of two atoms is constrained to be in the grey area indicated by two circles in sub-figure (d) of Fig. 5. Similarly, \({{{{\mathcal{L}}}}}_{intra}\) constrains the distance in a range in Hintra which describes intra-atom distance matrices. Sintra are atom radius sum corresponding to each pair of atoms in Hintra and \({\phi }_{intra}^{upper}\) and \({\phi }_{intra}^{lower}\) are control weights for upper and lower bound of inter-atom distance, respectively. N is batch size and 9 is the number of distance value in Hinter and Hintra.

Symmetry-compliant Base and Average Full Coordinates Losses. The generator generates three sets of base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\) which are used to generate the full site coordinates using the symmetric operation defined by the space group. The averaged transformation to \(({{{{\bf{F}}}}}_{fake}^{0},{{{{\bf{F}}}}}_{fake}^{1},{{{{\bf{F}}}}}_{fake}^{2})\) from base atom sites should be exactly same. With these implicit rules, we design two losses to explicitly enforce them in the generator as expressed below:

$$\begin{array}{lll}{{{{\mathcal{L}}}}}_{full}=\frac{1}{{{{\rm{9N}}}}}\mathop{\sum }\limits_{i=1}^{N}\left\{\max \left(0,\cos \left(\frac{{\bf{F}}_{fake}^{{0}}}{{\left\Vert {\bf{F}}_{fake}^{{0}}\right\Vert }_{2}},\frac{{\bf{F}}_{fake}^{{1}}}{{\left\Vert {\bf{F}}_{fake}^{{1}}\right\Vert }_{2}}\right)\right)\right.{{{{\mathcal{L}}}}}_{base}=\frac{1}{{{{\rm{9N}}}}}\mathop{\sum }\limits_{i=1}^{N}\left\{\left(1-\cos \left(\frac{{{{{\bf{B}}}}}_{fake}^{0}}{{\left\Vert {{{{\bf{B}}}}}_{fake}^{0}\right\Vert }_{2}},\frac{{{{{\bf{B}}}}}_{fake}^{1}}{{\left\Vert {{{{\bf{B}}}}}_{fake}^{1}\right\Vert }_{2}}\right)\right)\right.\\ \qquad\qquad+\max \left(0,\cos \left(\frac{{\bf{F}}_{fake}^{{1}}}{{\left\Vert {\bf{F}}_{fake}^{{1}{{{{\bf{F}}}}}}\right\Vert }_{2}},\frac{{\bf{F}}_{fake}^{{2}}}{{\left\Vert {\bf{F}}_{fake}^{{2}}\right\Vert }_{2}}\right)\right)\qquad\qquad \qquad+\left(1-\cos \left(\frac{{{{{\bf{B}}}}}_{fake}^{1}}{{\left\Vert {{{{\bf{B}}}}}_{fake}^{1}\right\Vert }_{2}},\frac{{{{{\bf{B}}}}}_{fake}^{2}}{{\left\Vert {{{{\bf{B}}}}}_{fake}^{2}\right\Vert }_{2}}\right)\right)\\ \qquad\qquad+\left.\max \left(0,\cos \left(\frac{{\bf{F}}_{fake}^{{0}}}{{\left\Vert {\bf{F}}_{fake}^{{0}}\right\Vert }_{2}},\frac{{\bf{F}}_{fake}^{{2}}}{{\left\Vert {\bf{F}}_{fake}^{{2}}\right\Vert }_{2}}\right)\right)\right\},\qquad\qquad\quad+\left.\left(1-\cos \left(\frac{{{{{\bf{B}}}}}_{fake}^{0}}{{\left\Vert {{{{\bf{B}}}}}_{fake}^{0}\right\Vert }_{2}},\frac{{{{{\bf{B}}}}}_{fake}^{2}}{{\left\Vert {{{{\bf{B}}}}}_{fake}^{2}\right\Vert }_{2}}\right)\right)\right\},\end{array}$$
(6)

where cos is cosine similarity function. We normalize each coordinate value across the mini-batch of size N. 9 is the number of coordinates.

Full Generator Loss. By combining above losses, we can achieve our full loss for the generator:

$${{{{\mathcal{L}}}}}_{gen}={{{{\mathcal{L}}}}}_{adv}+{\lambda }_{1}{{{{\mathcal{L}}}}}_{inter}+{\lambda }_{2}{{{{\mathcal{L}}}}}_{intra}+{\lambda }_{3}{{{{\mathcal{L}}}}}_{full}+{\lambda }_{4}{{{{\mathcal{L}}}}}_{base}$$
(7)

where λ1, λ2, λ3, and λ4 are balancing parameters.

Crystal symmetry based self-augmentation

Data augmentation is commonly used for images in which operations such as rotation of an image does not change its label. Similarly, self-augmentation as we define here is used to do data augmentation based on the symmetry-oriented Wyckoff position representation of CIF files. In the representation of symmetric crystals, the coordinates of the non-equivalent positions (Wyckoff positions) are just one of a set of possible positions as defined by the symmetric operations of the space group. So, for each structure file, we can use the set of symmetric operations of the space group to transform the Wyckoff position coordinates without changing the structure, which can then generate more equivalent structure samples. The generation of atom coordinates that meet the symmetry constraints is one of the most challenging tasks in crystal generation. In order to make the fixed size of representation for crystals (details in Table 7), we use base atom sites. As shown in sub-figure (c) of Fig. 5, we can use any atom site of each element to form a set of base atom sites. Instead of randomly selecting them, we choose three atoms for three elements individually using steps as shown in below:

  1. 1.

    Randomly select the first element e0 and one atom position b0 for it;

  2. 2.

    Randomly select the second element e1 from the rest two elements and find the closest atom b1 to atom b0 in the first step;

  3. 3.

    Calculate the atom distance from the atoms of the last element e2 to the atom b0 and the atom b1 respectively, then sum the atom distance element-wise and the atom of the last element with the smallest sum is considered as the closest atom b2 to the selected atoms in the first and second steps;

  4. 4.

    Repeat Steps 2, 3, and 4 three times to obtain three sets of base atom sites \(({{{{\bf{B}}}}}_{real}^{0},{{{{\bf{B}}}}}_{real}^{1},{{{{\bf{B}}}}}_{real}^{2})\);

  5. 5.

    Repeat last five steps 31 times.

In Step 4, we use three sets of base atom sites as part of inputs to the discriminator so that we can obtain more information from crystal structures. In this work, we obtain three sets of base atom sites 32 times repeatedly as in Step 5.

Atom clustering and merging

For crystals with high symmetry, the number of atoms in the unit cell tends to be very large after conversion by Algorithm 1. We propose a post-processing method to reduce the number of atoms by clustering and merging. Firstly, we cluster the nearby atoms of the same elements by forming flat clusters from hierarchical clustering42,43. The maximum atom distance allowed in our research is 1.2 times the atom radius sum. Secondly, we merge the atoms in the same clusters considering periodic attributes of crystal structures.

Implementation details

Materials representation

We use \({{{\mathcal{M}}}}=({{{\bf{E}}}},{{{\bf{B}}}},{{{\bf{P}}}},{{{\bf{O}}}})\) to completely describe a crystal material. As shown in mainframe of PGCGM, however, we use three sets of base atom sites (B0, B1, B2). Thus here we re-formulate a material as \({{{{\mathcal{M}}}}}^{* }=({{{{\bf{B}}}}}^{0},{{{{\bf{B}}}}}^{1},{{{{\bf{B}}}}}^{2},{{{\bf{P}}}},{{{\bf{E}}}},{{{\rm{sgp}}}})\). The space group sgp is used to link to the affine matrix O. We can use (B0, B1, B2) in \({{{{\mathcal{M}}}}}^{* }\) to calculate physical properties as inputs to the discriminator and to design physics-based losses. Three sets of base atom sites are useful for two reasons: (1) we want to add more crystal information for the discriminator and let the discriminator have enough information to tell real materials from fake ones; (2) With more base atom sites, we can calculate more atom distances as the physical constraints in the generator and the inputs to the discriminator.

All Fractional Coordinates We use affine matrix O to acquire the whole atom sites in the unit cell as shown in Algorithm 1. Since the number of affine operators in O varies in space groups, we zero-pad the affine matrices as large as 192 × 4 × 4. We then transform each base atom site by the affine matrix and get a coordinates matrix Fall with shape of 192 × 3 × 3. Affine transformation leads to duplicate fractional coordinates. In material science, practitioners usually remove the duplicates. However, uniqueness calculation is not differentiable and it requires lots of time to do it. We choose to average along with the first dimension of Fall to get three sets of averaged full fractional coordinates (F0, F1, F2), each of which is with shape of 3 × 3.

For a real material, base atom sites \(({{{{\bf{B}}}}}_{real}^{0},{{{{\bf{B}}}}}_{real}^{1},{{{{\bf{B}}}}}_{real}^{2})\) can be transformed into the same average full fractional coordinates, which means \({{{{\bf{F}}}}}_{real}^{0}={{{{\bf{F}}}}}_{real}^{1}={{{{\bf{F}}}}}_{real}^{2}\). When generating a fake material, base atom sites \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\) are supposed to belong to the same fake material, which hopefully results in \({{{{\bf{F}}}}}_{fake}^{0}={{{{\bf{F}}}}}_{fake}^{1}={{{{\bf{F}}}}}_{fake}^{2}\). However, the transformation of \(({{{{\bf{B}}}}}_{fake}^{0},{{{{\bf{B}}}}}_{fake}^{1},{{{{\bf{B}}}}}_{fake}^{2})\) might slightly deviate from the goal. Thus using (F0, F1, F2) in real and fake materials implicitly adds physical constraints, which somehow pushes the generator to generate different sets of base atom sites for a same material, which increases chances to generate good materials in return.

Base Cartesian Coordinates Three sets of Cartesian coordinates can be calculated for each set of base atom sites by Eq. (8) and we denote them by (C0, C1, C2).

Atom Distance Matrices Given three sets of base atom sites (B0, B1, B2), we calculate the atom distance matrices Hinter and Hintra as shown in sub-figure (d) of Fig. 5. We firstly calculate pair-wise different atom distance matrix for each base atom site Bj, j = 0, 1, 2 and return only values in upper triangle of corresponding distance matrix termed by Hinter. Then we select three atoms belonging to the same element to form a set of three atom sites for three elements and calculate pair-wise same atom distance matrix and again return only values in upper triangle of corresponding distance matrix termed by Hintra. The final shape of Hintra and Hinter both is 3 × 3.

Lattice Parameters The volume of the unit cell can be calculated by lattice parameters P. We repeat the scalar volume three times to get the volume vector V. We also use the lattice matrix A in Eq. (9) as part of the inputs to the discriminator.

Element Properties We use 23 properties as shown in Table 6 to formalize element matrix E.

Table 6 3 element properties used for element embedding in PGCGM.

Now we list all parts of inputs to the discriminator in Table 7. P* only contains the lengths because the angles are either (90°, 90°, 90°) or (90°, 90°, 120°) in the training materials. Thus instead of generating three angles in P for fake materials, we build a map between angles and the space group sgp. Then we concatenate all parts and a zero matrix of shape 3 × 3 into a matrix of shape of 3 × 64. The matrix is finally reshaped into 3 × 8 × 8 as the inputs to the discriminator.

Table 7 Symbols and their shape used in inputs to the discriminator.

Mathematical conversion in crystal representations

Fractional coordinates can be converted to Cartesian coordinates \({\left[x,y,z\right]}^{T}\) using44:

$$\left[\begin{array}{c}x\\ y\\ z\end{array}\right]={{{\bf{A}}}}\cdot \left[\begin{array}{c}u\\ v\\ w\end{array}\right],$$
(8)

where A is a lattice matrix calculated by lattice parameters P using:

$${{{\bf{A}}}}=\left[\begin{array}{ccc}a&b\cos {{{\boldsymbol{\gamma }}}}&c\cos {{{\boldsymbol{\beta }}}}\\ 0&b\sin {{{\boldsymbol{\gamma }}}}&c\frac{\cos {{{\boldsymbol{\alpha }}}}-\cos {{{\boldsymbol{\beta }}}}\cos {{{\boldsymbol{\gamma }}}}}{\sin {{{\boldsymbol{\gamma }}}}}\\ 0&0&\frac{V}{ab\sin {{{\boldsymbol{\gamma }}}}}\end{array}\right],$$
(9)

where \(V=abc\sqrt{1-{\cos }^{2}{{{\boldsymbol{\alpha }}}}-{\cos }^{2}{{{\boldsymbol{\beta }}}}-{\cos }^{2}{{{\boldsymbol{\gamma }}}}+2\cos {{{\boldsymbol{\alpha }}}}\cos {{{\boldsymbol{\beta }}}}\cos {{{\boldsymbol{\gamma }}}}}\) is the volume of the unit cell.

In order to acquire all atom positions in the unit cell, each base atom site can be converted by affine matrix O. The conversion procedure is summarized in Algorithm 1. Different materials vary from the number of atoms and the number of elements. In order to make a fixed size of inputs, we only use ternary materials in this research. After conversion shown in Algorithm 1, the number of atom (sites) also differs from materials. That is the reason why base atom sites (one element one base site) are used to represent atom positions. In addition, it should be noted that the calculation of the uniqueness at line 10 of Algorithm 1 is not differentiable and time-consuming.

Algorithm 1

Generate unique coordinates using base sites () and affine matrix

Require: The space group sgp, the base atom sites B

1: O ← GetAffineMatrices(sgp)

2: n ← len(O)

3: coords ← anemptylist

4: for i ← 1 to 3 do

5: add 0 to bi

6: uniq ← anemptylist

7: for j ← 1 to n do

8: c ← bi ⋅ tj − ⌊bi ⋅ tj

9: pop last element from c

10: ifcnotinuniqthen

11: addctouniq

12: end if

13: end for

14: adduniqtocoords

15: end for

16: return coords

Evaluation metrics

Past studies in crystal generation used different evaluation metrics, making it hard to compare different methods. Here, we create a set of metrics to evaluate our method and two baselines. (1) Validity20. Following18, we consider a crystal structure as valid when the shortest distance between any two atoms is bigger than 0.5 Å. Following CubicGAN, we calculate the overall charge of a crystal structure using pymatgen32 and if it is neutral, then it is valid. Also, we count the number of structures after post-processing in our method and we apply the same post-processing onto the CubicGAN. (2) Property distribution20 We calculate wasserstein distance (WD) between the property distribution of generated materials and materials in test dataset TST. The properties we used are minimum atom distance, maximum atom distance, and density. (3) Diversity. We calculate the diversity of compositions, which means the ratio of unique number of compositions in generated structures. (4) DFT validation of generated structures. We find that both the pair-wise atomic distance and property distribution based performance metrics are indirect weak criteria in crystal structure generation as the major challenge in such models is to generate stable structures. It is thus critical to check the success rate of DFT-based structural relaxation, the thermodynamic stability (e.g. evaluated by DFT phonon dispersion calculation), and synthesizability (e.g. based on energy-above-hull calculation).