Roles and opportunities for machine learning in organic molecular crystal structure prediction and its applications

The field of crystal structure prediction (CSP) has changed dramatically over the past decade and methods now exist that will strongly influence the way that new materials are discovered, in areas such as pharmaceutical materials and the discovery of new, functional molecular materials with targeted properties. Machine learning (ML) methods, which are being applied in many areas of chemistry, are starting to be explored for CSP. This article discusses the areas where ML is expected to have the greatest impact on CSP and its applications: improving the evaluation of energies; analyzing the landscapes of predicted structures and for the identification of promising molecules for a target property.


Introduction
The principal goal of crystal structure prediction (CSP) is to calculate the potential crystal structures of any given material from its chemical composition. Most CSP methods involve a search for local minima on the energy surface defined by the structural variables describing a crystal structure (unit cell dimensions, molecular positions, and orientations), and ranking of these minima by their calculated energies. 1 The difficulty of CSP is highlighted by the occurrence of polymorphism, where a molecule can exist in one of several crystalline phases that may possess different physical and chemical properties. It has been shown that most polymorphs of organic molecules are separated by less than 2 kJ/mol in lattice energy, 2 thus highlighting the need for accurate computational methods to correctly predict the energy ranking of structures. The large number of possible structures involved in CSP for a given molecule means that accurate energies must be achieved at as low a computational cost as possible. Despite these challenges, considerable progress has been achieved in the past few decades and it is now possible to predict the crystal structure landscape of even quite complex systems, including multicomponent crystals (salts, co-crystals, and solvates), 3,4 and pharmaceuticals that can adopt many molecular conformations. [5][6][7] Therefore, CSP is becoming more widely used, for example, in the pharmaceutical industry to complement experimental polymorph screening, 8 and to prioritize candidates for synthesis for functional materials discovery. 9 Predicting the likely crystal structures of a molecule is crucial for predicting many properties that depend on the relative arrangement of molecules in a crystal: a few examples include mechanical properties, porosity, the kinetics of dissolution and electron and hole mobilities in organic semiconductors.
The results of CSP can provide crucial information for experimental design. For example, predictions were able to show that the formation of caffeine-benzoic acid co-crystals is thermodynamically favorable, leading to the design of a seeding experiment that ultimately enabled the synthesis of the elusive crystal structure. 10 More recently, new polymorphs of iproniazid 11 and dalcetrapib 12 have been discovered from crystallizations under pressure, which were guided from the results of CSP that discovered thermodynamically stable structures with high densities. CSP has also been applied in the area of porous molecular materials, helping to understand and modify the crystalline properties of porous organic cages 13,14 and guiding the discovery of extrinsically porous molecular crystals for gas storage and molecular separations. 9 In recent years, machine learning (ML) methods have been employed across many areas of chemistry and are starting to be explored for CSP. [15][16][17] This includes their use for making highly accurate predictions of the relative energies of crystals, and in the analysis of CSP landscapes where the high dimensionality of the structural space can be simplified using ML algorithms.
This article highlights the areas where ML methods are expected to have greatest impact on CSP methods and their applications. Although many of the same challenges exist for other types of crystalline materials, we focus our discussion on organic molecular CSP.

ML of the relative stabilities of putative structures
A requirement of CSP is the evaluation of accurate energies for large numbers of computer-generated crystal structures. A large-scale computational study has shown that free-energy differences are <2 kJ/mol in over half of observed polymorphs for small organic molecules, exceeding 6.4 kJ/mol in only 5% of cases. 2 These differences are usually dominated by the lattice energy-the energy of the static arrangement of molecules in a crystal-with differences in entropy due to lattice vibrations usually being smaller than lattice energy differences. 2 Thus, temperature effects, including thermal expansion, are usually treated as a minor energetic contribution; it has been estimated that about one in five polymorph pairs swap their order of stability between 0 K and their melting point. 18 Therefore, the focus of CSP has largely been on obtaining accurate lattice energies.
Traditionally, the choice has been between force fields, which describe the interactions between atoms using physically motivated functional forms, and more expensive quantum mechanical electronic structure methods (typically, solid-state density functional theory, DFT). Many of the available methods have been benchmarked using the measured sublimation enthalpies of the X23 benchmark set of molecular organic crystals, [19][20][21] showing that the best force fields have mean absolute errors of 9 kJ/mol, whereas errors in the best DFT methods are about half this magnitude. Therefore, the fact that CSP is ever successful, given the small energy separations between polymorphs and predicted crystal structures, relies on cancellation of errors. For force fields, in particular, much of the errors are systematic, so do not affect the energetic ranking of predicted structures. Nevertheless, the increased accuracy of DFT methods is often necessary, particularly where polarization or charge-transfer interactions make important contributions to intermolecular interactions, or where changes in molecular geometry are significant. However, DFT energy calculations can be 10 3 -10 5 times more computationally expensive than force fields, 17 even for small molecules. Thus, ML has been investigated as a means to achieve DFT-quality energies in CSP studies at more affordable computational costs.
The main requirement of an ML model is the ability to model the nonlinear relationship between energy and geometric descriptors of the crystal structure, which are usually represented by their local atomic environments, such as in the smooth overlap of atomic positions (SOAP) 22 and atomcentered symmetry function (ACSF) 23 approaches. The use of a kernel, or covariance, matrix describing similarity between atomic configurations in Gaussian process regression (GPR) models is popular in chemical applications of ML, as are neural networks. Although ML models have been thoroughly tested for accurately predicting the energies and properties of inorganic crystal structures, 24 and small organic molecules, 25 applications to organic molecular crystal structures are more challenging, in part due to the number of atoms involved. 26 In the field of inorganic structure prediction, Tong et al. 27 demonstrated that a GPR model could be trained to high-level DFT calculations on-the-fly during a structure search for predicting boron clusters, saving an estimated 1-2 orders of magnitude in computational cost compared to full DFT calculations, and suggested that their work could apply to periodic systems. Deringer et al. 28 used GPR to train a potential for CSP of elemental phosphorus, initially training on DFT calculated energies of randomly generated structures, and refining the potential during the structure search, so that it could eventually identify complex structures whose size puts them out-of-reach of DFT calculations.
In the area of organic molecular CSP, several approaches have shown how applying ML allows for the use of quantum mechanics methods, more affordably than running such high-level calculations on all predicted crystal structures of a molecule. As a first demonstration in this area, Musil et al. 15 showed that GPR using the SOAP description of structural similarity could predict DFT lattice energies of pentacene CSP structures with less than 1 kJ/mol error, although the errors are higher for chemically more complex molecules. A Δ-ML approach, which learns the difference in energy between lower (force field) and higher (DFT) levels of theory led to lower errors and more uniform performance for different molecules. The approach has been extended by Egorova et al., who developed a multilevel ML approach to correct the relative stabilities of predicted structures, 17 further reducing the required amount of the most computationally expensive, high-level energy calculations.
Other work includes training on a finite molecular cluster from the crystal structure at the target level using the manybody expansion of the lattice energy, instead of using periodic calculations. McDonagh et al. explored ML models for learning individual two-body (i.e., dimer) corrections to forcefield-calculated energies while keeping the long-range interactions at low level to reduce the cost. 29 This approach allows more accurate quantum chemistry models, such as correlated wave-function methods, which are currently unaffordable for calculations on periodic structures. A similar approach was described by Wengert et al. 30 who included larger molecular clusters and reported that the use of ML reduces the computing time taken for 10,000 crystal structures of a small organic molecule from 30 million CPU h to 80,000 CPU h.
This dramatic reduction in cost means that ML models can be used to assess more computationally demanding freeenergy differences of crystal structures and include contributions from lattice vibrations, which are known to be important for polymorph relative stabilities, 2,18 and nuclear quantum effects. Kapil and Engel 31 demonstrated such an approach, training a neural network model that was used for calculating free-energy differences between crystal structures of benzene, glycine, and succinic acid.

ML-enabled analysis of structural landscapes
An aim of CSP is to produce a set of all energetically feasible crystal packings of the studied system. These structure sets are rich with information on structure-function relationships, and are analyzed to identify which structures are most likely to be experimentally realized. Key barriers in such analysis include uncertainty in the predictions themselves and the high dimensionality of CSP landscapes, which creates a challenge for visualizing the distribution of predicted structures. The dimensionality of the energy surface depends on the symmetry and number of molecules in the crystallographic unit cell. As an illustrative example, a structure with four rigid molecules in the unit cell is defined by up to 30 degrees of freedom: the unit cell dimensions, along with the orientations and positions of each molecule within the unit cell.
Reducing this dimensionality can help identify structure-function relationships. A common solution is to visualize the structures in only a few dimensions; for example, CSP results are often presented as a plot of relative energies against densities of the predicted structures. Chemical intuition can also be applied to classify structures by the presence of certain interactions (e.g., hydrogen bonds) 32 or packing features. 33 The information loss and bias in analyzing sets of predicted structures may be minimized by applying dimensionality reduction methods to identify features that capture the greatest structural variation across the set of structures, and to form a lower dimensional representation in which similarity and dissimilarity of structures is preserved. As with ML for energies of crystal structures, dimensionality reduction relies on descriptors of each structure; again, such descriptors usually describe the local environment of atoms, such as ACSFs and SOAP. As a part of their work on learning energies, Egorova et al. 17 performed principal component analysis (PCA) of the sets of predicted crystal structures of a series of small molecules, each described using ACSFs. This work found that a small number of principal components (linear combinations of ACSFs) capture most of the variability across predicted structures, demonstrating that stable crystal structures tend to be found in a lower-dimensional manifold of the full dimensionality of the energy surface.
A related approach, useful for classification of structures, is clustering-an unsupervised ML method that optimizes the separation of data points into clusters of similar points and defines each point by just one descriptor-its cluster index. As an example, Musil et al. 15 combined the nonlinear dimensionality reduction technique, sketch-map, 34 with clustering methods to produce two-dimensional (2D) mappings of crystal structure landscapes of pentacene (Figure 1a) and two azapentacenes proposed as promising organic semiconductors. The reduced mapping for pentacene showed clear groupings of structures, which, when classified using hierarchical densitybased clustering, 35 reproduced results from heuristic classification of the structures, according to the known structural classes of the crystal structures of polyaromatic hydrocarbons (sheet-like, herringbone, etc.). This demonstration that known structural classes can be identified algorithmically supports the application of these methods for analysis of CSP results.
The approach was applied to analyze the combined set of predicted crystal structures of 28 molecules in a single clustered mapping 36 (Figure 1b), which revealed relationships between molecular structure, preferred crystal packing, and electron mobility. These findings demonstrate potential for ML to accelerate structure classification and navigation of the combined space of molecular structures, crystal structure, and materials properties. Thus, similar approaches have been applied in exploration for porous molecular crystals. 37,38 These studies also highlighted several challenges. For instance, not all structure sets can be effectively clustered, as was found for a second azapentacene in Reference 15, and the mappings can be sensitive to choices in the representation; for example, the cutoff distance around each atom in SOAP influences the relative importance of inter-and intramolecular similarity when comparing crystal structures of different molecules. 36 Crystal structure representations can sometimes be constructed with a particular application in mind: for example, Moosavi et al. 38 showed that representations capturing the topological features of pores within predicted crystal structures lead to mappings that group structures with similar calculated methane deliverable capacities.
The same need to choose "the right tool for the job" applies to dimensionality reduction algorithms. Direct comparison of algorithms for dimensionality reduction has been underexplored in the context of analyzing crystal structure landscapes. In their CSP study of pyrrole azaphenacenes, Yang et al. observed important differences in the distribution of points in the mappings produced using four different dimensionality reduction algorithms, 39 demonstrating that researcher expertise is still required in algorithm selection.
Although the examples discussed so far have focused on using unsupervised ML for visualization and the identification of structure-function relationships, a related method called the generalized convex hull (GCH) 40 attempts to identify which predicted crystal structures should be synthesizable. A conventional convex hull (CH) examines the energy of a material with respect to stoichiometry or some structural variable, such as molar volume. Only structures on the CH are considered thermodynamically stable. This analysis, however, relies on intuitively chosen features. The GCH algorithm uses dimensionality reduction, via kernel PCA, 41 to select data-driven coordinates for CH construction. In this way, the GCH identifies structures that are low in energy or extremal in geometry in some respect and could, therefore, be stabilizable. Uncertainty in the structural features and energies is also addressed. The GCH samples the hull points probabilistically across many iterations in which the data points are randomized within boundaries determined by a machinelearned estimation of their uncertainties. The approach was demonstrated for the identification of crystalline phases of hydrogen from CSP at high pressure and identified magnetically stabilizable phases of oxygen. From the perspective of applying CSP for the discovery of functional molecular materials, the GCH was demonstrated to identify predicted crystal structures of pentacene that could be stabilized by chemical modification. 40

Chemical space exploration
To be used as an effective method for discovering materials with targeted properties, CSP must be combined with methods for proposing promising molecules. Exhaustive searches of possible candidates are prohibitively expensive; as an example, for small drug-like molecules a calculated search space of up to 10 60 possible molecules is estimated to exist. 42,43 In contrast, the largest CSP studies to date have assessed groups of 10-30 molecules to identify the candidates with the best predicted properties. 9,36,44 Due to the gulf between the scale of CSP that is currently affordable and the size of chemical space, more targeted methods, such as data-driven techniques, are required to focus effort on the best candidate molecules. 45 Data-driven methods to generate molecules, which have mainly been applied in the area of drug discovery, have also been demonstrated for functional materials discovery.

Molecular representations
Data-driven chemical space exploration requires molecules to be represented in a computer readable manner. 22 Ideal representations are invertible, mapping only to specific molecular structures, and invariant to symmetry operations.
Molecular graphs are a common method of representing structures as bonds and atoms within a molecule can be represented as the edges and vertices of a graph. A popular method converts a 2D molecular graph into a string of ascii characters called simplified molecular-input lineentry system (SMILES) strings. These are invertible, but not unique-one molecular structure can map to multiple SMILES strings. 46 Suggested improvements, all canonical, include the unique, non-standardized canonical SMILES, 47 InChi-a standardized string identifier 48 -and SELFIES, which always represent valid molecules. 49

High-throughput virtual screening
The conceptually simplest approach for finding high-performing molecules is high-throughput virtual screening (HTVS), where molecular data sets are tested for a targeted property via computational predictions. HTVS is often performed using a funneling method (Figure 2a), to reduce cost while allowing properties to be determined more accurately for the later candidate pools. Quantitative structure-property relationship and ML models for property prediction have been applied as steps in the funneling strategy. 45,50 HTVS can use generative models or existing chemical databases, such as ZINC, 51 the Cambridge Structural Database, 52,53 and the Harvard Clean Energy Project, 54 to build the initial populations of compounds to screen. Although CSP has not been applied in truly high-throughput studies, improvements in efficiency of the methods have made it possible to perform CSP, followed by property prediction for the sets of low-energy predicted crystal structures, for up to about 30 small molecules. 36 The properties of molecular materials often depend on intrinsic molecular properties, as well as properties that emerge due to the way that molecules are arranged in the solid. HTVS can take advantage of this: applying initial filtering based on calculated properties of isolated molecules and predicting crystal structures of only the molecules with the most promising properties. With further improvements in methods, coupled with increasingly available high-performance computing, CSP and property predictions should soon be possible on hundreds of candidate molecules on sufficiently short time scales to be useful in guiding experiments.

Generative neural network models (GNNs)
An advantage of sequence-based descriptors, such as SMILES, is that recurrent NNs can be trained to generate new descriptor sequences corresponding to new molecules. This approach has been demonstrated for molecular discovery of drug-like and other small organic molecules. 55,56 Other NN approaches involve training a generator to sample latent space for candidate molecules (Figure 2b). Two main methods for this are generative adversarial networks (GANs) 57 and variational autoencoders (VAEs). 58 VAEs were demonstrated for molecular design by Gómez-Bombarelli et al. 59 training the model on molecules from the ZINC and QM9 data sets. They demonstrated that the autoencoder can be jointly trained to predict molecules, along with their properties (such as drug-likeness and synthetic accessibility). Similarly, GANs have been demonstrated for the discovery of drug-like molecules and organic photovoltaic molecules. 60,61 A workflow can be envisioned of performing CSP on the optimized generated molecules to predict their materials properties.
As VAEs and GANs typically work on single molecules, CSP could have a similar role here as in HTVS: molecules are generated with optimized properties, followed by prediction of crystal structures and the resulting properties of the materials. Their direct application to the generation of crystal structures has been demonstrated for simple inorganic crystals, 62,63 but is hindered by the challenge of representing three-dimensional crystal structures in a continuous latent space. We envisage further challenges for organic molecular crystals: because of the small energy differences between alternative crystal packings, minor changes in molecular structure can often lead to energy re-rankings of proposed crystal packings and, therefore, the experimental observation of completely different crystal structures, introducing discontinuities in the relationship between molecule and solid-state properties.

Evolutionary algorithms
Issues with previously discussed methods include the computational cost of training the generator and the large amount of training data required to learn from. Evolutionary algorithms (EAs) are one alternative for generating new promising molecules (Figure 2c). EAs are an optimization method inspired by evolution where members of an initial population undergo genetic operations and fitness evaluations to create successive generations, with the fitter candidates more likely to contribute to the next generation. In the area of molecular materials discovery, EAs have been applied to discovery of organic semiconductors, 44 and porous organic cages, 64 where the properties to be optimized were charge-carrier mobilities and persistent porosity, respectively. EAs can be efficient for exploration, requiring calculations on a small fraction of possible molecules during optimization to the best molecules. This opens the possibility for CSP within the fitness evaluation itself, if these methods can be made sufficiently fast for application to hundreds of molecules.
Chemical space networks (CSNs) can be used to monitor the progress of chemical space exploration campaigns, 65 where edges of a graph denote "morphing relationships" used to generate one molecule from another. One can trace a path from each generated molecule back to the initial species, where edges contain information on the operations used. CSNs are powerful tools for the visualization of molecular sets that can reveal potential design rules, as shown by Kunkel et al. 66 Regardless of the approach used to generate molecules for assessment using CSP, the outcome is a set of molecules, each of which has an associated ensemble of predicted crystal structures. Thus, prioritization of molecules must consider the relevant properties (e.g., charge-carrier mobility or porosity) of multiple crystal structures for each molecule. Some methods have been proposed, considering weighted averaged properties over low-energy crystal structures to assess the likelihood that a molecule will lead to a crystal structure with the desired properties, 33,36 along with measures of property variation among low-energy crystal structures to assess uncertainty and risk. 44

Synthetic accessibility
The emerging methods for computational exploration of chemical space are exciting steps on the path to materials discovery, but experimental realization is vital; this requires feasible synthetic pathways to candidate species. When synthetic accessibility (SA) is not considered by a molecular generative model, candidates may be too challenging to synthesize in the laboratory. Approaches to biasing the generation of molecules toward synthetically accessible species have been discussed in detail by Gao and Coley. 67 Evaluation of SA with a post hoc filter (Figure 2d[i]) is the simplest approach: molecular generation is not biased and filtering is applied to the generated species after exploration is complete. Alternatively, heuristic biases can be used to guide molecular generation as part of the optimization function (Figure 2d[ii]). However, typical rapid scoring functions for SA (e.g., Ertl SAScore, 68 SYBA 69 ) focus on molecular complexity; structurally complex candidates are not always synthetically complex, given a reasonable set of starting materials and reactions. Instead, computer-aided synthesis planning (CASP) tools such as AiZynthFinder, 70 involve prediction of full retrosynthetic pathways to a given molecule. Although the process can be computationally expensive, it effectively mimics the process that chemists undertake. Information such as the number of synthetic steps or the availability and cost of precursors can be included in the fitness evaluation of candidates.
A third possibility is to directly influence the generation of molecules (Figure 2d[iii]). Imposing explicit constraints on the building blocks and molecular transformations of a chemical space exploration campaign limits the proportion of the space that can be assessed, but should produce more synthetically accessible candidates. The Synthetically Accessible Virtual Inventory (SAVI) 71 works in this style, where 53 known single-step, two-reactant reactions were applied to 150,000 readily available precursors, generating a 1.75 billion data set of which 1.09 billion compounds scored highly for synthetic accessibility.
As with the field of computer generation of molecules as a whole, synthetic accessibility prediction methods have focused on drug-like targets. These methods might need significant modification to account for the different types of molecular targets and the scale of synthesis required in materials discovery. Bennett et. al. 72 developed a binary classification model for the synthetic difficulty of porous organic cage precursors, learning the responses from experienced synthetic chemists to the question, "Can you make 1 g of the compound in under five steps?" Limiting the number of reaction steps works to reduce the overall yield loss during synthesis. While this limits access to species up to five synthetic steps away from available starting materials, their model was able to find precursors for promising porous materials with easier synthetic requirements.

Outlook
CSP methods can guide and accelerate materials discovery as research in this area shifts from an interesting academic challenge to applied studies. 8 Although much prior progress has built on developments of traditional simulation methodsalgorithms for exploring multidimensional energy landscapes, and models for calculating accurate lattice energies-datadriven, ML methods could lead to further exciting advances. These include acceleration of CSP through ML of accurate energies, methods for visualizing and interpreting the outcomes of large CSP data sets, and approaches to chemical space exploration to identify the best molecules to explore using CSP. Thus, the area will continue to benefit from close collaborations across chemistry, mathematics, and computing science.

Data availability
No data were generated for this review.

Conflict of interest
On behalf of all authors, the corresponding author states that there is no conflict of interest.

Open access
This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this license, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.