Introduction

Molecular solids refer to substances consisting of discrete molecules that are held together by relatively weak intermolecular forces. For centuries, they have been extensively used as key components in chemical industries such as medicine [1], fertilizers, dyes [2], pesticides [3], and high-energy explosives [4]. In addition, some organic solids were found to exhibit interesting physical properties. In 1921, the first ferroelectric crystal was found in organic Rochelle salt [5]. Organic solids have also received considerable interest in the electronics industry since the discovery of bulk conductivity in polycyclic aromatic compounds in the 1950s [6]. While some molecules can aggregate with no particular order, such as amorphous solids, most organic solids are crystalline, and their physical properties largely depend on regularly repeating intermolecular packing. For example, the organic semiconductor material rubrene usually exhibits outstanding carrier mobility, but changes in its crystal packing can eliminate that mobility entirely [7]. From a materials perspective, the soft molecular nature of organic crystals combined with other advantages (e.g., light weight, non-toxicity, and low raw material cost) has allowed for new optoelectronic technologies across a wide range of applications, such as fully flexible devices for large-area displays, lighting, electrochemical transistors, and solar cells [8, 9].

Developing new materials with targeted properties through the understanding and controlling of intermolecular interactions in the crystal is the central goal of the field of crystal engineering and design. Nowadays, computer simulation has played an increasingly important role in materials development [10, 11]. Substantial progress has been made in the modeling of organic solids. When the crystal structure information is known, many properties, including geometry, stability, elasticity, ground state electronic structure, and even excited state properties can be reliably estimated from the first-principles calculation. Utilizing computational tools, high-throughput screening of chemical compounds have significantly contributed to organic materials discovery in the recent years [12,13,14,15,16,17]. More importantly, it has become possible to discover new materials systematically through computation in the past two decades. The path to this breakthrough has been paved by the development of crystal structure prediction (CSP) methods [18]. There are two largely complementary approaches: one based on existing knowledge and the contents of structural databases (data mining) and the other on powerful exploratory computer algorithms capable of making predictions with little to no pre-existing knowledge. In particular, the latter, based on powerful exploratory algorithms, has the ability to generate completely new knowledge, beyond existing databases and intuition [11, 19, 20]. With exciting developments in both methodologies and ever-increasing computational power, it is reasonable to expect that, in the near future, the use of predictive computational modeling can be accomplished within a time frame that is appreciably shorter than that needed to perform the laboratory synthesis and characterization.

The topics on crystal polymorphism and CSP have been reviewed periodically [11, 19, 21, 22] in the past. However, the CSP and its application have progressed significantly in the past five years. Herein, the authors aim to give a timely review on the recent computational methodology developments based on our own expertise. The manuscript will be organized as follows. First, we will introduce the early efforts in developing the fundamentals of organic chemical crystallography. Furthermore, the recent developmental activities in methodology, algorithm, and software in the organic CSP, as well as representative case studies, will be reviewed. Finally, we will discuss the remaining challenges and future outlook in the end.

Figure 1
figure 1

The close-packing theory and symmetry preference of organic crystals. (a) shows four plane groups that can form a close-packed molecular layer in which each molecule touches six neighbors (according to Kitaigorodskii [23]). (b) lists the 15 most frequent space groups extracted from more than 1 million entries in the up-to-date Cambridge Structure Database 5.42 (2021.1). The arrow in the middle illustrates Kitaigorodskii’s ideas to extend the close-packing theory from 2D plane to 3D space groups.

Early efforts on organic chemical crystallography

Over the last century, small molecule crystallography has received everlasting interest by solid-state chemists since the first determination of organic crystal by X-ray diffraction in the early 1920s [24]. Researchers quickly realized that there exists some fundamental differences between organic crystals and their inorganic counterparts. First, for organic crystals, it is not difficult to picture that molecules can be stacked in many different ways in the three-dimensional (3D) space. Second, the forces holding molecules together in the crystals are much weaker than the regular chemical bonds in the atomic crystals. Although an accurate numerical calculation of intermolecular interaction only became available recently [22], it has been long known that modifying the molecular packing requires much less energy cost than the inorganic crystals. Indeed, crystal polymorphism of organic molecules has been a central subject in chemistry and materials science. Despite its ubiquity, understanding the packing of molecules is a rather elusive subject due to the variety of molecular shapes. Unlike the packing of spheres, geometry analysis on the irregularly shaped molecules appears to be more of an art than a science. In the past, there have been many excellent literature reviews on polymorphism from the perspectives of intermolecular interactions and crystal engineering [19, 21, 22, 25]. In this review, we will focus on two major questions that were proposed in the beginning of contemporary organic chemical crystallography; (i) how to understand the role of symmetry in crystal packing, and (ii) how to map the relation between molecule and crystal structures. To answer these questions, we will go over some early efforts on the development of the fundamental models to describe the intriguing crystal packing behavior.

Figure 2
figure 2

Packing motif analysis on the crystals made of planar aromatic hydrocarbons (PAHs). Desiraju and Gavezotti [26] suggested four representative groups of PAH packing, which are named as (a) herringbone, (b) sandwich, (c) \(\gamma\) and (d) \(\beta\). (e) and (f) show two structures that suggest the limitation of such classification. (g) shows the overall mapping from molecular to crystal structure for 32 PAH molecules as suggested in Ref. [26].

Close-packing principle

There exist 230 space groups in which the translational symmetry of a unit cell (including lattice centering) can be uniquely combined with the point group symmetry operations of reflection, (improper) rotation, screw axis, and glide plane symmetry operations to allow an object to fill three-dimensional space in a periodic manner. Among all possible 230 space groups, it has been found that some space groups, with the symmetry elements of inversion (\(\overline{1}\)), the twofold screw rotation (\(2_1\)) and glide reflection (g), occur much more often than others. Figure 1(b) shows a most recent survey on the Cambridge Structural Database (CSD) [27] by the authors. Among 1075116 entries with 3D coordinates information, over 87.4% can be accounted by the top 10 space groups \((P2_1/c, P2_12_12_1, P\overline{1}, P2_1, C2/c, Pbca, Pna2_1, Cc, Pnma, P1)\), while the contribution from other space groups is much less. In particular, 369,763 (34.39%) entries are found to possess the \(P2_1/c\) symmetry. On the contrary, no single space group has more than 10% of the structures for inorganic compounds [28].

To understand the observed strong symmetry preference, people sought to the close-packing idea [23]. Kitaigorodskii and several others analyzed a series of early crystallographic data and found that the packing coefficients for the majority of crystals are between 0.65 and 0.77, which is the same order as the close-packing coefficients of spheres (0.74). Hence, they put forward the hypothesis that the molecular crystal prefers the arrangement that the projection of one molecule fit into the hollows of adjacent molecules to achieve the dense packing. Kitaigorodskii also examined the possibilities in each of the 17 plane groups to form close-packed molecular layer in which each molecule has six neighbors (following the case of close-packed spheres in a 2D plane). For a molecule in an arbitrary shape, only four plane groups (P1, P2, P1g1, and P2gg) satisfy this requirement [see Fig. 1(a)]. Furthermore, Kitaigorodskii extended the description to 80 layer groups and identified the close-packed layer group symmetries. Finally, the possible close-packing space group can be derived by considering the ways to stack the close-packed layers through (i) monoclinic displacements (t); (ii) inversion centers (\(\overline{1}\)); (iii) screw axis (\(2_1\)); and (iv) glide planes (g). Kitaigorodskii found that closest packing is attainable in only a few space groups (e.g., P1, \(P\overline{1}\), \(P2_1\), \(P2_1/c\), \(Pca2_1\), \(Pna2_1\), \(P2_12_12_1\), .etc.). If the molecule is centrosymmetric, only \(P\overline{1}\), \(P2_1/c\), C2/c and Pbca are allowed. This postulate successfully explained the very non-uniform distribution of organic compounds over space groups, and predomination of \(P2_1/c\) in organic chemical crystallography [as shown in Fig. 1(b)]. More remarkably, Kitaigorodskii was able to apply his close-packing principle to correct a large number of molecular crystals that were assigned to wrong space groups in the early days when X-ray diffraction was applied to small molecule crystallography.

In addition to his close-packing theory, Kitaigorodskii also introduced an atom–atom potential scheme to construct the energy landscape of molecular crystals. Using this simple potential model, he was able to accurately predict the crystal structures for several simple molecules, such as naphthalene and anthracene. He also extended the model to estimate lattice expansion under finite temperature. These approaches have inspired several further methodology developments to understand the molecular packing [26, 29, 30].

It is important to note that, despite its success in the early days of small molecule crystallography, Kitaigorodskii’s model cannot be truly generalized to any case. For instance, it is possible to achieve very high packing density with a coordination number less than six [31]. One also needs to keep in mind that the close-packing theory is a pure geometry analysis; the packing preference is subjective to be complicated by the intermolecular interactions such as hydrogen and halogen bonding. A complete understanding of molecular crystals requires one to fully consider all types of intermolecular interactions and associated energies that sustain molecules in their crystal lattices.

Packing motifs

In addition to understanding the symmetry preference of molecular packing, it is also imperative to recognize, classify, and (even more ambitiously) predict the molecular packing pattern. Obviously, this is a grand challenge in terms of geometry. First, the molecular packing can dramatically change when the molecules adopt a variety of shapes. More importantly, interpreting the relation between molecular structure, packing, and functionality is believed to be an expert system, but it has no universal guidelines. It is visually difficult to capture the pattern even for experienced researchers [32].

To ease the visual challenge, most early works focused on several simple shapes. Among them, Robertson firstly proposed to divide planar aromatic hydrocarbons (PAH) into two categories based on the ratio of molecular area to the thickness [33]. It was found that the disk-like molecules tend to stack together via rigid translation (stack-promoting), while the molecules with smaller areas prefer the glide-like reflections (glide promoting). Using energetic as well as geometrical criteria, Desiraju and Gavezotti [26] extended the categorization of PAHs into four groups and used them as a guide for prediction of crystal packing for new molecules. As shown in Fig. 2, they include

  1. 1.

    herringbone, in which each molecular column has interactions with the nonparallel neighboring molecules;

  2. 2.

    sandwich, consisting of the herringbone motif with sandwich-type diads;

  3. 3.

    \(\gamma\), which looks like the flattened-our herringbone;

  4. 4.

    \(\beta\), as a layered structure made up of graphite-like planes.

The classification is advantageous not only because it accounts for the difference in geometry but also is separable from energetics. In a PAH crystal, the intermolecular forces are either from short-distance C\(\cdots\)C or C\(\cdots\)H interactions. When two parallel molecules are stacked, the main interaction is gained from C\(\cdots\)C. On the other hand, two molecules also form a close pack via the glide operation [34], in which the main interactions are contributed by the C\(\cdots\)H contacts. Therefore, the structures in herringbone or sandwich packing have more C\(\cdots\)H contacts while the \(\beta\) and \(\gamma\) structures are characterized by the strong C\(\cdots\)C interaction due to the flattened angle between two adjacent nonparallel molecules. Desiraju and Gavezotti also devised two parameters to measure the overall glide- (\(S_{\text{ g }}\)) and stack-promoting areas (\(S_{\text {st}}\)) to quantify the intermolecular contributions from C\(\cdots\)H and C\(\cdots\)C. By plotting the correlation between the glide-stack ratio (\(S_{\text{ g }}/S_{\text {st}}\)) and the molecular volume (\(S_{\text{ M }}\)) for about 40 PAH molecules, they suggested a predictive mapping from molecule to crystal structure that can separate four groups of motifs and predict the unknown crystal structures for other PAH molecules [see Fig. 2(g)]. Since this seminal work, many others have utilized and expanded the packing motif concept to more PAHs and to aromatic molecules that have more chemical diversity than PAHs. This packing motif concept has been applied to establish the correlations between these materials’ packing and observed physical properties (e.g., charge transport for organic semiconductors [35] and insensitivity for molecular explosives [36]). Recently, several methods have also been developed to automate the assignment of packing motif for the given molecular crystal [35, 37, 38].

Despite its increasing popularity, the definition of four patterns, like many other chemical nomenclatures, lacks a mathematical rigor. Since there exists a gradual shift of the interplanar angle between two adjacent nonparallel molecules from herringbone to \(\gamma\) and \(\beta\), it is visually difficult to define their boundaries. On the other hand, there is no further separation for the sandwich-type packing. In fact, the dinaphthoanthracene crystal [DNAPAN in Fig. 2(f)] clearly shows a distinct geometry compared to the standard sandwich type. In addition, the original proposal of herringbone, \(\gamma\), and \(\beta\) types seemed to be mainly designed for the crystals consisting of two types of molecular alignments. In these cases, the molecules are nonparallel to the nearest layers, but parallel to the second nearest layers. However, assigning the herringbone pattern to a complex structure with more than two alignments [e.g., the triphenylene structure in \(P2_12_12_1\) symmetry as shown in Fig. 2(e)] is likely to reduce the model’s predictive capability. Therefore, the number of motifs clearly needs to be expanded to characterize more subtle differences due to molecular shapes and symmetry operation.

Figure 3
figure 3

The crystal packing characterized by the strong hydrogen bonds in (a) aspirin form I and (b) \(\alpha\)-resorcionol. The dotted lines indicate the intermolecular hydrogen bondings.

Going beyond the PAH molecules, the close-packing principle tends to be further compromised by the existence of stronger intermolecular interactions, such as the hydrogen and halogen bonds. The energy of hydrogen (halogen) bonds ranges from 0.2 to 40 kcal/mol, which is much stronger than the typical van der Waals forces (0.1–1.0 kcal/mol). These bonds also have a strong preference for linear geometry. As such, one can easily detect the distinct bonding network if the strong hydrogen or halogen bonds exist in a crystal. Figure 3 shows two representative crystal structures based on (a) the dimers in aspirin and (b) an extended 3D hydrogen bonding networks in resorcinol. Around the 1990s, there have been numerous studies on the formation of hydrogen/halogen units connecting molecules in a crystal structure [39]. Desiraju termed such structural fragments supramolecular synthons to underpin conceptual similarities between retrosynthetic analyses in conventional organic synthesis and supramolecular chemistry [40]. To develop different synthons for different purposes of applications is an active research area in crystal growth and design [21, 41].

Progress in computational crystal structure prediction

While the aforementioned works provide a big picture regarding the tendencies of molecular packing, it remains challenging to predict the detailed crystal packing before one conducts the experimental work. In the past four decades, researchers have developed a range of statistical and computational tools to enable the accurate prediction of crystal packing through the combination of powerful structure navigation and accurate energy modeling of organic materials.

Figure 4 shows a typical CSP workflow starting from the information of a single molecule. In general, a successful CSP calculation involves two stages of challenges: (i) the search problem of generating all potential low-energy structures and (ii) the energy ranking problem of computing the relative stabilities of a range of structure candidates. Both of them represent some practical challenges when the system becomes large.

Figure 4
figure 4

The schematic procedures for a CSP calculation. From the starting molecule (a), CSP needs to go through a complete structure search (b), and then energy ranking will be performed for the screened low-energy minimum structures to decide the final thermodynamic stability.

The search space grows exponentially with the number of variables, i.e., the unit cell vectors, symmetry operations, molecular positions, conformation and orientations, as well as the number of molecules per asymmetric unit (\(Z^\prime\)). After a trial structural model is built, it needs to be relaxed to the configuration in a local energy minimum of the potential energy surface. Repeating the procedure for many iterations, one expects to identify a complete set of plausible crystal packings that may exist in the real world. Then, their relative thermodynamic (and kinetic) stabilities need to be evaluated by an accurate energy model.

The stage of energy ranking is also challenging for two reasons. First, the energy differences of different crystal packings are extremely small. It is possible to gather several tens to hundreds (or even thousands) of structures within a narrow energy window (say less than 20 kJ/mol) [19]. To accurately describe the tiny difference, high-quality electronic structure methods, instead of classical force fields or low-end electronic structure approaches, need to be employed. Second, computational limitations usually make it impractical to evaluate the lattice or free energy for each candidate structure with the high-cost model. Therefore, it is popular to adopt a multiple stage screening strategy to compromise between efficiency and accuracy.

The progress in CSP has been reflected by the periodically conducted blind tests of organic crystal structure prediction organized by the Cambridge Crystallographic Data Centre (CCDC) [42,43,44,45,46,47], which have shown that the combination of effective structure generation and energy ranking schemes can predict not only the structure of simple rigid molecules, but also the molecules representing real-life challenges. In the following, we will review the necessary computational techniques that have been developed to address the challenges in both structure searching and energy ranking.

Figure 5
figure 5

The generation of different based on the open-source PyXtal software package [48], (a) a benzene crystal with \(Z^\prime =1\) in \(P\overline{1}\) symmetry; (b) a C60 crystal with \(Z^\prime =1/3\) in \(P6_3mc\) symmetry; (c) a 2D crystal made of ice; (d) a low symmetry aspirin crystal though symmetry breaking from the original \(P2_1\) symmetry. Two short Python scripts are also listed in the text boxes to illustrate the use of PyXtal for generating the models in (a) and (d).

Crystal structure search

Given that molecular packing favors only a few particular symmetry elements, it is natural to utilize these symmetry preferences to reduce the search complexity in a practical CSP calculation. Therefore, most CSP codes pay attention to developing strategies to impose symmetry constraints on all stages of structural search, including the trial structure generation, optimization, and navigation on the entire potential energy surface.

Structure generation

Despite the fact that many CSP programs have their own built-in functions to generate crystals with specific space groups, most of these functions are implemented in the program’s main packages and cannot work in a standalone manner. To our knowledge, there exist only a few free packages [49, 50] allowing the generation of random molecular crystals, and they only support molecules occupying the general Wyckoff sites with \(Z^\prime =1\), except for the recent development of Genarris 2.0 [51] which is able to deal with structures having a non-integer value of \(Z^\prime\). According to the report from the 2015 CSP blind test [47], most research groups attempted to reduce the structure generation to a limited range of space-group choices with one molecule in the asymmetric unit (\(Z^\prime =1\)). Despite the fact that statistical analysis supports the idea that most organic crystals tend to crystallize with \(Z^\prime\) less than or equal to 1, structures with high \(Z^\prime\) are not rare [52, 53], they can exist as either the ground state or metastable states [54,55,56,57]. Hence, it is risky to restrict the extent of \(Z^\prime\) if one aims to do a complete polymorph screening. On the other hand, molecules with high symmetry tend to occupy the special Wyckoff sites with high site symmetry. In fact, a lot of reported crystals with \(P2_1/c\) and C2/c symmetry belong to this group. When it comes to CSP, such structures can be found by searching for structure with \(Z^\prime =1\) in the subgroup symmetry (e.g., the naphthalene crystal in \(P2_1/c\) symmetry with \(Z^\prime =1/2\) can be represented as a \(P2_1\) crystal with the molecules occupying the general Wyckoff site). However, taking the advantage of more symmetry constraints can greatly reduce the searching complexity for more symmetrical molecules, like the Buckyball shape [see Fig. 5(b)] or cage-like molecules [58].

Recently, we started to develop a standalone Python program called PyXtal which can be used for customized structure generation for different dimensional systems, including atomic clusters and 1D/2D/3D atomic/molecular crystals [48]. While PyXtal is designed for general purposes of crystal symmetry analysis, we emphasize some of its unique features for molecular crystals in Fig. 5. It allows one to generate a trial structure with a customized \(Z^\prime\) number. If the molecular symmetry is compatible with the site symmetry of a given Wyckoff site, PyXtal also supports the assignment of molecule to the high symmetry site, leading to a fractional \(Z^\prime\) in crystallographic description. Furthermore, the same scheme has been extended to generate low-dimensional systems that are described by rod/plane/layer groups [see Fig. 5(c)], which may find some applications for the study of 1D and 2D molecular crystals [59].

In addition to building the purely random crystal from scratch, it is often useful to generate the derivative structure from an existing crystal. When one crystal is converted to another by a phase transition, the symmetries of the crystal structures are usually related. The so-called group-subgroup relation has been well discussed mathematically. It is possible to list all possible subgroup types for every space-group type and to specify the subgroups in a general way by formulae. The international crystallography volume [60], as well as the online Bilbao Crystallographic Server [61], have provided the symmetry relations between a given space group G and its possible maximal translationengleiche (t) and klassengleiche (k) subgroups H. Ideally, to complete the transition from G to H, one needs to know the cell transformation matrix, as well as the Wyckoff splitting scheme. For a given crystal structure, PyXtal allows a user to either systematically extract all possible transformations (subject to a cutoff index of symmetry reduction), or randomly pick one possible transformation path between G and H. Figure 5(d) shows an example of transforming a \(P2_1/c\) structure in one 4e site to another structure occupying two 2e in \(P2_1\) by breaking the inversion symmetry. This module can find its use in at least two applications. First, it can be directly used to study ferroelectric or piezoelectric phase transitions [62]. Second, it can be used as an effective way to generate new child structures in global optimizations that will be discussed in the following subsection.

Geometry optimization

Any generated structure has to be relaxed to the energy minimum configuration. An efficient structure relaxation requires the forces and stresses that can be derived from the energy model. In order to reduce the structural complexity, the optimization is usually applied only to the lattice vectors and molecules in the asymmetric unit. Therefore, the crystal symmetry will be retained after structure relaxation. The simplest algorithm is steepest descent, where the crystal variables (e.g., atomic coordinates and lattice vectors) are iteratively updated, moving along the force/stress direction. In this algorithm, only the current positions of atoms and forces are taken into account. The Conjugate gradients algorithm [63], while still requiring only the forces/stresses (i.e., first derivatives of the energy), follows a set of conjugate vectors toward the target minimum during the line search. The choice of conjugate vectors, instead of the gradient direction, can effectively remedy the slow convergence issue when it approaches the minima. The Newton method takes the knowledge of the second derivatives matrix (Hessian) to greatly accelerate convergence to the local minimum, but the computation of the Hessian is very expensive when the number of variables becomes large. Therefore, a number of quasi Newton methods have been developed to construct an approximate numerical Hessian for the first derivatives. Notable examples include DFP, BFGS, and L-BFGS [63]. More efficient algorithms based on conventional molecular dynamics with additional velocity modifications and adaptive time steps also have been made available [64]. Recently, a hyperspatial optimization idea was also proposed to help avoid having a structure trapped in a shallow minima [65].

Figure 6
figure 6

The comparison between steepest descent and conjugate gradient optimization algorithms.

While there exist plenty of choices for efficient optimization algorithms, it is important to note that the no free lunch theorem still holds. Most of the optimization algorithms, other than steepest descent, assume that the starting point can be connected to the target minimum via a quadratic function. If the quadratic approximation is far from the ideal case, the choice of search direction may become worse than the gradient direction. Figure 6 shows such an example. When the initial position is close to the minimum, the conjugate gradients algorithm can effectively identify the minimum with only a few iterations. However, the same algorithm may lead to a worse minimum or even a saddle point when the initial positions become more distant to the target minimum. On the other hand, the steepest descent is more reliable to consistently find the same low-energy minimum as long as the initial position is within the same basin. For the practical application of CSP, it is often the case that the initial structure from a random guess is far from the minimum. Therefore, the simplest steepest descent is highly recommended at the beginning stage of geometry optimization.

Global structure navigation

Combined with global optimization, one can perform an extensive search for the plausible structures. Global optimization is a very large field in applied mathematics, with many methods continually being developed. For the CSP applications, it is important to remember that no existing method can give a guarantee of finding the global minimum in a finite amount of computational time. The following methods have become popular:

  1. 1.

    Random search is the simplest search strategy, starting from the atom–atom potential implementation [66], and then evolving into different versions of ab initio random structure search [67]. In practice, it is crucial to steer the searches toward finding realistic structures while maintaining structural diversity by imposing constraints on symmetry or some pre-defined structural units. When the number of variables is small, it is also possible to sample the configuration space based on the determined quasi-random, low-discrepancy sequences [68, 69]. Such quasi-random sampling has been very successful in a lot of engineering applications [70].

  2. 2.

    Simulated annealing [71, 72] is a strategy inspired by annealing of crystals, where gradual cooling leads to the equilibrium crystal structure. Basin hopping [73] adopts a similar strategy to perturb the crystal by Monte Carlo moves. These two methods mainly differ in whether or not relaxing the structure to the minimum after the perturbation.

  3. 3.

    Metadynamics [74] is designed to overcome large energy barriers through positively biased molecular dynamics simulation. It requires a set of collective variables (CVs) to distinguish between states of the system and scans the low-energy part of the energy landscape by distorting the actual energy landscape by a history-dependent potential, the aim of which is to discourage the system from sampling the states that have already been sampled. In addition to metadynamics, sampling of the CVs can also be achieved using a temperature-accelerated adiabatic free-energy dynamics approach [75]. These approaches are advantageous since they allow direct sampling on the free-energy surface.

  4. 4.

    Evolutionary algorithms are a type of nature inspired method that comes in many flavors [76,77,78]. The common idea is that a population of structures is evolved, driven by natural selection of lower-energy structures to become parents of a new generation of structures. These types of algorithms have been extremely successful in many different engineering applications. For the applications to the CSP, recipes for producing offspring from parents (genetic crossover and mutations) are of key importance [77].

We note that there also exist a variety of other implementations for predicting inorganic crystals. The fields of inorganic and organic CSP are beginning to converge [79]. For a more complete list of global optimization methods, the readers are recommended to check the Ref. [11]. It is expected more algorithms will be developed in the next few years.

It is also possible to consider more than one target function in global optimization. For instance, one may be interested in solving the crystal structure with unindexed powder x-ray diffraction (PXRD) data. In this case, one needs to simultaneously optimize two quantities, i.e., the minimum lattice energy and the maximum similarity with respect to the reference PXRD pattern [80]. Such multi-objective optimization problems can be handled by Pareto optimization [70]. It is also possible to use this idea to design new materials with optimum physical properties.

Models for intermolecular interactions

From a structure search, we expect to generate many trial structures whose energetic and physical properties need to be evaluated in a high-throughput manner. A popular choice is to use low-cost energy models to perform screening at the initial stage, and then employ more expensive electronic structure calculations for a short list of structures in the end. Here we will review the models to be used for both stages of structure optimization and final ranking.

Empirical models

Since the early developed atom-atom potential model, there have been plenty of efforts in developing more accurate empirical models to estimate the crystal packing energy. In molecular simulation, the popular choices are classical force fields (FFs) [81,82,83,84,85], distributed multipole models [86], machine-learning potentials (MLP) [87], and the Harris approximation [88]. Among them, the classical FFs and MLPs have been widely used in recent years. Classical FFs are built upon the total energy decomposition into several analytical terms, including 2/3/4-body intramolecular interactions and non-bonded van der Waals (vdW) dispersion. On the other hand, MLPs express the total energy as the machine-learning function of a high-dimensional array that represents each atomic local environment [89, 90]. In general, both approaches involve a parameterization (or training) from the reference data. The MLPs, usually based on neural networks (NN) regression of a set of symmetry-invariant structure descriptors [90, 91], give better accuracy when the configurations are highly correlated with the training data. However, it is also likely to generate nonphysical results if the input data are outside the training domain. Compared to the MLPs, the classical FFs are computationally cheaper and more robust in handling unknown configurations, but they are often insufficient to describe the subtle energy differences in crystal packing.

Figure 7
figure 7

The benchmark of GAFF [81] and ANI-2x [92] for 345 crystals according to the authors’ study on the C–H–O–N–S molecular systems.

For an accurate simulation, it is important to know the quality of each energy model. In a structure optimization, a good model should guarantee the reproduction of the same geometry for the input crystal. Figure 7 shows our recent survey of 345 crystal data by comparing the differences between experimental and simulated geometries based on a classical FF (GAFF [81]) and a MLP (ANI-2x [92]). The results suggest that both energy models are sufficient to describe most of the experimental geometries for most molecules. However, some systems cannot be handled. This is a common problem for any kind of generic force field since their training is built upon many different types of data.

A possible solution to the accuracy issue is to develop a customized force field [93]. The basic idea is to start with existing FF parameters to run CSP and then collect a new set of reference data by more accurate electron structure calculation. The FF parameters are then refined based on the new dataset. The tailor-made FF [93] has led to remarkable success in several recent CSP blind tests [45,46,47]. In principle, this strategy can be adapted to the construction of MLP as well.

Electronic structure theory

To make a final decision about which polymorph is most likely to exist, a short list of structures needs to be evaluated preferably by more accurate electronic structure methods [22]. Crystal polymorphism originates from the competition between intramolecular and intermolecular interactions for different crystal packings. The current state-of-the-art electronic structure methods based on density functional theory (DFT) are insufficient to capture such weak interaction. Various methods have been proposed to explicitly incorporate vdW interactions. One common approach is to add an extra energy term of \(-C_6/R^6\) to describe the first term of vdW interactions between two dipoles in a multipole expansion, as implemented in DFT-D [94], DFT-D2 [95], and TS [96]. The \(C_6\) term represents the dipole–dipole dispersion coefficient between the two atoms involved and R is the interatomic distance. More advanced correction includes the additional, dipole-quadrupole, and quadrupole–quadrupole contributions [97,98,99]. Another approach is to obtain dispersion interactions by designing functionals that explicitly include nonlocal correlations (though still based on pairwise addition) [100]. Furthermore, Tkatchenko and coworkers proposed the many-body dispersion (MBD) method [101] which describes many-body dipolar interactions up to infinite order and also includes electrodynamic response effects. Several test sets have been proposed to serve as the test bed to assess the performance of different DFT models [98, 102]. It was found that the MBD method substantially outperforms the original TS scheme, in particular for molecular crystals.

Although all the aforementioned vdW corrections were originally introduced to fix the shortcoming of a standard DFT calculation, they can also work with other semi-empirical electronic structure methods. For instance, Mortazavi et al recently combined the TS and MBD with third-order density functional tight binding (DFTB3) [103] via a charge population-based method [104]. They found an overall good performance for the X23 benchmark database of molecular crystals, while the entire calculation was 3000 times faster than the full DFT treatment. Recently, various machine-learning schemes have been introduced to systematically improve approximation of repulsive term in DFTB parameterization, and they showed great promise of achieving the chemical accuracy within the framework of DFTB modeling [105, 106]. Another general-purpose density functional tight-binding method, the semi-empirical extended tight-binding (xTB) model, is also gaining increased popularity due to its extended support for nearly 90 elements and improvements of the underlying theory regarding the treatment of the important electrostatic and dispersion interactions[107]. These exciting developments suggest that such methods can serve as a viable pre-screening tool for CSP in a high-throughput manner.

Successful examples

In the past decades, there have been a wealth of successes in using the CSP for organic crystal structure determination and functional materials design. It is obviously impossible to list each of them. In the following, we will focus on several representative examples in three main areas (i) pharmaceutical polymorph screening; (ii) high \(Z^\prime\) compounds and (iii) new organic semiconductors. The choice of these topics are mostly based on the authors’ expertise and research interests. For other types of applications, the readers are encouraged to review other literature.

Pharmaceutical compounds

Understanding, predicting, and controlling the structure of crystalline pharmaceuticals materials are a topic of tremendous importance because the crystal packing directly affects properties such as solubility, processability, hygroscopicity, chemical and physical stability, and bioavailability. In recent years, CSP has been increasingly successful in aiding the solid form screening of drugs. For instance, a recent computational screening of the pharmaceutical compound Dalcetrapib with 10 torsional degrees of freedom led to the discovery of a new form which was successfully synthesized under high pressure [108]. Inspired by the CSP calculation results, the search for new polymorphs of galunisertib was expanded to an unusual range of experiments, including melt crystallization under pressure, to work around solvate formation and the thermal instability of the molecule. These works led to the identification of ten unsolvated polymorphs of galunisertib were found [109]. As shown in Fig. 8, both molecules have mass over 350 g/mol and nontrivial topology. These successful examples demonstrate how current state-of-the-art CSP methods can be combined with experimental techniques to assess the polymorph risk in drug development.

Figure 8
figure 8

Two representative drug molecules that CSP has been successfully used to guide the polymorph screening [108, 109].

High crystals

A standard CSP calculation often limits the extent of search to \(Z^\prime\) = 1. However, an early database survey suggested that over 8% of compounds have high \(Z^\prime\) numbers [52]. Therefore, it is necessary to go beyond \(Z^\prime\) = 1 for future polymorph screening, if metastable polymorphs are of interest, as observed in recent cases [3, 54,55,56, 110]. For instance, coumarin, a rather simple molecule with only one well-known solid phase, turns out to have five polymorphs, one with three molecules in the asymmetric unit [56]. Structures of the \(\alpha\) and \(\beta\) phases of resorcinol, were the first polymorphic pair of molecular crystals solved by X-ray analysis in 1930s [111,112,113]. A new metastable polymorph was recently observed in a melt crystallization experiment. However, its complex structure (\(Z^\prime\) = 2, \(P2_12_12_1\)) was not solved until the aid of CSP efforts [55]. Similar synergy between theory and experiment has also been applied to establish the third ambient polymorph of aspirin [54], which also possesses two molecules in the asymmetric unit with the space-group symmetry of \(P2_1/c\) [Fig. 9(a)].

Figure 9
figure 9

Complex packing of high \(Z^\prime\) crystals of (a) aspirin [54] and (b) ROY [57]. The molecules in the asymmetric units are colored differently to guide the eye.

Another remarkable example is the identification of a long term elusive ROY polymorph. For a decade, the molecule 5-methyl-2-[(2-nitrophenyl)amino]-3-thiophenecarbonitrile has been considered the most polymorphic system [114]. It is called ROY for its red, orange, and yellow crystals. Out of the ten known polymorphs, seven forms have been solved in the 2000s. One of the remaining three unsolved structures, R05, an abbreviation for “red 2005”, was only solved recently through the combination of experiment and CSP calculation. This R05 form [Fig. 9(b)] is both the first acentric ROY polymorph, and the first with \(Z^\prime >1\). Since our successful determination of R05, there has been a renewed interest in ROY. Four new ROY polymorphs have been reported since 2019 through the joint efforts between experiment and computation [115,116,117,118]. It is expected that more ROY polymorphs await future discovery according to the improved energy ranking method by Beran et al. [119].

New organic semiconductors

Polymorphism has been considered as an emerging design strategy for organic functional materials. Structure determination for metastable polymorphs might gain impetus in the near future. One of the appealing industrial applications is the design of organic semiconductors, a special class of organic materials with delocalized molecular orbitals in which charge carriers are mobile due to extended \(\pi\)-conjunction. Sokolov et al. [120] developed a computational screening scheme to search for new compounds derivable from existing molecules and found a new compound with a significantly improved mobility over the parent molecules. With the advance in both computational power and CSP methodology development, a systematic CSP study of energy-structure-function maps for small hypothetical molecules is now possible [35, 121, 122].

The idea of global optimization is not necessarily limited to the navigation of crystal packing. Instead, it can also be used to identify promising molecules from a vast chemical space. Recently, an evolutionary method, combined with the CSP was developed to explore a user-specified region of chemical space to identify promising molecules for organic semiconductors with high charge carrier mobilities [123]. It has been shown the method can efficiently explore a large space for pentacene- and pyridazine-based molecules having both low reorganization energies and high electron affinities.

Remaining challenges

In recent years, the power of CSP has been demonstrated by many successful examples in materials’ development and design. It is clear that CSP has become central to the study of organic materials. New computational tools for such work are being widely developed by different research groups, as shown by the increasing number of participants in the recent blind tests [42,43,44,45,46,47]. Not only materials theorists, many experimental groups are actively using the CSP tool to design and interpret their experimental data. That has been said, structure prediction faces several challenges in handling large molecules, high \(Z^\prime\) packing, temperature effects, and large chemical space. Among them, some challenges (e.g., free-energy ranking) have been under active consideration. However, some issues may not receive much attention. All issues will be discussed as follows. In particular, we will pay more attention to the issues that have been seldom addressed in other review works.

Overlooked historic wisdom

The symmetry correlation between molecule and crystal packing used to be the central subject in the early days of organic crystallography. However, most research efforts have been shifted to numerical simulations for each concrete molecule and crystal. After the seminal work of Kitaigorodskii, the pursuit of a general geometric model became much less active. Nowadays, most CSP practitioners generate the trial structure from space-group symmetry. This has been shown to be sufficient for small, rigid, and simple-shaped molecules, thanks to the continuous methodology and code developments in the past two decades. Nevertheless, it remains challenging to predict the correct packing even when the space group is known. In fact, Kitaigorodskii’s original proposal to study 3D packing from molecular layers may provide a short cut to reduce such search complexity. In the future, it would be worthwhile to revisit those early ideas and check their validity for the new challenges.

In terms of the crystal packing motif, researchers continue to use those highly subjective nomenclatures (as part of historical heritages) to describe the crystal packing and interpret the observed results. For instance, the four groups of PAH crystals [26] are still widely used, despite their apparent limitations as described in the previous section. In fact, understanding the relation between molecular shape and packing is not far from a standard pattern recognition problem in computer vision. The emerging computational techniques in data science, geometry, and computation vision should be imported in a timely manner. However, one must keep in mind that symmetry invariance (e.g., rotation and translation) needs to be taken into account. If the packing pattern can be well grouped, the library of packing motifs should be the next step. Consequently, one can analyze the occurrence of each packing motif in a way similar to the prototype library for inorganic crystals [124, 125]. For a given molecular shape, it is expected that only a small number of template packing structures are likely to be populated as probable observable crystal polymorphs [126]. Such templates can be extracted from the crystal packing library to generate candidate structures for the new molecules [127].

In addition, there has been a small concerted effort to quest the origin of high \(Z^\prime\) organic crystals’ formation [53, 128,129,130]. It has been hypothesized that the high \(Z^\prime\) crystals may be closely related to a \(Z^\prime\) = 1 crystal by a pseudosymmetry via small molecular displacements and rotations [53, 129]. Similar phenomenon has been found in several of our studies. For instance, the newly discovered \(\epsilon\)-resorcinol (\(P2_12_12_1\)) was obtained via melt crystallization with a high cooling rate. With slow cooling, a more stable \(\beta\)-resorcinol phase (\(Pna2_1\)) should be formed. Comparing their crystal packing in Fig. 10, there are some clear hints showing that both polymorphs can be transformed to a common high point group symmetry mmm via a pseudo inversion symmetry. Therefore, it is likely that a high symmetry intermediate phase may form in the beginning and then a symmetry lowering occurs after crystal nucleation upon cooling. It would be interesting to check the validity of such a hypothesis through more rigorous characterization in both experiment and simulation. If such high symmetry intermediate phases do exist and they can be reliably predicted, the high \(Z^\prime\) crystals can be trivially resolved through the rigorous group-subgroup relation without an extensive CSP search.

Figure 10
figure 10

The comparison between two resorcinol polymorphs (a) \(\beta\) and (b) \(\epsilon\). In (b) two molecules in the asymmetric unit are colored differently. The light blue eclipses are marked to highlight the likely pseudo inversion symmetry.

Technical challenges

For the standard CSP calculation, the most urgent need is to be able to measure relative stability at finite temperature. Most current CSP techniques should be regarded as zeroth-order approaches [131] that consider the most stable crystal structure by evaluating lattice energy only. However, the real world is governed by free energies. It is possible that lattice vibration under finite temperature can substantially change the stability ranking [132]. Indeed, several groups improved their final polymorph rankings by considering the vibrational free-energy ranking under harmonic, quasi-harmonic, and even anharmonic treatments [47]. More advanced methods have also been proposed [133]. Although the free-energy calculations are still expensive, one should be optimistic that such calculations can be done within a reasonable time frame in the next couple of years with the further improvement in computational methodology.

Another challenge is to enable a more efficient CSP calculation in a high-throughput manner. In a standard CSP run, there is no general mechanism to terminate the search and one usually attempts to generate the trial crystal structures as many as possible. This exhaustive sampling strategy is suitable when one only needs to deal with a few systems. However, the organic materials development may require one to consider many molecules. To enable such high-throughput polymorph screening in a reasonable time frame, we shall develop a more efficient sampling strategy to avoid exhaustive searches. For a given chemical, a large number of hypothetical structures can exist. Relaxing each of them to their lowest-energy local minimum, the set of trajectories then form the basins of attraction on the potential energy surface. The probability of successfully finding a low-energy structure depends on the shapes and sizes of the hyper-volumes of the basins of attraction and the details of the search [67, 134, 135]. It is possible that some molecules may have fuzzy landscapes compared to others [19]. To validate the concept of landscape complexity, we selected three different molecules to run CSP simulations based on our evolutionary algorithms with a population size of 100. We terminated the search at the Nth generation as long as the experimental structure is found at the end of that generation. Such calculations were repeated for 50 times to collect the statistics. Figure 11 shows the distribution of N for each molecule. Clearly, the results indicate that each molecule has a distinct crystal landscape. For the molecules with a simpler crystal landscape, we can confidently set a small number of generations in the CSP simulation. On the other hand, more generations will be needed for the molecules with complex landscapes. In the future, a more automated strategy should estimate the packing complexity and guide the high-throughput CSP exploration.

Figure 11
figure 11

Comparison of CSP complexity for three molecules taken from the previous CCDC blind tests.

In terms of applications, the recent focus in crystal engineering has progressively shifted from structure to function. In describing the structure–property relation, symmetry plays a deterministic role as it can greatly simplify the calculation of physical quantities, such as elastic constants, vibration modes, and piezo-coefficients. While the relation between symmetry and crystal packing has been widely studied in the past, the symmetry relation between different polymorphs remains largely underexplored. Displacive phase transitions involve the structural changes that follow a low-energy barrier path between two phases with close symmetry relation. Nowadays, symmetry analysis for an organic crystal relies on a lot of human–machine interactions (e.g., extracting the molecular coordinates, looking up the character tables and trying related subgroup conventions). In the future, it is impetus to develop a tool that can automate the symmetry analysis for organic crystals, thus, enabling the studies on organic materials with interesting magnetic [136], ferroelectric [62], piezoelectric, and topological electronic properties [137].

Open software development

In addition to the innovations in methodology and algorithm, it is also worthwhile to discuss the current situation for open-source CSP software developments. In the field of materials’ modeling, there is clearly an ongoing paradigm shift driven by the open science movement. In the past decade, we have witnessed substantial growths of open databases for both inorganic [138,139,140] and organic materials [141,142,143], as well as the open packages that enable the automated process of materials data and property evaluation [144, 145]. As described in the review, conducting a successful organic crystal prediction requires the integration of several different computational pipelines ranging from force field generation, structure search, and (free) energy ranking. Currently, most researchers rely on subscriptions to commercial license to obtain the data, run simulations, and analyze results. Such heavy software dependence largely discourages young scientists from bringing new expertise and algorithmic innovation to this field. Due to the license restriction, it is also hard to reproduce historically published results, even for an experienced researcher. For the CSP of inorganic materials, there have been plenty of choices (e.g., USPEX [76], AIRSS [67], CALYPSO [146], XtalOpt [147], GASP [148]) that are either completely open source or free for academic researchers. However, the codes [69, 76, 78] for the prediction of organic materials are still limited and none of them are. To promote the open-source activity in organic CSP, we have recently launched the open-source project PyXtal [48] to provide easy access to generating and manipulating crystal structure via different symmetry constraints. We hope that this project can provide a platform that allows more researchers to develop or apply their advanced structure search or energy ranking methods for different CSP-related research activities.

Conclusions

In sum, we have presented a comprehensive review of the recent progress of crystal structure prediction. It is clear that we are witnessing an exciting paradigm shift where computer simulation is playing a more deterministic role in the development of new organic materials. Several notable examples include (i) very complex crystals that represent real-life challenges in pharmaceutical research can be predicted from pure blind tests; (ii) the energy ranking methods have been continuously improved; and (iii) new open-source codes have become available to provide invaluable resources to new practitioners with a minimum learning barrier. These exciting developments have significantly boosted the application of CSP in guiding the screening and design of new organic compounds. While there remain some limitations in handling more complex systems and exploring more diverse chemical space, it is reasonable to expect that, in the near future, the use of predictive computational modeling can be accomplished within a time frame that is appreciably shorter than that needed to perform the laboratory synthesis and characterization.