Data-driven science is becoming more and more an integral part of modern material development. Through improvements in experimental techniques and computational power, increasing amounts of data are being generated and more data is available in research facilities as well as in global data repositories [1]. To be able to process and analyse this data, new tools are required which allow detecting patterns and anomalies as well as extracting new relationships between materials quantities which could not be discovered otherwise. Computational materials science is adopting novel data-driven approaches to speed up calculations and improve its predictive power.

In this article we will present two applications of data-driven methods on the basis of two specific examples. The first example describes how computationally expensive calculations based on density functional theory (DFT) can be replaced with regression methods to obtain a more complete description of segregation phenomena in metallic materials. While DFT allows for scanning the periodic table for favourable or unfavourable solute elements at grain boundaries, the description is still idealized and does not fully account for the high degree of complexity which can be encountered at grain boundaries. For this reason, it is required to train machine learning methods to a consistent set of data encoding the DFT results and extend from there to all atomic sites in the grain boundary.

The second example demonstrates how uncertainty quantification can be applied to calibrating physical models to complex materials phenomena and processes. So far, parameter calibration has been mostly uncertainty-agnostic and computational predictions have been rarely provided with an error bar. However, approaches based on Bayesian inference provide a convenient framework for uncertainty quantification, which is crucial in the verification of simulation results [2]. We will demonstrate how such a methodology can be applied to calibrate a thermodynamic model to experimental data of binary W‑Ti alloys and to determine their degree of uncertainty. In this way, the regions in the phase diagram can be identified where further experimental or theoretical data are required for a consistent improvement of the thermodynamic model.

Methodology of Data Driven Methods

Machine Learning of Segregation Energies

In general, the term machine learning (ML) covers different methods including regression, classification or clustering. Common to all these methods is their application on an existing data set which should be large enough to encode the relevant information. To make these data usable, they are represented by a set of features or descriptors which are either taken directly from the data sets or obtained by pre-processing of the data. ML methods are generally divided into supervised and unsupervised learning algorithms. In supervised learning the ML algorithm is handed, in addition to the features, also the target quantity which is then used to train the model. In contrast, for unsupervised learning, no target values are provided. Supervised ML methods cover a variety of methods ranging from the simplest, i.e. linear regression, to more sophisticated methods such as decision tree-based methods or artificial (deep) neural networks. In this work, linear regression (LR) and a particular version of a tree-based ensemble method, gradient boosting regression (GBR), are adopted as implemented in the free Python library scikit-learn [3]. A comparison of the performance of the two regression techniques gives valuable insights into the quality of the employed features.

To carry out LR, a function

$$f\left(\overset{\rightarrow }{x};\overset{\rightarrow }{w}\right)=\overset{\rightarrow }{x}\overset{\rightarrow }{w}+w_{0}$$

is fitted to data points \(\left(\overset{\rightarrow }{x}_{i},y_{i}\right)\) by varying the coefficients \(w_{0},\overset{\rightarrow }{w}\) and minimizing the residual sum of squares (RSS)

$$\min _{\overset{\rightarrow }{w}}\sum _{i}\left(\overset{\rightarrow }{x}_{i}\overset{\rightarrow }{w}+w_{0}-y_{i}\right)^{2}.$$

The \(\overset{\rightarrow }{x}_{i}\) are the features of each data point and the yi are the corresponding target quantities, in the present case the segregation energies. The expression \(\overset{\rightarrow }{x}_{i}\overset{\rightarrow }{w}\) denotes the scalar product between two vectors.

The advantage of a LR is that it can be evaluated quickly, even for large datasets. As the name says, it can only describe linear relations. Therefore, if a feature performs well with LR, it is particularly suitable for ML of the investigated data.

GBR is one of the so-called ensemble methods where several models are trained on a dataset which then give reliable results. For the gradient boosting regression, several models \(h\left(\overset{\rightarrow }{x};a\right)\) are summed together for the model \(F\left(\overset{\rightarrow }{x}\right)\) [4]:

$$F_{M}\left(\overset{\rightarrow }{x}\right)=\sum _{m=0}^{M}\beta _{m}h\left(\overset{\rightarrow }{x};a_{m}\right).$$

The expansion coefficients βm and the parameters am are iteratively fitted to the residual \(\tilde{y}_{m}\) of the previous iterations:

$$\tilde{y}_{m}=\tilde{y}_{m-1}-F_{m-1}\left(\overset{\rightarrow }{x}\right);\tilde{y}_{0}=y.$$

This process is repeated until either the maximum number M is reached or a sufficient accuracy has been reached. For the models \(h\left(\overset{\rightarrow }{x};a\right)\) in Eq. 3, scikit-learn uses decision trees calculated with a modified CART algorithm. Decision trees consist of a sequence of nodes that partitions the dataset in the feature space. This leads in the end to fitting a series of step-functions to the data which can provide highly non-linear relationships. Compared to LR, GBR is therefore more flexible in the functional form of the target function.

Uncertainty Estimation in the Bayesian Framework

In materials science, physical models are often fit to experimental and theoretical data by variation of the parameters of the model to minimize the error in the property to be predicted. Usually, a single value of the model parameters is given as the result without specifying any bounds of uncertainty. But quantification and management of the uncertainty are crucial in materials science to validate the model. Uncertainty in the model and its parameters arise as the data used to fit the model come with an error bar because the model might not be able to represent all the data due to simplifying assumptions or because the knowledge about the system is incomplete.

Both frequentist and Bayesian inference can be used for parameter calibration and uncertainty estimation of model parameters. In the frequentist framework, the measurement is a realization of a random variable, but the model parameters are assumed to be fixed. In the Bayesian framework, the model parameters are treated as a random variable with a probability distribution. Furthermore, the Bayesian framework allows to incorporate existing knowledge about the model parameters (prior knowledge) in the inference process.

The posterior distribution p(Θ|M,D) defines the probability of the parameter Θ of the model M given the existing data D. The prior probability of a parameter p(Θ|M) is updated to the posterior probability, given the data D using Bayes theorem [5]:

$$p\left(\Theta |M,D\right)=\frac{p\left(\Uptheta |M\right)p\left(D|M,\Uptheta \right)}{p\left(D|M\right)}.$$

The prior contains already established knowledge or beliefs on the parameter values of a model M. \(p(D| M)\) is called the evidence, and it is the probability to obtain the data D with the model M. \(p(D| M,\Theta )\), also called likelihood, is the probability that the data D is described by the model M with the parameters Θ. A possibility to define the likelihood is to assume a normal distribution of the measurement around its true value with a standard deviation σ corresponding to the measurement error. The residual Ri is the deviation between data point Di and the model M(Θ). The likelihood for a single data point Di is then calculated as:

$$p\left(D_{i}|M,\Theta \right)=\frac{1}{\sqrt{2\pi \sigma ^{2}}}\exp \left(-\frac{R_{i}^{2}}{2\sigma ^{2}}\right).$$

The total likelihood is the product over the likelihoods of the single data points: \(p\left(D|M,\Theta \right)=\prod _{i}p\left(D_{i}|M,\Theta \right)\). The closed form solution for the posterior probability in Eq. 5 is rarely possible. Markov Chain Monte Carlo (MCMC) can be used to sample the posterior, which allows for simultaneous determination of the optimal parameter set and uncertainty quantification. Starting with an initial set of parameters Θ, a new parameter set \(\Theta '\) is proposed at each iteration. The new parameter set is accepted according to the Metropolis Criterion, where the ratio of the posterior probabilities comes into account

$$p_{\text{accept}}=\min \left(\frac{p\left(\Theta '|M,D\right)}{p\left(\Uptheta |M,D\right)},1\right)=\min \left(\frac{p\left(\Theta '|M\right)p\left(D|M,\Theta '\right)}{p\left(\Uptheta |M\right)p\left(D|M,\Uptheta \right)},1\right).$$

The Markov Chain converges to the optimal parameter set. After the convergence of the Markov Chain, the parameters will fluctuate around the optimum and, in this way, sample the posterior distribution [6]. The probability distribution corresponds to the uncertainty of the parameters which can be propagated to the model output. This is done by sampling the parameter space and evaluating the model with the sampled parameters to arrive at the distribution and uncertainty of the model output. As the Markov Chain already samples the parameter space after convergence, these samples can be used for this purpose.


Machine Learning of Grain Boundary Segregation

This section demonstrates how GPR can be used to predict grain boundary segregation energies based on local atomic environments. Grain boundary (GB) segregation describes the process of trapping of solute atoms to interfaces between different grains in a metal. GB segregation and the resulting GB excess can strongly influence different materials properties, for example intergranular brittleness or phase transformation. GB excess is related to the segregation energy (\(E_{k}^{\mathrm{seg}}\)) via the McLean isotherm [7]

$$\frac{c_{k}}{1-c_{k}}=\frac{c_{0}}{1-c_{0}}\cdot e^{-\frac{E_{k}^{\mathrm{seg}}}{k_{b}T}},$$

which describes the concentration of the solutes at a GB (ck) depending on bulk concentrations (c0), the temperature (T). The \(E_{k}^{\mathrm{seg}}\) can be provided by atomistic modelling. A routinely used and very accurate method is density functional theory (DFT) as it is implemented, for example, in the Vienna Ab-initio Simulation Package (VASP). Based on the atomic structure of a GB, the \(E_{k}^{\mathrm{seg}}\) are calculated as the difference in energy between placing the solute atom at a GB site (EGB,k) and placing it at a reference bulk site (EGB,0) [8]:


This method has been used previously to calculate segregation energies for a W-25 at.% Re alloy [8]. The dataset contains in total 219 segregation energies for segregation sites in 15 different GBs. Fig. 1 shows some of these GBs where the values of \(E_{k}^{\mathrm{seg}}\) are indicated by different colors.

Fig. 1
figure 1

Illustration of selected grain boundaries in a W-25 at.% Re alloy. Circles illustrate different atomic sites in the grain boundary and are colour-coded according to the segregation energies

Since each calculation of a \(E_{k}^{\mathrm{seg}}\) entails a DFT simulation for up to several hundreds of atoms, the associated computational effort is considerable. A way to accelerate the calculation of the \(E_{k}^{\mathrm{seg}}\) is to apply ML. Based on the local atomic structure around a segregation site, the corresponding Eseg can be predicted. To show how the choice of different features influences the result, different descriptions of the local atomic environment are chosen including the cartesian coordinates or the spherical coordinates of the neighbouring atoms. Only the 15 atoms closest to the site are considered and the features are sorted according to the radial distance. In addition we also consider Steinhardt bond order parameters which are evaluated from the spherical harmonics Ylm as [9]:

$$q_{l}=\left[\frac{4\pi }{2l+1}\sum _{m=-l}^{l}\left| \frac{1}{N_{A}}\sum _{n=1}^{N_{A}}Y_{lm}\left(\hat{r}_{n}\right)\right| ^{2}\right]^{\frac{1}{2}}.$$

Here the \(\hat{r}_{n}\) are the unit vectors in direction of the neighbouring atoms and NA is the number of atoms used in the sum. As already pointed out previously, there exists a remarkable correlation of Eseg and the Steinhardt q5 parameter for the W‑Re system [8]. The performance of different ML results is measured via the coefficient of determination, i.e. the R2-score. A score of one signifies a perfect fit, while a score of zero means that the quality of the fit is as good as the mean value of the data.

Fig. 2
figure 2

Predicted versus DFT-calculated segregation energies when applying GBR to cartesian coordinates (left), spherical coordinates (center) and the Steinhardt q5 parameter (right)

Table 1 and Fig. 2 compare the results of different ML methods for three different sets of features as mentioned before. The simplest and most straightforward feature set are the Cartesian coordinates. Applying LR clearly fails, since the corresponding R2 score is negative indicating that the Cartesian coordinates are not suitable features. GBR improves the results somewhat, however the R2 is still far from 1. These results for the Cartesian coordinates motivate the search for alternative features for the problem of GB segregation. Changing from Cartesian coordinates to spherical coordinates already greatly improves the results for LR and GBR. Since segregation energy shows a linear dependence on the Steinhardt q5 parameter, LR produces good results which are not improved any more with GBR. In general, the results highlight that proper feature engineering is decisive to improve regression quality.

TABLE 1 Performance of ML approaches employing different features and regression methods. The scores for the models are calculated using 5‑fold cross-validation

In Fig. 3 we show the benefits of training a ML model to the segregation data. The total effort of the DFT calculations, (red, empty circles in the plot) was about 1 million coreh. With the trained ML algorithm, the segregation profile can now be completed (black points) at essentially zero computational cost. Although the ML predictions do not exactly reproduce the data everywhere, the trends are very well captured.

Fig. 3
figure 3

Profiles of the Esegdepending on the distance from the GBs. Red open circles mark the DFT values while black solid circles show the ML predictions with GBR using spherical coordinates as the features

Our results show that ML-models using spherical coordinates or the Steinhardt q5 parameter already give a reasonable description of the GB segregation. Certainly, more elaborate features engineering can still improve this result. Furthermore, a general application of this methodology also requires extending the method to also include chemistry results. Such investigations are left to future research efforts.

Uncertainty Quantification in CALPHAD Modeling

The CALPHAD (CALculation of PHase Diagrams) method is an important part of materials science and thermodynamic calculations [10]. Starting from a thermodynamic description of the phases in a material, phase diagrams and thermodynamic properties can be estimated and also used as input in subsequent models such as diffusion simulations or phase field modeling of phase transformation.

Every experimental and theoretical data point which can be numerically related to the Gibbs energy can be used as input data for the CALPHAD assessment if a residual function can be defined. Thermochemical data, like heat capacity, enthalpy, and entropy, is used as these quantities can be calculated directly from derivatives of the Gibbs energy. Experimental evaluations of invariant points, transition temperatures, liquidus and solidus temperatures, and phase stabilities are crucial input for the determination of phase diagrams. The definition and evaluation of the residual function for this kind of data is more challenging as it requires a global minimization of the Gibbs energy.

CALPHAD assessments have mostly been deterministic, giving a single value for the parameters, without specifying any uncertainty of the parameters. Although there have been some early works, the topic of uncertainty quantification and propagation in CALPHAD has only gained momentum in recent years, often adopting the Bayesian framework [11]. As the accuracy of the thermodynamic database determines the reliability of the calculated results, the determination of the uncertainty in CALPHAD helps to understand how strongly one can rely on results from thermodynamic simulations. The determination of uncertainty also allows to identify gaps of knowledge in the data and can lead to the identification of the experiment which will contribute the most to the accuracy of the thermodynamic database.

The open source program ESPEI [12] uses the Bayesian framework for the parameter calibration and uncertainty quantification as described in Sect. 2.2. We demonstrate the parameter calibration and uncertainty quantification in a CALPHAD optimization with the example of the binary titanium tungsten (Ti-W) system. The accepted phase diagram of Ti‑W stems from an assessment in 1996 [13] based on experimental data dating back to the 50 s. Using these data we performed a parameter optimization with ESPEI. The Ti‑W binary system has three stable phases, the body-cantered-cubic (BCC), the hexagonal (HCP) and the liquid phase. The experimental phase diagram data includes the determination of the solidus, liquidus, determination of the eutectic temperature and composition, and (rough) determination of the bcc-miscibility gap at higher temperatures.

In total 6 parameters are calibrated against the data, describing the excess terms and, thus, the mixing energy of the two components in the BCC, HCP, and liquid phase. The corner plot in Fig. 4a shows the posterior distribution of the parameters obtained from MCMC. The diagonal plots are histograms of the parameter values from the Markov Chain after convergence, resembling the posterior. The covariance of two parameters is shown by the off-diagonal images. We use parameter set samples from this parameter distribution (which we can take from the Markov Chain) to identity the uncertainty in the phase boundaries. For this we calculate, for each parameter set, the phase diagram, represented by the minimization of the Gibbs energy with the common tangent construction in Fig. 4b. For a qualitative estimation of the uncertainty of the phase boundaries, we plot the calculated phase diagrams on top of each other. When the phase boundaries lie on top of each other, as it is the case for the solidus up to 50 atomic percent tungsten, the uncertainty in the phase boundary is low. With a wider spread of the phase boundaries, there is a larger uncertainty, as we can see for the location of the miscibility gap and the solubility of titanium in BCC tungsten.

Fig. 4
figure 4

Schematic illustration of the process of uncertainty propagation of CALPHAD parameters to phase boundary prediction. More information is provided in the text


Data driven materials modeling offers powerful tools that can be applied to a variety of materials design questions. We have presented two specific applications. The first deals with machine learning of DFT-calculated segregation energies which are key to predict grain boundary chemical composition. We could demonstrate how ML can effectively complete the total segregation profile of GBs in W and save a large amount of computational resources. Further development of such methodology is foreseen to allow investigating advanced co-segregation phenomena and phase transformations at grain boundaries. The second application was about quantification and propagation of uncertainty in thermodynamic modelling of the W‑Ti alloy based on Bayesian inference and Markov chain Monte Carlo sampling. The currently accepted W‑Ti phase diagram was reassessed and completed with uncertainty quantification and propagation. The analysis shows that especially the miscibility gap in the phase diagram is still associated with considerable uncertainty and additional experimental data should be supplied for higher precision of the thermodynamic model.