# High-Dimensional Materials and Process Optimization Using Data-Driven Experimental Design with Well-Calibrated Uncertainty Estimates

- 2.2k Downloads
- 12 Citations

## Abstract

The optimization of composition and processing to obtain materials that exhibit desirable characteristics has historically relied on a combination of domain knowledge, trial and error, and luck. We propose a methodology that can accelerate this process by fitting data-driven models to experimental data as it is collected to suggest which experiment should be performed next. This methodology can guide the practitioner to test the most promising candidates earlier and can supplement scientific and engineering intuition with data-driven insights. A key strength of the proposed framework is that it scales to high-dimensional parameter spaces, as are typical in materials discovery applications. Importantly, the data-driven models incorporate uncertainty analysis, so that new experiments are proposed based on a combination of *exploring* high-uncertainty candidates and *exploiting* high-performing regions of parameter space. Over four materials science test cases, our methodology led to the optimal candidate being found with three times fewer required measurements than random guessing on average.

## Keywords

Machine learning Experimental design Sequential design Active learning Uncertainty quantification## Introduction

Because of the time intensity of performing experiments and the high dimensionality of many experimental design spaces, exploring an entire parameter space is often prohibitively costly and time-consuming. The field of *sequential learning* (SL) is concerned with choosing the parameter settings for an experiment or series of experiments in order to either maximize information gain or move toward some optimal parameter space. In general, these parameter settings encompass everything from measurement procedures to physical test conditions and test specimen characteristics. Sometimes also called *optimal experimental design* or *active learning*, sequential learning can be used by the experimenter to decide which experiment to perform next to most efficiently explore the parameter space.

Traditional design-of-experiment approaches are typically applied to relatively low-dimensional optimization problems. For example, the Taguchi method relies on performing a set of experiments to create a complete basis of orthogonal arrays, which requires gridding the input parameters *a priori* into a set of plausible values [1]. Fisher’s *analysis of variance* approach, which decomposes the variance of a response variable into the contributions due to different input parameters, also assumes that the input parameters are discrete [2]. These approaches can be powerful in certain applications, such as reducing process variability, but do not scale to larger-dimensional, more exploratory scenarios with real-valued, inter-dependent input parameters. Many SL approaches rely on Bayesian statistics to inform their choice of experiments [3, 4, 5]. In this case, the experimental response function (i.e., the quantity being measured in the experiment) *f*(**x**) is estimated by a surrogate model \(\hat {f}(\mathbf {x} ; \mathbf {\theta })\), where **x** are the experimental parameters and **𝜃** are the surrogate model parameters. In the Bayesian setting, the experimental data are used to estimate an *a posteriori* joint probability distribution function for the model parameters. The Bayesian approach has two main strengths. First, it provides uncertainty bounds on the estimated model response \(\hat {f}(\mathbf {x} ; \mathbf {\theta })\). These uncertainty bounds can be used to inform the choice of experiments. The second advantage of Bayesian optimization, as opposed to gradient-based optimization, is that it uses all previous measurements to inform the next step in the optimization process, resulting in an efficient use of the collected data [6]. On the other hand, Bayesian methods often struggle in high-dimensional spaces due to the curse of dimensionality in constructing a joint probability distribution function between many parameters. High- dimensional spaces are typically handled by first applying dimension reduction techniques [7].

Recently, there has been increasing interest in SL approaches for applications in materials science. Wang et al. [8] applied SL to the design of nanostructures for photoactive devices. They proposed a Bayesian SL method that suggested experiments in batches. They used Monte Carlo sampling to estimate the dependence of the response function on their two experimental parameters. Their approach was shown to optimize the nanostructure properties with fewer trials than a greedy approach that did not leverage uncertainty information. Aggarwal et al. [9] applied Bayesian methods to two different applications in materials science: characterizing a substrate under a thin film, and selecting between models for the trapping of Helium impurities in metal composites. Ueno et al. [10] presented a Bayesian optimization framework which they applied to the case of determining the atomic structure of crystalline interfaces. Xue et al. [11] investigated the use of SL for discovering shape memory alloys with high transformation temperatures. They used a simple polynomial regression on three material parameters to drive their predictions. Dehghannasiri et al. [12] also proposed the use of SL for the discovery of shape memory alloys. They used the mean objective cost of uncertainty algorithm, an SL approach that performs robust optimization to incorporate the cost of uncertainty. In the test case of designing shape memory alloys with low energy dissipation, their approach was shown to require fewer trials than either random guessing or greedy optimization. These studies highlighted the significant promise of SL for reducing the number of experiments required to achieve specified performance goals in materials science applications. However, these SL approaches were all evaluated on case studies with five or fewer degrees of freedom.

In materials science, it is not always straightforward to describe an experimental design in terms of a small number of real-valued parameters. For example, in trying to determine a new alloy with specific characteristics, how should the alloy formula be parametrized? In an effort to discover new Heusler compounds, Oliynyk et al. [13] parametrized the chemical formula in terms of over 50 different real-valued and categorical features that could be calculated directly from the formula. Such high-dimensional parameter spaces demand a different SL strategy. In this paper, we present an SL approach that uses random Forests with Uncertainty Estimates for Learning Sequentially (FUELS) to scale to high-dimensional (> 50 dimensions) applications. It is worth noting that the focus here is on high-dimensional input parameter spaces, not on multi-objective optimization, which is beyond the scope of the current study. We evaluate the performance of the FUELS framework on four different test cases from materials science applications: one in magnetocalorics, one in superconductors, another in thermoelectrics, and the fourth in steel fatigue strength. There are two main innovations presented in this paper. The first is the implementation of robust uncertainty estimates for random forests, validated for the four test cases. These uncertainty estimates are critical not only for the application of SL, but also for making data-driven predictions in general. For example, if a model for steel fatigue strength predicts only a raw number without uncertainty, it is impossible to know if the model is confident in that prediction or is extrapolating wildly. In this case, it would not be clear whether more data needed to be collected before the model could be trusted as part of the design process. Because of the many sources of uncertainty in materials science and the reliability-driven nature of design specifications, it is key that as data-driven models rise in popularity, methods for quantifying their uncertainty are developed and evaluated. The second major innovation is the application of these random forests with uncertainty bounds to high-dimensional SL test cases in materials science. We demonstrate their utility as a practical experimental design tool for materials and process optimization that significantly reduces the number of experiments required.

## Methodology

### Random Forests with Uncertainty

Random forests are composed of an ensemble of decision trees [14, 15]. Decision trees are simple models that recursively partition the input space and define a piece-wise function, typically constant, on the partitions. Single decision trees often have poor predictive power for non-linear relations or noisy data sets [14]. In random forests, these weaknesses are overcome by using an ensemble of decision trees, each one fit to a different random draw, with replacement, of the training data set. For a given test point, the predictions of all the trees in the forest are aggregated, usually by taking the mean.

**x**due to training point

*i*,

*ω*is the noise threshold in the sample-wise variance estimates, and \(\tilde \sigma (\mathbf {x})\) is an explicit bias function, to be discussed later. In this work, the noise threshold is set to

*ω*= |min

*i*

*σ*

^{2}(

**x**

_{ i })|, the magnitude of the minimum variance over the training data.

_{ j }is the covariance computed over the tree index

*j*,

*n*

_{ i,j }is the number of instances of the training point

*i*used to fit tree

*j*,

*t*

_{ j }(

**x**) is the prediction of the

*j*th tree, \(\overline {t}_{-i}(\mathbf {x})\) is the average over the trees that were not fit on sample

*i*, \(\overline {t}(\mathbf {x})\) is the average over all trees,

*e*is Euler’s number,

*v*is the variance over all the trees, and

*T*is the number of trees.

The sample-wise variance is effective at capturing the uncertainty due to the finite size of the training data. It can, however, underestimate uncertainty due to noise in the training data or unmeasured degrees of freedom. For these reasons, we amended the sample-wise variance with the explicit bias model \(\tilde \sigma (\mathbf {x})\), which should be chosen to be very simple to avoid over-fitting. Here, \(\tilde \sigma (\mathbf {x})\) is a single decision tree limited to depth log 2(*S*)/2, where *S* is the number of training points. The random forests and uncertainty estimates used in this study are available in the open source Lolo scala library [18].

### Evaluation of Uncertainty Estimates

We evaluated these uncertainty estimates on the four data sets that will be explored as test cases in this paper. These data sets include a magnetocalorics data set [19], a superconductor data set compiled by the internal Citrine team, a thermoelectrics data set [20], and a steel fatigue strength data set [21], which will all be described in more detail in the “Results on Test Cases” section. Models were trained to predict the magnetic deformation, superconducting critical temperature, figure of merit ZT, and fatigue strength, respectively, on these four data sets. The models were evaluated via eightfold cross-validation over 16 trials and the validation error was compared to the combined uncertainty estimates.

*r*

_{ n }are given by \(r_{n} = \frac {\hat {f}(\mathbf {x}) - f(\mathbf {x})} {\sigma (\mathbf {x})}\). In other words,

*r*

_{ n }is the difference between the predicted and actual value, divided by the uncertainty estimate. If the uncertainty estimates were perfectly well-calibrated and the samples in the data set were independently distributed, then the normalized residuals would follow a Gaussian distribution with zero mean and unit standard deviation. As the histograms show, the distributions of the normalized residuals are roughly normal, albeit with heavier tails than a normal distribution. Figure 2 also shows the residuals normalized by the root mean square out-of-bag error, which is equivalent to removing the jackknife-based contributions to the uncertainty and using the simplest explicit bias model, i.e., a constant function. In this context, the out-of-bag error on a training example refers to the average error of predictions made using the subset of decision trees that were not trained on that particular training example. The root mean squared out-of-bag error is analogous to the conventional cross-validation error, which provides a constant error estimate for all test points. The figure demonstrates that the root mean square out-of-bag error is not a well-calibrated uncertainty metric; it drastically over-estimates the error for a large fraction of the points in the thermoelectrics and superconductor test cases, as demonstrated by the large difference between the standard normal distribution and the residuals near 0 in Fig. 2.

The heavy tails shown in Fig. 2a are not unexpected, since the current estimates cannot fully account for all sources of uncertainty, such as uncertainty due to contributions that cannot be explained with the given feature set, i.e., “unknown unknowns.” For example, if the target function is conductivity and different data points were acquired at different temperatures, but those temperatures were not measured and added to the training data, then the missing information can cause the uncertainty estimates to be unreliable. Such unknown unknowns are likely responsible for the few outliers seen in Fig. 2. Nevertheless, this examination of the uncertainty estimates shows that they give a reasonable representation of the random forest model uncertainty. This uncertainty estimation procedure is of broad utility for providing quantitative uncertainty bounds for data-driven random forest models and was used in the present study for the purpose of SL.

### FUELS Framework

The schematic in Fig. 1 outlines how the random forest and uncertainty estimates are applied to SL. In this study, it is assumed that the goal of SL is to determine the optimal material processing and composition from a list of candidate options using the fewest possible number of experiments. Optimality is based on maximizing (or minimizing) some material property, such as the critical temperature for superconductivity.

The first step in the SL framework is to evaluate the response function for an initial set of test candidates in order to fit a random forest model for the response function. In this study, this initial set of test candidates consisted of 10 randomly selected materials from the set of candidates. Future work will investigate the optimal size of this initial set, as well as explore sampling strategies other than random sampling for their selection.

Once a random forest model has been fit, it is evaluated for each of the unmeasured candidates. Three different strategies for selecting the next candidate were assessed: maximum expected improvement (MEI), maximum uncertainty (MU), and maximum likelihood of improvement (MLI). The MEI strategy simply selects the candidate with the highest (or lowest, for minimization) target value. The MU strategy selects the candidate with the greatest uncertainty, entirely independently of its expected value. The MLI strategy selects the candidate that is the most likely to have a higher (or lower, for minimization) target value than the best previously measured material. This strategy uses the uncertainty estimates from Eq. 1 and assumes that the uncertainty for a given prediction obeys a Gaussian distribution. While MLI and MEI are both greedy optimization strategies, the MLI strategy typically favors evaluating candidates with high uncertainty, leading to more exploration of the search space.

## Results on Test Cases

The FUELS framework was evaluated for four different application cases from materials science: magnetocalorics, superconductors, thermoelectrics, and steel fatigue strength. In each of these test cases, a data set was already publicly available on Citrination with a list of potential candidate materials and their previously reported target values.^{1} ^{,} ^{2} ^{,} ^{3} ^{,} ^{4} The goal of the FUELS process was to identify the candidate with the maximal value of the response function, using measurements of the response from the fewest number of candidates possible. It should be noted that because the test sets consist of candidates that have been previously measured, there is potential sample bias in these data sets: high-performance materials are more likely to have measurements available in public data sets. This sample bias means that there are fewer obvious bad candidates for the SL model to pass over, in effect making the problem more difficult. Future work will test this SL methodology on a case study for which the target values are not previously available.

In each test case, the FUELS methodology was run 30 times for each of the three strategies (MLI, MEI, and MU), in order to collect statistics on the number of measurements required to find the optimal candidate. The FUELS methodology was benchmarked against two other algorithms: random guessing and the COMBO Bayesian SL framework proposed by Ueno et al. [10]. In random guessing, the next candidate was selected randomly from the pool of candidates that had not been previously measured. As a result, the number of evaluations required follows a uniform distribution over the range of the data set size. In the COMBO strategy, a Gaussian process for the target variable is constructed and is queried to determine the next candidate to test. Unlike the FUELS approach, COMBO uses Bayesian methods to obtain uncertainty estimates by propagating uncertainty in model parameters through to the model predictions. COMBO uses state-of-the-art algorithms for scalability to large data sets and is a challenging benchmark strategy against which to compare the performance of FUELS.

### Magnetocalorics Test Case

#### Problem Description

A magnetocaloric material exhibits a decrease in entropy when a magnetic field is applied at temperatures near its Curie temperature. This property can be exploited for magnetization refrigeration, with larger entropy changes enabling more efficient cooling. Bocarsly et al. [19] showed that the entropy change of a material is strongly correlated with its magnetic deformation, a property that can be calculated via density functional theory (DFT). They presented a reference data set of 167 candidates for magnetocaloric behavior for which the magnetic deformation had already been calculated.

In this test case, the FUELS framework was used to identify the candidate with the highest value of magnetic deformation. If the FUELS process can efficiently identify candidates with large values of magnetic deformation, then it could be used to more efficiently determine which DFT calculations to perform. These DFT calculations could then, in turn, be used to identify the most promising candidates for experimental testing.

The free parameter in this test case is the material formula. Because the material formula is not in itself a continuously-varying real-valued variable, it was parameterized in terms of 54 real-valued features that could be calculated directly from the formula [22]. These features included quantities such as the mean electron affinity of the atoms in the compound, the orbital filling characteristics, and the mean ionization energy. These 54 features composed the inputs to the FUELS algorithm, and the target was the magnetic deformation.

#### Results

Sample mean and uncertainty in the sample mean at one standard deviation, for the number of steps required for different SL strategies to find the optimal candidate

Data size | # inputs | FUELS MLI | FUELS MEI | FUELS MU | COMBO | Random | |
---|---|---|---|---|---|---|---|

Magnetocaloric | 167 | 54 | 47 ± 3 | 51 ± 4 | 61 ± 6 | 57 ± 6 | 84 |

Superconductor | 546 | 54 | 73 ± 9 | 98 ± 12 | 52 ± 5 | 80 ± 9 | 273 |

Thermoelectric | 195 | 56 | 32 ± 3 | 37 ± 3 | 29 ± 3 | 38 ± 4 | 98 |

Steel fatigue | 437 | 22 | 24 ± 2 | 28 ± 2 | 86 ± 10 | 27 ± 2 | 219 |

### Superconductor Test Case

#### Problem Description

There is significant interest in developing superconductors with higher critical temperatures. For this test case, the data set consisted of 546 material candidates whose critical temperatures have been compiled into a publicly accessible Citrination database [24]. The highest critical temperature of these materials was for Hg-1223 (HgBa _{2}Ca _{2}Cu _{3} *O* _{8}) at 134 K. The goal of the SL process was to find this optimal candidate using the fewest number of measurements possible. The inputs were the same 54 real-valued features derived from the chemical formula as were used in the magnetocaloric test case.

#### Results

### Thermoelectric Test Case

#### Problem Description

In this test case, the data set consisted of 195 materials for which the thermoelectric figures of merit, ZT, as measured at 300*K*, have been compiled into an online Citrination database [20]. The inputs to the machine learning algorithm included not only the 54 features calculated from the material formula, but also the semiconductor type (p or n) as well as the crystallinity (*e.g.*, polycrystalline or single crystal) of the material. The goal of the optimization was to find the candidate with the highest value of ZT using the fewest number of evaluations.

#### Results

### Steel Fatigue Strength Test Case

#### Problem Description

This test case combined both material composition and process optimization. The goal was to find the composition and processing that led to the highest fatigue strength in steel. The data set was based on that of Agrawal et al. [21], which included 437 different combinations of steel composition and processing. The features included the fractional composition of nine different elements (C, Si, Mn, P, S, Ni, Cr, Cu, Mo) as well as 13 processing steps (including tempering temperature, carburization time, and normalization temperature). Agrawal et al. [21] showed that given these inputs, it was possible to fit a data-driven model that could accurately predict the steel fatigue strength when evaluated via cross-validation. The goal of this test case was to find the combination of the 22 input parameters that led to the candidate with the highest fatigue strength.

#### Results

Figure 4 shows that COMBO, FUELS MLI, and FUELS MEI all had very good performance on this test case, finding the optimal set of process and composition parameters in less than 15% of the number of evaluations as random guessing. Interestingly, FUELS MU did not perform well in this case. Since FUELS MU is driven by testing those candidates with high uncertainty, it performs well in cases where the optimal candidate is significantly different in some respect from the rest of the candidates. MLI and MEI, on the other hand, will fare better when the random forest is able to build an accurate model for the target quantity with the limited data from previously measured candidates. Since Agrawal et al. [21] have already shown that it is possible to use these input features to build an accurate model for the steel fatigue strength, these greedy strategies were able to find the optimal candidate very efficiently in this test case.

## Conclusion

A sequential learning methodology based on random forests with uncertainty estimates has been proposed. The uncertainty was calculated using bias-corrected infinitesimal jackknife and jackknife-after-bootstrap estimates, and was shown to be well-calibrated. This result is significant unto itself, since well-calibrated uncertainty estimates are critical for data-driven models in materials science and other engineering applications. These results represent some of the first evaluations of random forest uncertainty bounds for scientific applications. An implementation of random forests with these uncertainty bounds has been made available through the open source Lolo package [18].

The FUELS process has applicability to a wide range of engineering applications with large numbers of free parameters. In this paper, we explored its effectiveness on four test cases from materials science: maximizing the magnetic deformation of magnetocaloric materials, maximizing the critical temperature of superconductors, maximizing the ZT of thermoelectrics, and maximizing fatigue strength in steel. In all of these test cases, the experimental designs were parameterized using between twenty and sixty different features, leveraging the good scaling of FUELS to high-dimensional spaces. In all four cases, FUELS significantly out-performed random guessing. While random guessing might seem like a naive benchmark, it should be noted that the data sets in these initial test cases all comprise materials candidates that were thought promising enough to measure. Future work will evaluate the impact of SL on a real application for which the optimal candidate is not known *a priori*.

t-SNE projection was used to enable visualization of the FUELS candidate selections. Three different FUELS strategies were compared: MLI, MEI, and MU. In these initial test cases, MLI consistently had the highest performance. MEI struggled in cases where more exploration of the parameter space was important, and MU performed poorly when the random forest model could make accurate predictions after being fit to only a few training points. The FUELS approach also compared favorably to the Bayesian optimization COMBO approach, matching its performance in finding the optimal candidate on all four test cases. While the COMBO algorithm was designed for scalability to large data sets, it was less computationally efficient than FUELS for these relatively small, high-dimensional data sets. While rigorous comparisons of computational efficiency were beyond the scope of this study, in our runs on the steel fatigue strength test case, FUELS was an order of magnitude faster than COMBO per iteration on average in determining the next candidate to test. Because the Citrination platform provides publicly accessible, cloud-hosted machine learning capabilities, the computational efficiency of the experimental design process is important.

The consistent success of the FUELS strategies in out-performing random guessing underlines the importance and potential impact of optimal experimental design in materials optimization. With experimental efforts representing a bottleneck in the optimization process, it is critical that they be performed in the most efficient manner possible. It is worth noting that the FUELS methodology is equally applicable to both material composition optimization and process optimization. SL provides a framework for minimizing the number of experiments required to identify high-performance materials and optimal processes. It is not our suggestion that SL be used to replace scientific or engineering domain knowledge. Rather, the SL suggestions can be used in supplement to this domain knowledge to provide a quantitative framework to leverage data as it is collected to inform future experiments.

## Footnotes

## Notes

### Acknowledgements

The authors would like to thank S. Wager and T. Covert for their discussions regarding random forest uncertainty estimates. The authors would also like to thank the rest of the Citrine Informatics team. S. Paradiso and M. Hutchinson acknowledge support from Argonne National Laboratories through contract 6F-31341, associated with the R2R Manufacturing Consortium funded by the Department of Energy Advanced Manufacturing Office.

## References

- 1.Roy R (2010) A primer on the Taguchi method. Soc Manuf Eng, 1–245Google Scholar
- 2.Fisher R A (1921) On the probable error of a coefficient of correlation deduced from a small sample. Metron 1:3–32Google Scholar
- 3.Chaloner K, Verdinelli I (1995) Bayesian experimental design: a review. Stat Sci 10(3):273–304CrossRefGoogle Scholar
- 4.Chernoff H (1959) Sequential design of experiments. Ann Math Stat 30(3):755–770CrossRefGoogle Scholar
- 5.Cohn D A, Ghahramani Z, Jordan M I (1996) Active learning with statistical models. J Artif Intell Res 4(1):129–145Google Scholar
- 6.Martinez-Cantin R (2014) BayesOpt: a Bayesian optimization library for nonlinear optimization, experimental design and bandits. J Mach Learn Res 15(1):3735–3739Google Scholar
- 7.Shan S, Wang GG (2010) Survey of modeling and optimization strategies to solve high-dimensional design problems with computationally-expensive black-box functions. Struct Multidiscip Optim 41(2):219–241. doi: 10.1007/s00158-009-0420-2 CrossRefGoogle Scholar
- 8.Wang Y, Reyes KG, Brown KA, Mirkin CA, Powell WB (2015) Nested-batch-mode learning and stochastic optimization with an application to sequential multistage testing in materials science. SIAM J Sci Comput 37(3):B361–B381. doi: 10.1137/140971117. http://epubs.siam.org/doi/10.1137/140971117 CrossRefGoogle Scholar
- 9.Aggarwal R, Demkowicz M, Marzouk YM (2015) Information-driven experimental design in materials science. Inf Sci Mater Discov Des 225:13–44. doi: 10.1007/978-3-319-23871-5 Google Scholar
- 10.Ueno T, Rhone T D, Hou Z, Mizoguchi T, Tsuda K (2016) Combo: an efficient bayesian optimization library for materials science. Mater Discov 4:18–21CrossRefGoogle Scholar
- 11.Xue D, Xue D, Yuan R, Zhou Y, Balachandran P, Ding X, Sun J, Lookman T (2017) An informatics approach to transformation temperatures of NiTi-based shape memory alloys. Acta Mater 125:532–541CrossRefGoogle Scholar
- 12.Dehghannasiri R, Xue D, Balachandran PV, Yousefi MR, Dalton LA, Lookman T, Dougherty ER (2017) Optimal experimental design for materials discovery. Comput Mater Sci 129:311–322. doi: 10.1016/j.commatsci.2016.11.041 CrossRefGoogle Scholar
- 13.Oliynyk A, Antono E, Sparks T, Ghadbeigi L, Gaultois M, Meredig B, Mar A (2016) High-throughput machine-learning-driven synthesis of full-heusler compounds. Chem Mater 28(20):7324–7331CrossRefGoogle Scholar
- 14.Breiman L (2001) Random forests. Mach Learn 45(1):5–32CrossRefGoogle Scholar
- 15.Ho T K (1998) The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Mach Intell 20(8): 832–844CrossRefGoogle Scholar
- 16.Efron B (2012) Model selection estimation and bootstrap smoothing. Division of Biostatistics, Stanford UniversityGoogle Scholar
- 17.Wager S, Hastie T, Efron B (2014) Confidence intervals for random forests: the Jackknife and the infinitesimal Jackknife. J Mach Learn Res 15:1625–1651. doi: 10.1016/j.surg.2006.10.010.Use. http://jmlr.org/papers/v15/wager14a.html, arXiv:1311.4555v2 Google Scholar
- 18.Hutchinson M (2016) Citrine Informatics Lolo. https://github.com/CitrineInformatics/lolo accessed: 2017-03-21
- 19.Bocarsly JD, Levin EE, Garcia CA, Schwennicke K, Wilson SD, Seshadri R (2017) A simple computational proxy for screening magnetocaloric compounds. Chem Mater 29(4):1613–1622CrossRefGoogle Scholar
- 20.Sparks T, Gaultois M, Oliynyk A, Brgoch J, Meredig B (2016) Data mining our way to the next generation of thermoelectrics. Scr Mater 111:10–15CrossRefGoogle Scholar
- 21.Agrawal A, Deshpande P D, Cecen A, Basavarsu G P, Choudhary A N, Kalidindi S R (2014) Exploration of data science techniques to predict fatigue strength of steel from composition and processing parameters. Integr Mater Manuf Innov 3(1):1–19CrossRefGoogle Scholar
- 22.Ward L, Agrawal A, Choudhary A, Wolverton C (2016) A general-purpose machine learning framework for predicting properties of inorganic materials. arXiv preprintGoogle Scholar
- 23.van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605Google Scholar
- 24.O’Mara J, Meredig B, Michel K (2016) Materials data infrastructure: a case study of the citrination platform to examine data import, storage, and access. JOM 68(8):2031–2034CrossRefGoogle Scholar