Introduction

Statistical learning methods are gaining popularity in the materials science field, rapidly becoming known as “Materials Data Science.” With new data infrastructure platforms like Citrination [1] and the Materials data curation system [2], machine learning (ML) methods are entering the mainstream of materials science. Materials data science and informatics is an emergent field aligned with the goals of the Materials Genome Initiative to reduce the cost and time for materials design, development, and deployment. Building and interpreting machine learning models are indispensable parts of the process of curating materials knowledge. ML methods have been used for predicting a target property such as material failure [3, 4], twinning deformation [5], phase diagrams [6], and guiding experiments and calculations in composition space [7, 8]. Machine learning models are built on learning from “features” or variables that describe the problem. Thus, an important aspect of the machine learning process is to determine which variables most enable data-driven insights about the problem.

Dimensionality reduction techniques (such as principal component analysis(PCA) [9], kernel PCA [10], autoencoders [11], feature compression from information gain theory [12]) have become popular for producing compact feature representations [13]. They are applied to the feature set to get the best feature representation, resulting in a smaller dataset, which speeds up the model construction [14]. Dimensionality reduction has been used by material scientists to establish process-structure-property relationships and for exploratory data analysis to understand trends in a multivariate space [15]. For example, ranking-based feature selection methods such as information gain and Pearson correlation have been used during construction of predictive models for fatigue strength of steel [16]. Kalidindi et al. [17] have used two-point correlations and PCA to describe microstructure-property relationships between local neighborhoods and the localizations in microstructural response. Dey et al. [18] used PCA to analyze the features that cause outliers when predicting bandgaps for new chalcopyrite compounds. Broderick et al. [19] demonstrate how a compact representation (via PCA) makes it easy to visually track the different chemical processing pathways for interpenetrating polymer networks (IPNs) due to changing composition versus changing polymerization. However, dimensionality reduction techniques change the original representation of the features, and hence offer limited interpretability [13]. An alternate method for better models is feature selection. Feature selection is the process of selecting a subset of the original variables such that a model built on data containing only these features has the best performance. Feature selection avoids overfitting, improves model performance by getting rid of redundant features, and has the added advantage of keeping the original feature representation, thus offering better interpretability [13].

Feature selection methods have been used extensively in the field of bioinformatics [20], psychiatry [21], and cheminformatics [22]. There are multiple feature selection methods, broadly categorized into filter, wrapper, and embedded methods based on their interaction with the predictor during the selection process. The filter methods rank the variables as a preprocessing step, and feature selection is done before choosing the model. In the wrapper approach, nested subsets of variables are tested to select the optimal subset that work best for the model during the learning process. Embedded methods are those which incorporate variable selection in the training algorithm.

We have used random forest models to study stress hotspot classification in FCC [3] and HCP [4] materials. In this paper, we review some feature selection techniques applied to the stress hotspot prediction problem in hexagonal close-packed materials, and compare them with respect to future data prediction. We focus on two commonly used techniques from each method: (1) filter methods such as correlation-based feature selection (CFS) [23] and Pearson correlation [24]; (2) wrapper methods such as FeaLect [25] and recursive feature elimination (RFE) [13]; and (3) embedded methods such as random forest permutation accuracy importance (RF-PAI) [26] and least absolute shrinkage and selection operator (LASSO) [27]. The main contribution of this article is to raise awareness in the materials data science community about how different feature selection techniques can lead to misguided model interpretations and how to avoid them. We point out some of the inadequacies of popular feature selection methods, and finally, we extract data-driven insights with better understanding of the methods used.

Methods

An applied stress is distributed heterogenously within the grains in a microstructure [28]. Under an applied deformation, some grains are prone to accumulating stress due to their orientation, geometry, and placement with respect to the neighboring grains. These regions of high stress, so-called stress hotspots, are related to void nucleation under ductile fracture [29]. Stress hotspot formation has been studied in face-centered cubic (FCC) [3] and hexagonal close-packed (HCP) [4] materials using a machine learning approach. A set of microstructural descriptors was designed to be used as features in a random forest model for predicting stress hotspots. To achieve data-driven insights into the problem, it is essential to rank the microstructural descriptors (features). In this paper, we review different feature selection techniques applied to the stress hotspot classification problem in HCP materials, which have a complex plasticity landscape due to anisotropic slip system activity.

Let (xi, yi), for i = 1,..., N be N independent identically distributed (i.i.d.) observations of a p-dimensional vector of grain features xiRp, and the response variable yi ∈ 0,1 denotes the truth value of a grain being a stress hotspot. The input matrix is denoted by X = (x1,..., xN) ∈ RN×p, and y ∈ [0,1]N is the binary outcome. We will use small letters to refer to the samples x1,..., xN and capital letters to refer to the features X1,..., Xp of the input matrix X. Feature importance refers to metrics used by various feature selection methods to rank, such as feature weights in linear models or variable importance in random forest models.

Dataset Studied

The machine learning input dataset of synthetic 3D equiaxed microstructures with different HCP textures was generated using Dream.3D in [4]. Uniaxial tensile deformation was simulated in these microstructures using EVPFFT [30] with constitutive parameters representing a titanium-like HCP material with an anisotropic critically resolved shear stress ratio [4]. The EVPFFT simulation was carried out in 200 strain steps of 0.01% along sample Z direction, up to a total strain of 2%. The crystal plasticity simulations result in spatially resolved micromechanical stress and strain fields. This data was averaged to attain a dataset containing grain-wise values for equivalent Von Mises stress, and the corresponding Euler angles and grain connectivity parameters. Steps to reproduce this dataset are discussed in detail in [31].

The grains having stress greater than the 90th percentile of the stress distribution within each microstructure are designated as stress hotspots, a binary target. Thirty-four variables to be used as features in machine learning were developed. These features (X) describe the grain texture and geometry and have been summarized in Table 1. We note that these features are not a complete set, and there are long-range effects causing stress hotspots. We have taken the first-order microstructural descriptors to build stress hotspot prediction models and understand that these models can be improved upon by adding the missing features.

Table 1 Feature name descriptions

The microstructures contained in this dataset represent eight different kinds of textures, and we validate the machine learning models by leaving one texture out validation. This divides the dataset into ∼ 85% training and ∼ 15% validation. Note that since only 10% of the grains are stress hotspots, this is an imbalanced classification problem. Hence, the model performance is measured by the AUC (area under curve), a metric for binary classification which is insensitive to imbalance in the classes. An AUC of 100% denotes perfect classification and 50% denotes no better than random guessing [32].

We first build a decision tree-based random forest model [26] for stress hotspot classification using all the thirty-four variables. We then rank and select the variables using different feature selection techniques. The selected variables are then used to build random forest models and we observe the improvement in model performance. The feature rankings are then used to gain insights about the physics behind stress hotspot formation.

Feature Selection Methods

Filter Methods

Filter methods are based on preprocessing the dataset to extract the features X1,..., Xp that most impact the target Y. Some of these methods are as follows:

Pearson Correlation [24]

This method provides a straightforward way for filtering features according to their correlation coefficient. The Pearson correlation coefficient between a feature Xi and the target Y is:

$$\rho_{i} = \frac{cov(X_{i}, Y)}{\sigma_(X_{i})\sigma_{Y}}$$

where cov(Xi, Y ) is the covariance, σ is the standard deviation [24]. It ranges between (− 1,1) from negative to positive correlation, and can be used for binary classification and regression problems. It is a quick metric using which the features are ranked in order of the absolute correlation coefficient to the target.

Correlation-based feature selection (CFS) [23]

CFS was developed to select a subset of features with high correlation to the target and low intercorrelation among themselves, thus reducing redundancy and selecting a diverse feature set. CFS gives a heuristic merit over a feature subset instead of individual features. It uses symmetrical uncertainty correlation coefficient given by:

$$r(X,Y) = 2.0 \times \frac{IG(X|Y)}{H(X)+H(Y)} $$

where IG(X|Y ) is the information gain of feature X for the class attribute Y. H(X) is the entropy of variable X. The following merit metric was used to rank each subset S containing k features:

$$Merit_{S} = \frac{k\overline{r_{cf}}}{\sqrt{k + k(k-1)\overline{r_{ff}}}}$$

where \(\overline {r_{cf}}\) is the mean symmetrical uncertainty correlation between the feature (fS) and the target, and \(\overline {r_{ff}}\) is the average feature-feature intercorrelation. To account for the high computational complexity of evaluating all possible feature subsets, CFS is often combined with search strategies such as forward selection, backward elimination, and bidirectional search. In this work, we have used the scikit-learn implementation of CFS [33] which uses symmetrical uncertainity [23] as the correlation metric and explores the subset space using best first search [34], stopping when it encounters five consecutive fully expanded non-improving subsets.

Embedded Methods

These methods are popular because they perform feature selection while constructing the classifier, removing the preprocessing feature selection step. Some popular algorithms are support vector machines (SVM) using RFE [35], RF [26], and LASSO [27]. We compare LASSO and RF methods for feature selection on the stress hotspot dataset.

Least Absolute Shrinkage and Selection Operator (LASSO) [27]

LASSO is linear regression with L1 regularization [27]. A linear model \(\mathcal {L}\) is constructed

$$\mathcal{L}: min_{w\in R^{p}}\sum\limits_{i = 1}^{N}\frac{1}{2N}||y_{i} - w^{T}\cdot x_{i}||_{2}^{2} + \lambda||w||_{1}$$

on the training data (xi, yi), i = 1...., N, where w is a p dimensional vector of weights corresponding to each feature dimension p. The L1 regularization term (λ||w||1) helps in feature selection by pushing the weights of correlated features to zero, thus preventing overfitting and improving model performance. Model interpretation is possible by ranking the features according to the LASSO feature weights. However, it has been shown that for a given regularization strength λ, if the features have redundancy, inconsistent subsets can be selected [36]. Nonetheless, LASSO has been shown to provide good prediction accuracy by reducing model variance without substantially increasing the bias while providing better model interpretability. We used the scikit-learn implementation to compute our results [37].

Random Forest Permutation Accuracy Importance (RF-PAI) [26]

The random forest is a nonlinear multivariate model built on an ensemble of decision trees. It can be used to determine feature importance using the inbuilt feature importance measure [26]. For each of the trees in the model, a feature node is randomly replaced with another feature node while keeping all other nodes unchanged. The resulting model will have a lower performance if the feature is important. When the permuted variable Xj, together with the remaining unchanged variables, is used to predict the response, the number of observations classified correctly decreases substantially, if the original variable Xj was associated with the response. Thus, a reasonable measure for feature importance is the difference in prediction accuracy before and after permuting Xj. The feature importance calculated this way is known as PAI and was computed using the scikit-learn package in Python [37].

Wrapper Methods

Wrapper methods test feature subsets using a model hypothesis. Wrapper methods can detect feature dependencies, i.e., features that become important in the presence of each other. They are computationally expensive, hence often used in greedy search strategies (forward selection and backward elimination [38]) which are fast and avoid overfitting to get the best nested subset of features.

FeaLect Algorithm [25]

The number of features selected by LASSO depends on the regularization parameter λ, and in the presence of highly correlated features, LASSO arbitrarily selects one feature from a group of correlated features [39]. The set of possible solutions for all LASSO regularization strengths is given by the regularization path, which can be recovered computationally efficiently using the least angles regression (LARS) algorithm [40]. It was shown that LASSO selects the relevant variables with a probability 1 and all other with a positive probability [36]. An improvement in LASSO, the Bolasso feature selection algorithm was developed based on this property [36] in 2008. In this method, the dataset is bootstrapped, and a LASSO model with a fixed regularization strength λ is fit to each subset. Finally, the intersection of the LASSO-selected features in each subset is chosen to get a consistent feature subset.

In 2013, the FeaLect algorithm, an improvement over the Bolasso algorithm, was developed based on the combinatorial analysis of regression coefficients estimated using LARS [25]. FeaLect considers the full regularization path and computes the feature importance using a combinatorial scoring method, as opposed to simply taking the intersection with Bolasso. The FeaLect scoring scheme measures the quality of each feature in each bootstrapped sample and averages them to select the most relevant features, providing a robust feature selection method. We used the R implementation of FeaLect to compute our results [41].

Recursive Feature Elimination (RFE) [35]

A number of common ML techniques (such as linear regression, SVM, decision trees, Naive Bayes, perceptron, etc.) provide feature weights that consider multivariate interacting effects between features [13]. To interpret the relative importance of the variables from these model feature weights, RFE was introduced in the context of SVM [35] for getting compact gene subsets from DNA-microarray data.

To find the best feature subset, instead of doing an exhaustive search over all feature combinations, RFE uses a greedy approach, which has been shown to reduce the effect of correlation bias in variable importance measures [42]. RFE uses backward elimination by taking the given model (SVM, random forests, linear regression, etc.) and discarding the worst feature (by absolute classifier weight or feature ranking), and repeating the process over increasingly smaller feature subsets until the best model hypothesis is achieved. The weights of this optimal model are used to rank features. Although this feature ranking might not be the optimal ranking for individual features, it is often used as a variable importance measure [42]. We used the scikit-learn implementation of RFE with random forest classifier to come up with a feature ranking for our dataset.

Results and Discussion

Table 2 shows the feature importances calculated using filter-based methods Pearson correlation and CFS; embedded methods RF, linear regression, ridge regression (L2 regularization), and LASSO regression; and finally wrapper methods RFE and FeaLect. The values in bold font denote the features that were finally selected to build RF models and their corresponding performances are noted. The input data was scaled by minimum and maximum values to [0,1]. Figure 1 shows the correlation matrix for the features and the target.

Table 2 Variable Importance Measures using different methods for HCP materials with Unequal CRSS
Fig. 1
figure 1

Pearson correlation matrix between the target (EqVonMisesStress) and all the features

Pearson correlation can be used for feature selection, resulting in a good model. However, this measure has implicit orthogonality assumptions between variables, and the coefficient does not take mutual information between features into account. Additionally, this method only looks for linear correlations which might not capture many physical phenomenon.

The feature subset selected by CFS contains features with higher class correlation and lower redundancy, which translate to a good predictive model. Although we know grain geometry and neighborhood are important to hotspot formation, CFS does not select any geometry-based features and fails to provide an individual feature ranking.

Linear regression, ridge regression, and LASSO are highly correlated linear models. A simple linear model results in huge weights for some features (NumCells, FeatureVolumes), likely due to overfitting, and hence is unsuitable for deducing variable importance. Ridge regression compensates for this problem by using L1 regularization, but the weights are distributed among the redundant features, which might lead to incorrect conclusions. LASSO regression overcomes this problem by pushing the weights of correlated features to 0, resulting in a good feature subset. The top five ranked features by LASSO with regularization strength of λ = 0.3 are : sin𝜃, AvgMisorientations, cosϕ, sinϕ and Schmid_1. The first geometry-based feature ranks 10th on the list, which seems to underestimate the physical importance of such features. A drawback of deriving insights from LASSO-selected features is that it arbitrarily selects a few representatives from the correlated features, and the number of features selected depends heavily on the regularization strength. Thus, the models become unstable, because changes in training subset can result in different selected features. Hence, these methods are not ideal for deriving physical insights from the model.

Random forest models also provide an embedded feature ranking module. The RF-PAI importance seems to focus only on the hcp “c” axis orientation-derived features (cosϕ, sin𝜃,), average misorientation, and the Prismatic < a > Schmid factor, while discounting most of the geometry-derived features. RF-PAI suffers from correlation bias due to preferential selection of correlated features during tree building process [43]. As the number of correlated variables increases, the feature importance score for each variable decreases. Oftentimes, the less relevant variables replace the predictive ones (due to correlation) and thus receive undeserved, boosted importance [44]. Random forest variable importance can also be biased in situations where the features vary in their scale of measurement or number of categories, because the underlying Gini gain splitting criterion is a biased estimator and can be affected by multiple testing effects [45]. From Fig. 1, we found that all the geometry-based features are highly correlated to each other, therefore deducing physical insights from this ranking is unsuitable.

Hence, we move to wrapper-based methods for feature importance. RFE has been shown to reduce the effect of the correlation on the importance measure [42]. RFE with underlying random forest model selects a feature subset consisting of two geometry-based features (GBEuc and EquivalentDiameter); however, it fails to give an individual ranking among the features.

FeaLect provides a robust feature selection method by compensating for the uncertainty in LASSO due to arbitrary selection among correlated variables, and the number of selected variables due to change in regularization strength. Table 2 lists the FeaLect-selected variables in decreasing order. We find that the top two important features are derived from the grain crystallography, and geometry-derived features come next. This suggests that both texture- and geometry-based features are important. Using linear regression-based methods such as these tells us which features are important by themselves, as opposed to RF-PAI which indicates the features that become important due to interactions between them (via RF models) [13]. The FeaLect method provides the best estimate of the feature importance ranking which can then be used to extract physical insights. This method also divides the features into three classes: informative, irrelevant features that cause model overfitting, and redundant features [25]. The most informative features are the following: cosϕ, Schmid_1, EquivalentDiameter, GBEuc, Schmid_4, Neighborhoods, sin𝜃, and TJEuc. The irrelevant features are sinϕ and AvgMisorientations (which cause model overfitting). The remaining features are redundant.

A number of selected features directly or indirectly represent the HCP c-axis orientation, such as cosϕ, sin𝜃 and basal Schmid factor (Schmid_1), which is proportional to cos𝜃. It is interesting that pyramidal < c + a > Schmid factor (Schmid_4) is chosen as important. From Fig. 1, we can see that hot grains form where 𝜃, ϕ maximize sin𝜃 and sinϕ, i.e., 𝜃 ∼ 90, ϕ ∼ 90. This means that the HCP c-axis orientation of hot grains aligns with the sample Y-axis, which means these grains have a low elastic modulus. Since the c-axis is perpendicular to the tensile axis (sample Z), the deformation along the tensile direction can be accommodated by prismatic slip in these grains, and if pyramidal slip is occurring, it means they have a very high stress [4]. This explains the high importance of the pyramidal < c + a > Schmid factor. From the Pearson correlation coefficients in Fig. 1, we can observe that the stress hotspots form in grains with low basal and pyramidal < c + a > Schmid factor, high prismatic < a > Schmid factor, and higher values of sin𝜃 and sinϕ.

From Fig. 1, we can see that all the grain geometry descriptors do not have a direct correlation with stress, but are still selected by FeaLect. This points to the fact that these variables become important in association with others. We analyzed these features in detail in [4] and found that the hotspots lie closer to grain boundaries (GBEuc), triple junctions (TJEuc), and quadruple points (QPEuc), and prefer to form in smaller grains.

There is a subtle distinction between the physical impact of a variable on the target versus the variables that work best for a given model. From Table 2, we can see that a random forest model built on the entire feature set without feature selection has an AUC of 71.94%. All the feature selection techniques result in an improvement in the performance of the random forest model to a validation AUC of about 81%. However, to draw physical interpretations, it is important to use a feature selection technique which (1) keeps the original representation of the features, (2) is not biased by correlations/redundancies among features, (3) is insensitive to the scale of variable values , (4) is stable to the changes in the training dataset, (5) takes multivariate dependencies between the features into account, and 6) provides an individual feature ranking measure.

Conclusions

In this work, we have surveyed different feature selection techniques by applying them to the stress hotspot classification problem. These techniques can be divided into three categories: filter, embedded, and wrapper. We have explored the most commonly used techniques under each category. It was found that all the techniques lead to an improvement in the model performance and are suitable for feature selection to build a better model. However, when the aim is to interpret the model and understand which features might be more causal than others, it is essential to note the limitations of different techniques. We found that in the presence of correlated features, the FeaLect method helped us determine the underlying importance of the features. We find that:

  • All feature selection techniques result in ∼ 9% improvement in the AUC metric for stress hotspot classification.

  • Correlation-based feature selection and recursive feature elimination are computationally expensive to run, and give only a feature subset ranking.

  • Random forest embedded feature ranking is biased against correlated features and hence should not be used to derive physical insights.

  • Linear regression-based feature selection techniques can objectively denote the most important features, however have their flaws. These methods can be affected by the scale of features, correlation between them, and the dataset itself.

  • The FeaLect algorithm can compensate for the variability in LASSO regression, providing a robust feature ranking that can be used to derive insights.

  • Stress hotspot formation under uniaxial tensile deformation is determined by a combination of crystallographic and geometric microstructural descriptors.

  • It is essential to choose a feature selection method that can find this dependence even when features are redundant or correlated.