An applied stress is distributed heterogenously within the grains in a microstructure [28]. Under an applied deformation, some grains are prone to accumulating stress due to their orientation, geometry, and placement with respect to the neighboring grains. These regions of high stress, so-called stress hotspots, are related to void nucleation under ductile fracture [29]. Stress hotspot formation has been studied in face-centered cubic (FCC) [3] and hexagonal close-packed (HCP) [4] materials using a machine learning approach. A set of microstructural descriptors was designed to be used as features in a random forest model for predicting stress hotspots. To achieve data-driven insights into the problem, it is essential to rank the microstructural descriptors (features). In this paper, we review different feature selection techniques applied to the stress hotspot classification problem in HCP materials, which have a complex plasticity landscape due to anisotropic slip system activity.
Let (xi, yi), for i = 1,..., N be N independent identically distributed (i.i.d.) observations of a p-dimensional vector of grain features xi ∈ Rp, and the response variable yi ∈ 0,1 denotes the truth value of a grain being a stress hotspot. The input matrix is denoted by X = (x1,..., xN) ∈ RN×p, and y ∈ [0,1]N is the binary outcome. We will use small letters to refer to the samples x1,..., xN and capital letters to refer to the features X1,..., Xp of the input matrix X. Feature importance refers to metrics used by various feature selection methods to rank, such as feature weights in linear models or variable importance in random forest models.
Dataset Studied
The machine learning input dataset of synthetic 3D equiaxed microstructures with different HCP textures was generated using Dream.3D in [4]. Uniaxial tensile deformation was simulated in these microstructures using EVPFFT [30] with constitutive parameters representing a titanium-like HCP material with an anisotropic critically resolved shear stress ratio [4]. The EVPFFT simulation was carried out in 200 strain steps of 0.01% along sample Z direction, up to a total strain of 2%. The crystal plasticity simulations result in spatially resolved micromechanical stress and strain fields. This data was averaged to attain a dataset containing grain-wise values for equivalent Von Mises stress, and the corresponding Euler angles and grain connectivity parameters. Steps to reproduce this dataset are discussed in detail in [31].
The grains having stress greater than the 90th percentile of the stress distribution within each microstructure are designated as stress hotspots, a binary target. Thirty-four variables to be used as features in machine learning were developed. These features (X) describe the grain texture and geometry and have been summarized in Table 1. We note that these features are not a complete set, and there are long-range effects causing stress hotspots. We have taken the first-order microstructural descriptors to build stress hotspot prediction models and understand that these models can be improved upon by adding the missing features.
Table 1 Feature name descriptions The microstructures contained in this dataset represent eight different kinds of textures, and we validate the machine learning models by leaving one texture out validation. This divides the dataset into ∼ 85% training and ∼ 15% validation. Note that since only 10% of the grains are stress hotspots, this is an imbalanced classification problem. Hence, the model performance is measured by the AUC (area under curve), a metric for binary classification which is insensitive to imbalance in the classes. An AUC of 100% denotes perfect classification and 50% denotes no better than random guessing [32].
We first build a decision tree-based random forest model [26] for stress hotspot classification using all the thirty-four variables. We then rank and select the variables using different feature selection techniques. The selected variables are then used to build random forest models and we observe the improvement in model performance. The feature rankings are then used to gain insights about the physics behind stress hotspot formation.
Feature Selection Methods
Filter Methods
Filter methods are based on preprocessing the dataset to extract the features X1,..., Xp that most impact the target Y. Some of these methods are as follows:
Pearson Correlation [24]
This method provides a straightforward way for filtering features according to their correlation coefficient. The Pearson correlation coefficient between a feature Xi and the target Y is:
$$\rho_{i} = \frac{cov(X_{i}, Y)}{\sigma_(X_{i})\sigma_{Y}}$$
where cov(Xi, Y ) is the covariance, σ is the standard deviation [24]. It ranges between (− 1,1) from negative to positive correlation, and can be used for binary classification and regression problems. It is a quick metric using which the features are ranked in order of the absolute correlation coefficient to the target.
Correlation-based feature selection (CFS) [23]
CFS was developed to select a subset of features with high correlation to the target and low intercorrelation among themselves, thus reducing redundancy and selecting a diverse feature set. CFS gives a heuristic merit over a feature subset instead of individual features. It uses symmetrical uncertainty correlation coefficient given by:
$$r(X,Y) = 2.0 \times \frac{IG(X|Y)}{H(X)+H(Y)} $$
where IG(X|Y ) is the information gain of feature X for the class attribute Y. H(X) is the entropy of variable X. The following merit metric was used to rank each subset S containing k features:
$$Merit_{S} = \frac{k\overline{r_{cf}}}{\sqrt{k + k(k-1)\overline{r_{ff}}}}$$
where \(\overline {r_{cf}}\) is the mean symmetrical uncertainty correlation between the feature (f ∈ S) and the target, and \(\overline {r_{ff}}\) is the average feature-feature intercorrelation. To account for the high computational complexity of evaluating all possible feature subsets, CFS is often combined with search strategies such as forward selection, backward elimination, and bidirectional search. In this work, we have used the scikit-learn implementation of CFS [33] which uses symmetrical uncertainity [23] as the correlation metric and explores the subset space using best first search [34], stopping when it encounters five consecutive fully expanded non-improving subsets.
Embedded Methods
These methods are popular because they perform feature selection while constructing the classifier, removing the preprocessing feature selection step. Some popular algorithms are support vector machines (SVM) using RFE [35], RF [26], and LASSO [27]. We compare LASSO and RF methods for feature selection on the stress hotspot dataset.
Least Absolute Shrinkage and Selection Operator (LASSO) [27]
LASSO is linear regression with L1 regularization [27]. A linear model \(\mathcal {L}\) is constructed
$$\mathcal{L}: min_{w\in R^{p}}\sum\limits_{i = 1}^{N}\frac{1}{2N}||y_{i} - w^{T}\cdot x_{i}||_{2}^{2} + \lambda||w||_{1}$$
on the training data (xi, yi), i = 1...., N, where w is a p dimensional vector of weights corresponding to each feature dimension p. The L1 regularization term (λ||w||1) helps in feature selection by pushing the weights of correlated features to zero, thus preventing overfitting and improving model performance. Model interpretation is possible by ranking the features according to the LASSO feature weights. However, it has been shown that for a given regularization strength λ, if the features have redundancy, inconsistent subsets can be selected [36]. Nonetheless, LASSO has been shown to provide good prediction accuracy by reducing model variance without substantially increasing the bias while providing better model interpretability. We used the scikit-learn implementation to compute our results [37].
Random Forest Permutation Accuracy Importance (RF-PAI) [26]
The random forest is a nonlinear multivariate model built on an ensemble of decision trees. It can be used to determine feature importance using the inbuilt feature importance measure [26]. For each of the trees in the model, a feature node is randomly replaced with another feature node while keeping all other nodes unchanged. The resulting model will have a lower performance if the feature is important. When the permuted variable Xj, together with the remaining unchanged variables, is used to predict the response, the number of observations classified correctly decreases substantially, if the original variable Xj was associated with the response. Thus, a reasonable measure for feature importance is the difference in prediction accuracy before and after permuting Xj. The feature importance calculated this way is known as PAI and was computed using the scikit-learn package in Python [37].
Wrapper Methods
Wrapper methods test feature subsets using a model hypothesis. Wrapper methods can detect feature dependencies, i.e., features that become important in the presence of each other. They are computationally expensive, hence often used in greedy search strategies (forward selection and backward elimination [38]) which are fast and avoid overfitting to get the best nested subset of features.
FeaLect Algorithm [25]
The number of features selected by LASSO depends on the regularization parameter λ, and in the presence of highly correlated features, LASSO arbitrarily selects one feature from a group of correlated features [39]. The set of possible solutions for all LASSO regularization strengths is given by the regularization path, which can be recovered computationally efficiently using the least angles regression (LARS) algorithm [40]. It was shown that LASSO selects the relevant variables with a probability 1 and all other with a positive probability [36]. An improvement in LASSO, the Bolasso feature selection algorithm was developed based on this property [36] in 2008. In this method, the dataset is bootstrapped, and a LASSO model with a fixed regularization strength λ is fit to each subset. Finally, the intersection of the LASSO-selected features in each subset is chosen to get a consistent feature subset.
In 2013, the FeaLect algorithm, an improvement over the Bolasso algorithm, was developed based on the combinatorial analysis of regression coefficients estimated using LARS [25]. FeaLect considers the full regularization path and computes the feature importance using a combinatorial scoring method, as opposed to simply taking the intersection with Bolasso. The FeaLect scoring scheme measures the quality of each feature in each bootstrapped sample and averages them to select the most relevant features, providing a robust feature selection method. We used the R implementation of FeaLect to compute our results [41].
Recursive Feature Elimination (RFE) [35]
A number of common ML techniques (such as linear regression, SVM, decision trees, Naive Bayes, perceptron, etc.) provide feature weights that consider multivariate interacting effects between features [13]. To interpret the relative importance of the variables from these model feature weights, RFE was introduced in the context of SVM [35] for getting compact gene subsets from DNA-microarray data.
To find the best feature subset, instead of doing an exhaustive search over all feature combinations, RFE uses a greedy approach, which has been shown to reduce the effect of correlation bias in variable importance measures [42]. RFE uses backward elimination by taking the given model (SVM, random forests, linear regression, etc.) and discarding the worst feature (by absolute classifier weight or feature ranking), and repeating the process over increasingly smaller feature subsets until the best model hypothesis is achieved. The weights of this optimal model are used to rank features. Although this feature ranking might not be the optimal ranking for individual features, it is often used as a variable importance measure [42]. We used the scikit-learn implementation of RFE with random forest classifier to come up with a feature ranking for our dataset.