The overall proposed approach is illustrated in Figure 1. The raw data is preprocessed for consistency using domain knowledge. Ranking-based feature selection methods are also used to get an idea of the relative predictive potential of the attributes. Different regression-based predictive modeling methods are then used on the preprocessed and/or transformed data to construct models to predict the fatigue strength, given the composition and processing parameters. All constructed models are evaluated using Leave-One-Out Cross Validation with respect to various metrics for prediction accuracy. Below we present the details of each of the 4 stages.
Understanding and cleaning the data for proper normalization is one of the most important steps for effective data mining. Appropriate preprocessing, therefore, becomes extremely crucial in any kind of predictive modeling, including that of fatigue strength. The dataset used in this study consists of multiple grades of steel and in some records, some of the heat treatment processing steps did not exist. In particular, different specimens are subjected to different heat treatment conditions. For example, some are normalized and tempered, some are through hardened and tempered, and others are carburized and tempered. There could be cases where normalization is done prior to carburization and tempering. In order to bring in a structure to the database, we have included all the key processes in the data-normalization, through hardening, carburization, quenching and tempering. For the cases where the actual process does not take place, we set the appropriate duration/time variable to zero with corresponding temperature as the austenization temperature or the average of rest of the data where the process exists. Setting the time to zero would essentially mean that no material transformation occurs. An artifact of our resulting data is that we are treating temperature and time as independent variables whereas they actually make sense only when seen together.
This is an entropy-based metric that evaluates each attribute independently in terms of its worth by measuring the information gain with respect to the target variable:
where H(.) denotes the information entropy. The ranking generated by this method can be useful to get insights about the relative predictive potential of the input features.
Singular value decomposition is a matrix factorization defined as:
where, D is the data matrix such that every observation is represented by a row and each column is an explanatory variable, U is the matrix of left singular vectors, V is the matrix of right singular vectors and S is the diagonal matrix of singular values. In this case, A=U×S is a transformation of D where the data is represented by a new set of explanatory variables such that each variable is a known linear combination of the original explanatory parameters. The dimensions of A are also referred to as the Principal Components (PC) of the data.
We experimented with 12 predictive modeling techniques in this research study, which include the following:
Linear regression probably the oldest and most widely used predictive model, which commonly represents a regression that is linear in the unknown parameters used in the fit. The most common form of linear regression is least squares fitting . Least squares fitting of lines and polynomials are both forms of linear regression.
It evaluates the effect of each feature and uses a clustering analysis to improve the statistical basis for estimating their contribution to overall regression. It can be shown that pace regression is optimal when the number of coefficients tends to infinity. We use a version of Pace Regression described in [29, 30].
Regression post non-linear transformation of select input variables
A non-linear transformation of certain input variables can be done and the resulting data-set used for linear regression. In this study, the temperature variation effects on the diffusion equation are modelled according to the Arrhenius’ empirical equation as e x p(−1/T) where T is measured in Kelvin.
Robust fit regression
The robust regression method  attempts to mitigate the shortcomings which are likely to affect ordinary linear regression due to the presence of outliers in the data or non-normal measurement errors.
Multivariate polynomial regression
Ordinary least squares (OLS) regression is governed by the equation:
where β is the vector of regression coefficients, X is the design matrix and Y is the vector of responses at each data point. Multivariate Polynomial Regression (MPR) is a specialized instance of multivariate OLS regression that assumes that the relationship between regressors and the response variable can be explained with a standard polynomial. Standard polynomial here refers to a polynomial function that contains every polynomial term implied by a multinomial expansion of the regressors with a given degree (sometimes also referred to as a polynomial basis function). Polynomials of various degrees and number of variables are interrogated systematically to find the most suitable fit. There is a finite number of possible standard polynomials that can be interrogated due to the degree of freedom imposed by a particular dataset; the number of terms in the polynomial (consequently the number of coefficients) cannot exceed the number of data points.
This is a lazy predictive modeling technique which implements the K-nearest-neighbour (kNN) modeling. It uses normalized Euclidean distance to find the training instance closest to the given test instance, and predicts the same class as this training instance . If multiple instances have the same (smallest) distance to the test instance, the first one found is used. It eliminates the need for building models and supports adding new instances to the training database dynamically. However, the zero training time comes at the expense of a large amount of time for testing since each test instance needs to be compared with all the data instances in the training data.
KStar  is another lazy instance-based modeling technique, i.e., the class of a test instance is based upon the class of those training instances similar to it, as determined by some similarity function. It differs from other instance-based learners in that it uses an entropy-based distance function. The underlying technique used of summing probabilities over all possible paths is believed to contribute to its good overall performance over certain rule-based and instance-based methods. It also allows an integration of both symbolic and real valued attributes.
Decision table is a rule-based modeling technique that typically constructs rules involving different combinations of attributes, which are selected using an attribute selection search method. It thus represents one of the simplest and most rudimentary ways of representing the output from a machine learning algorithm, showing a decision based on the values of a number of attributes of an instance. The number and specific types of attributes can vary to suit the needs of the task. Simple decision table majority classifier  has been shown to sometimes outperform state-of-the-art classifiers. Decision tables are easy for humans to understand, especially if the number of rules are not very large.
Support vector machines
SVMs are based on the Structural Risk Minimization (SRM) principle from statistical learning theory. A detailed description of SVMs and SRM is available in . In their basic form, SVMs attempt to perform classification by constructing hyperplanes in a multidimensional space that separate the cases of different class labels. It supports both classification and regression tasks and can handle multiple continuous and nominal variables. Different types of kernels can be used in SVM models, including linear, polynomial, radial basis function (RBF), and sigmoid. Of these, the RBF kernel is the most recommended and popularly used, since it has finite response across the entire range of the real x-axis.
Artificial neural networks
ANNs are networks of interconnected artificial neurons, and are commonly used for non-linear statistical data modeling to model complex relationships between inputs and outputs. The network includes a hidden layer of multiple artificial neurons connected to the inputs and outputs with different edge weights. The internal edge weights are ‘learnt’ during the training process using techniques like back propagation. Several good descriptions of neural networks are available [36, 37].
Reduced error pruning trees
A Reduced Error Pruning Tree (REPTree)  is an implementation of a fast decision tree learner. A decision tree consists of internal nodes denoting the different attributes and the branches denoting the possible values of the attributes, while the leaf nodes indicate the final predicted value of the target variable. REPTree builds a decision/regression tree using information gain/variance and prunes it using reduced-error pruning. In general, a decision tree construction begins at the top of the tree (root node) with all of the data. At each node, splits are made according to the information gain criterion, which splits the data into corresponding branches. Computation on remaining nodes continues in the same manner until one of the stopping criterions is met, which include maximum tree depth, minimum number of instances in a leaf node, minimum variance in a node.
M5 model trees
M5 Model Trees  are a reconstruction of Quinlan’s M5 algorithm  for inducing trees of regression models, which combines a conventional decision tree with the option of linear regression functions at the nodes. It tries to partition the training data using a decision tree induction algorithm by trying to minimize the intra-subset variation in the class values down each branch, followed by back pruning and smoothing, which substantially increases prediction performance. It also uses the techniques used in CART  to effectively deal with enumerated attributes and missing values.
Traditional regression-based methods such as linear regression are typically evaluated by building the model (a linear equation in the case of linear regression) on the entire available data, and computing prediction errors on the same data. Although this approach works well in general for simple regression methods, it is nonetheless susceptible to over-fitting, and thus can give over-optimistic accuracy numbers. In particular, a data-driven model can, in principle learn every single instance of the dataset and thus result in 100% accuracy on the same data, but will most likely not be able to work well on unseen data. For this reason, advanced data-driven techniques that usually result in black-box models need to be evaluated on data that the model has not seen while training. A simple way to do this is to build the model only on random half of the data, and use the remaining half for evaluation. This is called the train-test split setting for model evaluation. Further, the training and testing halves can then also be swapped for another round of evaluation and the results combined to get predictions for all the instances in the dataset. This setting is called 2-fold cross validation, as the dataset is split into 2 parts. It can further be generalized to k-fold cross validation, where the dataset is randomly split into k parts. k−1 parts are used to build the model and the remaining 1 part is used for testing. This process is repeated k times with different test splits, and the results combined to get preductions for the all the instances in the dataset using a model that did not see them while training. Cross validation is a standard evaluation setting to eliminate any chances of over-fitting. Of course, k-fold cross validation necessitates builing k models, which may take a long time on large datasets.
Leave-one-out cross validation
We use leave-one-out cross validation (LOOCV) to evaluate and compare the prediction accuracy of the models. LOOCV is commonly used for this purpose particularly when the dataset is not very large. It is a special case of the more generic k-fold cross validation, with k=N, the number of instances in the dataset. The basic idea here is to estimate the accuracy of the predictive model on unseen input data it may encounter in the future, by withholding part of the data for training the model, and then testing the resulting model on the withheld data. In LOOCV, to predict the target attribute for each data instance, a separate predictive model is built using the remaining N−1 data instances. The resulting N predictions can then be compared with the N actual values to calculate various quantitative metrics for accuracy. In this way, each of the N instances is tested using a model that did not see it while training, thereby maximally utilizing the available data for model building, and at the same time eliminating the chances of over-fitting of the models.
Quantitative assessments of the degree to how close the models could predict the actual outputs are used to provide an evaluation of the models’ predictive performances. A multi-criteria assessment with various goodness-of-fit statistics was performed using all the data vectors to test the accuracy of the trained models. The criteria that are employed for evaluation of models’ predictive performances are the coefficient of correlation (R), explained variance (R2), Mean Absolute Error (MAE), and Root Mean Squared Error (RMSE), Standard Deviation of Error (SDE) between the actual and predicted values. The last three metrics were further normalized by the actual fatigue strength values to express them as error fractions. The definitions of these evaluation criteria are as follows:
where y denotes the actual fatigue strength values (MPa), denotes the predicted fatigue strength values (MPa), and N is the number of instances in the dataset.
The square of the coefficient of correlation, R2, represents the variance explained by the model (higher the better), and is considered one of the most important metrics for evaluating the accuracy of regressive prediction models. Another useful metric is the fractional mean absolute error, M A Ef, which represents the error rate (lower the better).