L1 regularized logistic regression
L1 regularized logistic regression works by penalizing the feature coefficients with the L1 norm, shrinking some of the feature coefficients to exactly zero. Consider datapoints \(\{(x_{i},y_{i}), i = 1, \dotsc ,N\}\), where N is the number of observations in data and \(x_{i} \in {\mathbb {R}}^d\), d is the number of features in data, and \(y_{i} \in \{0,1\}\) is a binary class label. For classification, the probability of an observation x belonging to class y is given as \(P(y|x) = \dfrac{1}{1+e^{-(\beta _0+\beta ^Tx)}},\) where \(\beta \) is a vector containing d feature coefficients and \(\beta _{0}\) is the intercept term.
The cost function to be minimized can be formulated as the negative of the regularized log-likelihood function:
$$\begin{aligned}{\mathcal {L}}(\beta _{0},\beta ) &= - \sum _{i=1}^{N} \Big [ y_ilog(P(y|x)) \nonumber \\ &\qquad + (1-y_i)log(1-P(y|x)) \Big ]\nonumber \\&\qquad + \lambda \sum _{j=1}^{d}|\beta _{j}|. \end{aligned}$$
(2)
The last term in the equation is a regularization parameter that is simply the sum of the L1 norms of the feature coefficients where \(\lambda \) controls the strength of regularization. The greater the value of parameter \(\lambda \), the more coefficients are shrunk to exactly zero. Having less features included makes the model more simple and interpretable. The magnitude of feature coefficients can be interpreted as the importance of that feature, a larger coefficient meaning the feature had more relevance in the classification. In addition, the direction of the coefficient tells whether the feature increases or decreases the probability of belonging to a certain class. The model was trained with the LogisticRegressionCV function and five-fold cross-validation to choose the amount of penalization to use.
Random forest
Random forest is a nonlinear classification and regression method that is based on building an ensemble of decision trees [8]. Decision trees are tree-like models, where data is split recursively at each decision node into subsets using some rule. The leaf nodes represent the outcome for the observation. The predicted outcome of a random forest model is the mode or mean of the predictions (majority vote) from the individual trees.
Random forests have become very popular, especially in medicine [6, 12, 33], as despite their nonlinearity, they can be interpreted. They provide feature importance measures by calculating the Gini importance, which in the binary classification can be formulated as [23]
$$\begin{aligned} Gini = p_1(1-p_1)+p_2(1-p_2), \end{aligned}$$
(3)
where \(p_1\) and \(p_2\) are the probabilities of class 1 and 2. The Gini index is minimized when either of the probabilities approaches zero and a total decrease in Gini index (node impurity) is calculated after each node split and then averaged over all trees. The more impurity decreases, the more important the input feature is. The model was trained with the RandomForestClassifier function. The maximum number of features to sample at each node and the minimum number of samples required to be at a leaf node were selected with GridSearchCV using five folds and values (3,5,9,11) and (1,5,20), respectively.
Local interpretable model-agnostic explanations
LIME [30] is a recently developed tool providing local interpretability on top of any supervised algorithm. It works by weighting neighbouring observations by their proximity to the observation being explained. The explanation is obtained by training a local linear model based on the weighted neighbouring observations. More precisely, if f is the prediction (in our case classification) model, x is the specific observation for which the prediction f(x) should be explained, g is an explanation model, and \(\pi _{x}\) the proximity of the neighborhood around x, LIME minimizes the objective function
$$\begin{aligned} \xi = \mathop {{\text {arg}}\,{\text {min}}}\limits _{g \in G} L(f, g, \pi _x) + \varOmega (g), \end{aligned}$$
(4)
where \(\varOmega \) penalizes the complexity of g. This means that from the family of all possible explanations G, the explanation g is chosen that is closest to the prediction of f, while the model complexity \(\varOmega (g)\) is kept low.
The explainer was trained with the LimeTabularExplainer function. As looking at every individual observation would be impractical, we decided to focus on the four most interesting observations with LIME. These include the observation correctly classified as benign/healthy with highest probability, correctly classified as malignant/injured with highest probability, misclassified as benign/healthy with highest probability, and misclassified as malignant/injured with highest probability. For each observation, LIME outputs a rule and an importance value for each feature separately.
Performance estimation
To estimate the performance of the classification models, we used five-fold cross-validation. Inside each fold, training data were normalized and then test data normalization was done using coefficients estimated from the training data. The missing values in the running injury data were imputed inside each fold, separately for training and test data. The performance was measured using area under the receiver operating characteristic curve (AUC-ROC) [7, 13], averaged over the five folds. Due to random split of folds, results from k-fold validation tend to vary [21, 22]. Therefore, to get a reliable estimate of the performance as well as the feature importances, the whole analysis was repeated a hundred times.
To confirm the significance of the important features and achieved performance, we apply an approach introduced in [20] based on permutations tests. By shuffling the class labels in the training data we made sure that the model was not simply learning some noise in data and therefore achieving higher performance and feature importance values than the chance level [11]. Cross validation splits were the same as in the runs with true labels. Pairwise comparisons of the hundred runs with true and shuffled labels were done with Wilcoxon signed-rank test for the achieved AUC values as well as for the feature importance values of logistic regression and random forest. Limit of significance was set to \(\alpha =0.05\) and Bonferroni corrected. The whole analysis process is outlined in Fig. 1.