Multicriteria interpretability driven Deep Learning

Deep Learning methods are renowned for their performances, yet their lack of interpretability prevents them from high-stakes contexts. Recent model agnostic methods address this problem by providing post-hoc interpretability methods by reverse-engineering the model's inner workings. However, in many regulated fields, interpretability should be kept in mind from the start, which means that post-hoc methods are valid only as a sanity check after model training. Interpretability from the start, in an abstract setting, means posing a set of soft constraints on the model's behavior by injecting knowledge and annihilating possible biases. We propose a Multicriteria technique that allows to control the feature effects on the model's outcome by injecting knowledge in the objective function. We then extend the technique by including a non-linear knowledge function to account for more complex effects and local lack of knowledge. The result is a Deep Learning model that embodies interpretability from the start and aligns with the recent regulations. A practical empirical example based on credit risk, suggests that our approach creates performant yet robust models capable of overcoming biases derived from data scarcity.


Introduction
Deep Learning (DL) models are used extensively nowadays in many fields ranging from self-driving cars (Rao and Frtunikj, 2018) to brain-computer interfaces (Zhang et al., 2019) to gaming (Vinyals et al., 2019).Recent software and hardware democratized DL methods allowing scholars and practitioners to apply them in their fields.On the software side, recent frameworks as Tensorflow (Abadi et al., 2015) and PyTorch (Paszke et al., 2019) allowed to create complex DL models avoiding the need to write ad-hoc compilers as did by LeCun et al. (1990).On the hardware side, the decrease in the cost of the necessary hardware to train such models, allowed many people to build and deploy sophisticated Neural Networks with minimal costs (Zhang et al., 2018).The democratization of such powerful technologies allowed many fields to benefit from it aside from computer science.Some of those that benefitted the most are Economics (Nosratabadi et al., 2020), and Finance (Ozbayoglu et al., 2020).DL applications have piqued the interest of governments, who are concerned about possible social implications.It is well known that these models necessitate extra vigilance when it comes to training data in order to minimize biases of any kind, especially in high-stakes judgments (Rudin, 2019).To counter these side effects, the governments enacted several regulatory standards, and the jurisprudence started to elaborate on the right to explanation concept (Dexe et al., 2020).In this effort to build interpretable but DL grounded models, scholars have started developing post-hoc interpretation methods.These approaches, however, are at odds with what is prescribed by recent guidelines, requiring interpretability from the start (European Commission, 2019).Another issue is that such approaches focus only on the interpretation after a model's training and cannot be used to insert prior information or remove biases.
This work focuses on ensuring the interpretability of DL models from the beginning through knowledge injection and investigating their potential in empirical settings as in credit risk prediction.In this regard, we make three contributions to the literature.First, we allow the Decision Maker (DM) to inject previous knowledge and alleviate dataset biases in the model training process by controlling the features' effect.Physics-guided Neural Networks (PGNN) are based on similar methodologies that are widely employed in physics-related applications (Daw et al., 2021).Knowledge injection, whether physics related or not, generally involves some sort of constraints in terms of features relationship with the outcome that can be implemented as posed by von Rueden et al. (2021) in four different ways: (i) on the training data; (ii) on the hypothesis set; (iii) on the learning algorithm; (iv) on the final hypothesis.Our methodology infuses knowledge on the learning algorithm level as this procedure relates to some post-hoc interpretability methods.According to our understanding, these architectures were never proposed outside of the physics and engineering fields.And even in such areas, effects constraints were conditional on the context as in Muralidhar et al. (2018) or applied to non-DL techniques (Kotłowski and Słowiński, 2009;Lauer and Bloch, 2008;von Kurnatowski et al., 2021).Our approach applies to any DL architecture and is not conditional on features' context.We test the validity of our technique in credit risk as the concept of Sustainable AI will considerably impact this field.The recent frameworks, as the one proposed by Bücker et al. (2021) do not allow for interpretability from the start as posed by European Commission (2019).These techniques can spot biases but cannot counter them, as their scope is only explainability and not knowledge injection.Our methodology can handle both these aspects by leaving a model that is compliant with the new guidelines on Sustainable AI.Second, we allow for non-linear effects and local lack of knowledge by defining ad-hoc knowledge functions on models parameters.This additional specification is necessary for two reasons.To begin, the empirical literature in credit risk agrees that the performances of DL models are due primarily to the fact that they capture non-linear patterns (Ciampi and Gordini, 2013).The second reason for including this non-linear pattern is that knowledge may be lacking in some regions of the feature space.Third, we explore the relationship between post-hoc interpretability methods that are model agnostic, as for the case of Accumulated Local Effects (Apley and Zhu, 2020).These methods play two critical roles in our strategy.Initially, they provide the DM with graphical visualizations that allow him to communicate with non-experts.Second, they serve as sanity checks for our methodology and hyperparameter optimization based on explainability.
The rest of this paper is structured as follows.The knowledge injection in the model and the multicriteria problem development are all covered in Section 2. The data sample utilized to test our technique, software packages, and hardware is briefly discussed in Section 3. Section 4 summarizes the findings and examines the most important ones.Section 5 comes to a close.

Deep Learning
DL is an AI subfield and type of Machine Learning technique aimed at developing systems that can operate in complex environments (Goodfellow et al., 2016).Deep architectures underpin DL systems (Bottou et al., 2007) which can be defined as: where f (•, w) is a shallow architecture as for example the Perceptron proposed by Rosenblatt (1958).Prior to Rosenblatt, McCulloch and Pitts (1943) (Saad, 1998).One of these first architectures trained through gradient-based methods was the Multilayer Perceptron (MLP) (Rumelhart et al., 1986) followed by the Convolutional Neural Network proposed by LeCun et al. (1990).
This paper tests our approach on two DL architectures: the MLP and the ResNet (RN).The choice of using the MLP is because it is somehow the off the shelve solution in many use-cases and especially in credit risk (Ciampi et al., 2021) and in several works on retail credit risk (Lessmann et al., 2015).MLP consists of a direct acyclic network of nodes organized in densely connected layers.After being weighted and shifted by a bias term, inputs are fed into the node's activation function and influence each subsequent layer until the final output layer.
In a binary classification task, the output of an MLP can be described as in Arifovic and Gencay (2001) by: RN differs from the canonical MLP architecture since it has "shortcut connections" that mitigate the problem of degradation in the case of multiple layers (He et al., 2015).Although the usage of shortcut connections is not new in the literature (Venables and Ripley, 1999), the key proposal of He et al. (2015) was to use identity mapping instead of any other nonlinear transformation.Figure 1 shows the smallest building block of the ResNet architecture in which both the first and the second layers are shortcutted, and the inputs x are added to the output of the second layer.

Multicriteria optimization
Multiple Criteria Decision-Making (MCDM) is a branch of Operations Research and Management Science.MCDM represents the set of methods and processes through which the concern for several conflicting criteria can be explicitly incorporated into the analytical process Ehrgott et al. (2002).
Several strategies, spanning from a priori to interactive ones, have been developed to address this issue (Miettinen, 2012).Historically MCDM's first roots are the ones laid by Pareto at the end of the 19th century.However, the modern MCDM has its origins more recently in the work of Koopmans (1951) who modified the canonical single objective linear programming model by reframing it as a vector optimization problem.An MCDM problem takes the following form: where f i : IR n → IR is the ith objective and the vector x x x contains the decision variables that belong to the feasible set S. Scalarization is a common approach for dealing with Multicriteria optimization problems.By scalarization, a vector optimization problem turns into a single objective optimization problem.Additional concerns affect the DM's preferences scheme as a result of this.In our approach, we begin by using a weighted sum scalarization to handle our problem: where the weights are the relative preference of the DM toward a specific goal.The incorporation of such preferences can happen in two ways, either a priori or a posteriori.In our approach, we use an a posteriori method as this best suits the DM's lack of knowledge, which may be uncertain about the relative importance of each objective.

Knowledge injection
Knowledge in this paper is intended as validated information about relations between entities in specific contexts as in von Rueden et al. ( 2021).
The key feature of such a definition is that it allows for formalization, implying that such knowledge can be transformed into mathematical constraints somehow.
Let's assume we have a deep architecture such that ŷ = F(x x x, W) and we observe the true label y then we can train a supervised model by using a differentiable loss function as for the case of regression by using the mean squared error (MSE): or in our case, of a binary classification with the binary cross-entropy: this will constitute the first objective of our Multicriteria loss function, that is, data fitting.The knowledge injection instead will act on the features' effects on the model outcome, which means that our knowledge-based objective will be: where the right-hand side of the Hadamard product is the gradient of our DL model at the feature level x x x whereas k(x x x) is a function that penal-izes/favorites specific effect of the gradient with range [−1, 1].Since knowledge is hardly spread through the whole feature space, a possible strategy is to define k(x x x) that maps the feature space to the knowledge we expect on that particular feature neighbor.
In its most straightforward formulation, k(x x x) can be a scalar, and in this formulation is easier to investigate its behavior at the model level.Let us assume k(x x x) = 1 1 1 then what is enforced is that all partial derivatives should be negative therefore enforcing monotonicity and, in particular decreasing monotonicity.The opposite holds for the case when k(x x x) = −1 −1 −1.When k(x x x) = 0 0 0 there is no constraint on the gradient behavior, meaning that knowledge is non-existent and therefore not injected.Following Daw et al.
(2021) we can augment our Multicriteria function with a further constraint that measures the network complexity as, for example, an L 2 regularization on the weights.The result is the following unconstrained minimization problem:

Interpretability methods
Model interpretability is gaining popularity due to the increasing applications of non-linear models (Molnar et al., 2020 where w T c is the gradient of the model at that particular class c in that particular feature configuration I 0 : Sundararajan et al. ( 2017) through an axiomatic approach, questioned the validity of such procedure, arguing that using only the gradient may result in misleading feature attributions with respect to a baseline observation.They proposed the concept of Integrated Gradients, a path-dependent approach in which the gradients are accumulated over the linear combinations between the observation and a baseline.This implies the evaluation of the following:

Model-agnostic interpretability
One of the first model-agnostic technique was the Partial Dependence (PD) proposed by Friedman (1991).PD plots evaluate the change in the average predicted value, as specified features vary over their marginal distribution Goldstein et al. (2015).However the main limitation of the PD is the dependence within features since evaluating the PD carries the risk of computing points outside the data envelope.Apley and Zhu (2020) proposed the Accumulated Local Effects (ALEs) to address this flaw.ALEs have the advantage of avoiding the problem of evaluating variables' effects outside the data envelope (Apley and Zhu, 2020).Computing the ALE implies the evaluation of the following: where: f is the black-box model, S is the subset of variables' index, X is the matrix containing all the variables, x is the vector containing the feature values per observation, z identifies the boundaries of the K partitions, such that z 0,S = min(x S ), C is a constant term to center the plot.To make Equation 14 model-aware we can substitute finite differences with the gradient as \S ).Because of that, the resulting model-aware formula is: As a result, knowledge injection, as proposed in the previous section, will have an effect on the ALEs because the final model will have a different gradient than the non-knowledge injected one.

Data
To test the goodness of our approach, we provide an empirical application in the context of credit risk.In particular on the problem of bankruptcy prediction.We used a publicly available dataset of Polish enterprises donated to the UCI Machine Learning Repository by Zikeba et al. (2016).The data contains information about the financial conditions of Polish companies belonging to the manufacturing sector.The dataset contains 64 financial ratios ranging from liquidity to leverage measures1 .Moreover, the dataset distinguishes five classification cases that depend on the forecasting period.
In our empirical setting, we focus on bankruptcy status after one year.In this subset of data, the total number of observations is 5910, out of which only 410 represents bankrupted firms.It is worth noting that we do not counter the class imbalance in the empirical setting, although this is something done commonly in the literature.We retained class imbalance to test the robustness of our approach even in conditions of scarcity of a particular class and used robust metrics such as the Area Under the Receiving Operating Curve (AUROC).Moreover, as our empirical experiment focuses on testing our approach on model interpretability, we restricted the number of features we considered to six.This is due to the fact that ALEs are inspected as plots, and having a plot for each feature increases complexity without providing any additional benefit to the reader or our approach.The choice was to focus on Attr 13, Attr 16, Attr 23, Attr 24, Attr 26, and Attr 27.The attributes were selected by using a ROC-based feature selection (Kuhn and Johnson, 2019).

Software and hardware
The overall pipeline management is built in R (R Core Team, 2020).
The preprocessing relied on the tidymodels ecosystem (Kuhn and Wickham, 2020) as well as on the tidyverse (Wickham et al., 2019).The DL models are developed in Julia (Bezanson et al., 2017) using the Flux framework (Innes et al., 2018;Innes, 2018).The interoperability between the two languages is possible via the JuliaConnectoR library (Lenz et al., 2021).To debug the model and to check the validity of our approach, we employed the ALEPlot package (Apley, 2018).As for the hardware environment, the pipeline is carried out on a local machine with 12 logical cores (Intel i7-9850H), 16 GB RAM, and a Cuda enabled graphic card (NVIDIA Quadro T2000).Both Julia and R codes are freely available for research reproducibility on GitLab, and an ad-hoc Docker container has been created on DockerHub.

Results
To test the performance of our approach on the Polish firms' dataset, we use a standard practice in the field of DL.At first, we split the dataset into training and testing.Three-quarters of the dataset is for training the model and the rest for testing its performance.In the case of a model that contains no hyperparameters, a setting like this will suffice.However, in DL, this is never the case as these models require a thorough calibration of hyperparameters.In our setup, the hyperparameters are the elements contained in λ λ λ.Therefore a common strategy is to use the training set to do what is called hyperparameter optimization (Goodfellow et al., 2016).
Therefore, the training set is further divided into training and validation sets, and the model is fitted and validated with different parameters.In our case, we draw from the training set ten samples using the bootstrap technique as proposed by Efron and Tibshirani (1997), and we trained our model using different combinations of hyperparameters.As for the choice on implementing such hyperparameter search, we relied on grid search, also known as full factorial design (Montgomery, 2017).We then trained the model on the entire training set and classified bankruptcy state on the test set with the optimal hyperparameters.
For a deep understanding of the results, we divided the analysis into three parts.Firstly we performed hyperparameter optimization and subsequent hold-out testing using both the MLP and the RN, with the former being the best performing model.Secondly, with the MLP, we analyzed the effect on the interpretation by using ALE plots.Thirdly we tested the robustness of our approach by diminishing the amount of data.The performances in Table 1 are promising.However, to precisely measure model generalization error, the performances that need to be considered are those taken from the test set.Table 2 presents these performances by taking into account only the optimally parametrized models and their baseline, that is, the model with λ 1 = 1.The clear-cut result from this table is that the MLP generalizes way better than its counterpart, with a slight decrease in performance in line with the expectations.This result suggests that knowledge injection corroborated with mild regularization can enhance the generalization performances of a DL classifier as the MLP and make it robust to class imbalances.

Performance review
In the following sections, we will investigate further the performance of the MLP with and without knowledge injection in terms of interpretability and robustness to data scarcity.We will focus only on the MLP, as it was the most performant architecture.Moreover, analyzing the interpretations of a non-performant classifier as the RN has no useful meaning from a practical point of view.• Attr 13: which is also known as the EBITDA-To-Sales ratio, is a profitability indicator.Therefore we should expect to decrease the probability of bankruptcy, especially in cases where the ratio is positive.The opposite occurs instead.An increase of the ratio above zero increases the probability of bankruptcy.This effect is at odds with the literature on the subject as, for example, in Platt and Platt (2002).
• Attr 16: is the inverse and a proxy of the Debt-To-EBITDA ratio which is leverage ratio.For the inverse of a leverage ratio we would assume negative impact on bankruptcy as in Beaver (1968).
• Attr 23: is the Net profit ratio and is a productivity ratio Lee and Choi (2013) which tends to have a negative impact on bankruptcy.
To counter this common biased effects we assumed the following logistic form for all the features' knowledge function: Such a knowledge function penalizes only positive effects above zero and retains the correctly captured effects below it.With this setting, in the case of moderate knowledge injection, the effects align with the literature findings.

Robustness checks
A fundamental question is how model performances deteriorate with less training data.In previous works, knowledge injection has been implemented indeed to alleviate such problems as in von Kurnatowski et al. (2021).To discover how our approach deals with scarce data, we systematically diminished the training set and measured the corresponding performance on the test set.These results are in Table 3 which depicts the different performances in the test set as the training data proportion decreases.In concordance with the literature on knowledge injection, our approach prevents performance degradation even in extreme cases where only half the dataset is used for training.

Conclusion
In this paper, we presented a novel approach to knowledge injection at the level of feature effects of a DL model.Model interpretability is a particularly crucial topic, and recent legislation implies interpretability from the
review A mentioned in the previous sections, post-hoc interpretability methods are essential tools for model debugging and to inspect any model bias.As a result, we present figure 2a, which demonstrates MLP ALEs with and without knowledge injection.The ALEs of the model without knowledge injection show several misbehaviors that may be due to class imbalance or hidden biases in the training sample.In detail: Local Effects plot of the Multilayer Perceptron architecture, without regularization and knowledge injection (i.e.λ 1 = 1, λ 2 = 0.0, λ 3 = 0.0).Local Effects plot of the Multilayer Perceptron architecture, with regularisation and knowledge injection optimally selected from the hyperparameter optimization procedure (i.e.λ 1 = 0.5, λ 2 = 0.2, λ 3 = 0.3).

Figure 2 :
Figure 2: Accumulated Local Effects plot of the Multilayer Perceptron architecture, with and without regularization and knowledge injection.
(Baehrens et al., 2010)ques proposed by the literature was to use the product of the model's gradient with feature values(Baehrens et al., 2010).
model output with respect to the input features.In other words, the score of multiclass c classifier can be locally approximated by:

Table 1 :
Performances of Multilayer Perceptron and Residual Network on the training set with various hyperparameter settings.The performance is measured as the mean and standard errors of the Area Under the Receiving Operating Curve in each bootstrap sample.Bold values indicate the best performing hyperparametrization for each model.

Table 2 :
Performances of Multilayer Perceptron and Residual Network on the test set with validated and baseline hyperparameter settings.The performance is measured as the Area Under the Receiving Operating Curve in the test sample.

Table 3 :
Performances of Multilayer Perceptron on the test set under different proportions of train/test split.The performance is measured as the Area Under the Receiving Operating Curve in the test sample.