A novel interaction-based methodology towards explainable AI with better understanding of Pneumonia Chest X-ray Images

In the field of eXplainable AI (XAI), robust “blackbox” algorithms such as Convolutional Neural Networks (CNNs) are known for making high prediction performance. However, the ability to explain and interpret these algorithms still require innovation in the understanding of influential and, more importantly, explainable features that directly or indirectly impact the performance of predictivity. A number of methods existing in literature focus on visualization techniques but the concepts of explainability and interpretability still require rigorous definition. In view of the above needs, this paper proposes an interaction-based methodology–Influence score (I-score)—to screen out the noisy and non-informative variables in the images hence it nourishes an environment with explainable and interpretable features that are directly associated to feature predictivity. The selected features with high I-score values can be considered as a group of variables with interactive effect, hence the proposed name interaction-based methodology. We apply the proposed method on a real world application in Pneumonia Chest X-ray Image data set and produced state-of-the-art results. We demonstrate how to apply the proposed approach for more general big data problems by improving the explainability and interpretability without sacrificing the prediction performance. The contribution of this paper opens a novel angle that moves the community closer to the future pipelines of XAI problems. In investigation of Pneumonia Chest X-ray Image data, the proposed method achieves 99.7% Area-Under-Curve (AUC) using less than 20,000 parameters while its peers such as VGG16 and its upgraded versions require at least millions of parameters to achieve on-par performance. Using I-score selected explainable features allows reduction of over 98% of parameters while delivering same or even better prediction results.


Introduction 1.AI Systems for COVID-19 Chest X-ray
The outbreak of novel coronavirus SARS-Cov-2 (previously 2019-nCov, also known as COVID-19) has quickly spread to nearly every country in the world [23] [16] [24].Since the beginning, the top priority for controlling the spread of COVID-19 has been to monitor suspected cases for appropriate quarantine measures and treatment.Pathogenic laboratory tests are the standard procedure (RT-PCR, collected with the invasive nasal swab most readers will be familiar with), but the accuracy of this test suffers from poor results [23].Chest CT scans have a high sensitivity for diagnosis of coronavirus disease 2019 (COVID-19) [2] and can be primary tool for the current COVID-19 detection [2].This innovation, if utilized earlier, could have revolutionized the initial screening procedure of COVID-19 and might prevented many unnecessary deaths.Looking forward, we believe that other related diseases could use similar detection methods to prevent future epidemics and pandemics.Creating these methods now will ensure the development of testing procedures with speed, accuracy, and explainability that will be required to deploy such Artificial Intelligence (AI) systems into future testing environments.AI-centered medical imaging-based deep learning systems have already been developed Convolutional Neural Networks (CNNs) that have shown promising results in feature extraction and learning [24].However, due to increasing complexity of deep CNN models, it is difficult to explain the prediction performance to their human users.To better assist the medical community by providing and deploying Explainable Artificial Intelligence (XAI) systems for analyzing chest scans to save critical time that is necessary for disease control, we call for immediate attention to developing a self-interpretable and explainable Convolutional Neural Network architecture capable of early disease detection.

What is XAI?
In 2016, DARPA initiated the Explainable Artificial Intelligence (XAI) challenge.The goal is to build suites of machine learning algorithms that are interpretable without sacrificing prediction performance (see Figure 1).The trade-off between learning performance and the effectiveness of explanation is illustrated in Figure 1.The work for this paper delivers a novel interaction-based methodology that produce the power of interpretability and explainability while maintaining state-of-the-art learning performance.[21].It presents the relationship between learning performance (usually measured by prediction performance) and effectiveness of explanations (also known as explainability).The proposed method in our work aims to take any deep learning method and provide explainability without sacrificing prediction performance.In the diagram, the proposed method is the orange dot on the upper right corner of the relationship plot.
A popular description of interpretability states the essential element for XAI is the ability to explain or to present in understandable terms to a human [8].Another popular version states interpretability as the degree to which a human can understand the cause of a decision [19].Though intuitive, these definitions lack mathematical formality and rigorousness as certain perspectives [1].
In this paper, we regard the core issue of an XAI problem to be "how features or variables are used to produce the prediction performance".In other words, the effectiveness of explanations in the DARPA document (see Figure 1) is innately a variable set assessment and selection problem.This means the explainability and interpretability of a machine learning algorithm is directly referring to what measures statisticians use to assess how the features or variables affect the final prediction results.In order to establish accountability, responsibility, and transparency in an AI system, one must first address an explainable and interpretable measure to assess the feature importance.We define the following three dimensions, D1, D2, and D3, for a measure to be interpretable and explainable from statistical and variable assessment perspective.In other words, these three dimensions, denoted by D1, D2, and D3, serve as the key premises of the definition of an explainable and interpretable measure.D1.A quantifiable measure for interpretability and explainability does not need to rely on the knowledge of correct specification of the underlying model.D2.Any desirable measure for interpretability and explainability should state clearly how a combination of explanatory variables influence the response variable.Moreover, it can directly compute a score for a set of variables in order to make reasonable comparisons.This means any additional influential variables should raise this measure while any existence of redundant or noisy variables should decrease the magnitude of this measure.

D3.
A measure appropriate for interpretability and explainability needs to state clearly the impact that the influence explanatory variables have on the predictivity of response variable.In other words, this measure should directly speaks for the predictivity (see equation [2] of [17]) of the explanatory variables.

Problems in Image Classification and Deep CNN
In the field of image classification, Convolutional Neural Networks (CNNs) are well-known for their high prediction performances in image classification [15] [13] [22] [11] [10] [6].Though CNNs are successful in its application areas, there is no theoretical proof to explain why it performs so well [3] especially when using a large size of filters [12] (See Table 1).Another challenge is the underlying assumption of using learned feature-maps due to large filter size [12] [13].Hence, it is difficult to explain how every parameter contributes to the prediction performance and to deliver explainability that satisfy the three dimensions (D1, D2, and D3) defined above.
Figure 2: This executive diagram summarizes the key components of the proposed methods of this paper.We start with the COVID-19 Image Data.With a small rolling window defined, we execute the Backward Dropping Algorithm (BDA) to select the important features within this window.Next, the BDA could select {X 1 , X 2 } as a variable module.Then we can construct a new variable using the technique of Interaction-based Feature Engineer (see the construction of X † in equation 4 to appreciate this new design).For more detailed discussion, please see Appendix 5.8.
Table 1: Summary of some well-known CNNs and their number of parameters.

An Interaction-based Convolutional Neural Network (ICNN) to Address XAI Problems
Chernoff, Lo, and Zheng (2009) [5] presents a general intensive approach, based on a method pioneered by Lo and Zheng (2002) [18] for detecting which, out of many potential explanatory variables, have an influence (impact) on a dependent variable Y .This paper presents an interaction-based feature selection methodology incorporating the notion of influence score, I-score, as a major technique to detect the higher-order interactions in complex and large-scale data set.Our work investigates the potential usage of I-score and a novel deep learning framework.The executive diagram can be seen in Figure 2 which outlines the road-map of the proposed methodology and the architecture of a novel Interaction-based Convolutional Neural Network (ICNN).This novel architecture takes full advantage of I-score and Backward Dropping Algorithm.In other words, it produces convolutional layers that are self-interpretable which provides explainable power to the features of image data at any convolutional layer if this design is implemented.In the following section, we discuss the contribution of our work and why the proposed method satisfies the three dimensions defined in §1.With all three dimensions satisfied (D1, D2, and D3), the proposed design of an Interaction-based Convolutional Neural Network (ICNN) is the ideal candidate to address XAI problems.
The key novelty of our proposed approach (see Executive Design in Figure 2) rests on the collection of features identified by I-score.Based on the contributions described above, the proposed method is model-free and hence checks the first condition D1.It describes a quantifiable measure of how much a combination of explanatory variables impact the outcome variable which allows statisticians to make comparisons and screen for influential features.This phenomenon checks the second dimension D2.In addition, the statistics, I-score, provides a measurement for explanatory variables that is directly associated with the predictivity that the explanatory variables have on outcome variable which satisfies the third condition D3.With all three dimensions (D1, D2, and D3) satisfied, the design of the proposed architecture presents an Interaction-based Convolutional Neural Network (ICNN) that is innately interpretable and explainable.It extracts influential information from the image data and generates explanatory features that directly associate with the predictivity of the data.Because all three dimensions are met, the proposed design of CNN is interpretable and it serves as the touchstone in the field of XAI.For more detailed discussion, please see Appendix 5.1.
2 Proposed Method 2.1 Influence Score (I-score) Assume that we have response variable Y to be binary (taking values 0 and 1) and all explanatory variables to be discrete.Consider the partition P k generated by a subset of k explanatory variables {X b1 , ..., X b k }.Assume all variables in this subset to be binary.Then we have 2 k partition elements; see the first paragraph of Section 3 in (Chernoff et al., 2009 [5]).Let n 1 (j) be the number of observations with Y = 1 in partition element j.Let n(j) = n j × π 1 be the expected number of Y = 1 in element j.Under the null hypothesis the subset of explanatory variables has no association with Y , where n j is the total number of observations in element j and π 1 is the proportion of Y = 1 observations in the sample.
In Lo and Zheng (2002) [18], the influence score is defined as The statistics I is the summation of squared deviations of frequency of Y from what is expected under the null hypothesis.There are two properties associated with the statistics I. First, the measure I is non-parametric which requires no need to specify a model for the joint effect of {X b1 , ..., X b k } on Y .This measure I is created to describe the discrepancy between the conditional means of Y on {X b1 , ..., X b k } disregard the form of conditional distribution.Secondly, under the null hypothesis that the subset has no influence on Y , the expectation of I remains non-increasing when dropping variables from the subset.The second property makes I fundamentally different from the Pearson's χ 2 statistic whose expectation is dependent on the degrees of freedom and hence on the number of variables selected to define the partition.We can rewrite statistics I in its general form when Y is not necessarily discrete where Ȳj is the average of Y -observations over the jth partition element (local average) and Ȳ is the global average.Under the same null hypothesis, it is shown (Chernoff et al., 2009 [5]) that the normalized I, I/nσ 2 (where σ 2 is the variance of Y ), is asymptotically distributed as a weighted sum of independent χ 2 random variables of one degree of freedom each such that the total weight is less than one.It is precisely this property that serves the theoretical foundation for the following algorithm.

Backward Dropping Algorithm (BDA)
The Backward Dropping Algorithm is a greedy algorithm to search for the optimal subsets of variables that maximizes the I-score through step-wise elimination of variables from an initial subset sampled in some way from the variable space.The steps of the algorithm are as follows.

Interaction-based Convolutional Layer
This section proposes an Interaction-based Convolutional Neural Network.This action is presented in Figure 3.
In this diagram, we start with an image that is sized 64 by 64 and suppose we use a window that has shape 4 by 4. Thus, this small window has 16 pixels locally.This set of 16 pixels can be converted into binary variables which hence gives us well defined partition.For example, we can discretize these pixels into black and white.In other words, each pixel takes value 1 or 0 (two levels) and the set of 16 pixels would give us a partition that is sized 2 16 .This is the set up for us to run Backward Dropping Algorithm.Based on the definition of the Backward Dropping Algorithm, each round we take turns dropping one variable iteratively.Every time we drop a variable, we compute I-score.We drop the variable at each step such that the I-score is the highest if that variable is dropped.Using this procedure, we are able to select a subset of variables out of the original 16 pixels.A visualization of the real application for this proposed convolutional layer is presented in Row (b) Figure 5.

Interaction-based Feature Engineer
This subsection we define a procedure to mechanically engineer an interaction-based feature.
For an image that is sized 3 × 3 (that is, this image has 9 pixels).We can write these pixels as X 1 , X 2 , ..., X 9 .Consider a small window of 2 × 2 passing from the first row and the first column of this 3 × 3 image.This means we start with {X 1 , X 2 , X 4 , X 5 } within this small 2 by 2 window in this example.We use the Backward Dropping Algorithm to narrow down to {X 1 , X 2 } because we observe this subset of variables deliver us the highest I-score.Since both X 1 and X 2 are discretized into two partitions, we have partition Π {X1,X2} well defined (see equation 3).In other words, assume X 1 and X 2 both only take values 0 and 1.Then the partition Π {X1,X2} is defined as the following In this case, each partition π j while j ∈ {1, 2, 3, 4} we can compute the local mean of response variable Y from observations in training set.Hence, a new interaction-based feature can be constructed using X † := ȳj while ȳj is the local mean of Y based on the j th partition ∈ Π.In general situation, let us consider an input matrix with size s in by s in .Suppose window size is w × w and the stride level to be l (notice in the above example w = 2 and l = 1).By, s out = sin−w l + 1 , we can compute output matrix with size s out by s out (which coincide with the X † matrix).In this case, we define running index b to take values {1, 2, 3, ...s out × s out }.
while X † b is defined as ȳj using observations in training set while j indicates the j th partition of Π that is formed by the variables selected from the b round of Backward Dropping Algorithm.Hence, the relationship from a 3-by-3 input Table 2: This table summarizes the dimension of the data.We download the COVID data from the work by [20].This totalled 576 COVID images and 2,000 non-COVID images.First, we split test set from the total images.The test consists of 60 COVID cases and 60 non-COVID cases.Next, we are left with 516 images for COVID class and 1,940 images for non-COVID class.This is in-sample set which we use for training and validating (short of Tr. and Val.below).For in-sample set, we upsample the images by adding noises drawn from normal distribution.This gives us 5,000 COVID images and 5,000 non-COVID images for training and testing.The out-of-sample test set has 120 observations and this is only used in the end to check the learning performance.
5,000 5,000 (upsampled) matrix into a 2-by-2 output matrix can be visualized using the following diagram input: while the input matrix has size 3 by 3 and the output matrix has size 2 by 2. In the output matrix, the X † 's are defined using equation 4 with observations in training set.

Data
The data is sourced from the work by [20].For detailed discussion of the source of data, please see Appendix 5.6.We present a brief table for the data in Table 2. From observation of Figure 4, it is visible for humans that the chest areas are clear in the healthy chest X-ray images.However, the images for COVID cases are not quite clear.This is an indication that there are inflammatory cells or other related body fluids filled in the chests.Instead of air which shows up on the pictures to be clear area, these areas in COVID cases tend to be cloudy and unclear.

Performance: Proposed Models
This subsection we present the experimental results.A brief summary of comparison of conventional methods literature and the proposed method is presented in Table 3.A detailed report of the proposed methodology is presented in Table 4.We also plot the AUC values for the proposed models in Figure 9.
The report in Table 3 shows results of previous work on this data set.Minaee et al. ( 2020) [20] used a 50-layer Convolutional Neural Network, ResNet50, and delivered a value of AUC at 99%.Their work also proposed SqueezeNet that delivered an even higher performance at 99.2% [20].This near-perfect performance is largely due to the design of We read the experimental results of the proposed model in the following order.As stated in section §3.3, the proposed architecture can be designed as deep or wide as the user desires.All models start with input images with dimensions 128 by 128.In other words, the input data has 128 × 128 = 16, 384 pixels.For simplicity of notation, let us denote to be the set of parameters of starting point of 6, window size of 2 by 2, and a stride level of 2. Let us further denote to be the collection of parameters of starting point of 1, window size of 2 by 2, and a stride level of 2.
For the training of the proposed model and the tuning parameters, please see Appendix 5.2, 5.3, and 5.4.
Model 2 -6.We also provided more updated versions of Model 1. Please see appendix.5.7

Visualization: Images and Convolutional Layer
This section presents visualization of the proposed architecture.These visualizations are presented in Figure 5. Unlike Figure 2 that is an executive summary with each position representing many samples, these visualizations in Figure 5 are sample-wise plots.In other words, the 10 original images that are sized 128 by 128 in Panel A and Panel B are the same samples in the second row, 1st Conv.Layer, and the third row, 2nd Conv.Layer.

Visualization Interpretation
The plot in Figure 5 of the original images for COVID-19 patients has grey and cloudy textures in chest area.Because an X-ray picture is at its brightest when most of the light beams emitted are bounced back from the object, we can observe bones to be the color "white" while the margin to be completely "black".For muscle and organs inside human body, X-ray beams that are emitted can only partially be collected and this causes the greyscale on the X-ray images in chest area.For COVID-19 patients, there are grey and shaded area in the chest X-ray pictures.This is due to the inflammatory fluid when patients exhibit pneumonia-like symptoms.The fluid inside chest area is a consequence of human immune system fighting against outside diseases.This shaded (as seen in Panel A of Figure 5) prevents us from observing the clear location of lungs.This is different in Panel B where the lung areas are dark and almost black, because a healthy lung is filled with air (i.e.normal cases and X-ray image presents color black).The black and white contrast in the two panels is directly related to how much inflammatory fluid there is in human lungs.This contrast translates to greyscale on pictures and it is directly related with COVID cases and non-COVID cases (i.e.response variable Y ).The same contrast can be seen using the new variables (these are X † 's based on equation 4) in the 1st Conv.Layer (sized 61 by 61).For COVID-19 patients, the lung area is cloudy and unclear while the healthy cases it is clearly visible.This is not a surprising coincidence because the proposed new variable modules, X † 's, are engineered using equation 4 which relies on the response variable ȳj in training set.The images sized 61 by 61 from the proposed algorithm is a direct translation of not only the original pixels but also response variable.In other words, this visualization presents how I-score sees image data.

Conclusions
Explainable AI System for Early COVID-19 Screening As the most important contribution of this paper, an Explainable Artificial Intelligence (XAI) system is proposed to assist radiologists for the initial screening of COVID-19 and other related diseases using chest X-ray images for treatment and disease control with accountability, responsibility, and transparency to human users and patients.
A Heuristic and Theoretical Framework of XAI.This paper introduces a heuristic and theoretical framework for addressing the XAI problems in large-scale and high-dimensional data sets.We provide three dimensions (D1, D2, D3) as necessary conditions and premises required for a measure to be regarded as explainable and interpretable.

An Interaction-based Convolutional Neural Network (ICNN).
To address the XAI problems heuristically described above, this paper introduced a novel design of an explainable and self-interpretable Interaction-based Convolutional Neural Network (ICNN) to contribute to the major issues about explainability, interpretability, transparency, and trustworthiness in black-box algorithms.We introduce and implement a non-parametric and interaction-based feature selection methodology and use this as replacement of pre-defined filters that are widely used in ultra-deep CNNs.Under this paradigm, we present an Interaction-based Convolutional Neural Network (ICNN) that extract important features.We believe that the proposed design can be adapted in any type of CNN.Thus, any CNN architecture that adapts the proposed technology can be regarded as Interaction-based Convolutional Neural Network (ICNN or Interaction-based Network).We encourage both the statistics and computer science communities to further explore this area to deliver more transparency, trustworthiness, and accountability to deep learning algorithms and to build a world with truly Responsible A.I..The illustration of the proposed Interaction-based Convolutional Nerual Network (ICNN) is presented in Figure 2.This executive diagram walks reader through a step-by-step procedure how the proposed network architecture is designed and why it satisfies the three dimensions (D1, D2, and D3) in the definition of interpretability and hence can be the benchmark address XAI problems.
Proposed Architecture.The executive diagram for the proposed architecture is presented in Figure 2. First, the architecture starts with image data that consists of X-ray pictures sized 128 by 128 (see detailed discussion of COVID-19 data set in §4).The architecture proposes using a rolling window with size 2 by 2 (we use a 2 by 2 window for simplicity and larger sizes can be applicable as well in practice depending on the data).Since the window size is 2 by 2, this means there would be four variables every time the window rests on a certain location of the image.Within this subset of variables, we execute the proposed method of Backward Dropping Algorithm (BDA).This procedure finely selects a subset of variables that is highly predictive by omitting the noisy variables in this small neighborhood on the image.Next, the selected variables (which can be any subset of the original four) go through a proposed procedure called Interaction-based Feature Engineer (see 4 for definition).The procedure of BDA is illustrated in the bottom left corner of the Figure 2 (we use a 2 by 2 window for demonstration purpose).In addition, we set the starting point to be 12 and this means we start from the pixel at the 12th row and the 12th column.Next, the proposed architecture is transparent at disclosing to its human users what location of the image is important at making decisions about prediction and what location is noisy.In every Interaction-based Convolutional Layer, we can compute the proposed measure I-score for any single variable.We can also compute I-score and finely select predictive variables from any combination of variables.The larger the I-score values the more important the variables are at making predictions.With simple visualization presented in Figure 2, we are able to use a spectrum of different colors to illustrate this phenomenon.We can code high I-score values to be one side of the color spectrum and the low I-score values to be the opposite side.The areas that are informative have high I-score values and have very different color than areas that are noisy.This characteristics of I-score allows the statistician to make comparisons and variable selection assessment.Therefore, it satisfies the second dimension, D2, defined in §1.
Third, the proposed architecture has a direct association with the predictivity (see equation [2] of [17]) of a variable set.This means that the important features screened by I-score provide the statistician how much impact this variable set has on the response variable.In addition, it is beneficial to be able to compute I-score for any variable in any Interaction-based Convolution Layer which implies that the association with the predictivity is well stated each step in the architecture.This can be visualized using the AUC values (see §4.As a summary, we have identified that the proposed Interaction-based Convolutional Neural Network (ICNN) relies on I-score which is a measure satisfies the three dimensions of explainable and interpretable measure.We regard the proposed ICNN to be an explainable and interpretable deep learning algorithm.Thus far, we have not been able to identify other methods and procedures that satisfies all three dimensions defined in §1.

Model Training
For the neural network classifier adapted in this paper, we use a standard neural network architecture with some basic designs (see the diagram after equation 5).This procedure has input variables (known as input layer), an optional hidden layer, and also output layer.The input layer are sets of variables ready to be fed into the machine to build classifier.The hidden layer can consist of any number of units and it is an optional design to deepen the network architecture.The output layer is constructed in order to compare with the output variable.This comparison can be quantified by a loss function.The path from input layer to output layer completes the procedure of forward propagation.Based on the loss function, we are able to compute the gradient which allows us to conduct optimize the weights of the architecture backwards by using gradient descent (or some upgraded version of gradient descent).This completes backward propagation.The training of a neural network model is based on many iterations of forward and backward propagation.
Forward Propagation.To illustrate the procedure of model training.Let us consider a set of input variables to be In the proposed work, this is referring to the variable modules, also notated as X † 's, that we created using Interaction-based Feature Engineer (see equation 4).For this discussion, we define a set of weights {w 1 , w 2 , w 3 } to construct a linear transformation.The symbol Σ below in the following diagram represents this linear transformation that takes the form For simplicity of notation, we write Σ = 3 j=1 w j X † j in short.Then we denote a(•) as an activation function.We choose sigmoid to be this activation function a(•).This means we have output ŷ to be defined as a(Σ).In other words, let us write the following The general form (assuming there are p variable modules) is expressed below Architecture.This architecture of neural network is presented below.For simplicity of drawing this picture, we assume there are 3 input variable modules, {X † 1 , X † 2 , X † 3 }.In practice, the number of variable modules (the total number of X † 's) depends on image data dimensions, window size, stride level, and starting point (please see §4.3 and equation 8 for the exact calculation).
Figure 6: The architecture below presents a feed forward neural network with three interaction-based input variables.The input variables are {X † 1 , X † 2 , X † 3 } which are variable modules created using equation 4.

Inputs
For the loss function, we used the binary cross-entropy loss function.This loss function is designed to minimize the distance between a target probability distribution P and an estimated target distribution Q when the task is a two-class classification problem.The cross-entropy loss function is defined as the following where y i is the ground truth of response variable for the i th observation and ŷi is the predicted value of response variable for the i th observation.The linear transformation, non-linear transformation, and the computation of the loss function completes the forward propagation.
Backward Propagation.To search for the optimal weights, we use an optimizer algorithm called RMSprop (short for Root Mean Square Propagation, a named suggested by Geoffrey Hinton).With the loss function computed above, we derive the gradient of the loss function to be ∇L := ∂L(y, ŷ)/∂w.At each iteration t, we compute v t,∇L := βv t−1,∇L + (1 − β)∇L 2 while β is a tuning parameter.Note that the square term on ∇L is element-wise multiplication.Then we can update the weights using w t := w t−1 − η • ∇L/ √ v t,∇L while η is learning rate.The value of learning rate η is a tuning parameter and it is usually a very small number.This process starts with the loss function and goes back to the beginning to update the weights w = {w 1 , w 2 , w 3 }.Hence, it earns the name backward propagation.

Model Parameters
This section investigates the tuning parameter of proposed method.In the design of Interaction-based Convolutional Neural Network (ICNN), the window size and the level of stride are hyper-parameters.
Window Size.Window size is the size of the local area that we narrow down to run the Backward Dropping Algorithm.For example, in the first Artificial Example in §2.3, the data is sized 6 × 6.A window size of 2 × 2 means that we are starting with a local area that investigates the first row and the first two columns and the second row and the first two columns.For each row i and each column j in this 6 × 6 grid structure, a window size of 2 × 2 means a local area of the following 2 by 2 matrix (i, j) For a 2 by 2 matrix, the Backward Dropping Algorithm will iteratively drop a variable amongst these 4 variables to compute I-score.The result will be a subset of this four variables.
We can also change the size of this window to be 3 × 3 which means we investigate the following matrix This means that the Backward Dropping Algorithm will start with 9 variables.After omitting the noisy variables within this set, the resulting predictive set will be a subset of these 9 variables.
Stride Level.The level of stride is how many rows or columns that get skipped over.This tuning parameter allows the algorithm to move faster but it makes sacrifice by skipping some variables.For example, we investigate stride level of 1 starting from row i and column j.Assume we use a 2 × 2 window and let us start from (i, j).We can visualize this action by using the following diagram Original matrix: (i, j) If we are at the end of the column for a row, we move down by moving to the first column of the next row.For example, in a grid structure of size 6 × 6, assume we are in the last position in a certain row i.The action of stride level 1 can be taken using the following diagram Original matrix in the end of a row: Again assume we are at row i and column j.Suppose we set stride level to be 2 and we want to move down.This means we increase increment of 2 on the number of rows and the action is the following Original matrix: If this window happens to be in the final position of a row, then we move down by skipping one row and we start with the first column.If we have a grid structure of size 6 × 6, this action is shown in the following diagram Original matrix in the end of a row: Starting Point.Another tuning parameter that we recommend to adjust is the starting point.The starting point represents the location of the first pixel that we start the proposed operation.The vanilla starting point is to start the rolling window from the pixel located at the first row and the first column.This can be illustrated in the following matrix Starting point = 1 : while the starting point is colored in red.
Alternatively, we can initiate the starting point to be at a higher level such as two or three.This allows algorithms to run more efficiently in large-scale data sets.For a simple example, in a matrix that is sized 6 × 6 (see §2.3 for the first artificial example), we have the first row of variables to be {X 1 , X 2 , ..., X 6 } and the second row of variables to be {X 7 , X 8 , ...X 12 }.At a starting point of 2, we start with X 8 to pass over the rolling window, because this variable sits at the position with the second row and the second column.This can be illustrated in the following matrix with the starting point colored in red.
This tuning parameter is particularly useful when the edge of the images are noisy and non-informative.For example, in Section §4.1, we notice from training images of COVID-19 Chest X-ray data set that the margins of X-ray images are dark and do not have human body within the range of a few pixels length.This is an example when this tuning parameter can be helpful and it speeds up the training process.
Computation of Dimensions.The above discussion introduced the tuning parameters of window size, stride level, and starting point.These parameters update our input matrix and generate a new matrix with different sizes.Let us denote window size to be w, stride level to be l, and starting point to be p.Given an input matrix with size s in by s in , the output matrix has new dimensions computed as the following For simplicity of this investigation, we assume input matrices to be a square.In other words, in the application of this paper, we process the input images to have the same width and height, i.e. 128 by 128 pixels.For future work, this can be generalized into different shapes.

Evaluation Metrics
This paper focuses on using Area-Under-Curve (or AUC) as the main evaluation metric.The AUC value is obtained from the Receiver Operating Characteristic (ROC) curve, a path constructed using different pairs of specificity and sensitivity.
Sensitivity and Specificity.The notion of sensitivity is interchangeable with recall or true positive rate.In a simple two-class classification problem, the goal is to investigate covariate matrix X in order to produce an estimated value of Y .From the output of a Neural Network model, the predicted values are always between 0 and 1, which acts as a probabilistic statement to describe the chance an observation is class 1 or 0. Given a threshold between 0 and 1, we can compute sensitivity to be the following On the other hand, specificity is also used to create ROC curve.Given a certain threshold between 0 and 1, we can compute specificity using the following Specificity = True Negative Negative = # of Images Classified Non-COVID Correctly # of Non-COVID Images (10) Given different thresholds, a list of pair of sensitivity and specificity can be created.The Area-Under-Curve (AUC) is the area under the path plotted using pairs of sensitivity and specificity that is generated using different thresholds.Area-Under-Curve (AUC).The AUC value is a single number derived from a predicted probability by a classification rule and the true label [9].Given a vector of true label Y and a vector of predicted probability Ŷ , we can use statistical package "pROC"1 to assist this computation.The package uses automatically generated thresholds to convert Ŷ into binary format.For example, we can use threshold of t 1 = 0.3 to convert a vector of predicted probabilities Ŷ = [0, 0.2, 0.4, 0.8] into binary form by writing Ŷt1 = 1( Ŷ > t 1 ) = [0, 0, 1, 1].Let us assume the true label to be Y = [0, 0, 1, 1].Thus, we can compute specificity to be 1 and sensitivity to be 1.We can then change threshold to a different value to compute another pair of specificity and sensitivity.By tracking all pairs of specificity and sensitivity, we can generate a curve called Receiver Operating Characteristic (ROC) [9].The value of Area-Under-Curve is exactly the area under the ROC.Assume the predicted probability to have some mistakes.In other words, let us assume the predicted probability vector to be Ŷ = [0, 0.2, 0.2, 0.8].It is not possible for two observations to have the same probability predicted while they come from different classes.Therefore, there must be a mistake in one of them.This information is reflected using the same procedure.Please see Table 7.

Simulation with Artificial Examples
In this section, we illustrate proposed methodologies on the following artificial examples.

Artificial Example: Variable/Feature Investigation
The first artificial example we demonstrate the procedure to construct Interaction-based Convolutional Layer and engineer Interaction-based Features.
Let us consider the following experiment.We create an artificial data set with 500 data points in training set and 10,000 data points in testing set.These observations are randomly drawn from Bernoulli(1/2) independently.In other words, we independently generate X i ∼ Bernoulli(1/2) while i = {1, 2, 3, ..., 36}.We define the underlying model to be the following (here known as model 11), This model 11 is a two-module example.The first module is a two-way interaction in modulo 2. The second module is a three-way interaction in modulo 2. The response variable Y is defined to be the first module 50% and the second module 50%.In this setup, the individual explanatory variable does not have any predictive power on the response variable Y .
Scenario I. Assume the statistician knows the model in this simulation.This means he is fully aware of S 1 = {X 1 , X 2 } to be an important variable set and S 2 = {X 3 , X 4 , X 5 } to be the other.In other words, he can use the first module as a predictor to make predictions on response variable Y .He is able to compute the theoretical prediction rate of the first variable set to be 75%.This is because the response variable is defined to be the first variable module S 1 = {X 1 , X 2 } exactly 50% of the times so S 1 is able to guess Y correctly at least 50% of the times.The rest 50% of the times the response variable is defined as the second variable module S 2 = {X 3 , X 4 , X 5 }.Since there is no marginal signal, the performance is exactly like random guessing and hence only get the rest of the occurrences correct 50% of the times.In other words, assuming training sample size has n data points, we can write the following This result can be extended to the other variable module as well.If this statistician uses the second variable module S 2 = {X 3 , X 4 , X 5 }, then a similar calculation can be carried out in the following.
The theoretical prediction rate for model 11 is the maximum percentage accuracy delivered by S 1 and S 2 .In this case, we have max(θ c (S 1 ), θ c (S 2 )) = 0.75.Scenario II.In practice, it is often the case that we do not observe exactly what the underlying model is in a given data set.The recommendation is to use I-score to select the important local information and then create an interaction-based feature based on the selected variable.The diagram for this action can be seen in Figure 8.
Classical procedure of Convolutional Neural Network starts with many pre-defined filters (or kernels) from a previous data set.The size of this filter can be 2 × 2, 3 × 3, or any other higher dimension.The procedure starts with the first row and the first column.Then the process takes the filter to pass it over across rows and columns in the image.The proposed methodology shares similar characteristics.However, instead of using a pre-trained filter, we apply the Backward Dropping Algorithm in this local area.
Constructing the Proposed Convolutional Layer.In the proposed artificial example, we have a data set with 36 variables.This means each observation can be reshaped into 6 × 6 grid structure.In other words, each observation may be considered as an image.The first row and the first column is variable X 1 .The first row and the last column will be variable X 6 .This means we can arrange these variables into the following structure Originally: and the red colored 4 variables are first used to the Backward Dropping Algorithm.If the window size is 2 × 2, we would use {X 1 , X 2 , X 7 , X 8 } and execute the Backward Dropping Algorithm on the variables within this window.
Notice that in the diagram (see Figure 8) the blue region indicates a local area where we run the Backward Dropping Algorithm.The pink region indicates the engineering workflow of building Interaction-based Feature using equation 4.
Each window we run Backward Dropping Algorithm.The procedure adopts the steps introduced in §2.2.The steps can be seen in Table 5.First, we start with all 4 variables which are {X 1 , X 2 , X 7 , X 8 }.For each step, we take turns dropping one variable and compute the I-score for the remaining variables.It can be seen that X 8 should be dropped because I-score raises from 160.18 in step 1 to 319.65 to step 2. We conduct the same procedure in step 2 and realize The artificial data has 6 × 6 = 36 variables which can be arranged in a grid structure with shape 6 by 6.We use a window size that is 2 by 2. By passing this window from top left corner of the original 6-by-6 matrix, we would create a new matrix that has dimension 5 by 5.
The number 32 is the number of units in the hidden layer.In this example, there is one hidden layer which is sufficient for the dimension of the data in the example.that X 7 needs to be dropped.Then I-score can increase to 638.17.We can keep dropping variables, but statistician realizes that I-score is the highest when there is only two variables, {X 1 , X 2 }, left.Hence, we put an up-arrow at the step where it indicates the most optimal variable selection (see up-arrow in Table 5).In this experiment, the most optimal selection is the variable module {X 1 , X 2 } with an I-score of 638.17.
Table 5: The table illustrates the steps of the Backward Dropping Algorithm in a 2 × 2 window. Step X 2 I-score 160.18 319.65 638.17 0.65 Each local area with a window size of 2 × 2 we discuss in the above how to finely select the important variable.Similar to the flavor of standard procedure in Convolutional Neural Network to use filter on a local area of an image, the proposed method above also investigate local information.However, instead of relying on many filters that are pre-defined, the proposed approach uses I-score, a model free influence measure that directly associates predictivity.After the variable module is selected within a local window, we adopts formula 4 to create Interaction-based Features.For example, in the local window above, we narrow down to variable module {X 1 , X 2 }.This means we can create the first Interaction-based Feature to be the following while j = {1, 2, 3, 4} and this corresponds to the 2 2 = 4 partitions generated by {X 1 , X 2 }.This gives us the first predictor to allow us to build classifier and to make predictions.
The experimental results from the first artificial example can be seen in Table 6, Table 7 and Table 8.Suppose a statistician knows the model.The theoretical prediction rate is 75% (which is derived using formula 12 and 13).In reality, we suppose that a statistician have no knowledge of the data set.In this case, he can directly proceed using Neural Network or even Convolutional Neural Network to make predictions.However, without the correct model specification the performance for these models are sub-par.We can see that Neural Network and Convolutional Neural Network both performed 50% on test set.This performance is rather like random guessing.Alternatively, this statistician can proceed the experiment using the proposed methods.First, one can generate Interaction-based Convolutional Layer.Instead of using many pre-trained filters for feature mapping, the proposed method adopts I-score and the Backward Dropping Algorithm to construct X † {X1,X2} , X † {X8,X9,X21} , and so on.The proposed methodology also has an exact I-score that is associated to each variable module which allow us to rank feature importance.With I-score, it is obvious that the first variable module X † {X1,X2} is much more influential than the other variable modules.[20] have demonstrated using deep CNNs including ResNet18, ResNet50, DenseNet-121 to classify COVID-19 disease using X-ray images.They have achieved sensitivity rate of 98% and specificity of 90%. on 5,000 Chest X-ray images.

Models
We discuss updated versions of Model 1 in the following.
Model 2. This model builds up the architecture of Model 1.The only difference is that there is one hidden layer with 64 units (or neurons).We fully connect each variable in 1st Conv.layer with each neurons in the hidden layer and afterwards we fully connect the hidden layer with the output layer.This means that from 1st Conv.Layer to the hidden layer there are 3, 721 × 64 = 238, 144 parameters.From the hidden layer of 64 neurons to output layer with 2 units, there are 64 × 2 = 128 parameters.This means in total there are 238, 144 + 128 = 238, 272 parameters.The performance for this architecture is 99.7%.The design of this one hidden layer with 64 units reduced the error rate from 1.5% in Model 1 to 0.3% in Model 2, which is 80% error reduction.
Model 3.This model aims to build two Interaction-based Convolutional Layer.The 1st Conv.Layer uses the set of parameters in and the 2nd Conv.Layer uses the set of parameters in .From the 1st Conv.Layer in Model 1, we are left with 61 × 61 = 3, 721 variables.Using the parameters in , we have new matrix with size (61 − 1 − 2 + 1)/2 + 1 × (61 − 1 − 2 + 1)/2 + 1 = 30 × 30 = 900 variables.These 900 variables can be input layer and we can directly pass these 900 variables into the output layer for making predictions.In other words, the output layer has 900 × 2 = 1, 800 parameters.The test set AUC value is 97.0%.
Model 4. This model is the deepest amongst all six models.More specifically, Model 4 has two Interaction-based Convolutional Layers and one hidden layer.From the 2nd Conv.Layer in Model 3, we are left with 900 variables.The architecture has one hidden layer with 64 units.The 900 variables are fully connected with the hidden layer which create 900 × 64 = 57, 600 variables.From the hidden layer with 64 units to output layer with 2 units, there are 64 × 2 = 128 parameters.In total, there are 57, 600 + 128 = 57, 728 parameters.The prediction performance is 99.6% on test set.Model 5.Both Model 5 and Model 6 build wider convolutional layers instead of aiming for depth.Model 5 has a concatenated of features from both convolutional layers.This means the architecture takes the 7,442 variables from the 1st Conv.Layer and 900 variables from the 2nd Conv.Layer from previous models together as one large convolutional layer.In other words, Model 5 has 1st Conv.Layer of 3, 721 + 900 = 4, 621 variables.These 4,621 variables can be used directly to be fed into the output layer with 2 units.In total, this architecture creates 4, 621 × 2 = 9, 242 parameters with test set performance to be 98.3%.Model 6.The last model, Model 6, is just as wide as the previous model, Model 5. Model 6 also has a first convolutional layer that is concatenation of features.It has 3, 721 + 900 = 4, 621 variables.In addition to Model 5, it has one hidden layer with 64 units.We fully connect the convolutional layer of 4,621 variables with the hidden layer of 64 variables.This gives us 4, 621 × 64 = 295, 744 parameters.The hidden layer with 64 units are then fully connected with the output layer which gives us 64 × 2 = 128 parameters.In total, the model has 295, 744 + 128 = 295, 872 parameters.This model has the highest AUC value on test set, i.e. 99.8%.

Visualization
Discussion for Figure 2.This executive diagram summarizes the key components of the proposed methods of this paper.We start with the COVID-19 Image Data.With a small rolling window defined, we execute the Backward Dropping Algorithm to select the important features within this window.For example, the rolling window may cover 4 variables, {X 1 , X 2 , X 3 , X 4 } and BDA could select {X 1 , X 2 } as a variable module.Then we can construct a new variable using the technique of Interaction-based Feature Engineer (see the construction of X † in equation 4 to appreciate this new design).In other words, using the selected variables X 1 and X 2 , we construct X † .The procedure of BDA is illustrated in the bottom left corner of the figure (we use a 2 by 2 window for simplicity).We set the starting point to be 12 (i.e.start from the pixel at the 12th row and the 12th column).From data (size of 128 by 128) to the 1st layer (58 by 58), this gives us a new dimension that is computed by (128 − 12 − 2 + 1)/2 + 1 = 58 (see equation 8).We can repeat the process in the 2nd Layer and the 3rd Layer.After the 3rd Layer, we shrink the dimension to 14 by 14 (which gives us 196 new variable modules, i.e. the new X † 's).We fully connect these 196 variable modules with the 10 neurons in the hidden layer (in practice the number of hidden layer and the number neurons are tuning parameters).This novel design is fundamentally different than the conventional practice of using pre-trained filters, because it proposes to use an explainable measure I-score to extract and build information directly from images in the training data.For each local variable (in the data it is referring to pixels and in the layers it is referred as variables), we compute the I-score values and also the AUC (see §4.4 for detailed discussion of AUC values) for that variable (using this variable as a predictor alone).We observe that the I-score value fully represents the predictivity of each local variable which can be confirmed by the variable's AUC value.The color spectrum of both I-score and AUC are presented in the bottom part of the diagram.We observe I-score values exhibiting paralleled behavior with AUC values.This means the variables with high I-score values have high AUC values which indicates strong predictive power for the information in that location.This design heavily rely on I-score and has an architecture that is interpretable at each location of the image at each convolutional layer.More importantly, the proposed design satisfies all three dimensions (D1, D2, and D3 in §1 Introduction) of the definition of interpretability and explainability.
Discussion for Figure 5.
This figure presents visualization summary for 10 randomly sampled images from COVID class and non-COVID class (each has 10).Panel A is for COVID patients and Panel B is non-COVID people.The first row plots the original images that are sized 128 by 128.The 1st Conv.Layer generates 61 × 61 = 3, 721 new variables.We plot the same 10 images from both classes using these 3,721 variables in the second row.We also print the predicted COVID probabilities on top left corner of each image.The 2nd Conv.Layer generate 30 × 30 = 900 variables.We plot the same 10 image samples from both classes using these 900 variables in the third row.We also print the predicted COVID probabilities on top left corner of each image assuming using only these 900 variables as predictors.The plot of the original images for COVID-19 patients has grey and cloudy textures in chest area.This is due to inflammatory fluid when patients exhibit pneumonia-like symptoms.This shaded (as seen in Panel A) prevents us from observing the clear location of lungs.This is different in Panel B where the lung areas are dark and almost black which means the lung is filled with air (i.e.normal cases).The black white contrast in the two panels is directly related to how much inflammatory fluid there is in human lungs which translate to greyscale on pictures.The same contrast can be seen using the new variables (these are X † 's based on equation 4) in the 1st Conv.Layer (sized 61 by 61).For COVID-19 patients, the lung area is cloudy and unclear while the healthy cases it is clearly visible.Discussion for Figure 5.  4 to produce the texts that states predicted probability of COVID class.The red color implies ground truth to be COVID class (Panel A) and the green color implies ground truth to be non-COVID class (Panel B).
1st Conv.Layer. to 2nd Conv.Layer.From the resulting matrix of the 1st Conv.Layer, we are left with 3,721 variables.We go through the proposed design in Table 4 and we create a new convolutional layer, i.e. 2nd Conv.Layer.This new layer has 30 × 30 = 900 variables.We take the same 10 sampled images from before and we use these 900 variables to present these images.In this presentation, we resize these 900 variables into shape 30 by 30.In other words, we get a smaller matrix that we can plot that exhibit mini version of similar patterns as before.We use Model 4 to generated the predicted probabilities.These probabilities are printed on the top left corner of each image and they are color coded similarly as before (red probabilities have ground truth of COVID class while green probabilities have ground truth of non-COVID class).

Conclusions
Explainable AI System for Early COVID-19 Screening As the most important contribution of this paper, an Explainable Artificial Intelligence (XAI) system is proposed to assist radiologists for the initial screening of COVID-19 and other related diseases using chest X-ray images for treatment and disease control.This innovation can revolutionize the application of how AI systems are deployed in hospitals and healthcare systems.We anticipate that other related diseases with viral pneumonia signs can use the same detection methods proposed in our paper, which ensure the development of testing procedures with accountability, responsibility, and transparency to human users and patients.
A Heuristic and Theoretical Framework of XAI.This paper introduces a heuristic and theoretical framework for addressing the XAI problems in large-scale and high-dimensional data sets.We provide three dimensions as necessary conditions and premises required for a measure to be regarded as explainable and interpretable.The first dimension, D1, states an interpretable measure does not need to rely on the knowledge of the true model, because any mistakes made in model fitting would be carried over in explaining the features.The second dimension, D2, describes that an interpretable measure should be able to indicate the impact a combination of variables have on the response variable.This means that any inclusion of influential variables would increase this measure while any injection of noisy and useless variables would decrease this measure.This desirable property allows human users to directly make comparisons of the impact the features have when any classifier is trained to make prediction decisions.Though the paper provided detailed work with an arbitrary image data set, the proposed method can be generalized and adapted to any big-data problems.Moreover, it opens up future pipelines for feature selection and dimension reduction in any large-scale and high-dimensional data sets.Last, the third dimension, D3, associates an interpretable measure with the predictivity of a sets of features.This helpful property benefits human users, because it allows us to establish connections and foresee the potential prediction performance (such as AUC values) that a set of features can deliver before any model fitting procedure.
An Interaction-based Convolutional Neural Network (ICNN).To address the XAI problems heuristically described above, this paper introduced a novel design of an explainable and self-interpretable Interaction-based Convolutional Neural Network (ICNN).We provide a flexible approach to contribute to the major issues about explainability, interpretability, transparency, and trustworthiness in black-box algorithms.We introduce and implement a non-parametric and interaction-based feature selection methodology and use this as replacement of pre-defined filters that are widely used in ultra-deep CNNs.Under this paradigm, we present an Interaction-based Convolutional Neural Network (ICNN) that extract important features.The proposed architecture uses these extracted features to construct influential and predictive variable modules that are directly associated with the predictivity of the response variable.The proposed design and its many prudent characteristics provide an extremely flexible pipeline that can learn, extract useful information, and unleash the hidden potential from any largescale or high-dimensional data sets.The proposed methods have been presented with both artificial examples and real data application in the COVID-19 Chest X-ray Image Data.We conclude from both simulation and empirical application results that I-score shows unparalleled potential to explain informative and influential local information in large-scale data set.The high I-score values suggest that local information possess capability to have higher lower bounds of the predictivity which thus leads to not only high prediction performance but also explanatory power.By arranging features according to I-score from high to low, we are able to cater the dimension of our model to any neural network architecture.Furthermore, we have also shown undiscovered field of an interaction-based neural network architecture which can help us move towards Explainable Artificial Intelligence many steps closer.In fact, we believe that the proposed design can be adapted in any type of CNN.Thus, any CNN architecture that adapts the proposed technology can be regarded as Interaction-based Convolutional Neural Network (ICNN or Interaction-based Network).We encourage both the statistics and computer science communities to further explore this area to deliver more transparency, trustworthiness, and accountability to deep learning algorithms and to build a world with truly Responsible A.I..

Figure 1 :
Figure 1: This diagram is a recreation of the figure in DARPA document (DARPA-BAA-16-53) [7][21].It presents the relationship between learning performance (usually measured by prediction performance) and effectiveness of explanations (also known as explainability).The proposed method in our work aims to take any deep learning method and provide explainability without sacrificing prediction performance.In the diagram, the proposed method is the orange dot on the upper right corner of the relationship plot.

Algorithm 1 : 1 nσ 2 j∈P k n 2 j
Procedure of the Backward Dropping Algorithm (BDA) Training Set: Consider a training set {(y 1 , x 1 ), ..., (y n , x n )} of n observations, where x i = (x 1i , ..., x pi )} is a p-dimensional vector of explanatory variables.The size p can be very large.All explanatory variables are discrete.;Sampling from Variable Space: Select an initial subset of k explanatory variables S b = {X b1 , ..., X b k }, b = 1, ..., B; Initialization: Set l = k; while While l >= 1 do Compute Standardized I-score: calculate I(S b ) = ( Ȳj − Ȳ ) 2 .For the rest of the paper, we refer this formula as Influence Measure or Influence Score (I-score).Drop Variables: Tentatively drop each variable in S b and recalculate the I-score with one variable less.Then drop the one that gives the highest I-score.Call this new subset S b which has one variable less than S b .l = |S b | (update l with length of current subset of variables) end

Figure 3 :
Figure 3: This is the architecture of Interaction-based Convolutional Neural Network (ICNN).There are two panels.Panel A proposes a single Interaction-based Convolutional Layer while Panel B proposes a deep but similar design.Panel A Panel B A deeper but similar architecture can be constructed using the following diagram.In Figure 3, we propose a deep Interactionbased Convolutional Neural Network.Instead of building a single Interaction-based Convolutional Layer (ICNN), multiple such layers can be constructed using the same procedure.See Row (b) and Row (c) in Figure 5.
In the example in previous paragraph, b can take value {1, 2, 3, 4} because s out = sin−w l + 1 = (3 − 2)/1 + 1 = 2 thus s out × s out = 4.For each round b of Backward Dropping Algorithm (b takes value from {1, 2, 3, 4} in this example), we can construct a new variable module that takes formula 4, defined below, X † b := ȳj while ȳj is the local mean of Y based on the j th partition ∈ Π

Figure 4 :
Figure 4: The figure presents two panels.Panel A exhibits 16 randomly sampled images from class "COVID".Panel B exhibits 16 randomly sampled images from class "non-COVID".We observe that the images for the COVID class appear more cloudy and unclear than the images for the non-COVID class.This is because in the X-ray images for COVID class there are substance that are not air.These substances can be liquid, germs, or inflammatory fluid which causes the images to have cloudy, unclear, and shady areas.Panel A Panel B

Figure 5 :
Figure 5: This figure presents visualization summary for 10 randomly sampled images from COVID class and non-COVID class (each has 10).Panel A is for COVID and Panel B is non-COVID.Each panel samples 10 images from the data.Each of the 10 images is then used to feed into proposed architecture and the visualization below presents these images after each proposed Interaction-based Convolutional Layer.For detailed discussion, please see Appendix 5.8.Panel A: Panel B True Label: COVID True Label: Non-COVID Input Images: 128 by 128 Input Images: 128 by 128 (Randomly select 10 samples) (Randomly select 10 samples) From data (size of 128 by 128) to the 1st layer (58 by 58), this procedure gives us a new feature matrix with size (128 − 12 − 2 + 1)/2 + 1 = 58 on both edges which means the new feature matrix has 58 by 58 variables (the formula is presented in equation 8).This feature matrix constitutes the first Interaction-based Convolutional Layer.We can then use the same methodology to construct the second and the third Interaction-based Convolutional Layer.The third Interaction-based Convolutional Layer can be used as input layer for a neural network classifier.Each layer we can compute the proposed measure I-score and the AUC value (see §4.4 for detailed discussion of AUC values) for each variable (assuming using this variable as a predictor when computing the AUC value).The paths of I-score and AUC values have parallel behavior and this can be seen in the color palette of the spectrum in Figure2.Why Proposed I-score Satisfies XAI Dimensions.The design of the proposed architecture in Figure2mainly focuses on using I-score and Backward Dropping Algorithm to extract and engineer features from the original images.The proposed measure I-score is non-parametric (see §3.1 for definition of this measure).This means the impact of how explanatory variables affect response variable measured by I-score does not rely on the knowledge of the correct specification of the underlying model.In other words, the computation of I-score does not rely on any model fitting procedure.This characteristic satisfies the first dimension, D1, defined in §1 about interpretable measures.
4 for detailed discussion of AUC values) that are coded onto the same spectrum location of I-score values.Moreover, we observe the path of I-score and AUC values to have parallel behavior.This implies that the values of I-score and AUC move up and down together which means for each variable with high I-score measure would definitely have high AUC values and vice versa.This interesting yet powerful phenomenon allows this architecture to satisfy the third dimension, D3, stated in the definition of an interpretable measure which is novel in the literature.

Figure 7 :
Figure 7: This figure presents two paths of ROC curves.We call them ROC1 and ROC2.For each ROC curve, we can compute AUC value.From ROC1, we can compute AUC1.From ROC2, we can compute AUC2.The mistake discussed in the paragraph is reflected by a reduction of AUC values from path 1 to path 2.

Figure 8 :
Figure 8: This figure demonstrate the network architecture in simulated data.The artificial data has 6 × 6 = 36 variables which can be arranged in a grid structure with shape 6 by 6.We use a window size that is 2 by 2. By passing this window from top left corner of the original 6-by-6 matrix, we would create a new matrix that has dimension 5 by 5.The number 32 is the number of units in the hidden layer.In this example, there is one hidden layer which is sufficient for the dimension of the data in the example.

Figure 9 :
Figure 9: This figure presents the AUC path for all six models in the proposed work.These models are listed in Table4with detailed information including the parameters from each layer and the out-of-sample prediction performance.

Table 3 :
The table presents experimental results of COVID-19 data set from literature.A number of different ultra-deep CNNs are used to classify COVID patients from non-COVID people.The average number of parameters of the ultra-deep CNNs can exceed 25 million parameters with top AUC to be at 99.2%.The proposed methods have average number of parameters to be less than 100k with top AUC of 99.8%.This is a 99% reduction on number of parameters without sacrificing the prediction performance.

Table 4 :
The table presents the summary statistics of the design of the proposed network: Interaction-based Convolutional Neural Network (ICNN).There are models 1-6.Each model can take one or two Interaction-based Convolutional Layer (i.e.1st Conv.or 2nd Conv.).The design of the model can directly go from Interaction-based Convolutional Layer to Output Layer.For example, Model 1 and Model 3 go directly from convolutional layer to output layer, i.e. no hidden layer.theconvolutional layers and fine tuning.Their work show that modern day AI technology such as deep Convolutional Neural Network can provide initial screening of COVID-19 diseased patients with a single scan of an image which could provide reduce workload for radiologists in practice.However, the number of parameters still far exceeds what humans can interpret.Moreover, it is unclear how these conventional methodologies can satisfy the three dimensions (D1, D2, and D3) of the definition of interpretable measures introduced in §1 of this paper.

Table 8 :
Minaee et al. (2020)erformance of the this simulation.In this simulation, we define the underlying model to be 11.The theoretical prediction rate is 75%.The conventional model such as Net-3 and LeNet-5 do not perform well in this example[14][25].The proposed method that uses I-score has prediction performance that is close to the theoretical prediction rate (see §4.4 for detailed discussion of AUC values).The notation "NN" stands for neural network.This is discussed in §4.2 and please read the architecture 6 for detailed description.COVID chest X-ray images and healthy images.Therefore, we do not take other diseases into consideration.We present the summary of these information in Table2.We first randomly select 60 images each from COVID class and non-COVID class and use these images as out-of-sample test set.We do not touch the test set until the very end after training and validation is completed.For the remaining images, we use data augmentation technique by adding noises drawn from normal distribution.This gives us totalled of 5,000 COVID cases and 5,000 non-COVID cases.These 10,000 images consist of the training and validating sets (short for tr. and val. in the Table2).These 10,000 images have two classes: COVID and non-COVID. of COVID-19 diseases on CT scans have been proposed.Bai et al. (2020) provided the model output to radiologists, and demonstrated that AI-assistance significantly improved radiologist diagnostic accuracy from 85% to 90% in distinguishing COVID classes from non-COVID classes[4].Minaee et al. (2020)