Abstract
The remarkable success of deep neural networks (DNNs) in various applications is accompanied by a significant increase in network parameters and arithmetic operations. Such increases in memory and computational demands make deep learning prohibitive for resourceconstrained hardware platforms such as mobile devices. Recent efforts aim to reduce these overheads, while preserving model performance as much as possible, and include parameter reduction techniques, parameter quantization, and lossless compression techniques.
In this chapter, we develop and describe a novel quantization paradigm for DNNs: Our method leverages concepts of explainable AI (XAI) and concepts of information theory: Instead of assigning weight values based on their distances to the quantization clusters, the assignment function additionally considers weight relevances obtained from Layerwise Relevance Propagation (LRP) and the information content of the clusters (entropy optimization). The ultimate goal is to preserve the most relevant weights in quantization clusters of highest information content.
Experimental results show that this novel EntropyConstrained and XAIadjusted Quantization (ECQ\(^{\text {x}}\)) method generates ultra lowprecision (2–5 bit) and simultaneously sparse neural networks while maintaining or even improving model performance. Due to reduced parameter precision and high number of zeroelements, the rendered networks are highly compressible in terms of file size, up to 103\(\times \) compared to the fullprecision unquantized DNN model. Our approach was evaluated on different types of models and datasets (including Google Speech Commands, CIFAR10 and Pascal VOC) and compared with previous work.
Keywords
 Neural Network Quantization
 Layerwise Relevance Propagation (LRP)
 Explainable AI (XAI)
 Neural Network Compression
 Efficient Deep Learning
Download chapter PDF
1 Introduction
Solving increasingly complex realworld problems continuously contributes to the success of deep neural networks (DNNs) [37, 38]. DNNs have long been established in numerous machine learning tasks and for this have been significantly improved in the past decade. This is often achieved by overparameterizing models, i.e., their performance is attributed to their growing topology, adding more layers and parameters per layer [18, 41]. Processing a very large number of parameters comes at the expense of memory and computational efficiency. The sheer size of stateoftheart models makes it difficult to execute them on resourceconstrained hardware platforms. In addition, an increasing number of parameters implies higher energy consumption and increasing run times.
Such immense storage and energy requirements however contradict the demand for efficient deep learning applications for an increasing number of hardwareconstrained devices, e.g., mobile phones, wearable devices, Internet of Things, autonomous vehicles or robots. Specific restrictions of such devices include limited energy, memory, and computational budget. Beyond these, typical applications on such devices, e.g., healthcare monitoring, speech recognition, or autonomous driving, require low latency and/or data privacy. These latter requirements are addressed by executing and running the aforementioned applications directly on the respective devices (also known as “edge computing”) instead of transferring data to thirdparty cloud providers prior to processing.
In order to tailor deep learning to resourceconstrained hardware, a large research community has emerged in recent years [10, 45]. By now, there exists a vast amount of tools to reduce the number of operations and model size, as well as tools to reduce the precision of operands and operations (bit width reduction, going from floating point to fixed point). Topics range from neural architecture search (NAS), knowledge distillation, pruning/sparsification, quantization and lossless compression to hardware design.
Beyond all, quantization and sparsification are very promising and show great improvements in terms of neural network efficiency optimization [21, 43]. Sparsification sets less important neurons or weights to zero and quantization reduces parameters’ bit widths from default 32 bit float to, e.g., 4 bit integer. These two techniques enable higher computational throughput, memory reduction and skipping of arithmetic operations for zerovalued elements, just to name a few benefits. However, combining both high sparsity and low precision is challenging, especially when relying only on the weight magnitudes as a criterion for the assignment of weights to quantization clusters.
In this work, we propose a novel neural network quantization scheme to render lowbit and sparse DNNs. More precisely, our contributions can be summarized as follows:

1.
Extending the stateoftheart concept of entropyconstrained quantization (ECQ) to utilize concepts of XAI in the clustering assignment function.

2.
Use relevances observed from Layerwise Relevance Propagation (LRP) at the granularity of perweight decisions to correct the magnitudebased weight assignment.

3.
Obtaining stateoftheart or better results in terms of the tradeoff between efficiency and performance compared to the previous work.
The chapter is organized as follows: First, an overview of related work is given. Second, in Sect. 3, basic concepts of neural network quantization are explained, followed by entropyconstrained quantization. Section 4 describes the ECQ extension towards ECQ\(^{\text {x}}\) as an explainabilitydriven approach. Here, LRP is introduced and the perweight relevance derivation for the assignment function presented. Next, the ECQ\(^{\text {x}}\) algorithm is described in detail. Section 5 presents the experimental setup and obtained results, followed by the final conclusion in Sect. 6.
2 Related Work
A large body of literature exists that has focused on improving DNN model efficiency. Quantization is an approach that has shown great success [14]. While most research focuses on reducing the bit width for inference, [52] and others focus on quantizing weights, gradients and activations to also accelerate backward pass and training. Quantized models often require finetuning or retraining to adjust model parameters and compensate for quantizationinduced accuracy degradation. This is especially true for precisions \({<}8\) bit (cf. Fig. 1 in Sect. 3). Trained quantization is often referred to as “quantizationaware training”, for which additional trainable parameters may be introduced (e.g., scaling parameters [6] or directly trained quantization levels (centroids) [53]). A precision reduction to even 1 bit was introduced by BinaryConnect [8]. However, this kind of quantization usually results in severe accuracy drops. As an extension, ternary networks allow weights to be zero, i.e., constraining them to 0 in addition to \(w_{}\) and \(w_{+}\), which yields results that outperform the binary counterparts [28]. In DNN quantization, most clustering approaches are based on distance measurements between the unquantized weight distribution and the corresponding centroids. The works in [7] and [32] were pioneering in using Hessianweighted and entropyconstrained clustering techniques. More recently the work of [34] use concepts from XAI for DNN quantization. They use DeepLIFT importance measures which are restricted to the granularity of convolutional channels, whereas our proposed ECQ\(^{\text {x}}\) computes LRP relevances per weight.
Another method for reducing the memory footprint and computational cost of DNNs is sparsification. In the scope of sparsification techniques, weights with small saliency (i.e., weights which minimally affect the model’s loss function) are set to zero, resulting in a sparser computational graph and higher compressible matrices. Thus, it can be interpreted as a special form of quantization, having only one quantization cluster with centroid value 0 to which part of the parameter elements are assigned to. This sparsification can be carried out as unstructured sparsification [17], where any weight in the matrix with small saliency is set to zero, independently of its position. Alternatively, a structured sparsification is applied, where an entire regular subset of parameters is set to zero, e.g., entire convolutional filters, matrix rows or columns [19]. “Pruning” is conceptually related to sparsification but actually removes the respective weights rather than setting them to zero. This has the effect of changing the number of input and output shapes of layers and weight matrices^{Footnote 1}. Most pruning/sparsification approaches are magnitudebased, i.e., weight saliency is approximated by the weight values, which is straightforward. However, since the early 1990s methods that use, e.g., secondorder Taylor information for weight saliency [27] have been used alongside other criteria ranging from random pruning to correlation and similarity measures (for the interested reader we recommend [21]). In [51], LRP relevances were first used for structured pruning.
Generating efficient neural network representations can also be a result of combining multiple techniques. In Deep Compression [16], a threestage model compression pipeline is described. First, redundant connections are pruned iteratively. Next, the remaining weights are quantized. Finally, entropy coding is applied to further compress the weight matrices in a lossless manner. This three stage model is also used in the new international ISO/IEC standard on Neural Network compression and Representation (NNR) [24], where efficient data reduction, quantization and entropy coding methods are combined. For coding, the highly efficient universal entropy coder DeepCABAC [47] is used, which yields compression gains of up to 63\(\times \). Although the proposed method achieves high compression gains, the compressed representation of the DNN weights require decoding prior to performing inference. In contrast, compressed matrix formats like Compressed Sparse Row (CSR) derive a representation that enables inference directly in the compressed format [49].
Orthogonal to the previously described approaches is the research area of Neural Architecture Search (NAS) [12]. Both manual [36] and automated [44] search strategies have played an important role in optimizing DNN architectures in terms of latency, memory footprint, energy consumption, etc. Microstructural changes include, e.g., the replacement of standard convolutional layers by more efficient types like depthwise or pointwise convolutions, layer decomposition or factorization, or kernel size reduction. The macro architecture specifies the type of modules (e.g., inverted residual), their number and connections.
Knowledge distillation (KD) [20] is another active branch of research that aims at generating efficient DNNs. The KD paradigm leverages a large teacher model that is used to train a smaller (more efficient) student model. Instead of using the “hard” class labels to train the student, the key idea of model distillation is to deploy the teacher’s class probabilities, as they can contain more information about the input.
3 Neural Network Quantization
For neural network computing, the default precision used on general hardware like GPUs or CPUs is 32 bit floatingpoint (“singleprecision”), which causes high computational costs, power consumption, arithmetic operation latency and memory requirements [43]. Here, quantization techniques can also reduce the number of bits required to represent weight parameters and/or activations of the fullprecision neural network, as they map the respective data values to a finite set of discrete quantization levels (clusters). Providing n such clusters allows to represent each data point in only \(\log _2n\) bit. However, the continuous reduction of the number of clusters generally leads to an increasingly large error and degraded performances (see the EfficientNetB0^{Footnote 2} example in Fig. 1).
This tradeoff is a wellknown problem in information theory and is addressed by ratedistortion optimization, a concept in lossy data compression. It aims to determine the minimal number of bits per data symbol (bitrate) at which the reconstruction of the compressed data does not exceed a certain level of distortion. Applying this to the domain of neural network quantization, the objective is to minimize the bitrate of the weight parameters while keeping model degradation caused by quantization below a certain threshold, i.e., the predictive performance of the model should not be affected by reduced parameter precisions. In contrast to multimedia compression approaches, e.g., for audio or video coding, the compression of DNNs has unique challenges and opportunities. Foremost, the neural network parameters to be compressed are not perceived directly by a user, as e.g., for video data. Therefore, the coding or compression error or distortion cannot be directly used as performance measure. Instead, such accuracy measurement needs to be deducted from a subsequent inference step. Then, current neural networks are highly overparameterized [11] which allows for high errors/differences between the fullprecision and the quantized parameters (while still maintaining model performance). Also, the various layer types and the location of a layer within the DNN have different impacts on the loss function, and thus different sensitivities to quantization.
Quantization can be further classified into uniform and nonuniform quantization. The most intuitive way to initialize centroids is by arranging them equidistantly over the range of parameter values (uniform). Other quantization schemes make use of nonuniform mapping functions, e.g., kmeans clustering, which is determined by the distribution of weight values (see Fig. 2). As nonuniform quantization captures the underlying distribution of parameter values better, it may achieve less distortion compared to equidistantly arranged centroids. However, nonuniform schemes are typically more difficult to deploy on hardware, e.g., they require a codebook (lookup table), whereas uniform quantization can be implemented using a single scaling factor (step size) which allows a very efficient hardware implementation with fixedpoint integer logic.
3.1 EntropyConstrained Quantization
As discussed in [49], and experimentally shown in [50], lowering the entropy of DNN weights provides benefits in terms of memory as well as computational complexity. The EntropyConstrained Quantization (ECQ) algorithm is a clustering algorithm that also takes the entropy of the weight distributions into account. More precisely, the firstorder entropy \(H = \sum _c P_c\log _2{P_c}\) is used, where \(P_c\) is the ratio of the number of parameter elements in the cth cluster to the number of all parameter elements (i.e., the source distribution). To recall, the entropy H is the theoretical limit of the average number of bits required to represent any element of the distribution [39].
Thus, ECQ assigns weight values not only based on their distances to the centroids, but also based on the information content of the clusters. Similar to other ratedistortionoptimization methods, ECQ applies Lagrange optimization:
Per network layer l, the assignment matrix \(\mathbf{A} ^{(l)}\) maps a centroid to each weight based on a minimization problem consisting of two terms: Given the fullprecision weight matrix \(\mathbf{W} ^{(l)}\) and the centroid values \(w_c^{(l)}\), the first term in Eq. (1) measures the squared distance between all weight elements and the centroids, indexed by c. The second term in Eq. (1) is weighted by the scalar Lagrange parameter \(\lambda ^{(l)}\) and describes the entropy constraint. More precisely, the information content I is considered, i.e., \(I=\log _2(P_c^{(l)})\), where the probability \(P_c^{(l)}\in [0,1]\) defines how likely a weight element \(w_{ij}^{(l)}\in \mathbf{W} ^{(l)}\) is going to be assigned to centroid \(w_c^{(l)}\). Data elements with a high occurrence frequency, or a high probability, contain a low information content, and vice versa. P is calculated layerwise as \(P_c^{(l)} = N_{w_c}^{(l)} / N_\mathbf{W }^{(l)}\), with \(N_{w_c}^{(l)}\) being the number of fullprecision weight elements assigned to the cluster with centroid value \(w_c^{(l)}\) (based on the squared distance), and \(N_\mathbf{W }^{(l)}\) being the total number of parameters in \(\mathbf{W} ^{(l)}\). Note that \(\lambda ^{(l)}\) is scaled with a factor based on the number of parameters a layer has in proportion to other layers in the network to mitigate the constraint for smaller layers.
The entropy regularization term motivates sparsity and lowbit weight quantization in order to achieve smaller coded neural network representations. Based on the specific neural network coding optimization, we developed ECQ. This algorithm is based on previous work in EntropyConstrained Trained Ternarization (EC2T) [28]. EC2T trains sparse and ternary DNNs to stateoftheart accuracies.
In our developed ECQ, we generalize the EC2T method, such that DNNs of variable bit width can be rendered. Also, ECQ does not train centroid values to facilitate integer arithmetic on general hardware. The proposed quantizationaware training algorithm includes the following steps:

1.
Quantize weight parameters by applying ECQ (but keep a copy of the fullprecision weights).

2.
Apply StraightThrough Estimator (STE) [5]:

(a)
Compute forward and backward pass through quantized model version.

(b)
Update fullprecision weights with scaled gradients obtained from quantized model.

(a)
4 ExplainabilityDriven Quantization
Explainable AI techniques can be applied to find relevant features in input as well as latent space. Covering large sets of data, identification of relevant and functional model substructures is thus possible. Assuming overparameterization of DNNs, the authors of [51] exploit this for pruning (of irrelevant filters) to great effect. Their successful implementation shows the potential of applying XAI for the purpose of quantization as well, as sparsification is part of quantization, e.g., by assigning weights to the zerocluster. Here, XAI opens up the possibility to go beyond regarding model weights as static quantities and to consider the interaction of the model with given (reference) data. This work aims to combine the two orthogonal approaches of ECQ and XAI in order to further improve sparsity and efficiency of DNNs. In the following, the LRP method is introduced, which can be applied to extract relevances of individual neurons, as well as weights.
4.1 LayerWise Relevance Propagation
Layerwise Relevance Propagation (LRP) [3] is an attribution method based on the conservation of flows and proportional decomposition. It explicitly is aligned to the layered structure of machine learning models. Regarding a model with n layers
LRP first calculates all activations during the forward pass starting with \(f_1\) until the output layer \(f_n\) is reached. Thereafter, the prediction score f(x) of any chosen model output is redistributed layerwise as an initial quantity of relevance \(R_n\) back towards the input. During this backward pass, the redistribution process follows a conservation principle analogous to Kirchhoff’s laws in electrical circuits. Specifically, all relevance that flows into a neuron is redistributed towards neurons of the layer below. In the context of neural network predictors, the whole LRP procedure can be efficiently implemented as a forwardbackward pass with modified gradient computation, as demonstrated in, e.g., [35].
Considering a layer’s output neuron j, the distribution of its assigned relevance score \(R_j\) towards its lower layer input neurons i can be, in general, achieved by applying the basic decomposition rule
where \(z_{ij}\) describes the contribution of neuron i to the activation of neuron j [3, 29] and \(z_j\) is the aggregation of the preactivations \(z_{ij}\) at output neuron j, i.e., \(z_j = \sum _i z_{ij}\). Here, the denominator enforces the conservation principle over all i contributing to j, meaning \(\sum _i R_{i \leftarrow j} = R_j\). This is achieved by ensuring the decomposition of \(R_j\) is in proportion to the relative flow of activations \(z_{ij}/z_j\) in the forward pass. The relevance of a neuron i is then simply an aggregation of all incoming relevance quantities
Given the conservation of relevance in the decomposition step of Eq. (3), this means that \(\sum _i R_i = \sum _j R_j\) holds for consecutive neural network layers. Next to componentwise nonlinearities, linearly transforming layers (e.g., dense or convolutional) are by far the most common and basic building blocks of neural networks such as VGG16 [41] or ResNet [18]. While LRP treats the former via identity backward passes, relevance decomposition formulas can be given for the latter explicitly in terms of weights \(w_{ij}\) and input activations \(a_i\). Let the output of a linear neuron be given as \(z_j = \sum _{i,0} z_{ij} = \sum _{i,0} a_i w_{ij}\) with bias “weight” \(w_{0j}\) and respective activation \(a_0=1\). In accordance to Eq. (3), relevance is then propagated as
Equation (5) exemplifies, that the explicit computation of the backward directed relevances \(R_{i \leftarrow j}\) in linear layers can be replaced equivalently by a (modified) “gradient \(\times \) input” approach. Therefore, the activation \(a_i\) or weight \(w_{ij}\) can act as the input and target wrt. which the partial derivative regarding output \(z_j\) is computed. The scaled relevance term \(R_j / z_j\) takes the role of the upstream gradient to be propagated.
At this point, LRP offers the possibility to calculate relevances not only of neurons, but also of individual weights, depending on the aggregation strategy, as illustrated in Fig. 3. This can be achieved by aggregating relevances at the corresponding (gradient) targets, i.e., plugging Eq. (5) into Eq. (4). For a dense layer, this yields
with an individual weight as the aggregation target contributing (exactly) once to an output. A weight of a convolutional filter however is applied multiple times within a neural network layer. Here, we introduce a variable k signifying one such application context, e.g., one specific step in the application of a filter w in a (strided) convolution, mapping the filter’s inputs i to an output j. While the relevance decomposition formula within one such context k does not change from Eq. (3), we can uniquely identify its backwards distributed relevance messages as \(R^k_{i \leftarrow j}\). With that, the aggregation of relevance at the convolutional filter w at a given layer is given with
where k iterates over all applications of this filter weight.
Note that in modern deep learning frameworks, derivatives wrt. activations or weights can be computed efficiently by leveraging the available automatic differentiation functionality (autograd) [33]. Specifying the gradient target, autograd then already merges the relevance decomposition and aggregation steps outlined above. Thus, computation of relevance scores for filter weights in convolutional layers is also appropriately supported, for Eq. (3), as well as any other relevance decomposition rule which can be formulated as a modified gradient backward pass, such as Eqs. (8) and (9). The ability to compute the relevance of individual weights is a critical ingredient for the eXplainabilitydriven EntropyConstrained Quantization strategy introduced in Sect. 4.2.
In the following, we will briefly introduce further LRP decomposition rules used throughout our study. In order to increase numerical stability of the basic decomposition rule in Eq. (3), the LRP \(\varepsilon \)rule introduces a small term \(\varepsilon \) in the denominator:
The term \(\varepsilon \) absorbs relevance for weak or contradictory contributions to the activation of neuron j. Note here, in order to avoid divisions by zero, the \(\text {sign}(z)\) function is defined to return 1 if \(z \ge 0\) and −1 otherwise. In the case of a deep rectifier network, it can be shown [1] that the application of this rule to the whole neural network results in an explanation that is similar to (simple) “gradient \(\times \) input” [40]. A common problem within deep neural networks is, that the gradient becomes increasingly noisy with network depth [35], partly a result from gradient shattering [4]. The \(\varepsilon \) parameter is able to suppress the influence of that noise given sufficient magnitude. With the aim of achieving robust decompositions, several purposed rules next to Eqs. (3) and (8) have been proposed in literature (see [29] for an overview).
One particular rule choice, which reduces the problem of gradient shattering and which has been shown to work well in practice, is the \(\alpha \beta \)rule [3, 30]
where \((\cdot )^+\) and \((\cdot )^\) denote the positive and negative parts of the variables \(z_{ij}\) and \(z_j\), respectively. Further, the parameters \(\alpha \) and \(\beta \) are chosen subject to the constraints \(\alpha  \beta = 1\) and \(\beta \ge 0\) (i.e., \(\alpha \ge 1\)) in order to propagate relevance conservatively throughout the network. Setting \(\alpha =1\), the relevance flow is computed only with respect to the positive contributions \(\left( z_{ij}\right) ^+\) in the forward pass. When alternatively parameterizing with, e.g., \(\alpha = 2\) and \(\beta = 1\), which is a common choice in literature, negative contributions are included as well, while favoring positive contributions.
Recent works recommend a composite strategy of decomposition rule assignments mapping multiple rules purposedly to different parts of the network [25, 29]. This leads to an increased quality of relevance attributions for the intention of explaining prediction outcomes. In the following, a composite strategy consisting of the \(\varepsilon \)rule for dense layers and the \(\alpha \beta \)rule with \(\beta =1\) for convolutional layers is used. Regarding LRPbased pruning, Yeom et al. [51] utilize the \(\alpha \beta \)rule (9) with \(\beta =0\) for convolutional as well as dense layers. However, using \(\beta =0\), subparts of the network that contributed solely negatively, might receive no relevance. In our case of quantization, all individual weights have to be considered. Thus, the \(\alpha \beta \)rule with \(\beta =1\) is used for convolutional layers, because it also includes negative contributions in the relevance distribution process and reduces gradient shattering. The LRP implementation is based on the software package Zennit [2], which offers a flexible integration of composite strategies and readily enables extensions required for the computation of relevance scores for weights.
4.2 eXplainabilityDriven EntropyConstrained Quantization
For our novel eXplainabilitydriven EntropyConstrained Quantization (ECQ\(^{\text {x}}\)), we modify the ECQ assignment function to optimally reassign the weight clustering based on LRP relevances in order to achieve higher performance measures and compression efficiency. The rationale behind using LRP to optimize the ECQ quantization algorithm is twofold:
Assignment Correction: In the quantization process, the entropy regularization term encourages weight assignments to more populated clusters in order to minimize the overall entropy. Since weights are usually normally distributed around zero, the entropy term also strongly encourages sparsity. In practice, this quantization scheme works well rendering sparse and lowbit neural networks for various machine learning tasks and network architectures [28, 48, 50].
From a scientific point of view, however, one might wonder why the shift of numerous weights from their nearestneighbor clusters to a more distant cluster does not lead to greater model degradation, especially when assigned to zero. The quantizationaware retraining and finetuning can, up to a certain extent, compensate for this shift. Here, the LRPgenerated relevances show potential to further improve quantization in two ways: 1) by readding “highly relevant” weights (i.e., preventing their assignment to zero if they have a high relevance), and 2) by assigning additional, “irrelevant” weights to zero (i.e., preventing their distance and entropybased assignment to a nonzero centroid).
We evaluated the discrepancy between weight relevance and magnitude in a correlation analysis depicted in Fig. 4. Here, all weight values \(w_{ij}\) are plotted against their associated relevance \(R_{w_{ij}}\) for the input layer (left) and output layer (right) of the fullprecision model MLP\(\_\)GSC (which will be introduced in Sect. 5.1). In addition, histograms of both parameters are shown above and to the right of each relevanceweightchart in Fig. 4 to better visualize the correlation between \(w_{ij}\) and \(R_{w_{ij}}\). In particular, a weight of high magnitude is not necessarily also a relevant weight. And in contrast, there are also weights of small or medium magnitude that have a high relevance and thus should not be omitted in the quantization process. This phenomenon is especially true for layers closer to the input. The outcome of this analysis strongly motivates the use of LRP relevances for the weight assignment correction process of lowbit and sparse ECQ\(^{\text {x}}\).
Regularizing Effect for Training: Since the previously described readding (which is also referred to as “regrowth” in literature) and removing of weights due to LRP depends on the propagated input data, weight relevances can change from data batch to data batch. In our quantizationaware training, we apply the STE, and thus the reassignment of weights, after each forwardbackward pass.
The regularizing effect which occurs due to dynamic readding and removing weights is probably related to the generalization effect which random Dropout [42] has on neural networks. However, as elaborated in the extensive survey by Hoefler et al. [21], in terms of dynamic sparsification, readding (“drop in”) the best weights is as crucial as removing (“drop out”) the right ones. Instead of randomly dropping weights, the work in [9] shows that readding weights based on largest gradients is related to Hebbian learning and biologically more plausible. LRP relevances go beyond the gradient criterion, which is why we consider it a suitable candidate.
In order to embed LRP relevances in the assignment function (1), we update the cost for the zero centroid (\(c=0\)) by extending it as
with relevance matrix \(\mathbf{R} _{W^{(l)}}\) containing all weight relevances \(R_{w_{ij}}\) of layer l with row/input index i and column/output index j, as specified in Eq. (7). The relevancedependent assignment matrix \(\mathbf{A} _{\text {x}}^{(l)}\) is thus described by:
where \(\rho \) is a normalizing scaling factor, which also takes relevances of the previous data batches into account (momentum). The term \(\rho ~\mathbf{R} _{W^{(l)}}\) increases the assignment cost of the zero cluster for relevant weights and decreases it for irrelevant weights.
Figure 5 shows an example of one ECQ\(^{\text {x}}\) iteration that includes the following steps: 1) ECQ\(^{\text {x}}\) computes a forwardbackward pass through the quantized model, deriving its weight gradients. LRP relevances \(\mathbf{R} _W\) are computed by redistributing modified gradients according to Eq. (7). 2) LRP relevances are then scaled by a normalizing scaling factor \(\rho \), and 3) weight gradients are scaled by multiplying the nonzero centroid values (e.g., the upper left gradient of \(0.03\) is multiplied by the centroid value 1.36). 4) The scaled gradients are then applied to the fullprecision (FP) background model which is a copy of the initial unquantized neural network and is used only for weight assignment, i.e. it is updated with the scaled gradients of the quantized network but does not perform inference itself, 5) The FP model is updated using the ADAM optimizer [23]. Then, weights are assigned to their nearestneighbor cluster centroids. 6) Finally, the assignment \(\mathbf{A} _{\text {x}}\) cost for each weight to each centroid is calculated using the \(\lambda \)scaled information content of clusters (i.e., \(I_{ \text { (blue)}}\approx 1.7\), \(I_{0 \text { (green)}}=1.0\) and \(I_{+ \text { (purple)}}\approx 2.4\) in this example) and \(\rho \)scaled relevances. Here, relevances above the exemplary threshold (i.e., mean \(\bar{\mathbf{R }}_W\approx 0.3\)) increase the cost for the zero cluster assignment, while relevances below (highlighted in red) decrease it. Each weight is assigned such that the cost function is minimized according to Eq. (11). 7) Depending on the intensity of the entropy and relevance constraints (controlled by \(\lambda \) and \(\rho \)), different assignment candidates can be rendered to fit a specific deep learning task. In the example shown in Fig. 5, an exemplary candidate grid was selected, which is depicted at the top left of the Figure. The weight at grid coordinate D2, for example, was assigned to the zero cluster due to its irrelevance and the weight at C3 due to the entropy constraint.
In the case of dense or convolutional layers, LRP relevances can be computed efficiently using the autograd functionality, as mentioned in Sect. 4.1. For a classification task, it is sensible to use the target class score as a starting point for the LRP backward pass. This way, the relevance of a neuron or weight describes its contribution to the target class prediction. Since the output is propagated throughout the network, all relevance is proportional to the output score. Consequently, relevances of each sample in a training batch are, in general, weighted differently according to their respective model output, or prediction confidence. However, with the aim of suppressing relevances for inaccurate predictions, it is sensible to weigh samples according to the model output, because a low output score usually corresponds to an unconfident decision of the model.
After the relevance calculation of a whole data batch, the relevance scores \(\mathbf{R} _{W^{(l)}}\) are transformed to their absolute value and normalized, such that \(\mathbf{R} _{W^{(l)}} \in [0, 1]\). Even though negative contributions work against an output, they might still be relevant to the network functionality, and their influence is thus considered instead of omitted. On one hand, they can lead to positive contributions for other classes. On the other, they can be relevant to balancing neuron activations throughout the network.
The relevance matrices \(\mathbf{R} _{W^{(l)}}\) resulting from LRP are usually sparse, as can be seen in the weight histograms of Fig. 4. In order to control the effect of LRP in the assignment function, the relevances are exponentially transformed by \(\beta \), applying a similar effect as for gamma correction in image processing:
with \(\beta \in [0, 1]\). Here, the parameter \(\beta \) is initially chosen such that the mean relevance \(\hat{\mathbf{R }}_{W^{(l)}}\) does not change the assignment, e.g., \(\rho \left( \hat{\mathbf{R }}_{W^{(l)}} \right) ^\beta = 1\) or \(\beta = \frac{\ln {\rho }}{\ln {\hat{\mathbf{R }}_{W^{(l)}}}}\). In order to further control the sparsity of a layer, the target sparsity p is introduced. If the assignment increases a layer’s sparsity by more than the target sparsity p, parameter \(\beta \) is accordingly minimized. Thus, in ECQ\(^{\text {x}}\), LRP relevances are directly included in the assignment function and their effect can be controlled by parameter p. An experimental validation of the developed ECQ\(^{\text {x}}\) method, including stateoftheart comparison and parameter variation tests, is given in the following section.
5 Experiments
In the experiments, we evaluate our novel quantization method ECQ\(^{\text {x}}\) using two widely used neural network architectures, namely a convolutional neural network (CNN) and a multilayer perceptron (MLP). More precisely, we deploy VGG16 for the task of smallscale image classification (CIFAR10), ResNet18 for the Pascal Visual Object Classes Challenge (Pascal VOC) and an MLP with 5 hidden layers and ReLU nonlinearities solving the task of keyword spotting in audio data (Google Speech Commands).
In the first subsection, the experimental setup and test conditions are described, while the results are shown and discussed in the second subsection. In particular, results for ECQ\(^{\text {x}}\) hyperparameter variation are shown, followed by a comparison against classical ECQ and results for bit width variation. Finally, overall results for ECQ\(^{\text {x}}\) for different accuracy and compression measurements are shown and discussed.
5.1 Experimental Setup
All experiments were conducted using the PyTorch deep learning framework, version 1.7.1 with torchvision 0.8.2 and torchaudio 0.7.2 extensions. As a hardware platform we used Tesla V100 GPUs with CUDA version 10.2. The quantizationaware training of ECQ\(^{\text {x}}\) was executed for 20 epochs in all experiments. As an optimizer we used ADAM with an initial learning rate of 0.0001. In the scope of the training procedure, we consider all convolutional and fullyconnected layers of the neural networks for quantization, including the input and output layers. Note that numerous approaches in related works keep the input and/or output layers in fullprecision (32 bit float), which may compensate for the model degradation caused by quantization, but is usually difficult to bring into application and incurs significant overhead in terms of energy consumption.
Google Speech Commands. The Google Speech Commands (GSC [46]) dataset consists of 105,829 utterances of 35 words recorded from 2,618 speakers. The standard is to discriminate ten words “Yes”, “No”, “Up”, “Down”, “Left”, “Right”, “On”, “Off”, “Stop”, and “Go”, and adding two additional labels, one for “Unknown Words”, and another for “Silence” (no speech detected). Following the official Tensorflow example code for training^{Footnote 3}, we implemented the corresponding data augmentation with PyTorch’s torchaudio package. It includes randomly adding background noise with a probability of 80\(\%\) and time shifting the audio by [\(100, 100\)]ms with a probability of 50\(\%\). To generate features, the audio is transformed to MFCC fingerprints (Mel Frequency Cepstral Coefficients). We use 15 bins and a window length of 2000 ms. To solve GSC, we deploy an MLP (which we name MLP_GSC in the following) consisting of an input layer, five hidden layers and an output layer featuring 512, 512, 256, 256, 128, 128 and 12 output features, respectively. The MLP_GSC was pretrained for 100 epochs using stochastic gradient descent (SGD) optimization with a momentum of 0.9, an initial learning rate of 0.01 and a cosine annealing learning rate schedule.
CIFAR10. The CIFAR10 [26] dataset consists of natural images with a resolution of \(32\times 32\) pixels. It contains 10 classes, with 6,000 images per class. Data is split to 50,000 training and 10,000 test images. We use standard data preprocessing, i.e., normalization, random horizontal flipping and cropping. To solve the task, we deploy a VGG16 from the torchvision model zoo^{Footnote 4}. The VGG16 classifier is adapted from 1,000 ImageNet classes to ten CIFAR classes by replacing its three fullyconnected layers (with dimensions [25,088, 4,096], [4,096, 4,096], [4,096, 1,000]) by two ([512, 512], [512, 10]), as a consequence of CIFAR’s smaller image size. We also implemented a VGG16 supporting batch normalization (“BatchNorm” in the following), i.e., VGG16_bn from torchvision. The VGGs were transferlearned for 60 epochs using ADAM optimization and an initial learning rate of 0.0005.
Pascal VOC. The Pascal Visual Object Classes Challenge 2012 (VOC2012) [13] provides 11,540 images associated with 20 classes. The dataset has been split into 80\(\%\) for training/validation and 20\(\%\) for testing. We applied normalization, random horizontal flipping and center cropping to \(224\times 224\) pixels. As a neural network architecture, the pretrained ResNet18 from the torchvision model zoo was deployed. Its classifier was adapted to predict 20 instead of 1,000 classes and the model was transferlearned for 30 epochs using ADAM optimization with an initial learning rate of 0.0001.
5.2 ECQ\(^{\text {x}}\) Results
In this subsection, we compare ECQ\(^{\text {x}}\) to stateoftheart ECQ quantization, analysing accuracy preservation vs. sparsity increase. Furthermore, we investigate ECQ\(^{\text {x}}\) compressibility, behavior on BatchNorm layers, and an appropriate choice of hyperparameters.
ECQ\(^\mathbf{x}\) Hyperparameter Variation. In ECQ\(^{\text {x}}\), two important hyperparameters, \(\lambda \) and p, influence the performance and thus are optimized for the comparative experiments described below. The parameter \(\lambda \) increases the intensity of the entropy constraint and thus distributes the working points of each trial over a range of sparsities (see Fig. 6). The p hyperparameter defines an upper bound for the perlayer percentage of zero values, allowing a maximum amount of p additional sparsity, on top of the \(\lambda \)introduced sparsity. It thus implicitly controls the intensity of the LRP constraint.
Figure 6 shows results using several p values for the 4 bit (\(bw=4\)) quantization of the MLP_GSC model. Note, that the variation of bit width bw is discussed below the comparative results. For smaller p, less sparse models are rendered with higher top1 accuracies in the lowsparsity regime (e.g., \(p=0.02\) or \(p=0.05\) between 30–50% total network sparsity). In the regime of higher sparsity, larger values of p show a better sparsityaccuracy tradeoff. Note, that larger p do not only set more weights to zero but also readd relevant weights (regrowth). For \(p=0.4\) and \(p=0.5\), both lines are congruent since no layer is achieving more than \(40\%\) additional LRPintroduced sparsity with the initial \(\beta \) value (cf. Sect. 4.2).
ECQ\(^\mathbf{x}\) vs. ECQ Analysis. As shown in Fig. 7, the LRPdriven ECQ\(^{\text {x}}\) approach renders models with higher performance and simultaneously higher efficiency. In this comparison, efficiency is determined in terms of sparsity, which can be exploited to compress the model more or to skip arithmetic operations with zero values. Both methods achieve a quantization to 4 bit integer without any performance degradation of the model. Performance is even slightly increased due to quantization when compared to the unquantized baseline. In the regime of high sparsity, model accuracy of the previous stateoftheart (ECQ) drops significantly faster compared to the LRPadjusted quantization scheme.
Regarding the handling of BatchNorm modules for LRP, it is proposed in literature to merge the BatchNorm layer parameters with the preceding linear layer [15] into a single linear transformation. This canonization process is sensible, because it reduces the number of computational steps in the backward pass while maintaining functional equivalence between the original and the canonized model in the forward pass.
It has been further shown, that network canonization can increase explanation quality [15]. With the aim of computing weight relevance scores for a BatchNorm layer’s adjacent linear layer in its original (trainable) state, keeping the layers separate is more favorable than merging. Therefore, the \(\alpha \beta \)rule with \(\beta =1\) is also applied to BatchNorm layers. The quantization results of the VGG architecture with BatchNorm modules and ResNet18 are shown in Fig. 8.
In order to capture the computational overhead of LRP in terms of additional training time, we compared the average training times of the different model architectures per epoch. Relevancedependent quantization (ECQ\(^{\text {x}}\)) requires approximately 1.2\(\times \), 2.4\(\times \), and 3.2\(\times \) more processing time than baseline quantization (ECQ) for the MLP_GSC, VGG16, and ResNet18 architectures, respectively. This extra effort can be explained with the additional forwardbackward passes performed in Zennit for LRP computation. More concretely, using Zennit as a plugin XAI module, it computes one additional forward pass layerwise and redistributes the relevances to the preceding layers according to the decomposition and aggregation rules specified in Sect. 4.1. For redistribution, Zennit computes one additional backward pass for \(\varepsilon \)rule associated layers and two additional backward passes for \(\alpha \beta \)rule associated layers in order to derive positive \(\alpha \) and negative \(\beta \) relevance contributions. To recap, in the applied composite strategy, the \(\varepsilon \)rule is used for dense layers and the \(\alpha \beta \)rule for convolutional layers and BatchNorm parameters, which results in the extra computational cost for VGG16 and ResNet18 compared to MLP_GSC, which consists solely of dense layers. In addition, aggregation of relevances for convolutional filters is not required for dense layers. Note that the above mentioned values for additional computational overhead of ECQ\(^{\text {x}}\) due to relevance computation can be interpreted as an upperbound and that there are options to minimize the effort, e.g., by 1) not considering relevances for cluster assignments in each training iteration, 2) leveraging precomputed outputs or even gradients from the quantized base model instead of separately computing forwardbackward passes with a model copy in the Zennit module. Whereas 1) corresponds to a change in the quantization setup, 2) requires parallelization optimizations of the software framework.
Bit Width Variation. Bit width reduction has multiple benefits over fullprecision in terms of memory, latency, power consumption, and chip area efficiency. For instance, a reduction from standard 32 bit precision to 8 bit or 4 bit directly leads to a memory reduction of almost 4\(\times \) and 8\(\times \). Arithmetic with lower bit width is exponentially faster if the hardware supports it. E.g., since the release of NVIDIA’s Turing architecture, 4 bit integer is supported which increases the throughput of the RTX 6000 GPU to 522 TOPS (tera operations per second), when compared to 8 bit integer (261 TOPS) or 32 bit floating point (14.2 TFLOPS) [31]. Furthermore, Horowitz showed that, for a 45 nm technology, lowprecision logic is significantly more efficient in terms of energy and area [22]. For example, performing 8 bit integer addition and multiplication is 30\(\times \) and 19\(\times \) more energy efficient compared to 32 bit floating point addition and multiplication. The respective chip area efficiency is increased by 116\(\times \) and 27\(\times \) as compared to 32 bit float. It is also shown that memory reads and writes have the highest energy cost, especially when reading data from external DRAM. This further motivates bit width reduction because it can reduce the number of overall RAM accesses since more data fits into the same caches/registers when having a reduced precision.
In order to investigate different bit widths in the regime of ultra low precision, we compare the compressibility and model performances of the MLP\(\_\)GSC and VGG16 networks when quantized to 2 bit, 3 bit, 4 bit and 5 bit integer values (see Figs. 9 and 10). Here, we directly encoded the integer tensors with the DeepCABAC codec of the ISO/IEC MPEG NNR standard [24]. The least sparse working points of each trial, i.e., the rightmost data points of each line, show the expected behaviour, namely that compressibility is increased by continuously reducing the bit width from 5 bit to 2 bit. However, this effect decreases or even reverses when the bit width is in the range of 3 bit to 5 bit. In other words, reducing the number of centroids from \(2^5=32\) to \(2^3=8\) does not necessarily lead to a further significant reduction in the resulting bitstream size if sparsity is predominant. The 2 bit quantization still minimizes the size of the bit stream, even if, especially for the VGG model, more accuracy is sacrificed for this purpose. Note that compressibility is only one reason for reducing bit width besides, for example, speeding up model inference due to increased throughput.
ECQ\(^\mathbf{x}\) Results Overview. In addition to the performance graphs in the previous subsections, all quantization results are summarized in Table 1. Here, ECQ\(^{\text {x}}\) and ECQ are compared specifically for a 2 and 4 bit quantization as these fit particularly well to poweroftwo hardware registers. The ECQ\(^{\text {x}}\) 4 bit quantization achieves a compression ratio for VGG16 of 103\(\times \) with a negligible drop in accuracy of \(0.1\%\). In comparison, ECQ achieves the same compression ratio only with a model degradation of \(1.23\%\) top1 accuracy. For the 4 bit quantization of MLP_GSC, ECQ\(^{\text {x}}\) achieves its highest accuracy (“drop”, i.e., increase of \(+0.71\%\) compared to the unquantized baseline model) with a compression ratio that is almost \(10\%\) larger compared to the highest achievable accuracy of ECQ (\(+0.47\%\)). For sparsities beyond \(70\%\), ECQ significantly reduces the model’s predictive performance, e.g., at a sparsity of \(80.39\%\) ECQ shows a loss of \(1.40\%\) whereas ECQ\(^{\text {x}}\) only degrades by \(0.34\%\). ResNet18 sacrifices performance at each quantization setting, but especially for ECQ\(^{\text {x}}\) the accuracy loss is negligible. The 2 bit representations of ResNet18 sacrifice more than \(5\%\) top1 accuracy compared to the unquantized model, which may be compensated with more than 20 epochs of quantizationaware training, but is also due to the higher complexity of the Pascal VOC task.
And finally, the 2 bit results in Table 1 show two major findings: 1) With only a minor model degradation all weight layers of the MLP_GSC and VGG networks can also be quantized to only 4 discrete centroid values while still maintaining a high level of sparsity, 2) ECQ\(^{\text {x}}\) renders higher compressible models in comparison to ECQ, as indicated by the higher compression ratios CR.
6 Conclusion
In this chapter we presented a new entropyconstrained neural network quantization method (ECQ\(^{\text {x}}\)), utilizing weight relevance information from Layerwise Relevance Propagation (LRP). Thus, our novel method combines concepts of explainable AI (XAI) and information theory. In particular, instead of only assigning weight values based on their distances to respective quantization clusters, the assignment function additionally considers weight relevances based on LRP. In detail, each weight’s contribution to inference in interaction with the transformed data, as well as cluster information content is calculated and applied. For this approach, we first utilized the observation that a weight’s magnitude does not necessarily correlate with its importance or relevance for a model’s inference capability. Next, we verified this observation in a relevance vs. weight (magnitude) correlation analysis and subsequently introduce our ECQ\(^{\text {x}}\) method. As a result, smaller weight parameters that are usually omitted in a classical quantization process are preserved, if their relevance score indicates a stronger contribution to the overall neural network accuracy or performance.
The experimental results show that this novel ECQ\(^{\text {x}}\) method generates low bit width (2–5 bit) and sparse neural networks while maintaining or even improving model performance. Therefore, in particular the 2 and 4 bit variants are highly suitable for neural network hardware adaptation tasks. Due to the reduced parameter precision and high number of zeroelements, the rendered networks are also highly compressible in terms of file size, e.g., up to 103\(\times \) compared to the fullprecision unquantized DNN model, without degrading the model performance. Our ECQ\(^{\text {x}}\) approach was evaluated on different types of models and datasets (including Google Speech Commands, CIFAR10 and Pascal VOC). The comparative results vs. stateoftheart entropyconstrainedonly quantization (ECQ) show a performance increase in terms of higher sparsity, as well as a higher compression. Finally, also hyperparameter optimization and bit width variation results were presented, from which the optimal parameter selection for ECQ\(^{\text {x}}\) was derived.
Notes
 1.
In practice, pruning is often simulated by masking, instead of actually restructuring the model’s architecture.
 2.
https://github.com/lukemelas/EfficientNetPyTorch, Apache License, Version 2.0  Copyright (c) 2019 Luke MelasKyriazi.
 3.
 4.
References
Ancona, M., Ceolini, E., Öztireli, C., Gross, M.: Gradientbased attribution methods. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 169–191. Springer, Cham (2019). https://doi.org/10.1007/9783030289546_9
Anders, C.J., Neumann, D., Samek, W., Müller, K.R., Lapuschkin, S.: Software for datasetwide XAI: from local explanations to global insights with Zennit, CoRelAy, and ViRelAy. CoRR abs/2106.13200 (2021)
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On pixelwise explanations for nonlinear classifier decisions by layerwise relevance propagation. PLoS ONE 10(7), e0130140 (2015)
Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K.W.D., McWilliams, B.: The shattered gradients problem: if ResNets are the answer, then what is the question? In: International Conference on Machine Learning, pp. 342–350. PMLR (2017)
Bengio, Y., Léonard, N., Courville, A.C.: Estimating or propagating gradients through stochastic neurons for conditional computation. CoRR abs/1308.3432 (2013)
Bhalgat, Y., Lee, J., Nagel, M., Blankevoort, T., Kwak, N.: LSQ+: improving lowbit quantization through learnable offsets and better initialization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2020
Choi, Y., ElKhamy, M., Lee, J.: Towards the limit of network quantization. CoRR abs/1612.01543 (2016)
Courbariaux, M., Bengio, Y., David, J.P.: BinaryConnect: training deep neural networks with binary weights during propagations. In: Advances in Neural Information Processing Systems, pp. 3123–3131 (2015)
Dai, X., Yin, H., Jha, N.K.: Nest: a neural network synthesis tool based on a growandprune paradigm. IEEE Trans. Comput. 68(10), 1487–1497 (2019)
Deng, B.L., Li, G., Han, S., Shi, L., Xie, Y.: Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc. IEEE 108(4), 485–532 (2020)
Denil, M., Shakibi, B., Dinh, L., Ranzato, M., de Freitas, N.: Predicting parameters in deep learning. In: Advances in Neural Information Processing Systems, pp. 2148–2156 (2013)
Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results. http://www.pascalnetwork.org/challenges/VOC/voc2012/workshop/index.html
Gholami, A., Kim, S., Dong, Z., Yao, Z., Mahoney, M.W., Keutzer, K.: A survey of quantization methods for efficient neural network inference. CoRR abs/2103.13630 (2021)
Guillemot, M., Heusele, C., Korichi, R., Schnebert, S., Chen, L.: Breaking batch normalization for better explainability of deep neural networks through layerwise relevance propagation. CoRR abs/2002.11018 (2020)
Han, S., Mao, H., Dally, W.J.: Deep compression: compressing deep neural network with pruning, trained quantization and Huffman coding. In: 4th International Conference on Learning Representations (ICLR) (2016)
Han, S., Pool, J., Tran, J., Dally, W.: Learning both weights and connections for efficient neural network. In: Advances in Neural Information Processing Systems, vol. 28. Curran Associates, Inc. (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
He, Y., Zhang, X., Sun, J.: Channel pruning for accelerating very deep neural networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397 (2017)
Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv abs/1503.02531 (2015)
Hoefler, T., Alistarh, D., BenNun, T., Dryden, N., Peste, A.: Sparsity in deep learning: pruning and growth for efficient inference and training in neural networks (2021)
Horowitz, M.: 1.1 computing’s energy problem (and what we can do about it). In: 2014 IEEE International SolidState Circuits Conference Digest of Technical Papers (ISSCC), pp. 10–14 (2014)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization (2014). arxiv:1412.6980 Comment: Published as a Conference Paper at the 3rd International Conference for Learning Representations, San Diego (2015)
Kirchhoffer, H., et al.: Overview of the neural network compression and representation (NNR) standard. IEEE Trans. Circuits Syst. Video Technol. 1–14 (2021). https://doi.org/10.1109/TCSVT.2021.3095970
Kohlbrenner, M., Bauer, A., Nakajima, S., Binder, A., Samek, W., Lapuschkin, S.: Towards best practice in explaining neural network decisions with LRP. In: 2020 International Joint Conference on Neural Networks (IJCNN), pp. 1–7. IEEE (2020)
Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images, April 2009
LeCun, Y., Denker, J.S., Solla, S.A.: Optimal brain damage. In: Advances in Neural Information Processing Systems, pp. 598–605 (1990)
Marban, A., Becking, D., Wiedemann, S., Samek, W.: Learning sparse & ternary neural networks with entropyconstrained trained ternarization (EC2T). In: The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pp. 3105–3113, June 2020
Montavon, G., Binder, A., Lapuschkin, S., Samek, W., Müller, K.R.: Layerwise relevance propagation: an overview. In: Samek, W., Montavon, G., Vedaldi, A., Hansen, L.K., Müller, K.R. (eds.) Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. LNCS (LNAI), vol. 11700, pp. 193–209. Springer, Cham (2019). https://doi.org/10.1007/9783030289546_10
Montavon, G., Samek, W., Müller, K.R.: Methods for interpreting and understanding deep neural networks. Digit. Signal Process. 73, 1–15 (2018)
NVIDIA Turing GPU Architecture  Graphics Reinvented. Technical report, WP09183001_v01, NVIDIA Corporation (2018)
Park, E., Ahn, J., Yoo, S.: Weightedentropybased quantization for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7197–7205 (2017)
Paszke, A., et al.: Automatic differentiation in pytorch (2017)
Sabih, M., Hannig, F., Teich, J.: Utilizing explainable AI for quantization and pruning of deep neural networks. CoRR abs/2008.09072 (2020)
Samek, W., Montavon, G., Lapuschkin, S., Anders, C.J., Müller, K.R.: Explaining deep neural networks and beyond: a review of methods and applications. Proc. IEEE 109(3), 247–278 (2021)
Sandler, M., Howard, A., Zhu, M., Zhmoginov, A., Chen, L.C.: MobileNetV2: inverted residuals and linear bottlenecks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4510–4520 (2018)
Schütt, K.T., Arbabzadah, F., Chmiela, S., Müller, K.R., Tkatchenko, A.: Quantumchemical insights from deep tensor neural networks. Nat. Commun. 8(1), 1–8 (2017)
Senior, A.W., et al.: Improved protein structure prediction using potentials from deep learning. Nature 577(7792), 706–710 (2020)
Shannon, C.E.: A mathematical theory of communication. Bell Syst. Tech. J. 27(3), 379–423 (1948)
Shrikumar, A., Greenside, P., Shcherbina, A., Kundaje, A.: Not just a black box: learning important features through propagating activation differences. CoRR abs/1605.01713 (2016)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Sze, V., Chen, Y., Yang, T., Emer, J.S.: Efficient processing of deep neural networks: a tutorial and survey. Proc. IEEE 105(12), 2295–2329 (2017)
Tan, M., et al.: MnasNet: platformaware neural architecture search for mobile. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2820–2828 (2019)
Warden, P., Situnayake, D.: TinyML: Machine Learning with TensorFlow Lite on Arduino and UltraLowPower Microcontrollers. O’Reilly Media (2020)
Warden, P.: Speech commands: a dataset for limitedvocabulary speech recognition. CoRR abs/1804.03209 (2018)
Wiedemann, S., et al.: DeepCABAC: a universal compression algorithm for deep neural networks. IEEE J. Sel. Top. Signal Process. 14(4), 700–714 (2020)
Wiedemann, S., Marban, A., Müller, K.R., Samek, W.: Entropyconstrained training of deep neural networks. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Wiedemann, S., Müller, K.R., Samek, W.: Compact and computationally efficient representation of deep neural networks. IEEE Trans. Neural Netw. Learn. Syst. 31(3), 772–785 (2020)
Wiedemann, S., et al.: FantastIC4: a hardwaresoftware codesign approach for efficiently running 4bitcompact multilayer perceptrons. IEEE Open J. Circuits Syst. 2, 407–419 (2021)
Yeom, S.K., et al.: Pruning by explaining: a novel criterion for deep neural network pruning. Pattern Recogn. 115, 107899 (2021)
Zhou, S., Ni, Z., Zhou, X., Wen, H., Wu, Y., Zou, Y.: DoReFaNet: training low bitwidth convolutional neural networks with low bitwidth gradients. CoRR abs/1606.06160 (2016)
Zhu, C., Han, S., Mao, H., Dally, W.J.: Trained ternary quantization. In: International Conference on Learning Representations (ICLR) (2017)
Acknowledgements
This work was supported by the German Ministry for Education and Research as BIFOLD (ref. 01IS18025A and ref. 01IS18037A), the European Union’s Horizon 2020 programme (grant no. 965221 and 957059), and the Investitionsbank Berlin under contract No. 10174498 (Pro FIT programme).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Open Access This chapter is licensed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license and indicate if changes were made.
The images or other third party material in this chapter are included in the chapter's Creative Commons license, unless indicated otherwise in a credit line to the material. If material is not included in the chapter's Creative Commons license and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
Copyright information
© 2022 The Author(s)
About this chapter
Cite this chapter
Becking, D., Dreyer, M., Samek, W., Müller, K., Lapuschkin, S. (2022). ECQ\(^{\text {x}}\): ExplainabilityDriven Quantization for LowBit and Sparse DNNs. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, KR., Samek, W. (eds) xxAI  Beyond Explainable AI. xxAI 2020. Lecture Notes in Computer Science(), vol 13200. Springer, Cham. https://doi.org/10.1007/9783031040832_14
Download citation
DOI: https://doi.org/10.1007/9783031040832_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783031040825
Online ISBN: 9783031040832
eBook Packages: Computer ScienceComputer Science (R0)