1 Introduction

Convolutional Neural Networks (CNN) are increasingly being used in critical applications such as self-driving cars (Dikmen & Burns, 2016) and automatic quality control (Bergmann et al., 2019). Their outstanding success arises from their powerful approximation capabilities (Lin et al., 2017), which traditionally has come at the cost of missing transparency and the hazard of unexpected behaviors. As the consequences in critical systems can be severe, ensuring a better understanding of the CNNs and their reliability becomes essential.

We focus on the domain of automatic quality control. In this domain, image classification of defective and non-defective parts ensures quality in mass production processes (Landing AI, 2020). As an example, in the process of metal casting, CNNs can be used to classify good parts from parts containing specific defects such as pinholes. These defects are directly related to the structural integrity of the parts, and a misdetection of a faulty part can lead to economic consequences (e.g., a broken machine or a collapsed structure) or even put human life at risk (e.g., a faulty brake disk in a car). We refer to unexpected behavior when these misdetections are caused by biases in the models which are not aligned with human expectations. For example, a detection model could classify a part as defective by focusing on the background or the serial number of a part instead of focusing on the defect. This is clearly undesirable and dangerous.

To detect unexpected behaviors in a human-understandable way, the field of eXplainable Artificial Intelligence (XAI) (Das & Rad, 2020; Arrieta et al., 2020; Burkart & Huber, 2021) proposes the usage of concepts. In this work, a concept refers to an abstract idea, represented by a set of images, sharing a specific semantic meaning. As an example, consider the concept of a scratch. Scratches can take many appearances and forms, yet they still share the same semantic meaning. The task of concept extraction consists in analyzing a model to obtain sets of images representing concepts, which are of importance for the model. These concepts can then be presented to a human to better understand, which useful representations were learned by a model to solve a specific task. Therefore, it is an important tool in providing insights into neural networks, which often are considered black boxes.

In the context of automated quality control, current methods have two considerable flaws. First, they focus on analyzing scale-agnostic models and thus rely on interpolation techniques (e.g., bicubic interpolation) for resizing features. In contrast, quality control setups often provide static perspectives of parts, yielding data and models where scale is an important factor of the emerging image features. For example, holes or curves of different sizes may correspond to normal features or defects depending on their size. Second, existing methods rely on segmentation/superpixel (e.g., SLIC (Achanta et al., 2012)) techniques during the concept extraction process. These techniques segment patches alongside intensity and color boundaries (e.g, bents, curves, edges), which generates a loss of context around the segmented areas, and can introduce artifacts when padding is applied afterwards. As our empirical results highlight, both issues are significant, leading to unreliable results. We address this gap by proposing a concept extraction method that can deal with typical features of industrial datasets (e.g., small defects such as holes) and performs correctly when applied to models sensitive to changes in scale.

1.1 Space: Main idea and use

Fig. 1
figure 1

Visual comparison between a SPACE (proposed method) and b ACE (state-of-the-art). Both follow a similar structure of (A) patch extraction, (B) concept-image composition, (C) concept clustering, (D) random concept building, and (E) concept testing. However, for steps (A) and (B) SPACE introduces new principles to avoid scaling through interpolation. In (C), SPACE introduces the usage of PCA and OPTICS to extract non-spherical clusters. Finally, (D) is modified to be consistent with (A) and (B)

In this work, we propose the novel algorithm Scale-Preserving Automatic Concept Extraction (SPACE). Our algorithm builds up on the state-of-the-art for image-based concept extraction, ACE, (Ghorbani et al., 2019) and introduces significant modifications presented in Sect. 4. Through experiments, we show that it achieves superior results in the context of automatic quality control. SPACE takes as inputs a CNN model, a dataset containing images for all classes that the model was trained on, and a set of (hyper-)parameters (including, for example, the class index as defined in the model output), as described in Fig. 1a. SPACE returns a set of meaningful concepts, where each concept consists of a set of example images (concept-images) and a corresponding importance score.

As an application example, consider the case of detecting pinholes on a metal part (details follow in Sect. 5.4). First, an image dataset of non-defective and defective parts is obtained. Second, a CNN is trained to perform the intended classification task. Third, an expert may use SPACE to validate which concepts are being used to classify images as defective parts. The SPACE algorithm takes the trained CNN, the training dataset, and the chosen (hyper-)parameters as inputs and returns multiple concepts represented through example images and importance scores. Then, a human domain expert may visually inspect the examples of each concept to interpret what concepts are important for the model to solve the classification task. Next, the human can verify if the concepts with high importance are aligned with the intended task (e.g., pinholes), or point towards biases which can generate unexpected behaviors (e.g., background colors, or thickness of an unrelated line). These results can be used by the human expert to identify biases, data leakages and overfitting. Finally, the dataset can be adjusted to remove the discovered biases or data leakages.

In more detail, the five steps of SPACE are (c.f., Fig. 1): (A) first, patch extraction takes as input the images of the class under inspection; then, the images are sliced into square patches, and a subset of the patches is selected based on their aggregated pixel importance. (B) Then, concept-image composition takes each extracted patch, and stacks it vertically and horizontally (tiling) to obtain images of the input size of the model. (C) Next, in concept clustering, concept-images are encoded in the latent space of the model, then the dimensionality of the encoding is reduced through principal component analysis (PCA) (Burkart & Huber, 2021), and clusters are extracted using ordering points to identify the clustering structure (OPTICS) (Ankerst et al., 1999). (D) After this, during random concept building, the rest of the dataset (excluding the class under inspection) is used to randomly extract square patches similar to step (A) and compose concept-images as in step (B). These sets of random concept-images are needed for the importance score calculation of the next step. (E) Finally, in concept testing, each extracted and random concept is sampled multiple times and tested using testing with concept activation vectors (TCAV) (Kim et al., 2018a), obtaining a mean TCAV score for each concept. The higher the importance score, the more sensitive the model is to the concept for predicting the class under inspection.

1.2 Contributions

In essence, this paper makes the following contributions: (1) We propose the SPACE algorithm as a new concept extraction method for the global interpretability of CNNs. (2) We compare SPACE to the state-of-the-art method, ACE, using three quality control use cases. (3) We present an application on real-world manufacturing data and show how SPACE allows for a better understanding of trained models, as well as a more reliable detection of defects.

In summary, this paper tackles the gap of creating a concept extraction method capable of analyzing scale-sensitive models and data containing features that are small or related to the morphological structure of the images. Our method provides a tool for concept extraction in the context of automatic quality control.

A Python implementation of the SPACE algorithm based on TCAV (Kim et al., 2018a) is available at:

https://github.com/Data-Science-in-Mechanical-Engineering/SPACE.

1.3 Outline

The paper continues with the related work, exposing the current alternative methods and the overall scientific context (Sect. 2). Then, the required background of concept extraction is described (Sect. 3). Next, the algorithm SPACE is introduced, and the underlying mathematical ideas explained (Sect. 4). Consecutively, SPACE is compared with ACE through empirical studies over three datasets (Sect. 5). Finally, conclusions and future perspectives are presented (Sect. 6).

2 Related work

This work falls into the general field of XAI, which we briefly discuss next. We start by introducing the general objective of the field, and the taxonomy of explanations. Within the field of XAI, we locate the methods of concept extraction, which encompass the current work. Afterward, we give a brief introduction to concept extraction methods and position our work in this context.

2.1 Explainable artificial intelligence (XAI)

The field of XAI proposes either the creation of AI models which are human-understandable, or methods that allow humans to better understand the decision-making processes of existing models (Das & Rad, 2020; Arrieta et al., 2020; Burkart & Huber, 2021). XAI methods can be classified in terms of their scope (local or global) and usage (ante-hoc or post-hoc) (Das & Rad, 2020). The current work focuses on algorithms that explain a model as a whole (global scope) and are used to analyze trained models without having to modify their architecture (post-hoc usage). In this context, global explanation methods for CNNs have focused on extracting decision-making patterns that can describe the behavior of a model in a human-understandable way (Arrieta et al., 2020). The two paradigms that have been proposed are to distill a simpler model which is human-understandable (e.g. rule-based (Augasta & Kathirvalavakumar, 2012; Zhou et al., 2003), decision trees (Zhang et al., 2019; Schaaf et al., 2019; Zilke et al., 2016; Sato & Tsukimoto, 2001), fuzzy systems (Gu & Cheng, 2020)), or to extract sets of inputs which generate similar activations inside the models and have global relations with resulting predictions such as ACE (Ghorbani et al., 2019), CaCe (Goyal et al., 2019), and VRX (Ge et al., 2021). The current work focuses on the latter, more specifically on the extraction of concept-based explanations (Ghorbani et al., 2019; Goyal et al., 2019; Yeh et al., 2020).

2.2 Concept extraction

The extraction of concepts to make AI models explainable has only recently been proposed. A first approach in this direction was proposed as an ante-hoc method named ProtoPNet (Chen et al., 2019), where the CNN architecture was specifically designed to provide prototypical regions as explanations, which are analog to concepts. On the field of post-hoc methods, the first work in this direction is by Ghorbani et al., with the method ACE (Ghorbani et al., 2019). This work was directly linked to the concept testing algorithm TCAV (Kim et al., 2018a). Both methods (ACE and TCAV) will be discussed in more detail in Sect. 3. Since its recent proposal, three main working principles have been introduced for post-hoc concept extraction and concept-based explanations. The main approaches have been based on autoencoders (e.g. CaCe (Goyal et al., 2019), VAEs (Utkin et al., 2021)), changing a model’s architecture to constrain certain layers (e.g. concept bottleneck models (Koh et al., 2020), concept whitening (Chen et al., 2020)), or patch extraction (e.g. ACE (Ghorbani et al., 2019), VRX (Ge et al., 2021), conceptShap (Yeh et al., 2020)).

Our work is directly related to the patch extraction approaches, which encompass methods based on the extraction of regions/patches from the input images, then clustering and testing the obtained concepts. The first introduction to region-based explanations was first proposed in XRAI (Kapishnikov et al., 2019). Afterwards, ACE (Ghorbani et al., 2019) focused on extracting patches through superpixel techniques such as SLIC (Achanta et al., 2012). Similarly, VRX (Ge et al., 2021) proposes the usage of gradients to guide the patch discovery of ACE. Other similar approaches such as EFC-CAM (Wang et al., 2021) have introduced the usage of gradients to obtain regions specific to a class. Our current work also proposes a step of patch extraction, yet, we argue that by changing the way patches are extracted, encoded, and tested, better results can be obtained in applications related to quality control.

From the mentioned algorithms, ACE will be taken as the state-of-the-art of concept extraction techniques. Thus, it will be further introduced alongside the testing method TCAV in Sect. 3. Additionally, ACE will be used as an experimental baseline in Sect. 5 to obtain an insightful comparison.

3 Background

In this section, we introduce three fundamental components required for the correct understanding of SPACE. Specifically, we describe the general methods for concept testing and automatic extraction. First, we explain the method TCAV for concept testing, which is a supervised approach for testing whether a CNN is sensitive to a concept or not. Then, we lay out the general mechanisms of the ACE algorithm, introduced by Ghorbani et al. (Ghorbani et al., 2019).

3.1 Testing with concept activation vectors (TCAV)

TCAV was proposed (Kim et al., 2018a) to provide an interpretation of a CNNs internal state employing human-understandable concepts. The key idea is to assess whether two sets of images (containing or not a human-defined concept) generate different activations within the CNN, and how much this difference contributes to the prediction process of a CNN. For this, the method proposes to encode each image as a flattened activation map \(a_l=f_l(x)\) (obtained through the partial evaluation of the CNN until layer l). Then, a linear classifier is trained to differentiate the two groups of encoded images. As a result, the fitted parameters of the linear classifier denote a vector normal to the decision boundary (hyperplane) of the classification. This vector is then used to define the concept and is named Concept Activation Vector (CAV). Finally, a sensitivity metric \(S_{C,k,l}(x)\) is introduced to express how sensible the prediction of the class k for the image x is towards the defined concept C,

$$\begin{aligned} S_{C,k,l}(x) = \nabla h_{l,k}(f_l(x))\cdot v_{C}^{l}. \end{aligned}$$
(1)

Here, \(h_{l,k}(f_l(x))\) is the predicted logit for the class k, obtained by evaluating the activation map \(a_l=f_l(x)\) from the layer l onwards. Finally, a score \(TCAV_{Q_{C,k,l}}\) is introduced to quantify the global importance of a concept for the predictions of a specific class. This score is defined as the ratio of images x of class k (\(X_k\)) with a positive sensitivity for a selected concept C,

$$\begin{aligned} TCAV_{Q_{C,k,l}}= \frac{\left| \left\{ x \in X_{k}: S_{C,k,l}(x) > 0 \right\} \right| }{\left| X_{k}\right| }. \end{aligned}$$
(2)

3.2 Automatic concept-based explanations (ACE)

Based on TCAV, the first automatic concept extraction method was introduced by Ghorbani et al. (Ghorbani et al., 2019). The general method of ACE is described in Fig. 1b, and is composed of five steps. (A) First, a set of images from a class are given, and each image is segmented with multiple resolutions using SLIC (Achanta et al., 2012). (B) Then, each one of the extracted patches/superpixels is padded to a rectangular shape with the mean pixel values and then resized by bicubic interpolation to the original size of the image. (C) Later, the activation maps of the resized patches are flattened and clustered using k-mean to obtain a defined number of concepts. Next, (D) sets of random concepts are created by selecting a subset from a pool of random images. Finally, (E) the importance score of each concept is calculated using TCAV (Kim et al., 2018a).

4 Space

We propose the technique Scale-Preserving Automatic Concept Extraction (SPACE) to tackle specific challenges of industrial datasets related to applications such as quality control and predictive maintenance. SPACE, as described in Fig. 1a, represents an alternative method for the automatic extraction of concept-based explanations using a distinctive patch extraction technique, coupled with saliency maps to assess local importance of extracted patches and the usage of tiling for patch resizing. A central algorithmic requirement that considered for the SPACE algorithm was the preservation of scale of all features while extracting and evaluating concepts. Because the scale of potentially meaningful concepts of industrial datasets often critically determines their actual semantic meaning.

As described previously, SPACE follows the process of concept extraction as shown in Fig. 1. SPACE takes as input: a dataset \(E:\left\{ (x_{1},y_{1}),...,(x_{N},y_{N}) \right\}\) of labeled data points; a trained CNN \(f_\textrm{M}\); the index k of the class to be analyzed; the number \(n_\textrm{s}\) of horizontal and vertical slices used for patch extraction; the percentage \(n_\textrm{p}\) of patches to be extracted from each image; the number of PCA components \(n_\textrm{pca}\) to reduce activation maps dimensionality before clustering; the layer \(l_\textrm{gradcam}\) to perform Grad-CAM; the layer \(l_\textrm{activ}\) used to extract activations for the clustering and testing. As output, SPACE returns a set of concepts \(\{ (\varepsilon _{0}, \overline{S}_0),(\varepsilon _{1}, \overline{S}_1),...,(\varepsilon _{n-1}, \overline{S}_{n-1})\}\), where \(\varepsilon _{i}\) is a set of examples for the concept \(c_i\), and \(\overline{S}_i\) is the importance of said concept with relation to the class k. In this section, we introduce SPACE’s five functional steps as shown in the Algorithm 1.

figure a

4.1 (A) Patch extraction

The Patch extraction function produces a set P of patches from the subset \(E_k\) of images of the target class k. For each image \(x_j\), SPACE first performs Grad-CAM (Selvaraju et al., 2017) to obtain the saliency map \(S_j\) (pixel-wise importance). Afterwards, the image \(x_j\) is sliced into \(n_\textrm{s} \, \textrm{x} \, n_\textrm{s}\) windows to obtain a set \(P^{*}_{x_j}\) of patches \(p_{x_j,o}\), where o denotes the position of the patch. Similarly, a binary mask \(g_o\) the size of \(x_j\) is created for each patch \(p_{x_j,o}\), obtaining the set \(G^{*}\). Then, the aggregated patch importance score \(\psi _{j,o}=f_{\psi }(S_j, g_o)\) of each patch \(p_{x_j,o}\) is computed by applying the mask \(g_o\) over the saliency map \(S_j\) (element-wise multiplication), summing the importance of each pixel (ab), and then dividing by the number of non-zero importance pixels,

$$\begin{aligned} \psi _{j,o}=f_{\psi }(S_j, g_o)=\frac{\sum _{a,b} (S_j \odot g_o)_{(a,b)}}{\sum _{a,b} H (S_j \odot g_o)_{(a,b)}} \end{aligned}$$
(3)

with

$$\begin{aligned} H (x) = {\left\{ \begin{array}{ll} 1 &{} \quad x>0\,,\\ 0 &{} \quad else. \end{array}\right. } \end{aligned}$$
(4)

Then, the patches of \(P^{*}_{x_j}\) are ranked based on \(\psi _{j,o}\), and only the top \(n_\textrm{p}\) percent of the most important patches are selected for the set \(P_{x_j}\). Finally, the set P of extracted patches is obtained as the union of all \(P_{x_j}\) for \(x_j\) in \(E_k\).

In contrast with ACE, the proposed patch extraction has two advantages. First, the borders of the extracted patches are uncorrelated with intensity boundaries (e.g, edges) of the input images. This enables a more coherent extraction in the cases where these types of features are important. Second, the ranking and selection of patches centers the analysis towards the patches that have actual impact on the decision-making process of the model. The proposed aggregation approach leads to preferring patches with few pixels of high pixel-importance compared to several pixels with lower pixel-importance, leading to more robustness against noisy saliency maps and the chosen patch size. As an example, a homogeneous yet unimportant background would be filtered out in this step, instead of becoming noisy points in the context of the concept clustering step.

4.2 (B) Concept-image composition

The Concept-image composition function transforms each extracted patch \(p_{t}\) from P into an image \(x_{p_{t}}\), yielding the set \(X^{*}\). To do this, SPACE proposes to tile each square patch \(p_{t}\) vertically and horizontally \(n_\textrm{s}\) times to obtain \(x_{p_{t}}\). The resulting concept-image aims to trigger similar activation patterns as the images of E containing similar features.

The main argument behind the use of tiling is the fact that convolutions are not scale invariant. This translates in CNNs learning features which are not scale invariant, unless the dataset, training process, or architecture of the CNN are explicitly modified for this goal. Other works have tried to deal with this limitation, either by adding specific architectures such as ensembles (Van Noord & Postma, 2017), using multiple columns/backbones for different scales (Xu et al., 2014), or specific architectures such as scale pyramids for object detection (Kim et al., 2018b). These same works have studied how activation maps and predictions shift as the scale of input images changes. From another perspective, only a subset of an image is required to compute the activations at an arbitrary layer. This region is the receptive field, and, as long as the tiles are bigger than the receptive filed required to encode the important features of the dataset, tiling will allow a similar encoding for a patch in comparison to an original image.

In contrast with other methods, the resulting concept-images are not transformed with respect to their scale. This is specially important when analyzing models that are trained from a single perspective (e.g., quality control of a metal piece where the scale of the features is semantically meaningful). In these cases, the internal representations of the models are not scale-invariant and thus, any activations generated by re-scaled concept-images can differ significantly from the activations of the dataset.

During the tiling process, discontinuities are introduced. This can present a challenge for our methodology in instances where classification relies on similar features Yet, discontinuities are rarely defining features for quality control classification tasks. In quality control tasks, images of products or objects are scrutinized to identify specific local features. In most instances, the visual cues associated with each class are local morphological features and not discontinuities. Hence, the models that are developed are typically not tuned to detect or be sensitive to such features. This is especially true when these discontinuities are introduced during the training process as a result of data augmentations (e.g. by using random crops of mix-based augmentations).

4.3 (C) Concept clustering

The Concept clustering function extracts clusters from the set \(X^{*}\) of concept-images which are meaningful for the model. First, SPACE computes the activation maps \(a_{l_\textrm{activ},x_{p_t}}\) by partially evaluating each concept-image \(x_{p_t}\), in the model \(f_\textrm{M}\) until the layer \(l_\textrm{activ}\). It then flattens the activation maps, composing the set \(A^{*}\). As a result, the flattened activation maps become the encoding of the image, representing the perception of the model. Second, we perform a PCA over \(A^{*}\), and reduce the dimensionality of each element to \(n_\textrm{pca}\) components, obtaining the set \(A^{*}_\textrm{pca}\). Then, the clustering algorithm OPTICS (Ankerst et al., 1999) is used to identify a set \(C^{*}\) of clusters based on the Manhattan distance between their elements. Each one of the extracted clusters \(c_i\) becomes an extracted concept. Finally, all concept-images \(x_{p_t}\) whose \(a_{l_\textrm{activ},x_{p_t}}\) belong to a concept \(c_i\), are used to compose the example set \(\varepsilon _{i}\).

This function implements a step of dimensionality reduction to improve the effectivity of the clustering algorithms as well as to reduce redundant information caused by previous steps. The joint application of PCA, OPTICS and the usage of the Manhattan distance allows for an effective extraction of density-based clusters consisting of coherent concept-images leading to meaningful concepts.

We chose OPTICS over alternatives such as k-means due to several factors. OPTICS can identify complex, non-spherical clusters and handle varying densities and quantities. In contrast, k-means extracts a known number of spherical clusters, struggling with density differences, or a mismatch of the assumed cluster quantity. With OPTICS, we avoid making assumptions about cluster shapes, densities, and quantities within the CNN’s latent space. Prior to OPTICS, we implement PCA to manage our data’s high dimensionality, thereby improving computational speed, reducing redundancy, and noise.

4.4 (D) Random concept building

The Random concept building function assembles example sets of random concepts. The items of these sets are chosen randomly, and do not have any shared meaning. SPACE proposes to use the images of the analyzed dataset, excluding the ones from the class that is being analyzed. From each image \(x_j\) in \(E \setminus E_k\), a random patch \(p_j\) is cropped, considering the dimensions of the patches in step (a). Then, each random patch \(p_j\) is used to compose concept-images \(x_{p_j}\) using the same approach as step (B). The resulting set of concept-images \(x_{p_j}\), becomes the example set \(\varepsilon _{r_i}\) of a random concept \(c_{r_i}\). As an output of this function, a defined number of random concepts \(c_{r_i}\) and their example sets \(\varepsilon _{r_i}\) are extracted.

In the case of ACE, the random concepts are composed by complete images, randomly selected from the dataset. This means that during the concept testing step, complete images are being compared with resized patches of other images. We argue that this comparison is flawed, and the composition of the random example sets \(\varepsilon _{r_i}\) should be similar to the composition of the extracted concept example sets \(\varepsilon _{i}\). It is desired to have similar transforms in the main features of the concept-images and the random images, this will minimize the risk of testing the importance of the transforms instead of the importance of the actual concepts. This is specially important for models that are not robust to changes in scale, as the activation maps generated from these images can differ significantly or, even worse, generate representations that are completely disconnected from the nominal behaviors of the model.

4.5 (E) Concept testing

The Concept testing function is performed for each concept \(c_i\) in accordance with the TCAV algorithm (Kim et al., 2018a), using the layer \(l_\textrm{activ}\). For each analyzed concept \(c_i\), a sample of the example set \(\varepsilon _{i}\) is compared with a sample of a random example set \(\varepsilon _{r}\). The basic idea behind the testing in TCAV, is to compute multiple CAV and TCAV scores for a subset of a group A of patches (\(\varepsilon _{i}\)), and for a group of random patches B (\(\varepsilon _{r}\)). This generates a set of TCAV scores for each group, which are later compared through a two-sided t-test. It is to be noted that the expected value of the TCAV score for a set of random patches is 0.5. This 0.5 TCAV score means that for half of the images in a class, the concept affects the prediction of the CNN positively, and half of the time it affects it negatively. This procedure was introduced in detail by Kim et al. in TCAV (Kim et al., 2018a). This process is repeated multiple times, obtaining the CAV and TCAV score of each repetition. Afterwards, a control group is obtained by repeatedly comparing two random concepts. Then, the two populations of TCAV scores are used to verify the significance of the findings through a two-sided t-test. The TCAV algorithm yields as result an average TCAV score, also referred as average sensitivity \(\overline{S}_i\). This method of testing concepts is also used in ACE, as it allows for the testing of whether a concept described by a set of images sharing a semantic meaning is influential in the decision-making process of the model.

5 Results

In this section, we present the aggregated experimental results of the proposed algorithm SPACE and its comparison with ACE. We compare SPACE and ACE through multiple industrial datasets, three CNN architectures and ten random seeds. We focus on datasets where scale is semantically meaningful, data is scarce, unbalanced, or the intended features are small in comparison with the main features of images. First, we present the results of executing both SPACE and ACE over two open datasets as mentioned before, and aggregating the results based on the ratio of extracted concepts aligned with the semantic meaning of the analyzed classes, as shown in Fig. 2. A complimentary dataset on leather classification in discussed in Appendix 4. Then, we further explore the datasets through representative cases, highlighting the main issues observed in the experimental setup. As an additional representative case, we present a real world use case on quality control for metal casting. We show that SPACE outperforms ACE when extracting concepts in the above defined settings.

5.1 Concept extraction runs (SPACE and ACE)

In our setup, a run refers to the execution of either SPACE or ACE, with a specific set of parameters, to analyze a single model. Each run is defined by the parameters of the concept extraction method, the parameters of the analyzed model, the dataset used for the training of the model, as well as the random seed used for the run itself.

On our experimental setup, we use two open datasets, which highlight different properties common in industrial environments. First, the concrete crack dataset (Çağlar Fırat Özgenel, 2019) provides a test case for defects spanning complete images without other major features. Second, the metal nut anomaly detection dataset (Bergmann et al., 2019) is a test case where a specific object (roughly the size of the image) may contain a defect of a size comparable to other features in the image. A third complementary dataset about leather anomaly detection(Bergmann et al., 2019) will be discussed in Appendix 4.

The used CNN architectures where VGG-16 (Simonyan & Zisserman, 2015), ResNet-18 (He et al., 2015), and Densenet-121 (Huang et al., 2016). These architectures have significant differences in the interconnection of their layers, which influences how information propagates within them. Each architecture was trained with ten different random seeds on each dataset. The training and validation sets were used for training and verification of convergence, respectively, while the test data was used to assess the model accuracy after the training was over. For both training and testing the dataset was sampled using a weighted random sampler, to mitigate the effects of the unbalanced datasets. After training, the best model of each training was analyzed with both SPACE and ACE to extract its learned concepts. The overall performance metrics of these models are presented in Appendix 2 and 3.

Our experimental setup focuses on the number of patches used by each method to extract patches, as this is the most sensitive parameter when dealing with features of different scales. For ACE, we test multiple values for \(n_{\textrm{SLIC}}\), which directly refers to the number of superpixels/patches obtained using the segmentation technique SLIC (Achanta et al., 2012). Thus, in Fig. 2 ACE-sl15, ACE-sl50, ACE-sl80, ACE-sl200, describe ACE runs where \(n_{\textrm{SLIC}}\) is equal to 15, 50, 80, and 200 respectively. The run ACE-full represent a single run as described in by (Ghorbani et al., 2019), where the patches of multiple SLIC executions (to extract 15, 50, and 80 segments) are extracted and merged. Other ACE parameters were fixed to the values recommended by (Achanta et al., 2012), \(s_{\textrm{SLIC}}=1.0\) (sigma), \(c_{\textrm{SLIC}}=20.0\) (compactness), \(n_{\textrm{k}}=25\) (clusters), and a gray padding of 128.

For SPACE, we test multiple values for \(n_{\textrm{s}}\), which refers to the number of vertical and horizontal windows \(n_s \times n_s\) the images will be divided on, and thus, defines a total of \(n_{\textrm{s}}^2\) patches. In Fig. 2, the runs SPACE-s4 to SPACE-s8, denote SPACE runs with \(n_{\textrm{s}}=4\) to \(n_{\textrm{s}}=8\) respectively. The percentage of selected patches \(n_{\textrm{p}}\) and the dimensionality reduction before clustering \(n_{\textrm{pca}}\), were fixed to 10% and 30 respectively. Additionally, the last convolutional layer of the model was used for \(l_{\textrm{gradcam}}\), since this layer is expected to have the best compromise between high-level semantics and detailed spatial information (Selvaraju et al., 2017).

To obtain comparable results, some parameters were set to the same values for the two methods. The layer selected for performing the TCAV analysis as well as the clustering step (\(l_{\textrm{activ}}\)), was the closest to the top of the network, namely the dense layer before the softmax layer. It was chosen to capture the concepts which are most important for the final classifications of the model. In contrast with the original implementation of ACE, we performed a PCA with 30 components before the clustering (k-means or OPTICS) in each method. Similarly, during each run, folders of random concepts were automatically created to obtain a compromise between computational cost, and stable and reliable results.

For each run, we visually inspected the example images from each concept, labelling it as aligned or not, based on the intended semantic meaning of the classification task. In the case of concrete crack detection, the task is visually straightforward; cracks are easy to identify in images or concept example images without major confounding factors that could result in mislabel concept alignments. Similarly, for the metal nut dataset, we had access to binary ground truth masks, providing clear indicators of key visual features. The visual cues related to defects are unambiguous, making the visual inspection of concept examples feasible without significant exposure to confounding factors. More so, in the selected datasets, there is no major difference in prior knowledge between experts and non-experts (e.g. the visual cues related to a crack are easy to identify).

We aggregated the results of the runs for the ten random seeds on each parameter configuration. The aggregation was performed through visual inspection on the results of each run, labeling concepts as aligned or not, based on the intended semantic meaning of the analyzed class. Then, we aggregate the results, by computing the average ratio of aligned concepts across all runs of a model and method configuration. We use this aggregation method in response to the question, “are the right concepts being extracted for the right classes?”. The summarized results are presented in Fig. 2, and detailed in Appendix 2, and 3.

Fig. 2
figure 2

Aggregated results from SPACE and ACE runs. First we present the number of concepts extracted with more than three items a, and b. Second, the number of aligned concepts extracted in each run are presented c, and d. Third, we present the ratio of aligned concepts, e, and f. In both cases, SPACE extracts more relevant concepts through a variety of hyperparameters, achieving a higher alignment ratio

The main points that can be highlighted from the empirical results concern the amount and type of extracted concepts. First, SPACE extracts fewer concepts, yet, these concepts are more aligned with the semantic meaning of the analyzed classes. An example can be observed in Fig. 2b and f, where the runs SPACE-s5 and SPACE-s6, extracted a single yet aligned concept, in comparison with the best run of ACE (ACE-full), which extracted on average ten valid concepts, yet only half were meaningful ones. A similar case can be observed for the concrete crack dataset in Fig. 2a, and e. Second, when extracting concepts SPACE is able to focus on the meaningful parts of the images, reducing the effect of outliers and irrelevant regions of the images. This is the result of the patch extraction, concept clustering steps, which make SPACE able to differentiate outliers, detect non spherical clusters, and focus on the most important patches obtained from images. In contrast, outliers and irrelevant background can influence the k-means process of ACE, diminishing the number of valid concepts (with more than 3 samples), as well as the ratio of aligned concepts (as seen in Fig. 2). Third, the nature of the extracted patches differs for both methods, which impacts the features that are extracted, their encoding and testing. This leads to better performance of both methods for bigger features, and a better performance of SPACE for smaller features. This second point will be better visualized in the example cases below.

To better visualize our findings, we further discuss two example runs of SPACE and three runs of ACE for each dataset. The parameters of each method were modified to better observe their mechanisms at different scales. In addition, we also show the results of both methods for a real-world use case on quality control on metal casting.Footnote 1 Afterward, the results with the highest importance scores are presented and compared.

The example runs of SPACE for the three datasets are described in Table 1. The number of slices \(n_{\textrm{s}}\) was varied according to each dataset and the scale of its features. The percentage of filtered patches \(n_{\textrm{p}}\) and the number of components extracted through PCA, \(n_{\textrm{pca}}\) were fixed as before. Similarly the Table 1 also summarizes all the example runs of ACE, as well as the sets of parameters used for each run. Where \(s_{\textrm{SLIC}}\), \(n_{\textrm{SLIC}}\), and \(c_{\textrm{SLIC}}\) refer to sigma, number of patches and compactness of the SLIC segmentation. For the ACE runs, the padding of the patches was fixed to 128 as before. In addition, the number of clusters extracted through k-means \(n_{\textrm{k}}\), was varied to account for the lower number of valid concepts usually extracted when using a fewer number of patches (as seen in Fig. 2a).

Table 1 Specification of the experiments with SPACE and ACE

5.2 Concrete crack dataset

The concrete crack dataset (Çağlar Fırat Özgenel, 2019) consists of two classes, each with 20 000 concrete images of 227 \(\times\) 227 pixels resized to 210 \(\times\) 210. Class 0 and 1 correspond to good and cracked samples of concrete, respectively (Fig. 4a and b). The main (and only) concept used for the labeling of the classes is well known, as the cracks span through complete images and there are no other significant features present in the dataset. After training, the VGG-16 model obtained a test accuracy of 99.9%. The importance of the extracted concepts is shown in Fig. 3, and examples of the highest importance score concepts are shown in Fig. 4.

Fig. 3
figure 3

Importance scores of concepts for the concrete crack dataset. Markers indicate the interpreted content of the concept, and its statistical significance. SPACE extracted meaningful concepts and scored them consistently. In contrast, ACE runs extracted mixed concepts, scoring them inconsistently

Fig. 4
figure 4

Concrete crack dataset results for class “cracked” (k=1). Examples of classes in a and b. Top concepts from two SPACE runs and three ACE runs in (cj). Top concepts with high importance from SPACE-A and SPACE-B contain cracks, which are the most relevant feature of the class. Top concepts from ACE-A and ACE-B contain segmented cracks but are of lower importance e, h. In contrast, top concepts from ACE-C do not contain cracks at all i, j

SPACE extracted a single concept per run, shown in Fig. 4c and d. In both runs, the extracted concepts were statistically significant with the highest possible importance score (1.0). The background-related patches were filtered due to a low aggregated importance. The number of concepts extracted in each ACE run depends on \(n_{\textrm{SLIC}}\). The most notable finding for the ACE runs is the significantly lower importance score of the concepts. Over 48% of concepts had an importance score of 0.0, shown in Fig. 3. Similarly, the highest importance scores for the runs were lower than 0.61.

Patches and importance scores. The concepts extracted with SPACE were square patches directly containing cracks and scored with high importance. In contrast, ACE extracted 80% of concepts containing normal concrete patches and 20% of concepts containing cracks. These concepts were scored significantly lower, with 70% scored with less than 0.1. Even the concepts with the highest importance were scored a maximum of 0.61, as can be seen in Fig. 4e.

Alignment. The discriminative feature of the datasets are the cracks, which is aligned with the results obtained through the two SPACE runs, as seen in Figs. 4c and d. In comparison, ACE concepts containing cracks were not always scored with high importance. As seen in Fig. 3, most concepts containing cracks were scored close to 0.0, meaning that they adversely affected the prediction of the cracked class. This points towards issues when using TCAV in the ACE runs, which is caused by how the patches were segmented, the padding of the patches, and the interpolation used for resizing.

5.3 MVTec metal nuts dataset

The MVTec metal nuts dataset (Bergmann et al., 2019) is originally an anomaly detection dataset, which was reframed as a classification task. It consists of five classes, illustrated in Fig. 6a–e. Each image is of size 700\(\times\)700 pixels and contains a metal nut in front of a black background. Classes 0 to 4 are bent, color, flipped, ok, and scratched metal nuts, respectively. The numbers of images for each of the classes are imbalanced (e.g., class 3 has 242 images, whereas class 1 has only 22 images in total), which is why resampling and data augmentation were used for the training data. After training, the obtained test accuracy was 100%. From this dataset, a single class was analyzed with SPACE and ACE, and the importance of the extracted concepts is shown in Fig. 5. Examples of the concepts with the highest importance scores are shown in Fig. 6.

Fig. 5
figure 5

Importance scores of concepts for the metal nut dataset. Markers indicate the interpreted content of the concept, and its statistical significance. SPACE consistently extracted and scored meaningful concepts. In contrast, the extraction and scoring of meaningful concepts differed in the ACE runs

Fig. 6
figure 6

MVTec metal nut dataset results for class “color” (k=1). Examples of classes in (ae). Top concepts from two SPACE runs and three ACE runs in (fq). Top concepts with high importance from SPACE-C and SPACE-D contain red marks g, i, k, and black or blue marks f, j, which are important for the class (as opposed to normal metallic). Concepts from ACE-D contain large segments including red marks l, m. In contrast, concepts from ACE-E contain metal pieces n, o. Finally, concepts from ACE-F contain red marks p and dark metal patches q

SPACE extracted three and eighteen concepts for the runs SPACE-C and SPACE-D. The two concepts with the highest importance for each SPACE run are shown in Fig. 6f, i and j. All concepts from SPACE-C were statistically significant with a high importance score, as shown in Fig. 5. SPACE-D extracted mostly (>80%) high important concepts, except for \(c_{16}\) which was not statistically significant, and \(c_{17}\) which had an adverse effect in the predictions of this class. On the ACE runs, the extracted concepts were of mixed importance. On the ACE-D run, half of the extracted concepts were either not statistically significant or didn’t have enough samples for the testing, as shown in Fig. 5. A similar phenomenon was observed in ACE-E. Finally, ACE-F extracted small patches which were significant more than 80% of the times.

Patches and importance scores. The number of patches used for each analysis had a significant impact on the results. As a clear difference, SPACE-D extracted smaller and more numerous patches than SPACE-C, as a consequence other less important or significant concepts were extracted. Nonetheless, the concepts with the highest importance contained the expected features. In contrast, the ACE runs extracted superpixels following color and intensity boundaries, which were significantly different for each scale, (see Figs. 6i, o, and p). The scale of the ACE patches did not adversely affect the scale of the importance scores.

Alignment. The features used for the labelling of this class were the colored marks on the metal nuts. At different scales, SPACE performed more consistently, and was able to extract clearly and with high importance the discriminative features of the class. Other concepts were also extracted containing dark metallic regions, which could indicate a bias. In comparison, ACE runs varied significantly, where ACE-D and ACE-F had high importance concepts containing the red marks, but ACE-E did not extract a single concept containing this feature.

5.4 Metal casting

The last dataset was obtained during an actual application of an automated quality control process with an industrial partner (Deevio GmbH). It consists of images of a part manufactured through metal casting. The two labeled classes were class 0 consisting of 139 images without defects (e.g., Fig. 8a) and class 1 consisting of 141 images containing defects (e.g., Fig. 8b). Each image has a size of 900x900 pixels, and the dataset is composed of images from the front and back of the casting part. The predominant defect that was labeled during the experiment were pinholes, which are common small casting defects that generate a small yet visible porosity in the part. After training the mentioned classification model, the obtained test accuracy was 100%. The importance of the extracted concepts for all runs are shown in Fig. 7 and examples of the concepts with the highest importance scores are shown in Fig. 8.

Fig. 7
figure 7

Importance scores of concepts for the metal casting dataset. Markers indicate the interpreted content of the concept, and its statistical significance. SPACE consistently extracted and scored similar concepts. ACE failed to extract small concepts, as well as to provide a consistent scoring

Fig. 8
figure 8

Casting dataset results for class “defect” (k=1). Examples of classes in a and b. Two of the top three concepts from two SPACE runs and three ACE runs in (ci). Top concepts of SPACE-A and SPACE-B contain pinholes c, f, which is the feature used for labelling the class; They also contain yellow lines d, e, which is a detected bias. Top concepts from ACE-G contain large segments with yellow lines g and large segments of metal h. Similarly, concepts from ACE-H and ACE-I contain segments of the background i, j, k, and patches of metal l. One concept per ACE run was excluded due to confidentiality

SPACE extracted seven and eight concepts for runs SPACE-E and SPACE-F. For each SPACE run, the extracted concepts had an importance score between 0.71 and 1.0, except for one concept per run that was not statistically significant. The content of the concepts contained either pinholes, a yellow line from an edge, or letters from the backside of the parts, as highlighted in Fig. 7. Respecting confidentiality, the images of the concepts containing the letters from the backside of the parts are omitted in the results. In contrast, the concepts extracted from the ACE runs either contained a big segment of the image (e.g., Fig. 8g), or had a significantly lower importance score (e.g., Figs. 8i and 7).

Patches and importance scores. The two SPACE runs obtained similar results, locating semantically equivalent areas (pinholes, yellow lines, and backside letter). Similarly, the importance scores obtained in the two runs were comparable. In both SPACE runs, the concepts containing pinholes were within the two concepts with the highest important scores. ACE runs achieved different results, in ACE-G large portions of the parts were extracted and labeled with high importance. These concepts (see Fig. 8g) did not contain specifically pinholes, but they did contain the yellow lines mentioned before. No concept from the ACE runs contained pinholes specifically. This can be due to the severe interpolation required for the small patches and the lack of specificity of the bigger ones. This is also a possible cause for a predominantly low importance score for the ACE runs.

Alignment. The features used for the labeling of this dataset were the pinholes. On the SPACE runs, it was clear that the pinholes were important for the trained models, but the yellow lines and backside letters were also important. On the ACE runs, only bigger features such as the yellow lines were extracted and important scores were predominantly low. This erroneously hinted towards the model only using the yellow lines for the prediction.

Impact. This dataset was obtained in a real-world scenario, and the current method was used to verify its decision-making process. Through this analysis, it was detected that the model was also using the yellow lines and backside letters for its predictions, which in this context represents a target leakage. In general, target leakage occurs when the training data contains information about the target variable that will not be available at prediction time (Kaufman et al., 2012). In this case, the target leakage not only leads to learning concepts which are not directly helpful to solve the task, but can also become an unexpected bias and diminish robustness during the deployment of the models. After consulting with the domain expert, we learned that defective parts had been created from a different mold than non-defective parts. The different molds had subtle but significant differences in shape and identifying letters, which were learned by the model. Thanks to the detection of data leakage, mitigation was possible with further actions.

5.5 Conclusion

Two important phenomena were observed through all experiments. First, the encoding proposed by SPACE proves to be superior when analyzing scale variant models, introducing fewer artifacts when composing bigger images from the extracted patches. This impacts the concept clustering and importance testing as the encoding are closer to what the networks have learned. In contrast, using interpolation techniques for resizing patches modifies the scale-dependent features of the model. This had a significant impact when testing concepts containing pinholes. Second, SPACE improves concept extraction when analyzing features related to edges, or boundaries (e.g., cracks). The main reason for this is that the square slicing and tiling proposed by SPACE is not biased by color or intensity boundaries in images (as opposed to superpixels). Third, SPACE extracts fewer concepts than ACE, yet, the extracted concepts are more aligned to the semantic meaning of the analyzed classes. This is a consequence of guiding the patch extraction step by using the aggregated local importance of the patches.

6 Discussion and concluding remarks

The current work proposes the algorithm SPACE, which is specifically designed to perform concept extraction in industrial settings. SPACE was then tested over three datasets representative of real-world industrial quality control problems. These datasets contained relevant features of different scales captured from a single perspective. In these scenarios, the results of SPACE outperform ACE.

The techniques introduced by SPACE for patch extraction, image composition and concept testing enable a better concept extraction in quality control. Specifically, SPACE should be preferred, when analyzing models sensitive to scale changes and when the intended features of the labeled classes are small or related to intensity and color boundaries.

In real-world scenarios (e.g. metal casting), a model can show a perfect accuracy during training and testing. However, the model may contain biases and be influenced by data leakages, which will reflect in a poor performance on unseen data. During this study, two data leakages were identified (yellow lines and backside letters of the metal casting study). It was SPACE which allowed a clear identification of the unintended leakages and biases of the data. In cases like this one, actions can be taken to obtain more robust models which are better suited for industrial deployment.

Throughout this work, SPACE has shown a reliable performance in multiple use cases. Yet, some points must be considered when applying it. First, the number of slices \(n_s\) must respect the general size of the features that are analyzed. If the selected number of slices yields patches smaller than the general features of the dataset, suboptimal results will be obtained. Second, SPACE should not be used in cases where datasets already contain features similar to the discontinuities introduced during the tiling process of SPACE.

While SPACE has shown significant improvement over the state-of-the-art when it comes to scale-sensitive problems, there are several aspects worth of consideration in future work. First, on a practical side, more research is required on concept-based explanations and how it fits compliance frameworks in the industry. Second, the topic of offering guarantees for concept-based explanations can have a huge impact on the adoption of these techniques in practical applications. Finally, two important points of this work are the choice of feature representations (the extracted feature maps) and the explanation mediums (the images presented to the users). Exploring different alternatives of both can allow a more precise extraction of local concepts and a more complete information flow towards the users.