Topological Measurement of Deep Neural Networks Using Persistent Homology

The inner representation of deep neural networks (DNNs) is indecipherable, which makes it difficult to tune DNN models, control their training process, and interpret their outputs. In this paper, we propose a novel approach to investigate the inner representation of DNNs through topological data analysis (TDA). Persistent homology (PH), one of the outstanding methods in TDA, was employed for investigating the complexities of trained DNNs. We constructed clique complexes on trained DNNs and calculated the one-dimensional PH of DNNs. The PH reveals the combinational effects of multiple neurons in DNNs at different resolutions, which is difficult to be captured without using PH. Evaluations were conducted using fully connected networks (FCNs) and networks combining FCNs and convolutional neural networks (CNNs) trained on the MNIST and CIFAR-10 data sets. Evaluation results demonstrate that the PH of DNNs reflects both the excess of neurons and problem difficulty, making PH one of the prominent methods for investigating the inner representation of DNNs.

enabling the understanding of the inner representation of DNNs have been investigated, including the input identification of specific results [2,26,34,44] and similarity evaluation between different networks [20,27,30]. At the same time, the complexity of DNNs is one of the essential subjects, which represents the knowledge in trained DNNs.
Persistent homology (PH) is one of the prominent methods in TDA owing to its three advantages: theoretical foundation, computability in practice, and robustness with small perturbations [28]. These advantages are beneficial for investigating DNNs. Theoretical foundation and computability are fundamental in constructing knowledge from empirical observations, while robustness is indispensable for investigating DNNs involving parameter perturbations.
Bastian et al. investigated the complexity of the inner representation of DNNs using zero-dimensional PH, which counts the number of connected neurons at different resolutions [32]. At the same time, one-dimensional PH can reveal other essential aspects of the knowledge complexity in DNNs because it can examine the combinational effects of multiple neurons. To the best of our knowledge, there is no previous work employing onedimensional PH for investigating the inner representation of DNNs based on the trained weight parameters except our presentation at a symposium [41].
We constructed clique complexes, which were employed for analyzing brain networks [31], on trained DNNs. Furthermore, we calculated the one-dimensional PH of fully connected networks (FCNs) and networks combining FCNs and convolutional neural networks (CNNs) trained on the MNIST and CIFAR-10 data set to demonstrate the effectiveness of one-dimensional PH 1 .
The remainder of this paper is organized as follows. Section 2 presents the intuition behind this study. Background information is presented in Section 3. Clique complexes are constructed on trained DNNs in Section 4. The evaluation setup and results are provided in Section 5 and 6, respectively. Section 7 discusses the assumptions and applications of the measurement method. Related work is discussed in Section 8. Conclusions and suggestions for future work are presented in Section 9.

Intuition behind topological measurement of DNNs
DNNs work as knowledge distilling pipelines, meaning that the degree of feature abstraction increases with the depth of DNN layers [23]. For example, images of cats are incrementally abstracted from pixels to diagonal lines and ear shapes. Additionally, DNNs can detect cats based on feature combinations [9]. Feature relationships represent the implementation of knowledge in DNNs, which can be investigated from DNN structures.
Previous studies have demonstrated that PH can be used for comparing and characterizing human brains. Cassidy et al. employed PH as a tool for comparing human brains using functional magnetic resonance imaging (fMRI) [8]. Petri et al. demonstrated that psilocybin affects the homological structure of the brain's functional patterns [29]. Furthermore, Sizemore et al. employed PH to highlight the crucial features of human brains from diffusion spectrum imaging (DSI) [36]. However, it is often difficult to quantify the activation of neurons from fMRIs and DSIs. Hence, PH is more useful for analyzing DNNs because their network structures and the activation of neurons can be described mathematically. In this study, we employed PH to investigate the process of training a DNN and evaluate its knowledge representation complexities.

Background
The terms of TDA and PH can be understood based on previous studies [11,19,28], while introductory videos explaining TDA and PH can be found on on-demand video services 2 .

Persistent homology
The homology groups of orders zero and one represent the number of connected components and holes, respectively. PH is a method for computing the homology groups at different resolutions. While the formal definition of PH is provided below, its intuitive understanding is sufficient for interpreting the presented experimental results obtained using some computational libraries.
Definition 1 An abstract simplicial complex is a finite collection of sets K such that X ∈ K and Y ⊆ X implies Y ∈ K .
The sets X in K denote its simplices. The dimension of a simplex is dim X = card X −1, where card X denotes the cardinality of X. The dimension of an abstract simplicial complex is the maximum dimension of any of its simplices. The vertex set is the set consisting of all the simplices of dimension 0, while the face of a simplex X is a non-empty subset Y ⊆ X.
A p-chain c of a simplicial complex K is a formal sum of p-simplices in K , that is, c = ∑ a i X i , where X i are p-simplices and a i are coefficients. We employ module-2 coefficients, that is, a i are either 0 or 1 and 1 + 1 = 0. The binary arithmetic of two p-chains c = ∑ a i X i and c = ∑ b i X i is defined as c + c = ∑(ai + b i )X i , where the coefficients are of modulo-2. The p-chain forms a group denoted as C p .
A boundary operator ∂ p is a map from a p-simplex to the sum of its (p − 1)-simplices. Formally, ∂ p X = ∑ A p-cycle is a p-chain with an empty boundary forming a group denoted as Z p = ker ∂ p . A p-boundary is a p-chain, that is, the image of a (p + 1)-chain forming a group denoted as B p = im ∂ p+1 .

Definition 2
The p-th homology group denoted as H p (= Z p /B p ) is the p-th cycle group modulo the p-th boundary group. The p-th Betti number β p is the rank of H p .
Definition 3 A filtration of the simplicial complex K is a sequence of simplicial complex such that / 0 = K 0 ⊂ K 1 ⊂ · · · ⊂ K n = K .
For every i ≤ j, there is an induced homomorphism in each dimension p, f i, j p from represents the maps induced by including maps K i → K j .
, γ can be said to live forever, and its lifetime is the interval [i, ∞).

Diagrams
A PH diagram illustrates the birth and death of homologies in a filtration, which was fundamentally introduced in [3]. Fig. 1(a) shows points with oblique lined circles in R 2 . When the radius of the circles is small, the points are isolated. Two encircled regions appear in R 2 when the circles are gradually enlarged. The appearance of the encircled regions corresponds to the birth of homologies. The regions disappear when the circles are enlarged further, and the disappearances correspond to the death of homologies. Fig. 1(b) shows the PH diagram of Fig. 1(a), in which the X-axis shows the birth of homologies and the Y-axis the death of them. The two points in Fig. 1(b) correspond to the births and deaths of the two regions. The large region in Fig. 1(a) is stable with regard to the enlargement of the circles. In contrast, the small region is less stable compared to the large region. The stability of the regions is indicated by the distance from the dialog line in Fig.  1(b), i.e., the small region is pointed near the dialog line, whereas the large region is pointed in a distance from the dialog line.
Barcode is another diagram that gives the same information as the PH diagram. Barcode diagram of Fig. 1(a) is shown in Fig. 1(c), in which the start and end points of lines parallel to the X-axis show the birth and death of homologies, respectively. The short and long lines correspond to the small and large regions, respectively. The stability of regions is indicated by the length of the bars in the barcode diagrams.

Construction of clique complexes on DNNs
We consider a set of neurons as vertices V = {v 0 , . . . , v n }, where n + 1 is the number of neurons. DNNs are considered as directed graphs with weights w i j , where w i j denotes the weight between v i and v j ; here, w i j is zero if v i and v j are not connected. We set the value of the relevance of identical neurons to one and the relevance R i j between the connected neurons v i and v j as the normalized weight. Formally we set where w + i j denotes the positive part of the weight, i.e. w + i j = max{0, w i j }. R i j indicates the relevance between v i and v j because the input to the j-th neuron is calculated by ∑ i a i w i j + b j in DNNs, where a i is the activation of the i-th neuron and b j is the bias [9]. We employed the positive part of the weight and ignored the bias, in a manner similar to the z + -rule defined in deep Taylor decomposition [26].
To construct clique complexes on DNNs, the relevance was extended to indirectly connected neurons. For example, when v 0 and v 2 are connected to a path v 0 → v 1 → v 2 , the relevance between v 0 and v 2 corresponding to the path is defined as R 01 R 12 . The intuition behind the definition is as follows: R 01 and R 12 indicate the contributions of v 0 and v 1 to the increase in the inputs of v 1 and v 2 , respectively; R 01 R 12 indicates the contribution of v 0 to the increase in the input of v 2 . Formally we set where L i j denotes the set of all possible paths from v i to v j . It is possible to define R i j using multiple paths in L i j . However, the maximum was employed in Eq.
(2) to improve computational efficiency. Masulli et al. constructed a clique complex K(G) on a finite directed weighted graph G = (V, E) with vertex set V and edge set E with no self-loops and no double edges [25].
Correspondingly, R i j enables the construction of a clique complex and filtration on V . The neurons were numbered in ascending order from the output to input layers. Hence, the numbers of neurons in the closer layer to the output layer are smaller than those in the farther layer, where the distance is indicated by the number of edges from the output layer. Using this numbering, we set p-simpleces on V as where t is a threshold value (0 ≤ t ≤ 1).
. . , v n ) be a finite set, and {w i j } (0 ≤ i, j ≤ n) be a set of real numbers. Let R i j (0 ≤ i, j ≤ n) be the relevance defined by Eqs. (1) and (2) using {w i j }. Let K t p be the p-simplices defined by Eq. (3), where t is a threshold value (0 ≤ t ≤ 1). Then, a finite collection of sets K t = K t 0 ∪ K t 1 ∪ · · · ∪ K t n is an abstract simplicial complex.
be a monotonically decreasing sequence ranging from 1 to 0.   Fig. 2(a) illustrates a four-layered DNN with an output neuron v 0 . The values adjacent to the arrows denote the weight between two neurons, and the weight matrix is presented in Fig. 3(a) where the (i,j) element denotes the weight between the i-th and j-th neurons. Fig.  2(b) illustrates the simplicial complex of K r=1.0 with Betti number β 0 = 9. The decrease of the Betti number β 0 according to the filtration can be observed in Fig. 2(c) to (h). Fig. 2(e) illustrates a 2-simplex represented with the gray triangle. Fig. 2(g) and 2(h) illustrate the increase of the Betti number β 1 corresponding to the occurrences of the cycle. If the vertices representing the features of input images are connected straightforwardly to the output neurons, the knowledge in the DNN is considered to be simple because it is equivalent to feature detection. In contrast, the increase of the Betti number β 1 indicates that the DNN classifies the input based on the combination of features. From these viewpoints, we can assume the increase in the Betti number β 1 reflects the complexity of knowledge in the DNN. Filtration 10 ( Fig. 2(i)) has Betti number β 1 = 1. While [0, 2] is a simplex in Filtration 10, it is not included in another simplex [0,. . .,10] and produces β 1 = 1. The computation of PH involves the explosion of the complexity caused by the increase of vertices, several implementations of which are publicly available [28]. We employed the GUDHI [6,33,39], JavaPlex [38], and Dionysus 2 [12,13] libraries for the computation and visualization. These libraries require registering simplexes in each filtration to calculate PH.
Algorithm 1 identifies all simplexes from a vertex s up to the limit of the threshold of relevance t using the recursive procedure call. All simplexes in each filtration are identified using this procedure and registered to the libraries. Fig. 3(b) and (c) are barcode and PH diagrams illustrated by the GUDHI library, respectively. The library employed red and green for indicating zero-and one-dimensional homologies, respectively. The Betti numbers in Fig. 3(b) correspond to the number of the intersections between the bars and the perpendicular lines to the X-axis (remembering that the lifetime of homologies is defined by the half-open interval [birth, death)). The GUDHI library illustrates Betti numbers using color shades in PH diagrams shown in Fig. 3(c). PH was calculated using the Dionysus 2 and JavaPlex libraries, resulting in the same diagrams.
A filtration is defined using thresholds of relevance. This study considered 64 threshold values composed with (1.0 0 , . . . , 1.0 −7 ) and eight interval values between the adjacent values. Formally, we considered the simplicial complexes K n(r=(1−0.1×(l−1))×10 −m ) (1 ≤ n ≤ 64), where m and l are the quotient and remainder when n is divided by 9, respectively. And the filtration was defined as K 1(r=1.0) ⊂ K 2(r=0.9) ⊂ · · · ⊂ K 10(r=1.0 −1 ) ⊂ K 11(r=0.09) ⊂ · · · ⊂ K 64(r=1.0 −7 ) . While the thresholds should be considered depending on the network structure of DNNs, we set this aside as a task for future work; this study only examined the prominence of the topological measurement of DNNs.

Evaluation setup
The MNIST and CIFAR-10 data sets were employed in the evaluation [22,24]. As shown in Table 1, the contents of the MNIST and CIFAR-10 data sets are 28 × 28 grayscale handwritten digits and 32 × 32 color photographs, respectively. The CIFAR-10 data set comprises the photographs of 10 types of objects such as airplanes, automobiles, birds, etc. All experiments were conducted using Keras and Tensorflow [1,9], and DNNs were developed based on the examples in Keras 2.3.0. For the classification of the MNIST data set, we employed an FCN with two hidden layers of sizes 300 and 100, the ReLU activation function in the hidden layers and 10 output neurons with the sigmoid activation function (Fig. 1(d)). The models were traind for 10 epochs with a batch size of 64, and all models achieved an accuracy of over 97% on the test data.
For the classification of the CIFAR-10 data set, we employed DNNs consisting of a CNN and an FCN. The CNN was used to extract features from the photographs, while the FCN was used to classify the photographs based on the combination of the features. The proposed method was applied to the FCN since the purpose of this study was to examine the complexity of the knowledge in DNNs represented in the combination of features.
We employed the CNN from an example network included in Keras 2.3.0 without modifications. This CNN comprises multiple layers, including two-dimensional convolution, max pooling, and dropout layers. Two FCNs with sizes of (300, 100, 10) and (512, 512, 10) were used for examining the sensitivity of the proposed method to the network structures 3 . The DNNs were trained for 30 epochs with a batch size of 32.
6 Evaluation results

MNIST data set
Figs. 4(a-j) illustrate PH diagrams of the FCNs produced using the Dionysus 2 library, where the number of input digits used to train the FCN models was varied. In particular, we extracted the images of the target digits from the MNIST data set and trained FCN models using the images of digits 0-9 ( Fig. 4(a)), digits 0-8 ( Fig. 4(b)), and so on. The Dionysus 2 library allows to visualize the overlapping quantity of homologies using different colors as indicated by the legends in Fig. 4. The values of birth and death in the axes on PH diagrams indicate the order of the 64 threshold values defined in Section 4. Let m and l are the quotient and remainder when the values of birth and death are divided by 9, respectively, the threshold values corresponding to the values in the axes on PH diagrams are (1−0.1×(l −1))×10 −m . This correspondence is consistent through the paper.
The following three observations can be made from Figs. 4(a-j): (1) points are plotted in the belt-like area (birth +5 < death < birth +20) parallel to the dialog line; (2) some figures have points below the belt-like area; and (3) some figures have points over the belt-like area.
With respect to observation (2), the number of points below the belt-like area increases from Fig. 4(a) to Fig. 4(g) and decreases from Fig. 4(h) to Fig. 4(j). This pattern reflects both the excess of the output neurons and problem difficulty. It can be further observed that the diagrams seem to reflect the degree of confidence of the FCN models, i.e., the excess of the output neurons reduced the confidence, whereas the simplicity of the problem increases it. For further investigation, we classified five digits using five output neurons (Fig. 4(k)) and 10 digits using 20 output neurons (Fig. 4(l)). In contrast to Fig. 4(f), the points below the belt-like area disappeared in Fig. 4(k). The opposite can be observed in Figs. 4(a) and 4(l). Table 2 lists the number of points plotted in Fig. 4(a-e), 4(i), and 4(j). We categorized the points using the representative cycles calculated by the JavaPlex based on the following two conditions: (c1) the homology includes unused output neurons and (c2) the points are under the belt-like area (death ≤ birth + 5). While the number of points that include unused output neurons in Figs. 4(i) and 4(j) is more than twice of that in Fig. 4(e), these points are not plotted below the belt-like area. The simplicity of the problem led to no points being plotted under the belt-like area.

CIFAR-10 data set
Figs.5(a-j) illustrate PH diagrams of the DNN models combining a CNN and an FCN (300, 100, 10), where the number of classes used to train the models was varied. In particular, we extracted photographs of the target classes from the CIFAR-10 data set and trained the DNN models using the photographs of 10 classes (Fig. 5(a)), nine classes ( Fig.5(b)), and so on.
As described in Section 5, the contents of the CIFAR-10 data set differs from that of the MNIST data set in terms of the image size, tone, and represented object. Unlike FCN-based models traind on the MNIST data set, CNNs were employed in addition to FCNs to classify the CIFAR-10 data set.
Despite these differences, Figs A further experiment was conducted using the DNN models combining a CNN and an FCN (512, 512, 10). The results of this experiment are illustrated in Figs. 6. A similar patterns regarding the appearance and disappearance of points under the belt-like area can be observed from Fig. 6; that is, only Figs. 6(d-h) and 6(l) have the points under the beltlike area. This result suggests that the observation is robust to not only the network type and content of data sets but also number of neurons in FCNs.
Two additional observations can be made from  The number of points reflects the difference of expressiveness of the FCN (512, 512, 10) and FCN (300, 100, 10). The FCN (512, 512, 10) has more parameters compared to the FCN (300, 100, 10), which results in the ability of the FCN (512, 512, 10) to learn knowledge is higher than that of the the FCN (300, 100, 10) and produces many homologies. As a rough approximation, the FCN (512,512,10) has 512×512+512×10 of weight parameters, whereas the FCN (300, 100, 10) has 300 × 100 + 100 × 10 of them. The ratio 8.62 (= (512 × 512 + 512 × 10)/(300 × 100 + 100 × 10)) provides the explanation for the increase in the values listed in Table 3. The increase in the size of convex hull is smaller than that of the number of points, which indicates that the FCNs (512, 512, 10) have duplicated homologies approximately 4 to 8 times more often compared to the FCNs (300, 100, 10). It implies that the FCNs (512, 512, 10) have duplicated homologies with different neurons, which can be achieved with expressive training to the data set. The interpretation of the PH diagrams requires further investigation, which we left as a task for future work because the purpose of this study was only to examine the prominence of the topological measurement of DNNs.

Robustness on weight initialization
We conducted additional experiments by varying the initial values of network weights to investigate the robustness of the PH diagrams' transitions described in Subsections 6.1 and 6.2. Keras framework starts the training with random initial values of network weights [9]. We repeated each experiment 10 times by varying the number of input classes from 10 to 1 with the three network types, MNIST (300, 100, 100), CIFAR-10 (300, 100, 100), and CIFAR-10 (512, 512, 10), resulting in a total of 300 additional experiments.   Fig. 7 shows the minimum, average, and maximum size of convex hulls of the points in the PH diagrams. The differences between the maximum and minimum values indicate the degree of vibration of the experiment results. All the three graphs are approximately convex upward, indicating that the PH diagrams transit the shape in a similar manner to those described in Subsections 6.1 and 6.2, and the transitions are robust on the initial values of network weights.
In Subsections 6.1 and 6.2, we observed the transition of the PH diagrams that the number of points near the dialog line (death ≤ birth + 5) changes by varying the number of input classes. No point near the dialog line appeared when the number of input classes was set to 10 and 1. Additionally, the number of points near the dialog line increased and decreased with the decrease in the number of input classes from 10 to 8 and 3 to 1, respectively. Table 5 lists the minimum, average, and maximum numbers of points near the dialog line regarding the additional experiments. We observed that no point appeared near the dialog line when the number of input classes was set to 10 and 1 in all the additional experiments. Additionally, the increase and decrease followed the same trend in the additional experiments, shown in Table 5, meaning that the observations obtained in Subsections 6.1 and 6.2 are robust on the initial values of network weights.  FCN (300, 100, 10),and CIFAR-10 using FCN (512, 512, 10) by varying the number of input classes, respectively results imply that the shortage of data can be indicated by the PH, that is the excess of the output neurons produces homologies near the dialog line. Furthermore, the proposed method is beneficial for selecting appropriate DNN architectures, which is one of the major challenges when utilizing DNNs [35,46].

Related work
Bianchini et al. investigated the upper and lower bounds of network complexity from the viewpoint of PH [5]. Based on their results, Guss et al. empirically analyzed the relationship between the upper bound of network complexity and data complexity measured by PH to determine appropriate network architecture for a given data set [15]. However, these two types of complexities are not homogeneous, and their comparability is uncertain. Under these considerations, we addressed the inner representations of DNNs with small perturbations. Our evaluation results revealed that small perturbations such as the number of output neurons and a variety of input data have significant impact on PH. Thus, the sensitivity of PH requires a careful investigation for securing comparability. Bastian et al. investigated the complexity of the inner representation of DNNs using zero-dimensional PH [32]. Zero-dimensional PH counts the number of connected components in DNNs. Fig. 2(f) and (g) have β 0 = 3 and β 0 = 2 corresponding to the connected components, respectively. In contrast, the Betti number β 1 reveals the combinations among neurons illustrated in Fig. 2(g), where the neurons one and three collaborate to increase the Betti number β 1 . Thus, we believe that one-dimensional PH can reveal the combination of neurons and access essential aspects of DNNs that are difficult to be accessed using other methods.

Conclusion
This paper introduced a novel approach to investigate the inner representation of DNNs using PH. Evaluations were conducted using FCNs and networks combining a CNN and an FCN trained on the MNIST and CIFAR-10 data sets. The evaluation results demonstrated that the one-dimensional PH of DNNs can reflect both the excess of neurons and problem difficulty, which implies that PH can become one of the prominent methods for investigating the inner representation of DNNs.
The methods for constructing simplicial complexes and defining the filtration are developed on the basis of our attempts. The development of these methods will, however, include many research areas, especially due to large variety of network types, including CNNs and recursive neural networks (RNNs). Furthermore, with regard to computation, the development would require considerable efforts in applying the topological measurement to enlarged neural networks, which can have more than 1,000 layers [17]. At the same time, we believe that the topological measurement of DNNs is worth further investigation.