Importance attribution in neural networks by means of persistence landscapes of time series

This article describes a method to analyze time series with a neural network using a matrix of area-normalized persistence landscapes obtained with topological data analysis. The network’s architecture includes a gating layer that is able to identify the most relevant landscape levels for a classification task, thus working as an importance attribution system. Next, a matching is performed between the selected landscape levels and the corresponding critical points of the original time series. This matching enables reconstruction of a simplified shape of the time series that gives insight into the grounds of the classification decision. As a use case, this technique is tested in the article with input data from a dataset of electrocardiographic signals. The classification accuracy obtained using only a selection of landscape levels from data was 94.00%±0.13\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$94.00\%\pm 0.13$$\end{document} averaged after five runs of a neural network, while the original signals achieved 98.41%±0.09\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$98.41\% \pm 0.09$$\end{document} and landscape-reduced signals yielded 97.04%±0.14\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$97.04\% \pm 0.14$$\end{document}.


Introduction
The use of methods from topological data analysis (TDA) to enhance the performance of neural networks is widespread.In some cases the purpose is to provide versatile vectorizations [4], or to achieve a higher prediction accuracy or classification accuracy [10], or to regularize learning algorithms by feeding topological information extracted from data [6,7,15].Topology has also been used to reduce the size of datasets without much loss in training accuracy [17].A survey of TDA methods for time-series analysis in deep learning using Betti numbers is offered in [22].Tracking changes in the topology of a dataset as it passes through the layers of a well-trained neural network is the subject of [20], while the topology of neuron activations is analyzed in [16].Assessment of the generalization gap by means of persistence descriptors without the need of a testing set is discussed in [1,8].
The use of landscapes as persistence descriptors was initiated by Bubenik in [3].Landscapes were used in connection with deep learning in [18] with the goal of improving learnability by adding information on topological features of input data into subsequent layers, but not for explainability purposes.Activation landscapes have also been used as topological summaries of neural network performance in [23], and for personalized arrhythmia classification [24].
In this article we use TDA towards interpretability of classification results in deep learning.More precisely, we use persistence landscapes to retrieve information about features from data on which a neural network focuses to perform a classification task.We preprocess data so that the network is fed with a sequence of landscape levels instead of the original signals.The hierarchical structure of persistence landscapes allows us to design a method for finding the most informative levels.For this, we introduce an additional layer to a chosen architecture, whose mission is to assign weights to persistence landscape levels of given signals from a dataset.Then we run again the network using only the landscape levels with the highest weights.The results show that the set of selected landscape levels (normally 2 to 4) yield similar classification accuracies as the collection of the first 10 levels.
Selecting the most relevant landscape levels for a deep learning classification task opens the possibility of reconstructing partially the given data using only the chosen landscape functions.The resulting simplified version of the given data sheds light on which parts of data signals were most relevant for the network's classification task.This provides not only information about the learning process by the network but also about the most essential features carried by data.
In the context of a heartbeat analysis (Section 3.2) we checked that our neural network obtains similar accuracies when fed with reconstructions of signals from selected landscape levels in comparison with those obtained with raw data.This enhances confidence in the classification results by providing evidence that the network is not focusing on artifactual details during the learning process.
Our reconstruction method is described more precisely in a companion article [14], which addresses some mathematical questions related with the present paper and is related with the inverse problem in TDA, namely recovering certain types of data from persistence summaries [2,13,21].
As an outcome of our approach, we found that the number of landscape levels that are selected as being most relevant for a dataset is inversely associated with the accuracy of a neural network being trained with the given dataset (Fig. 6 below).Thus, datasets for which it is unclear which landscape levels should be marked as most important for a deep learning classifier tend to correspond with those with a lower classification performance.
Basic facts about persistence landscapes are collected in Section 1, and our attribution algorithm for landscape levels is described in Section 2. In Subsection 3.1 we validate our technique with nine datasets from the UCR Time Series Classification Archive [9], and use it in Subsection 3.2 to test the accuracy of classification of electrocardiographic signals from the MIT-BIH Arrhytmia Database [19].As discussed in Subsection 3.3, shifting signals may cause a loss of classification accuracy by a neural network, while persistence landscapes and results obtained from them remain invariant under horizontal shifts of the data.Hence there is an advantage in using landscapes for classification in cases where such shifts may be due to undesired effects.

Persistence landscapes for sublevel sets
Time-series arrays can be viewed as one-dimensional continuous piecewise linear functions where persistent homology can be applied to study the evolution of sublevel sets.Thus we consider a sliding parameter t along the y-axis, and for each function f defined on an interval [a, b] and each value of t we compute the number of connected components of the corresponding sublevel set L t (f ), which is defined as This coincides with the number of connected components of the part of the graph of f which lies at or below height t.The collection of all sublevel sets for a given function yields a persistence module whose value at t is the vector space H 0 (L t (f ); R), where H 0 denotes zero-dimensional homology and coefficients in the field R of real numbers are used.For background about persistence modules and their associated barcodes and persistence diagrams, see [12].A barcode depicts the lifetime of each connected component of a sublevel set, from the height t = b (birth) where it appears until the height t = d (death) in which it merges with some other connected component.The corresponding persistence diagram contains a point (b, d) for each barcode line starting at b and ending at d.The infinite ray depicting the essential homology class that survives to infinity is discarded for practical purposes.Barcodes or persistence diagrams are not optimal for their use in deep learning.Neural networks perform best with array-shaped data.In this article we use landscapes as persistence summaries.
Persistence landscapes were defined in [3] and, in the case of sublevel sets of signals, they express the evolution of connected components by means of a finite sequence of continuous piecewise linear functions with compact support.Computationally, each landscape function can be expressed as an array of discretized values, which makes it suitable to be introduced into a deep learning system.
The sequence of landscape functions associated with a persistence diagram is defined as follows.For each point (b, d) in the persistence diagram, one considers the corresponding tent function Next, a piecewise linear function λ k : R → R is defined for each k ≥ 1 as where kmax returns the k-th largest value of a given set of real numbers whose elements are counted with multiplicities, or zero if there is no k-th largest value.Therefore, since the number of points in a persistence diagram is finite, λ k = 0 for all sufficiently large values of k.The first landscape levels λ 1 , λ 2 . . .depict the most persistent topological features, while the last ones correspond to less persistent phenomena.

Attribution of importance
The fact that persistence landscapes can be stratified into a hierarchical sequence of levels makes it possible to design a mechanism for importance attribution ranking landscape levels of a given sample of signals.In [14] a deterministic procedure is described to reconstruct signals from directional persistence landscapes in a number of chosen directions.It is also shown in [14] how to partially reconstruct the given signals using only a subset of selected landscape levels, which is the focus of interest in the present article.By combining this procedure with a machine learning assignment of a sequence of weights to landscapes, we achieve a substantial reduction of the number of critical points of the given data functions without losing much classification accuracy.
To do this, we stack landscape functions from persistence of sublevel sets of the given signals in a matrix that will be fed into a neural network.Landscapes provide a convenient representation, since each landscape level corresponds to a different region of the oscillation of the input signal.
Since our objective is to feed a deep learning model, we decided to normalize the area under each landscape function in order to force the network to focus on their morphology instead of their actual values.This process is illustrated in Fig. 3.The existence of different levels of information naturally leads to the study of which levels are more important than others for the classification task.In order to implement this idea, we propose the use of a gating layer : we maintain the matrix shape throughout all the architecture and, before applying the fully connected layers, each landscape level λ k is multiplied by a positive, less-than-one learnable weight w k .Thus we obtain a set of weights that indicate how influential is each landscape level for the classification task.Typically, a network should regard the first landscape levels as more important than the last ones, given that the first levels contain information about the most persistent topological features.
By building a ranking of all the landscape levels, we are able to decide at which threshold of information the network stops learning.This is helpful in two main ways: first, we are able to reduce the information that we use to train our system by reducing the number of landscape functions that we pass to our network; and second, we can attribute importance to the parts of the original data that are producing the most relevant landscape levels.

Experimental setting and results
In this section, we present the results of our experiments using a neural network with a fixed architecture and different input signals.Our main aims are to assess the changes in classification accuracy by using only a set of selected landscape levels in comparison with the full landscape and with the original data, while determining which are the most relevant landscape features in each database.Robustness of our method is estimated by applying it to nine databases of very different nature.
Data.We applied our methodology to a collection of datasets taken from the UCR Time Series Classification Archive [9].The criteria for choosing a dataset were the following: the dataset should have at most five different classes and the total number of samples divided by the number of classes should be greater than or equal to 500.These criteria were adopted in order to avoid dealing with data scarcity problems and difficulties caused by imbalanced classes or by an excessive number of classes.Table 1  Methodology.In order to avoid discrepancies in the accuracy of the method due to the different ranges of values among datasets, input functions have been standardized to have values between 0 and 1.Moreover, when the topological preprocessing is applied, landscapes have been normalized so that the area under each landscape function is equal to 1.In doing so, we force the neural network to study the shape of the landscape, rather than only taking into account its actual values.
The main objective of our study is to compare the ability of landscape levels to capture information against a baseline of the raw data with the only preprocessing of standardization.Furthermore, to assess that the selected landscape levels are sufficient to classify, the results of feeding a neural network with the full landscape and the results of using only the selected levels are compared.
The architecture of the neural network is as follows: three convolutional layers combined with row-preserving max pooling layers followed by two dense layers (Fig. 4).Our gating layer is used for selection and attribution purposes and it is only present when landscape levels are used as input.In such case, the gating layer is placed between the last max pooling layer and the first dense layer.The experiments are conducted using a 5-fold cross-validation.Training sets amount to 80% of each dataset.The neural network is trained during 240 epochs, with a starting learning rate of 0.01 that is divided by 5 every 100 epochs.This architecture has been chosen to be rather generic, without attempting to achieve the highest possible accuracy, neither with the original data nor by means of landscapes.Our purpose was to assess the validity our method while avoiding possible particularities due to a tailored choice of an optimal architecture.
As for performance metrics, only accuracy is taken into account in this article.
Figure 4: Architecture of the neural network designed for this study.The gating layer is placed immediately before the first dense layer (pink) when landscapes are used as input.

Performance results
We carried out the same experiment for 9 different datasets from [9] to verify the stability of the results (Table 2).For each dataset, we ran a neural network (Fig. 4) with three different inputs: the original data, a sequence of persistence landscape levels, and a selected subset of levels.Since the length of the full sequence of nonzero landscape levels is variable, we chose the first 10 levels λ 1 , . . ., λ 10 since in most cases the 10th level was already zero, and fixing a larger number of landscape levels caused memory difficulties during the training process without a significant increase in accuracy.Subsequently, the selection of a smaller number of principal landscape levels was made by choosing the highest weights provided by the gating layer.The number of selected levels ranged from 2 to 5 depending on the dataset (Fig. 5).Further details about the selection of an appropriate subset of landscape levels are given in § 3.1.2.
Table 2 shows the average accuracy and standard deviation of each experiment using 5-fold crossvalidation.The table contains average accuracy results using raw data, unnormalized landscapes, normalized landscapes, and a selected subset of normalized landscape levels.The results show that landscapes achieve sufficiently high classification accuracies, especially when they are normalized (third and fourth columns).In that respect, landscape accuracies are statistically comparable up to one standard deviation to using raw data in four out of the nine datasets.
In Table 2, the results obtained by TDA-based strategies that are statistically comparable among them -including the method that achieved maximum accuracy-are highlighted in bold font.Unnormalized landscapes consistently miss relevant information in most cases, and this is translated into a significant reduction in accuracy.It is also remarkable that the selected landscape levels achieve similar performances as whole (normalized) landscapes.This reinforces the hypothesis that most of the information contained in data is captured by a small subset of landscape levels.
In the PhalangesOC dataset, normalized landscapes perform even better than the original data.As pointed out in the Discussion, this could be due to the inherent elastic deformation invariance provided by the landscape representation.4) for nine signal datasets.Accuracies obtained from original data (first column) are compared with those obtained from the first 10 landscape levels without area normalization (second column) and with area normalization (third column), and from the most informative landscape levels (fourth colum).The last column indicates how many landscape levels were selected in each case.Statistically comparable accuracies among TDA-based strategies appear in boldface.

Ranking of landscape levels
The keystone of our process is to be able to identify which landscape levels carry the highest amount of information for classification outcomes.The gating layer multiplies each landscape level λ k (with k = 1, . . ., 10) by a learnable weight w k with 0 ≤ w k ≤ 1.After the full training process of the neural network, the resulting weights are used to attribute importance to each landscape level.
To ensure significance, we performed the experiment five times and recorded the mean weight value and standard deviation for each landscape level, as seen in Fig. 5.Although there is no obvious numerical method to determine the number of landscape levels that should be considered important in view of their weights, we used the following criterion.If w k < 1 2 w k−1 for some k, we call k a significant drop.If k is the largest significant drop with w k−1 > 0.1, then we select λ 1 , . . ., λ k−1 as most important landscape levels.If there is no significant drop with w k−1 > 0.1, then we pick the smallest k such that w With very few exceptions, the network regards the first landscape levels as more important.These contain information of the most persistent topological features of each signal (connected components of sublevel sets).The first 10 levels were used in all the experiments.In some cases -namely, (d) and (f) in Fig. 5-landscape levels λ k with k > 6 were zero for all samples in the dataset.In these cases, the gating layer assigned small but not necessarily zero weights to the null levels.
It is remarkable that the terminal landscape level (i.e., the 10th in our study) tends to be consistently more relevant than the immediately precedent ones, except in those cases where it is zero for the whole dataset.This suggests that the terminal landscape level may convey discriminant information, deserving further study.
Fig. 5 shows that for certain datasets all weights are below 0.4, specifically (c) and (f), and marginally also (i).Looking at Table 2, we find that these datasets are precisely the ones that yield accuracies below 90% on test sets after the neural network had been trained with the original data.The datasets where the original data achieved a higher classification accuracy coincide with those with a smallest number of important landscape levels.Indeed, Fig. 6 shows an inverse relationship between accuracies and the number of selected landscape levels.
As examples of unfavorable cases, we now discuss results obtained with the datasets FordA and TwoPatterns from [9].These datasets share a common property, namely they consist of wave-like signals with a varying wavelength and the key information to classify them is the x-coordinate where the changes in the waves are happening.In one of them (FordA), the original data are difficult to  In Fig. 7 we see that, for the FordA datasets (where the neural network has trouble classifying even with the original data) the weights of persistence landscape levels are all similar and with a low relevance.In contrast, in the TwoPatterns case we see a clear ranking of the first landscape levels.Hence, landscape selection yields meaningful information about the dataset even in disadvantageous situations, since there is a consistent inverse relationship between the ability of the neural network to correctly classify the original data and the number of important landscape levels found through our method.In conclusion, Fig. 6 and Fig. 7 provide evidence that the outcome of landscape level selection can be related to how well a neural network can perform.: Average weights and standard deviations of the first ten landscape levels for two datasets after five runs of a neural network (Fig. 4) equipped with a gating layer.

A use case: Results of a heartbeat analysis
As an application case, we used our algorithm for a classification of electrocardiogram signals (ECG) from the MIT-BIH Arrhytmia Database [19] for evaluation of arrhytmia detectors.The dataset can be retrieved from [11] and it includes 48 half-hour excerpts of 24-hour ECG recordings obtained from 47 subjects (25 men aged 32 to 89 years and 22 women aged 23 to 89 years) studied between 1975 and 1979.Our data sample includes 87,554 heartbeats of five classes: one corresponding to normal beats (82.77%); three classes corresponding to different arrhythmia types, namely supraventricular premature beats (2.54%), premature ventricular contraction (6.61%), and fusion of ventricular and normal beats (0.73%); and one class for unidentifiable heartbeats (7.35%).Table 3 shows average accuracy after a 5-fold cross-validation.The classification accuracy of our neural network (Fig. 4) fed with the original unprocessed signal (98.41%) is compared with the accuracy of the same architecture using a 10-level landscape (94.55%) and using only the three most important landscape levels (94.00%).Landscapes were area-normalized since Table 2 evidenced an advantage of normalized landscapes versus unnormalized ones.The choice of three levels was based on weights assigned by the network, as shown in Fig. 8, where k = 4 is the largest significant drop.
Figure 8: Average weights and standard deviations of the first ten landscape levels for a sample of the MIT-BIH Arrhytmia Database after five runs of a neural network (Fig. 4).
Next we used the partial reconstruction technique described in detail in [14,Section 3] in four examples, corresponding to the classes of (a) normal heartbeats, (b) supraventricular premature beats, (c) premature ventricular contraction, and (d) fusion of ventricular and normal beats.Three landscape levels were used for approximation in each case.The outcome is shown in Fig. 10.Each landscape function λ k was paired with a list of y-values of critical points of the given signal f as specified in [14,Proposition 3.1].Hence we obtained a list of y-values of critical points of f associated with the subset of selected landscape levels.The values in this list were compared with the list of all critical points of f in order to obtain the matching x-values, and a new graph was drawn by joining the resulting critical points of f in the order of their x-coordinates, as in Fig. 9.The procedure is detailed below in Algorithms 1, 2 and 3.The resulting simplified graphs (Fig. 10) mark the points of interest, according to the neural network used in our experiment, for the classification of ECG samples.Thus they encode the most relevant information on which the network focused for its task.

Raw data 10 levels 3 levels Reconstructed
Accuracy 98.41 ± 0.09 94.55 ± 0.16 94.00 ± 0.13 97.04 ± 0.14 Table 3: Accuracy of classification (given in percentages) of our neural network (Fig. 4) fed with unprocessed data versus processed data with ten landscape levels and processed data using the most significant three landscape levels, as well as data approximately reconstructed by means of three landscape levels.We subsequently introduced the simplified reconstructions of the wave functions (Fig. 9) into the network in order to check if the data features distilled by our reconstruction method were sufficient for the network's classifications task.The results can be seen in Table 3 and indicate that the simplified signals gave rise to similar accuracies (97.04%) as the original data (98.41%).

Invariance under translations
Persistence summaries are not altered by horizontal shifts of signals and hence the accuracy of a classification task based on landscapes is invariant under such shifts.However, shifts may cause a loss of classification accuracy by a neural network fed with the original data.To demonstrate this effect, we used the same ECG dataset from § 3.2, yet we modified each heartbeat by adding a number of zeros randomly split between the beginning and the end of the beat signal.Thus, while in the original dataset each heartbeat was represented by a vector of length 187, in our experiment we introduced zeros so that the length was increased to 374.Classification of the shifted ECG graphs by means of the same neural network as in § 3.2 with five repetitions resulted in lower accuracy (Table 4) than with the original data.However, shifts do not alter the evolution of connected components of sublevel sets of the graphs and therefore the landscapes associated with the shifted graphs are the same as those of the original data.This illustrates that the use of persistence descriptors can be advantageous in practical cases where translations of data signals are unsubstantial in nature and nevertheless can be misinterpreted by a neural network.

Discussion
This article highlights an instance of the usefulness of topological data analysis in machine learning, specifically in the field of interpretability of outcomes of neural networks.Our procedure enabled us to distill partial information from the given data sufficiently relevant for classification purposes without a significant loss of accuracy.We used landscapes as descriptors of persistence of sublevel sets of signals, since landscapes come with a hierarchy of levels that enables us to rank the importance of each level by means of weights assigned by a gating layer in a neural network.
We confirmed that using the whole persistence landscape is not necessary for an accurate classification of signals: once we have identified the subset of landscape levels that is most important for the network, running the experiment with only this subset of levels yields a statistically comparable accuracy (Table 2).
A novelty of this study is normalization of landscape functions so that the area below the graph is constantly equal to one.This was conceived as an attempt to feed the neural network with shapes rather than magnitudes.As Table 2 shows, the accuracies obtained with normalized landscapes were higher than those obtained prior to normalization.Furthermore, the standard deviation of accuracy is lower after normalization in most cases, suggesting that normalization enhances stability.
In addition, our method allows us to partially reconstruct the given signals using the set of selected landscape levels, thus depicting which features of the data are most relevant for classification by means of the chosen architecture.Persistence descriptors are not injective in general and cannot be used to recover the data except in some cases where a collection of directional persistence diagrams are considered [2,5,14,21].However, our problem at hand is different: we need not fully reconstruct a function with the only knowledge of its persistence diagram, but rather attribute the persistence descriptors to the original information.Hence, our reconstruction task consists of matching points in the persistence landscape with corresponding parts of the given signals.Indeed, y-values of critical points of signals are determined by points in persistence diagrams as in [14, Section 1].
Since we are discretizing the given signals, difficulties regarding numerical precision may arise.When we are searching for peaks in landscape functions, the corresponding y-values of critical points in signals can be computed up to the chosen precision.When comparing those y-values with the original functions, we cannot expect a zero difference, since we are dealing with approximations.Instead, we have to take values that are below a certain threshold.In order to avoid picking x-values that do not correspond to the correct critical points, each time we had a candidate x-value we checked in the original function that it came indeed from a minimum or a maximum.
A feature of our method is that persistence diagrams of sublevel sets of signals do not capture information about the distribution of data along the x-axis, but only along the y-axis.As a consequence, persistence summaries such as landscapes are invariant with respect to translations or scale changes on the x-axis, or, more generally, with respect to horizontal elastic deformations.This can be a disadvantage for the use of persistent homology in cases when, for example, the wavelength of periodic or almost periodic functions is crucial for classification purposes, as illustrated by the datasets FordA and TwoPatterns in Subsection 3.1.2.However, it can be an advantage if expansion or contraction along the x-axis produces undesired effects, as in the case of bradycardia and tachycardia in [10] or in the experiment made in Subsection 3.3.Hence, the use of persistence summaries can serve to remove spuriousness due to shifts on the x-coordinate without a physical significance; e.g., when time series are segmented into shorter signals or when random horizontal segments occur within signals.
Regardless of the effect on performance metrics of using persistence descriptors instead of raw data, by using landscapes we gain insight about key patterns used to classify the given data, which makes the process more trustworthy.Thus, our method not only provides information about the focus of the network's learning process but it also serves to explore and better understand the dataset.

Figure 1 :
Figure 1: From left to right, a piecewise linear function, its barcode of zero-dimensional homology of sublevel sets and the corresponding persistence diagram.

Figure 2 :
Figure 2: Sequence of nonzero levels λ k of a persistence landscape (left).

Figure 3 :
Figure 3: Extracting information through persistence landscapes to feed a neural network.

Figure 5 :
Figure5: Average weights and standard deviations of the first ten landscape levels for nine datasets after five runs of a neural network (Fig.4) equipped with a gating layer.

Figure 6 :
Figure 6: Inverse relationship between the accuracy of our neural network (red) trained with the original raw data and the number of landscape levels (blue) that were selected as important.Datasets in the horizontal axis are ordered by increasing accuracy.

Figure 7
Figure7: Average weights and standard deviations of the first ten landscape levels for two datasets after five runs of a neural network (Fig.4) equipped with a gating layer.

Figure 9 :
Figure 9: An ECG signal function (left) and its approximate reconstruction from a set of selected landscape levels (right).
Supraventricular premature beat (c) Premature ventricular contraction (d) Fusion of ventricular and normal beat (e) Unclassifiable beat

Figure 10 :
Figure 10: Partial reconstruction of ECG graphs using the three most important landscape levels for each of four types of heartbeats.

Table 1 :
contains a summary of the characteristics of each dataset.A summary of the characteristics of each dataset.For each dataset we indicate the total number of samples, the length of each sample, the number of classes and whether the dataset is imbalanced or not.

Table 2 :
Average accuracies (given as percentages) and standard deviations on test sets from five runs of a neural network (Fig.

Table 4 :
Accuracies (given in percentages) of our neural network fed with unmodified data versus modified data by inserting zero segments at the beginning and end of each signal so as to duplicate the length of the signals (second column).