Introduction

Nowadays, classifying volcanic activity is delegated to human experts, as it requires a timely consultation of various kinds of instrumental data. This activity is then all the more demanding as the time interval between samples is smaller. It is therefore interesting to know to what extent Machine Learning (ML) approaches can help to solve this problem by automatically processing instrumental data. In particular, among several kinds of geophysical data, such as seismic, ground deformation, chemical, etc., in this paper we chose to consider the seismic tremor, which is widely available in areas of active volcanism. Although the use of volcanic tremor for classifying volcanic activity is not new in literature, see for instance (Langer et al. 2011) who applied Self-Organized Maps (SOM) and Fuzzy clustering, with the aim of detecting imminent eruptive activity at Mt Etna, we must consider that a clear measure of the reliability of ML algorithms to cluster volcanic activity, based on the use of volcanic tremor, is still incomplete and could benefit from additional contributions. In this work, we aim to contribute to this field, providing a reliable estimation of the performances of both supervised and unsupervised classifiers, adopting as metric the f1-score, which, as is known, represents the harmonic mean between Precision and Recall and in terms of Area under the ROC (AuC). Furthermore, we will compare the f1 index obtained by using the unsupervised approaches against a supervised classification algorithm, in order to estimate the performance loss. Results presented in this paper can integrate those recently presented by Spampinato et al. (2019), who measured the reliability of a warning system, based on the use of seismic tremor measured on Etna, in terms of the AuC index, estimating \(AuC\simeq 0.8\). The validity of the results shown in this work is based on a dataset, recorded at Mt Etna from January 2011 and December 2015, which contains a quite high number of paroxysm episodes (namely 48) and therefore allowing a reliable estimate of the classifiers performances.

Review papers concerning clustering approaches are provided by Liao (2005), Xu and Tian (2015), and Bano and Khan (2018), while some application of Machine Learning for the specific field of volcanic seismic signals processing has been recently provided by Malfante et al. (2018). Recent papers dealing with ML application in solid Earth geoscience and seismology have been provided by Bergen et al. (2019) and Kong et al. (2019). Other ML methods and applications in geophysics, with specific examples to classify tremor data can be found in the recent book (Langer et al. 2020).

The paper is organized as follows: in Section “Statement of the problem” the problem is stated. The considered dataset and pre-processing steps are described in Section “Dataset and preliminary analysis”. Considered supervised classification methods are listed in Section “Supervised classification of volcanic activity”, together with a description of the classification metrics. Numerical results concerning supervised methods are reported in Section “Supervised classification”. Unsupervised classification approaches considered and results, in comparison with the supervised approaches are reported in Section “Numerical results concerning unsupervised classification”. In Section “Discussion and Conclusions”, conclusions are drawn. A formal description of considered ML approaches and hints about their implementation is reported in Appendix A.

Statement of the problem

The problem of classifying volcanic activity consists in associating a label to each kind of activity observed at the summit crater area. In particular, in this work we will follow the line drawn by Cannavó et al. (2017) and subsequently continued by Hajian et al. (2019), who classify the volcanic activity into three classes based on the following description:

  • The class of Quite (Q) state, labeled also as 0, indicates quiet, i.e. no relevant activity or unknown activity, characterized by low amplitudes of all physical features that can be related with the volcanic activity. Representing classes with a double symbol, numerical and literal, make easier on one hand the visualization as a timeseries of classes (numeric representation) and the readability (literal representation).

  • the class of Strombolian (S) activity, labeled also as 1, indicates Strombolian activity, essentially characterized by a mildly explosive activity, with medium amplitude of seismic tremor RMS, shallower source of the seismic tremor, presence of clustered infrasonic events, no eruption column, but possible ash emissions.

  • the class of Paroxysm (P) activity, labeled also as 2, indicates paroxysm, i.e. an energetic activity with lava fountains, characterized by high amplitude of seismic tremor RMS, presence of clustered infrasonic events, shallower source of the seismic tremor, eruption column formation and ash emissions.

For example, Mt Etna would appear to an observer as represented in Fig. 1, during a Strombolian activity and as in Fig. 2 during a paroxysmal. Comparing the two images it is evident that the ejected matter reaches even some thousands of meters for Paroxysm and only a few tens or some hundreds of meters for Strombolian activity. This is representative of the fact that the energy levels corresponding to the various kinds of volcanic activity can be quite different.

Fig. 1
figure 1

Example of Strombolian activity at Mount Etna. The image was kindly provided by the INGV-Sez. di Catania

Fig. 2
figure 2

Example of Paroxysm activity at Mt Etna. The image was kindly provided by the INGV-Sez. di Catania

In a well-monitored volcanic area, such as Mt Etna, the onset of this type of activity can be detected by various instruments such as cameras or Doppler radars. However, in remote volcanoes, very often these kinds of sensors are not present, while seismic monitoring networks are more frequently installed.

The classification of volcanic activity at Mt Etna by using a supervised Bayesian Network (BN) classifier, which discriminate the three kinds of activity mentioned above, based on five features, namely the seismic RMS tremor, the infrasonic RMS tremor, the depth of the volcanic source, the radar Doppler and the tilt derivative, was first proposed by Cannavó et al. (2017). Other supervised classification approaches, namely Decision Tree (DT) and K-Nearest-Neighbor (KNN), on the same dataset have been applied by Hajian et al. (2019).

Dataset and preliminary analysis

The dataset taken into account in this study was recorded from January 1, 2011 to December 31, 2015 and after processing was sampled every 10 minutes. During this time interval the eruptive activity of Mt Etna was rich in episodes characterized by phases of degassing, alternating with explosive and effusive activities, to periodic eruptions from fissures opened on its flanks. A detailed description of the activity occurred is outside the scope of this work and the interested readers can refer to Cannavó et al. (2017) and references therein. For the purposes of this paper, it is enough to mention that expert volcanologists have labeled the 262944 samples into 213190 samples of Q activity, 49048 of S activity and 706 samples of P activity. Samples labeled as Q, S and P are the 81.08%, 18,65% and 0.27% of the total, respectively. This time-series of classes, shown in Fig. 3, will be, again, considered in this paper for evaluating to what extent classification of volcanic activity can be performed by unsupervised classifiers (i.e. clustered) and estimate the performance in comparison with supervised ones. Classification will be carried out considering the seismic tremor as the only feature.

Fig. 3
figure 3

Time series of classes from January 2011 to December 2015, recorded at Mt Etna with 10 m sampling time: 0, 1 and 2 represents Quiet, Strombolian and Paroxysmal activity, respectively. The dataset was kindly provided by the INGV-Sez. di Catania

Volcanic seismic tremor

Volcanic seismic tremor, also referred to as harmonic tremor, is a sustained release of seismic and infrasonic energy typically associated with the underground movement of magma, the venting of volcanic gases from magma, or both. It is a long-duration release of seismic energy, with distinct spectral lines, that often precedes or accompanies volcanic activity. More generally, volcanic tremor is a long-duration continuous signal generated by a spatially distributed source, which contrasts distinctly with transient sources of seismic radiation, such as earthquakes and explosions. The relation between long-period events and an imminent eruption was first observed by Chouet (1996). As indicator of the level of volcanic seismic tremor, the root mean square (RMS) amplitudes of seismic signals in the tremor band have been routinely recorded at the Istituto Nazionale di Geofisica e Vulcanologia (INGV), Osservatorio Etneo, for real-time volcano monitoring, through the so-called permanent seismic network, equipped with broadband (40 s cutoff period), three-component seismometer, which records data in real time at a sampling rate of 100 Hz. The original signals are then filtered in the band [0.5,5.5] Hz, since, as suggested by (Chouet 1996), it concentrates the main part of released seismic energy. The geographic coordinates of most of the considered seismic stations are reported in Table 1.

Table 1 StationID and geographic coordinates of the considered seismic stations

Regression analysis of tremor time series

RMS tremor recorded at different stations of Mt Etna are linear correlated. To illustrate this aspect, we have computed linear regression models of the form:

$$ y(t)=mx(t)+b $$
(1)

where:

  • x(t) is the tremor value measured at time t in a given reference station, assumed as input;

  • y(t) is the tremor estimated by the linear model, into another station, assumed as output;

  • m and b are the slope and the offset coefficients, respectively.

Figure 4, report, as an example, the scatter plots of tremor recorded at the couples ESLN-ECZM, assuming the tremor recorded at the ESLN station as input and that at ECZM as output; the strong linear correlation appears evident. In order to highlight that this is almost independent of the particular class of activity, samples belonging to different classes were represented by different colors. The regression coefficient R, the slope and the offset, for the whole set of stations are reported in Table 2. It can be seen that almost all stations exhibit R close to 1. The lower R values corresponds to stations, namely ECNE, ECPN, EPDN, located very close to the active craters areas, which are, probably for this reason, characterized by very high tremor levels and/or other kinds of local phenomena. For classification purposes, the high linear correlation of tremor time series, implies that the dataset dimensionality can be significantly reduced, as it will be discussed in the next section. For now, we mention here that this property suggested to us a strategy for filling missing data at several stations. Indeed, in the considered time interval a not negligible quantity of missing data has plagued the network, as shown in the last column of Table 2. It is possible to see that in the considered time interval, the percentage of missing data ranges between about 19% and 38%, and in particular, only two stations are below 20%.

Fig. 4
figure 4

Scatter plot and corresponding linear regression model at the ECZM - ESLN couple of stations

Table 2 Regression model coefficients assuming the tremor measured at the ESLN station as input. The last column reports the missing data in percent

In order to overcome this shortcoming, which would have involved the loss of many samples useful for classification, among the various candidate strategies, we have chosen to fill missing data by the average value of the tremor calculated using expression (2), learned from (Cannavó et al. 2017),

$$ T_{RMS}(t)=\frac{1}{n}\sum\limits_{i=1}^{n}\frac{RMS_{i}(t)-\mu_{t}\{RMS_{i}(t)\}}{\sigma_{t}\{RMS_{i}(t)\}} $$
(2)

In this expression, the index, i refers to the n seismic stations, μt represents the median operator over two years previous the time t, and σt represents the interquartile range over the same period. An example of filled time series at the station EBCN, in comparison with the true one, is reported in Fig. 5.

Fig. 5
figure 5

Filled and original tremor time series at the ECBD station

Supervised classification of volcanic activity

In this paper, to perform the supervised classification, of volcanic activity at Mount Etna, the six classification models listed below were considered:

  • A Fisher Discriminant model, here simply referred to as DISC, implemented by using the fitcdiscr function of the Matlab Statistical and Machine and Learning Toolbox, choosing a linear kernel. More details about this kind of classifiers are reported in Section A.

  • A Multiclass error-correcting output model. In particular, as ECOC, in this paper we refer to an Ensemble multiclass classifier implemented by using the fitcecoc Matlab function, using a Support Vector Machine (SVM) algorithm, with linear kernel and One-Versus-One (OVO) coding scheme. More details are reported in Section A.

  • An Ensemble model, here referred to as ENSE, implemented through the fitcensemble function of the Matlab Statistical Toolbox. We have adopted as basic learner a Adaboost2 algorithm, building a medium coarse tree with a maximum number of splits set to 20, a maximum number of cycles up to 30 and a learning rate set to 0.1. A short description of Ensemble models is reported in Appendix A.

  • A K-Nearest Neighbor model, here referred to as KNN, implemented through the fitcknn function of the Matlab library, setting Euclidean distance metrics and 10 samples as the maximum number of neighbors. More details in Section A.

  • A Naive-Bayes model, here referred to as NBYE, based on the fitcnb function of the Matlab library, with normal kernel distribution and unbounded support. More details in Section A.

  • A Decision Tree for multiclass classification, here simply referred to as TREE, implemented through the Matlab function fitctree, growing a medium coarse tree, with a maximum number of splits up to 20. The Gini’s index was adopted as a split criterion. More details in Section A.

There are no particular reasons behind the choice of these methods, other that are among the most popular. Obviously, many other supervised classification algorithms could have been chosen, including the popular neural networks. We want to stress here that the main aim of this work is not to compare classification algorithms, but to measure to what extent the described problem can be solved using widely available and easily implementable algorithms. More deep details concerning these approaches can be found in popular textbooks such as Bishop (2006), Hastie et al. (2008), and Goodfellow et al. (2016) or directly from the MATLAB Statistical and Machine Learning Toolbox (Inc 2017).

Reducing the dimensionality of the dataset

As well known, see for instance (Liu and Motoda 2008), performances of classification algorithms significantly depend on the number of features. Therefore, reducing the dimensionality of the dataset is a good practice (Mohamed 2020). We have performed this task by using the Principal Component Analysis (PCA), which consists of projecting each data point onto the main directions in the feature space, in order to obtain lower-dimensional data, while preserving as much of the data variability as possible. The first principal component can be defined as the direction that maximizes the variance of the projected data. Similarly, the ith principal component is the direction orthogonal to the first (i − 1)th principal ones that maximizes the variance of the projected data.

For our data set, the cumulative variability explained, versus the number of principal components, is shown in Fig. 6. It can be seen that taking the first two components it is possible to account for about 96% of the dataset variability. This percentage rises to 99%, if the first three principal components are considered. Therefore, the PCA analysis points out that the original 17-D dataset can be reduced to 2-D or 3-D. In this paper, in order to be more conservative, we have chosen to consider 3 principal components. However, the effectiveness of reducing the dataset dimensionality by using the PCA must be done with some caution, as a known drawback of the PCA approach for classification problems is that the variance is preserved regardless of the classes (see for instance (Lei et al. 2017)). Thus, a subspace with large variance is not necessarily one in which classes are well separated. In addition to the reduction of the dataset dimensionality through the PCA, we have explored the possibility of performing this task by heuristically choosing two stations of the network and finally by averaging the seismic tremor among all the stations in the network through expression (2). Summarizing, the following three reduced data set were considered to perform the supervised classification:

  1. 1.

    The 3-D dataset represented by the first three PCA components;

  2. 2.

    a 2-D dataset consisting of tremor recorded at two selected stations of the network, namely ESLN and EBCN;

  3. 3.

    the 1-D dataset obtained averaging the RMS tremor over the whole network, by using expression (2).

The stations ESLN and EBCN, were selected according to the following criteria:

  • The time series recorded at these stations are not affected by very high tremor levels, as for instance observed at stations ECNE, EPDN and EPIT.

  • From trials performed, it seems that this couple of stations allow a good classification of class P. However, it is to be stressed, that other choices would be possible, without significantly affecting the classifier performances.

Fig. 6
figure 6

Cumulative explained variance versus the principal components

Classification metrics

Given a classification experiment, let us indicate as P(i) and N(i) the number of actual positive and actual negative cases in the ith class, respectively. Moreover, let TP(i), TN(i), FP(i) and FN(i) the number of true positive, true negative, false positive and false negative cases, respectively, recognized by the classifier, for the ith class. Referring to these quantities, the following rates can also be defined:

$$ \begin{array}{@{}rcl@{}} TPR(i) = \frac{TP(i)}{P(i)}=\frac{TP(i)}{TP(i)+FN(i)}=1-FNR(i)& \end{array} $$
(3)
$$ \begin{array}{@{}rcl@{}} TNR(i) = \frac{TN(i)}{N(i)}=\frac{TN(i)}{TN(i)+FP(i)}=1-FPR(i)& \end{array} $$
(4)
$$ \begin{array}{@{}rcl@{}} FNR(i) = \frac{FN(i)}{P(i)}=\frac{FN(i)}{FN(i)+TP(i)}=1-TPR(i)& \end{array} $$
(5)
$$ \begin{array}{@{}rcl@{}} FPR(i) = \frac{FP(i)}{N(i)}=\frac{FP(i)}{FP(i)+TN(i)}=1-TNR(i)& \end{array} $$
(6)

The meaning of the above indices can be expressed as follows:

  • The TPR(i) expresses the proportion of actual positives that are correctly classified by the model as belonging to the ith class. Best values of TPR approach to 1, while in the worst case TPR approach 0. The TPR is referred to also as specificity or Recall (r).

  • The TNR(i) expresses the proportion of actual negatives that are correctly classified as not belonging to the ith class. As for the TPR, best values of TNR approach 1, while worst values approaches 0. The TNR is referred to also as specificity of selectivity.

  • The FNR(i) expresses the proportion of false negatives in the ith class, with respect to all actual positives in the same class. Of course in the best case FNR approaches 0, while in the worst case approaches 1.

  • The FPR(i) expresses the proportion of false positives in the ith class with respect to the total number of actual negatives in the same class. Similar to the FNR in the best case FNR approaches 0, while in the worst case approaches 1.

Another useful index, is the Positive Predicted Value (PPV ) or simply Precision, which, for the generic class i is defined as:

$$ PPV(i) = \frac{TP(i)}{TP(i)+FP(i)} =1-FDR(i) $$
(7)

In expression (7), FDR stands for False Discovery Rate.

For the purposes of this paper we also make use of the f1-score, defined as:

$$ f_{1}(i)=\frac{2TP(i)}{2TP(i)+FP(i)+FN(i)} $$
(8)

It is easy to verify that indicating as r(i) the TPR for the ith class and as p(i) the PPV, the f1 can be also written as:

$$ f_{1}(i)=2\frac{r(i)p(i)}{r(i)+p(i)} $$
(9)

which highlights that it is the harmonic mean between p(i) and r(i). Therefore, f1 is a more reliable measure of performance than the simple Precision and/or Recall. For this reason, in this paper we will always show f1, instead of the simple Precision and/or Recall.

A useful way to collect most of the performance indices shown above is the so-called Confusion Matrix (CM). On the CM, see for instance the next Section “Numerical results concerning unsupervised classification”, the rows correspond to the predicted class (Output Class) and the columns correspond to the true class (Target Class). The diagonal cells correspond to observations that are correctly classified. The off-diagonal cells correspond to incorrectly classified observations. Both the number of observations and the percentage of the total number of observations are shown in each cell. The column on the far right of the plot shows the percentages of all the examples predicted to belong to each class that are correctly and incorrectly classified, i.e the PPV and the FDR. The row at the bottom of the plot shows the percentages of all the examples belonging to each class that are correctly and incorrectly classified, i.e. the TPR and the FNR, respectively. The cell in the bottom right of the plot shows the overall Accuracy. The accuracy can formally be described by using expression (10)

$$ Accuracy=100\frac{{\sum}_{n=1}^{N}I(C(x_{n})=y_{n})}{N} $$
(10)

where

  • I(g) is a function that returns 1 if g is true and 0 otherwise,

  • C(xn) the class label assigned by the classifier to the sample xn

  • yn the true class label of the sample xn

  • N is the number of samples in the testing set.

Another useful index for evaluating the degree of accuracy and reliability of a statistical classification is Cohen’s Kappa, since it compensates for random hits. It can be computed (see for instance (Tallon-Ballesteros and Riquelme 2014)) starting from the CM, by using expression (11)

$$ \kappa=\frac{N{\sum}_{i=1}^{m}{CM_{ii}-{\sum}_{i=1}^{m}C_{itrue}C_{ipred}}}{N^{2}-{\sum}_{i=1}^{m}C_{itrue}C_{ipred}} $$
(11)

where:

  • m is the dimension of the confusion matrix CM, i.e. the number of classes.

  • N is the number of samples in the testing set.

  • CMii,i = 1,…,m the entries of the CM main diagonal.

  • Citrue the true number of labels of class i.

  • Cipred the predicted number of labels of class i.

The range of Kappa values extends from − 1 ≤ κ ≤ 1, with positive one indicating strong agreement, negative one indicating strong disagreement, and zero indicating chance-level agreement.

Another useful tool for evaluating the reliability of supervised classifiers is represented by the Receiver Operating Characteristic (ROC) metric and in particular the so-called area under curve AuC. ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the ideal point, since it is characterized by a false positive rate of zero, and a true positive rate of one. Even though this is not very realistic, it does mean that a larger area under the curve (AUC) is usually better. For a specific application of the ROC curves to evaluate the reliability of a warning system in volcanic areas, the readers can see the recent paper (Spampinato et al. 2019). For more general further details about metrics concerning classifiers, see Powers (2011).

Supervised classification

Preliminary considerations

Before starting with the description of supervised classification results, it can be useful to make some preliminary considerations regarding the separability of the three considered classes, using the seismic tremor as the only feature. While in Section “Reducing the dimensionality of the dataset” we have drawn attention to the fact that the tremor recorded in the various stations is linearly correlated, which implies that the dimensionality of the dataset can be significantly reduced, here, we want to point out that classes, as labeled by the experts, are largely overlapping in the feature space of the RMS tremor. To justify this statement, we can look at Fig. 4, from which it is easy to realize that the three classes are largely overlapping. This is more evident between classes Q and S which therefore, regardless of the classification technique, we may expect that will result poorly separable. Among the three classes, class P seems to have more chances to be correctly classified. To further confirm that the three classes are not easily separable, by using the seismic tremor as the only feature, we have applied the Silhouette criterion (Rouseeuw 1987). The test was performed by using the K-means algorithm and the search was extended in the range from 1 up to 6 clusters. Results reported in Fig. 7 show that that the highest Silhouette value occurs at 2 clusters, suggesting that this is the optimal number of clusters, regardless of the dataset, among the considered ones. It is also to be observed that the K-means operating on the 3-D data set was not able to contribute, since it indicates a silhouette value independent of the number of classes.

Fig. 7
figure 7

Silhouette values versus the number of clusters for the three reduced dataset

Assessing the generalization capabilities of classifiers

A popular strategy to estimate the generalization capabilities of a model is to divide the available samples in at least two subsets for training and for testing, respectively. The former is used for tuning the model parameters and the latter for assessing its performance on fresh data. However, sometimes the testing set is used in some way during the training, for instance to understand when the training must be stopped. In this case, a third set of data, referred to as the validation set is introduced, in order to be more confident that performances are evaluated on a really fresh dataset.

Since we have a limited number of samples in the P class, a validation set is not realistic. To overcome this drawback, we have adopted the k-fold cross validation strategy. This technique consists of splitting the dataset into k groups, after shuffling the samples randomly. Then for each unique group the following steps are performed:

  • A group is held out and used for testing.

  • The remaining groups are used for training a model.

  • The fitted model is evaluated on the testing set.

  • The obtained score is then retained while the model is discarded.

This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k − 1 times. At the end of the rotation process, the skill of the model is evaluated using the sample of model evaluation scores.

In the next section, we report validation accuracy computed for the three different reduced dataset, as above discussed.

Numerical results

We start with the global indices, i.e. computed averaging over the classes, namely the Accuracy, the f1 index and Cohen’s K, which are shown in Fig. 8.

Fig. 8
figure 8

Global indices for the three options

Roughly speaking, it is possible to say that in terms of Accuracy, there are not significant differences between the various classifiers and to the use of a particular kind of reduced dataset. Instead, the global f1 index points out a slight advantage using the ENSE and the KNN.

As for the Cohen K index, it is always higher than 0.6, thus meaning that the classification is globally enough reliable. Furthermore, also this index contributes to assign an advantage to the use of ENSE and KNN.

However, as stressed above, for our application, non all classes have the same importance. In order to evaluate the reliability in discriminating individual classes, we report in Fig. 9, the single class f1 indices. The figure highlights that this index, regardless of the classification method and the data set, is quite high for the Q class, but limited for the S class, which is therefore poorly discriminated. Another aspect, probably the most interesting for the purposes of this work, is that the f1 index for the P class is about 0.7 for almost all the considered models, and in particular for the DISC, ENSE and KNN classifiers, regardless of the reduced dataset. However, even if it may seem trivial, with regard to the classification of class P, there is no substantial difference even to operate with the 1-D dataset.

Fig. 9
figure 9

f1 evaluated for each class and each classification approach

The ROC curves for the six classifiers are shown in Fig. 10. It is possible to see that regardless for the considered classifiers the P class is always best classified among the others. In more detail, the Area under curve (AuC) for the six classifiers and for each class is reported in Fig. 11. It is possible to see that almost all classifiers work well for classifying P class samples, but perform poorly for the Q and S classes. On average, for the P class the AuC index is 0.87, while for the Q and S class are 0.65 and 0.64, respectively.

Fig. 10
figure 10

ROC curves for the six classifiers trained using the 2-D dataset

Fig. 11
figure 11

Area under the ROC for all methods and classes.

Operating with the 2-D dataset offers the possibility to visualize the boundary regions, as shown in Fig. 12. The regions in red, green and blue colors represent couples of values classified as class Q, S and P, respectively. In most of the cases, the shape of these areas can be easily interpreted. For instance, looking at the DISC, which was trained as a classifier with linear boundaries, it is possible to see that low values of RMS at both the ESLN and EBCN are attributed to the Q class, low values at ESLN and high at EBCN are attributed to the S class and, finally, high values at both stations are attributed to the P class. An easy interpretation can also be found for the NBAY, ECOC, ENSE and TREE classifiers, while the interpretation of the KNN is a little more difficult. Indeed, the classification regions for this classifier exhibit, in the middle-upper part of the plan, an area assigned to class S, that the other classifiers definitely assign to class P. This is due to the fact that not only classes Q and S are partially overlapping in the expert classification, but also classes S and P.

Fig. 12
figure 12

Classification regions computed for the six classifiers, trained by using the k-fold approach (axes units represent RMS of seismic tremor). The axes are graduated in RMS of seismic tremor; on the abscissa and ordinate we have the values measured at the ESLN and EBCN stations, respectively

Supervised classification by using a more balanced data set

Among the criticisms that can be made to the results shown in the previous section, there could be that of having carried out the classification on data set in which the number of samples in the various classes are strongly unbalanced, especially comparing class Q and class P. Indeed, as mentioned in Section “Dataset and preliminary analysis”, the original data set, from now referred as the Full dataset, contains 262944 samples labeled into 213190 samples of class Q, 49048 of class S and only 706 of class P, which in percent are the 81.08%, 18,65% and 0.27%, respectively. Therefore, one might be led to think that reducing the weight of the Q class samples, which are overwhelmingly, there could be some advantage in terms of performance. To address this question, we carried out the classification experiment described below.

From the Full dataset we extracted a new dataset, in the following referred as the Subset, which includes all the P samples of the Full data set, almost all the S samples, but a smaller amount of samples of the Q class. In order to extract such a subset we have considered that the Full dataset contains 48 paroxysmal episodes occurred on Mount Etna during 2011-2015. With the aim of preserving all P samples, we extracted 48 data windows, centered on the starting date of each paroxysmal episode and spanning a prefixed number of samples before and after this date. The time series of classes obtained in this way is shown in Fig. 13.

Fig. 13
figure 13

Time series of the classes for the Subset

Such a time series consists of 10278 samples, of which 5834 labeled as Q, 3738 as S and 706 as P. This means that in the Subset the weights of Q, S and P are 56.76%, 36.37% and 6.87% respectively.

With the obtained Subset, the models of six supervised classifiers, were obtained, trained, by using the k-folding technique. These new classifiers were subjected to classifying the Full dataset and the performance indices were estimated and compared with those of the corresponding classifiers, trained and tested with the Full dataset. Such a comparison, in terms of the f1 index, evaluated for the single classes, is shown in Fig. 14.

The following considerations can be made:

  • The f1 indices for the Q class are almost equal for the two intercompared cases, except that the ECOC model trained with the subset was not able to classify samples of this class.

  • As concerning the S class, the f1 index is lower than 0.6 for all classifiers and in particular DISC and ECOC perform very poorly. It is to be stressed that the ECOC model trained with the Subset was not able to correctly classify any samples of the Full dataset assigned to the S class.

  • For the P class there are not substantial differences for the two inter-compared cases.

Fig. 14
figure 14

f1 index obtained for the model trained and tested with the Full data set and the model trained with the Subset and tested with the Full set

In summary, from experimental results, it seems that there are no advantages training the classifiers with the Subset, if it is used for classifying samples of the Full dataset, which of course is what a user would like to do. On the contrary, some classes are worse classified compared to the case in which the training is carried out on the Full dataset.

Numerical results concerning unsupervised classification

In this section we report numerical results obtained by using the following three clustering approaches:

  • k-means.

  • Fuzzy c-means.

  • Gaussian mixture models.

A short description of each algorithm is provided in Section A. The kmeans, fcm and fitgmdist functions of the Matlab Statistical and Machine Learning Toolbox, were considered for implementing the three approaches listed above. The criteria for setting the parameters, left to the user are indicated below. During the model fitting, in this case, dealing with unsupervised classification, the dataset was not split, i.e. the whole dataset was given as input to the classifiers.

Similarly to what was done for the supervised classification, we report results obtained performing the clustering on three dataset with reduced dimension, presented in Section “Supervised classification”. The performances of the unsupervised clustering were evaluated assuming as true the time series of classes shown in Fig. 3. The global indices, in terms of Accuracy, f1 and Cohen K, for the three dataset are shown in Fig. 15. In the figure, in correspondence with the f1 index relating to the 3-D dataset, the values should not be considered zero, but Not a Number. The reason is that, for all trials carried out, choosing different options for the tunable parameters, the considered clustering algorithms, were not able to correctly recognize any sample of the class P class. For this reason, in the following of this section we will report results for the 2-D and 1-D data sets, only.

Fig. 15
figure 15

Global Accuracy, f1 and Cohen K for the 1-D, 2-D and 3-D dataset and for the three considered clustering approaches

Results by using the k-means and the 2-D dataset

As explained in Section A.2.1, using this approach it is necessary to choose the kind of distance. Based on trials performed, among the Euclidean, cityblock, cosine and correlation, we have chosen the former one. The classification regions of the K-means performed on 2-D dataset in the ESLN-EBCN feature plane, are shown in Fig. 16. It can be seen that the algorithm determines well-distinct regions: the one in the lower corner in red color, characterized by low values of the tremor at both stations, represents the Q class, the intermediate one in green represents the S class and, finally, the region in blue color, characterized by high values of tremor at both the stations, represents the P class. For each class, the searched cluster centers are reported in Table 3. Applying the classification mask shown in Fig. 16 to the 2-D dataset, we obtained the pattern distribution shown in Fig. 17. Performances of the k-means classifier, in terms of Confusion Matrix, assuming as target the expert classification, are reported in detail in Fig. 18.

Fig. 16
figure 16

Classification regions obtained performing the K-means clustering; cluster centers for the three classes are also indicated by the × symbol

Table 3 Class centers using the K-means in the ESLN-EBCN plane
Fig. 17
figure 17

Samples classified by the K-means (axes units represent RMS of seismic tremor)

Fig. 18
figure 18

Confusion Matrix obtained classifying the 2-D dataset by using the K-means algorithm

The behavior of the K-means algorithm, referring to the P class, can be considered relatively satisfactory. Indeed, of the 706 samples labeled by the expert as belonging to P, 424 have been correctly attributed, while 255 were recognized as S and only 27 as Q, therefore giving TPR% = 60.1 and PPV % = 69.9. A weak point of the K-means classifier, for the considered dataset is the low value of TPR (8.4%) for the class S. Indeed, 44797 samples of the total 49048 assigned by the human expert as belonging to the S class, are instead recognized as Q. Obviously, this contributes to lowering the classifier’s performance also with respect to the Q class.

The ROC curves obtained applying the K-means algorithm to the 2-D dataset are reported in Fig. 19. The figure confirms that in terms of ROC curves, the only class classified with acceptable accuracy by K-means is class P, being AuC = 0.81. For both the remaining classes Q and S we have AuC = 0.54, therefore indicating a behavior very close to a random classifier.

Fig. 19
figure 19

ROC curves of the K-means classifier for the three classes

Results by using the FCM and 2-D dataset

Working with the FCM algorithm, in addition to the choice of the metrics, which also in this case we have chosen as Euclidean, it is possible to tune the m parameter (see Section “A.2.2”), which controls the degree of fuzzy overlap among the clusters. After some trials, we have experimentally found that best results were obtained by setting m = 2, which is also the default value assumed by the fcm Matlab function. The cluster centers obtained for the FCM model, reported in Table 4, are a bit different from the ones searched by the K-means algorithm.

Table 4 Class centers using the FCM algorithm in the ESLN-EBCN plane

The classification regions provided by the FCM in the ESLN-EBCN feature plane are shown in Fig. 20. Compared with the analogous mask obtained by the K-means it is possible to see that the area assigned to the Q-region is smaller, while that assigned to the P one is larger. The performances of this classifier are reported in Fig. 21 in terms of Confusion Matrix. It is possible to see the FCM, compared with the K-means, gives higher TPR for the P and S classes, but with a lower PPV. In particular, for the P class we have a TPR of 82.9%, but with a PPV = 56%, which in terms of f1 means f1 = 0.67. Therefore, by using the FCM, we get a higher number of samples correctly classified but at the price of a higher level of false positive.

Fig. 20
figure 20

Classification regions obtained by using the FCM algorithm. The cluster centers are indicated by the × symbol

Fig. 21
figure 21

Confusion Matrix obtained classifying the 2-D dataset by using the FCM algorithm

The ROC curves obtained applying the FCM algorithm to the 2-D dataset are reported in Fig. 22 which shows that for this classifier the AuC assumes the values AuC = 0.91 for the P class and about 0.67 for both the S and Q classes. Therefore, while the P class is classified with an acceptable accuracy, the Q and S are poorly classified. However, in terms of ROC curves, the FCM performs better than the K-means.

Fig. 22
figure 22

ROC curves for each class obtained by using the FCM algorithm

Results by using the GMM

Tuning a GMM model offers to the user the possibility of choosing different settings for the covariance matrices. Indeed, even if ultimately the covariance matrices are automatically computed by the software, the choice of their structure is left to the user. After some trials, with the aim of maximizing the model performances against the P class, we have found convenient to assume, diagonal covariance matrices, shared among the classes. The obtained mixing coefficients πi and mean μi parameters of the GMM model are reported in Table 5. The classification mask, in the feature plane ESLN-EBCN, is shown in Fig. 23.

Table 5 Components of the fitted GMM model
Fig. 23
figure 23

Classification regions obtained by using the GMM algorithm

It is possible to see that also this model, similar to the k-means and FCM ones, simply classifies as Q low RMS values recorded at both ESLN and ECBN, as S low RMS at ESLN and high at ECBN and, finally, as P high RMS values at both the stations. Obviously, the position of the boundary regions are different with respect to K-means and FCM.

The classier performances in terms of CM are reported in Fig. 24, while the ROC curves are reported in Fig. 25. This last figure shows that for this classifier the AuC is 0.80 for the P class and about 0.54 for both the S and Q classes. Therefore, while the P class is classified with an acceptable accuracy, the Q and S are poorly classified. In terms of AuC, the GMM performs worse than the FCM and similar to the Kmeans.

Fig. 24
figure 24

Confusion Matrix obtained by using the GMM model

Fig. 25
figure 25

ROC curves for the three classes obtained by using the GMM algorithm

Comparison among the unsupervised classifiers for the 2-D data

A direct comparison among the three unsupervised classifiers, in terms of global Accuracy, f1 and Cohen’s K, by using the 2-D dataset is reported in Fig. 26. It is possible to see that Kmeans and GMM perform quite similar in terms of all the three considered global indices. In particular, while they outperform the FCM model in terms of Accuracy and Cohen’s K, the FCM performs better in terms of global f1.

Fig. 26
figure 26

Comparison among the unsupervised classifiers in terms of global Accuracy, f1 and Cohen’s K, by using the 2-D dataset

Fig. 27
figure 27

Comparison among the unsupervised classifiers in terms of f1 for each individual class by using the 2-D dataset

The performance indices for each single class are shown in Fig. 27. The figure shows that the f1 score for class Q is quite high regardless of the clustering algorithm and so is the Accuracy. For class S both the f1 and Accuracy indices are low. For class P, the f1 index is about 0.7 regardless of the clustering method. Furthermore, it should be noted that the Cohen K index for class Q is very low, unlike the Accuracy index, thus indicating that the statistic on class Q is not very reliable. On the contrary, the statistic on class P is very reliable.

Models in 1-D and comparison with 2-D

Operating in 1-D, i.e. training the models by using the average seismic tremor, estimated model parameters are reported in Tables 6 and 7 for the K-means and FCM model, respectively. The parameters of the fitted GMM model are reported in Table 8.

Table 6 Cluster centers using the Kmeans and the 1-D dataset
Table 7 Cluster centers using the FCM and the 1-D dataset
Table 8 Parameters of the fitted GMM model and the 1-D dataset

A comparison of the performance, in terms of f1 and AuC, Accuracy and Cohen’s K indices, between models fitted by using the 1-D and 2-D dataset, is reported in Figs. 2829, As a general comment it is easy to say that there are not meaningful advantage in performing the unsupervised clustering operating on the 2-D with respect to the 1-D dataset. This result which may appear trivial, is essentially due to the fact that the seismic tremor timeseries at different stations are strongly correlated, as discussed in the previous Section “Regression analysis of tremor time series”.

Fig. 28
figure 28

Performance in terms of the f1 index by using the 2-D and the 1-D dataset

Fig. 29
figure 29

Comparison in terms of AuCindex by using the 2-D and the 1-D dataset

In more details, Fig. 28 shows that the f1 index, averaged among the methods, is about 0.88 for the Q class, about 0.25 for the S class and about 0.66 for the P class. So it might seem that in terms of f1, both class Q and class P are well classified. In reality, if we look at the AuC index reported in Fig. 29, we realize that class Q has an Auc index of 0.58 while class P has an AuC of 0.84. Bearing in mind that a random classifier has an AuC = 0.5, we can conclude that while the value of f1 = 0.88 is not representative of the true performance of the classifier with respect to class Q, the value average value of f1 = 0.66 for class P is representative of the true capabilities of the classifier. In a few words we can conclude that the unique class reliably discriminated is class P.

Another result that could be interesting to observe is that the value of Auc for class P, estimated in this paper, could agree with that obtained by Spampinato et al. (2019), operating with different methodologies on the same area, who computed a value \( AuC \sim 0.8\).

Measuring the performance of the FCM classifier assuming as true the output of the ENSE supervised classifier

In this section we report a comparison between one of the best among the unsupervised classification algorithms, in our case the FCM operating on the 1-D dataset, and one of the best supervised classifiers presented in the previous section. For this comparison we have chosen the ENSE classifier. In order to perform this comparison, we have computed a CM, assuming as target the time series obtained by the ENSE classifier and as output the one provided by the FCM. Such a CM is shown in Fig. 30. It is possible to see that:

  • Supervised and unsupervised algorithms almost fully agree in classifying samples of class Q, being TPR% = 100 and PPV % = 91.6.

  • The two kinds of classifiers almost fully agree also for the P class, being TPR% = 100 and PPV % = 97.6.

  • Supervised and unsupervised algorithms do not fully agree with samples of S class, exhibiting PPV = 100% but TPR = 18.5%.

This experimentally demonstrates that the unsupervised classification of RMS tremor can effectively replace the supervised one, limited to the clustering of paroxysm activity. In other terms, it is possible to reach the same level of performances, avoiding the laborious commitment of the experts or at least they could take the results provided the unsupervised classifier as a good starting point for further developments. Furthermore, without significant loss of performance, it is possible to carry out the classification on the 1-D dataset, that is, using the average RMS tremor time series, obtained using the expression (2).

Fig. 30
figure 30

Confusion Matrix obtained assuming as target the output of the ENSE supervised classifier and as output class the output of the FCM clustering algorithm

Comparing the true and the unsupervised time series of class

The following discussion aims to show that the misclassified samples occur during the transition phases from quiet to paroxysm and vice versa. Indeed, the 706 samples classified by the experts as belonging to the P class, form actually 48 paroxysmal events (or patterns). A few of these episodes are shown in Fig. 31. The time series of classes provided by the unsupervised classifier is reported in yellow color, the average RMS tremor is reported in red color, and the expert class in blue. The figures also show that paroxysm is always preceded by Strombolian and usually also followed. It is also possible to see that the yellow and the blue curves usually agree during paroxysm, while the major differences can be observed during the Strombolian activity. It is also interesting to observe the close correlation between the average level of the RMS tremor (red curve) and the time series of class provided by the unsupervised classifier (yellow curve). Therefore, the greatest uncertainties concern the classification of the S episodes, for which the average level of the RMS tremor is not particularly different from the level it assumes during the activity classified as Q.

Fig. 31
figure 31

Comparison among the true (blue color) and clustered (yellow) time series of classes for a few paroxysm episodes; the red curve is the average RMS tremor

Discussion and Conclusions

In this work we have addressed both the problem of supervised and unsupervised clustering of volcanic activity, based on seismic tremor, a kind of geophysical signal widely available in many volcanoes around the world, in order to objectively evaluate to what extent popular clustering algorithms can automate this task, usually performed by human experts. Results achieved can be summarized as follows:

  • All the considered classification algorithms, both supervised and non, agree that by using the RMS of tremor as the only feature, only events belonging to class P can be classified with a reliability that, measured in terms of the f1 index, is about 0.7 and in terms of AuC about 0.8.

  • Almost all the supervised classifiers considered are able to perform this task of classifying the class P events, but a limited superiority can be assigned to ENSE, KNN and TREE models. Similarly, among the considered unsupervised classifiers a limited superiority can be assigned to the FCM one.

  • The greatest uncertainties of classifiers concern the attribution of events belonging to class S. In this regard, the RMS level of tremor alone is not enough to safely discriminate whether these samples are to be attributed to class Q or to class S.

As a final remark, it is to stress that the expert classification is not necessarily based exclusively on the evaluation of physical signals and the dating of the events can therefore be imprecise. Therefore, a basic problem arises, concerning the establishment of objective criteria that experts must consider to classify volcanic activity. However, this problem is beyond the scope of this paper and is left for future developments.