Introduction

One longstanding hypothesis investigated in the scientific community is that damage mechanisms in multi-phase structural materials, such as composites, can be identified directly from the strain waves, or acoustic emission (AE), they produce [1,2,3,4]. Developing this capability has wide-reaching ramifications for lifetime prediction investigations and in operando monitoring of advanced structural materials. It would allow researchers to augment damage triangulation [5, 6], lifetime prediction [7], and high-resolution optical studies [8, 9] with complementary mechanism-informed data streams.

However, directly mapping a waveform to its source mechanism is non-trivial. In a single experiment, difficult-to-capture factors such as transducer contact, specimen geometry, and loading configuration all influence the waveform structure. Because of these effects, it is infeasible to implement bottom-up approaches where the measured waveform is directly compared to waveforms generated from computational models [10,11,12,13,14]. While these experimental factors are often difficult to capture in physics-based models, the effect they have on an acoustic signal as it travels from source to sensor is constant. It is therefore more effective to group waveforms according to their differences and assign mechanisms to groups. As a consequence, unsupervised machine learning (ML) methods (i.e., pattern recognition techniques) which classify signals based on differences in their signal structure are an effective strategy for damage mechanism identification, with many such frameworks being developed over the last two decades [15,16,17,18,19,20,21].

A general inspection of these frameworks yields an important observation: there is no ground truth dataset, wherein mechanisms have been directly assigned to each individual signal, which is suitable for benchmarking performance of an AE-ML framework [22, 23]. Previous studies have attempted to create ground truth datasets using a variety of strategies, for example by designing the loading configurations and sample geometry to promote 1–2 damage mechanisms [15, 19, 24]. Generally, such strategies produce datasets which are not usable for quantitative benchmarking because the ground truth is still unknown. For example, in the absence of visual confirmation, it is possible that geometries designed to promote fiber failure in composites may contain numerous signals from other mechanisms, such as interfacial damage, as well [25,26,27,28].

Because the datasets described above are unsuitable for benchmarking accuracy, indirect measures of framework performance have been employed. Metrics that have been studied include the tendency to fall into a local minimum, compactness of clusters, and how well average cluster characteristics match previous literature [19, 29,30,31]. However, these cannot be used as a proxy metric for accuracy because they do not measure accuracy or discriminating power [32,33,34]. Therefore, there is a need for datasets and methodologies that can be used for the standardized, quantitative assessment of AE-ML frameworks.

Toward the goal of generating datasets which can be used to assess discriminating power, pencil lead breaks (PLBs) offer a powerful solution. PLBs emit signals whose frequency content can be controlled by varying the angle of incidence, \(\theta \) [5, 35, 36]. Incremental increases to the angle of incidence \(\Delta \theta \) result in incremental increases to the low-frequency (flexural wave) component of the AE signal.

Since signal structure is uniquely determined by its frequency content, sets comprised of signals generated at angles \(\theta \) and \(\theta +\Delta \theta \) can be used to quantitatively evaluate the ability of a framework to group signals according to their emitting source. Frameworks that can accurately distinguish between signals generated from \(\theta \) and \(\theta +\Delta \theta \), when \(\Delta \theta \) is small, will have higher discriminating power than frameworks which cannot. Therefore, a dataset comprised of signals generated from known values of \(\theta \) can be used to quantitatively assess the discriminating power of AE-ML frameworks, investigate how specific changes to frameworks impacts discriminating power, and guide decisions on improvement.

In this work, an acoustic emissions (AE) dataset comprised of PLB acoustic sources was generated at one reference angle \(\theta _0\) and five benchmarking angles \(\theta _b\). Five ML frameworks from literature were applied to this dataset, and their performance was assessed. We investigated how changing feature choice impacts framework discriminating power and found that when only frequency domain features are used, discriminating power rises. Moreover, it is shown that for discriminating between different PLBs, the choice of ML algorithm was unimportant and a framework’s performance could be attributed primarily to the feature set. Finally, we propose a set of guidelines for standardized benchmarking procedures for AE-ML frameworks, strategies for identification of salient features, and future benchmarking procedures.

Materials and Methods

Data Collection

All pencil lead breaks (PLBs) were conducted with Pentel 0.5 mm HB leads and a nominal free lead length of 4 mm [36]. A Pentel GraphGear 500 mechanical drafting pencil was fixed to a custom-built, displacement controlled, load frame (Fig. 1). The load frame was composed of a rotational stage, which allowed for angle adjustments in increments of \(2^\circ \) (corresponding manual angle measurement error is \(\frac{1}{2}\) the unit of measurement, or \(\pm 1^{\circ }\)), and two precision-adjust linear stages. The aluminum plate on which the PLBs were conducted had an unsupported span of 200.7 mm, width of 51.0 mm, and thickness of 1.2 mm.

Fig. 1
figure 1

Photograph of experimental setup. A mechanical pencil is attached to a rotational stage which controls the angle of incidence \(\theta \). The linear X stage is used to position the tip of free lead at a consistent location on the aluminum plate. The linear Z stage is used to lower the pencil lead until fracture. The resultant waveform generated is recorded by the piezoelectric B1025 transducer, located approximately 25 mm from the tip of free lead

Fig. 2
figure 2

Workflow diagram of an AE-ML framework. a Waveforms are collected and b pertinent features are extracted from the waveforms, which are then represented as vectors in feature space. c Feature vectors can then be re-scaled and/or re-mapped before d the clustering algorithm is applied and feature vectors are labeled. Every AE-ML framework follows this procedure

Fig. 3
figure 3

Fourier transform (FFT) of average signals at each angular condition (Fig. 4). As the angle of incidence increases, the low-frequency components increase in power following the findings of [35] and [36]

Fig. 4
figure 4

The mean PLB signal and point-wise standard deviation at each angular condition. Signals generated using the experimental fixture shown in Fig. 1 were found to be repeatable, while still containing variation that might be expected from signals collected during in operando health monitoring

Fig. 5
figure 5

The ARI of each framework as a function of \(\Delta \theta \). ARI values exceeding 0.4 are correlated with good discriminating power, whereas values near 0 correspond to no discriminating power. The discriminating power of each framework increases with \(\Delta \theta \) and high ARIs at low values of \(\Delta \theta \) are better suited for clustering signals whose differences are minor. The ability to directly compare accuracy between frameworks allows researchers to choose an appropriate framework for their specific needs

Fig. 6
figure 6

The a average maximum amplitude and b average rise time of signals generated at each angle of incidence \(\theta \). Error bars correspond to 1 standard deviation. There is no consistent difference between values in either feature. Because it is possible to construct many sets of unique signals with indistinguishable amplitudes and rise times, they should not be considered salient features and their use should be taken with caution

Fig. 7
figure 7

Adjusted Rand index vs. number of signals per angle. Signals from \(\theta _0=20^\circ \) and \(\theta _b=26^\circ \) (\(\Delta \theta _0=6^\circ \)) were clustered using an increasing number of signals per angle. As the number of signals increased, the performance of frameworks becomes independent to the addition of new signals, indicating enough data is present to capture stochastic waveform variations

PLBs were recorded at \(20^{\circ }, 22^{\circ }, 26^{\circ }, 30^{\circ }\), 36\(^{\circ }\), and \(40^{\circ }\). PLBs were generated by lowering the pencil via the linear Z stage until the lead fractured on the aluminum plate. For each angular condition, the rotational stage was fixed using a set screw and the set screw was loosened only to change angles. During an angle change, the linear X stage was adjusted to maintain a nominal distance of 25 mm from the PLB source to the sensor. Upon inspection of collected AE signals, some were found to be reflections. These reflections presented themselves as a second, low amplitude signal occurring immediately after the initial PLB, and were excluded from the data set. Due to the exclusion of these reflections, each angle has a differing number of signals, between 75-111. For the purposes of this analysis, only the first 75 signals of each angle were clustered by an AE-ML framework.

AE was recorded using a piezoelectric B1025 transducer (Digital Wave Corporation, Centennial, CO) with a broadband response of 50–2000 kHz (Fig. 1). The threshold voltage was set to 0.1 V, the number of pre-trigger points was set to 256, and the total length of signal captured was 1024 points at a rate of 10 MHz. The sensor was fixed to the aluminum plate with an alligator clamp using vacuum grease as a coupling agent. The sensor was not remounted at any point during the experiment, meaning PLBs at all angles were conducted for a fixed sensor coupling. The authors note here that because the coupling is unchanging during all data collection, an absolute calibration of the sensor as described in [37] is not necessary. Additionally, all signals were collected within a single 3 h span in a temperature controlled laboratory where environmental factors which might otherwise affect the absolute sensor calibration, such as temperature, were assumed to be unchanging. Unsupervised frameworks group signals according to differences in signal features, rather than the absolute values of those features. Since the absolute sensor calibration is unchanging, any differences in signal features can be attributed to changes in the angle of incidence.

Signals collected at the reference angle \(\theta _0=20^{\circ }\) and signals at a single benchmarking angle \(\theta _b \in [22^\circ , 26^\circ , 30^\circ , 36^\circ , 40^\circ ]\) were clustered using each of the frameworks described in Sect. Framework Descriptions and Accuracy Metrics, and relative discriminating power was assessed quantitatively using the procedure described below.

Quantitative Benchmarking

The permutation model of the adjusted Rand index (ARI) was used to benchmark frameworks. The ARI, which ranges from 0 to 1, measures accuracy of ML-calculated clusters as compared to the ground truth in a label-agnostic way. It compares the membership similarity of objects in the ML-calculated clustering, A, to the membership similarity of objects in the ground truth clustering, B, and assigns a higher number if similarities are high [38]. In the context of this work, signals from \(\theta _0\) and \(\theta _b\) are fed to an AE-ML framework. These signals are then assigned a label by the framework, either 0 or 1, depending on if the framework believes the signal should be grouped with \(\theta _0\) or \(\theta _b\). The ML-assigned label of each signal is then compared to the ground truth, the known angle at which the signal was collected. If the membership similarity of the ML-assigned labels and the true angles are similar, then the ARI will take on higher values.

The ARI is an adjusted-for-chance version of the Rand index (RI) and is calculated as [38]:

$$\begin{aligned} RI(A,B)=\frac{N_{11}+N_{00}}{\left( {\begin{array}{c}N\\ 2\end{array}}\right) } \end{aligned}$$
(1)

where N is the number of signals, \(N_{11}\) is the number of signal pairs which are grouped into the same cluster in A and B, and \(N_{00}\) is the number of signal pairs that are grouped into different clusters in both A and B. The ARI can then be calculated as [39, 40]:

$$\begin{aligned} ARI(A,B)=\frac{RI(A,B)-\mathbb {E}[RI(A,B)]}{\max [RI(A,B)]-\mathbb {E}[RI(A,B)]} \end{aligned}$$
(2)

where \(\mathbb {E}[RI(A,B)]\) is the expected value of the RI under a random model. The ARI is bound between 0 and 1, with 0 corresponding to random label assignments and 1 corresponding to perfectly matching labels. While many cluster similarity metrics exist, the ARI was chosen to compare partitions because it can be calculated for any number of clusters (provided the number of clusters in each partition is equal), and it accommodates unbalanced cluster sizes [40, 41]. Signals from \(\theta _0\) and \(\theta _b\) were clustered by each framework. The value of \(\Delta \theta =\theta _b-\theta _0\) at which the ARI vanished represents the point at which the framework has lost all discriminating power, and is unable to identify differences between two groups of different signals.

Framework Descriptions and Accuracy Metrics

A general AE-ML framework can be described by workflow in Fig. 2. Following data collection (Fig. 2a), the most important step in the framework is the selection of the feature set (Fig. 2b). Waveforms can only be sorted according to their source mechanism if the feature set captures something fundamental about the waveform-mechanism relationship. Features can be classified as belonging to the time domain, frequency domain, or time-frequency domain. However, there is little consensus as to which category is best suited for damage mechanism identification. In fact, even when two frameworks leverage features within the same domain, their feature sets differ. Consequently, each framework uses a unique feature set, where d pertinent features are identified, extracted, and stored as a feature vector \(\textbf{v}\in \mathbb {R}^d\) (Fig. 2b). The reader is referred to the original investigations for discussions on why particular feature sets were chosen [4, 17, 20, 21, 29].

Next, individual features of a feature vector may be re-scaled or re-mapped with a transformation (Fig. 2c). Similar to the variations in feature sets, each framework utilizes a different set of pre-processing steps. Finally, the ML algorithm is applied to partition feature vectors by assigning them to clusters, where feature vectors in the same cluster are proximal under a chosen distance metric (Fig. 2d).

The frameworks described in the following sub-sections follow this workflow and were adopted directly from literature. They were chosen to span the current space of diverse feature set types and ML algorithms [22]. The key differences between frameworks are the choice of feature sets, pre-processing steps, and ML algorithm. The specific choice of feature set, pre-processing steps, and ML algorithm are further summarized in Table 1. In Sect. Results and Discussion, we provide key findings and discuss the impact of feature selection.

Table 1 Investigated framework summaries

Base Framework

We define a Base Framework relative to subsequent frameworks, which are variations on this base (either by swapping out the feature set, ML algorithm or both). This framework employs a time-domain feature set as investigated in [17]:

  1. 1.

    average frequency (number of counts/signal length)

  2. 2.

    rise frequency (average frequency from signal start to maximum amplitude)

  3. 3.

    ln(energy)

  4. 4.

    ln(rise time/duration)

  5. 5.

    ln(amplitude/rise)

  6. 6.

    ln(amplitude/decay time)

  7. 7.

    ln(amplitude/average frequency)

The start and end time of an experimental signal was determined by the first and last crossing of a floating 10% voltage threshold.

Each feature was scaled by the maximum observed value of that feature, using the MaxAbsScaler method in [42], such that they fell in the range [-1,1]. A principal component analysis (PCA) transformation was performed, and principal components containing \(\ge 95\%\) of the variance were retained. Distances, d, between any two feature vectors, \(\textbf{x}, \textbf{y}\) were calculated using a modified Euclidean metric:

$$\begin{aligned} d(\textbf{x}, \textbf{y})=\sqrt{\sum _i\lambda _i(x_i-y_i)^2} \end{aligned}$$
(3)

where \(x_i\) and \(y_i\) are the ith vector components of the feature vectors in the PCA basis, and \(\lambda _i\) is the eigenvalue of the ith PCA axis. As the Scikit-learn implementation of k-means enforced the standard Euclidean metric, a rescaling of feature vectors was required to accommodate the modified Euclidean metric (proof is given in 6):

$$\begin{aligned} x_i'=\sqrt{\lambda _i}x_i \end{aligned}$$
(4)

It should be noted that the distance metric in Eq. 3 differs from the standard PCA whitening approach, where distances along axes with large eigenvalues are contracted, rather than elongated [33].

K-means was then applied to the feature vectors. For a detailed description of the k-means algorithm, the reader is referred to [33]. Because k-means is not guaranteed to converge to an optimum solution, it is typically run many times and the initialization with the lowest value of loss function is taken [43]. To determine the number of re-initializations needed, convergence checks were performed by increasing the number of re-initializations until the loss remained unchanged. The minimum objective function did not change after \(2\times 10^3\) re-initializations. To conservatively ensure a global optimum of the objective function had been reached, the number of re-initializations was set to \(2\times 10^4\). Similarly, an error tolerance of 0.0001 and 300 iterations were sufficient to ensure a local optimum was reached within a single initialization of k-means.

Agglomerative Framework

The agglomerative framework [29] used the feature set:

  1. 1.

    amplitude

  2. 2.

    peak frequency

Rather than partitioning feature vectors by k-means as in the Base framework, the Agglomerative framework uses a hierarchical agglomerative approach. In this approach, each data point is initially defined as a cluster. Clusters are then iteratively merged such that the chosen objective function (usually the sum of squared distances) is extremized. For a discussion of this algorithm, the reader is referred to [29, 42].

The linkage type, the parameter defining pairwise distances between points, was not reported in the original work. Here, each linkage type was tested, and no linkage type was found to outperform another.

Spectral Framework

The Spectral framework [21] used the partial power feature set. The ith component of the feature vector is as follows:

$$\begin{aligned} x_i=\frac{\int _{k_{i-1}}^{k_i}F[s(t)]dk}{\int _{k_{0}}^{k_d}F[s(t)]dk} \end{aligned}$$
(5)

where \(F[*]\) is the Fourier transform operator, s(t) is the recorded signal, \(k_i\) and \(k_{i-1}\) are the frequency bounds over which integration is performed, and d is the number of entries in the feature vector. We set \(k_0=200\) kHz, \(k_d=800\) kHz, and \(d=23\). The width of integration bounds, \(k_i-k_{i-1},\) was set to be equal for all i as in [21].

The sklearn implementation of spectral clustering was used to cluster the feature vectors [42]. A detailed explanation of the algorithm can be found in [44]. The ARPACK eigensolver was used and the number of nearest neighbors was set to \(NN=5\). To ensure cluster membership did not depend on initialization parameters, convergence checks were performed for error tolerance and maximum number of iterations. The cluster membership was found to stabilize after 10 re-initializations. To conservatively ensure stability, the number of re-initializations was set to 100.

Frequency Framework

The frequency framework used a feature set in the frequency domain [4]:

  1. 1.

    average frequency

  2. 2.

    reverberation frequency

  3. 3.

    rise frequency

  4. 4.

    peak frequency

  5. 5.

    frequency centroid

  6. 6.

    weighted peak frequency

  7. 7.

    partial powers from 0–150 kHz, 150–300 kHz, 450–600 kHz, 600–900 kHz, and 900–1200 kHz

Features were independently normalized with the variance scaler, which centers features to have zero mean and scales them to unit variance. Feature vectors were then clustered with k-means. The same convergence checks as the Base framework were conducted, and identical parameters were found to be sufficient for convergence.

WPT Framework

The wavelet packet transform (WPT) framework extracted features through application of a WPT [20]. Waveforms were subjected to a WPT on three levels using the Daubechies wavelet of order 2 as the mother wavelet. Fractional energies carried in each node were calculated, and the five least correlated values were retained. These, in addition to the waveform energies read by the AE software, were used as features. Feature vectors were then normalized with the maximum value scaler and subjected to PCA. Principal components containing \(\ge 95\%\) of the variance were retained. The feature vectors were then partitioned via k-means, using the modified Euclidean metric (Eqn. 3).

Convergence checks were conducted and parameters identical to the Base framework were found to be sufficient for convergence. It should be noted that the algorithm described by [30] and used by [20] is k-means optimized by a genetic algorithm. Thus, a fully converged k-means solution will not differ from a fully converged genetic solution.

Results and Discussion

The frequency content of PLB signals from our experimental configuration was able to be precisely controlled by varying the angle of incidence. Signals generated were found to follow expectations from plate-wave theory [35, 36]; as the angle of incidence increased, the low-frequency components of the signal were observed to increase in power Fig. 3. Moreover, little variation in PLB signals was observed within a each angular condition. The mean signal and its standard deviation envelope are shown in Fig. 4. As a result of the small variation between signals for a single angular condition, when signals are represented in feature space (Fig. 2b), the standard deviation of those features is smaller than if signals had a large variability. If a benchmark set were constructed from more repeatable acoustic signals, it would be expected that signals at different angles would form tighter clusters in feature space, and subsequently the ARI of each framework at each value of \(\Delta \theta \) would increase.

By comparing ARI values at a fixed value of \(\Delta \theta \), it is possible to quantitatively evaluate the relative discriminating power of various frameworks. The frameworks listed in Table 1 were applied to group signals according to the procedure described in  Sect. Quantitative Benchmarking. The accuracy of each framework was plotted as a function of \(\Delta \theta \) (Fig. 5). The discriminating power of each framework increased with \(\Delta \theta \) which can be attributed to increasing differences in the signal structure as a function of \(\Delta \theta \). Frameworks exhibiting higher ARI values at lower values of \(\Delta \theta \), such as the Spectral and Frequency frameworks, have higher overall discriminating power and will likely be able to distinguish between damage mechanisms that emit similar signals.

Saliency of Features

Between the five frameworks, no single feature set was used. Although this is common in the broader context of AE-ML frameworks [22, 23], the lack of consensus raises an important question: ”What features should be used for the purposes of AE signal discrimination?”. Addressing this question is of utmost importance, since the discriminating power of a framework hinges on how signals are represented [22, 45].

For signal discrimination, both exclusion of noisy features and inclusion of useful features is necessary: a principle known as the Ugly Duckling theorem [45]. To highlight the degree to which this principle impacts discriminating power, features were parametrically excluded from the Frequency framework and Base framework. In the Frequency framework, ARI was maximized when clustering was performed using average frequency, rise frequency, and partial power from 150–300 kHz. When these three optimal features were used to encode signals, the ARI at \(\Delta \theta =2^{\circ }\) increased from 0.681 to 0.973, representing a change from modest to high discriminating power. A similar procedure was conducted for the Base framework, and when only the average frequency and log(amplitude/average frequency) were included, the ARI increased from 0.325 to 0.82.

While such parametric studies can yield insight into which features are useful for a specific dataset, they are less effective in identifying universally salient features. For this purpose, it is necessary to consider the physics of the emitting source on a case-by-case basis and when possible, exclude features that are dependent on external and uncontrollable factors unrelated to the source mechanism. For example, although amplitude is correlated to the angle in this dataset (Fig. 6a), it should not be used for sorting signals from multi-phase materials because it is convolved with factors, such as crack area formed and the source-to-sensor distance, which are unrelated to individual mechanisms [46]. Similarly, even though rise time is commonly used in AE-ML frameworks, [18, 19, 30, 47,48,49], it is more strongly related to the source-to-sensor distance, due to the different velocities of the flexural and extensional wave components [5]. Consequently, two signals emitted from similar locations will have similar rise times, even if the emitting mechanisms are different (Fig. 6b).

Limitations of the PLB Dataset

Intuitively, a framework with higher ARI values is a promising candidate for damage mechanism identification, when signals from different mechanisms are expected to be similar. However, the degree to which performance on the PLB dataset translates to performance under more realistic conditions and material systems is unknown. Specifically, the PLB signals in this dataset are collected under the strictest possible conditions; signals are from a single source-to-sensor distance, sensor coupling, and source type (e.g., pencil lead), removing the effect of factors that influence a signal such as dispersion, attenuation, and absolute frequency response. ML approaches for mechanism identification must ultimately be robust to these effects. Although this dataset represents a first step toward quantitative benchmarking, a full characterization of framework performance under realistic conditions is still required.

Another limitation of the dataset we have collected is the angular resolution; the \(\pm 1^\circ \) tolerance of the rotational stage has implications on the measured ARI. For example, if the true \(\Delta \theta \) between two angular conditions was less than the reported value, due to the \(\pm 1^\circ \) tolerance, signals generated at these angles would be more similar than expected. Consequently, the ARI measured would be lower than if the signals had been collected from a true angular condition with a larger value of \(\Delta \theta \). The exact degree to which the ARI would change is highly dependent on how each feature varies with \(\theta \), and also subject to any data dependent pre-processing, such as PCA, which would further impact framework performance.

Finally, also due to the angular resolution, the current experimental setup prevents collecting signals from values of \(\Delta \theta < 2^\circ \). In order to properly benchmark frameworks, there must be at least one value of \(\Delta \theta \) where ARIs are not saturated at 1. For example, for \(\Delta \theta =20\), the Spectral, Frequency, and Base frameworks all perform equally well, but \(\Delta \theta =6\) allows for comparison of discriminating power (Fig. 5). As the community continues to improve the discriminating power of frameworks, ARI values will increase. Consequently, it will become necessary to collect signals from values of \(\Delta \theta <2^\circ \), below what we have allowed for in this study, to ensure frameworks’ performances can be separated.

Guidelines and Conclusions

While many AE-ML frameworks have been developed and implemented, the lack of ground truth datasets has restricted discussions of their strengths and limitations, particularly with respect to feature choice, and has prevented development of standardized quantitative benchmarking procedures. In this section considerations for the quantity of data in benchmarking sets, types of features that should be included in a framework, and transparent benchmarking practices are discussed.

The performance of an unsupervised framework is intrinsically tied to how well the sampled data represents its population distribution. In the context of AE-ML benchmarking datasets, it is critical to ensure enough signals have been collected to capture statistical variations. If too few waveforms are collected at any angle, it is unlikely that the sampling distribution will represent the population distribution of waveform features (Fig. 2b). Consequently, the addition on new waveforms will lead to spurious performance of an AE-ML framework. To ensure enough signals are in a benchmarking set, a framework’s performance must be shown to be independent of the number of signals collected. For this benchmarking set, it is demonstrated that 75 signals per angle are sufficient to ensure the ARI values we calculate are independent of the amount of data (Fig. 7).

Feature selection is of critical importance with respect to maximizing the discriminating power. As demonstrated in Sect. Saliency of Features, the inclusion of non-salient features was directly correlated with poor framework performance. Despite the importance of feature selection, there is little discussion within the literature as to why certain features were chosen [22]. As a result, many modern frameworks continue to include non-salient features (e.g., rise time) which negatively impact framework performance.

Toward better feature selection, universal features should be prioritized, and when possible, choice of feature set should be explicitly motivated. If it is possible to construct cases where a given feature cannot reliably discriminate between signals emitted by two unique sources, then the feature is likely convolved with factors unrelated to the source mechanism and are therefore not universal. The use of such non-universal features must be treated with caution. For example, although small amplitudes and large rise times have been correlated with Mode II cracks, these features are not universal because they are also a strong function of the source-to-sensor distance [50]. This makes it possible to construct a dataset where unique signals can appear artificially similar resulting from little to no statistical difference between features (Fig. 6).

Although universally salient features will not change between material systems or loading configurations, the values of the features might vary. For example, partial power appears to be a universally salient feature [4, 21, 47], but every frequency band does not provide equal discriminating power. As demonstrated by the parametric removal of Frequency framework features, the frequency band from 150–300 kHz was the most useful for signal discrimination. In this case, 150–300 kHz was useful for discrimination between two PLB signals; however, different frequency bands will be useful when the material system or loading conditions changes [12, 26].

Finally, publicly available standardized datasets should be used for quantitative benchmarking of frameworks. Although these types of benchmarking tools are common in other fields [51,52,53,54], they are absent in the AE community. Development and continued maintenance of these types of datasets will provide the tools necessary to assess the strengths and limitations of AE-ML frameworks and will allow for detailed discussions regarding the specific strengths and weaknesses of individual frameworks. In turn, this will provide transparency and trust in the results obtained from such frameworks, promoting their broader use in both scientific and engineering applications.