1 Introduction

Two of the most common forms of arterial disease are stenosis, narrowing of an arterial vessel, and aneurysm, an increase in the area of a vessel. They are estimated to affect between 1 and 20% of the population (Fowkes et al. 2013; Shadman et al. 2004; Mathiesen et al. 2001; Li et al. 2013), and ruptured abdominal aortic aneurysms alone are estimated to cause 6000–8000 deaths per year in the United Kingdom (Darwood et al. 2012). Current methods for the detection of arterial disease are primarily based on direct imaging of the vessels, which can be expensive and hence prohibitive for large-scale screening. If arterial disease can be detected by easily acquirable pressure and flow-rate measurements at select locations within the arterial network, then large-scale screening may be facilitated.

It is likely that the indicative biomarkers of arterial disease in the pressure and flow-rate profiles consist of micro inter- and intra-measurement details. In the past, detection of arterial disease has been proposed through the analysis of waveforms in combination with mathematical models of pulse wave propagation, see for example Sazonov et al. (2017), Stergiopulos et al. (1992). This, however, requires specification or identification of patient-specific network parameters, which is not easy to perform, especially at large scales.

This study explores the use of machine learning (ML) methods for the detection of stenoses and aneurysms in order to facilitate large scale low-cost screening/diagnosis. A data-driven ML approach is adopted, which does not require specification of patient-specific parameters. Instead, such algorithms learn patterns and biomarkers from a labelled data set. ML has a history of being used for medical applications (Kononenko 2001). Classification algorithms have been shown to be able to predict the presence of irregularities in heart valves (Çomak et al. 2007), arrhythmia (Song et al. 2005), and sleep apnea (Khandoker et al. 2009) from recorded time domain data. Recently, a study reported the successful use of ML methods to estimate pulse wave velocity from radial pressure wave measurements (Jin et al. 2020). Automatic detection, segmentation, and classification of AAAs in CT images are presented in Hong and Sheikh (2016), while severity growth of AAAs is predicted from CT images in Jiang et al. (2020). A previous study (Chakshu et al. 2020) has applied deep-learning methods to AAA classification, using a synthetic data set created by varying seven parameters. In this study, accuracies of \(\approx 99.9\%\) are reported for binary classification of AAA based on three pressure measurements. Furthermore, Wang et al. (2021) achieved a sensitivity of 86.8% and a specificity of 86.3% for early detection of AAA from the photoplethysmogram pulse waves—using a synthetic data set created by finding the mean and standard deviation of six cardiovascular properties for subjects of each age decade from 55 to 75 years, and then varying each property in combination with each other by ± 1 standard deviation from their age-specific mean values. These studies motivate the application of ML to detect arterial disease—both stenosis and aneurysms—using only pressure and flow-rate measurements at select locations in the arterial network. A previous proof-of-concept study (Jones et al. 2021c) showed promising results that ML classifiers can detect stenosis in a simple three vessel arterial network using only measurements of pressures and flow-rates. Here, these ideas are extended to a significantly larger, physiologically realistic, network of the human arterial system. All the ML methods are trained and tested on the virtual healthy subject database proposed in Jones et al. (2021a), which is augmented to introduce disease into the virtual subjects.

This study is organised as follows. It begins by briefly explaining the healthy VPD proposed in Jones et al. (2021a). Modifications to this database to create four different forms of arterial disease are presented next, along with the parameterisation of disease forms. This is followed by presentation of the ML methodology and metrics used for quantification of classification accuracies. Finally, these accuracies are assessed when using different combinations of pressure and flow-rate measurements, along with the analysis of patterns and behaviours observed in the ML classifiers.

2 Methodology

The ML algorithms are trained and tested on a data set containing both healthy subjects and diseased patients.

2.1 Healthy subjects

A physiologically realistic VPD containing healthy subjects is created in Jones et al. (2021a) and forms the starting point of this study. This database is available in Jones et al. (2021b). The arterial network contains 71 vessel segments and is shown in Fig. 1, along with the locations where disease occurs in high prevalence, and where measurements of pressure and flow-rate can potentially be acquired (Jones et al. 2021a). The healthy patient database of Jones et al. (2021a) contains 28,868 VPs and is referred as \(\text {VPD}_{\text {H}}\). Disease is introduced into these healthy arterial networks as described next.

Fig. 1
figure 1

The connectivity of the arterial network, taken from Jones et al. (2021a). The location of the four forms of disease (see Sect. 2.2.1); and six pressure and flow-rate measurements (see Sect. 2.3) are highlighted

2.2 Creation of unhealthy VPDs

2.2.1 Disease forms

The four most common forms of arterial disease are carotid artery stenosis (CAS), subclavian artery stenosis (SAS), peripheral arterial disease (PAD, a form of stenosis), and abdominal aortic aneurysm (AAA) (Jones et al. 2021a; Dyken et al. 1974; Kullo and Rooke 2016; Aboyans et al. 2010; Chen et al. 2013; Li et al. 2013). Their prevalence is restricted to the following vessels and shown in Fig. 1:

  • CAS is assumed to only affect the common carotid arteries. For simplification and consistency of notation, these vessels are referred to as the carotid artery chains (\(\hbox {CA}_{{\mathbf {x}}}\)).

  • SAS is assumed to affect the first and second subclavian segments. These two chains of vessels (one on the right and left side) are referred to as the subclavian artery chains (\(\hbox {SA}_{{\mathbf {x}}}\)).

  • PAD is assumed to affect the common iliacs; external iliacs; first and second femoral segments; and the first popliteal segments. These chains are referred to as the peripheral artery chains (\(\hbox {PA}_{{\mathbf {x}}}\)).

  • AAA is assumed to affect the first to forth abdominal aorta segment. This chain of vessels is referred to as the abdominal aortic chain (\(\hbox {AA}_{{\mathbf {x}}}\)).

It is assumed that each diseased VP has only one of the four forms of arterial disease. Four complementary databases corresponding to \(\text {VPD}_{\text {H}}\) are constructed, each pertaining to one form of arterial disease. To create the diseased VPD corresponding to CAS, referred to as \(\text {VPD}_{\text {CAS}}\), for every subject in \(\text {VPD}_{\text {H}}\), disease is introduced in \(\hbox {CA}_{\mathrm {x}}\) (i.e. the left or right carotid artery). This is achieved by taking the arterial network of a subject from \(\hbox {VPD}_{\text {H}}\), artificially introducing a stenosis in \(\hbox {CA}_{\mathrm {x}}\), and then using a one-dimensional pulse-wave propagation model—which has previously been widely employed, tested, and validated (Boileau et al. 2015; Formaggia et al. 2003; Alastruey et al. 2012; Olufsen et al. 2000; Reymond et al. 2009; Matthys et al. 2007)—to compute the pressure and flow-rate waveforms. Note that this model has also been used to study haemodynamics in both stenosis (Boileau et al. 2018; Carson et al. 2019; Jin and Alastruey 2021) and aneurysms (Sazonov et al. 2017; Chakshu et al. 2020; Jin and Alastruey 2021). The numerical implementation of the pulse-wave propagation model employed here is outlined in Jones et al. (2021a) and validated against a discontinuous Galerkin (DCG) scheme (Alastruey et al. 2012), which in turn has been successfully validated against a 3D model of blood-flow through stenosed arterial vessels (Boileau et al. 2018).

Thus, \(\text {VPD}_{\text {CAS}}\) contains 28,868 VPs with CAS. Similarly, the databases corresponding to SAS, PAD, and AAA are created, and referred to as \(\text {VPD}_{\text {SAS}}\), \(\text {VPD}_{\text {PAD}}\), and \(\text {VPD}_{\text {AAA}}\), respectively. The disease severities, locations, and shapes are varied randomly across these databases as described next.

2.2.2 Parameterisation of diseased vessels

The severity of stenoses (percentage reduction in area) is varied between 50 and 95%. The lower 50% limit is set for the stenoses to be haemodynamically significant (Aboyans et al. 2010; Subramanian et al. 2005) and the upper limit of 95% reflects near total occlusion. For aneurysms, based on (Ernst 1993) and (Davis et al. 2013), an allowable range of AAA severities of 4cm–6cm diameters is chosen. This corresponds to a cross-sectional area variation of \(12.56\text { cm}^2\)\(28.27\text { cm}^2\). With the abdominal aortic area in the reference network (Jones et al. 2021a) between 1.76 and \(1.09\text { cm}^2\), the corresponding AAA severities are set to vary between 713% (12.56/1.76) and 2,593% (28.27/1.09). With the above ranges, parameterisation of area increase/reduction proposed in Jones et al. (2021c) is adopted, see Fig. 2. For a chain of diseased vessels (\(\hbox {CA}_{\text {x}}\), \(\hbox {SA}_{\text {x}}\), \(\hbox {PA}_{\text {x}}\), or \(\hbox {AA}_{\text {x}}\)), the normalised area \(A_n\) as a function of the normalised x-coordinate, \(x_n\), is represented as:

$$\begin{aligned} A_{n}\!=\! {\left\{ \begin{array}{ll} \bigg (1\! \mp \! \dfrac{{\mathcal {S}}}{2} \bigg ) \pm \dfrac{S}{2} \cos \left( \dfrac{2 (x_n-b) \pi }{e-b}\right) &{} \text {for } b\le x_n \le e \\ 1 &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(1)

where \({\mathcal {S}}\) represents the severity, b represents the normalised starting location of the disease in the vessel chain, e represents the normalised end location, \(A_n\) is normalised with respect to the healthy version of the vessel in \(\hbox {VPD}_{\text {H}}\), and ± creates an aneurysm or stenosis, respectively. In \(\hbox {CA}_{\text {x}}\), \(\hbox {SA}_{\text {x}}\), and \(\hbox {PA}_{\text {x}}\), the left and right side vessels are chosen with equal probability.

Fig. 2
figure 2

An example of a stenosis of severity 0.6 and aneurysm of severity 8.0 is shown. These disease profiles are created with a start location of 0.2 and an end location of 0.8

The disease severity \({\mathcal {S}}\), start location b, and end location e are assigned uniform distributions based on physical considerations. To sample values for these parameters, a fourth parameter, the reference location of the disease (represented by r) is introduced. This is included to impose a minimum length of 10% of the chain length on the disease profiles. Thus, the parameters for disease are sampled sequentially from uniform distributions within the following bounds:

$$\begin{aligned} \text {Bounds:} {\left\{ \begin{array}{ll} 0.2 \le r \le 0.8,\\ 0.1 \le b \le r-0.05, \\ r+0.05 \le e \le 0.9,\\ {\left\{ \begin{array}{ll} 0.5 \le {\mathcal {S}} \le 0.95 &\quad\text {for stenoses,}\\ 7.13 \le {\mathcal {S}} \le 25.93 &\quad \text {for aneurysms.}\\ \end{array}\right. } \end{array}\right. } \end{aligned}$$
(2)

Based on the above parameterisation, examples of healthy and diseased \(\hbox {SA}_{\text {x}}\), \(\hbox {PA}_{\text {x}}\), and \(\hbox {AA}_{\text {x}}\) area profiles are shown in the left and right columns of Fig. 3, respectively.

Fig. 3
figure 3

Examples of healthy and diseased \(\hbox {SA}_{\text {x}}\), \(\hbox {PA}_{\text {x}}\), and \(\hbox {AA}_{\text {x}}\) area profiles. The geometrical boundaries between vessel segments that form the chains are indicated by red dashed lines

2.3 Measurements

A review of potential measurements that can be acquired in the network is presented in Jones et al. (2021a). Based on this, the locations at which time-varying pressure and flow-rate measurements can be acquired are shown in Fig. 1 and described below.

  • Pressure in the carotid and radial arteries measured using applanation tonometry (Adji et al. 2006; O’rourke 2015). To simplify annotation and description, the right and left carotid artery pressures are referred as \(P_1^{\text {(R)}}\) and \(P_1^{\text {(L)}}\), respectively. Similarly, the radial artery pressures are referred to \(P_3^{\text {(R)}}\) and \(P_3^{\text {(L)}}\), respectively.

  • Pressure in the brachial arteries estimated through reconstruction of finger arterial pressure (Guelen et al. 2008). The right and left brachial artery pressures are referred to as \(P_2^{\text {(R)}}\) and \(P_2^{\text {(L)}}\) , respectively.

  • Flow-rate in the carotid, brachial, and femoral arteries measured using Doppler ultrasound (Byström et al. 1998; Oglat et al. 2018; Radegran 1997). The right and left carotid artery, brachial, and femoral flow-rates are referred to as \(Q_1^{\text {(R)}}\), \(Q_1^{\text {(L)}}\); \(Q_2^{\text {(R)}}\), \(Q_2^{\text {(L)}}\); and \(Q_3^{\text {(R)}}\), \(Q_3^{\text {(L)}}\), respectively.

2.3.1 Provision of measurements to ML classifiers

Unless specified otherwise, the measurements to ML classifiers are bilateral, i.e. when \(Q_1\) is specified it is implied that both right and left carotid flow-rates are used:

$$\begin{aligned} Q_1 = \{Q_1^{\text {(R)}}, Q_1^{\text {(L)}}\}. \end{aligned}$$
(3)

There are, therefore, a total of by six bilateral measurements available: three pressure and three flow-rates. To reduce the dimensionality required to describe each pressure or flow-rate measurement, the periodic profiles are described through a Fourier series (FS) representation:

$$\begin{aligned} u(t)=\sum _{n=0}^N a_n \sin (n \omega t) + b_n \cos (n \omega t), \end{aligned}$$
(4)

where u represents any pressure or flow-rate profile; \(a_n\) and \(b_n\) represent the \(n{\text {th}}\) sine and cosine FS coefficients, respectively; N represents the truncation order; and \(\omega ={2 \pi }/{T}\), with T as the time period of the cardiac cycle. It is found in Jones et al. (2021c) that haemodynamic profiles can be described by a FS truncated at \(N=5\). Thus, each individual measurement is described by 11 FS coefficients, and each bilateral measurement by 22 FS coefficients.

2.4 Machine learning classifiers

A model mapping a vector of input measurements, \(\varvec{x}\), to a discrete output classification, y, can be described as:

$$\begin{aligned} y = m(\varvec{x}) \quad y \in \{{\mathcal {C}}^{(1)}, {\mathcal {C}}^{(2)}\}, \end{aligned}$$
(5)

where \({\mathcal {C}}^{(j)}\) represents the \(j{\text {th}}\) possible classification. In the context of this study, the measured inputs, \(\varvec{x}\), represent the FS coefficients of a user defined combination of the haemodynamic measurements \(\{Q_1\), \(Q_2\), \(Q_3\), \(P_1\), \(P_2\), \(P_3\}\) (see Sect. 2.3.1) taken from VPs, and the output classification represents the corresponding health of those VPs : \({\mathcal {C}}^{(1)}\)= ‘healthy’ and \({\mathcal {C}}^{(2)}\)= ‘diseased’. To account for large differences in magnitudes of the components of \(\varvec{x}\), they are individually transformed with the Z-score standardisation method (Mohamad and Usman 2013) to have zero-mean and unit variance.

As previously stated, it assumed that in a patient disease is limited to only one of the four forms. As a first exploratory study, the ML classifiers are created for each form independently. All classifiers are therefore binary (see Jones et al. 2021c), i.e. four independent classifiers are trained to predict the following questions independently: “Does a VP belong to \(\text {VPD}_{\text {H}}\) or \(\text {VPD}_x\)”, where x can be either CAS, SAS, PAD, or AAA.

2.4.1 Training and test sets

Each VP in \(\text {VPD}_{\text {CAS}}\), \(\text {VPD}_{\text {SAS}}\), \(\text {VPD}_{\text {PAD}}\), and \(\text {VPD}_{\text {AAA}}\) shares an identical underlying arterial network, apart from the diseased chain, with the corresponding healthy subject in \(\hbox {VPD}_{\text {H}}\). It is, therefore, important to ensure that the same subset of VPs is not included in the both healthy and diseased data sets used for ML classifiers. As each form of disease is mutually exclusive, four independent training and test sets, each corresponding to one form of the disease, are constructed in the following three stages:

  • Step 1: Half of the available VPs are randomly selected from \(\text {VPD}_{\text {H}}\) for inclusion within the ML data set; this is referred to as \(\text {VPD}_{\text {H-ML}}\). The unhealthy VPs corresponding to the remaining unused half are taken from the appropriate unhealthy VPD (\(\text {VPD}_{\text {CAS}}\), \(\text {VPD}_{\text {SAS}}\), \(\text {VPD}_{\text {PAD}}\), or \(\text {VPD}_{\text {AAA}}\)) and incorporated into the ML data set. These data sets are referred to as \(\text {VPD}_{\text {CAS-ML}}\), \(\text {VPD}_{\text {SAS-ML}}\), \(\text {VPD}_{\text {PAD-ML}}\), or \(\text {VPD}_{\text {AAA-ML}}\).

  • Step 2: The data sets of Step 1 are combined to create four complete data sets each containing 50% healthy and 50%, unhealthy VPs:

    1. 1.

      \(\text {VPD}_{\text {H-ML}}\cup \text {VPD}_{\text {CAS-ML}}\)

    2. 2.

      \(\text {VPD}_{\text {H-ML}}\cup \text {VPD}_{\text {SAS-ML}}\)

    3. 3.

      \(\text {VPD}_{\text {H-ML}}\cup \text {VPD}_{\text {PAD-ML}}\)

    4. 4.

      \(\text {VPD}_{\text {H-ML}}\cup \text {VPD}_{\text {AAA-ML}}\)

  • Step 3: The four data sets of Step 2 are randomly split into a training set containing 2/3 of all the VPs in the data set, and a test set containing 1/3 of all the VPs.

The performance of all ML classifiers is evaluated using a fivefold validation. For each fold, the same data set from Step 2 is used, but different subsets are sampled in Step 3 for training and testing.

2.4.2 ML methods

The purpose of this study is to perform an initial exploratory investigation into the possibility of using ML classifiers to detect different forms of arterial disease. Focus is, therefore, on uncovering patterns and behaviours—such as which haemodynamic measurements are particularly informative—rather than optimisation to achieve increasingly higher accuracies. With consideration for this objective, it is not feasible to perform extensive optimisation and analysis on every single ML classifier trained and tested. Thus, the ML methods used are chosen based on their “robustness”—i.e. minimal sensitivity to the hyper-parameters and minimal susceptibility to problems such as overfitting—relative to more complex deep learning methods. Five different ML methods are employed. These five methods are random forest, gradient boosting, naive Bayes’, support vector machine, and logistic regression. These methods encompass a range of probabilistic and non-probabilistic applications of different modelling approaches, see Table 1, while fulfilling the aforementioned characteristics. Along with these five ML methods, one deep learning method is also employed for comparison. This method is multi-layer perceptron. It is a priori expected that multi-layer perceptron classifiers will not perform to their full potential in this study, as they are more reliant on complex hyper-parameter optimisation and monitoring for overfitting than the five ML methods. The use of multi-layer perceptrons will, however, provide some, albeit limited, comparison of ML and deep learning methods. Since standard versions and implementations of these methods are employed without any modifications, methodological details of these methods are not presented in this study. Instead, the reader is referred to the following references for methodological details:

  1. 1.

    Random Forest (RF) (Liaw and Wiener 2002; Breiman 2001)

  2. 2.

    Gradient Boosting (GB) (Friedman 2001; Elith et al. 2008)

  3. 3.

    Naive Bayes’ (NB) (Rish et al. 2001b, a)

  4. 4.

    Support Vector Machine (SVM) (Kecman 2005)

  5. 5.

    Logistic Regression (LR) (Sperandei 2014; Hilbe 2009; Jones et al. 2021c)

  6. 6.

    Multi-layer Perceptron (MLP) (Murtagh 1991)

All implementations of the above algorithms in the Python package Scikit-learn (Pedregosa et al. 2011) are used. Some of these methods require optimisation of the hyper-parameters. This is described after presenting performance quantification metrics in the next section.

Table 1 The four different modelling approaches and how each classification method aligns with these approaches

2.4.3 Quantification of results

Classifier performance is assessed by two metrics: sensitivity and specificity in combination; and the \(F_1\) score. Figure 4 shows the definition of sensitivity, specificity, and \(F_1\) score, along with the related concepts of precision and recall commonly used in the assessment of classifiers. It is desirable to have both sensitivities and specificities to be high. Similarly, a higher \(F_1\) score is desirable. Since the \(F_1\) score is a single scalar metric that balances both precision and recall, it is a good metric to compare classifiers when tuning the hyper-parameters of ML algorithms. For a discussion on these metrics and their relevance, please refer to Jones et al. (2021c).

Fig. 4
figure 4

The relationship between sensitivity, specificity, recall, and precision. TP: True Positive, representing VPs belonging to a classification correctly identified; FN: False Negative, representing VPs belonging to a classification incorrectly identified: FP: False Positive, representing VPs not belonging to a classification incorrectly identified; and TN: True Negative, representing VPs not belonging to a classification correctly identified

2.5 Hyper-parameter optimisation

The architecture of LR, NB, and SVM classifiers can all be considered to be problem independent. While these three algorithms are able to undergo varying levels of problem specific optimisation, the underlying structure of the classifier usually does not change. The architectures of RF, MLP, and GB classifiers, however, are dependent on the specific problem. The architecture choices for the classifiers and associated hyper-parameter optimisation are described next. For all six methods, all hyper-parameters that are neither optimised nor specified in the text are set to their default values within Scikit-learn (Pedregosa et al. 2011).

2.5.1 LR, SVM, and NB

For LR, the ‘LIBLINEAR’ solver offered by the Scikit-learn (Pedregosa et al. 2011) package is chosen. In the case of SVM, a kernel is typically chosen to map the input measurements to a higher order feature space (Jakkula 2006). All SVM classifiers use a radial basis function kernel (Scholkopf et al. 1997), with the Scikit-learn hyper-parameter ‘gamma’ set to ‘scale’. In the case of NB, the distribution of input measurements across the data set is chosen to be normal (Murphy et al. 2006).

2.5.2 Random Forest

In the case of RF, the number of trees in the ensemble and the maximum depth of each tree is optimised. Other hyper-parameters that can be tuned include the minimum number of data points allowed in a leaf node, and the maximum number of different features considered for splitting each node. However, the effect of these is not investigated here. To optimise the two hyper-parameters, a grid search is carried out. A grid is constructed by discretising the possible number of trees within the ensemble between 10 and 400 at intervals of 10, and the possible depth of each tree between 20 and 200 at intervals of 10. RF classifiers are trained for every combination with all six pressure and flow-rate measurements (see Sect. 2.3.1) across all the four forms of arterial disease. The hyper-parameters describing the architecture that produces the highest \(F_1\) score are found for each form of disease, and this combination of hyper-parameters is then chosen for all subsequent classifiers. The optimal hyper-parameters for each of the four forms of disease are shown in Table 2, along with the \(F_1\) score achieved by each.

Table 2 The hyper-parameters describing the architecture of the RF classifiers that produce the highest \(F_1\) scores, when using all six pressure and flow-rate measurements

It is unlikely that a single architecture will consistently produce the best results when varying the combination of input measurements. In this study, re-optimisation of the hyper-parameters when varying the input measurement combination is not performed, to minimise computational cost. It is found that when using all six pressure and flow-rate measurements, the \(F_1\) score produced is relatively insensitive to the hyper-parameters used. Thus, it is likely that a reasonable representation of the maximum achievable accuracy can be obtained for various input measurement combinations by a single architecture. It should be noted, however, that further improvements in classification accuracy may be possible with such re-optimisation.

2.5.3 Gradient Boosting

Similar to the RF architecture, the GB architecture is optimised by varying the number of trees within the ensemble and the maximum depth of each tree. Other hyper-parameters which may be varied, however, are not considered here, are the minimum number of data points allowed in a leaf node, the maximum number of different features considered for splitting each node, and the impact of each tree on the final outcome (i.e. the learning rate). A grid search is carried out to find the combination producing the highest \(F_1\) score when using all the six input measurements. It is common for GB classifiers to use weaker, shallower decision trees (relative to RF classifiers) to deliberately create high bias and low variance (Hastie et al. 2009). The possible depth of each tree is, therefore, discretised between 2 and 20 at intervals of 1. As a high number of trees is not required to compensate for over fitting, contrary to the RF method, the possible number of trees within the ensemble is discretised between 10 and 100 at intervals of 10. The optimal hyper-parameters for each of the four forms of disease are shown in Table 3.

Table 3 The hyper-parameters describing the architecture of the GB classifiers that produce the highest \(F_1\) scores, when using all six pressure and flow-rate measurements

2.5.4 Multi-layer perceptron

As is common with deep learning methods, relative to ML methods, there are significantly more hyper-parameters which can be optimised for the MLP classifiers relative to Gradient Boosting or Random Forest. Examples of hyper-parameters that significantly affect the performance of an MLP classifier include batch-size, learning rate, activation functions, drop-out, and individual units per hidden layers. With consideration for the exploratory stance of this study, it is chosen to only optimise the number of neurons within each hidden layer and the number of hidden layers. For simplification, it is assumed that all the hidden layers contain an identical number of neurons. A logistic activation function is used for all the hidden layers. It is likely that this simplistic hyper-parameter optimisation will limit the accuracy of classification achieved by MLP classifiers.

Similar to the RF and GB methodology, the hyper-parameters that produce the highest \(F_1\) score are found through a grid search. The number of neurons within each layer is discretised between 10 and 200 at intervals of 10, and the number of hidden layers is discretised between 1 and 6 at intervals of 1. The optimal hyper-parameters found for each of the four forms of disease are shown in Table 4. It shows that relative to RF and GB, there is less consistency in the maximum \(F_1\) scores achieved by MLP—it classifies AAA and CAS to high levels of accuracies, but performs relatively poorly for SAS and PAD.

Table 4 The hyper-parameters describing the architecture of the MLP classifiers that produce the highest \(F_1\) scores, when using all six pressure and flow-rate measurements

2.6 Input measurement combination search

There are 63 possible combinations of input measurements that can be provided to a ML classifier from the six bilateral pressure and flow-rate measurements (see Sect. 2.3.1). A combination search is performed for each of the four forms of disease. For every combination of input measurements, all the six ML classification methods are trained, and then subsequently tested to quantify their performance. The average \(F_1\) score, sensitivity, and specificity for each case across five folds are recorded. Combinations of interest are then further analysed.

2.7 Overfitting and early stopping criterion

To assess any overfitting by the ML and deep-learning methods, the log loss costs across the training and test sets are recorded at each sequential iteration of the training process (up to the 200\({\text {th}}\) iteration). At a low number of training iterations, both the training and test costs are expected to be high as the classifiers can neither fit the training data nor generalise to the test data. As the training process progresses, the training and test costs are both expected to decay before converging to stable values in the absence of overfitting. However, in the case of overfitting, while the training costs continue to decrease, after a minima in the test costs, overfitting results in successively increasing test costs. In such cases, an early stopping criterion (Prechelt 1998; Yao et al. 2007) is adopted to avoid overfitting. A third partition to the available data (the validation set) is introduced. The combined healthy and unhealthy data sets described in Sect. 2.4.1 are split so that the training set contains 50%, the validation set 25%, and the test set 25% of the available data. Classifiers are trained on the training set; however, stopping criterion is based on the log loss cost in the validation set. At each sequential iteration in the training process, the average log loss cost is computed across the validation set. If more than 75 iterations have been performed, and the improvement in the log loss cost across the validation set between two consecutive iterations is less than \(1\times 10^{-3}\), training is stopped. The final classifier accuracy is assessed on the test set.

3 Results and discussion

The full tables of results achieved for CAS, SAS, PAD, and AAA classification are shown in Appendices ABC and D, respectively. The \(F_1\) score achieved by each ML method and combination of input measurements are visually shown for CAS, SAS, PAD, and AAA classification in Figs. 567, and 8, respectively. They show that for all forms of arterial disease, NB and LR classifiers consistently produce low accuracy. It has previously been shown in the PoC (Jones et al. 2021c) that the partition between the pressure and flow-rate profiles taken from healthy and stenosed patients is likely to be nonlinear. The fact that LR consistently produces low accuracy results supports this finding, as LR is the only linear classification method used. The finding that NB classifiers produce low accuracy classification is also consistent with the results of the PoC (Jones et al. 2021c), which found that the NB method is poorly suited to the problem of distinguishing between haemodynamic profiles. On the contrary, across all the four forms of disease, the tree-based methods (RF and GB) consistently produce high accuracy results. This finding is in contradiction to the finding in the PoC (Jones et al. 2021c) and is likely due to the inadequate architecture optimisation or because of the unsuitability of RF on a smaller network used in the PoC (Jones et al. 2021c). The fact that both RF and GB classifiers are producing high accuracy classification in this study suggests that not only are tree-based methods well suited to distinguishing between haemodynamic profiles, but also emphasises the importance of adequate architecture optimisation.

There is less consistency in the results achieved by SVM and MLP classifiers when detecting different forms of disease. SVM classifiers produce accuracies comparable with RF and GB classifiers in the case of AAA detection; however, low accuracy results for the three other forms of disease. MLP classifiers produce accuracies comparable with RF and GB classifiers in the case of CAS and AAA detection; however, relatively low accuracy results for SS and PAD classification. Overall, it is found that tree-based methods of RF and GB perform best, with GB performance slightly superior to that of RF. It is important to remember, however, that the results presented here do not necessarily capture the full potential of each method, and instead only reflect the accuracies achieved within the limitations of the simplistic hyper-parameter optimisation—a consideration particularly important for MLP.

Fig. 5
figure 5

The \(F_1\) scores achieved for CAS using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square

Fig. 6
figure 6

The \(F_1\) scores achieved for SAS using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square

Fig. 7
figure 7

The \(F_1\) scores achieved for PAD using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square

Fig. 8
figure 8

The \(F_1\) scores achieved for AAA using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square

3.1 Measurement combinations

To investigate the importance of both the number of input measurements provided to the ML algorithms and the specific combination of measurements, the average \(F_1\) scores achieved by all classifiers when providing only one, two, three, four, five, or six input measurements are found. In each case, the specific combinations that achieve the maximum and minimum \(F_1\) scores are also recorded. These results for different forms of disease are presented next.

3.1.1 CAS classification

The average, maximum and minimum \(F_1\) score achieved when providing different number of input measurements for CAS classification are shown in Fig. 9.

Fig. 9
figure 9

The average, maximum, and minimum \(F_1\) score achieved by all classifiers trained using different numbers of input measurements are shown for carotid artery stenosis classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits

It shows that NB and LR classifiers consistently produce an \(F_1\) score of approximately 0.5, which is comparable to naive classification, i.e. randomly assigning the health of VPs with an equal probability to each outcome. SVM performs slightly better with \(F_1\) scores averaging 0.5 – 0.6. The other three classification methods (RF, MLP, and GB) perform significantly better with \(F_1\) scores generally averaging between 0.7 and 0.95 and showing a clear increase in the average \(F_1\) score as the number of input measurements increases. While the average and minimum \(F_1\) score achieved by RF and GB classifiers continuously increases, the maximum \(F_1\) score achieved can be seen to quickly reach a plateau (at one input measurement for RF and three input measurements for GB). For a fixed number of measurements, the wide range of \(F_1\) scores in Fig. 9 across all classifiers suggests that specific combinations of measurements may be more important than others for optimal classification. To explore this further, the combinations of input measurements that produce the highest \(F_1\) scores and the corresponding accuracies when employing the RF and GB methods are shown in Table 5. Two observations are made from this table. First that for a fixed number of measurements, the best combinations are not identical for the two methods. For example, when two measurements are used the best combination for RF is (\(Q_2\), \(Q_1\)), while the best combination for GB is (\(P_2\), \(P_1\)). This suggests that the best combination of measurements is likely dependent on the particular ML method chosen. Second, some patterns stand out with respect to which measurements may be more informative than others. For example, across Table 5, \(Q_1\) appears in 11 out of 12 combinations, and \(P_1\) appears in 8 out of 12 combinations. This suggests that \(Q_1\) is most informative about identifying the presence of CAS followed by \(P_1\). Physiologically, this is not surprising as \(Q_1\) and \(P_1\) are flow-rates and pressures in the carotid arteries and the disease under consideration is carotid artery stenosis. It is encouraging that the ML methods are indeed placing more importance to the relevant physiological measurements. In fact, it is remarkable that RF and GB both achieve \(F_1\) scores above 0.85 and sensitivities and specificities larger than 85% with only one measurement. Also notable is that these accuracies can be taken to beyond 93% (see GB row for 3 measurements in Table 5) when adding 2 more measurements as long as the additional two measurements are carefully chosen.

Table 5 The combinations of input measurements that produce the maximum \(F_1\) scores when providing one to six input measurements and employing the RF and GB methods to detect CAS

An interesting pattern to note is that while the average and minimum \(F_1\) score achieved by MLP classifiers continuously increases in Fig. 9, the maximum \(F_1\) score decreases beyond three input measurements. The maximum \(F_1\) scores achieved by MLP classifiers, and the corresponding sensitivities and specificities, when using three to six input measurements are shown in Table 6. It shows that the decrease in \(F_1\) scores is also accompanied by an associated decrease in both the sensitivities and specificities, as opposed to the balance between them (increase in sensitivity and decrease in specificity and vice versa). This behaviour is unusual as intuitively more input measurements should generally provide more information. This finding may suggest that MLP classifiers are able to extract maximum information from the haemodynamic profiles when using as little as three input measurements, and may be susceptible to over fitting when using more than three measurements, thereby leading to less generalisation capabilities and consequently decreased accuracies.

Table 6 The combinations of input measurements that produce the maximum \(F_1\) scores when providing three to six input measurements and employing the MLP method to detect CAS

To investigate any overfitting, the log loss costs for the training and test sets during the training process are shown in Fig. 10 for the best measurement combinations identified by the MLP, GB, and RF method classifiers (Tables  5 and 6). It shows that the RF and GB methods show no signs of overfitting. However, for the MLP, while the three-measurement case also shows no overfitting, the cases with four, five, and six measurements show an increase in test costs beyond 50–100 training iterations, implying overfitting, the extent of which worsens as the number of measurements increases. Such behaviour for the MLP is also observed for SAS and PAD, and thus for the MLP method an early stopping criterion is adopted (see Sect. 2.7), the results of which are presented in Sect. 3.6.

Fig. 10
figure 10

The average log loss cost across the training and test sets during the training process when using the combination of three to six input measurements that achieve highest accuracies for RF, GB, and MLP methods (Tables 5 and 6)

3.1.2 SAS classification

The results of the analysis for SAS classification are shown in Fig. 11. As is seen in the case of CAS classification, Fig. 11 shows that NB, LR, and SVM classifiers consistently produce accuracies comparable to naive classification, irrespective of the number of input measurements used. A clear difference between Figs. 9 and 11 is the accuracy achieved by MLP classifiers. Compared to the CAS case, the MLP performance is further degraded for SAS, while still being better than NB, LR, and SVM, although only marginally. It is important to consider, however, that the MLP classifiers are experiencing overfitting, as highlighted in Sect. 3.1.1. Results with overfitting avoided by adopting an early stopping criterion are presented in Sect. 3.6.

Fig. 11
figure 11

The average, maximum, and minimum \(F_1\) score achieved by all classifiers trained using different numbers of input measurements are shown for SAS classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits

A high degree of similarity can be seen between the behaviours of RF and GB classifiers for CAS and SAS. Figure 11 shows that the average and minimum \(F_1\) score achieved by RF and GB classifiers continuously increases as the number of input measurements used increases. The maximum \(F_1\) score achieved is seen to quickly reach an asymptotic limit (at three input measurements for both RF and GB classifiers). Peak \(F_1\) score of approximately 0.85 is achieved by GB, along with sensitivities and specificities higher than 85%.

The combination of input measurements that produce the highest \(F_1\) scores and the corresponding accuracies are shown in Table 7. It shows a higher degree of consistency between the best combinations for the two methods relative to the case for CAS, i.e. the best combinations are generally identical (or with minimal differences) between RF and GB. It also shows that \(Q_1\) is particularly informative, with this measurement appearing in all of the best combinations. Physiologically this may be due to its proximity to the disease location.

Table 7 The combinations of input measurements that produce the maximum \(F_1\) scores when providing one to six input measurements and employing the RF and GB methods to detect SAS

3.1.3 PAD classification

The results for PAD classification are shown in Fig. 12. Comparing Figs. 11 and 12, a high degree of similarity is seen between the behaviours of SAS and PAD classification. As is previously seen for SAS classification, Fig. 12 shows that the NB, LR, and SVM methods are all consistently producing accuracies comparable to naive classification. While the MLP method performs slightly better than the naive classification, the accuracy still remains relatively low. High accuracy can be seen in Fig. 12 for the two tree-based methods of RF and GB. As has been previously seen for CAS and SAS, while the average and minimum \(F_1\) score achieved by the RF and GB methods increases as the number of input measurements increases, the maximum \(F_1\) score achieved quickly reaches an asymptotic limit (at 3 input measurements for both the RF and GB methods).

The combination of input measurements that produce the highest \(F_1\) scores for PAD classification when employing the RF and GB methods are shown in Table 8. It not only shows good consistency between the combinations of input measurements that produce the highest \(F_1\) scores when employing each of the two ML methods, but also good agreement with the combinations presented in Table 7. Very similar combinations of input measurements (with some minor differences) can be seen to produce the highest \(F_1\) score when providing all numbers of input measurements. As has previously been observed in Tables 5 and 7, the input measurement \(Q_1\) appears to be most informative, appearing in all the best scoring classifiers. Since the location of \(Q_1\) is far from the location of disease, it is not obvious why this measurement is particularly informative of PAD.

Fig. 12
figure 12

The average, maximum, and minimum \(F_1\) score achieved by all classifiers trained using different numbers of input measurements are shown for PAD classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits

Table 8 The combinations of input measurements that produce the maximum \(F_1\) scores when providing one to six input measurements and employing the RF and GB methods to detect PAD

3.1.4 AAA classification

The results for AAA classification are shown in Fig. 13. As has been previously seen for all of the three other forms of disease, the NB and LR classifiers consistently produce accuracies comparable to naive classification, irrespective of the number of input measurements used. The consistency of this finding (as seen in Figs. 911, and 12) irrespective of the form of disease being classified, highlights both the importance of nonlinear partitions between healthy and unhealthy VPs and the unsuitability of the NB method for distinction between haemodynamic profiles.

Fig. 13
figure 13

The average, maximum, and minimum \(F_1\) score achieved by all classifiers trained using different numbers of input measurements are shown for AAA classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits

In the case of AAA classification, the SVM, RF, MLP, and GB methods consistently produce good accuracies. Figure 13 shows that these methods produce high accuracies even with a single input measurement. While there is some increase in the average \(F_1\) score as the number of input measurements increases, due to the very high initial average \(F_1\) score achieved (when using a single input measurement) this increase is limited (as the \(F_1\) score can not exceed 1). Two possible reasons of the higher accuracies in aneurysm classification relative to stenosis classification are:

  • Aneurysms, owing to an increase in area as opposed to decrease in the area for stenoses, may actually produce more significant or consistent biomarkers in the pressure and flow-rate profiles. This hypothesis is supported by Low et al. (2012), which found that even low severity AAAs have a global impact on the pressure and flow-rate profiles.

  • While the severities of aneurysms cannot be directly compared to severities of stenosis, it may be that the severity of aneurysms in \(\text {VPD}_{\text {AAA}}\) is disproportionately large relative to the severities of stenoses. The significance of any indicative biomarkers introduced into pressure and flow-rate profiles is likely to be proportional to the severity of the change in area. This implies that the increase in vessel area of 712–2,593% in \(\text {VPD}_{\text {AAA}}\) is perhaps on the extreme end of aneurysm severity, thereby making the classifications relatively easier. This is further explored in Sect. 3.4.

The combination of input measurements that produce the highest \(F_1\) scores when providing one to six input measurements and employing the RF and GB methods are shown for AAA classification in Table 9. It shows that \(F_1\) scores range from 0.97–0.997 and sensitivities and specificities range from 96.5% to 99.8%. Due to the high accuracies across all the number of measurements, the analysis of specific combinations is not very meaningful. However, the measurement \(Q_1\) again appears in all the best combinations. It should also be noted that the high accuracies for AAA classification are also consistent with those reported in Chakshu et al. (2020)— where deep-learning methods are applied on a VPD created by varying seven network parameters, and classification accuracies of \(\approx 99.9\%\) are reported—and (Wang et al. 2021)—where machine learning methods are applied on a VPD, and sensitivities and specificities of \(\approx 86\%\) are reported.

Overall, the results show that the physiological changes to the waveforms induced by both stenosis and aneurysms (Stergiopulos et al. 1992; Low et al. 2012) are well captured by the data-driven machine learning methods.

Table 9 The combinations of input measurements that produce the maximum \(F_1\) scores when providing one to six input measurements and employing the RF and GB methods to detect AAA

3.2 Importance of carotid artery flow-rate

Appendices AD, along with the above analysis show that classifiers trained using flow-rates in the common carotid arteries (\(Q_1\)) consistently produce the highest accuracy. To analyse this further, the \(F_1\) scores of classifiers with combinations that include and exclude \(Q_1\) are separated and compared for CAS, SAS, PAD, and AAA in Figs. 141516, and 17, respectively. These figures show the the histograms of the \(F_1\) scores, i.e. the number of occurrences/classifiers/combinations including and excluding \(Q_1\) against \(F_1\) score buckets. For each disease form, results are only shown for the classification methods that consistently produce good results for the corresponding disease form. The figures show a clear positive shift in the histograms when \(Q_1\) is included, pointing to the particularly informative nature of \(Q_1\). Other behaviours observed from these figures are:

  • While there is generally an increase in \(F_1\) score when including \(Q_1\), it is also simultaneously observed that the maximum accuracies are relatively less sensitive to the inclusion of \(Q_1\).

  • The greatest distinction between \(F_1\) scores when including or excluding \(Q_1\) is seen for CAS classification when using the RF method. There is no overlap between the two RF histograms in Fig. 14.

  • Observing the lower plots in Figs. 15 and 16, a clear subgroup of low-accuracy classifiers can be seen when excluding \(Q_1\) for SAS and PAD, which does not exist when including \(Q_1\).

Fig. 14
figure 14

The histograms of the \(F_1\) scores achieved for CAS classification are shown for all input measurement combinations that include \(Q_1\) in the upper plot and exclude \(Q_1\) in the lower plot

Fig. 15
figure 15

The histograms of the \(F_1\) scores achieved for SAS classification are shown for all input measurement combinations that include \(Q_1\) in the upper plot and exclude \(Q_1\) in the lower plot

Fig. 16
figure 16

The histograms of the \(F_1\) scores achieved for PAD classification are shown for all input measurement combinations that include \(Q_1\) in the upper plot and exclude \(Q_1\) in the lower plot

Fig. 17
figure 17

The histograms of the \(F_1\) scores achieved for AAA classification are shown for all input measurement combinations that include \(Q_1\) in the upper plot and exclude \(Q_1\) in the lower plot

3.3 Feature importance

An important aspect of the GB method is that the measurement importance, which determines the influence that individual measurements have towards classification, can be computed. This split-improvement feature importance (Zhou and Hooker 2020) of a feature can be thought of as the contribution of that feature to the total information gain achieved in a decision tree, averaged across all the trees in the ensemble. A high feature importance suggests that the given feature is contributing heavily to the classification accuracies achieved. As the features provided to the GB classifiers are the FS coefficients describing the haemodynamic profiles, the total importance of each bilateral pressure or flow-rate measurement is found by summing the feature importance of the associated 22 FS coefficients. The total importance of each input measurement for each disease form is shown in Table 10.

Table 10 The total importance of each input measurement, based on the GB classifiers provided with all six measurements

Three important observations from this table are:

  • The input measurement \(Q_1\) consistently produces the highest importance for all forms of disease. This finding supports the findings of Sect. 3.2.

  • The importance of each input measurement changes between disease forms based on the spatial proximity to the disease location. Generally, the measurements in close proximity to the disease location have higher importance. For example, the importance of \(Q_3\) (flow-rate in the femoral arteries) is highest for PAD classification (see Fig. 1 for locations of disease and measurements). Similarly, \(P_1\) (pressure in carotid arteries) has highest importance for CAS and SAS.

  • The feature importances, when viewed in collection, also shed some light on why \(Q_1\) is important for SAS and PAD even though the measurement location is far from the disease location. For SAS, the two most informative measurements are \(Q_1\) and \(Q_2\), and for PAD, these are \(Q_1\) and \(Q_3\). From Fig. 1, it is clear that these combinations form pairs of flow-rates before and after/at the disease location. Thus, the measurement locations bound the disease location to provide more information on the presence of disease.

3.4 Lower severity aneurysms

In Sect. 3.1.4, it is found that AAAs can be classified to a very high levels of accuracy with only one input measurement. Whether these accuracies are affected when lower severity aneurysms are considered is assessed here. For this assessment, a new lower severity AAA VPD, referred to as \(\text {VPD}_{\text {AAA-L}}\), is created in an identical manner to the other diseased databases (see Sect. 2.2), with the following two differences:

  • The severity of aneurysms introduced into the virtual subjects (see Sect. 2.2.2) is sampled from a uniform distribution bounded as follows: \(3.0 \le {\mathcal {S}}_{\text {aneurysm}} \le 7.0\).

  • To reduce the computational expense associated with the creation of virtual patients, the size of \(\text {VPD}_{\text {AAA-L}}\) is restricted to 5,000 VPs.

A combination search is carried out with only the GB method as it is the best overall method. The \(F_1\) scores, sensitivities, and specificities achieved by all the measurement combinations are presented in Appendix E. For comparison, the GB \(F_1\) scores for all forms of disease (including AAA-L) are shown in Appendix F. The ratios of the GB \(F_1\) scores achieved for AAA-L classification relative to AAA classification are shown in Fig. 18.

The observations from this figure are:

  • The \(F_1\) scores for AAA-L classification are consistently lower (ranging from 1% to 10% lower) than that for AAA classification. This finding supports the physiological expectation that the significance of biomarkers in pressure and flow-rate profiles is proportion to the severity.

  • The ratios of \(F_1\) scores are lowest for combinations of inputs that predominantly rely on pressure measurements. This suggests that pressure measurements are, in general, less informative about disease severity. This is in support of the, generally, lower feature importance of pressure measurements in Table 10.

  • The \(F_1\) score ratios are highest for input combinations that include \(Q_1\). This finding further suggests that \(Q_1\) contains consistent biomarkers.

  • The ratios range between 0.9 and 0.99, implying a maximum degradation of only 10% relative to high-severity classification accuracies. Thus, even in the low-severity aneurysms, many combinations of classifiers achieve \(F_1\) scores higher than 0.95 and corresponding sensitivities and specificities larger than 95%.

Fig. 18
figure 18

The ratios of the \(F_1\) scores for AAA-L classification relative to AAA classification, when providing each combination of input measurements are shown. Measurements included within each combination are highlighted with a black square

3.5 Unilateral aneurysm measurement tests

Hitherto, all ML classifiers used bilateral measurements, i.e. both the right and left instances of each measurement were simultaneously provided. Here, the ability of unilateral measurements, i.e. only the right or left instance of a measurement, to detect AAAs is assessed. This analysis is restricted to the GB method as it consistently outperforms other methods.

GB classifiers are trained and tested to detect AAAs using four different unilateral measurements:

  • Flow-rate in the right carotid artery, shown in Fig. 1 as \(Q_1^{\text {(R)}}\).

  • Flow-rate in the left carotid artery, shown in Fig. 1 as \(Q_1^{\text {(L)}}\).

  • Pressure in the right radial artery, shown in Fig. 1 as \(P_3^{\text {(R)}}\).

  • Pressure in the left radial artery, shown in Fig. 1 as \(P_3^{\text {(L)}}\).

Carotid artery flow-rate is chosen as it has been shown to be the best measurement for disease classification. Radial artery pressure is chosen due to the location of the radial artery on the human wrist. Recent advancements have resulted in wearable devices capable of measuring continuous radial pressure profiles, such as the TLT Sapphire monitor (Tarilian Laser Technologies, Welwyn Garden City, U.K.) (Lobo et al. 2019), and thus if AAAs can be detected to satisfactory accuracies using these measurements, it may suggest the possibility of future home monitoring of abdominal aortic health through such wearables. The sensitivities and specificities achieved by the four unilateral GB classifiers are shown in Table 11. It shows that relative to the bilateral case, while there is a decrease in the classification accuracies, the magnitude of the decrease is less than 10%. This finding suggests that there may be sufficient biomarkers of AAA presence captured within the intra-measurement details of a single pressure or flow-rate profile. The fact that similar accuracies are achieved with either the right or left instances of any measurement is likely due to physiological symmetry. While there are some minor asymmetries between the right and left upper extremities, due to the topology of the arterial network (as shown in Fig. 1) changes to the cross-sectional area of the abdominal aorta are expected to produce relatively consistent changes in both the right and left side of the body.

Table 11 The sensitivities and specificities achieved when using the measurements of flow-rate in the right, left, and both CAs and pressure in the right, left, and both radial arteries

3.6 MLP early stopping to avoid overfitting

It is shown in Sect. 3.1.1 that the accuracy of MLP classifiers is hindered by the presence of overfitting. Thus, the early stopping criterion outlined in Sect. 2.7 is adopted for the combinations of three to six measurements that hitherto produced best results without early stopping. Here, the hyper-parameters describing the MLP architecture—the number of neurons per layer and the number of layers (depth)—for each such case are also individually re-optimised on the validation data set with early stopping criterion enabled. Thus, for each combination in the grid search, the best validation set \(F_1\) score is computed with early stopping enabled during training, and the architecture producing the maximum \(F_1\) score is selected. Subsequently, for this optimal architecture, the test scores are computed on the test data set. This analysis is performed for CAS and AAA as the behaviour of SAS and PAD is very similar to that of CAS.

Fig. 19
figure 19

MLP: the log loss cost profiles across the training and validation sets when using the best performing combination containing three to six input measurements for CAS classification and employing early stopping

3.6.1 CAS: early stopping

The hyper-parameters describing the optimum architectures with early stopping criterion for best combinations are shown in Table 12. It shows a remarkable degree of consistency between the optimum hyper-parameters for varying number of input measurements: for four measurements and above the optimal architecture is identical. This finding supports the previous simplification of using a single architecture for all the MLP classifiers. An interesting finding to note, however, is that there is less consistency with the previous optimum hyper-parameters presented in Table 4, which found that four layers containing 60 neurons produced the highest \(F_1\) score when providing six input measurements.

The cost profiles for the optimal architectures with early stopping are shown in Fig. 19. It shows that generally the early stopping criteria fulfil its purpose of stopping the training process near to the minimum validation cost point, thus minimising overfitting. It is observed that for all numbers of input measurements, training is stopped as soon as the 75 minimum iterations have been completed. While this early stopping criteria greatly reduce overfitting in all the cases, it is seen that the minimum number of training iterations (75) is too high for the six measurement case (the validation cost has already started to significantly rise), suggesting further refinement may reduce the validation and test costs even further.

A comparison between the \(F_1\) scores achieved with and without early stopping is shown in Table 13. While early stopping has reduced the log loss cost across the validation and test sets, this does not necessarily translate to improvements in the \(F_1\) score. The log loss cost will decrease without increasing the \(F_1\) score if easy to classify patients are predicted with a higher degree of certainty (for example, predicting 95% probability rather than 75%) even if no new additional patients are correctly classified. For the six-measurement case, however, some increase in \(F_1\) score is clearly observed as a benefit of early stopping.

Table 12 The hyper-parameters describing the architecture of the MLP classifiers that produce the highest \(F_1\) scores on the validation set with early stopping criterion for CAS classification, when using the best performing combinations of three to six input measurements
Table 13 MLP: \(F_1\) scores on the test dataset when using the best three to six input measurement combinations found to produce the highest accuracies for CAS with (Sect. 3.6.1) and without early stopping (Sect. 3.1.1)

3.6.2 AAA: early stopping

The hyper-parameters describing theoptimum architectures with early stopping criterion for best combinations are shown in Table 14. The consistency of best architecture for AAA across the number of measurements is less when compared to that for CAS. It is again observed that the new hyper-parameters are inconsistent with the old (Table 4). Initially, this finding may seem to undermine early stopping and individual architecture optimisation for varying number of input measurements. However, while the optimum hyper-parameters are inconsistent, the \(F_1\) scores achieved are very similar—0.9785 in Table 4 and 0.9870 in Table 14. This similarity in \(F_1\) scores may suggest an insusceptibility to the architecture used, i.e. the \(F_1\) score plane in the two-dimensional grid-search space is relatively flat for this problem. This again supports the earlier simplification of using a single architecture for all the classifiers.

The cost profiles for the optimal architectures with early stopping are shown in Fig. 20. It shows no major signs of overfitting when using MLP classifiers to detect AAA. As a result, the employment of an early stopping criteria has little affect on the final log loss cost achieved across all training and validation data sets. Thus, when comparing the with and without early stopping test scores in Table 15, no significant differences in the \(F_1\) scores achieved are observed for AAA classification.

The aforementioned findings with early stopping enabled for both CAS and AAA classification, suggest that to substantially improve the accuracy of MLP classifiers, a more extensive hyper-parameter optimisation strategy, which tunes many other hyper-parameters, is required, and should be adopted in future studies.

Table 14 The hyper-parameters describing the architecture of the MLP classifiers that produce the highest \(F_1\) scores on the validation set with early stopping criterion for AAA classification, when using the best performing combinations of three to six input measurements
Table 15 MLP: \(F_1\) scores on the test dataset when using the best three to six input measurement combinations found to produce the highest accuracies for AAA with (Sect. 3.6.2) and without early stopping (Sect. 3.1.4)
Fig. 20
figure 20

MLP: the log loss cost profiles across the training and validation sets when using the best performing combination containing three to six input measurements for AAA classification and employing early stopping

4 Conclusions

The main conclusion of this study is that machine learning methods have the potential to detect arterial disease—both stenoses and aneurysms—from peripheral measurements of pressure and flow-rates across the network. Amongst various ML methods, it is found that tree-based methods of Random Forest and Gradient Boosting perform best for this application (within the limitations of the classifier specific optimisation performed). Across the different forms of disease, the Gradient Boosting method outperforms Random Forest, Support Vector Machine, Naive Bayes, Logistic Regression, and even the deep learning method of Multi-layer Perceptron in the setting adopted. It should be noted, however, that the multi-layer perceptron results could be improved by problem specific optimisation of architecture and fine-tuning of further hyper-parameters. This, however, would come at added complexity and computational costs against the easier-to-train machine-learning methods of Random Forest and Gradient Boosting.

It is demonstrated that maximum \(F_1\) scores larger than 0.9 are achievable for CAS and PAD, larger than 0.85 for SAS, and larger than 0.98 for both low- and high-severity AAAs. The corresponding sensitivities and specificities are also both larger than 90% for CAS and PAD, larger than 85% for SAS, and larger than 98% for both low- and high-severity AAAs. While these maximum scores are for the case when all the six measurements are used, it is also shown that the performance degradation is less than 5% when using only three measurements and less than 10% when using only two measurements, as long as the these measurements are carefully chosen in specific combinations. For the case of AAA, it is further demonstrated that when only a single measurement (either on the left or right side) is used, \(F_1\) scores larger than 0.85 and corresponding sensitivities and specificities larger than 85% are achievable. This aspect encourages the application of AAA monitoring and/or screening through the use of a wearable device, such at the TLT Sapphire monitor (Tarilian Laser Technologies, Welwyn Garden City, U.K.) (Lobo et al. 2019). Confidence in this is further strengthened by the similar very high accuracies reported for AAA classification by Chakshu et al. (2020) (\(\approx 99.9\%\)) and Wang et al. (2021) (sensitivities and specificities of \(\approx 86\%\)). However, multi-class classifier accuracies, as opposed to only the binary classifiers assessed here, remain unknown and should be considered to fully assess the ability of machine and deep learning methods for arterial disease detection.

Finally, it is shown through the analysis of several classifiers and feature-importance that, among the measurements, the carotid artery flow-rate is a particularly informative measurement to detect the presence of all the four forms of disease considered.

5 Limitations & future work

While high accuracy classification has been achieved, all classifiers are binary (i.e. disease are treated mutually exclusively). A logical next step, to further the results presented here, is to relax the assumption of mutually exclusive disease. Thus, classifiers should be built to detect not only the presence of disease, but also identify the type of disease (potentially concomitant disease in multiple locations), its location, and its severity. This further analysis can be completed in two stages:

  1. 1.

    The previously created unhealthy VPDs (each containing only one form of disease) can be used to created mixed disease data sets, i.e. each VP has only one form of disease; however, the data sets contain multiple forms of disease. Binary ML classifiers can then be created to predict if a VP is subject to a particular form of disease, and multiclass classifiers to determine which form of disease a VP has.

  2. 2.

    New VPDs can be created, in which each VP may contain more than one form of disease. In this case, binary classifiers can be created to predict the presence of each individual form of disease within a VP, and multiclass classifiers to predict the combination of disease forms present.

While the results are encouraging, they are produced on a virtual cohort of subjects. Even though the database is physiologically realistic and carefully constructed, it may be that real patient behaviour differs from those in the VPD. Therefore, future steps should be in applying the trained classifiers here directly to a small cohort of real-patient measurements. The effect of measurement errors and biases is ignored in this study. This aspect can also be considered in future studies. Further improvements can be also made, to aim for higher accuracies with fewer, potentially noise- and bias-corrupted, measurements, by:

  • Further optimising the architectures of the machine and deep learning methods (particularly MLP classifiers).

  • Further monitoring individual classifiers for signs of overfitting, and minimising this when needed.