Machine learning for detection of stenoses and aneurysms: application in a physiologically realistic virtual patient database

This study presents an application of machine learning (ML) methods for detecting the presence of stenoses and aneurysms in the human arterial system. Four major forms of arterial disease—carotid artery stenosis (CAS), subclavian artery stenosis (SAS), peripheral arterial disease (PAD), and abdominal aortic aneurysms (AAA)—are considered. The ML methods are trained and tested on a physiologically realistic virtual patient database (VPD) containing 28,868 healthy subjects, adapted from the authors previous work and augmented to include disease. It is found that the tree-based methods of Random Forest and Gradient Boosting outperform other approaches. The performance of ML methods is quantified through the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}F1 score and computation of sensitivities and specificities. When using six haemodynamic measurements (pressure in the common carotid, brachial, and radial arteries; and flow-rate in the common carotid, brachial, and femoral arteries), it is found that maximum \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}F1 scores larger than 0.9 are achieved for CAS and PAD, larger than 0.85 for SAS, and larger than 0.98 for both low- and high-severity AAAs. Corresponding sensitivities and specificities are larger than 90% for CAS and PAD, larger than 85% for SAS, and larger than 98% for both low- and high-severity AAAs. When reducing the number of measurements, performance is degraded by less than 5% when three measurements are used, and less than 10% when only two measurements are used for classification. For AAA, it is shown that \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$F_1$$\end{document}F1 scores larger than 0.85 and corresponding sensitivities and specificities larger than 85% are achievable when using only a single measurement. The results are encouraging to pursue AAA monitoring and screening through wearable devices which can reliably measure pressure or flow-rates.


Introduction
Two of the most common forms of arterial disease are stenosis, narrowing of an arterial vessel, and aneurysm, an increase in the area of a vessel. They are estimated to affect between 1 and 20% of the population (Fowkes et al. 2013;Shadman et al. 2004;Mathiesen et al. 2001;Li et al. 2013), and ruptured abdominal aortic aneurysms alone are estimated to cause 6000-8000 deaths per year in the United Kingdom (Darwood et al. 2012). Current methods for the detection of arterial disease are primarily based on direct imaging of the vessels, which can be expensive and hence prohibitive for large-scale screening. If arterial disease can be detected by easily acquirable pressure and flow-rate measurements at select locations within the arterial network, then large-scale screening may be facilitated.
It is likely that the indicative biomarkers of arterial disease in the pressure and flow-rate profiles consist of micro inter-and intra-measurement details. In the past, detection of arterial disease has been proposed through the analysis of waveforms in combination with mathematical models of pulse wave propagation, see for example Sazonov et al. (2017), Stergiopulos et al. (1992). This, however, requires specification or identification of patient-specific network parameters, which is not easy to perform, especially at large scales.
This study explores the use of machine learning (ML) methods for the detection of stenoses and aneurysms in order to facilitate large scale low-cost screening/diagnosis. A data-driven ML approach is adopted, which does not require specification of patient-specific parameters. Instead, such algorithms learn patterns and biomarkers from a labelled data set. ML has a history of being used for medical applications (Kononenko 2001). Classification algorithms have been shown to be able to predict the presence of irregularities in heart valves (Çomak et al. 2007), arrhythmia (Song et al. 2005), and sleep apnea (Khandoker et al. 2009) from recorded time domain data. Recently, a study reported the successful use of ML methods to estimate pulse wave velocity from radial pressure wave measurements (Jin et al. 2020). Automatic detection, segmentation, and classification of AAAs in CT images are presented in Hong and Sheikh (2016), while severity growth of AAAs is predicted from CT images in Jiang et al. (2020). A previous study (Chakshu et al. 2020) has applied deep-learning methods to AAA classification, using a synthetic data set created by varying seven parameters. In this study, accuracies of ≈ 99.9% are reported for binary classification of AAA based on three pressure measurements. Furthermore, Wang et al. (2021) achieved a sensitivity of 86.8% and a specificity of 86.3% for early detection of AAA from the photoplethysmogram pulse waves-using a synthetic data set created by finding the mean and standard deviation of six cardiovascular properties for subjects of each age decade from 55 to 75 years, and then varying each property in combination with each other by ± 1 standard deviation from their age-specific mean values. These studies motivate the application of ML to detect arterial disease-both stenosis and aneurysms-using only pressure and flow-rate measurements at select locations in the arterial network. A previous proof-of-concept study (Jones et al. 2021c) showed promising results that ML classifiers can detect stenosis in a simple three vessel arterial network using only measurements of pressures and flow-rates. Here, these ideas are extended to a significantly larger, physiologically realistic, network of the human arterial system. All the ML methods are trained and tested on the virtual healthy subject database proposed in Jones et al. (2021a), which is augmented to introduce disease into the virtual subjects.
This study is organised as follows. It begins by briefly explaining the healthy VPD proposed in Jones et al. (2021a). Modifications to this database to create four different forms of arterial disease are presented next, along with the parameterisation of disease forms. This is followed by presentation of the ML methodology and metrics used for quantification of classification accuracies. Finally, these accuracies are assessed when using different combinations of pressure and flow-rate measurements, along with the analysis of patterns and behaviours observed in the ML classifiers.

Methodology
The ML algorithms are trained and tested on a data set containing both healthy subjects and diseased patients.

Healthy subjects
A physiologically realistic VPD containing healthy subjects is created in Jones et al. (2021a) and forms the starting point of this study. This database is available in Jones et al. (2021b). The arterial network contains 71 vessel segments and is shown in Fig. 1, along with the locations where disease occurs in high prevalence, and where measurements of pressure and flow-rate can potentially be acquired (Jones et al. 2021a). The healthy patient database of Jones et al. (2021a) contains 28,868 VPs and is referred Fig. 1 The connectivity of the arterial network, taken from Jones et al. (2021a). The location of the four forms of disease (see Sect. 2.2.1); and six pressure and flow-rate measurements (see Sect. 2.3) are highlighted as VPD H . Disease is introduced into these healthy arterial networks as described next.

Disease forms
The four most common forms of arterial disease are carotid artery stenosis (CAS), subclavian artery stenosis (SAS), peripheral arterial disease (PAD, a form of stenosis), and abdominal aortic aneurysm (AAA) (Jones et al. 2021a;Dyken et al. 1974;Kullo and Rooke 2016;Aboyans et al. 2010;Chen et al. 2013;Li et al. 2013). Their prevalence is restricted to the following vessels and shown in Fig. 1: -CAS is assumed to only affect the common carotid arteries. For simplification and consistency of notation, these vessels are referred to as the carotid artery chains ( CA ). -SAS is assumed to affect the first and second subclavian segments. These two chains of vessels (one on the right and left side) are referred to as the subclavian artery chains ( SA ). -PAD is assumed to affect the common iliacs; external iliacs; first and second femoral segments; and the first popliteal segments. These chains are referred to as the peripheral artery chains ( PA ). -AAA is assumed to affect the first to forth abdominal aorta segment. This chain of vessels is referred to as the abdominal aortic chain ( AA ).
It is assumed that each diseased VP has only one of the four forms of arterial disease. Four complementary databases corresponding to VPD H are constructed, each pertaining to one form of arterial disease. To create the diseased VPD corresponding to CAS, referred to as VPD CAS , for every subject in VPD H , disease is introduced in CA x (i.e. the left or right carotid artery). This is achieved by taking the arterial network of a subject from VPD H , artificially introducing a stenosis in CA x , and then using a one-dimensional pulsewave propagation model-which has previously been widely employed, tested, and validated (Boileau et al. 2015;Formaggia et al. 2003;Alastruey et al. 2012;Olufsen et al. 2000;Reymond et al. 2009;Matthys et al. 2007)-to compute the pressure and flow-rate waveforms. Note that this model has also been used to study haemodynamics in both stenosis (Boileau et al. 2018;Carson et al. 2019;Jin and Alastruey 2021) and aneurysms (Sazonov et al. 2017;Chakshu et al. 2020;Jin and Alastruey 2021). The numerical implementation of the pulse-wave propagation model employed here is outlined in Jones et al. (2021a) and validated against a discontinuous Galerkin (DCG) scheme (Alastruey et al. 2012), which in turn has been successfully validated against a 3D model of blood-flow through stenosed arterial vessels (Boileau et al. 2018). Thus, VPD CAS contains 28,868 VPs with CAS. Similarly, the databases corresponding to SAS, PAD, and AAA are created, and referred to as VPD SAS , VPD PAD , and VPD AAA , respectively. The disease severities, locations, and shapes are varied randomly across these databases as described next.

Parameterisation of diseased vessels
The severity of stenoses (percentage reduction in area) is varied between 50 and 95%. The lower 50% limit is set for the stenoses to be haemodynamically significant (Aboyans et al. 2010;Subramanian et al. 2005) and the upper limit of 95% reflects near total occlusion. For aneurysms, based on (Ernst 1993) and (Davis et al. 2013), an allowable range of AAA severities of 4cm-6cm diameters is chosen. This corresponds to a cross-sectional area variation of 12.56 cm 2 -28.27 cm 2 . With the abdominal aortic area in the reference network (Jones et al. 2021a) between 1.76 and 1.09 cm 2 , the corresponding AAA severities are set to vary between 713% (12.56/1.76) and 2,593% (28.27/1.09). With the above ranges, parameterisation of area increase/ reduction proposed in Jones et al. (2021c) is adopted, see Fig. 2. For a chain of diseased vessels ( CA x , SA x , PA x , or AA x ), the normalised area A n as a function of the normalised x-coordinate, x n , is represented as: where S represents the severity, b represents the normalised starting location of the disease in the vessel chain, e represents the normalised end location, A n is normalised with respect to the healthy version of the vessel in VPD H , and ± (1) creates an aneurysm or stenosis, respectively. In CA x , SA x , and PA x , the left and right side vessels are chosen with equal probability. The disease severity S , start location b, and end location e are assigned uniform distributions based on physical considerations. To sample values for these parameters, a fourth parameter, the reference location of the disease (represented by r) is introduced. This is included to impose a minimum length of 10% of the chain length on the disease profiles. Thus, the parameters for disease are sampled sequentially from uniform distributions within the following bounds: Based on the above parameterisation, examples of healthy and diseased SA x , PA x , and AA x area profiles are shown in the left and right columns of Fig. 3,respectively. (2) Bounds: for stenoses, 7.13 ≤ S ≤ 25.93 for aneurysms.

Measurements
A review of potential measurements that can be acquired in the network is presented in Jones et al. (2021a). Based on this, the locations at which time-varying pressure and flow-rate measurements can be acquired are shown in Fig. 1 and described below.
To simplify annotation and description, the right and left carotid artery pressures are referred as P (R) 1 and P (L) 1 , respectively. Similarly, the radial artery pressures are referred to P (R) 3 and P (L) 3 , respectively. -Pressure in the brachial arteries estimated through reconstruction of finger arterial pressure (Guelen et al. 2008). The right and left brachial artery pressures are referred to as P (R) 2 and P (L) 2 , respectively. -Flow-rate in the carotid, brachial, and femoral arteries measured using Doppler ultrasound (Byström et al. 1998;Oglat et al. 2018;Radegran 1997). The right and left carotid artery, brachial, and femoral flow-rates are referred to as

Provision of measurements to ML classifiers
Unless specified otherwise, the measurements to ML classifiers are bilateral, i.e. when Q 1 is specified it is implied that both right and left carotid flow-rates are used: There are, therefore, a total of by six bilateral measurements available: three pressure and three flow-rates. To reduce the dimensionality required to describe each pressure or flow-rate measurement, the periodic profiles are described through a Fourier series (FS) representation: where u represents any pressure or flow-rate profile; a n and b n represent the nth sine and cosine FS coefficients, respectively; N represents the truncation order; and = 2 ∕T , with T as the time period of the cardiac cycle. It is found in Jones et al. (2021c) that haemodynamic profiles can be described by a FS truncated at N = 5 . Thus, each individual measurement is described by 11 FS coefficients, and each bilateral measurement by 22 FS coefficients. (3) (4) u(t) = N ∑ n=0 a n sin(n t) + b n cos(n t), Fig. 3 Examples of healthy and diseased SA x , PA x , and AA x area profiles. The geometrical boundaries between vessel segments that form the chains are indicated by red dashed lines 1 3

Machine learning classifiers
A model mapping a vector of input measurements, x , to a discrete output classification, y, can be described as: where C (j) represents the jth possible classification. In the context of this study, the measured inputs, x , represent the FS coefficients of a user defined combination of the haemodynamic measurements {Q 1 , Q 2 , Q 3 , P 1 , P 2 , P 3 } (see Sect. 2.3.1) taken from VPs, and the output classification represents the corresponding health of those VPs : C (1) = 'healthy' and C (2) = 'diseased'. To account for large differences in magnitudes of the components of x , they are individually transformed with the Z-score standardisation method (Mohamad and Usman 2013) to have zero-mean and unit variance.
As previously stated, it assumed that in a patient disease is limited to only one of the four forms. As a first exploratory study, the ML classifiers are created for each form independently. All classifiers are therefore binary (see Jones et al. 2021c), i.e. four independent classifiers are trained to predict the following questions independently: "Does a VP belong to VPD H or VPD x ", where x can be either CAS, SAS, PAD, or AAA.

Training and test sets
Each VP in VPD CAS , VPD SAS , VPD PAD , and VPD AAA shares an identical underlying arterial network, apart from the diseased chain, with the corresponding healthy subject in VPD H . It is, therefore, important to ensure that the same subset of VPs is not included in the both healthy and diseased data sets used for ML classifiers. As each form of disease is mutually exclusive, four independent training and test sets, each corresponding to one form of the disease, are constructed in the following three stages: -Step 1: Half of the available VPs are randomly selected from VPD H for inclusion within the ML data set; this is referred to as VPD H-ML . The unhealthy VPs corresponding to the remaining unused half are taken from the appropriate unhealthy VPD ( VPD CAS , VPD SAS , VPD PAD , or VPD AAA ) and incorporated into the ML data set. These data sets are referred to as VPD CAS-ML , VPD SAS-ML , VPD PAD-ML , or VPD AAA-ML . -Step 2: The data sets of Step 1 are combined to create four complete data sets each containing 50% healthy and 50%, unhealthy VPs: Step 3: The four data sets of Step 2 are randomly split into a training set containing 2/3 of all the VPs in the data set, and a test set containing 1/3 of all the VPs.
The performance of all ML classifiers is evaluated using a fivefold validation. For each fold, the same data set from Step 2 is used, but different subsets are sampled in Step 3 for training and testing.

ML methods
The purpose of this study is to perform an initial exploratory investigation into the possibility of using ML classifiers to detect different forms of arterial disease. Focus is, therefore, on uncovering patterns and behaviours-such as which haemodynamic measurements are particularly informative-rather than optimisation to achieve increasingly higher accuracies. With consideration for this objective, it is not feasible to perform extensive optimisation and analysis on every single ML classifier trained and tested. Thus, the ML methods used are chosen based on their "robustness"-i.e. minimal sensitivity to the hyper-parameters and minimal susceptibility to problems such as overfitting-relative to more complex deep learning methods. Five different ML methods are employed. These five methods are random forest, gradient boosting, naive Bayes', support vector machine, and logistic regression. These methods encompass a range of probabilistic and non-probabilistic applications of different modelling approaches, see Table 1, while fulfilling the aforementioned characteristics. Along with these five ML methods, one deep learning method is also employed for comparison. This method is multi-layer perceptron. It is a priori expected that multi-layer perceptron classifiers will not perform to their full potential in this study, as they are more reliant on complex hyper-parameter optimisation and monitoring for overfitting than the five ML methods. The use of multi-layer perceptrons will, however, provide some, albeit limited, comparison of ML and deep learning methods. Since standard versions and implementations of  (Friedman 2001;Elith et al. 2008) 3. Naive Bayes' (NB) (Rish et al. 2001b, a) 4. Support Vector Machine (SVM) (Kecman 2005) 5. Logistic Regression (LR) (Sperandei 2014;Hilbe 2009;Jones et al. 2021c) 6. Multi-layer Perceptron (MLP) (Murtagh 1991) All implementations of the above algorithms in the Python package Scikit-learn (Pedregosa et al. 2011) are used. Some of these methods require optimisation of the hyperparameters. This is described after presenting performance quantification metrics in the next section.

Quantification of results
Classifier performance is assessed by two metrics: sensitivity and specificity in combination; and the F 1 score. Figure 4 shows the definition of sensitivity, specificity, and F 1 score, along with the related concepts of precision and recall commonly used in the assessment of classifiers. It is desirable to have both sensitivities and specificities to be high. Similarly, a higher F 1 score is desirable. Since the F 1 score is a single scalar metric that balances both precision and recall, it is a good metric to compare classifiers when tuning the hyper-parameters of ML algorithms. For a discussion on these metrics and their relevance, please refer to Jones et al. (2021c).

Hyper-parameter optimisation
The architecture of LR, NB, and SVM classifiers can all be considered to be problem independent. While these three algorithms are able to undergo varying levels of problem specific optimisation, the underlying structure of the classifier usually does not change. The architectures of RF, MLP, and GB classifiers, however, are dependent on the specific problem. The architecture choices for the classifiers and associated hyper-parameter optimisation are described next. For all six methods, all hyper-parameters that are neither optimised nor specified in the text are set to their default values within Scikit-learn (Pedregosa et al. 2011).

LR, SVM, and NB
For LR, the 'LIBLINEAR' solver offered by the Scikit-learn (Pedregosa et al. 2011) package is chosen. In the case of SVM, a kernel is typically chosen to map the input measurements to a higher order feature space (Jakkula 2006). All SVM classifiers use a radial basis function kernel (Scholkopf et al. 1997), with the Scikit-learn hyper-parameter 'gamma' set to 'scale'. In the case of NB, the distribution of input measurements across the data set is chosen to be normal (Murphy et al. 2006).

Random Forest
In the case of RF, the number of trees in the ensemble and the maximum depth of each tree is optimised. Other hyperparameters that can be tuned include the minimum number of data points allowed in a leaf node, and the maximum number of different features considered for splitting each node. However, the effect of these is not investigated here. To optimise the two hyper-parameters, a grid search is carried out. A grid is constructed by discretising the possible number of trees within the ensemble between 10 and 400 at intervals of 10, and the possible depth of each tree between 20 and 200 at intervals of 10. RF classifiers are trained for every combination with all six pressure and flow-rate measurements (see Sect. 2.3.1) across all the four forms of arterial disease. The hyper-parameters describing the architecture that produces the highest F 1 score are found for each form of disease, and this combination of hyper-parameters is then chosen for all subsequent classifiers. The optimal The relationship between sensitivity, specificity, recall, and precision. TP: True Positive, representing VPs belonging to a classification correctly identified; FN: False Negative, representing VPs belonging to a classification incorrectly identified: FP: False Positive, representing VPs not belonging to a classification incorrectly identified; and TN: True Negative, representing VPs not belonging to a classification correctly identified 1 3 hyper-parameters for each of the four forms of disease are shown in Table 2, along with the F 1 score achieved by each.
It is unlikely that a single architecture will consistently produce the best results when varying the combination of input measurements. In this study, re-optimisation of the hyper-parameters when varying the input measurement combination is not performed, to minimise computational cost. It is found that when using all six pressure and flow-rate measurements, the F 1 score produced is relatively insensitive to the hyper-parameters used. Thus, it is likely that a reasonable representation of the maximum achievable accuracy can be obtained for various input measurement combinations by a single architecture. It should be noted, however, that further improvements in classification accuracy may be possible with such re-optimisation.

Gradient Boosting
Similar to the RF architecture, the GB architecture is optimised by varying the number of trees within the ensemble and the maximum depth of each tree. Other hyper-parameters which may be varied, however, are not considered here, are the minimum number of data points allowed in a leaf node, the maximum number of different features considered for splitting each node, and the impact of each tree on the final outcome (i.e. the learning rate). A grid search is carried out to find the combination producing the highest F 1 score when using all the six input measurements. It is common for GB classifiers to use weaker, shallower decision trees (relative to RF classifiers) to deliberately create high bias and low variance (Hastie et al. 2009). The possible depth of each tree is, therefore, discretised between 2 and 20 at intervals of 1. As a high number of trees is not required to compensate for over fitting, contrary to the RF method, the possible number of trees within the ensemble is discretised between 10 and 100 at intervals of 10. The optimal hyper-parameters for each of the four forms of disease are shown in Table 3.

Multi-layer perceptron
As is common with deep learning methods, relative to ML methods, there are significantly more hyper-parameters which can be optimised for the MLP classifiers relative to Gradient Boosting or Random Forest. Examples of hyperparameters that significantly affect the performance of an MLP classifier include batch-size, learning rate, activation functions, drop-out, and individual units per hidden layers. With consideration for the exploratory stance of this study, it is chosen to only optimise the number of neurons within each hidden layer and the number of hidden layers. For simplification, it is assumed that all the hidden layers contain an identical number of neurons. A logistic activation function is used for all the hidden layers. It is likely that this simplistic hyper-parameter optimisation will limit the accuracy of classification achieved by MLP classifiers.
Similar to the RF and GB methodology, the hyper-parameters that produce the highest F 1 score are found through a grid search. The number of neurons within each layer is discretised between 10 and 200 at intervals of 10, and the number of hidden layers is discretised between 1 and 6 at intervals of 1. The optimal hyper-parameters found for each of the four forms of disease are shown in Table 4. It shows that relative to RF and GB, there is less consistency in the maximum F 1 scores achieved by MLP-it classifies AAA and CAS to high levels of accuracies, but performs relatively poorly for SAS and PAD.

Input measurement combination search
There are 63 possible combinations of input measurements that can be provided to a ML classifier from the six bilateral pressure and flow-rate measurements (see Sect. 2.3.1). A combination search is performed for each of the four forms of disease. For every combination of input measurements, all the six ML classification methods are trained, and then subsequently tested to quantify their performance. The average F 1 score, sensitivity, and specificity for each case across five folds are recorded. Combinations of interest are then further analysed.

Overfitting and early stopping criterion
To assess any overfitting by the ML and deep-learning methods, the log loss costs across the training and test sets are recorded at each sequential iteration of the training process (up to the 200th iteration). At a low number of training iterations, both the training and test costs are expected to be high as the classifiers can neither fit the training data nor generalise to the test data. As the training process progresses, the training and test costs are both expected to decay before converging to stable values in the absence of overfitting. However, in the case of overfitting, while the training costs continue to decrease, after a minima in the test costs, overfitting results in successively increasing test costs. In such cases, an early stopping criterion (Prechelt 1998;Yao et al. 2007) is adopted to avoid overfitting. A third partition to the available data (the validation set) is introduced. The combined healthy and unhealthy data sets described in Sect. 2.4.1 are split so that the training set contains 50%, the validation set 25%, and the test set 25% of the available data. Classifiers are trained on the training set; however, stopping criterion is based on the log loss cost in the validation set. At each sequential iteration in the training process, the average log loss cost is computed across the validation set. If more than 75 iterations have been performed, and the improvement in the log loss cost across the validation set between two consecutive iterations is less than 1 × 10 −3 , training is stopped. The final classifier accuracy is assessed on the test set.

Results and discussion
The full tables of results achieved for CAS, SAS, PAD, and AAA classification are shown in Appendices A, B, C and D, respectively. The score achieved by each ML method and combination of input measurements are visually shown for CAS, SAS, PAD, and AAA classification in Figs. 5, 6, 7, and 8, respectively. They show that for all forms of arterial disease, NB and LR classifiers consistently produce low accuracy. It has previously been shown in the PoC (Jones et al. 2021c) that the partition between the pressure and flow-rate profiles taken from healthy and stenosed patients is likely to be nonlinear. The fact that LR consistently produces low accuracy results supports this finding, as LR is the only linear classification method used. The finding that NB classifiers produce low accuracy classification is also consistent with the results of the PoC (Jones et al. 2021c), which found The F 1 scores achieved for CAS using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square that the NB method is poorly suited to the problem of distinguishing between haemodynamic profiles. On the contrary, across all the four forms of disease, the tree-based methods (RF and GB) consistently produce high accuracy results. This finding is in contradiction to the finding in the PoC (Jones et al. 2021c) and is likely due to the inadequate architecture optimisation or because of the unsuitability of RF on a smaller network used in the PoC (Jones et al. 2021c). The fact that both RF and GB classifiers are producing high accuracy classification in this study suggests that not only are tree-based methods well suited to distinguishing between haemodynamic profiles, but also emphasises the importance of adequate architecture optimisation. There is less consistency in the results achieved by SVM and MLP classifiers when detecting different forms of disease. SVM classifiers produce accuracies comparable with RF and GB classifiers in the case of AAA detection; however, low accuracy results for the three other forms of disease. MLP classifiers produce accuracies comparable with RF and GB classifiers in the case of CAS and AAA detection; however, relatively low accuracy results for SS and PAD classification. Overall, it is found that tree-based methods of RF and GB perform best, with GB performance slightly superior to that of RF. It is important to remember, however, that the results presented here do not necessarily capture the full potential of each method, and instead only reflect the accuracies achieved within the limitations of the simplistic hyper-parameter optimisation-a consideration particularly important for MLP.

Measurement combinations
To investigate the importance of both the number of input measurements provided to the ML algorithms and the specific combination of measurements, the average F 1 scores achieved by all classifiers when providing only one, two, three, four, five, or six input measurements are found. In each case, the specific combinations that achieve the maximum and minimum F 1 scores are also recorded. These results for different forms of disease are presented next.

CAS classification
The average, maximum and minimum F 1 score achieved when providing different number of input measurements for CAS classification are shown in Fig. 9.
It shows that NB and LR classifiers consistently produce an F 1 score of approximately 0.5, which is comparable to naive classification, i.e. randomly assigning the health of VPs with an equal probability to each outcome. SVM

Fig. 6
The F 1 scores achieved for SAS using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square performs slightly better with F 1 scores averaging 0.5 -0.6. The other three classification methods (RF, MLP, and GB) perform significantly better with F 1 scores generally averaging between 0.7 and 0.95 and showing a clear increase in the average F 1 score as the number of input measurements increases. While the average and minimum F 1 score achieved by RF and GB classifiers continuously increases, the maximum F 1 score achieved can be seen to quickly reach a plateau (at one input measurement for RF and three input measurements for GB). For a fixed number of measurements, the wide range of F 1 scores in Fig. 9 across all classifiers suggests that specific combinations of measurements may be more important than others for optimal classification. To explore this further, the combinations of input measurements that produce the highest F 1 scores and the corresponding accuracies when employing the RF and GB methods are shown in Table 5. Two observations are made from this table. First that for a fixed number of measurements, the best combinations are not identical for the two methods. For example, when two measurements are used the best combination for RF is ( Q 2 , Q 1 ), while the best combination for GB is ( P 2 , P 1 ). This suggests that the best combination of measurements is likely dependent on the particular ML method chosen. Second, some patterns stand out with respect to which measurements may be more informative than others. For example, across Table 5, Q 1 appears in 11 out of 12 combinations, and P 1 appears in 8 out of 12 combinations. This suggests that Q 1 is most informative about identifying the presence of CAS followed by P 1 . Physiologically, this is not surprising as Q 1 and P 1 are flow-rates and pressures in the carotid arteries and the disease under consideration is carotid artery stenosis. It is encouraging that the ML methods are indeed placing more importance to the relevant physiological measurements. In fact, it is remarkable that RF and GB both achieve F 1 scores above 0.85 and sensitivities and specificities larger than 85% with only one measurement. Also notable is that these accuracies can be taken to beyond 93% (see GB row for 3 measurements in Table 5) when adding 2 more measurements as long as the additional two measurements are carefully chosen.
An interesting pattern to note is that while the average and minimum F 1 score achieved by MLP classifiers continuously increases in Fig. 9, the maximum F 1 score decreases beyond three input measurements. The maximum F 1 scores achieved by MLP classifiers, and the corresponding sensitivities and specificities, when using three to six input measurements are shown in Table 6. It shows that the decrease in F 1 scores is also accompanied by an associated decrease in both the sensitivities and specificities, as opposed to the balance between them (increase in sensitivity and decrease in specificity and The F 1 scores achieved for PAD using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square vice versa). This behaviour is unusual as intuitively more input measurements should generally provide more information. This finding may suggest that MLP classifiers are able to extract maximum information from the haemodynamic profiles when using as little as three input measurements, and may be susceptible to over fitting when using more than three measurements, thereby leading to less generalisation capabilities and consequently decreased accuracies.
To investigate any overfitting, the log loss costs for the training and test sets during the training process are shown in Fig. 10 for the best measurement combinations identified by the MLP, GB, and RF method classifiers (Tables 5  and 6). It shows that the RF and GB methods show no signs of overfitting. However, for the MLP, while the three-measurement case also shows no overfitting, the cases with four, five, and six measurements show an increase in test costs beyond 50-100 training iterations, implying overfitting, the extent of which worsens as the number of measurements increases. Such behaviour for the MLP is also observed for SAS and PAD, and thus for the MLP method an early stopping criterion is adopted (see Sect. 2.7), the results of which are presented in Sect. 3.6.

Fig. 8
The F 1 scores achieved for AAA using each combination of bilateral input measurements are shown. Measurements included within each combination are highlighted with a black square Fig. 9 The average, maximum, and minimum F 1 score achieved by all classifiers trained using different numbers of input measurements are shown for carotid artery stenosis classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits

SAS classification
The results of the analysis for SAS classification are shown in Fig. 11. As is seen in the case of CAS classification, Fig. 11 shows that NB, LR, and SVM classifiers consistently produce accuracies comparable to naive classification, irrespective of the number of input measurements used. A clear difference between Figs. 9 and 11 is the accuracy achieved by MLP classifiers. Compared to the CAS case, the MLP performance is further degraded for SAS, while still being better than NB, LR, and SVM, although only marginally. It is important to consider, however, that the MLP classifiers are experiencing overfitting, as highlighted in Sect. 3.1.1. Results with overfitting avoided by adopting an early stopping criterion are presented in Sect. 3.6.
A high degree of similarity can be seen between the behaviours of RF and GB classifiers for CAS and SAS. Figure 11 shows that the average and minimum F 1 score achieved by RF and GB classifiers continuously increases as the number of input measurements used increases. The maximum F 1 score achieved is seen to quickly reach an asymptotic limit (at three input measurements for both RF and GB classifiers). Peak F 1 score of approximately 0.85 is achieved by GB, along with sensitivities and specificities higher than 85%.
The combination of input measurements that produce the highest F 1 scores and the corresponding accuracies are shown in Table 7. It shows a higher degree of consistency Table 5 The combinations of input measurements that produce the maximum F 1 scores when providing one to six input measurements and employing the RF and GB methods to detect CAS The corresponding sensitivities and specificities are also included

Method
Combination F 1 score Sens. Spec. Table 6 The combinations of input measurements that produce the maximum F 1 scores when providing three to six input measurements and employing the MLP method to detect CAS The corresponding sensitivities and specificities are also included

Number of input measurements
Combination F 1 score Sensitivity Specificity

Fig. 10
The average log loss cost across the training and test sets during the training process when using the combination of three to six input measurements that achieve highest accuracies for RF, GB, and MLP methods (Tables 5 and 6) between the best combinations for the two methods relative to the case for CAS, i.e. the best combinations are generally identical (or with minimal differences) between RF and GB. It also shows that Q 1 is particularly informative, with this measurement appearing in all of the best combinations. Physiologically this may be due to its proximity to the disease location.

PAD classification
The results for PAD classification are shown in Fig. 12. Comparing Figs. 11 and 12, a high degree of similarity is seen between the behaviours of SAS and PAD classification. As is previously seen for SAS classification, Fig. 12 shows that the NB, LR, and SVM methods are all consistently producing accuracies comparable to naive classification. While the MLP method performs slightly better than the naive classification, the accuracy still remains relatively low. High accuracy can be seen in Fig. 12 for the two treebased methods of RF and GB. As has been previously seen for CAS and SAS, while the average and minimum F 1 score achieved by the RF and GB methods increases as the number of input measurements increases, the maximum F 1 score achieved quickly reaches an asymptotic limit (at 3 input measurements for both the RF and GB methods). The combination of input measurements that produce the highest F 1 scores for PAD classification when employing the RF and GB methods are shown in Table 8. It not only shows good consistency between the combinations of input measurements that produce the highest F 1 scores when employing each of the two ML methods, but also good agreement with the combinations presented in Table 7. Very similar combinations of input measurements (with some minor differences) can be seen to produce the highest F 1 score when providing all numbers of input measurements. As has previously been observed in Tables 5 and 7, the input measurement Q 1 appears to be most informative, appearing in all the best scoring classifiers. Since the location of Q 1 is far from the location of disease, it is not obvious why this measurement is particularly informative of PAD.

AAA classification
The results for AAA classification are shown in Fig. 13. As has been previously seen for all of the three other forms of disease, the NB and LR classifiers consistently produce accuracies comparable to naive classification, irrespective of Fig. 11 The average, maximum, and minimum F 1 score achieved by all classifiers trained using different numbers of input measurements are shown for SAS classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits Table 7 The combinations of input measurements that produce the maximum F 1 scores when providing one to six input measurements and employing the RF and GB methods to detect SAS The corresponding sensitivities and specificities are also included

Method
Combination the number of input measurements used. The consistency of this finding (as seen in Figs. 9, 11, and 12) irrespective of the form of disease being classified, highlights both the importance of nonlinear partitions between healthy and unhealthy VPs and the unsuitability of the NB method for distinction between haemodynamic profiles. In the case of AAA classification, the SVM, RF, MLP, and GB methods consistently produce good accuracies. Figure 13 shows that these methods produce high accuracies even with a single input measurement. While there is some increase in the average F 1 score as the number of input measurements increases, due to the very high initial average F 1 score achieved (when using a single input measurement) this increase is limited (as the F 1 score can not exceed 1). Two possible reasons of the higher accuracies in aneurysm classification relative to stenosis classification are: -Aneurysms, owing to an increase in area as opposed to decrease in the area for stenoses, may actually produce more significant or consistent biomarkers in the pressure and flow-rate profiles. This hypothesis is supported by Low et al. (2012), which found that even low severity AAAs have a global impact on the pressure and flow-rate profiles. -While the severities of aneurysms cannot be directly compared to severities of stenosis, it may be that the severity of aneurysms in VPD AAA is disproportionately large relative to the severities of stenoses. The significance of any indicative biomarkers introduced into pressure and flow-rate profiles is likely to be proportional to the severity of the change in area. This implies that the increase in vessel area of 712-2,593% in VPD AAA is perhaps on the extreme end of aneurysm severity, thereby making the classifications relatively easier. This is further explored in Sect. 3.4.
The combination of input measurements that produce the highest F 1 scores when providing one to six input measurements and employing the RF and GB methods are shown for AAA classification in Table 9. It shows that F 1 scores range from 0.97-0.997 and sensitivities and specificities range from 96.5% to 99.8%. Due to the high accuracies across all the number of measurements, the analysis of specific combinations is not very meaningful. However, the measurement Q 1 again appears in all the best combinations. It should also be noted that the high accuracies for AAA classification are also consistent with those reported in Chakshu Fig. 12 The average, maximum, and minimum F 1 score achieved by all classifiers trained using different numbers of input measurements are shown for PAD classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits Table 8 The combinations of input measurements that produce the maximum F 1 scores when providing one to six input measurements and employing the RF and GB methods to detect PAD The corresponding sensitivities and specificities are also included Number of input measurements Overall, the results show that the physiological changes to the waveforms induced by both stenosis and aneurysms (Stergiopulos et al. 1992;Low et al. 2012) are well captured by the data-driven machine learning methods.

Importance of carotid artery flow-rate
Appendices A-D, along with the above analysis show that classifiers trained using flow-rates in the common carotid arteries ( Q 1 ) consistently produce the highest accuracy. To analyse this further, the F 1 scores of classifiers with combinations that include and exclude Q 1 are separated and compared for CAS, SAS, PAD, and AAA in Figs. 14, 15, 16, and 17, respectively. These figures show the the histograms of the F 1 scores, i.e. the number of occurrences/classifiers/ combinations including and excluding Q 1 against F 1 score buckets. For each disease form, results are only shown for the classification methods that consistently produce good results for the corresponding disease form. The figures show a clear positive shift in the histograms when Q 1 is included, pointing to the particularly informative nature of Q 1 . Other behaviours observed from these figures are: -While there is generally an increase in F 1 score when including Q 1 , it is also simultaneously observed that the maximum accuracies are relatively less sensitive to the inclusion of Q 1 . -The greatest distinction between F 1 scores when including or excluding Q 1 is seen for CAS classification when using the RF method. There is no overlap between the two RF histograms in Fig. 14. -Observing the lower plots in Figs. 15 and 16, a clear subgroup of low-accuracy classifiers can be seen when excluding Q 1 for SAS and PAD, which does not exist when including Q 1 .

Fig. 13
The average, maximum, and minimum F 1 score achieved by all classifiers trained using different numbers of input measurements are shown for AAA classification. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits Table 9 The combinations of input measurements that produce the maximum F 1 scores when providing one to six input measurements and employing the RF and GB methods to detect AAA The corresponding sensitivities and specificities are also included Number of input measurements

Feature importance
An important aspect of the GB method is that the measurement importance, which determines the influence that individual measurements have towards classification, can be computed. This split-improvement feature importance (Zhou and Hooker 2020) of a feature can be thought of as the contribution of that feature to the total information gain achieved in a decision tree, averaged across all the trees in the ensemble. A high feature importance suggests that the given feature is contributing heavily to the classification accuracies achieved. As the features provided to the GB classifiers are the FS coefficients describing the haemodynamic profiles, the total importance of each bilateral pressure or flow-rate measurement is found by summing the feature importance of the associated 22 FS coefficients. The total importance of each input measurement for each disease form is shown in Table 10. Three important observations from this table are: -The input measurement Q 1 consistently produces the highest importance for all forms of disease. This finding supports the findings of Sect. 3.2.
-The importance of each input measurement changes between disease forms based on the spatial proximity to the disease location. Generally, the measurements in close proximity to the disease location have higher importance. For example, the importance of Q 3 (flow-rate in the femoral arteries) is highest for PAD classification (see Fig. 1 for locations of disease and measurements). Similarly, P 1 (pressure in carotid arteries) has highest importance for CAS and SAS. -The feature importances, when viewed in collection, also shed some light on why Q 1 is important for SAS and PAD even though the measurement location is far from the disease location. For SAS, the two most informative measurements are Q 1 and Q 2 , and for PAD, these are Q 1 and Q 3 . From Fig. 1, it is clear that these combinations form pairs of flow-rates before and after/at the disease location. Thus, the measurement locations bound the disease location to provide more information on the presence of disease.

Lower severity aneurysms
In Sect. 3.1.4, it is found that AAAs can be classified to a very high levels of accuracy with only one input measurement. Whether these accuracies are affected when lower severity aneurysms are considered is assessed here. For this assessment, a new lower severity AAA VPD, referred to as VPD AAA-L , is created in an identical manner to the other diseased databases (see Sect. 2.2), with the following two differences: -The severity of aneurysms introduced into the virtual subjects (see Sect. 2.2.2) is sampled from a uniform distribution bounded as follows: 3.0 ≤ S aneurysm ≤ 7.0. -To reduce the computational expense associated with the creation of virtual patients, the size of VPD AAA-L is restricted to 5,000 VPs.
A combination search is carried out with only the GB method as it is the best overall method. The F 1 scores, sensitivities, and specificities achieved by all the measurement combinations are presented in Appendix E. For comparison, the GB F 1 scores for all forms of disease (including AAA-L) are shown in Appendix F. The ratios of the GB F 1 scores achieved for AAA-L classification relative to AAA classification are shown in Fig. 18. The observations from this figure are: -The F 1 scores for AAA-L classification are consistently lower (ranging from 1% to 10% lower) than that for AAA classification. This finding supports the physiological expectation that the significance of biomarkers in pressure and flow-rate profiles is proportion to the severity. -The ratios of F 1 scores are lowest for combinations of inputs that predominantly rely on pressure measurements. This suggests that pressure measurements are, in general, less informative about disease severity. This is in support of the, generally, lower feature importance of pressure measurements in Table 10. -The F 1 score ratios are highest for input combinations that include Q 1 . This finding further suggests that Q 1 contains consistent biomarkers. -The ratios range between 0.9 and 0.99, implying a maximum degradation of only 10% relative to high-severity Fig. 16 The histograms of the F 1 scores achieved for PAD classification are shown for all input measurement combinations that include Q 1 in the upper plot and exclude Q 1 in the lower plot Fig. 17 The histograms of the F 1 scores achieved for AAA classification are shown for all input measurement combinations that include Q 1 in the upper plot and exclude Q 1 in the lower plot classification accuracies. Thus, even in the low-severity aneurysms, many combinations of classifiers achieve F 1 scores higher than 0.95 and corresponding sensitivities and specificities larger than 95%.

Unilateral aneurysm measurement tests
Hitherto, all ML classifiers used bilateral measurements, i.e. both the right and left instances of each measurement were simultaneously provided. Here, the ability of unilateral measurements, i.e. only the right or left instance of a measurement, to detect AAAs is assessed. This analysis is restricted to the GB method as it consistently outperforms other methods.
GB classifiers are trained and tested to detect AAAs using four different unilateral measurements: -Flow-rate in the right carotid artery, shown in Fig. 1 as Carotid artery flow-rate is chosen as it has been shown to be the best measurement for disease classification. Radial artery pressure is chosen due to the location of the radial artery on the human wrist. Recent advancements have resulted in wearable devices capable of measuring continuous radial pressure profiles, such as the TLT Sapphire monitor (Tarilian Laser Technologies, Welwyn Garden City, U.K.) (Lobo et al. 2019), and thus if AAAs can be detected to satisfactory accuracies using these measurements, it may suggest the possibility of future home monitoring of abdominal aortic health through such wearables. The sensitivities and specificities achieved by the four unilateral GB classifiers are shown in Table 11. It shows that relative to the bilateral case, while there is a decrease in the classification accuracies, the magnitude of the decrease is less than 10%. This finding suggests that there may be sufficient biomarkers of AAA presence captured within the intra-measurement details of a single pressure or flow-rate profile. The fact that similar accuracies are achieved with either the right or left instances of any measurement is likely due to physiological symmetry. While there are some minor asymmetries between the right and left upper extremities, due to the topology of the arterial network (as shown in Fig. 1) changes to the cross-sectional area of the abdominal aorta are expected to produce relatively consistent changes in both the right and left side of the body.   . 18 The ratios of the F 1 scores for AAA-L classification relative to AAA classification, when providing each combination of input measurements are shown. Measurements included within each combination are highlighted with a black square

MLP early stopping to avoid overfitting
It is shown in Sect. 3.1.1 that the accuracy of MLP classifiers is hindered by the presence of overfitting. Thus, the early stopping criterion outlined in Sect. 2.7 is adopted for the combinations of three to six measurements that hitherto produced best results without early stopping. Here, the hyperparameters describing the MLP architecture-the number of neurons per layer and the number of layers (depth)-for each such case are also individually re-optimised on the validation data set with early stopping criterion enabled. Thus, for each combination in the grid search, the best validation set F 1 score is computed with early stopping enabled during training, and the architecture producing the maximum F 1 score is selected. Subsequently, for this optimal architecture, the test scores are computed on the test data set. This analysis is performed for CAS and AAA as the behaviour of SAS and PAD is very similar to that of CAS.

CAS: early stopping
The hyper-parameters describing the optimum architectures with early stopping criterion for best combinations are shown in Table 12. It shows a remarkable degree of consistency between the optimum hyper-parameters for varying number of input measurements: for four measurements and above the optimal architecture is identical. This finding supports the previous simplification of using a single architecture for all the MLP classifiers. An interesting finding to note, however, is that there is less consistency with the previous optimum hyper-parameters presented in Table 4, which found that four layers containing 60 neurons produced the highest F 1 score when providing six input measurements. The cost profiles for the optimal architectures with early stopping are shown in Fig. 19. It shows that generally the early stopping criteria fulfil its purpose of stopping the training process near to the minimum validation cost point, thus minimising overfitting. It is observed that for all numbers of input measurements, training is stopped as soon as the 75 minimum iterations have been completed. While this early stopping criteria greatly reduce overfitting in all the cases, it is seen that the minimum number of training iterations (75) is too high for the six measurement case (the validation cost has already started to significantly rise), suggesting further refinement may reduce the validation and test costs even further.
A comparison between the F 1 scores achieved with and without early stopping is shown in Table 13. While early stopping has reduced the log loss cost across the validation and test sets, this does not necessarily translate to improvements in the F 1 score. The log loss cost will decrease without increasing the F 1 score if easy to classify patients are predicted with a higher degree of certainty (for example, predicting 95% probability rather than 75%) even if no new additional patients are correctly classified. For the six-measurement case, however, some increase in F 1 score is clearly observed as a benefit of early stopping.

AAA: early stopping
The hyper-parameters describing theoptimum architectures with early stopping criterion for best combinations are shown in Table 14. The consistency of best architecture for AAA across the number of measurements is less when compared to that for CAS. It is again observed that the new hyper-parameters are inconsistent with the old (Table 4). Initially, this finding may seem to undermine early stopping and individual architecture optimisation for varying number of input measurements. However, while the optimum hyper-parameters are inconsistent, the F 1 scores achieved are very similar-0.9785 in Table 4 and 0.9870 in Table 14. This similarity in F 1 scores may suggest an insusceptibility to the architecture used, i.e. the F 1 score plane in the two-dimensional grid-search space is relatively flat for this problem. This again supports the earlier simplification of using a single architecture for all the classifiers.
The cost profiles for the optimal architectures with early stopping are shown in Fig. 20. It shows no major signs of overfitting when using MLP classifiers to detect AAA. As a result, the employment of an early stopping criteria has little affect on the final log loss cost achieved across all training and validation data sets. Thus, when comparing the with and without early stopping test scores in Table 15, Table 13 MLP: F 1 scores on the test dataset when using the best three to six input measurement combinations found to produce the highest accuracies for CAS with (Sect. 3.6.1) and without early stopping (Sect. 3.1.1)

Number of input measurements Combination
F 1 score (test) Without early stopping With early stopping 3 (P 3 , P 2 , P 1 ) 0.8831 0.8621 4 (Q 3 , Q 1 , P 2 , P 1 ) 0.8683 0.8693 5 (Q 3 , Q 2 , P 3 , P 2 , P 1 ) 0.8463 0.7975 6 (Q 3 , Q 2 , Q 1 , P 3 , P 2 , P 1 ) 0.7785 0.8394 Table 14 The hyper-parameters describing the architecture of the MLP classifiers that produce the highest F 1 scores on the validation set with early stopping criterion for AAA classification, when using the best performing combinations of three to six input measurements

Conclusions
The main conclusion of this study is that machine learning methods have the potential to detect arterial disease-both stenoses and aneurysms-from peripheral measurements of pressure and flow-rates across the network. Amongst various ML methods, it is found that tree-based methods of Random Forest and Gradient Boosting perform best for this application (within the limitations of the classifier specific optimisation performed). Across the different forms of disease, the Gradient Boosting method outperforms Random Forest, Support Vector Machine, Naive Bayes, Logistic Regression, and even the deep learning method of Multi-layer Perceptron in the setting adopted. It should be noted, however, that the multi-layer perceptron results could be improved by problem specific optimisation of architecture and fine-tuning of further hyper-parameters. This, however, would come at added complexity and computational costs against the easier-totrain machine-learning methods of Random Forest and Gradient Boosting.
It is demonstrated that maximum F 1 scores larger than 0.9 are achievable for CAS and PAD, larger than 0.85 for SAS, and larger than 0.98 for both low-and high-severity AAAs. The corresponding sensitivities and specificities are also both larger than 90% for CAS and PAD, larger than 85% for SAS, and larger than 98% for both low-and high-severity AAAs. While these maximum scores are for the case when all the six measurements are used, it is also shown that the performance degradation is less than 5% when using only three measurements and less than 10% when using only two measurements, as long as the these measurements are carefully chosen in specific combinations. For the case of AAA, it is further demonstrated that when only a single measurement (either on the left or right side) is used, F 1 scores larger than 0.85 and corresponding sensitivities and specificities larger than 85% are achievable. This aspect encourages the application of AAA monitoring and/or screening through the use of a Fig. 20 MLP: the log loss cost profiles across the training and validation sets when using the best performing combination containing three to six input measurements for AAA classification and employing early stopping wearable device, such at the TLT Sapphire monitor (Tarilian Laser Technologies, Welwyn Garden City, U.K.) (Lobo et al. 2019). Confidence in this is further strengthened by the similar very high accuracies reported for AAA classification by Chakshu et al. (2020) ( ≈ 99.9% ) and Wang et al. (2021) (sensitivities and specificities of ≈ 86% ). However, multi-class classifier accuracies, as opposed to only the binary classifiers assessed here, remain unknown and should be considered to fully assess the ability of machine and deep learning methods for arterial disease detection.
Finally, it is shown through the analysis of several classifiers and feature-importance that, among the measurements, the carotid artery flow-rate is a particularly informative measurement to detect the presence of all the four forms of disease considered.

Limitations & future work
While high accuracy classification has been achieved, all classifiers are binary (i.e. disease are treated mutually exclusively). A logical next step, to further the results presented here, is to relax the assumption of mutually exclusive disease. Thus, classifiers should be built to detect not only the presence of disease, but also identify the type of disease (potentially concomitant disease in multiple locations), its location, and its severity. This further analysis can be completed in two stages: 1. The previously created unhealthy VPDs (each containing only one form of disease) can be used to created mixed disease data sets, i.e. each VP has only one form of disease; however, the data sets contain multiple forms of disease. Binary ML classifiers can then be created to predict if a VP is subject to a particular form of disease, and multiclass classifiers to determine which form of disease a VP has. 2. New VPDs can be created, in which each VP may contain more than one form of disease. In this case, binary classifiers can be created to predict the presence of each individual form of disease within a VP, and multiclass classifiers to predict the combination of disease forms present.
While the results are encouraging, they are produced on a virtual cohort of subjects. Even though the database is physiologically realistic and carefully constructed, it may be that real patient behaviour differs from those in the VPD. Therefore, future steps should be in applying the trained classifiers here directly to a small cohort of real-patient measurements. The effect of measurement errors and biases is ignored in this study. This aspect can also be considered in future studies. Further improvements can be also made, to aim for higher accuracies with fewer, potentially noise-and bias-corrupted, measurements, by: -Further optimising the architectures of the machine and deep learning methods (particularly MLP classifiers). -Further monitoring individual classifiers for signs of overfitting, and minimising this when needed.

CAS combination search results
The F 1 scores, sensitivities, and specificities achieved for CAS classification when using each of the six ML methods are shown in Tables 16, 17, and 18, respectively.

SAS combination search results
The F 1 scores, sensitivities, and specificities achieved for SAS classification when using each of the six ML methods are shown in Tables 19, 20, and 21, respectively.

PAD combination search results
The F 1 scores, sensitivities, and specificities achieved for PAD classification when using each of the six ML methods are shown in Tables 22, 23, and 24 respectively.

AAA combination search results
The F 1 scores, sensitivities, and specificities achieved for AAA classification when using each of the six ML methods are shown in Tables 25, 26 and 27 respectively.