A proof of concept study for machine learning application to stenosis detection

This proof of concept (PoC) assesses the ability of machine learning (ML) classifiers to predict the presence of a stenosis in a three vessel arterial system consisting of the abdominal aorta bifurcating into the two common iliacs. A virtual patient database (VPD) is created using one-dimensional pulse wave propagation model of haemodynamics. Four different machine learning (ML) methods are used to train and test a series of classifiers—both binary and multiclass—to distinguish between healthy and unhealthy virtual patients (VPs) using different combinations of pressure and flow-rate measurements. It is found that the ML classifiers achieve specificities larger than 80% and sensitivities ranging from 50 to 75%. The most balanced classifier also achieves an area under the receiver operative characteristic curve of 0.75, outperforming approximately 20 methods used in clinical practice, and thus placing the method as moderately accurate. Other important observations from this study are that (i) few measurements can provide similar classification accuracies compared to the case when more/all the measurements are used; (ii) some measurements are more informative than others for classification; and (iii) a modification of standard methods can result in detection of not only the presence of stenosis, but also the stenosed vessel. Graphical Abstract An overview of methodology fo the creation of virtual patients and their classification


Introduction
While there are many forms of arterial disease, one of the most common is stenosis, which refers to the narrowing of an arterial vessel. This is normally caused by a build up of fatty deposits, known as atherosclerosis. Stenosis can be be categorised into several sub-diseases depending on its location. Three of the most common forms of stenosis are peripheral artery disease (PAD), carotid artery stenosis, and subclavian artery stenosis (SS). The prevalence of PAD and SS have been recorded to vary between 1.9 and 18.83% within different demographics [14,45], while carotid artery stenosis has been recorded to affect 3.8% of men and 2.7% of women [28].
Current methods for the detection of arterial disease are primarily based on imaging techniques [25,27,37,49], and Sanjay Pant Sanjay.Pant@swansea.ac.uk 1 Faculty of Science and Engineering, Swansea University, Swansea, UK 2 McLaren Technology Centre, Woking, UK so are often impractical for large-scale screening, expensive, or both. If a new inexpensive and non-invasive method for the detection of stenosis is found, the cost effectiveness of large-scale screening could be improved making both continuous monitoring and screening feasible. One such alternative is to use easily acquirable pressure and flow-rate measurements at accessible peripheral locations within the circulatory system and use them for diagnosis. It is known from the principles of fluid mechanics that if the crosssectional area of a vessel is changed, the pressure and flowrate profiles of fluid passing through that vessel will also change [11,31,44,50]. Applying this to arterial disease, the inclusion of a stenosis within a patients arterial network may create detectable biomarkers within the pressure and flowrate profiles of blood. This precise hypothesis is explored in this study.
A previous study [44] has explored the use of physicsbased models of pulsewave propagation to predict the presence of an aneurysm, another common form of arterial disease, using flow-rate measurements. Its use for disease detection is, however, limited by the the need for patient specific parameters. If a consistent and significant biomarker of arterial disease is found within pressure and flow-rate profiles, irrespective of a patients individual arterial network, it would be possible to predict the presence of a stenosis using only these measurements. This would allow for inexpensive and non-invasive screening of patients for arterial disease. As opposed to a mechanistic approach to such an inverse problem, this study explores a pure data-driven machine learning approach for finding such biomarkers.
It is likely that the indicative biomarkers of arterial disease held within pressure and flow-rate profiles consist of micro inter-and intra-measurement details. Discovery of these biomarkers through a traditional hypothesis driven scientific method [53] and a classical inverse problems approach is difficult. If a large database of pressure and flow-rate measurements taken from patients of known arterial health is available, it maybe possible for a machine learning (ML) classifier to be trained to not only discover but also exploit any biomarkers within the pressure and flow-rate profiles. In the past, ML has been used for a variety of diagnostic medical applications [24] with applications such as detection of irregularities in heart valves [10], arrhythmia [47], and sleep apnea [22] from recorded time domain data. Other applications include estimation of pulse wave velocity from radial pressure wave measurements [20]; automatic detection, segmentation, and classification of abdominal aneurysms in computer tomography (CT) images [17]; and prediction of aneurysm severity growth from CT images [18]. The wide application and success of ML methods in medical applications motivates exploration of their use for stenosis detection.
The aim of this proof of concept (PoC) study is to carry out an initial investigation into the potential of using ML classification algorithms to predict the presence of stenosis, using haemodynamics measurements. While two previous studies [9,54] have investigated the potential of aneurysm classification, no comparable work has been completed for stenosis to the authors' knowledge. In [9], a seven-parameter synthetic dataset is constructed and deeplearning methods are tested on it for binary detection of abdominal aortic aneurysm (AAA), yielding accuracies of ≈ 99.9% from three pressure measurements. A sensitivity of 86.8 % and a specificity of 86.3% for early detection of AAA from photoplethysmogram pulse waves is reported in [54], which used a synthetic dataset created by varying six cardiovascular variables. Extending these ideas to exploring detection of stenosis, this study will investigate (i) the potential for stenosis (which are expected to have a more localised affect than aneurysm) classification, and (ii) the ability to classify when using a database created with significantly more variability (25 parameters varied in this study, compared to seven and six parameters in [9] and [54], respectively).
To train and test such ML classifiers, a large database of measurements taken from patients of known arterial health is required. As opposed to using measurements from a real population, which are unavailable, a synthetic virtual patient database (VPD), similar to that presented in [56], is created through the use of a physics-based model of pulse wave propagation. To create the VPD, a priori distributions are first constructed for the parameters describing the arterial networks of virtual patients (VPs) across the resulting VPD. Random realisations are then sampled from these distributions, and the physics-based model is solved to obtain the corresponding pressure and flow-rate profiles. Finally, "Hard" filters, i.e. the direct imposition of bounds on the ranges of pressure profiles, are applied to the VPD to reduce the occurrence of physiologically unrealistic VPs.
This virtual population is then used to train and test a series of ML classifiers to detect arterial disease, and test their performance. Focus is on assessing feasibility and uncovering behaviours and patterns in the performance of classification methods, rather than optimisation and creation of increasingly complex ML models for maximum accuracy. Understanding the behaviour of classifiers will allow subsequent, more complex, studies to leverage on these observations.
In what follows, first the design of the VPD-its motivation, physics-based model, the arterial network, its parameterisation, probability distributions, and filters-is presented. This is followed by the ML setup, its relation to the size of the VPD required, brief description of the ML methods and metrics to quantify their performance. Finally, the results and analysis of the ML methods performance are presented, with a focus on uncovering why some ML methods perform better than others and which measurements (and their combinations) are more informative.

Motivation
To train and test ML classifiers a large database of haemodynamics measurements taken from a comprehensive cohort of patients is required. The corresponding correct arterial health of these patients is also required. As opposed to using measurements taken from real patients, VPs are created using a physics-based model of pulse wave propagation. This VP approach has several advantages: 1. Expense: creating VPs is relatively inexpensive. The primary cost associated with the creation of VPs is computational, and thus negligible in comparison to data acquisition in a real population. 2. Class imbalance: creating VPs allows for the control of the distribution of different diseased states. For example, in a real population the rate of arterial disease can vary between 1 and 20%. During the creation of VPs, however, 50% diseased patients can be created to ensure a balanced dataset. 3. Measurement availability: using VPs allows for measurements of pressure and flow-rate to be taken at any location within the arterial system. This allows for an a priori assessment of ML classifiers using all possible combinations of pressure and flow-rate measurements.
While there are limitations to the measurements that can be non-invasively and inexpensively obtained for a clinical application, pressure and flow-rate measurements throughout the arterial network are useful as they allow the impact of measurement location on performance to be investigated. This benefit is particularly important for this PoC where feasibility of the ML approach is being assessed. A primary purpose of this study is to gain an understanding of the patterns between the measurements and classification accuracy.

Physics-based model of pulse wave propagation
To compute the pressure and flow-rate waveforms associated with VPs, a physics-based model of one-dimensional pulse wave propagation is adopted [4]. By considering each vessel within the network to be a deforming tube, a system of two governing equations can be derived. These equations represent conservation of mass and momentum balance with the assumption that blood is incompressible and that the tube walls are impermeable. The system of equations is (see [2] for details): where P (x, t), U(x, t), and A(x, t) represent the pressure, flow velocity, and arterial cross-sectional area, respectively, at spatial coordinate x and time t; ρ and μ represent the density and the dynamic viscosity of blood, respectively; and f represents the frictional force per unit length described as follows where ζ is a constant that depends on the velocity profile across the arterial cross-section. To close this system of equations, a mechanical model of the displacement of the vessel walls [4] is included: where P ext represents the external pressure, P d represents the diastolic blood pressure, A d represents the diastolic area of the vessel, E represents the vessel wall's Young's modulus, and h represents the vessel wall's thickness. This system of equations has been previously used and tested extensively [2,4,13,29,34,38]. Such onedimensional models have been shown to obtain good accuracies in predicting global haemodynamics in the presence of both stenoses and aneurysms when compared against more complex and detailed 3D simulations [5,19], against in vitro measurements [19], and against invivo clinical measurements [8]. These studies support the idea that the 1D models provide a good balance between accuracy and computational costs, and are able to capture global changes in haemodynamics, thus making them a suitable choice for machine learning applications. While local flow-features are not captured in the one-dimensional formulation, which leads to a loss of 3D flow features such as vortices and recirculation zones, the motivation of this study is to detect changes in global haemodynamics and biomarkers in space-averaged flow-features (pressure and flow-rate waveforms), which are naturally included in the 1D formulation and more likely to be available as clinical measurements. Another limitation of the 1D formulation is that vessel cross-sectional areas directly are the variables. Thus, the effects of how the cross-section areas change because of disease (radially symmetric or asymmetric variations) are not considered in the 1D model. For an initial exploratory study, the advantages of the 1D model in terms of computational costs are deemed to outweigh these limitations.

Arterial network
In this study, the network of interest is the abdominal aorta bifurcating into the two common iliacs. A pre-existing model for this is taken as a reference network from [4]. This is shown in Fig. 1, where the three vessels (abdominal aorta and the two iliacs) are represented in 1D while suitable boundary conditions are imposed at the inlet and the outlets. At the inlet a time varying volumetric flow-rate is prescribed. The terminal outlets are coupled to three element Windkessel models [55], which replicate the effect of peripheral arteries. Each Windkessel model, as shown in Fig. 2, consists of two resistors, R 1 and R 2 , which represent the viscous resistances of the large arteries and the microvascular system, respectively, and a capacitor C, which represents the compliance of large arteries. The inlet and outlet boundary conditions to the model. The relation of the model to the aortic bifurcation is also shown through comparison to an angiogram (reprinted from [7] with permission from Elsevier)

Numerical scheme
With the specification of the network mechanical parameters and the boundary conditions, the model of Section 2.2 is solved to compute the pressure and flow-rate waveforms across the network. The system of equations is numerically solved using a discontinuous Galerkin scheme, see [2] for details. This scheme is implemented in an in-house code written in Python/NumPy [15,51], and has been successfully validated against benchmarks [4], and against a 3D model of blood flow through stenosed arterial vessels [5].

Parameterisation of the arterial network
This section presents the parameterisation of the arterial network for the creation of VPs. Once parameterised, the network parameters can be randomly sampled to create VPs. The inlet volumetric flow-rate profile, Q inlet (t), is described using a Fourier series (FS) representation: where a n and b n represent the nth sine and cosine FS coefficients, respectively; N represents the truncation order; Fig. 2 The configuration of a three element Windkessel model: Q 1D and P 1D represent the volumetric flow-rate and pressure, respectively, at the terminal boundary of the 1D system and ω = 2π/T , with T as the time period of the cardiac cycle. It is found that the time domain inlet flow-rate profile of [4] can be described to a high level of precision using a FS truncated at the 5th order. Thus, the time domain inlet flow-rate profile is described by: requiring specification of 11 coefficients.
Since the three vessel segments in the network are short, It is assumed that the properties of all the three vessels are constant along their lengths. To impose geometric and mechanical symmetry on the lower extremities, the two common iliac arteries are assumed to share identical properties. This symmetry, however, is not extended to the terminal Windkessel model parameters. The parameterisation of the network thus requires specification of the following 25 parameters: • Six geometric properties: the two common iliac arteries require specification of a single length, a reference area, and a wall thickness. These three properties are also required for the aorta. • Two mechanical properties: Young's modulus of the aorta and the common iliacs need to be specified. • Six terminal boundary parameters: each of the Windkessel models requires two resistances and a compliance. • 11 FS coefficients: the time domain inlet flow-rate profile is described using a FS truncated at the 5th order.
For an ML classifier to be trained to distinguish between healthy and unhealthy patients, examples of both classifications are required within the VPD. A parameterisation must, therefore, be chosen to describe stenosed arterial vessels. For simplicity, all VPs are limited to having a maximum of one diseased vessel. To create Both the length and cross-sectional area of the vessel is normalised between 0 and 1. This map, for a 60% stenosis, is shown in Fig. 3, where the x-axis represents the reference position along the length of the vessel and the y-axis represents the reference cross-sectional area. For healthy vessels the normalised reference crosssectional area is constant and equal to 1. For unhealthy vessels, a cosine curve is used to create a change in area. This cosine curve is scaled using three parameters to create variation in location and severity of disease between patients. These parameters are the severity S, the start location b, and the end location e of the disease. The normalised cross-sectional area A n of a diseased vessel at a normalised spatial location x n is described as: Thus, in addition to the 25 parameters for the description of a healthy subject, three more parameters are required for specification of disease. Random realisations of these parameters are sampled and the physics-based model of pulse wave propagation is solved to produce each VP. Examples of pressure and flow-rate profiles taken from one healthy subject and corresponding subjects with 80% stenoses in the aorta and the first iliac are shown in Fig. 4. The probability distributions of the VP parameters are described next.

Probability distributions
Ideally the distribution of both arterial network parameters and the resulting pressure and flow-rate profiles should be representative of those measured in a real population. Since one-dimensional arterial network parameters are generally either expensive and invasive to obtain or non-physical (so cannot be directly measured), their exact distributions are not known. Thus, a priori distributions are assumed for both healthy and diseased virtual subjects, as described next.

Healthy subjects
A priori distributions are assumed for the arterial network parameters, based on values reported in literature [4]. It is assumed that across a large population, all parameters required to describe VPs arterial networks, excluding the disease parameters, are independent and normally distributed. The mean value for each of these parameters is taken from [4] and the standard deviation is set to be 20% of the mean, as summarised in Table 1. VPs are assigned disease so that the VPD consists of an expected 50% healthy patients, and there is an expected equal number of aortic, first iliac, and second iliac stenosis VPs.

Diseased patients
In addition to the parameters described for healthy patients above, a diseased patient is characterised by three more parameters-disease severity, start location, and end location-which are assigned uniform distributions based on physical constraints. A fourth parameter, the reference location of the disease (represented by r), is introduced. This parameter is included to impose a minimum stenosis length of 10% of the vessel length. The four parameters are sequentially sampled from uniform distributions within the following bounds: Constraints : The assumption that all arterial network parameters are independent and normally distributed is likely physiologically incorrect. To correct for this assumption, post simulation filters are applied to discard non-physiological patients. This is described next.

Post simulation filter
Through random sampling, there is a chance that VPs are assigned combinations of arterial network parameters that result in physiologically unrealistic pressures and flow-rate profiles. Thus, to remove these VPs from the VPD, a post simulation filter is applied. "Hard filters" are applied to VPs, i.e. ranges within which pressure profiles must fall are directly imposed. Based on literature reported ranges [46], a more conservative version is adopted to allow for the full range of possible pressure waveforms to be expressed in the VPD. The three conditions of the the post simulation filter are: Filters: where P inlet represents the vector describing the time domain pressure profile at the inlet of the system. Using the VPD created through the methodology described above, the ability of ML classifiers to distinguish between healthy and unhealthy VPs is assessed, as outlined next.

Representation of measurements
The output of the pulse wave propagation model is the pressure and flow-rate at all temporal and spatial locations. While, these vectors of pressure or flow-rate at any spatial location (for example p = [p(t 0 ), p(t 0 + t), p(t 0 + 2 t), · · · , p(t 0 + k t)]) can be used directly as a measurement input to the ML algorithms, its dimensionality is quite large. Furthermore, as severity of stenosis increases, resulting in additional nonlinearities in the model, the time step t for a stable solution can become very small. As pressure and flow-rate profiles are periodic, it seems natural to represent the time domain haemodynamic profiles using a FS representation. Using this representation allows the pressure and flow-rate profiles to be described to a high level of completeness in significantly fewer dimensions. Since the input FS coefficients differ by several orders of magnitude between measurements and VPs, they are individually transformed to have zero-mean and unitvariance through the widely used Z-score standardisation [32]. The transformed inputs are subsequently used as inputs to the ML algorithms.
With the above stated method of network parameterisation, its sampling, and measurement representation, the ML setup is described next.

Test/train split
The VPD is split into two parts: testing set and training set. The training set is used for learning in the ML algorithms and is set to two-thirds of the size of the VPD. The remaining one-third of the VPs comprise the testing set, which is used to assess the accuracy of the ML algorithms on previously unseen data, i.e. the data not used while training. The ML algorithms are briefly described next.

Machine learning algorithms
A model mapping a vector of input measurements, x, to a discrete output classification, y, can be described as: with, where C represents the set describing all possible classifications, and C (j ) represents the j th possible classification.
In the context of this study, the measured inputs x and output classification y represent the haemodynamics measurements taken from VPs and the corresponding health of those VPs, respectively. The following four ML methods are used in this study.

Logistic regression (LR)
The LR classifier [16,48] is a probabilistic binary classification method. Given that patients belong to one of the two classifications, i.e. C = {C (1) , C (2) }, the true binary responses τ i are assigned to all subjects in the training set: To predict the binary health of a patient an activation function is used. A general equation for an activation function h(x i , θ) can be written as: where p τ i = 1 | x i , θ represents the predicted probability that the ith VP belongs to C (1) , given that the patient specific input measurements x i have been observed, and that the vector of measurement specific weightings are described by θ . Typical choices for h(x i , θ) are the the sigmoid and tanh functions. The sigmoid function is shown below: To obtain optimal measurement specific weightings θ, the logistic regression algorithm is trained by minimising the mean error between the predicted probability of VPs producing a positive binary response and the known correct classification across the training set, i.e.: where L(θ , X train , τ train ) represents the average cost, in this case computed as a log loss, across the training set; X train and τ train represent the matrix of input measurements and the vector of the known correct binary classifications for all the m VPs in the training set, respectively; x i and τ i represents the vector of input measurements and the known correct binary classification corresponding to the i th VP, respectively; and θ represents the measurement specific weightings.
The numerical minimisation can be carried out using many algorithms such as gradient descent, gradient descent with momentum [36], Nesterov accelerated gradient (NAG) [33], Adadelta [57], and Adam method [23]. Post training, the obtained weightings can be used to predict the health classification of new unseen VPs, i.e. VPs within the test set, by Eq. 14 through application of a threshold B, often referred to as the decision boundary, to the predicted probabilities as follows: whereŷ i represents the predicted health classification of the new unseen test VP, p τ i = 1 | x i , θ represents the predicted probability returned by the activation function through Eq. 14, and B represents a chosen decision boundary.
The remaining three methods are not described in great detail here. Their descriptions can be found in the references below. LR is described in more detail above as it is later modified for the application in this study.

Naive Bayes (NB)
An NB classifier [40,41] is a probabilistic multiclass method. An NB classifier creates a conditional probability model, through the use of Bayes theorem, that predicts the probability of a VP belonging to each classification, given the measured pressure and flow-rate profiles.

Support vector machine (SVM)
An SVM classifier is a non-probabilistic binary classification method [21]. An SVM method finds an optimum partition between positive and negative binary outcomes through a high-order feature space by maximising the distance between the partition and the nearest instances of both binary outcomes. It is common for SVM classifiers to map the input measurements to a higher order feature space, typically through the use of an input kernel.

Random forest (RF) classification method
An RF classification method is a non-probabilistic multiclass classification method [6,26]. An RF method is an ensemble method, combining the predictions returned by a series of weak decision tree classifiers through the use of a bootstrap aggregation method. Each decision tree within the ensemble is created by repeatedly splitting the training data into subsets, based on an evaluation criteria, to maximise the homogeneity of the subsets.

Motivation for the chosen ML classifiers
Two characteristics that can be used to distinguish between different ML methods are if they are capable of producing linear or non-linear partitions between different classifications, and if they return a probabilistic or nonprobabilistic output prediction. These four ML methods are chosen as they encompass all four combinations of classifier characteristic behaviours, as shown in Table 2. Another attractive feature of these methods is that they all require very little problem specific optimisation. Before ML classifiers are trained and tested using these four different methods, preliminary tests are carried out using the LR method. LR is used for these initial tests as it is computationally inexpensive. Once an initial understanding of the VPD has been gained, further classifiers are trained using the other three ML classification methods. The methodologies and considerations required to use the VPD to train and test ML classifiers are explained next.

Required size of the VPD
An important consideration in the creation of VPD is its size-how many virtual patients are sufficient for the ML algorithms to be applied? Here, a priori evaluation of the required size of the VPD is presented, while a posteriori analysis is found in Section 4.1. A common rule of thumb in ML is that to train a classifier at least 10 examples of each possible classification are required per input dimension, known as events per variable or EPV [52]. While pressure and flow-rate measurements can be obtained at any location within the arterial network, measurements are limited to the inlet and two outlets of the system, shown in Fig. 1 by P 1 , Q 1 , P 2 , Q 2 , P 3 , and Q 3 respectively. This results in the maximum number of input dimensions to be 66 (each measurement is described by 11 FS coefficients and all six measurements taken). A minimum EPV of any one health classification is chosen to be 12 in this study, in order to be on the conservative side of the rule of 10. Two-thirds of VPs within the VPD are used for training the classifiers, and the remaining one third are used for testing. From this, it is calculated that the VPD requires 1188 (3/2 × 12 × 66) VPs with disease in each of the three vessels. Since a balanced dataset is desired, the number of healthy patients required are 3564 (1118×3). This results in the EPV of 36 for healthy subjects.

Classifier configurations
The objectives and configurations of classifiers can be split into two general categories. These two categories are binary classifiers and multiclass classifiers. Binary ML classifiers are trained to predict the outcome of Eq. 10 when the output classification may belong to one of two possible outcomes, i.e. C = {C (1) , C (2) }. In contrast, when more than two classes are present, multiclass classifiers are necessary.

Binary classifiers
Binary classifiers are created using one of two different configurations. predict if there is a stenosis present within a particular vessel of a VP's arterial network. When creating IVBCs, an arterial vessel of interest must be isolated, and VPs with disease present within this vessel are assigned to the first discrete output classification, C (1) . All other VPs are assigned to the second discrete output classification, C (2) . The assignment of true state classifications to VPs when creating IVBCs is described by:

Individual vessel binary configuration The first configuration of binary classifiers are individual vessel binary classifiers (IVBCs). The purpose of IVBCs is to
where y i represents the true state classification of the ith VP, and a represents the arterial vessel for which the binary health is being predicted.

Entire network binary configuration
The second configuration of binary classifiers are entire network binary classifiers (ENBCs). The purpose of ENBCs is to predict the health of a VP's entire arterial network, i.e. irrespective of the vessel in which the disease is located. When creating ENBCs, VPs with no disease present within their arterial network are assigned to the first class, C (1) , while all other VPs are assigned to the second discrete output classification, C (2) . The assignment of true state classifications to VPs when creating ENBCs is described by: Multiclass ML classifiers are discussed next.

Multiclass classifiers
Multiclass classifiers predict the outcome of Eq. 10 when the output may belong to more than two different classifications. The purpose of multiclass classifiers is to predict if there is a stenosis present within a VP's arterial network, and if so which vessel does that disease occur within. Thus four different classifications exist: where C (1) , C (2) , C (3) , and C (4) represent no disease present; and disease present within the aorta, the first iliac, and the second iliac respectively. It is found through analysis of binary classification behaviours (Section 4.3) that LR and SVM classifiers consistently achieve higher accuracy classification than NB and RF classifiers. Thus, multiclass classifiers are only created using these two methods. However, LR and SVM methods are both inherently binary-only naturally capable of distinguishing between two classes. In order to be used as multiclass classifiers, they can be adopted through strategies such as one-versus-all [39] and one-versus-one [42]. These are described next.

One-versus-all (OVA)
An OVA strategy [39] trains multiple instances of binary classifiers, each designed to predict the probability of a separate classification problem. These probabilities are then combined to make a multiclass prediction. In our problem, the OVA strategy trains four instances of a binary classifiers. Each binary classifier prescribes a correct binary health classification of 1 to all VPs belonging to the corresponding possible classification. All other patients, irrespective of which of the other three classifications they belong too, are assigned a correct binary health classification of 0: where τ (j ) i represents the correct binary health classification of the ith VP for the j th instance of a binary classifier. To assign a predicted multiclass classification to a new subject, the predicted probability of producing a positive binary response (y i = C (j ) ) is found for all the four binary classifiers. The classification that corresponds to the highest predicted probability is then selected as the multiclass prediction.

One-versus-one (OVO)
An OVO strategy [42] creates binary classifiers for all the pairs of the classes. Thus, if n total classes exist, then n(n − 1)/2 binary classifiers are created. The most frequent class predicted among these binary classifiers is then used as the multiclass prediction.
In our problem, the OVO strategy creates six instances of a binary classifier. Each binary classifier is designed to distinguish between two different classes. Thus, the binary classifier created to distinguish between classifications C (j ) and C (k) uses: When predicting the classification of an unseen test VP, a voting scheme is applied. The input measurements taken from the test VPs are passed through each of the six instances of a binary classifier, and the predicted classifications recorded. The classification that occurs most frequently is selected as the multiclass prediction.
It is found that while both LR classifiers employing an OVA method and SVM classifiers employing an OVO method achieve high aortic, first iliac, and second iliac classification accuracy, they produce very low healthy VP classification accuracy (see Section 4.5). To rectify the low healthy VP classification accuracies a custom probabilistic configuration is developed, as described next.

Custom probabilistic configuration (CPC)
The CPC method assigns all VPs a health classification corresponding to 'no disease' before running any binary classifiers. This strategy treats 'no disease present' as the opposite to the three other possible classifications a VP may belong to. The binary classifiers employed in CPC are identical to OVA, except that the classifier for 'no disease' is omitted. Thus, as opposed to four binary classifiers in the OVA strategy, this strategy uses only three binary classifiers-each pertaining to diseased aorta, first iliac, and second iliac, respectively.
The assignment of true state binary outcomes to VPs for the three binary classifiers are: Note that j = 1 for 'no disease' classification is not included. To predict a multiclass classification for test VPs, the vessel that produces the highest probability of being diseased among the three binary classifiers is first found. The default multiclass classification is 'no disease' unless the highest probability of disease occurring is greater than a prescribed threshold (decision boundary), in which case the test VP is predicted to have disease in the arterial vessel with this highest probability, i.e.
where p τ (j ) represents the probability of the ith VP being predicted to produce a positive binary response for the j th instance of a classifier within the ensemble; x i represent the vector of measurements for the test patient, θ (j ) represent the measurement specific weightings for the j th classifier; and B represents the threshold (decision boundary).
As opposed to the classical OVA, where the classification with highest predicted probability, irrespective of the magnitude of this probability, is chosen, CPC requires a minimum certainty of disease being present to be met before the default hypothesis 'no disease' can be overridden. It is not possible to create multiclass classifiers in this manner using non-probabilistic methods, such as SVM.

Quantification of results
Two different methods are used to quantify and compare the results of different classifiers. The first, also the most intuitive, of these is to compute the sensitivity and specificity of classification across the test set. Determination of whether a VP is classified correctly or incorrectly can be achieved by comparison against the true states, see Table 3. The proportion of VPs belonging to a classification that are correctly classified, i.e. the sensitivity (S e ), is computed using the equation S e =TP/(TP+FN), while the proportion of VPs not belonging to a classification that are correctly classified, i.e. the specificity (S p ), is compute using the equation S p =TN/(TN+FP). The relationships between the TP, FN, FP, TN, S e , and S p with respect to the class C (j ) are shown in Fig. 5.
In the case of multiclass classifiers, assessment of the accuracy of classification requires provision of the sensitivity and specificity corresponding to each classification. In our case, there are four classes, thus requiring specification of eight different numbers (four sensitivities and four specificities). While quantifying the accuracy of ML classifiers through the sensitivity and specificity of each classification is simple and easily understood, the description of results through two different numbers per classification can make comparison of different classifiers difficult.
A more complex, however easier to compare, method for quantifying the accuracy of ML classifiers is the F score [43]. The F score produces a single quantitative score allowing for easy comparison. Higher values of F score imply a better classification. To calculate the F score, the  precision (P) and recall (R) of each discrete classification are calculated. A visual explanation of the F score, precision, and recall is shown in Fig. 5. Precision is the proportion of patients predicted to belong to a classification, who do in fact belong to that classification. The recall is the portion of patients belonging to a classification who are correctly classified, thus identical to sensitivity. The F score combines the precision and recall as follows: where P represents the precision, R represents the recall, and δ represents a hyper parameter. Values of δ above 1 give preference to recall, while values under 1 give preference to precision. Although there is a preference to recall in the proposed application of the classifiers, δ = 1 is used to get a general sense of classifier performance without any bias. As δ = 1 is used, the F score is referred to as the F 1 score and is, thus, essentially the harmonic mean of precision and recall. While the F 1 score balances the affect of precision and recall, it does not balance the affect of the sensitivity and specificity. Given a situation in which there is an equal number of healthy and unhealthy VPs, an ENBC which correctly predicts the health of 80% of healthy VPs (R = S e = 0.8) and 20% of unhealthy VPs (S p = 0.2) will achieve an F 1 score of 0.61. An ENBC that correctly predicts the health of 20% of healthy VPs (R = S e = 0.2) and 80% of unhealthy VPs (S p = 0.8), however, will achieve an F 1 score of 0.28, despite the fact that the total number of VPs who have been correctly classified is unchanged. This highlights the importance of using both the F 1 score and the sensitivities/specificities in combination.

Empirical evaluation VPD size
While an estimation to the adequacy of the VPD size has been made by calculating the EPV, this can be checked more thoroughly by training and testing a series of classifiers with successively increasing number of VPs. This assessment is made for the case with the largest input dimensionality, i.e. when all the six measurements-three pressure and three flow-rate profiles-at all the three measurement locations are used (see Fig. 1).
To minimise the lowest number of VPs belonging to a single classification, classifiers must be trained to predict the health of each vessel individual. As the VPD has been created so that there is an equal number of healthy and unhealthy VPs, for any given number of available VPs an ENBC will have half of the number of available VPs belonging to C (1) and half belonging to C (2) . On the contrary, three series of IVBCs are created (as described in Section 3.4.1.1), each predicting the health of a different vessel. This results in each instance of an IVBC having 5/6 of the available VPs belonging to a negative binary classification, however only 1/6 of the number of available VPs belonging to a positive binary classification. By empirically showing there is an adequate number of VPs to train and test classifiers in this extreme situation, it is reasonable to assume there is an adequate number of VPs to train and test ENBCs.
Due to the class imbalance present, i.e. there are significantly more VPs belonging to C (2) than C (1) , a weighting w is applied to the cost of VPs belonging to C (1) when training IVBCs. Without this weighting, the IVBCs are biased towards VPs belonging to C (2) . The weighting applied to the cost of the prediction of VPs belonging to C (1) for each classifier is calculated by assigning a ratio r to the effective number of VPs belonging to classifications C (1) and C (2) : where m (1) and m (2) represent the number of VPs belonging to classes C (1) and C (2) , respectively. The corresponding cost (loss) function is modified from Eq. 16 to include the weight w as (x i , θ)) When r = 1 is used, VPs belonging to C (1) and C (2) have the potential to contribute equally to the total cost of prediction across the training set. If r > 1 is used, bias is given towards VPs belonging to C (1) , and r < 1 gives bias towards VPs belonging to C (2) . Unless stated otherwise, r = 1 is used.
For successively increasing number of VPs, five instances of each of the three IVBCs corresponding to disease in the three vessels are trained and tested. Each of these instances uses a different random subset of VPs for training and testing the classifier. The average performance of these five instances is then computed, thus minimising the effect of test-train split. This is referred to as five-fold validation. The average F 1 scores achieved across the training and test sets, over the five-folds, with increasing numbers of VPs are shown in Fig. 6. Figure 6 shows that both training and test accuracies are low when a small proportion of the VPD is made available to ML classifiers. This suggests that the classifiers being trained are underfitting the training data, i.e. low variance but high bias. The classifiers trained can neither fit the training data nor generalise to the test data. As the number of available VPs increases, the behaviour of classifiers differs between the aorta and two common iliacs. In the case of the aorta, the training accuracy remains relatively constant, while the test accuracy increases. In the case of the two common iliac classifiers, both the training and test accuracy increase. These behaviours suggest the classifiers are fitting the training data better, and as a consequence are better able to classify test patients. Initially, between 1000 and 5000 available VPs, the changes made to the partition between VPs belonging to C (1) and C (2) through the input measurement space are significant, and so there are large jumps in change to the training and test accuracies. As the number of available VPs continuous to increase, the partition between healthy and unhealthy patients through the input measurement space begins to converge to an optimum solution. This causes the changes to the training and test accuracies to reduce, and eventually flatten off. Figure 6 suggests that beyond 7000 VPs the VPD contains enough VPs to train and test ML classifiers. This is shown by the fact that the training and test accuracies of each vessel are consistent for the final several numbers of available VPs, and so the partitions between healthy and unhealthy patients are no longer changing.

Iliac 2
Training profile Test profile Fig. 6 Analysis of the adequacy of VPD size when using pressure and flow-rate measurements at all the three locations: training and test F 1 scores with successively increasing VPD size

VPD characteristics
The stenosis characteristics in the VPD are visualised by considering the distributions of the severity S and the location of the stenosis (b + e)/2. These distributions are shown in Fig. 7a and b, respectively. The severity is uniformly distributed within the range considered, and the location (b+e)/2 shows a Gaussian-like distribution centred at the middle of each vessel.

ENBC results
The architecture of LR, NB, and SVM classifiers can all be considered to be problem independent. While these three algorithms are able to undergo varying levels of problem specific optimisation, the underlying structure of the classifier cannot be changed. In the case of SVM classifiers, the classifier is optimisation for a specific problem by choosing a kernel to map the input measurements to a higher order feature space. Unless otherwise stated, all SVM In the case of NB classifiers, the classifier is optimisation to a specific problem by choosing the distribution of input measurements across the dataset. Here, for NB, it is assumed that all input measurements are normally distributed across the dataset. The architecture of RF classifiers, however, is dependent on the specific problem. The number of trees within the ensemble and the maximum depth of each tree can be optimised for a specific problem. To fit the RF classifiers, a basic grid search is carried out. The hyperparameters describing the architecture that produces the highest F 1 score are empirically found, and this combination of hyperparameters is then used for all further classifiers trained and tested.

Like-for-like input measurement comparison
To gain a better understanding of how much difference in F 1 score can be considered insignificant, classifiers that should theoretically achieve identical accuracies are compared.
Exploiting the symmetrical structure of the arterial network (see Fig. 1), classifiers that use symmetric measurements can be identified. These are referred to as like-for-like measurements; two examples of such measurements are shown in Fig. 9. Any discrepancy between the F 1 scores achieved by classifiers trained using like-for-like input measurement combinations is therefore introduced due to training and statistical errors. There are 24 possible cases of like-for-like input measurement pairs. The discrepancy in the F 1 score achieved by the two classifiers within each of these pairs is computed when using each of the four different classification methods. It is found that NB classifiers show Fig. 9 Two examples of like for like input measurements significantly greater magnitudes in the discrepancies of F 1 scores produced than any of the three other methods. The maximum discrepancy in F 1 scores produced when using an NB method is equal to 0.18. This large discrepancy points to something beyond statistical and training errors and is, therefore, most likely related to the unsuitability of the NB method to our problem. It is therefore decided to exclude the results achieved by the NB method from all subsequent analysis. The histograms of the discrepancies in the F 1 score between like for like input measurement combinations produced when using the remaining three ML methods are shown in Fig. 10. Figure 10 shows that the discrepancy in F 1 scores between like-for-like input measurement combinations follow a very similar pattern for both the LR and RF classification methods. For both of these methods it can be seen that the majority of the 24 like-for-like input measurement combinations produce a discrepancy in F 1 score of less than 0.005. There is then a clear exponential decay in the number of occurrences as the F 1 score discrepancy increases. Twenty of the 24 LR pairs, and 16 of the 24 RF pairs achieved a discrepancy of less than 0.01. When looking at the F 1 discrepancies of SVM classifiers, there appears to be no real decay in the number of occurrences as the F 1 discrepancy increases, and instead a relatively constant number of SVM pairs produce F 1 discrepancies between 0 and 0.025.
The maximum discrepancy in F 1 scores between likefor-like input measurement combinations is equal to 0.0231. This discrepancy in F 1 score is measured between two pairs of input measurements when using an SVM method. The firsts of these two pairs is (Q 3 , P 3 ) and (Q 2 , P 2 ). When training a SVM classifier using (Q 3 , P 3 ) the sensitivity and specificity is equal to 0.71 and 0.51 respectively. When training an SVM classifier using (Q 2 , P 2 ) the sensitivity  The average, maximum, and minimum F 1 score achieved by all the ENBCs against the numbers of input measurements. The central markers represent the average score achieved, while the error bars indicate the upper and lower limits. The combination of input measurements that produces the highest F 1 score is identified in text for the SVM method and specificity is equal to 0.74 and 0.47 respectively. The second pair of input measurements producing a discrepancy in F 1 score of 0.0231 is (Q 3 , P 1 ) and (Q 2 , P 1 ). When training SVM classifiers using (Q 3 , P 1 ) and (Q 2 , P 1 ), the sensitivities and specificities are equal to 0.76 and 0.50; and 0.8 and 0.46 respectively. It can be seen that in the case of both pairs of input measurements highlighted above, while there are some differences in the sensitivities and specificities produced, the differences in accuracies are relatively low and the behaviours of each of the two classifiers are relatively consistent. From Fig. 10 and the aforementioned analysis, a difference in F 1 score of more than 0.01 between two LR, SVM, or RF classifiers trained using different input measurements can be considered to be significant and likely due to the behaviour of the classifiers. It is important to remember, however, that a difference in F 1 score of approximately 0.025 is required to fully rule out the possibility that patterns are due to training or statistical errors.

Effect of the number of input measurements
Appendix A and Fig. 8 show that there is a correlation between the number of input measurements used in the ML classifiers and the F 1 score. To investigate this further the average F 1 score achieved by all the classifiers using one to six input measurements is found for each of the three different classification methods. The maximum and minimum F 1 scores are also recorded and shown in Fig. 11. It can be seen that as the number of input measurements increases, the average F 1 score achieved by all classification methods also increases. The increase in F 1 score is most noticeable for the SVM method. For the LR and RF classification methods, the average F 1 score achieved when using 1 input measurement is approximately 0.5, representing naive classification (S e + S p = 1). The average F 1 score achieved by SVM classifiers trained using 1 input measurement is marginally better than naive classification. This finding that the average F 1 score increases as the number of input measurements increases is expected as the discriminatory information increases, on average, as more measurements are made available.
Observing the range of maximum to minimum F 1 scores in Fig. 11 it can be seen that as the number of input measurements increases, the range of F 1 scores decreases. An interesting pattern to note is that while the average and minimum F 1 score achieved increases when increasing the number of input measurements between four and six, the maximum remains relatively constant. The maximum and minimum F 1 scores are shown in Table 4, along with the corresponding sensitivities and Table 4 The combinations of input measurements that produce the maximum and minimum F 1 scores when providing four, five, and six input measurements The corresponding sensitivities and specificities are also included specificities. Table 4 shows that the maximum accuracy of classifications-assessed by F 1 scores, sensitivities, and specificities-varies insignificantly between four, five, and six measurements. Thus, the analysis points that similar levels of accuracies can be achieved by using only four measurements compared to the case when all six measurements are used, but one must be judicious in the choice of the four measurements.

Importance of inlet pressure and flow-split
A further pattern noticed within the tables in Appendix A and Fig. 8 is that classifiers trained using P 1 generally perform better than those that do not use P 1 . To analyse this further, the F 1 scores of classifiers trained with and without P 1 are separated and plotted in Fig. 12 Fig. 12 The histograms of the F 1 scores achieved by the ENBCs that include P 1 (upper), and exclude P 1 (lower) expected, in part due to design. There are a total of 32 combinations of input measurements that include P 1 , and 31 combinations of input measurements that exclude P 1 . The classifier trained using all six input measurements, and five of the six classifiers trained using five input measurements contain P 1 . Only one classifier trained using five input measurements does not include P 1 . It has previously been shown in Fig. 11 that, generally, classifiers trained using more input measurements achieve higher accuracy classification results. There is, therefore, some expected skewing towards higher F 1 scores in favour of classifiers trained with P 1 . This expected behaviour is further amplified by the fact that only one combination of input measurements consists of a single input measurement and contains P 1 . This compares to five combinations that consist of a single input measurement and exclude P 1 . This results in an expectation of more low scoring classifiers without P 1 . Figure 12 shows that in the case of LR, only 11 of the 32 classifiers trained using P 1 achieve an F 1 score of less than 0.54. This compares to all 31 LR classifiers trained without P 1 achieving an F 1 score of less than 0.54. In the case of SVM classifiers, only 1 combination of input measurements containing P 1 achieves an F 1 score of less than 0.54. This compares to 5 combinations of input measurements that do not contain P 1 that achieved an F 1 score of less than 0.54. When the threshold for comparison is increased to 0.6 it is found that 20 of the 32 SVM classifiers trained with P 1 exceed this threshold, compared to 14 of the 31 trained without P 1 exceeding this threshold. Similar analysis shows that the inclusion or exclusion of Q 1 produces similar patterns and behaviours in the F 1 scores produced. Thus, measurements of pressure and flow-rate at the inlet of the system appear to be particularly informative in differentiating between healthy and unhealthy patients.
Another observation can be made by observing the highest scoring SVM classifiers in Fig. 11. The best performing classifiers include P 1 and a combination to determine the flow-split between the left and the right iliacs. For example, when three measurements are used, the best combination is (Q 3 , Q 1 , P 1 ), which would enable the flow split to be known through mass conservation (note that compliance of the arteries is relatively small) in addition to P 1 . This observations bears similarity to the classical inverse problem analysis presented in [35], where the authors show that in order to find the parameters of any arterial network, the inlet pressure and flow-splits to all the outlets should be known.

Linear vs non-linear partitions
Comparing the results achieved by LR and SVM classifiers in all previous analyses, it can be seen that SVM classifiers Fig. 13 Logistic Regression: predicted probability of disease against stenosis severity for diseased patients (left) and histogram of predicted probability of disease for healthy subjects (right). TP: true positive; FN: false negative; FP: false positive; and TN: true negative consistently achieve higher accuracy results than the LR classifiers. When using all the six input measurements, the LR and SVM classifiers achieve sensitivities and specificities of 0.73 and 0.52; and 0.80 and 0.57, respectively. Similarly, the F 1 scores for LR and SVM classifiers are 0.58 and 0.65, respectively. All SVM classifiers trained up to this point have mapped the input measurements provided to a higher order feature space through the use of radial basis function kernel. The fact that the accuracy of SVM classifiers are consistently higher than LR classifiers suggests that the partition between healthy and unhealthy VPs through the pressure and flowrate measurement space is likely non-linear. To test the hypothesis that the increase in accuracy seen in SVM classifiers is due to this higher order mapping, an SVM classifier is trained and tested with a linear kernel. It is found that an SVM classifier trained using all the six pressure and flow-rate measurements and a linear kernel produces an average sensitivity and specificity of 0.85 and 0.42 respectively over five-folds of the VPD. This corresponds to an F 1 score of 0.53. The corresponding F 1 scores for LR and radial basis function SVM are 0.58 and 0.65, respectively. The non-linear SVM outperforms the linear SVM and LR (also linear), thus demonstrating that a non-linear mapping is beneficial in discerning between healthy and diseased states.

Effect of disease severity
Here the effect of disease severity on the accuracy of classification is investigated. This analysis is performed using an SVM classifier employing an RBF kernel and an LR classifier, both using pressure and flow-rate measurements at all the three locations.
A scatter plot of the predicted probability returned by the LR classifier against the severity of disease (i.e. diseased VPs) for false negatives and true positives is shown in Fig. 13 (left), while a histogram of the predicted probabilities for all the healthy VPs (i.e. zero severity) is shown in Fig. 13 (right). Contrary to intuitive reasoning, which suggests that higher severity of stenosis should be easier to detect, no trends are observed in Fig. 13, with classification accuracy being independent of the severity. This suggests that the variability in the pressure and flow-rate waveforms induced by the boundary conditions representing physiology before and after the anatomical network (see Fig. 1) is large and can overshadow the variability induced by stenosis severity alone.
Since the SVM classifiers do not predict a probability of disease, but a direct classification of healthy of diseased subject, histograms of the distributions of true positives and false negatives across the range of severities for diseased subjects are considered to assess the effect of stenosis severity. These are shown in Fig. 14. For the healthy

IVBC results
Following an identical procedure to that employed for the ENBC combination search, three IVBC combination searches-one for disease classification in each of the three vessels-are performed using the LR and SVM methods. It is chosen to limit the IVBC combination searches to these two classification methods due to the higher computational expense, and the fact that these two methods have shown consistently higher accuracy results. The full tables of results for the IVBC combination search are presented in Appendix B. The average, minimum, and maximum F 1 score achieved when using one to six input measurements are shown in Fig. 15. There is a good agreement between the overall behaviour seen across the IVBC and ENBC (as shown in Fig. 11) combination searches. These similarities include: • The average and minimum F 1 score achieved continuously increases when increasing the number of input measurements from one to six. • The maximum F 1 score initially increases rapidly and reaches an asymptotic limit between two and four input measurements. • The SVM method consistently produces higher accuracy results than the LR method.
For the SVM configurations corresponding to maximum F 1 scores, the sensitivities, specificities, and the combination of measurements is shown in Table 5. It shows that the combinations of input measurements that produce the highest F 1 scores in the two common iliacs are not only identical, but also symmetrical (with the same input measurements being taken from the right and left sides). While the combinations of input measurements that Table 5 The combinations of input measurements that produce the maximum F 1 scores when providing four, five, and six input measurements to the IVBCs with the SVM method The corresponding sensitivities and specificities are also included produce the highest F 1 scores differ from the ENBC results (see Table 4), a similarity between the two is that the best performing classifiers include a pressure measurement and a combination to determine the flow-split. In Section 4.3.3, it is hypothesised that the combination of pressure at the inlet and flow-split may be particularly informative. Table 5, however, seems to suggest that it may be the pressure within the diseased vessel and the flow-split that best captures the presence of a stenosis. Comparing Tables 4 and 5 also shows that IVBCs, owing to their more granular characterisation of diseases states, lead to higher F 1 scores, sensitivities, and specificities, relative to the ENBCs. Neither of them are, however, good at pointing to the precise vessel that is diseased in the network. Note that even if an IVBC classifier has perfect accuracy it does not lead to knowledge of precise diseased vessel; for example, the aortic IVBC classifier only determines whether disease is in aorta, and considers both healthy and diseased iliac vessel patients together in one class (see Section 3.4.1.1). When knowledge of not only the presence of disease but also the precise location is required, multiclass classifiers are necessary, and their results are presented next.

Multiclass analysis
Results of the multiclass configurations are presented here. Unlike ENBC and IVBC classifier results presented above, here the goal is also to determine which vessel the disease is located in. Due to the increased computational expense, a full combination search is not carried out for multiclass classifiers. Instead multiclass classifiers are trained and tested using the measurements of pressure and flow-rate at all the three available locations.
Initially, multiclass classifiers are created using LR employing an OVA method (see Section 3.4.2.1) and SVM employing an OVO method (see Section 3.4.2.2). While these initial classifiers produced high accuracy for aortic, first iliac, and second iliac disease classification, it is found that the sensitivity corresponding to the classification of VPs with 'no disease' present is consistently close to 0. A multiclass classifier is, therefore, created using LR employing a CPC method, as outlined in Section 3.4.2.3.
The results of the OVA, OVO, and the CPC classifiers are shown in Table 6. Table 6 shows that for OVA the sensitivities and specificities for the first and second iliac are equivalent. For aorta, the specificity is relatively higher but comes at a compromise of reduced sensitivity. Finally, the healthy classification sensitivity is poor, almost close to zero. Similar behaviour is observed for OVO with almost all classification accuracies lower when compared to OVA. Thus, OVA outperforms OVO in all cases and is thus superior for this application. When comparing OVA against CPC, the highest improvement is seen for the sensitivity of healthy classification, an increase to ∼50% compared in CPC compared to ∼0% in OVA. For the aorta and iliacs, a rebalancing of sensitivities and specificities is observed in relation to OVA-an increase in sensitivity is accompanied by a decrease in specificity, with their averages relatively unchanged. Overall, Table 6 shows that the CPC achieves its purpose of improving the classification accuracy for healthy ('no disease') class without significantly compromising other classification accuracies.
When creating CPC multiclass classifiers, preference can be given to healthy or unhealthy VPs by adjusting the decision boundary B in Eq. 24 -i.e. the certainty required to override the default classification that a VP has no disease present. Reducing the certainty required to change the classification a VP is assigned to, i.e. lowering the decision boundary, creates bias towards unhealthy VPs as the CPC classifier is more willing to override the default classification that a VP is healthy. Increasing the decision boundary will require more certainty to classify a VP as diseased, giving bias toward healthy VPs, as the CPC is less willing to override the default classification that a VP is healthy.
To analyse the aforementioned affect of the decision boundary used on the classification of VPs, receiver operating characteristic (ROC) curves [1] are plotted. ROC curves are obtained by plotting the true positive rates against the false positive rates of each classification when different decision boundaries are applied. By recording a series of discrete true-positive/false-positive points for various decision boundaries, a curve is fitted that shows the characteristics of the accuracy of each classification across all possible decision boundaries. A complete ROC curve must start at the point representing a true positive and false positive rate of 0, i.e. no VPs are predicted to belong to the classification being examined, and must end at the point representing a true positive and false positive rate of 1, i.e. all VPs are predicted to belong to the discrete classification being examined. A naive classifier, achieving an accuracy of 50%, will produce a straight line between these two points, and so the area under the curve (AUC) is equal to 0.5. A perfect classifier ascends vertically along the the y-axis between the points (0, 0) and (0, 1), then transverses the xaxis between the points (0, 1) and (1, 1). This will result in a perfect AUC score of 1. The point (0, 1) represents a perfect classifier, as 100% of positive VPs are correctly classification, while 0% of negative patients are incorrectly classified.
Within the context of the multiclass CPC, when a decision boundary of 1 is applied, all VPs are classified as healthy, and so the ROC curves of aortic disease, first iliac disease, and second iliac disease classification all begin at the true positive and false positive position (0, 0). When a decision boundary of 1 is applied, the true positive and false positive position of healthy classification is (1, 1), i.e. all VPs are being assigned to the classification of no disease present. When the decision boundary is set to be 0 all VPs are classified as having disease in one of the three vessels, and so the healthy classification will reach the point (0, 0). A complete ROC curve can, therefore, be obtained for healthy classification accuracy. When the decision boundary is set to be 0, while all VPs are classified as having disease in one of the three vessels, this does not necessarily ensure that the true positive and false positive classification accuracy is equal to 1 for disease classification in each individual vessel. Complete ROC curves can, therefore, not be plotted for aortic, first iliac, and second iliac disease classification. The ROC curve of healthy VP classification accuracy is plotted against the reported true positive and false positive rates of 193 current general screening methods, recorded in [30] and [3], and is shown in Fig. 16. Note that the general screening methods in this Figure are not necessarily all cardiovascular related, i.e. this analysis does not compare the results achieved here to current directly comparable methods for identical configuration. Instead, it shows a comparison of the results achieved against the general landscape of clinically used methods for screening. Figure 16 shows that the ROC curve of healthy VP classification follows a desirable profile. The AUC of the ROC curve is computed 0.75. An AUC of between 0.7 and 0.9 can be considered as moderate accuracy [12]. The overall correct classification of healthy VPs by the CPC outperforms approximately 20 of the current methods.
The ROC curve of classifiers created in this PoC study can not be fairly compared to current screening methods,   16 The ROC curve of healthy VP classification within the CPC ensemble trained and tested using pressure and flow-rate measurements at all the three measurement locations. The accuracy of classification is compared to current general screening methods (shown in red) [1] as the affects of simplifications such as only using a simple three vessel system, and limiting the number of diseased vessels to one are not understood. However, Fig. 16 provides some indication of how the results achieved in this PoC study compare to currently used screening methods. Overall, the results, despite simplifications and assumptions used in this study, are encouraging and point towards the potential of increased classification accuracies when larger networks and more sophisticated ML or deep learning algorithms are used.

Conclusions
This is the first-of-its-kind PoC study exploring ML application to detection of stenosis in arterial networks.
The key conclusion is that ML methods are appropriate for detection of arterial disease, as demonstrated in the three-vessel network. The most balanced classifier, the CPC, achieves specificities larger than 80% and sensitivities ranging from 50-75%. The AUC under the ROC for this classifier is 0.75, which outperforms approximately 20 of the current methods used in clinical practice for various types of screening (see Section 4.5). This observation, motivates further exploration of more sophisticated ML and deep learning methods on virtual databases created on larger networks. This can facilitate home monitoring of disease and/or larger-scale, cost-effective, screening programmes.
Among the four ML methods considered, it is shown that LR and SVM perform significantly better than NB and RF, with the further advantage that these require little to no problem-specific optimisation. While this conclusion may be specific to the network considered, the results show that non-linear classification methods, such as the SVM with radial basis functions, perform better than linear classification methods for arterial disease detection. Finally, it is demonstrated that the standard methods can be modified in a custom probabilistic configuration to not only detect the presence of stenosis in the network, but also identify the diseased vessel.
This study also presents a methodological framework to both create the virtual patient database and assess that it is of adequate size for the ML applications. The conclusion from ML classifier performance is that all measurements are not equally informative (or the classifiers are not able to extract the information equally from all the measurements), and that similar classification accuracies can be achieved by using fewer measurements as long as the measurements are chosen judiciously. For example, the spread of F 1 scores for a given number of measurements in Fig. 11 (also see Table 4) shows that the best performing classifier with only three measurements outperforms the worst performing classifiers with four and five measurements. In this context of which measurements are more informative, for the arterial network considered, the importance of inlet pressure, inlet flow-rate, and flow-split at the bifurcation is highlighted.

Limitations and future work
Several simplifications and assumptions are made during both the creation of the VPD and the training and testing of ML classifiers. These are likely to affect the classification accuracies achieved within this study. Some of these major limitations are: • The arterial network, containing only three vessels, is small. It is not clear whether this aids or hinders classification. On the one hand, due to small nature of the network, the signals are less diffused, and on the other hand specific features which may be result of unique reflections in certain anatomical locations is not accounted for. However, the small arterial network does allow for a preliminary analysis which, with encouraging results, points towards exploration in larger networks. • The distributions of all arterial network parameters, excluding disease conditions, across the VPD are described using independent distributions. These simple distributions ignore the complex inter-parameter relationships likely seen within real arterial networks. The simplification of the distribution of arterial network parameters likely results in a wider range of pressure and flow-rate profiles across the VPD, making distinction between healthy and unhealthy VPs more difficult. This may be potentially solved by first determining the probability distributions through an inverse problem approach, for example Markov chain monte carlo. • This study is completed without significant consideration for clinical requirements. For example which measurements are really obtainable easily, and what range of stenosis severities should a ML classifier be able to detect? These questions are best explored on a larger network. • The pressure and flow-rate profiles are free of noise and other measurement errors. A straightforward inclusion of additive Gaussian white noise to the measurements is likely to not result in any significant changes as the representation of the measurements is based on Fourier series. With truncation of the Fourier series at low orders, such a representation will filter-out most of the added noise. A careful investigation of the errors in the measurements, including biases, is thus needed to make the results more robust. This should be performed while considering the nature and magnitude of errors in the measurement devices used in the clinic and also procedural errors which may include biases.

Appendix A: ENBC combination search results
The F 1 scores, sensitivities, and specificities achieved by ENBC classifiers when employing each of the four classification methods are shown in Tables 7, 8, and 9.

Conflict of interest
The authors declare no competing interests.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.