Machine Learning in Hypertension Detection: A Study on World Hypertension Day Data

Many modifiable and non-modifiable risk factors have been associated with hypertension. However, current screening programs are still failing in identifying individuals at higher risk of hypertension. Given the major impact of high blood pressure on cardiovascular events and mortality, there is an urgent need to find new strategies to improve hypertension detection. We aimed to explore whether a machine learning (ML) algorithm can help identifying individuals predictors of hypertension. We analysed the data set generated by the questionnaires administered during the World Hypertension Day from 2015 to 2019. A total of 20206 individuals have been included for analysis. We tested five ML algorithms, exploiting different balancing techniques. Moreover, we computed the performance of the medical protocol currently adopted in the screening programs. Results show that a gain of sensitivity reflects in a loss of specificity, bringing to a scenario where there is not an algorithm and a configuration which properly outperforms against the others. However, Random Forest provides interesting performances (0.818 sensitivity – 0.629 specificity) compared with medical protocols (0.906 sensitivity – 0.230 specificity). Detection of hypertension at a population level still remains challenging and a machine learning approach could help in making screening programs more precise and cost effective, when based on accurate data collection. More studies are needed to identify new features to be acquired and to further improve the performances of ML models.


Introduction
Arterial hypertension still remains the most important modifiable risk factor for cardiovascular disease worldwide.Despite extensive knowledge about ways to prevent and treat hypertension, the global incidence and prevalence of hypertension and its cardiovascular complications are still elevated mainly due to inadequacies in prevention, detection and control [1,2].The high variability characterising blood pressure (BP) values, together with the lack of specific symptoms of this condition, make the detection of hypertension still challenging.Since 2005, the World Hypertension League has been leading a global campaign to raise awareness of the importance of hypertension through annual screening programs.During the 2018 survey, among 502079 participants found to have hypertension, only 59.5% were aware of having such condition [3].This evidence confirms that current screening programs conducted by the National Health Services are still failing in detecting appropriately hypertension.Therefore, there is an urgent need to find new strategies to improve hypertension detection at a population level, given that identifying risk factors for hypertension may facilitate earlier interventions, aimed at preventing future development of hypertension, early detecting its appearance and reducing the incidence of its long-term consequences.
In the recent years, artificial intelligence (AI), which includes all computer systems able to perform tasks normally requiring human intelligence, has been successfully applied to healthcare and shown to be a valid tool in managing different clinical conditions [4,5].
The Italian Society of Hypertension (SIIA) conducts every year a national campaign to increase awareness of the importance of high BP detection.Over the years, several questionnaires related to hypertension have been administered, generating a large dataset.In particular, 37110 individuals participated in the World Hypertension Day campaigns from 2015 to 2019.From the initial dataset, 20206 subjects have been selected for the present study, after removing those with high BP already diagnosed, out of which 4192 (20.75%) with newly discovered hypertension.Data include demographics, risk factor information, questions about general knowledge on hypertension and three measures of systolic and diastolic BP and heart rate.The aim was to apply supervised machine learning (ML) algorithms to such a large dataset in order to find a model capable of detecting unknown hypertension, and comparing their performances with the current screening protocols.Data were split into a training set with 14144 records and a validation set with 6062 samples.We used five ML models -logistic regression, decision tree, random forest, support vector machine and XGBoost -trained performing a 10-fold cross-validation process on the original training set in the first place, and then applying both oversampling and undersampling techniques for managing the imbalanced nature of data.
The performance of the models was evaluated estimating the sensitivity, specificity, accuracy and precision of the different trained models.Results show that, among different ML algorithms and different balancing techniques exploited, there is not an algorithm and a configuration which properly outperforms against the others, even though Random Forest is the most promising one in all the three schemes.The undersampling experiments are those that provide highest sensitivity scores, but a gain in sensitivity often reflects in poor sensitivity and overall accuracy.In particular, the model that best performs in terms of sensitivity was obtained with XGBoost under undersampling scheme, which however showed poor specificity and, consequently, overall low accuracy.The best compromise is given by Random Forest in the undersampling experiment, that provides sensitivity of 0.818, specificity of 0.629, an overall accuracy of 0.681 and AUC of 0.816.
Medical protocols result in a high sensitivity of 0.906 but have a poor specificity of 0.230, thus including in the screening program subjects that likely will not develop hypertension.This result highlights that the current protocols are not optimised and that novel strategies are needed to avoid unnecessary expanses and to reach a larger population.
Future studies should develop and apply new techniques and algorithms with the goal to improve the model performances, possibly evaluating how accurate data acquisition could be enriched with additional new information that may support the automatic prediction of hypertension occurrence.

Related work
In recent years, AI has been successfully applied to healthcare as a valid medical tool in different clinical conditions [4,5].ML in particular is designed for performing high accuracy predictions on individuals' outcome without explicit programming but based on learning patterns from acquired data.
Different ML algorithms have been applied in the field of hypertension with very heterogeneous results [6][7][8][9][10].The review by Martinez-Ríos et al. [11] provides a comprehensive analysis of the literature in the field.The conclusion was that ML has proven to be useful in classifying hypertensive subjects, even though there is not a general agreement on which algorithms perform better, which metrics must be measured to evaluate the model and, mostly, which type of data and features must be acquired to replicate the studies and possibly train predictors that outperform the clinical protocols currently adopted.
To be effective, predictions should be based on data that are accurately, easily and massively collected to screen a population whose size is as large as possible.For instance, predictive models based on genetic data make a mass screening not feasible [7].Promising results are presented in [10] where four ML models were evaluated on 11 easyto-collect variables, risk factors (such as smoking, drinking, family history) and anthropometric data.However, data were all acquired in the same local hospital, making results possibly not generalisable to individuals living in other areas.

Study subjects
Data were collected from questionnaires administered during the World Hypertension Day from 2015 to 2019.Individuals willing to participate were asked to fill in an anonymous questionnaire and had their BP measured according to the European Society of Hypertension (ESH) standards (3 consecutive BP readings were performed by trained health personnel with validated automated devices after 5 min rest) [3].
Demographic information (age, sex, BMI), self-reported information on cardio-vascular risk factors (hypertension, diabetes, smoking, high cholesterol, kidney disease, family history of cardiovascular diseases), sleep complaints (snoring, witnessed apneas, daytime sleepiness) and prior cardiovascular diseases (previous cardio and cerebrovascular events, previous myocardial infarction) as well as information about the awareness of hypertension and its health consequences were collected through the questionnaire.Questionnaires were administered in medical check-up points in different cities streets and squares, thus involving a very diverse population.

Preprocessing
In order to analyse the raw data, a set of preprocessing techniques have been applied to the dataset.We included data of adult subjects only (age ⩾ 18 and ⩽ 100 years) with new-onset hypertension.Patients with known hypertension or already treated for hypertension were excluded.Participants were classified as newly detected hypertension if, computed the mean value of the last two BP measurements, at least one between systolic and diastolic BP was equal or greater than 140 or 90 mmHg respectively [12].Records with missing values have been removed since some of the algorithms adopted in the present study do not support missing values.Outliers for normally distributed data (age, height, weight), defined as those values deviating from the mean more than three times the standard deviation (outside the range ± 3 ) were also excluded from the analysis because were classified as errors in data acquisition.The whole preprocessing procedure is shown in Fig. 1.
Features on top of which our analysis grounds, are listed in Tables 1 and 2. All the features containing categorical values have been converted into a set of dummy variables, each one modelling one of the categories, for a total of 28 dummy variables and 2 continuous ones.For each feature in the table, a statistic description is available, mean/std for continuous variable, or % for dummy variables.Features include the information collected from questionnaires, namely individual's medical history, subject demographic, clinical characteristics, lifestyle factors, anthropometric measurements, and general knowledge on hypertension.Accordingly, our dataset is composed of 37110 individuals before applying exclusion criteria, which led to 20206 subjects being included in the analyses.This cohort includes middle aged subjects almost equally distributed in term of sex (51.65% females), with a normal BMI (24.74 kg/m2).Most of the included subjects were in primary prevention as only 3.52% reported a previous cardiovascular event.

Data analyses with machine learning algorithms
Algorithms have been trained to predict, on top of the features of Tables 1 and 2, an individual's hypertension risk.Since the goal it to detect hypertension, the database was split into two classes: 16014 subjects (79.25 %) with normal BP and 4192 subjects with newly discovered hypertension (20.75 %).
Due to the imbalance of the two groups, in order to avoid that the trained model overfits on the majority class, a combination of techniques was adopted: (i) accuracy was combined with precision and sensitivity to evaluate the model performance and a different coefficient was used to evaluate the errors of the classes (ii) the minority class was oversampled, (iii) the majority class undersampled.Accordingly, in this study, we performed different experiments in order to find the best predictors and achieve better performance.
Data were randomly split into a training set (70%, n = 14144), used for model construction and development, and a validation set (30%, n = 6062), used to test the performance of the derived model.
As a first experiment, we adopted the imbalanced dataset and we scaled errors with weights inversely proportional to the double of class frequencies in the input data.Given N as the samples number, N c the number of classes in the problem and N i the number of occurrences of class i, the weight w i to balance class i is Then, we tested the performances of the algorithms exploiting resampling methods to transform the composition of the training dataset and balance the class distribution.Oversampling consists in creating synthetic samples of the under-represented class, thus generating a training set with two classes equally populated: we adopted to this purpose the SMOTE technique [13] and obtained a training set with 22420 samples (11210 positive samples and 11210 negative samples).Validation has been performed on the same test set as for the previous experiment.Conversely, undersampling consists in reducing the data by eliminating examples belonging to the majority class: we adopted a random strategy to this purpose.
For all the three schemes of experiments, we ran a Grid Search with 10-fold cross-validation to prevent overfitting on the training set and chose the hyperparameters of the prediction model.We chose to adopt sensitivity as the main measure of model to be maximised during the training phase, in order to find the model that optimises the number of positive individuals correctly classified (true positive).The best model built during the training phase has been selected and tested on the validation set.To evaluate the performance of the derived prediction model sensitivity, specificity, precision and accuracy have been computed.Moreover, the receiver operating characteristic (ROC) curve and validated area under the curve (AUC) value were derived.
This workflow has been adopted to evaluate different machine learning algorithms: 1. Logistic Regression: a binary statistical model that estimates the probability of an event occurring, based on a given dataset of independent variables.To map predictions and their probabilities, this method uses a sigmoid logistic function of the form p(x) =1 1+e −x [14]; 2. Decision tree classifier: a scheme that uses a tree-like model of decisions to classify an input [15]; 3. Random Forest classifier: an ensemble learning technique that uses a combination of decision trees as the base classifiers.Each tree is constructed from a sample from the original dataset and collected outputs are combined to obtain the final classification [15].4. Support vector machine (SVM) classifier: a class of linear algorithms that generate a hyperplane to separates different classes of data with as wide a margin as possible [14].5. XGBoost model: a scalable, distributed gradient-boosted decision tree that, similarly to random forests, combines multiple machine learning algorithms to obtain a better model [16].
Since the majority of features, available in the dataset, were categorical, we avoided those machine learning algorithms thought to work on numerical values, e.g., K-Nearest Neighbors (K-NN).All analyses were performed using the scikitlearn library [17].

Data analyses with medical protocols
In order to make our evaluation as complete as possible, we formalised the medical protocol extracted from the WHO list of risk factors 1 .The set of rules derived has been applied to the original dataset and a confusion matrix defined, basically reporting how many positive inidviduals were intercepted and how many were left out, as well as how many negative individuals were classified as positive, bringing to overscreening and unnecessary medical tests.In particular, dealing with the features reported in Table 2 if an individual was positive to one of the factors reported Table 3, he/she was classified as positive.

Results
The clinical goal of this study was to maximise the number of individuals included in the screening who will likely develop hypertension, while minimising included individuals who will not develop hypertension.As such, we needed to select the model that provides the best compromise between sensitivity and specificity.
The evaluation metrics of the different models, for each of the three schemes evaluated, is presented in Table 4, highlighting the highest measure.Moreover, Fig. 2 shows the ROC curves associated with the best model found for each of the 5 classification algorithms from the original dataset.With the exception of decision tree, which provides the worse performances, from Fig. 2 we derive that, training algorithms on the original dataset, none of the ROC curves resulting from the other models outperforms and AUC remains low, thus requiring further experiments.
Extending the study with resampling methods, we observe that oversampling significantly drops sensitivity and slightly improves specificity of the different algorithms, while undersampling affects positively both these indicators.Generally speaking, the model with the highest specificity, accuracy and precision was obtained with Random Forest in oversampling scheme.However, the sensitivity was 0.368 leading far more than 50% of positive individuals being classified as negative.XGBoost performs very well with a sensitivity of 0.855 in oversampling, and a very promising 0.988 in undersampling configuration, but specificity falls to 0.404 and 0.113 respectively, resulting in low overall accuracy.The best compromise was achieved with the Random Forest model, getting to a sensitivity of 0.818 and a specificity of 0.629, and an overall accuracy score of 0.681.The corresponding AUC is 0.816, the highest we obtained among the different experiments and models, and the corresponding ROC curve is reported in Fig. 3 When compared with ML algorithms, use of the known risk factors as performed in standard care yielded to a sensitivity of more than 90% but a specificity of 23%, leading to an accuracy of 37%.

Discussion
Screening for hypertension still remains one of the most critical aspects in cardiovascular prevention.Appropriate screening programs could allow early detection of patients at risk and prevent the development of hypertension mediated organ damage.Big data analysis through AI algorithms could improve sensitivity and specificity of such screening programs.
In this study we evaluated and compared five supervised ML algorithms in predicting hypertension based on a large dataset acquired during a national campaign during the World Hypertension Day.Although the best performing model achieved with the Random Forest algorithm allowed identification of hypertensives and normotensives with acceptable accuracy, sensitivity appears to be suboptimal limiting its use in clinical practice.As such, the results of the present work highlight both the advantages and limitations of ML: if on one hand ML automates the entire data analysis, resulting in more comprehensive, deeper, and faster insights, it strongly depends on the quality of data.Despite the huge amount of scientific literature reporting ML applications in healthcare, suggesting that AI applications can improve accuracy and timeliness of diagnoses, results are not always convincing.Have data the sufficient quality to be adopted for ML analyses?Are data containing the information needed for a clinical evaluation?Accordingly, an appropriate and comprehensive data collection is of fundamental importance as predictive models are only as good as the the quality of data from which they are built.However, adopting the standard approach that includes the evaluation of known risk factors identified by the WHO medical protocol, allowed acceptable sensitivity but a very low specificity, leading to an approach that causes high expanses for the health system and that is not sustainable from an economic point of view.
Nevertheless, integrating the best ML model with clinical algorithms could represent the best compromise that maximises the number of true positives and minimises the false negatives.

Conclusion
Supervised ML algorithms applied to a big dataset collected in Italy by the SIIA, allowed to identify hypertension with modest accuracy and suboptimal sensitivity.Such an approach, if further improved and tested in different cohorts, could help with hypertension screening and might represent a potentially cost-effective alternative to patients' evaluation by physicians.Future research should consider the adoption of ML techniques to develop a clinical model which should combine in practice the accuracy in diagnosing hypertension with the need to reduce the costs particularly in low resource settings.

Sara1
Montagna and Martino Pengo contributed equally to this work.Page 2 of 10

Fig. 2
Fig. 2 ROC curves for the classification algorithms on the original dataset

Table 2
Descriptive statistics for dummy variables.Percentage are of true values

Table 4
Performance in the test set of the three experimented schemes, original, oversampling and undersampling the training set, and of the medical protocol evaluation In each of the three experiments, and for each indicator, the highest value is marked in order to highlight the best performing model