Introduction

Foreman ovale is a bridging structure during embryological development. Normally, this structure closes spontaneously after birth. If it is not the case, a patent channel will be formed and named as patent foreman ovale (PFO), predisposing an increased risk of paradoxical embolisms [1, 2]. The association of PFO and stroke was first proposed in 1988 [3]. Since then, numerous studies, including observational or randomized control trials (RCTs), have shown potential causal effect of PFO on cryptogenic stroke (CS) [4,5,6,7,8,9,10]. The prevalence of PFO in general population is around 25%, but reaches up to 40% in CS patients [3].

Although the first three RCTs suggested a neutral effect of PFO closure on stroke prevention [4,5,6], latest RCTs [7, 9, 10] and updated meta-analyses [11, 12] all supported the benefits of PFO closure. In view of these data, current guidelines have recommended PFO closure in patients with proved PFO-associated stroke (PFO-AS), a new concept created in 2020 [13, 14]. Although device closure on average decreases the risk for recurrent stroke, treatment effects varied substantially across different studies [15].

In recent years, the applications of artificial intelligence or machine learning have shown promising results in health care system. It not only helps the process of data management, but also aids in disease prediction or patient sub-clustering. For example, in a data-driven machine-learning analysis, the authors applied hierarchical k-means clustering algorithm to explore potential sources of embolic stroke [16]. In another study, machine learning was used to automatically discriminate cardioembolic from non-cardioembolic strokes in large datasets [17].

Understanding different causes and classifying strokes based on etiologic subtypes are prerequisites for effective treatments. With the state-of-the-art machine learning, we could easily identify subsets of patients that would benefit most from PFO closure or sustain elevated risk for recurrent stroke. Thus, in this study, we applied unsupervised machine learning to detect sub-clusters in post-closure PFO patients and assess their associated risk with adverse outcomes. As the second aim, we used supervised machine learning to identify potential predictors for adverse outcomes.

Methods

Study design and population

The analyzed population was from 7 centers in China, including Guangdong Cardiovascular Institute, Zhongnan Hospital of Wuhan University, Wuhan Asian Heart Hospital, Hubei Huiyi Cardiovascular Center, Jiang Men Central Hospital, the first people's hospital of Foshan and General Hospital of Southern Theatre Command of PLA. Patients with embolic stroke of undetermined sources (ESUS) and PFO were included during June 1st, 2013 and May 31st, 2020. The diagnosis of ESUS was deliberately and systematically assessed by both a neurologist and a cardiologist after excluding the other common etiologies of stroke. PFO was initially discovered by either transthoracic/transesophageal or right heart contrast echocardiography and finally confirmed during cardiac catheterization. Patients with overt alternative causes of their strokes or not receiving PFO closure were excluded. We collected demographic information, laboratory and echocardiographic data for the included subjects.

Outcome ascertainment

Patients were followed by regular telephone interviews or outpatient examinations. The main outcome in our study was a composite of recurrent ischemic stroke, transient ischemic attack (TIA) or all-cause death. Major bleeding or new-onset atrial fibrillation (AF) was examined as secondary outcomes. Cardiac rhythm was assessed by cardiac auscultation, which was followed by electrocardiography if abnormal auscultation was found. At each telephone interview or outpatient visit, a standardized and validated questionnaire (Questionnaire for verifying stroke-free status) was used to detect potential stroke or TIA.

Statistical analysis

Student t-test and chi-square test were used for comparisons of differences between groups. Data were shown as mean (SD), median (interquartile range[IQR]) or number (percentage). Two-sided P < 0.05 was considered to be significant.

Data pre-processing was conducted before machine learning. The summary for missing data is shown in Additional file 1: eTable1. As suggested by the missing data pattern (Additional file 1: eFigure1), it was considered as random missing data case with no particular trend among all the variables. Missing values were then computed with multiple imputation using R package of “mice”. Thirty-four variables from demographic, laboratory and echocardiographic data were finally included.

We first used principal component analysis (PCA) to reduce the dimensions with the function of FAMD (Factor Analysis of Mixed Data). Next, we applied cluster analysis on the PCA outputs using the function of HCPC (Hierarchical Clustering on Principal Components) in FactoMineR package. The partitioning of the HCPC is performed by cutting the hierarchical tree (dendrogram). To consolidate the final partitioning solution, we further performed k-means clustering. The binary data was treated as numeric values before clustering [18]. The average silhouette of observations for different values of k (1 to 10) were computed. The location of the maximum is considered as the optimal number of clusters. Cox proportional hazards regression was then applied to calculate the hazard ratio (HR) and 95% confidence interval (95%CI) of adverse events by different clusters. Proportional-hazards assumption was tested and no violation was found. To examine potential bias from the imputed datasets, we performed complete case analysis as one of the sensitivity analyses.

In supervised learning, we first used all available variables to construct the random forest survival (RFS) model and accessed the variable importance (VIMP). We then selected the top ranking features to reconstruct the prediction models and assessed the performance using concordance index (C-index) and Brier score (BS). A higher C-index and lower BS suggest a better prediction performance. In addition, we applied supervised self-organizing maps to help visualize features associated with individual cluster within the studied patients. All statistical analyses were performed using R version 4.1.2 or Stata 15.1 (StataCorp/SE, College Station, TX). Detailed descriptions of R source code were disclosed in the supplement.

Results

197 PFO patients receiving percutaneous interventions were initially included in this study. The first 12 principal components with an eigenvalue ≥ 1, which accumulatively counted for 70.62% of the dataset, were used as input for the HCPC method (Additional file 1: eTable 2). Using HCPC, 3 clusters were identified (Fig. 1-A). Since the middle cluster included only 1 patient, we excluded this cluster, leaving a final number of 196 patients for subsequent analysis. Briefly, the average age of the included subjects was 42.7 (12.37) years and 64.80% (127/196) were female. During a median follow-up of 739 (IQR 731–924) days, 22 (11.22%) patients were lost to follow up. A total number of 12 adverse events (12/174, 6.9%) were reported, including 6 recurrent stroke (6/174, 3.45%), 5 TIA (5/174, 2.87%) and one death (1/174, 0.6%). No AF or major bleeding was documented.

Fig. 1
figure 1

Clusters identified by different methods. A Dendrogram from hierarchical clustering on principal components analysis. B The average silhouette of observations for different values of k (1 to 10) using k-means clustering analysis. The highest average silhouette was located at k = 2

Among the analyzed subjects, 77 (39.29%) patients were assigned to cluster 1 and 119 (60.71%) were assigned to cluster 2. Compared to cluster 1, patients in cluster 2 were more likely to be male, had higher systolic and diastolic blood pressure, higher body mass index (BMI), lower high-density lipoprotein cholesterol (HDL-C) and increased proportion of presence of atrial septal aneurysm (ASA). The values of red blood cell, hemoglobin, creatinine, uric acid, left atrium (LA), left ventricular end-diastolic dimension (LVEDD), interventricular septum (IVS) and posterior wall thickness (PW) were also higher in patients of cluster 2. Detailed descriptions of these variables were summarized in Table 1 and vividly visualized in Fig. 2. In Cox regression analysis, patients in cluster 2 tended to have 21% increased hazards for adverse events than those in cluster 1 (HR 1.21, 95%CI 0.62–2.36, P = 0.58, Fig. 3-A).

Table 1 Baseline characteristics of the study patients according to the clusters (from HCPC)
Fig. 2
figure 2

Self-organizing maps supervised by clusters identified by HCPC analysis. HCPC, hierarchical clustering on principal components; sbp, systolic blood pressure; dbp, diastolic blood pressure; bmi, body mass index; hdlc, high-density lipoprotein cholesterol; hgb, hemoglobin; ast, aspartate aminotransferase; alt, alanine aminotransferase; alb, albumin; ua, uric acid, asa, atrial septal aneurysm

Fig. 3
figure 3

Cumulative hazard estimates of adverse events according to the identified clusters. A Clusters from hierarchical clustering on principal components analysis. B Clusters from k-means clustering. HR, hazard ratio

In k-means clustering analysis, the highest average silhouette was located at k = 2, suggesting 2 as the optimal number of clusters (Fig. 1-B). Detailed descriptions of baseline characteristics across the 2 clusters were summarized in Additional file 1: eTable 3. Generally, the results were similar to what we found from HCPC. Cox regression analysis also suggested that patients in high risk cluster tended to have increased hazards for adverse events (HR 2.11, 95% CI 0.63–6.96, P = 0.21, Fig. 3-B). And the high risk cluster was characterized by higher proportion of male gender, higher blood pressure, higher BMI and lower HDL-C. Likewise, the finding from complete case analysis was largely identical to that of the primary analysis, except that the analyzed sample was significantly reduced (Additional file 1: eFigure2 and eTable4).

Figure 4 plots the variable importance of the full model using random forest survival analysis. We then selected the eight top ranking features to construct the prediction models, including fasting blood glucose, thickness of interventricular septum, the ratio of mitral peak early (E) to late (A) diastolic filling velocity, left ventricular end-systolic dimension, BMI, systolic blood pressure, thickness of the posterior wall and PTA. As presented in Table 2, the RFS model had similar Brier Score (2.6% vs 2.4%) but higher C-index than the traditional Cox proportional hazard regression model (0.87 vs 0.54), suggesting a better predictive model for adverse events.

Fig. 4
figure 4

Random forest variable importance (VIMP). Blue bars indicate positive VIMP, red indicates negative VIMP. Importance is relative to positive length of bars

Table 2 Performance metrics for different prediction models*

Discussion

Increasing data have supported that PFO closure could significantly reduce the risk of stroke or TIA compared to medical therapy [7, 9,10,11,12]. The reported rate of recurrent stroke or TIA after PFO closure varied across different studies, ranging from 0 to 5.61% [4,5,6,7,8,9,10]. As pointed out by previous researchers, the key determination of treatment effect relies mainly on whether the discovered PFO is causally related to the stroke or just an innocent bystander [19]. Currently, there are two prediction systems used to evaluate the likelihood of a stoke-related PFO-the risk of paradoxical embolism (RoPE) score and the PASCAL classification system [14, 20].

Main components for RoPE score include age, smoking status, history of hypertension, diabetes, stroke or TIA, characteristics of the infarct on imaging [20]. PASCAL classification system is based on RoPE score, with combined consideration of PFO features, like PFO shunt size or the presence/absence of an ASA [2, 14]. Although the prediction systems are widely used in clinical practice, these estimations are sometimes violated by model assumptions or limited by subjective feature selections.

Machine learning is a promising technique increasingly applied in health care system. It allows phenotyping or sub-clustering of the analyzed population without knowing the definite outcomes, which helps reveal the underlying etiologies. In this study, we identified two main clusters in post-closure PFO patients, with the high risk cluster characterized by higher proportion of male gender and poorer cardiovascular profiles. Machine learning also enables objective feature selections and efficient prediction model constructions. The analysis of RFS further supported the predictive value of traditional risks factors, suggesting that high-risk groups should continue to be targeted to prevent stroke recurrence even after PFO closure [21, 22]. However, whether maximizing the management of these factors would provide extra benefits for these patients warrants further investigations.

Till now, few studies are conducted on machine learning and PFO [16, 17]. Owing to the small number of adverse outcomes, statistical power to identify independent predictors of recurrent stroke/TIA was often limited, when using the traditional Cox regression model [23]. The application of supervised machine learning to some extent helps settle this matter [24]. As shown in this study, RFS model did display better performance compared to Cox regression model after selecting the top ranking features. Additionally, RFS model was able to identify predictive features that were neglected in previous studies, for example, the ratio of mitral peak early (E) to late (A) diastolic filling velocity and thickness of the interventricular or posterior wall.

Although this is a pioneer study, several limitations should still be acknowledged. First, this is a post-hoc analysis, some data are not available, for example, PFO shunt size before closure or residual shunting after closure. Second, the small datasets and missing data could potentially bias the results, although the missing pattern suggests a random missing data case and the results from complete case analysis was similar to that of the imputed datasets. Third, the constructed model was not further validated by external datasets, which to some extent limits its generalizability. Finally, the evaluation of AF was based on cardiac assessment during follow-up visits. Occult AF might still be possible, leading to an underestimation of the prevalence of AF being reported.

Conclusions

There were 2 main clusters in PFO patients receiving device closure. The supervised and unsupervised machine learning both suggest that traditional cardiovascular profiles remain important predictors for future recurrence of stroke or TIA. However, whether maximizing the management of these factors would provide extra benefits in post-closure PFO patients warrants further investigations.