Introduction

The morbidity and mortality of rheumatic heart disease (RHD) is chiefly due to damage to the cardiac valves, consequent on an autoimmune reaction to Group A Streptococcal infection (typically, childhood sore throat). RHD is the only cardiovascular disease of global impact that has been shown to be completely preventable [1]. Poor social conditions, overcrowding, and limited access to medical resources are key enablers of RHD, which remains a major source of morbidity and mortality, in low and middle-income countries (LMICs) [2]. In excess of 40 million people are currently living with RHD worldwide [3]; most are in countries where advanced medical technologies such as percutaneous or surgical intervention are not accessible [4]. The Global Burden of Disease study has shown that RHD affects nearly five million more people than HIV and causes about 10 million disability adjusted life years lost globally.

Group A Streptococcus (GAS) is the etiological agent triggering Acute Rheumatic Fever (ARF), with evidence of molecular mimicry by the M protein on the bacteria, which shares an α-helical coiled structure with cardiac proteins such as myosin [5]. Antibodies to the M protein cross-react with heart tissues, leading to carditis and other systemic manifestations such as arthritis [6, 7]. The current dominant (but yet to be proved) understanding, is that progression to chronic RHD occurs through a pathway that includes repeated episodes of subclinical ARF in genetically susceptible individuals and interactions between host genes, GAS infections and social conditions of poverty [8].

RHD demonstrates a wide spectrum of symptoms and signs, with no single available confirmatory laboratory test; this adds to the difficulty in the diagnosis and treatment of early RHD cases [9]. Current diagnostic measures for ARF rely on the 2015 revised Jones criteria [10] incorporating echocardiography images of the heart valves [11]; however, the availability of echocardiography is highly limited in poorer countries. A striking mismatch between high prevalences of RHD and low prevalences of previously diagnosed ARF in developing countries has been observed [12, 13] indicating that a significant proportion of ARF cases are undetected, or undetectable with current tools, and there is a missed opportunity to identity and intervene in, those at risk for progression to severe RHD [10, 14]. Given the human and financial cost of this inability to recognize the disease until late in its course, a better understanding of the biological underpinnings of ARF and subsequent progression may present important targets for prevention and treatment. This study sought to complement our recent GWAS study confirming an association between RHD and genetic susceptibility loci in African individuals [15] through the identification of a plasma protein signature of RHD that may aid biological understanding of the processes involved, and potentially point towards economically feasible interventions to prevent severe RHD in poorer countries based upon repurposing of readily available and inexpensive medicines.

Mass spectrometry of clinical specimens using the SWATH-MS technique implements a Data-Independent Acquisition (DIA) approach for precision identification and accurate quantification of proteins [16]. Briefly, the approach begins with the generation of precursor fragments coupled with further sequentially fragmented windows across the entire mass to charge ratio range. These mass spectra chromatograms are compared to a spectral library with a spectral scoring strategy employed as an in-silico, label-free protein quantification method. SWATH-MS data have been successfully subjected to various informatics techniques, including machine learning (ML) algorithms, to identify and characterize the differentially expressed proteins from the resultant digitized SWATH maps [17]. Here we identify candidate protein biomarkers for ARF and RHD, by applying ML methodology to proteomic data acquired using SWATH-MS, in severe cases of RHD and controls recruited from peri-urban settings across Africa.

Materials and methods

Study design

Two-hundred and fifteen patients with severe RHD, and 230 healthy controls, of various ethnicities recruited in peri-urban settings across the African continent, were included in this study. A breakdown of the contributing countries and sites is shown in Additional file 1: Table S1. There was no age restriction of the cases, and the controls were ethnically matched individuals with no echocardiographic evidence of RHD and who were older than 15 years of age. Case severity was determined by an experienced clinician, who assessed each heart valve lesion referring to echocardiographic images, and categorised valve disease for severity according to the Gewitz/ACC criteria [10]. Informed consent was provided by each participant before inclusion into the study. After the consent process, 5 ml of blood were obtained through standard procedures by a trained on-site nurse and transported for processing to the Cardiovascular Genetics laboratory at the University of Cape Town. Briefly, blood tubes were centrifuged at 3000 rpm for 10 min and plasma aliquoted into vials for storage at −80 °C. The plasma samples of cases and controls were then subjected to SWATH-MS at the Stoller Biomarker Discovery Centre, University of Manchester.

SWATH-MS proteomics

Samples were quality-checked, assigned a unique ID and cases and controls were randomized and prepared for mass spectrometry by tryptic hydrolysis after immunoaffinity depletion of the 12 major proteins found in plasma. To counteract batch effects following machine cleaning, we repeatedly tested plasma from pooled samples or a commercial standard until the Total Ion Current (TIC) Chromatogram stabilised, before running patient samples. Digitized proteomic maps were generated through SWATH-MS analysis performed on a 6600 TripleTOF mass spectrometer (Sciex, Warrington, UK) coupled to a Dionex Ultimate 3000 HPLC (Dionex, Thermo, UK), with specific mass spectrometric conditions (including isolation window size and overlap and total cycle time) as previously described [18].

Spectral libraries were generated by TransProteomic Pipeline (version 4.8.0) [19]. X!Tandem (version 2015.04.01.1) [20] was used to interrogate the SWATH-MS files generated from the samples. More specifically, the samples were pooled together to create a final set of 12 fractions and processed, generating 12 files that were searched against the appropriate database with X!Tandem. These files were further processed with the TransProteomic Pipeline, containing xinteract, InterPropherParser and spectrast, to generate the spectral library. SWATH maps were generated by OpenMS (version 2.0.1) [21] and MSproteomicstools (version 0.4.3). pyProphet (version 0.18.3) was used for the False Discovery Rate (FDR) calculations of the resulting transition groups. Feature alignment tools were used to align multiple pyProphet files with the corrected retention times and FDR scores. As the aligned SWATH maps contain transition-level information, MSstats() function from the R package MSstats [22] (version 3.13.5) was used to infer protein-level quantification. Parameters chosen were “top3” option for parameter “featureSubset” and normalisation with Tukey-Median Polish (TMP). Coefficient of variance (CV) analysis between technical injection replicates was performed on the resulting MSstats-processed data, with samples allowed to go forward to downstream analysis if the median and 75% quantiles were 20% and 30% maximum, respectively. Proteins present in at least 40% of the samples were retained in the following biomarker analysis [23]. The 12 purposely physically immunodepleted proteins were removed in silico prior to statistical analysis.

Statistical analysis

Proteomic data was log2 transformed to stabilize the variance and reduce heteroscedasticity. Baseline phenotypic characteristics were compared between case and control groups using Mann–Whitney U tests for continuous variables and Chi-squared tests for proportions. As some cases were taking Warfarin, we removed proteins known to be Vitamin-K dependent. Relationships between medications prevalent among cases (chiefly Warfarin and penicillin) and individual proteins were explored using Student’s t-tests. Pearson correlation coefficients of protein expression with BMI and age were calculated among case and control samples, and we tested for interaction between case/control status and sex in expression of each protein. An unadjusted bivariate comparison of all proteins between cases and controls was carried out using Student’s t-tests applied to log2 proteomics data; p-values from this analysis were corrected for multiple comparisons using the Bonferroni method.

Feature selection was undertaken using the Boruta algorithm [24], which implements a random forest (RF) procedure comparing each candidate feature’s performance in a classification model with respect to that of a randomly created ‘shadow’ feature. Boruta has wide application in feature selection [25, 26] and has recently been applied to SWATH-MS data [27]. Boruta has been shown to be effective in permutation based feature selection [28]. The Boruta algorithm also has the merit of incorporating data from all collinearly associated proteins instead of randomly selecting one among them, as some other algorithms do. Log2 transformed proteomics data were randomly split into training and testing sets in a ratio of 7:3. The Boruta R package (version 7.0.0), was deployed with the parameter ntree, which defines the number of trees to grow, set to 500 and the parameter maxRuns, which specifies maximum runs the algorithm will iterate, set to 4000; these settings were chosen through an initial training of the model on a subset of the data.

In order to test the robustness of biomarkers detected by Boruta algorithm, the LASSO (Least Absolute Shrinkage and Selection Operator) logistic regression method was applied to the same training and testing datasets as used for Boruta algorithm. glmnet() function from R package glmnet (version 4.1–2) was used to carry out LASSO regression.

The glm() function in R was used to implement a logistic regression (LR) model to yield adjusted betas and per-marker AUCs for each log2 scaled proteomic feature that had emerged as significant from the Boruta analysis. BMI, age and sex were included in the model, as was a BMI*age interaction term. Twenty-three patients with missing BMI information (including 8 controls and 15 cases) were removed from these analyses. The cumulative AUC for the addition of each biomarker, in order of its Boruta importance, was calculated using the Cstat() function from the DescTools package.

Enrichment testing using the list of proteins identified by the Boruta algorithm was performed using ClueGo (version 2.5.7), a plug-in application in Cytoscape (version 3.8.2). The following databases were used: GO Biological Process; GO Molecular Functions; GO Immune System Process; KEGG; Reactome Pathways; Wiki Pathways. Following the approach used by others in similar analyses of plasma samples [29], we used the SWATH plasma reference library of 2,559 proteins as background in our principal analyses (analyses using the whole genome as background are presented in Additional file 1: Data). Only pathways with p-value < 0.05 (calculated using a two-sided hypergeometric test and Bonferroni step down correction) and a minimum of two proteins per pathway were considered.

Results

Demographic information

Among 445 participants in the study, there were 215 cases of severe RHD and 230 controls. Demographic baseline data are shown in Table 1. RHD is typically a disease of young people and as age-matching was not carried out in population collection, we found cases were significantly younger than controls (p = 0.014; Table 1). Sixty-four RHD patients were below the age of 18 years, whereas only 13 controls were below the age of 18 years old. Also, RHD cases had lower BMI than controls (p = 6.02e−12; Table 1). We therefore explored relationships between age, BMI and protein levels in the cohort prior to the case–control proteomic analyses. BMI and age in the cases were correlated with Pearson correlation coefficient r = 0.63, compared to r = 0.23 in control samples; the higher correlation in cases is mainly due to the presence of participants younger than 18 in the case cohort (Additional file 1: Fig. S1). Subsequent LR analyses were therefore adjusted for age, sex, BMI and age*BMI interaction. Regarding medication differences between cases and controls, 111 cases and zero controls in the study were receiving secondary prophylaxis for RHD, comprising regular benzathine penicillin G injections. Twenty-three cases and zero controls were identified as anticoagulated with warfarin. One case received both penicillin G injections and warfarin. Neither penicillin nor warfarin treatment (after the removal of proteins known to be affected by warfarin) was a significant factor in explaining protein differences between cases and controls.

Table 1 Baseline characteristics of included study participants. Data presented as median (IQR) or percentange (%). P-values obtained using the Mann-Whitney U test.

Proteomic baseline data

A total of 940 proteins were quantified in the blood samples and 366 proteins, present in at least 40% of the samples, were kept for downstream analysis (Additional file 1: Fig. S2). The principal reason for protein dropout was abundance level, rather than unacceptable levels of variation. Among these 366 proteins, no significant differences of protein expression were observed between participants taking warfarin or penicillin, compared to those not taking medication (pairwise t-test, adjusted p-value = 1). Correlation coefficients of protein expression and BMI or age were in general weak, and not systematically different between cases and controls (Additional file 1: Fig. S3). No protein showed significantly different expression in males than in females, and there were no significant interactions between case/control status and sex in protein expression (Additional file 1: Fig. S4).

Boruta machine learning analyses

Fold change analyses showed a total of 84 proteins that exhibited significant differences between cases and controls with adjusted p-values < 0.05 (Additional file 1: Table S2). Using the Boruta algorithm, 56 features were identified as important; these are presented in order of their Boruta importance in Table 2. Figure 1a shows the boxplots of the permutation importance of the 56 proteins in order with an emphasis on the top six proteins. Adiponectin (Q15848) and complement factor C7 (P10643) are the strongest differentially expressed proteins in this analysis, followed by quiescin sulfhydryl oxidase 1 (O00391), insulin-like growth factor binding protein acid labile subunit (P35858), pregnancy zone protein (P20742) and glycosylphosphatidylinositol specific phospholipase D1 (P80108). Twenty-four of the proteins identified by the Boruta algorithm were also identified by LASSO regression (Additional file 1: Table S3). However, the Boruta algorithm identified some important additional biomarkers, for example, quiescin sulfhydryl oxidase 1 (O00391), a known marker of cardiac disease, that LASSO regression did not detect.

Table 2 List of biomarkers identified from Boruta package with their log2-scaled mean expression in cases, controls, log2 fold change, mean permutation importance (meanImp); and with Odds Ratios (ORs), 95% Confidence Interval (CI), p-values and AUCs from single-marker LR models adjusted for age, sex, BMI, and age*BMI
Fig. 1
figure 1

a Boxplot representing the permutation importance of the 56 proteins (from 215 cases; 230 controls) found to be significant by the Boruta algorithm. UniProt IDs are presented in Table 2. b Cumulative AUC for Boruta-identified biomarkers calculated from logistic regression analysis

Logistic regression

Results of the marker-by-marker logistic regression analyses adjusted for age, sex, BMI and age*BMI, for each of the 56 proteins identified by the Boruta algorithm, are presented in Table 2. The top marker from the Boruta analyses, Adiponectin, was higher in cases than controls, exhibiting an OR for disease per unit increase on the log2 scale (ie per doubling) of 1.18 [95% CI 1.13–1.24]; p = 2.00e−12. The second placed marker by the Boruta algorithm, complement component C7, had the highest absolute case–control difference of any biomarker in the LR model, with OR = 3.40 [95% CI 2.41–4.93]; p = 2.14e−11. Among other significant markers, Fibulin-1, a known component of cardiac valve matrix, was higher in cases than controls, potentially indicating ongoing significant valve damage in these chronic RHD patients (OR = 1.96; [95% CI 1.46–2.68]; p = 1.44e−05). Also, in keeping with previous analyses [30], we found the complement-activating protein Ficolin-3 to be lower in cases than controls (OR = 0.60; [95% CI 0.47–0.76]; p = 2.65e−05). Ficolin-3 had a strong classification ability similar to Adiponectin and C7 with an individual AUC of 0.81. The cumulative AUC from the logistic regression analyses is shown in Fig. 1b. Incorporating the top 6 biomarkers in the model yielded an AUC of over 90% and incorporating the top 12 biomarkers yielded an AUC of ~ 0.95 (Table 2). Thus, the use of SWATH-MS based discovery proteomics identified a candidate biomarker signature that accurately discriminates RHD patients from controls.

Pathway enrichment

Statistically significantly enriched pathways identified by ClueGo functional enrichment conducted on the Boruta-identified proteins are presented in Additional file 1: Table S4. A functionally grouped network of pathways is shown in Fig. 2. Enriched pathways confirmed our inference from the individual protein analyses that the activity of protein networks involved in inflammatory mechanisms were significantly different between cases and controls. For example, proteins involved in the Insulin like Growth Factor (IGF) and IGF-binding protein (IGFBP) pathways were significantly enriched (FDR-adjusted p = 1.70e−04) which are of known importance in autoimmunity [31]. Pathways of previously unsuspected relevance in RHD included serine-type endopeptidase inhibitors (FDR-adjusted p = 4.94e−05), including members of the Serpin family involved in stabilization of the extracellular matrix and inhibiting clotting proteins; and lipoprotein metabolism (FDR-adjusted p = 1.30e−04). Subsidiary analyses using the whole genome as background produced results highly congruent with the plasma reference library analyses (Additional file 1: Table S5).

Fig. 2
figure 2

Functionally grouped networks of enriched pathways from ClueGO. For the full enrichment analysis results see Additional file 1: Table S4

Discussion

In this study of geographically and ethnically diverse African patients with severe RHD and healthy controls, we identified a proteomic signature consistent with ongoing inflammation, during what has typically been considered a “burned out” phase of disease—when severe chronic valve disease is established.

Previous plasma proteomic studies of RHD have involved smaller numbers of patients than the present study: Mukherjee et al. [32] studied six patients with rheumatic mitral stenosis and six controls; Gao et al. [33] studied 40 RHD patients and 40 controls; and Wu et al. [34] carried out the only previous study of comparable size to the present investigation, involving 160 RHD patients and 160 healthy controls. There was minimal overlap between the proteins identified in those studies and the present investigation, which is the first to employ a machine learning approach to identify differentially expressed proteins. Proteomic studies of rheumatic human valves replaced at surgery offer the potential to more directly interrogate pathological processes, however these have involved only small numbers of patients, due to limited availability of specimens for study (recently reviewed by Lumngwena et al. [35]). Moreover, while such studies of valve tissue provide directly pathologically relevant information, they do not necessarily inform the basis for a potential field diagnostic. In the following, we discuss certain of the proteins that showed most significant differences between cases and controls and their potential relevance to RHD.

Adiponectin was the top protein identified in the Boruta and logistic regression analyses. Plasma adiponectin was a mean of 2.2 fold higher in cases than controls. Adiponectin has a complex relationship to inflammation, being currently thought to act as either an anti-inflammatory or a pro-inflammatory protein dependent on context [36]. In the context of diabetes, obesity and coronary artery disease, adiponectin is lower in cases than controls and inversely correlated with C-reactive protein (CRP) levels. By contrast, levels are higher in cases of rheumatoid arthritis, Systemic Lupus Erythematosus (SLE) and inflammatory bowel disease than controls. Thus elevation of adiponectin appears to be a specific autoimmune marker in the context of inflammation, in keeping with the disease process in RHD.

Complement factor 7 was the second most important protein in the Boruta and logistic regression analyses. Plasma C7 was a mean 1.6 fold higher in cases than controls. Unlike some other complement components, C7 is not considered an acute phase reactant, and it is the only terminal complement component not predominantly synthesised by hepatocytes [37]. C7 is often the limiting factor for terminal complement complex generation, and has been found at higher levels in plasma of diabetic patients with kidney disease [38]. Thus far there is no evidence for plasma C7 levels being altered in rheumatoid or autoimmune diseases. The combination of Adiponectin and C7 elevation in cases compared with controls together is therefore, to the best of our knowledge, unique to RHD among inflammatory diseases studied so far, and suggests their combination could have diagnostic utility.

Quiescin sulfhydryl oxidase 1 (QSOX1) was the third most important protein in the machine learning analyses. QSOX1 was on average 43% higher in cases than in controls. When fully adjusted for age/sex/BMI/age*BMI in logistic regression analyses, it fell to 29th position among the identified proteins, but remained statistically significantly different between cases and controls (OR = 1.27 [95% CI 1.12–1.47]; p = 5.58e−04). QSOX1 catalyses disulphide bond formation in fibroblasts, and supports ECM assembly in fibroblast cultures. It has been described as a marker of acute heart failure [39] and is higher in patients admitted with MI who later go on to develop LV dysfunction [40], in which situation it is thought to originate from the infarct border zone. QSOX1 has not previously been implicated in rheumatic or other heart valve disease.

Fibulin-1 is an extracellular matrix protein strongly expressed during development in the cardiac cushions, from which the heart valves develop, and in adult valve tissue [41, 42]. Plasma fibulin-1 levels have been suggested to be an early plasma marker of aortic stenosis [43]. Levels have been positively associated with N-terminal pro-BNP, and left atrial size [44], and fibulin is hypothesised to play a key role in determining aortic stiffness [45]. Our data showing a 46% higher mean value plasma fibulin-1 in RHD cases compared to controls, particularly when coupled with the pro-inflammatory signature constituted by other proteins, tends to support the notion of ongoing valve damage in late-stage RHD. However, this observation could also be consistent with left atrial size increase consequent upon mitral stenosis or regurgitation among a proportion of the cases.

We found Ficolin-3 levels to be about 43% lower in RHD cases than controls. Ficolin-3 is one of three ficolin proteins that bind to microbial surface residues, and play key roles together with the Mannose-binding lectin (MBL)-associated serine proteases 1 and 2 in the cleavage of complement components 4 and 2 to form the C3 convertase C4b2a [46]. The lectin pathway, of which Ficolin-3 is the most abundant plasma component, has been implicated in RHD by multiple previous studies; Ficolin 3 itself binds to the highly conserved N-acetyl-beta-D-glucosamine (GlcNAc) antigen, the main carbohydrate antigen of the Group A Streptococcus cell wall. Recently, a focused ELISA based study of serum Ficolin-3 concentrations showed a 30% lower serum ficolin-3 among 179 patients with a history of rheumatic fever compared to 170 healthy controls, a result strongly in concordance with our large-hypothesis experiment [30]; although a smaller recent study of Egyptian adolescents did not confirm this result [47]. It is possible that either consumption of Ficolin-3 by an ongoing inflammatory process, or a genetic predisposition to lower Ficolin-3 levels resulting in a greater propensity for streptococcal sore throat to progress to acute rheumatic fever among cases, may explain the association we and others have shown between severe RHD and lower plasma Ficolin-3. Further research will be required to distinguish these possibilities.

Taken together, our results strongly suggest an ongoing inflammatory process involving damage to the cardiac valves among these cases of severe RHD, which to date has remained an unresolved question. Of note, over 50% of the case population were treated with secondary penicillin prophylaxis, and we observed no difference in proteomic profile among those cases who were, and who were not, taking penicillin prophylaxis. This suggests that recent undiagnosed episodes of rheumatic fever would be an unlikely explanation for our observations. This is important in light of alternative plausible hypotheses for the drivers of progressive valve severity that are emerging. For example there is previous work showing that myocarditis remained in its active phase in patients with ARF, months after the disease ventured into the quiescent phase [48] suggesting that continuous valve damage may occur in a similar fashion in chronic RHD patients, with evidence of a continuum of inflammation due to the presence of high levels of CRP [49]. Elsewhere Karthikeyan and colleagues have suggested that a major driver of persistent inflammation and progression of valve disease may be related to the hemodynamic burden and turbulence created by transvalvular pressure gradients across damaged valves [50]. Of interest, Rifaie et al. reported that high concentrations of inflammatory markers present in the sera of patients with chronic rheumatic valvular heart disease subsequently disappeared after administration of anti-inflammatory drugs [51]. Clinical observation tends to support the notion of ongoing valve damage distant from ARF episodes—for example, while pure mitral regurgitation dominates in the young, mixed valvular pathology is the most common finding in chronic RHD, indicating progression [52]. Our results suggest these clinical changes reflect ongoing inflammation-driven valvular scarring and remodelling occurring in RHD, even distant from recurrent episodes of ARF.

Our analyses were able to distinguish a six-protein signature of severe RHD (ADIPOQ, C7, QSOX1, IGFALS, PZP, GPLD1) that correctly classified over 90% of cases; incorporation of the top 12 proteins enabled correct classification of over 95% of cases. Certain features of the signature appear, from the literature, to confer specificity—the combination of high Adiponectin and high C7, higher levels of Fibulin-1, and lower levels of Ficolin-3 in cases. If ongoing inflammation were shown to have prognostic importance in chronic RHD, the protein signature could be used to attempt to stratify RHD patients, and potentially identify opportunities for drug repurposing in future studies. A similar protein signature identifying ARF would be of even greater utility in low-resource settings, where access to experts trained in clinical cardiovascular evaluation, and the use of echocardiography, is very limited. Similar studies to ours will be necessary in ARF patients and controls to investigate this question.

This study has limitations. Although it is the largest study thus far, the only one to date to incorporate machine learning, and the first to use the SWATH-MS or proteomics methodology, replication of our findings in a second cohort of similar size would be of value. Incorporation of genetic information could enable a “Mendelian randomisation” approach to distinguish causal from non-causal association—this could be of particular value, for example, in the case of Ficolin-3 where lower levels could be due to either genetic predisposition or enhanced consumption by an ongoing inflammatory process. Such experiments would require larger samples. Adiponectin exists in three isoforms (trimer, hexamer and multimer) which are known to have differential properties in, for example, induction of chemokine expression in vitro [53]. Our approach could not distinguish these different isoforms, which would require alternative analytic platforms. It is therefore possible that we have underestimated the importance of a particular isoform of Adiponectin. Some of the proteins we identified as among the strongest biomarkers do not, as yet, have plausible mechanisms linking them to RHD; further research will be required to discover these.

In summary, we have identified a plasma protein signature of rheumatic heart disease that suggests an ongoing inflammatory process in the chronic phase of the condition. A small number of proteins considered together accurately classified chronic, severe RHD cases distinct from healthy controls. This work may could contribute to opportunities for drug repurposing, guide recommendations for prophylaxis, and/or inform development of near-patient diagnostics.