Introduction

Cerebral malaria (CM), a neurological complication caused by malaria, is one of the most serious neurological disease in the world, with a high mortality rate, and is the primary cause of malaria death [1]. The onset age of CM mainly occurs in children aged 40–45 months, and children under five years of age account for 67% of all malaria deaths [2]. The typical clinical symptoms of CM in children include fever, anorexia, vomiting, cough, convulsions, and coma [3]. It has been reported that approximately 1% of children with Plasmodium falciparum are likely to develop CM. Approximately 11% of childhood survivors of CM have neurological sequelae, such as epilepsy, movement disorders, hemiplegia, speech disorders, cortical blindness, and hypotonia [4]. Therefore, to improve the prognosis of patients, one of the most important factors is early diagnosis [5]. However, due to its complex and nonspecific clinical manifestations, there is no gold standard for the diagnosis of CM [6]. In recent years, the rapid development of bioinformatics technology has greatly promoted research on diagnostic markers of disease. Nowadays, there are many studies on the biomarkers of childhood CM, but there are still few studies on the representative biomarkers of childhood cerebral malaria, and the biomarkers may have population differences [7, 8]. In addition to early diagnosis, effective therapeutic drugs have a critical impact on the prognosis of CM. Recently, the use of quinine or artemisinin derivatives as first-line malaria treatments has significantly reduced Plasmodium infection rates. However, they generally fail to protect against cell death, nerve damage, and cognitive deficits, having less effect on CM [9]. In addition, studies have shown that Plasmodium falciparum resistant to artemisinin derivatives (ART) has emerged in some areas, which is detrimental to the prognosis of patients [10, 11]. To improve the prognosis of patients, it is necessary to understand the molecular mechanisms of the complex biological processes involved in CM.

Therefore, this study applied a variety of bioinformatics tools to study the molecular biological functions, signaling pathway changes and biological targets in the process of CM infection, aiming at discovering the molecular mechanism of the disease and finding key, representative, and highly correlated biomarkers that can be used to diagnose CM and predict the progression and outcome of CM.

Methods and materials

Differentially expressed genes (DEGs) screening and functional enrichment analysis

In this study, we downloaded the GSE117613 dataset from the GEO database (https://www.ncbi.nlm.nih.gov/geo/). The GSE117613 dataset included 17 African pediatric CM samples and 12 control samples [12]. These data were normalized using the normalizeBetweenArrays function in the limma package. Subsequently, the limma package was also applied to identify the DEGs between the CM and control groups. Adjusted p < 0.05 and log |FC|> 1 were considered the statistical screening criterion. In addition, a volcano map and heatmap were made to display the differential expression of DEGs using the ggplot2 package. Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analyses were performed on DEGs to analyze their biological function by the ClusterProfiler R package. GO terms and KEGG pathways with adjusted p < 0.05 were deemed statistically significant.

Weighted gene coexpression network analysis (WGCNA) network construction

WGCNA is a novel method to efficiently detect gene modules and hub genes associated with clinical features [13]. In this study, we performed WGCNA on the GSE117613 gene set using the WGCNA package. The detailed data analysis process is as follows. First, genes in the top 25% of variance were screened for the construction of coexpression networks. Second, hierarchical clustering analysis of the samples was performed to detect and remove outlier samples. Next, according to the WGCNA tutorial, pickSoftThreshold function was applied to determine the best soft-thresholding power to meet the scale-free topology criterion. Thereafter, WGCNA network construction and module detection were performed based on soft-thresholding power and a minimal module size of 200. Eventually, gene significance (GS) and module membership (MM) were calculated to evaluate the correlation between modules and CM. The module with the highest correlation with CM was selected as the key module. The common genes of the key module and DEGs were identified by Venn analysis for further analysis.

Hub genes identification and validation

To screen diagnostic markers for CM, we applied two machine learning algorithms to identify candidate genes for CM diagnosis. The least absolute shrinkage and selection operator (LASSO) logistic regression is a useful variable selection method by the glmnet R package [14]. LASSO compressed the regression coefficients of some variables to zero by imposing constraints on the model parameters (λ), thereby obtaining an interpretable model. In this process, variables with zero regression coefficients were excluded from the model. Therefore, we applied the LASSO to narrowed down CM-related candidate genes. Furthermore, we utilized support vector machine recursive feature elimination (SVM-RFE) by the e1071 R package [15]. SVM-RFE is a machine learning method for screening genes for sample classification from microarray data. Here, SVM-RFE was performed to identify the value of these biomarkers in CM. The intersection genes of LASSO and SVM-RFE were screened out and then validated in the GSE1124 dataset.

Gene set enrichment analysis (GSEA)

To further investigate the potential roles of hub genes, we performed GSEA via the gseKEGG function of the clusterProfiler package on hub genes. Based on the median expression levels of hub genes, 17 CM samples in the GSE117613 dataset were divided into low-expression and high-expression groups to perform enrichment analysis. The enrichment pathways with p-value < 0.05 and FDR < 25% were considered statistically significant.

Results

DEGs filtration and enrichment analysis

Based on p.adj  < 0.05 and log|FC|> 1, 182 DEGs were identified in the GSE117613 dataset, including 124 upregulated and 58 downregulated genes (Fig. 1A). The top 20 upregulated and downregulated DEG profiles are shown in Fig. 1B. GO and KEGG analyses were introduced to analyze the DEGs, thereby identifying their biological functions. GO analysis was composed of biological process (BP), cellular components (CC), and molecular function (MF). For BP, DEGs were enriched in some pathways related to neutrophils, such as neutrophil activation, neutrophil degranulation, neutrophil activation involved in immune response, and neutrophil-mediated immunity (Fig. 2A). For CC, the most significant terms were associated with some lumen structures, such as vesicle lumen and cytoplasmic vesicle lumen (Fig. 2B). For MF, DEGs were enriched in binding-related functions, such as organic acid binding and carboxylic acid binding (Fig. 2C). The significantly enriched KEGG terms were transcriptional misregulation in cancer, the NOD-like receptor signaling pathway, Staphylococcus aureus infection, and the IL-17 signaling pathway (Fig. 2D).

Fig. 1
figure 1

Volcano map and heatmap of differentially expressed genes. A DEGs between cerebral malaria blood samples and control samples. Red dot: upregulated gene, blue dot: downregulated gene. B The top 20 upregulated genes and top 20 downregulated genes in GSE117613 dataset

Fig. 2
figure 2

GO and KEGG enrichment analysis. A GO enrichment in Biological Process terms. B GO enrichment in Cellular Component terms. C GO enrichment in Molecular Function. D Enriched KEGG pathways of the DEGs

Weighted gene coexpression network construction and analysis

Through preliminary screening, a total of 8674 genes, whose variance ranked in the top 25%, were incorporated into the WGCNA network construction, and outlier sample GSM3305179 was excluded after sample cluster analysis. (Additional file 1: Fig. S1). With the pickSoftThreshold algorithm, β = 8 (with R2 = 0.85) was determined as the scale-free topology criterion (Additional file 2: Fig. S2). Then, the coexpression network was established based on soft-thresholding power β = 8 and a cutheight of 0.25. Subsequently, DEGs were clustered into 15 modules (labeled with different colors) with a minimum module size of 200 by hierarchical average linkage clustering (Fig. 3A). The genes in the same modules indicated highly shared biological functions, and unassigned genes were divided into the gray module. The heatmap illustrated the correlation between CM and different modules, in which the purple module was the most highly correlated with CM (cor = 0.78, p = 9e−07) (Fig. 3B and C). Therefore, the purple module was considered the key module. The 68 genes shared by the purple module and DEGs were screened out for further analysis (Fig. 3D).

Fig. 3
figure 3

WCGNA coexpression module construction. A The cluster dendrogram of the top 25% genes with highest variance in GSE117613. Each specified color represents a specific gene module. The genes in a same module have highly shared biological function. B Associations between gene modules and Cerebral Malaria. Each row corresponds to a module eigengene and each column corresponds to a clinical status. Each cell displayed the correlations and p-values between each module and clinical status. The purple module was the highest correlation module with cerebral malaria. C Scatterplot of gene significance vs. module membership in the purple module. D Venn diagram for intersection between DEGs and genes of purple module

Hub genes filtration and verification

To verify the diagnostic value of the genes, 68 candidate genes were subjected to LASSO logistic regression and SVM-RFE algorithms, respectively. LASSO identified 7 genes (Fig. 4A, B), while SVM-RFE screened 17 genes (Fig. 4C, D). The common genes (LEF1, ANXA1, IRAK3, and VCAN) obtained by LASSO and SVM-RFE algorithms were identified as hub genes (Fig. 5). Subsequently, we validated the expression levels of four hub genes in GSE117613 and a separate external dataset, GSE1124. Compared to the control group, IRAK3, VCAN, and ANXA1 were upregulated, while LEF1 was downregulated in the CM group in the GSE117613 dataset (Fig. 6). However, only LEF1 and IRAK3 had significantly different expression levels in the validation dataset GSE1124. (p < 0.05) Thus, LEF1 and IRAK3 were identified as hub genes and subjected to subsequent research.

Fig. 4
figure 4

Selection of diagnostic biomarkers using the machine learning methods. A, B LASSO algorithm to screen candidate genes. C, D SVM-RFE algorithm to screen candidate genes. The point highlighted indicates the optimal accuracy and the lowest error rate, respectively, and the corresponding genes at this point are the best signature selected by SVM-RFE

Fig. 5
figure 5

Venn diagram showing the hub genes shared by LASSO and SVM-RFE algorithms

Fig. 6
figure 6

Validation of hub genes in GSE117613 and GSE1124 datasets

Gene set enrichment analysis (GSEA)

The enriched KEGG pathways are shown in Fig. 7, which may reveal the potential regulatory mechanisms of hub genes among blood in the progression of CM. According to the enrichment results, some immune-related pathways (IL − 17 signaling pathway and Toll-like receptor signaling pathway) were enriched in the low-LEF1 group and high-IRAK3 group (Fig. 7A, B).

Fig. 7
figure 7

Gene set enrichment analysis of hub genes. A The main signaling pathways that are significantly enriched in low-LEF1 group. B The main signaling pathways that are significantly enriched in high-IRAK3 group

Discussion

CM is a life-threatening complication of Plasmodium falciparum [16]. In addition to its high fatality, 10% to 20% of surviving children with CM suffer from severe neurological sequelae [17]. Deciphering the molecular mechanism of malaria is of great significance for early diagnosis and developing new treatment strategies to reduce the burden of CM. Recently, microarray technologies and bioinformatic analyses have become popular methods for exploring disease pathogenesis and identifying biomarkers of disease [18]. The GSE117613 and GSE1124 datasets may provide new insights into the identification of pathophysiology and biomarkers in CM. Some researchers have previously analyzed the GSE117613 and GSE1124 datasets. Nallandhighal, Srinivas et al., who offered the original data for GSE117613, compared the whole-blood transcriptional profile difference between CM and severe malaria anemia and reported their difference in oxidative stress and erythropoietic responses [12]. Boldt, Angelica B W et al. revealed the modifications of gene expression in different stages of P. falciparum infection and identified some potential prognostic markers [19]. Unlike the previous studies mentioned in this report, we utilized WGCNA, a new bioinformatics tool, to investigate the molecular mechanisms underlying CM. By constructing the WGCNA network, we identified the CM-related module and extracted the hub genes. Additionally, the application of machine learning methods contributed to the screening of gene biomarkers with high diagnostic value.

The GO and KEGG enrichment results illustrated the regulating pathways involved in the DEGs. The BP enrichment results revealed that neutrophils and their related pathways played important roles in CM. The results remained the same as those of previous studies. Feintuch et al. reported higher levels of activated neutrophils in malarial retinopathy-positive CM pediatric patients [20]. Some studies hypothesized that neutrophils stimulated by the large numbers of sequestered parasites in the retinal and cerebral microvasculature in CM could degranulate, releasing MMP8 and leading to vascular endothelial barrier damage and vascular leakage [21]. Moreover, these vascular dysfunctions are common features of CM [22]. In terms of CC, DEGs were involved in the form of vesicle lumen. Previous research has shown that iRBCs secrete significantly more extracellular vesicles than uninfected cells [22]. Vesicles contain many poisonous factors, resulting in vascular dysfunction and the severity of disease in CM [23]. For KEGG, the NOD-like signaling pathway is one of the significantly enriched pathways. Some studies [24, 25] reported that hemozoin, a product of hemoglobin (Hb) catabolism, triggered a high level of IL-1β production and recruited neutrophils by activating the NOD-like receptors containing pyrin domain 3 (NLRP3) inflammasome complex in malaria infection. High IL-1β was highly associated with disease severity and death during malaria [26, 27]. Previous studies found that the NLRP3–IL1β axis was involved in acute cerebrovascular dysfunction and progressive neuroinflammation in various brain pathologies [28, 29]. Therefore, inhibition of the NLRP3–IL1β axis may weaken neurological sequelae and improve the efficacy of antimalarial drug treatment of CM [30].

This study also identified some biomarkers for CM by LASSO and SVM-RFE. However, the possible effects of these hub genes in CM have not been reported in previous studies. LEF1 (Lymphoid Enhancer Binding Factor 1) is a protein-coding gene involved in the Wnt signaling pathway [31]. Several studies [32] have reported that activation of the Wnt signaling pathway contributes to maintaining the integrity of the blood–brain barrier (BBB) in various cerebral diseases. Jin, Zhao et al. found that activation of the Wnt signaling pathway may inhibit MMP-9 activation and upregulate LEF1 expression, which alleviated BBB breakdown and reduced brain edema in cerebral ischemia–reperfusion [33]. Therefore, we suggest that the downregulation of LEF1 may imply the absence of the Wnt signaling pathway in CM and poor prognosis. In addition, according to the enriched GSEA terms, the IL-17 signaling pathway was highly enriched in the low-LEF1 group. It's reported that excess expressions of immune-related pathways were associated with disease development or poor prognosis in CM [34]. Huppert, Jula et al. disclosed that IL-17 is involved in the disruption of the BBB [35], which is frequently fatal and related to long-term neurological sequelae [36]. Therefore, the downregulation of LEF1 may suggest the disruption of the BBB and the poor prognosis of CM.

IRAK3 is a member of the interleukin-1 receptor-associated kinase protein family and functions as a negative regulator of Toll-like receptor (TLR) signaling [37]. Dickinson-Copeland, Carmen M et al. reported that the activation of TLR-mediated heme-induced apoptosis, leading to the depletion of endothelial progenitor cells, which contributed to vascular dysfunction and BBB damage [38]. Additionally, IRAK3 is an important inflammatory down-regulator that reduces the transcription of NF-κB-induced cytokines [39]. NF-κB activation contributes to the cause of apoptosis in endothelial cells in CM, leading to BBB dysfunction [40]. Therefore, we suggest that the high expression of IRAK3 contributes to maintaining the integrity of the BBB by inhibiting the Toll-like receptor and NF-κB pathways, thereby improving the prognosis of CM.

In this study, we identified LEF1 and IRAK3 as important biomarkers in CM. Through various bioinformatic analyses, we validated their diagnostic value in CM and found that they were highly associated with the integrity of the BBB. Therefore, we suggest that LEF1 and IRAK3 may act as key targets to improve prognosis, contributing to the diagnosis and treatment of CM.

We acknowledge that there are some limitations in our study. First, the datasets included in our study did not have enough samples. In addition, in vitro and in vivo experiments are required to further validate the value of hub genes in CM.