Introduction

Sepsis is a major public health concern which develops an abnormal host response to an infection, and is associated with the life-threatening organ dysfunction [1, 2]. Acute respiratory distress syndrome (ARDS), a common and fatal complication of sepsis, is characterized by the damage of alveolar-capillary membrane leading to lung edema and hypoxemia [3]. In a large international study, approximately 75% of patients with ARDS were caused by sepsis [4]. According to the US report, there are over 210,000 cases of sepsis-induced ARDS in the US annually [5]. Besides, septic patients with ARDS had a higher overall disease severity, poorer recovery from lung injury and higher mortality than non-sepsis-related ALI [6]. Despite the growing understanding of the mechanisms in sepsis-induced ARDS, we still remain incompletely understood of why only a fraction of septic patients will develop ARDS. Furthermore, ARDS will develop rapidly after initial insult, and no consensus has yet been reached regarding biomarkers that can be used to directly diagnose ARDS and assess lung injury. Thus, it is important to identify some diagnostic biomarkers for the diagnosis of ARDS.

Gene expression signatures have been an intense focus of studies in recent years. Numerous studies have indicated that gene expression signatures have great predictive value to identify septic patients with ARDS [7]. In one study, an 8-gene signature was found to be associated with acute lung injury (ALI), which could be used to distinguish ALI patients from septic patients [8]. Then, the expression of genes related to neutrophils was significantly increased in septic patients with ARDS rather than patients with sepsis alone [9]. The recent study had also found the distinguishing gene expression profiles in monocytes between patients with sepsis and patients with sepsis with ARDS [10]. Thus, the gene signatures from gene expression profiles might be a novel and accurate biomarkers to distinguish patients with ARDS. However, with a large number of gene signatures involving the pathophysiological process, identifying those relevant for diagnosis of ARDS can be computationally challenging.

Machine learning is an emerging field with huge resources to deal with large, complex and disparate data. It has progressively improved our ability to find relevant features in large and high-dimensional data from gene expression profiles [11]. Supervised machine learning has been used successfully to develop classifiers for disease diagnosis and identify the related biomarkers on the basis of the input features [12, 13]. However, it still lacks the research using machine learning to identify potential diagnostic biomarkers of sepsis-induced ALI. Here, we hypothesized that by integrating multiple machine learning algorithms, we could identify gene expression signatures for sepsis-induced ALI, which may serve as diagnostic tools. Moreover, the functional analysis of the diagnostic genes identified will provide insight into the pathogenesis mechanisms of ALI development and uncover druggable targets for its prevention. In this study, we systematically reviewed the available transcriptomic profiling datasets, and identified the gene biomarkers associated with the diagnosis of sepsis-induced ALI by using a consensus of four different supervised machine learning features selection techniques. Further insight into the role of biomarkers in the pathogenesis of sepsis-induced ALI and potential candidates for the therapeutic intervention were explored.

Methods

Data sources used for analysis

The overall design of this study was shown in Fig. 1. We have retrospectively enrolled 5 datasets from Gene Expression Omnibus (GEO) (http://www.ncbi.nlm.nih.gov/geo) and ArrayExpress (https://www.ebi.ac.uk/arrayexpress/) database. Datasets between 2009 and 2020 containing transcriptomic profiling in Homo sapiens were potentially eligible. Datasets were excluded following the criteria: (1) the datasets included pediatric patients; (2) not measuring RNA; (3) patient’s samples were obtained after the admission over 48 h; (4) focusing on special pathogens such as Staphylococcus aureus and Pseudomonas aeruginosa; (5) not focusing on sepsis, sepsis-associated lung injury or sepsis-associated pneumonia. Additional datasets could be added by manual search of the references of included studies. The detailed baseline characteristics was summarized in Additional file 1: Table S1. Among these datasets, GSE66890, GSE10474 and GSE32707 were utilized to develop the diagnostic model. Then, E-MTAB-5273 and E-MTAB-5274 from the ArrayExpress database were applied to evaluate the performance of the diagnostic model in distinguishing sepsis-induced ALI patients.

Fig. 1
figure 1

The overall flow of this study

Data preprocessing and identification of differentially expressed genes (DEGs)

The datasets were downloaded from GEO and ArrayExpress databases, and the probe expression matrix was converted to gene expression based on the platform annotation file. The expression matrix was further normalized by robust multichip average (RMA). Genes were then filtered, keeping only those expressed in at least 10% of arrays. In cases where datasets had missing values, Multiple Imputation were conducted by using the weighted average from k-nearest neighbors (KNN) to handle the missing values. Then, the datasets GSE66890, GSE10474 and GSE32707 were merged by using the “comBat” function in the sva package to remove the batch effect among the datasets. To evaluate the batch effect, we conducted the Principal Component Analysis (PCA) and t-Distributed Stochastic Neighbor Embedding (t-SNE) to investigate the data. The DEGs analysis between sepsis and sepsis-induced ALI was performed using the limma package. The thresholds of DEGs were |log fold change (FC)|> 0.2 and P value < 0.05. Then, the results were visualized in volcano plots and heatmap plots which were constructed by using the R packages ggplot and pheatmap. The guidelines of the transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD) statement were followed (Additional file 1: Table S2).

Pathway enrichment analysis

Based on the normalized gene expression matrix, the R package clusterProfiler was used to conduct the GSEA analysis. The pathway gene sets were downloaded from the molecular signature database (MSigDB). Normalized enrichment score (NES) and false discovery rate (FDR) were applied to quantify enrichment magnitude and statistical significance, respectively [14, 15].

Multivariable DEGs selection and model building

To further conduct the multivariable DEGs selection, we firstly need to eliminate the high mean absolute correlation of DEGs by using a correlation matrix method. For each DEGs, the mean absolute correlation based on the pair-wise correlations was calculated. If a pair-wise correlation was > 0.5, the DEGs with greater absolute correlation was removed by using the caret package in R [16].

In multiple DEGs selection, the four independent feature selection methods were simultaneously conducted to screen candidate biomarkers. The intersection amongst the four machine learning algorithms were considered the significant features. In each method, parameters were tuned using stratified tenfold cross-validation (repeated 10 times) on the training set, and the cross-validation was also performed to overcome the imbalance of outcome variables. Then, we subsequently create predictive classification models by using a supervised machine learning method for binary classification, based on the selected features from machine learning algorithms. Elastic net linear regression could be used to select the relevant DEGs on binomial logistic regression using glmnet package in R [17]. We chose the regularization parameter, λ, using tenfold cross-validation with binomial deviance as the criterion. A probability threshold of > 0.5 was used to determine whether septic patients with ALI or not. SVM is a supervised learning manner to select relevant characteristics and remove redundant characteristics using e1071 R package. Based on the best parameters, we chose Polynomial Kernel of svm to screen features. Boruta is a feature selection random forest wrapper algorithm used to obtain the relevant variables. We performed 300 items of the random forest normalized permutation importance function to attribute importance by using Boruta package. Then, we constructed the random forest model with the DEGs selected by Boruta using randomForest package [16]. XGBoost is a very effective method in in a range of classification problems. It is an extreme gradient boosting method which can rank features from most to least important by using XGBoost package in R [18]. The final parameters selected can be seen in Additional file 1: Table S3. Features contributing to more than 1% improvement in accuracy to the branch were considered importance. However, few algorithms possessed the ability to perfectly perform feature selection. Thus, we constructed an ensemble supervised machine learning model based on the ‘stacking’ method, which refers to fitting multiple machine learning models on the same dataset and using secondary modeling to learn how to best combine their predictions [19]. The above supervised machine learning algorithms were combined to generate a consensus model. An ensemble of predictions from each model were generated by averaging the predicted probabilities from each individual supervised machine learning algorithm (Additional file 1: Figure S1). Models with highest area under the receiver operating curve (AUROC) in cross-validation were selected as the optimal model.

Multivariable classifier performance assessment and validation

To further evaluate the performance of the diagnostic model, the AUCs of all methods above were calculated using average of the cross validation across the whole dataset. Then, we also assessed the accuracy of diagnostic model through external validation. The two datasets (E-MTAB-5273 and E-MTAB-5274) from ArrayExpress database were applied to perform the verification, and the AUCs were also calculated. To further compare classifiers, we also looked at the performance of each supervised machine learning algorithm by using the evaluation metrics.

Functional analysis of diagnostic features

To further explore why diagnostic genes contribute to the development of ALI, we defined the top 30% and bottom 30% of patients with diagnostic DEGs expression in the merged dataset as overexpression and low expression groups. Then, the differences and pathway activity change between groups were analyzed by gene set variation analysis (GSVA) [20].

Nomogram, decision curve analysis (DCA) and clinical impact curve (CIC) of predictive model

Nomogram is a graphical tool that is designed to approximate complicate calculation quickly [21]. The selected gene signatures in diagnostic model were included to construct a nomogram model using rms package, which was established to predict the occurrence of ALI in septic patients. To validate the performance of nomogram, the concordance index (C-index) was calculated to assess the discrimination by a bootstrap method with 1000 resamples. Then, the calibration curve was plotted to observe the nomogram prediction probabilities against the observed rates. DCA curves are widely used to measure clinical utility of a specific model by comprehensively considering the relative value of benefits and harms associated with the prediction model, which can overcome the limitations of both traditional statistical metrics [22]. CIC could visually show the overall net benefit of nomogram within the wide and practical ranges of threshold probabilities that might impact patient outcomes, which indicates that the diagnostic model possesses significant predictive value [23]. Thus, in this study, the DCA curves and CIC were used to evaluate the predictive value of diagnostic model by using rmda package.

Drugs screened and docking

Based on the functional analysis of 5 selected features, we screened 5 protein-coding genes for targeted drugs. Drug selection criteria focused on the expression of selected features in sepsis-induced ALI patients. We used Autodock for molecular docking to find the interaction between small molecules compound and selected genes. First, we obtained the catalog of small molecules compound that interacting with selected genes from the CTD database (http://ctdbase.org/), followed by the structures of small molecules compound from PDB database (https://rcsb.org/). Next, we downloaded the biological macromolecular structures of selected features from Uniprot database (https://uniprot.org/). Finally, the automatic docking of biological macromolecules and small molecular compounds was performed according to the standard docking process. The interaction of small molecular compounds and biological macromolecules was determined by lowest binding energy. The PyMol was used to visualize the results.

Statistical analysis

All data processing and analysis were conducted in R version 4.2.0. Correlation analysis between two continuous variables were analyzed by Spearman Rank correlation analysis. Nonparametric test was used to compare the difference between two groups. The ROC curve used to predict binary categorical variables was implemented via pROC package. P value < 0.05 was regarded as statistically significant. Error bar span 95% confidence intervals.

Results

Screening for DEGs and underlying biological mechanisms

According to the exclusion criteria, 3 microarray raw datasets containing a total of 79 cases of sepsis and 60 septic patients with ALI were included as the training set. The basic information of the included datasets is shown in Additional file 1: Table S2. Through gene expression profiles and PCA, we observed that there were baseline batch differences among the included datasets (Additional file 1: Figure S2A, B). To merge the datasets, the “combat” algorithm was applied to eliminate the batch effect which could increase the analysis power in the following analysis. After performing batch-correction approach, the batch differences were all eliminated (Additional file 1: Figure S2C, D). Among them, the sample (GSM812638, GSM812696, GSM812737, GSM812705, GSM812721) were removed because they could not be integrated. Initial t-SNE was also conducted to show some separation between groups (Additional file 1: Figure S2E, F). Then, DEGs were obtained by using limma package based on the P-value < 0.05 and |Log2FC|> 0.2. The different expression analysis revealed that there was 289 DEGs, including 76 upregulated and 193 downregulated genes (Fig. 2A, B).

Fig. 2
figure 2

Different expression analysis and functional studies. A, B Volcano plot and heatmap showed the differentially expressed genes. C, D The biological functions were associated with the development of sepsis-induced ALI

To decipher the possible biological mechanisms underlying sepsis-induced ALI, we performed GSEA on 21,338 gene sets from MSigDB resource. The results showed that immune response and metabolism might play an important role in the development of ALI (Fig. 2C, D). Among them, the innate immune response, adaptive immune response and monocyte chemotaxis were significantly activated. Moreover, the pathways of chemokines secretion, Toll-like receptor and T cell receptor were also upregulated. To further explore the functional changes, we also conducted the functional enrichment analysis, including KEGG and GO analysis (Additional file 1: Figure S3). Based on the GO analysis, we found that the sepsis-induced ALI might be initiated by an inflammatory host response to a microbial pathogen (Additional file 1: Figure S3A). Then, mitochondrial plays an important role in the pathogenesis in sepsis-induced ALI. The mitochondrial biogenesis and other processes could be regulated by LPS (via TLR4 activation) involving the inflammatory and/or oxidative stress in tissues (Additional file 1: Figure S3B and D) [24]. Changes in alveolar epithelial and endothelial cells during sepsis-induced ALI include alterations in cell–cell junction formation, cell surface glycocalyx, and cell trauma or death (Additional file 1: Figure S3C, D). Thus, the results showed that aberrant host response to infection leads to the disruption of alveolar-capillary barrier, resulting in the development of lung injury. Dysregulated immune response was associated with the occurrence of sepsis-induced ALI, and monocytes might be the key immune cells contributing to the lung injury. Damage of endothelial and epithelial cells was essential for the progression of ALI.

DEGs selected using supervised machine learning algorithms

In this study, we profiled the DEGs from 77 septic patients without ALI and 57 septic patients with ALI. Since several of the supervised machine learning approaches could not account for the multicollinearity, we removed the DEGs failing quality control and DEGs highly correlated with each other (Additional file 1: Figure S4). Then, remaining 70 genes selection was used to determine the DEGs most relevant to the diagnosis. Four different machine learning methods (Elastic net, svm, random forest and XGBoost) were performed to select DEGs and construct diagnostic model. Each features subsets selected by each method were different (Additional file 1: Figure S5A–D), and there were 5 genes overlapping in all (Fig. 3A, B). Basic on the importance of features, there were 27 genes were selected by Elastic net, 29 genes were selected by svm, 20 genes were selected by random forest and 33 genes were selected by XGBoost. The genes (ARHGDIB, ALDH1A1, TREM1, TACR3 and PI3) selected by all methods were further used to construct diagnostic model. The expression levels of selected features were showed in Fig. 3C–G. To ensure no individual features was driving the diagnostic model, a univariable analysis was conducted. For the DEGs selected by at least two methods, the expression levels of sepsis and sepsis-induced ALI were compared by using Wilcoxon signed-rank test, and controlled for multiple testing by using Benjamini Hochberg correction at 0.05. The centered expression values of DEGs selected by at least two methods were showed in Additional file 1: Figure S6A, B. 32 DEGs identified in the feature selection methods had an p-value < 0.05.

Fig. 3
figure 3

The DEGs selected by each machine learning methods. A Venn diagram showed the intersection of DEGs selected by four supervised machine learning approaches. B The expression correlation matrix among DEGs selected by all machine learning algorithms. CG The expression levels of PI3, ARHGDIB, ALDH1A1, TREM1 and TACR3

Performance of diagnosis for sepsis-induced ALI using selected DEGs

To compare the performance of each feature selection method, we evaluated how each model performed as a classifier on the validation set. As shown in Table 1, the svm model had the highest AUC (0.846) and accuracy (0.872). The random forest model had the poorest AUC (0.727) and accuracy (0.730) (Fig. 4A–D). As multivariable methods are known to select features with different accuracy, we conducted the ensemble learning algorithm using the DEGs selected by each model. The result showed that ensemble model had higher AUC (0.876) than svm model (Fig. 4E). Then, the number of DEGs selected by each model were also different, with the XGBoost model selecting the most genes and random forest model selecting the least genes (Table 1). Moreover, we further focused on the overlapping genes selected by four feature selection methods, and we evaluated the performance of individual overlapping genes in sepsis-induced ALI diagnosis. The result showed that PI3 had the best performance with the highest AUC (0.833). Then, the genes selected by all models were combined to construct the diagnostic model, and the model have great predictive value with higher AUC (0.875) (Fig. 4F). These results confirmed that the diagnostic model constructed by genes (ARHGDIB, ALDH1A1, TREM1, TACR3 and PI3) had perfect diagnostic efficiency. Thus, a clear association of selected features with sepsis-induced ALI diagnosis may warrant future investigation of specific genes for therapeutic intervention.

Table 1 Model performance of 4 classifiers in validation set
Fig. 4
figure 4

The performance of each feature selection method. A Elastic net utilizing 27 genes. B svm utilizing 29 genes. C Random forest utilizing 20 genes. D XGBoost utilizing 33 genes. E Ensemble approach utilizing 53 genes. F Average cross validated ROC for overlapping genes selected by four feature selection methods on the validation set

Validation of diagnosis for sepsis-induced ALI by using external datasets

To assess the predictive performance of diagnostic model, two datasets (E-MTAB-5273 and E-MTAB-5274) obtained from ArrayExpress database were considered as external validation. The overlapping genes selected by four supervised machine learning algorithms were used to conduct ROC analysis. The results showed that the AUC was 0.725 in E-MTAB-5273 (Fig. 5A) and 0.833 in E-MTAB-5274 (Fig. 5B). Thus, the results of external validation demonstrated that the diagnostic model constructed by 5 genes had excellent performance in sepsis-induced ALI.

Fig. 5
figure 5

External validation of predictive performance in diagnostic model. A The ROC curve of E-MTAB-5273. B The ROC curve of E-MTAB-5274

Visualization of the diagnostic model

For visualization of the diagnostic model, the risk nomogram that integrated 5 independent predictors for the incidence of sepsis-induced ALI (Fig. 6A). The calibration curves for incidence of sepsis-induced ALI indicated a high degree of overlap between the actual incidence rate and the incidence rate predicted by the nomogram (Fig. 6B), suggesting that nomogram has an excellent predictive value. Then, the decision curve analysis (DCA) for the diagnostic genes (ARHGDIB, ALDH1A1, TREM1, TACR3 and PI3) and that for the model with diagnostic genes integrated was presented in Fig. 6C. The DCA showed that if the threshold probability of a patients or doctor is > 10%, using the individual genes or diagnostic model to predict the occurrence of ALI adds more benefit than either diagnosis-all-patients scheme or the diagnosis-none scheme. Within this range, net benefit was comparable. The net benefit of integrated diagnostic model was superior than individual diagnostic genes (Fig. 6C). Based on the results of DCA, we further plotted the CIC to assess the clinical utility of the nomogram. The CIC visually showed that the nomogram with a superior overall net benefit within the wide and practical ranges of threshold probabilities and impacted the diagnosis, suggesting that the diagnostic model had an excellent predictive value (Fig. 6D). Besides, the CIC of the individual diagnostic genes were also showed the similar results (Additional file 1: Figure S7).

Fig. 6
figure 6

The nomogram, DCA and CIC of the diagnostic model. A Nomogram to evaluate the risk of the occurrence of sepsis-induced ALI. B Calibration curves of the nomogram prediction. C DCA curves of the nomogram prediction. D CIC of the nomogram prediction

Functional analysis and small molecular compound docking of diagnostic genes

A good biomarker is not only characterized by high specificity and sensitivity in diagnosing the disease but also yields valuable insights into the pathogenesis of the disease [25].Understanding the biological roles of specific diagnostic markers for ALI may help elucidate underlying mechanisms and lead to the identification of novel targets for therapeutic intervention. Thus we further explored the functional alteration of 5 diagnostic genes. Firstly, for ARHGDIB, it was significantly downregulated in sepsis-induced ALI. The results of GSVA after high expression showed that the activities of multiple immune response pathways, including neutrophils activation, were upregulated, indicating that ARHGDIB was involved in various immune and pathogen clearance in septic patients with ALI. Moreover, the upregulation of ARHGDIB in septic patients with ALI was also associated with negative regulation of vascular endothelial growth factor receptor signaling pathway, which might involve in the regulation of vascular permeability (Fig. 7A). Then, ALDH1A1 was expressed at a low level in sepsis-induced ALI. It was found that the upregulated ALDH1A1 could involve in the negative regulation of oxidative stress-related pathway such as respiratory burst. Furthermore, the endothelial cell activation pathway was also upregulated in septic patients with ALI, indicating that endothelial cell might synthesize and secrete some proteins and cytokines to promote the vascular permeability (Fig. 7B). As for TREM1, it is expressed on myeloid cells as a superimmunoglobulin receptor which could amplify the inflammatory response by interact with Toll-like receptor [26]. In this study, the septic patients without ALI had higher expression of TREM1, indicating that inflammatory response had an important role in the development of sepsis. Septic patients with ALI had lower expression of TREM1, followed with the mitochondrial dysfunction and downregulated biological metabolic pathways, such as oxidative phosphorylation, which might reduce energy production and further inhibit the vascular regeneration (Fig. 7C). Similarly, the decreased TACR3 in septic patients with ALI was also influenced the energy production (e.g., TCA cycle) and celluar replication (Fig. 7D), suggesting that TACR3 had an important role in tissue regeneration. PI3 was revealed a rapid decrease in ALI patients, which followed with the degrading extracellular matrix and decreased biosynthesis (Fig. 7E).

Fig. 7
figure 7

Functional analysis of diagnostic genes. AE After grouping ARHGDIB (A), ALDH1A1 (B), TREM1 (C), TACR3 (D) and PI3 (E) at high and low levels, the enriched KEGG and GO pathways were scored for GSVA

We next used the CTD database, drug toxicology studies and auto molecular docking to explore the drugs targeted to diagnostic genes. Firstly, we found that Estradiol could bind tightly with ARHGDIB and decrease the expression of ARHGDIB (Fig. 8A). Estradiol, as the naturally existing endogenous hormone in women, had been demonstrated that it could improve the pulmonary inflammation [27] and promote the proliferation of endothelial cells [28]. According to the GSVA results, we found that upregulated ARHGDIB was correlated with the increasing inflammation and inhibition of vascular endothelium regeneration. Thus, the results of molecular docking analysis indicated that Estradiol might ameliorate the lung injury by interacting with ARHGDIB with an optimal docking binding energy of -7.11(kcal/mol). Acetaminophen (also known as n-acetyl-p-aminophenol or APAP) was the famous analgesic and antipyretic agents, which could be used to block prostaglandin synthesis from arachidonic acid by inhibiting the enzymes cyclooxygenase (COX)-1 and -2 [29]. Moreover, Acetaminophen could also impact the activity of mitochondrial to affect the TCA cycle [30]. In our study, Acetaminophen could efficiently increase the expression of TACR3, which might enhance the production of biological energy by regulating the TCA cycle in mitochondrial (Fig. 8B). However, it still needs further study to prove the efficiency of Acetaminophen in treating sepsis-induced ALI. Curcumin is a polyphenolic compound derived from dietary spice turmeric which has several pharmacologic effects including anti-inflammatory, antioxidant, antiproliferative and antiangiogenic activities [31]. We found that Curcumin could blockage TREM1 by binding to TREM1 with high docking energy − 5.39 (kcal/mol), which might improve the inflammation and oxidative stress in septic patients (Fig. 8C). Tretinoin is a retinol (vitamin A) derivative which has been evaluated as a treatment for ARDS. In this study, Tretinoin could enhance the expression of PI3 with the high level of docking binding energy of up to − 6.71 (kcal/mol) (Fig. 8D). Dexamethasone, an approved corticosteroid medication, acting as an anti-inflammatory and immunosuppressant agent. It has been widely used to treat a variety of diseases, including ARDS and sepsis. In the results of molecular docking, Dexamethasone could bind tightly with ALDH1A1which will result in the decreased gene expression of ALDH1A1 (Fig. 8E).

Fig. 8
figure 8

The docking results of diagnostic genes encoded proteins with small molecular compounds. A The docking result of ARHGDIB with Estradiol. B The docking result of TACR3 with Acetaminophen. C The docking result of TREM1 with Curcumin. D The docking result of PI3 with Tretinoin. E The docking result of ALDH1A1 with Dexamethasone

Discussion

ALI is a lethal clinical syndrome that commonly occurs in septic patients, but the pathogenesis is still unknown. The limitations of the current ALI diagnostic system hamper the capacity to early provide optimal clinical care to septic patients, as the clinical diagnosis of sepsis-induced ALI is primarily determined by PaO2/FiO2 and chest imaging, without regard to molecular biological characteristics [32, 33]. With the development of high-throughput sequencing technology and computational biology, numerous studies have proposed the predictive gene expression signatures based on various machine learning approaches. However, two questions should be considered that why a particular method should be used and which solution is the best one. The selection of algorithms by researchers may exist in the preference and bias. Thus, in this study, we integrated the gene expression profiles and performed a consensus machine learning algorithm to generate a consensus signature with high accuracy at identifying septic patients with ALI, as candidates for further investigation. We subsequently perform the external validation to assess the feasibility of diagnostic model in different centers, and the results suggested that the selected genes had a great predictive value with AUC (0.725 and 0.833). These data indicated that selected genes by combing different methods could reveal the diagnostic signatures and insights into regulators of disease.

The study has identified five gene signatures (ARHGDIB, ALDH1A1, TREM1, TACR3 and PI3) by several supervised machine learning algorithms (Additional file 1: Figure S8). ARHGDIB, the pivotal molecular in celluar signaling, is mainly expressed in hematopoietic tissues such as B- and T-lymphocyte cell line which was initially found to be act as the inhibitor of GDP dissociation from RhoA [34]. Previous studies had demonstrated that the upregulated ARHGDIB could promote the macrophages infiltration and increase the production of ROS by regulating the activity of NADPH oxidase in phagocytes [35, 36], indicating that the upregulated expression of ARHGDIB might aggravate the lung injury. Moreover, ARHGDIB could also inhibit the vascular endothelial cell migration and regulates vascular tone and other vascular functions [37]. The upregulated ARHGDIB could inhibit the expression of vascular endothelial growth factor (VEGF) which might suppress the regeneration of endothelial cells [38]. It has been found in this research that the overexpression of ARHGDIB in sepsis-induced ALI increases the activity of immune cells, and ARHGDIB had a significant negative correlation with the regeneration of vascular endothelial cell. It indicated that ARHGDIB promoted the development of ALI by affecting immune response and regulating activity of vascular, resulting in the damage of vascular endothelial cell and lung edema. It has been reported that the key role of ALDH1A1 is the oxidation of retinaldehyde to retinoic acid, forming transcriptional regulators critical for normal cell growth and differentiation [39]. Furthermore, the overexpression of ALDH1A1 is closely associated with system metabolism and inflammation. Studies have found that the high expression of ALDH1A1 predicts a poor prognosis because of dysregulated metabolism and inflammatory response [40, 41]. Interestingly, ALDH1A1 is low expression in septic patients with ALI. After the low expression of ALDH1A1 in the sepsis-induced ALI, it was found that the ability of immune tolerance was decreased, and the activities of related pathways of intercellular connectivity were also decreased, indicating that the low expression of ALDH1A1 might promote the damage of alveolar-endothelium barrier. TREM1, part of the immunoglobulin superfamily, was mainly expressed in neutrophils or monocytes/macrophages, when bound to ligand, stimulating release of proinflammatory cytokines (e.g., TNF-α and IL-1β). It is reported that the TREM1 can be used as a diagnostic and prognostic biomarker for sepsis, indicating the potential diagnostic value of TREM1 [42]. It is believed that the upregulated expression of TREM1 in response to infection will augment inflammatory response not only remove the pathogens but also aggravate the organs damage [42,43,44]. In this study, we found that the decreased expression of TREM1 in septic patients with ALI which might impair the clearance of pathogens. Besides, TREM1 is involved in the mitochondrial metabolism and energy production [45, 46]. The downregulating TREM1 will lead to mitochondrial metabolism disorder and reduce the energy production which affect the cell proliferation and repairment. Our research also found that the downregulating TACR3 was associated with the decreasing production of energy and enhanced oxidative stress. It is speculated that the redox imbalance and disturbed energy were induced by downregulating the expression of TACR3, leading to the development of ALI. PI3 is neutrophil serine proteinase inhibitor with a crucial role in preventing excessive tissue injury during inflammatory events. It has previously been identified as significantly downregulated in the acute stage of ARDS, in concordance with our findings [47]. The plasma PI3 levels could be used to early diagnosis ARDS, indicating that direct analysis of ARDS patient blood may provide valuable information [47]. Furthermore, the expression and polymorphisms in PI3 gene were significantly associated with ARDS risk which could be regarded as a prognostic marker [48, 49]. After injury-inducing, the epithelial will be repaired by secreting extracellular matrix to restore the epithelial barrier [50]. However, the downregulating of PI3 affected the secretion of extracellular matrix protein which might delay the tissue repair [47].These results suggest that the dysregulated immune response and enhanced oxidative stress might be the crucial initial mechanism to damage the alveolar-endothelium barrier, leading to increased permeability to liquid and protein across the lung endothelium, which then leads to oedema in the lung interstitium. Besides, mitochondrial dysfunction and bioenergetic dysfunction also largely contribute to the progression of sepsis-associated ALI. Thus, understanding the function of diagnostic genes will help to clarify the pathogenesis of sepsis-induced ALI and proposed the targeted therapy options.

Nowadays, reorientation of drug function is the novel strategy for disease treatment. With the ARDS mechanisms continued to reveal and treatment plans continued to refine, a variety of drugs were applied to treat ALI/ARDS. In COVID-19 associated ARDS, a lot of drugs were explored to treat COVID-19 patients even they were not applied to the treatment of lung diseases before [51, 52]. Therefore, according to this strategy, we performed targeted drug screening of diagnostic genes to propose a novel therapeutic approach for inhibiting the development of sepsis-associated ALI. As a small molecular compound, Estradiol could efficiently bind to and decrease ARHGDIB expression. Estrogen receptor are expressed in all immune cells which could regulate the cellular functions as transcriptional factor. Treatment with Estradiol will decrease the accumulation of immune cells (e.g., neutrophil and monocyte) and suppress the production of proinflammatory cytokines, which could improve the lung inflammation [53, 54]. However, excessive intake of Estrogen will result in the side effect such as vomiting, nausea and thrombosis [55]. Acetaminophen is one of the most popular analgesic and antipyretic agents, which showed an exceptional performance in increasing TACR3 expression. Previous studies have demonstrated that treating sepsis patients with Acetaminophen will reduce oxidative stress and inhibit the excessive innate immune response [56, 57], which is benefit for the tissue repair. The toxicity of Acetaminophen should be noticed that the overdose of Acetaminophen will lead to acute liver failure [58]. The herbal compounds Curcumin have been reported the beneficial effects in treating inflammatory diseases, neurological diseases, cardiovascular diseases, pulmonary disease, metabolic diseases, liver diseases, and cancers [59]. In sepsis-induced ALI, intranasal Curcumin could significantly reduce the expression of oxidative stress marker (e.g., nitric oxide (NO) and malondialdehyde (MDA)) and inflammatory cytokines (e.g., TNF-α). Besides, Curcumin also improves the lung permeability and reduce the capillary leakage [60]. Yuan et al. further demonstrated that curcumin exerts anti-inflammatory and anti-oxidant effects through regulation of TREM-1 gene activity, which is in line with our study [61]. Tretinoin (vitamin A derivative) was one of the compounds with upregulation of PI3 that exhibit high affinity docking binding energy. Tretinoin is a medicine with anti-inflammatory and immunomodulating properties for sepsis. Treatment with Tretinoin in sepsis will inhibit the activation of NF-κB and related target genes such as IL-6, MCP-1 and COX-2 [62]. Furthermore, Tretinoin also attenuated the fibroblast degradation of extracellular matrix, suggesting that Tretinoin could modify tissue injury and ameliorate the lung fibrosis [63]. Therefore, the interaction between Tretinoin and PI3 might improve the lung inflammation and fibrosis. Dexamethasone has been recognized as one of the most efficient anti-inflammatory medicines which was used in various inflammatory diseases. Early administration of Dexamethasone could reduce the overall mortality in ARDS patients [64]. Paradoxically, these hormones were given to patients with sepsis and pneumonia could not find the beneficial therapeutic efficacy [65, 66]. In our study, we found that Dexamethasone could increase the expression of ALDH1A1 in septic patients with ALI, which might prevent the lung inflammation and improve lung permeability. However, when administered through a systemic route, Dexamethasone can elicit severe side effects, such as hyperglycemia, hypertension, hydro-electrolytic disorders and peptic ulcers [67]. Thus, based on the drugs screening for targeting the five diagnostic genes, our study has proposed a novel targeted therapy strategy with a combination of multiple drugs, which might prevent the development of sepsis-induced ALI brought by the five diagnostic genes and improve the prognosis of patients. However, of the gene-targeted drugs selected in this study, the primary goal is regulating the mRNA expression of targeted genes. Further research is needed to explore the novel biomaterials to deliver drugs to targeted genes.

The novelty of this study lies in the integration of multiple machine learning algorithms to construct a consensus model for distinguishing septic patients with ALI or not. We firstly used the correlation matrix to eliminate the multicollinearity and performed multiple supervised machine learning approaches for constructing diagnostic model. Then, we further used the external datasets to validate the accuracy in diagnostic model. Further investigation discussing gene function and targeted drugs is also novel in this research. However, there still have some limitations in this study. Firstly, although we have performed a batch correction for the several datasets, the essential bath effect still exists. Future integration studies could begin with sequenced documents to ensure consistency and accuracy. Second, many genes were excluded during the merging of datasets and eliminating multicollinearity, resulting in the loss of some important genes. However, to validate the model in independent datasets, we must ensure that genes used for model construction were available in testing sets. Third, some clinical and molecular traits were not adequately provided in public datasets, which limited the study to further reveal the potential associations between diagnostic genes and some traits. Finally, while our study provides a framework for the early diagnosis through the assessment of specific genes, the results are still in the analytical and speculative stage without experiments validation, and we recognize that the process of assessing these diagnostic genes by microarray may be time-consuming. However, utilizing real-time PCR to assess the expression of these 5 genes offers as a quick and relatively straightforward method for early recognition of sepsis-associated ALI. Thus, nanogram of five genes measured by real-time PCR may represent a promising step towards meeting the urgent diagnostic needs in the context of rapidly progressing conditions like sepsis associated ALI. Future research may further refine this method and explore its integration with clinical practice to enhance its usability and effectiveness. Besides, the combined therapeutic value of the five targeted drugs at cellular and animal level will also need to further study. Based on the diagnostic model, we hope to establish a shared platform to aid in clinical diagnosis and treatment in sepsis-induced ALI.

Conclusion

Our study using four supervised machine learning feature selection approaches identified a five gene signatures for sepsis-induced ALI from patient whole blood. These diagnostic genes could be used to construct a diagnostic model with great predictive value, which could be effectively distinguished septic patients with ALI or not. The selected signatures revealed the disease mechanisms that damage of alveolar-endothelium barrier and dysfunctions of mitochondrial metabolism may be the crucial mechanisms for the development of sepsis-associated ALI. Lastly, diagnostic genes may be the future putative drug targets, and drugs screened for the presence of diagnostic genes, leading to new sight for targeted therapy.