Background

The neonatal period is the first 28 days of life, which is the stage of developing physiological adaptations for extra-uterine life. This time is a vulnerable period and the high neonatal mortality rate is due to the high level of vulnerability in this period [1]. Neonatal and children death is a major health indicator [2] and mortality prediction is applied for reviewing and benchmarking, looking the results in neonatal intensive care units (NICUs) and evaluating efficacy [3]. About two-thirds of infant deaths and about half of the under-five deaths occur in neonatal period [4]. Predictions show that between 2019 and 2030, approximately 52 million children under the age of 5 will die, approximately half of whom will be neonate [2]. However, the under-5 mortality rate has declined around the world, but the neonatal mortality rate is still an alarming issue [5].

In order to public health policy-making and management of pregnancy, childbirth and neonate periods, including the proper selection of risk factors and development of selective care pathways for high-risk pregnancies, it is important to predict high-risk neonates [6]. Furthermore, early identification of neonates who are at risk for death can help physicians provide early treatment and has a direct impact on their survival and decreasing their morbidity [7].

Machine Learning (ML) is a subset of Artificial Intelligence (AI), which incorporates all methods that permit machines to learn from data [8]. The expectation of ML is to train machines based on the provided data and algorithms. The machines learn how to make autonomous decisions using large sets of data inputs and outputs [9, 10]. In NICUs, decision-making is a complex and important process, and the use of artificial intelligence and machine learning techniques can improve the quality of neonatal care by providing early warnings to healthcare providers [11].

According to some studies, the use of machine learning methods in predicting the neonatal mortality was promising. For example, Mboya et al. [5] showed that the predictive ability of perinatal death in machine learning algorithms was considerably superior over the logistic regression method. However, despite there are many studies in this field, most of them have been done on specific groups of neonates, such as premature or Very Low Birth Weight (VLBW) neonates or in general settings rather than specifically in the NICUs. For instance, in a cohort study in Tanzania (2020), perinatal death prediction using machine learning models were compared to logistic regression. The results showed that there was no significant difference in perinatal death prediction between machine learning and regression models, except for bagging method. In addition, the machine learning algorithms had a superior net benefit and its predictive ability was greatly higher than regression model [5]. In a 2019 study, researchers in Bangladesh developed and evaluated regression models to predict the risk of neonatal death based on known characteristics in the beginning of pregnancy, beginning of delivery and five minutes after delivery. According to results, the predictive ability of the model was moderate at the beginning of pregnancy (AUC = 0.59). At the beginning of delivery, the predictive ability was significantly better (AUC = 0.73) and at 5 min after birth, the predictive ability was good (AUC = 0.85) [6]. Researchers in Ohio (2018) predicted postoperative neonatal deaths using a superlearning method (including 14 algorithms) and showed that performance of the superlearner algorithm was better than any of the other algorithms alone [12].

As for studies related to NICUs, researchers in Iran (2020) used neural network and logistic regression to predict the probability of mortality in preterm neonates after admission to NICU and showed that neural network with 60 neurons in hidden layer had a more acceptable performance [3]. In 2020, researchers in Finland used 9 different classifiers to predict mortality and morbidity among very low birth weight infants on time series data and clinical variables. The results showed that random forest had the best results compared to the other classifiers in predicting death (AUROC = 0.922 and F1-score = 0.493) [13].

Despite of these models to predict the neonatal death risk, the performance of different algorithms on different datasets has been different. Furthermore, the best AUC on neonatal death obtained from these studies was 0.96 [3].

Unlike the previous studies, which were performed on premature or VLBW neonates [3, 13], or in general settings other than NICUs [5, 6, 12], the present study considers all neonates without birth weight limitation in NICU settings. Furthermore, the majority of previous models were not prospectively applied and evaluated [5, 6, 13]; however, in the current study, we prospectively evaluated our models in a NICU to better examine the performance of the models from a clinical perspective.

On the other hand, according to studies, neonatal mortality is more prevalent in the developing countries and may follow a different pattern, so appropriate models need to be developed in these countries based on the internal conditions. Therefore, the purpose of this study was to present a neonatal death risk prediction model using machine learning algorithms and apply these models in an NICU to predict the neonatal death and compare the results with final status of neonates to evaluate the performance of these models.

Methods

At first, important risk factors in neonatal death were identified through a literature review, neonatologists’ opinions and different feature selection methods. Then, several machine learning models were developed and evaluated in a prospective study for external validation of models. The overall methodology briefly is described in Fig. 1.

Fig. 1
figure 1

Overall methodology

Identification of the neonatal mortality risk factors

A literature review was conducted to identify neonatal mortality risk factors, which was reported elsewhere [14] and then ultimately 21 important risk factors for neonatal death were identified by neonatologists’ opinions. However, four of these variables were not recorded in the neonatal registry that we used. Therefore, these four variables were excluded and 17 variables were selected for the analysis. There are also different feature selection methods in practice and according to previous studies [15,16,17], four well-known feature selection methods were applied to identify the most important features from these 17 features. First, we used univariate statistical analysis (non-parametric Mann–Whitney and Chi-square tests) to identify significantly different variables among alive and dead neonates. We considered p-value ≤ 0.05 as our selection criterion and identified 13 features. Some statisticians believe that marginal (p-value < 0.2) significant variables identified from univariate should be included in multivariate analysis. Using this criterion, 14 features were selected. We also used IBM SPSS modeler feature selection node and identified 9 important features. Also, we used ‘CfsSubsetEva’ method in Weka. This method measures the significance of attributes on the basis of predictive ability of attributes and its degree of redundancy. The subsets which are having less inter-correlation but highly correlated to the target class are preferred and according to studies, variables selected by this method have the best results in term of the percentage of correctly classified instances compared to other feature selection methods available in Weka [16, 17] In our dataset, this method identified 6 important features. In Table 1, different feature selection methods are presented.

Table 1 Different feature selection results

After implementing several machine learning algorithms on these five sets of features, we found that the better results were obtained for models developed based on 17 and 12 features, however, most models developed based on 17 features had the highest performance. Therefore, we considered this feature set for the further analyses. One of the examples of our experiments on 17 and 12 features on one dataset are presented in Additional file 1: Table 1.

Neonatal data

The data was collected from a neonatal registry database in “Maternal, Fetal and Neonatal Research Center”, Tehran University of Medical Sciences, Tehran, Iran. This registry contains neonatal records from teaching hospitals in Tehran. We extracted data from 1 May 2017 to 31 July 2018. Based on the previous phase, 17 confirmed neonatal risk factors were extracted from this registry. Our dataset consisted of 1762 records in two classes (dead, n = 138 and survived, n = 1624).

Data pre-processing

The various models were firstly developed using the original data; however, due to the low sample in “dead” class, the model performance was not acceptable, especially in terms of their specificity. Hence, to obtain the best models, data pre-processing techniques were applied.

Missing data imputation

Data imputation is usually used to improve the data quality and performance of machine learning models. For this purpose, many methods are well-documented such as replacing the mean or mode of a class group [18, 19]. In our dataset, 14 of the 17 variables had less than one percent missing values. Details on frequency of missing data in different features in the dead and survived classes are presented in Table 2. We used the mean and the most frequent category of each class (dead vs. survived) to impute continuous and Boolean variables, respectively [20]. The missing values were imputed using IBM SPSS modelers. For example, for “congenital malformation”, most records had a value of "No", so the missing value was replaced with "No".

Table 2 Frequency of missing data in different features

Data balancing

Our dataset was imbalanced regarding to the frequency of each class, and the “dead” class contained only 138 records (7.83%). Hence, we applied two minority oversampling techniques, the Synthetic Minority Over-Sampling Technique (SMOTE) [21] and Adaptive Synthetic (ADASYN) [22] to balance the data using R software version 4.0.4. SMOTE is the most famous method to balance the data to improve random oversampling [23]. This technique works by increasing the sample of minority class. By using this method, majority instances do not change [24] and more data from the minority class is added to the dataset so that the amount of data in the minority and majority classes reaches a more balanced level. Although there are many versions of this technique, most of them do not outperform than the original version, therefore, we relied on the original SMOTE [23]. Furthermore, the goal of ADASYN is to utilize a weighted distribution for different minority class samples relevant to their difficulty level of learning, where more synthetic data is made for minority class samples that are difficult to learn compared to those minority samples that are simple to learn [22].

Based on these two methods, we created new datasets by using R software. We added records to the “dead” class using SMOTE by 3, 4, 5, and 11 times and named them as SMOTE-oversampled dataset1 (live class:1624; dead class:552; ratio: 2.96), SMOTE-oversampled dataset2 (live class:1624; dead class:690; ratio: 2.35), SMOTE-oversampled dataset3 (live class:1624; dead class:828; ratio: 1.96), and SMOTE-oversampled dataset4 (live class: 1624; dead class: 1656, ratio: 1.01). We also used ADASYN (with k = 3) and created the ADASYN-oversampled dataset (live class: 1624; dead class: 1583, ratio: 1.02).

In order to achieve the best results, we implemented different machine learning algorithms on the original data, and the above-mentioned oversampled datasets and compared the performance of the models in terms of confusion matrix measures (AUC and F1 measure).

Model development

In this phase, we used the selected variables as input variables to develop the machine learning models. The data were randomly splited into two groups: 70% for training and 30% for testing data and then different machine learning algorithms including ANN, decision tree (RF, C5.0 and CHAID tree), SVM and Bayesian network were developed on the original and the oversampled datasets. Also, each algorithm was executed 10 times with different randomly selected train and test sets.

Artificial neural network (ANN)

These networks, similar to the natural neural networks, process the input variables through neural processing units [20]. ANN is a set of connected input/output units and each connection has a weight. During the training phase, it adjusts the weights to learn how to predict the output class [25]. There are many kinds of ANNs [26] out of which, we used Radial Basic Function (RBF) and Multiple Layer Perceptron (MLP) networks with different number of processing units in each hidden layer. The selection of the network architecture was done by trial and error, and finally, the network with the best performance was selected.

Decision tree

Decision tree classifies data to discrete ones applying tree structure algorithms [25]. The main purpose of these classifiers is to display the structural information stored in the data. This technique generates a decision tree from a set of labeled training samples [18]. The advantages of this method are its ease and speed, ability to handle high dimensional data and its understandable representation [25]. We used the RF, C5.0 and CHAID tree algorithms to construct the decision tree. C5.0 algorithm utilized a pruning method. Also, this method uses a boosting method to build and merge multiple classifiers to deliver improved accuracy [27]. The Chi-squared Detection of Automatic Interaction (CHAID) tree is one of the oldest decision trees for prediction made by repeatedly splitting the subset space into two or more subgroups [28, 29]. CHAID investigates the relationship between a dependent variable and the predictors by maximizing the significance of a Chi-square statistics [30, 31].

RF is one of the most famous machine learning techniques for prediction problems [32]. RF is an Ensemble method that developed multiple decision trees through bootstrap aggregation. Each time an input is supplied to RF, each of the developed decision trees is passed on to that input. Every tree independently predicts a classification and "votes" for the corresponding class. The overall RF forecast is determined by the majority of the votes. Inherently, this combined vote of multiple decision trees is less noisy and less prone to outliers than a single decision tree [13, 33, 34].

Support vector machine (SVM)

SVM is an appropriate technique for binary classification. This method is very popular due to its features such as dealing with complex nonlinear data points in the health field. SVM is one of most accurate methods and is less prone to over-fitting than other methods [25, 35]. Furthermore, it is a suitable classifier without the need for any prior knowledge and has high precision and robustness [36, 37]. In addition to linear problems, SVM can also be used as a nonlinear kernel function. The most common kernel functions in SVM include Linear, Polynomial and RBF [25]. In this study, we developed Linear, RBF and Polynomial kernel functions and selected the model with the best performance.

Bayesian network

We also applied the Bayesian Network algorithm. These networks are known as statistical classifiers which predict the probability of membership of a given sample in a specific class. Accuracy and speed of this network is high for large databases [38, 39] and the performance of this classifier is also robust [40].

Ensemble model

An Ensemble model is one of the methods to increase the classification accuracy. In this technique, a classification model that combines several classification methods is selected. Each classifier returns its vote and the final result is determined by calculating the frequency of votes by each individual classifier [18]. In the current study, the performance measures of different models were reviewed, the best models were selected and combined by two Ensemble methods including Ensemble-Confidence weighted voting and Voting method and their results were compared with each individual models.

Prospective evaluation and external validation

In order to perform external validation, we conducted a one-month prospective study in the NICU of “Yas” hospital affiliated with Tehran University of Medical Science. We applied our models to predict outcomes of all neonates admitted to this NICU from 20 April 2020 to 20 July 2020 (92 neonates) and then followed them up to their discharge or death (dead, n = 18 and survived, n = 74) and compared the model results with the actual final status of these neonates.

Implementation and data analysis

We used Statistical Package for the Social Sciences (SPSS) version 23 and Waikato Environment for Knowledge Analysis (Weka) software to analyze the data and identify the important features, respectively. Moreover, we used R software version 4.0.4 and IBM SPSS Modeler version 18 to balance the data and develop the machine learning models, respectively. In this regard, we used confusion matrix and performance measures including accuracy, precision, sensitivity, specificity, F-Score and AUC.

Results

The included variables

Based on neonatologist’s opinions, 21 important risk factors were identified and four of them were excluded from the analysis due to high missing data in the dataset. Then, different feature selection methods were applied on identified risk factors (Table 1).

Description of the neonates

Table 3 indicates the distribution of the quantitative and qualitative variables for all neonates and also dead and survived ones.

Table 3 Distribution of qualitative and quantitative features in the original dataset

The mean of BW was 2323.2 gr (1643.5 gr in dead and 2566 gr in survived neonates). Furthermore, the mean of GA was 240.49 days (216 and 249.4 days in dead and survived neonates, respectively). 56.5% of all neonates were preterm and 20.5% were SGA. Additionally, 97.2% of mothers received routine perinatal care during pregnancy (97.3% in survived vs. 95.7% in dead neonates) and only 20.8% of mothers suffered from chronic diseases such as gestational and chronic diabetes, chronic and gestational hypertension and other diseases during pregnancy; 31.3% of neonates had RDS (29.2% in survived vs. 55.8% in dead neonates), and steroid and surfactant administration were seen only in 0.5% and 23.7% of all neonates, respectively. There was also 1.9% pulmonary hemorrhage (0.2% in survived vs. 21.7% in dead neonates), heart disease (22.5%), NEC (2%), sepsis (45.8%) and IVH (15.7%) in neonates. Additionally, the most of neonates had no asphyxia (98.1%), intubation (90.2%) and ventilation need (84.6%).

The machine learning algorithms and their evaluation

The performance of selected models on the original data indicated that the specificity was not appropriate mainly because of our imbalanced dataset. Therefore, we initially created 5 oversampled datasets. Considering that SMOTE-oversampled dataset1 and SMOTE-oversampled dataset4 had the best results (based on AUC and F-score) for the most models, we only report our results for the original, SMOTE-oversampled dataset1, SMOTE-oversampled dataset4 and ADASYN-oversampled dataset in Table 4. The details of the results for the other oversampled data are presented in Additional file 1: Table 1.

Table 4 Performance measures of the selected machine learning models on original, SMOTE-oversampled dataset1, SMOTE-oversampled dataset4 and ADASYN-oversampled dataset

According to the Table 4, most models developed on SMOTE-oversampled dataset1 shows slightly better results. Among RF models, RF with 100 built models and 10 maximum tree depths had the best accuracy (0.92) and AUC (0.97). Top RF decision rules are presented in Additional file 1: Table 2. Additional file 1: Table 3 shows different neural network architectures. We found that the best performance in ANNs was obtained in MLP network model with 17 input variables and 9 units in one hidden layer with accuracy (0.91) and AUC (0.96) (Additional file 1: Fig. 1). In addition, the best result for C5.0 was observed in the tree with 11 levels tree depth in terms of the accuracy (0.92) and AUC (0.94), respectively. The C5.0 decision rules are presented in Additional file 1: Table 4. As for SVM, we found that RBF function with stopping criteria = 1.0E-3, Regularization parameter (C) = 10, Regression precision = 0.1 and RBF Gamma = 0.1 had the best accuracy and AUC with values of 0.94 and 0.98, respectively. The details of different SVM models are present in supplement (Additional file 1: Table 5).

Table 5 Prospective evaluation results of the selected machine learning models

In addition, for CHAID tree, the 5-depth tree had the best accuracy (0.90) and AUC (0.96). Details on CHAID decision rules are presented in Additional file 1: Table 6. We developed a Bayesian network with two TAN and Markov structures. The network with TAN structure, parameter maximum likelihood learning method, and maximum conditioning size set = 5 had the best accuracy (0.90) and AUC (0.95), respectively on the test data. In Additional file 1 (Fig. 2), the selected Bayesian network structures are shown. Also, all setting, configurations and their values for the best performing models are presented in Additional file 1: Table 7.

Table 6 Comparison of our results with related literature
Fig. 2
figure 2

Receiver Operating Characteristic (ROC) curves in a selected algorithms b Ensemble method on the test data

Finally, the best RF, ANN, C5.0, SVM, CHAID tree and Bayesian network models were combined using Ensemble methods (Table 4 and Fig. 2). The results indicated that the accuracy and F-score for the SVM was better than the other methods. Comparing different models indicated that the RF had the highest precision and specificity compared to other models. The Ensemble and SVM had the best AUC on the test data. Details on confusion matrix result for each model are presented in Table 8 in Additional file 1.

Among all variables, only the “intubation” was significant in all six models. Then, “GA” was significant in RF, ANN, C5.0, SVM and CHAID tree models. BW, pulmonary hemorrhage and SGA were significant variables that used by at least four models. Congenital malformation, NEC, prenatal care, preterm birth and sepsis were significant variables that used by at least three models. Asphyxia, mother disease and ventilation variables were significant in just two models. Some variables such as steroid therapy, surfactant administration and RDS variables were only significant in one model.

Prospective evaluation

We conducted a prospective evaluation on the best RF, ANN, C5.0, SVM, CHAID tree, Bayesian network and Ensemble models. Table 5 shows the performance of the models developed based on the SMOTE-oversampled dataset1 in external evaluation. Details on the confusion matrix and the results of the prospective evaluation conducted by other models (developed on other datasets) in Additional file 1: Table 9 and Table 10 . According to the results, most models developed on SMOTE-oversampled dataset1 (except RF) completely outperformed and among them (Table 5), the highest accuracy, sensitivity, F-score and AUC were observed for the ANN; however, C5.0 and CHAID trees had the highest precision and specificity. Therefore, we finally selected models developed based on the SMOTE-oversampled dataset1 on 17 features.

Discussion

In this study, we developed prediction models for neonatal death in NICU using machine learning algorithms and 17 important variables. Cooper et al. [12] performed the superlearning algorithm on 68 variables. Safdari et al. [41] and Beluzon et al. [42] considered 14 and 23 variables, respectively. Ravelli et al. [43] developed the antenatal prediction of neonatal mortality in very premature infants on 13 variables. Mboya et al. [5] considered 32 predictive variables for perinatal death prediction.

In our study, “intubation” was the only identified important neonatal mortality risk factor were also significant in all six developed predictive models. Studies in Thailand [44], Brazil [45] and Iran [46] also identified this variable as one of the most important risk factors in neonatal death.

GA and BW were also identified as important risk factors by at least four models. Similarly in UK [47], Ethiopia [48],China [49], Brazil [50], Iran [51], Mexico [52], Finland [13] and Brazil [53], these two variables were identified as important risk factors for neonatal mortality. Furthermore, a systematic review indicated the importance of these risk factors for neonatal mortality in NICUs; GA and BW were the most cited risk factors for neonatal death [14]. Some risk factors such as “pulmonary hemorrhage” was found in at least four models of our study and also stated in [45, 54,55,56], but not mentioned in other machine learning studies [13, 43].

We developed RF, ANN, C5.0, SVM, CHAID tree and Bayesian network as well as Ensemble models and found that the SVM had the highest accuracy, F-score and sensitivity than other models. Also, SVM and Ensemble methods resulted in the highest AUC. The best performance in terms of specificity and precision was for RF. Additionally, the results from prospective evaluation showed the highest accuracy, sensitivity, F-score and AUC was for the ANN model and the highest precision and specificity was for the C5.0 and CHAID tree. This result indicates that ANN, C5.0 and CHAID tree models are more generalizable and applicable for external data.

There are several studies in this respect. Although the studies have been conducted on different data and are not comparable, but as shown in Table 6, Jaskari’s study [13] for predicting the neonatal mortality showed AUC (0.922) and F-score (0.477) for RF classifier. Beluzos et al. [42] proposed a novel support decision method to classify newborns based on their neonatal mortality risk and indicated that the accuracy and AUC were 93% and 0.965, respectively. Rezaeian et al. [3] developed models for prediction of mortality of premature neonates and presented AUC (95.99%), accuracy (96.79%), sensitivity (86.20%) and specificity (98.37%). Cooper’s study [12] for predicting the postoperative neonatal death showed that the AUC for model development and validation were 0.91 and 0.87, respectively. Ravelli et al. [43] developed a model to predict neonatal mortality in very premature infants and indicated that the AUC and accuracy were 0.83 and 0.65, respectively.

Because of different datasets, and targeted neonates, comparing our results with other studies is difficult; however, in general, in comparison with the best results from previous studies, we achieved the highest AUC (98% vs. 95.99% [3]), sensitivity (95% vs. 86.20% [3]) and F-score (96% vs 0.477 [13]) than other similar studies. The highest accuracy (94% for SVM) and specificity (94% for RF) in our study are much better than Vianna’s study (83% and 62%, respectively) [57] and Ribeiro’s study (88.2% and 91.7%, respectively) [53] but less than Rezaeian’s study (96.79% and 98.37% respectively). It should be mentioned that Rezaeian’s models are only applicable for premature neonates [3]. Among studies focused on all neonates (not premature), our models showed better results.

Study limitations and future studies

One of the limitations of our study was the necessity to balance the original data. Comparison of models developed on the original (imbalanced) with those developed on the oversampled data showed an improved performance especially in terms of specificity. Although, oversampling technique is well-defined in machine learning [21], it produces artificial data that may effect on the results. Therefore, it is suggested that the study should be replicated with a larger dataset that is more balanced. Also, it is recommended to use different minority oversampling techniques other than “SMOTE” and “ADASYN” to balance dataset and compare the results. Additionally, we excluded some of the variables from our analysis because of unavailability of data. It is highly recommended to consider these variables in future studies. Also, given that this dataset was specific to this study, as far as we know, there are no other studies (methods) on this dataset, therefore we were not able to conduct such a comparison. Developing other models on the same dataset is recommended. In addition, no decision support system has been implemented yet. Indeed, future studies should be focused on developing such systems and evaluating the impact of these models and systems on health outcomes.

Implication

The main audiences of this study are physicians and neonatologists in NICUs. They can consider the models and the different risk factors that are identified as important factors by these models in their decision making in NICUs. Artificial intelligence researchers and developers who are interested in developing predictive models or decision support systems for neonatal mortality can also use the results of this study to select the best models for the prediction of neonatal death.

Conclusion

We developed several machine learning-based models including RF, ANN, SVM, C5.0, CHAID tree, Bayesian network and Ensemble methods using different feature selection methods to predict neonatal deaths in NICUs. As a result, models developed using feature selected by neonatologists (17 features). The ANN models had the best results in prospective evaluation. Therefore, it is suggested for implementing on similar projects.