1 Introduction

Neonatal medical data are used for various purposes such as evaluating the health status of neonates, making clinical decisions, diagnosing diseases, and supporting patient treatment and care processes. By analyzing the important information contained in this data with artificial intelligence algorithms, it may be possible to achieve important goals such as early disease diagnosis, creating appropriate treatment plans and conditions, and supporting healthy growth and development [1]. Machine learning algorithms can also contribute to scientific research on neonatal health, thanks to their ability to process large amounts of data and uncover complex relationships [2, 3]. By employing these algorithms, it becomes possible to monitor the health status of newborns, anticipate potential treatment-related challenges, and proactively take necessary precautions [4].

Length of stay (LoS) is a clinical measure that indicates the time between the patient’s admission to the hospital and discharge. According to the World Health Organization (WHO), the length of hospital stay is a significant performance measurement and monitoring indicator [5]. In hospitals, it is crucial to anticipate patient discharge times to efficiently manage resources such as beds, intensive care units, and especially devices like ventilators, ensuring timely access to critical medical needs, controlling hospital costs, and enhancing service efficiency. Furthermore, the predicting of LoS is essential for not only healthcare institutions but also for patients, their families, and insurance companies, as it aids in requirements and cost planning [6, 7].

Similarly, predicting LoS for neonatal intensive care patients serves as an indicator of health status and holds paramount significance in bed allocation, ensuring a consistent healthcare staff roster, and evaluating the resources within healthcare facilities [6].

Recently, there has been an increasing trend toward medical applications utilizing artificial intelligence techniques. Machine learning, big data analysis, and deep learning (DL) are among the artificial intelligence techniques employed in predicting LoS. However, the fact that the model often functions as a black box is not desirable for medical professionals. Consequently, Explainable Artificial Intelligence (XAI) studies have become widespread [8]. Explainable systems can be categorized into two main groups: post hoc and ante hoc systems. Post hoc systems explain the relationships between features obtained through the model after feature extraction, while ante hoc systems produce inherently understandable relationships. Examples of ante hoc systems include methods like Fuzzy Logic and decision trees (DT) [9, 10]. Since machine learning methods are primarily ante hoc systems, they are generally preferred in studies aiming to predict LoS.

The main motivations and contributions of this study are as follows:

  • In the literature, there is a noticeable scarcity of studies related to artificial intelligence and machine learning in the analysis of neonatal medical data. Addressing the deficiency in this area is one of our primary motivations.

  • The study introduces a hybrid approach called Classifier Fusion-LoS (CF-LoS), which comprises two stages for predicting LoS.

  • CF-LoS involves testing both known and novel classifiers. Subsequently, classification accuracy is significantly enhanced by employing the Voting Classifier technique to combine the high-performing results of these diverse classifiers. This innovative approach presents a new method to improve research performance.

This study aims to develop a machine learning-based prediction model that classifies the length of stay (LoS) of patients hospitalized in the Neonatal Intensive Care Unit (NICU) as either long-term or short-term hospitalization using the original dataset. For this purpose, the epicrisis reports of 512 babies hospitalized in the NICU of the Private Eastern Anatolia Hospital in Elazig, Turkey were retrospectively examined with the necessary ethical permissions. Patients referred to another hospital were excluded from the study. Consequently, 12 features from 453 patients were included in the dataset used for predicting LoS. The dataset underwent preprocessing, including digitizing categorical data. Subsequently, the testing and validation processes were conducted in two stages within the CF-LoS method. In the first stage, various classifiers, including Logistic Regression (LR), ExtraTrees (ET), Random Forest (RF), K-Nearest Neighbor (KNN), SVC, AdaBoost, GradientBoosting (GB), XGBoost (XGB), and CatBoost, were employed. Following this, in the second stage, Voting Classifiers, namely Soft Voting and Hard Voting, were applied to enhance the overall classification accuracy. With CF-LoS, the prediction of LoS was performed by binary classification as either long-term hospitalization or short-term hospitalization, using the categorical and numerical data of patients hospitalized in the NICU. Classification performance was evaluated using different metrics such as Accuracy, Precision, Recall, F1-score, and Area Under the Curve (AUC) and then, compared with similar studies in the literature.

The article is organized as follows: First, in Sect. 2, LoS prediction studies for NICU patients are extensively covered. Additionally, other artificial intelligence studies for LoS prediction are mentioned in this section. Section 3 provides detailed explanations of the dataset and classifiers used. Section 4 encompasses the application and results. Section 5 includes comparisons with other studies in the literature and discusses limitations. The article concludes with the Conclusion section.

2 Literature review

2.1 LoS prediction studies for NICU patients

Research in the field of LoS prediction for NICU patients is relatively scarce. One of the earliest studies in this area was conducted by Zernikow et al. (1999), utilizing Multiple Linear Regression (MLR) and Artificial Neural Network (ANN) techniques. They retrospectively collected data from 2144 prematurely born patients over 7 years, excluding those who were transferred to another hospital without discharge or who passed away during their stay. The performance of the study was assessed using the linear correlation coefficient (r). For the MLR model, the calculated r ranged from 0.85 to 0.90, and a range of 0.87–0.92 was observed for ANN [11].

In their study, Feldman and Chawla (2014) developed a machine learning model called ADMIT to estimate LoS, which they consider important for appropriately adjusting the number of nurses caring for babies treated in the Neonatal Intensive Care Unit. For this purpose, they used a dataset containing the characteristics of 28,061 babies. They reduced the dataset by selecting 7 patient features. They achieved an average AUC of 0.97 with the ADMIT model, an improved version of the AdaBoost Algorithm [12]. In another study, Thompson et al. (2018) developed a long-term LoS prediction application for newborns using machine learning methods. They aimed to predict costs for healthcare services. They utilized the 2014 Florida HCUP SID dataset, which includes administrative data and simple clinical data. They experimented with methods such as Naive Bayes (NB), SVM, C4.5, and Random Tree, achieving the best result with the RF method with an AUC of 0.877 [6]. Mansouri et al. (2022) applied machine learning-based methods to estimate LoS for patients hospitalized in the NICU. They utilized demographic information and post-initial presentation diagnostic data sourced from the MIMIC III database. Their regression analysis study achieved an R2 value of 0.78, indicating its performance [13]. Lin et al. (2022) estimated LoS for very-low-birth-weight babies in a hospital in Taiwan using machine learning. They pursued two distinct approaches, seeking continuous and categorical results. For this, they used a dataset consisting of 21 features and 6 different algorithms. Patients who died before 7 days were not included in the dataset. Among the models with binary classification as late discharge/early discharge, LR achieved an AUC of 0.724, while the RF method achieved an AUC of 0.712 [14]. Juraev et al. (2022) implemented a machine learning application that predicts neonatal mortality and LoS using a multilayer dynamic ensemble model. They employed classification for mortality prediction and regression for LoS prediction, utilizing the MIMIC III dataset, which includes time-series data from 3133 patients. The predictive performance resulted in an RMSE of 12.509 and an R2 of 78.345 [15].

2.2 Other artificial intelligence studies for LoS prediction

In recent years, there has been a notable rise in studies focused on LoS prediction. The majority of these studies utilize machine learning methods to predict LoS based on patient information collected at admission. A smaller subset explores LoS prediction through big data analysis and DL. Age, gender, and diagnostic criteria are commonly used features across many of these studies, while other features vary according to the studies [7].

The pioneering LoS estimation study was conducted by Tu and Guerriere (1993). They employed the Neural Network (NN) using a dataset of 1409 patients to predict the LoS in the Intensive Care Unit (ICU) following cardiac surgery. They classified LoS into three groups: low, medium, and high. Their model achieved an AUC of 0.69 in the test [16]. Lowell and Davis (1994) conducted a similar study to estimate LoS with NN for psychiatric patients admitted to the hospital. The dataset included demographic information, place of residence, and disease severity, in addition to factors influencing LoS. The accuracy of the proposed method was compared with the predictions of the clinical treatment team. In all cases, the proposed method provided the same or better predictions than those of the treatment team [17].

Hachesu et al. (2013) applied DT, SVM, and ANN to a dataset containing 36 different features of 4948 patients for LoS prediction. Among these three algorithms, the SVM method achieved the best result with an accuracy of 96.4%. On the other hand, the advantage of DT lies in creating easily understandable model rules—a quality highly valued by medical experts. As a result, the longer LoS in patients with lung diseases, respiratory disorders, and high blood pressure has attracted attention [18]. Tsai et al. (2016) estimated LoS using two different datasets. While the first uses the data collected at the pre-admission stage, the second uses all data from hospitalized patients. Thus, the LoS estimate at the discharge stage was used to evaluate the LoS estimate at the pre-admission stage. The dataset includes 2377 cardiac patients diagnosed with Heart Failure (HF), Acute Myocardial Infarction (AMI), and Coronary Atherosclerosis (CAS). The highest accuracy achieved for CAS before discharge is 89.64%, whereas during the pre-admission phase, it reaches 89.65%. Similarly, for HF, the highest accuracy before discharge is 64.12%, compared to 63.69% during the pre-admission stage. In the case of AMI, the highest accuracy before discharge stands at 66.07%, while during the pre-admission phase, it reaches 65.72% [19].

Tang et al. (2018) performed classification for four different tasks using data from the MIMIC III database, which included intensive care patients. These tasks encompass mortality prediction, LoS prediction, differential diagnosis, and readmission prediction. In LoS estimation, a multiple classification model was created for the periods (1–2), (3–5), (5–8), (8–14), (14–21), (21–30), (30+). Among the proposed models, the Collapse models consist of standard machine learning models such as SVM, RF, GB, LR, and Multilayer Perceptron (MLP), while the Sequential models include two RNN models, LSTM, and CNN-LSTM. In terms of LoS prediction, the Collapse models achieved AUC values ranging from 0.64 to 0.88 for different classes, while the Sequential models obtained AUC values ranging from 0.65 to 0.93 [20]. Jalali et al. (2020) used five different methods to determine postoperative death and long-term hospital stay. These methods included Ridge Regression (RR), DT, RF, GB, and Deep Neural Network. The Deep Neural Network model outperformed the others, achieving an accuracy of 89 ± 4% and an AUC of 0.95 ± 0.02 [21].

Castineira et al. (2020) retrospectively collected vital continuous time data and clinical history data of 284 patients hospitalized in the Pediatric Intensive Care Unit of Boston Children’s Hospital. They estimated LoS using a machine learning method based on the Gradient Boosting Decision Tree (GBDT), similar to our study, classifying it into short or long duration. They achieved an AUC of 83% using only vital data, whereas the combination of vital and clinical history data yielded an AUC of 90% [22]. Bacchi et al. (2021) used ANN and LR models to predict hospital discharge probabilities within 2 and 7 days based on daily epicrisis reports of patients in General Medicine at Royal Adelaide Hospital. The models achieved an AUC of 0.80 for discharge within 2 days for ANN and an AUC of 0.78 for LR. For discharge within 7 days, the AUC was 0.68 for ANN and 0.61 for LR [23].

Unlike many previous studies, Abd-Elrazek et al. (2021) utilized general medical characteristics to estimate LoS, in addition to the disease diagnosis collected at the patient admission stage. They employed various machine learning methods for LoS prediction, including Classification Trees (CT), ANN, Tree Bagger (TB), RF, FL, SVM, KNN, Regression Tree (RT), and NB. Among these methods, FL achieved the highest accuracy at 92% [5]. Wu et al. (2021) employed four different machine learning models to predict prolonged LoS in the General Intensive Care Unit. These models included RF, SVM, DL, and GBDT. They utilized eICU and MIMIC III databases to train and validate the models. As a result, the GBDT model achieved the highest performance with an AUC value of 0.747. Comparisons with the Simplified Acute Physiology Score (SAPS-II) showed that all machine learning methods achieved more successful results [24].

Sinha et al. (2021) developed models for initially diagnosing COVID-19 and subsequently predicting LoS. They employed the Catboost model, a multilayer data processing approach, for LoS prediction. The average prediction accuracy for the two models was calculated as 92.33% [25]. Kalgotra and Sharda (2021) proposed a big data analysis and deep learning-based model for predicting LoS. They investigated hidden comorbidities using the records of 10 million patients hospitalized for different diseases in hospitals in the USA. As a result, their model achieved an MAE of 29.8% in LoS estimation [26].

Chrusciel et al. (2021) conducted two different LoS predictions employing the RF model. One data model utilized clinical reports, while the other incorporated age, gender, and primary diagnosis. They retrospectively collected records of patients admitted from the Emergency Department with hospital stays exceeding 2 days. The unstructured model, which received data from plain text, achieved an accuracy of 75.0%, while the structured data model achieved an accuracy of 74.1%. A similar evaluation focused solely on intensive care patients yielded results of 76.3% and 75.0% for the unstructured and structured data models, respectively [27].

In another study, Bacchi et al. (2022) prospectively obtained data for 4 months from the Royal Adelaide Hospital and 12 months from the Queen Elizabeth Hospital for LoS estimation. They estimated LoS using ANN with the Natural Language Processing technique applied to patient admission reports. The models achieved an AUC of 0.74 for discharge within 2 days and 0.67 for discharge within 7 days [28].

Kim et al. (2022) used data from 27,747 patients to estimate the LoS in Emergency Departments. They employed machine learning models including LR, XGBoost, NGBoost, SVM, and DT. The SVM and XGBoost methods achieved the highest accuracy of 0.896 [29]. Zeng (2022) predicted LoS using five machine learning algorithms named Lasso Regression (LSR), RR, RF, Lightweight Gradient Boosting Machine (LightGBM), and XGBoost and six feature-coding methods. The dataset, sourced from hospitalization records provided by the New York State Department of Health, consisted of 34 features encompassing patient characteristics, diagnosis, treatment, services, and costs. Performance was assessed using MSE and R2, yielding values of 2.231 and 96.0%, respectively [30]. Alabbad et al. (2022) used four machine learning models, namely RF, Gradient Boosting (GB), XGBoost, and Ensemble, to predict intensive care needs and LoS for COVID-19 patients during the epidemic. For this purpose, they utilized the dataset of 895 COVID-19 patients admitted to King Fahad University Hospital in Saudi Arabia. The RF model reached the highest accuracy at 94.16%. Their findings highlighted the significance of age, CRP, and days on nasal oxygen support as key factors influencing LoS in the Intensive Care Unit [31].

Saravi et al. (2022) employed machine learning and DL algorithms to predict LoS for patients undergoing lumbar spine decompression surgery. The study involved the use of K-means and Unsupervised Learning for data clustering. Then, for LoS classification, LR, RF, Stochastic Gradient Descent (SGD), KNN, DT, Gaussian Naive Bayes (GaussianNB), SVM, CNN, MLP, and Radial Basis Function Neural Network (RBNN) models were used. A decision support system based on the CHAID algorithm was also added to the models. In the dataset containing 236 patients, 14 features were used to train the models. The proposed models achieved AUC values ranging from 67.5 to 87.3%. As a result, it has been observed that the most decisive feature in the LoS prediction of the proposed models is the operation time [32].

Kadri et al. (2023) applied a generative adversarial network (GAN) model to estimate the LoS for patients admitted to a Pediatric Emergency Department in France. They compared the results with different deep learning (Deep Belief Networks, CNN, Stacked Auto-Encoder) and machine learning (SVM, RF, AdaBoost, DT) models. It has been observed that the GAN model gives better results than other models. They achieved an R2 and EV score of 0.87, along with an MAE score of 61.72 [33].

The prominent limitations of LoS prediction studies for NICU can be listed as follows:

  1. 1.

    It is challenging to predict extremely long LoS effectively using regression analysis. Zernikow et al. [11] also emphasized this limitation in their study. In their studies, they were generally unable to accurately predict hospital stays exceeding 140 days. To address this constraint, we adopted a classification approach to enhance the precision of neonatal LoS predictions.

  2. 2.

    In machine learning studies, sometimes using a large dataset can cause overlearning rather than demonstrating the success of the model. Feldman and Chawla [12] emphasize the difficulties they experienced in applying the same method to other medical units, despite 28,061 data. To overcome this limitation, we used a dataset with a more reasonable number of data.

  3. 3.

    One of the most important limitations of LoS studies is the lack of data unity. Thompson et al. [6] emphasized the importance of ensuring data unity to apply studies to different states. This also limits the realistic comparison of proposed methods in the literature. In order to ensure data unity in future studies, we obtained the dataset from routinely collected epicrisis reports.

  4. 4.

    Lin et al. [14] stated that, as another limitation, there were differences in terms of discharge criteria among the hospitals that comprised the dataset. To overcome this limitation, we obtained our data from the NICU of a single hospital. Additionally, Lin et al. [14] state that they faced an overfitting problem because they used the entire dataset for training purposes in their study. However, in our study, we overcame the overfitting limitation by allocating 0.3 of the dataset as test data.

  5. 5.

    In certain studies, the utilization of unnecessary features can increase the cost, particularly when paid datasets are involved [15]. Our study did not have any cost limitations, and we chose to include a reasonable number of features (12 features). Juraev et al. [15] also noted the use of classical machine learning methods as a limitation in their study. In our study, both classical and ensemble machine learning methods such as CatBoost were used. Additionally, an innovative approach has been developed using techniques such as Voting Classifier to increase classification performance. Juraev et al. [15] also listed among the limitations of their study that they focused on presenting an explainable model and failed to focus on accuracy and robustness. Although our study did not focus on explainability, it achieved high success in all criteria such as accuracy, precision, and sensitivity.

3 Materials and methods

A detailed flowchart describing the CF-LoS method developed for NICU LoS Prediction and the entire application is presented in Fig. 1.

Fig. 1
figure 1

NICU LoS prediction flowchart using the CF-LoS method

3.1 Dataset

The epicrisis reports of 512 babies hospitalized in the Neonatal Intensive Care Unit at the Private Eastern Anatolia Hospital in Elazig, Turkey, between 2021 and 2022 were retrospectively examined with the necessary ethical permissions. Among these neonates, the data of the babies who were transferred from the relevant unit were not included in the study because the exact LoS could not be known. Consequently, using the reports of 453 babies, 12 features of the baby and its mother were added to the dataset. These features encompassed both categorical and continuous data types. Table 1 contains the features, types, and definitions found in the dataset.

Table 1 Feature description of the dataset

The LoS of the patients in the dataset ranges from 1 to 83 days. More than half of the patients are discharged within the first 25 days. Details about the dataset are represented in Fig. 2 by boxplots.

Fig. 2
figure 2

Dataset overview

3.2 CatBoost classifier

CatBoost Classifier is a gradient boosting-based machine learning algorithm developed specifically for datasets where categorical variables are effectively utilized. CatBoost offers a unique approach that enables the direct use of categorical features, significantly reducing preprocessing steps. Consequently, it delivers higher performance and easier usability compared to other traditional gradient boosting algorithms.

CatBoost builds symmetrical trees, thereby achieving a high prediction rate without the need to construct excessively deep trees, thus mitigating the risk of overlearning. Moreover, CatBoost employs a proprietary regularization technique to prevent overfitting, considering the internal structure of discrete features. This approach enables the algorithm to achieve high levels of accuracy with low learning speed. Additionally, CatBoost enhances sample weight management through the utilization of category-specific statistical information, resulting in a more balanced modeling approach [34].

CatBoost Classifier is particularly effective on datasets abundant in categorical variables and characterized by complex relationships. Its proficiency in handling such data, along with its high performance, makes it a favored choice for applications demanding precise classification accuracy [35].

3.3 Voting classifier

Voting Classifier is a robust ensemble learning technique designed to achieve an optimal solution by combining various models. It integrates the insights and predictions from these individual models into a unified approach. This unified model makes decisions by favoring the most commonly agreed-upon choice among all the models for each output class [36].

This algorithm aggregates the results of each classifier within the Voting Classifier and considers the majority vote when determining the final output class. By combining predictions from multiple models, the voting classifier can incorporate diverse perspectives and benefit from the collective knowledge of the ensemble.

A notable advantage of employing a voting classifier lies in its resilience against errors or misclassifications from individual models. Should one model underperform in specific cases, it can benefit from the strong performance of other models. This ensemble approach helps reduce the influence of weaknesses in individual models, thereby enhancing the overall predictive accuracy and stability of the classifier.

Essentially, the Voting Classifier capitalizes on the principle that the combined decision-making of multiple models can surpass the performance of any single model. This leads to more precise and dependable predictions by leveraging the strengths of diverse models, offering a comprehensive and well-balanced predictive mechanism for addressing intricate classification problems [37].

Hard Voting classifier creates an ensemble containing many different classifiers. Each classifier makes its prediction based on the data. The hard voting classifier determines the final prediction by taking the majority of these predictions [38]. For example, if three different classifiers choose class A, one chooses class B and the other chooses class C, the hard voting classifier will choose class A because the number of classifiers choosing class A is the majority.

Hard Voting is based on an absolute majority among classes. This means that each classifier has equal weight and each has a single vote. This ensemble technique aims to create a stronger and more reliable classifier by blending diverse insights from various models. In situations where one model may make errors, accurate predictions from other models can offset these mistakes. Figure 3 illustrates the Hard Voting method [39].

Fig. 3
figure 3

Hard voting [39]

Soft Voting classifier is a type of voting classifier that is an ensemble learning method. This classifier utilizes the probability values provided by different classifiers during classification [40].

Let’s imagine we have three distinct models, each providing class predictions for a given input data. Employing such models proves more effective when we not only want to determine which class this input data belongs to but also assess the reliability of this prediction.

Each model generates a probability distribution for every class. For instance, the probabilities for Class A are as follows: Model 1 (0.30), Model 2 (0.47), Model 3 (0.53). Similarly, for Class B, the probabilities are Model 1 (0.20), Model 2 (0.32), Model 3 (0.40).

Soft voting aggregates these predictions by considering the probability values. For instance, for Class A, the average probability is calculated as (0.30 + 0.47 + 0.53)/3 = 0.4333, and for Class B, it is calculated as (0.20 + 0.32 + 0.40)/3 = 0.3067.

Finally, according to the soft voting approach, the class with the highest average probability is selected as the final prediction. In this scenario, Class A is chosen as the final prediction class because it has a higher average probability [41].

This approach takes into account not only the prediction result of the classifiers but also the confidence levels of these predictions. Thus, it aims to obtain a more precise and reliable estimate. Figure 4 shows the Soft voting method.

Fig. 4
figure 4

Soft voting [39]

4 Application and results

Python was used as the software development environment of the study. Moreover, in order to develop the application and obtain the results, a workstation equipped with an I7 processor and a Geforce 1070 GPU with an 8 GB capacity was employed. In this study, we conducted Length of Hospital Stay (LoS) estimation through binary classification. Precisely, a LoS ranging from 1 to 25 days was categorized as ‘short-term’, while a LoS of 26 days or more was designated as ‘long-term.’ The workflow of the study is illustrated in Fig. 5.

Fig. 5
figure 5

The workflow of the study

The original dataset, provided by NICU experts with the necessary ethical permissions, underwent initial data preprocessing, including data cleaning and digitization. Following this stage, a training data ratio of 0.7 and a testing data ratio of 0.3 were established for each classifier. As a result of the training and testing processes using these ratios, the performance of each classifier was evaluated.

To verify the consistency of the results, the k-fold value was determined as 5. Through this value, the performance of all classifiers was evaluated in more detail. For each classifier, the data were divided into 5 different parts, and the training and testing stages were repeated. Thus, the performance of each classifier was evaluated by the k-fold cross-validation method.

In the final stage, a Voting Classifier was used to optimally combine the obtained results. This approach aimed to achieve superior outcomes by consolidating predictions from various classifiers. Through this method, we strived to bolster our results by harnessing the collective performance of classifier combinations.

In our study, we used Logistic Regression (LR), ExtraTrees (ET), Random Forest (RF), K-Nearest Neighbor (KNN), SVC, AdaBoost, GradientBoosting (GB), XGBoost (XGB), and CatBoost algorithms based on machine learning as well as Voting Classifier technique. Table 2 shows the parameter settings for all algorithms used in this study.

Table 2 Parameter settings for the algorithms

In our study, the results obtained by initially setting the test size to 0.3 are presented in Table 3. These results encompass a range of metrics, including Accuracy, Precision, Recall, F1-Score, Kappa [42,43,44], MSE, RMSE, and MAE [45,46,47], which were obtained for both classical and ensemble classifiers.

Table 3 Performance results with test size = 0.3

As observed in Table 3, the Random Forest classifier consistently yields the best results, boasting an accuracy of 0.9779, precision of 0.9784, and recall of 0.9779. Conversely, the Logistic Regression classifier demonstrates the lowest performance among the classifiers. It is worth noting that the relatively new CatBoost classifier also exhibits promising results for this dataset, with an accuracy of 0.9632. In summary, Random Forest exhibits the highest overall performance, while CatBoost shows strong potential for this specific dataset.

In the subsequent stage, a cross-validation was conducted using a k-fold value of 5. The outcomes of this cross-validation are presented in Table 4.

Table 4 Results for k-fold = 5

When we examine the results obtained when the k-fold value is set to 5, it is seen that the Random Forest classifier exhibits the best performance. While the accuracy achieved with a test data ratio of 0.3 stands at 0.9779, it registers as 0.9425 with cross-validation using k-fold. Similarly, it is observed that the best results in other metrics are again obtained from the RF classifier. In contrast, the CatBoost classifier demonstrates a respectable accuracy of 0.9403, indicating its effectiveness as a valid classifier for this dataset. In summary, the Random Forest classifier shines prominently in terms of Accuracy, Precision, Recall, and F1-Score metrics, while CatBoost closely follows suit in performance.

In the final phase of the study, the Voting Classifier, an ensemble method discussed in Sect. 3.3, was employed. Voting Classifier utilizes two distinct methods: Hard Voting and Soft Voting. The outcomes derived from these methods are presented in Table 5.

Table 5 Voting classifier results

Upon reviewing Table 5, it becomes apparent that the accuracy achieved with the Soft Voting method is 0.9632, whereas the Hard Voting method yields an accuracy of 0.9558. These findings underscore the influence of various voting methods on performance. Notably, the Soft Voting method exhibits a slightly higher accuracy rate.

In the evaluation of precision metrics, it was observed that the highest value was achieved by the Hard Voting method at 0.9166, whereas in the Soft Voting method, this value was recorded as 0.8888. Precision values represent the proportion of correct positive predictions to the total positive predictions. In this instance, the Hard Voting method appears to exhibit a higher precision rate.

When we examine the other metrics, it becomes evident that the Soft Voting method generally produces superior results. Soft Voting aggregates predictions through majority voting, whereas Hard Voting selects the most frequently predicted class without considering the confidence scores. As a result, Soft Voting tends to deliver better overall performance.

Furthermore, Table 5 also includes the value of Average CV Recall [48] calculated with a k-fold value of 5. Recall measures the ratio of true positives to total positives and indicates how many true negative results the classifier missed. Upon reviewing these values, it becomes evident that the Soft Voting method consistently yields the highest results.

In summary, Table 5 illustrates that the Soft Voting method generally yields superior results, while Hard Voting tends to offer higher precision. Moreover, Soft Voting exhibits greater success in terms of average recall value. In addition to the metrics presented in the table, the AUC value for Soft Voting was calculated to be 0.947.

ROC_curve and Precision_Recall graphs for Soft Voting are given below. Figure 6 displays ROC_curve graphs for Soft Voting.

Fig. 6
figure 6

Voting classifier with soft voting average precision

Figure 7 displays Precision–Recall curve (PR curve) graphs for Soft Voting. When we look at Fig. 7, we see that the average precision value is 0.95 for Soft Voting.

Fig. 7
figure 7

Precision–recall curve (PR curve) for voting classifier with soft voting

5 Discussion

For our LoS prediction study in NICU, which is a classification application, we preferred Accuracy, Precision, Recall, F1-Score, Avg CV Recall, Standard Deviation of CV Recall, and AUC performance metrics. It can be seen that our study achieved high success with an Accuracy of 0.96. In addition, Recall values of 0.92, F1-Score of 0.90, and Standard Deviation of CV Recall values of 0.11 reflect the superior performance of the study. In addition, the AUC value of 0.95 confirms the success of the CF-LoS. In Table 6, both the applications made on LoS estimation in different fields in recent years and the LoS estimation studies conducted for NICU are given comparatively. These studies are compared and presented below in terms of method, performance criteria, data types they use, dataset location, and size.

Table 6  Comparison of our results with the related studies in the literature

Considering Table 6, it can be seen that a standard has not yet been developed in LoS studies. Regarding this, Stone et al. [7] suggested using routinely collected hospital records in future LoS studies. They emphasized that this would be more suitable in terms of data unity and comparison opportunities. In our study, instead of non-routine applications such as surveys, the information in routinely kept epicrisis reports was used to estimate LoS. This emerges as the strength of our study.

When Table 6 is examined, it is seen that the diversity of metrics used in LoS estimation studies makes performance comparison difficult. Some of the studies conducted in the field of LoS have been in the form of regression analysis. In these studies, performance measures such as R2, EV, RMSE, and MAE were used. In applications that estimate LoS in the form of classification, as in our study, metrics such as Accuracy, AUC, Precision, Recall, and F1-Score are generally considered important and preferred. This is a limitation that reduces the possibility of comparison [49]. In contrast, Bacchi et al. [28], Kim et al. [29], Alabbad et al. [31], Saravi et al. [32], Feldman and Chawla [12], Thompson et al. [6], Lin et al. [14] presented their results with Accuracy and AUC metrics in their study. Among them, the best results in terms of Accuracy and AUC were obtained by CF-LoS. Only Feldman and Chawla [12] presented their results in the form of Avg AUC from their study. In this study, the AUC value was calculated separately for each class. Therefore, it is not fully suitable for comparing classification performance. The closest accuracy to our study, with an accuracy of 94.16%, was obtained with the method suggested by Alabbad et al. [31]. Again, the closest result in terms of the AUC was achieved with the method proposed by Kim et al. [29] with an AUC of 0.93.

The limitation of the study of Zernikow et al. [11] includes its low success rate in predicting LoS of 140 days and above. Feldman and Chawla’s [12] study, which utilized a large dataset, may not accurately reflect the model’s success, making its application in other locations challenging. Thompson et al. [6] noted a limitation in their study regarding the potential difficulty in applying the model uniformly across different states due to the lack of a standardized dataset. Lin et al. [14] identified several limitations in their study, including variations in discharge criteria among hospitals in the dataset and the presence of an overfitting problem due to using all patients for model training. Additionally, they mentioned the inability to obtain some features as another limitation. Juraev et al. [15] outlined several limitations in their study, such as the high cost associated with using numerous features, reliance on classical machine learning models instead of more effective deep learning models, and less focus on robustness and fairness while prioritizing explainability.

One limitation that emerged in interdisciplinary studies was the challenge of collecting professional data. To address this, we collaborated closely with healthcare professionals to accurately transfer data from epicrisis reports to dataset tables. Another limitation of our study is the presence of missing data in the epicrisis, resulting in a reduction in the available dataset for analysis and thereby affecting model performance. However, to address this challenge, we enhanced model accuracy by employing various classifiers. Furthermore, we identified the classifier with the highest accuracy among the candidates by utilizing the Voting Classifier.

Voting Classifier is an ensemble learning technique used in the field of machine learning. This technique combines predictions from multiple different algorithms, providing the ability to make a more robust and stable prediction. It has many advantages. First, using different algorithms enhances model diversity and improves generalization ability. Combining multiple models reduces overfitting risks, as each model is trained on different data subsets. Additionally, combining different algorithms allows for better adaptation to the dataset, resulting in a broader perspective. Thus, the Voting Classifier is a powerful tool for solving complex problems, often yielding better performance and more reliable results than a single algorithm alone. Soft and hard voting are different types of Voting Classifiers, and each has its advantages.

Hard voting takes the classification predictions of each base model and selects the class with the most votes. This method allows classification results to be determined by a simple majority vote. The advantage of hard voting is that the classification predictions use only the class labels to make the model’s final decision. This method requires less computational cost and is faster to implement when making the classification decision.

Soft voting, on the other hand, takes the classification predictions of the basic models, along with their probability values, and makes the final prediction by weighting these probabilities. This method provides more precision and generally better performance by taking into account the confidence level of each underlying model. Soft voting provides a more balanced decision-making process by considering the reliability of each underlying model’s classification prediction. The advantage of soft voting is that it provides a more robust and reliable decision-making mechanism by taking into account the confidence level of each underlying model. In particular, when a particular model is more reliable among the underlying models, this information can be better used through soft voting. However, the computational cost of soft voting is slightly higher than hard voting due to the heavier calculation of probability values.

As a result, the use of the Voting Classifier, in addition to the advantages listed above, also increased the model accuracy in the study.

6 Conclusion

The objective of this study is to predict the LoS for neonates in the NICU by using machine learning algorithms to analyze neonatal medical data. In this way, it is aimed to provide valuable insights about the Length of Hospital Stay (LoS) to the relatives of the patient, NICU staff, and hospital management.

In our study, we retrospectively examined the epicrisis reports of 512 patients admitted to the NICU, following ethical guidelines. The patients who were transferred to another hospital from the NICU were excluded from the dataset. The prenatal data and clinical data of the remaining 453 neonates were included in our original dataset. CF-LoS involves binary classification for predicting long and short-term LoS using various machine learning algorithms such as LR, ET, RF, KNN, SVC, AdaBoost, GB, XGB, and CatBoost. Enhancing classifier performance, we introduced a hybrid approach by incorporating the Voting Classifier technique. Our findings revealed the RF classifier’s superior accuracy, precision, and recall, affirming its proficiency in neonatal medical data classification. This confirms that the RF model is capable of accurately classifying neonatal medical data. Furthermore, the CatBoost classifier demonstrated strong performance, with its high accuracy rate demonstrating its effectiveness in effectively classifying neonatal medical data. Comparing hard and soft voting methods, soft voting exhibited superior accuracy, while hard voting achieved higher precision. This indicates that while the soft voting method excels in overall accuracy, the hard voting method can provide more consistent and precise classifications. As a result, CF-LoS, leveraging machine learning for neonatal medical data analysis, achieved noteworthy accuracy, precision, and recall.

The obtained results have enabled us to predict the timing of incubator availability in the Neonatal Intensive Care Unit (NICU), a facility not uniformly present in every city in Turkey and constrained by a limited incubator capacity. This information is crucial for the unit manager in making decisions regarding the acceptance of new patients. Thanks to the LoS estimation, it is predicted that the incubators of patients who are expected to stay for a long time will be full, and therefore, it will not be possible to admit new patients for an extended period. Conversely, a patient may be admitted instead of the baby whose stay is expected to be short-term. Thus, effective utilization of hospital resources is facilitated. Moreover, the LoS estimation significantly contributes to healthcare coordination. Hospital administrators can use these estimates to determine the required number of doctors, nurses, and allied health personnel for a given monthly period, facilitating task allocation. This information also aids in planning medical supplies based on anticipated patient numbers. Additionally, managers can make informed budgetary decisions aligned with the projected patient load. Lastly, responsible doctors could provide more concrete answers to parents' inquiries about the expected LoS of their infants in the NICU, thereby fostering clearer communication and understanding between healthcare providers and families. Consequently, CF-LoS can provide reliable guidance in making important decisions regarding neonatal health and optimizing treatment processes.

In our study, we experienced difficulties in converting the categorical and numerical data that we could use from epicrisis notes into regular dataset tables due to missing data and sometimes different naming of the same categorical data. These limitations were overcome by seeking assistance from expert personnel during the organization of this data. Another limitation was that we had to choose the classification study instead of regression analysis because our original dataset contained data from a limited number of babies. Furthermore, the diversity in metrics used in studies conducted in the relevant field in the literature posed another limitation for comparing performance in our study. The utilization of age and gender criteria to ensure the generalizability of the study is significant in establishing standards for future research. However, despite aiming to include features of disease diagnosis, it was observed that this endeavor reduced accuracy due to limited data, despite the large number of disease types. Thus, the exclusion of diagnostic criteria from our study constitutes another limitation. To summarize, the primary limitations of this study, the ability of algorithms to make inferences based on training data, and the importance of having a larger dataset with accurate, complete, and representative data can be highlighted.

In the future, conducting a deep-learning study to predict mortality in newborns would be of interest. Additionally, future studies could focus on improving data quality to enhance algorithm performance and achieve accurate results. This can be achieved through meticulous management of the model selection and training process. At this stage, to ensure valuable inferences with high accuracy, it is advisable for hospital managers to ensure the professional maintenance of patient records, making them more accessible for analysis by artificial intelligence experts. It is also recommended to provide healthcare professionals with the opportunity to assess the clinical impact of artificial intelligence studies, thereby facilitating the refinement and development of these studies as needed.