Introduction

In recent years, there has been growing recognition of the profound impact that adverse birth outcomes can have on the long-term health and well-being of individuals, spanning from infancy to adulthood [1, 2]. Adverse birth outcomes, including preterm birth, stillbirth, macrosomia, and low birth weight, occur during the period of fetal development and the early weeks of life [3].

While considerable progress has been made in reducing adverse birth outcomes globally, substantial disparities persist between developing and developed countries, with Sub-Saharan Africa bearing a hefty burden [4,5,6,7]. Approximately 30 million low-birth-weight babies are born worldwide each year, which accounts for around 23.4% of all births [8, 9]. Low birth weight poses various adverse health risks for the baby, like fetal and neonatal mortality, morbidity, and impaired growth and cognitive development, as well as leading to chronic diseases later in life [10].

Sub-Saharan Africa and Southeast Asia were the hotspots of low birth weight, which shared three-quarters of the global burden in 2020. Sub-Saharan Africa alone shares 27.1% of the worldwide burden of low birth weight [9]. Both low birth weight and macrosomia or high birth weight are critical indicators of infant health. According to the DHS report, the pooled prevalence of macrosomia in this region was 8.33% [11].

Prematurity is also a major cause of neonatal mortality worldwide, with approximately 15 million babies born preterm each year, accounting for around 11% of all deliveries globally [12]. It accounts for 35% of all deaths among newborns worldwide [7]. According to recent estimates, about 65% of preterm births were from Southern Asia and Sub-Saharan Africa [6]. In 2019, it was estimated that around 2.0 million babies were stillborn worldwide. The global stillbirth rate was approximately 13.9 stillbirths per 1,000 total births. Sub-Saharan Africa was the top vulnerable region, which shares 43.6% of the worldwide burden of stillbirths [13].

The health sector’s main focus in the Sustainable Development Goals (SDGs) is reducing infant and child mortality, which is closely linked to adverse birth outcomes [14]. Alleviating adverse birth outcomes is vital for tackling infant and childhood mortality and in turn, achieving the SDGs. Various strategies, including reducing adverse birth outcomes like low birth weight, stillbirth, and preterm births a part of the SDGs, have been undertaken by different stakeholders to address the magnitude of these issues [15]. Despite these initiatives, adverse birth outcomes remain a significant challenge in Sub-Saharan Africa, with an overall pooled prevalence of 29.7% based on previous DHS data [16].

Previous literature highlighted several possible factors for adverse birth outcomes, including the place of residence [17,18,19,20,21], maternal education [3, 16, 19, 20, 22, 23], mode of delivery [22], substance use [22], distance to healthcare facilities [16], gender of the child [3, 16, 17, 19, 24, 25], maternal age [3, 18, 26,27,28,29], religion [20], number of children [24], wealth index [16, 19,20,21,22, 25], parity [16,17,18,19, 22, 27, 29], contraceptive use history [30], marital status [3, 21, 22], husband’s education level [3, 24], being anemic [31], iron supplementation [20, 24, 31], type of pregnancy [3, 16, 21, 24], number of ANC visits [16, 18, 23, 25, 26, 30, 32], birth interval [3, 21, 30, 33], unwanted pregnancy [22], place of delivery [3, 20, 29], and geographic region [19, 22, 34].

While previous studies have examined the factors influencing adverse birth outcomes using DHS data from various countries in Sub-Saharan Africa [16, 28, 33, 35], a deeper understanding requires the use of advanced machine learning algorithms and data science techniques. This approach enables the discovery of hidden patterns and relationships that may not be easily identified through traditional statistical methods. Additionally, our study’s use of recent DHS data allows us to capture the most up-to-date trends and factors influencing adverse birth outcomes in the region.

Accordingly, we proposed a study aimed at developing a predictive model for adverse birth outcomes among childbearing women in Sub-Saharan Africa using advanced machine learning techniques. Additionally, this study aimed to employ novel data science interpretability techniques to identify the key risk factors and quantify the impact of each feature on the model prediction.

Method

Data source

Secondary data from the most recent DHS, which was carried out in 26 countries in Sub-Saharan Africa, were used in this analysis. The following are the nations and corresponding survey years: Angola (2015–2016), Benin (2017–2018), Burkina Faso (2021), Burundi (2016–17), Cameroon (2018), Ivory Coast (2021), Ethiopia (2016), Gabon (2019–21), Gambia (2019–20), Ghana (2022), Guinea (2018), Kenya (2022), Liberia (2019–20), Madagascar (2021), Malawi (2015–16), Mali (2018), Mauritania (2019-21), Nigeria (2018), Rwanda (2019–20), Senegal (2019), Sierra Leone (2019), South Africa (2016), Tanzania (2022), Uganda (2016), Zambia (2018), and Zimbabwe (2015). The data for this study was obtained from the official database of the DHS program, which can be accessed at the following URL: https://dhsprogram.com/data/available-datasets.cfm.

Standardized cross-sectional surveys were conducted by the DHS Program in more than 90 countries to collect thorough and representative data on nutrition, HIV, health, and population. A multi-stage stratified sampling approach was used in these surveys to choose respondents from households within specific clusters. To select the enumeration areas, sampling strata based on the urban and rural sectors were created, and probability proportionate to size was used in the process. Households were picked using an equal probability systematic sampling procedure within the designated enumeration areas. Detailed information regarding the sampling procedure and methodology employed in the survey can be accessed on the DHS website and their provided guidelines [36].

The study focused on women of childbearing age [15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49] in Sub-Saharan African countries who had given birth within the five years prior to data collection. The analysis encompassed a significant sample size of 139,659 women who met the specified criteria across the 26 countries included in the study. The dataset used in this study contained 28 different features that were considered during the analysis.

Study variables and measurements

Adverse birth outcomes, which were defined as the incidence of one or more particular conditions during recent pregnancies, such as low birth weight, macrosomia, preterm birth, or stillbirth, were the main measures utilized in this study. This definition was based on the availability of possible items on DHS. Low birth weight, macrosomia, stillbirth, and gestational age of fewer than 37 weeks were all integrated by the researchers to form a single variable. If there were any reported adverse birth outcomes, this variable was coded as 1, indicating that there were adverse birth outcomes, and 0 if none were recorded. The previous study provided detailed information on how this variable was recoded from the DHS dataset [33].

Following a review of the literature, the study looked at several independent variables. The variables that were included in this study were: the place of residence [17,18,19,20,21], maternal education [3, 16, 19, 20, 22, 23], gender of the household head [16, 22], ownership of a mobile phone [17,18,19,20,21], mode of delivery (cesarean section or vaginal) [22], smoking [22], distance to healthcare facilities [16], maternal employment status [16, 22], gender of the child [3, 16, 17, 19, 24, 25], maternal age [3, 18, 26,27,28,29], religion [20], number of children in the family [24], wealth index [16, 19,20,21,22, 25], parity (number of previous pregnancies) [16,17,18,19, 22, 27, 29], contraceptive use history [30], marital status [3, 20, 29, 30], husband’s education level [3, 20, 24, 29], iron supplementation [20, 24, 31], type of pregnancy (single or twin) [3, 16, 21, 24], number of ANC visits [16, 18, 23, 25, 26, 30, 32], birth interval [3, 21, 30, 33], desired pregnancy status (wanted or unwanted pregnancy) [22], first ANC check-up (early or late) [16, 22], place of delivery [3, 20, 29], and media exposure [20, 24, 30, 31].

The birth interval variable was recoded into short (less than 24 months), normal (24 to 59 months), and long (greater than 59 months) categories based on the World Health Organization (WHO)’s guidelines for inter-pregnancy interval [37]. Regarding the timing of first ANC check-up, mothers who started their first ANC visit during the first trimester were considered as early ANC check-ups, and those who started in the second and third trimesters were considered late ANC check-ups [38].

In addition, community-level variables such as country and region (including sub-regions of Sub-Saharan Africa such as West, East, North, Central, and South Africa) were included after repeated experimentation to assess the potential significance of those features in the study. The inclusion of community-level variables, such as country and region, was informed by a review of previous literature [19, 22, 34].

To further assess the importance of these variables, the researchers conducted a series of experiments. Specifically, we included the community-level variables one by one and evaluated the performance of the different machine learning algorithms to determine the impact of these factors on the prediction of adverse birth outcomes. By systematically testing the contribution of the community-level variables, the researchers were able to build a more comprehensive and robust predictive model that accounted for both individual-level and contextual factors.

Data preprocessing

Data pre-processing, the first step in machine learning, is preparing the data to make sure it is suitable for computer analysis [39]. For our models, we used a continuous improvement methodology in our machine learning workflow. We carried out a number of activities, such as feature selection, data balancing, model selection, training, evaluation, parameter optimization via cross-validation, final model selection, and deployment to forecast adverse birth outcomes [40].

Throughout this iterative approach, we continuously improved our models to achieve better outcomes. While Fig. 1 shows the high-level workflow phases visually, it may not capture all the repetitive tasks we completed as part of this process. Beyond the activities depicted in the figure, we engaged in numerous additional iterative steps to systematically refine our models.

Fig. 1
figure 1

Study work flow diagram

The iterative tasks included trying different feature engineering methods to enhance the model inputs, experimenting with various data balancing techniques to address class imbalances, evaluating the performance of different kernel functions for the support vector machine (SVM) model, exploring various model tuning and hyperparameter optimization approaches, and assessing the impact of different n-fold cross-validation schemes. By continuously iterating and improving our models through this cyclical process, we were able to enhance the accuracy of our adverse birth outcome forecasts.

Moreover, model evaluation is where the trained model is evaluated on the held-out test set data to get an unbiased estimate of its performance, and this was done before model optimization. However, the researcher of this study also conducted model evaluation after optimization and to select the final best model by comparing the optimized models. By continuously iterating and improving our models through this cyclical process, we were able to enhance the accuracy of our adverse birth outcome forecasts.

Data cleaning

Throughout the data analysis process, we ensured the quality and dependability of the dataset by using a rigorous approach. We began with a comprehensive manual audit to identify and eliminate any extraneous data entries. The DHS guideline [36] was used to identify extraneous data entries and outliers. In this study, we identified a few outliers and removed them after assessing the impact through sensitivity analysis [41, 42].

Upon conducting our analysis, we discovered that for each variable utilized in this study, the missing rate was under 10%. To deal with these missing values, we employed the imputation technique known as K-Nearest Neighbors (KNN). A popular method for filling in the missing values is KNN imputation, which uses information from neighboring data points. Because KNN imputation can handle a wide range of data formats, incorporate contextual information, preserve the dataset’s structure, and consistently resolve missing values, it is the method of choice for us [43, 44].

Additionally, the correlation matrix was used to assess multicollinearity; a correlation between two variables was considered high if it was greater than 0.8 [45, 46]. Nevertheless, there was no indication of multicollinearity between the variables in our research.

Feature engineering

In order to create precise and useful machine learning models, the process of feature engineering involves carefully locating, extracting, and modifying pertinent features from the given data [47]. We conducted a study using different encoding algorithms for different types of category variables. Specifically, we employed one-hot encoding for nominal categorical variables and label encoding for ordinal categorical variables [48].

The rationale for choosing these encoding methods was based on the nature of the categorical variables in the dataset and the common best practices in the field of machine learning. One-hot encoding and label encoding are well-established techniques that are often the starting point for handling categorical variables, as they are straightforward to implement and can provide meaningful representations for the machine learning models [49, 50].

Dimensionality reduction

We used multiple dimensionality reduction techniques to improve the performance of the model. Boruta-based feature selection, principal component analysis (PCA), lasso regression, random forest feature selection, recursive feature elimination (RFE), and univariate selection were a few of these techniques. After doing rigorous testing and comparing the outcomes across multiple feature selection techniques, we discovered that RFE and random forest-based feature selection were the most effective strategies in terms of accuracy and robustness.

In our experimentation, we systematically evaluated the performance of various machine learning algorithms before and after applying each feature selection technique. We also compared the performance of the machine learning algorithms across each feature selection technique. This allowed us to determine the optimal dimensionality reduction strategy based on the resulting model improvements.

RFE is an iterative technique that methodically eliminates features from the dataset based on their significance or ability to enhance the model’s performance. The less important elements are gradually removed from the model once it has been trained with all of the characteristics up to a predetermined threshold or level of performance [51]. On the other hand, random forest-based feature selection determines which features are most relevant by evaluating each one’s relevance using a random forest method.

We were able to remove two characteristics with these methods: smoking and the number of children in the family. Both recursive feature elimination and random forest-based feature selection showed that these traits were the least significant factors.

After removing the two variables, the performance of each machine learning algorithm showed a slight improvement compared to applying other feature selection techniques or no dimensionality reduction at all. For instance, the performance of random forest improved from an accuracy of 83.0% and AUC of 92.0% before feature selection, to an accuracy of 87.0% and AUC of 95.0% after removing the two variables using RFE and random forest-based selection. Additional Material 1 provides a detailed performance comparison of each algorithm before and after feature selection.

Data balancing

In our study, we used a novel data balancing technique to address class imbalance in our dataset. Six data balancing strategies were used in this approach: the NearMiss algorithm, SMOTE-ENN (SMOTE edited nearest neighbor), Adaptive Synthetic (ADASYN), under and over-sampling, and synthetic minority over-sampling methodology (SMOTE).

We carried out a comprehensive analysis of model performance and took into account several performance parameters to identify the best data balancing method. Using several balancing approaches, we trained machine learning algorithms on the balanced data and compared their performance. We determined the best strategy for resolving the class imbalance in our dataset by this meticulous method [52, 53].

Model selection and development

We select ten advanced machine learning models that could accurately predict adverse birth outcomes. The Jupyter Notebook and the Python packages from sci-kit-learn version 1.3.2 were used in this study. These machine learning algorithms were selected because they could be used for classification tasks and they aligned well with the features of our dataset [54,55,56].

We used the following algorithms: AdaBoost, eXtreme Gradient Boosting (XGBoost), CatBoost Classifier, Gaussian Naive Bayes (GNB), logistic regression, decision tree, random forest, SVM, KNN method, and feedforward neural network (FFNN). For the detailed descriptions of each employed machine learning algorithm see Additional Material 2.

Model training and evaluation

We used a simple strategy in our study to split the data into two sets: a training set consisting of 80% (111,727 cases) and a testing set consisting of 20% (27,932 cases). The rationale behind choosing a 20% test and 80% train split was based on a careful consideration of the characteristics of our dataset, the complexity of the models employed, and the specific requirements of our project. To further validate the appropriateness of this split, we also compared the performance of each employed algorithm when using a 30% test and 70% train split. The results showed that the 20% test and 80% train split gave a comparable good performance, and we considered this split to be the most suitable for our study [57, 58].

We employed a variety of metrics, including accuracy, precision, recall/sensitivity, F1-score, specificity, and AUC to evaluate each predictive model’s performance.

  1. 1.

    Accuracy: The accuracy of a model measures the overall predictions which is determined by dividing the total number of predictions by the number of accurate predictions [59].

  2. 2.

    Precision: Precision assesses how well the model predicts good outcomes. It measures the percentage of actual positive predictions that match the total number of projected positives. Precision can be calculated using the following formula: Precision = TP / (TP + FP), where TP stands for true positive and FP for false positive [59].

  3. 3.

    Recall: Recall evaluates the model’s capacity to detect every positive instance. It is also referred to as sensitivity or true positive rate. It calculates the percentage of actual positives that are true positive forecasts [59]. Recall = TP / (TP + FN) is the recall formula, in which TP stands for true positive and FN for false negative.

  4. 4.

    F1-score: By taking precision and recall into account, the F1-score offers a fair assessment of a model’s performance. This is the harmonic mean of recall and precision [59].

    F1-score = 2 * (Recall * Precision) / (Recall + Precision) is the formula to compute F1-score.

  5. 5.

    Specificity: This refers to the percentage of accurate negative predictions among all real negatives, which indicates how well the model can detect negative occurrences. Specificity is calculated as TN / (TN + FP), where TN stands for true negatives (negatives that are accurately predicted) and FP stands for false positives (positives that are wrongly predicted).

  6. 6.

    AUC: The area under the receiver operating characteristic (ROC) curve is used to generate the AUC metric. Plotting the true positive rate against the false positive rate at different categorization thresholds results in the ROC curve. An algorithm’s capacity to discriminate across classes is shown by its AUC value; a higher AUC value corresponds to stronger discrimination [59, 60].

Model optimization, cross-validation, calibration, and kernel techniques

Model optimization

To maximize the performance of the model, we conducted a comprehensive analysis of the hyperparameters using three common tuning approaches: grid search, random search, and Bayesian optimization [61, 62]. For the grid search, we defined a parameter grid that included key hyperparameters such as learning rate, number of estimators, max depth, and min samples split. We systematically tested a range of values for each hyperparameter and performed an exhaustive evaluation of all possible combinations to identify the optimal configuration.

The random search approach involved randomly sampling hyperparameter values from predefined distributions, allowing us to explore a larger portion of the hyperparameter space in a more efficient manner.

In addition, we leveraged Bayesian optimization, which models the objective function (model performance) as a Gaussian process and intelligently selects the next hyperparameter values to test based on the observed performance so far.

After comparing the results from these three optimization techniques, we found that the grid search approach yielded the best overall performance for our specific dataset and problem. Therefore, we selected the grid search as the primary method for hyperparameter tuning.

Cross-validation

To ensure a thorough and reliable evaluation of the model’s performance, we employed three common cross-validation techniques: 3-fold, 5-fold, and 10-fold cross-validation [63].

In the 3-fold cross-validation, we divided the training data into three equal-sized partitions, trained the model on two partitions, and evaluated it on the held-out third partition. We repeated this process three times, using a different partition for the test set each time.

The 5-fold and 10-fold cross-validation followed a similar logic but with five and ten partitions, respectively. By using these varying cross-validation approaches, we were able to obtain more stable and reliable performance estimates, as the smaller the number of folds, the higher the variance in the performance metrics.

After analyzing the results from the different cross-validation methods, we determined that the 10-fold cross-validation provided the most accurate and consistent evaluation of the model’s performance for our specific dataset.

Model calibration

To improve the model’s accuracy and reliability, we conducted model calibration using isotonic regression [64]. Isotonic regression is a non-parametric technique that adjusts the model’s output probabilities to better align with the true likelihood of the predicted outcomes.

The calibration process involved fitting an isotonic regression model to the model’s output probabilities and the true class labels. This allowed us to transform the original probabilities in a way that minimized the discrepancy between the predicted probabilities and the actual probabilities of the target classes.

By calibrating the model’s output probabilities, we were able to enhance the reliability of the model’s predictions and ensure that the reported probabilities accurately represented the likelihood of the outcomes.

Kernel approaches for SVM

In addition to the other models explored, we also investigated the impact of different kernel functions on the performance of the Support Vector Machine (SVM) algorithm. We evaluated five common kernel types: linear, polynomial, radial basis function (RBF), sigmoid, and Gaussian [65].

By comparing the performance of SVM models with these various kernel types, we were able to identify the optimal choice for our specific problem and dataset. The polynomial kernel emerged as the top performer, allowing for non-linear decision boundaries by mapping the input features to a higher-dimensional space using a polynomial function [66].

To quantify the performance of the different kernel functions, we analyzed the accuracy and ROC values, as well as other relevant metrics. The linear kernel achieved an accuracy of 71.0% and an ROC value of 0.76. In contrast, the polynomial kernel provided the highest accuracy at 76.0% and an ROC value of 0.83, outperforming the other kernel functions. The RBF kernel had an accuracy of 75.0% and an ROC value of 0.81, while the sigmoid kernel exhibited the lowest performance with an accuracy of 63.0% and an ROC value of 0.61. The Gaussian kernel achieved an accuracy of 69.0% and an ROC value of 0.75.

Based on these comprehensive results, we can conclude that the polynomial kernel was the optimal choice for our problem and dataset, as it demonstrated the highest metrics performance value compared to the other kernel functions evaluated.

Model interpretability

To reveal hidden patterns and correlations in the dataset, we employed association rule mining techniques, which enhanced the interpretability of our model. The well-known Apriori method, which was developed specifically for association rule mining, had to be used for this. By applying this method, we were able to identify frequently occurring item sets and construct important connection rules using metrics like lift, confidence, and support.

By measuring the strength of correlations between variables, the lift measure was able to demonstrate the effect of one variable on the occurrence of another. Confidence proved that association rules were true by showing the frequency with which the consequent variable arose when the antecedent variable was present. The support measure additionally determined the frequency of occurrence of specific item sets in order to corroborate the association findings. By including support, we were able to gain a comprehensive understanding of the relationships and patterns seen in the dataset, which enhanced the interpretability of our model [67,68,69].

We also employed SHAP (Shapley Additive Explanations) analysis to show how each feature affects the predictive model. SHAP analysis is a useful technique for figuring out the importance and input of specific characteristics in a machine learning model.

Results

Descriptive results of the socio-demographic and economic characteristics

The study conducted a thorough analysis of the descriptive and sociodemographic parameters of 139,659 women who were childbearing. The results showed that the majority of participants—70,275 individuals, or 50.32% of the sample were in the 20–29 age group. Furthermore, 86,386 people, or 61.85% of the poll participants, were rural residents. For more detailed information, please refer to Table 1.

Table 1 The socio-demographic and economic characteristics of study participants in Sub-saharan Africa based on recent DHS (2015–2024)

Maternal obstetrics characteristics of the study participants in Sub-saharan Africa

The data showed that the majority of study participants, 91.38% (127,618 individuals), delivered their babies through vaginal delivery. Additionally, 93.04% (129,933 participants) received their delivery services at health institutions. The data also showed that a substantial number, 59.74% (83,426 of women), had never used contraception. This suggests a relatively high rate of non-use of contraceptive methods among the study participants. For more detailed information, please refer to Table 2.

Table 2 Maternal obstetrics characteristics of the study participants in Sub-Saharan Africa based on recent DHS (2015–2024)

Adverse birth outcome in Sub-Saharan Africa

According to our research of the most current DHS dataset, about 28.59% (95% CI: 28.36, 28.83) of childbearing women in Sub-Saharan Africa had adverse birth outcomes. Our finding also revealed significant variation between each country. Sierra Leone had the lowest prevalence of adverse birth outcomes at 14.08% (95% CI: 13.22 − 14.94%), while Mali had the highest at 45.78% (95% CI: 44.10 − 47.45%). Figure 2 provides a detailed breakdown of the adverse birth outcome rates for each country in Sub-Saharan Africa, sequenced from the highest to the lowest prevalence.

Fig. 2
figure 2

The prevalence of adverse birth outcome among childbearing women in Sub-Saharan Africa countries using forest tree plot

Machine learning analysis of adverse birth outcome

Data balancing

Our analysis of machine learning algorithms using various data balancing techniques is shown in Table 3. The outcome showed that, in comparison to other data balancing strategies utilized in this investigation, SMOTE-ENN consistently outperformed all models evaluated in terms of accuracy. The random forest approach achieved the highest accuracy (87.0%) and AUC (95.0%) notably after data were balanced using SMOTE-ENN. For additional research and optimization, we decide to use SMOTE-ENN. Please see Additional Material 3 for detailed information on the performance metrics of several machine learning algorithms experimenting with different data balancing techniques.

Table 3 Comparison of imbalanced data handling techniques across each machine learning algorithms

Development and performance comparisons of machine learning models

We carried out various optimization techniques to improve the performance of these machine learning models. We applied random search tuning, Bayesian optimization, and grid search hyperparameter tuning to explore the performance of each algorithm across the different optimization techniques.

Our results showed that grid search optimization produced performance metrics that were consistently higher and more consistent for every algorithm implemented in our study. While Bayesian optimization demonstrated higher performance metrics for a few of the algorithms, specifically KNN and CatBoost, it lacked the overall consistency observed with grid search tuning.

Additionally, the random search tuning technique showed higher performance for the CatBoost algorithm. However, we ultimately chose grid search tuning as the final model optimization technique for the ten machine learning algorithms in our study. This decision was based on grid search’s ability to consistently improve the performance metrics across all the algorithms. Random forest was the top-performing algorithm in all three of the tuning techniques.

For a detailed comparison of the machine learning algorithm performance under the random search, Bayesian optimizations, and grid search tuning approaches, please refer to Table 4. The table presents the accuracy and AUC metrics for each algorithm, after applying tenfold cross-validation and calibration during hyperparameter optimization steps.

Table 4 Accuracy and AUC metrics for selected machine learning algorithms after data balancing and hyperparameter optimization

After extensive experimentation and evaluation, we identified the random forest model as the top performer machine learning model for our dataset, achieving outstanding performance with an AUC score of 0.95, accuracy = 0.88, recall = 0.94, specificity = 0.87, precision = 0.88, and F1-score = 0.91. For a detailed analysis of the performance metrics, please refer to Fig. 3 (ROC curve) and Fig. 4 (overall metric performance) in our study.

Fig. 3
figure 3

ROC curve trained with balanced data and optimized model using grid search tuning

Fig. 4
figure 4

Comprehensive metrics performance of employed machine learning algorithms using balanced data and optimized model using grid search tuning

Model interpretability

Association rule mining

The principal predictor variables associated with adverse birth outcomes were determined using association rule mining. The main factors linked to a high risk of adverse birth outcomes according to our exploration with association rule mining were found to be home deliveries, lack of prenatal iron supplementation, fewer than four ANC visits, short and long delivery intervals, unwanted pregnancy, primiparous (first-time) moms, and geographic location in the West African region.

The top 5 association rules and their corresponding statistics were:

Rule 1

If mothers do not have iron supplementation, the pregnancy type is unwanted, and the delivery occurs at home, the probability of an adverse birth outcome is 94.9% (confidence = 0.949 and lift = 2.96).

Rule 2

If mothers have fewer than four prenatal visits, the pregnancy is unwanted, and the delivery occurs at home, the probability of an adverse birth outcome is 92.5% (confidence = 0.925 and lift = 2.88).

Rule 3

If mothers have a short birth interval, the pregnancy is unwanted, and the delivery occurs at home, the probability of adverse birth outcomes is 91.7% (confidence = 0.917 and lift = 2.86).

Rule 4

If mothers are primiparous, the pregnancy is unwanted, and their geographic location is the West Africa region, the probability of adverse birth outcome is 90.8% (confidence = 0.908 and lift = 2.83).

Rule 5

If mothers have a long birth interval, the pregnancy is unwanted, and the delivery occurs at home, the probability of adverse birth outcomes is 90.1% (confidence = 0.901 and lift = 2.81).

SHAP feature impact on model prediction

As depicted in Fig. 5, the top influential factors for the model predictions were region, ANC visit, and parity, place of delivery, mother education, wanted pregnancy, and iron supplementation.

Fig. 5
figure 5

SHAP feature impact on model prediction

Discussion

To predict adverse birth outcomes in Sub-Saharan Africa, our research study focused on applying advanced machine learning techniques. The results of our investigation showed significant variables linked to adverse birth outcomes and demonstrated the effectiveness of ten advanced machine learning algorithms in this regard.

According to our findings based on the analysis of the most recent DHS dataset, approximately 28.59% (95% CI: 28.36, 28.83) of childbearing women in sub-Saharan Africa experienced adverse birth outcomes. This prevalence is slightly lower than the previously reported rate for the region [16]. The difference in adverse birth outcome rates might be attributable to the varying periods in which the individual country surveys were conducted.

Regarding the employed algorithm, the findings indicated that the random forest approach was the best-performing algorithm. This finding is consistent with other studies [70,71,72,73,74], suggesting that the random forest method is a valuable tool for predicting adverse birth outcomes. The random forest performs well due to its ability to handle complex and nonlinear relationships within the data [75]. This finding implies that advanced machine learning techniques, specifically the random forest algorithm, hold immense potential as powerful tools to tackle this critical public health challenge of adverse birth outcomes in Sub-Saharan Africa.

Our association rule mining approach discovered multiple factors associated with poor birth outcomes in Sub-Saharan Africa. Inadequate iron supplementation was found to be a major risk factor, indicating the need for iron to encourage healthy fetal development and prevent issues. This conclusion underscores the need for therapies and strategies to ensure adequate iron supplementation for pregnant mothers in the area, in line with previous research findings [20, 24, 31].

Unwanted pregnancy was another important factor that increased the likelihood of poor outcomes at birth. The justification for this could be due to unexpected pregnancies presenting women with a variety of challenges, such as psychological stress, inadequate prenatal care, and a delayed antenatal service start date, all of which can have an adverse effect. The finding is consistent with previous research findings [22].

Access to skilled delivery care is essential for a safe delivery experience, as evidenced by the strong association between home delivery and adverse birth outcomes in our data. As other researchers have already discovered [3, 20, 29], this could be because prompt interventions and emergency obstetric care delivered at healthcare facilities can reduce the risk of complications and improve outcomes for the mother and newborn.

Our finding implied that inadequate ANC visits, particularly less than the recommended four visits, were associated with an increased risk of adverse birth outcomes. This result is in line with previous research done in other parts of the world [16, 18, 23, 25, 26, 30, 32]. The justification for this could be that inadequate ANC visits increase the likelihood of adverse birth outcomes and missed opportunities for early intervention. The finding implies that ANC visits should be improved if pregnant women are to be encouraged to attend the prescribed number of visits.

Our association rule mining results showed that both short and long birth intervals increase the likelihood of adverse birth outcomes. This finding is supported by previous research elsewhere in the world [3, 21, 30, 33, 76]. This could be because long birth intervals are linked to more challenging pregnancies and older mothers, which could account for the observed findings. The mother’s body and the growth of the fetus may also be adversely affected by lengthy waits between pregnancies. A short birth interval, however, might not provide the mother adequate time to heal and restore her nutrition stores. The finding implies pregnant women’s and their children’s health may be improved by encouraging optimal birth intervals.

According to our findings, one possible factor that raised the risk of adverse birth outcomes was being a primipara mother. This finding is consistent with previous research findings [16,17,18,19, 22, 27, 29]. The possible justification for this is that first-time mothers can encounter particular difficulties since they lack experience and understanding about pregnancy and childbirth. The finding implies that educating and providing targeted interventions to first-time mothers can help reduce the risk of adverse birth outcomes.

Within Sub-Saharan Africa, our data showed considerable regional differences in adverse birth outcomes. In particular, we discovered that, in comparison to other regions in Sub-Saharan Africa, living in the West Africa region increased the likelihood of adverse birth outcomes. For this study, the West Africa region was defined to include countries such as Burkina Faso, Ivory Coast, Gambia, Ghana, Guinea, Liberia, Senegal, Sierra Leone, Nigeria, Zambia, Zimbabwe, Benin, Mali, and Mauritania.

This finding implies that regional variations in healthcare quality, access, socioeconomic status, and cultural practices throughout Sub-Saharan Africa have a significant impact on the health outcomes of mothers and children. Strengthening this finding, our further analysis implied that Western Africa had the highest rates of unintended pregnancy and late ANC checkups compared to other sub-regions of Sub-Saharan Africa, which may be one possible justification for it being a high-risk region for adverse birth outcomes. The finding is consistent with previous research elsewhere in the world [19, 22, 34].

Conclusion

The region continues to face persistent adverse birth outcomes, emphasizing the urgent need for increased attention and action. Fortunately, the findings of this study make it abundantly clear that advanced machine learning techniques, specifically the random forest algorithm, hold immense potential as powerful tools for governments and healthcare providers in Sub-Saharan Africa to tackle this critical issue head-on.

Building upon the invaluable insights obtained from these employed innovative techniques, we strongly recommend that certain key groups of women receive more intensive interventions as soon as their pregnancy is confirmed. Specifically, this includes women who are pregnant for the first time, women with a short (less than 24 months) or long (over 59 months) gap between pregnancies, women with unintended pregnancies, and women living in the high-risk western Africa region identified by the study.

These vulnerable women should be immediately provided with iron supplements and scheduled for comprehensive prenatal care. We should also strongly encourage them to deliver their babies in a medical facility or with the assistance of a qualified health professional, even if they had originally planned a home birth.

Strengths and limitations of the study

This study had several key strengths. First and foremost, the researchers conducted a very thorough analysis, testing out ten different advanced machine learning algorithms and carefully tuning the parameters to get the best performance from each one. This rigorous approach allowed identifying the most suitable algorithms for predicting adverse birth outcomes. Additionally, the study provided valuable insights, highlighting the significant factors that contribute to these adverse birth outcomes. The researchers also used innovative machine learning techniques to uncover the key driving indicators.

However, there were limitations to take into account, though. First, the study was dependent on secondary data such as DHS surveys, which might have certain drawbacks and might not have included all relevant variables. As a result, the authors were limited in their ability to fully explore potential internal correlations among the identified risk factors.

The secondary nature of the data also constrained the authors’ capacity to investigate the complex causal pathways underlying the observed associations. Future studies utilizing primary data collection would be valuable to further elucidate these intricate relationships and provide a more comprehensive understanding of the determinants of adverse birth outcomes in sub-Saharan Africa.