Background

An unintended pregnancy is a pregnancy that is either unwanted or mistimed, such as when it occurs earlier than desired. It is one of the most important issues the public health system is currently facing, and it comes at a significant cost to society both economically and socially. It results in decreased workforce productivity and quality of life [1, 2]. Between 2015 and 2019, there were 121 million unintended pregnancies worldwide. Every year, 61% of pregnancies result in abortions. Although unintended births have decreased globally, there has been an uneven distribution between high-income and low-income nations. The prevalence rate in high-income countries is 66 per 1000 pregnancies however; it was 93 per 100 women in middle and low-income countries. The burden of an undesired pregnancy still weighs heavily on Ethiopia despite the availability of broad family planning services. According to a systematic review conducted in Ethiopia, the overall prevalence of unintended pregnancy was 28%. Also, results from the 2016 Ethiopian Demographic and Health Survey (EDHS) support this finding, which indicates that 25% of all births in the previous five years and all ongoing pregnancies were unintended [3,4,5].

Globally, unintended pregnancies have a variety of detrimental effects on mothers and fetuses. One of the most common negative consequences of unintended pregnancy is induced abortion with its complications. Six out of 10 of all unintended pregnancies end in induced abortion. People with unintended pregnancies frequently turn to unsafe abortion when they encounter obstacles to obtaining a safe, quick, inexpensive, geographically accessible, respectful, and non-discriminatory abortion [6, 7]. The child of an unintended pregnancy is more likely to be maltreated, to be born weighing less than 2,500 g, to die within the first year of life, and to lack the resources necessary for healthy growth. It also affects mothers by making the relationship with their spouse more likely to end in divorce, and she may be more likely to experience physical violence herself. The mother and father can experience financial difficulty and fall short of their aspirations for their careers and education [8].

According to different studies, the prevalence of unintended pregnancies was highest among women who were between the ages of 18 and 24 years, had never used family planning methods, had low income (less than 100% of the federal poverty level), had not completed high school, had a birth interval of fewer than two years, was living in rural areas, was pregnant only by their husband’s decision, had gravidity greater than or equal to five, was non-Hispanic black or African American, and was cohabiting but had never married [6,7,8,9,10,11,12].

The following recommendations are made to help reduce unintended pregnancies: increasing access to contraception; raising awareness of the importance of feelings, attitudes, and motivation in using contraception and preventing unintended pregnancies; developing and meticulously evaluating a variety of local programs, and encouraging research to create new contraceptives. The new guideline of abortion care also recommends straightforward primary care interventions that involve assuring access to medical abortion pills, ensuring that correct care information is available to all those who need it, improving the quality of abortion care delivered to women and girls, and task-sharing by a wider range of health providers [8, 13].

The potential causes of unintended pregnancy have been the subject of numerous research investigations using traditional statistical analysis methods [6, 14,15,16,17]. Nevertheless, no prior studies have attempted to use machine learning to predict unintended pregnancy and identify predictive factors. As a result, when the number of input variables and potential correlations rises, previously employed statistical procedures become less accurate, producing incorrect conclusions [18]. Machine learning was used more effectively [19] and machine learning methods are a good solution to these issues because they can capture complicated and nonlinear correlations in the data, improving prediction accuracy above traditional regression models. So, the purpose of this work was to use the most advanced machine learning models to predict unintended pregnancy and identify its predictors.

Method

Data source and population

This study relied on the 2016 Ethiopian Demographic and Health Survey (EDHS), a nationally representative survey that was conducted from January 18 to June 27, 2016. The survey’s sample was divided into two groups and then selected in two stages. A total of 645 EAs were chosen in the first stage, with the chance of selection inversely correlated with the size of the EA (202 in urban regions and 443 in rural areas). 28 households per cluster were chosen in the second stage by a methodical process with an equal probability. A comprehensive amount of data was gathered from 16,650 households, 15,683 female respondents, and 12,688 male respondents on topics such as adult and childhood morbidity and mortality, awareness and attitudes toward HIV/AIDS, and other significant public health issues. These topics included fertility and fertility preference, marriage, awareness and use of family planning methods, as well as issues related to reproductive health [4]. A total weighted sample of 7590 women (15–49 years old) of reproductive age who had birth within the five years before the survey was used.

Study features

The dependent variable was unintended pregnancy, which encompasses unintended or later-wanted pregnancies. The independent features for this study were the maternal age, maternal occupation, marital status, religion, region parity, household size, wealth index, Husband occupational status, Husband education, residence past miscarriages, knowledge of the ovulation cycle, and distance from the health facility, Ideal number of children, age at first sex, refusal sex, total birth, and age at first birth. To make important independent variables appropriate for analysis, they were recorded or categorized.

Data processing and analysis

A high-quality dataset is required for machine learning to make predictions. As a result, managing the missing data during the dataset’s pre-processing is an essential step. Encoding data is a fundamental and necessary procedure that is included in data pre-processing. Categorical variables were encoded using one-hot and label encoding. Values that fall into two or more categories and are discrete rather than continuous are said to be categorical. One hot encoding and label encoding technique were used in this work to encode categorical variables [20].

Data analysis

In this study, descriptive statistics were used to describe the socio-demographic characteristics using frequency and percentage. Data analysis stages included pre-processing the data, feature selection, data splitting, addressing imbalanced data, model building, and model performance testing. Python version 3 was the tool used in this study.

Feature selection method

The goal of feature selection is to rank and prioritize the most important predictors in the dataset. This is determined by computing the information gain values for each of the selected variables. To find the major factors that significantly result in unintended pregnancy, we used a decision tree classifier, extra trees classifier, XGBoost classifier, gradient boosting classifier, and a random forest model in this work. The higher information gain values indicate significant variables and their class have strong associations. The top ten information values were chosen at random. It is a relatively effective method for reducing model complexity and accelerating the processing of machine learning algorithms [21].

Data split

For machine learning approaches, the dataset is randomly divided into two parts: one is a training dataset that trains the model, and the second is a test dataset that predicts the response variable and sees if the predicted outcome is similar to the actual outcomes. The validation dataset is also taken into consideration for the parameter estimates to be incorporated into the training models [22]. However, The complete dataset for this study was divided into ten folds using the stratified tenfold cross-validation approach.

Imbalance data handling

The effectiveness of machine learning algorithms is frequently assessed using predictive accuracy, however, due to the imbalance in the data, it is challenging to identify the root cause of unintended pregnancy. To balance the majority and minority classes in this study, the Synthetic Minority Oversampling Technique (SMOTE) [23] was employed. SMOTE is a pre-processing method for learning algorithms that effectively handles class imbalance by oversampling imbalanced datasets. By linearly overlaying at random between a few samples and their neighbors, it generates a new sample [24].

Method of building a predictive model

The most effective models were picked to do the training after the data was arranged and split into training and testing samples. To produce a prediction, it was necessary to select the appropriate classifiers for the result variable’s categorical nature, which made the challenge a classification task. In this work, six supervised classification methods were employed. The ExtraTrees classifier, Random Forest, Decision Tree, Logistic Regression, Gradient Boosting, and XGBoost were used for this study. The algorithms were chosen for their accuracy, training time, ability to handle missing data, and ease of understanding and learning.

Performance evaluation for predictive model

Following model training, each model’s performances are assessed and contrasted with one another. Based on the confusion matrix, the prediction models’ performance was assessed. Precision, sensitivity, specificity, F1-score, and area under the receiver-operating characteristic (AUC-ROC) were utilized in this study to evaluate the model’s performance.

The confusion matrix is a common performance measuring tool used in machine learning classification tasks and is used to describe a model’s output as a binary class [25]. The performance of ML models was also visualized using the ROC curve (or receiver operating characteristic curve) (Table 1).

Table 1 Confusion matrix and different derived metrics adapted from [25]

According to the confusion matrix above, the following lists recall (sensitivity), (specificity), precision, and accuracy were derived

$$\text{R}\text{e}\text{c}\text{a}\text{l}\text{l} \left(\text{S}\text{e}\text{n}\text{s}\text{i}\text{t}\text{i}\text{v}\text{i}\text{t}\text{y}\right) =\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{N} }$$
(1)
$$\text{S}\text{p}\text{e}\text{c}\text{i}\text{f}\text{i}\text{c}\text{i}\text{t}\text{y} =\frac{\text{T}\text{N}}{\text{T}\text{N}+\text{F}\text{P} }$$
(2)
$$\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n} =\frac{\text{T}\text{P}}{\text{T}\text{P}+\text{F}\text{P} }$$
(3)
$$\text{A}\text{c}\text{c}\text{u}\text{r}\text{a}\text{c}\text{y}=\frac{\text{T}\text{P}+\text{T}\text{N}}{\text{T}\text{N}+\text{T}\text{P}+\text{F}\text{P}+\text{F}\text{N} }$$
(4)
$$\text{F}1 =2\text{*}\frac{\text{R}\text{e}\text{c}\text{a}\text{l}\text{l}\text{*}\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n}}{\text{R}\text{e}\text{c}\text{a}\text{l}\text{l} +\text{P}\text{r}\text{e}\text{c}\text{i}\text{s}\text{i}\text{o}\text{n} }$$
(5)

In summary, Fig. 1 shows the machine learning process used in this study.

Fig. 1
figure 1

Workflow of machine learning for unintended pregnancy prediction

Results

Sociodemographic characteristics of participants

From the total number of reproductive-age women who had unintended, this study includes 7589 women who were reproductive age. 69.36% (5264) were between the ages of 20 and 34. 92.58% (7026) women were married and 79% (6042) lived in a rural area and 41.24% (3129) lived in Oromia. More than half of the women had not educated, and 33.10% (2512) of all women were orthodox (Table 2).

Table 2 Sociodemographic characteristics of reproductive age women in Ethiopia, EDHS 2016, March 2023 (n = 7589)

Imbalance data handling

Unbalanced data handling was a key strategy for this study to handle the problem of unbalanced data and boost the performance of the machine learning algorithms. An imbalanced dataset was balanced using the SMOTE sampling method, and the accuracy and AUC based on the chosen machine learning algorithms were compared for the balanced and unbalanced datasets. When compared to another classifier in the unbalanced dataset, gradient boosting performed better with an AUC of 0.682, while logistic regression had a higher AUC of 0.668. The Extra tree classifier in the SMOTE has a higher accuracy of 84.93% and an AUC of 0.926. Moreover, the Random forest also outperformed next to the Extratrees classifier on a balanced dataset, with the test accuracy and AUC values of 84.40 and 0.924, respectively (Table 3).

Table 3 Compares imbalanced data handling techniques using accuracy and Area under the curve (AUC)

Machine learning is difficult with unbalanced data because values from the minority class or rarely occurring classes are wrongly categorized as instances of the majority class, which lowers the performance of the classifying algorithm. After all, the classifier is overwhelmed by the dominant class and ignores the unintended class, which is the minority class. After SMOTE was applied to the unbalanced dataset, the overall number of records rose. (Fig. 2). We mainly used AUC to compare the classifier and balanced sampling method.

Fig. 2
figure 2

Before unbalanced and after balancing the target feature

Implementation of unintended pregnancy prediction models

In this study, the data were split into training and test sets, which together made up 90% and 10% of the total data. The model performance, including prediction evaluation metrics, can be evaluated in comparison to various machine learning classifiers. To avoid overfitting, the popular 10-fold cross-validation method was applied to this study. The experiments were mainly divided into two sections: the first section trained the different classification algorithms using 32 features from an imbalanced dataset, and the second section employed a balanced sampling strategy to determine which model with 32 features was the best. High accuracy, precision, sensitivity, specificity, f1-score, and AUC were obtained by applying various machine learning classification algorithms like Logistic regression, decision tree, random forest, gradient boosting, XGBoost, and ExtraTrees) to the balanced data using SMOTE. In comparison to other algorithms, ExtraTrees produces better accuracy and results in performance metrics. The ExtraTrees classifier (AUC = 0.928) outperforms all other classifiers in terms of performance metrics and is the best in foretelling unintended pregnancies, as shown by the ROC curve in Fig. 3. Alongside the ExtraTrees classifier, the performance of random forest (AUC = 0.924), XGBoost (0.898), logistic regression (AUC = 0.775), XGBoost (0.898), gradient boosting classifier (AUC = 0.824), and decision tree classifier (AUC = 0.76) was also impressive (Fig. 3).

Fig. 3
figure 3

ROC curve shows a balanced dataset using SMOTE

Extratrees classifier performance

From the balanced dataset, the ExtraTrees classifier’s performance was quite strong compared to other selected classifiers. The hyper-parameter tuning and feature selection were carried out after the best model had been chosen. The important predictor of unintended pregnancy was established to compare the model’s performance.

Tuning an ExtraTrees classifier with grid serach CV

After selecting the best model, this study applied the hyperparameter tuning to compare it with the default hyperparameter tuning. Figure 4 shows that default hyperparameter tuning was higher performed than hyperparameter tuning using the best classifier of the ExtraTrees Model. According to the results, the ExtraTrees Model classifier with tuned hyperparameters was less performed than the ExtraTrees classifier with the default hyperparameter. Therefore, this study used the ExtraTrees classifier with a default hyperparameter with the tuning of sensitivity, specificity, precision, and f1-score of 83.79%, 83.94%, 83.94%, and 84.04%, respectively. The ExtraTrees Model classifier with default hyperparameter tuning had the highest AUC value, which means that the classifier properly identified unintended or unplanned. Then this study applied the default hyperparameter tuning (Fig. 4).

Fig. 4
figure 4

Comparison of tuned and default hyperparameter using ExtraTrees classifier

Top features from the chosen classifier

This experiment was performed to examine the classifier’s ability to predict unintended pregnancy and the impact of feature selection. Based on all chosen classifiers, this study identified the features that predict unintended pregnancy to determine which features were the best predictors. The cumulative result of the classifier feature importance was chosen as the suitable way to identify the features that most reliably predict unintended pregnancy for this study using these findings as a guide. The region, the Ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total birth, age at first birth, and Mother’s Educational Status were the factors that had the greatest impact on unintended pregnancy out of all the predictors. Table 4 shows that from the chosen classifier, the top ten features were selected using the median results (Table 4).

Table 4 Compares selected machine learning models in choosing the top features

ExtraTrees classifier features importance

Relevant features selected by an ExtraTrees classifier show that at the bottom were identified as the top predictors of unintended pregnancy. Of all features, region, ideal number of children, religion wealth index, age at first sex, husband education, household size, refusal sex, total birth, and decision on marriage were top predictors (Fig. 5).

Fig. 5
figure 5

Relevant features selected by an ExtraTrees feature importance

Discussion

According to earlier research on this topic, Ethiopia has one of the highest rates of unintended pregnancies worldwide [26,27,28,29]. Findings also revealed that while the prevalence of unwanted pregnancies has occasionally declined in the nation, more work is still needed to support this pattern and manage the phenomenon’s undesirable repercussions. Machine learning models are regarded as state-of-the-art approaches and techniques for quick and accurate problem-solving. This study has aimed to predict and identify the predictors of unintended pregnancy and build the best performance of a machine learning classifier. Six machine learning algorithms such as Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and ExtraTrees, were applied to predict unintended pregnancy in Ethiopia using EDHS 2016 data.

The above models were chosen to build and evaluate the best predictive model using the key predictors, which will increase model prediction accuracy and generalizability. stratified 10-fold cross-validation has been used to train the classifiers on a set of training data. To determine the optimal accuracy, several tests were conducted applying both balanced and unbalanced datasets. The outcome demonstrated that imbalanced data produced low-performance metrics. To balance the unbalanced data, this study used the SMOTE balancing sampling approach. AUC, recall, precision, and accuracy performance evaluations showed that the ExtraTrees classifier performed better than all other selected classifiers (84.75%, 84.66%, 84.81%, and (0.925)., respectively). As a result, this classifier was selected in our study for the prediction of unintended pregnancy.

Using the relevance values of independent features for the ExtraTrees classifier, this study identified the key influencing factors for unintended pregnancy. The most significant variables that contribute to higher performance in unintended pregnancy prediction were found using the average results of all classifiers used in the feature selection process. The important predictors of unintended pregnancy among all independent characteristics included region, the ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total birth, age at first birth, and mother’s educational status.

The region was a very important predictor of unintended pregnancy. This research supports a systematic review and meta-analysis of an Ethiopian observational study [6], as well as a study using EDHS 2016 data that found that different regions of Ethiopia [14] had higher rates of unintended pregnancies. The sociodemographic variations between the individuals in each region could be one of the causes.

The machine learning classifier identified that the wealth index was a highly important feature for predicting unintended pregnancies. This finding is supported by research conducted in India [30], Bangladesh [31], Iran [32], Nepal [33], Nigeria [34], Kenya [35], and across different parts of Ethiopia [36,37,38], revealed that women who have high wealth status are more empowered to take charge of their sexual and reproductive health matters than women who have poorer wealth status. The relationship between income status, occupation, and unintended pregnancies may be explained by the connection between formal employment and social networks, and earning potential [39].

Religion was an important feature in predicting unintended pregnancy. previous studies conducted in Bangladesh [31], Nepal [40], Addis Zemen, and Ethiopia [41], and a study using EDHS 2016 data in Ethiopia [14] revealed that Women who had a religion tend to be highly associated with unintended pregnancy. The possible explanations for the association include the fact that women believe that every child is a gift from God and that their religion discourages the use of contraception. Thus, mothers who follow a particular faith do not think that unintended pregnancy occurs.

Based on the finding of this study, the husband’s education and the mother’s educational status of the respondent were found other relevant features for predicting unintended pregnancy. This finding is supported by research conducted in Russia [42], Bangladesh [43], Uganda [44], Malawi [45], and Southern Ethiopia [46], which reported that husbands and mothers who had educational status were more likely associated with unintended pregnancy. The possible explanation might be due.

Other relevant features of unintended pregnancy were the Ideal number of children, age at first sex, refusal of sex, total birth, and age at first birth.

In findings, our study shows that machine learning techniques can be used to identify predictive characteristics related to unwanted pregnancy. Machine learning methods appear to be useful for determining which indicators are most important for predicting an unplanned pregnancy. Our study model might help with the crucial public health problem of identifying and managing unintended pregnancies.

For predicting unintended or unplanned pregnancies, the suggested method has the best ROC, accuracy, precision, recall, and specificity. This prediction is in support of providing women with comprehensive services and extended working hours. Effective predictive modeling may raise medical care standards and increase maternal survival. Therefore, the prediction models of unintended pregnancy developed in our work can significantly contribute by detecting women with undesired or unintended pregnancies and adopting the most effective supportive measures, such as offering training or any other form of information transmission. This might reduce misunderstanding by providing quantitative, unbiased, and research-based models for risk classification, prediction, and ultimately care planning. This work cannot be considered complete without its limitations. In contrast to the statistical model, the machine learning model’s result lacks a coefficient and odds ratio, making it challenging to determine how much and in which direction various factors impact the final result. In addition, Machine algorithms are primarily less interpretable because they lack parameters and typically identify or anticipate particular variables according to how significant a part they played in the current study’s determination of the unwanted pregnancy.

Conclusion

In predicting unintended pregnancy factors in Ethiopia, the ExtraTrees classifier has a somewhat higher predictive ability than other selected machine learning classifiers. By using the ExtraTrees classifier to choose the desired features related to unintended pregnancy, we found that region, the ideal number of children, religion, wealth index, age at first sex, husband education, refusal sex, total birth, age at first birth, and mother educational status were the significant predictors of unintended pregnancy. This work emphasizes the use of machine learning algorithms to predict and better comprehend top significant unintended pregnancy predictor variables to improve essential policy directions.