Background

Globally, infant and child mortality rates are critical issues and fundamental indicators of a country’s population’s health, quality of life, and socioeconomic situation [1]. A remarkable decline of 60% in under-five mortality has been observed over the last three decades. However, 7.4 million annual global mortalities are estimated due to preventable and treatable diseases in young children. Besides, 70% of these deaths occur in children under the age of 5 years, and 95% are from South Asia and sub-Saharan Africa, i.e., on average, about 1 in 13 children in sub-Saharan Africa die before the age of five [2]. Various factors may contribute to high mortality rates, such as poor living conditions and other socio-economic factors of countries’ populations where childhood acute respiratory infections (ARI) remain among the top leading morbidities in low-income countries, particularly in sub-Saharan Africa [3].

The ARI disease and its related symptoms are typically caused by contagious viruses and bacterial infections that spread rapidly through droplets from either person-to-person or contaminated food or drinking water due to poor hygiene [4]. According to WHO 2019, ARI diseases are the fourth most common childhood disease among those with a higher rate of morbidity. When combined with malaria, ARI diseases become the top communicable diseases causing more deaths than other comorbidities [5,6,7]. In addition, the symptoms of ARI disease coincide with those of diarrhea and malaria diseases could lead to childhood death [8].

In Uganda, ARI has remained the leading cause of morbidity and mortality in children under the age of 5 years, accounting for about 9% of the ARI prevalence, with 81.3% in urban areas. The under-five mortality rate accounted for 1 in 16 child deaths, and 42% of these deaths occurred in the neonatal period [9]. The heavy loss of young lives from childhood ARI mortalities poses a heavy burden to families and healthcare providers in Uganda. Therefore, conducting research regarding the assessment of risk factors related to such diseases can greatly help policy decision-making and reduce these morbidity and mortality rates, especially in under-five children.

Traditional analysis methods such as logistic regression and chi-square test approaches are commonly applied in social science and medical literature. However, in diagnosing cardiopulmonary diseases using medical data, machine learning tools have become popular and frequently used in recent research [10]. The appropriate usage of machine learning algorithms has revealed significant performance in the prediction and classification of disease outcomes [11]. This study aims to determine potential risk factors contributing to ARI disease symptoms in children under the age of 5 years in Uganda using well-performed methods to predict the ARI symptom outcomes between traditional and machine learning analysis methods. The study findings could help in making research-based decisions to address the associated risk factors of ARI disease symptoms relevant to the disease’s control and spread among children.

Methods

Data source

This study used secondary data from the recent five-year cross-sectional survey, the Uganda Demographic and Health Survey (UDHS), that was conducted between June and December 2016. The UDHS is conducted by the Ugandan Bureau of Statistics and collaborated with the DHS program to collect up-to-date data for fundamental demographic and health indicators relevant to policymakers and program managers in order to evaluate the national population’s health and nutritional programs [9]. The DHS data collected in different developing countries can be found and downloaded via the website of the DHS program after approval.

Design and sampling

We used a cross-sectional study design in collecting characteristics and information regarding the prevalence of the ARI disease symptoms among children under the age of 5 years in Uganda, using UDHS data collected in 2016. We used a two-stage stratified sampling design to select the sample. The first stage involved selecting 697 geographic areas named enumerated areas (EAs) (535 rural and 162 urban EAs) that covered 130 households on average, and the second stage involved the selection of households to be included in each EA. All the EAs with more than 300 households were segmented into one EA, and the households in the EAs were selected with a probability proportional to the size of the segment [9].

Population and sample

The target population of this study was comprised of male and female children under the age of 5 years from different regions of Uganda. The data and recorded information for 13,493 children were used as the sample for this study. The total sample of children was divided into two groups: 75% for analysis and 25% for testing the performance of the various methods of analysis used in the study.

Variables of interest

In this study, we used various characteristics that were measured in the 2016 UDHS [9] and factors from other related literature, which were included in the survey dataset (Fig. 1). Behavioral, environmental, and social demographic characteristics for children, mothers, and households were used to analyze and determine potential risk factors associated with the symptoms of ARI disease in children under the age of five in Uganda. During the survey, mothers aged 15–49 years who had children under the age of 5 years in the selected households were asked whether their children experienced ARI disease symptoms such as coughing accompanied by short, rapid or difficulty breathing in the 2 weeks before the survey. The responses regarding the ARI disease symptoms were considered subjective since they were mothers’ perceptions without validation from medical personnel. The explanations for the variables used in this study are presented in the supplementary file in Tables A1 and A2.

Fig. 1
figure 1

A framework of factors of childhood ARI disease symptoms

Analysis methods

The scope of this study focuses primarily on determining the potential risk factors of ARI disease symptoms based on the well-performing methods between traditional and machine learning methods of analysis mostly applied in the social sciences and medical research [12].

Logistic regression

We used a binary logistic regression (LR) model shown in Eq. 1 to analyze the log-linear association of k variables (i.e., factors), X1,X2,,Xk with their corresponding b1,b2,,bk effects on Y outcome of the ARI disease symptoms, 1 if “child had ARI symptoms” and 0 “otherwise” where π indicates the probability that a child had the ARI symptoms [13]. Stepwise variable selection procedures were also used to select influential factors associated with the outcome of interest, i.e., the symptoms of ARI disease.

$$\mathbf{Y}=\mathbf{\ln}\left(\frac{\boldsymbol{\pi}}{\mathbf{1}-\boldsymbol{\pi}}\right)={\boldsymbol{b}}_{\mathbf{0}}+{\boldsymbol{b}}_{\mathbf{1}}{\boldsymbol{X}}_{\mathbf{1}}+{\boldsymbol{b}}_{\mathbf{2}}{\boldsymbol{X}}_{\mathbf{2}}+\dots +{\boldsymbol{b}}_{\boldsymbol{k}}{\boldsymbol{X}}_{\boldsymbol{k}}$$
(1)

Elastic net regression

The elastic net logistic regression (EN) model shown in Eq. 2 was used in addition to the previous LR model in Eq. 1 to control the correlation between features in order to solve the problem of overfitting that could exist in the analysis of risk factors associated with the outcomes of the ARI disease symptoms [14]. The EN method penalizes and shrinks b1,b2,,bk effects of the non-informative x1,x2,,xk variables using non-negative tuning parameters αϵ[0, 1] and λ with ten-fold cross-validation [2].

$$\left(\widehat{{\mathrm b}_0,\mathrm b}\right)=\arg\;\min\;\left\{-{\textstyle\sum_{\mathrm i=1}^{\mathrm n}}\left[{\mathrm y}_{\mathrm i}\left({\mathrm b}_0+\mathrm x_{\mathrm i}^{\mathrm T}\mathrm b\right)-\mathrm{In}\left(1+\exp\left\{{\mathrm b}_0+\mathrm x_{\mathrm i}^{\mathrm T}\mathrm b\right\}\right)\right]\;+\mathrm\lambda\left[\frac12\left(1-\mathrm\alpha\right){\textstyle\sum_{\mathrm j=1}^{\mathrm k}}\mathrm b_{\mathrm j}^2+\mathrm\alpha{\textstyle\sum_{\mathrm j=1}^{\mathrm k}}{\textstyle\left|{\mathrm b}_{\mathrm j}\right|}\right]\right\}$$
(2)

Machine learning methods

In addition, machine learning algorithms such as decision tree (DT) and random forest (RF) methods were also used in comparisons with the regression methods to predict the outcomes of ARI disease symptoms in children under the age of 5 years in Uganda [15, 16]. The DT algorithm was particularly used due to its advantages like its tree-like structure, which is simple and easy to learn and interpret, while the RF algorithm approach was used as an extension of the DT method because of its effectiveness in minimizing the variance using its random DT tree-like structures generated from a random sample in the prediction [17].

Measures of evaluation

In the evaluation of the performance of the methods used in this study, we considered various measures or metrics that are applied in the contingency matrix in diagnosing ill patients in most medical research [18]. A total accuracy in Eq. 3 measures the proportion of all children reported as with and without ARI disease symptoms who are correctly predicted by the method in this study; a precision measure in Eq. 4 shows the proportion of children who actually had ARI symptoms and were correctly predicted as having ARI disease symptoms. While the selectivity measure shown in Eq. 5 measured the proportion of children who were actually reported as not having ARI symptoms and correctly predicted by the method as not having ARI symptoms; A recall measure in Eq. 6, also called a sensitivity measure, indicates the proportion of the children who are predicted as symptomatic among all children with ARI symptoms in the study. We also used the area under the curves (AUCs) measure for the receiver operating characteristic (ROC) curves based on the true and predicted outcomes of ARI symptoms [10]. This study used statistical software such as STATA version 17.0 for data management and R software using functions in the Caret package for analyzing data.

$$\mathrm{Accuracy}=\frac{\left( TP+ TN\right)}{\left( TP+ FP\right)+\left( TN+ FN\right)}$$
(3)
$$\mathrm{Precision}=\frac{TP}{\left( TP+ FP\right)}$$
(4)
$$\mathrm{Selectivity}=\frac{TN}{\left( TN+ FP\right)}$$
(5)
$$\mathrm{Recall}\ \mathrm{or}\ \mathrm{Sensitivity}=\frac{TP}{\left( TP+ FN\right)}$$
(6)

Where TP, TN, FP, and FN represent the number of true positives, true negatives, false positives, and false negatives respectively.

Results

In this study, a sample of 13,493 children under the age of 5 years in Uganda was analyzed. Overall, the prevalence of ARI disease symptoms in children with symptoms was found to be 5437 (40.3%) and 8056 (59.7%) for children without ARI disease symptoms (Fig. 2). Tables 1 and 2 show that the symptoms’ prevalence of ARI disease in children was high in males (50.7%) compared to females (49.3%), and about 44.5% of children with ARI symptoms were under 24 months of age, and 33.8% had mothers under 25 years of age and living in a lower-income class (47.4%). About 74.2% of children had mothers who only attended below the secondary level of education, and only 56.6% were breastfed. The majority of children reported were found in households exposed to wood smoke from firewood as cooking energy (77.9%) and 53.4% reported in the dry season.

Fig. 2
figure 2

Prevalence of ARI disease symptoms in children under the age of 5 years in Uganda

Table 1 Distribution of the ARI symptoms’ prevalence based on socio-economic and demographic characteristics
Table 2 Distribution of the ARI symptoms’ prevalence based on behavioral and environmental characteristics

Comparison of method performances

The scope of this study focused primarily on determining the potential risk factors of ARI disease symptoms based on well-performing methods. In the analysis, we used 75% of the total sample as a training sample and the remaining 25% for testing the method’s performance using ten-fold cross-validation. Table 3 shows the results of the performance comparisons between logistic regression (LR), elastic net logistic regression (EN), decision tree (DT), and random forest (RF) methods. The RF method showed the highest accuracy of 88.7 and 93.10% for precision in predicting the childhood ARI symptoms compared to other methods, i.e., about 88.7% of children who actually reported having or not having symptoms of ARI were correctly predicted by the RF method, while 93.1% of children with actual symptoms were also correctly predicted to have ARI symptoms. The LR method is followed with 62.0% accuracy and 88.03% precision. The other methods, such as the EN methods (61.7% accurate and 86.25% precise) and the DT method (61.2% accurate and 83.0% precise), showed the least performance in the prediction of ARI disease symptoms in this study. The AUC results for the receiver operating curves comparing these methods are presented in Fig. 3. Therefore, we used the random forest and logistic regression methods to determine potential risk factors for ARI disease symptoms among children under the age of 5 years in Uganda.

Table 3 Comparison of predictive performances for the methods
Fig. 3
figure 3

ROC curves and AUC values for method performances

Potential risk factors contributing to the childhood ARI disease symptoms

Tables 4 and 5 summarize the logistic regression and random forest methods’ results. A subsample of 75% of all sampled children used in performance comparisons between these methods was used to determine the potential risk factors for ARI disease symptoms among children in Uganda. Adjusting for other factors, the LR model results (Table 4) reveal that children aged 12–23 months had a higher risk of 1.27 times (95% CI: 1.12–1.44) of developing ARI disease symptoms than infants, while children aged 48–59 months showed a low risk of 0.69 times (95% CI: 0.60–0.80) of ARI symptoms compared to children at earlier ages. Children living in other regions different from the central region had a low risk of developing ARI symptoms. Children of teen mothers had a significantly higher risk of having ARI symptoms than children of mothers in their middle ages (21–24 years old), and employed mothers in farming 1.25 times (95% CI: 1.11–1.42) or other employment 1.93 times (95% CI: 1.71–2.19) showed that their children were associated with a high risk of ARI disease symptoms compared to children of unemployed mothers who had enough time to care for them.

Table 4 Logistic regression estimates of risk factors associated with childhood ARI symptoms in Uganda
Table 5 Potential risk factors contributing to the childhood ARI disease in both random forest and logistic regression methods

In behavioral and environmental factors, mothers who breastfed their children had a lower risk of 0.83 times (95% CI: 0.76–0.92) of ARI symptoms compared to those who did not breastfeed, and other factors such as the child’s exposure to charcoal cooking smoke showed a lower risk of 0.77 times (95% CI: 0.69–0.87) of developing ARI disease symptoms than children exposed to firewood smoke, while in the rainy season, children were less likely to 0.66 times (95% CI: 0.61–0.72) of developing symptoms of ARI disease than in the dry season in Uganda.

For the random forest method, the important factors contributing to the prediction of ARI disease symptoms were shown in Table 5, and risk factors such as mother’s employment, season effect, region of residence, cooking energy, mother’s wealth status, place of delivery, and mother’s education were potential risk factors contributing to the ARI disease symptoms among children under the age of 5 years in the random forest methods.

Discussion

This study builds upon the analysis of risk factors for ARI disease symptoms among children under the age of 5 years and compares various methods’ performances in predicting childhood ARI symptom outcomes. Using well-performing methods, we analyzed socio-demographic, behavioral, and environmental factors contributing to childhood ARI disease symptoms in Uganda using the 2016 UDHS dataset. The results revealed that the random forest method performed better in accuracy than other methods considered in the analysis, followed by the logistic regression method (Table 5). As shown in two methods, the employment of mothers in farming activities, the season effect, the region of residence, and the fuel used for cooking, such as firewood and charcoal, were found to be potential risk factors contributing to the childhood ARI disease symptoms in Uganda. In addition, the young ages of mothers and children, breastfeeding, and wealth status were also found to be factors associated with ARI disease symptoms among children in this study.

Other studies conducted in Uganda also showed that these higher prevalence results for childhood ARI diseases were consistent with the current findings [19, 20], and the high risk of childhood ARI disease symptoms due to factors such as season and geographical regions was also in concurrence with other findings from studies conducted in neighboring countries such as Rwanda and Kenya [21,22,23]. A vulnerable region in Uganda, like the northern region where people were forced to settle in camps because of the civil war in 1986, suffered from overcrowding and poor sanitation that speeded up the disease occurrence [19]. More efforts in sanitation and appropriate health services from the government should be established in highly risk areas to eliminate regional differences against ARI diseases. However, a study conducted in the Gulu district, northern Uganda, reported that children living in urban areas were more likely to develop ARI symptoms than those living in rural areas [24].

The study findings also revealed that the ARI symptoms increased among the children exposed to firewood smoke compared to those exposed to charcoal smoke. These results of the association between wood fuel and ARI symptoms were similar to others conducted in sub-Saharan African countries [23, 25,26,27,28]. According to WHO reports, “Children exposed to cooking fuels and parental smoking are more likely to be at a high risk of having pneumonia and other respiratory infection diseases” [8]. The need for parents’ and community education about the dangers of smoking to children must be addressed, especially in places where smoking and firewood are used frequently [29].

The ARI factors, such as the education and employment of mothers, are consistent with other results found in Kenya, Ethiopia, and Rwanda [22, 30, 31]. However, the factors contradict findings from another study conducted in northern Uganda because of the discrepancies in living standards and characteristics of the population studied. In the northern part, people suffered from overcrowding and poor sanitation, and most people were living in camps that encouraged disease occurrence and easy spread [19]. In the current study, children younger than 1 year old showed a higher risk of having ARI disease symptoms than children aged 48–59 months. These findings are supported by similar findings [29, 32,33,34]. The factors were related to the low rates of immunization in young children, low maternal literacy, and the young mothers in farming activities that do not allow the care of young children, particularly in sub-Saharan African countries, where health facilities and maternal healthcare education have to be improved.

Aside from the foregoing, this study provides evidence on parental behavior factors such as breastfeeding, which contradicts other findings [19, 20]. The current study showed that non-breastfed children whose mothers were teenaged were found to be more likely to develop ARI disease symptoms than breastfed ones, and generally, breastfeeding is more important to the child’s nutrition and the good functionality of the child’s immunity system.

Despite the strengths, limitations also have to be discussed. Parental smoking and childbirth weight factors were found to be significantly associated with ARI disease among children under the age of five in other studies [26, 29, 35, 36]. Due to the much missing information presented in these two variables in the current study, these two risk factors were limited in the 2016 UDHS dataset. In general, smoking harms the natural human defense of the respiratory system [37], especially in low birth-weight children. The government and community campaigns should educate people about the dangers of smoking on people’s health, particularly in young children’s households.

In terms of the analysis methods, we used both new and traditional supervised analysis methods, such as machine learning algorithms and multivariate regression methods, to predict the childhood outcomes of ARI disease symptoms. Furthermore, these findings complement other comparative machine learning findings [38,39,40,41] in providing evidence of the better performance of the random forest algorithm (88.7%) than traditional methods of analysis. However, other studies [42, 43] contradicted these findings. Further research is needed to overcome these challenges and compare various analysis methods using nationwide cross-sectional survey datasets like the DHS data. Moreover, longitudinal data analysis can better examine the potential risk factors of ARI disease in children under the age of 5 years.

In summary, this paper revealed that the mother’s employment and age, child age, breastfeeding, wealth status, season effect, region of residence, and cooking fuel such as firewood and charcoal were found to be potential risk factors for ARI disease symptoms in children under the age of 5 years. In this study, non-breastfed children whose mothers were teenagers had a significant effect on the development of ARI disease symptoms. Based on the results, policy-makers and health stakeholders should initiate target-oriented approaches to address the problems regarding poor children’s healthcare, improper environmental conditions, and childcare facilities. The government and child family interventions have to encourage maternal education and especially child breastfeeding. For the sake of early child care, the government should promote child breastfeeding and maternal education.