Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption

Nasejje, Justine B.; Mwambi, Henry

doi:10.1186/s13104-017-2775-6

Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption

Research article
Open access
Published: 07 September 2017

Volume 10, article number 459, (2017)
Cite this article

Download PDF

You have full access to this open access article

BMC Research Notes Aims and scope Submit manuscript

Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption

Download PDF

Justine B. Nasejje¹ &
Henry Mwambi²

3594 Accesses
18 Citations
Explore all metrics

Abstract

Background

Uganda just like any other Sub-Saharan African country, has a high under-five child mortality rate. To inform policy on intervention strategies, sound statistical methods are required to critically identify factors strongly associated with under-five child mortality rates. The Cox proportional hazards model has been a common choice in analysing data to understand factors strongly associated with high child mortality rates taking age as the time-to-event variable. However, due to its restrictive proportional hazards (PH) assumption, some covariates of interest which do not satisfy the assumption are often excluded in the analysis to avoid mis-specifying the model. Otherwise using covariates that clearly violate the assumption would mean invalid results.

Methods

Survival trees and random survival forests are increasingly becoming popular in analysing survival data particularly in the case of large survey data and could be attractive alternatives to models with the restrictive PH assumption. In this article, we adopt random survival forests which have never been used in understanding factors affecting under-five child mortality rates in Uganda using Demographic and Health Survey data. Thus the first part of the analysis is based on the use of the classical Cox PH model and the second part of the analysis is based on the use of random survival forests in the presence of covariates that do not necessarily satisfy the PH assumption.

Results

Random survival forests and the Cox proportional hazards model agree that the sex of the household head, sex of the child, number of births in the past 1 year are strongly associated to under-five child mortality in Uganda given all the three covariates satisfy the PH assumption. Random survival forests further demonstrated that covariates that were originally excluded from the earlier analysis due to violation of the PH assumption were important in explaining under-five child mortality rates. These covariates include the number of children under the age of five in a household, number of births in the past 5 years, wealth index, total number of children ever born and the child’s birth order. The results further indicated that the predictive performance for random survival forests built using covariates including those that violate the PH assumption was higher than that for random survival forests built using only covariates that satisfy the PH assumption.

Conclusions

Random survival forests are appealing methods in analysing public health data to understand factors strongly associated with under-five child mortality rates especially in the presence of covariates that violate the proportional hazards assumption.

Understanding the social determinants of child mortality in Latin America over the last two decades: a machine learning approach

Article Open access 27 November 2023

Machine learning approach for predicting under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey

Article Open access 04 November 2020

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Article Open access 25 March 2022

Background

The third sustainable development goal states that ensuring healthy lives and promoting the well-being for all at all ages is essential to sustainable development [1, 2]. Critical among these age groups are the children under the age of five. In 2015, the United Nations recorded that a total of 17,000 fewer children died each day than was the case in 1990. However, more than six million children still die before their fifth birthday each year. Most of these deaths occur in Sub-Saharan Africa. Uganda in particular recorded an under-five mortality rate of 71.28 per 1000 live births in the period of 2005–2011 [3]. This rate is approximately 3 times the third sustainable development goal target of at least as low as 25 per 1000 live births [4].

Identifying factors strongly associated with under-five child mortality rates is a topic of increased research interest for most of the countries in Sub-Saharan Africa, Uganda included. Several statistical methods have been used in studies aimed at identifying factors that are strongly associated with under-five child mortality rates [5,6,7]. Most studies have employed standard survival methodologies like the Cox-proportional hazards model [8,9,10,11]. However, the model has constantly been criticized for its restrictive assumption commonly referred to as the proportional hazards (PH) assumption [12,13,14].

Extensions for this model to deal with survival data in situations where the PH assumption is violated have been suggested such as the extended Cox model [15,16,17]. The extended Cox model is more flexible and most importantly relaxes the standard assumptions of the original Cox model, this however, comes at a cost of a more complicated model. For example, employing a smooth spline helps one to explicitly specify the functions for the Cox regression relationship but it requires one to specify correct degrees of freedom, number and placement of the knot points and order of the regression spline model (which could be quadratic, cubic, quartic, some combination of different orders, among others). In addition, polynomial spline models must be constrained by goodness-of-fit characteristics based on the actual data, resulting in penalty functions and other such criteria that cannot be universally applied to varying datasets [18,19,20]. This implies therefore that the hazard estimates of the extended Cox model are dependent on the parameter and model specification considered. Estimates of both nonlinearity and time-dependence vary depending upon the degrees of freedom and other parameters. Furthermore, models that fit the data equally well can have different shapes for the hazard function and result in different hazard estimates. Relying heavily on hazard estimates based on these models may require a more skilled user methodologically because there is no standardized method for determining which parameters are most appropriate [20]. However, it should be noted that when all the covariates being considered satisfy the PH assumption then the Cox PH model is preferred.

Survival trees and random survival forests formally implemented in R [21, 22], are simple but robust methods that have been considered to be an attractive alternative model choice for survival data. These methods are extensions of classification and regression trees (CART) and random forests [23, 24]. The methods are fully non parametric, have fewer assumptions and can easily deal with high dimensional data [25]. Random survival forests do not impose a restrictive structure on how the variables should be combined. If the relationship between the predictor variables and the response variable is complex with non linear patterns and interactions then random survival forests are capable of incorporating this automatically [26, 27]. Most often researchers who use the Cox PH model for time-to-event data go ahead and use it even when covariates in the model do not satisfy the PH assumption and make interpretations as if the PH assumption holds for each covariate in the model. Random survival forests do not rely on this assumption for their validity thus this can protect a user who is not familiar with model enhancements such as the extended Cox model to deal with covariates that do not satisfy the restrictive PH assumption.

In a study to identify factors strongly associated to under-five child mortality rates in Uganda [3], many of the covariates were excluded from the Cox PH model analysis due to their violation of the PH assumption. Random survival forests were recommended as alternative methods for the study [3]. These methods have been found appropriate to use in the presence of covariates that do not satisfy the PH assumption or in situations where the relationship between the response and the covariates may be complicated [26, 27]. In this study, we re-analyse the dataset used in the study by [3] using both the Cox PH model and random survival forests where the former is used to emphasize the difference between them. We also investigate the predictive performance of the two random survival forest models used in this study in the presence of covariates that violate the PH assumption and compared these results with the predictive performance of the models used in the presence of only those covariates that satisfied the PH assumption.

Objective of the study

We implement random survival forests on Uganda Demographic Health Survey data for 2011 to determine factors strongly associated to under-five child mortality rates. First we compare the results from random survival forests with those of the Cox PH model in the presence of covariates that satisfy the PH assumption. We also fit random survival forests on our dataset including covariates that violate the PH assumption which were excluded in the first analysis [3]. We further discuss our findings on predictive performance for random survival forests in the presence of covariates that violate and those that do not violate the PH assumption.

The article is structured as follows: in the “Methods” section, we discuss the data and the methods used. The “Results” section presents results from the methods used. In the “Predictive performance” section, we present the results on predictive performance of the methods used. We state the general discussion and conclusions from this study in the “Discussion” and “Conclusions” section, respectively. Appendices 1 and 2 are provided as additional materials to describe the models and the methods used to evaluate the models, respectively.

Methods

Data

To understand factors affecting under-five child mortality rates in Uganda, the 2011 Uganda Demographic Health Survey (UDHS) data was used [3]. This dataset was collected from May 2011 through to December 2011. This was the fifth comprehensive survey conducted in Uganda as part of the worldwide Demographic and Health Surveys [28]. A representative sample of 10,086 households was selected during the 2011 UDHS. The sample was selected in two stages. A total of 404 enumeration areas (EAs) were selected from among a list of clusters sampled for the 2009/10 Uganda National Household Survey (2010 UNHS). In the second stage of sampling, households in each cluster were selected from a complete listing of households. Eligible women for the interview were aged between 15 and 49 years of age who were either usual residents or visitors present in the selected household on the night before the survey. Out of 9247 eligible women, 8674 were successively interviewed with a response rate of \(94 \%\,(91 \%\) in urban and \(95\%\) in rural areas). The study population for this analysis includes infants born between exactly one and 5 years preceding the 2011 UDHS.

Exploratory data analysis

Covariates

In this study, 19 covariates are considered as candidates for analysis and their choice was based on related literature [29,30,31]. To some extent, other limitations like high level of missingness in the dataset influenced our covariate choice. The covariates include; mother’s age group (<20, 20–29, 30–39, 40+ years); type of residence (urban, rural); mother’s level of education (illiterate, primary, secondary and higher); partner’s level of education (illiterate, primary, secondary and higher); birth status (singleton birth, multiple births); sex of the child (male, female); wealth index (poorest, poorer, middle, richer, richest); children ever born (one child, two children, three children, four and more); birth order (first child, second to third child, 4th–6th child); religion (Catholic, Muslim, other Christians, others); types of toilet facility (flush toilet, pit latrine, no facility); mother’s occupation (not-working, sales and service, agriculture); current working status (working, not working); births in the past 1 year (no births, 1-birth, 2-births); births in the past 5 years (1-birth, 2-births, 3-births, 4-births); children under the age of five in the household (no child, one child, two children, three children, four children); sex of the household head (male, female); source of drinking water (piped water, borehole, well, surface/rain/pond/lake, others); mother’s age at first birth (less than 20, 20–29, 30–39 years). Note that all covariates are categorical. The categories of covariates that were not originally categorical, were created based on other similar studies in literature [31].

Table 1 The distribution of births and deaths by survival determinants

Full size table

Table 1 shows the distribution of deaths for children under the age of five across all covariates considered in the study. The percentages of deaths for each of the covariate categories is stated in the second column of Table 1. For example, \(7.7\%\) of children born to mothers with no education died before celebrating their fifth birthday. This is the highest percentage compared to those children born of mothers with primary education which is \(6.4\%\) and secondary or higher education which is \(4.2\%\). Covariates with categories that have the highest percentage of deaths include number of children in the household under the age of five, number of births in the past 5 years, number of births in the past 1 year, birth status and lastly age of the mother at first birth.

Dependent variable

Under-five child mortality rate is defined as the mortality rate from the age of 1 month to the age of 59 months. Thus the dependent variable used in our analysis is the time-to-event which in our case is the age of a child reported at the time of the interview (survey) for those still alive or the age of the child when he/she died. Thus children under the age of five that were still alive at the date of the interview were considered to be right censored.

Analysis methods

The Cox proportional hazards model and random survival forests are both used in this analysis to identify factors that affect under-five child survival in Uganda. Two random survival forest implementations are used. The first forest is constructed on survival trees that are built using the log-rank split-rule. The second forest is constructed on survival trees built using the log-rank score split-rule. Note that the split-rule based on the log-rank score is desirable in the presence of tied event times. To evaluate the predictive performance for the models used, cross-validated integrated brier scores are used. The Cox PH model and the two random survival forest implementations are described in detail in Additional file 1: Appendix 1. To evaluate the predictive performance for the models used, cross-validated integrated brier scores are used and these are described in detail in Additional file 1: Appendix 2. Note that Appendices 1 and 2 are given as additional material in Addition file 1: Appendices 1 and 2.

Results

Proportional hazards analysis

Cox proportional hazards model

To use the Cox PH model, it is important to establish which covariates in the dataset satisfy the PH assumption. We used the Schoenfeld residual test [32,33,34] in R an open source software [35] using the command cox.zph. Under this test, it is assumed that regression parameters are constant over time, hence the corresponding hazard ratios are constant over time. All those regression parameters (covariate effects) that changed with time, do not satisfy the PH assumption and therefore do not qualify to be entered in the final Cox PH model. Note that as our first step, we fitted a Cox PH model on all covariates considered in the study and then obtained Schoenfeld residuals. Results from this analysis are presented in Table 2. Covariates that violated the PH assumption include: mother’s education level, total number of children ever born, type of residence, wealth index, birth order, number of births in the past 5 years, mother’s occupation and type of birth. These covariates were, therefore, not included in the final Cox PH analysis.

Table 2 Testing the proportional hazard assumption using scaled Schoenfeld residuals

Full size table

It is important to note that graphical methods can also be used to identify covariates that may potentially violate the PH assumption but are not statistical tests except for an initial exploratory assessment before a formal statistical test. Covariates with categories whose survival curves intersect or diverge disproportionately from each other over time are known to violate the PH assumption.

Figures 1 and 2 illustrate a graphical method mentioned above for assessing PH assumption using two covariates that have been identified as those that violate the PH assumption. Both figures give supporting evidence to violate the PH assumption by the two covariates considered. We fitted a univariate and a multivariate Cox PH model on all covariates that did not violate the PH assumption. The results from this analysis are presented in Table 3. Sex of the child, sex of the household head and number of births in the past 1 year are the factors strongly associated with under-five child mortality rate in Uganda. The results suggest that a girl child has a \(17\%\) lower hazard of death compared to the boy child. Children born in households headed by females have a \(30\%\) higher hazard of death than those born in households headed by males. The results further suggest that mothers who had more than one birth in a year put their children at a higher hazard of death than those with no birth. The hazard of death for children born of mothers who had 2 births in the past 1 year was 2.34-fold higher than those born of mothers with no birth in the past 1 year. Lastly, children whose fathers had secondary and higher education were at a lower hazard of death compared to those born of illiterate fathers.

Table 3 The adjusted and unadjusted hazard ratios from fitting the Cox-proportional hazard model for only those covariates that satisfy the proportionality hazard assumption

Full size table

Using the Akaike information criteria (AIC) [36], the best fitting Cox PH model had four covariates namely: father’s education, sex of the child, mother’s age group and sex of the household head.

Table 4 The best fitting Cox proportional hazards model

Full size table

Results presented in Table 4 confirm that sex of the child, sex of the household head and number of births in the last 1 year were strongly associated with under-five child mortality rates in Uganda. Children whose father’s education level is secondary and higher had a lower hazard of death compared to children whose fathers were illiterate. There was no significant difference in the hazard of death for children whose fathers were illiterate or had primary education. Mother’s age group was not significant but the age groups considered gave some interesting results. Children born of mothers below 20 years of age had a higher hazard of death than those born of mothers aged between 20 and 29 years of age. There was no significant difference between the hazard of death for children under the age of five born of mothers below 20 years and those who were 40\(+\) years of age. This indicates that women who give birth before 20 years of age and those who give birth after 40 years of age, put their children at an equally higher hazard of death before celebrating their fifth birthday.

We graphically illustrate the results for two of the covariates considered to be strongly associated to under-five child mortality rates in Uganda using survival curves.

Figures 3 and 4 illustrate survival curves for the two selected covariates. The survival curve for girls is above that of boys and hence indicates a better survival rate for girls. Female headed households were also associated with a higher hazard of death for children under the age of five compared to male headed households.

Random survival forests built using covariates that satisfy the PH property

We fitted two random survival forest models on the dataset, that is, the one based on survival trees built using the log-rank and the log-rank score split-rules, respectively. Note that these two models were built using only covariates that were identified as satisfying the PH assumption. Characteristics of the two forests are presented in Table 5 below.

Table 5 Characteristics of the two fitted forests

Full size table

To identify the most important covariates in explaining survival of children under the age of five in Uganda, permutation importance was used as the measure of variable importance [22, 26, 37]. Results from fitting a random survival forest of 1000 survival trees built using the log-rank split-rule are summarised in Fig. 5. They indicate that sex of the household head (SHH), religion (RELI), father’s education (FE), source of drinking water (SDW), number of births in the past 1 year (BP1Y) and sex of the child (SC) are the most important covariates strongly associated to under-five child mortality rates in Uganda. These results are in agreement with the results obtained from fitting a multivariate Cox PH model presented in Table 3 as far as significant effects are concerned but it is interesting to note that the random survival forest model did pick other covariates as important, namely, religion and source of drinking water. The error rate for any new prediction and in this case the out-of-bag prediction error rate was \(47.32\%\).

For comparison, we also fitted a random survival forest model with survival trees built using the log-rank score split-rule.

The results on variable importance presented in Fig. 5 are similar to the results in Fig. 6. The figures further indicate that the two survival forest models have an approximately equal error rate which confirms or is in agreement with a study by [38] where the two models were found to have a similar predictive performance.

Random survival forests built using covariates with or without the PH property

Survival trees and random survival forests divide the covariate space into subgroups of good and poor survival experience predictors. They are therefore promising methods in analysing survival data in the presence of non-proportional hazards [27]. We fitted random survival forest models under the two split rules (log-rank and log-rank score, respectively) on the 2011 Uganda Demographic Health Survey dataset. We considered all covariates in the analysis including those that violated the PH assumption. The characteristics of these two forests are presented in Table 6 below.

Table 6 Characteristics of the two fitted forests

Full size table

The error rates from the out-of-bag sample for the forests built with survival trees based on the log-rank and the log-rank score split-rules are 17.29 and 19.69, respectively. These two error rates are much lower compared to the error rates for survival forests built based on only covariates that satisfy the PH assumption. This result confirms the improved performance of random survival forests in the presence of non-proportional hazards covariates [27]. However, making this conclusion based on the out-of-bag error rate may not be sufficient. It is also important to note that it is expected of the error rate to decrease with addition of more covariates. However, the key point in the above analysis is that the importance of covariates that satisfied and those that violated the PH assumption were evaluated.

The results on factors associated with under-five mortality rate, together with the prediction error rate curves for the two random survival forest models, are presented in Figs. 7 and 8.

Results from both forests indicate that the number of children under the age of five in the household (CUF) highly influences under-five child mortality rate in Uganda. Other covariates that are strongly associated to under-five child mortality in Uganda as ranked by the forest according to their importance include: the number of births in the past 5 years (BP5Y), birth order (BORD), wealth index (WI) and the total number of children ever born (CEB). Note that the number of children under the age of five in the household had the highest percentage of death as seen in Table 1.

Covariates that were strongly associated to under-five child mortality rates in Uganda in the presence of proportional hazards show up among other covariates but do not appear to be highly ranked. This result indicates that excluding covariates in the analysis of survival data due to violation of the PH assumption leads to loss of information. We see this as a very important property for random survival forests demonstrated in these two analyses namely, the choice of covariates in the model do not need a priori to rely on the too restrictive PH assumption. This is a demonstration of flexibility on the part of random survival forests as an additional attractive property compared to models that rely on the strict PH assumption. We can, therefore, conclude that random survival forests are good alternative models to use while identifying factors affecting under-five mortality rates especially in the presence of non-proportional hazards covariates. To verify this results, we used integrated brier scores [39] as a measure of predictive performance as presented in the next section.

Predictive performance

The predictive performance for the models used was evaluated using the integrated brier scores [39], presented in Additional file 1: Appendix 2. We used the pec package [40] in R [35] for this analysis. Prediction error rates of \(50\%\) or higher are useless because they are no better than tossing a coin [26, 41].

The results in Fig. 9 show that models used in this analysis have a good predictive performance. In the presence of non-proportional hazards covariates, random survival forest models under the two split rules (log-rank and log-rank score, respectively) show a much better predictive performance. Their predictive performance exhibited is better than that of models based strictly on the PH assumption. In the presence of proportional hazards, however, the Cox model shows a better predictive performance compared to the two random survival forests models. This strengthens the recommendation that if all covariates satisfy the PH assumption, the Cox PH model is preferable.

The good predictive performance for random survival forests in the presence of non-proportional hazards covariates is an appealing result in the analysis of survival data especially that from public health. This is because covariates with non-proportional hazards have often been excluded in the analysis of survival data especially when the standard Cox proportional hazards model was being used for analysis. In some cases, other models like the extended Cox model have been used but they are known to have some restrictive formulation complexities. Using a stratified Cox PH model is another alternative to dealing with covariates that do not satisfy the PH assumption. However, the downside of this approach is that if a covariate is used as a stratifying variable its effect on the outcome cannot be estimated yet a researcher(s) might be interested in its effect. Random survival forests are flexible and have fewer assumptions. They are, therefore, plausible alternative models in analysing survival data to understand factors affecting under-five mortality rates in the presence of proportional and non-proportional hazards. However, further research is required on the merits and demerits of the methods.

Discussion

Survival trees and random survival forests are increasingly becoming popular alternative models for the analysis of time-to-event outcomes [42]. They have been identified as suitable models in analysing survival data in situations where the proportional hazards assumption is violated [27, 43]. However, not much literature is available to confirm the assertion. In this study, we have therefore compared the predictive performance of the Cox proportional hazards model to the random survival forests by re-analysing a dataset that was first analysed by [3]. The study further compares the performance of random survival forests on the same dataset in the presence of covariates that violate the proportional hazards assumption to that when these covariates are excluded. Under the PH assumption, the three models show that sex of the household head, sex of the child and the number of births in the past 1 year are strongly associated to under-five child mortality rate in Uganda.

Other covariates such as source of drinking water, Father’s education and religion show up as important in explaining under-five child mortality rates in Uganda with random survival forest models. However, these covariates did not appear to be very strongly associated to under-five child mortality rate in the Cox proportional model. It is interesting to note that random survival forest models give additional information in regard to variable importance.

Results from the two forest models in the presence of non-proportional hazards show that the number of children under the age of five in a household, greatly influences under-five child mortality rates. This ranks top in the two random survival forest models. Other factors ranked as important in understanding under-five child mortality rates by random survival forests in the presence of non-proportional hazards covariates are: births in the past 5 years, wealth index, birth order and total number of children ever born. Similar factors have emerged to be strongly associated to under-five child mortality rates in other studies [3, 29, 44, 45].

To compare the predictive performance of these three models on the scenarios considered, we used integrated brier scores via cross-validation. The Cox proportional hazards model had a better predictive performance in the presence of only those covariates that satisfy the proportional hazards assumption compared to the two random survival forest models. This result may not be seen as a surprise because the Cox PH model works best under this assumption from which its original formulation by [8] is based. The result is further confirmed because the two random survival models had a high out-of-bag error rate of 47.36 and 47.32%, respectively. The out-of-bag error rate for the two random survival forest models (RSFLR, RSFLRS) in the presence of proportional hazards are higher compared to those of random survival forest models (RSFLRNON, RSFLRSNON) in the presence of non-proportional hazards covariates. This implies that excluding covariates that have non-proportional hazards in the analysis gives less informative results. The results further confirm that random survival forests are robust in approximating complex survival functions, including functions based on covariates with non-proportional hazards, while maintaining low prediction error rates [27, 46,47,48].

However, since most aspects of these models are under development, it is recommended that one uses them hand in hand with the standard methods like the Cox proportional hazards model. The same recommendation was made in other studies related to random forests [42, 47, 49, 50]. It has also been established that random survival forests are useful in situations where the relationship between the response and the predictors may be complicated [26]. However, there are concerns that survival trees are built using the log-rank split-rule whose power to discriminate between two groups is highest when the proportionality hazards assumption holds. This may have an impact on the predictive performance of the survival forest model. This is important especially when the survival (or hazard) functions cross each other in the two groups being compared [51]. However, more research is needed to fully ascertain this fact especially in the presence of non-proportional hazards. More research will also guide scholars to the best split-rule that may help in such circumstances. A recent study [51] has recommended the use of the integrated absolute difference between the two daughter nodes’ survival functions as the splitting rule in circumstances where the hazard function cross. They have concluded that forests built with this rule produce very good results in general, and that they are often better compared to forests built with the log-rank splitting rule.

Conclusions

The study confirms that random survival forests have a good predictive performance in the presence of non-proportional hazards [27]. It is, therefore, clear that these methods are promising alternatives to models that rely heavily on the proportional hazards assumption where the presence of covariates that violate the proportional hazards assumption is inevitable.

This study has demonstrated that the Cox PH model and random survival forests could cleverly be used in a complementary manner to fully model and analyse survival data in the presence of proportional and non-proportional hazards. The good predictive performance shown by the two random survival forest models in the presence of non-proportional hazards covariates for this dataset implies that these models could be alternative models in analysing survival datasets especially when the assumption is violated. Our conclusions on the use of random survival forests to analyse survival data are in agreement with the recommendations by [26, 50]. Obvious extensions that came to light when dealing with large survey data is when there are outcomes and covariates with missing data. We propose combining random survival forests with multiple imputation methods to reduce the loss of information. The combined approach will be to apply random survival forests after multiple imputation. A limitations to this study is that we have used random survival forest models that have been identified to favour to covariates with many split points in survival tree building [52,53,54,55]. Given the fact that most of our covariates were categorical with more than two categorises, biased results on estimates such as variable importance are inevitable [53, 55]. Our recent study[56] has therefore recommended the use of conditional inference forests suggested by [57] in the presence of covariates with many split points.

Abbreviations

PH:: proportional hazards
HR:: hazard ratio
CART:: classification and regression trees
UDHS:: Uganda Demographic Health Survey
SHH:: sex of the household head
RELI:: religion
FE:: Father’s education
SDW:: source of drinking water
BP1Y:: number of births in the past one year
SC:: sex of the child
CUF:: number of children under the age of five in the household
BP5Y:: number of births in the past five years
BORD:: birth order
WI:: wealth index
CEB:: total number of children ever born
RSFLR:: random survival forests with log-rank split-rule on data with proportional hazards
RSFLRS:: random survival forests with log-rank score split-rule on data with proportional hazards
RSFLRNON:: random survival forests with log-rank split-rule on data with both proportional and non-proportional hazards
RSFLRSNON:: random survival forests with log-rank score split-rule on data with both proportional and non-proportional hazards
OOB:: out-of-bag
IBS:: integrated brier scores

References

Norheim OF, Jha P, Admasu K, Godal T, Hum RJ, Kruk ME, Gómez-Dantés O, Mathers CD, Pan H, Sepúlveda J, et al. Avoiding 40% of the premature deaths in each country, 2010–30: review of national mortality trends to help quantify the un sustainable development goal for health. Lancet. 2015;385(9964):239–52.
Article PubMed Google Scholar
Sachs JD. From millennium development goals to sustainable development goals. Lancet. 2012;379(9832):2206–11.
Article PubMed Google Scholar
Nasejje JB, Mwambi HG, Achia TN. Understanding the determinants of under-five child mortality in uganda including the estimation of unobserved household and community effects using both frequentist and bayesian survival analysis approaches. BMC Public Health. 2015;15(1):1.
Article Google Scholar
World Health Organization et al. Global health observatory (gho) data. Life expectancy; 2015.
Rutstein SO. Effects of preceding birth intervals on neonatal, infant and under-five years mortality and nutritional status in developing countries: evidence from the demographic and health surveys. Int J Gynecol Obstet. 2005;89:7–24.
Article Google Scholar
Åsling-Monemi K, Tabassum Naved R, Persson LÅ. Violence against women and the risk of under-five mortality: analysis of community-based data from rural bangladesh. Acta Paediatrica. 2008;97(2):226–32.
Article PubMed Google Scholar
Rajaratnam JK, Tran LN, Lopez AD, Murray CJ. Measuring under-five mortality: validation of new low-cost methods. PLoS Med. 2010;7(4):1000253.
Article Google Scholar
Cox DR. Regression models and life-tables. J R Stat Soc Ser B (Methodological). 1972:187–220.
Manda SO. Birth intervals, breastfeeding and determinants of childhood mortality in malawi. Soc Sci Med. 1999;48(3):301–12.
Article CAS PubMed Google Scholar
Asefa M, Drewett R, Tessema F. A birth cohort study in south-west ethiopia to identify factors associated with infant mortality that are amenable for intervention. Ethiop J Health Dev. 2000;14(2):161–8.
Article Google Scholar
Kembo J, Van Ginneken JK. Determinants of infant and child mortality in Zimbabwe: eesults of multivariate hazard analysis. Demogr Res. 2009;21:367–84.
Article Google Scholar
Platt RW, Joseph K, Ananth CV, Grondines J, Abrahamowicz M, Kramer MS. A proportional hazards model with time-dependent covariates and time-varying effects for analysis of fetal and infant death. Am J Epidemiol. 2004;160(3):199–206.
Article PubMed Google Scholar
Ng’andu NH. An empirical comparison of statistical tests for assessing the proportional hazards assumption of Cox’s model. Stat Med. 1997;16(6):611–26.
Article PubMed Google Scholar
Fisher LD, Lin DY. Time-dependent covariates in the Cox proportional-hazards regression model. Annu Rev Public Health. 1999;20(1):145–57.
Article CAS PubMed Google Scholar
Therneau TM. Extending the Cox model. In: Proceedings of the First Seattle symposium in biostatistics. New York: Springer; 1997. p. 51–84.
Wei L. The accelerated failure time model: a useful alternative to the Cox regression model in survival analysis. Stat Med. 1992;11(14–15):1871–9.
Article CAS PubMed Google Scholar
Therneau TM, Grambsch PM. Modelings survival data: extending the Cox model. New York: Springer-Verlag; 2000.
Book Google Scholar
Schwartz J, Coull B, Laden F, Ryan L. The effect of dose and timing of dose on the association between airborne particles and survival. Environ Health Perspect. 2008;116:64–9.
Article PubMed Google Scholar
Abrahamowicz M, Schopflocher T, Leffondré K, du Berger R, Krewski D. Flexible modeling of exposure-response relationship between long-term average levels of particulate air pollution and mortality in the american cancer society study. J Toxicol Environ Health Part A. 2003;66(16–19):1625–54.
Article CAS PubMed Google Scholar
Krewski D, Burnett RT, Goldberg MS, Hoover K, Siemiatycki J. Special report reanalysis of the Harvard six cities study and the American Cancer Society Study of particulate air pollution and mortality part ii: sensitivity analyses appendix c. Flexible modeling of the effects of fine particles. Boston: Health Effects Institute; 2000.
Google Scholar
Hothorn T, Hornik K, Strobl C, Zeileis A. Party: a laboratory for recursive partytioning. 2010. https://cran.r-project.org/, R package version 1.2-3.
Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008;2:841–60.
Article Google Scholar
Breiman L, Friedman J, Stone CJ, Olshen RA. Classification and regression trees. Wadsworth: Belmont; 1984.
Google Scholar
Breiman L. Random forests. Mach Learn. 2001;45(1):5–32.
Article Google Scholar
Fernández T, Rivera N, Teh YW. Gaussian processes for survival analysis. In: Advances in neural information processing systems. New York: Curran Associates; 2016. p. 5015–23.
Taylor JM. Random survival forests. J Thorac Oncol. 2011;6(12):1974–5.
Article PubMed Google Scholar
Ehrlinger J, Rajeswaran J, Blackstone EH. ggrandomforests: exploring random forest survival. R Vignette; 2016.
Corsi DJ, Neuman M, Finlay JE, Subramanian S. Demographic and health surveys: a profile. Int J Epidemiol. 2012;41(6):1602–13.
Article PubMed Google Scholar
Ssewanyana S, Younger SD. Infant mortality in uganda: determinants, trends and the millennium development goals. J Afr Econ. 2008;17(1):34–61.
Article Google Scholar
Ayiko R, Antai D, Kulane A. Trends and determinants of under-five mortality in Uganda. East Afr J Public Health. 2009;6(2):136–40.
PubMed Google Scholar
Demombynes G, Trommlerová SK. What has driven the decline of infant mortality in kenya? Policy research working paper No. WPS 60572010. Washington: World Bank; 2012.
Book Google Scholar
Grambsch PM, Therneau TM. Proportional hazards tests and diagnostics based on weighted residuals. Biometrika. 1994;81(3):515–26.
Article Google Scholar
Schoenfeld D. Partial residuals for the proportional hazards regression model. Biometrika. 1982;69(1):239–41.
Article Google Scholar
Hess KR. Graphical methods for assessing violations of the proportional hazards assumption in Cox regression. Stat Med. 1995;14(15):1707–23.
Article CAS PubMed Google Scholar
R Core Team. R: a a language and environment for statistical computing. Vienna: R Foundation for Statistical Computing; 2016. https://www.r-project.org.
Akaike H. Likelihood of a model and information criteria. J Econom. 1981;16(1):3–14.
Article Google Scholar
Strobl C, Boulesteix A, Kneib T, Augustin T, Zeileis A. Conditional variable importance for random forests. BMC Bioinform. 2008;9.
Omurlu IK, Ture M, Tokatli F. The comparisons of random survival forests and Cox regression analysis with simulation and an application related to breast cancer. Expert Syst Appl. 2009;36(4):8582–8.
Article Google Scholar
Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999;18(17–18):2529–45.
Article CAS PubMed Google Scholar
Gerds TA, doMC C, Gerds MTA. Prediction error curves for survival models; r package pec. Version 2.5.3. Vienna: R Foundation for Statistical Computing; 2015. urlhttps://cran.r-project.org/web/packages/pec/index.html.
Chen G, Kim S, Taylor JM, Wang Z, Lee O, Ramnath N, Reddy RM, Lin J, Chang AC, Orringer MB, et al. Development and validation of a quantitative real-time polymerase chain reaction classifier for lung cancer prognosis. J Thorac Oncol. 2011;6(9):1481–7.
Article PubMed PubMed Central Google Scholar
Segal MR, Bloch DA. A comparison of estimated proportional hazards models and regression trees. Stat Med. 1989;8(5):539–50.
Article CAS PubMed Google Scholar
Gerds TA, Kattan MW, Schumacher M, Yu C. Estimating a time-dependent concordance index for survival prediction models with covariate dependent censoring. Stat Med. 2013;32(13):2173–84.
Article PubMed Google Scholar
Buor D. Mothers’ education and childhood mortality in Ghana. Health Policy. 2003;64(3):297–309.
Article PubMed Google Scholar
Mondal N, Hossain K, Ali K, et al. Factors influencing infant and child mortality: a case study of Rajshahi district, Bangladesh. J Hum Ecol. 2009;26(1):31–9.
Google Scholar
Hsich E, Gorodeski EZ, Blackstone EH, Ishwaran H, Lauer MS. Identifying important risk factors for survival in patient with systolic heart failure using random survival forests. Circulation. 2011;4(1):39–45.
PubMed Google Scholar
Datema FR, Moya A, Krause P, Bäck T, Willmes L, Langeveld T, de Jong B, Robert J, Blom HM. Novel head and neck cancer survival analysis approach: random survival forests versus Cox proportional hazards regression. Head Neck. 2012;34(1):50–8.
Article PubMed Google Scholar
Hamidi O, Poorolajal J, Farhadian M, Tapak L. Identifying important risk factors for survival in kidney graft failure patients using random survival forests. Iran J Public Health. 2016;45(1):27.
PubMed PubMed Central Google Scholar
Walschaerts M, Leconte E, Besse P. Stable variable selection for right censored data: comparison of methods. Toulouse: Toulouse School of Economics (TSE); 2012. p. 1903–4928.
Google Scholar
Jones Z, Linder F. Exploratory data analysis using random forests. In: Prepared for the 73rd Annual MPSA Conference; 2015.
Moradian H, Larocque D, Bellavance F. L_1 splitting rules in survival forests. Lifetime Data Anal. 2016: 1–21.
Ziegler A, König IR. Mining data with random forests: current options for real-world applications. Wiley Interdiscip Rev. 2014;4(1):55–63.
Google Scholar
Strobl C, Boulesteix A-L, Zeileis A, Hothorn T. Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinform. 2007;8(1):1.
Article Google Scholar
Loh W-Y. Fifty years of classification and regression trees. Int Stat Rev. 2014;82(3):329–48.
Article Google Scholar
Wright MN, Dankowski T, Ziegler A. Unbiased split variable selection for random survival forests using maximally selected rank statistics. Stat Med. 2017;36(8):1272–84. doi:10.1002/sim.7212.
Article PubMed Google Scholar
Nasejje JB, Mwambi H, Dheda K, Lesosky M. A comparison of the conditional inference survival forest model to random survival forests based on a simulation study as well as on two applications with time-to-event data. BMC Med Res Methodol. 2017;17(1):115.
Article PubMed PubMed Central Google Scholar
Hothorn T, Hornik K, Zeileis A. Unbiased recursive partitioning: a conditional inference framework. J Comput Gr Stat. 2006;15:651–74.
Article Google Scholar

Download references

Authors’ contributions

Authors, JBN and HM conceived the concept, JBN, analysed the data. JBN and HM prepared the manuscript. Both authors read and approved the final manuscript.

Acknowledgements

We thank the German Academic Exchange Service (DAAD) and the University of Kwazulu-Natal for supporting the first author in her Ph.D. studies. We sincerely thank The Demographic Health Survey program for providing the data used in this study.

Competing interests

Both authors declare that they have no competing interests.

Availability of data and materials

The authors confirm that all data underlying the findings are fully available without restriction. The data is held by the Demographic and Health Survey Program and freely available to the public but a request has to be sent to the Demographic and Health Survey Program.

Consent to publish

The Demographic Health Survey Data is collected according to the rules and guidelines stipulated by WHO World Health Survey on consent from the participants. Some of these rules include but not limited to; participation in the survey is voluntary and the respondent can refuse to be interviewed. The interviewer is responsible for explaining what the survey is about, providing all the necessary information, and making sure the respondent understands the implications of his/her participation before giving his/her consent. The information given should be simple and clear and adapted to the respondent’s level of understanding. Consents must be documented by asking the respondents to sign an Informed Consent Forms (Household Informant Consent Form; Individual Consent Form) before doing the interview. These forms must mention who will be doing the study, the types of questions that will be asked, why the study is being done, and who will have access to the information provided.

Ethics approval and consent to participate

The statement is available on the DHS ethical clearance certificate and it states that: The IRB-approved procedures for DHS public-use datasets do not in any way allow respondents, households, or sample communities to be identified. There are no names of individuals or household addresses in the data files. The geographic identifiers only go down to the regional level (where regions are typically very large geographical areas encompassing several states/provinces). Each enumeration area (Primary Sampling Unit) has a PSU number in the data file, but the PSU numbers do not have any labels to indicate their names or locations. Before each interview is conducted, an informed consent statement is read to the respondent, who may accept or decline to participate. A parent or guardian must provide consent prior to participation by a child or adolescent. The informed consent statement emphasizes that participation is voluntary; that the respondent may refuse to answer any question, or terminate participation at any time; and that the respondent’s identity and information will be kept strictly confidential. The study was also submitted to the ethics committee of University of Kwazulu-Natal and they stated that no ethical clearance was required.

Funding

We thank the University of Kwazulu-Natal for supporting JN in her PH.D. work. The first author is also financially supported by DAAD.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Author information

Authors and Affiliations

School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, 250 King Edward Avenue, Scottsville, Pietermaritzburg, 3201, South Africa
Justine B. Nasejje
School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Scottsville, Pietermaritzburg, 3209, South Africa
Henry Mwambi

Authors

Justine B. Nasejje
View author publications
You can also search for this author in PubMed Google Scholar
Henry Mwambi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Justine B. Nasejje.

Additional file

13104_2017_2775_MOESM1_ESM.pdf

Additional file 1. Methods and model evaluation techniques. The file contains algorithms for the methods used in this study together with their evaluation technique.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Nasejje, J.B., Mwambi, H. Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption. BMC Res Notes 10, 459 (2017). https://doi.org/10.1186/s13104-017-2775-6

Download citation

Received: 01 May 2016
Accepted: 31 August 2017
Published: 07 September 2017
DOI: https://doi.org/10.1186/s13104-017-2775-6

Application of random survival forests in understanding the determinants of under-five child mortality in Uganda in the presence of covariates that satisfy the proportional and non-proportional hazards assumption

Abstract

Background

Methods

Results

Conclusions

Similar content being viewed by others

Understanding the social determinants of child mortality in Latin America over the last two decades: a machine learning approach

Machine learning approach for predicting under-five mortality determinants in Ethiopia: evidence from the 2016 Ethiopian Demographic and Health Survey

Applying machine learning to social datasets: a study of migration in southwestern Bangladesh using random forests

Background

Objective of the study

Methods

Data

Exploratory data analysis

Covariates

Dependent variable

Analysis methods

Results

Proportional hazards analysis

Cox proportional hazards model

Random survival forests built using covariates that satisfy the PH property

Random survival forests built using covariates with or without the PH property

Predictive performance

Discussion

Conclusions

Abbreviations

References

Authors’ contributions

Acknowledgements

Competing interests

Availability of data and materials

Consent to publish

Ethics approval and consent to participate

Funding

Publisher’s Note

Author information

Authors and Affiliations

Corresponding author

Additional file

13104_2017_2775_MOESM1_ESM.pdf

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation