Background

Aging is a challenge for all countries, and the crisis is even more significant in low- and middle-income countries [1]. Increasing incidences of physical disability are associated with aging, and that significant heterogeneity exists among older populations [2]. Disability refers to the difficulty or inability to perform tasks essential to everyday life, affecting social roles, and maybe a result of physical, emotional, cognitive, or sensory limitations [3]. The main pathway of disability includes four successive stages: pathology (the existence of disease/injury), impairments (dysfunction / structural abnormality), functional limitation (basic physical/mental activity limitation), and disability (difficulty doing activities of daily life, ADL) [4]. In the process of disability, depression plays an accelerating role, especially in the early and late stages of disability. Reducing depression may help alleviate the disability process of those who find themselves in the above stages [5]. Compared with the ordinary elderly, the disabled elderly are more likely to have the symptoms of depression [6, 7]. Özlem and Ünsal found that the incidence of depression was 57.8% (n = 201) in disabled people [8]. Therefore, the depression of the disabled elderly needs social attention.

There also have been some previous studies on the influencing factors of depression in disabled adults [8]. In Özlem’s cross-sectional study, depression in disabled adults was associated with demographic characteristics (married people tend to develop depression), health behaviors (smoking and drinking tend to develop depression), and family status (three or more children are more likely to develop depression) [9]. In the factors influencing the depression-exit of the disabled, JunSu found socioeconomic state-related factors (gender, marital status, and the regional location), psycho-social characteristics related factors (self-esteem and the satisfaction about leisure and recreation), and disability-health related factors affect depression withdrawal of the disabled [10].

Previous studies have shown differences in the risk of depression between urban and rural elderly, which may be due to the differences in the living environment between urban and rural elderly [8]. Among urban and rural Chinese older adults, Yu found that the prevalence rates of depression in urban and rural areas were 16.4 and 30.0% [11]. However, the dynamics of depression affecting urban and rural people are complex and may vary with different health status, populations, and national backgrounds [8, 12].

Machine learning is increasingly used in depression [13,14,15]. Compared with human experts, the amount of data, computational complexity, and storage capacity of medical decision support systems are relatively high [16]. Random forest is a flexible and easy-to-use machine learning algorithm. It includes a random forest classifier and random forest regression. Previous studies have applied a random forest classifier to predict depression in different populations. In the prediction of depression in nursing staff of patients with Alzheimer’s disease, Byeon showed that gender, subjective health status, disease or accidence experience within the past 2 weeks, the frequency of meeting a relative, economic activity, and monthly mean household income were the significant predictors for the depression of caregivers [17]. Gokten and Uyulan used a random forest classifier to predict the development of depression and post-traumatic stress disorder development in sexually abused children. They found that the most important feature of the prediction model is time after abuse, type of abuse, and smoking [14]. Due to the different research objects, there are great differences in the important predictors.

However, to the best of our knowledge, seldom studies built a machine learning-based model for predicting the onset of depression among disabled elderly, and there is rarely research to indicate the difference of influencing factors of depression symptoms between urban and rural elderly and the extent to influencing factors of the depressive symptom disparity.

Methods

Data collection

The data were derived from the China Family Panel Studies (CFPS). CFPS is a biennial longitudinal survey conducted by the Institution of Social Science Survey at Peking University. This investigation launched in 2010, with five waves of publicly released datasets comprising 2010, 2012, 2014, 2016, and 2018. The samples covered 25 provinces, accounting for 95% of the total population of China. The contents of CFPS are rather typical, covering the demographics, socioeconomic condition, education, and health of respondents. According to the research purpose, we chose the data in 2016 because the survey before 2016 did not include the depression scale.

The object of this study is elderly with limited ability to live. The items for measuring disability were based on the IADL scale [18]: going to outdoor activities, dining, kitchen activities, public transportation, shopping, cleaning, and laundry. If any of the seven activities cannot be completed independently, it is defined as a disability in this paper. Finally, 1460 participants met the requirements, including 841 rural elderly and 619 urban elderly. The sample selection flowchart is shown in Fig. 1.

Fig. 1
figure 1

Flowchart of participant selection

Study design and variables

In the 2016 CFPS questionnaire, depressive symptoms were measured by The Center for Epidemiologic Studies Depression Scale (CES-D). Participants were asked to assess how often they experienced happiness, loneliness, and hope in the previous week. This scale allows respondents to self-rate their degree of experience using a four-point scale: “rarely or never (less than 1 day)”, “not too often (1–2 days)”, “sometimes or half the time (3–4 days)”, and “most of the time (5–7 days)”. The responses for the items of negative feelings were assigned to an index value of 0, 1, 2, and 3, and those to positive feelings were assigned as 3, 2, 1, and 0. The total score ranged from 0 to 60. More than 16 scores of adults were positive screening of depression [19]. CES-D 20 and CES-D 8 questionnaires were used in the survey, respectively, among which 20% of the respondents used CES-D 20 and the remaining 80% used CES-D 8. The scales used by the respondents were randomly assigned. The score of CES-D 8 was transformed into that of CES-D 20 by equal percentile transformation. After transformation, the percentile distribution of CES-D 20 scores of the two groups was similar and had almost the same mean, standard deviation, kurtosis, and skewness. The output variable is CES-D 20 result, classified as “1” depression and “0” non-depression.

The predictors are divided into six categories, including demographic characteristics (gender, age, marital status, years of education, family per capita income, and urban and rural), health behavior (smoking, drinking more than three times a week, sleep duration, and regular exercise), health status (chronic disease, BMI, disease or accidence experience within the past 2 weeks, hospitalization within 1 year, total medical expenses within 1 year, self-rated health, and changing in perceived health), family relations (number of family members, number of children, close to children, receiving financial assistance from children, and weekly family dinner frequency), social relations (neighborhood help, neighborhood relationship, community emotion, participating organizations, and trusting people), and subjective attitude (life satisfaction and having trust in the future).

Statistical analysis

Random forest (RF) algorithm is an integrated model that uses various models to evaluate the response and is designed to solve the classification and regression problems. RF algorithm can be applied to continuous data sets and classification data sets. This paper uses a Random Forest Classifier (RFC), consisting of many individual decision trees that operate as an ensemble. Each tree in the random forest is predicted and voted, with the most voted class becoming predictive of the entire model. Compared with a single model, one of the advantages of RFC is that each tree classifier is like a team member, and all members work together to obtain the final prediction, which performs better than when using a single decision tree. RFC is suitable for binary classification. It can cope with a dataset where the number of variables exceeds the number of observations and handle the dataset with a mixture of continuous and categorical predictors. RFC also has good noise resistance, can process high latitude data without feature selection, process various kinds of data, and get the order of variable importance. The data were randomly divided into two sets in the rural and urban model: training set (70% of the sample) and testing set (30% of the sample).

We used generalization error and model complexity to adjust the parameters of the RFC, developed to avoid overfitting by simplifying the decision tree by removing terminal nodes. As a result of this process, the predictive power of the model could be enhanced. So, we identified hyper-parameters used commonly in RFC [20]: (1) n-estimators (number of trees in the forest); (2) max depth (maximum depth of the tree); (3) min_samples_split (minimum number of data points in a node before the node is split); and (4) min_samples_leaf (minimum number of data points allowed in a leaf node). The learning curve was used to evaluate two sets of hyper-parameters to optimize the algorithm’s performance. In the context of machine learning, learning curves are used to select the optimal combination of parameters. Table 1 shows the results for rural and urban areas.

Table 1 Random forest model and training parameters

Results

Participant characteristics

A total of 1460 individuals were included in the analysis. The prevalence of depression varied substantially between urban and rural older adults, and the prevalence of depression of urban, rural, and all older adults were 44.59, 57.67, and 52.12%, respectively. Table 2 summarizes the demographic characteristics, health behavior, health status, family relations, social relations, and subjective attitudes of the disabled elderly in rural and urban areas. There is statistical significance in the years of education, family per capita income, regular exercise, BMI, disease or accidence experience within the past 2 weeks, hospitalization within 1 year, total medical expenses within 1 year, self-rated health, changing in perceived health, number of family members, number of children, receiving financial assistance from children, community emotion, participating organizations, life satisfaction, having trust in the future and depression of the disabled elderly between rural and urban areas. Therefore, we consider the construction of depression prediction models for the disabled elderly in urban and rural areas.

Table 2 Sociodemographic data and characteristics of rural and urban disabled elderly

Detecting potential predictors

In Table 2, we first performed a series of Chi-square and T-test analyses to examine the difference between rural and urban variables (P value1). Then, we used a series of Chi-square and T-test analyses to test the difference between depressive and non-depressive variables of urban and rural disabled elderly, respectively, P < 0.1 was included in the RFC. If P < 0.1, we included this variable in the subsequent RFC model.

Therefore, the input variables (features) in the rural groups were classified as follows: age, gender, marital status, education, family per capita income, drinking more than three times a week, sleep duration, regular exercise, BMI, disease, or accidence experience within the past 2 weeks, chronic disease, hospitalization within 1 year, total medical expenses within in 1 year, self-rated health, changing in perceived health, receiving financial assistance from children, weekly family dinner, neighborhood help, neighborhood relationship, community emotion, life satisfaction, having trust in the future, and trusting people.

The input variables (features) in the urban groups were classified as follows: family per capita income, drinking more than three times a week, sleep duration, regular exercise, BMI, disease or accidence experience within the past 2 weeks, chronic disease, hospitalization within 1 year, total medical expenses within 1 year, self-rated health, changing in perceived health, weekly family dinner, neighborhood help, neighborly relations, community emotion, life satisfaction, having trust in the future, trusting people, and closing to children.

Testing prediction accuracy of potential predictors

The total sample was divided into two sub-samples for the analysis with a random forest algorithm: one train set and one test set. Figure 2 shows the test set confusion matrices for rural disabled elderly and urban disabled elderly. Based on previous studies [21], the other two dimensions’ accuracy, sensitivity, and specificity were also calculated.

Fig. 2
figure 2

a Confusion matrices for rural disabled elderly b Confusion matrices for urban disabled elderly

Sensitivity refers to the true positive rate, which is calculated as follows:

$$\mathrm{Sensitivity}=\mathrm{TP}\ \left(\mathrm{True}\ \mathrm{Positive}\right)/\left(\mathrm{TP}\ \left(\mathrm{True}\ \mathrm{Positive}\right)+\mathrm{FN}\ \left(\mathrm{False}\ \mathrm{Negative}\right)\right)$$

This study refers to the proportion of disabled older adults with depression who are correctly predicted. The sensitivity score for rural disabled elderly was 80.6%, and that of urban elderly was 64.2%.

Specificity is the true negative rate, which is calculated with the following formula:

$$\mathrm{Specificity}=\mathrm{TN}\ \left(\mathrm{True}\ \mathrm{Negative}\right)/\left(\mathrm{TN}\ \left(\mathrm{True}\ \mathrm{Negative}\right)+\mathrm{FP}\ \left(\mathrm{False}\ \mathrm{Positive}\right)\right)$$

The specificity score for rural disabled elderly was 65.3%, and that of urban elderly was 78.1%.

Then, the classifier’s performance was tested by the 10-k cross-validated method, and the result was 0.71 for rural areas and 0.70 for urban areas.

The typical characteristic of the receiver operating characteristic (ROC) curve is that Y-axis is the true positive rate, and X-axis is the false positive rate. The upper left corner of the graph is an ideal point to indicate that the false positive rate is 0 and the true positive rate is 1. A larger area under the curve (AUC) is usually better. The AUC of rural disabled elderly and urban disabled elderly were 0.7905 (see Fig. 3a) and 0.7781 (see Fig. 3b).

Fig. 3
figure 3

ROC Curves for depression a Rural b Urban

Feature importance

In constructing the classification model, it is important to introduce the local interpretable technique, SHAP value calculation, and evaluation to explain the model’s data results. Figures 4 and 5 show the importance of the features evaluated by each model in descending order. The y-axis represents the features of the evaluation. The color represents the height of the feature value: the farther the distance between the points on the x-axis, the greater the influence of the feature on depression prediction.

Fig. 4
figure 4

Feature importance in Random Forest Classifier. a Mean SHAP value b SHAP value(Rural)

Fig. 5
figure 5

Feature importance in Random Forest Classifier. a Mean SHAP value b SHAP value(Urban)

Red means the characteristic value is relatively high, and blue means that the characteristic value is relatively low. The more right the shap value is, the greater the positive contribution to the prediction of depression. In contrast, the more left, the smaller the shap value is, the greater the negative contribution to the prediction of depression. If the shap value can distinguish between red and blue, it can be proved that their high or low values have different effects on the final results.

From Fig. 4a, the top 10 predictors for rural disabled elderly were: chronic disease, self-rated health, having trust in the future, neighborly relations, total medical expenses within 1 year, life satisfaction, BMI, disease or accidence experience within the past 2 weeks, changing in perceived health, and trusting people.

Figure 4b shows without chronic disease, better self-rated health, more confidence in the future, better neighborhood relationship, lower total medical expenses within 1 year, higher life satisfaction, higher BMI, without disease or accidence experience within the past 2 weeks, perceived health better or unchanged and deeper trust in people have greater negative contributions to depression for rural disabled elderly.

From Fig. 5a, The top 10 predictors for urban disabled elderly were: self-rated health, life satisfaction, disease or accidence experience within the past 2 weeks, having trust in the future, changing in perceived health, sleep duration, BMI, family per capita income, community emotion, and trusting people.

Figure 5b shows better self-rated health, higher life satisfaction, without disease or accidence experience within the past 2 weeks, more confidence in the future, perceived health better or unchanged, longer sleep duration, higher BMI, higher family per capita income, deeper community emotion and deeper trust in people have greater negative contributions to depression for urban disabled elderly.

Discussion

In this study, we built a machine learning-based predictive model for rural and urban disabled older people. The depression rate of rural disabled elderly was 57.67%, higher than that of urban disabled elderly (44.59%). The mean value of the 10-k cross-validated results was 0.71 in rural areas and 0.70 in urban areas. Moreover, the AUC, specificity, and sensitivity scores of rural disabled elderly were 0.79, 65.3, and 80.6%. In contrast, urban disabled elderly were 0.78, 78.1, and 64.2%, respectively. The above result shows that this model could be practically used to screen rural and urban disabled elderly people prone to depression. There are some studies using machine learning to predict depression.

Zhang predicted depression among pregnant women through Weill Cornell Medical (WCM) data, and the AUC of their model was 0.937(development datasets) and 0.886(validation datasets). They used indicators including clinical features related to mental health history, medical comorbidity, obstetric complications, medication prescription orders, and patient demographic characteristics [22]. Dinga predicted the naturalistic course of depression from a wide range of clinical, psychological, and biological data, and their AUC values ranged from 0.66 to 0.69 [23]. Other scholars use data without clinical symptoms or indicators. Gokten and Uyulan used sociodemographic data and characteristics of sexual abuse to predict the development of depression in sexually abused children, and their accuracy of the study was 72.0% [14]. Although scholars all used the machine learning method, prediction accuracy was discrepant. We think it may relate to the predictors, research objects, and sample sources.

The ability of machine learning to detect key features from complex data sets reveals its importance. Our study found differences in the top ten predictors between the rural and urban disabled elderly. For the rural elderly, the most important feature of the random forest classifier is a chronic disease. However, chronic diseases are not the top ten predictors for urban elderly. This difference may be due to the difference in medical timeliness caused by urban and rural socioeconomic status and geographical location, which leads to the different severity of chronic diseases [24,25,26,27]. For some chronic diseases, such as diabetes, heart disease, and cerebrovascular disease, the treatment of these diseases depends on the economic status of rural residents. In contrast, urban residents would visit the doctor if they had any disease [28]. Social relationship is also an important factor in preventing and improving depression [29, 30]. Older adults exhibited better mental health in neighborhoods where positive neighborly interactions prevailed over individual adversities [31]. Our study found that the prediction effect of neighborhood relationships is more obvious among the disabled elderly in rural areas. The possible reason may be that social ties appear more consequential for attachment in rural people than in urban areas, and there are differences in social cohesion between rural and urban areas [32,33,34].

The prevalence of depression varied substantially between urban and rural older adults, and the prevalence of depression of urban, rural, and all older adults were 44.59, 57.67, and 52.12%, respectively. Overall, the depression rate of the disabled in the study was lower than that in the Özlem and Ünsal study (57.8%) [8]. The most likely reason may be the difference in disability measurement. Our study used IADL to measure disability, while the Özlem and Ünsal studies were based on data provided by disabled individuals registered in the Turkish Disabled Association Branch. He found the prevalence of baseline depression symptoms was 29.5, 58.0, and 73.6% in subjects with basic ADL scores of 0, 1, and ≥ 2, respectively, which shows differences in depression rates due to differences in different measurement methods of disability [23].

For the important factors to predict depression in both rural disabled elderly and urban disabled elderly, health status is one of the most important predictors of depression, including self-rated health, without disease or accidence experience within the past 2 weeks, changing in perceived health, and BMI. Self-rated health is a significant predictor, mainly because it is a multidimensional structure, including physiological, psychological, functional, and social variables. Although self-rated health is commonly seen as a manifestation of depressed affect, it seems to predict the subsequent mental health results [35]. Previous scholars have found that depression is closely related to self-rated health [36,37,38]. Higher BMI causes lower depressive symptoms, which the “jolly fat” hypothesis can explain. Obese people may be happier because they may not be exposed to strict diets that lead to depression [39]. Meanwhile, life satisfaction and having trust in the future were the other significant predictors of depression [40]. We were surprised to find a close relationship between trusting people and depression among the elderly in rural and urban areas. Trust itself has been shown to be associated with a host of health outcomes [41, 42].

From our research, this could be good news for the rural elderly, who have limited access to good health facilities. Using only risk factors for depression prediction means that depression can be valued even before symptoms appear, which will lead to early intervention. However, the present study has two limitations. First, due to the limitation of CFPS investigation content, our model did not include clinical symptoms or indicators. The previous research has shown that clinical symptoms or indicators contribute to the pathophysiology of depression [43,44,45], but our research lacks this information. Second, we did not assess the severity of depression in the rural disabled elderly.

Conclusion

We suggest that the depression of the disabled elderly can be predicted by machine learning method from six aspects: demographic characteristics, health status, health behavior, family relationship, social relationship, and subjective attitude. There are differences in the top ten predictors between the rural and urban disabled elderly. However, we should further consider the clinical symptoms or indicators in future research.

Using random forest data to predict the depression of the disabled elderly can detect the depression early. The prediction model of this study could provide support for the intervention of depression risk identification of rural and urban disabled elderly and improve their health status through early prevention, diagnosis, and treatment.