Interpretable machine learning analysis to identify risk factors for diabetes using the anonymous living census data of Japan

Purpose Diabetes mellitus causes various problems in our life. With the big data boom in our society, some risk factors for Diabetes must still exist. To identify new risk factors for diabetes in the big data society and explore further efficient use of big data, the non-objective-oriented census data about the Japanese Citizen’s Survey of Living Conditions were analyzed using interpretable machine learning methods. Methods Seven interpretable machine learning methods were used to analysis Japan citizens’ census data. Firstly, logistic analysis was used to analyze the risk factors of diabetes from 19 selected initial elements. Then, the linear analysis, linear discriminate analysis, Hayashi’s quantification analysis method 2, random forest, XGBoost, and SHAP methods were used to re-check and find the different factor contributions. Finally, the relationship among the factors was analyzed to understand the relationship among factors. Results Four new risk factors: the number of family members, insurance type, public pension type, and health awareness level, were found as risk factors for diabetes mellitus for the first time, while another 11 risk factors were reconfirmed in this analysis. Especially the insurance type factor and health awareness level factor make more contributions to diabetes than factors: hypertension, hyperlipidemia, and stress in some interpretable models. We also found that work years were identified as a risk factor for diabetes because it has a high coefficient with the risk factor of age. Conclusions New risk factors for diabetes mellitus were identified based on Japan's non-objective-oriented anonymous census data using interpretable machine learning models. The newly identified risk factors inspire new possible policies for preventing diabetes. Moreover, our analysis certifies that big data can help us find helpful knowledge in today's prosperous society. Our study also paves the way for identifying more risk factors and promoting the efficiency of using big data.


Introduction
Diabetes Mellitus (DM) not only influences our daily life but also causes various complications, such as Ketoacidosis, hypertension, kidney disease, foot complications, etc. [1].
World Health Organization (WHO) reports that 422 million adults have DM around the world, which makes one of every eleven people a DM patient [2]. In Japan, the prevalence of diabetes has been steadily increasing and is expected to grow by 10% in 2030 [3]. Moreover, researchers found that diabetes patients are easier to have COVID-19 [4]. To prevent the severe effects caused by DM, many institutions make various efforts to prevent diabetes. WHO publishes a yearly report regarding diabetes [2]. The US Centre for Disease Control and Prevention initiated the National Diabetes Prevention Program to prevent or delay type 2 diabetes [5]. Certification Board for Diabetes Educators in Japan [6] trains doctors and nurses with the essentials to assisting diabetes patients. Japan Preventive Association of Life-style Related Disease [7] is trying to inform citizens how to prevent diabetes by improving good life habits. The Japan Diabetes Society [8] organizes an annual conference and promotes diabetes research. Diabetes Network 1 3 [9] of Japan collects data about diabetes and organizes various events to educate the public and help prevent diabetes in Japan.
Even though governments have made various efforts to prevent diabetes mellitus, the profound influence caused by diabetes still exists. And because of COVID -19, our life habits changed. Efforts to stop DM are still necessary. Identifying new risk factors of DM can help us make more efficient policies to prevent DM. Therefore, researchers made various efforts to find new associated risk factors for DM. For example, Aidin et al. [10] found a relationship between diabetes patients' mortality and cardiovascular disease. Meanwhile, age [11] and gender [11,12] were also identified as affecting the prevalence of DM. At the same time, a dietary factor of diabetes was found in some studies [13][14][15]. Moreover, several metabolic and anthropometric traits were associated with DM: BMI [16], overweight [17][18][19] and obesity [17,18,20] were found as associated factors for DM. Considering lifestyle and environmental factors, more risk factors associated with DM were identified: social-economic statics [21], life environment [22], life habits [23] and lifestyle [24] smoking status [25][26][27][28][29][30][31][32][33][34][35][36][37][38] or cigarette consumption [10,13,26], alcohol consumption [13,39], occupation [40], work stress [12,41], work years [40,42], weekly work hours [41]. Research by Bellou et al. [13] indicated that a low level of education and conscientiousness decreased physical activity, high sedentary time, duration of television watching, and air pollution presented robust evidence for increased risk of type 2 DM. Although current works have achieved great success, all their factors were identified by objective-oriented datasets. They ignore that we are in a prosperous data society with the technological development of the Internet of Things (IoT). Furthermore, these methods cannot have the ability to deal with large-scale complex DM-related risk factors analysis in the current big data society. Thus, there is a crude need to design a model to identify more risk factors of large-scale complex DM-related data.
Fortunately, with the development of technology, various reliable and robust machine learning methods [43][44][45] were proposed and used to classify or predict complex risk factors of multiple diseases. Significantly, the Interpretable Machine Learning (IML) models will capture the "extraction of relevant knowledge from a machine-learning model concerning relationships either contained in data or learned by the model" [46]. Especially with the development of AI technology, the explainable AI (XAI) models help us understand the black box AI model and provide a local or global explanation of models. The robust XAI or explainable machine learning ( XML) methods LIME [47] and SHAP [48] methods were used in various fields and were certificated efficient, especially in medical and clinical areas. Some research confirmed that LIME [49][50][51][52] and SHAP [51][52][53][54][55] could be used to explain models and give reasons for model decisions. Because of the robustness of XML methods, we used them to analyse our data and acquired perfect/hoped results. Therefore, the IML methods were used to analyse the nonobjective-oriented data helping us identify new risk factors for DM in our study. The knowledge from IML models also will help us take better actions to prevent DM for private persons and guide governments' policymaking. Therefore, we started this study hoping to find new knowledge about DM using Japan's non-objective-oriented anonymous census data. The significant contributions of our study are as follows: • We proposed using interpretable machine learning to obtain new risk factors that suggest DM prevention in current society.
• As far as we know, our analysis is a significant new try that uses non-objective-oriented data to find knowledge in the booming big data society. • Our study paved the way for finding more useful knowledge using interpretable machine learning methods.
The rest of the paper is organized as follows. Section 2 describes the used datasets. The methodology in this analysis is introduced in Section 3. Section 4 shows the details results of our study. Section 5 shows the discussion of our results and the limitations of our study. Finally, Sect. 6 discusses the future direction and concludes the paper.

Data source
Generally, research analysis for DM is based on experimentaldesigned statistical data. However, in our current society, there are various kinds of data. Using the non-objective oriented data to find knowledge is necessary. Therefore, we analyzed the Japanese citizens living survey data, which were collected to know the daily life of Japanese citizens by the Ministry of Healthcare, Labor, and Welfare (MHLW) [56] in Japan, hoping to find new risk factors for DM. MHLW compiles a comprehensive census of Japanese citizens living every four years since 1995, which includes many information items: personal information (age, sex, marriage situation, etc.); information about work (profession, company description, weekly work hours, weekly workdays, and work years since starting work); information about the family (number of family members, etc.); living situation (the space of houses, the numbers of rooms, etc.); hospital information and healthcare situation (stress, various diseases information, etc.). These anonymous census data can help the Japanese government understand the citizens' living conditions and promote citizen living. Because the MHLW data collection process is not designed specifically for DM, the census data without object-oriented design makes identifying the hidden risk factors for DM possible. The census data from 2013 was used in this analysis. And 28,292 samples were extracted from 192,519 anonymous samples by deleting the samples with missed values. The extracted samples contain 19 factors: number of family members, spending of one year (Spence), room space, number of rooms for one family 1 3 (room_num), age, sex, insurance type in Japan, national public pension type in Japan, weekly work hours, weekly workdays, total work years, profession category, obesity, hyperlipidemia, hypertension, healthcare awareness level, health investigation situation in Japan, stress, and smoking status. Among all the factors, sex, obesity, hyperlipidemia, hypertension, and anxiety are dichotomous, whose values are defined as binarized value: 0 or 1, while other factors' value is ordinal. All the samples were analyzed using seven interpretable machine-learning methods.

Methodology
With the development of analysis technology, the reliability of models is also essential, besides model accuracy. The interpretable machine learning models can make decisions and tell us why one decision was made. Therefore, we used interpretable machine learning models to analyze the MHLW census data. Generally interpretable machine learning (IML) models have two kinds of types: intrinsic (rule-based) or post hoc models [57]. The intrinsic models obtain knowledge by restricting the rules of machine learning models. In contrast, the post hoc models refer to the application of interpretation after training, such as Local interpretable model-agnostic explanations (LIME) [47] and Shapley Additive explanations (SHAP [48]. Primarily, the SHAP method was used in various filed and was certificated robust [58][59][60][61][62][63]. Therefore, we used SHAP to explain the multilayer perception (MLP) model to calculate the feature importance. The intrinsic models generally contain Logistic Analysis (LA), linear regression, Linear discrimination (LDA), Hayashi's quantification method 2 (qt2) [64], random forest, and XGBoost methods. In our study, we used both models to find and check the feature contributions in the DM classification.
Firstly, the commonly used logistic analysis was used to find the associated risk factors of DM. Other IML methods were used to check the importance of each factor. Logistic analysis is one of the most widely used methods to analyze statistical data in biology and healthcare fields (Table 4). Using logistic analysis, we will not only be able to identify related factors like linear regression but also can use it to predict the possibility of disease occurrence. Meanwhile, LA can also check the risk increase for factors by calculating the odds ratio (OR), especially for dichotomous factors ( Table 1). The detailed steps of our analysis are shown in Fig. 1. Firstly, all aspects were tested using a univariate logistic regression model, and 15 strongly related factors (P < 0.05) were identified. Then the OR of all dichotomous variables was calculated ( Table 2). The associated risk factors of DM were rechecked using multiple linear regression to review multi-related risk factors of DM (Table 3). Consequently, associated risk factors of DM were identified ( Table 1). The OR of associated risk factors for ordinal factors was also checked ( Table 2). Then, the other six interpretable machine learning methods: linear regression, Linear discrimination (LDA), qt2, random forest, XGBoost, and SHAP methods, were used to recheck the factors' importance (Table 3 and Fig. 2). In Table 3, the higher value of LDA factor contributions indicates the more critical factors in LDA analysis. In contrast, the higher qt2 factor importance means the characteristics are more acute in the qt2 elements. Similarly, the higher values in the random forest, XGBoost, and SHAP methods suggest that one factor contributes more to the classification of DM. Moreover, the co-efficient of all the factors was also checked (Fig. 3) to understand the relationship among factors. Finally, identified factors in previous research also were reviewed and compared with our analysis (Table 4).

The identified risk factors
After applying LA, 15 risk factors associated with DM (p < 0.005) were identified (Table 1) among the total 19 factors. Notably, insurance type, public pension, the number of family members (family_num), and health awareness level are identified as risk factors for DM using non-objective-oriented data for the first time. Meanwhile, 11 factors: age, sex, obesity, hyperlipidemia, hypertension, profession, years of working, weekly workdays, weekly work hours, stress, and smoking status were also re-identified as risk factors.
The newly identified risk factor: health awareness level was confirmed as a risk factor of DM in LA, linear analysis, qt2 analysis, and LDA analysis. The importance of health awareness level in qt2 research is 1.01, flowing the factors obesity (4.33) and age (2.95). Meanwhile, the health awareness level has comparatively higher LDA factors contribution than previous research identified: smoking situation, stress, work time, and age (Table 3). similarly, in the decision tree models, health awareness levels have higher importance than generally identified factors: stress, obesity, and gender. Data in Fig. 3 shows that health awareness level correlates more with factor stress than other factors. Similar to the health awareness level, the public pension type in Japan was not only reconfirmed as a potential risk factor by the singular linear regression method and multiple linear regression, but it also has a higher factor contribution (0.31) than stress (0.07) and hypertension (0.22) in LDA analysis. In multiple linear regression, it wasn't reconfirmed as a risk factor for factor insurance type. Moreover, factor insurance type's qt2 importance (0.6) is higher than general risk factors: stress (0.07), hyperlipidemia (0.29), and hypertension (0.22). As a newly identified risk factor, family     members were identified as risk factors in LA and qt2 analysis, even though family members' factor importance of qt2 is higher than factors: stress, hypertension, and hyperlipidemia. For the reconfirmed factors in this analysis, data from Table 1 and Table 2 show that obesity is a significant risk factor for DM, with OR over 6 ( Table 1), while both the factor contribution in LDA and qt2 factor importance are over 4 (Table 3). For the commonly admitted factors: stress and smoking situation, their LDA factor contribution and qt2 factor importance are lower than the newly identified risk factor: health awareness level. Especially the factor stress, its qt2 factor importance is the weakest in the potentially associated risk factors of DM. For factor worktime (years of working), it was reconfirmed as a risk factor of DM in this analysis, while data of Fig. 3 certificates that factor worktime has a deep connection with age.
However, in the decision tree and SHAP analysis, the factor room space contributes the most to the DM classification models. In contrast, the commonly recognized risk factors of age and gender make comparatively lower contributions in XGBoost and SHAP models.

The comparison between our analysis and other factors identification analysis
After our risk factors analysis, we compared our analysis with other studies. Our analysis creatively used various XML (Table 4) methods to identify risk factors for DM, while other researchers generally only used one kind of method (LA, MA, etc.). Similar to other previous analyses, we also used national-level data in our analysis and identified new risk factors for DM. However, our data are not specifically designed for DM. Using the nonobjective-designed data, we identified new risk factors for DM, while other studies commonly used objective-designed data. Meanwhile, we not only found new risk factors for DM but also re-confirmed some other risk factors for DM (Table 4), which were identified by previous research. Certainly, because of the data limitation, there are some factors that we could not reconfirmed in this analysis.

Discussion
After using seven IML methods to analyze the anonymous census data of Japan, four new potential risk factors of DM were identified for the first time. In contrast, another 11 risk factors were reconfirmed using IML methods. In contrast to Mika et al. [23] identifying that life environment affects DM, our analysis showed for the first time that insurance type and national pension would lighten the risk of DM in some aspects in Japan. Compared with stress, hypertension, and hyperlipidemia, the higher factor contribution of the national pension type shows us one possible direction to preventing DM: promoting one country's pension system. Meanwhile, older citizens with insurance have fewer associated risk factors than those with other insurance (Table 2) in Japan. These certificates that Japan's insurance system help prevent diabetes in some aspects. Japan has a unique insurance system and national pension system to protect citizens. Because of Japan's unique healthcare system, the insurance type and national pension were identified as associated factors of DM in this study. However, the identification of insurance type and national public pension shows a possible governmental effort direction to prevent diabetes: offering an efficient healthcare system, which agrees with the opinion of Magriplis et al. [65] that immediate public health intervention is the primary prevention of type 2 diabetes.
Health awareness level was also identified as one possible risk factor associated with DM for the first time in this analysis. Our results indicate that people with good health awareness had fewer risk factors associated with DM (Table 2) than people with a general health awareness level. Meanwhile, the health awareness level has a comparatively higher correlation with stress than other factors, whose current reason is unclear. More profound research is necessary to understand the complex relationship among the various risk factors of DM.
While Chen et al. [11] found that age affected the prevalence of DM. This analysis found that the probability and the risk of DM increase with age, which matches our common sense that older people are easier to get various sicks. Meanwhile, like previous research [11,12], which certifies that gender affects the prevalence of DM, factor gender also  Table 2). The different risk between males and females tells us that males should be aware of the high possibility of having DM in Japan, and more efforts are needed to limit DM occurrence among males in Japan.
As one significant risk factor of DM, obesity was reconfirmed as a severe risk factor (OR > 6) in our study, which agrees with the finds of A. Brown et al. [21]. The higher risk of DM for obese Japanese people alarms us again that more effort is needed to halt obesity. At the same time, hypertension and hyperlipidemia don't have highly adverse effects on DM as we imaged comparing with the factor insurance type ( Table 3). The comparatively lower factor contribution of hypertension and hyperlipidemia with health awareness level tells us that more work should be focused on helping citizens to improve their health awareness level to prevent DM in Japan. Meanwhile, following Norito Kawakami et al. [42], we found that the professional category will affect the associated risk factors for DM (Table 2) in Japan, which clarifies that potential DM patients should be treated differently depending on their professional type.
Like previous research [12,41,43], our analysis also certifies that stress raises the risk of DM (OR > 1). To clarify the causes of stress in Japan, the statistical data on reasons for stress (Appendix 5) in Japan were checked. The top three reasons for stress in Japan were: disease and nursery, the problem of income balance, and problems with work. This tells us that the Japanese should be careful in easing the stress from ailments, income balance, and work, specifically for the stress from work, which was already certificated in previous research [12,41,43].
In contrast to other studies [10,13,[25][26][27][28][29][30][31][32][33][34][35][36][37][38], which found that smoking status (or tobacco consumption) will affect the situation of type 2 DM, we found that smoking status did not have a high contribution to DM compared with risk factors such as obesity, age, and gender, and a newly identified factor: health awareness level. We also found that people, who smoked every day, had the same number of associated risk factors for DM as people who did not smoke in Japan (Table 2).
Separately from other studies, we found the number of family members as an associated risk factor for DM for the first time in Japan. However, the comparatively deeper relationship among factor number of family, room space (Table 3 and Fig. 2), and room number makes it difficult to explain the effects caused by the factor: family member. Future work needs to find how family structure can influence the risk of DM.

Limitation
Certainly, limitations exist in our study. Firstly, the data used in this study are obtained from 2013 because Japan government hadn't opened the newest data when we started this analysis. More recent data will be used in future investigations. Secondly, the data in this research does not classify the categories of DM because of the anonymous census data type. However, because of the non-objective-oriented census data, we can find new knowledge using interpretable machine learning methods. Finally, because of the complex situation and relation of risk factors in a realistic society, we should have analyzed the combined relationship among factors. Future studies should consider the compound relationship among risk factors. Despite the limitations, our findings point to a different aspect for identifying unknown risk factors of DM using non-objective-oriented designed data. Meanwhile, analysis using census data broadens the usage of big data, especially in today's prosperous and intelligent society.

Conclusion
No-objective-oriented census data were analyzed using various interpretable machine learning methods in this study, and 15 risk factors for DM were identified. Specifically, four new risk factors of DM: members of a family, insurance type, national Table 4 The comparison of our identified factors with other research MA method is one kind of paper review study. Therefore, it does not offer data information in paper. NOD Non-objective-designed, OD Objective-Designed. Remarks: × stands for no information item in our dataset

Risk factor
Reference Remarks Methodology Data type Data region level

Identified factors in this analysis
Public pension type, and health awareness level were found for the first time in this analysis. Our study certifies that using interpretable machine learning methods can help us find new knowledge in our current big data society. Moreover, our analysis results provide a new direction to prevent diabetes in the current AI society. Certainly, our analysis clears some aspects of DM, and more risk factors of DM still exist. However, our study inspired research to find more risk factors associated with DM using non-objective-oriented data in the current data-prosperous society. Our study is also an efficient endeavor at data mining in the contemporary intelligent and big data society, which will widen the research border for future studies.
Authors' contributions All authors contributed to the study's conception and design. Data collection was performed by Dr. Hiroyuki Suzuki. Data analysis and the first draft of the manuscript were written by Pei Jiang, and all authors commented on previous versions of the manuscript. All authors read and approved the final manuscript.
Funding The authors declare that no funds and grants were received during the preparation of this manuscript. However, we thank the Ministry of Healthcare, Labor, and Welfare of Japan for supplying the anonymous data for our study. We are also grateful to Mr Dongyuan Li and Dr. Aloupogianni Eleni for their help revising this paper's English.
Code availability Not applicable.

Availability of data
The data used in this study need usage approval from the Ministry of Healthcare, Labor, and Welfare of Japan.

Declarations
Competing interests The authors have no relevant financial or nonfinancial interests to disclose. Table 5 The statistical data on stress reasons in Japan

Number Reasons Number
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.