Introduction

The positive relationship between physical activity (PA) and health has been well established [1, 2], yet many adults worldwide perform insufficient PA [3]. Thus, understanding the factors associated with PA behavior is essential to develop and improve public health interventions [3,4,5]. Many studies have investigated the association of various factors including personal, societal, and environmental factors with different PA behavior indices such as the daily amount of moderate-to-vigorous PA (MVPA) or sedentariness [5,6,7]. Despite much progress in research into correlates, only a few studies have followed analytical approaches that account for both the existence of several levels of influence [5, 7, 8] and the complexity and multimodality of PA behavior [5, 9, 10]. To further advance correlates research, there has been calls for more research using both sophisticated statistical assessment that can capture the multilevel nature of correlates [4, 5] and PA behavior definitions that better reflect everyday life rather than unidimensional metrics such as daily MVPA [1, 5, 9, 10].

Using classical statistical modeling (such as regression analyses), studies have generally examined whether and how various factors are associated with different PA metrics [6, 11]. In classical statistics, these analyses could remain restricted to data analysts’ decisions about how the association and interaction are hypothesized (knowledge-driven) mainly because the factors selected for inclusion in the analyses are primarily chosen subjectively according to their conceptual relevance and, in some cases, initial empirical associations [11, 12]. This may limit the recognition of new and innovative correlate categories, which are needed in this field for further progress [5, 11]. Ecological approaches that integrate ideas from several theories have been also used in correlates research, often to overcome classical statistical analysis limitations [5]. They have been used to both conceptualize the factors and their interrelationships at all levels explaining PA behavior (such as the interconnections between individuals and their social and physical environments) [13] and guide variable selection for analyses [5, 11]. However, ecological approaches are also knowledge-driven [6] and, to some extent, rely on very well-established correlates [6, 8], which might result in missing some factors and interrelationships associated with PA behavior.

We have now entered a data-intensive era, with an increasing popularity of data mining approaches [14]. Such approaches originated from statistics but are known to capture hidden and novel insights buried in large amounts of data and generate data-driven hypotheses [14, 15]. These principles also regard the field of PA research, in which there is a need for more complex approaches to identify the next generation of PA behavior correlates, understand their relative importance, and capture the complex interrelations among the factors at different levels [5, 6, 8]. Several studies have applied data mining approaches [16,17,18,19] mostly to establish data-driven correlate hierarchies [16, 17] but using a limited number of factors and self-reported measurement of PA or sedentary behavior.

The present study applied a predictive data mining approach to classify individuals’ PA behavior (defined as active or inactive) using an extensive list of individual, demographical, psychological, behavioral, environmental, and physical factors. PA behavior, to better represent everyday life, was defined based on machine-learned activity profiles established preciously using a multidimensional (clustering) approach applied on continuous accelerometer-measured activity intensities in one week [20]. This cross-sectional study sought to build a data-driven hierarchy of PA behavior correlates from empirical data and, as a secondary purpose, to methodologically identify PA behavior correlates from a wide list of factors.

Materials and methods

Data for the present study were from the population-based Northern Finland Birth Cohort 1966 study (NFBC1966). NFBC1966 is a life-course study involving participants whose dates of birth were expected to be in 1966 in Finland’s two northernmost provinces, Oulu and Lapland (n = 12,058, 96.3% of all live births in the study area). The present cross-sectional study included NFBC1966 cohort members who participated in the latest follow-up at age 46 and agreed to wear accelerometers for device-based physical activity measurements [21]. A total of 10,321 NFBC1966 cohort members (85.6% of all cohort members) were alive in Finland in 2012 and were invited to the follow-up, of which 5621 (46.6% of all cohort members and 54.4% of those who were invited) participated and wore accelerometers (Fig. 1). With respect to the measurement tools/techniques, the collected data can be categorized into four: self-reported measures, clinical measures, objective built and natural environmental measures, and objective physical activity measures.

Fig. 1
figure 1

The collected data in the latest follow–up of Northern Finland Birth Cohort 1966 (a), and the selection of study population, input variables, and outcome variables for data mining in the present study (b)

Questionnaires and measurements

Questionnaires

A postal questionnaire was sent to all living cohort members with known addresses. The questionnaire included items on social background, frequency and type of habitual exercises, physical and psychological health and well-being, and work–life and socioeconomic situation. In addition, health-related behaviors were assessed by a separate questionnaire, the Quality Of Life Questionnaire (15D©), to rate health-related quality of life [22]. Another additional separate survey was used to address opinions and experiences, covering questions from the Temperament and Character Inventory (TCI) questionnaire [23]. The temperament and personality trait scores were then composed based on the responses to the items of the TCI questionnaire. More details on the self-reported measures can be found elsewhere [24].

Clinical examination and measurement of physical activity

Participants were also invited to attend a clinical examination. The clinical examinations included measurement of anthropometry, body composition, and cardiorespiratory fitness. Participants’ height, weight, blood pressure and waist-hip ratio were measured and BMI (body mass index) calculated. Participants’ body composition was measured with bio-impedance measurement (InBody720, InBody, Seoul, Korea). A static back muscle strength test (Biering-Sorensen trunk extension test) was performed to evaluate physical performance. A submaximal four-minute single-step test during which heart rate was continuously monitored was performed to assess cardiorespiratory fitness. Further details on the clinical examination protocol and measures are presented elsewhere [25, 26].

Objective measurement of physical activity was initiated during clinical examination using a wrist-worn accelerometer (Polar Active, Polar Electro Oy, Kempele, Finland). Participants were instructed to wear the monitor on the wrist of their non-dominant hand continuously for 24 h for 14 days. Polar Active has a uniaxial accelerometer that outputs estimated energy expenditure in metabolic equivalent (MET) values every 30 s. The validity of Polar Active under free-living conditions against the double-labeled water technique has been shown elsewhere [27].

Environmental measures

We obtained the residential coordinates of all participants whose residences were available at the time of the 46-year follow-up data collection (2012–2014) from the Finnish Population Register Centre. We used a geographic information system (ArcGIS 10.3) to calculate built, natural, and socioeconomic environment variables (Supplementary file 1, Table S1) that might describe the conduciveness of participants’ residential environment to PA. We calculated all variables in the year the participant attended the 46-year data collection. We also determined quantitative environmental features using a one-kilometer-radius circular buffer around the residential locations, and the distances (as the crow flies) to amenities were measured using road network data.

Data related to community structure; land use; amenities such as retail, recreation, office, and community institutions; and socioeconomic factors were derived from the Finnish community structure database [28]. Street network data, including the number of bus stops, intersection density, and length of cycle paths, were based on the Finnish national road and street database (Digiroad) [29]. Data on indoor and outdoor sport facilities were obtained from the Finnish database of sport facilities [30]. Natural environment features such as distances to the closest forests and parks and residential area greenness were assessed with the land cover data from the Finnish Environment Institute [31].

Data mining using a decision tree

We selected a decision tree technique to establish a data-driven model for classifying PA behavior. A decision tree model is created by partitioning the data on the basis of several independent input variables (or predictors) to form homogenous subgroups with respect to the outcome variable. A decision tree-produced hierarchy has a flow chart-like structure that enables identifying the relative importance of input variables in predicting the outcomes; the predictors in the higher layers of hierarchy are more important predictors [32]. In clinical applications and several other areas in which interpreting the results is of vital importance, decision trees are one of the most widely used classification methods [12, 14, 32, 33].

We used the Chi-squared Automatic Interaction Detection (CHAID) decision tree algorithm to create the model [34]. CHAID has been repeatedly used in studies with clinical applications whose main purpose was to identify key factors related to the outcomes of interest [35, 36]. In this algorithm, homogenous groups may be formed by any possible combination of the known values of a categorical predictor, or by setting cut-off points at any values of a continuous predictor. The number of selected independent predictors for creating the model together with the number categories (for categorical and ordinal) and intervals (for continuous) for the selected independent predictors depends on results of the Chi-square analyses and whether the differences are significant or not. Since the correlates of PA behavior could be of mixed data types, CHAID is an appropriate candidate because it uses a nonparametric procedure with no assumptions of the underlying data and is designed to include continuous, ordinal, and categorical predictors [33].

Decision tree model construction and validation

Input variables (predictors) and physical activity behavior (outcome variable)

The questionnaire and clinical and environmental measures, except those with more than ~ 10% missing values, were used as input variables. Recent evidence suggests that any single unidimensional metric (including the most commonly used criterion that defines physical inactivity as the insufficient activity level to meet present recommendations [1]) might not be enough to define individuals’ PA behavior [10, 37,38,39]. We therefore used participants’ activity profiles, which we built in a previous study using a multidimensional approach and continuous accelerometer data to define the PA behaviors for the present study [20]. A distinct aspect of this approach is that continuous accelerometer-measured activity intensities in one full week across the whole intensity continuum, including sedentary (SED), light PA (LPA), and MVPA were incorporated into a machine learning approach to create the activity profiles.

The details about how the activity profiles were established have been presented elsewhere [20]. Briefly, X-means clustering algorithm was applied on accelerometer-based MET-level data of participants who had seven consecutive valid measurement days (N = 4582), and four distinct activity profiles (clusters) were derived. A total of 1008 features/variables (10-min averages of the original 30-s MET data resulting in 144 MET values for each of the 7 valid measurement days) for each participant were fed into the clustering algorithm for creating the profiles [20]. A valid measurement day was defined as at least 600 min of activity monitor wearing time per day during waking hours. Seven consecutive valid measurement days were used as a criterion to enable analyzing one full week including both weekdays and weekends. The activity profiles were named with respect to the temporal and intensity patterns of participants’ daily activities in each cluster: Inactive (N = 1881), Moderately active (N = 802), Evening active (N = 1297), and Very active (N = 602). The results of our initial experiments revealed the decision trees induced for classifying the four activity clusters have unreasonable performance and generalizability, primarily because the outcome variable had both class imbalance (i.e., 41% Inactive, 18% Moderately active, 28% Evening active, and 13% Very active) and class overlap (i.e., those who were in the Moderately active, Evening active, and Very active had comparable activity profiles with different temporal patterns) problems [40]. Previous research has shown that the effects of these two problems that associate with each other in limiting the performance and generalizability of classification trees is best minimized with near-balanced class distribution in the outcome variable [41]. We therefore defined those in the Moderately active, Evening active, or Very active clusters as active (N = 2701), and the remaining ones who were in the Inactive cluster as inactive (N = 1881). We used the input variables in their original form to classify the two PA behavior categories: active and inactive.

Missing values and algorithm parameters

Missing values were included in the analysis as a separate category that was allowed to merge with other categories in the decision tree. The imputation of missing values of input variables was unnecessary [35]. A previous study has shown that the a decision tree developed with the presence of missing values in their input variables has reasonable misclassification rates, especially when the missing values are not very high (e.g., 20%) [42].

Several parameters must be set prior to constructing a decision tree model. Of these parameters, pruning criteria are the most primary ones to limit the size of the tree and prevent overfitting [14]. The pruning criteria were set such that groups smaller than 80 were not split any further (maximum number of participants in a parent node), and no group smaller than 40 was formed (maximum number of participants in a child node). The tree growth was limited to 10 layers, meaning that a maximum of 10 factors could be selected to form a group.

Model validation and visualization

We created and validated the model using 10-fold cross-validation. To evaluate the accuracy of the final decision tree model, we used the confusion matrix, which shows the proportion of participants with each outcome variable that was correctly and incorrectly classified. In the visualization of the final tree, the percentage of active and inactive participants in each subgroup, along with the response index (RI), was presented. The RI is the percentage of inactive participants in each subgroup relative to that of inactive participants in the total sample (i.e., 41.1%). Similar to an odds ratio, RI is an indicator of the direction and strength of the association [16].

Activity patterns in decision tree-formed subgroups of participants

Given that the outcome variable was formed with a multidimensional approach, we also calculated Z-scores of three PA metrics including average daily time (minutes per day [min/day]) spent in SED, LPA, and MVPA in each decision tree-formed subgroup of participants. A Z-score indicates how many standard deviations the mean of a measure in a subgroup is away from the corresponding mean in whole study population. As such, we could compare the variation of the three activity intensities across different subgroups with respect to the study population means. We calculated these three PA metrics from the same seven consecutive valid measurement days to establish the activity profiles [20] using previously validated cut-points (SED, 1–1.99 MET; LPA, 2–3.49 MET; and MVPA, ≥ 3.5 MET) by the accelerometer manufacturer [43].

Association analysis

The same above-mentioned PA metrics (SED, LPA, and MVPA) were also used for association analyses. We examined the association between factors emerging from the model and these PA metrics to determine the significance and relative importance of the methodologically identified factors. We used adjusted generalized linear mixed models, including urban–rural area as a random effect, to examine the associations between each independent variable (factor emerging in the decision tree) separately with min/day in SED, LPA, and MVPA. Age and gender were used as covariates in all models. We standardized the continuous independent variables to obtain a mean of zero and a standard deviation (SD) of 1 before including them in regression analyses. As such, we could interpret coefficients (B) from the models encompassing a continuous independent variable as a change in the outcome (e.g., min/day of LPA) for every 1 SD change in the independent variable and therefore compare them to each other across a similar outcome in terms of magnitude regardless of the unit. We included the categorical and ordinal independent variables in the regression analyses in the form of dummy variables and set response categories at the lowest end as the reference category. A p-value of 0.05 was used to interpret significance. All analyses (including data mining) were performed with IBM SPSS Statistics for Windows, version 25.0 (IBM Corporation, Armonk, USA).

Results

Participants

A total of 4582 participants (38% of all cohort members and 44.4% of those invited to the 46-year follow-up) had enough valid PA data to be included in the cluster analysis study [20] and, accordingly, sufficient information on the outcome value (active or inactive profile) for inclusion in the present study. The numbers of participants with an active and inactive profiles were 2701 (58.9%) and 1881 (41.1%), respectively. The characteristics of the study’s participants for the whole sample, with respect to the two outcome variables, are shown in Table 1. These descriptive results are identical to those reported in cluster analysis study [20].

Table 1 The characteristics of the study participants

Input variables

We used a total of 168 factors as input variables after eliminating those with over ~ 10% missing values. Overall, the factors related to medication use and diseases had the highest number of missing values (~ 20–50%) while the number of missing values in environmental and adiposity-related factors were lowest (~ 1–5%). Of these 168 factors, 82 were continuous, 19 were categorical, and 67 were ordinal factors. All the 168 input variables are given in the Supplementary file 1, Tables S1–S3.

Decision tree model

The prediction results are presented in Table 2. The overall classification accuracy was 69.7%. The final decision tree is shown in Fig. 2. The decision tree algorithm selected a total of 36 different factors of different domains, by which 54 subgroups of participants were formed (marked in Fig. 2 as S1-S54), 26 predicted as active and 28 as inactive. The most frequently appeared factor in the model, appearing three times, was ‘average weekday total sitting time’, followed by ‘average weekday sitting time at the office or such places’, ‘body fat percentage’, ‘frequency of exercise through walking’, ‘urban-rural areas’, and ‘difficulty of a 5-kilometer run without breaks’, which each appeared twice. Other variables appeared only once. The number of layers (or factors) for forming subgroups ranged from two to seven, even though the allowed maximum number of layers was 10.

Table 2 Confusion matrix showing the performance of model with 10-fold cross validation
Fig. 2
figure 2

The Chi Squared Automatic Interaction Detection tree illustrating the hierarchy of the factors predicting Active and Inactive participants. The thickness of branches is based on the number of participants in the branch. Categories (for categorical and ordinal variables) and cut-off values (for continuous variables) are shown in italicized text, and the variables in normal text. In interval notations between brackets, inclusiveness and exclusiveness are shown with squared and round brackets, respectively

Overall, participants with higher body fat percentage (> 31%) were more likely to be inactive (RI range: 1.16–1.49) compared with those with lower body fat percentage (< 28.3%). The largest subgroup of inactive participants (n = 193, RI = 1.55) included those with the highest body fat percentage who reported their physical activity frequency through gardening more than once a month, and were with a normalized heart rate recovery slope < 55% per second. The largest active subgroup (n = 335, RI = 0.39) was composed of participants with the lowest body fat percentage in the study population and with a normalized heart rate recovery 60 s after exercise > 25 beats per minute. Participants who lived in city/rural centers and had a physically demanding occupation (i.e., process and transport workers, forestry workers and farmers, and other workers) had the least risk of being inactive (RI = 0.11).

SED, LPA, and MVPA variations in the decision tree-formed subgroups of participants

The variations in the three activity intensities in the 54 decision tree-formed subgroups of participants are shown in Fig. 3. Most inactive and active subgroups had different accumulation patterns of SED, LPA, and MVPA. In general, although most active subgroups had lower SED level than the population mean, some subgroups had noticeably higher levels of MVPA (e.g., subgroups 3, 6, and 7), while others had noticeably higher levels of LPA (e.g., subgroups 20, 32, 33, and 52). Inactive subgroups had generally higher SED level and lower MVPA level than the population mean, while several subgroups had noticeably both lower LPA and MVPA levels (e.g., subgroups 41, 46, 49, and 51).

Fig. 3
figure 3

Z-scores of sedentary (SED), light physical activity (LPA), and moderate-to-vigorous physical activity (MVPA) in the 54 decision tree-formed subgroups of participants. S = Subgroup

Association analysis

Tables 3 and 4 show the association between the continuous, categorical, and ordinal explanatory variables from the decision tree model and the three PA metrics in the total study population. All factors except fear of uncertainty and impulsiveness scores were associated with at least one PA metric. Most continuous factors (Table 3) in the relatively high layers of the decision tree model and larger subgroups significantly explained min/day in all the three PA metrics. For example, body fat percentage was positively associated with SED level (B = 26.5) and inversely associated with LPA (B = -16.1) and MVPA (B = -11.7) levels. Higher normalized heart rate recovery 60 s after exercise was associated with lower SED (B = -16.1) and higher LPA (B = 9.9) and MVPA (B = 9.6). Categorical factors were also associated with min/day in SED, LPA, and/or MVPA (Table 4). For instance, those with physically strenuous occupations (workers, farmers, service, sales, and care staff compared with managers, advisers, office workers, etc.) spent less time in SED (B = -46.7) and more time in LPA (B = 41.1) and MVPA (B = 3.5). Those who reported a higher frequency of physical activity through gardening (2–3 times a month or higher compared with fewer than once month or not at all) had lower SED (B = -20.6) and higher LPA (B = 14.4).

Table 3 Associations with the whole study population (N = 4582) between the continuous factors emerged in the decision tree model and time spent in sedenteriness (SED), light physical activity (LPA), and moderate-to-vigorous physical activity (MVPA)
Table 4 Associations with the whole study population (N = 4582) between the categorical and ordinal factors emerged in the decision tree model and time spent in sedenteriness (SED), light physical activity (LPA), and moderate-to-vigorous physical activity (MVPA)

Overall, from the regression coefficients (B) in Tables 3 and 4 (indicative of changes in min/day of SB, LPA, and MVPA for every 1 SD change in the predictor and of changes from the reference response categories, respectively), the associations seemed generally stronger for those factors that emerged in the higher layer and larger subgroups. For instance, higher body fat percentage and lower normalized heart rate recovery slope were associated with lower and higher min/day in MVPA, respectively, but the former, which appeared in the higher level of the decision tree, was associated with MVPA to a greater extent (B = -11.7 vs. 9.5).

Discussion

This study applied the decision tree technique to establish a multilevel data-driven model that predicts adults’ PA behavior, defined as active or inactive based on their machine-learned activity profiles, and to methodologically identify PA behavior correlates. From the 168 factors of different domains used as input variables to create the decision tree model, the final model selected 36 factors from which 54 different participant subgroups with different variations in SED, LPA, and MVPA were formed. The largest subgroup of inactive participants included those with the highest body fat percentage, who were frequently engaged in physically demanding activities through gardening, but who had rather slow heart rate recovery. The largest subgroup of active participants included those with the lowest body fat percentage in the study population with a relatively fast heart rate recovery. The factors that emerged from the decision tree model, such as body fat percentage, normalized heart rate recovery 60 s after exercise, urban–rural areas, average weekday total sitting time, and extravagance score, were associated with SED, LPA, and/or MVPA time. Thus, the present results may inform both multilevel intervention allocation and design.

Consistent with the results of studies focusing on understanding the causation of PA behaviors [5, 8, 13, 44], the established model in the present study indicates that PA behavior is explained by a multilevel hierarchy composed of various factors in different domains. However, our results extend this finding by indicating that PA behavior predictors for different subgroups are different and come from various domains. In addition, our model was driven by empirical data consisting of a range of factors. Studies have generally conceptualized the influence of PA behaviors by theoretically combining common sense and well-established evidence, therefore primarily providing a broad view of PA behavior and its causation for general populations [5, 8, 44]. While previous multilevel models have succeeded in hypothesizing the interaction among factors of different domains, their practical implications have remained limited [8] partially because of their theoretical nature. There were two studies that applied a data-driven approach to establish a decision tree–based model but with self-reported PA measure and a limited number of factors, and one of them used only demographical factors [17] while the other used only sociodemographic factors [16]. Overall, the multilevel model presented here specifies the PA behavior correlates at different levels in each subgroup and may be utilized to target and tailor interventions.

Most emerged factors in the decision tree model have been recognized as factors associated with PA behavior in past works, such as education level, profession, overall health status, fitness status, and population density [5, 6, 11]. However, there were also some factor in decision tree model were less established, such as those that were related to personality and temperament including extravagance, impulsiveness, and explorative excitability [6]. Such factors (or factors similar to them) were assessed in a few studies but, mostly due to the limited or sometimes contradictory evidence, had not yet been identified as correlates nor been rejected. The other factors that can also be categorized as less established factors are body composition measures (i.e., lean body mass and skeletal muscle mass) and a few of the psychological and environmental factors (e.g., enjoyment of daily activities and number of road accidents) [6,7,8]. A few measures related to heart rate recovery were also emerged in the decision tree model. Even though the association of PA with heart rate recovery measures have been well-studied [45], they can be considered as novel factors associated with PA behavior that are identified in the present study because our results indicate the existence of another direction of relationship that has not been previously examined.

The less established and previously undiscovered factors found here may be candidates for the next generation of correlates [5]. These factors have likely remained underreported (or unexamined) because of the subjective tendency in the existing literature toward examining only those factors for which evidence of significant associations (positive or negative) with different PA behavior indices has been well-established [11]. It is important to consider that these factors were selected by the decision tree to create the final model from a wide list of input (independent) variables. This suggests that the less established and novel factors that emerged in the decision tree model might be relatively more important correlates and likely surrogates for the other previously less established or well-established factors that the decision tree excluded in creating the model, such as behavioral attributes (e.g., alcohol, smoking, etc.) or socioeconomic status [6]. Nevertheless, one must infer the relative importance of the emergent factors with caution. The study’s participants had a narrow age range (46–48 years), which might explain why some of the well-known PA behavior correlates, including age and gender, did not appear in the final model [5, 6, 11, 46]. This result agrees with the findings of a previous review, speculating that in studies including both men and women with sufficient age diversity, age was found to be inversely associated with PA participation, and significant differences in PA participation existed between men and women (higher in men) [11].

As far as we know, our study is the first to use machine-learned activity profiles to define the PA behavior of participants. Previous studies have generally examined the associations between different factors and unidimensional indices, typically including the daily amount of SED, LPA, and/or MVPA [16, 17, 47, 48]. However, recent evidence from time-use epidemiological studies and beyond suggests that these three activities are interrelated [10, 37,38,39], and should all be considered when studying individuals’ PA behavior [37,38,39]. Although due to the methodological constraints (i.e., outcome variable imbalance and overlap) we merged all the active participants in one class to form a near-balance and non-overlapping outcome variable [40, 41], it was apparent that the accumulation patterns of SED, LPA, and/or MVPA were varied in the 54 decision-tree formed subgroups and across different active and inactive subgroups. This is indicating that all the three activity intensities along with their interrelationships were considered in our definition of active and inactive individuals, which were based on machine-learned activity profiles [20]. Hence, our multidimensional definition of PA behavior might limit the comparability of our results with those of other studies with unidimensional criteria for defining PA behaviors.

Body fat percentage, a direct measure of adiposity, was the most primary discriminator in the decision tree model. Even though it is typically assumed that PA impacts adiposity-related measures, this result is consistent with the findings of a previous systematic review suggesting a possible bidirectional relationship between adiposity and PA behavior [5]. A number of other factors for which the other direction of relationship is generally assumed were also seen in the other layers of the final model including muscle strength and heart rate recovery measures. Of note is the prognostic value of most of these factors for several chronic health conditions. For example, attenuated heart rate recovery is associated with an increased risk of diabetes [49], or can even indicate the presence of coronary artery disease [50]. Chronic health conditions have been identified both as a barrier and as motivations towards PA in different populations [51]. Even though the self-reported measures addressed the prevalence of diagnosed diseases (e.g., having diabetes, hypertension, etc.), these direct measures were eliminated from the list of input variables due to the high number of missing values. Besides, the study’s participants did not consist of only healthy individuals. As a result, the factors with prognostic value of chronic diseases found in our model may be acting as partial surrogates for chronic health conditions/risks and their effects on different PA behaviors.

We also performed association analysis between all the emerged factors in the decision tree model and three PA metrics. Almost all the emerged factors in the decision tree model were significantly associated with SED, LPA, and/or MVPA. The results of association analyses were, at least for the well-established factors, in line with previous studies. For instance, a better health-related quality of life score was associated with lower levels of SED [52], and higher levels of LPA and MVPA [6]. The results of association analyses also indicated the relative importance of the identified factors, supporting that our results can be used to highlight the factors associating with PA behavior in terms of priority.

The main strength of the present study is the inclusion of a wide list of factors rather than a few subjectively selected factors [5, 11], which resulted in the discovery of the novel predictors. The use of objective measurement of daily PA is also a strength. Previous studies have typically used self-reported PA measures that are known to be imprecise and biased [53]. Another strength is the discrimination of PA behaviors based on activity profiles built using the whole activity intensity spectrum over the course of one full week [10]. However, the binary categorization of participants (active or inactive) might be a limitation. We used a binary outcome variable because the model’s prediction accuracy degraded significantly when the number of PA behavior categories increased (for example, to Inactive, Moderately active and Evening active, and Very active), mostly because of the misclassification between the active categories (results not presented). This is not surprising because despite the different temporal pattern of activities in the active profiles, the overall activity levels were comparable (overlap problem) and the outcome variable was imbalanced. Although it was practically possible to reduce the dimension or select the relevant features in prior to decision tree induction [54], or use more complicated learning algorithms (such as ensemble methods) to achieve a better performance with higher number of categories in the outcome variable [33], these could have caused loss of key mechanistic information and obscured the interpretability of the final model [33, 40, 54], limiting the recognition of novel correlates categories that were identified here. Another limitation is the cross-sectional study design, which prevents any causal effects to be analyzed. Also, although more than 85% of the original cohort members were alive in Finland during the latest follow-up, less than 40% participated and provided valid accelerometer data—possibly those who were healthier and more active. This might induce selection bias and limit the generalizability of the results. Additionally, the study sample was homogenous in terms of age and ethnicity, and some of the emergent factors in the final model were related to cultural and health behaviors. These might also limit the generalizability of the results, especially to more diverse populations with different cultural and health behaviors.

Conclusion

Using a data mining approach, we established a multilevel model that predicts PA behavior from empirical and large-scale data. The model consisted of 36 different factors of relative importance from different domains and may be used to target and tailor interventions. The factors emerging from the decision tree model such as body fat percentage, normalized heart rate recovery 60 s after exercise, urban-rural areas, average weekday total sitting time, and extravagance score were associated with SED, LPA, and/or MVPA time. The extensive set of factors that was methodologically discovered can be a basis for additional hypothesis testing in PA correlates research. Finally, data mining appeared to be a feasible approach and complex enough to identify different factors along with their interdependencies in explaining PA behavior.