1 Introduction

The methods used for data analysis are often divided into confirmatory methods, applicable to well-structured research problems and used to verify (falsify) theoretical models, and exploration methods that are related to solving unstructured problems and formulating theoretical proposals that emerge from the analyzed data. This division is supplemented with hybrid approaches that constitute the submission of methods, often belonging to different analytical traditions, using other methodological assumptions.

These types of hybrid methods are used both in such confirmatory approaches as structural modeling with latent variables (Structural Equation Modeling—SEM), as well as in exploratory approaches, which include the decision trees method.

Examples of hybrid approaches in structural modeling are the so-called automated structural models that are a combination of structural models and heuristic numerical procedures. Within them, SEM models are combined with heuristic ant algorithms (ACO-SEM), genetic algorithms (GA-SEM), tabu search (TS-SEM), destruction and reproduction (ruin) and-recreate—R and R-SEM), or simulated annealing (SA-SEM) (Marcoulides and Ing 2012; Sagan and Perek-Białas 2016).

Their application is associated with attempts to heuristic search for model specification (model search), which results from the exploratory nature of the process of searching for specifications, complexity of the model, variables and possible combinations of the number of potential dependencies between them. Without taking into account any theoretical assumptions, the number of possible SEM models built on a given covariance matrix based on k variables is \(n =4^{ \frac{k(k-1)}{2}} \).

The idea of hybridization decision trees with other algorithms is not new. The combination of decision trees (CHAID algorithm) with logistic regression carried out by Lindahl and Winship (1994) was probably the first attempt to build this kind of a hybrid model. Hybridization was based on the sequential use of these analytical tools. After the initial exploration of data set by using CHAID algorithm cases were divided into terminal nodes.

In the second step of the procedure a separate logistic regression model was built for each leaf. Another concept of hybridization was proposed a few years later by Steinberg and Cardell (1998). It combined CART (classification and regression trees) algorithm (Breiman et al. 1984) with logit models. This time it was also a two-step procedure, however, the set of independent variables in the logit model was supplemented with an additional variable whose categories informed about the terminal node to which the case was assigned.

After 1998 several researchers attempted to combine decision trees with logistic regression. They developed new hybrid approaches known as LOTUS (Logistic Tree with Unbiased Selection) (Chan and Loh 2004), LMT (Logistic Model Tree) (Landwehr et al. 2005), or PLUTO (Penalized, Logistic Regression, Unbiased Splitting, Tree Operator) (Zhang and Loh 2014).

Combining clustering with decision trees for building hybrid predictive models has long been of interest to many researchers as well. Some authors combined the results obtained from clustering algorithms (k-means, k-medoid, self-organizing maps (SOM), fuzzy c-means and Balanced Iterative Reducing and Clustering using Hierarchies (BIRCH)) with the results obtained from the decision tree (C5.0) with boosting. Some authors proposed a hybrid model to predict churning in the area of customer relationship management, where C5.0 decision trees and Growing Hierarchical Self-Organizing Map (GHSOM) were combined.

In general, hybrid models based on decision trees and clustering algorithms are referred to as cascade models, cross-algorithm ensembles, and two-stage classification (Łapczyński and Jefmański 2013; Łapczyński 2016).

One of the hybrid methods that allow combining the advantages of a confirmatory SEM approach and exploratory classification trees is the SEM-Tree model (Brandmaier et al. 2013b).

In the literature there are many papers on the allocation of resources in households, but they do not relate to relationships with altruistic and competitive decision-making strategies and preferences. Kazianga and Wahhaj (2017) investigated the relationship between the family ties of household members and the allocation of resources within that household. On the other hand, Quanjer and Kok (2019) examined how the distribution of the family budget influences the nutrition of children and their height in adulthood. Often the subject of research refers to the sex role in the household. The authors examine how the resource allocation is determined when the wife earns more than her husband (Commuri and Gentry 2005), when the wife spends money in the absence of her husband in a family of Mexican immigrants in the US (Antman 2015), when the expenses relate to healthcare (Onah and Horton 2018). The worse situation of women in the decision-making process on the allocation of household resources is confirmed by research results from almost all over the world: from Kenya (Marinda 2006), from Russia (Lacroix and Radtchenko 2011) or from India (Fuwa et al 2006).

The aim of the article is to identify the dimensions of shaping the strategy of scarce resources allocation used by the members of Polish households and to segment them. The hybrid approach allows the separation of household segments based on the interrelationships between the decision-making strategies (altruistic—patriarchal person vs. competitive—rivalry strategy) and preferences in terms of consumer resources that are their cultural, human and financial capital (reputation, time and money). The applicative contribution of the paper is an identification of these relationships in a heterogeneous population using exploratory and hybrid approaches linking structural equation modeling and classification and regression trees in SEM-Tree model. Because the contemporary applications of SEM-Tree approach are relatively rare in the literature, several methods of SEM-Tree model growing methods and n-fold cross-validations were used and compared.

2 SEM-Tree approach

Structural Equation Model Trees (SEM Trees) is the method combining a confirmatory approach (SEM) and the exploratory approach (recursive partitioning known from decision trees) (Brandmaier et al. 2013a; Zeileis et al. 2008). The graphic form of the model resembles the structure of the decision tree, in which the nodes correspond to estimated structural equation models. The partition of nodes in decision tree is performed on the basis of covariates, which are not used while building structural models. At each terminal node of the tree, the SEM model is built on the basis of cases (subset of data) that belong to it.

SEM-Tree is flexible and useful tool for assessing model invariance for either categorical or continuous partitioning variables. However, SEM Tree has a tendency to overfit the sample data so pruning is needed to obtain a more generalizable solution (Brandmaier et al. 2013a).

The elementary algorithm for the induction of SEM Trees is as follows:

  1. 1.

    Fitting a parametric model (so-called template model) to the whole data set.

  2. 2.

    Binary split of the data set in all possible ways using all covariates for this purpose. In child nodes of decision tree, build structural models and compare the fit of them (so called the compound model) with the fit of template model.

  3. 3.

    Choosing the best compound model, the one that best describes the data from the point of view of the chosen criterion. Repeating the procedure with the first step in case the best model fits significantly better than template model. Otherwise, the ending of procedure of growing SEM Trees.

The structural model M(\({\theta }\)) based on all cases (built in the root of the decision tree) is called either a template model, a pre-split model, or a base model. It is created with the use of the fit index, which is usually the maximum likelihood index of fit (Furnival 1961) .

Data set D is represented by a \(n\times (k + l)\) matrix, where n is the number of cases, k the number of observed variables and l the number of covariates not included in the structural model. The distinction between observed variables and covariates means that matrix D is divided into respectively: submatrix \(D_k\) and submatrix \(D_l\).

Covariates are used to split nodes, and their values or categories decide to which child node the object from data set will be assigned to. In order to include continuous (quantitative), ordinal and nominal covariates in the analysis, all multi-valued and multi-categorical variables are converted into dichotomous ones. The way they are transformed depends on the type (measurement scale).

Let N denote the number of categories or values of covariates. According to the proposed binarization procedure, continuous and ordinal covariates can be transformed in \(N-1\) ways, and nominal covariates in \(2^{(N-1)} - 1\) ways. For every possible split in child nodes, structural models are built, which are called submodels. They are also called “post-split model” or “compound model”.

The pre-split model and post-split model are algebraically nested models if the parameter estimates are obtained with maximum likelihood estimation. Because they are nested, a likelihood ratio test is used to determine whether the model can be sub-modelled. The likelihood ratio is asymptotically chi-square distributed with the null hypothesis, which states that the covariate does not influence the model (the template model is not significantly different than the compound model). At each level of the tree, a covariate with the highest value of log-likelihood ratio is selected. This procedure is continued recursively in the subsequent stages while splitting the tree.

There are natural stop criteria throughout the procedure:

  • there is no covariate that could split tree nodes,

  • the number of cases in the leaf (terminal node) is lower than the threshold determined by the researcher,

  • the desired tree depth has been achieved,

  • the best split of the node is not good enough. The fourth criterion is used to prevent overfitting.

The formal evaluation of the split of the node is as follows:

  1. 1.

    The template model \(M({\hat{\theta }})\mid {D}\) is characterized by the likelihood function for the parameter set \({\hat{\theta }}\) with m free parameters and the data set D.

  2. 2.

    The parameter set \({\hat{\theta }}_{F}\) is obtained for the whole data set by minimization:

    $$\begin{aligned} f({\hat{\theta }}_F|D) \end{aligned}$$
    (1)
  3. 3.

    For a given potential split, it can be assumed that data set D will be divided into j disjoint subsets where \(j=1,2,\ldots ,k\). The partition is based on covariate \(C_i\), whose values are denoted by \(v_{ij}\).

  4. 4.

    Since the subsets are disjoint, so the parameters of the structural models \({\hat{\theta }}_{v_{ij}}\) are estimated independently by minimizing likelihood function.

  5. 5.

    The compound model \(M({\hat{\theta }}_{v_{ij}})\mid {D}_{v_{ij}}\) is now referred to as \(M_{SUB}\). The template model M is nested within \(M_{SUB}\), because M corresponds to the \(M_{SUB}\) with n additional linear constraints on the free parameters in \({\hat{\theta }}\).

  6. 6.

    Given this nested structure, a null hypothesis can be formulated, which says that model fit M does not differ significantly from model \(M_{SUB}\), i.e. \(H_0: \hat{\theta _1} = \hat{\theta _2}\).

  7. 7.

    The log likelihood ratio between M and \(M_{SUB}\) is expressed by the formula:

    $$\begin{aligned} \varLambda _{i} = -2LL (M({\hat{\theta }})\mid {D}) + \sum _{j=1}^{k} 2LL (M({\hat{\theta }}_{v_{ij}})\mid {D}_{v_{ij}}) \end{aligned}$$
    (2)
  8. 8.

    It is known that \(\varLambda \) is chi-square distributed with \((k-1)m\) degrees of freedom.

Taking into account all covariates, all the possible splits of the node are evaluated. The model with the highest increase in goodness of fit is compared with the previously selected threshold determined by the significance level \(\alpha \). If the split is statistically significant, the tree construction procedure is continued.

One of the first applications of the SEM Trees approach in the field of psychology concerned nonsuicidal self-injury (Ammerman et al 2016). The set of variables included, among others: suicidal thoughts and behavior, symptoms of depression, problems with emotions, anxiety, symptoms of eating disorders. The covariate, which was used to separate the homogeneous subsets of the examined people, was the number of self-mutilations.

Another example of the application of the SEM Trees approach was identification of factors affecting brain health, its cognitive and mental functions at various stages of people’s life (Walhovda 2018). The research was carried out as part of the “Lifebrain” project, whose vision was to enable targeted prevention of problems related to brain health. Risk factors and protective factors related to sociodemographic variables, lifestyle variables, birth and early health data, and early intellectual function. The dependent variables were, among others: memory, depression, anxiety, while moderators were: age, sex, brain structure, health factors and brain functions.

The SEM Trees model was used in the study of cognitive decline among older people (Zelinski and Jacobucci 2017). Thanks to this approach, almost 21,000 patients were grouped into subsets using such covariates as: race, education, gender and Hispanic ethnicity. Separate structural models that were built for each of the subsets, showed different patterns of cognitive ageing.

The relationship between cognitive functions of the brain and the age of adults has also been the subject of other studies where SEM Trees models were used during data analysis (de Mooij et al 2018). The research involved the grey matter and the white matter in the brain as well as cognitive tasks, which included language, memory and fluid intelligence. Using this approach, more than 600 subjects were divided into subsets (tree nodes), due to the continuous covariant, which was the age of an adult.

In marketing research SEM-Tree approach is applied to identify the structure of preferences and values among Polish consumers. According to Authors’ knowledge this is the first application of SEM-Tree approach to marketing data.

3 Data characteristics

The SEM-Tree model was built on the basis of data from nationwide survey concerning the analysis of the structure of preferences and values of Polish households in the allocation of resources for consumption, saving and investing. The research was carried out in 2013 on a random sample of 1020 respondents chosen from 410 households (interviews were conducted in families with a father, mother and the oldest child over 16 years old at home) (Sagan 2014).

The variables selected for the SEM model are indicators of three constructs related to attitudes towards resource allocation and decision making strategy. The first concerns the strategy of making allocation decisions in the family with respect to the altruistic dimension (that is the patriarchal person strategy of decision-making focused on caring for the common good of the family) and rivalry (having own, independent “budgets” by family members). Models of behaviours within a household: altruistic, cooperative (Bergstrom 1996), negotiating-competitive (Mc Elroy and Horney 1981; Donni and Chiappori 2011).

A female decision maker, a large number of children, and worse financial standing should be conducive to altruistic attitude and common ends of savings and consumption.

A male decision maker, smaller number of children, and better financial standing should promote more competitive attitudes, individual ends of consumptions and savings, giving preference to the role of the decision maker. Educational factor is related to other factors (the higher education, the higher income) and it helps to predict more altruistic attitudes.

The measurement was made using the 5-point Likert scales and the following statements regarding the altruistic dimension pole (A):

A—altruistic value

  • p23—“The family should limit spending on individual needs in order to satisfy common ones”

  • p24—“The common good of the whole family is more important than a family member separately”

  • p25—“The sense of accomplishment is provided by goods intended for the whole family”

  • p26—“The joy of life draws more in the family from the goods that serve everyone”

The other two latent variables concerned compromises (trade offs) resulting from the relationship between reputation and income as well as leisure time and the pursuit of higher income and material security (the dilemma “reputation” and “money” and “time” and “money”). The most important asset of family members are: money (financial capital), leisure-time (human capital) and reputation (cultural capital). The research question therefore was what are the relationships between the strategies of income management and money (M)—leisure time (T) and money (M)—reputation (R) trade offs?

On the basis of research question the main hypothesis was formulated: the higher tendency to take patriarchal person strategy, the larger T–M and R–M trade off due to risky shift effect. The risky shift effect as a form of group polarization phenomenon, explains the tendency of group as a whole (or group leader) to reveal a more extreme course of action in comparison to the average of the individual judgments of group members (Myers and Lamm 1976; Shaw 1976).

The specific latent variables and its indicators (statements with 5-point scales Likert) are given below:

Y1—“time” and “money” trade-off:

  • p51—“Earning and spending money is more important than free time”

  • p52—“Earning a safe tomorrow is more important than free time”

Y2—“reputation” and “money” trade-off:

  • p54—“The end justifies the means - money is more important than own reputation”

  • p55—“You can lose your reputation as a safe financial future”

Basic statistics for the indicators are given in Table 1. Covariance matrix of the indicators are presented in Table 2.

Table 1 Basic statistics of the latent variable indicators
Table 2 Covariance matrix of the latent variable indicators

Additional accompanying variables that participated in the construction of the SEM-Tree tree (predictors) were:

  • m1—sex: male, female

  • m2—age: 18–24, 25–34, 35–49, 50–64 and 65+

  • m3—education: primary education, lower secondary, vocational education, secondary, higher education (1st level) and higher education (2nd level)

  • m5—subjective financial situation: very good, good, average, bad and very bad

4 Results

4.1 Template SEM model

The basic template SEM model is depicted in Fig. 1.

Fig. 1
figure 1

Source: printout of the lavaan library (R package)

Template SEM model. A: altruistic value, \(Y_1\): time-money trade off, \(Y_2\): reputation-money trade off

The fit of the model is acceptable. Model fit test Chi-Square statistic is 51.533 with 17 df and p-value = 0.000 (Chi-square/df = 3.03). The incremental fit indices Comparative Fit Index (CFI) = 0.975, Tucker-Lewis Index (TLI) = 0.960. Root Mean Square Error of Approximation (RMSEA) = 0.045 (90 percent confidence interval is between 0.031 and 0.059, with p-value RMSEA equal or less 0.05 is 0.716). Standardized Root Mean Square Residual (SRMR) = 0.031. Factor loadings, path coefficients and error variances of the template model are presented in Table 3.

Table 3 Parameters of the template path model

The Cronbach’s alpha (CA) and McDonald’s Omega reliability coefficients of scales are acceptable. For Y1 scale, CA = 0.59 and Omega = 0.60. For Y2 scale CA = 0.80 and Omega = 0.83. For A scale CA = 0.62 and Omega = 0.63. The reliability of total model is 0.61 and 0.75 respectively. Average variance extracted (Fornell-Lacker AVE) for Y1 scale = 0.43, for Y2 scale = 0.72, for A = 0.31 and for total model = 0.45.

The parameters of structural part of the model are significant and shows that the higher altruistic attitude towards income allocation, the higher time-money trade off (A–Y1 = 0.44) and reputation - money trade off (A–Y2 = 0.22). It confirms the main research hypothesis concerning the risky shift effect. Also, the higher time-money trade off, the higher reputation—money trade off (Y1–Y2 = 0.18). Decomposition of mediation effect indicates that indirect effect (ab = 0.039, p = 0.04) significantly explains part of the relationships between altruistic attitude and reputation—money trade off.

4.2 SEM-Tree model

The disadvantage of SEM models built in the leaves of the classification tree is the need to take into account the predefined dependent variable in the tree construction process.

In the SEM-Tree algorithm presented at the beginning, the splits cover all combinations of the binarized predictors without the need to consider the dependent variable. In the model building process several models, based on splitting methods and stop rules, are developed. The semtree package offers several tree growing methods:

  1. 1.

    “naive” splitting takes the best split value of all possible splits on each covariate.

  2. 2.

    “fair” selection tests all splits on half of the data, then tests the best split value for each covariate on the other half of the data.

  3. 3.

    “fair3” has an additional step of retesting all of the split values on the best covariate found in the second phase.

  4. 4.

    “crossvalidation” partitions the data for maximizing splits on each variable, then comparing maximum splits across each variable on the rest of the data

Additional criterion was the size of final leaf (25, 50, 100, 200 and 400). In sum, 20 models are developed and compared.

Model comparisons are based on deviance (− 2LL of leaf (terminal) nodes). Figure 2 presents the model comparisons. It should be noted that for \(N>25\) all method present approximate similar results. The most consistent results appear at N = 50 (in general, the higher number of respondents in the final leaf, the higher deviance).

Fig. 2
figure 2

Comparison of SEM-Tree growing methods. CV: crossvalidation partitions for maximizing splits on each variable, fair: partitions based on all splits on half of the data, fair 3: partitions based on retesting all of the split values on the best covariate, naive: best split value of all possible splits

Additionally, several models based on n-fold cross validation and N = 50 were developed. Figure 3 presents the results of that simulation in the range of 3-fold up to 15-fold cross validation . The simulation shows that the 7-fold cross validation performs the best. In the Authors’ opinion, too many folds in cross validation procedure make subsets to test the model too small. As a consequence, the variance is high and the final model error is also high.

Fig. 3
figure 3

SEM-Tree model cross-validation comparison

Figure 4 shows the final hybrid SEM-Tree model calculated using the semtree library of the R program. The model was created using the CV method with 7-fold cross validation.

Fig. 4
figure 4

Source: printout of the semtree library (R package)

SEM-Tree model of resource preferences. m1: sex (1—male, 2—female), m2: age (1—18–24, 2—25–34, 3—35–49, 4—50–64, 5—65+), m5: subjective financial situation (1—very good, 2—good, 3—average, 4—bad, 5—very bad), l1: Y1-p52 loading, l2: Y2-p55 loading, l3: A-p24 loading, l4: A-p25 loading, l5: A-p26 loading

The model shows that the most important predictors turned out to be: age, financial situation and gender. The structure of the segments are as follows:

Segment 1 refers to the young (18–34). The patriarchal person strategy is characterized by stronger relationships with R-M and weaker T–M trade off. People in this segment possess relatively weak cultural capital, and they are able to sacrifice own reputation in order to reach a higher financial position. This segment consists of 439 respondents.

Segment 2 refers to mature (above 35) rich males. The patriarchal person strategy is characterized by strong relationships with both R–M and T–M trade off. People in this segment possess relatively strong financial but weak cultural and human capital, and they are able to sacrifice both own reputation and free time in order to reach a higher financial position. This segment consists of 130 respondents. The findings confirm the main hypothesis but it does not prove the patriarchal role of women.

Segment 3 refers to mature rich (very good and good financial situation) females. The patriarchal person strategy is characterized by very strong relationships with T–M trade off and negative R–M trade off. People in this segment possess relatively strong financial and cultural capital but weak human capital, and they are able to sacrifice only free time in order to reach a higher financial position, keeping own reputation intact. This segment consists of 137 respondents. Within this segment the main hypothesis is confirmed and the results indicate the patriarchal role of women.

Segment 4 refers to the mature poor (average, bad and very bad financial situation). The patriarchal person strategy is characterized by very strong relationships with T–M trade off and insignificant R–M trade off. People in this segment possess relatively weak financial and cultural but strong human capital, and they are able to sacrifice only free time in order to reach a higher financial position, keeping own reputation intact. This segment consists of 314 respondents. The results enable to accept the research hypothesis and confirms the role of low income (regardless of gender).

The variable “education” did not appear in the tree, because other predictors reduced the variances in child nodes to a higher degree. Interestingly, all segments relate to the patriarchal person strategy. The patriarchal system means that men play the dominant role in the family and in society. This does not mean a lower status of women, but it is related to the fact that the man represents the family, he is to ensure its safety and financial resources. According to the research describing the social roles of men and women (Hofstede et al. 2010), society is called masculine when men have challenging work, have opportunities for high earnings, are assertive and focus on the material situation. An analysis of the masculinity index (MAS) values in European countries shows that in Poland it is quite high (64 points). This level is similar to the one in Germany and Great Britain (66 points), but quite distant from the one in Slovakia (110 points) or Sweden (5 points) and Norway (8 points).

5 Conclusions

The SEM-Tree hybrid models allow exploratory analysis of path relationships in a heterogeneous population. In contrast to classical decision trees with SEM models in leaves, they enable the extraction of predictors associated with the variation not only of the level of the dependent variable, but also of the path relations between variables. The model comparisons shows that 7-fold cross validation gives the best model performance.

Both types of models confirm the risk-shift hypothesis about the influence of decision-making strategies (altruistic) on the devotion of free time and reputation for obtaining higher income in the household.

Summing up the results of the model, it should be emphasized that the choice of an altruistic (patriarchal person) strategy (taking care of the “common good” with a strong control function of the family head) strongly encourages the members of the household to submit the value of money (“struggle for existence”), even at the expense of free time and loss of one’s reputation. Therefore, it can be assumed that the decentralization of budgets, greater autonomy and thus the sense of security of family members tend to reveal a higher preference for free time, taking care of their own reputation and thus a more harmonious individual life.