1 Introduction

Data and analytics have captured the attention of human resource management (HRM) scholars, as multinational corporations (MNCs) increasingly have at their disposal large volumes of data and techniques for analyzing large amounts of data that could be used to support decision making related to complex problems, such as task organization, employee turnover, career development, and training design. Extrapolating strategic knowledge from large datasets through supervised Machine Learning (ML) techniques is becoming one of the main challenges for decision-makers in MNCs. In the digital era, techniques based on machine learning algorithms play an increasingly important role in extrapolating strategic knowledge from raw data (Canhoto and Clear 2020). Since 1956, when John McCarthy coined the term “Artificial Intelligence” (AI), the interest in this topic has grown exponentially, in line with the increasing number of applications, permeating different disciplines and research areas. In this context, management studies have started to analyze the potential applications of ML, which is a subset of AI and represents a way to achieve AI through the development of algorithms capable of improving themselves with experience (Garg et al. 2022). Supervised ML techniques can analyze large volume of data from different sources to discover hidden patterns with high strategic value for organizations (Pereira et al. 2018), providing insights for prediction, classification, and decision-making purposes (Cui et al. 2006; Naeem et al. 2024). These techniques have some specific features, such as scalability, because they can handle and process large amounts of data; interactivity, because they can learn new variables from new data; and dynamism, because they can periodically reassess and reevaluate hypotheses by taking into account incoming data, even without human interaction (Garg et al. 2022). With these capabilities, managers could make rapid and contextualized decisions based on data-driven evidence (Gupta et al. 2018; Wirges and Neyer 2023). For instance, supervised machine learning techniques are useful in marketing to estimate the probability of customer churn (Archaux et al. 2004; Gordini and Veglio 2017; Hung et al. 2006; Rosset et al. 2003; Wei and Chiu 2002), in social and economic analyses (Blazquez and Domenech 2018), in smart city applications (Iqbal et al. 2020), in finance to predict customer credit risk (Kruppa et al. 2013), and in HRM in the field of recruitment, performance management, and team dynamics (Garg et al. 2022; Koechling et al. 2023). As has happened in marketing and finance over the past decade, the application of ML techniques in HRM is growing rapidly (Garg et al. 2022; Yang et al. 2023), although only a limited number of studies have aimed at predicting employee turnover (Rombaut and Guerry 2018; Saradhi and Palshikar 2011). Voluntary employee turnover refers to why people leave an organization (Lee et al. 2017) and is considered a serious issue for MNCs (Saradhi and Palshikar 2011). High levels of employee turnover generate unexpected costs in terms of hiring, training, search, selection, and replacement (Mobley 1982; Price 1977; Staw 1980). In fact, the cost of hiring new employees is substantially higher than retaining the existing employees, negatively affecting firm performance and competitiveness (Holtom et al. 2005; Mitchell et al. 2001).

Despite the relevance of this topic, extant research has barely explored the potential, but not yet investigated, the real added value of the application of supervised ML techniques to identify the determinants of voluntary employee turnover (Yang et al. 2023). Particularly in the context of employee turnover, the few existing works employing supervised ML techniques have so far been published in conference proceedings (Garg et al. 2022).

Responding to the call for new methodological approaches (Hom et al. 2017), this study tests the analytical and predictive power of a classification decision tree based on the CHAID (Chi-square Automatic Interaction Detector) algorithm to identify the characteristics of employees who voluntarily leave the company. By analyzing a large dataset of employees in the context of a telco MNC, we apply a CHAID classification decision tree to a sample of 2,932 employees working in the firm’s headquarters in Norway and in a subsidiary located in Denmark. This approach makes it possible to exploit the potential of this technique to identify in advance those groups of employees who have a higher likelihood of leaving the firm and, ultimately, to implement retention practices targeted at these groups.

This study contributes to the turnover literature from a methodological perspective by encouraging researchers to measure and/or analyze employee turnover in nontraditional ways using supervised ML models to test the validity of well-known historical explanatory constructs in this area of research (Hom et al. 2017). In addition, we contribute to the HRM literature by demonstrating the potential of the classification decision tree as a method for solving complex problems in HRM (Garg et al. 2022), which can be used to complement previous studies and further advance the literature on this topic. Our results suggest the complementary, rather than substitutive, role of supervised ML in assessing the risk of employee turnover. From a practical perspective, our study demonstrates the ability of these techniques to reveal hidden relationships between data, allowing decisionmakers and scholars to identify new, previously unknown relationships and evidence. Using this technique, companies could self-assess their own employee turnover risk levels and identify high-risk employees, allowing them to develop timely and effective retention strategies tailored to their needs. In this sense, supervised ML techniques become an additional toolkit that supports, but does not replace, human resource management decisions.

2 Theoretical background

2.1 A brief introduction to employee turnover research

Voluntary employee turnover – employees’ unilateral, unwanted, and often surprising termination of their employment contract – is a phenomenon of practical relevance for all organizations for a variety of reasons (e.g. Lee et al. 2008), as a negative economic impact is generally expected, mainly due to the additional recruitment costs or tacit knowledge drains (Glebbeck and Bax 2004; Holtom et al. 2005; Mitchell et al. 2001; Reiche 2008).

Reflecting its importance, research on employee turnover can be traced back to 1920 (Hom et al. 2017; Lee et al. 2017). Since its inception, the literature has continually grown, resulting in an impressive number of theories and models over the subsequent one hundred years mainly aimed at explaining the motivations, antecedents, processes, and consequences of this organizational phenomenon (Hom et al. 2017; Lee et al. 2017; Rubenstein et al. 2018). Employee turnover is related to a wide range of factors. Researchers agree on the importance of job satisfaction (or dissatisfaction) and individual perceptions of the perceived desirability and ease of moving to another job based on the assumption that employees who are satisfied with their job and do not have other job options are more likely to stay in the organization (Griffeth et al. 2000; Mobley 1977). The seminal work of March and Simon (1958) triggered a stream of studies attempting to explain voluntary employee turnover by focusing on why people leave (Barrick and Zimmerman 2005). This initial approach was followed by attempts to identify the primary antecedents of employee turnover (e.g. Lee and Mitchell 1994; O' Reilly III et al. 1991). With the aim of retaining employees, academics then attempted building an employee turnover model that was as accurate as possible to minimize voluntary employee turnover. Over the years, several models ensued, such Mobley’s (1977) process model explaining how dissatisfaction leads to turnover in the attempt to explain why employees leave their jobs (Lee et al. 2017). Other scholars focused more on the content than the process, identifying a variety of determinants of turnover, including factors related to the workplace, labor market causes, community, and occupational aspects (Hom et al. 2017; Price 1977, 2001), emphasizing the importance of both individual and environmental attributes. As scholars have criticized extant turnover models for their lack of explanatory and predictive power, turnover research has included variables that are not necessarily related to the employee’s affective state and the decision to quit (Morrell et al. 2001). Subsequently, researchers studied the role of shocks and jarring events that drive employees to choose alternative career paths (Hom and Griffeth 1991), demonstrating that voluntary leave is not necessarily related to job dissatisfaction alone. The resulting models and analyses of voluntary employee turnover therefore included “shocks” (Lee and Mitchell 1994), while others focused on motives for leaving (Maertz and Campion 2004).

Another research stream has adopted the opposite perspective (Porter and Steers 1973), focusing on why people stay, proposing the “job embeddedness” construct that considers contextual factors both related to the workplaces and off-the-job aspects (Coetzer et al. 2019; Mitchell et al. 2001). Following this radical paradigm shift, scholars developed further conceptualizations, explanations, and empirical approaches (Lee et al. 2017). Along these lines, some researchers more recently proposed integrative frameworks that attempt to explain both why and how employees quit (Maertz and Campion 2004).

In relation to the IT context, which is particularly affected by turnover issues (Rode et al. 2007), Ghapanchi and Aurum (2011) developed a systematic literature review of the antecedents of IT employee turnover, showing that the determinants can generally be grouped into five main categories: individual, organizational, job-related, psychological, and environmental. This summarizing perspective is consistent with the more general perspective that emerges from findings of the broader turnover literature (Lee et al. 2017; Rubenstein et al. 2018). The first category includes individual attributes, motivational, and professional behavior constructs. For example, motivational factors may be related to mindset types such as self-positiveness or low core self-evaluation (Hom et al. 2012). Organizational factors relate to individual perceptions of the organization, such as remuneration, benefits, human resource practices, organizational culture (O' Reilly III et al. 1991; Rubenstein et al. 2018) and the centrality of the functional department in the intraorganizational network of the MNE (Castellacci et al. 2018). For instance, in terms of organizational culture, the person-organization fit predicts employee job satisfaction, which also affects turnover (O' Reilly III et al. 1991). Also, knowledge sharing practices within the department and with colleagues outside the function (e.g., Cabrera and Cabrera 2005; Dasí et al. 2017; Garg et al. 2022) and knowledge flows across firm boundaries (Gupta and Govindarajan 2000). Job-related antecedents instead concern the characteristics, support, difficulties, and attractiveness of jobs, whereas psychological factors include individuals’ satisfaction in terms of jobs, career prospects, and organizational aspects (e.g. commitment). For instance, job design can influence job characteristics such as competence, autonomy, or task identity, which can motivate knowledge sharing among employees (e.g., Foss et al. 2009; Ryan and Deci 2000), aspects that can also influence employee well-being and turnover. Finally, environmental factors include aspects that are external to the workplace related to job alternatives, family support, work-family balance, etc. However, Ghapanchi and Aurum (2011) underline a prevalence of antecedents at the job-related, organizational, and psychological levels, whereas fewer antecedents pertain to the remaining categories. Similar to other widely and long studied management and organizational phenomena, the study of employee turnover has developed into distinct streams of research. This research system-immanent development has led to ever more complexity and an abundance of highly specialized empirical findings rather than convergence and consolidation. This can also be concluded from the number and increasing specializations of literature reviews and meta-analyses (which are briefly described in Table 4 in the appendix).

A closer look at these reviews reveals that research on voluntary employee turnover has not led to maturity and consolidation, in the sense that the (i) results converge, (ii) the most important factors and cause-effect relationships are clearly known, and (iii) only nuances are debated. Quite the opposite, although research on voluntary employee turnover has developed and propagated advanced models and theories, the findings remain inconsistent and at times even conflicting (Hom et al. 2017). Partly fueled by the paradigm prevailing in the social sciences and peer-reviewed journals documenting the novelty of research by virtue of previously unstudied or understudied determinants, the number of variables studied has increased significantly in recent decades. It seems that research on voluntary employee turnover – like many other organizational phenomena – is in search of the famous needle in the haystack, or as renowned scholars in the field put it, in “Search of the Holy Grail” (Holtom et al. 2008). The search for impact factors that allow forecasting turnover has predominantly adopted regression models (e.g. OLS, logit, and logistic), structural equation modelling (SEM), and other traditional statistical techniques, such as cluster analysis and dimension reduction (Garg et al. 2022; Lee et al. 2017).

As the numerous and increasingly specialized reviews and especially meta-analyses demonstrate, the voluntary employee turnover phenomenon is a mature and increasingly differentiated field of management research with high scientific and practical relevance. Based on these reviews, different approaches can be adopted for the theoretical and empirical development of the field. On the one hand, replication studies could be conducted for known but possibly inconsistent correlations, or testing new independent variables, moderators, or mediators. We refer to this as the ‘more of the same’ approach. On the other hand, the extensive datasets obtained from the increasing digitalization of HRM and web-based surveys (Holtom et al. 2008, p. 259) could be used to identify previously unrecognized influencing factors, effect relationships, and patterns. More recent calls in the field of employee turnover also emphasize the need to apply new and innovative methodological approaches, such as the implementation of machine learning techniques, to predict employee turnover (Choudhury et al. 2021; Garg et al. 2022; Lee et al. 2017; Rombaut and Guerry 2018; Yang et al. 2023).

2.2 Applications of ML techniques in the HRM context

Although HRM is a somewhat unexplored area with regard to big data analytics (BDA) and supervised ML applications (e.g. Ekawati 2019; Sheng et al. 2017), interest has significantly grown as a consequence of the ongoing digitalization of firms (Raguseo and Vitari 2018; Rombaut and Guerry 2018; Saradhi and Palshikar 2011; Sexton et al. 2005; Shah et al. 2017). The comparatively few studies that apply data mining techniques in the HRM field focus on employee selection (Aiolli et al. 2009), employee competences (Zhu et al. 2005), career planning (Lockamy and Service 2011), predicting employee performance and evaluation (Zhao 2008), candidates’ preliminary evaluation and training success (Aviad and Roy 2011), and employee turnover (Quinn et al. 2002; Saradhi and Palshikar 2011; Sexton et al. 2005). Besides the studies adopting other advanced statistical techniques, such as regressions, SEM models and Bayesan Model Averaging (BMA) (e.g., Coetzer et al. 2019; Nandialath et al. 2018; Sandhya and Sulphey 2021), the handful of studies that have applied supervised ML to the issue of employee turnover evaluate or compare neural network solutions (Quinn et al. 2002; Sexton et al. 2005), random forests, support vector machines and naïve Bayes (Saradhi and Palshikar 2011), and classification decision trees (Choudhury et al. 2021; Rombaut and Guerry 2018; Saradhi and Palshikar 2011). Table 1 summarizes the main characteristics and contributions of these studies.

Table 1 Previous applications of ML techniques in voluntary employee turnover research

We next provide a summary of the rather diverse and contradictory assessments of these techniques with a particular focus on their predictive power. While Sexton and colleagues (2005) emphasize the accuracy of neural network techniques in predicting and solving classification business problems, Quinn et al. (2002) find that they perform worse than logistic regression. Saradhi and Palshikar (2011) compare naïve Bayes, support vector machines, decision tree and random forests, highlighting the superiority of support vector machines. Rombaut and Guerry (2018) point out the superiority of decision tree techniques compared to logistic regression. Choudhury and colleagues (2021) partially confirm this finding in their recent comparison of decision trees, random forests, and neural networks, documenting that classification decision trees have high predictive and analytical power in identifying employee turnover probability. However, their study does not address the characteristics of employees highly inclined to voluntarily leave, instead limited to evaluating the statistical performance of different ML techniques. Although these first attempts provide initial evidence of the potential applications of these techniques in the field of HRM, there is still a lack of research that applies the ML technique to identify the characteristics of employees who are more likely to leave. Our study contributes to filling this gap by identifying the determinants of voluntary employee turnover using a classification decision tree.

3 Research method

Through an illustrative example on data from a telco MNC, this study investigates the root causes of voluntarily employee turnover through the CHAID classification decision tree. IBM SPSS Statistics (v.27) has been applied to run the CHAID classification decision tree.

3.1 Data collection, sample, and measures

We used a database derived from an online survey submitted to the employees of a leading telco MNC in Northern European countries and Asia. The company has a strong position in mobile, broadband, and TV services with 180 million global customers worldwide and annual revenues of approximately USD 12 billion. It is headquartered in Norway with more than 12 subsidiaries in Europe and Asia.

The company’s HRM department provided the dataset. The data was collected in 2016 through an email-based survey sent via the firm’s internal system to all 7,786 employees working in the headquarters and Nordic subsidiaries. Prior to the survey, an invitation letter signed by the CEO was sent out, emphasizing the importance and reasons for the survey, as well as the fact that there was no obligation to respond. Employees were clearly informed of the mechanism in place to protect their privacy. Employee email addresses were retrieved from the central HRM system. Then, when employees returned the questionnaire, the research department temporarily used their email addresses to retrieve some general (e.g. demographic) information from the HRM system and to link responses to previous surveys. The research department ensured that an encrypted employee email address was developed prior to any use of the data to ensure the anonymity of responses.

The survey was sent out via the head office and each subsidiary’s local intranet. After three weeks, the average response rate was around 56%, of which 66% from Norway and 48% from Denmark, which is considered an acceptable response rate for this type of analysis. The decision to focus on both the headquarter and a subsidiary was prompted by a discussion with the CEO, who claimed that voluntary employee turnover was a serious problem in Norway and Denmark, thus confirming the relevance of the setting for our research. Table 2 shows the percentage of the company’s voluntary employee turnover for Norway and Denmark in 2016.

Table 2 Voluntary employee churn

The final database was constructed from a few different datasets. The HR department merged the survey data with internal data on voluntary employee turnover after two years, resulting in a dataset based on 2,932 usable responses, of which 834 were from Denmark and 2,098 from Norway, and 209 variables. We removed 95 variables from the database because unrelated to the issues of voluntary employee turnover and only 9 responses due to missing values. We chose to exclude these answers instead of replacing missing values in order to minimize the risk of bias in our results.

The dependent variable is dichotomous and takes the value 1 if voluntary employee turnover occurred and 0 otherwise. Instead, the 113 independent variables are nominal (dichotomous), categorical, and single-item 7-point Likert scales drawn from previous literature. Table 5 in the Appendix A shows the variables included in the analysis and their coding. Independent variables pertain to three main categories of determinants: individual attributes, job-related determinants, and organizational determinants, in line with Ghapanchi and Aurum (2011)’s classification.

We employed several procedural remedies to reduce common method bias, including guaranteeing anonymity to respondents, emphasizing the importance and reasons for the survey, collecting data from different sources and at different points of time, using different datasets to build the final database, and included questions with different shapes to reduce the risk of response set (Podsakoff et al. 2012). Moreover, data on the dependent variable is collected through an internal database after two years from the survey collecting data on independent variables; therefore, the issue related to common method bias in not relevant for the analysis (Kock et al. 2021).

3.2 Research methodology

We applied a supervised ML technique, known as the classification decision tree technique, based on CHAID algorithms, to identify the determinants that characterize the employees with a high probability of voluntarily leave the firm. This technique is, particularly suitable for discovering patterns of meaningful relationships (both linear and nonlinear) and rules from large databases (Jain et al. 2016). It is particularly efficient with dichotomous, nominal, and scale-ordinal variables (Ture et al. 2009).

This technique has numerous advantages, such as: 1) it is simple to understand and interpret, 2) it requires little data preparation, 3) it can handle both numerical and categorical data, 4) it uses a white box model, 5) it has a high explanatory power, 6) it performs well with large data in a short time (Alao and Adeyemo 2013; Choudhury et al. 2021; Perner et al. 2001; Rombaut and Guerry 2018), 7) it is more “fair” as it shows an improved ability to make unbiased decisions (Garg et al. 2022), 8) it provides clear information about the importance of significant factors for prediction or classification (Tso and Yau 2007), 9) multicollinearity is not a problem; attempts to eliminate it resulted in poor classification performance (Piramuthu 2008), thus eliminating the need to apply dimensionality reduction techniques such as Principal Component Analysis (PCA) and Linear Discriminant Analysis (LDA) (Reddy et al. 2020), and 10) it can be applied to a large sample for data reduction purposes. This technique can also handle missing values, provides stopping rules that account for statistical significance at the 1%, 5%, or 10% level, does not assume a priori the type of distribution of independent variables, and its probabilistic estimation is based on the chi-square test (Díaz-Pérez and Bethencourt-Cejas 2016; Kass 1980; Nisbet et al. 2018). However, despite the advantages of this technique, the classification decision tree is prone to overfitting (Giudici 2010), leading to biased results, even with large databases. However, various remedies, such as the use of cross-validation, can be employed to successfully address overfitting concerns. This method divides the sample into several subsamples (typically 10 samples). Tree models are then generated, excluding the data from each subsample generated. For each tree, the risk of misclassification is estimated by applying the tree to the subsample excluded in its generation. This approach produces a single, final tree model for which the cross-validated risk estimate is calculated as the average of the risks for all trees (Blockeel and Struyf 2002).

The CHAID classification decision tree systematically breaks down data to classify patterns found in the data set and make rule-based predictions (Berry and Linoff 2000). It can be seen as a recursive procedure in which a set of n observations is progressively partitioned into groups according to a division rule – based on the \(p\) value derived from multiple chi-square tests – aimed at maximizing a measure of homogeneity or purity of the dependent variable in each of the obtained groups (Giudici 2010). Then, in further step stage, the identification of the best variable as the first variable of the dependent variable is considered. Once again, a chi-square test is applied and calculated for each contingency table derived from the intersection of the dependent variable with each individual predictor (Cerchiello and Giudici 2012). The first split falls on the predictor with the highest chi-square value and the lowest \(p\) value. The tree continues to branch into child nodes until it reaches the terminal node for each branch (Jain et al. 2016). Each terminal node identifies subgroups defined by different sets of predictors (Tan et al. 2006). The procedure stops when the chosen stopping rule is satisfied (Cerchiello and Giudici 2016). To obtain a final partition of the observation, it is necessary to specify stopping criteria for the division process. The criteria developed for selecting the best partition are often based on the degree of impurity of the child nodes (Tan et al. 2005). The concept of impurity refers to a measure of the variability of the response values of the observation (Giudici 2010). The lower the degree of impurity, the more skewed the class distribution. Specifically, in a regression tree, a node is pure if it has zero variance (all observations are equal) and impure if the variance of the observations is high, while for classification decision trees, alternative measures such as misclassification impurity, Gini impurity, and entropy impurity should be considered (Tan et al. 2005). In this case, the misclassification impurity (the distance between the observed and expected frequencies) was applied to obtain the final partition of the tree. The expected frequencies are calculated using the hypotheses of homogeneity for the observation in the node considered or using the Chi-square index (Giudici 2010). Assuming that a final partition consisting of \(g\) groups (\(g<n)\) has been reached, then for any given response variable observation \({y}_{i},\) a CHAID classification decision tree will produce the fitted value \({\widehat{y}}_{i}\) or the fitted probabilities of belonging to a single group, assuming only two classes (binary classification). Then, the fitted probability of success is given by the equation:

$${\widehat{y}}_{i}= \frac{{\sum }_{l=1}^{{m}_{n}}{y}_{lm}}{{n}_{m}} ,$$

where the observation \({y}_{lm}\) can take the value 0 or 1, and the fitted probability corresponds to the observation proportion of success in group \(m\) with \({\widehat{y}}_{i}\) constant for all observations in the group (Giudici and Figini 2009; Linoff and Berry 2011; Tan et al. 2005). Figure 1 shows the basic structure of classification decision trees.

Fig. 1
figure 1

Structure of the classification decision tree model. Source: Adapted from Li et al. (2022)

The statistical performance of the classification decision tree was assessed through the area under the ROC curve (AUC) (Deng et al. 2016; Hanley and McNeil 1982; Pendharkar 2009; Tan et al. 2006) and the cross-validation test (Choudhury et al. 2021). AUC values equal to 0.5 indicate that the test is not informative, values between 0.5 and 0.7 indicate an inaccurate test, values between 0.7 and 0.9 indicate a moderate test. Instead, values between 0.9 and 1 or equal to 1 indicate that the test is highly accurate or perfect (Swets 1988). However, values of AUC in the range between 0.9 and 1 may indicate overfitting and analysis bias (Foucher and Danger 2012).

4 Findings

Figure 2 shows the predictive and analytical power of the CHAID classification decision tree. The tree starts with the root node (node 0), which indicates that 8.6% of permanent employees have recently left their job voluntarily. The classification tree then identifies three layers of predictors with different predictive power (from highest to lowest), which can profile groups of employees with a high propensity to voluntarily quit the firm.

Fig. 2
figure 2

CHAID classification decision tree

The first layer (node 1 and node 2) pinpoints the most powerful predictor that influences voluntary employee turnover. The country location of employees is the predictor with the greatest discriminatory power. Two different scenarios—in terms of employee turnover determinants—emerge from the analysis.

In Norway, the classification decision tree identifies three different groups of employees at risk of voluntary turnover. The first group includes employees with very low variety in their jobs (e.g. in terms of tasks) who also consider social media (e.g. Facebook) as important tools to share work-related knowledge in the firm (20.7%). The second group includes employees with low-medium job variety who are highly inclined to share work-related knowledge with colleagues to get promoted (16.9%). The third group of employees includes females with high job variety (12.1%). Instead, in Denmark, the classification decision tree identified six groups of employees with a high risk to voluntarily leave. The first three groups, respectively, include the employees with very low, low, and medium job freedom in terms of deciding how to manage their work (33.3%, 13.7%, 34.1%). The fourth group identifies the employees with high freedom in managing their jobs who also work in organizational contexts that consider job rotation an important tool to share work-related knowledge within the organization (28.3%). The fifth group includes employees with high job freedom working in organizational contexts where there is a medium propensity to consider job rotation as an important activity to share work-related knowledge (16.4%). Finally, another important group includes the employees with high job freedom working in contexts where there is a lower propensity to share work-related knowledge through job rotation (8.4%).

Table 3 summarizes the terminal nodes (or leaves) of the CHAID classification decision tree highlighting the split value used in the development of the predictive classification and the p-value for each predictor. Predictors such as country, job variety, gender, and job freedom have high statistical relevance (p < 0.000) in predicting the determinants of the voluntary employee turnover, while “knowledge-sharing with colleagues in order to get promoted”, “importance of knowledge-sharing through social media in the company” and “importance of job rotation as an activity to share work-related knowledge within the organization” have a medium statistical relevance (p < 0.001). The predictor related to knowledge-sharing through social media has low predictive value (p < 0.05).

Table 3 Classification decision tree terminal nodes

The goodness of fit of the CHAID classification decision tree is moderate, as indicated by the AUC equal to 74.5%. The cross-validation test confirms the moderate accuracy of the model, excluding overfitting issues. Figure 3 provides a representation of the ROC curve.

Fig. 3
figure 3

ROC (receiver operating characteristic) curve

5 Discussion

The CHAID classification decision tree identified seven predictors of voluntary employee turnover. The country location of employees has the highest predictive power, suggesting that the organizational context plays a crucial role in understanding why permanent employees voluntarily leave the firm. This is consistent with previous research showing the importance of organizational variables in this decision (Rubenstein et al. 2018). Then, the model identified six predictors of which four refer to the Norwegian context (job variety, importance of sharing work-related knowledge through social media in the firm, propensity to share work-related knowledge to get promoted, and gender) and two to the Danish one (job freedom, importance of sharing work-related knowledge within the firm through job rotation). The analysis identifies predictors which are organizational, job-related, and related to individual attributes in line with previous studies in high-tech contexts (Ghapanchi and Aurum 2011).

The model identifies groups of employees, who are internally homogeneous and heterogeneous with respect to each other and who have a high propensity to leave voluntarily. This approach allows for the profiling of the employees who are more likely to leave voluntarily, highlighting the concurrent effect of predictors for specific groups of employees. For example, our analysis has led to the identification of groups of employees who leave voluntarily the company, which we labelled for the discussion. In the Norwegian context, job variety is the most important predictor, and the model identifies three groups of employees. The first group, which we labelled “bored workers”, includes employees who have relatively repetitive job tasks (Foss et al. 2009) and who want to use the social media to share knowledge within the organization (Cabrera and Cabrera 2005). The second one is “ambitious workers” who are employees with a medium job variety (Foss et al. 2009), but who consider important to share knowledge in order to get promoted (Ryan and Deci 2000). The third group identifies “female workers” who have a high job variety (Foss et al. 2009). An imposed high level of tasks job variety may imply more effort and resources for employees who need to change job frequently. Further studies could explore these relationships and their motivations in more detail. Instead, in Denmark, job freedom is the most significant predictor (Foss et al. 2009), which identifies six groups of employees, which we have labeled as follows. “Rebels I”, “Rebels II” and Rebels III” are the employees who are at risk of voluntary turnover because they have limited job freedom (Foss et al. 2009). “Impatients I”, “Impatients II” and “Impatients III” are the groups of employees who have a high degree of job freedom but work in an organizational context that views job rotation as a tool for sharing knowledge within the organization (Cabrera and Cabrera 2005). Employees who are relatively free in their work organization tend to be reluctant to engage in job rotation. Job rotation may involve the reorganization of tasks and active confrontation with other employees and departments, thus threatening to restrict freedom in the way and timing of individual work organization. This suggests that job rotation may have a constraining effect on the employees who are free to organize their work. Future studies could explore the interactions between these variables.

In addition, the CHAID classification decision tree identifies the set or bundle of determinants that influence the decision to leave, enabling a predictive profiling of employees. Thus, unlike traditional statistical techniques (Hom et al. 2017; Garg et al. 2022), this method reveals which combinations of variables influence the decision to voluntarily leave the company by identifying specific groups of employees operating in the organizational context. This method highlights a multiplicity of linear and non-linear effects among different groups of employees, paving the way for new lines of research to explore these aspects.

Our study makes two important theoretical contributions to the field of HRM. First, from a theoretical perspective, our study shows both the application and the benefits of the CHAID classification decision trees in selecting the determinants that characterize voluntary employee turnover and in profiling groups of employees at risk of voluntary turnover. As suggested by Hom et al. (2017), the application of supervised ML techniques could be used to advance the turnover literature. In this sense, the CHAID classification decision tree can be used to uncover linear and nonlinear relationships in large databases. Supervised ML techniques could be used to analyze in depth relationships that have emerged in the past to complement past evidence and, eventually, lead to the identification of new relationships, especially in the presence with large databases. Second, in line with the considerations of Garg et al. (2022) considerations, the classification decision tree emerges as an effective technique that can be used to solve complex management problems, such as the employee turnover. However, this approach could also be combined with other ML tools to support decision making on complex HRM issues that require an understanding of socio-cultural phenomena (Garg et al. 2022; Yang et al. 2023). This suggestion could pave the way for the birth and growth of a research stream lying at the intersection of the HRM and ML literature, which could also properly assess the benefits and risks arising from this interaction.

This article also provides relevant managerial implications. By using ML techniques, in particular the classification decision trees, MNCs could develop effective ad hoc retention plans that precisely and accurately target the groups of employees who are more likely to leave voluntarily. For instance, in this case, our results suggest that a retention strategy that increases job variety and targets female employees in Norway would not be as effective for Danish employees whose propensity to leave is more related to job freedom (Foss et al. 2009). This shows that MNCs could use this method to effectively self-assess their employee turnover risk, identify employees at risk, and create tailored retention strategies that address the needs of different employee groups. In doing so, they can improve their chances of reducing voluntary employee turnover rates (e.g., Reiche 2008). In fact, developing effective retention strategies allows for the retention of current employees, thereby avoiding additional costs due to turnover (Mobley 1982; Price 1977; Saradhi and Palshikar 2011; Staw 1980) and negative effects on the overall organizational effectiveness and business success (Holtom et al. 2005; Mitchell et al. 2001).

6 Conclusion and limitations

This study shows the predictive and analytical power of ML techniques through the application of the CHAID classification decision tree to predict the determinants of voluntary employee turnover to profile groups of employees at risk of voluntary turnover. This research goes beyond merely identifying the probability of employees at risk of voluntary turnover, as previously done in past studies (Choudhury et al. 2021). Indeed, this research shows the predictive and analytical power of the CHAID classification decision tree, and more generally of the supervised ML techniques, in analyzing large databases. In particular, through this research we highlight the advantages of this technique in the context of HRM by seeking to open new avenues for future applications related to classification problems in other management contexts where supervised ML techniques could be successfully used to support and improve the quality of the decision-making process (Iqbal et al. 2020; Janssen et al. 2017). ML techniques, especially the CHAID classification decision tree, appear to be a realistic way for decision-makers to obtain strategic knowledge from raw data considered relevant to steering the firm and could be used not only to solve simple management problems related to recruitment and performance management, but also complex ones such as forecasting employees’ turnover, also eventually being used in combination with other ML techniques (Garg et al. 2022). However, while the mere implementation of supervised ML techniques to support the decision-making process is a good starting point, when not integrated with strategic thinking risks to produce biased results with a negative impact on the quality of business decisions (Choudhury et al. 2021). In fact, we would like to emphasize that supervised ML techniques it should not be seen as a replacement for human resource reasoning.

Our research also has some limitations that highlight opportunities for future research. First, the choice of classification algorithm influences the predictive power of the resulting model. However, to date, no complete theory or conceptual guidelines are available to assist researchers in choosing or developing appropriate classification decision tree algorithms. Thus, more research is needed to assess the selection of these algorithms according to a specific type of classification problem. In addition, we recall the importance of acknowledging that classification decision trees are sensitive to noisy data and could also not perform as good as neural network on non-linear data (Curram and Mingers 1994). Future research should test this technique on different contexts, databases, and, also, in comparison with other techniques. In addition, future studies could test the application of supervised ML techniques (e.g., decision trees, support vector machine, neural networks) on panel data, which could provide a more robust estimation of employee turnover predictors and represent an effective tool to accurately develop customized retention strategies able to considerably reduce the employee turnover risk over the time.

To conclude, we hope our study will generate interest and new stimuli in the application of supervised ML techniques, particularly, among management scholars, opening new lines of research not only limited to HRM but also in fields where the use of these techniques is still in its embryonic phase.