This section presents the main results. First, casino players and sports bettors are compared by inspecting Shapley values in isolation, meaning that the values of the explanatory variables are not considered at this first step. Following this, the Shapley values are scrutinized in relation to the values of the explanatory variables. The results are further discussed in the next section where the most important findings are emphasized and elaborated on.
Before reporting the Shapley values, some comment on the performance of the trained models is needed. The area under the receiver operating characteristic curve was found to be 0.87 for the casino-gambling model and 0.92 for the sports-betting model. However, the imbalance of the data has to be taken into consideration when interpreting these figures (given that the proportion of positive examples was around 6%). In such cases, the precision and recall metrics are usually preferred. These metrics require converting estimated risk scores, which are values from zero to one, into binary decisions. As mentioned in the Procedure section, this was not necessary for the analysis presented in this paper, since it operated directly on raw risk scores. Nonetheless, for completeness, two thresholds were chosen, one for each mode, by optimizing the F score with \(\beta = 0.5\). The precision and recall were found to be 0.45 and 0.27, respectively, for the casino-gambling model and 0.60 and 0.42, respectively, for the sports-betting model.
Aggregate Contributions
Figure 1 shows the contributions of the 40 explanatory variables to the risk associated with problem-gambling-related exclusion (refer to Table 1 for the meaning of the variables). The impact was measured in terms of the median absolute value of Shapley values, which was further normalized for convenience. The top ten indicators for each group of gamblers are labeled in the figure.
Comparing the two groups, it can be seen that there are significant differences in terms of which variables are important. Only six out of ten major contributors in the case of casino players can be found in the top ten of sports bettors. More specifically, the slope of the number of approved deposits denoted by deposit_approved_num_slope, volume of approved deposits denoted by deposit_approved_sum_norm, number of active days denoted by session_day_num_norm, and slope of the number of sessions denoted by session_num_slope are less informative for sports betting compared to casino gambling. Likewise, the number of denied deposits denoted by deposit_denied_num_norm, volume of cash (as opposed to bonus) results denoted by result_cash_sum_norm, proportion of desktop authentication sessions denoted by session_desktop_num_ratio, and standard deviation of the duration of sessions denoted by session_sum_sd are less informative for casino gamblers compared to sports bettors.
The variable indicating that the account was registered in the United Kingdom, which is denoted by country__gb, stands out among other explanatory variables. This is due to the fact that the exclusion rate is significantly higher in the market compared to other markets of operation. It is also interesting to note that the other demographic variables, namely age and gender, play a role, but this role is relatively small compared to other indicators. It suggests that, when problem-gambling-related exclusion is concerned, age and gender are not as informative as one might expect.
Figure 2 shows box plots of Shapley values of all explanatory variables. Outliers are depicted by semi-transparent circles. The variables are sorted by their median Shapley values in the casino-gambling group, and the graph is zoomed in on the interquartile ranges for clarity reasons. It can be seen that the distributions tend to be skewed toward zero. More specifically, the variables with negative medians are right skewed, while those with positive medians are left skewed. One can also note that the interquartile ranges of relatively few variables are located strictly to the left or right of zero. Examples of such variables include the number of days since registration denoted by day_num, which mainly increases the risk score for both groups, and the number of canceled withdrawals denoted by withdrawal_canceled_num_norm, which mainly decreases the risk score for both groups.
The distribution of the Shapley values of deposit_approved_sum_norm is much more spread out for casino players. This means that the variable’s contribution to the risk score varies substantially, taking relatively large negative and positive values. For sports bettors, this is not the case. Here the variable has a very narrow range of contribution. A similar observation can be made with respect to session_daynum_norm (the number of active days). On the other hand, the contribution of the number of cash wagers denoted by turnover_cash_num_norm is relatively similar across the two groups, which can also be concluded with respect to self-reported age.
The risk score is noticeably indifferent to specific variables. For casino players, the standard deviation of the volume of approved deposits denoted by deposit_approved_sum_sd, ratio of the volume of cash winnings to the volume of cash wagers denoted by winning_turnover_sum_ratio, volume of bonus wagers denoted by turnover_bonus_sum_norm, standard deviation of the volume of cash wagers denoted by turnover_cash_sum_sd, standard deviation of the number of authentication sessions denoted by sessions_num_sd, and proportion of cash wagers on Saturdays denoted by turnover_saturday_num_ratio are tightly centered at zero. For sports bettors, such variables are the standard deviation of the number of cash wagers denoted by turnover_cash_num_sd and ratio of cash winnings to cash wagers denoted by winning_turnover_num_ratio. The impact of these variables on the score in the corresponding groups was observed to be minor.
Overall, there were both similarities and dissimilarities between casino players and sports bettors.
Individual Contributions
In this section, the top ten casino-gambling and the top ten sports-betting variables as identified in the previous section are examined. Age is also added to the list, as this is usually of interest. Consequently, the variables of interest comprise the following 15 indicators: age, country__gb (and country__se), deposit_approved_num_norm (the number of approved deposits), deposit_approved_num_slope, deposit_approved_sum_norm, deposit_denied_num_norm, result_cash_sum_norm, session_day_num_norm, session_desktop_num_ratio, session_num_norm (the number of authentication sessions), session_num_slope, session_sum_norm (the duration of authentication sessions), session_sum_sd, turnover_cash_num_norm, and turnover_cash_sum_norm (the number of cash wagers). Unlike the previous section, the Shapley values in this section are shown in relation to the individual values of the corresponding explanatory variables.
In the majority of the figures that follow, the overall trend is emphasized by a solid line computed using locally estimated scatterplot smoothing (Hastie et al. 2009), and the border between negative and positive Shapley values is highlighted using a dashed line. In addition, many plots have logarithmic scales on their horizontal axes with values of interest being annotated.
Effect of the Country of Registration
The first explanatory variable analyzed is the country of registration. There were two binary variables considered: country__gb indicating whether the account was created in the United Kingdom and country__se indicating whether the account was created in Sweden. However, it should be noted that the dataset being studied was not constrained to just these two countries. For other countries, both binary variables were zero. The first row in Fig. 3 shows box plots of Shapley values for the two values of the aforementioned two binary variables. The UK market stands out in terms of the contribution magnitude, which was explained earlier. Focusing closer on the UK indicator (the bottom four box plots), the situation is similar across the two groups of bettors when the variable is zero (that is, not registered in the UK). However, when the variable is one, it manifests itself much stronger in the case of casino players. More specifically, the bulk of the distribution is above 0.5, while it is below 0.5 in the case of sports bettors. This suggests that British casino players are more prone to exclusion than British sports bettors. As for the indicator for Sweden, when the variable is one (that is, registered in Sweden), the risk score is strictly increased for casino players but mostly decreased (although relatively little) for sports bettors. This suggests that Swedish sports bettors tend to not exclude due to gambling-related problems.
Effect of Self-reported Age
The second row in Fig. 3 corresponds to the age that was reported by the gambler at initial registration. It can be seen that the two groups have similar patterns. Low and high values tend to decrease the risk score, while the ones in the middle tend to increase. However, for casino players, this middle region is narrower and has a larger vertical spread, and the extremum is reached much earlier. For casino players, the most susceptible age for problem-gambling-related exclusion is between 25 and 30 years, whereas for sports bettors, it is between 30 and 40 years.
Effect of Authentication Sessions
The influence of the number of days with authentication sessions, which are also referred to as active days, normalized by the total number of days since registration (that is, session_day_num_norm) is depicted in the third row in Fig. 3. One should be careful reading this plot, since a lot of mass is concentrated at value one, which is due to a large number of new gamblers who have one active day and one day in total. There are differences between casino players and sports bettors. More specifically, the change from negative to positive Shapley values for casino players is one active day per three days. However, there is no clear-cut change point for sports bettors. One can observe that the values in the left tail also tend to increase the risk score. This left tail corresponds to infrequent gamblers with relatively long lifetimes (that is, the time since the initial registration). Such gamblers might decide to permanently close their accounts as redundant, making the model increase the risk score for a reason other than problem gambling.
The proportion of sessions started on a desktop computer including laptops (that is, session_desktop_num_ratio) is depicted in the last row in Fig. 3. A sharp separation can be observed. For casino players, the ratio tends to increase the risk score when it increases to one-quarter or more. For sports bettors, there is an opposite trend. The score starts to decrease as the ratio reaches around one-half. This means that, for casino players, using primarily desktop computers for gambling increases the risk of exclusion, while this mode of gambling decreases the risk for sports bettors.
The impact of the duration of sessions per active day (that is, session_sum_norm) is displayed in the first row in Fig. 4. There is a sharp separation in both groups. However, the change of the sign of Shapley values happens at different times. It is around 70 min for casino players and 100 min for sports bettors. The overall trend declines, which likely relates to the degree of gamblers’ engagement with the product. Gamblers who are willing to spend more time are less inclined to exclusion. This, in turn, might again hint at the limitations of the target variable.
The number of sessions per active day (that is, session_num_norm) is depicted in the second row in Fig. 4. The trend is as expected here. The risk score increases with the frequency of sessions. As with the previous plot, the change point is slightly different for the two groups. It is around two sessions per day for casino players and three sessions per day for sports bettors.
The utility of the standard deviation of the duration of sessions (that is, session_sum_sd) is illustrated in the third row in Fig. 4. This explanatory variable was available for 84% of the 10,000 gamblers. It can be seen that the variable is informative, and that it manifests itself similarly but noticeably stronger among sports bettors in the negative region of Shapley values.
The correlation coefficient of the slope of the number of sessions per active day over the latest three months (that is, session_num_slope) is given in the fourth row in Fig. 4. The figure concerns around 65% of the gamblers. The transition of Shapley values from negative to positive happens at different locations: −0.4 for casino players and −0.25 for sports bettors. The trend on the positive half-line is noticeably flatter for sports bettors. In other words, the contribution to the risk score for sports bettors plateaus at a specific point, while it keeps growing for casino players.
Effect of Approved and Denied Deposits
In relation to depositing behavior, the volume of approved deposits per active day (that is, deposit_approved_sum_norm) is depicted in the first row in Fig. 5. The casino-gambling group has a large spread of Shapley values, indicating high informativeness of the variable in this case. For casino players, there is also a clear change point at around €20. A deposit above €20 per active day raises a concern. However, the situation is not as clear for sports bettors. Relative to casino players, the spread of Shapley values appears to be minimal. For the majority of sports bettors, which are located in the middle, the Shapley values fluctuate around zero, meaning that this explanatory variable is not indicative of problem-gambling-related exclusion in the case of sports bettors.
The number of approved deposits per active day (that is, deposit_approved_num_norm) is given in the second row in Fig. 5. In both casino-gambling and sports-betting groups, there is a clear separation between positive and negative Shapley values. For casino players, the critical point is located at one deposit per active day, while it is at one deposit per two active days for sports bettors. In addition, for casino players, the left-hand side is notably flat, meaning that fewer than one approved deposit per active day decreases the risk by a relatively constant amount (independent of the value of the explanatory variable). Finally, sports bettors exhibit another notable change at one deposit per active day; after this point, the contribution exhibits a large jump.
The impact of the number of deposits denied per active day (that is, deposit_denied_num_norm) is shown in the third row in Fig. 5. Denied deposits are due to payment service providers, and they can occur due to various reasons, such as insufficient funds. In this figure, gamblers without denied deposits are excluded for clarity reasons. The behavior appears to be similar across casino players and sports bettors in terms of the change point and dissimilar in terms of the vertical spread, which is similar to the previous observations.
The correlation coefficient of the number of approved deposits per active day over the most recent three months (that is, deposit_approved_num_slope) is depicted in the last row in Fig. 5. It should be noted that this variable is available for around 35% of the participants. The explanatory variable gives an offset to the risk score that is almost exclusively positive. However, the magnitude of this offset tends to be higher to the right of the origin. This trend is particularly prominent for casino players. Also of note is the fact that the scores of casino players take on values from around zero to 0.5, whereas those of sports bettors lie mostly between 0.25 and 0.5, meaning that the variable increases the risk of problem-gambling-related exclusion for sports bettors much more than casino players.
Effect of Cash Wagers
The volume of cash wagers per active day (that is, turnover_cash_sum_norm), which refers to real money as opposed to bonus money, is displayed in the first row in Fig. 6. The change point for casino players is €90 per active day but only around €50 for sports bettors. In other words, for sports bettors, the risk of problem-gambling-related exclusion starts to be increased by this explanatory variable at a wager that is €40 lower compared to casino players.
The number of cash wagers per active day (that is, turnover_cash_num_norm) is given in the second row in Fig. 6. Here the difference is dramatic. The critical point is 200 bets per active day for casino players and only two for sports bettors. The difference is explained by the nature of the two types of gambling. A casino player generates a wager with every spin of a slot machine. However, each wager is typically of a small monetary value compared to bets in sports. In addition, it should be noted that the spread of Shapley values in the sports-betting group is larger, indicating that this explanatory variable is more discriminative in the case of sports bettors.
Effect of Cash Results
The final variable under examination is the volume of cash results per active day (that is, result_cash_sum_norm), which is the difference between the volume of cash wagers and winnings. Positive and negative results are presented separately. The third row in Fig. 6 shows positive results (in favor of the operator). In general, high losses increase the risk level of problem-gambling-related exclusion. The sign of Shapley values changes at a loss of €10, meaning that after this amount, the risk score starts to be increased by this variable. The last row in Fig. 6 shows negative results (in favor of the gambler). In this case, the small number of data points should be noted when interpreting the results. The Shapley values are all negative, suggesting that gamblers who win tend to not be excluded.