1 Introduction

Learning response patterns from preference and evaluation survey data is a crucial task to be addressed to measure respondents’ perceptions and, subsequently, to understand their behavior. For instance, assessments of churn risk and of customers’ loyalty are usually pursued by asking to rate the extent by which a customer would recommend a given service or product to others. Similar investigations concern the assessment of perceptions and opinions about public policies and interventions on relevant subjects. For these data, the binomial regression model is a well-acknowledged choice since it offers a simple and effective representation of the data generating process (Allik, 2014; Grilli et al., 2015; Pinto da Costa JF. et al., 2008; Raidvee et al., 2012; Zhou & Lange, 2009). The estimable probability parameter conveys a synthesis of the latent feeling with versatile interpretation (satisfaction, preference, agreement, etc., depending on the topic under investigation). Regression techniques and standard variable selection algorithms allow to derive relevant response profiles in terms of subjects’ characteristics. Automatic identification of these response profiles can be achieved via tree methods: in particular, the rationale of model-based binary trees (Zeileis et al., 2008) hinges on the assumption that a given model can be maintained at each partitioning level, thus performing an iterative search for significant splitting variables both to disclose variables interaction and to optimize model fit, on the basis of parameter instability tests. In this framework, the paper discusses the implementation of binomial regression trees for rating data. In order to account for possible framing effects on the response outcome (Hilbig, 2012), the class of CUB mixture models and their extensions (Piccolo & Simone, 2019) provides a research paradigm to assess both feeling and uncertainty of the response distribution in a parsimonious way (Tourangeau et al., 2000). Indeed, any measure of location and shape summarizing the overall feeling should be accompanied by an assessment of the heterogeneity of the distribution and should be adjusted for possible nuisance effects blurring the underlying sentiment, as those induced by inflated frequencies and over-dispersion, for instance. Under this rationale, if the binomial model is assumed for the feeling component, different uncertainty specifications can be considered: the uniform distribution for the so-called non-contingent response style, or specific models to cope with different response styles (Gottard et al., 2016; Simone & Tutz, 2018). As for every Mixture of Experts System (Gormley & Frühwirth-Schnatter, 2019), variable selection for regression of the featuring mixture parameters is a challenging task. Recently, model-based trees have been developed for the class of CUB models (Cappelli et al., 2019; Simone et al., 2019), providing useful and innovative tools to perform automatic variable selection and to describe and classify response profiles in terms of uncertainty, feeling, and possible shelter effect, occurring where an excess of frequency is observed at a category acting as refuge response. Motivated by the circumstance that the assumed model might not provide an adequate fit to all response profiles, the paper wishes to contribute to the state of the art addressing the issue of model diagnostics on the response profiles learnt from a model-based tree, with focus on the binomial case. Specifically, surrogate residual analysis proposed in Liu and Zhang (2018) is exploited to run a local model selection for the best binomial extension that verifies a necessary condition for being correctly specified, within the class of mixture models with uncertainty.

The paper is organized as follows: binomial trees for ordered evaluation data are discussed in Section 2. Section 3 recalls some recent findings on residuals’ analysis for ordinal regression models and provides an overview on how this approach can be exploited as a diagnostic procedure for the chosen class of models and related trees. The ideas put forth in the paper are illustrated in Section 4 on the basis of a customer satisfaction survey as an introductory example. Then, Sections 5 and 6 present two more comprehensive case studies for the analysis of perceived trust towards Television and Press — taken from the ALLBUS German General Social Survey (GESIS - Leibniz-Institut für Sozialwissenschaften, 2016) — and for the analysis of satisfaction of Italian Ph.Ds. for the doctoral studies, taken from the ISTAT survey on carrier placement. Footnote 1 An outline of future developments is given in Section 7.

2 Models and Regression Trees for Ordered Evaluation Data

Throughout the paper, assume that ordered evaluations are collected on a discrete response scale with m ordered categories, say c1c2 ≺⋯ ≺ cm: categories will be coded as integers for notational convenience only (cj = j). Let R denote the response rating variable and assume m > 4 for identifiability of all the models that will be considered.

In order to parameterize the latent sentiment and possible framing effects for the response R, the setting of models on the discrete support is advocated. The (shifted) binomial distribution:

$$ \text{b}_{r}(\xi) := Pr(R=r|\xi) = \binom{m-1}{r-1} \xi^{m-r}(1-\xi)^{r-1}, \qquad r=1,\dots,m, $$

provides a parsimonious yet effective choice for the response R, acknowledged by several literature (Allik, 2014; Pinto da Costa JF. et al., 2008; Raidvee et al., 2012; Zhou & Lange, 2009). The estimable parameter ξ ∈ (0,1) accounts for the location and shape of the response: specifically, parameter 1 − ξ assesses the probability that a category is preferred over the previous ones. Footnote 2, so that the highest is the value of ξ, the more the distribution will be right-skewed with modal value at one of the lower categories. Thus, if the response scale is oriented so that higher categories correspond to a stronger trait, the measure 1 − ξ is a direct indicator of the overall feeling towards the item being investigated.

Adhering to the rationale supported by some experimental psychologists (Tourangeau et al., 2000), the response process of ordered evaluation data on a discrete support is deemed to be subject to framing effects (Hilbig, 2012). The resulting conceptualization achieved via the class of mixture models with uncertainty (Piccolo & Simone, 2019) involves a combination of actual feeling towards the item and uncertainty, so that:

$$ \text{Pr}(R =r|\boldsymbol{\theta}) = \pi \text{b}_{r}(\xi) + (1-\pi) q_{r}, \qquad r=1,\dots,m, $$

where \(\mathbf {q} = (q_{1},\dots ,q_{m})\) is a suitable model for uncertainty contaminating the feeling model, and π ∈ (0,1] is the mixture parameter weighting feeling importance. The benchmark choice is the uniform specification encompassed by CUB models (\(q_{r}=\frac {1}{m}\) for all \(r=1,\dots ,m\)), to account for heterogeneity of responses and adjust the feeling assessment accordingly (for short, \(R \sim \) CUB(π,ξ)). In several circumstances, a category acts as a refuge option for the response choice, resulting in an inflated frequency for that category, so that uncertainty degenerates solely or partly to a shelter effect (Iannario, 2012). In this case, the feeling measurement should be corrected accordingly, by letting the contaminating distribution in (2) be:

$$ q_{r} = \delta \text{D}_{r}^{(c)} + (1-\delta) \frac{1}{m}, \quad r=1,\dots, m,$$

with \(\text {D}_{r}^{(c)}\) denoting a degenerate distribution with mass concentrated at the shelter category, say c, and δ ∈ [0,1] measures the excess frequency at c. The resulting distribution for the observed response is a CUB with shelter effect at category c, namely:

$$ \text{Pr}(R =r|\boldsymbol{\theta}) = \pi_{1} \text{b}_{r}(\xi) + \pi_{2} \frac{1}{m} + (1-\pi_{1}-\pi_{2}) \text{D}_{r}^{(c)}, $$

with shelter effect parameterized by δ = 1 − π1π2, π1,π2 ∈ (0,1) weighting the importance of feeling component and heterogeneity, respectively. This model encompasses the binomial with shelter (π2 = 0). Further uncertainty specifications are possible to encompass different alleged framing effects (Gottard et al., 2016), as well as to consider response styles to middle and extreme categories by choosing the discretized beta distribution (Ursino & Gasparini, 2018; Simone, 2022) as contaminating model (resulting into CAUB: Combination of adjusted uncertainty and Binomial, see Simone and Tutz (2018)). In general, any discrete distribution with support in \(\{1,2,\dots , m\}\) (or a discretized version of a continuous distribution) with some interpretative extent as nuisance of feeling, can be assumed for the uncertainty component, provided that it does not entail identifiability problem when mixed with the feeling component.

Possibly, the binomial model for feeling can be extended to account for over-dispersion in terms of a beta-binomial distribution, governed by an extra parameter ϕ > 0 related to the excess variability:

$$ \text{g}_{r}(\xi;\phi) := \text{Pr}(R=r|\xi ,\phi) = \binom{m-1}{r-1} \frac{\prod\limits_{k=1}^{r} \big(1-\xi + \phi(k-1)\big) \prod\limits_{k=1}^{m-r+1} \big(\xi + \phi(r-1)\big) }{\big(1-\xi + \phi(r-1)\big)\big(\xi + \phi(m-r)\big) \prod\limits_{k=1}^{m-1} \big(1+\phi(k-1)\big)}, $$

so that the binomial is recovered as \(\phi \rightarrow 0\). As a result, mixtures of the beta-binomial model with (discrete) uniform distribution or with degenerate distribution at a fixed category can be designed to account for heterogeneity (resulting into CUBE models: Iannario (2013)) and for shelter effect (beta-binomial with shelter).

Given this choice, the implementation of model-based trees (Zeileis et al., 2008) is a natural step to characterize response profiles with synthetic information of different response features (feeling, heterogeneity, shelter, over-dispersion), thus fostering their interpretation and comparative analysis.

2.1 Binomial Trees

Under the binomial model, individual response patterns can be disclosed by letting the probability parameter ξ be subject-dependent with a logit link to a covariate vector xi:

$$ \text{logit}(\xi_{i}) = \gamma_{0} + \mathbf{x}_{i} \gamma,\qquad i=1,\dots,n, $$

where n denotes the sample size. With respect to the intercept term γ0, regression coefficients of the vector γ measure the effects of covariates xi on response feeling: this is a crucial step especially to model multi-modal rating distributions. Model selection and search for significant effects can be performed on the basis of standard likelihood inference methods. Within the model-based setting to classification and regression trees, here a binomial tree is proposed to identify response profiles for the latent feeling of evaluation data. Thus, assuming that the binomial is the maintained model, a binary partitioning algorithm will allow to iteratively split groups of observations (corresponding to tree nodes) in terms of significant variables for the baseline feeling parameter. The algorithm can be summarized as follows:

  1. 1.

    At step k, consider the subset of observations corresponding to node k with nk observations (if k = 1, this is called the root node): consider the binomial fit with parameter ξ(k) for this sample;

  2. 2.

    Given a set of candidate binary splitting variables {Di}, test the regression:

    $$ \text{logit}(\xi_{i}^{(k)}) = \gamma_{0}^{(k)} + \gamma_{1}^{(k)} \text{D}_{i}, \quad i=1,\dots,n_{k}, $$

    to identify significant effects at a predetermined α level;

  3. 3.

    Among the candidate splitting variables entailing significant differences on response feeling, determine the one fulfilling a pre-specified optimality criterion;

  4. 4.

    Accordingly, divide node k into two descendants according to the value of the selected splitting variable: these children nodes will be enumerated as 2k (left descendant) and 2k + 1 (right descendant). Consider the binomial fit conditional to Di for these sub-samples;

  5. 5.

    For each children node, iterate the procedure from step 1 until at least one stopping rule is met (see below).

Figure 1 displays the general partitioning step of the binomial tree that splits node k into left and right descendants, with feeling parameters that are determined according to (6) conditional to D.

Fig. 1
figure 1

General partitioning principle for the Binomial tree

Optimality criteria to derive a partitioning rule can be implemented in the same spirit of CUB model regression trees (CUBREMOT (Cappelli et al., 2019; Simone et al., 2019)). Indeed, the binomial model is nested in CUB. Specifically, a splitting criterion can be either based:

  1. 1.

    on improvements of the log-likelihood deviance from father to children nodes. If l(ξ(k)) is the estimated log-likelihood maximum for the binomial model at node k, then the partitioning principle will select the binary split that entails the larger (absolute) deviance between father’s and descendants’ levels:

    $$ {\Delta} l^{(k)} = l(\xi^{(k)}) -\big(l(\xi^{(2k)}) + l(\xi^{(2k+1)})\big); $$
  2. 2.

    on maximizing the (normalized) dissimilarity between the estimated distributions \(\hat {\mathbf {p}}^{(2k)}\), \(\hat {\mathbf {p}}^{(2k+1)}\) of children nodes.Footnote 3

    $$ \text{Diss}(\hat{\mathbf{p}}^{(2k)},\hat{\mathbf{p}}^{(2k+1)}) = \frac{1}{2}\sum\limits_{j=1}^{m} | \hat{p}_{j}^{(2k)} - \hat{p}_{j}^{(2k+1)}| \qquad \in [0,1] $$

    to identify the most dissimilar response profiles.

Remark 2.1

Customarily, the normalized dissimilarity index Diss(p,f) ∈ [0,1] (Leti G., 1983) is used as a goodness of fit indicator to compare an estimated model p and a relative frequency distribution f, as \(\text {Diss}(\mathbf {p},\mathbf {f}) = \frac {1}{2}{\sum }_{r=1}^{m} |f_{r} - p_{r}|\), so that the lower is the value, the better are the fitting performance of p. Given this interpretation, the dissimilarity index will be hereafter exploited also to derive a global evaluation of the proposed flexible trees. If \(t_{1}, \dots , t_{p}\) are the terminal nodes of a tree \(\mathcal {T}\), with sizes \(n_{t_{1}},\dots ,n_{t_{p}}\), then consider the following average as an overall measure of fitting performance of \(\mathcal {T}\):

$$ \text{Diss}(\mathcal{T}) = \sum\limits_{i=1}^{p} \frac{n_{t_{i}}}{n} \text{Diss}(t_{i}), $$

where Diss(ti) denotes the dissimilarity between the frequency distribution of observations in node ti and the local model implied by the procedure.

Pre-pruning of a model-based tree can be performed by specifying a priori stopping rules, applied until the a priori maximum depth of the tree (namely, the number of generations descending from the root note) has been reached. Usually, a node is declared terminal (and the partitioning procedure will stop for it) if any of these stopping rule is met:

  • the sample size of the node is lower than a pre-specified threshold to attempt any splitting procedure;

  • the number of observations for any of the descendants of a candidate (significant) split is lower than a given threshold.

Given a terminal node, the corresponding response profile is determined by the values of covariates corresponding to the edges connecting it to the root.

3 Residual Diagnostics for Ordinal Response Models

For models of the form \(R \sim F_{a}(r; X, \theta )\), where Fa(⋅) is the cumulative distribution function of the assumed model, Liu and Zhang (2018) advocate a jittering approach on the probability scale to define residuals. Briefly, a surrogate variable S is defined by conditionally sampling from a continuous uniform distribution over (0,1):

$$ S|R=r \sim \mathcal{U}(F_{a}(r-1;\theta),F_{a}(r;\theta)), \quad r=1,\dots,m. $$

On this basis, the residual variable V for the fit of the assumed model can be defined as:

$$ V = S - \mathbb{E}[S| \eta], $$

where η denotes the available information set. One of the main findings of the results established in (Liu & Zhang, 2018, Section 7, Theorem 4) is that \(V| X \sim \mathcal {U}(-\frac {1}{2},\frac {1}{2})\), if the assumed model is correctly specified. Thus, if this necessary condition for the assumed model does not hold, mis-specification is detected. The proposal of the paper is to resort to diagnostics of residuals’ built via the surrogate variable method to determine if the assumed model can be maintained or if it is misspecified in either some model components or in neglecting mixed populations. Beyond graphical inspection of residuals, general tests for comparisons of continuous distributions (as the Kolmogorov-Smirnov, Cramer-von Mises tests), as well as specific tests of uniformity, can be considered to perform residual diagnostics. In the following, the Quesenberry-Miller test of uniformity (Quesenberry & Miller, 1977) will be considered to test if an assumed model verifies the necessary condition for being correctly specified. This choice is due to its best power performance against alternative tests (Quesenberry & Miller, 1977).

In the next subsections, we show how to adopt this method to binomial regression trees to perform local uncertainty diagnostics. Specifically, the ultimate goal of the paper is to exploit this necessary condition to determine if an assumed model (hereafter, the binomial) can be maintained at each step of the partitioning process, or if a split should be preferably pruned. In the first case, local model selection for the best binomial extension that fulfils the necessary condition for being correctly specified can be pursued to improve the interpretative extent of the tree.

3.1 Model Misspecification for Missing Component

The goal of this section is to illustrate how residuals’ analysis can be used to check for possible misspecification of the binomial fit to a sample of ratings and, in a similar way, to identify candidate extensions with uncertainty that satisfy the necessary condition for being correctly specified or that should be disregarded instead.

Let \(R \sim \) CUB(π,ξ = 0.7), with m = 7, for varying π ∈ (0,1]: some QQ plots of residuals for the binomial fit on the generated samples, with n = 1000, are shown in Fig. 2 (the reference distribution is the continuous uniform distribution on \([-\frac {1}{2},\frac {1}{2}]\)). Table 1 reports the p-values for the Quesenberry-Miller uniformity tests: results indicate that, unless the weight of the binomial component is very high, significant evidence against the correct specification of the binomial on CUB data is found.

Fig. 2
figure 2

QQ plot of surrogate residuals for the Binomial fit to data generated according to CUB(π,ξ = 0.7), with m = 7, for varying π ∈ (0,1]

Table 1 p-values for the Quesenberry-Miller uniformity test run on residuals of the estimated binomial and CUB models to data generated according to CUB(π,ξ = 0.7), with m = 7, for varying π ∈ (0,1], n = 1000

Similar tests can be performed to identify the threshold under which misspecification of the binomial due to over-dispersion is significantly detected. Assume that data are generated according to a beta-binomial model as defined in (4). Figure 3 displays the QQ plots comparing the uniform distribution over \((-\frac {1}{2},\frac {1}{2})\) with the distribution of the surrogate residuals for the binomial and the beta-binomial estimated models. Accordingly, Table 2 reports the p-values for the Quesenberry-Miller test of uniformity for the surrogate residuals: it is possible to conclude that missed-specification of over-dispersion is identified also for moderately small values of ϕ.

Fig. 3
figure 3

QQ plots to compare surrogate residuals for the binomial and beta-binomial fits against quantiles of (continuous) uniform distribution, if data are generated under a beta-binomial model with the indicated over-dispersion parameter ϕ

Table 2 p-value for the Quesenberry-Miller test of uniformity for surrogate residuals for the binomial and beta-binomial fit, if data are generated under a beta-binomial model (ξ = 0.3,n = 1000,m = 7)

3.2 Neglecting Sub-populations

In order to show how model misspecification can be detected if sub-populations exist and are neglected, consider the surrogate residuals’ distribution for the binomial model with no covariate in case data are generated according to \(R_{i} \sim \text {Bin}(\xi _{i}), \text {logit}(\xi _{i}) = \gamma _{0} + \gamma _{1} \text {D}_{i}\), for a given dummy variable Di, over a scale with m = 7 ordered categories. Figure 4 shows QQ plots of the distributions of the residuals (unconditional and conditional to Di) when the mixed population effect is disregarded or correctly accounted for (left and right panel, respectively).

Fig. 4
figure 4

QQ plot of surrogate residuals for a binomial model: neglecting (on the left), correctly specifying mixed population effects (on the right)

Thus, for the binomial tree derived with the procedure defined in Section 2, a model-selection procedure can be applied at each terminal node. In particular, if the binomial model verifies the necessary condition for being correctly specified, then the search for fitting improvement is pursued among those extensions that verify this condition in turn. Otherwise, the node is declared terminal and the parent split should be preferably pruned if evidence for misspecification persists even after performing all candidate splits.

Remark 2.2

Since residuals’ construction is based on random generation, the diagnostic procedure hereafter implemented for real data analysis will consider the average p-value of the chosen uniformity test over a set of replications of residuals’ generation for the estimated model. This strategy will be particularly relevant to assess binomial diagnostics at tree nodes and allows to take the uncertainty of parameter estimation implicitly into account if a large number of replications is considered.

4 Illustrative Example on Customers’ Satisfaction Survey

The ABC Annual Customer Satisfaction Survey refers to a company offering IT solutions to media and telecommunication service providers (Kenett & Salini, 2012, Chapter 2). On a Likert-type scale with m = 5 ordered categories (1 =“very low,” 5 =“very high”), customers were asked to rate their overall satisfaction, along with satisfaction for several aspects of the customer experience, including the following: the extent by which they would recommend the ABC company to a third company (recom); the extent by which the would reconsider the company for further purchases (product); the overall satisfaction for the equipments of the purchase (equipment); for sales and technical support (sales, technical); purchasing support (purchase) and pricing. Due to the small sample size obtained after omitting missing values list-wise (n = 212), only small trees can be grown: thus, this dataset will be used for illustration purposes only.

Table 3 reports the average p-value for the Quesenberry-Miller test of uniformity for surrogate residuals for selected candidate models (50 random generations were considered), showing that — at the 5% level — the binomial is significantly misspecified for ratings concerning recom, equipment, technical. For equipment, the only model that fulfils the necessary condition for correct specification is the binomial with the addition of a shelter effect (at category c = 4).

Table 3 Average p-value for Quesenberry-Miller test of uniformity applied to surrogate residuals for selected candidate models (50 replications)

It is seen that the binomial model (without covariates) cannot be maintained for recom at the fixed significance level. This circumstance may be due to missing sub-populations: indeed, the primary split of a dissimilarity binomial tree separates recom ratings provided by customers who are not satisfied with the sales support (sales≤ 2) from those who are satisfied (sales≥ 3), for which the binomial model can be safely assumed instead. Figure 5 displays diagnostics check of uniformity of residuals at the root node and at left and right descendants (top row panels), along with the barplots of the frequency distributions at the nodes, with superimposition of the fitted Binomial model.

Fig. 5
figure 5

Primary split of the dissimilarity binomial tree for recom: comparison between observed and fitted distributions (top) and residuals’ QQ plots (bottom): ABC dataset

Node 2 is declared terminal by the procedure due to a pre-pruning condition relative to the sample size of the node: the split that is selected for node 3, instead, cannot be accepted for the binomial tree since the (conditional) binomial is significantly misspecified for its left descendant (node 6: see Fig. 6). Thus, the tree growing procedure stops and node 3 is declared terminal as well.

Fig. 6
figure 6

Split at node 3 of the dissimilarity binomial tree for recom: comparison between observed and fitted distributions (top) and residuals QQ plot (bottom): ABC dataset

Thus, one should prune this further split under the binomial tree.

As a more comprehensive example in this respect, consider the binomial tree for the overall satisfaction (satis) to disentangle local association for different aspects of the customer experience: here the deviance criterion is considered.

Table 4 reports the average p-value obtained over 50 replications of residual generation for competing models: it is seen that the binomial and all the extensions satisfy the necessary condition for being correctly specified at the given significance level. Thus, local model selection can be performed to identify the best uncertainty specification according to standard criteria.

Table 4 Average p-value (over 50 replications) for the Quesenberry-Miller test of uniformity for nodes of the deviance binomial tree for overall satisfaction: ABC dataset

Accordingly, Table 5 summarizes main information for each node: estimated feeling measure \(1-\hat {\xi }\), mixing weight \(\hat {\pi }\) of the feeling component.Footnote 4 possible shelter effect parameter \(\hat {\delta }\) and corresponding shelter category under the best mixture model with uncertainty, as well as dissimilarity of both binomial and best model with respect to the observed frequency distribution. For each inner node of the tree, the selected split variable and split point to determine the left and right descendants are also reported. The best model is selected by jointly considering results from likelihood ratio tests for nested models (in particular, with respect to the baseline binomial), and from BIC comparisons for non-nested models. In case more models are equivalent in terms of BIC index (Burnham & Anderson, 2003), the model with the lowest dissimilarity with the observed frequencies can be chosen.

Table 5 Binomial tree results (deviance criterion) and best uncertainty specification: ABC dataset

From response profiles learnt at terminal nodes (3,5,8,9), it can be claimed that:

  • customers’ propensity to recommend the company is the strongest indicator of overall satisfaction (indeed, node 3 refers to overall satisfaction for those who rated recom= 5);

  • the most influential dimension of overall satisfaction is the satisfaction for the sales support: thus, overall satisfaction can be controlled by focusing primarily on the control of this aspect of the customer experience;

  • a small percentage of structurally dissatisfied customers is present (measured by \(\hat {\delta }\)), stronger for respondents who are moderately satisfied for the sales support (sales= 3);

  • the feeling measure \((1-\hat {\xi })\), weighted for the importance \(\hat {\pi }\) of the feeling component, can be considered as an overall satisfaction indicator, and response profiles can be ranked accordingly. In the present example, customer satisfaction should be improved starting from the response profile associated with node 8 (corresponding to customers so that recom≤ 4 and sales≤ 2).

5 Perceived Trust Towards Press and Television

Perceived quality of products and services is related to perceived trust of users in a complex and multi-faceted scheme that involves customers’ satisfaction and loyalty (Bloemer et al., 1999; Chiou & Droge, 2006; Eisingerich & Bell, 2008; Garbarino & Johnson, 1999).

With reference to the ALLBUS German General Social Survey of 2012 (GESIS - Leibniz-Institut für Sozialwissenschaften, 2016), consider the perceived trust expressed by n = 2692 respondents towards Press and Television, as institutions, collected on a rating scale with m = 7 ordered categories (1 = “no trust at all,” 7 = “a great deal of trust”), after list-wise omission of missing values of the considered set of variables.

Among the available covariates used to grow the tree (including gender, employment status, German citizenship, income, and marital status), the procedure has selected:

  • Agec: age of the respondent in ordered classes of years (1 = 18–29, 2 = 30–44, 3 = 45–59; 4 = 60–74; 5 = 75–89; 6 = more than 90);

  • notwork: a dummy indicating if the respondent is unemployed (notwork= 1) or employed (notwork= 0);

  • Internet: a dummy indicating if the respondent uses internet for private purposes (Internet= 1) or not (Internet= 0);

  • (left-right): left-right self placement on political orientation (semantic scale with ten categories running from extreme left (left-right= 1) to extreme right (left-right= 10);

  • univ: a dummy variable to indicate whether the respondent has a university education (univ= 1) or not (univ= 0);

  • west: a dummy variable to indicate if respondent’s residence is in the Old Federal Republic (West Berlin: west= 1) or in former German Democratic Republic (East Berlin: west= 0).

For both ratings on perceived Trust towards Press and Television, the response profiles derived from the dissimilarity binomial tree are discussed. Indeed, for latent traits like perceived trust which are deemed to exhibit similar sentiment among the population, the dissimilarity criterion can be helpful in determining significant differences in model parameters that highlights more dissimilar response patterns. Given the orientation of the response scale, the feeling measure \(1-\hat {\xi }\) under the binomial model is a direct indicator of perceived trust.

Remark 2.3

Binomial trees are grown up to the fourth generation of descendants from the root node: as further pre-pruning rules, a minimum sample size of 250 observations per node is required to attempt a split, and a split is admissible only if each descendant corresponds to 100 observations at least.

5.1 Perceived Trust Towards Press

The binomial tree highlights the following response profiles at the terminal nodes:

  • Node 4:Employed residents in former GDR (west= 0);

  • Node 5:Unemployed residents in former GDR (west= 0);

  • Node 6:Residents in former Federal Republic (west= 1) that do not use internet for private purposes;

  • Node 14:Young residents (aged less than 29) in former Federal Republic (west= 1) that use internet for private purposes;

  • Node 15:Adult and elderly respondents, resident in former Federal Republic (west= 1), that use internet for private purposes.

Before commenting results from the binomial tree, diagnostics of the assumed model needs to be performed. For each node of the binomial tree, Table 6 reports the average value of the distribution of p-values for the Quesenberry-Miller test of uniformity of surrogate residuals to verify the necessary condition for correct specification of binomial and its extensions.

Table 6 Average p-values for the chosen uniformity test for the binomial tree — Trust towards Press (500 replicates)

For the sake of completeness, Fig. 7 displays the boxplots of the p-value distribution for terminal nodes.

Fig. 7
figure 7

Boxplots of p-values for the Quesenberry-Miller test of uniformity of surrogate residuals of alternative models at terminal nodes: Trust towards Press

Assuming the average p-value as synthesis of the residuals’ generation procedure, it follows that the binomial can be maintained at each level: thus, local model selection for binomial extensions can be pursued to account for possible over-dispersion and to identify the uncertainty source best characterizing each response profile. This selection process is based on a combined analysis of information on fitting performance related to LRT and BIC index, reported in Tables 7 and 8.

Table 7 LRT statistics for pairs of nested models: Trust towards Press
Table 8 BIC Difference with respect to the minimum BIC for each node: Trust towards Press

After determining the best model for each node, Table 9 reports the relevant results, indicating featuring parameters, split at the node, sample size, and dissimilarity of both the baseline binomial and the best fitting mixture extension with respect to the observed frequencies.

Table 9 Summarizing results for the local model selection on the (dissimilarity) binomial tree: Trust towards Press

It turns out that fitting performances of the feeling model improves if mixed with a shelter effect at the first category c = 1 (thus, indicating that distrust is a structural phenomenon, yet with different weights). Focusing on response profiles at terminal nodes (4, 5, 6, 14, 15), perceived trust towards Press is the lowest for unemployed people living in former East Germany: this response profile corresponds also to the strongest structural distrust (as measured by \(\hat {\delta }\)), whereas the strongest trust characterizes responses from people living in former west Germany that do not have access to Internet. Feeling differences due to age are found in former West Germany for those who have access to Internet for private use: in particular, young respondents experience a higher trust towards Press than older adults. For the sake of completeness, Fig. 8 displays observed and fitted response distributions (comparing binomial and selected best model), along with quantile-quantile plots of residuals, for the terminal nodes.

Fig. 8
figure 8

For each terminal node, instance of residuals’ QQ plot (top) and comparisons between observed and fitted distributions for binomial and selected best model (bottom) with reference to Table 9: dissimilarity binomial tree for Trust towards Press, adjusted for best contaminating uncertainty

5.2 Perceived Trust for Television

The binomial tree highlights the following response profiles at the terminal nodes:

  • Node 4:Residents in former East Germany that do not use Internet for private purposes;

  • Node 5:Residents in former West Germany that do not use Internet for private purposes;

  • Node 7:Respondents that use Internet for private purposes, aged more than 60 years;

  • Node 13:Respondents that use Internet for private purposes, aged less than 60 years, with a University education;

  • Node 24:Respondents that use Internet for private purposes, aged less than 60 years, with no University education and with left-wing political orientation (left-right≤ 4);

  • Node 25:Respondents that use Internet for private purposes, aged less than 60 years, with no University education and with neutral or right-wing political orientation (left-right≥ 5).

Table 10 reports the average of the distribution of the p-values for the Quesenberry-Miller test of uniformity (500 random generations of residuals): Fig. 9 supplements the discussion with boxplots of these distributions. It follows that, at the root node, evidence for the correct specification of the binomial is quite weak: however, in light of the improvement occurring at the descending nodes 2 and 3, this circumstance can be partly due to the existence of two sub-populations. Similar remarks hold for subsequent splits of node 3 into node 6 and node 7, of node 6 into nodes 12 and 13, and finally for node 12 into nodes 24 and 25. As a result, there is sufficient evidence that the binomial can be maintained as assumed model for the baseline response generating process, possibly after accounting for mixed populations.

Table 10 Average p-value for the chosen uniformity test for dissimilarity binomial tree (500 replicates): Trust towards Television
Fig. 9
figure 9

Boxplots of the distributions of p-values for Quesenberry-Miller test of uniformity of surrogate residuals of alternative models at terminal nodes (500 replicates): Trust towards Television

Table 11 reports the main results from the local model selection for the best adjustment of the binomial: it can be concluded that the lowest Trust towards Television corresponds to extremely left-wing politically oriented young adults with no university degree and that use internet for private purposes; the highest trust, instead, characterize respondents living in former West Germany that do not use internet. Overall, older people trust Television more than young adults; in addition, structural distrust, as measured by \(\hat {\delta }\), is stronger for young adults than it is for seniors. When comparing former West and East Germany residents, it follows that the latter perceive a lower trust towards television overall, and are also subject to a quite strong structural distrust.

Table 11 Summarizing results for the local model selection on the (dissimilarity) binomial tree: Trust for Television

Results indicate that different uncertainty contamination are needed at different partitioning levels: in particular, the specification of a shelter effect (at category c = 1 for all nodes except for node 24) improves fitting of either the binomial or CUB models, as well as possible moderate over-dispersion effect should be accounted for certain response profiles. Thus, a fixed model-based tree would have missed to account for the diversified features of response profiles for perception of Trust for TV. In order to display the local model adjustment that is needed at the response profiles, Fig. 10 displays, for each terminal node, the QQ plot of residuals for binomial and its best extension (see Table 11), as well as fitted distributions superimposed to the barplot of the observed distribution.

Fig. 10
figure 10

For each terminal node, instance of residuals’ QQ plot and comparisons between observed and fitted distributions for both binomial (solid lines) and selected best model (dashed lines): deviance binomial tree for perceived trust towards Television, adjusted for best contaminating uncertainty

5.2.1 Tree Performance

The fitting performance of the binomial tree can be assessed by averaging the dissimilarity index to compare frequency distribution and the model implied by the tree computed at each terminal node, with weights given by sample sizes, as defined in (9). Similarly, the dissimilarity between frequency distribution and best binomial extension can be computed to evaluate the advantage of resorting to the proposed procedure of local uncertainty diagnostics: results are reported in Table 12 and indicate a noticeable improvement in fitting performance from the binomial tree to the adjusted version.

Table 12 Tree dissimilarity for the case studies on perceived trust towards Press and Television

With respect to predictive ability, several measures are available for evaluating the performance of classification procedures, but few proposals are specifically designed for ordinal responses, among which the Ranked Probability Score (Gneiting & Raftery, 2007; Murphy, 1971; Simone & Piccolo, 2022) (corresponding to assigning the median of the predictive model to a new observation), or some resorting to the modal value of the predictive model as a prediction for a new observation, like the proposals introduced by (Ballante et al., 2022; Cardoso & Sousa, 2011). However, the modal value may provide inadequate representation for an ordinal model, even if is unique. In general, the rationale pursued by the proposed procedure is based on the association of a whole predictive model to a given covariate profile, rather than a single response value, so that featuring model parameters can be used to characterize future observations from that profile. For instance, consider a sample of 500 observations from the original sample of both case studies on Perceived Trust to be used as a test set. Then, for each test observation, the corresponding covariate profile is associated to the node of the binomial tree where it is classified into; finally, with respect to the frequency distribution of the test observations classified in that node, prediction performance can be assessed via the dissimilarity value and the Ranked Probability Score (RPSFootnote 5) with the estimated conditional model under the binomial tree and the estimated best mixture with uncertainty. Results reported in Table 13 indicate satisfactory performance.

Table 13 Indicators of prediction performance: average dissimilarity between observed distribution of the test sets classified into terminal nodes and corresponding binomial and best mixture with uncertainty; total RPS and weighted average of RPS (averages are weighted with sample sizes)

6 Satisfaction of Italian Ph.D. Awardees

The approach proposed in the paper takes the lead from the introduction of the binomial tree, since it is the simplest model on the discrete support that can be assumed for evaluation data to parameterize the latent feeling. However, the diagnostic procedure and the subsequent local specification adjustments can be implemented for every model-based tree with respect to extension of the assumed model. For illustration purposes, an example on CUBREMOT is presented (Cappelli et al., 2019; Simone et al., 2019). Consider the overall satisfaction for the doctoral experience rated by Italian Ph.Ds. awarding the title in 2012 and 2014, collected within the survey run by the Italian National Statistical Office (ISTAT) to investigate their satisfaction for the professional placement and the Ph.D. programme (available at https://www.istat.it/it/archivio/87536).

Satisfaction for the Ph.D. experience was rated with reference to several dimensions (quality of teaching courses, spaces and tools at disposal, and so on): ratings for overall satisfaction will be considered hereafter as response variable. All the ratings were collected on a discrete scale with categories coded from 0 to 10: the rating scale has been subsequently modified to a scale with 8 categories due to zero-scores observed in certain categories, so that higher scores along the response scale corresponds to higher levels of satisfaction.

After omitting missing values for the variables of interest, n = 3830 observations are used for the analysis. Among the available covariates used to grow the tree (including current employment status, residence, discipline of the Ph.D. program, marital status, participation in research project, and year of Ph.D. completion), the procedure has selected:

  • gender: a dummy indicating if the respondent is male (gender= 0) or female (gender= 1);

  • abroad: a dummy indicating if the respondent had any work or training experience abroad after the Ph.D. completion (abroad= 1) or not (abroad= 0);

  • research: a dummy indicating if the respondent currently works in the research domain (research= 1) or in other fields (research= 0);

  • stem, a dummy variable to indicate whether the Ph.D. program was relative to STEM disciplines (stem= 1) or different ones (stem= 0);

  • north, a dummy variable to indicate if respondent awarded the Ph.D. title from a University located in Northern Italy (north= 1) or in a different geographical area (north= 0).

Due to a structural heterogeneity of the distribution, the binomial does not satisfy the necessary condition for being correctly specified at any of the partitioning levels for the corresponding model-based tree, not even after searching for possible neglected sub-populations. Indeed, for a general CUB model, the necessary condition for being correctly specified is verified at each stage of the growth of a CUBREMOT with the deviance splitting criterion (results are not reported for brevity). Thus, CUB can be assumed as maintained model for the response generating process. Then, at each step, the procedure selects the most significant binary split in either feeling or uncertainty parameters. Table 14 reports the main information concerning the nodes composing the tree with respect to the local search of the best model extension, among those verifying the necessary condition for being correctly specified (in particular, candidate models are CUBE and CUB with shelter) see Fig. 11 for visualization of results.

Table 14 Summarizing results for the local model selection on the (deviance) CUBREMOT for Ph.D. overall satisfaction
Fig. 11
figure 11

For each terminal node, residuals QQ plot and comparisons between observed and fitted distributions for both CUB and selected best model: adjusted deviance CUBREMOT tree for Ph.Ds. overall satisfaction

First, it is worth to highlight that a CUBREMOT tree with a pre-specified shelter fixed at all partitioning levels would have entailed sub-optimal descriptions of the response profiles. Once diagnostic checks have confirmed that the baseline CUB can be maintained as assumed model, the flexible model selection that is performed locally provides, instead, more accurate descriptions of the response patterns (for instance, shelter effects are found at different levels of the scale for different response profiles). In particular, it follows that:

  • Ph. Doctors with studies in disciplines different from STEM experience lower feeling than Ph. Doctors with studies in STEM disciplines, especially if the current job does not involve research. For Ph. Doctors in disciplines different from STEM, feeling of evaluation decreases if the respondent had any work or training experience abroad after the Ph. doctors;

  • Among the Ph. Doctors working in research, women are less satisfied than men, especially if the Ph.D. program concerned disciplines different from STEM and if no abroad experience has occurred after the Ph.D. completion. In the latter case, the lower feeling towards satisfaction experienced by women is revealed also in terms of the location of the shelter effect at a lower category than it is found for men (even if in both cases concentration is found at the center of the scale). Thus, even for neutral evaluation, women tend to assess their evaluations with lower scores.

7 Conclusions and Further Developments

Statistical modelling of preference and evaluation data that accounts for uncertainty can be embedded in the 7th dimension of Information Quality (Kenett & Shmueli, 2014; Kenett, 2016), a paradigm for analytic research that guides scholars and stakeholders towards a qualified experience of data analysis to successfully pursue research goals and decision-making process. Being framed within this modern debate, the paper proposes a method to perform local diagnostics of model-based trees for evaluation data, assuming the setting of mixture models with uncertainty. The discussion stems from the introduction of binomial regression trees, following the rationale of CUBREMOT technique (Cappelli et al., 2019; Simone et al., 2019). With respect to the consolidated approach to model-based trees, which assumes a constant model specification at all partitioning levels, the local-model selection for the best extension of the binomial model at tree nodes allows for a more accurate assessment of feeling. The adoption of this flexible approach is subject to the fulfilment of a necessary condition for correct specification that can be assessed through diagnostics of surrogate residuals for ordinal data models. Then, a pruning criterion could be derived for those splits where evidence for model misspecification is found. In a similar way, diagnostics check for model-based trees could be exploited to tune the depth of the tree, or to select the best tree out of a set of alternative ones. The thread of the discussion has been the binomial model for rating data, but the proposal may be adapted to other preference models on the discrete support (as shown in Section 6 for the case of CUBREMOT). In the end, the flexible binomial tree with uncertainty could be possibly exploited to obtain: a more precise imputation of missing values or prediction rules on the basis of the derived response profiles and its characterizing model parameters; an adjusted synthetic indicator of the latent feeling of the trait being examined. Further research will address the possibility of incorporating the proposed diagnostics methods within the partitioning process to drive the selection of the split and provide more general flexible uncertainty trees: see Banchelli (2019) for a recent proposal of flexible trees on the basis of non-nested models for count data. With respect to residuals diagnostics, further methodological research will assume a comparative perspective with alternative tests and techniques to detect model misspecification; adjustments of the splitting rule on the basis of stability tests of regression parameters, as for the model-based trees introduced in Zeileis et al. (2008); a sensitivity analysis to determine the most suited synthetic indicator of the distribution of the p-values for the uniformity tests for residuals. Finally, the proposal could be exploited for prediction purposes with the implementation of model-based random forests for rating variables. In this setting, specific attention should be devoted in the derivation of variable importance measures due to possible indirect to effects of splitting variables (Gottard et al., 2020), which could also affect the interpretation of response profiles read from a single tree.