1 Introduction

Testing if a given model is different across groups of observations is relevant in many contexts. Groups’ membership is usually coded in the dataset through one or more stratification variables, whose levels denote the different classes, introducing an heterogeneity that must be considered in analyzing and modeling data. For example, in education, the impact of students’ socioeconomic characteristics on their performance and learning achievement may differ by geographic origin (Hansen and Gustafsson 2016) or gender (Baye and Monseur 2016).

The classical approach estimates a single model on the whole dataset, introducing dummy variables to distinguish the different group effects (Gujarati 1970). Following this approach, an F-test is exploited to assess if groups are significantly different. Chow test (Chow 1960) and the Lebart test (Lebart et al. 1979) are two commonly used solutions. The two F-tests consist in comparing the restricted deviance with the unrestricted deviance, the former being related to a single model estimated on the whole sample, the latter related to a separate model for each group of observations associated with the source of heterogeneity. However, it is necessary to consider that there are two main drawbacks in pursuing such an approach. Firstly, these tests implicitly assume normality, independence, and homoscedasticity among groups. Furthermore, the specific impact of regressors is not immediate, especially in the case of multiple regression models. This unless interaction terms are included in the model, terms not always straightforward to interpret. A different approach, named multilevel modeling (Gelman 2006; Raudenbush and Bryk 2002; Snijders and Bosker 2011), estimates an additional coefficient, the between-cluster variability, to capture the hierarchical structure of the data, but is more suited to the case of a large number of groups induced by the stratification variable(s). An alternative to the analysis of group effects in statistical models has been recently introduced in the context of composite-based path modeling (Vinzi et al. 2013; Hair et al. 2016; Wold 1985). This approach exploits a multigroup perspective, separating data into segments according to the levels of the stratification variable(s), and estimating a separate model for each segment. Resampling and permutation methods are then used to test differences among the separate models (Hair et al. 2018). Since composite-based path modeling essentially consists of simple and multiple regressions aimed to estimate latent variables, the multigroup approach has been also used in the framework of ordinary least squares (OLS).

This paper extends the multigroup approach to quantile regression (QR), a class of models aiming to assess the effects of a set of explicative variables at different locations of the conditional distribution of a response variable. QR (Koenker and Bassett 1978; Davino et al. 2013; Furno and Vistocco 2018) provides a different model for each conditional quantile of interest, without introducing any parametric assumption on the response. The use of the multigroup approach in QR is promising, since it offers simple and interpretable tools to assess if and how heterogeneity impacts at different parts of the response distribution. Notwithstanding, the problems engendered from the use of separate models for each group of observations are amplified in the QR framework, since comparisons must be carried out both for models related to different groups at a given conditional quantile, and for models related to different conditional quantiles for a given group.

This paper focuses on two tests popular in composite-based path modeling, the parametric t-test (Keil et al. 2000), and the permutation test (Chin and Dibbern 2010), adapting and comparing them for the case of quantile regression. The two tests are detailed in Sect. 3, once the basic notation and the essential toolkit has been introduced in Sect. 2. Section 4 validates the proposed multigroup approach through a study on artificial data whose design takes into account possible effects of different sample sizes, as well as the performance of the two tests in detecting low, medium, and high differences among coefficients pertaining different groups. The practical implications and the relevance of the approach is shown on real data in Sect. 5 through an empirical analysis on MOOC students’ performance, one of the major challenges in learning analytics (Siemens and Long 2011). The empirical application allows to evaluate if and how the effect of learning and engagment, the two main drivers of student’s performance (Carannante et al. 2020; de Barba et al. 2016; Moore and Wang 2021), changes according to the way the courses are offered, namely distinguishing self-paced courses and instructor-paced courses (Fianu et al. 2018; Goopio and Cheung 2020). Finally, a discussion on the main results, and the conclusions with some further research developments to be explored are included in Sect. 6.

2 The basic notation and the essential toolkit

The multigroup approach consists in testing statistical differences among coefficients of a given model estimated on different groups of observations. The different groups are associated with the levels of one or more stratification variables, making it possible to take into account heterogeneity corresponding to an a priori partition of the whole sample. The approach has been widely used for comparing OLS regression models, especially in the case of composite-based path modeling (Eslami et al. 2013; Hair et al. 2018). This section introduces the basic notation for extending the multigroup approach to compare QR models.

Let \({\textbf {y}}\) denote a response variable observed on \(i = 1, \ldots , n\) observations. Let \({\textbf {X}}\) be the matrix storing the explicative variables. Data are row-partitioned in G groups according to the levels of one or more stratification variables:

$$\begin{aligned} \textbf{y}^T&= \left[ {\textbf {y}}_1, \ldots , {\textbf {y}}_I, \ldots , {\textbf {y}}_J, \ldots , {\textbf {y}}_G \right] \\ \textbf{X}^T&= \left[ {\textbf {X}}_1, \ldots , {\textbf {X}}_I, \ldots , {\textbf {X}}_J, \ldots , {\textbf {X}}_G \right] \end{aligned}$$

where the subscripts I and J refer to two generic groups. Finally, let \(n_g\) be the cardinality of the generic group g, with \(n=\sum _{g=1}^G n_g\). The multigroup approach allows to evaluate if the impact of the explicative variables is different across groups through a comparison of the coefficients estimated for each group. The number of possible pairwise comparisons clearly depends on the number G of groups, and on the number of coefficients included in the model, i.e. the number of explicative variables.

This paper introduces the multigroup approach to compare coefficients of QR models. QR extends classical regression to a set of quantile functions of a response variable \(\textbf{y}\), conditional on a set of covariates \(\textbf{X}\). QR, originally proposed by Koenker and Bassett (1978), is a a distribution free regression approach, since it does not pose any parametric assumption on the response distribution. It aims to estimate the effects of a set of regressors on the quantiles of a response variable. In particular, QR estimates separate models for different \(\theta \in (0,1)\), where \(\theta \) denotes the conditional quantile of interest. Unlike the classical regression model, where the conditional mean of the error \(E(\epsilon \mid X) = 0\), in QR the \(\theta \)-quantile of the error term is 0, namely \(Q_{\theta }(\epsilon (\theta ) \mid ~ X) = 0\) where \(Q_{\theta }(.\mid .\,)\) is the conditional quantile function. The separate models provided by QR, one for each quantile of interest, are interpretable in terms of regression models for the associated conditional quantiles of the response. The QR model for a given conditional quantile \(\theta \) can be formulated as follows:

$$\begin{aligned} Q_{\theta } (\hat{{\textbf {y}}}\mid X) = {\textbf {X}}\hat{\beta }(\theta ). \end{aligned}$$
(1)

The conditional quantile estimator minimizes the sum of absolute deviations, asymmetrically weighting positive and negative residuals. The bootstrap procedure is typically used for inference (Koenker et al. 2017), so to avoid the assumptions required by asymptotic theory. Bootstrap offers the flexibility to obtain standard errors and confidence intervals for any estimates and combinations of estimates, keeping the distribution free nature of QR. The reader interested in the machinery of QR can refer to Koenker and Bassett (1978), Davino et al. (2013), Furno and Vistocco (2018) for details.

Since QR provides separate models for each conditional quantile, the number of possible pairwise comparisons in the multigroup approach is amplified. The multigroup approach tests the null hypothesis \(H_0: \beta _I(\theta )= \beta _J(\theta )\) versus the presence of a significant difference, \(H_1: \beta _I(\theta ) \ne \beta _J(\theta )\), where \(\beta _I(\theta )\) and \(\beta _J(\theta )\) are the coefficients of a generic explicative variable related to group I and group J, respectively. As has been pointed out above, the parametric t-test and the permutation test are the two most widely used tests in such framework. The parametric t-test (Keil et al. 2000) exploits a bootstrap resampling procedure to compare coefficients. In particular, the bootstrap samples are used to approximate the sampling distributions of the coefficient estimators for each segment, providing a reasonable solution if sample size is sufficiently large. Instead, the permutation test (Chin and Dibbern 2010) evaluates differences between the coefficients of different segments through a permutation procedure: data of the two groups are permuted preserving the initial group sizes to obtain the sampling permutation distribution of the difference between groups. The difference observed on the two samples is then compared with such permutation distribution to test the null hypothesis of no significant differences between groups. Details on the two tests adapted to the QR context are offered in Sect. 3.

3 The multigroup approach for quantile regression

Starting from the QR model (1), the aim is to assess if units corresponding to two different groups I and J share the same dependence structure, namely if the impact of the regressors on the response is different with respect to different segments of units. Moreover, since the QR model provides estimates at different conditional quantiles, the comparison can be carried out also for different locations. Starting from two QR models estimated for group I and J, respectively, the multigroup approach tests if the observed difference between coefficients \(\hat{\beta }_I (\theta )\) and \(\hat{\beta }_J (\theta )\) is significant for a given conditional quantile \(\theta \). The two following subsections extend to QR the two main tests proposed in multigroup literature. To simplify the notation, in the following we refer to a generic \(\beta (\theta )\) coefficient for a given conditional quantile. Clearly, the tests can be carried out for any of \(\beta _p (\theta )\) coefficient in case of multiple regression, as shown in the application in Sect. 5. The null and alternative hypotheses can be defined as follow for both the tests:

$$\begin{aligned}&H_0: \beta _I(\theta ) = \beta _J(\theta ) \\&H_1: \beta _I(\theta ) \ne \beta _J(\theta ). \end{aligned}$$

Furthermore, following the QR logic, comparisons for different conditional quantiles are carried out separately for each quantile \(\theta \) of interest.

3.1 The parametric test

The parametric t-test (Keil et al. 2000) combines a classical test statistic for comparing the means of two groups, with a bootstrap procedure used to estimate its standard errors. Bootstrap (Efron and Tibshirani 1998) does not pose distributional assumptions but entails a higher computational cost with respect to the standard parametric procedure. Computational costs can be reduced using newest efficient bootstrapping methods (Kleiner et al. 2014; Sengupta et al. 2016). Bootstrap estimates are unbiased, even if they introduce additional sources of variability in the process, i.e. sample variability, being based on one single sample from a given population, and resampling variability, exploiting a finite number of replications (Davino et al. 2013).

To our ends, the resampling procedure is simultaneously applied to the vector of response variable and to the matrix of regressors, resampling with replacement B times, separately for the two groups I and J, holding fixed the cardinality \(n_I\) and \(n_J\) of the original groups. This paired resampling allows to preserve the dependence structure among the variables.Footnote 1

The QR model (1) is estimated separately for group I and J, for each bootstrap sample and for each conditional quantile of interest. Therefore, a vector \(\varvec{\hat{\beta }}_{boot}(\theta ) = [\hat{\beta }^1(\theta ), \hat{\beta }^2(\theta ), \dots , \hat{\beta }^B(\theta )]\) is obtained in correspondence of each coefficient and each quantile of interest. Such vectors are used to estimate the standard errors of the cofficients \(\textrm{SE}_{\hat{\beta }_{boot}}(\theta )\), exploited in the usual t-test statistic for comparing the means of the two groups:

$$\begin{aligned} t(\theta )=\frac{\hat{\beta }_I(\theta ) - \hat{\beta }_J(\theta )}{ \sqrt{\frac{(n_{I}-1)}{n_{I}}\textrm{SE}_{\hat{\beta }_{I_{boot}}}^{2}(\theta ) + \frac{(n_{J}-1)}{n_{J}}\textrm{SE}_{\hat{\beta }_{J_{boot}}}^{2}(\theta )}}. \end{aligned}$$

The statistic is asymptotically t-distributed and the degrees of freedom (df) are determined by means of the Welch–Satterthwaite equation. The formula is derived following Sarstedt et al. (2011):

$$\begin{aligned} df=\Bigg \Vert \frac{\Big (\frac{(n_{I}-1)}{n_{I}}\textrm{SE}_{\hat{\beta }_{{I_{boot}}(\theta )}}^{2} + \frac{(n_{J}-1)}{n_{J}}\textrm{SE}_{\hat{\beta }_{{J_{boot}}(\theta )}}^{2}\Big )^2}{\frac{(n_{I}-1)}{n_{I}^2}\textrm{SE}_{\hat{\beta }_{I_{boot}(\theta )}}^{4} + \frac{(n_{J}-1)}{n_{J}^2}\textrm{SE}_{\hat{\beta }_{J_{boot}(\theta )}}^{4}} - 2 \Bigg \Vert . \end{aligned}$$
(2)

Algorithm 1 outlines the steps for carrying out the test.

figure a

3.2 The permutation test

The permutation test (Chin and Dibbern 2010) evaluates the differences between the coefficients of two groups exploiting a permutation procedure. The original difference \({DIF} = \left\| \hat{\beta }_I(\theta ) - \hat{\beta }_J(\theta ) \right\| \) between the coefficients estimated on the two groups I and J at a given conditional quantile \(\theta \) is compared with the permutation distribution of the differences, PDIF, computed on P permuted samples. The actual aim is to appraise how extreme DIF is under the null hypothesis. The two samples \([{\textbf {y}}_I \mid {\textbf {X}}_I]\) and \([{\textbf {y}}_J \mid {\textbf {X}}_J]\), corresponding to the two groups, are merged. Then, P samples of size \(n_I\) and \(n_J\) are obtained permuting data. The assumption of exchangeability holds, and the units can be rearranged without substantially alterating the process under \(H_0\). That is done by randomly assigning these units to one group or the other, preserving the cardinalities of the two initial segments. Starting from the two new groups of units, the permutation difference is computed: \({PDIF}^{(i)} = \left\| \hat{\beta }_{I_{perm}}(\theta )- \hat{\beta }_{J_{perm}} (\theta ) \right\| \) (for \(i=1, \ldots , P\)). Fixing a quantile of interest, model (1) is estimated after each permutation, and the differences between coefficients are calculated. Finally, the vector \(\textbf{PDIF}\) of permuted differences is compared with the original differences, computing p-value as 1 minus the proportion of times in which the original difference is larger than the permuted one. The steps of the permutation test procedure are summarized in Algorithm 2.

figure b

Following Hair et al. (2012), to guarantee the stability of the results, should be convenient to maintain the number of permutations large (500 or 1000 permutations are commonly employed thresholds (Kherad-Pajouh and Renaud 2010). However, depending also on the number of comparisons, a larger number of permutations could produce an increase in the computation time.

4 Simulation study

This section presents a simulation study aimed to show the proposed quantile multigroup approach in action, and to compare the parametric and the permutation test varying sample size, and evaluating the ability to capture group differences of different magnitudeFootnote 2. We focus here on the case of one regressor and two groups, I and J. This setting represents a first step in investigating the behavior of the two tests in the context of quantile regression. For the considered simple regression model, the focus is on possible effects of the quantile of interest, the sample size and the separability of the groups. The introduction of additional variables would require an extremely complicated simulation design, considering both the case of independent and correlated predictors. Indeed, the problem of multicollinearity must be considered also for QR (Davino et al. 2022). Therefore, a study that considers multiple correlated regressor is postponed to a future work. Besides, just considering a multiple regression model under the assumption of uncorrelated regressors would not be relevant to any issues for a first investigation on the response performance of the two tests regarding sample size, quantile of interest, and difference between coefficients. Furthermore, in order to study the asympotic distributions in case of multiple QR, we plan to compare classical QR with composite QR (Zou and Yuan 2008).

The data generating process exploits a uniform distribution \(U(a~=~0,~b~=~4)\) for the regressor, generated independently for the two segments. Error terms, \(\epsilon _I\) and \(\epsilon _J\), were generated exploiting a normal standardized distribution, a skew normal distribution with shape parameter \(\alpha \) equal to 4, and a uniform distribution \(U(a~=~0,~b~=~4)\), assuming homoschedasticity within the two subgroups. This allowed us to consider three different scenarios of normal, asymmetrical, and no normal residuals. Finally, since it is well known that QR is very useful in the case of heteroschedasticity, the error component was introduced in the model through a multiplicative term with the regressor. Different scenarios have been considered in order to assess the effect of the sample size, the degree of overlapping between the groups (namely, magnitude of the difference between the coefficients of the two segments), and the quantile of interest. More specifically, the design of the simulation study considers the following factors and levels:

Sample size The effect played by the cardinality of each group is explored by hypothesizing both equal group sizes {50, 100, 250, and 500}, and unbalanced segments {75 vs 25, 150 vs 50, 375 vs 125, and 750 vs 250}.

Difference between coefficients Following Lamberti et al. (2016a, b), this study considers the case of no difference between the coefficients estimated in groups I and J, and the cases of small, medium, or large differences. Table 1 shows these four levels specifying the size of the differences (second column) and the value of the coefficients in the two groups (last two columns) used to simulate data.

Quantile of interest The sensitivity of the tests is evaluated for the QR model estimated at \(\theta = \left\{ 0.1, 0.25, 0.50, 0.75, 0.9\right\} \).

Error term distribution Error terms were generated exploiting a normal standardized distribution, a skew normal distribution with shape parameter \(\alpha \) equal to 4, and a uniform distribution \(U(a~=~0,~b~=~4)\).

Table 1 Different profiles according to the coefficient differences between group I and J

Considering all possible combinations of sample sizes (balanced and unbalanced), differences between coefficients, quantile of interest, and error term distribution, 480 scenarios (2 \(\times \) 4 \(\times \) 4 \(\times \) 5 \(\times \) 3) results.. Concerning the number of permutations, we employed 100 permutations to control the computation time.

In order to show the different degrees of separability between the segments, Fig. 1 provides the graphical visualizations of the models estimated for both groups (I and J) for the case of two balanced sample of 500 observations and normal errors. Each panel refers to a given coefficient difference (equal, small, medium, large, top-left by row). Different colors/grey levels distinguish the two segments, while different lines are used for each of the five considered QR models. Analyzing the graphs starting from the top-left, it is evident that the effect of the regressor on both the mean and the conditional quantiles of y is the same in the two segments in the case of equal differences, and becomes increasingly different moving towards large differences. Therefore, it becomes important to extend the comparison between segments to parts other than the conditional mean of the dependent variable.

Fig. 1
figure 1

Two artificial datasets of 500 observations, with normal errors. Different panels refers to the different magnitude of the difference between coefficients: equal, small, medium, and large, top-left, clockwise. The different lines correspond to the five QR models (\(\theta \) = 0.1, 0.25, 0.5, 0.75, and 0.9)

The performance of the parametric and permutation tests is evaluated through the p-values obtained on 100 replications for each experimental condition.

Figures  2,  3, and  4 show the results obtained using the two tests with unbalanced samples, and considering the different quantiles of interest for the three different scenarios (normal, asymmetrical normal, and not normal errors, respectively). The results with balanced samples are almost similar and therefore not shown. Each panel reports the boxplots of the p-values (vertical axis) for both the parametric test (first four columns) and the permutation test (last four columns), in case of small, medium, and large differences between coefficients (horizontal axis). The case of no differences between the coefficients of the two groups is reported in Appendix. The rows of each graph refer to the different quantile of interest (0.1, 0.25, 0.5, 0.75, and 0.9, from top to bottom), the columns to the different sizes considered for the unbalanced groups. A dotted horizontal line is drawn at the conventional significance level \(\alpha \) = 0.05. In all the cases, the parametric test provides a better performance than the permutation test. More generally, the effect of sample size is evident since the tests detect medium and large differences for smaller samples (100 observations). When sample size increases, the sensitivity of the tests improves, even in presence of small differences between the coefficients. Furthermore, the two tests show a good performance even in the case of extreme quantiles. Results do not substantially differ in case of normal errors (Fig. 2), asymmetrical normal errors (Fig. 3), and uniform errors (Fig. 4). Finally, results suggest that at least one of the two samples must have an adequate size to identify also small differences.

Fig. 2
figure 2

Boxplots of p-values obtained on 100 replications of the parametric and permutation test. Horizontal panels refer to quantiles (0.1, 0.25, 0.5, 0.75, 0.9), vertical panels to sample sizes (75 vs 25, 150 vs 50, 325 vs 125, and 750 vs 250). Each boxplot describes a different degree of overlapping among groups (small, medium, and large difference between coefficients). Simulations consider standard normal errors and unbalanced segments

Fig. 3
figure 3

Boxplots of p-values obtained on 100 replications of the parametric and permutation test. Horizontal panels refer to quantiles (0.1, 0.25, 0.5, 0.75, 0.9), vertical panels to sample sizes (75 vs 25, 150 vs 50, 325 vs 125, and 750 vs 250). Each boxplot describes a different degree of overlapping among groups (small, medium, and large difference between coefficients). Simulations consider asymmetrical normal errors and unbalanced segments

Fig. 4
figure 4

Boxplots of p-values obtained on 100 replications of the parametric and permutation test. Horizontal panels refer to quantiles (0.1, 0.25, 0.5, 0.75, 0.9), vertical panels to sample sizes (75 vs 25, 150 vs 50, 325 vs 125, and 750 vs 250). Each boxplot describes a different degree of overlapping among groups (small, medium, and large difference between coefficients). Simulations consider not normal errors and unbalanced segments

5 A real data analysis

The proposed multigroup approach in QR is shown in action through a case study based on real data. The aim is to model students’ performance in a particular type of course, the Massive Open Online Courses, also known as MOOCs. The data and model have recently been published by Carannante et al. (2020).

MOOCs are an increasingly common type of course in education, especially in higher education. The structure and delivery of such courses has a strong impact on the way students attend MOOCs and, inevitably, on their final performance. In the learning analytics framework (Siemens and Long 2011), predicting students’ performance in MOOCs can be considered one of the main challenges. Indeed, there are several elements affecting students’ performance. Some are specifically related to the learning experience (student motivation, learning attitude, engagement), others can be defined as external, being related to personal characteristics of the student or to the specific features of the course.

In this study, we considered two main drivers of students’ performance related to the learning experience: the learning attitude and the students’ involvement in the planned MOOC activities (engagement), and one external factor describing the course type. In particular, the aim is to analyse if and how much the effects of learning and engagement on students’ performance vary according to the course type. To this end, QR allows us to investigate whether the effect of this relationship varies for low, medium, or high performing students, and the proposed multigroup approach allows us to assess whether these effects vary by course type.

5.1 Data and measurements

Data refer to 3578 students who attended two courses in Political Science on the FedericaX platform, the EdX MOOC platform of the “Federica WebLearning” Center at University of Naples Federico II.Footnote 3 Each course was offered in two versions: an instructor-paced version and a self-paced version. The instructor-paced course is strictly scheduled, with specific dates for assignments, course materials, exams, and a deadline for learners to complete the course and get a certification. Usually, this modality is integrated into an in-site course delivered in blended mode. Instead, the self-paced version provides all course materials as soon as the course starts, assignments and exams do not have due dates, and therefore a learner can progress through the course at its own speed and pass grade in the course, even without completing all of the course materials. Of the 3578 students, 73.1\(\%\) followed in the self-paced modality, and 26.9\(\%\) in instructor-paced modality.

The considered model uses performance as response variable, and learning and engagment as explicative variable. This in line with the model proposed in Carannante et al. (2020). Performance was measured as the proportion of correct answers to a set of questions. Learning was quantified by considering the quantity of actions undertaken to acquire knowledge. In particular, we exploit three dimensions for its measurement: frequency-based actions (count of activities spent studying), time-based actions (duration of time spent studying) and interactions (discussion on forums and social learning aspects). Engagement was analysed through two sub-dimensions: regularity (how a learner spends her/his time on the platform and how she/he organizes the learning road map), and no-procrastination (the ability of the learner in organizing the learning processes). For more details, consult Carannante et al. (2020), de Barba et al. (2016), Moore and Wang (2021).

Figure 5 shows the presence of a severe left-hand skewness both for the response and the regressors. In particular, the strong asymmetry of the response could lead to problems in using OLS, keeping in mind the classical assumptions of such a model. A comparison of the response distribution between the two groups of students by course type is offered in Fig. 6 through violin plots (Hintze and Nelson 1998). They are a combination of a boxplot and a density plot, realized rotating and placing symmetrically on each side two density plots. The length of the horizontal axis allows to appreciate the range of the observed values, while the shape highlights how values are distributed in terms of variability and skewness. In particular, Fig. 6 reveals differences especially in the right tail of the distribution, showing a larger concentration of higher performance students in the group of instructor-based course.

Fig. 5
figure 5

Histogram and density plot for student’s performance (left), engagement (middle) and learning (right)

Fig. 6
figure 6

Student’s performance by course type

5.2 Main results

The effect of engagement and learning on students’ performance is explored comparing the results of classical regression and quantile regression both on the whole sample and on the two subgroups defined considering the course type. Results are reported in Table 2 and graphically summarized in Fig. 7. The first key to interpret such results lies in the comparison of OLS and QR coefficients estimated on the whole sample (third column of Table 2). Then, it is important to compare results obtained on students attending instructor-paced courses (fourth column) with students attending self-paced courses (fifth column). In all the three cases above, it is important to assess the sign and size of the coefficients but also their significance. The multigroup analysis then highlights any differences between the groups, for each regressor and for each model (sixth and seventh column of Table 2 report p-values for the parametric test and the permutation test, respectively).

The comparison of OLS and QR results on the whole sample shows different effects of engagement and learning on students’ performance, although always with a positive sign. Considering the three conditional quartiles, this effect is increasing for learning (0.245, 0.420, 0.886,respectively), and particularly differentiated in the tails of the distribution when compared to the effect on the conditional mean, which is equal to 0.734. Regarding engagement, we have an inverse trend. The effect is decreasing (0.194, 0.183, 0.082), and not very dissimilar from the conditional mean, which is equal to 0.169. All coefficients are significant except the engagement estimated in correspondence of the quantile 0.75. From a practical perspective, this means that on the lower performing students, engagement plays an important role, at least more important than it does for the high performing students.

To analyse possible differences regarding course type, the same model is estimated separately on the two subgroups of students. The learning component has always a greater effect on performance than engagement, even if the impact is stronger among students who have chosen an instructor-paced course, and who perform better. Most of the coefficients are significant, especially at the considered extreme quantiles, which are very important for the practical use of the study. Coefficients are compared by using both parametric and permutation test employing 500 permutations. The p-values reported in the last two columns of Table 2 highlight how QR can usefully complement OLS results. As an example, considering the effect of learning, OLS coefficients are affected by the high asymmetry of performance and hide a significant difference between the two groups which emerges looking at the QR results at the quantile 0.5.

Table 2 Comparison of the QR and OLS coefficients (rows) for the global model (third column) and the two models according to the course type (fourth and fifth column). The p-values for the multigroup comparison using the parametric and the permutation test are in the last two columns

Figure 7 provides an overview of the most important results, allowing a more immediate assessment. The two plots on the left column refer to the learning dimension, the two plots on the right column to engagement. Th first row summarizes results of the parametric test, the second row results of the permutation tests. The coefficients are depicted on the vertical axis, the considered conditional quantiles on the horizontal axis. QR results on the whole sample is represented through solid lines, while results on specific groups using dashed lines. OLS results are depicted through three dots: the cross-shaped dots correspond to the coefficients estimated on the whole sample, triangle-shaped points correspond to coefficients estimated for the two groups of students, triangle with the vertex at the bottom refer to self-paced courses, while square-shaped dot to instructor-paced courses. Filled points indicate significant coefficients (p-value < 0.05) according to the permutation or the parametric tests. The four panels exploit a common scale, so as to make possible to visually appreciate the predominant impact of the learning component compared to the engagement component, pattern that becomes more relevant moving from lower to higher quantiles. However, both the t-test and the permutation test show that there is no significant difference between the two groups in the effect of learning on performance in the lowest 25% of students. The opposite occurs at the top of the performance distribution. The lower performing students in the two groups differ more with respect to engagement and, as already pointed out, engagement is more important among lower performing students that attend self-paced course.

Fig. 7
figure 7

Coefficient comparison of QR on the whole sample and on the two subgroups of students according to course type (different lines), the three conditional quartiles on the horizontal axis. The OLS coefficients are depicted using symbol at the conditional median for the sake of comparison. First row refers to parametric test, second row to permutation test; first column refers to learning component, second column to engagement

6 Concluding remarks

Modelling is not an easy task both because of the possible complexity of the relationships among the components of the phenomenon under investigation, and because of the presence of possible heterogeneity in the dependency relationship. In some cases, this heterogeneity is known and defined through one or more stratification variables that identify groups of observations, each requiring different modelling. Obviously, for such different models it is necessary to test differences between the dependence structure. The problem is amplified in case of QR, because we do not have a unique model, but separate models for each conditional quantile \(\theta \) of interest. The challenge becomes larger with the analysis of group effects on different parts of the conditional distribution of the dependent variable.

In this paper, we extend the traditional approaches for handling heterogeneity from OLS regression to QR. In particular, we focus on two tests popular in the literature for multigroup analysis, introduced in composite-path modeling: the parametric t-test (Keil et al. 2000), and the permutation test (Chin and Dibbern 2010). The aim is to assess whether the presence of heterogeneity in the sample involves different effects at different parts of the conditional distribution of the response. Multigroup approach provides results easy to interpret, even if the number of comparisons is strictly related to the number of levels of the stratification variable. Indeed, such comparisons grow exponentially with the number of levels. Obviously, in case of more stratification variables, the number of comparisons is equal to the number of possible combinations of the levels of all the involved variables. Thus, we recommend to use the multigroup approach in case of stratification variables with a reduced number of levels.

The effectiveness of the parametric t-test and the permutation test has been illustrated through a real data application. Empirical analysis of MOOC students’ performance showed that both engagement and learning are important drivers to explain the final performance. However, the effect of these variables is not uniform, but varies according to the conditional quantile of interest, and to the different ways in which the course is offered (self-paced and instructor-paced). Simulations confirm the ability of the proposed quantile multigroup approach in detecting differences in models. As expected, the sensitivity of both the considered tests was guaranteed by larger sample size and clearer differences between groups.

The multigroup approach focuses on observed heterogeneity, namely when the groups are defined a priori by the levels of one or more stratification variables. However, there are situations in which an a priori segmentation of the observations is not available. In such a case, we need to find some criterion to identify the comparisons to consider. The pathmox tree (Lamberti et al. 2016a) and MOB procedure (Zeileis et al. 2008) are possible procedures to identify the most important comparisons through the use of a recursive approach. They exploit multiple comparisons, and ranks the variables that produce differences in the model coefficients revealing the most significant comparisons. We postpone the feasibility of a recursive approach to the context of quantile regression to a future study.