1 Introduction

Many studies are designed explicitly to show that there is an absence of effects of competing scenarios or theories. However, they sometimes base their findings on failing to reject a null hypothesis rather than confirming a hypothesis of equivalence. For comparison of treatment effects, the traditional hypothesis test of difference aims to determine whether the treatment effects differ from one another. Under such condition, the traditional difference tests are inappropriate to establish equivalence, because failing to reject a no-difference hypothesis test does not necessarily support the conclusion of equivalence. There has been a growing awareness and demand of appropriate techniques for assessing equivalence and similarity in the behavioral and managerial literature. For example, related discussions of theoretical perspectives and practical issues can be found in Cashen and Geifer [1], Cortina and Folger [2], Edward and Berry [3], Frick [4], Rogers, Howard, and Vessey [5], Seaman and Serlin [6], Stanton [7], Stegner, Bostrom, and Greenfield [8], and Steiger [9], among others.

To assess an observed effect size that is clinically negligible or practically non-important, the recommended equivalence test is to ascertain whether the observed effect size falls inside the selected equivalence range. The technical discussion and fundamental review of different types of mean equivalence tests were presented in Berger and Hsu [10], Meyners [11], and Schuirmann [12]. Despite there are more powerful tests, two prominent procedures have received considerable attention in the literature. They are the two one-sided tests (TOST) method of Schuirmann [13] and Westlake [14] and the equivalence approach of Anderson and Hauck [15] and Hauck and Anderson [16]. These two procedures of mean equivalence admit a simple methodological reform for assessing equivalence. Their flexible settings allow generalizations to more complex experimental designs. Accordingly, Dixon and Pechmann [17], and Schmidt and Meyer [18] have extended the TOST to assess whether the linear trend is practically negligible in linear regressions. Also, Counsell and Cribbie [19] described an extension of the Anderson and Hauck procedure for comparing the slope coefficients of two regression lines.

Despite the conservative nature, TOST maintains a good control of Type I error rate at the specified level. However, the actual Type error rate of TOST can be substantially less than the nominal level and the rejection region can be empty when the equivalence ranges are narrow, particularly with small sample sizes. Across the practical and diverse research designs for equivalence assessment, the undertaken equivalence bounds and associated sample sizes may not be all that large. Under such circumstances, it is of methodological concern to consider alternative procedures with proper rejection region and good Type I error control. On the other hand, the normal approximation presented in Counsell and Cribbie [19] for p-value calculations is only one of the three possible methods proposed in Anderson and Hauck [15]. Following the results of an extensive simulation study, Anderson and Hauck [15] recommended the central-t approach, instead of the least accurate normal approximation. In view of the absence of vital clarification for theory development and supportive technique, it is desirable to properly generalize the Anderson and Hauck procedure for linear regression analysis.

The present article aims to contribute to the development of equivalence methodology for linear regressions in three aspects. First, using the central-t approximation, extended Anderson and Hauck procedures are presented for equivalence testing of slope coefficient and mean response in one and two regression lines. The general formulation of asymmetric equivalence ranges permits a wide range of equivalence questions to be tested. Consequently, they are useful for assessing negligible trend and similar response in a single regression line, and for evaluating unimportant interaction-moderation effect and comparable simple effect between two linear regression lines. Second, the associated power functions and sample size procedures are also derived and compared under the random and fixed model settings. According to the analytic justification and empirical assessment, the exact approaches have a clear advantage over the approximate formulas for accommodating the full stochastic nature of both the response and predictor variables. It should be noted that exact power and sample size calculations were not addressed in Counsell and Cribbie [19]. Third, the proposed equivalence techniques are not available in popular software packages. Computer algorithms are provided for critical value computations, power calculations, and sample size determinations of the extended Anderson and Hauck procedures. The suggested power and sample size calculations should be useful for planning equivalence studies about the much-discussed appraisals of interaction-moderation effect and simple effect in behavioral and management research.

2 Single Regression Line

The simple linear regression model is of the form

$$Y_{i} = {\upbeta }_{0} + X_{i} {\upbeta }_{{1}} + \varepsilon_{i},$$
(1)

where Yi is the response score of the ith subject, β0 is the intercept, β1 is the slope coefficient, Xi is the predictor score of the ith subject, and εi are iid N(0, σ2) random variables, i = 1, …, N. The least squares estimator \({\hat{\upbeta}}_{{1}}\) of slope coefficient β1 has the following distribution

$$\hat{\beta} _{1} \sim N(\beta _{1} ,\sigma ^{2} /SSX),$$
(2)

where \(SSX = \sum\nolimits_{{i = 1}}^{N} {(X_{i} - {\bar{X}})^{2} }\) and \({\bar{X}} = \sum\nolimits_{{i = 1}}^{N} {X_{i} /N}\). Also, \({\hat{\upsigma}}^{2} = SSE/\nu\) is the usual unbiased estimator of σ2 where SSE is the error sum of squares and ν = N – 2. Moreover, V = SSE2 ~  χ2(ν), where χ2(ν) are chi-square distribution with ν degrees of freedom.

To detect the difference of slope coefficient in terms of H0: β1 = β10 versus H1: β1 ≠ β10, the test statistic has the form

$$T_{{S0}} = \frac{\hat{\beta}_{1} - \beta_{10}}{(\hat{\sigma}^2/SSX)^{1/2}}$$
(3)

The null hypothesis is rejected at the significance level α if

$$|T_{S0} | \, > t_{\nu ,\alpha /2}$$
(4)

where \(t_{\nu ,\alpha /2}\) is the 100(1 – α/2) percentile of t(ν) and t(ν) is a t distribution with degrees of freedom ν.

2.1 Equivalence Test of Linear Trend

The primary focus of this article is the test of equivalence, the null and alternative hypotheses are expressed as

$${\text{H}}_{0} :{\upbeta }_{{1}} \le \Delta_{L} {\text{ or }}\Delta_{U} \le {\upbeta }_{{1}} {\text{versus H}}_{{1}} :\Delta_{L} < {\upbeta }_{{1}} < \Delta_{U},$$
(5)

where ΔL and ΔU are a priori constants that represent the minimal range for declaring equivalence effect size. The hypotheses with asymmetric equivalence thresholds can be readily rewritten in terms of symmetric equivalence bounds as

$${\text{H}}_{0} :\upbeta_{1}^{*} \le {-}\Delta {\text{ or }}\Delta \le \upbeta_{1}^{*} {\text{ versus H}}_{{1}} : \, {-}\Delta < \upbeta_{1}^{*} < \Delta ,$$
(6)

where \(\upbeta_{1}^{*}\) = β1 – ΔM, ΔM = (ΔL + ΔU)/2, and Δ = (ΔU – ΔL)/2. An important scenario is to detect a negligible trend by setting ΔU = Δ and ΔL = – Δ so that ΔM = 0 for a bound Δ.

For the given value of the predictor quantity SSX, it is essential to note that

$$T_{S} = \frac{\hat{\upbeta}_{1} - \Delta_M } { ({\hat{\upsigma}}^{2} /SSX)^{1/2} } \sim t(\nu ,\lambda s),$$
(7)

where t(ν, λS) is the noncentral t distribution with degrees of freedom ν and noncentrality parameter λS = (β1 – ΔM)/(σ2/SSX)1/2. To claim the slope coefficient β1 is within the interval (ΔL, ΔU), a natural rejection region to the null hypothesis is

$$\{ \uptau_{SL} < T_{S} < \uptau_{SU} \} ,$$

where the two critical values τSL and τSU are chosen to simultaneously attain the nominal Type I error rate

$$P\{ \uptau_{SL} < T_{S} < \uptau_{SU} |{\upbeta }_{{1}} = \Delta_{L} \} \, = \alpha {\text{ and }}P\{ \uptau_{SL} < T_{S} < \uptau_{SU} |{\upbeta }_{{1}} = \Delta_{U} \} \, = \alpha .$$

Following the properties of a noncentral t distribution as in Johnson, Kotz and Balakrishnan [20], it can be shown that the two conditions can be simultaneously satisfied by the choice of critical values τSL = –τS and τSU = τS where τS > 0. Hence, the rejection region is of the form

$$AH_{S} = \, \{ {-}\uptau_{S} < T_{S} < \uptau_{S} \} ,$$
(8)

where τS is determined by the condition P{–τS < TS < τS| β1 = ΔL} = α or P{–τS < TS < τS| β1 = ΔU} = α. Note that the error variance is generally unknown and the exact distribution of TS cannot be specified. Following the suggestion in Anderson and Hauck [15], a feasible and accurate approach is to find the critical value τS through the approximation \(T_{S}\text{ } \dot\sim \text{ }T + {\hat{\lambda }}_{S}\) where \(T\sim t(\nu )\)\({\hat{\lambda }}_{S} = \Delta /({\hat{\upsigma }}^{2} /SSX)^{{{1}/{2}}}\), and

$$P\{ {-}\uptau_{S} < T + {\hat{\lambda }} < \uptau_{S} \} \, = \alpha.$$
(9)

Thus, the optimal quantity τS can be computed by a simple iterative search. Note that the critical value τS is a function of α, Δ, N, \({\hat{\upsigma }}^{{2}}\), and SSX. It does not have an explicit analytic expression and requires a computer program to calculate the actual value. An efficient algorithm is developed for computing the critical value and rejection region for the suggested procedure. Also, the p-value associated with the observed slope estimate \({\hat{\upbeta}}_{1O}\) can be calculated as

$$p{\text{-value }} = P\{ {-}\left| {T_{O} } \right| \, {-}{\hat{\lambda }}_{S} < T < \, \left| {T_{O} } \right| \, {-}{\hat{\lambda }}_{S} \},$$
(10)

where \(T_{O} = ({\hat{\upbeta}}_{{{1}O}} {-}\Delta_{M} )/({\hat{\upsigma }}^{2} /SSX)^{{{1}/{2}}}\). It is apparent that the p-value is computationally easier to obtain than the critical value.

Note that similar discussion was described in Anderson and Hauck [15] for testing two-group mean equivalence. Because of the computational ease of the p-value, they recommend the p-value approach to conclude the decision. Hence, they did not address the calculation and implementation issues of the rejection region and corresponding power function. Accordingly, the sample size procedure for mean equivalence in Hauck and Anderson [16] is less transparent and cannot be readily adopted as a general tool in linear regressions. Moreover, the Anderson and Hauck procedure has an unbounded rejection region as other more powerful tests. The counterintuitive rejection of nonequivalence with arbitrarily large values of sample variance has been debated extensively in Berger and Hsu [10] and the discussions therein. As a constructive response, they proposed to specify an upper bound on the sample variance beyond which the null hypothesis will never be rejected. Moreover, unlike the TOST, the advantage of the Anderson and Hauck procedure in the Type I error protection for small sample sizes and tight equivalence bounds should also be taken into consideration. The contrasting behavior of the two test procedures is also demonstrated in the subsequent numerical examples.

2.2 Equivalence Test of Mean Response

The equivalence appraisal can also be applied to the mean response μ = β0 + Xβ1 at a focal predictor value XF. The null and alternative hypotheses are presented as

$$\text{H}_0 :\upmu \le \Delta_{L} {\text{ or }}\Delta_{U} \le \upmu \;{\text{versus}}\;\text{H}_1 :\Delta_{L} < \upmu < \Delta_{U},$$
(11)

where ΔL and ΔU are a priori constants that represent the threshold range for declaring practical equivalence. With the least squares estimators \(({\hat{\upbeta}}_{0} , \, {\hat{\upbeta}}_{1} )\) of (β0, β1), the linear estimator \(\hat{\upmu } \, = \, {\hat{\upbeta}}_{0} + {X_{F}\hat{\upbeta}}_{1}\) has the distribution

$$\hat{\upmu }\sim N(\upmu ,\upsigma^{{2}} H_{M} ),$$
(12)

where \(H_{M} = { 1}/N + \, (X_{F} {-}\overline{X})^{2} /SSX\). It is useful to note that

$$T_{M} = \frac{\hat{\upmu } - \Delta_{M}} { (\hat{\sigma}^2H_M ) ^{1/2}}\sim t(\nu ,\lambda_{M} ),$$
(13)

where the noncentrality parameter λM = (μ – ΔM)/(σ2HM)1/2 and ΔM = (ΔL + ΔU)/2.

Following the same principle for slope coefficient assessment, a potential rejection region to the null hypothesis is of the form

$$AH_{M} = \, \{ {-}\uptau_{M} < T_{M} < \uptau_{M} \} ,$$
(14)

where the critical value τM is chosen to attain the nominal Type I error rate when μ = ΔL and ΔU. The proposed approach is to find the critical value through the approximate evaluation

$$P\{ {-}\uptau_{M} < T + {\hat{\lambda }}_{M} < \uptau_{M} \} \, = \alpha$$
(15)

where T ~ t(ν), \({\hat{\lambda }}_{M} = \Delta /({\hat{\upsigma }}^{2} H_{M} )^{1/2}\), and Δ = (ΔU – ΔL)/2. Note that the critical value τM is a function of α, Δ, N, \({\hat{\upsigma }}^{2}\), and HM. Moreover, an iterative algorithm is required to compute the critical value.

2.3 A Numerical Example

The numerical details for the equivalence tests of slope coefficient and mean response are demonstrated with the data of training study described in Table 6.1 of Huitema [21] about the relation between the response variable (Y: achievement) and the predictor variable (X: aptitude) for three types of training program.

For the first training group with N = 10, the sample means of the predictor and response variables are \(\overline{X} = 52.00\) and \({\overline{Y}} = 30.00\), respectively. Moreover, the least squares estimates of the linear regression line between achievement and aptitude measurements are obtained as \(\{ {\hat{\upbeta}}_{0}, \, {\hat{\upbeta}}_{1}\} \, = \, \{ 4.1033, \, 0.4980\}\), and the sample variance of error is \({\hat{\upsigma }}^{{2}} { = 7}0.{5615}\). For illustration, an equivalence test of slope coefficient is performed in terms of H0: β1 ≤ 0.25 or 0.75 ≤ β1 versus H1: 0.25 < β1 < 0.75 (ΔM = 0.50 and Δ = 0.25). With SSX = 2014.00 and α = 0.05, the test statistic and critical value are computed as TS = –0.0106 and τS = 0.1598, respectively. Thus, the nonequivalence null hypothesis is rejected at the significance level 0.05. The conclusion indicates that the slope coefficient is essentially equivalent to 0.50 with no more than 0.25 difference.

The equivalence test of mean response can also be performed with the estimated mean response \(\hat{\upmu }{ = 29}.00{4}0\) at XF = 50. Using ΔM = 29 and Δ = 4, the equivalence test of mean response is conducted in terms of H0: μ ≤ 25 □or 33 ≤ μ versus H1: 25 < μ < 33. The test statistic and critical value can be computed as TM = 0.0015 and τM = 0.1966, respectively for α = 0.05. Hence, the nonequivalence null hypothesis is rejected at the significance level 0.05. The analysis suggests that the mean response at XF = 50 is nearly within a bound of 4 around 29. Moreover, it can be shown that the resulting rejection regions of the TOST procedures are empty sets and there is no chance to reject the nonequivalence null hypothesis of the slope coefficient and mean response. Apparently, the TOST approach may not be a reliable procedure when the sample size is small, especially for a tight equivalence range. Such deficiency agrees with the explication of TOST for assessing mean equivalence in Schuirmann [12].

2.4 Power and Sample Size Calculations

When planning and conducting a research, the actual values of the continuous measurements of response and predictor variable for each subject are available only after the observations are obtained. In addition to the randomness of normal responses, the stochastic nature of predictor variables has to be taken into account in power analysis under the random and unconditional context in linear regression study. A useful and convenient framework is to assume the continuous predictor variables {Xi, i = 1, …, N} have the independent and identical normal distribution NX, \(\upsigma_{X}^{2}\)) as in Shieh [22, 23] within the context of ANCOVA.

Under the prescribed stochastic consideration of {Xi, i = 1, …, N}, it can be readily established that K = SSX/\(\upsigma_{X}^{2}\) ~ χ2(κ) where κ = N – 1. The power function of the equivalence procedure for slope coefficient can be expressed as

$$\Pi_{S} = P\{ {-}\uptau_{S} < T_{S} < \uptau_{S} |\Delta_{L} < \upbeta_{{1}} < \Delta_{U} \} .$$
(16)

Note that the critical value τS depends on the two quantities \({\hat{\upsigma}}^{2}\) and SSX. With \({\hat{\upsigma}}^{2} = \upsigma^{{2}} (V/\nu )\) and \(Hs = { 1}/SSX = { 1}/(\upsigma_{X}^{2} K)\), the power function ΠS can be rewritten as

$$\Pi_{S} = E_{(K,V)} [\Phi \left( {B_{S} } \right) \, {-}\Phi \left( {A_{S} } \right)],$$
(17)

where BS = (ΔM – β1)/(σ2Hs)1/2 + τS(V/ν)1/2, AS = (ΔM – β1)/(σ2Hs)1/2 – τS(V/ν)1/2, Φ(⋅) is the cumulative density function of the standard normal distribution, and the expectation E(K,V) is taken with respect to the chi-square distributions of K and V.

Under the random predictor framework, the normality assumption implies that

$$T_{X} = \frac{{{\bar{X}} - X_{F} }}{{{\mkern 1mu} ({\widehat{\sigma}}_{X}^{2} /N)^{{1/2}} }}\sim t(\kappa,\lambda _{X} ),$$
(18)

where \({{\hat{\upsigma }}}_{X}^{2} = SSX/\upkappa\) and \(\lambda _{{\text{X}}} = (\mu _{{\text{X}}} - X_{{\text{F}}} )/(\sigma _{{\text{X}}}^{2} /N)^{{1/2}}\) Also, the power function of the equivalence procedure for mean response is of the form

$$\Pi_{M} = P\{ {-}\uptau_{M} < T_{M} < \uptau_{M} |\Delta_{L} < \upmu < \Delta_{U} \}.$$
(19)

In this case, the critical value τM depends on the two terms \({\hat{\upsigma}}^{2}\) and HM. With \({\hat{\upsigma}}^{2} = \upsigma^{{2}} (V/\nu )\) and \(H_{M} = \, 1/N + T_{X}^{2} /(\upkappa N)\), it follows that the power function ΠM can be expressed as

$$\Pi_{M} = E_{(TX,V)} [\Phi (B_{M} ) \, {-}\Phi (A_{M} )],$$
(20)

where BM = (ΔM – μ)/(σ2HM)1/2 + τM(V/ν)1/2, AM = (ΔM – μ)/(σ2HM)1/2 – τM(V/ν)1/2, and E(TX,V) is taken with respect to the joint distribution of TX and V.

The prescribed power functions ΠS and ΠM for slope coefficient and mean response involve a mixture of noncentral t distributions through the distribution K and TX of the predictor variables, respectively. It is appealing to simplify these power functions because of computational complexity. Under the normal assumption NX, \(\upsigma_{X}^{2}\)) for the predictors {Xi, i = 1, …, N}, the standard results show that \(E[\overline{X}] = \upmu_{X}\) and E[SSX] =  \(\upkappa \upsigma_{X}^{2}\). Hence, an approximation of unconditional distribution can be obtained for the test statistic \(T_{S}\text{ } \dot\sim \text{ } t(\nu ,\lambda_{SA} )\) where λSA = (β1 – ΔM)/(σ2HSA)\(^{1/2}\) and HSA = 1/(\(\upkappa \upsigma_{X}^{2}\)). It yields a simplified power function for the equivalence test of linear trend

$$\Pi_{SA} = P\{ {-}\uptau_{S} < t(\nu ,\lambda_{SA} ) \, < \uptau_{S} \} .$$
(21)

Moreover, following similar arguments, the test statistic of mean response has the approximate distribution \(T_{M} {\dot{\sim}} t(\nu ,\lambda _{{MA}} )\) where λMA = (μ – ΔM)/(σ2HMA)\(^{1/2}\) and HMA = 1/N + (μXXF)2/(\(\upkappa \upsigma_{X}^{2}\)). Then, an approximate power function for the equivalence test of mean response is denoted by

$$\Pi_{MA} = P\{ {-}\uptau_{M} < t(\nu ,\lambda_{MA} ) \, < \uptau_{M} \} .$$
(22)

The approximate power functions of the equivalence procedures provide computational shortcuts to the exact formulas. The simple formulations can be readily implemented with the embedded probability functions of a noncentral t distribution in standard software systems. On the other hand, the prescribed analytic justifications provide statistical support for the exact power functions. An immediate application of the power functions is to compute optimal sample sizes needed for the equivalence procedure to attain the specified power under the designated model configurations. The fundamental discrepancy between the exact and simplified power and sample size calculations will be further assessed in the succeeding numerical investigations.

2.5 Numerical Assessments

As an exemplifying framework, the model configurations follow that of the prescribed training study in Huitema [21]. Accordingly, the sample estimated of regression coefficients and variance component of the first training group are designated the working configurations: {β0, β1} = {4.1033, 0.4980}, and σ2 = 70.5615, respectively. The mean and variance of the normal predictors are chosen as {μX, \(\upsigma_{X}^{2}\)} = {52.00, 223.7778}. The equivalence thresholds (ΔL, ΔU) are defined as ΔL = ΔM – Δ, ΔU = ΔM + Δ, and various magnitudes of ΔM and Δ are evaluated. For the equivalence tests of linear trend, the selected values are ΔM = 0.5 with Δ = 0.2, 0.3, and 0.4. The equivalence tests of mean response are examined at XF = 50 with ΔM = 29 for μ = 29.0040 under three equivalence bounds Δ = 4, 5, and 6.

With these specifications, the required sample sizes of both exact and approximate methods were computed for the chosen power value 1 – β = 0.80 and significance level α = 0.05. The estimated sample sizes for the equivalence tests of linear trend and mean response are presented in Table 1. Note that the resulting sample sizes cover a reasonable range of magnitudes without being unrealistic or excessively large. More importantly, the estimated sample sizes of the exact approach are consistently larger than or equal to those of the approximate procedure for all 6 cases. For ease comparing the accuracy of power functions, the estimated power or attained power are also summarized in Table 1. Because of the underlying metric of integer sample sizes, the estimated values of both exact and approximate procedures are marginally larger than the nominal level for all cases.

Table 1 Computed sample size, estimated power, and simulated power for the Anderson and Hauck test of linear trend and mean response at XF = 50 when Type I error α = 0.05 and nominal power 1 – β = 0.80

In the second stage, Monte Carlo simulation studies were performed to justify the performance of power and sample size calculations. With the computed sample sizes, parameter configurations, and nominal alpha level, estimates of the true power were computed via Monte Carlo simulation of 10,000 independent data sets. For each replicate, the sample size N predictor values were generated from the selected normal distributions. The outcome values of predictor variables are then designated to determine the mean responses for generating the normal responses with the specified linear regression model. Next, the equivalence test statistics were computed and the simulated power was the proportion of the 10,000 replicates whose null hypothesis was rejected at the significance level 0.05. Accordingly, the adequacy of the approximate and exact sample size procedures is determined by the error (= simulated power – estimated power) between the simulated power of Monte Carlo study and the estimated power computed from analytic power function. The simulated power and error are also presented in Table 1.

The results reveal that the exact approaches are extremely accurate because the associated errors of the 6 cases are all within the small range of –0.0055 to 0.0075. Accordingly, there exists a close agreement between the simulated power and the estimated power of the exact approaches for these settings. On the other hand, the simulated powers for the approximate methods are constantly less than the estimated powers. Specifically, the resulting errors are {–0.0167, –0.0210, –0.0306} and {–0.0057, –0.0069, –0.0177} for the linear trend and mean response, respectively. Although some of the differences are not substantial, it implies that the approximate power functions do not give reliable results for small sample sizes. In short, the adequacy of the approximate power and sample size calculations varies with model configurations. It is clear that the exact techniques are more reliable and accurate than the approximate methods for all cases of linear trend and mean response considered here.

3 Two Regression Lines

The two-group nonparallel simple linear regression model is expressed as

$$Y_{{{1}i}} = \upbeta_{{0{1}}} + X_{{{1}i}} \upbeta_{{{11}}} + \varepsilon_{{{1}i}} {\text{ and }}Y_{{{2}j}} = \upbeta_{{0{2}}} + X_{{{2}j}} \upbeta_{{{12}}} + \varepsilon_{{{2}j}} ,$$
(23)

where ε1i and ε2j are iid N(0, σ2) random variables, i = 1, …, N1, and j = 1, …, N2. Note that a traditional ANCOVA model assumes that the regression slopes are equivalent β11 = β12. Accordingly, a test of slope equality is generally required to justify the use of ANCOVA.

Standard results that the least squares estimators \({\hat{\upbeta}}_{{{11}}}\) and \({\hat{\upbeta}}_{{{12}}}\) of slope coefficients β11 and β12 have the following distributions

$${\hat{\upbeta}}_{11}\sim N(\upbeta_{{{11}}} ,\upsigma^{{2}} /SSX_{{1}} ){\text{ and }}{\hat{\upbeta}}_{12}\sim N(\upbeta_{{{12}}} ,\upsigma^{{2}} /SSX_{{2}} ),$$

where \(SSX_{1} = \sum\nolimits_{i = 1}^{{N_{1} }} {(X_{{{1}i}} - \overline{X}} )^{2},\)\(SSX_{2} = \sum\nolimits_{j = 1}^{{N_{2} }} {(X_{{{2}j}} \!-\! {\overline{X}}_{2} } )^{2},{\overline{X}}_{{1}} \! =\! \sum\nolimits_{i = 1}^{{N_{1} }}, {X_{{{1}i}} /N_{{1}} }\) and \({\overline{X}}_{{2}} = \sum\nolimits_{i = 1}^{{N_{2} }} {X_{{{2}i}} /N_{{2}} }.\) The difference of two slope estimators has the distribution

$${\hat{\upbeta}_D} = {\hat{\upbeta}}_{{{11}}} { - } {\hat{\upbeta}}_{{{12}}} \sim N\{ \upbeta_{D} ,\upsigma^{{2}} H_{DS} \} ,$$
(24)

where βD = β11 – β12 and HDS = 1/SSX1 + 1/SSX2. In this case, \({\hat{\upsigma }}^{{2}} = SSE/\nu_{D}\) is the usual unbiased estimator of σ2 and V = SSE2 ~ χ2D) where SSE is the error sum of squares and νD = N1 + N2 – 4.

To detect the difference between two slope coefficients in terms of H0: βD = βD0 versus H1: βD ≠ βD0, the test statistic has the form

$$T_{DS0} = \frac{ {\hat{\upbeta}_D} - \upbeta_{D0}}{{{ (}{\hat{\upsigma }}^{{2}} {H}_{{{DS}}} {)}^{1/2} }}$$
(25)

The null hypothesis is rejected at the significance level α if

$$|T_{DS0} | \, > t_{\nu_{D} \text{, } \alpha /2}$$
(26)

3.1 Equivalence Test of Trend Effect

To conduct equivalence test of trend effect or slope difference, the null and alternative hypotheses are expressed as

$$\text{H}_0 :\upbeta_{D} \le \Delta_{L} \text{ or } \Delta_{U} \le \upbeta_{D} \text{ versus } \text{H}_1 :\Delta_{L} < \upbeta_{D} < \Delta_{U},$$
(27)

where ΔL and ΔU are a priori constants that denote the minimal magnitude for declaring equivalence for trend effect. Under the model assumption, it follows that

$$T_{{DS}} = \frac{ \hat{\beta }_{D} - \Delta _{M}} {{(\hat{\sigma }^{2} H_{{{DS}}} )^{{1/2}} }}\sim t(\nu _{D} ,\lambda _{{DS}} ),$$
(28)

where the noncentrality parameter λDS = (βD – ΔM)/(σ2HDS)1/2 and ΔM = (ΔL + ΔU)/2. To justify the slope difference βD is within the interval (ΔL, ΔU), a feasible rejection region to the null hypothesis is

$$AH_{{DS}} = \{ - \tau _{{DS}} < T_{{DS}} < \tau _{{DS}} \} ,$$
(29)

where the critical value τDS is chosen to simultaneously attain the nominal Type I error rate when βD = ΔL and ΔU. In practice, the exact distribution of TDS is practically unknown and the critical value τDS can be determined through the approximation

$$P\{ - {\uptau }_{{DS}} < T + \hat{\lambda }_{{DS}} < {\uptau }_{{DS}} \} = \alpha ,$$
(30)

where T ~ tD), \(\hat{\lambda }\) DS = Δ/(\(\hat{\sigma }^{{2}}\) HDS)1/2, and Δ = (ΔU – ΔL)/2. The optimal quantity τDS is a function of α, Δ, N1, N2, \({\hat{\upsigma }}^{2}\), and HDS. Although the critical value does not have a closed-form expression, it can be computed by a simple iterative search.

As emphasized in Huitema [21], Kutner et al. [24], Rencher and Schaalje [25], and related texts of research methods, the traditional ANCOVA assumes that the slope coefficients associating the predictor variables with the response variables are the same for each treatment group. The assertion of homogeneous regression slopes implies a lack of interaction effects between a categorical moderator and a continuous predictor in moderation study. Note that the conventional difference test purports to show the regression lines are nonparallel. Hence, the suggested equivalence procedure for trend effect is more appropriate for supporting the equality or comparability of slope coefficients assumption in ANCOVA.

3.2 Equivalence Test of Simple Effect

A related and practical scheme for comparing two regression lines is to assess the difference between two mean responses at a designated predictor value. The simple effect or the mean response difference between two regression lines at XF is defined as

$$\upmu_{D} = \upmu_{{1}} {-}\upmu_{{2}} = \, (\upbeta_{{0{1}}} {-}\upbeta_{{0{2}}} ) \, + X_{F} (\upbeta_{{{11}}} {-}\upbeta_{{{12}}} )$$
(31)

The equivalence test of simple effect is conducted under the null and alternative hypotheses:

$${\text{H}}_{0} :\upmu_{D} \le \Delta_{L} {\text{ or }}\Delta_{U} \le \upmu_{D} {\text{ versus H}}_{{1}} :\Delta_{L} < \upmu_{D} < \Delta_{U} ,$$
(32)

where ΔL and ΔU are a priori constants that represent the minimal threshold for declaring essential equivalence.

Using the least squares estimators {\({\hat{\upbeta}}_{{{01}}}\), \({\hat{\upbeta}}_{11}\) \({\hat{\upbeta}}_{{{02}}}\), \({\hat{\upbeta}}_{12}\)} of for the intercept and slope coefficients {β01, β11, β02, β12}, the estimated mean response \(\hat{\upmu }_{{1}}\) and \(\hat{\upmu }_{{2}}\) for mean values μ1 = β01 + Xβ11 and μ2 = β02 + Xβ12 at a specified value XF are

$$\hat{\upmu }_{1} = {\hat{\upbeta}}_{{{01}}} { + }X_F{\hat{\upbeta}}_{{11}} \, \text{and}\, \hat{\upmu }_{2} = {\hat{\upbeta}}_{02} { + }X_F{\hat{\upbeta}}_{{12}} $$

respectively. A natural and unbiased estimator of μD is \(\hat{\upmu }_{{D}} = \hat{\upmu }_{1} - \hat{\upmu }_{2}\) and

$$\hat{\upmu }_{D} \sim N(\upmu_{D,}\text{ } \sigma^{2} H_{DM} ),$$
(33)

where HDM = 1/N1 + 1/N2 + (XF\({\overline{X}}_{{1}}\))2/SSX1 + (XF\({\overline{X}}_{{2}}\))2/SSX2. It is important to note under the model assumption that

$$T_{{DM}} = \frac{\hat{\mu}_D - \Delta_M} {(\hat{\sigma }^2 H_{DM} )^{1/2} } \sim t(\nu _{D} ,\lambda _{{DM}} ),$$
(34)

where the noncentrality parameter λDM = (μD – ΔM)/(σ2HDM)1/2 and ΔM = (ΔL + ΔU)/2. To evaluate whether the simple effect μD is within the interval (ΔL, ΔU), the suggested rejection region is

$$AH_{DM} = \, \{ {-}\uptau_{DM} < T_{DM} < \uptau_{DM} \} ,$$
(35)

where the critical value τDM is chosen to simultaneously attain the nominal Type I error rate when μD = ΔL and ΔU. The assessments can be calculated through the approximation

$$P\{ {-}\uptau_{DM} < T + {\hat{\lambda }}_{DM} < \uptau_{DM} \} \, = \alpha ,$$
(36)

where T ~ tD), \(\hat{\lambda }\) DM = Δ/(\({\hat{\upsigma }}^{2}\) HDM)1/2, and Δ = (ΔU – ΔL)/2. The optimal quantity τDM is a function of α, Δ, N1, N2, \({\hat{\upsigma }}^{2}\), and HDM, and it needs to be calculated by an iterative search algorithm.

It should be noted that the equivalence analysis of simple effect or response difference between two linear regression lines is closely related to the Johnson–Neyman problem of Johnson and Neyman [26] and Potthoff [27]. The Johnson–Neyman regions of significance and non-significance are identified with the conclusion to reject or the failure to reject the conventional hypothesis of no difference between mean responses. Technical illustrations and implications can be found in Hunka [28], Rogosa [29], and Spiller, et al. [30], among others. Contrastly, the proposed equivalence test of simple effect can be used to identify the regions of equivalence and nonequivalence or the ranges of predictor values that the simple effect is equivalent and nonequivalent.

3.3 An Application

The prescribed example about training study in Table 6.1 of Huitema [21] is utilized to demonstrate the suggested equivalence testing of trend and simple effects between the first two treatments. In addition to the summary information of the first group, the second group of training type has N2 = 10, \(\overline{Y}_{2}\) = 39.0000 and \({\overline{X}}_{2}\) = 47.0000, and SSX2 = 1798.00. The regression coefficient estimates are \({\text{\{ }}{\hat{\upbeta}}_{{{02}}} {, }{\hat{\upbeta}}_{12} \} = \, \left\{ {{15}.{1863}, \, 0.{5}0{67}} \right\}\) and the sample variance of error is \({\hat{\upsigma }}_{2}^{2}\) = 54.3025. It is readily obtained that \({\hat{\upbeta}_D} = {\hat{\upbeta}}_{11} {-}{\hat{\upbeta}}_{12} = {-}0.00{87}\) and the pooled sample variance is \({\hat{\upsigma }}^{2}\) = 62.4320. The equivalence hypothesis testing of trend effect is presented as H0: βD ≤ –0.25 or 0.25 ≤ βD versus H1: –0.25 < βD < 0.25 (ΔM = 0 and Δ = 0.25). For ν = 16 and α = 0.05, the test statistic TDS = –0.0338 and the critical value τDS = 0.1048. Hence, the nonequivalence null hypothesis is rejected at the significance level 0.05. It suggests that the slope coefficient is virtually equivalent and their difference is within the range (–0.25, 0.25).

It is of practical importance to assess the simple effect or the mean response difference between two regression lines. At the particular predictor value XF = 50, the mean response difference is computed as \(\hat{\upmu }_{{\text{D}}} = \hat{\upmu }_{{1}} - \hat{\upmu }_{{2}}\) = –11.5161. For illustration, the equivalence thresholds is set as ΔM = –11 and Δ = 5 and the equivalence test of simple effect is conducted for the hypotheses H0: μD ≤ –16 or –6 ≤ μD versus H1: –16 < μD < –6. With ν = 16 and α = 0.05, the test statistic and critical value can be obtained as TDM = –0.1436 and τM = 0.1684, respectively. Consequently, the nonequivalence null hypothesis is rejected at the significance level 0.05 and the mean response difference is practically –11 with the threshold of 5 at XF = 50. In view of the limited features of available software packages, computer programs are developed to facilitate the usage of the proposed equivalence procedures for trend and simple effects.

3.4 Power and Sample Size Calculations

In order to elucidate the critical notion of accommodating the distributional properties of the predictor variables, the continuous covariate variables {X1i, i = 1, …, N1} and {X2j, j = 1, …, N2} are assumed to have the independent normal distributions NX1, \(\upsigma_{X1}^{2}\)) and NX2, σ \(_{X2}^{2}\)), respectively. It can be readily established that K1 = SSX1/\(\upsigma_{X1}^{2}\)  ~ χ21) and K2 = SSX2/\(\upsigma_{X2}^{2}\)~ χ22) where κ1 = N1 – 1 and κ2 = N2 – 1.

Under the unconditional setting, the power function for trend effect is expressed as

$$\Pi_{DS} = P\{ {-}\uptau_{DS} < T_{DS} < \uptau_{DS} |\Delta_{L} < \upbeta_{D} < \Delta_{U} \}.$$
(37)

Note that the critical value τDS depends on the two statistics \({\hat{\upsigma }}^{2}\) and HDS. With \({\hat{\upsigma }}^{2}\) = σ2(V/ν) and HDS = 1/(\(\upsigma_{X1}^{2}\) K1) + 1/(\(\upsigma_{X2}^{2}\)K2), the power function ΠDS can be rewritten as

$$\Pi_{DS} = E_{{(K{1},K{2},V)}} [\Phi \left( {B_{DS} } \right) \, {-}\Phi \left( {A_{DS} } \right)],$$
(38)

where BDS = (ΔM – βD)/(σ2HDS)1/2 + τDS(VD)1/2, ADS = (ΔM – βD)/(σ2HDS)1/2 – τDS(VD)1/2, and E(K1, K2, V) is taken with respect to the joint distribution of K1, K2 and V.

Moreover, the normality assumptions of predictor variables imply that

$$T_{Xg} = \frac{{\overline{X}}_g - X_F } { ({\hat{\upsigma }}_{X_g}^{2} /N_g)^{1/2} } \sim t(\upkappa_g ,\lambda_{X_g} )$$
(39)

where \(\hat{\upsigma}_{Xg}^{2}\) = SSXgg and λXg = (μXgXF)/(\(\upsigma_{Xg}^{2}\)/Ng)1/2 for g = 1 and 2. Following the prescribed power function ΠDS, the power function for mean response difference is presented as

$$\Pi_{DM} = P\{ {-}\uptau_{DM} < T_{DM} < \uptau_{DM} |\Delta_{L} < \upmu_{D} < \Delta_{U} \} .$$
(40)

Note that the critical value τDM depends on the two terms \({\hat{\upsigma }}^{{2}}\) and HDM. With \({\hat{\upsigma }}^{{2}}\) = σ2(VD), HDM = 1/N1 + 1/N2 + \(T_{X1}^{2}\)/(κ1N1) + \(T_{X2}^{2}\) /(κ2N2), the power function has the alternative form

$$\Pi_{DM} = E_{{(TX{1},TX{2},V)}} [\Phi \left( {B_{DM} } \right) \, {-}\Phi \left( {A_{DM} } \right)],$$
(41)

where BDM = (ΔM – μD)/(σ2HDM)1/2 + τDM(VD)1/2, ADM = (ΔM – μD)/(σ2HDM)1/2 – τDM(VD)1/2, and E(TX1, TX2, V) is taken with respect to the joint distribution of TX1, TX2 and V.

It is also temping to simplify the unconditional distributions for the equivalence test statistics for comparing slope coefficients and mean responses. Conceivably, a straightforward approach is to replace the two means {\({\overline{X}}_{{1}}\), \({\overline{X}}_{2}\)} and sum of squares {SSX1, SSX2} with the corresponding expected values E[\({\overline{X}}_{{1}}\)] = μX1, E[\({\overline{X}}_{2}\)] = μX2, E[SSX1] = κ1\(\upsigma_{X1}^{2}\), and E[SSX2] = κ2\(\upsigma_{X2}^{2}\). Thus, an approximate power function for the equivalence test of trend effect is

$$\Pi_{DSA} = P\{ {-}\uptau_{DS} < t(\nu ,\lambda_{DSA} ) \, < \uptau_{DS} \} ,$$
(42)

where λDSA = (βD – ΔM)/(σ2HDSA)\(^{1/2}\) and HDSA = 1/(κ1\(\upsigma_{X1}^{2}\)) + 1/(κ2\(\upsigma_{X2}^{2}\)). Moreover, the power function of equivalence test of simple effect is expressed as

$$\Pi_{DMA} = P\{ {-}\uptau_{DM} < t(\nu ,\lambda_{DMA} ) \, < \uptau_{DM} \} ,$$
(43)

where λDMA = (μD – ΔM)/(σ2HDMA)\(^{1/2}\) and HDMA = 1/N1 + 1/N2 + (μX1XF)2/(κ1\(\upsigma_{X1}^{2}\)) + (μX2XF)2/(κ2\(\upsigma_{X2}^{2}\)). Empirical examinations will be conducted to demonstrate the critical differences between the exact and approximate power functions using different levels of information of predictor variables.

3.5 Numerical Investigations

The model configurations of the first two groups of the training study in Huitema [21] provide a convenient framework for the subsequent simulation study of trend effect and simple effect. For illustration, the key statistics of response and predictor variables are treated as population parameters as potential settings of future investigations for power calculations and sample size determinations. Specifically, the regression coefficients are {β01, β11} = {4.1033, 0.4980}, {β02, β12} = {15.1863, 0.5067}, and common error variance σ2 = 62.4320. The means and variances of the two predictor variables are {μX1, \(\upsigma_{X1}^{2}\)} = {52.00, 223.7778} and {μX2, \(\upsigma_{X2}^{2}\)} = {47.00, 199.7778}.

Similar to the prescribed scenario of linear trend and mean response, numerical investigations contain the determination of optimal sample sizes and the simulation study of power calculations. Through the empirical examinations, the Type I error rate and nominal power are fixed as α = 0.05 and 1 – β = 0.80, respectively. First, the trend effect or the slope difference between two regression lines is βD = –0.0087. Thus, the equivalence tests of trend effect have ΔM = 0 and Δ = 0.2, 0.3, and 0.4 for the equivalence bounds. Second, the mean response of the two levels of treatment at XF = 50 are μ1 = 29.0040 and μ2 = 40.5200, respectively, and their difference is μD = –11.5161. Accordingly, the equivalence tests of simple effect are performed for ΔM = –11 and Δ = 4, 5, and 6. The optimal sample sizes of both exact approach and approximate method were determined for the chosen power value and significance level with balanced and unbalanced structures r = N1/N2 = 1 and 2. The computed sample sizes for the equivalence tests of trend effect and simple effect are presented in Tables 2 and 3, respectively. The results suggest the general pattern that the approximate formulas tend to give smaller sample sizes than the exact techniques. Balanced designs require fewer samples to achieve the nominal power than the unbalanced structures. Also, the computed sample size decreases with increasing threshold bound Δ.

Table 2 Computed sample size, estimated power, and simulated power for the Anderson and Hauck test of trend effect when Type I error α = 0.05 and nominal power 1 – β = 0.80
Table 3 Computed sample size, estimated power, and simulated power for the Anderson and Hauck test of simple effect when XF = 50, Type I error α = 0.05 and nominal power 1 – β = 0.80

To elucidate the accuracy of sample size calculations, Monte Carlo simulation study of 10,000 replications were conducted to obtain the simulated powers and they are compared to the estimated powers for the optimal sample sizes. These power values and associated errors are also presented in the tables. As can been from the reported deviations, the exact approaches of trend effect and simple effect maintain small errors in power computations. Whereas the approximate methods are not as good as the exact counterparts and their performance deteriorates as the sample size decreases. Specifically, the two errors associated with Δ = 0.4 are {–0.0301, –0.0360} and {–0.0172, –0.0157} in Tables 2 and 3, respectively. The overall usefulness of the approximate methods is affected by the undesirable properties of underestimation of sample sizes and over-calculation of power levels. According to the findings, the exact power functions and sample size procedures are recommended for general use. The implementation of the suggested power evaluation and sample size determination involves specialized programs not currently available in prevailing statistical packages. Thus, the accompanying computer algorithms are presented for conducting the suggested power and sample size calculations.

4 Conclusions

The concept and theory of equivalence have been widely practiced in pharmaceutical sciences and related medical fields. Equivalence testing procedures are also potentially useful in behavioral and psychological sciences. The technical intuition and computational simplicity of TOST provide an important motivation to apply appropriate statistical tools for equivalence assessment, rather than the traditional hypothesis tests that purport to detect whether treatment groups significantly differ from one another. Despite the ready applicability, the TOST is generally conservative and the true Type I error rate can be substantially less than the nominal level for close equivalence bounds and small sample sizes. In contrast, the Anderson and Hauck procedure and other more powerful equivalence tests always have a rejection region with reasonably controlled significance level.

Within the context of linear regressions, one and two regression lines represent two major scenarios of regression slope appraisal research. Accordingly, the TOST has been applied to assess whether the linear trend is practically negligible in ecological and environmental issues. In view of the potential limitation of TOST, this study presents extended Anderson and Hauck procedures for equivalence assessment in linear regression analysis. Specifically, equivalence tests are proposed for evaluating the linear trend and mean response of a single regression line, and the trend effect and simple effect between two regression lines. The hypotheses are constructed with asymmetric equivalence bounds and therefore, can be readily applied to all equivalence problems about regression slopes and mean responses.

Moreover, to enhance the usefulness of the suggested procedures, the advanced issues of power and sample size calculations are also investigated. The proposed power and sample size procedures are derived under the random regression framework and have the distinct features to account for the imbedded uncertainty of predictor variables. It is essential to note that the recommended approaches involve statistical evaluations and iterative algorithms not currently available in statistical package. A full set of computer programs are developed for implementing the suggested equivalence tests and sample size determinations. These research findings expand the conceptual understanding and theoretical development of Anderson and Hauck procedure for equivalence assessments in linear regression analysis.