1 Introduction

It is very common in applications to transform data before investigation of functional dependence of variables by regression models. The aim of the transformation is to obtain a simpler model, e.g. with a specific structure of the regression function, or a homoscedastic instead of a heteroscedastic model. Typically, flexible parametric classes of transformations are considered from which a suitable one is selected data-dependently. A classical example is the class of Box-Cox power transformations (see Box and Cox (1964)). For purely parametric transformation models, see Carroll and Ruppert (1988) and references therein. Powell (1991) and Mu and He (2007) consider transformation quantile regression models. Nonparametric estimation of the transformation in the context of parametric regression models has been considered by Horowitz (1996) and Chen (2002), among others. Horowitz (2009) reviews estimation in transformation models with parametric regression in the cases where either the transformation or the error distribution or both are modeled nonparametrically. Linton et al. (2008) suggest a profile likelihood estimator for a parametric class of transformations, while the error distribution is estimated nonparametrically and the regression function semi-parametrically. Heuchenne et al. (2015) suggest an estimator of the error distribution in the same model. Neumeyer et al. (2016) consider profile likelihood estimation in heteroscedastic semi-parametric transformation regression models, i.e. the mean and variance function are modeled nonparametrically, while the transformation function is chosen from a parametric class. A completely nonparametric (homoscedastic) model is considered by Chiappori et al. (2015). Lewbel et al. (2015) provide a test for the validity of such a model. The approach of Chiappori et al. (2015) is modified and corrected by Colling and Van Keilegom (2019). The version of the nonparametric transformation estimator considered in the latter paper is then applied by Colling and Van Keilegom (2020) to suggest a new estimator of the transformation parameter if it is assumed that the transformation belongs to a parametric class.

In general, asymptotic theory for nonparametric transformation estimators is sophisticated and parametric transformation estimators show much better performance if the parametric model is true. A parametric transformation will thus lead to better estimates of the regression function. Moreover, parametric transformations are easier to interpret and allow for subsequent inference in the transformation model. For the latter purpose note that for transformation models with parametric transformation, lack-of-fit tests for the regression function as well as tests for significance for covariate components have been suggested by Colling and Van Keilegom (2016), Colling and Van Keilegom (2017), Allison et al. (2018) and Kloodt and Neumeyer (2020). Those tests cannot straightforwardly be generalized to nonparametric transformation models because known estimators in that model do not allow for uniform rates of convergence over the whole real line, see Chiappori et al. (2015) and Colling and Van Keilegom (2019).

However, before applying a transformation model with parametric transformation, it would be appropriate to test the goodness-of-fit of the parametric transformation class. In the context of parametric quantile regression, Mu and He (2007) suggest such a goodness-of-fit test. In the context of nonparametric mean regression Neumeyer et al. (2016), develop a goodness-of-fit test for the parametric transformation class based on an empirical independence process of pairs of residuals and covariates. The latter approach was modified by Hušková et al. (2018), who applied empirical characteristic functions. In a linear regression model with transformation of the response, Szydłowski (2020) suggests a goodness-of-fit test for the parametric transformation class that is based on a distance between the nonparametric transformation estimator considered by Chen (2002) and the parametric class. We will follow a similar approach but consider a nonparametric regression model. The aim of the transformations we consider is to induce independence between errors and covariates. The null hypothesis is that the unknown transformation belongs to a parametric class. Note that when applied to the special case of a class of transformations that contains as only element the identity, our test provides indication on whether a classical homoscedastic regression model (without transformation) is appropriate or whether first the response should be transformed. Our test statistic is based on a minimum distance between a nonparametric transformation and the parametric transformations. We present the asymptotic distribution of the test statistic under the null hypothesis of a parametric transformation and under local alternatives of \(n^{-1/2}\)-rate. Under the null hypothesis, the limit distribution is that of a degenerate U-statistic. With a flexible parametric class applying an appropriate transformation can reduce the dependence enormously, even if the ‘true’ transformation does not belong to the class. Thus, for the first time in the context of transformation goodness-of-fit tests, we consider testing for so-called precise or relevant hypotheses. Here, the null hypothesis is that the distance between the true transformation and the parametric class is large. If this hypothesis is rejected, then the model with the parametric transformation fits well enough to be considered for further inference. Under the new null hypothesis, the test statistic is asymptotically normally distributed. The term “precise hypotheses” refers to Berger and Delampady (1987). Dette et al. (2020) considered precise hypotheses in the context of comparing mean functions in the context of functional time series. Note that the idea of precise hypotheses is related to that of equivalence tests, which originate from the field of pharmacokinetics (see Lakens (2017)). Throughout, we assume that the nonparametric transformation estimator fulfills an asymptotic linear expansion. It is then shown that the estimator considered by Colling and Van Keilegom (2019) fulfills this expansion and thus can be used for evaluating the test statistic.

The remainder of the paper is organized as follows. In Sect. 2, we present the model and the test statistic. Asymptotic distributions under the null hypothesis of a parametric transformation class and under local alternatives are presented in Sect. 3, which also contains a consistency result and asymptotic results under relevant hypotheses. Section 4 presents a bootstrap algorithm and a simulation study. Section 1 of the supplementary material contains assumptions for bootstrap results, while Section 2 there treats a specific nonparametric transformation estimator and shows that it fulfills the required conditions. The proofs of the main results are given in Section 3 and a rigorous treatment of bootstrap asymptotics is given in Section 4 of the supplement.

2 The model and test statistic

Assume we have observed \((X_i,Y_i)\), \(i=1,\ldots ,n\), which are independent with the same distribution as (XY) that fulfill the transformation regression model

$$\begin{aligned} h(Y)=g(X)+\varepsilon , \end{aligned}$$
(1)

where \(E[\varepsilon ]=0\) holds and \(\varepsilon \) is independent of the covariate X, which is \(\mathbb {R}^{d_X}\)-valued, while Y is univariate. The regression function g will be modelled nonparametrically. The transformation \(h:\mathbb {R}\rightarrow \mathbb {R}\) is strictly increasing. Throughout we assume that given the joint distribution of (XY) and some identification conditions, there exists a unique transformation h such that this model is fulfilled. It then follows that the other model components are identified via \(g(x)=E[h(Y)|X=x]\) and \(\varepsilon =h(Y)-g(X)\). See Chiappori et al. (2015) for conditions under which the identifiability of h holds. In particular, conditions are required to fix location and scale, and we will assume throughout that

$$\begin{aligned} h(0)=0\quad and \quad h(1)=1. \end{aligned}$$
(2)

Now let \(\{\varLambda _{\theta }:\theta \in \varTheta \}\) be a class of strictly increasing parametric transformation functions \(\varLambda _{\theta }:\mathbb {R}\rightarrow \mathbb {R}\), where \(\varTheta \subseteq {\mathbb {R}}^{d_{\varTheta }}\) is a finite-dimensional parameter space. Our purpose is to test whether a semi-parametric transformation model holds, i.e.

$$\begin{aligned} \varLambda _{\theta _0}(Y)={\tilde{g}}(X)+{{\tilde{\varepsilon }}}, \end{aligned}$$

for some parameter \(\theta _0\in \varTheta \), where \({{\tilde{\varepsilon }}}\) and X are independent. Due to the assumed uniqueness of the transformation h one obtains \(h=h_0\) under validity of the semi-parametric model, where

$$\begin{aligned} h_0(\cdot )=\frac{\varLambda _{\theta _0}(\cdot )-\varLambda _{\theta _0}(0)}{\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0)}. \end{aligned}$$

Thus, we can write the null hypothesis as

$$\begin{aligned} H_0:\ h\in \bigg \{\frac{\varLambda _{\theta }(\cdot )-\varLambda _{\theta }(0)}{\varLambda _{\theta }(1)-\varLambda _{\theta }(0)}:\theta \in \varTheta \bigg \} \end{aligned}$$
(3)

which thanks to (2) can be formulated equivalently as

$$\begin{aligned} H_0:\ h\in \bigg \{\frac{\varLambda _{\theta }(\cdot )-c_2}{c_1}:\theta \in \varTheta ,c_1\in \mathbb {R}^+,c_2\in \mathbb {R}\bigg \}. \end{aligned}$$
(4)

Our test statistics will be based on the following \(L^2\)-distance

$$\begin{aligned} d(\varLambda _{\theta },h)= & {} \underset{c_1\in \mathbb {R}^+,c_2\in \mathbb {R}}{\min }\,E\big [w(Y)\{h(Y)c_1+c_2-\varLambda _{\theta }(Y)\}^2\big ], \end{aligned}$$
(5)

where w is a positive weight function with compact support \({\mathcal {Y}}_{w}\). Its empirical counterpart is

$$\begin{aligned} d_n(\varLambda _{\theta },{\hat{h}}):=\underset{c_1\in C_1,c_2\in C_2}{\min }\,\frac{1}{n}\sum _{j=1}^nw(Y_j)\{{\hat{h}}(Y_j)c_1+c_2-\varLambda _{\theta }(Y_j)\}^2, \end{aligned}$$

where \({{\hat{h}}}\) denotes a nonparametric estimator of the true transformation h as discussed below, and \(C_1\subset \mathbb {R}^+\), \(C_2\subset \mathbb {R}\) are compact sets. Assumption (A6) assures that the sets are large enough to contain the true values. Let \(\gamma :=(c_1,c_2,\theta )\) and \(\varUpsilon :=C_1\times C_2 \times \varTheta \). The test statistic is defined as

$$\begin{aligned} T_n=n\min _{\theta \in \varTheta }d_n(\varLambda _{\theta },{\hat{h}})=\underset{\gamma =(c_1,c_2,\theta )\in \varUpsilon }{\min }\,\sum _{j=1}^nw(Y_j)\{{\hat{h}}(Y_j)c_1+c_2-\varLambda _{\theta }(Y_j)\}^2 \end{aligned}$$
(6)

and the null hypothesis should be rejected for large values of the test statistic. If the null hypothesis holds, the minimizing parameters \(c_1,c_2\) in Eq. (5) can be written as \(c_1=\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0)\) and \(c_2=\varLambda _{\theta _0}(0)\) for some \(\theta _0\in \varTheta \). Hence, an alternative test statistic

$$\begin{aligned} {\bar{T}}_n=\min _{\theta \in \varTheta }\sum _{j=1}^nw(Y_j)[{\hat{h}}(Y_j)\{\varLambda _{\theta }(1)-\varLambda _{\theta }(0)\}+\varLambda _{\theta }(0)-\varLambda _{\theta }(Y_j)]^2 \end{aligned}$$
(7)

can be considered as well.

We will derive the asymptotic distributions under the null hypothesis and local and fixed alternatives in Section 3 and suggest a bootstrap version of the tests in Section 4.

Remark 2.1

Colling and Van Keilegom (2019) consider the estimators

$$\begin{aligned} {\hat{\theta }}:=\arg \underset{\theta \in \varTheta }{\min }\,d_n(\varLambda _{\theta },{\hat{h}}) \end{aligned}$$

and

$$\begin{aligned} {\tilde{\theta }}:= \arg \min _{\theta \in \varTheta }n^{-1}\sum _{j=1}^nw(Y_j)[{\hat{h}}(Y_j)\{\varLambda _{\theta }(1)-\varLambda _{\theta }(0)\}+\varLambda _{\theta }(0)-\varLambda _{\theta }(Y_j)]^2 \end{aligned}$$

for the parametric transformation (assuming \(H_0\)) corresponding to \(T_n\) and \({\bar{T}}_n\). They observe that \({\hat{\theta }}\) outperforms \({\tilde{\theta }}\) in simulations.

Nonparametric estimation of the transformation h has been considered by Chiappori et al. (2015) and Colling and Van Keilegom (2019). For our main asymptotic results, we need that \({{\hat{h}}}\) has a linear expansion, not only under the null hypothesis, but also under fixed alternatives and the local alternatives as defined in the next section. The linear expansion should have the form

$$\begin{aligned} {\hat{h}}(y)-h(y)=\frac{1}{n}\sum _{i=1}^n\psi (Z_i,{\mathcal {T}}(y))+o_P(n^{-1/2}) \text{ uniformly } \text{ in } y\in {\mathcal {Y}}_{w}. \end{aligned}$$
(8)

Here, \(\psi \) needs to fulfil condition (A8) in Section 3, and we use the definitions (\(i=1,\dots ,n\))

$$\begin{aligned} Z_i= & {} (U_i,X_i),\quad U_i\;=\;{\mathcal {T}}(Y_i),\quad {\mathcal {T}}(y)=\frac{F_Y(y)-F_Y(0)}{F_Y(1)-F_Y(0)}, \end{aligned}$$
(9)

where \(F_Y\) denotes the distribution of Y and is assumed to be strictly increasing on the support of Y. To ensure that \({\mathcal {T}}\) is well-defined, the values 0 and 1 are w.l.o.g. assumed to belong to the support of Y, but can be replaced by arbitrary values \(a<b\in {\mathbb {R}}\) (in the support of Y). The expansion (8) could also be formulated with a linear term \(n^{-1}\sum _{i=1}^n{{\tilde{\psi }}}(X_i,Y_i,y)\). In Section 2 of the supplement, we reproduce the definition of the estimator \({{\hat{h}}}\) that was suggested by Colling and Van Keilegom (2019) as modification of the estimator by Chiappori et al. (2015). We give regularity assumptions under which the desired expansion holds, see Lemma 1. Other nonparametric estimators of the transformation that fulfill the expansion could be applied as well.

3 Asymptotic results

In this section, we will derive the asymptotic distribution under the null hypothesis and under local and fixed alternatives. For the formulation of the local alternatives, consider the null hypothesis as given in (4), i.e. \(h(\cdot )c_1+c_2=\varLambda _{\theta _0}(\cdot )\) for some \(\theta _0\in \varTheta \), \(c_1\in \mathbb {R}^+\), \(c_2\in \mathbb {R}\), and instead assume

$$\begin{aligned}&H_{1,n}: h(\cdot )c_1+c_2=\varLambda _{\theta _0}(\cdot )+n^{-1/2}r(\cdot ) \text{ for } \text{ some } \theta _0\in \varTheta ,c_1\in {\mathbb {R}}^+,c_2\\&\quad \in {\mathbb {R}} \text{ and } \text{ some } \text{ function } r. \end{aligned}$$

Due to the identifiability conditions (2), one obtains \(c_2=\varLambda _{\theta _0}(0)+n^{-1/2}r(0)\) and \(c_1=\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0)+n^{-1/2}(r(1)-r(0))\). Assumption (A5) yields boundedness of r, so that we rewrite the local alternative as

$$\begin{aligned} \nonumber h(\cdot )= & {} \frac{\varLambda _{\theta _0}(\cdot )-\varLambda _{\theta _0}(0)+n^{-1/2}(r(\cdot )-r(0))}{\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0) +n^{-1/2}(r(1)-r(0))}\\= & {} h_0(\cdot )+n^{-1/2}r_0(\cdot )+o(n^{-1/2}), \end{aligned}$$
(10)

where \(h_0(\cdot )=(\varLambda _{\theta _0}(\cdot )-\varLambda _{\theta _0}(0))/(\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0))\) and

$$\begin{aligned} r_0(\cdot )&=\frac{r(\cdot )-r(0)-h_0(\cdot )(r(1)-r(0))}{\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0)}. \end{aligned}$$

Note that the null hypothesis \(H_0\) is included in the local alternative \(H_{1,n}\) by considering \(r\equiv 0\) which gives \(h=h_0\). We assume the following data generating model under the local alternative \(H_{1,n}\). Let the regression function g, the errors \(\varepsilon _i\) and the covariates \(X_i\) be independent of n and define \(Y_i=h^{-1}(g(X_i)+\varepsilon _i)\) (\(i=1,\dots ,n\)), which under local alternatives depends on n through the transformation h. Throughout we use the notation (\(i=1,\dots ,n\))

$$\begin{aligned} S_i=h(Y_i)=g(X_i)+\varepsilon _i. \end{aligned}$$
(11)

Further, recall the definition of \(U_i\) in (9). Note that the distribution of \(U_i\) does not depend on n, even under local alternatives, because \(F_Y(Y_i)\) is uniformly distributed on [0, 1], while \(F_Y(0)=P(Y_i\le 0)=P(h(Y_i)\le h(0))=P(S_i\le 0)\) due to (2), and similarly \(F_Y(1)=P(S_i\le 1)\).

To formulate our main result, we need some more notations. With \(\psi \) from (8), \(Z_i\) from (9) and \(S_i\) from (11) define (\(i=1,\dots ,n\))

$$\begin{aligned} {\dot{\varLambda }}_{\theta }(y)= & {} \bigg (\frac{\partial }{\partial \theta _k}\varLambda _{\theta }(y)\bigg )_{k=1,\ldots ,d_{\varTheta }}\nonumber \\ R(s)= & {} (s,1,-{\dot{\varLambda }}_{\theta _0}(h_0^{-1}(s)))^t \end{aligned}$$
(12)
$$\begin{aligned} R_f(s)= & {} \big ({\dot{\varLambda }}_{\theta }(1)^t-{\dot{\varLambda }}_{\theta }(0)^t,{\dot{\varLambda }}_{\theta }(0)^t,{\dot{\varLambda }}_{\theta _0}(h_0^{-1}(s))^t\big )(s,1,-1)^t \end{aligned}$$
(13)
$$\begin{aligned} \varGamma _0= & {} E[w(h_0^{-1}(S_1))R(S_1)R(S_1)^t] \end{aligned}$$
(14)
$$\begin{aligned} \varGamma _{0,f}= & {} E[w(h_0^{-1}(S_1))R_f(S_1)R_f(S_1)^t] \end{aligned}$$
(15)
$$\begin{aligned} \varphi (z)= & {} E[w(h_0^{-1}(S_2))\psi (Z_1,U_2)R(S_2)\mid Z_1=z] \end{aligned}$$
(16)
$$\begin{aligned} \varphi _f(z)= & {} E[w(h_0^{-1}(S_2))\psi (Z_1,U_2)R_f(S_2)\mid Z_1=z] \end{aligned}$$
(17)
$$\begin{aligned} \zeta (z_1,z_2)= & {} E\Big [w(h_0^{-1}(S_3))\big \{\psi (Z_1,U_3)-\varphi (Z_1)^t\varGamma _0^{-1}R(S_3)\big \} \nonumber \\&\qquad \times \big \{\psi (Z_2,U_3)-\varphi (Z_2)^t\varGamma _0^{-1}R(S_3)\big \}\mid Z_1=z_1,Z_2=z_2\Big ] \end{aligned}$$
(18)
$$\begin{aligned} \zeta _f(z_1,z_2)= & {} E\Big [w(h_0^{-1}(S_3))\big \{\psi (Z_1,U_3)-\varphi _f(Z_1)^t\varGamma _{0,f}^{-1}R_f(S_3)\big \} \nonumber \\&\qquad \times \big \{\psi (Z_2,U_3)-\varphi _f(Z_2)^t\varGamma _{0,f}^{-1}R_f(S_3)\big \}\mid Z_1=z_1,Z_2=z_2\Big ] \nonumber \\\end{aligned}$$
(19)
$$\begin{aligned} {\bar{r}}(s)= & {} r_0(h_0^{-1}(s))-E[w(h_0^{-1}(S_1))r_0(h_0^{-1}(S_1))R(S_1)]^t\varGamma _0^{-1}R(s) \end{aligned}$$
(20)
$$\begin{aligned} {\bar{r}}_f(s)= & {} r_0(h_0^{-1}(s))-E[w(h_0^{-1}(S_1))r_0(h_0^{-1}(S_1))R_f(S_1)]^t\varGamma _{0,f}^{-1}R_f(s) \end{aligned}$$
(21)
$$\begin{aligned} {\tilde{\zeta }}(z_1)= & {} 2E[w(h_0^{-1}(S_2))\psi (Z_1,U_2){\bar{r}}(S_2)\mid Z_1=z] \end{aligned}$$
(22)
$$\begin{aligned} {\tilde{\zeta }}_f(z_1)= & {} 2E[w(h_0^{-1}(S_2))\psi (Z_1,U_2){\bar{r}}_f(S_2)\mid Z_1=z] \end{aligned}$$
(23)

and let \(P_Z\) and \(F^Z\) denote the law and distribution function, respectively, of \(Z_i\). The quantities which are marked with an “f”, referring to “fixed” parameters \(c_1=\varLambda _{{\hat{\theta }}}(1)-\varLambda _{{\hat{\theta }}}(0)\) and \(c_2=\varLambda _{{\hat{\theta }}}(0)\), will be used to describe the asymptotic behaviour of the test statistic \({\bar{T}}_n\).

With these notations, the assumptions for the asymptotic results can be formulated. To this end, let \({\mathcal {Y}}\) denote the support of Y (which depends on n under local alternatives). Further, \(F_S\) denotes the distribution function of \(S_1\) as in (11) and \({\mathcal {T}}_S\) denotes the transformation \(s\mapsto (F_S(s)-F_S(0))/(F_S(1)-F_S(0))\). The following assumptions are used.

(A1):

The sets \(C_1,C_2\) and \(\varTheta \) are compact.

(A2):

The weight function w is continuous with a compact support \({\mathcal {Y}}_{w}\subset {\mathcal {Y}}\).

(A3):

The map \((y,\theta )\mapsto \varLambda _{\theta }(y)\) is twice continuously differentiable on \({\mathcal {Y}}_{w}\) with respect to \(\theta \) and the (partial) derivatives are continuous in \((y,\theta )\in {\mathcal {Y}}_{w}\times \varTheta \).

(A4):

There exists a unique strictly increasing and continuous transformation h such that model (1) holds with X independent of \(\varepsilon \).

(A5):

The function \(h_0\) defined in (10) is strictly increasing and continuously differentiable and r is continuous on \({\mathcal {Y}}_{w}\). \(F_Y\) is strictly increasing on the support of Y.

(A6):

Minimizing the functions \(M:\varUpsilon \rightarrow \mathbb {R},\gamma =(c_1,c_2,\theta )\mapsto E\big [w(Y)(h_0(Y)c_1+c_2-\varLambda _{\theta }(Y))^2\big ]\) and \({\bar{M}}:\varTheta \rightarrow \mathbb {R},\theta \mapsto E\big [w(Y)(h_0(Y)(\varLambda _{\theta }(1)-\varLambda _{\theta }(1))+\varLambda _{\theta }(0)-\varLambda _{\theta }(Y))^2\big ]\) leads to unique solutions \(\gamma _0=(c_{1,0},c_{2,0},\theta _0)\) and \(\theta _0\) in the interior of \(\varUpsilon \) and \(\varTheta \), respectively. For all \(\theta \ne {\tilde{\theta }}\) it is \(\underset{y\in {\text {supp}}(w)}{\sup }\,\big |\frac{\varLambda _{\theta }(y)-\varLambda _{\theta }(0)}{\varLambda _{\theta }(1)-\varLambda _{\theta }(0)}-\frac{\varLambda _{{\tilde{\theta }}}(y)-\varLambda _{{\tilde{\theta }}}(0)}{\varLambda _{{\tilde{\theta }}}(1)-\varLambda _{{\tilde{\theta }}}(0)}\big |>0\).

(A7):

The Hessian matrices \(\varGamma _0:={\text {Hess}}M(\gamma _0)\) and \(\varGamma _{0,f}:={\text {Hess}}\bar{(}\theta _0)\) are positive definite.

(A8):

The transformation estimator \({\hat{h}}\) fulfils (8) for some function \(\psi \). For some \({\mathcal {U}}_0\) (independent of n under local alternatives) with \({\mathcal {T}}_S(h({\mathcal {Y}}_{w}))\subset {\mathcal {U}}_0\) the function class \(\{z\mapsto \psi (z,t):t\in {\mathcal {U}}_0\}\) is Donsker with respect to \(P^Z\) and \(E[\psi (Z_1,t)]=0\) for all \(t\in {\mathcal {U}}_0\). The fourth moments \(E[w(h_0^{-1}(S_1))\psi (Z_1,U_1)^4]\) and \(E[w(h_0^{-1}(S_1))\psi (Z_2,U_1)^4]\) are finite.

When considering a fixed alternative \(H_1\) or the relevant hypothesis \(H'_0\) below, (A6) and (A8) are replaced by the following Assumptions (A6’) and (A8’) (assumption (A8’) is only relevant for \(H'_0\)). Note that h is a fixed function then, not depending on n.

(A6’):

Minimizing the functions \(M:\varUpsilon \rightarrow \mathbb {R},\gamma =(c_1,c_2,\theta )\mapsto E\big [w(Y)(h(Y)c_1+c_2-\varLambda _{\theta }(Y))^2\big ]\) and \({\bar{M}}:\varTheta \rightarrow \mathbb {R},\theta \mapsto E\big [w(Y)(h(Y)(\varLambda _{\theta }(1)-\varLambda _{\theta }(1))+\varLambda _{\theta }(0)-\varLambda _{\theta }(Y))^2\big ]\) leads to unique solutions \(\gamma _0=(c_{1,0},c_{2,0},\theta _0)\) and \(\theta _0\) in the interior of \(\varUpsilon \) and \(\varTheta \), respectively. For all \(\theta \ne {\tilde{\theta }}\), it is \(\underset{y\in {\text {supp}}(w)}{\sup }\,\big |\frac{\varLambda _{\theta }(y)-\varLambda _{\theta }(0)}{\varLambda _{\theta }(1)-\varLambda _{\theta }(0)}-\frac{\varLambda _{{\tilde{\theta }}}(y)-\varLambda _{{\tilde{\theta }}}(0)}{\varLambda _{{\tilde{\theta }}}(1)-\varLambda _{{\tilde{\theta }}}(0)}\big |>0\).

(A8’):

The transformation estimator \({\hat{h}}\) fulfills (8) for some function \(\psi \). For some \({\mathcal {U}}_0\supset {\mathcal {T}}_S(h({\mathcal {Y}}_{w}))\), the function class \(\{z\mapsto \psi (z,t):t\in {\mathcal {U}}_0\}\) is Donsker with respect to \(P^Z\) and \(E[\psi (Z_1,t)]=0\) for all \(t\in {\mathcal {U}}_0\). Further, one has \(E[\psi (Z_1,U_2)^2]<\infty \).

Remark 3.1

Assumptions concerning compactness of the parameter spaces, differentiability of model components and uniqueness of the minimizer \(\gamma _0\) are standard assumptions in the context of goodness of fit tests. Moreover, it can be shown that the definitions of \(\varGamma _0\) and \(\varGamma _{0,f}\) in (A7) coincide with those in Eqs. (14) and (15), respectively. Assumption (A8) controls the asymptotic behaviour of \({\hat{h}}-h\) and thus the rate of local alternatives which can be detected. The Donsker and boundedness conditions are needed to obtain uniform convergence rates of \({\hat{h}}-h\) and some negligible remainders in the proof. Assumption (A8’) is the counterpart of Assumption (A8) for precise hypotheses as considered in (24).

Theorem 3.2

Assume (A1)–(A8). Let \((\lambda _{k})_{k\in \{1,2,\dots \}}\) and \((\lambda _{k,f})_{k\in \{1,2,\dots \}}\) be the eigenvalues of the operators

$$\begin{aligned} K\rho (z_1):=\int \rho (z_2)\zeta (z_1,z_2)\,{\text {d}}F_{Z}(z_2)\quad \textit{and}\quad K_f\rho (z_1):=\int \rho (z_2)\zeta _f(z_1,z_2)\,{\text {d}}F_{Z}(z_2), \end{aligned}$$

respectively, with corresponding eigenfunctions \((\rho _{k})_{k\in \{1,2,\dots \}}\) and \((\rho _{k,f})_{k\in \{1,2,\dots \}}\), which are each orthonormal in the \(L^2\)-space corresponding to the distribution \(F_Z\). Let \((W_k)_{k\in \{1,2,\dots \}}\) be independent and standard normally distributed random variables and let \(W_0\) be centred normally distributed with variance \(E[{{\tilde{\zeta }}}(Z_1)^2]\) such that for all \(K\in {\mathbb {N}}\) the random vector \((W_0,W_1,\dots ,W_K)^t\) follows a multivariate normal distribution with \({\text {Cov}}(W_0,W_k)=E[{\tilde{\zeta }}(Z_1)\rho _k(Z_1)]\) for all \(k=1,\dots ,K\). Let \(W_{0,f}\) and \((W_{k,f})_{k\in \{1,2,\dots \}}\) be defined similarly with \(E[W_{0,f}^2]=E[{\tilde{\zeta }}_f(Z_1)^2]\) and \({\text {Cov}}(W_0,W_k)=E[{\tilde{\zeta }}_f(Z_1)\rho _k(Z_1)]\) for all \(k\in {\mathbb {N}}_0\). Then, under the local alternative \(H_{1,n}\), \(T_n\) converges in distribution to

$$\begin{aligned} (\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0))^2\Bigg (\sum _{k=1}^\infty \lambda _{k}W_k^2+W_0+ E\left[ w(h_0^{-1}(S_1)){\bar{r}}(S_1)^2\right] \Bigg ) \end{aligned}$$

and \({\bar{T}}_n\) converges in distribution to

$$\begin{aligned} (\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0))^2\Bigg (\sum _{k=1}^\infty \lambda _{k,f}W_{k,f}^2+W_{0,f}+ E\left[ w(h_0^{-1}(S_1)){\bar{r}}_f(S_1)^2\right] \Bigg ). \end{aligned}$$

In particular, under \(H_0\) (i.e. for \(r\equiv 0\)), \(T_n\) and \({\bar{T}}_n\) converge in distribution to

$$\begin{aligned} T=(\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0))^2\sum _{k=1}^\infty \lambda _{k}W_k^2\quad \textit{and}\quad {\bar{T}}=(\varLambda _{\theta _0}(1)-\varLambda _{\theta _0}(0))^2\sum _{k=1}^\infty \lambda _{k,f}W_{k,f}^2, \end{aligned}$$

respectively.

The proof is given in Section 3 of the supplementary material. Asymptotic level-\(\alpha \) tests should reject \(H_0\) if \(T_n\) or \({\bar{T}}_n\) are larger than the \((1-\alpha )\)-quantile of the distribution of T or \({\bar{T}}\), respectively. As the distributions of T and \({\bar{T}}\) depend in a complicated way on unknown quantities, we will propose a bootstrap procedure in Section 4. Although most results hold similarly for \(T_n\) and \({\bar{T}}_n\), for ease of presentation, we will mainly focus on results for \(T_n\) in the remainder.

Remark 3.3

   

  1. 1.

    Note that \(\zeta (z_1,z_2)=E[I(z_1)I(z_2)]\) with

    $$\begin{aligned} I(z):=w(h_0^{-1}(S_1))^{1/2}\left( \psi (z,U_1)-\varphi (z)^t\varGamma _0^{-1}R(S_1)\right) \end{aligned}$$

    and \(\psi \) from (8). Thus, the operator K defined in Theorem 3.2 is positive semi-definite.

  2. 2.

    The appearance of \(W_0\) under the local alternative results from asymptotic theory for degenerate U-statistics. Related phenomena occur in the case of quadratic forms. Similar to the proof of Theorem 3.1, consider some \(z_n+cn^{-1/2},\) where \(n^{1/2}z_n\) converges to a centred normally distributed random variable, say z, and we have \(c=0\) under \(H_0\). Moreover, consider a quadratic form \(z_n^TA_nz_n\), where \(A_n\) is a positive definite matrix and \(n^{-1} A_n\) converges to a matrix A. Then, under \(H_0\), \(z_0^T A_n z_0=z_n^T A_n z_n\) converges to \(z^TAz\), which has a \(\chi ^2\) distribution. However, under \(H_{1,n}\), we have

    $$\begin{aligned} z_0^T A_n z_0= z_n^T A_n z_n +2 c^T n^{-1/2} A_n z_n+c^T n^{-1}A_nc, \end{aligned}$$

    where the first term on the right-hand side is as before. However, the second term converges to \(2c^TAz\), which is normally distributed and corresponds to \(W_0\) in our context. The last term converges to a constant \(c^TAc\), corresponding to the constant summand in the limit in Theorem 3.1. Note that the limit of \(z_0^T A_n z_0\) cannot be negative due to the positive definiteness of \(A_n\).

Next, we consider fixed alternatives of a transformation h that do not belong to the parametric class, i. e.

$$\begin{aligned} H_1:\quad d(\varLambda _{\theta },h)>0\quad for all \theta \in \varTheta . \end{aligned}$$

Theorem 3.4

Assume (A1)–(A4), (A6’) and let \({\hat{h}}\) estimate h uniformly consistently on compact sets. Then, under \(H_1\), \(\lim _{n\rightarrow \infty }P(T_n>q)=1\) and \(\lim _{n\rightarrow \infty }P({\bar{T}}_n>q)=1\) for all \(q\in \mathbb {R}\), that is, the proposed tests are consistent.

The proof is given in Section 3 of the supplement.

The transformation model with a parametric transformation class might be useful in applications even if the model does not hold exactly. With a good choice of \(\theta \) applying the transformation \(\varLambda _\theta \) can reduce the dependence between covariates and errors enormously. Estimating an appropriate \(\theta \) is much easier than estimating the transformation h nonparametrically. Consequently, one might prefer the semi-parametric transformation model over a completely nonparametric one. It is then of interest how far away we are from the true model. Therefore, in the following, we consider testing precise hypotheses (relevant hypotheses)

$$\begin{aligned} H'_0:\underset{\theta \in \varTheta }{\min }\,d(\varLambda _{\theta },h)\ge \eta \quad and \quad H'_1:\underset{\theta \in \varTheta }{\min }\,d(\varLambda _{\theta },h)<\eta . \end{aligned}$$
(24)

If a suitable test rejects \(H_0'\) for some small \(\eta \) (fixed beforehand by the experimenter), the model is considered “good enough” to work with, even if it does not hold exactly. To test those hypotheses, we will use the same test statistic as before, but we have to standardize differently. Assume \(H_0'\), then h is a transformation which does not belong to the parametric class, i.e. the former fixed alternative \(H_1\) holds. Let

$$\begin{aligned} M(\gamma )=M(c_1,c_2,\theta )=E\{w(Y)(h(Y)c_1+c_2-\varLambda _{\theta }(Y))^2\}, \end{aligned}$$

and let

$$\begin{aligned} \gamma _0=(c_{1,0},c_{2,0},\theta _0):=\arg \underset{(c_1,c_2,\theta )\in \varUpsilon }{\min }\,M(c_1,c_2,\theta ). \end{aligned}$$

If \({\mathbb {R}}_+\) and \({\mathbb {R}}\) are replaced by \(C_1\) and \(C_2\) in the definition of d in (5) one has \(\underset{c_1\in C_1,c_2\in C_2}{\min }\,M(\gamma )=d(\varLambda _{\theta },h)\) for all \(\theta \in \varTheta \). Assume that

$$\begin{aligned} \varGamma '=E\left[ w(Y_1) \left( \begin{array}{ccc} h(Y_1)^2&{}h(Y_1)&{}-h(Y_1){\dot{\varLambda }}_{\theta _0}(Y_1)\\ h(Y_1)&{}1&{}-{\dot{\varLambda }}_{\theta _0}(Y_1)\\ -h(Y_1){\dot{\varLambda }}_{\theta _0}(Y_1)^t&{}-{\dot{\varLambda }}_{\theta _0}(Y_1)^t&{}\varGamma '_{3,3} \end{array} \right) \right] \end{aligned}$$
(25)

is positive definite, where \(\varGamma '_{3,3}={\dot{\varLambda }}_{\theta _0}(Y_1)^t{\dot{\varLambda }}_{\theta _0}(Y_1)-\ddot{\varLambda }_{\theta _0}(Y_1){\tilde{R}}_1\) with

$$\begin{aligned} \ddot{\varLambda }_{\theta }(y)=\bigg (\frac{\partial ^2}{\partial \theta _k\partial \theta _\ell }\varLambda _{\theta }(y)\bigg )_{k,\ell =1,\ldots ,d_{\varTheta }} \end{aligned}$$

and \({\tilde{R}}_i=h(Y_i)c_{1,0}+c_{2,0}-\varLambda _{\theta _0}(Y_i)\) \( (i=1,\dots ,n)\).

Theorem 3.5

Assume (A1)–(A4), (A6’), (A8’), let (A7) hold with \(\gamma _0\) from (A6’)

and let \(\varGamma '\) be positive definite. Then,

$$\begin{aligned} n^{1/2}(T_n/n-M(\gamma _0))\overset{{\mathcal {D}}}{\rightarrow }{\mathcal {N}}\big (0,\sigma ^2\big ) \end{aligned}$$

with \(\sigma ^2={\text {Var}}\big (w(Y_1){\tilde{R}}_1^2+\delta (Z_1)\big )\), where \(\delta (Z_1)=2c_{1,0}E[w(Y_2)\psi (Z_1,U_2){\tilde{R}}_2\mid Z_1]\).

The proof is given in Section 3 of the supplementary material. It is conjectured, that a similar result can be derived for \({\bar{T}}_n\), although the corresponding Hessian matrix might become more complex.

A consistent asymptotic level-\(\alpha \)-test rejects \(H'_0\) if \((T_n-n\eta )/(n{\hat{\sigma }}^2)^{1/2}<u_{\alpha }\), where \(u_{\alpha }\) is the \(\alpha \)-quantile of the standard normal distribution and \({\hat{\sigma }}^2\) is a consistent estimator of \(\sigma ^2\). Further research is required on suitable estimators of \(\sigma ^2\). Let \({\hat{\gamma }}=({\hat{c}}_1,{\hat{c}}_2,{\hat{\theta }})^t\) be the minimizer in Eq. (6). For some intermediate sequences \((m_n)_{n\in {\mathbb {N}}},(q_n)_{n\in {\mathbb {N}}}\) with \(q_n=\lfloor n/m_n\rfloor -1\) we considered

$$\begin{aligned} {\hat{\sigma }}^2:=&\frac{1}{q_n}\sum _{s=1}^{q_n}\bigg (\frac{2{\hat{c}}_1\sqrt{m_n}}{n}\sum _{k=1}^nw(Y_k)\big ({\hat{h}}^{(s)}(Y_k)-{\hat{h}}(Y_k)\big )({\hat{h}}(Y_k){\hat{c}}_1+{\hat{c}}_2-\varLambda _{{\hat{\theta }}}(Y_k))\\&+\frac{1}{\sqrt{m_n}}\sum _{j=(s-1)m_n+1}^{sm_n}\bigg (w(Y_j)({\hat{h}}(Y_j){\hat{c}}_1+{\hat{c}}_2-\varLambda _{{\hat{\theta }}}(Y_j))^2\\&-\frac{1}{n}\sum _{i=1}^nw(Y_i)({\hat{h}}(Y_i){\hat{c}}_1+{\hat{c}}_2-\varLambda _{{\hat{\theta }}}(Y_i))^2\bigg )\bigg )^2 \end{aligned}$$

as an estimator of \(\sigma ^2\), where \({\hat{h}}^{(s)}\) denotes the nonparametric estimator of h depending on the subsample \((Y_{(s-1)m_n+1},X_{(s-1)m_n+1}),\ldots ,(Y_{sm_n},X_{sm_n}),s=1,\ldots ,q_n\), but suitable choices for \(m_n\) are still unclear. Alternatively, a self-normalization approach as in Shao (2010), Shao and Zhang (2010) or Dette et al. (2020) can be applied. For this purpose, let \(s\in (0,1)\) and let \({\hat{h}}_{s}\) and \({\hat{\gamma }}_{s}=({\hat{c}}_{s,1},{\hat{c}}_{s,2},{\hat{\theta }}_s)^t\) be defined as \({\hat{h}}\) and \({\hat{\gamma }}\), but based on the subsample \((Y_{1},X_{1}),\ldots ,(Y_{\lfloor ns\rfloor },X_{\lfloor ns\rfloor })\). Moreover, let \(K\in {\mathbb {N}},0<t_1<\cdots<t_K<1\) and let \(\nu \) be a probability measure on (0, 1) with \(\nu (\{t_1,\ldots ,t_K\})=1\). Define

$$\begin{aligned} V_n:=&\int _0^1\bigg (\sum _{k=1}^{\lfloor ns \rfloor }w(Y_k)({\hat{h}}_s(Y_k){\hat{c}}_{1,s}+{\hat{c}}_{2,s}-\varLambda _{{\hat{\theta }}_s}(Y_k))^2\\&\quad -s\sum _{k=1}^{n}w(Y_k)({\hat{h}}(Y_k){\hat{c}}_{1}+{\hat{c}}_{2}-\varLambda _{{\hat{\theta }}}(Y_k))^2\bigg )^2\,\nu ({\text {d}}s) \end{aligned}$$

as well as

$$\begin{aligned} \tilde{T}_n:=\frac{T_n-nM(\gamma _0)}{\sqrt{V_n}}. \end{aligned}$$

In Section 5 of the supplementary material, it is shown that \(\tilde{T}_n\overset{{\mathcal {D}}}{\rightarrow }\tilde{T}\) for some random variable \(\tilde{T}\) and that the distribution of \(\tilde{T}\) does not depend on any unknown parameters. Hence, its quantiles can be simulated and \(\tilde{T}_n\) can be used to test for the hypotheses \(H_0'\) and \(H_1'\).

Remark 3.6

Note that not rejecting the null hypothesis \(H_0\) does not mean that the null hypothesis is valid. Consequently, alternative approaches like for example increasing the level to accept more transformation functions instead of testing for precise hypotheses as in (24) in general do not necessarily result in evidence for applying a transformation model.

4 A bootstrap version and simulations

Although Theorem 3.2 shows how the test statistic behaves asymptotically under \(H_0\), it is hard to extract any information about how to choose appropriate critical values of a test that rejects \(H_0\) for large values of \(T_n\). The main reasons for this are that first for any function \(\zeta \) the eigenvalues of the operator defined in Theorem 3.2 are unknown that second this function is unknown and has to be estimated as well, and that third even \(\psi \) (which would be needed to estimate \(\zeta \)) mostly is unknown and rather complex (see e.g. Section 2 of the supplement). Therefore, approximating the \(\alpha \)-quantile, say \(q_\alpha \), of the distribution of T in Theorem 3.2 in a direct way is difficult and instead we suggest a smooth bootstrap algorithm to approximate \(q_\alpha \).

Algorithm 4.1

Let \((Y_1,X_1),\ldots ,(Y_n,X_n)\) denote the observed data, define

$$\begin{aligned} h_{\theta }(y)=\frac{\varLambda _{\theta }(y)-\varLambda _{\theta }(0)}{\varLambda _{\theta }(1)-\varLambda _{\theta }(0)}\quad and \quad g_{\theta }(x)=E[h_{\theta }(Y)|X=x] \end{aligned}$$

and let \({\hat{g}}\) be a consistent estimator of \(g_{\theta _0}\), where \(\theta _0\) is defined as in (A6) under the null hypothesis and as in (A6’) under the alternative. Let \(\kappa \) and \(\ell \) be smooth Lebesgue densities on \({\mathbb {R}}^{d_X}\) and \({\mathbb {R}}\), respectively, where \(\ell \) is strictly positive, \(\kappa \) has bounded support and \(\kappa (0)>0\). Let \((a_n)_n\) and \((b_n)_n\) be positive sequences with \(a_n\rightarrow 0\), \(b_n\rightarrow 0\), \(na_n\rightarrow \infty \), \(nb_n^{d_X}\rightarrow \infty \). Denote by \(m\in {\mathbb {N}}\) the sample size of the bootstrap sample.

(1):

Calculate \({\hat{\gamma }}=({\hat{c}}_1,{\hat{c}}_2,{\hat{\theta }})^t=\arg \underset{\gamma \in \varUpsilon }{\min }\,\sum _{i=1}^nw(Y_i)({\hat{h}}(Y_i)c_1+c_2-\varLambda _{\theta }(Y_i))^2\). Estimate the parametric residuals \(\varepsilon _i(\theta _0)=h_{\theta _0}(Y_i)-g_{\theta _0}(X_i)\) by \({\hat{\varepsilon }}_i=h_{{{\hat{\theta }}}}(Y_i)-{\hat{g}}(X_i)\) and denote centered versions by \({{\tilde{\varepsilon }}}_i={\hat{\varepsilon _i}}-n^{-1}\sum _{j=1}^n{\hat{\varepsilon _j}}\), \(i=1,\dots ,n\).

(2):

Generate \(X_j^*\), \(j=1,\dots ,m\), independently (given the original data) from the density

$$\begin{aligned} f_{X^*}(x)=\frac{1}{nb_n^{d_X}}\sum _{i=1}^n\kappa \bigg (\frac{x-X_i}{b_n}\bigg ) \end{aligned}$$

(which is a kernel density estimator of \(f_X\) with kernel \(\kappa \) and bandwidth \(b_n\)). For \(j=1,\dots ,m\) define bootstrap observations as

$$\begin{aligned} Y_j^*=(h^*)^{-1}\big ({\hat{g}}(X_j^*)+\varepsilon _j^*\big )\quad for \quad h^*(\cdot )=\frac{\varLambda _{{\hat{\theta }}}(\cdot )-\varLambda _{{\hat{\theta }}}(0)}{\varLambda _{{\hat{\theta }}}(1)-\varLambda _{{\hat{\theta }}}(0)}, \end{aligned}$$
(26)

where \(\varepsilon _j^*\) is generated independently (given the original data) from the density

$$\begin{aligned} \frac{1}{n} \sum _{i=1}^n \frac{1}{a_n}\ell \left( \frac{{{\tilde{\varepsilon }}}_i-\cdot }{a_n}\right) \end{aligned}$$

(which is a kernel density estimator of the density of \(\varepsilon (\theta _0)\) with kernel \(\ell \) and bandwidth \(a_n\)).

(3):

Calculate the bootstrap estimate \({\hat{h}}^*\) for \(h^*\) from \((Y_j^*,X_j^*),j=1,\ldots ,m\).

(4):

Calculate the bootstrap statistic \(T_{n,m}^*=\underset{(c_1,c_2,\theta )\in \varUpsilon }{\min }\,\sum _{j=1}^mw(Y_j^*)({\hat{h}}^*(Y_j^*)c_1+c_2-\varLambda _{\theta }(Y_j^*))^2\).

(5):

Let \(B\in {\mathbb {N}}\). Repeat steps (2)–(4) B times to obtain the bootstrap statistics \(T_{n,m,1}^*,\dots ,T_{n,m,B}^*\). Let \(q_{\alpha }^*\) denote the quantile of \(T_{n,m}^*\) conditional on \((Y_i,X_i),i=1,\ldots ,n\). Estimate \(q_{\alpha }^*\) by

$$\begin{aligned} {\hat{q}}_{\alpha }^*=\min \,\bigg \{z\in \{T_{n,m,1}^*,\ldots ,T_{n,m,B}^*\}:\frac{1}{B}\sum _{k=1}^BI_{\{T_{n,m,k}^*\le z\}}\ge \alpha \bigg \}. \end{aligned}$$

Remark 4.2

   

  1. 1.

    The reason for resampling the bootstrap data \(X_j^*,j=1,\ldots ,m,\) nonparametrically consists in the need to mimic the original transformation estimator and its asymptotic behaviour with the bootstrap estimator conditional on the data. Therefore, to proceed in the proof as in Colling and Van Keilegom (2019), it is necessary to smooth the distribution of \(X^*\). The properties \(nb_n^{d_X}\rightarrow \infty \) and \(\kappa (0)>0\) ensure that conditional on the original data \((Y_1,X_1),\ldots ,(Y_n,X_n)\) the support of \(X^*\) contains that of v (from assumption (B7) in Section 2 of the supplement) with probability converging to one. Thus, v can be used for calculating \({\hat{h}}^*\) as well.

  2. 2.

    To proceed as in Algorithm 4.1, it may be necessary to modify \(h^*\) so that \(S_j^*={\hat{g}}(X_j^*)+\varepsilon _{j}^*\) belongs to the domain of \((h^*)^{-1}\) for all \(j=1,\ldots ,m\). As long as these modifications do not have any influence on \(h^*(y)\) for \(y\in {\mathcal {Y}}_w\), the influence on the \({\hat{h}}^*\) and \(T_{n,m}\) should be asymptotically negligible (which can be proven for the estimator by Colling and Van Keilegom (2019)).

The bootstrap algorithm should fulfil two properties: On the one hand, under the null hypothesis, the algorithm has to provide, conditionally on the original data, consistent estimates of the quantiles of \(T_n\), or rather its asymptotic distribution from Theorem 3.2. On the other hand, to be consistent under \(H_1\) the bootstrap quantiles have to stabilize or at least converge to infinity with a rate less than that of \(T_n\). To formalize this, let \((\varOmega ,{\mathcal {A}},P)\) denote the underlying probability space. Assume that \((\varOmega ,{\mathcal {A}})\) can be written as \(\varOmega =\varOmega _1\times \varOmega _2\) and \({\mathcal {A}}={\mathcal {A}}_1\otimes {\mathcal {A}}_2\) for some measurable spaces \((\varOmega _1,{\mathcal {A}}_1)\) and \((\varOmega _2,{\mathcal {A}}_2)\). Further, assume that P is characterized as the product of a probability measure \(P_1\) on \((\varOmega _1,{\mathcal {A}}_1)\) and a Markov kernel

$$\begin{aligned} P_2^1:\varOmega _1\times {\mathcal {A}}_2\rightarrow [0,1], \end{aligned}$$

that is \(P=P_1\otimes P_2^1\). While randomness with respect to the original data is modelled by \(P_1\), randomness with respect to the bootstrap data and conditional on the original data is modelled by \(P_2^1\). Moreover, assume

$$\begin{aligned} P_2^1(\omega ,A)= & {} P\big (\varOmega _1\times A|(Y_1(\omega ),X_1(\omega )),\ldots ,(Y_n(\omega ),X_n(\omega ))\big )\quad \\&for all \omega \in \varOmega _1,A\in {\mathcal {A}}_2. \end{aligned}$$

With these notations, the assumptions (A8\(^{*}\)) and (A9\(^{*}\)) from Section 1 of the supplementary material can be formulated.

Theorem 4.3

Let \(q_{\alpha }^*\) denote the bootstrap quantile from Algorithm 4.1.

  1. 1.

    Assume \(H_0\),(A1)–(A8),(A8\(^{*}\)),(A9\(^{*}\)). Then, \(q_{\alpha }^*\) fulfils

    $$\begin{aligned} P_1\Big (\omega \in \varOmega _1:\underset{m\rightarrow \infty }{\limsup }\,|q_{\alpha }^*-q_{\alpha }|>\delta \Big )=o(1) \end{aligned}$$

    for all \(\delta >0\). Hence, \(P(T_n>q_{\alpha }^*)=\alpha +o(1)\) under the null hypothesis.

  2. 2.

    Assume \(H_1\),(A1)–(A4),(A6’),(A8\(^{*}\)). Then, \(q_{\alpha }^*\) fulfils

    $$\begin{aligned} P_1\Big (\omega \in \varOmega _1:T_n>\underset{m\rightarrow \infty }{\limsup }\,q_{\alpha }^*\Big )=1+o(1), \end{aligned}$$

    so that \(P(T_n>q_{\alpha }^*)=1+o(1)\) under the alternative.

The proof is given in the supplement. Since only \({\hat{\theta }}\) is used to generate the bootstrap observations in Algorithm 4.1, it is conjectured that Theorem 4.3 can be generalized to the usage of \({\bar{T}}_n\) from (7) in Algorithm 4.1.

4.1 Simulations

Throughout this section, \(g(X)=4X-1\), \(X\sim {\mathcal {U}}([0,1])\) and \(\varepsilon \sim {\mathcal {N}}(0,1)\) are chosen. Moreover, the null hypothesis of h belonging to the Yeo and Johnson (2000) transformations

$$\begin{aligned} \varLambda _{\theta }(Y)=\left\{ \begin{array}{ll}\frac{(Y+1)^{\theta }-1}{\theta },&{}if Y\ge 0,\theta \ne 0\\ \log (Y+1),&{}if Y\ge 0,\theta =0\\ -\frac{(1-Y)^{2-\theta }-1}{2-\theta },&{}if Y<0,\theta \ne 2\\ -\log (1-Y),&{}if Y<0,\theta = 2.\end{array}\right. \end{aligned}$$

with parameter \(\theta \in \varTheta _0=[0,2]\) is tested. Under \(H_0\), we generate data using the transformation \(h=(\varLambda _{\theta _0} (\cdot )-\varLambda _{\theta _0}(0))/(\varLambda _{\theta _0} (1)-\varLambda _{\theta _0}(0))\) to match the identification constraints \(h(0)=0, h(1)=1\). Under the alternative, we choose transformations h with an inverse given by the following convex combination,

$$\begin{aligned} h^{-1}(Y)=\frac{(1-c)(\varLambda _{\theta _0}^{-1}(Y)-\varLambda _{\theta _0}^{-1}(0))+c(r(Y)-r(0))}{(1-c)(\varLambda _{\theta _0}^{-1}(1)-\varLambda _{\theta _0}^{-1}(0))+c(r(1)-r(0))} \end{aligned}$$
(27)

for some \(\theta _0\in [0,2]\), some strictly increasing function r and some \(c\in [0,1]\). In general, it is not clear if a growing factor c leads to a growing distance (5). Indeed, the opposite might be the case, if r is somehow close to the class of transformation functions considered in the null hypothesis. Simulations were conducted for \(r_1(Y)=5\varPhi (Y)\), \(r_2(Y)=\exp (Y)\) and \(r_3(Y)=Y^3\), where \(\varPhi \) denotes the cumulative distribution function of a standard normal distribution, and \(c=0,0.2,0.4,0.6,0.8,1\). The prefactor in the definition of \(r_1\) is introduced because the values of \(\varPhi \) are rather small compared to the values of \(\varLambda _{\theta }\), that is, even when using the presented convex combination in (27), \(\varLambda _{\theta _0}\) (except for \(c=1\)) would dominate the “alternative part” r of the transformation function without this factor. Note that \(r_2\) and \(\varLambda _{0}\) only differ with respect to a different standardization. Therefore, if h is defined via (27) with \(r=r_2\) the resulting function is for \(c=1\) close to the null hypothesis case.

For calculating the test statistic the weighting function w was set equal to one. The nonparametric estimator of h was calculated as in Colling and Van Keilegom (2019) (see Section 2 of the supplement for details) with the Epanechnikov kernel \(K(y)=\frac{3}{4}(1-y^2)^2I_{[-1,1]}(y)\) and a normal reference rule bandwidth (see for example Silverman (1986))

$$\begin{aligned} h_u =\bigg (\frac{40\sqrt{\pi }}{n}\bigg )^{\frac{1}{5}}{\hat{\sigma }}_u,\quad h_x=\bigg (\frac{40\sqrt{\pi }}{n}\bigg )^{\frac{1}{5}}{\hat{\sigma }}_x, \end{aligned}$$

where \({\hat{\sigma }}_u^2\) and \({\hat{\sigma }}_x^2\) are estimators of the variance of \(U={\mathcal {T}}(Y)\) and X, respectively. The number of evaluation points \(N_x\) for the nonparametric estimator of h was set equal to 100 (see Section 2 of the supplement for details). The integral in (S3) was computed by applying the function integrate implemented in R. In each simulation run, \(n=100\) independent and identically distributed random pairs \((Y_1,X_1),\ldots ,(Y_n,X_n)\) were generated as described before, and 250 bootstrap quantiles, which are based on \(m=100\) bootstrap observations \((Y_1^*,X_1^*),\ldots ,(Y_m^*,X_m^*)\), were calculated as in Algorithm 4.1 using \(\kappa \) the \(U([-1,1])\)-density, \(\ell \) the standard normal density and \( a_n=b_n=0.1\). To obtain more precise estimators of the rejection probabilities under the null hypothesis, 800 simulation runs were performed for each choice of \(\theta _0\) under the null hypothesis, whereas in the remaining alternative cases 200 runs were conducted. Among other things the nonparametric estimation of h, the integration in (S3), the optimization with respect to \(\theta \) and the number of bootstrap repetitions cause the simulations to be quite computationally demanding. Hence, an interface for C++ as well as parallelization were used to conduct the simulations.

Table 1 Rejection probabilities at \(\theta _0\in \{0,0.5,1,2\}\) and \(r\in \{r_1,r_2,r_3\}\) for the test statistic \(T_n\)

The main results of the simulation study are presented in Table 1. There, the rejection probabilities of the settings with \(h=(\varLambda _{\theta _0} (\cdot )-\varLambda _{\theta _0}(0))/(\varLambda _{\theta _0} (1)-\varLambda _{\theta _0}(0))\) under the null hypothesis, and h as in (27) under the alternative with \(r\in \{r_1,r_2,r_3\}\), \(c\in \{0,0.2,0.4,0.6,0.8,1\}\) and \(\theta _0\in \{0,0.5,1,2\}\) are listed. The significance level was set equal to 0.05 and 0.10. Note that the test sticks to the level or is even a bit conservative. Under the alternatives, the rejection probabilities not only differ between different choices of r, but also between different transformation parameters \(\theta _0\) that are inserted in (27). While the test shows high power for some alternatives, there are also cases, where the rejection probabilities are extremely small. There are certain reasons that explain these observations. First, the class of Yeo-Johnson transforms seems to be quite general and second the testing approach itself is rather flexible due to the minimization with respect to \(\gamma \). Having a look at the definition of the test statistic in (6), it attains small values if the true transformation function can be approximated by a linear transformation of \(\varLambda _{{\tilde{\theta }}}\) for some appropriate \({\tilde{\theta }}\in [0,2]\). In the following, this issue will be explored further by analysing some graphics. All of the three figures that occur in the following have the same structure and consist of four panels. The upper left panel shows the true transformation function with inverse function (27). Since Y depends on the transformation function, the values of \(S=h(Y)=g(X)+\varepsilon \), which are displayed on the vertical axis, are fixed for a comparison of different transformation functions. Due to the choice of \(g(X)=4X-1\) and \(X\sim {\mathcal {U}}([0,1])\) the vertical axis reaches from \(-1\) to 3, which would be the support of h(Y) if the error is neglected. In the upper right panel, the parametric estimator of this function is displayed. Both of these functions are then plotted against each other in the lower left panel by pairing values with the same S component. Finally, the function \(Y\mapsto \varLambda _{\theta _0}(Y(\varLambda _{\theta _0}^{-1}(1)-\varLambda _{\theta _0}^{-1}(0))+\varLambda _{\theta _0}^{-1}(0))\), which represents the part of h corresponding to the null hypothesis, is plotted against the true transformation function in the last panel.

Fig. 1
figure 1

Some transformation functions for \(\theta _0=0.5,c=0.6\) and \(r=r_1\)

Fig. 2
figure 2

Some transformation functions for \(\theta _0=2,c=0.6\) and \(r=r_1\)

Fig. 3
figure 3

Some transformation functions for \(\theta _0=2,c=0.2\) and \(r=r_3\)

In the lower left panel, one can see if the true transformation function can be approximated by a linear transform of \(\varLambda _{{\tilde{\theta }}}\) for some \({\tilde{\theta }}\in [0,2]\), which is an indicator for rejecting or not rejecting the null hypothesis as was pointed out before. As already mentioned, the rejection probabilities not only differ between different deviation functions r, but also within these settings. For example, when considering \(r=r_1\) with \(c=0.6\) the rejection probabilities for \(\theta _0=0.5\) amount to 0.035 for \(\alpha =0.05\) and to 0.050 for \(\alpha =0.10\), while for \(\theta _0=2\), they are 0.415 and 0.545. Figures 1 and 2 explain why the rejection probabilities differ that much. While for \(\theta _0=0.5\) the transformation function can be approximated quite well by transforming \(\varLambda _{1.06}\) linearly, the best approximation for \(\theta _0=2\) is given by \(\varLambda _{1.94}\) and seems to be relatively bad. The best approximation for \(c=1\) can be reached for \(\theta \) around 1.4. In contrast to that considering \(\theta _0=2\) and \(r=r_3\) results in a completely different picture. As can be seen in Fig. 3 even for \(c=0.2\) the resulting h differs so much from the null hypothesis that it cannot be linearly transformed into a Yeo-Johnson transform (see the lower left subgraphic). Consequently, the rejection probabilities are rather high.

A way to overcome this problem can consist in applying the modified test statistic \({\bar{T}}_n\) from (7). Although Colling and Van Keilegom (2020) showed that the estimator \({\hat{\theta }}\) seems to outperform \({\tilde{\theta }}\) from Remark 2.1 in simulations, fixing \(c_1,c_2\) beforehand might lead due to the reduced flexibility of the minimization procedure to higher rejection probabilities when using \({\bar{T}}_n\) instead of \(T_n\). Table 2 contains rejection probabilities which are based on the bootstrap version of \({\bar{T}}_n\). The same simulation setting and procedures as before have been used. Indeed, some of the rejection probabilities have increased compared to Table 1. For example, the rejection probabilities for \(r=r_1,\theta _0=0.5\) and \(c=0.6\) amount to 0.115 and 0.17 instead of 0.0035 and 0.05 in Table 1. Nevertheless, this cannot be generalized since the rejection probabilities when using \({\bar{T}}_n\) are sometimes below those for \(T_n\), e.g. for \(\theta _0=0\) and \(r=r_1\) or \(\theta _0=2\) and \(r=r_2\).

Table 2 Rejection probabilities at \(\theta _0\in \{0,0.5,1,2\}\) and \(r\in \{r_1,r_2,r_3\}\) for the test statistic \({\bar{T}}_n\)

Under some alternatives the rejection probabilities are even smaller than the level. This behaviour indicates that from the presented test’s perspective, these models seem to fulfil the null hypothesis more convincingly than the null hypothesis models themselves. The reason for this is shown in Fig. 4 for the setting \(\theta _0=1,c=0.4\) and \(r=r_1\). There, the relationship between the nonparametric estimator of the transformation function and the true transformation function is shown. While the diagonal line represents the identity, the nonparametric estimator seems to flatten the edges of the transformation function. In contrast to this, using \(r=r_1\) in (27) steepens the edges so that both effects neutralize each other. Similar effects cause low rejection probabilities for \(r=r_2\), although the reasoning is slightly more sophisticated and is also associated with the boundedness of the parameter space \(\varTheta _0=[0,2]\).

Fig. 4
figure 4

Transformation function for \(\theta _0=1,c=0.4\) and \(r=r_1\) on the horizontal axis and its nonparametric estimator on the vertical axis. The identity is displayed in red

One possible solution could consist in adjusting the weight function w such that the boundary of the support of Y does no longer belong to the support of w. In Table 3, the rejection probabilities for a modified weighting approach are presented. There, the weight function was chosen such that the smallest five percent and the largest five percent of observations were omitted to avoid the flattening effect of the nonparametric estimation. Indeed, the resulting rejection probabilities under the alternatives increase and lie above those under the null hypotheses.

Table 3 Rejection probabilities at \(\theta _0=1\) and \(\theta _0=2\) for \(r=r_1\)
Table 4 Rejection probabilities at \(\theta _0=0.5\) and \(\theta _0=1\) for precise hypotheses and \(r=r_1,r_2,r_3,r_4\)

At last, simulations for precise hypotheses as in (24) were conducted. For sake of brevity, only rejection probabilities resulting from the self normalized test statistic are presented since this approach seems to outperform that based on the estimator \({\hat{\sigma }}^2\) from Section 3 by far in the simulated settings. Since only a fraction of the data is used to calculate \(V_n\), the sample size was increased to \(n=500\). The settings and techniques remain the same as before. The probability measure \(\nu \) was set to

$$\begin{aligned} \nu =\frac{1}{10}\delta _{0.6}+\frac{2}{10}\delta _{0.7}+\frac{3}{10}\delta _{0.8}+\frac{4}{10}\delta _{0.9} \end{aligned}$$

to put a higher weight on those parts of \(V_n\) where more data points are used. Furthermore, the threshold was chosen to be \(\eta =0.02\), which roughly corresponds to plugging the logit-function \(r_4(y):=\frac{5\exp (y)}{1+\exp (y)}\) and \(c=1\) into Eq. (27) and calculating \(\underset{\theta \in \varTheta }{\min }\,d(\varLambda _{\theta },h)\). Hence, we expect the test to reject the null hypothesis \(H_0'\) if \(T_n<n\eta =10\) holds.

A detailed analysis would go beyond the scope of this manuscript, so that only some rejection probabilities are given in  Table 4. Moreover, the mean values of the test statistic \(T_n\) are listed to link the rejection probabilities to the distance between the expected value of the test statistic and the threshold of \(\eta =0.02\). First, the smaller the value of \(T_n\) is the more likely the test seems to reject the null hypothesis \(H_0'\). Further, the test holds the level, but is slightly conservative. Alternatives seem to be detected for mean values of \(T_n\) around or below eight. Nevertheless, the power of the test is quite high in scenarios with small expected values of the test statistic, which often corresponds to transformations functions which are close to the parametric class. For \(\theta =0.5\) and \(\theta =1\) the rejection probabilities are in these cases above 0.90 and sometimes even close to one. Although the influence of simulation parameters such as the sample size n or the probability measure \(\nu \) has not been examined, the results indicate that using the self normalized test statistic can be a good way to test for the precise hypotheses \(H_0'\) and \(H_1'\).