Specification testing in semi-parametric transformation models

In transformation regression models the response is transformed before fitting a regression model to covariates and transformed response. We assume such a model where the errors are independent from the covariates and the regression function is modeled nonparametrically. We suggest a test for goodness-of-fit of a parametric transformation class based on a distance between a nonparametric transformation estimator and the parametric class. We present asymptotic theory under the null hypothesis of validity of the semi-parametric model and under local alternatives. A bootstrap algorithm is suggested in order to apply the test. We also consider relevant hypotheses to distinguish between large and small distances of the parametric transformation class to the `true' transformation.


Introduction
It is very common in applications to transform data before investigation of functional dependence of variables by regression models. The aim of the transformation is to obtain a simpler model, e.g. with a specific structure of the regression function, or a homoscedastic instead of a heteroscedastic model. Typically flexible parametric classes of transformations are considered from which a suitable one is selected data-dependently. A classical example is the class of Box-Cox power transformations (see Box and Cox (1964)). For

INTRODUCTION
2 purely parametric transformation models see Carroll and Ruppert (1988) and references therein. Powell (1991) and Mu and He (2007) consider transformation quantile regression models. Nonparametric estimation of the transformation in the context of parametric regression models has been considered by Horowitz (1996) and Chen (2002), among others. Horowitz (2009) reviews estimation in transformation models with parametric regression in the cases where either the transformation or the error distribution or both are modeled nonparametrically. Linton, Sperlich, and Van Keilegom (2008) suggest a profile likelihood estimator for a parametric class of transformations, while the error distribution is estimated nonparametrically and the regression function semi-parametrically. Heuchenne, Samb, and Van Keilegom (2015) suggest an estimator for the error distribution in the same model. Neumeyer, Noh, and Van Keilegom (2016) consider profile likelihood estimation in heteroscedastic semi-parametric transformation regression models, i.e. the mean and variance function are modeled nonparametrically, while the transformation function is chosen from a parametric class. A completely nonparametric (homoscedastic) model is considered by Chiappori, Komunjer, and Kristensen (2015). Their approach was modified and corrected by Colling and Van Keilegom (2019). The version of the nonparametric transformation estimator considered in the latter paper was then applied by Colling and Van Keilegom (2018) to suggest a new estimator of the transformation parameter if it is assumed that the transformation belongs to a parametric class.
In general asymptotic theory for nonparametric transformation estimators is sophisticated and parametric transformation estimators show much better performance if the parametric model is true. A parametric transformation will thus lead to better estimates of the regression function. Moreover, parametric transformations are easier to interpret and allow for subsequent inference in the transformation model. For the latter purpose note that for transformation models with parametric transformation lack-of-fit tests for the regression function as well as tests for significance for covariate components have been suggested by Colling and Van Keilegom (2016), Colling and Van Keilegom (2017), Allison, Hušková, and Meintanis (2018) and Kloodt and Neumeyer (2019). Those tests cannot straightforwardly be generalized to nonparametric transformation models because known estimators in that model do not allow for uniform rates of convergence over the whole real line, see Chiappori et al. (2015) and Colling and Van Keilegom (2019).
However, before applying a transformation model with parametric transformation it would be appropriate to test the goodness-of-fit of the parametric transformation class. In the context of parametric quantile regression, Mu and He (2007) suggest such a goodnessof-fit test. In the context of nonparametric mean regression Neumeyer et al. (2016) develop a goodness-of-fit test for the parametric transformation class based on an empirical independence process of pairs of residuals and covariates. The latter approach was modified by Hušková, Meintanis, Neumeyer, and Pretorius (2018), who applied empirical characteristic functions. In a linear regression model with transformation of the response Szyd lowski (2017) suggests a goodness-of-fit test for the parametric transformation class that is based on a distance between the nonparametric transformation estimator considered by Chen (2002) and the parametric class. We will follow a similar approach but consider a nonparametric regression model. The aim of the transformations we consider is to induce independence between errors and covariates. The null hypothesis is that the unknown transformation belongs to a parametric class. Note that when applied to the special case of a class of transformations that contains as only element the identity, our test provides indication on whether a classical homoscedastic regression model (without transformation) is appropriate or whether first the response should be transformed. Our test statistic is based on a minimum distance between a nonparametric transformation and the parametric transformations. We present the asymptotic distribution of the test statistic under the null hypothesis of a parametric transformation and under local alternatives of n −1/2 -rate. Under the null hypothesis the limit distribution is that of a degenerate U-statistic. With a flexible parametric class applying an appropriate transformation can reduce the dependence enormously, even if the 'true' transformation does not belong to the class. Thus, for the first time in the context of transformation goodness-of-fit tests we consider testing for so-called precise or relevant hypotheses. Here the null hypothesis is that the distance between the true transformation and the parametric class is large. If this hypothesis is rejected, then the model with the parametric transformation fits well enough to be considered for further inference. Under the new null hypothesis the test statistic is asymptotically normally distributed. The term "precise hypotheses" refers to Berger and Delampady (1987). Dette, Kokot, and Volgushev (2018) considered precise hypotheses in the context of comparing mean functions in the context of functional time series. Note that the idea of precise hypotheses is related to that of equivalence tests, which originate from the field of pharmacokinetics (see Lakens (2017)). Throughout we assume that the nonparametric transformation estimator fulfills an asymptotic linear expansion. It is then shown that the estimator considered by Colling and Van Keilegom (2019) fulfills this expansion and thus can be used for evaluating the test statistic.
The remainder of the paper is organized as follows. In Section 2 we present the model and the test statistic. Asymptotic distributions under the null hypothesis of a parametric transformation class and under local alternatives are presented in Section 3, which also contains a consistency result and asymptotic results under relevant hypotheses. Section 4 presents a bootstrap algorithm and a simulation study. Appendix A contains assumptions, while Appendix B treats a specific nonparametric transformation estimator and shows that it fulfills the required conditions. The proofs of the main results are given in Appendix C. A supplement contains a rigorous treatment of bootstrap asymptotics.

The model and test statistic
Assume we have observed (X i , Y i ), i = 1, . . . , n, which are independent with the same distribution as (X, Y ) that fulfill the transformation regression model h(Y ) = g(X) + ε, (2.1) where E[ε] = 0 holds and ε is independent of the covariate X, which is R d X -valued, while Y is univariate. The regression function g will be modelled nonparametrically. The transformation h : R → R is strictly increasing. Throughout we assume that, given the joint distribution of (X, Y ) and some identification conditions, there exists a unique transformation h such that this model is fulfilled. It then follows that the other model components are identified via g(x) = E[h(Y )|X = x] and ε = h(Y ) − g(X). See Chiappori et al. (2015) for conditions under which the identifiability of h holds. In particular conditions are required to fix location and scale and we will assume throughout that h(0) = 0 and h(1) = 1.
(2.2) Now let {Λ θ : θ ∈ Θ} be a class of strictly increasing parametric transformation functions Λ θ : R → R, where Θ ⊆ R d Θ is a finite dimensional parameter space. Our purpose is to test whether a semi-parametric transformation model holds, i.e.
for some parameter θ 0 ∈ Θ, whereε and X are independent. Due to the assumed uniqueness of the transformation h one obtains h = h 0 under validity of the semi-parametric model, where Thus we can write the null hypothesis as which thanks to (2.2) can be formulated equivalently as Our test statistics will be based on the following L 2 -distance where w is a positive weight function with compact support Y w . Its empirical counterpart is whereĥ denotes a nonparametric estimator of the true transformation h as discussed below, and C 1 ⊂ R + , C 2 ⊂ R are compact sets. Assumption (A6) in Appendix A assures that the sets are large enough to contain the true values. The test statistic is defined as and the null hypothesis should be rejected for large values of the test statistic. We will derive the asymptotic distribution under the null hypothesis and local and fixed alternatives in Section 3 and suggest a bootstrap version of the tests in Section 4. for the parametric transformation (assuming H 0 ) and observe thatθ outperforms the version without minimization over c 1 , c 2 , i.e.θ = arg min θ∈Θ n Nonparametric estimation of the transformation h has been considered by Chiappori et al. (2015) and Colling and Van Keilegom (2019). For our main asymptotic results we need thatĥ has a linear expansion, not only under the null hypothesis, but also under fixed alternatives and the local alternatives as defined in the next section. The linear expansion should have the form Here, ψ needs to fulfil condition (A8) in Appendix A and we use the definitions (i = 1, . . . , n) where F Y denotes the distribution of Y and is assumed to be strictly increasing on the support of Y . To ensure that T is well defined the values 0 and 1 are w.l.o.g. assumed to belong to the support of Y , but can be replaced by arbitrary values a < b ∈ R (in the support of Y ). The expansion (2.7) could also be formulated with a linear term n −1/2 n i=1ψ (X i , Y i , y). In Appendix B we reproduce the definition of the estimatorĥ that was suggested by Colling and Van Keilegom (2019) as modification of the estimator by Chiappori et al. (2015). We give regularity assumptions under which the desired expansion holds, see Lemma B.2. Other nonparametric estimators for the transformation that fulfill the expansion could be applied as well.
To formulate our main result we need some more notations. For notational convenience, define γ = (c 1 , c 2 , θ) ∈ Υ := C 1 × C 2 × Θ, which is assumed to be compact (see (A1) in Appendix A). Then, note that Further, with Z i from (2.8) and S i from (3.2) define (i = 1, . . . , n) and let P Z and F Z denote the law and distribution function, respectively, of Z i .
Theorem 3.1. Assume (A1)-(A8) given in Appendix A. Let (λ k ) k∈{1,2,... } be the eigenvalues of the operator with corresponding eigenfunctions (ρ k ) k∈{1,2,... } , which are orthonormal in the L 2 -space corresponding to the distribution F Z . Let (W k ) k∈{1,2,... } be independent and standard normally distributed random variables and let W 0 be centred normally distributed with variance E[(ζ(Z 1 )) 2 ] such that for all K ∈ N the random vector Then, under the local alternative H 1,n , T n converges in distribution to In particular, under H 0 (i.e. for r ≡ 0), T n converges in distribution to The proof is given in Appendix C. An asymptotic level-α test should reject H 0 if T n is larger than the (1 − α)-quantile of the distribution of T . As the distribution of T depends in a complicated way on unknown quantities, we will propose a bootstrap procedure in Section 4.
Thus, the operator K defined in Theorem 3.1 is positive semi-definite.
Next we consider fixed alternatives of a transformation h that do not belong to the parametric class, i. e.
Theorem 3.3. Assume (A1)-(A4), (A6') given in the Appendix and letĥ estimate h uniformly consistently on compact sets. Then, under H 1 , lim n→∞ P (T n > q) = 1 for all q ∈ R, that is, the proposed test is consistent.
The proof is given in Appendix C. The transformation model with a parametric transformation class might be useful in applications even if the model does not hold exactly. With a good choice of θ applying the transformation Λ θ can reduce the dependence between covariates and errors enormously. Estimating an appropriate θ is much easier than estimating the transformation h nonparametrically. Consequently, one might prefer the semiparametric transformation model over a completely nonparametric one. It is then of interest how far away we are from the true model. Therefore, in the following we consider testing precise hypotheses (relevant hypotheses) If a suitable test rejects H 0 for some small η (fixed beforehand by the experimenter) the model is considered "good enough" to work with, even if it does not hold exactly.
To test those hypotheses we will use the same test statistic as before, but we have to standardize differently. Assume H 0 , then h is a transformation which does not belong to the parametric class, i.e. the former fixed alternative H 1 holds. Let and let γ 0 = (c 1,0 , c 2,0 , θ 0 ) := arg min (c 1 ,c 2 ,θ)∈Υ M (c 1 , c 2 , θ).

A bootstrap version and simulations
Although Theorem 3.1 shows how the test statistic behaves asymptotically under H 0 , it is hard to extract any information about how to choose appropriate critical values of a test that rejects H 0 for large values of T n . The main reasons for this are that first for any function ζ the eigenvalues of the operator defined in Theorem 3.1 are unknown, that second this function is unknown and has to be estimated as well, and that third even ψ (which would be needed to estimate ζ) mostly is unknown and rather complex (see e.g. Appendix B). Therefore, approximating the α-quantile, say q α , of the distribution of T in Theorem 3.1 in a direct way is difficult and instead we suggest a smooth bootstrap algorithm to approximate q α .
Algorithm 4.1. Let (Y 1 , X 1 ), ..., (Y n , X n ) denote the observed data, define and letĝ be a consistent estimator of g θ 0 , where θ 0 is defined as in (A6) under the null hypothesis and as in (A6') under the alternative (see Appendix A for the assumptions). Let κ and be smooth Lebesgue densities on R d X and R, respectively, where is strictly positive, κ has bounded support and κ(0) > 0. Let (a n ) n and (b n ) n be positive sequences with a n → 0, b n → 0, na n → ∞, nb d X n → ∞. Denote by m ∈ N the sample size of the bootstrap sample.
(2) Generate X * j , j = 1, . . . , m, independently (given the original data) from the density (which is a kernel density estimator for f X with kernel κ and bandwidth b n ). For j = 1, . . . , m define bootstrap observations as where ε * j is generated independently (given the original data) from the density 1 n n i=1 1 a n ε i − · a n (which is a kernel density estimator for the density of ε(θ 0 ) with kernel and bandwidth a n ).
1. The properties nb d X n → ∞ and κ(0) > 0 ensure that conditional on the original data (Y 1 , X 1 ), ..., (Y n , X n ) the support of X * contains that of v (from assumption (B7) in Appendix B) with probability converging to one. Thus, v can be used for calculatingĥ * as well.
2. To proceed as in Algorithm 4.1 it may be necessary to modify h * so that S * j = g(X * j ) + ε * j belongs to the domain of (h * ) −1 for all j = 1, ..., m. As long as these modifications do not have any influence on h * (y) for y ∈ Y w , the influence on theĥ * and T n,m should be asymptotically negligible (which can be proven for the estimator by Colling and Van Keilegom (2019)).
The bootstrap algorithm should fulfil two properties: On the one hand, under the null hypothesis the algorithm has to provide, conditionally on the original data, consistent estimates of the quantiles of T n , or to be precise its asymptotic distribution from Theorem 3.1. To formalize this, let (Ω, A, P ) denote the underlying probability space. Assume that (Ω, A) can be written as Ω = Ω 1 × Ω 2 and A = A 1 ⊗ A 2 for some measurable spaces (Ω 1 , A 1 ) and (Ω 2 , A 2 ). Further, assume that P is characterized as the product of a probability measure P 1 on (Ω 1 , A 1 ) and a Markov kernel that is P = P 1 ⊗ P 1 2 . While randomness with respect to the original data is modelled by P 1 , randomness with respect to the bootstrap data and conditional on the original data is modelled by P 1 2 . Moreover, assume With these notations in mind for all q ∈ (0, ∞) it would be desirable to obtain for all δ > 0 and n → ∞. Here, the convention P 1 2 (ω, {T * n,m ≤ q}) = P 1 2 ω, ω ∈ Ω 2 : (ω,ω) ∈ {T * n,m ≤ q} is used. On the other hand, to be consistent under H 1 the bootstrap quantiles have to stabilize or at least converge to infinity with a rate less than that of T n . To be precise, it is needed that for all δ > 0.
In the supplement we give conditions under which the bootstrap Algorithm 4.1 has the desired properties (4.2) and (4.3). In particular we need an expansion ofĥ * as bootstrap counterpart to (2.7). To formulate this, for any realisation w ∈ Ω 1 define Then for any compact set K ⊆ R and for n → ∞, where ψ * fulfils some assumptions given in the supplement (see assumption (A8*) for details). In the supplement we also give conditions under which for the transformation estimator of Colling and Van Keilegom (2019) the expansion is valid (see Lemma D.8).

Simulations
Throughout this section, g(X) = 4X − 1, X ∼ U([0, 1]) and ε ∼ N (0, 1) are chosen. Moreover, the null hypothesis of h belonging to the Yeo and Johnson (2000) transformations with parameter θ ∈ Θ 0 = [0, 2] is tested. Under H 0 we generate data using the transformation h = (Λ θ 0 (·) − Λ θ 0 (0))/(Λ θ 0 (1) − Λ θ 0 (0)) to match the identification constraints h(0) = 0, h(1) = 1. Under the alternative we choose transformations h with an inverse given by the following convex combination, for some θ 0 ∈ [0, 2], some strictly increasing function r and some c ∈ [0, 1]. In general it is not clear if a growing factor c leads to a growing distance (2.5). Indeed, the opposite might be the case, if r is somehow close to the class of transformation functions considered in the null hypothesis. Simulations were conducted for r 1 (Y ) = 5Φ(Y ), r 2 (Y ) = exp(Y ) and r 3 (Y ) = Y 3 , where Φ denotes the cumulative distribution function of a standard normal distribution, and c = 0, 0.2, 0.4, 0.6, 0.8, 1. The prefactor in the definition of r 1 is introduced because the values of Φ are rather small compared to the values of Λ θ , that is, even when using the presented convex combination in (4.5), Λ θ 0 (except for c = 1) would dominate the "alternative part" r of the transformation function without this factor. Note that r 2 and Λ 0 only differ with respect to a different standardization. Therefore, if h is defined via (4.5) with r = r 2 the resulting function is for c = 1 close to the null hypothesis case.
For calculating the test statistic the weighting function w was set equal to one. The nonparametric estimator of h was calculated as in Colling and Van Keilegom (2019) (see Appendix B for details) with the Epanechnikov kernel K(y) = 3 4 (1 − y 2 ) 2 I [−1,1] (y) and a normal reference rule bandwidth (see for example Silverman (1986)) x , whereσ 2 u andσ 2 x are estimators for the variance of U = T (Y ) and X, respectively. The number of evaluation points N x for the nonparametric estimator of h was set equal to 100 (see Appendix B for details). The integral in (B.2) was computed by applying the function integrate implemented in R. In each simulation run n = 100 independent and identically distributed random pairs (Y 1 , X 1 ), ..., (Y n , X n ) were generated as described before and 250 bootstrap quantiles, which are based on m = 100 bootstrap observations (Y * 1 , X * 1 ), ..., (Y * m , X * m ), were calculated as in Algorithm 4.1 using κ the U ([−1, 1])-density, the standard normal density and a n = b n = 0.1. To obtain more precise estimators of the rejection probabilities under the null hypothesis, 800 simulation runs were performed for each choice of θ 0 under the null hypothesis, whereas in the remaining alternative cases 200 runs were conducted. Among other things the nonparametric estimation of h, the integration in (B.2), the optimization with respect to θ and the number of bootstrap repetitions cause the simulations to be quite computationally demanding. Hence, an interface for C++ as well as parallelization were used to conduct the simulations.  Table 1: Rejection probabilities at θ 0 ∈ {0, 0.5, 1, 2} and r ∈ {r 1 , r 2 , r 3 }.
The main results of the simulation study are presented in Table 1. There, the rejection probabilities of the settings with h = (Λ θ 0 (·) − Λ θ 0 (0))/(Λ θ 0 (1) − Λ θ 0 (0)) under the null hypothesis, and h as in (4.5) under the alternative with r ∈ {r 1 , r 2 , r 3 }, c ∈ {0, 0.2, 0.4, 0.6, 0.8, 1} and θ 0 ∈ {0, 0.5, 1, 2} are listed. The significance level was set equal to 0.05 and 0.10. Note that the test sticks to the level or is even a bit conservative. Under the alternatives the rejection probabilities not only differ between different choices of r, but also between different transformation parameters θ 0 that are inserted in (4.5). While the test shows high power for some alternatives, there are also cases, where the rejection probabilities are extremely small. There are certain reasons that explain these observations. First, the class of Yeo-Johnson transforms seems to be quite general and second the testing approach itself is rather flexible due to the minimization with respect to γ. Having a look at the definition of the test statistic in (2.6), it attains small values if the true transformation function can be approximated by a linear transformation of Λθ for some appropriateθ ∈ [0, 2]. In the following, this issue will be explored further by analysing some graphics. All of the figures that occur in the following have the same structure and consist of four panels. The upper left panel shows the true transformation function with inverse function (4.5). Due to the choice of g(X) = 4X −1 and X ∼ U([0, 1]) the vertical axis reaches from −1 to 3, which would be the support of h(Y ) if the error is neglected. In the upper right panel the parametric estimator of this function is displayed.
In the lower left panel one can see if the true transformation function can be approximated by a linear transform of some Λθ,θ ∈ [0, 2], which is an indicator for rejecting or not rejecting the null hypothesis as was pointed out before. As already mentioned, the rejection probabilities not only differ between different deviation functions r, but also within these settings. For example, when considering r = r 1 with c = 0.6 the rejection probabilities for θ 0 = 0.5 amount to 0.035 for α = 0.05 and to 0.050 for α = 0.10, while for θ 0 = 2 they are 0.415 and 0.545. Figures 1 and 2 explain why the rejection probabilities differ that much. While for θ 0 = 0.5 the transformation function can be approximated quite well by transforming Λ 1.06 linearly, the best approximation for θ 0 = 2 is given by Λ 1.94 and seems to be relatively bad. The best approximation for c = 1 can be reached for θ around 1.4. In contrast to that, considering θ 0 = 2 and r = r 3 results in a completely different picture. As can be seen in Figure 3 even for c = 0.2 the resulting h differs so much from the null hypothesis that it can not be linearly transformed into a Yeo-Johnson transform (see the lower left subgraphic). Consequently, the rejection probabilities are rather high. Under some alternatives the rejection probabilities are even smaller than the level. This behaviour indicates that from the presented test's perspective these models seem to fulfil the null hypothesis more convincingly than the null hypothesis models themselves. The reason for this can be seen in Figure 4 for the setting θ 0 = 1, c = 0.4 and r = r 1 . There, the relationship between the nonparametric estimator of the transformation function and the true transformation function is shown. While the diagonal line represents the identity, the nonparametric estimator seems to flatten the edges of the transformation function.
In contrast to this, using r = r 1 in (4.5) steepens the edges so that both effects neutralize each other. Similar effects cause low rejection probabilities for r = r 2 , although the reasoning is slightly more sophisticated and is also associated with the boundedness of the parameter space Θ 0 = [0, 2]. One possible solution could consist in adjusting the weight function w such that the boundary of the support of Y does no longer belong to the support of w. In Table 2 the rejection probabilities for a modified weighting approach are presented. There, the weight function was chosen such that the smallest five percent and the largest five percent of observations were omitted to avoid the flattening effect of the nonparametric estimation. Indeed, the resulting rejection probabilities under the alternatives increase and lie above those under the null hypotheses.

A Assumptions for the main results
In the following assumptions let Y denote the support of Y (which depends on n under local alternatives). Further, F S denotes the distribution function of S 1 as in (3.2) and T S denotes the transformation s (0)).
(A2) The weight function w is continuous with a compact support Y w ⊂ Y.
(A4) There exists a unique strictly increasing and continuous transformation h such that model (2.1) holds with X independent of ε.
(A5) The function h 0 defined in (3.1) is strictly increasing and continuously differentiable and r is continuous on Y w . F Y is strictly increasing on the support of Y .
(A8) The transformation estimatorĥ fulfills (2.7) for some function ψ. For some U 0 (independent of n under local alternatives) with When considering a fixed alternative H 1 or the relevant hypothesis H 0 , (A6) and (A8) are replaced by the following Assumptions (A6') and (A8') (assumption (A8') is only relevant for H 0 ). Note that h is a fixed function then, not depending on n.

B Nonparametric transformation estimation
In this section we consider a transformation estimator which fulfills assumption (A8) and in particular the expansion (2.7). To this end we reproduce the definitions of Colling and Van Keilegom (2019) and prove Lemma B.2 below. Denote the conditional distribution function of U 1 from (2.8), given X 1 = x, by F U |X (·|x) and estimate it bŷ .
Here h x and h u are bandwidths and K is an appropriate kernel function (as in assumptions (B4) and (B5) below), Further, consider some kernel L and bandwidth b 0 fulfilling assumption (B6) and defineQ Remark B.1. In principle, the derivative with respect to any other component x i fulfilling assumption (B3) below can be used in (B2) as well (similar to Chiappori et al. (2015)). W.l.o.g. only the case i = 1 is considered here.
LetF Y be the empirical distribution function of Y 1 , . . . , Y n and definê and estimate U i from ( Then, the conditional distribution function of U conditioned on X = x can be written as . see Colling and Van Keilegom (2019) for details. Then, with Q(·) = h(T −1 (·)) the function ψ in the expansion (2.7) can be written as Note that E[ψ(Z j , u)] = 0 for all u.
In the following assumptions (adjusted from Colling and Van Keilegom (2019)) are given which ensure (A8) for the estimatorĥ from (B.4). Let, as in (A8), T S (h(Y w )) ⊂ U 0 , where U 0 is independent from n under local alternatives and lies in the interior of the support of U . Let X ⊂ R d X denote the support of X.
(B1) The cumulative distribution function of ε is absolutely continuous and has a density that is continuous on its support. Furthermore, X and ε are independent and U 0 is a connected subset of R.
(B4) The bandwidths h x and h u satisfy for an appropriate q ∈ N (B5) The kernel K is symmetric with a connected and compact support containing some neighbourhood around 0. Further, K is q-times continuously differentiable with K and K being of bounded variation. Moreover, K(z) dz = 1, z l K(z) dz = 0 for all l = 1, ..., q − 1.
(B6) The kernel L is twice continuously differentiable with uniformly bounded derivatives and with median 0, and b = b n > 0 is a bandwidth sequence that satisfies nb (B7) v is a weight function with compact support X 0 ⊂ X ∂1 with nonempty interior. Further, X 0 v(x) dx = 1 and v is q-times continuously differentiable and all these derivatives are uniformly bounded in the interior, i.e., (B8) The regression function g is continuously differentiable with respect to x i on X for i = 1, ..., d X .
(B9) The joint density function f Y,X (y, x) of (Y, X) is uniformly bounded, (q + 2)-times continuously differentiable and all these derivatives are uniformly bounded, i.e., The following result holds under the null hypothesis H 0 , under fixed alternatives H 1 and under local alternatives H 1,n .
Proof. In the case of a fixed transformation h, the assertion is covered by Theorems 5.1 and 5.2 in Colling and Van Keilegom (2019). Therefore, only local alternatives need to be considered. To this end we consider the transformation class for an appropriate set A. In case of local alternatives as in equation (3.1), consider for example A = {n − 1 2 : n ∈ N, n ≥ N } for a sufficiently large N ∈ N. The expansion is shown uniformly in α ∈ A, that is uniformly in h ∈ H, and uniformly in y ∈ Y w . Nevertheless, most arguments used for fixed h as in Colling and Van Keilegom (2019) are still valid. Note that in our framework S = g(X) + ε does not depend on n, and consider Y = h −1 (S) for h ∈ H. First, note that neither the U i nor theÛ i depend on α, since Hence, Q from (B2) and its estimatorQ from (B.1) are independent of α ∈ A. Moreover, Q is uniformly consistent. By standard arguments it can be shown that forT from (B.3) and T from (2.8) one haŝ uniformly in y ∈ Y w and α ∈ A, so that h(y) − h(y) =Q(T (y)) − Q(T (y)) =Q(T (y)) − Q(T (y)) +Q (T (y))(T (y) − T (y)) + o P (n −1/2 ) =Q(T (y)) − Q(T (y)) + Q (T (y))(T (y) − T (y)) + o P (n −1/2 ) = 1 n n j=1 ψ(Z j , T (y)) + o P (n −1/2 ).
Since Γ 0 is positive definite by assumption and M n (γ 0 ) and M n (γ) are bounded in probability, one obtains from (C.3) thatγ − γ 0 = O P (n −1/2 ). Now, again by Taylor expansion, for all values γ with γ − γ 0 = O P (n −1/2 ), where the map q is defined via with R(·) from (3.3). The minimizer z 0 of the quadratic function q can easily be obtained as where we have inserted the expansion from (2.7) as well as (3.1) and use the definition β = E[w(h −1 0 (S 1 ))r 0 (h −1 0 (S 1 ))R(S 1 )]. It is also easy to see that (for someγ) so that it is sufficient to consider q(z 0 ) = min z q(z) instead of T n . Inserting the expansion for z 0 into q(·) as well as inserting the expansion forĥ from (2.7) and (3.1) into M n (γ 0 ) gives With some simple calculations of variances one shows that, after centering the multiple sums, those terms are negligible, where some of the indices coincide. Considering the (centred) multiple sums with distinct indices only, for the resulting U-statistics Hoeffding decompositions are applied (see, e.g. Section 1.6 in (Lee, 1990)). Again with simple, but tedious calculations of variances one obtains the following dominating terms, where U n is a U-statistic of order 2, i.e.
which coincides with ζ(z 1 , z 2 ) from (3.6). Further W 0,n = n −1 n i=1ζ (Z i ) with withr from (3.7). Note that ζ is symmetric. Hence, referring to Witting and Müller-Funk (1995, p. 141) it can be written as (in L 2 sense corresponding to the distribution F Z ) with notations from Theorem 3.1. Referring to Remark 3.2 ζ is positive semi-definite, which results in λ k ≥ 0, k ∈ N. From classical results on U-statistics, nU n converges to ∞ k=1 λ k (W 2 k − 1) in distribution (again with notations as in Theorem 3.1), see e.g. Lee (1990), Theorem 1 in Section 3.2.2. On the other hand, n 1/2 W 0,n converges to a normal distribution by the central limit theorem. As U n and W 0,n are dependent, we have to go through some of the steps of the proof of Theorem 1 in Lee (1990, p. 79) to obtain the limiting distribution of T n . Lee (1990) uses the truncated sums with W k,n = n −1/2 n i=1 ρ k (Z i ) and v k,n = n −1 n i=1 ρ 2 k (Z i ) = 1 + o P (1) by the law of large numbers and the orthonormality of the eigenfunctions. Now to obtain convergence of nU n + n 1/2 W 0,n , note that applying the multivariate central limit theorem, (W 0,n , W 1,n , . . . , W K,n ) t converges in distribution to (W 0 , W 1 , . . . , W K ) t as defined in Theorem 3.1, for each K. Hence, by the continuous mapping theorem we obtain for each K. Proceeding as in the proof of Theorem 1 in Lee (1990, p. 79) by letting K → ∞, one obtains ∞ k=1 λ k (W 2 k − 1) + W 0 as limit of nU n + n 1/2 W 0,n . Note further that (C.6) especially leads to such that nU n + b + n 1/2 W 0,n converges to ∞ k=1 λ k W 2 k + W 0 , which completes the proof of Theorem 3.1.

Proof of Theorem 3.3.
Note that the functions f γ (y) = w(y)(h(y)c 1 + c 2 − Λ θ (y)) 2 are bounded, the parameter set C 1 × C 2 × Θ is compact and for every y the map γ = (c 1 , c 2 , θ) → w(y)(h(y)c 1 + c 2 − Λ θ (y)) 2 is continuous. Hence, following Lemma 6.1 in (Wellner, 2005) the class F = {f γ : γ ∈ C 1 × C 2 × Θ} is a Glivenko-Cantelli class. This leads to Proof of Theorem 3.4. The beginning of the proof is similar to the proof of Theorem 3.1 and we will state the main differences. Again, we write T n = min γ∈Υ M n (γ) = M n (γ) with M n as in (C.1). Recall that by (A6), γ 0 = (c 0,1 , c 0,2 , θ 0 ) is the unique minimizer of . We can again deriveγ − γ 0 = o P (1) and further (C.3) and (C.4), but now with Γ 0 replaced by Γ as in (3.9), which is the limit of n −1 Hess M n (γ 0 ). However, M n (γ 0 ) and M n (γ) are not bounded in probability under the model considered here. Instead we will show that M n (γ 0 ) − M n (γ) = O P (n 1/2 ) (C.7) and thus we can derive from (C.3) thatγ − γ 0 = O P (n −1/4 ). To obtain (C.7), definẽ and letM n (γ * ) = min γ∈ΥMn (γ). Note that E[M n (γ)] = nM (γ). Now one can obtain where for (C.8) one applies (2.7), whereas (C.9) holds because the empirical process n −1/2 (M n (γ) − nM (γ)), γ ∈ Υ, is Donsker. Because To derive (C.10) note that by (C.8) and (C.9). On the other hand by (C.8) and (C.9). Both inequalities together imply (C.10) and consequently (C.7) holds. Again similar to the proof of Theorem 3.1 we obtain by Taylor expansion that, for all values γ with γ − γ 0 = O P (n −1/4 ), where the map q is defined via = 1, . . . , n). The minimizer z 0 of the quadratic function q can easily be obtained as To obtain the rate note that E[w(Y k )(h(Y k )c 1,0 + c 2,0 − Λ θ 0 (Y k ))R k ] = 0 because γ 0 minimizes M (γ) and thus one has ∇M (γ 0 ) = 0. As in the proof of Theorem 3.1 it follows that instead of considering T n /n one can consider q(z 0 )/n to derive the limiting distribution. To this end note that The first term on the right hand side can be treated as in the proof of Theorem 3.1 by inserting the expansion from (2.7) to obtain the rate o P (n −1/2 ). The expectation of the second term on the right hand side is M (γ 0 ). Inserting the expansion from (2.7) into the third term one obtains Applying a Hoeffding decomposition to the U-statistic term (C.11) one sees that the degenerate part is negligible and the dominating part is the (centred) linear term n −1/2 n i=1 δ(Z i ) with δ from Theorem 3.4. The assertion of the theorem now follows from the classical central limit theorem. Supplement to: "Specification testing in semi-parametric transformation models" by Nick Kloodt, Natalie Neumeyer and Ingrid Van Keilegom

D Bootstrap theory
In this section, we use the notations for the probability space as in section 4. The expectation with respect to P 1 2 (ω, ·) is written as E [·|ω]. Note that the functions h * and ψ * depend on ω via the original sample. This is suppressed in the notation. We formulate the following additional assumptions.
(A8*) The following properties are meant conditional on the data (Y i , X i ), i = 1, ..., n, and thus define for fixed n ∈ N some subsets A n ∈ A 1 of Ω 1 , where these properties are valid. Thus, let ω ∈ A n , then we assume the following.
(i) The transformation estimatorĥ * fulfils lim sup for all δ > 0, for some function ψ * and h * from Algorithm 4.1.
For A n as defined above, we assume P (A n ) → 1 for n → ∞.
(A9*) Define the distribution function of Z * for some ω ∈ Ω 1 by F Z * (z) = P 1 2 (ω, {Z * ≤ z}) and assume sup Moreover, for all compact K ⊆ R d X +1 there exists an appropriate C > 0, such that for n → ∞ sup for all z ∈ R d X +1 , s ∈ R and for ψ from (A8) for n → ∞.
Proof of Theorem D.1. As in the proof of Lemma D.5, the conditional distribution and expectation of (Y * 1 , X * 1 ), ..., (Y * m , X * m ) given (Y 1 , X 1 ), ..., (Y n , X n ) are denoted by P * and E * , respectively. Consider ω ∈ A n with A n from (A8*). The proof can be divided into two parts: First, the uniform convergence of some bootstrap components appearing in the asymptotic distribution of the bootstrap test statistic is proven and second, the assertion itself is shown by the convergence of the conditional distribution functions in probability. Referring to the definition of h * , the following condition (A6*) is valid.

D.2 Nonparametric transformation estimation in the bootstrap case
The aim of this subsection is to show that the estimating approach developed by Colling and Van Keilegom (2019) can be applied in this context. To this end, for the estimator h from Appendix B respectively its bootstrap analogĥ * validity of assumptions (A8*) and (A9*) needs to be shown, such that Theorems D.1 and D.2 apply and Algorithm 4.1 gives valid approximation of the critical value. Denote the conditional density of ε(θ 0 ) (defined in Algorithm 4.1) given X by f ε(θ 0 )|X . Then we need the following assumptions.
(ii) Many of the assumptions in (A10) can be replaced by less complex, but more restrictive versions. For example, due to (D.14) assumption (D.10) is implied by under the alternative for all j = 1, ..., r − 1.
(iii) It is ||θ − θ 0 || = O P n − 1 4 under the alternative. Unfortunately, the experimenter in advance does not know, if the null hypothesis or the alternative holds, so that in general (D.14) limits a n to a −1 n = o n r 4(r+1) = o n 1 4 .
Proof of Lemma D.3. Only the second assertion is shown since the first one can be concluded similarly. The proof uses similar techniques as ?. First, for the deviation terms R i =ε i − ε i (θ 0 ) and appropriate u * i , i = 1, ..., n, a Taylor expansion leads to For appropriateθ y betweenθ and θ 0 the R i can be split into (ĝ(X i ) − g θ 0 (X i )) j a j n j! (j) u − ε i (θ 0 ) a n for all j = 1, ..., r − 1 and for some sufficiently large constant C > 0, so that it suffices to treat the cases R in equation (D.15) negligibility of the last summand directly follows from (D.14) and the boundedness of (r) . Thanks to ?, to prove for all j = 0, ..., r − 1 and some constant C > 0, it suffices to show uniform (with respect to u) boundedness of the expectation. Hence one has for some constant C > 0 (see (D.8)) and thus for all j = 1, ..., r − 1. FurtherR i can be written as where the O P -terms are independent of i. When insertingR i in equation (D.15), one has for any δ > 0 for all j = 1, ..., r − 1. By assumptions (D.8) and (D.9) the expected value of the sum can be bounded by some constant C > 0, so that by (D.10) and (D.14). The remaining term can be treated similarly by applying (D.11) and (D.12) to obtain Altogether one obtains uniformly on compact sets.
Proof of Lemma D.5. Note that conditional on (Y 1 , X 1 ), ..., (Y n , X n ) the random variables (Y * 1 , X * 1 ), ..., (Y * m , X * m ) are independent as well as identically distributed. Moreover, after conditioning on the original data, the Assumptions (B1)-(B10) are valid for the bootstrap sample with probability converging to one, so that due to Remark 4.2 the same reasoning as in (Colling and Van Keilegom, 2019) can be applied to obtain (4.4). For notational convenience the conditional distribution of (Y * 1 , X * 1 ), ..., (Y * m , X * m ) conditional on (Y 1 , X 1 ), ..., (Y n , X n ) is written as P * and the expectation with respect to P * is written as E * . Let F Y * |X * denote the conditional distribution function of Y * 1 conditioned on X * 1 (and (Y 1 , X 1 ), ..., (Y n , Y n )). To verify (A8*) ψ * has to be examined further and to define ψ * some further notations are needed. Let v be the weighting function from assumption (B7) and define where D * p,0 (u, x), ..., D * f,1 (u, x) are defined as x) and f X * from Algorithm 4.1. Then, ψ * is defined as Condition (A8*) for ψ * is implied by the same reasoning as in Colling and Van Keilegom (2019). Note that the first part of Remark 4.2 ensures that v can be used as the weighting function for the bootstrap data as well.
Remark D.7. Roughly speaking, the proof of Lemma D.5 was based on the convergence of ψ * to ψ. If the alternative holds, it is not even clear if ψ * stabilizes in some sense (see Assumption (v)). Hence, additional assumptions are needed. For that purpose define F ε(θ) (e) = P (ε(θ) ≤ e), F B S (u) = F ε(θ 0 ) (u − g θ 0 (x))f X (x) dx, While doing so, assume F B S (0) < F B S (1) to ensure that T B S is well defined, and define Φ plays a similar role under the alternative as Φ(u|x) = F U |X (u|x) under the null hypothesis and thus needs to be continuously differentiable on U 0 × supp(v) with (again, the same i as in (B3)  (here and in the following, o P -and O P -terms are with respect to P 1 and for n → ∞) and the asymptotic behaviour of ψ * can not be reduced to the convergence to ψ. Nevertheless, ψ * can be expressed as in (D.21) with D * p,0 , ..., D * f,1 as in (D.19)-(D.20). The main idea is to prove uniform convergence of ∂ ∂u Φ * and ∂ ∂u Φ * on U 0 × supp(v) to ∂ ∂uΦ and ∂ ∂x 1Φ , respectively, while the remaining parts of δ * jṽ * 1 , δ * jṽ * 2 and Q * are bounded in probability. Due to (D.23) it is U 0 ⊆ − Due to Lemma D.3 F S * can be written for appropriate u * i,w ∈ R, i = 1, ..., n, w ∈ supp(κ), as F S * (u) = 1 n 2 n i=1 n k=1 F ξ u −ĝ(X i + b n w) −ε k + 1 n n l=1ε l a n κ(w) dw = 1 n 2 n i=1 n k=1 F ξ u − g θ 0 (X i + b n w) −ε k a n κ(w) dw ĝ(X i + b n w) − g θ 0 (X i + b n w) + 1 n n l=1ε l a n dw = 1 n 2 n i=1 n k=1 F ξ u − g θ 0 (X i + b n w) − ε k (θ 0 ) a n κ(w) dw + o P (1).