Behaviour of the Monotone Single Index Model Under Repeated Measurements

The generalized linear model is an important method in the statistical toolkit. The isotonic single index model can be thought of as a further generalization whereby the link function is assumed to be monotone non-decreasing as opposed to known and fixed. Such a shape constraint is quite natural in many statistical problems, and is fulfilled by the usual generalized linear models. In this paper we consider inference in this model in the setting where repeated measurements of predictor values and associated responses are observed. This setting is encountered in medical studies and is very different from the one considered in the classical monotone single index model studied in the literature. Here, we use nonparametric maximum likelihood estimation to infer the unknown regression vector and link function. We present a detailed study of finite and asymptotic properties of this estimator and propose goodness-of-fit tests for the model. Through an extended simulation study, we show that the model has competitive predictive performance. We illustrate our estimation approach using a Leukemia data set.


Introduction
The class of generalized linear models was introduced as a unified, and hence more general, framework which encompasses many popular tools such as linear and logistic regression (Nelder and Wedderburn, 1972). Both these examples, which are widely used in practice, involve an increasing link function: the identity and logit function respectively. In fact, link functions in the usual generalized linear models are increasing. Thus, these generalized linear models can be themselves viewed as a sub-family of the even more general class of isotonic single index models.
Furthermore, isotonic single index models offer flexibility and robustness against picking the wrong model. As rightly pointed out in Härdle et al. (1997, page 213) "the advantage of [the single index] approach is that an interpretable linear single index, a weighted sum of of the predictor variables, is still produced." Estimation in the isotonic single index model has recently received increasing attention, see for example Kakade et al. (2011), Chen and Samworth (2016), Foster et al. (2013) for the high-dimensional setting, and the recent work of Balabdaoui et al. (2019). The model is closely related to the semiparametric single index model which was much popularized in econometrics (Li and Racine, 2007). Unlike the isotonic single index model, no shape assumptions on the link function are made. However, it is necessary to impose some smoothness conditions on the unknown link function, which is typically estimated via kernel-based nonparametric methods. The most popular estimation methods are due to Ichimura (1993) and Klein and Spady (1993), which are implemented in the R package np. Despite their popularity, one of the main downsides of such approaches is the fact that they require bandwidth selection, known to be a hard problem to solve. For a review of existing estimation algorithms we refer the reader to Li and Racine (Li and Racine,Chapter 8). Unlike kernel-based methods, isotonic estimation is fully automatic and no tuning parameter needs to be selected to achieve optimal performance. In addition, Foster et al. (2013) argue that, whenever appropriate, the assumption of monotonicity "noticeably improves performance". The advantages of the monotone single index model have also been investigated in Qin et al. (2019), Wan et al. (2017) where applications to, respectively, causal inference and drug interaction modelling were considered. Qin et al. (2019) considered a similar estimator as we do here assuming continuous design, while Wan et al. (2017) studied a spline approach, which has the drawback of having to use cross-validation in order to select to optimal knot positions.
To the best of our knowledge, theoretical developments for the isotonic single index model are restricted to the setting of random and continuous design where it is assumed that the covariates are generated from a multivariate density. In the recent work of Balabdaoui et al. (2019) it is shown that the rate of convergence of the least squares estimator (LSE) of the isotonic link function in this setting is n 1/3 under some standard regularity assumptions. The methods used there cannot be adapted to datasets where repeated measurements of predictor values are observed. Such discrete design settings occur for example in drug-interaction studies, cf. Carter et al. (1986), Wan et al. (2017). We note that although Wan et al. (2017) considered an estimation method which is not fully automatic like the one presented here, they stressed the benefit to be gained from fitting a monotone single index model. Hence, it is crucial to understand the behaviour of this estimator as well as any potential technical challenges associated with the setting with repeated predictors.
Our main contributions in this paper are to give a precise description of the asymptotic behavior of the maximum likelihood estimator (MLE) in the monotone single index model under discrete design and develop an asymptotic goodness-of-fit test which could be used before applying our approach. Similarly to Qin et al. (2019), Chen and Samworth (2016), we study the case where the sample size n is allowed to grow while the dimension d of the predictors is held fixed. One of our main findings is that, as opposed to the classical monotone single index models where the nonparametric LSE or MLE of the link function converges at the n 1/3 -rate, the asymptotics of the MLE under the discrete design exhibit the faster rate of n 1/2 . We also investigate the finite sample performance of the MLE using synthetic data, where the MLE is found to have competitive predictive performance when compared to the MLE in the true generalized linear model from which the data are drawn.
The manuscript is organized as follows. In Section 2 we give a description of the isotonic single index model under discrete design and prove some of the fundamental properties of the MLE such as existence and lack of uniqueness of the estimator. In Section 3 we prove consistency of the MLE of the link function and give sufficient conditions under which it is asymptotically normal. Furthermore, we propose two goodness-of-fit tests. In Section 4 we illustrate through simulations the performance of the proposed method. In Section 5, following Wan et al. (2017), we give an example involving drug interactions in a Leukemia dataset, where we compare the monotone single model approach with kernel-based nonparametric single-index and fully parametric methods. To keep the reading more fluid, we defer all proofs and technical details to the Supplementary Material.

The Model and Maximum Likelihood Estimation
2.1. The Model and Corresponding Log-Likelihood Let n be the total size of a sample of observed responses organized into m groups. Observations in the i-th group for i = 1, . . . , m are assumed to have expectation of the form where is a design point, is an unknown regression parameter, also called the "index", and f 0 is an unknown increasing link function. We use the term increasing to mean non-decreasing. All values of x i , i = 1, . . . , m are assumed to be distinct. We assume that the density function of the observed response in the i-th group, with respect to a given dominating measure λ which is either Lebesgue measure or counting measure, is given by where c is a normalizing function, φ > 0 is the dispersion parameter, and is a given real valued function with first derivative > 0, and inverse −1 = B . Balabdaoui et al. (2019, Proposition 5.1) study identifiability of the model when the distribution of the predictors has a density with respect to Lebesgue measure. The situation considered here is very different in that we observe f 0 (α T 0 x) with noise only at the discrete design points x = x 1 , . . . , x m . Hence, the model is not necessarily identifiable in the classical sense. However, we can show that the index parameter satisfies a weaker form of identifiability. Let S d−1 denote the unit sphere in d-dimensional Euclidean space.
Proposition 1. Assume that there exist an increasing function f 0 and α 0 ∈ S d−1 such that observations in the i-th group have expectation μ 0i of the form (2.1). Define the set Consider any increasing function f and index α ∈ S d−1 such that μ 0i = f (α T x i ), i = 1, . . . , m. Then, the index α is set identifiable in the sense that α ∈ S 0 .
Heuristically, the set S 0 identifies all α which, via the projected predictors α T x i , preserve the correct order in the components of μ 0 . This can be rephrased as "S 0 is the subset of Assume that we have n i independent observations Y ij (j = 1, . . . , n i ) from group i (i = 1, . . . , m) with density as in Eq. 2.2 with respect to λ, and where μ 0i = f 0 (α T 0 x i ) for some unknown increasing function f 0 and index α 0 ∈ S d−1 . The overall sample size is n = m i=1 n i , and the log-likelihood is Y ij is the average of the observations from group i.

The MLE in the Monotone Single Index Model
The MLE of (f 0 , α 0 ) maximizes l n (f, α) in Eq. 2.4 over all increasing functions f and index vectors α ∈ S d−1 . To maximize l n we use the following profile approach: we first hold α fixed and maximize in f ; then we maximize the resulting likelihood in α.
Consider first maximization in f for a fixed α. Let M denote the class of increasing functions . For , vectors and , let iso(w, W ) denote the weighted isotonic regression of W, which is the unique solution to where M k denotes the class of vectors such that β 1 ≤ . . . ≤ β k . Some well-known properties of isotonic regression, such as uniqueness, are reviewed in the Supplementary Material, Section A.
Proposition 2. Fix α ∈ S d−1 . Let z 1 , . . . , z k denote the unique values of α T x 1 , . . . , α T x m . The maximizer of l n (f, α) over f ∈ M exists and is unique at the points z 1 , . . . , z k . Moreover, if π denotes the unique permutation on {1, . . . , k} such that z π(1) < . . . < z π(k) , then the values of the maximizer at the points z π(1) , . . . , z π(k) are the components of iso(w, W ), where Let f n | α denote the maximizer of l n (f, α) over f ∈ M for fixed α. We now maximize l n ( f n | α , α) in α to find the MLE. The MLE of (f 0 , α 0 ) is any pair ( f n , α n ) where α n ∈ { α n }, where the set { α n } is defined in the next proposition, and f n = f n | αn , whereas the MLE μ n of μ 0 has components μ ni = f n ( α T n x i ), i = 1, . . . , m. The MLE exists, but is not unique, as the following propositions shows.
It is also important to note the ways in which the maximum likelihood estimator is not unique. First, the maximum likelihood estimator f n | α of the link function for a fixed α is uniquely defined only on the set of projections α T x i (i = 1, . . . , m), and any monotone interpolation of the values at these points gives a maximum likelihood estimator. More importantly, with α n as above, it can be seen from the proof of Proposition 3 that any α ∈ S d−1 such that α T x i < α T x j for all i, j satisfying α T n x i < α T n x j , maximizes the likelihood.
3 Asymptotics of the Maximum Likelihood Estimator 3.1. Consistency Assume that the design points x 1 , . . . , x m are fixed. Throughout, we make one of the two following assumptions about the design.
Fixed design: We assume that there are n i (i = 1, . . . , m) observations collected from group i at each design point x i (i = 1, . . . , m) with a non-random n i . We consider what happens as n increases in such a way that λ n,i = n i /n converges to some λ i ∈ (0, 1) for all i = 1, . . . , m.
Random design: We assume that at each design point x i there is a random number n i of observations Y ij where (n 1 , . . . , n m ) follows a multinomial distribution with parameters n and λ i (i = 1, . . . , m). This setting occurs if one observes independent copies of (X, Y ) where X is a discrete variable such that pr(X = x i ) = λ i ∈ (0, 1) for i = 1, . . . , m and Y is from the group i if and only if X = x i . Again, we consider what happens as n increases, and call this assumption (A2).
Recall the definition of S 0 given above in Eq. 2.3.

Proposition 4. Under either assumption (A1) or (A2), almost surely there exists a positive integer
Proposition 4 shows that any monotone single index model MLE of α n is consistent for the set S 0 . The model is set identifiable for α 0 up to the set S 0 , see Proposition 1, and hence a stronger statement should not be expected. However, simulations in Section 4.1 indicate that the method estimates α 0 reasonably well for finite sample sizes, with efficiency similar to or only slightly worse than a fully parametric model.
We also note that, defining d( α n , S 0 ) = min α 0 ∈S 0 α n −α 0 , Proposition 4 implies the following. For all ε > 0 and all rates r n (i.e. r n is a positive number which converges to infinity as n → ∞) we have Lastly, an interesting alternative to the approach above is to consider instead estimation of the entire set S 0 . Alas, algorithms to produce the entire set { α n } are still an open problem, and these are a crucial first step to such lines of inquiry.
One important consequence of Proposition 4 is the following.
Then, under either assumption (A1) or (A2), almost surely, there exists a positive integer N 0 such that for for any convex function Ψ. In particular, we have The preceding result shows the benefits of the proposed single index model over the naive model where nothing is assumed about the true means μ 0i in Eq. 2.2. In the naive model, the natural estimator for μ 0i is Y i , and thus the proposition shows that our estimator has smaller risk than this naive estimator. Similar results were obtained for the discrete Grenander estimator (Jankowski and Wellner, 2009). Some simulations studying Proposition 5 are given in the Supplementary Material, Section C.3.
achieves its minimum over S 0 . Let α denote an arbitrary minimizer. Then, under either assumption (A1) or (A2), Despite its complex appearance, it is simple to sample from the limiting distribution described in Theorem 1, assuming that μ 0 is known.
It follows from Theorem 1 that in the important setting where all components of μ 0 differ from each other, the limiting distribution of This holds in particular if the true link function f 0 is strictly monotone. Practitioners comfortable with this assumption may therefore use the Gaussian limiting distribution to obtain confidence intervals for μ 0 , as the variance can be easily estimated as φ n /λ n,i ( μ n,i ) (estimation of dispersion is reviewed in Section E.2 of the Supplementary Material, while comes from the exponential family assumption). We also note that the limiting distribution depends on the exponential family only through the normalization (φ/ (μ 0i )) 1/2 . The √ n convergence rate is a result of the size of the underlying estimation space. In our set-up, the underlying distribution is discrete, and therefore the space of possible values of the estimators is small enough that, despite the nonparametric nature of the estimator, the convergence rate is still as fast as in the parametric setting. This is not the case, however, in the continuous set-up, where the large size of the underlying space decreases the convergence rate to n 1/3 , see Balabdaoui et al. (2019).
Define the test statistic We consider a weighted sum of squares since the variables where g| α is taken from Theorem 1.
The distribution of the random variable D 2 , depends on several unknowns which require estimation. First, it involves the dispersion φ, though this can be estimated using standard techniques, see Section E.2 in the Supplementary Material. Second, it depends on the unknown true mean μ 0 in a nontrivial manner via S 0 and part(μ 0 ). This dependence makes it difficult to estimate in practice. To circumvent this issue, we calibrate the test using quantiles of a simpler random variable D 2 n which we conjecture to be asymptotically stochastically larger than D 2 .
To define D 2 n , consider independent mean zero Gaussian random variables Z n,1 , . . . , Z n,m with variances 1/λ n,i , i = 1, . . . , m, respectively. For all α ∈ S d−1 , let G α be the set of all vectors such that where φ n is any consistent estimator of the true dispersion parameter φ; see Section E.2 of the Supplementary Material. Clearly, as n → ∞, D 2 n converges in distribution to If μ 0 is constant then part(μ 0 ) contains the single element {1, . . . , m} and S 0 = S d−1 . In this case, we have the equality D 2 = D 2 . Therefore, the asymptotic distribution of D 2 n is the same as D 2 when μ 0 is constant. In tests for monotonicity in the context of shape constrained estimation, the constant setting has previously been shown to be least favorable, see for example Ghosal et al. (2000, Inequality (2.7)) or Durot (2003, Theorem 1). We conjecture that this holds here as well; i.e. we conjecture that D 2 ≤ st D 2 . The conjecture clearly holds if μ 0 is constant. It also holds if μ 0 is such that μ 0i = μ 0j for all 1 ≤ i = j ≤ m (which is true if the link function is strictly increasing) since in this case D 2 = 0. Therefore, our conjecture holds in two key scenarios. To prove that the constant case is indeed least favorable is out of the scope of the paper.
Let q n,1−A denote the (1 − A)-quantile of D 2 n , for some fixed probability A ∈ (0, 1) The goodness-of-fit test rejects H 0 if D 2 n > q n,1−A . This guarantees a hypothesis test with asymptotic level no larger than A when μ 0 is constant and when μ 0i = μ 0j for i = j (i, j = 1, . . . , m). Moreover, the asymptotic level is exact when μ 0 is constant.
A second goodness-of-fit test is based on the following stochastic inequality.
where D 2 is the random variable defined in (3.4). When λ i = 1/m, i = 1, . . . , m, then K 2 is a chi-squared random variable with m − 1 degrees of freedom.
The proof is straightforward and is provided in the Supplementary Material. Note that the same approach can also be used to show that D 2 ≤ st φK 2 , which is consistent with the conjecture that D 2 ≤ st D 2 . A histogram showing the extent of the difference between the distributions of D 2 , D 2 n , and φK 2 for a particular choice of design is given in Fig. 2 in the the Supplementary Material. Section 4.2 studies the finite sample performance of the two proposed tests.

Finite Sample Performance
We next investigate the finite sample performance of the MLE in different scenarios. To compute the MLE, we use a modified stochastic search algorithm over α, and find f n | α (α T x j ), j = 1, . . . , m using PAVA. We use linear interpolation to obtain f n | α (α T x).

Estimation and Prediction
Example 1. This example is taken from Wan et al. (2017), where it forms the basis of their simulation study. The true model is In this example the different combinations of the levels of two drugs, d 1 and d 2 say, are used for prediction. The interaction of these drugs is modeled using the Emax model (Wagner, 1968;Kong and Lee, 2006). The dose levels for both drugs were set to (0, 0.1, 0.5, 1, 2, 4) and each dose level for drug 1 was combined with each dose level for drug 2, for a total number of interactions equal to m = 36. The true mean response curve is given by where again d i is the level of drug i, i = 1, 2. In this model, we can re-write In our simulation study, we considered several estimation methods, as follows.
(a). Generalized linear model with inverse link function as in the true model. We call this the "correct GLM".
(c). Empirical approach where μ 0 (x i ) is estimated by Y i . This yields no estimates of α 0 and cannot be used in predicting the response for new predictor values. d 2 ) and no assumptions are made on f 0 . We use a kernel-based estimator calculated with the R package np (Hayfield and Racine, 2008).
(g). A misspecified generalized linear model with link function equal to the identity and the predictor x as given in Eq. 4.1.
(h). A well-specified single index model with μ 0 (x) = f 0 (α T 0 x) where the predictor x is as given in Eq. 4.1 but no assumptions are made on f 0 . We use a kernel-based estimator calculated with the R package np (Hayfield and Racine, 2008).

(i). A misspecified generalized linear model with identity link function and
x T = (d 1 , d 2 ). Table 1 gives the mean of the estimated α 0 values for approaches (a), (b), (f), and (h)-as these are the only approaches which enable its estimation. Although nothing beats the correct GLM approach, the monotone single index model (b) performs well. Figure 1 shows boxplots of the estimates for models (a) and (b), from which one can note the striking fact that the MLE in the monotone single index model does not exhibit a big loss efficiency when compared with the MLE in the true GLM model. Note that in Table 1 and Fig. 1 we only report the case where σ 2 0 = 1 and n i = 3.  The five other settings performed similarly and hence their results are not reported here. In Tables 2 and 3, we consider estimation and prediction of the response surface: To study the estimation performance, we report in Table 2 the average sum of squared deviations of the different estimators from the true mean response surface at the observed predictors. The prediction performances are compared in Table 3 where we report the average sum of squared errors evaluated at 100 new and randomly selected predictor values. As expected, the correct GLM approach is best in all instances. In terms of prediction and estimation, the second best model is (b), the well-specified montone single index model-with the exception of model (c) which sometimes outperforms (b) in Table 2. Theoretically, we know from Proposition 5 that (b) will outperform (c) asymptotically. Recall also that (c) does not produce estimates of α 0 and cannot be used for prediction. We note that the kernel-based single-index approach (e) and (h) does not perform well throughout. The comparison here may not necessarily be fair, as one would not expect a nonparametric method perform well with small values of m. We include it here to highlight the importance of the monotonicity assumption in method (b). As mentioned in the introduction, Foster et al. (2013) also find that the assumption of monotonicity "noticeably improves performance". In fact, even the misspecified monotone model (d) outperforms methods (e)-(i) in Tables 2 and 3.
Example 2. In our second example, we consider the setting of a logistic regression. We take Y ij to be independent Bernoulli variables, with probability of success given by and consider three different cases of h 0 . We will assume throughout that x = (x 1 , x 2 ) T , and take α T 0 = (1, 1)/ √ 2. For the first variable x 1 , the design levels are set to (0, 0.01, 0.1, 0.25) and (0, 0.01, 0.1, 1.0) for the second variable x 2 . Each level for x 1 was combined with each level for x 2 , for a total of m = 16. This is the same design as that in case study in Section 5, except that the levels have been rescaled. We take values n i = 3, 10, 100 as in Example 1 and consider the following four estimation methods. Note that the samples n i in the data set of Section 5 are close 100.
Again, this yields no estimates of α 0 and cannot be used for making prediction.
(D.) A well-specified single index model with μ 0 (x) = f 0 (α T 0 x) but with no assumptions on f 0 . We use a kernel-based estimator calculated with the R package np (Hayfield and Racine, 2008).
The three different cases for h 0 are chosen to consider a variety of scenarios.
Plots of μ 0 (x i ) under the three scenarios are given in Fig. 2. The results of our simulations follow. Table 4 gives the mean of the estimated α 0 values for approaches (A), (B), (D), and Fig. 3 shows boxplots of the estimates for n i = 100 (this is the setting most similar to the real data set). Table 5 reports the mean sum of squared errors for the different estimators from the true mean response surface at the observed predictors, while Table 6 reports the same, except evaluated at 100 randomly selected predictor values, chosen to be within the range of the fixed design. In case (i), as expected, GLM performs best in all metrics, followed by the monotone SIM. For a semi-parametric approach, the monotone SIM method performs remarkably well when compared with the correct parametric method. In case (ii), where approach (A) is misspecified, the monotone SIM performs better than GLM. Both the GLM and monotone SIM methods also do not perform well for case (iii), as expected. As in Example 1, the kernel-based estimator of the single index model performs poorly for prediction of the response surface, despite being well-specified in all three cases.
We also repeated this example with a Poisson distribution instead of the binomial. The results are similar to those presented here, and are detailed in the Supplementary Material.

Goodness-of-Fit Test
We perform a simulation study of the proposed goodness-of-fit test for all examples of the previous section, with results given in Table 7. For Example 1, Example 2 case (i) and (ii), the conservative nature of the test is clearly visible. Furthermore, in Example 1, Example 2 case (i), the true mean vector is strictly increasing, in which case asymptotically D 2 = 0, contributing to the effect. In Example 2 case (ii), the true mean has a region where it is flat. However, for the particular choice of covariates, this region is quite small. For this reason, we added an artificial Example 2 case (iv) where μ 0i = E[Y i |x i ] = 0.5, i = 1, . . . , m. In this example, the mean function is flat in which case we know that the asymptotic distributions D 2 and D 2 are the same. This fact that can be seen from the observed significance levels. Lastly, in Example 2 case (iii) the model is misspecified. Here, the power performs reasonably well, noting that when n i = 3, the overall sample size is n = 108. Additional simulations  When the exponential family is binomial, it is possible for small sample sizes to observe Y i equal to 0 or 1, which makes computation of D 2 n in Eq. 3.3 difficult, as (Y i ) = 1/0 in this case. Therefore, for the binomial family we replaced the term (Y i ) with ( μ ni ), as μ ni is smoother and hence has fewer such instances. This alteration does not change the asymptotic behaviour of the test statistic. For cases where ( μ ni ) = 1/0, we simply set the term   Wan et al. (2017), we consider a drug interaction data set. Carter et al. (1986) present an analysis of cytotoxicity data, where the effect of treatment combinations from multiple drug agents on the binary response is studied. The response counts the number of surviving human promyelocytic Leukemia cells after being exposed to a combination of doses of methylmethanesulfonate and phorbol 12-myristrate, 13-acetate. In the analysis, the dose-response surface is estimated using multivariate parametric logistic regression. The data consists of a total of 16 design points, four per agent, with anywhere from 83 − 98 observations per point and an overall sample size of n = 1436 (Carter et al. 1986, Table 1, page 125).  The goal of the data analysis in Carter et al. (1986) was to evaluate the dose-response surface for the combination of two drug agents. Following Carter et al. (1986), we similarly transform the methylmethanesulfonate variable. In their paper, the authors perform a second order polynomial logistic regression, resulting in the fitted parametric mean function μ 1 (x) = −1.33 − 0.84x 1 + 0.159x 2 + 0.003883x 2 1 − 0.001308x 2 2 , after backward stepwise model selection. Here, x = (x 1 , x 2 ), where x 1 is the transformed methylmethanesulfonate variable, and x 2 is the phorbol 12-myristrate, 13acetate variable.
We compare their parametric model with a single index model where μ(x) = f 0 (α T 0 x) with x = (x 1 , x 2 ) T as above, under two assumptions: f 0 is either entirely unknown or increasing. When no assumptions are made on f 0 , we use the fitting method of Klein and Spady (1993) and let μ 2 (x) = f n,2 ( α T n,2 x) denote the nonparametric single index estimate. When f 0 is assumed to be increasing, we use the maximum likelihood estimator proposed in this work, and let μ 3 (x) = f n,3 ( α T n,3 x) denote the corresponding estimate. The nonparametric estimate was computed using the R package np (Hayfield and Racine, 2008). We also consider the monotone single index model with elliptical contours, that is, μ 4 (x) = f n,4 ( α T n,4 y) where y = (x 1 , x 2 , x 2 1 , x 1 x 2 , x 2 2 ) T , and the model μ 5 (x) = f n ( α T n,5 y), with y = (x 1 , x 2 , x 2 1 , x 2 2 ) T , where the ellipses have been constrained to have axes parallel to the coordinate axes. We do not include the intercept term as it is not identifiable. We choose the convention that f n,3 , f n,4 and f n,5 are piecewise linear. Figure 4, from left to right, shows grey-scale heatmap plots of μ 1 (x), μ 3 (x), and μ 5 (x). The phorbol 12-myristrate, 13-acetate variable appears on the vertical axis and the transformed methylmethanesulfonate variable appears on the horizontal axis. Black dots indicate the locations of the design points, and the size of the dot is proportional to n i . While all three approaches Figure 4: From left to right: grey-scale heatmap of the response surface for logistic regression, isotonic single index model with linear contours, and isotonic single index model with elliptical contours appear to identify the phorbol 12-myristrate, 13-acetate axis as the more important direction, the heatmaps appear quite different. We also implemented the proposed goodness of fit test for models μ 3 , μ 4 , μ 5 . In all three cases, the test fails to reject at the 5% level the hypothesis that the monotone single index model is valid for the datatset. The residual sum of squares is 264. 04, 264.39, 261.82, and 261.82 for μ 1 , μ 3 , μ 4 , μ 5 respectively. The residual sum of squares is very similar for μ 1 and μ 3 , however, the contour plots away from the design points are more reasonable for μ 3 . The residual sum of squares for μ 4 and μ 5 are the same, as are the fitted values for both models. Additional plots of the models are given in Figs. 5 and 6.
An unusual characteristic of the parametric logistic regression is that it selects a polynomial model where a decrease in the response surface is seen for large values of phorbol 12-myristrate, 13-acetate. Carter et al. (1986) discuss this and suggest that this is due to lysis; High concentrations of the drug render the cells uncountable. Comparing μ 1 with μ 3 explains that this is an artefact of the parametric model, and not a true phenomenon. Figure 7, left, plots the monotone estimate f n,3 (t), dotted line. For comparison, the solid line denoting μ 1 (x) projected, with smoothing, onto the line x has been added. The observed values Y i are shown as grey dots. We immediately see that the parabolic effect of the parametric regression is artificial and not due to lysis. Figure 7, right, similarly compares the parametric regression with the nonparametric single index regression. Here, we see f n,2 (t), dashed line, plotted with μ 1 (x) in solid, projected, with smoothing, onto the line t = α T n,2 x. The observed values Y i are again shown as grey dots. The nonparametric method identifies the methylmethanesulfonate axis as the important direction, which contradicts both the parametric and isotonic results, as well as a three dimensional visual inspection of the data set. We believe that this anomalous result is due to the small value of m. In fact, the comparison of the kernel-based is not necessarily a fair one, as this approach would not be expected to perform well with such a small design. We include it to, yet again, emphasize the benefits of the montonicity assumption in nonparametric single index models. R scripts with the analysis and additional details can be found in the Supplementary Material.

Conclusions
From the results obtained in the simulation study, one can observe that the main advantage of the approach proposed in this work is robustness and protection against model specification. The assumption of monotonicity of the link function is a natural one, and does not require the correct link function to be known which is a clear limitation with the parametric GLM. The only prior knowledge required from the user is the exponential family (e.g. Gaussian, binomial, Poisson) to which the distribution of the response belongs. Our simulation results show also competitiveness of the MLE. Indeed, of all estimation procedures considered, it is rivalled only by the correct parametric GLM approach, which in practice may not be known. Although the index vector in the monotone single index model with discrete design is only set-identifiable, our estimator shows remarkably small loss of efficiency against the MLE in the correct GLM. This is a very important feature which gives much incentive to apply our method in case the user is not ready to impose some parametric form on the link function.
The assumption of monotonicity is mild, and natural in a plethora of practical situations. In our simulations, we have also considered cases when this assumption is violated. Notably the goodness-of-fit test developed here has reasonable performance in terms of its ability to detect these situations and signal to the practitioner that the monotone model is not appropriate, at least for large sample sizes. Studying further the theoretical behaviour of the model in the misspecified setting, in both the continuous and discrete cases, would be of interest though. We conjecture that the MLE will continue to converge at rate √ n to the monotone model closest to the misspecified truth in the discrete setting. Due to set identifiability in our model (see Proposition 1), these projections would require careful consideration.