Shape constrained additive models

A framework is presented for generalized additive modelling under shape constraints on the component functions of the linear predictor of the GAM. We represent shape constrained model components by mildly non-linear extensions of P-splines. Models can contain multiple shape constrained and unconstrained terms as well as shape constrained multi-dimensional smooths. The constraints considered are on the sign of the first or/and the second derivatives of the smooth terms. A key advantage of the approach is that it facilitates efficient estimation of smoothing parameters as an integral part of model estimation, via GCV or AIC, and numerically robust algorithms for this are presented. We also derive simulation free approximate Bayesian confidence intervals for the smooth components, which are shown to achieve close to nominal coverage probabilities. Applications are presented using real data examples including the risk of disease in relation to proximity to municipal incinerators and the association between air pollution and health.


Introduction
This paper is about estimation and inference with the model where Y i is a univariate response variable with mean μ i arising from an exponential family distribution with scale parameter φ (or at least with mean variance relationship known to within a scale parameter), g is a known smooth monotonic link function, A is a model matrix, θ is a vector of unknown parameters, f j is an unknown smooth function of predictor variable z j and m k is an unknown shape constrained smooth function of predictor variable x k . The predictors x j and z k may be vector valued.
It is the shape constraints on the m k that differentiate this model from a standard generalized additive model (GAM). In many studies it is natural to assume that the relationship between a response variable and one or more predictors obeys certain shape restrictions. For example, the growth of children over time and dose-response curves in medicine are known to be monotonic. The relationships between daily mortality and air pollution concentration, between body mass index and incidence of heart disease are other examples requiring shape restrictions. Unconstrained models might be too flexible and give implausible or un-interpretable results.
Here we develop a general framework for shape constrained generalized additive models (SCAM), covering estimation, smoothness selection, interval estimation and also allowing for model comparison. The aim is to make SCAMs as routine to use as conventional unconstrained GAMs. To do this we build on the established framework for generalized additive modelling covered, for example, in Wood (2006a). Model smooth terms are represented using spline type penalized basis function expansions; given smoothing parameter values, model coefficients are estimated by maximum penalized likelihood, achieved by an inner iteratively reweighted least squares type algorithm; smoothing parameters are estimated by the outer optimization of a GCV or AIC criterion. Interval estimation is achieved by taking a Bayesian view of the smoothing process, and model comparison can be achieved using AIC, for example. This paper supplies the novel components required to make this basic strategy work, namely 1. We propose shape constrained P-splines (SCOP-splines), based on a novel mildly non linear extension of the Psplines of Eilers and Marx (1996), with novel discrete penalties. These allow a variety of shape constraints for one and multidimensional smooths. From a computational viewpoint, they ensure that the penalized likelihood and the GCV/AIC scores are smooth with respect to the model coefficients and smoothing parameters, allowing the development of efficient and stable model estimation methods. 2. We develop stable computational schemes for estimating the model coefficients and smoothing parameters, able to deal with the ill-conditioning that can affect even unconstrained GAM fits (Wood 2004(Wood , 2008, while retaining computational efficiency. The extra non-linearity induced by the use of SCOP-splines does not allow the unconstrained GAM methods to be re-used or simply modified. Substantially new algorithms are required instead. 3. We provide simulation free approximate Bayesian confidence intervals for the SCOP-spline model components in this setting. The bulk of this paper concentrates on these new developments, covering standard results on unconstrained GAMs only tersely. We refer the reader to Wood (2006a) for a more complete coverage of this background. Technical details and extensive comparative testing are provided in online supplementary material.
To understand the motivation for our approach, note that it is not difficult to construct shape constrained spline like smoothers, by subjecting the spline coefficients to linear inequality constraints (Ramsay 1988;Wood 1994;Zhang 2004;Kelly and Rice 1990;Meyer 2012). However, this approach leads to methodological problems in estimating the smoothing parameters of the spline. The use of linear inequality constraints makes it difficult to optimize standard smoothness selection criteria, such as AIC and GCV with respect to multiple smoothing parameters. The difficulty arises because the derivatives of these criteria change discontinuously as constraints enter or leave the set of active constraints. This leads to failure of the derivative based optimization schemes which are essential for efficient computation when there are many smoothing parameters to optimize. SCOP-splines circumvent this problem.
Other procedures based on B-splines were proposed by He and Shi (1998), Bollaerts et al. (2006), Rousson (2008), Wang and Meyer (2011). Meyer (2012) presented a cone projection method for estimating penalized B-splines with monotonicity or convexity constraints and proposed a GCV based test for checking the shape constrained assumptions. Monotonic regression within the Bayesian framework has been considered by Lang and Brezger (2004), Holmes and Heard (2003), Dunson and Neelon (2003), and Dunson (2005). In spite of their diversity these existing approaches also lack the ability to efficiently compute the smoothing parameter in a multiple smooth context. In addition, to our knowledge except for the bivariate constrained P-spline introduced by Bollaerts et al. (2006), multi-dimensional smooths under shape constraints on either all or a selection of the covariates have not yet been presented in the literature.
The remainder of the paper is structured as follows. The next section introduces SCOP-splines. Section 3.1 shows how SCAMs can be represented for estimation. A penalized likelihood maximization method for SCAM coefficient estimation is discussed in Sect. 3.2. Section 3.3 investigates the selection of multiple smoothing parameters. Interval estimation of the component smooth functions of the model is considered in Sect. 3.4. A simulation study is presented in Sect. 4 while Sect. 5 demonstrates applications of SCAM to two epidemiological examples.

B-spline background
In the smoothing literature B-splines are a common choice for the basis functions because of their smooth interpolation property, flexibility, and local support. The B-splines properties are thoroughly discussed in De Boor (1978). Eilers and Marx (1996) combined B-spline basis functions with discrete penalties in the basis coefficients to produce the popular 'P-spline' smoothers. Li and Ruppert (2008) established the corresponding asymptotic theory. Specifically that the rate of convergence of the penalized spline to a smooth function depends on an order of the difference penalty but not on a degree of B-spline basis and number of knots, given that the number of knots grows with the number of data and assuming the function is twice continuously differentiable. Ruppert (2002) and Li and Ruppert (2008) showed that the choice of the basis dimension is not critical but should be above some minimal level which depends on the spline degree. Asymptotic properties of P-splines were also studied in Kauermann et al. (2009) andClaeskens et al. (2009). Here we propose to build on the P-spline idea to produce SCOPsplines.

One-dimensional case
The basic idea is most easily introduced by considering the construction of a monotonically increasing smooth, m, using a B-spline basis. Specifically let where q is the number of basis function, the B j are Bspline basis functions of at least second order for representing smooth functions over interval [a, b], based on equally spaced knots, and the γ j are the spline coefficients.
It is well known that a sufficient condition for m (x) ≥ 0 over [a, b] is that γ j ≥ γ j−1 ∀ j (see Supplementary material, S.1, for details). In the case of quadratic splines this condition is necessary. It is easy to see that this condition could be imposed by re-parameterizing, so that T is the vector of m values at the observed points x i , and X is the matrix such that X i j = B j (x i ), then we have m = XΣβ.

Smoothing
In a smoothing context we would also like to have a penalty on the m(x) which can be used to control its 'wiggliness'. Eilers and Marx (1996) introduced the notion of directly penalizing differences in the basis coefficients of a B-spline basis, which is used with a relatively large q to avoid underfitting. We can adapt this idea here. For j > 2 our β j are log differences in γ j . We therefore propose penalizing the squared differences between adjacent β j , starting from β 2 , using the penalty Dβ 2 where D is the (q − 2) × q matrix that is all zero except that D i,i+1 = −D i,i+2 = 1 for i = 1, . . . , q − 2. The penalty is zeroed when all the β j after β 1 are equal, so that the γ j form a uniformly increasing sequence and m(x) is an increasing straight line (see Fig. 1). As a result our penalty shares with a second order P-spline penalty, the basic feature of 'smoothing towards a straight line', but in manner that is computationally convenient for constrained smoothing.
It might be asked whether penalization is necessary at all, given the restrictions imposed by the shape constraints? Even with shape constraint, the unpenalized estimated curve shows a good deal of spurious variation that the penalty removes.

Identifiability, basis dimension
If we were interested solely in smoothing one-dimensional Gaussian data then β would be chosen to minimize where λ is a smoothing parameter controlling the trade-off between smoothness and fidelity to the response data y. Here, we are interested in the basis and penalty in order to be able to embed the shape constrained smooth m(x) in a larger model. This requires an additional constraint on m(x) in order to achieve identifiability to avoid confounding with the intercept of the model in which it is embedded. A convenient way to do this is to use centering constraints on the model matrix columns, i.e. the sum of the values of the smooth is set to be zero n i=1 m(x i ) = 0 or equivalently 1 T XΣβ = 0. As with any penalized regression spline approaches, the choice of the basis dimension, q, is not crucial but should be generous enough to avoid oversmoothing/underfitting (Ruppert 2002;Li and Ruppert 2008). Ruppert (2002) suggested algorithms for the basis dimension selection by minimizing GCV over a set of specified values of q, while Kauermann and Opsomer (2011) proposed an equivalent likelihood based scheme.
This simple monotonically increasing smooth can be extended to a variety of monotonic functions, including decreasing, convex/concave, increasing/decreasing and concave, increasing/ decreasing and convex, the difference between alternative shape constraints being the form of the matrices Σ and D.

Multi-dimensional SCOP-splines
Using the concept of tensor product spline bases it is possible to build up smooths of multiple covariates under the monotonicity constraint, where monotonicity may be assumed on either all or a selection of the covariates. In this section the construction of a multivariable smooth, m(x 1 , x 2 , . . . , x p ), with multiple monotonically increasing constraints along all covariates is first considered, followed by a discussion of single monotonicity along a single direction.

Tensor product basis
Consider p B-spline bases of dimensions q 1 , q 2 , and q p for representing marginal smooth functions, each of a single covariate

Monotone increasing
As above increasing and convex

As above
Increasing and concave

As above
Decreasing and convex

As above
Decreasing and concave As above where B k j (x j ), j = 1, . . . , p, are B-spline basis functions, and α j k j are spline coefficients. Then, following Wood (2006a) the multivariate smooth can be represented by expressing spline coefficients of the marginal smooths as the B-spline of the following covariate, starting from the first marginal smooth. By denoting B k 1 ...k p (x 1 , . . . , x p ) = B k 1 (x 1 ) · . . . · B k p (x p ), the smooth of p covariates may be written as follows

Constraints
By extending the univariate case one can see that a suffi- The elements of Σ j are the same as for the univariate monotonically increasing smooth (see Table 1). For the multiple monotonically decreasing multivariate function Σ = 1 : Σ (,−1) , where Σ = −Σ 1 ⊗ Σ 2 ⊗ · · · ⊗ Σ p , that is Σ is a matrix Σ with the first column replaced by the column of one's. To satisfy conditions for a monotonically increasing or decreasing smooth with respect to only one covariate the following re-parameterizations are suggested: 1. For the single monotonically increasing constraint along the x j direction: Let Σ j be defined as previously while I s is an identity matrix of size q s , s = j, then and γ = Σβ, whereβ is a vector containing a mixture of un-exponentiated and exponentiated coefficients with For the single monotonically decreasing constraint along the x j direction: The re-parametrization is the same as above except for the representation of the matrix Σ j which is as for univariate smooth with monotonically decreasing constraint (see Table 1).
By analogy it is not difficult to construct tensor products with monotonicity constraints along any number of covariates.

Penalties
For controlling the level of smoothing, the penalty introduced in Sect. 2 can be extended. For multiple monotonicity the penalties may be written as Table 1 for a monotone smooth. Penalties for single monotonicity along x j are where S j is defined as above. The penalty matricesS i , i = j, in the unconstrained directions can be constructed using the marginal penalty approach described in Wood (2006a). The degree of smoothness in the unconstrained directions can be controlled by the second-order difference penalties applied to the non-exponentiated coefficients, and by the first-order difference penalties for the exponentiated coefficients. As in the univariate case, these penalties keep the parameter estimates close to each other, resulting in similar increments in the coefficients of marginal smooths. When λ j → ∞ such penalization results in straight lines for marginal curves.

SCAM representation
To represent (1) for computation we now choose basis expansions, penalties and identifiability constraints for all the unconstrained f j , as described in detail in Wood (2006a), for example. This allows j f j (z ji ) to be replaced by F i γ , where F is a model matrix determined by the basis functions and the constraints, and γ is a vector of coefficients to be estimated. The penalties on the f j are quadratic in γ . Each shape constrained term m k is represented by a model matrix of the form XΣ and corresponding coefficient vector. Identifiability constraints are absorbed by the column centering constraints. The model matrices for all the m k are then combined so that we can write where M is a model matrix andβ is a vector containing a mixture of model coefficients (β i ) and exponentiated model coefficients (exp(β i )). The penalties in this case are quadratic in the coefficients β (not in theβ).
So (1) becomes For fitting purposes we may as well combine the model matrices column-wise into one model matrix X, and write the model as whereβ has been enlarged to now contain θ , γ and the originalβ. Similarly there is a corresponding expanded model coefficient vector β containing θ , γ and the original β. The penalties on the terms have the general form β T S λ β where S λ = k λ k S k , and the S k are the original penalty matrices expanded with zeros everywhere except for the elements which correspond to the coefficients of the kth smooth.

SCAM coefficient estimation
Now consider the estimation of β given values for the smoothing parameters λ. The exponential family chosen determines the form of the log likelihood l(β) of the model, and to control the degree of model smoothness we seek to maximize its penalized version However, the non-linear dependence of Xβ on β makes this more difficult than in the case of unconstrained GAMs. In particular we found that optimization via Fisher scoring caused convergence problems for some models, and we therefore use a full Newton approach. The special structure of the model means that it is possible to work entirely in terms of a matrix square root of the Hessian of l, when applying Newton's method, thereby improving the numerical stability of computations, so we also adopt this refinement. Also, since SCAM is very much within GAM theory, the same convergence issues might arise as in the case of GAM/GLM fitting (Wood 2006a). In particular, the likelihood might not be uni-modal and the process may converge to different estimates depending on the starting values of the fitting process. However, if the initial values are reasonably selected then it is unlikely that there will be major convergence issues. The following algorithm suggests such initial values. Let V (μ) be the variance function for the model's exponential family distribution, and define Penalized likelihood maximization is then achieved as follows: 1. To obtain an initial estimate of β, minimize g(y) − Xβ 2 +β T S λβ w.r.t.β, subject to linear inequality constraints ensuring thatβ j > 0 wheneverβ j = exp(β j ). This is a standard quadratic programming (QP) problem.
The algorithm is derived in Appendix 1 which shows several similarities to a standard penalized IRLS scheme for penalized GLM estimation. However, the more complicated structure results from the need to use full Newton, rather than Fisher scoring, while at the same time avoiding computation of the full Hessian, which would approximately square the condition number of the update computations. Two refinements of the basic iteration may be required.
1. If the Hessian of the log likelihood is indefinite then step 10 will fail, because some Λ ii will exceed 1. In this case a Fisher update step must be substituted, by setting α(μ i ) = 1. 2. There is considerable scope for identifiability issues to hamper computation. In common with unconstrained GAMs, flexible SCAMs with highly correlated covariates can display co-linearity problems between model coefficients, which require careful handling numerically, in order to ensure numerical stability of the estimation algorithms. An additional issue is that the non-linear constraints mean that parameters can be poorly identified on flat sections of a fitted curve, where β is simply 'very negative', but the data contain no information on how negative. So steps must be taken to deal with unidentifiable parameters. One approach is to work directly with the QR decomposition to calculate which coefficients are unidentifiable at each iteration and to drop these, but a simpler strategy substitutes a singular value decomposition for the R factor at step 8 if it is rank deficient, so that Then we set Q = QU , R = DV T , and Q 1 is the first n rows of Q, and everything proceeds as before, except for the inversion of R. We now substitute the pseudoinverse 'Too small' is judged relative to the largest singular value D 11 multiplied by some power (in the range .5 to 1) of the machine precision. If all parameters are numerically identifiable then the pseudoinverse is just the inverse.

SCAM smoothing parameter estimation
We propose to estimate the smoothing parameter vector λ by optimizing a prediction error criterion such as AIC (Akaike 1973) or GCV (Craven and Wahba 1979). The model deviance is defined in the standard way as where l max is the saturated log likelihood. When the scale parameter is known we find λ which minimizes V u = D(β)+ 2φγ τ, where τ is the effective degrees of freedom (edf) of the model. γ is a parameter that in most cases has the value of 1, but is sometimes increased above 1 to obtain smoother models [see Kim and Gu (2004)]. When the scale parameter is unknown we find λ minimizing the GCV score, V g = n D(β)/(n − γ τ ) 2 . For both criteria the dependence on λ is via the dependence of τ andβ on λ [see Hastie and Tibshirani (1990) and Wood (2008) for further details].
The edf can be found, following Meyer and Woodroofe (2000) as where L + is the diagonal matrix such that Details are provided in Appendix 2.
Optimization of the V * w.r.t. ρ = log(λ) can be achieved by a quasi-Newton method. Each trial ρ vector will require a Sect. 3.2 iteration to find the correspondingβ so that the criterion can be evaluated. In addition the first derivative vector of V * w.r.t. ρ will be required, which in turn requires ∂β/∂ρ and ∂τ/∂ρ.
As demonstrated in Supplementary material, S.3, implicit differentiation can be used to obtain The derivatives of D and τ then follow, as S.4 (Supplementary material), shows in tedious detail.

Interval estimation
Having obtained estimatesβ andλ, we have point estimates for the component smooth functions of the model, but it is usually desirable to obtain interval estimates for these functions as well. To facilitate the computation of such intervals we seek distributional results for theβ, i.e. for the coefficients on which the estimated functions depend linearly.
Here we adopt the Bayesian approach to interval estimation pioneered in Wahba (1983), but following Silverman's (1985) formulation. Such intervals are appealing following Nychka's (1988) analysis showing that they have good frequentist properties by virtue of accounting for both sampling variability and smoothing bias. Specifically, we view the smoothness penalty as equivalent to an improper prior distribution on the model coefficients where S − λ is the Moore-Penrose pseudoinverse of S λ = k λ k S k . In conjunction with the model likelihood, Bayes theorem then leads to the approximate result where Vβ = C(C T X T WXC+S λ ) −1 Cφ, and W is the diagonal matrix of w i calculated with α(μ i ) = 1. Supplementary material, S.5, derives this result. The deviance or Pearson statistic divided by the effective residual degrees of freedom provides an estimate of φ, if required. To use the result we condition on the smoothing parameter estimates: the intervals display surprisingly good coverage properties despite this (Marra and Wood (2012), provide a theoretical analysis which partly explains this). In this section performance comparison with unconstrained GAM and the QP approach to shape preserving smoothing (Wood 1994) is illustrated on a simulated example of an additive model with a mixture of monotone and unconstrained smooth terms. All simulation studies and data applications were performed using R packages scam, which implements the proposed SCAM approach, and mgcv for GAM and QP implementations. A more extensive simulation study is given in Supplementary material, S.6. Particularly, the first subsection of S.6 shows comparative study with the constrained P-spline regression (Bollaerts et al. 2006), monotone piecewise quadratic splines of Meyer (2008), and shape-restricted penalized B-splines of Meyer (2012) on simulated example on univariate single smooth term models. Since there was no mean square error advantage of these approaches over the SCAM for the univariate model, and moreover, the direct grid search for multiple optimal smoothing parameter is computational expensive, (and to the authors' knowledge, R routines for the implementation of these methods are not freely available) the comparison for multivariate and additive examples were performed only with the unconstrained GAM and QP approach. The following additive model is considered: For the Gaussian data the values of σ were 0.05, 0.1, 0.2, which gave the signal to noise ratios of about 0.97, 0.88, and 0.65. For the Poisson model the noise level was controlled by multiplying g(μ i ) by d, taking values 0.5, 0.7, 1.2, which resulted in the signal to noise ratios of about 0.58, 0.84, and 0.99. For the SCAM implementation a cubic SCOP-spline of the dimension 30 was used to represent the first monotonic smooth term and a cubic P-spline with q = 15 for the second unconstrained term. For an unconstrained GAM, P-splines with the same basis dimensions were used for both model components. The models were fitted by penalized likelihood maximization with the smoothing parameter selected using V g in the Gaussian case and V u for the Poisson case.
For implementing the QP approach to monotonicity preserving constraint, we approximated the necessary and sufficient condition f (x) ≥ 0, via the standard technique (Villalobos and Wahba 1987) of using a fine grid of linear con- , where x * i are spread evenly through the range of x (strictly such constraints are necessary, but only sufficient as n → ∞, but in practice we observed no violations of monotonicity). Cubic regression spline bases were used here together with the integrated squared second order derivative of the smooth as the penalty. The model fit is obtained by setting the QP problem within a penalized IRLS loop given λ chosen via GCV/UBRE from unconstrained model fit. Cubic regression splines tend to have slightly better MSE performance than Psplines (Wood 2006a) and moreover, the conditions built on finite differences are not only sufficient but also necessary for monotonicity. So this is a challenging test for SCAM. Three hundred replicates were produced for Gaussian and Poisson distributions at each of three levels of noise and for two sample sizes, 100, 200, for the three alternative approaches.
The simulation results for the Gaussian data are illustrated in Fig. 4. The results show that SCAM works better than the other two alternative methods in the sense of MSE performance. Note that the performance of GAM was better than the performance of the QP approach in this case, but the difference in MSE between SCAM and GAM is much less than that in the one-dimensional simulation studies shown in Supplementary material, S.6. Also it is noticeable that GAM reconstructed the truth better than the QP method. The explanation may be due to there being only one monotonic term, and both GAM and SCAM gave similar fits for the unconstrained term, f 2 . At lower noise levels GAM might also be able to reconstruct the monotone shape of m 1 for some replicates. The results also suggest that the SCAM works better than GAM for greater levels of noise which seems to be natural since at lower noise levels the shapes of constrained terms can be captured by the unconstrained GAM. The reduction in performance of the QP compared to GAM was due to the smoothing parameter estimation from the unconstrained fit which sometimes resulted in less smooth all three methods worked similarly, but with an increase in sample size SCAM outperformed the other two approaches (plots are not shown). As in the Gaussian case the unconstrained GAM worked better than QP. The simulation studies show that SCAM may have practical advantages over the alternative methods considered. It is computationally slower than GAM and QP approaches, however, obviously GAM cannot impose monotonicity, and the selection of the smoothing parameter for SCAM is well founded, in contrast to the ad hoc method used with QP of choosing λ from an unconstrained fit, and then refitting subject to constraint. Finally, the practical MSE performance of SCAM seems to be better than that of the alternatives considered here.

Coverage probabilities
The proposed Bayesian approach for confidence intervals construction makes a number of key assumptions; (i) it uses linear approximation of the exponentiated parameters, and in the case of non-Gaussian models adopts large sample inference; (ii) the smoothing parameters are treated as fixed.
The simulation example of the previous subsection is used in order to examine how these restrictions affect the performance of the confidence intervals. The realized coverage probabilities is taken as a measure of their performance. Supplementary material, S.7, demonstrates two other examples for more thorough confidence interval performance presentation.
The simulation study of confidence interval performance is conducted in an analogous manner to Wood (2006b). Samples of sizes n = 200 and 500 were generated from (5) for Gaussian and Poisson distributions. 500 replicates were produced for both distributions at each of three levels of noise and for two sample sizes. For each replicate the realized coverage proportions were calculated as the proportions of the values of the true functions (at each of the covariate values) falling within the constructed confidence interval. Three confidence levels were considered 90, 95, 99 %. An overall mean coverage probability and its standard error were obtained from the 500 'across-the-function' coverage proportions. The results of the study are presented in exception for the first monotone smooth, m 1 (x 1 ), for the low signal strength, which may be explained by the fact that the optimal fit inclines toward a straight line model (Marra and Wood 2012).

Examples
This section presents the application of SCAM to two different data sets. The purpose of the first application is to investigate whether proximity to municipal incinerators in Great Britain is associated with increased risk of stomach cancer (Elliott et al. 1996;Shaddick et al. 2007). It is hypothesized that the risk of cancer is a decreasing function of distance from an incinerator. The second application uses data from the National Morbidity, Mortality, and Air Pollution Study (Peng and Welty 2004). The relationship between daily counts of mortality and short-term changes in air pollution concentrations is investigated. It is assumed that increases in concentrations of ozone, sulphur dioxide, particular matter will be associated with adverse health effects.
Incinerator data: Elliott et al. (1996) presented a largescale study to investigate whether proximity to incinerators is associated with an increased risk of cancer. They analyzed data from 72 municipal solid waste incinerators in Great Britain and investigated the possibility of a decline in risk with distance from sources of pollution for a number of can-cers. There was significant evidence for such a decline for stomach cancer, among several others. Data from a single incinerator from those 72 sources, located in the northeast of England, are analyzed using the SCAM approach in this section. This incinerator had a significant result indicating a monotone decreasing risk with distance (Elliott et al. 1996).
The data are from 44 enumeration districts (censusdefined administrative areas), ED, whose geographical centroids lay within 7.5 km of the incinerator. The response variable, Y i , are the observed numbers of cases of stomach cancer for each enumeration district. Associated estimates of the expected number of cases, E i , available for risk determination, risk i = Y i /E i , obtained using national rates for the whole of Great Britain, standardized for age and sex, were calculated for each ED. The two covariates are the distance (km), dist i , from the incinerator and a deprivation score, the Carstairs score, cs i .
Under the model, it is assumed that, Y i are independent Poisson variables, Y i ∼ Poi(μ i ), where μ i = λ i E i , μ i is the rate of the Poisson distribution with E i the expected number of cases (in area i) and λ i the relative risk. Shaddick et al. (2007) proposed a model under which the effect of a covariate, e.g., distance, on cancer risk was linear through an exponential function, i.e. λ i = exp(β 0 +β 1 dist i ). Since the risk of cancer might be expected to decrease with distance from the incinerator, in this paper a smooth monotonically decreasing function, m(dist i ), is suggested for modelling its rela- Fig. 6 The estimated smooth and cancer risk function for monotone and unconstrained versions of model 1 (incinerator data). a the estimated smooth of SCAM + 95 % confidence interval; b the SCAM estimated risk as the function of distance; c the GAM estimated smooth + 95 % confidence interval; d the GAM estimated risk as the function of distance. Points show the observed data. As noted in the text, AIC suggests that the shape constrained model which is a single smooth generalized Poisson regression model under monotonicity constraint, where log(E i ) is treated as an offset (a variable with a coefficient equal to 1). Therefore, the SCAM approach can be applied to fit such a model. Carstairs score is known to be a good predictor of cancer rates (Elliott et al. 1996;Shaddick et al. 2007), so its effect may also be included in the model. The following four models are considered for this application. Model Model 2 is the same as model 1 but with m 2 (cs i ) as its smooth term instead with m 2 (cs i ) > 0. Model 3 combines both smooths while model 4 takes a bivariate function m 3 (−dist i , cs i ) subject to double monotone increasing constraint. The univariate smooth terms were represented by the third order SCOP-splines with q = 15, while q 1 = q 2 = 6 were used for the bivariate SCOP-spline.
Plots for assessing the suitability of model one are given in Supplementary material, S.8. The first model for comparison has been also fitted without constraint. The estimated smooths and risk functions for both methods are illustrated in Fig. 6. The estimate of the cancer risk function was obtained byrisk i =μ i /E i = exp m 1 (dist i ) . Note, that the unconstrained GAM resulted in a non-monotone smooth, which supports the SCAM approach. The AIC score allows us to compare models with and without shape constraints.
The AIC values were 152.35 for GAM and 150.57 for SCAM which favoured the shape constrained model.
In model 2 the number of cases of stomach cancer are represented by a smooth function of deprivation score. This function is assumed to be monotonically increasing since it was shown (Elliott et al. 1996) that in general, people living closer to incinerators tend to be less affluent (low Carstairs score). The AIC value for this model was 155.59, whereas the unconstrained version gave AIC = 156.4, both of which were higher than for the previous model. The other three measures of the model performance, V u , the adjusted r 2 , and the deviance explained, also gave slightly worse results than those seen in model 1.
Model 3 incorporates both covariates, dist and cs, assuming an additive effect on log scale. The estimated edf of m 2 (cs) was about zero. This smoothing term was insignificant in this model, with all its coefficients near zero. This can be explained by a high correlation between two covariates. Considering a linear effect of Carstairs in place of the smooth function, m 2 , as it was proposed in Shaddick et al. (2007) The bivariate function, m 3 (−dist i , cs i ), is considered in the last model. The perspective plot of the estimated smooth is illustrated in Fig. 7. This plot also supports the previous result, that the Carstairs score does not provide any additional information for modelling cancer risk when distance is included in the model. The graph of the estimated smooth has almost no increasing trend with respect to the second covariate. The measures of the model performance, such as V u , adjusted r 2 , and the percentage of deviance explained were not as good as for the first simple model 1. The equiv- alent model without shape constraints resulted in the AIC = 157.35, whereas the AIC score for SCAM was 155.4. Hence, the AIC best selected model is the simple shape constrained model which only includes distance. Air pollution data: The second application investigates the relationship between non-accidental daily mortality and air pollution. The data were from the National Morbidity, Mortality, and Air Pollution Study (Peng and Welty 2004) which contains 5,114 daily measurements on different variables for 108 cities within the United States. As an example a single city (Chicago) study was examined in Wood (2006a). The response variable was the daily number of deaths in Chicago (death) for the years 1987-1994. Four explanatory variables were considered: average daily temperature (tempd), levels of ozone (o3median), levels of particulate matter (pm10median), and time. Since it might be expected that increased mortality will be associated with increased concentrations of air pollution, modelling with SCAM may prove useful.
The preliminary modelling and examination of the data showed that the mortality rate at a given day could be better predicted if the aggregated air pollution levels and aggregated mean temperature were incorporated into the model, rather than levels of pollution and temperature on the day in question (Wood 2006a). It was proposed that the aggregation should be the sum of each covariate (except time), over the current day and three preceding days. Hence, the three aggregated predictors are as follows Assuming that the observed numbers of daily death are independent Poisson random variables, the following additive model structure can be considered where monotonically increasing constraints are assumed on m 2 and m 3 , since increased air pollution levels are expected to be associated with increases in mortality. The plots for assessing the suitability of this model together and the plots of the smooth estimates are illustrated in Supplementary material, S.8. This model indicates that though the effect of the ozone level is only with one degree of freedom, it is positive and increasing. The rapid increase in the smooth of aggregated mean temperature can be explained by the four highest daily death rates occurring on four consecutive days of very high temperature, which also experienced high levels of ozone (Wood 2006a).
Since the combination of high temperatures together with high levels of ozone might be expected to result in higher mortality, we consider a bivariate smooth of these predictors. The following model is now considered where m 2 (pm10 i ) is a monotone increasing function and m 3 (o3 i , tmp i ) is subject to single monotonicity along the first covariate. The diagnostic plots of this model showed a slight improvement in comparison to the first model (Supplementary material, S.8). The estimates of the univariate smooths and perspective plot of the estimated bivariate smooth of model 2 are illustrated in Fig. 8. The second model also has a lower V u score which implies that model 2 is a preferable model. The current approach has been applied to air pollution data for Chicago just for demonstration purpose. It would be of interest to apply the same model to other cities, to see whether the relationship between non-accidental mortality and air pollution can be described by the proposed SCAM in other locations.

Discussion
In this paper a framework for generalized additive modelling with a mixture of unconstrained and shape restricted smooth terms, SCAM, has been presented and evaluated on  Fig. 8 The estimates of the smooth terms of model 2 (air pollution data). A cubic regression spline was used for f 1 with q = 200, SCOP-spline of the third order with q = 10 for m 2 , and bivariate SCOP-spline with the marginal basis dimensions q 1 = q 2 = 10 for m 3 a range of simulated and real data sets. The motivation of this framework is an attempt to develop general methods for estimating SCAMs similar to that of a standard unconstrained GAM. SCAM models allow inclusion of multiple unconstrained and shape constrained smooths of both univariate and multi-dimensional type which are represented by the proposed SCOP-splines. It should be mentioned that the shape constraints were assured by the sufficient but not necessary condition for the cubic and higher order splines. However, this condition for the cubic splines is equivalent to that of Fritsch and Carlson (1980) who showed that the sufficient parameter space constitutes the substantial part of the necessary parameter space (see their Fig. 2, p. 242). Also the sensitivity analysis of Brezger and Steiner (2008) on an empirical application models defends the point that the sufficient condition is not highly restrictive.
Since a major challenge of any flexible regression method is its implementation in a computationally efficient and stable manner, numerically robust algorithms for model estimation have been presented. The main benefit of the procedure is that smoothing parameter selection is incorporated into the SCAM parameter estimation scheme, which also produces interval estimates at no additional cost. The approach has the O(nq 2 ) computational cost of standard penalized regression spline based GAM estimation, but typically involves 2-4 times as many O(nq 2 ) steps because of the additional non-linearities required for the monotonic terms, and the need to use Quasi-Newton in place of full Newton optimization. However, in contrast to the ad hoc methods of choosing the smoothing parameter used in other approaches, smoothing parameter selection for SCAMs is well founded. It should also be mentioned that although the simulation free intervals proposed in this paper show good coverage probabilities it might be of interest to see whether Bayesian confidence intervals derived from posterior distribution simulated via MCMC would give better results.
Open Access This article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Appendix
Appendix 1: Newton method for penalized likelihood estimation of SCAM This appendix describes a full Newton (Newton-Raphson) method for maximizing the penalized likelihood of a SCAM. The penalized log likelihood function to be maximized w.r.t. β is where the log likelihood of β can be written as where a i , b i , and c i are arbitrary functions, φ an arbitrary 'scale' parameter, and θ i a 'canonical parameter' of the distribution related to the linear predictor via the relationship E(Y i ) = b (θ i ) (Wood 2006a). While the functions a i , b i , and c i may vary with i, the scale parameter φ is assumed to be constant for all observations. The distribution parameters θ i depend on the model coefficients β j via the link between the mean of Y i and θ i , Recall that the smoothing parameter vector λ is considered to be fixed while estimating β. Consider only cases where a i (φ) = φ/ω i , and ω i is a known constant, which usually equals 1. Almost all probability distributions of interest from the exponential family are covered by such a limitation. Then and the first order derivative of l(β) w.r.t. β j is where (for this appendix only) S λj = k λ k S k j while S k j is the jth row of the matrix S k , and Taking the first order derivatives from the both sides of the linking equation E(Y i ) = b i (θ i ), we get Since g(μ i ) = X iβ , then Hence Another key point of the exponential family concerns the variance which is represented in the theory of GLMs in terms of μ i as var Let G and W 1 be n×n diagonal matrices with the diagonal elements G i = g (μ i ) and and let C be a q × q diagonal matrix such that Then a penalized score vector may be written as To find the working model parameters estimates,β, one needs to solve u p (β) = 0. These equations are non-linear and have no analytical solution, so some numerical methods should be applied. In the case of unconstrained GAM the penalized iteratively reweighed least squares (P-IRLS) scheme based on Fisher scoring is used to solve these equations.
To proceed the Hessian of the log-likelihood function is derived from (8) where W is a diagonal matrix with , and E is a q × q diagonal matrix with Note that for the model with a canonical link function, the second term of α i is equal to zero, since in this case Therefore, α i = 1 and the matrices W 1 and W are identical.
So, using the Newton method, if β [k] is the current estimate of β, then the next estimate is where the scale parameter φ is absorbed into the smoothing parameter λ.
To use (11) directly for β estimation is not efficient since explicit formation of the Hessian would square the condition number of the working model matrix, √ WXC (Golub and van Loan 1996). It should be noted that the Hessian matrix also appears in an expression for the edf of the fitted model (Appendix 2). In the case of the unconstrained model (Wood 2006a) a stable solution forβ is based on a QR decomposition of √ WX augmented with B, where B T B = S λ . The same approach can be applied here for the shape constrained model, i.e. use a QR decomposition of the augmented √ WXC. However, the values of W can be negative when a non-canonical link function is assumed, so firstly, the issue with these negative weights has to be handled.
The approach applied here is similar to that given in Sect. 3.3 of Wood (2011). LetW denote a diagonal matrix with the elements |w i |, and W − be a diagonal matrix with Then C T X T WXC = C T X TW XC − 2C T X T W − XC.
Now the QR decomposition may be used for the augmented matrix, where Q is a rectangular matrix with orthogonal columns, and R is upper triangular. Now let Q 1 be the first n rows of Q, then W XC = Q 1 R. Therefore where I − is an n × n diagonal matrix with Note that several near non-identifiability issues can arise here. In order to deal with unidentifiable parameters it is proposed to use a singular value decomposition for the R factor of the QR decomposition if it is rank deficient. This step is described in Sect. 3.2. The next step is to apply the eigen-decomposition 2Q T 1 I − Q 1 + R −T ER −1 = UΛU T which gives Defining a vector z such that z i = (y i − μ i )g (μ i )/α(μ i ) andz wherez i = sign(w i )z i , then (13) may be written as β [k+1] = β [k] + PK T Wz − PP T S λ β [k] .
The last expression has roughly the square root of the condition number of (11) for the unpenalized likelihood maximization problem, since the condition number of R −1 equals the condition number of W XC. Another refinement may be required in the last step. If the Hessian of the log likelihood is indefinite then step in expression (14) will fail because some Λ ii will exceed 1. To avoid indefiniteness problem a Fisher update step must be substituted by setting α(μ i ) = 1 so that w i ≥ 0 for any i, then the QR decomposition is used as previously √ WXC B = QR, then β [k+1] = R −1 Q T 1 √ Wz, where z = G(y−μ)+XCβ [k] . If there is an identifiability issue then the singular value decomposition step is applied on the QR factor, R = U DV T , resulting in where Q 1 is the first n rows of QU .
Note that in case of the canonical link function α i = 1 for any i, and therefore,W = W.
Appendix 2: SCAM degrees of freedom An un-penalized model would have as many degrees of freedom as the number of unconstrained model parameters. However, the use of penalties decreases the number of degrees of freedom so that a model with λ → ∞ would have the degrees of freedom near 1. Using the concept of the divergence of the maximum likelihood estimator, the edf of the penalized fit can be found as (Meyer and Woodroofe 2000;Wood 2001) Substituting (11) (Appendix 1) into the model g(μ i ) = X iβ and taking first-order derivatives with respect to y i , we get where the right-hand-side of this expression is the ith diagonal element of the matrix written in the square brackets. Therefore, where F = C T X T WXC − E + S λ −1 C T X T W 1 XC and the matrices W, W 1 , C, and E are evaluated at convergence. Note that F is the expected Hessian of l(β), premultiplied by the inverse of the Hessian of l p (β).
Using the approach and notations of Appendix 1, τ can also be obtained in a stable manner. Introducing n × n diagonal matrices L + such that then the expression for the edf (16) becomes tr(F) = tr {C T X T WXC − E + S λ } −1 C T X T W L + W XC = tr PK T L + W XC = tr KK T L + . (17)