A smoothed monotonic regression via L2 regularization
 527 Downloads
Abstract
Monotonic regression is a standard method for extracting a monotone function from nonmonotonic data, and it is used in many applications. However, a known drawback of this method is that its fitted response is a piecewise constant function, while practical response functions are often required to be continuous. The method proposed in this paper achieves monotonicity and smoothness of the regression by introducing an L2 regularization term. In order to achieve a low computational complexity and at the same time to provide a high predictive power of the method, we introduce a probabilistically motivated approach for selecting the regularization parameters. In addition, we present a technique for correcting inconsistencies on the boundary. We show that the complexity of the proposed method is \(O(n^2)\). Our simulations demonstrate that when the data are large and the expected response is a complicated function (which is typical in machine learning applications) or when there is a change point in the response, the proposed method has a higher predictive power than many of the existing methods.
Keywords
Monotonic regression Kernel smoothing Penalized regression Probabilistic learning1 Introduction
There is a multitude of applications in which it can be believed that one or more predictors have a monotonic relationship with the response variable, i.e., the response function is nondecreasing (or nonincreasing) when one or more of these predictors increase. Such applications can be found in psychology, biology, signal processing, economics and many other disciplines [1, 2, 21, 22]. Perhaps the most widely known approach which allows for extracting a monotonic dependence is monotonic regression (MR) [5, 38].
There exists a simple algorithm, called pool adjacent violators (PAV), that solves problem (1) in O(n) steps [4]. It is widely used due to its computational efficiency, but yet some practitioners are skeptic about using MR in their applications because the fitted response returned by the MR resembles a step function while the expected response in these applications is believed to be continuous and smooth. When there is more than one predictor involved, MR can also be applied, and there are computationally heavy exact algorithms for solving MR problem, such as the algorithm in Maxwell and Muckstadt [25], or computationally inexpensive approximate MR algorithms, such as algorithms in Burdakov et al. [11] and Sysoev et al. [40]. However, the fitted response still resembles a piecewise constant function even in the model with many predictors. Recently, a new generalized additive framework was proposed in Chen and Samworth [12], but when the model has one input, this method returns the same response as the PAV algorithm.
In order to handle the lack of smoothness of the MR, various techniques were developed for the model with one predictor. One group of methods [17, 19, 24, 28] combines the MR approach with kernel smoothers, and the fitted response appears to be monotonic and smooth. A particularly simple ad hoc approach is discussed in Mammen [24], where a smooth monotonic estimator \(m_{\mathrm{IS}}\) first applies the monotonic (called isotonic in Mammen [24]) regression and then a kernel smoother; in another estimator \(m_{\mathrm{SI}}\), the smoothing is followed by the monotonization. Since the complexity of the kernel smoothing is \(O(n^2)\) [16] and the complexity of the MR computation is O(n), the computational cost of \(m_{\mathrm{IS}}\) and \(m_{\mathrm{SI}}\) is \(O(n^2)\). Other methods in this group seem to be more computationally heavy: They either are based on a numerical integration or apply quadratic programming.
Another large group of methods produces a smooth monotonic fit by applying smoothing spline techniques [8, 9, 26, 27, 35, 36, 44]. The quality of the fitting and prediction, as well as computational time of these methods, depends on the number of knots in a spline. If the number of knots (which defines both the complexity and the quality of the model) is fixed, these models are capable to fit quite large data within an acceptable time. However, modern applications often involve data that are large and at the same time are generated by complicated processes. In order to fit such large and complex data, a large number of knots might be needed, which in its turn leads to prohibitively large computational times. We discuss this issue further in Sect. 5. In addition, we include into our simulation studies the method introduced in Pya and Wood [35] (which was shown to be more precise than any of Bollaerts et al. [8], Meyer [26], Meyer [27]).
In the literature, there is also some amount of work based on Bayesian approaches (see for ex. [20, 23, 30, 31, 37, 39, 42]). In general, these methods involve Markov chain Monte Carlo sampling or other type of stochastic optimization, which makes them computationally heavy compared to the aforementioned frequentist alternatives. Complexity and scalability issues are often dismissed or not in the main focus in these papers, and the authors often consider pretty small samples in their simulations. For instance, the method described in Holmes and Heard [20] was presented as ‘extremely fast to simulate,’ and it completes the computations ‘under two minutes’ for a sample size containing around 200 observations. In contrast, the method presented in this paper can fit more than a million observations in less than a minute, as it is illustrated in Burdakov and Sysoev [10]. However, in order to compare the predictive power of these models and our approach, we include in our simulation studies the approach described in Neelon and Dunson [30, 31]. The choice of the method was motivated by the public availability of the code [32], and also by the results of the simulation in Shively et al. [39] demonstrating that there is no obvious winner among the methods in Holmes and Heard [20], Neelon and Dunson [31] and Shively et al. [39]. We have also decided not to include the methods based on Gaussian process optimization (such as [23, 37, 42]) into our simulations because they require at least \(O(n^3)\) numerical operations per iteration, which makes these algorithms too expensive for large data.
The purpose of this work is to develop a method which (a) is fast and well scalable, i.e., able to efficiently handle large data sets and can be efficiently applied in iterative algorithms like GAM or gradient boosting; (b) is statistically motivated rather than ad hoc; (c) enables a user to select a desired degree of smoothness, as it is with splines or kernel smoothers and (d) has a reasonably good predictive power.
In this work and in an accompanying paper [10], we study problem (2) from different perspectives and present a smoothed pool adjacent violators (SPAV) algorithm that computes the solution of the SMR problem. Paper published by Burdakov and Sysoev [10] is focused on convergence properties given a set of penalty parameters \(\Lambda =\left\{ \lambda _1,\ldots , \lambda _{n1}\right\} \) and contains the proof of the optimality of the SPAV algorithm from the optimizational perspective. However, the algorithm presented in Burdakov and Sysoev [10] cannot be directly used in machine learning applications. Firstly, the algorithm requires the set \(\Lambda \) to be specified, and if traditional parameter selection, i.e., crossvalidation, is used, the overall computational time becomes exponential in n (because at least \(2^{n1}\) possible \(\Lambda \) need to be considered). Secondly, the algorithm is only able to deliver predictions for the values of the input variables present in the training data; the prediction for all other input values becomes undefined. This is a known limitation of the monotonic regression discussed in Sysoev et al. [41].
The current paper introduces a probabilistic framework that allows for resolving the critical issues mentioned above and thus enables employing the SMR in machine learning applications. In particular, this framework induces a motivated choice for the penalty parameters that reduces the overall worstcase computational cost to \(O(n^2)\). This framework makes it also possible to deliver consistent predictions for the input values that are outside the training data. In addition, we present a technique that performs correction of the boundary inconsistencies. Last but not least, the paper presents numerical analysis of the predictive performance of the proposed model and some alternative algorithms.
The paper is organized as follows. Section 2 presents the SPAV algorithm and also contains estimates of the complexity of this algorithm given a fixed set of \(\Lambda \). In Sect. 3, a probabilistic model inducing the SMR problem is presented, an appropriate choice of the penalty parameters is discussed and a consistent predictive model is presented. In particular, this section introduces a generalized crossvalidation for the probabilistic model corresponding to (2). Section 4 considers a problem of inconsistency of the SMR on the boundaries and introduces a correction strategy. In Sect. 5, results of numerical simulations and a comparative study of our smoothing approaches and some alternative algorithms are presented. Section 6 contains conclusions.
2 SPAV algorithm
One can see that the objective function in (2) is strictly convex, because the first term is a strictly convex function and the other one is convex. The constraints are linear. Therefore, the SMR problem (2) is a strictly convex quadratic programming (QP) problem. In principle, it can be solved by the conventional QP algorithms [33]. However, these algorithms do not make use of the special structure of the objective function or constraints in (2), whereas taking this structure into account, as one can see below, allows for substantial speedingups. It should also be noted that the general purpose QP algorithms are not of polynomial complexity in contrast to our algorithm.
Theorem 1
Proof
The important feature of system (11) is that the matrix \({\hat{A}}\) is tridiagonal. It can be efficiently solved, e.g., by the Thomas algorithm [13] of complexity O(n). Moreover, since \({\hat{A}}\) does not depend on Y, (11) can be viewed as a linear smoother.
We call the outlined algorithm for solving SMR problem (2) a smoothed pool adjacent violators (SPAV) algorithm. It employs Theorem 1.
Algorithm 1
Step 3 is called PAV step because it pools adjacent violators, i.e., indexes i and \(i+1\) satisfying \(\mu _i > \mu _{i+1}\) into one set and enforces \(\mu _i=\mu _{i+1}\) in the following steps. The result of the SPAV algorithm and the traditional PAV algorithm is compared in Fig. 1. The stepwise response produced by PAV algorithm can sometimes be unrealistic and deviate substantially from the true model. One can see this in Fig. 1, where the SPAV response looks much more natural and closer to the true model.
The following theorem provides an estimate of the complexity of the SPAV algorithm.
Theorem 2
The complexity of the SPAV algorithm is \(O(n^2)\)
Proof
Smoothing step followed by PAV step can be viewed as one iteration. As it was mentioned above, the number of iterations does not exceed n. The complexity of step 1 is determined by the complexity of composing and solving a tridiagonal system of equations, which is O(n). Step 2 involves O(n) comparison operations. Step 3 involves a search that requires O(n) operations plus adding an element to a list and updating the list of blocks which requires O(n) steps. Therefore, the total complexity of one iteration is O(n), which means that the total complexity of the algorithm is \(O(n^2)\). \(\square \)
Note that the quadratic complexity is a worstcase estimate, which assumes that the number of iterations is n. In practice, we observed that this number was fewer than n, and the CPU time of SPAV algorithm was growing with n much slower than as \(n^2\).
3 Connections to probabilistic modeling
An important property of our model is presented in the following result.
Theorem 3
Proof
Our analysis enables us to conclude that, using the probabilistic approach, it is possible to select an appropriate structure of \(\lambda _i\) in such a way that it reflects the location of the X components. Without using probabilistic modeling, this structure would perhaps be difficult to determine. Another advantage of using probabilistic methodology is that it provides a strategy for determining consistent predictions for some new values of X, as it is discussed below.
Prediction for monotonic regression is a special problem, as it is discussed in Sysoev et al. [41].
In general, the MR is only able to compute unique predictions for the values of the input levels that are present in the training data. This can be a critical obstacle for applying MR in many machine learning applications. However, for the smoothed MR, we can define a consistent prediction model for some new predictor value \(X^*\) as follows.
The prediction model presented by (19) and (20) is a natural extension of the model given by (13) because the two models have the same structure. Accordingly, we can derive the prediction in the same way as it was done for (13), i.e., by computing a maximum a posteriori estimate of \(\mu ^*  \mu _i, \mu _{i+1}\).
Theorem 4
Proof
From the practical machine learning perspective, it is crucial that instead of optimizing \(n1\) parameters in (2), we need to estimate only one parameter \(\lambda \). A possible way of doing this is crossvalidation. The standard crossvalidation implies that several \(\lambda \) values need to be considered, and the SMR needs to be computed for each \(\lambda \) value. This leads to the overall computational time \(O(n^2)\) (because the number of considered \(\lambda \) values is normally smaller than n and is not dependent on n). However, applying the crossvalidation may still be computationally expensive.
We suggest a procedure called here generalized crossvalidation. It is based on the following idea supported by empirical observations. The expected (true) response should normally be increasing, and there is a little chance of observing a region where the response is constant. Since in our model the response is constant for the observations that were merged into one block by the PAV step of the SPAV algorithm, we can conclude that in an ideal model there should be very few executions of the PAV step. Accordingly, if we remove the PAV steps from this algorithm, it should not affect much this ideal model, although the monotonicity may be violated. The algorithm can be presented as follows.
Algorithm 2
The fitted responses produced by the SPAV and Smoothing algorithms are presented in Fig. 2, which demonstrates that these responses are quite similar.
4 Correction on the boundaries
The MR is known to be inconsistent on the boundaries, which means that the MR estimates for \(\mu _1\) and \(\mu _n\) do not converge to the true values when n tends to infinity, and some solutions to this problem were suggested [34, 43]. This problem is also known as the spiking problem. An intuitive explanation of this issue is the asymmetry of the PAV algorithm on the boundary: The observation with index n can only be merged into a block with the observations \(i<n\), while the observation with index j located in the middle of the domain of X can be merged into a block with neighbors \(i>j\) and \(i<j\). It may happen that \(Y_n>Y_i\) for all \(i<n\). In this case, the PAV algorithm does not make any error correction and returns \({\hat{\mu }}_n=Y_n\).
A similar kind of asymmetry can be observed in the SMR problem: The objective function in (2) contains two penalty factors (involving \(i+1\) and \(i1\)) for all \(i=2, \ldots , n1\), while for \(i=n\) or \(i=1\) there is only one penalty factor. Figure 1 demonstrates that the SPAV fitted response can sometimes behave badly on the boundaries, i.e., it can be shifted toward the observed values of \(Y_1\) and \(Y_n\).
Consider now how to choose the parameter \(\phi \). One possibility is to use crossvalidation, as it was suggested for \(\lambda \). However, selecting an optimal combination of two parameters might be computationally expensive, and we describe an alternative approach that appeared to work well in practice.
Algorithm 3
Note that the linear equations in Algorithm 3 involve tridiagonal matrices only, and the rest of the algorithm contains simple vector operations. Therefore, the complexity of computing the correction is O(n), and the overall \(O(n^2)\) complexity of the SPAV algorithm is not affected by introducing the correction.
5 Numerical results

monotonic smoothers \(m_{\mathrm{IS}}\) and \(m_{\mathrm{SI}}\) described in Mammen [24],

PAV solution \(m_{\mathrm{PAV}}\),
Computational time (s) for SCAM algorithm with a number of knots increasing with n
n  100  200  300  400  500  600  700  800  900  1000 

Time  0.28  1.25  4.07  8.85  16.7  18.2  30.1  39.4  50.5  77.6 
Computational time (s) for BIR algorithm with a number of knots increasing with n
n  100  120  140  160  180  200  220  240  260  280  300 

Time  12.2  17.6  24.0  31.2  39.9  49.5  60.3  73.1  86.7  101.6  118.7 
Our comparative simulation studies involve the data generated as \(X \sim U[0,A]\), \(Y=f(X)+\epsilon \), \(\epsilon \sim N(0,s)\) for all combinations of the following settings: \(n=100, 1000\) or 10, 000; \(s=0.03\) or 0.1, the functional shape f is one of the following functions \(f_1(X)=X\), \(f_2(X)=X^2\), \(f_3(X)=(X+\mathrm{sin}(X))/10\), \(f_4(X)=\mathrm{tanh} \left( \frac{n}{10} (X0.5) \right) \), and \(A=\left\lfloor n/5\right\rfloor \) when \(f_3\) is used and \(A=1\) otherwise. These functions were chosen in order to represent various possible behaviors of the underlying model such as linearity, nonlinearity, increasing complexity or a sudden level shift. The latter property is dismissed by many methods although the change points are often present in machine learning applications. To measure the uncertainty of our results, we generate \(M=100\) instances for the same model.

\(m_{1}\): SPAV algorithm with linear kernel with correction where \(\lambda \) was selected by the generalized crossvalidation with 10 folds,

\(m_{2}\): SPAV algorithm with linear kernel with correction where \(\lambda \) was selected by the crossvalidation with 10 folds,

\(m_{3}\): SPAV algorithm with quadratic kernel without correction where \(\lambda \) was selected by the generalized crossvalidation with 10 folds,

\(m_{4}\): SPAV algorithm with linear kernel without correction where \(\lambda \) was selected by the generalized crossvalidation with 10 folds.

\(m_{\mathrm{SCAM}}\): SCAM algorithm. Due to time limitations, we fixed \(k=15\) as it was done in the R package documentation of SCAM.

\(m_{\mathrm{BIR}}\): BIR algorithm. Due to time limitations, we fixed \(k=3\) as it was recommended in Neelon and Dunson [30].

monotonic smoothers \(m_{\mathrm{IS}}\) and \(m_{\mathrm{SI}}\),

monotonic regression \(m_{\mathrm{PAV}}\),
By analyzing MSE values, it can be concluded that the SPAV algorithm has two main competitors: the SCAM algorithm and the PAV algorithm. The other tested algorithms appear to have a worse predictive performance than one of these three algorithms.
When the underlying model is simple, i.e., when it is given by functions \(f_1\) or \(f_2\), the average MSE of the SCAM algorithm is smaller than the average MSE values of all other algorithms, including SPAV. However, it must be noted that the SPAV algorithm is ranked second for these cases, and its MSE values are often quite close to those of the SCAM algorithm. Moreover, for some settings the difference is not statistically significant.
When the underlying model is complex or has sharp level shifts (change points), the average MSE values of the SCAM algorithm are quite high even for small n values, and the predictive performance dramatically decreases for large n values. For such underlying models, the SPAV algorithm is significantly more precise than the SCAM algorithm. Our results demonstrate that the SPAV algorithm has the lowest MSE values for these underlying models, while the PAV algorithm is ranked second, and the difference in predictive performance between these two algorithms is statistically significant.
By comparing the variants of the SPAV algorithm in Table 3, it can be concluded that applying the boundary correction often helps to improve the predictive power of the algorithm, and the effect is quite substantial for simpler models. It appears that applying generalized crossvalidation instead of the usual crossvalidation often leads to similar MSE values, which was expected due to the results presented in Sect. 3. Results presented in the table also indicate that employing a linear kernel seems to lead to a better predictive performance than using a quadratic kernel.
To make a comparison of the predictive performance of \(m_1\) and \(m_{\mathrm{SCAM}}\), we apply the following procedure to each of the two datasets. In step 1, we sample (nonoverlapping) training and test data of the size \(n=10^4\) from the given dataset. In step 2, we fit \(m_1\) and \(m_{\mathrm{SCAM}}\) to the training data (in model \(m_1\), the generalized crossvalidation is applied to the current training data) and compute the MSE values for the test data. Steps 1 and 2 are repeated \(M=100\) times, and the obtained set of the MSE values is used to compute the mean MSE value and the standard error of the mean MSE for both models. Table 5 illustrates the results. In addition, we compute the test errors by sampling \(n=10^5\) observations from the ‘Gas sensor’ data and fitting \(m_1\) and \(m_{\mathrm{SPAV}}\) models. The test error for \(m_1\) is \(29.09\times 10^{3}\) while for \(m_{\mathrm{SCAM}}\) it is \(30.85 \times 10^{3}\).
It can be concluded from the presented analysis of real datasets that \(m_1\) has a statistically significantly better performance than \(m_{\mathrm{SCAM}}\) for these data. A possible explanation for this result can be that \(m_1\) model is flexible enough to discover small but significant trends in the data which are not discovered by \(m_{\mathrm{SCAM}}\)(as Fig. 5 illustrates), and \(m_1\) is better than \(m_{\mathrm{SCAM}}\) in modeling sharp level shifts (as Fig. 6 illustrates).
6 Conclusions
In this work, we introduce an approach for smoothed monotonic regression in one predictor variable. It allows for adjusting the degree of smoothness to the data. We have shown that the worstcase computational complexity of this method is \(O(n^2)\) which makes the method suitable in largescale settings. Moreover, it is found in the accompanying paper [10] that, in practice, its running time grows almost linearly with n for a given set of penalty factors \(\Lambda \).
In many machine learning applications, the dependence between the response and predictor variables is a complicated function. Our numerical experiments demonstrate that the predictive performance of SCAM and BIR methods can substantially degrade when the complicated data are involved, unless a sufficiently large amount of knots is used. At the same time, it may be impossible to choose a proper number of knots in these algorithms without making them prohibitively too expensive. The SMR method is free of this shortcoming. Our simulations demonstrate that, in these settings, the SMR method has the best predictive performance compared to the other tested algorithms. It should be emphasized that the computational cost of the our approach does not depend on the complexity of the underlying response function.
Another obvious advantage of the SMR method is that it is able to treat both smooth or nonsmooth responses: For small values of the smoothing parameter, a response close to piecewiseconstant is produced, and the increasing smoothing parameter value increases the degree of smoothness. Our numerical experiments reveal that our approach has a superior predictive performance in situations when there is a sharp change in the level in the response function, and that is important for the applications characterized by the presence of change points.
The SMR problem was introduced as a regularization problem with n regularization parameters. Then, a connection to a probabilistic model was established which assured the same MAP estimate as the solution to the regularization problem. The probabilistic formulation provided us with a clear strategy for reducing n regularization parameters to just one parameter by properly setting the structure of the prior variance and thus reducing the overall worstcase computational cost to \(O(n^2)\). The prior variance depends on the kernel structure, and the predictive performance was better when linear kernels were used compared to the case with quadratic kernels. In addition, the probabilistic formulation provided us with a consistent predictive model for the input values which are not present in the training data. Further, we motivated a generalized crossvalidation strategy for the selection of the smoothing parameter. Our preliminary experiments confirmed that the generalized crossvalidation produces results which are similar to the crossvalidation, and these required far less computational efforts. Finally, a boundary effect problem was studied, and a strategy for the inconsistency correction was suggested. The proposed strategy had a clear positive effect on the predictive performance of the SMR method.
For estimating the predictive performance of the SMR method, it was natural to use MSE values (as we have done in our numerical experiments). For practical data, measures of uncertainty, such as confidence (credible) limits, can also be derived. A natural way of deriving such intervals would be to use the probabilistic formulation (12) and derive the posterior distribution of the quantity of interest, but this would require a full Bayesian inference for this model. This is avoided in our scalable algorithms. An alternative (which we recommend) is to use bootstrap techniques [14] together with the estimator provided by the result of Algorithms 1 or 3.
Some peculiarities of our model are noteworthy. First, although the fitted response is continuous, it is in general not differentiable. This is because the predictive model uses only two adjacent fitted values to make prediction for a new input value, and this may result in that the left and right derivatives of the fitted response may be different in some points. Another peculiarity is that the fitted response and the prediction depend on the choice of the kernel, but this is also typical for many other methods, such as kernel smoothers.
References
 1.Acton S, Bovik A (1998) Nonlinear image estimation using piecewise and local image models. IEEE Trans Image Process 7:979–991CrossRefGoogle Scholar
 2.AntSahalia Y, Duarte J (2003) Nonparametric option pricing under shape restrictions. J Econ 116:9–47MathSciNetCrossRefzbMATHGoogle Scholar
 3.Asuncion A, Newman D (2007) UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html
 4.Ayer M, Brunk HD, Ewing GM, Reid W, Silverman E (1955) An empirical distribution function for sampling with incomplete information. Ann Math Stat 26(4):641–647MathSciNetCrossRefzbMATHGoogle Scholar
 5.Barlow R, Bartholomew D, Bremner J, Brunk H (1972) Statistical inference under order restrictions. Wiley, New YorkzbMATHGoogle Scholar
 6.Barsocchi P, Crivello A, La Rosa D, Palumbo F (2016) A multisource and multivariate dataset for indoor localization methods based on WLAN and geomagnetic field fingerprinting. In: International Conference on indoor positioning and indoor navigation (IPIN), 2016. IEEE, pp 1–8Google Scholar
 7.Best M (1984) Equivalence of some quadratic programming algorithms. Math Program 30(1):71–87MathSciNetCrossRefzbMATHGoogle Scholar
 8.Bollaerts K, Eilers P, Mechelen I (2006) Simple and multiple Psplines regression with shape constraints. Br J Math Stat Psychol 59(2):451–469MathSciNetCrossRefGoogle Scholar
 9.Brezger A, Steiner W (2008) Monotonic regression based on bayesian Psplines. J Bus Econ Stat 26(1):90–104CrossRefGoogle Scholar
 10.Burdakov O, Sysoev O (2017) A dual activeset algorithm for regularized monotonic regression. J Optim Theory Appl 172(3):929–949MathSciNetCrossRefzbMATHGoogle Scholar
 11.Burdakov O, Sysoev O, Grimvall A, Hussian M (2006b) An \(o(n^2)\) algorithm for isotonic regression problems. In: Di Pillo G, Roma M (eds) Nonconvex optimization and its applications. Springer, Berlin, pp 25–33Google Scholar
 12.Chen Y, Samworth R (2015) Generalized additive and index models with shape constraints. J R Stat Soc Ser B (Stat Methodol) 78:729–754MathSciNetCrossRefGoogle Scholar
 13.de Boor C (1972) Elementary numerical analysis. McGrawHill, New YorkzbMATHGoogle Scholar
 14.Efron B, Tibshirani R (1994) An introduction to the bootstrap. CRC Press, Boca RatonzbMATHGoogle Scholar
 15.Fonollosa J, Sheik S, Huerta R, Marco S (2015) Reservoir computing compensates slow response of chemosensor arrays exposed to fast varying gas concentrations in continuous monitoring. Sens Actuators Chem 215:618–629CrossRefGoogle Scholar
 16.Friedman J, Hastie T, Tibshirani R (2001) The elements of statistical learning, Springer series in statistics, vol 1. Springer, BerlinzbMATHGoogle Scholar
 17.Friedman J, Tibshirani R (1984) The monotone smoothing of scatterplots. Technometrics 26(3):243–250CrossRefGoogle Scholar
 18.Goldfarb D, Idnani A (1983) A numerically stable dual method for solving strictly convex quadratic programs. Math Program 27(1):1–33MathSciNetCrossRefzbMATHGoogle Scholar
 19.Hall P, Huang L (2001) Nonparametric kernel regression subject to monotonicity constraints. Ann Stat 29:624–647MathSciNetCrossRefzbMATHGoogle Scholar
 20.Holmes C, Heard N (2003) Generalized monotonic regression using random change points. Stat Med 22(4):623–638CrossRefGoogle Scholar
 21.Kalish ML, Dunn J, Burdakov O, Sysoev O (2016) A statistical test of the equality of latent orders. J Math Psychol 70:1–11MathSciNetCrossRefzbMATHGoogle Scholar
 22.Lee C (1996) On estimation for monotone doseresponse curves. J Am Stat Assoc 91:1110–1119MathSciNetzbMATHGoogle Scholar
 23.Lin L, Dunson D (2014) Bayesian monotone regression using gaussian process projection. Biometrika 101(2):303–317MathSciNetCrossRefzbMATHGoogle Scholar
 24.Mammen E (1991) Estimating a smooth monotone regression function. Ann Stat 19:724–740MathSciNetCrossRefzbMATHGoogle Scholar
 25.Maxwell WL, Muckstadt JA (1985) Establishing consistent and realistic reorder intervals in production–distribution systems. Oper Res 33(6):1316–1341CrossRefzbMATHGoogle Scholar
 26.Meyer M (2008) Inference using shaperestricted regression splines. Ann Appl Stat 2:1013–1033MathSciNetCrossRefzbMATHGoogle Scholar
 27.Meyer M (2012) Constrained penalized splines. Can J Stat 40(1):190–206MathSciNetCrossRefzbMATHGoogle Scholar
 28.Mukerjee H (1988) Monotone nonparametric regression. Ann Stat 16:741–750CrossRefzbMATHGoogle Scholar
 29.Murphy K (2012) Machine learning: a probabilistic perspective. MIT Press, CambridgezbMATHGoogle Scholar
 30.Neelon B, Dunson DB (2003) Bayesian isotonic regression and trend analysis. Technical report, University of North Carolina, National Institute of Environmental Health Sciences. http://ftp.stat.duke.edu/WorkingPapers/0315.html. Accessed 11 Sept 2016
 31.Neelon B, Dunson DB (2004) Bayesian isotonic regression and trend analysis. Biometrics 60(2):398–406MathSciNetCrossRefzbMATHGoogle Scholar
 32.Neelon B, Dunson DB R code for ‘Bayesian isotonic regression and trend analysis’. http://people.musc.edu/~brn200/. Accessed 11 Sept 2016
 33.Nocedal J, Wright S (2006) Numerical optimization. Springer, BerlinzbMATHGoogle Scholar
 34.Pal J (2008) Spiking problem in monotone regression: penalized residual sum of squares. Stat Probab Lett 78(12):1548–1556MathSciNetCrossRefzbMATHGoogle Scholar
 35.Pya N, Wood SN (2015) Shape constrained additive models. Stat Comput 25(3):543–559MathSciNetCrossRefzbMATHGoogle Scholar
 36.Ramsay JO (1988) Monotone regression splines in action. Stat Sci 3:425–441CrossRefGoogle Scholar
 37.Riihimäki J, Vehtari A (2010) Gaussian processes with monotonicity information. AISTATS 9:645–652Google Scholar
 38.Robertson T, Wright F, Dykstra R (1988) Order restricted statistical inference. Wiley, New YorkzbMATHGoogle Scholar
 39.Shively T, Sager T, Walker S (2008) A bayesian approach to nonparametric monotone function estimation. J R Stat Soc Ser B (Stat Methodol) 71(1):159–175MathSciNetCrossRefzbMATHGoogle Scholar
 40.Sysoev O, Burdakov O, Grimvall A (2011) A segmentationbased algorithm for largescale monotonic regression problems. J Comput Stat Data Anal 55:2463–2476CrossRefzbMATHGoogle Scholar
 41.Sysoev O, Grimvall A, Burdakov O (2016) Bootstrap confidence intervals for largescale multivariate monotonic regression problems. Commun Stat Simul Comput 45(3):1025–1040MathSciNetCrossRefzbMATHGoogle Scholar
 42.Wang X, Berger JO (2016) Estimating shape constrained functions using Gaussian processes. SIAM/ASA J Uncertain Quantif 4(1):1–25MathSciNetCrossRefzbMATHGoogle Scholar
 43.Wu J, Meyer M, Opsomer J (2015) Penalized isotonic regression. J Stat Plan Inference 161:12–24MathSciNetCrossRefzbMATHGoogle Scholar
 44.Zhang J (2004) A simple and efficient monotone smoother using smoothing splines. J Nonparametr Stat 16(5):779–796MathSciNetCrossRefzbMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.