Skip to main content
Log in

Robust kernel-based regression with bounded influence for outliers

  • General Paper
  • Published:
Journal of the Operational Research Society

Abstract

The kernel-based regression (KBR) method, such as support vector machine for regression (SVR) is a well-established methodology for estimating the nonlinear functional relationship between the response variable and predictor variables. KBR methods can be very sensitive to influential observations that in turn have a noticeable impact on the model coefficients. The robustness of KBR methods has recently been the subject of wide-scale investigations with the aim of obtaining a regression estimator insensitive to outlying observations. However, existing robust KBR (RKBR) methods only consider Y-space outliers and, consequently, are sensitive to X-space outliers. As a result, even a single anomalous outlying observation in X-space may greatly affect the estimator. In order to resolve this issue, we propose a new RKBR method that gives reliable result even if a training data set is contaminated with both Y-space and X-space outliers. The proposed method utilizes a weighting scheme based on the hat matrix that resembles the generalized M-estimator (GM-estimator) of conventional robust linear analysis. The diagonal elements of hat matrix in kernel-induced feature space are used as leverage measures to downweight the effects of potential X-space outliers. We show that the kernelized hat diagonal elements can be obtained via eigen decomposition of the kernel matrix. The regularized version of kernelized hat diagonal elements is also proposed to deal with the case of the kernel matrix having full rank where the kernelized hat diagonal elements are not suitable for leverage. We have shown that two kernelized leverage measures, namely, the kernel hat diagonal element and the regularized one, are related to statistical distance measures in the feature space. We also develop an efficiently kernelized training algorithm for the parameter estimation based on iteratively reweighted least squares (IRLS) method. The experimental results from simulated examples and real data sets demonstrate the robustness of our proposed method compared with conventional approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

References

  • Askin RG and Montgomery DC (1980). Augmented robust estimators. Technometrics 22 (3): 333–341.

    Article  Google Scholar 

  • Beaton AE and Tukey JW (1974). The fitting of power series, meaning polynomials, illustrated on band-spectroscopic data. Technometrics 16 (2): 147–185.

    Article  Google Scholar 

  • Billor N and Kiral G (2008). A comparison of multiple outlier detection methods for regression data. Communications in Statistics—Simulation and Computation 37 (3): 521–545.

    Article  Google Scholar 

  • Bishop CM (2006). Pattern Recognition and Machine Learning. Springer: New York.

    Google Scholar 

  • Brabanter K et al (2009). Robustness of kernel based regression: A comparison of iterative weighting schemes. In: Proceedings of the 19th International Conference on Artificial Neural Networks: Part I, Springer Berlin Heidelberg: Limassol, Cyprus, pp 100–110.

  • Buxton LHD (1920). The anthropology of Cyprus. The Journal of the Royal Anthropological Institute of Great Britain and Ireland 50: 183–235.

    Article  Google Scholar 

  • Christmann A and Steinwart I (2007). Consistency and robustness of kernel-based regression in convex risk minimization. Bernoulli 13 (3): 799–819.

    Article  Google Scholar 

  • Coakley CW and Hettmansperger TP (1993). A bounded influence, high breakdown, efficient regression estimator. Journal of the American Statistical Association 88 (423): 872–880.

    Article  Google Scholar 

  • Debruyne M, Christmann A, Hubert M and Suykens JaK (2010). Robustness of reweighted least squares kernel based regression. Journal of Multivariate Analysis 101 (2): 447–463.

    Article  Google Scholar 

  • Dufrenois F, Colliez J and Hamad D (2009). Bounded influence support vector regression for robust single-model estimation. IEEE Transactions on Neural Networks 20 (11): 1689–1706.

    Article  Google Scholar 

  • Fang Y and Jeong MK (2008). Robust probabilistic multivariate calibration model. Technometrics 50 (3): 305–316.

    Article  Google Scholar 

  • Handshin E, Schweppe FC, Kohlas J and Fiechter A (1975). Bad data analysis for power system state estimation. IEEE Transactions on Power Apparatus and Systems 94 (2): 329–337.

    Article  Google Scholar 

  • Hawkins DM, Bradu D and Kass GV (1984). Location of several outliers in multiple-regression data using elemental sets. Technometrics 26 (3): 197–208.

    Article  Google Scholar 

  • Holland PW (1973). Weighted ridge regression: Combining ridge and robust regression methods. NBER Working Paper: Cambridge, MA.

    Book  Google Scholar 

  • Jianke Z, Hoi S and Lyu MRT (2008). Robust regularized kernel regression. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics 38 (6): 1639–1644.

    Article  Google Scholar 

  • Kimeldorf GS and Wahba G (1970). A correspondence between Bayesian estimation on stochastic processes and smoothing by splines. The Annals of Mathematical Statistics 41 (2): 495–502.

    Article  Google Scholar 

  • Krasker WS and Welsch RE (1982). Efficient bounded-influence regression estimation. Journal of the American Statistical Association 77 (379): 595–604.

    Article  Google Scholar 

  • Markatou M and Hettmansperger TP (1990). Robust bounded-influence tests in linear models. Journal of the American Statistical Association 85 (409): 187–190.

    Article  Google Scholar 

  • Micchelli CA (1986). Algebraic aspects of interpolation. Proceedings of Symposia in Applied Mathematics 36: 81–102.

    Article  Google Scholar 

  • Pekalska E and Haasdonk B (2009). Kernel discriminant analysis for positive definite and indefinite kernels. IEEE Transactions on Pattern Analysis and Machine Intelligence 31 (6): 1017–1031.

    Article  Google Scholar 

  • Peng X and Wang Y (2009). A normal least squares support vector machine (NLS-SVM) and its learning algorithm. Neurocomputing 72 (16–18): 3734–3741.

    Article  Google Scholar 

  • Scholkopf B and Smola A (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press: Cambridge, MA.

    Google Scholar 

  • Simpson DG and Chang Y.-C. I. (1997). Reweighting approximate GM estimators: Asymptotics and residual-based graphics. Journal of Statistical Planning and Inference 57 (2): 273–293.

    Article  Google Scholar 

  • Simpson JR and Montgomery DC (1996). A biased-robust regression technique for the combined outlier-multicollinearity problem. Journal of Statistical Computation and Simulation 56 (1): 1–22.

    Article  Google Scholar 

  • Smits GF and Jordaan EM (2002). Improved SVM Regression Using Mixtures of Kernels. In: Proceedings of the 2002 International Joint Conference on Neural Networks, Honolulu, HI, pp. 2785–2790.

  • Steece BM (1986). Regressor space outliers in ridge regression. Communications in Statistics: Theory and Methods 15 (12): 3599–3605.

    Article  Google Scholar 

  • Suykens JaK, De Brabanter J, Lukas L and Vandewalle J (2002a). Weighted least squares support vector machines: Robustness and sparse approximation. Neurocomputing 48 (1–4): 85–105.

    Article  Google Scholar 

  • Suykens JaK, Van Gestel T, De Brabanter J, De Moor B and Vandewalle J (2002b). Least Squares Support Vector Machines. World Scientific Publishing: Singapore.

    Book  Google Scholar 

  • Vapnik VN (2000). The Nature of Statistical Learning Theory. Springer Verlag: New York.

    Book  Google Scholar 

  • Walker E and Birch JB (1988). Influence measures in ridge regression. Technometrics 30 (2): 221–227.

    Article  Google Scholar 

  • Welsch RE (1980). Regression sensitivity analysis and bounded-influence estimation. In: Kmenta J and Ramsey JB (eds). Evaluation of Econometric Models. Academic Press: New York.

    Google Scholar 

  • Wen W, Hao Z and Yang X (2008). A heuristic weight-setting strategy and iteratively updating algorithm for weighted least-squares support vector regression. Neurocomputing 71 (16–18): 3096–3103.

    Article  Google Scholar 

  • Zhao YP and Sun JG (2008). Robust support vector regression in the primal. Neural Networks 21 (10): 1548–1555.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Myong K Jeong.

Appendices

Appendix A

The diagonal element of a hat matrix in the feature space

Using the SVD, Φ(X) can be decomposed as Φ(X)=UΛV′, where U is an n × n matrix whose columns are eigenvectors of Φ(X)Φ(X)′, V is a q × q matrix whose columns are eigenvectors of Φ(X)′Φ(X), and Λ is an n × q matrix with (singular values) for i=1, 2, …, min (n, q) as the ith diagonal element. Without any loss of generality, we assume that the eigenvectors are sorted in the descending order of eigenvalues. An inverse of Φ(X)′Φ(X) can then be obtained as

Moreover, since U′U and VV are identity matrices, Equation (3) can be written as

It can be shown that Λ(Λ′Λ)−1Λ′ is an n × n diagonal matrix with r ones and (n−r) zeros, where r is the number of non-zero singular values of Φ(X). Note that r is also equal to the number of non-zero eigenvalues of Φ(X)Φ(X)′ and Φ(X)′Φ(X). Therefore,

where I m denotes an m × m identity matrix, 0m × n an m × n matrix whose all elements are zeros and u j is a column vector of U. The leverage of the ith observation (ith diagonal elements of ) can now be obtained as

where u ij is the ith element of u j .

Appendix B

Proof of Proposition 1

Let φ μ be the empirical mean vector defined as φ μ =(1/n)∑i=1nφ(x i )=(1/n)Φ(X)′1 n , where 1 n is an n × 1 vector of all ones. The observations in the feature space are centered by subtracting their mean such that

Therefore, the kernel matrix for centred data can be obtained without the explicit form of a mean vector as follows:

It should be noted that if Φ(X) is centred, the rank of Φ(X) will be reduced by 1. Therefore, the rank of will be r–1, where r is the rank of Φ(X).

  1. i)

    The squared Mahalanobis distance to the mean vector in the feature space is defined as:

    where denotes φ(x i )−φ μ .Then,

    where r−1 is the number of eigenvalues of and is an ith element of jth eigenvector of , which is sorted in descending order of eigenvalue size.

  2. ii)

    The unit length-scaled distance SD i to the mean vector in a transformed space by KPCA can be written as since λ i is a variance of the ith kernel principal component and is the projection of the ith observation onto the direction v j (ie, jth kernel principal component). Owing to the following relationship between u i and v i ,

SD i can be rewritten as Therefore,

Appendix C

Proof of Proposition 2

If r=n, As U is an orthogonal matrix, for all i.

Appendix D

Proof of Proposition 3

It can be proven in a similar approach as that adopted in Appendix A by the SVD. By the spectral decomposition, K can be decomposed as K=UDU′, where U is an n × n matrix whose columns are eigenvectors of K, and D is an n × n diagonal matrix with eigenvalues λ i for i=1, …, n of K. We assume that λ i is the ith largest eigenvalue and eigenvectors are sorted in the descending order of eigenvalues. An inverse of K+γI n can then be written as

Equation (6) can be rewritten as

Therefore, the hat diagonal element of is given by

where r is the number of eigenvalues of K, λ j is a jth largest eigenvalue of K, and u ij is an ith element of jth eigenvector of K.

We can verify that (i) and (ii) of Proposition 3 can be derived using the above results. If we assume Φ(X) is centred, can be described by kernel PCA (see the proof of Proposition 1). Since can be rewritten as

Therefore, the leverage of an observation that lies in the direction of major principal component (in the feature space) becomes smaller than the leverage of an observation that lies in the direction of minor principal component (in the feature space).

Appendix E

Derivation of Equation (6)

For any matrix U and V, where I+UV and I+VU are nonsingular, the following matrix identity property holds,

Letting U=(1)/(γ)Φ(X)′ and V=Φ(X), Equation (5) can be rewritten as,

Let K=[K ij ]i,j=1n be an n × n matrix with entries K ij =k(x i , x j )=φ(x i )′φ(x j ) for i, j=1, …, n. Then Equation (5) can be rewritten as the following equation.

Appendix F

Derivation of the estimates of training response values in Section 3.4

From Equation (10), the following results can be obtained.

where

Then, in the above equation can be rewritten by using the matrix identity property. Letting U=(1)/(λ)Φ(X)′ and V=QΦ(X), β is given by

Thus,

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hwang, S., Kim, D., Jeong, M. et al. Robust kernel-based regression with bounded influence for outliers. J Oper Res Soc 66, 1385–1398 (2015). https://doi.org/10.1057/jors.2014.42

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1057/jors.2014.42

Keywords

Navigation