Abstract
In this work we propose a wavelet-based classifier method for binary classification. Basically, based on a training data set, we provide a classifier rule with minimum mean square error. Under mild assumptions, we present asymptotic results that provide the rates of convergence of our method compared to the Bayes classifier, ensuring universal consistency and strong universal consistency. Furthermore, in order to evaluate the performance of the proposed methodology for finite samples, we illustrate the approach using Monte Carlo simulations and real data set applications. The performance of the proposed methodology is compared with other classification methods widely used in the literature: support vector machine and logistic regression model. Numerical results showed a very competitive performance of the new wavelet-based classifier.
Similar content being viewed by others
References
Bosq, D. (1998). Nonparametric Statistics for Stochastic Processes: Estimation and Prediction, vol. 10 of Lecture Notes in Statistics, 2nd edn. Springer, New York.
Cai, T.T. and Brown, L.D. (1998). Wavelet shrinkage for nonequispaced samples. Ann. Statist.26, 1783–1799.
Cai, T.T. and Brown, L.D. (1999). Wavelet estimation for samples with random uniform design. Statist. Probab. Lett.42, 313–321.
Chang, W., Kim, S.-H. and Vidakovic, B. (2003). Wavelet-based estimation of a discriminant function. Appl. Stoch. Models Bus. Ind.19, 185–198.
Daubechies, I. (1992). Ten Lectures on Wavelets. Regional Conference Series in Applied Mathematics. Society for Industrial and Applied Mathematics, Philadelphia.
Daubechies, I. and Lagarias, J.C. (1991). Two-scale difference equations I: existence and global regularity of solutions. SIAM J. Math. Anal.22, 1388–1410.
Daubechies, I. and Lagarias, J.C. (1992). Two-scale difference equations II. Local regularity, infinite products of matrices and fractals. SIAM J. Math. Anal.23, 1031–1079.
Devroye, L., Györfi, L. and Lugosi, G. (1996). A Probabilistic Theory of Pattern Recognition. Stochastic Modelling and Applied Probability. Springer, New York.
Donoho, D.L. (1993). Nonlinear Wavelet Methods for Recovery of Signals, Densities, and Spectra from Indirect and Noisy Data. In Proceedings of Symposia in Applied Mathematics. vol. 47, pp. 173–205. AMS.
Greblicki, W. (1981). Asymptotic efficiency of classifying procedures using the hermite series estimate of multivariate probability densities. IEEE Trans. Inform. Theory27, 364–366.
Greblicki, W. and Pawlak, M. (1982). A classification procedure using the multiple fourier series. Inform. Sci.26, 115–126.
Greblicki, W. and Pawlak, M. (1983). Almost sure convergence of classification procedures using hermite series density estimates. Pattern Recognition Letters2, 13–17.
Greblicki, W. and Rutkowski, L. (1981). Density-free bayes risk consistency of nonparametric pattern recognition procedures. Proceedings of the IEEE69, 482–483.
Hastie, T., Tibshirani, R. and Friedman, J.H. (2017). The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York.
Huang, J.Z. and Shen, H. (2004). Functional coefficient regression models for non-linear time series: a polynomial spline approach. Scand. J. Stat.31, 515–534.
Kulik, R. and Raimondo, M. (2009). Wavelet regression in random design with heteroscedastic dependent errors. Ann. Statist.37, 3396–3430.
Lichman, M. (2013). UCI Machine Learning Repository. University of California, School of Information and Computer Science, Irvine. http://Archive.Ics.Uci.Edu/Ml.
Mallat, S. (2008). A Wavelet Tour of Signal Processing: the Sparce Way, 3rd edn. Academic Press, New York.
Mangasarian, O.L., Street, W.N. and Wolberg, W.H. (1995). Breast cancer diagnosis and prognosis via linear programming. Oper. Res.43, 570–577.
Montoril, M.H., Morettin, P.A. and Chiann, C. (2014). Spline estimation of functional coefficient regression models for time series with correlated errors. Statist. Probab. Lett.92, 226–231.
Ogden, R.T. (1997). Essential Wavelets for Statistical Applications and Data Analysis. Birkhäuser, Boston.
Pandit, S.M. and Wu, S.-M. (1993). Time Series and System Analysis with Applications. Krieger Publishing Company, Malabar.
Ramírez, P. and Vidakovic, B. (2010). Wavelet density estimation for stratified size-biased sample. J. Statist. Plann. Inference140, 419–432.
Restrepo, J.M. and Leaf, G.K. (1997). Inner product computations using periodized daubechies wavelets. Internat. J. Numer. Methods Engrg.40, 3557–3578.
Van Ryzin, J. (1966). Bayes risk consistency of classification procedures using density estimation. Sankhyā, Ser. A28, 261–270.
Vidakovic, B. (1999). Statistical Modeling by Wavelets Wiley Series in Probability and Statistics. Wiley-Interscience, New York.
Zhao, Y., Ogden, R.T. and Reiss, P.T. (2012). Wavelet-based LASSO in functional linear regression. J. Comput. Graph. Statist.21, 600–617.
Acknowledgments
The authors thank the Editor for the insightful suggestions and comments that led to a considerable improvement of the paper. The first author acknowledges FAPESP (Fundação de Amparo à Pesquisa do Estado de São Paulo) for the visit to the Georgia Institute of Technology (2013/21273-5) and for his post-doc at the University of Campinas (2013/09035-1).
Author information
Authors and Affiliations
Corresponding author
Appendix: Proofs
Appendix: Proofs
In this appendix we provide the proofs of the theorems presented in Section 3. The proof of Theorem 1 is partially based on the proof of Theorem 1 in Montoril, Morettin and Chiann (2014) and the proof of Theorem 1 of Huang and Shen (2004). We use two lemmas and one proposition, which are given below. For the sake of simplicity, we will use hereafter the symbol≲ to represent the magnitude order O, i.e., we will write an ≲ bn to represent an = O(bn), for the two positive sequences an and bn. Hereafter, we will use the abbreviation a.s. to denote “almost surely”.
Lemma 1.
Under assumptions of the model in Section 3,
Lemma 2.
Under assumptions of the model in Section 3, there are 0 < A ≤ B < ∞such that all the eigenvalues of\( \frac {1}{n} \boldsymbol {Z}^{\top } \boldsymbol {Z}\)fallin [A, B] a.s., asn → ∞, whereZis an × 2Jmatrix such that itsi-th line corresponds toϕJk(Xi), k = 0, 1,…, 2J − 1.
Proposition 1.
Following the notations in the proof of Lemma 1, for anyk, l = 0, 1,…, 2J − 1, η > 0, s ≥ 3, 2r < γ < 1 and 0 < δ < γ − 2r,
Proof of Theorem 1.
Observe that
where \(f_{J}(x) = {\sum }_{k = 0}^{2^{J}-1} c_{Jk} \phi _{Jk}(x)\) is the orthogonal projection of f on VJ. Since \(\| f_{J} - f \|^{2} = {{\rho }_{J}^{2}}\), it will be enough to verify that .
Note that the least squares estimator of the wavelet coefficients cJ can be written as
whereZ is a n × 2J matrix such that its i-th line corresponds to ϕJk(Xi), k = 0, 1,…, 2J − 1, and \(\boldsymbol {Y}^{*} = ({Y}_{1}^{*}, \ldots , {Y}_{n}^{*})^{\top }\), with \({Y}_{i}^{*} = 2Y_{i} - 1\), as in model (3.1). Denote \(\bar {\boldsymbol {Y}}^{*} = ({\bar {Y}}_{1}^{*}, \ldots , {\bar {Y}}_{n}^{*})^{\top }\), where \({\bar {Y}}_{i}^{*} = f(X_{i})\), and define \(\bar {\boldsymbol {c}}_{J} = (\boldsymbol {Z}^{\top } \boldsymbol {Z})^{-1} \boldsymbol {Z}^{\top } \bar {\boldsymbol {Y}}^{*}\). Thus,
where 𝜖 = (𝜖1,…,𝜖n)⊤, with 𝜖i defined as in the model (3.1).
Since the errors are assumed to be iid and independent of the covariates . This implies that
Then, by Lemma 2,
Hence, by Parseval’s identity,
Again, applying the Parseval’s identity and the Lemma 2,
Once \(\boldsymbol {Z} \bar {\boldsymbol {c}}_{J} = \boldsymbol {Z}(\boldsymbol {Z}^{\top } \boldsymbol {Z})^{-1} \boldsymbol {Z}^{\top } \bar {\boldsymbol {Y}}^{*} \) is an orthogonal projection, by (i) and (ii),
which implies that
The desired result follows from the fact that , by (A.1) and (A.2).
Proof of the Corolary 1.
By Corollary 6.2 in Devroye et al. (1996) and by assumption (ii),
Thus, applying the expectation in both sides, the result follows from Theorem 1,
Proof of Theorem 2..
By Corollary 6.2 in Devroye et al. (1996), we see that
In order to verify the convergence of the right hand side of the inequality above, observe that the integral above can be written as
Combining the last two terms above and applying the assumption (ii),
Now observe that
where the class of functions \(\mathcal {T}\) is defined by
Since y∗ = − 1or y∗ = 1,|y∗| = 1. Furthermore, since |ϕ(x)|≤ W, for some positive constant W, we have that
for some positive constant W1. Observe that we used the fact that there exists a positive constant C such that \({\sum }_{k = 0}^{2^{J} - 1} |c_{Jk}| \leq C 2^{J/2}\).
By Theorem 29.1 in Devroye et al. (1996) and (A.4),
where \({{Z}_{1}^{n}} = (X_{1}, Y_{1}), \ldots , (X_{n},Y_{n})\), and \(\mathcal {N} (\epsilon , \mathcal {T}({{z}_{1}^{n}}))\) is the ℓ1-covering number of \(\mathcal {T}({{z}_{1}^{n}})\), as in Definition 29.1 in Devroye et al. (1996).
For fixed \({{z}_{1}^{n}}\), one can estimate \(\mathcal {N} \left (\frac {\epsilon }{16}, \mathcal {T}({{z}_{1}^{n}})\right )\). For arbitrary functions f1, f2 ∈ VJ, denote h1(x, y∗) = (f1(x) − y∗)2 and h2(x, y∗) = (f2(x) − y∗)2. Then, for any probability measure ν on
for some positive constant C1. Thus, for any \({{z}_{1}^{n}} = (x_{1}, {y}_{1}^{*}), \ldots , (x_{n}, {y}_{n}^{*})\) and 𝜖,
where\(V_{J}({{x}_{1}^{n}}) = \{ (f(x_{1}), \ldots , f(x_{n})): f \in V_{J} \}\). Thus, it suffices to estimate the covering number corresponding toVJ, which is a subspace of a linear space of functions. Then, following the Definitions 12.1 and 12.3, and Theorem 13.9 in Devroye et al. (1996), \( V_{V_{J}^{+}} \leq 2^{J} \). Hence, by Corollary 29.2 in the same reference,
for some positive constant C3.
where C3 and C4 are positive constants, and the convergence follows from assumption (iii). The fact that ρJ = o(1), (A.3) and (A.7) ensure that gJ is universally consistent, as already stated in the Corollary 1.
In order to verify that \(\hat {g}_{J}\) is strongly universal consistent, it suffices, by the Borel-Cantelli lemma and Eq. A.7, to verify the convergence of the series
Based on assumption (iii), there are positive constants B, C and D such that
It is easy to see that there existsa natural n0 such that, for all n ≥ n0,
Furthermore, for every n ≥ 21/r,
Then, there exist anatural m such that, for all n > m,
which ensures the desired result.
Proof of Theorem 3.
By the Parseval’s identity
Since 0 ≤ λjk ≤ 1, then \(0 \leq \lambda _{jk}^{2} \leq 1\) and 0 ≤ (1 − λjk)2 ≤ 1. Thus, the second term of the right hand side of the inequality above can be bounded by
because \({\sum }_{j = J_{0}}^{J - 1} {\sum }_{k} d_{jk}^{2} = {\rho }_{J_{0}}^{2} - {{\rho }_{J}^{2}} \).
Furthermore, we have observed in the proof of Theorem 1 that
Hence,
which yields the desired result, because \(\rho _{J_{0}} \asymp \rho _{J}\) (due to the fact that J0 and J have the same order of convergence).
The results of universal consistency and strong universal consistency can be analogously derived as in the proofs of the Corollary 1 and the Theorem 2, respectively.
Proof of Lemma 1.
Denote \({E}_n(Z.) = \frac {1}{n} {\sum }_{i = 1}^{n} Z_{i} \)and, where Z1, Z2,… is a stationary series. For any f ∈ VJ there is a vector \( \boldsymbol {c} = (c_{J0}, \ldots , c_{J,2^{J}-1})^{\top } \), |cJ|2 < ∞, such that \( f(x) = {\sum }_{k = 0}^{2^{J}-1} c_{Jk} \phi _{Jk}(x) \). Fix η > 0. If |(En − E)ϕJk(X.)ϕJl(X.)|≤ η, then
where Ik, l is one, if the supports of ϕJk and ϕJl overlap, and zero, otherwise.
It is easy to see that \({\sum }_{k} I_{k,l} \leq C\) and\({\sum }_{l} I_{k,l} \leq C\), for some positive constant C. Thus, applying the Cauchy-Schwarz inequality in the first and second inequalities below, and by the Parseval’s identity,
This implies
Then, since \({\sum }_{k} {\sum }_{l} I_{k,l} \lesssim 2^{J} \asymp n^{r} \),
By Proposition 1 and using \(\delta = \frac {2r-\gamma }{2}\), we have that, for every 2r < γ < 1 and s ≥ 3,
Observe that, whenever 2r < γ < 1 and \(r + 1 + \frac {sr}{2s + 1} - \frac {2s \alpha (1 - \gamma )}{2s + 1} < -1\),
Hence, since η > 0 is arbitrary, we have by Borel-Cantelli lemma that
The fact that
ensures that it isalways possible to find 0 < γ < 1 and s ≥ 3 satisfying
Thus the desired result follows from (A.10), because assumption (ii) ensures .
Proof of Lemma 2.
Let \(\boldsymbol {c}_{J} = (c_{J0}, \ldots , c_{J,2^{J}-1})^{\top }\) be a vector such that |cJ|2 < ∞, and denote f(x; cJ). Thus, by Lemma 1,
Hence, by assumption (ii) and the Parseval’s identity,
which ensures the desired result.
Proof of Proposition 1.
From Theorem 1.4 of Bosq (1998),
where c is a positive constant, q ∈ [1,n/2],
Observe that, since any compactly supported orthonormal wavelet atom has the order ϕJk(x) ≲ 2J/2, then by assumption (iii),
which implies that
Let q = nγ, γ ∈ (0, 1). Thus, it is easy to see that
and
By assumption (iv),
Rights and permissions
About this article
Cite this article
Montoril, M.H., Chang, W. & Vidakovic, B. Wavelet-Based Estimation of Generalized Discriminant Functions. Sankhya B 81, 318–349 (2019). https://doi.org/10.1007/s13571-018-0158-1
Received:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13571-018-0158-1