On semi-supervised learning

Cholaquidis, A.; Fraiman, R.; Sued, M.

doi:10.1007/s11749-019-00690-2

On semi-supervised learning

Original Paper
Published: 16 November 2019

Volume 29, pages 914–937, (2020)
Cite this article

TEST Aims and scope Submit manuscript

379 Accesses
4 Citations
Explore all metrics

Abstract

Major efforts have been made, mostly in the machine learning literature, to construct good predictors combining unlabelled and labelled data. These methods are known as semi-supervised. They deal with the problem of how to take advantage, if possible, of a huge amount of unlabelled data to perform classification in situations where there are few labelled data. This is not always feasible: it depends on the possibility to infer the labels from the unlabelled data distribution. Nevertheless, several algorithms have been proposed recently. In this work, we present a new method that, under almost necessary conditions, attains asymptotically the performance of the best theoretical rule when the size of the unlabelled sample goes to infinity, even if the size of the labelled sample remains fixed. Its performance and computational time are assessed through simulations and in the well- known “Isolet” real data of phonemes, where a strong dependence on the choice of the initial training sample is shown. The main focus of this work is to elucidate when and why semi-supervised learning works in the asymptotic regime described above. The set of necessary assumptions, although reasonable, show that semi-parametric methods only attain consistency for very well-conditioned problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on semi-supervised learning

Article Open access 15 November 2019

Empowering Imbalanced Data in Supervised Learning: A Semi-supervised Learning Approach

Projected estimators for robust semi-supervised classification

Article Open access 03 April 2017

References

Aaron C, Cholaquidis A, Cuevas A (2017) Stochastic detection of some topological and geometric features. Electron J Stat 11(2):4596–4628. https://doi.org/10.1214/17-EJS1370
Article MathSciNet MATH Google Scholar
Abdous B, Theodorescu R (1989) On the strong uniform consistency of a new Kernel density estimator. Metrika 11:177–194
Article MathSciNet Google Scholar
Agrawala AK (1970) Learning with a probabilistic teacher. IEEE Trans Autom Control 19:716–723
MathSciNet MATH Google Scholar
Arnold A, Nallapati R, Cohen W (2007) A comparative study of methods for transductive transfer learning. In: Seventh IEEE international conference on data mining workshops (ICDMW)
Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html
Azizyan M, Singh A, Wasserman L (2013) Density-sensitive semisupervised inference. Ann Stat 41(2):751–771
Article MathSciNet Google Scholar
Belkin M, Niyogi P (2004) Semi-supervised learning on Riemannian manifolds. Mach Learn 56:209–239
Article Google Scholar
Ben-David S, Lu T, Pal D (2008) Does unlabelled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In: 21st annual conference on learning theory (COLT). Available at http://www.informatik.uni-trier.de/~ley/db/conf/colt/colt2008.html
Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge
Google Scholar
Chapelle O, Zien A (2005) Semi-supervised classification by low density separation AISTATS, vol 2005, pp 57–64
Cholaquidis A, Cuevas A, Fraiman R (2014) On Poincaré cone property. Ann Stat 42:255–284
Article Google Scholar
Castelli V, Cover TM (1995) On the exponential value of labeled samples. Pattern Recognit Lett 16(1):105–111
Article Google Scholar
Castelli V, Cover TM (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans Inf Theory 42(6):2102–2117
Article MathSciNet Google Scholar
Cuevas A, Fraiman R (1997) A plug-in approach to support estimation. Ann Stat 25:2300–2312
Article MathSciNet Google Scholar
Cuevas A, Rodríguez-Casal A (2004) On boundary estimation. Adv Appl Probab 36:340–354
Article MathSciNet Google Scholar
Cuevas A, Fraiman R, Pateiro-López B (2012) On statistical properties of sets fulfilling rolling-type conditions. Adv Appl Probab 44:311–329
Article MathSciNet Google Scholar
Erdös P (1945) Some remarks on the measurability of certain sets. Bull Am Math Soc 51:728–731
Article MathSciNet Google Scholar
Fanty M, Cole R (1991) Spoken letter recognition. In: Lippman RP, Moody J, Touretzky DS (eds) Advances in neural information processing systems, 3. Morgan Kaufmann, San Mateo
Google Scholar
Federer H (1959) Curvature measures. Trans Am Math Soc 93:418–491
Article MathSciNet Google Scholar
Fralick SC (1967) Learning to recognize patterns without a teacher. IEEE Trans Inf Theory 13:57–64
Article Google Scholar
Haffari G, Sarkar A (2007) Analysis of semi-supervised learning with the Yarowsky algorithm. In: Proceedings of the 23rd conference on uncertainty in artificial intelligence, UAI 2007, July 19–22, 2007. Vancouver, BC
Joachims T (1999) Transductive inference for text classification using support vector machines. In: ICML 16
Joachims T (2003) Transductive learning via spectral graph partitioning. In: ICML
Lafferty J, Wasserman L (2008) Statistical analysis of semi-supervised regression. In: Conference in advances in neural information processing systems, pp 801–808
Nadler B, Srebro N, Zhou X (2009) Statistical analysis of semi-supervised learning: the limit of infinite unlabelled data. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates, Inc., pp 1330–1338. http://papers.nips.cc/paper/3652-statistical-analysis-of-semi-supervised-learning-the-limit-of-infinite-unlabelled-data.pdf
Niyogi P (2008) Manifold regularization and semi-supervised learning: some theoretical analyses. Technical Report TR-2008-01, Computer Science Dept., Univ. of Chicago. Available at http://people.cs.uchicago.edu/~niyogi/papersps/ssminimax2.pdf
Rigollet P (2007) Generalized error bound in semi-supervised classification under the cluster assumption. J Mach Learn Res 8:1369–1392 MR2332435
MathSciNet MATH Google Scholar
Scudder HJ (1965) Probability of error of some adaptive patter-recognition machines. IEEE Trans Inf Theory 11:363–371
Article MathSciNet Google Scholar
Singh A, Nowak RD, Zhu X (2008) Unlabeled data: now it helps, now it doesn’t. Technical report, ECE Dept., Univ. Wisconsin-Madison. Available at www.cs.cmu.edu/~aarti/pubs/SSL_TR.pdf
Sinha K, Belkin M (2009) Semi-supervised learning using sparse eigenfunction bases. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1687–1695
Google Scholar
Vapnik V (1998) Statistical learning theory. Wiley, Hoboken
MATH Google Scholar
Wang J, Shen X, Pan W (2007) On transductive support vector machines. Contemp Math 443:7–20
Article MathSciNet Google Scholar
Zhu X (2008) Semi-supervised learning literature survey. http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html

Download references

Acknowledgements

We thank two referees and an associated editor for their constructive comments and insightful suggestions, which have improved the presentation of the present manuscript. We also thank to Damián Scherlis for helpful suggestions.

Author information

Authors and Affiliations

Facultad de Ciencias, Universidad de la República, Montevideo, Uruguay
A. Cholaquidis & R. Fraiman
Instituto de Cálculo, Facultad de Ciencias Exactas y Naturales, Buenos Aires, Argentina
M. Sued

Authors

A. Cholaquidis
View author publications
You can also search for this author in PubMed Google Scholar
R. Fraiman
View author publications
You can also search for this author in PubMed Google Scholar
M. Sued
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Cholaquidis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof of Proposition 1

Observe that $\mathbb {P}\big (g_i(\mathcal {X}_l)\ne Y_i \mid \mathcal {X}_l{\setminus } X_i\big )\ge \mathbb {P}(g^*(X_i)\ne Y_i) $, for $i=1,\ldots , l$. Thus,

$$\begin{aligned} \mathbb {E}\Big (\mathbb {I}_{g_i(\mathcal {X}_l)\ne Y_i}\Big )=\mathbb {P}(g_i(\mathcal {X}_l)\ne Y_i) =\mathbb {E}\Big (\mathbb {P}\big (g_i(\mathcal {X}_l)\ne Y_i|\mathcal {X}_l{\setminus } X_i\big )\Big )\ge \mathbb {P}(g^*(X_i)\ne Y_i), \end{aligned}$$

and therefore, $L({\mathbf {g}_l})=\mathbb {E}\Big (\frac{1}{l} \sum _{i=1}^l I_{g_i(\mathcal {X}_l)\ne Y_i}\Big )\ge \mathbb {P}(g^*(X_i)\ne Y_i),$ showing that $L({\mathbf {g}_l})\ge \mathbb {P}(g^*(X)\ne Y)$, for any $\mathbf {g}_l=(g_1,\ldots , g_l)$. The lower bound is attained by choosing the ith coordinate of $ \mathbf {g}_l$ equal to $g^*(X_i)$. Moreover, the accuracy of $\mathbf {g}^*_l$ equals that of a single coordinate; namely, $L(\mathbf {g}^*_l)= \mathbb {P}(g^*(X)\ne Y)=L^*$. $\square $

Proof of Proposition 2

We will prove that if H1, H2 i) and H4 are satisfied, then $\mathcal {I}_{0,l}\cap \mathcal {I}_{1,l}\subset \mathcal {F}_l$. Combining this inclusion with H3, we conclude that $\mathbb {P}(\mathcal {F})=1$. To prove that $\mathcal {I}_{0,l}\cap \mathcal {I}_{1,l}\subset \mathcal {F}_l$, we will see that if

$$\begin{aligned} I_a\subseteq \bigcup _{X\in \mathcal {X}_{l}\cap I_a } B(X,h_l/2)\;,\quad a=0,1, \end{aligned}$$

(11)

all the elements of $\mathcal {X}_l$ are labelled by the algorithm. To do so, note that, by H4, there exists $X_a^{*}$ in $\mathcal {X}^n$ such that $X_a^{*}\in I_a$, for $a=0,1$. We will now prove that the algorithm starts. Since $X^*_1$ is in $I_1$ and (11) holds with $a=1$, there exists $X_j^1\in \mathcal {X}_l\cap I_1$ with $d(X_1^{*}, X_j^1)< h_l$. In particular, $d(\mathcal {X}^n, X_j^1)<h_l$ and so $ X_j^1\in \mathcal {U}_0(h_l)$. This guarantees that $\mathcal {U}_0(h_l) \not =\emptyset $, and hence, the algorithm can start.

Assume now that we have classified $j< l$ points of $\mathcal {X}_l$. We will prove that there exists at least one point satisfying the iteration condition required at step $j+1$: $\mathcal {U}_{j}(h_l)\not =\emptyset $. By H1, we can assume that $\mathcal {U}_j=\mathcal {U}_j\cap (I_0\cup I_1)$. Take a such that $\mathcal {U}_j\cap I_a\not =\emptyset $. We will consider now two possible cases: (i) if $\mathcal {X}_l\cap I_a\cap \mathcal {U}_j^c=\emptyset $, then $\mathcal {X}_l\cap I_a=\mathcal {X}_l\cap I_a\cap \mathcal {U}_j $ and so, by (11), $X^*_a\in B(X,h_l/2)$ for some $X\in \mathcal {X}_l\cap \mathcal {U}_j$. Since $X^*_a$ is in $\mathcal {Z}_j$ and $X\in \mathcal {U}_j$, we conclude that $X\in \mathcal {U}_j(h_l)$. Assume now that (ii) $\mathcal {X}_l\cap I_a\cap \mathcal {U}_j^c\not =\emptyset $. Since $I_a$ is connected and (11) holds, the union of $ B(X, h_l/2)$, with $X \in \mathcal {X}_l\cap I_a$, is also a connected set and, therefore,

$$\begin{aligned} \Bigg (\bigcup _{X\in \mathcal {X}_{_l}\cap I_a \cap \mathcal {U}_{j}^c} B(X,h_l/2)\Bigg )\ \ \bigcap \ \Bigg (\bigcup _{X\in \mathcal {X}_{l}\cap I_a \cap \mathcal {U}_j} B(X,h_l/2)\Bigg )\ne \emptyset . \end{aligned}$$

Finally, take $X\in \mathcal {X}_{_l}\cap I_a \cap \mathcal {U}_{j}^c$ and $\tilde{X}\in \mathcal {X}_{_l}\cap I_a \cap \mathcal {U}_j$ such that $B(X, h_l/2)\cap B(\tilde{X},h_l/2)\not =\emptyset $ to conclude that $\tilde{X}\in \mathcal {U}_j(h_l)$. $\square $

Proof of Lemma 1

By H1, we can assume that $\eta (X)\ne 1/2$ for all $X\in \mathcal {X}^n\cup \mathcal {X}_l$. Assume first that $\eta (X_{j_{\mathrm{bad}}})>1/2$, that is, $X_{j_{\mathrm{bad}}}\in I_1$, $\tilde{Y}_{j_{\mathrm{bad}}}=0$, and all the points labelled up to the step $j_{\mathrm{bad}}-1$ by the algorithm are well classified. Now, suppose by contradiction $X_{j_{\mathrm{bad}}}\not \in B_1^{h_l}$, which means that $X_{j_{\mathrm{bad}}}\not \in B(I_0, h_l)$ and thus, $B(X_{j_{\mathrm{bad}}}, h_l)\cap I_0=\emptyset $. This implies that $g^*(X)=1 $ for all $X\in (\mathcal {X}^n\cup \{X_{i_1},\ldots ,X_{j_{\mathrm{bad}} -1} \})\cap B(X_{j_\mathrm{bad}},h_l)$, contradicting the label assigned to $X_{j_\mathrm{bad}}$ according to the majority rule that is used by the algorithm. Thus, $B(X_{j_\mathrm{bad}}, h_l)\cap I_0\not =\emptyset $, and so $X_{j_\mathrm{bad}}\in B_1^{h_l}$. Analogously, if $\eta (X_{j_\mathrm{bad}})<1/2$, we deduce that $X_{j_\mathrm{bad}}\in B_0^{h_l}$. $\square $

Proof of Lemma 2

Given $\delta <\delta _1$, choose $\varepsilon $ such that $\gamma (\delta )-2\varepsilon >0$, for $\gamma (\delta )$ introduced in H6. We will prove $\mathcal {S}_l^\varepsilon = \{\sup _{u \in S}|\hat{f}_l(u)-f(u)|<\varepsilon \}$ is included in $ \mathcal {V}_l^\delta $ as far as $h_l<\delta $, and therefore, from (7), we conclude that $\mathbb {P}(\mathcal {V}^\delta )=1$.

Now, note that on $\mathcal {S}_l^\varepsilon $, we get that $f(u)-\varepsilon< \hat{f}_l(u) < f(u)+\varepsilon ,$ and so, on $\mathcal {S}_l^\varepsilon $, for $a\in A_0^{\delta }\cup A_1^{\delta }$ and $b \in B_1^{h}\cup B^{h}_0$, $\hat{f}_l(b)< f(b)+\varepsilon<f(a)-\gamma +\varepsilon < \hat{f}_l(a)+2\varepsilon - \gamma .$ Thus, on $\mathcal {S}_l^\varepsilon $,

$$\begin{aligned} \sup _{b\in B_0^{h_l}\cup B_1^{h_l}}\hat{f}_l(b) \;\le \; \inf _{a\in A_0^\delta \cup A_1^\delta } \hat{f}_l(a)+2\varepsilon -\gamma < \inf _{a\in A_0^\delta \cup A_1^\delta } \hat{f}_l(a), \end{aligned}$$

when $2\varepsilon -\gamma <0$. This proves that $\mathcal {S}_l^\varepsilon \subseteq \mathcal {F}_l^\delta $, for l such that $8h_l<\delta $. $\square $

Proof of Lemma 3

When $j_\mathrm{bad}=\infty $, $\mathcal {Z}_{j_\mathrm{bad}-1}=\mathcal {X}_n\cup \mathcal {X}_l$. This fact implies that, on the event $\mathcal {F}_l\cap \mathcal {B}_l^c$, the following identity holds: $\mathcal {X}_l\cap (\mathcal {Z}_{j_\mathrm{bad}-1})^c=\emptyset $. Thus, to prove (9), we need to show that, for $a=0,1$ $\mathcal {F}_l \cap \mathcal {A}_{a,l}^\delta \cap \mathcal {V}^\delta _l\cap \mathcal {B}_l \;\subset \; \left\{ \mathcal {X}_l\cap A_a^\delta \cap (\mathcal {Z}_{j_\mathrm{bad}-1})^c=\emptyset \right\} .$ We will argue by contradiction, assuming that there exists $\omega \in \mathcal {F}_l \cap \mathcal {A}_{a,l}^\delta \cap \mathcal {V}^\delta _l\cap \mathcal {B}_l $ for which $\emptyset \not = \mathcal {X}_l\cap A_a^\delta \cap (\mathcal {Z}_{j_\mathrm{bad}-1})^c=\{W_1, \ldots , W_m\}$. Invoking H8, $\mathcal {X}^n\subseteq \mathcal {Z}_{j_\mathrm{bad}-1}$ and there exists $X_a^*\in A_a^\delta \cap \mathcal {X}^n$. These facts guarantee that $X_a^*\in A_0^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1}$, and since we are working on $\mathcal {A}_{a,l}^\delta $, we get that

$$\begin{aligned} X_a^*\in A_a^\delta \subseteq \bigcup _{X\in \mathcal {X}_{l}\cap A_a^\delta } B(X,h_l/2)\quad \hbox {and}\quad X_a^*\in \mathcal {Z}_{j_\mathrm{bad}-1}. \end{aligned}$$

(12)

Next, we will argue that there exist $W^*\in \{W_1, \ldots , W_m\}$ such that $d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l$. To do so, consider the following two cases:

(i)
$\mathcal {X}_l \cap A_a^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1} =\emptyset $. In such a case, from (12) we get that $A_a^\delta $ can be covered by balls centred at $\{W_1, \ldots , W_m\}$ and, since $X_a^*\in A_a^\delta $, $X_a^*\in B(W^*, h_l/2)$ for some $W^*\in \{W_1, \ldots , W_m\}$. Therefore, $d(X_a^*, W^*)<h_l$. Recalling that, as stated in (12), $X_a^*\in \mathcal {Z}_{j_\mathrm{bad}-1}$, we conclude that $d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l$.
(ii)
Assume now that $\mathcal {X}_l \cap A_a^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1} \not =\emptyset $. Since $A_a^\delta $ is connected, the union of balls given in (12) is connected, and then,
$$\begin{aligned} \Bigg \{\bigcup _{X\in \mathcal {X}_{l}\cap A_a^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1} } B(X,h_l/2)\Bigg \} \quad \bigcap \quad \Bigg \{\bigcup _{1\le i\le m} B(W_i,h_l/2)\Bigg \}\quad \not =\quad \emptyset . \end{aligned}$$
Thus, there exist $X\in \mathcal {Z}_{j_\mathrm{bad}-1}$ and $W^*\in \{W_1, \ldots , W_m\}$ with $d(X^*, W^*)<h_l$, which implies that $d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l$.

To finish the proof, we will show that such a $W^*$ should have been chosen by the algorithm to be labelled before $X_{j_\mathrm{bad}}$, which implies that $W^*\in \mathcal {Z}_{j_\mathrm{bad}-1}$, contradicting that $W^*\in (\mathcal {Z}_{j_\mathrm{bad}-1})^c$. This contradiction show that no such $W^*$ exists, as announced. Since $d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l$, we get that $W^*\in \mathcal {U}_{j_\mathrm{bad}-1}(h_l)$, the set of candidates to be labelled by the algorithm at step $j_\mathrm{bad}$. Indeed, since $W^*\in A_a^\delta $ and $h<\delta $, $B(W^*, h_l) \subseteq I_a$. Thus, $\hat{\eta }_{j_\mathrm{bad}-1}(W^*)=a$, implying that $W^*$ attains the maximum stated in (3). Invoking now Lemma 2, since $W^*\in A_a^\delta $ while $X_{j_\mathrm{bad}}$ is in $B_0^h\cap B_1^h$ (see Lemma 1), we know that $\#\{\mathcal {X}_l\cap B(W^*,h_l)\}\ge \#\{\mathcal {X}_l\cap B(X_{j_\mathrm{bad}},h_l)\}$; thus, $W^*$ should have been chosen before $X_{j_\mathrm{bad}}$. This conclude the proof of the result (Fig. 9). $\square $

Proof of Theorem 1

Recall that $g_{{n,l,r(i)}}(\mathcal {X}_l)$ denotes the label assigned by the algorithm to the observation$ X_i\in \mathcal {X}_l$. The empirical mean accuracy of classification satisfies

$$\begin{aligned} \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\ge&\; \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= Y_i}\;\mathbb {I}_{g^*(X_i)=Y_i}\;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i)\\ =&\; \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= g^*(X_i)} \;\mathbb {I}_{g^*(X_i)=Y_i}\;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i). \end{aligned}$$

Consider $\mathcal {T}_l^\delta =\mathcal {F}_l\cap \mathcal {A}_{0,l}^\delta \cap \mathcal {A}_{1,l}^\delta \cap \mathcal {V}_l^\delta $, and $\mathcal {T}^\delta =\bigcup _{l_0}\bigcap _{l\ge l_0}\mathcal {T}_l^\delta $. Combining the results obtained in Proposition 2 and Lemma 2 with condition H5, we conclude that $\mathbb {P}(\mathcal {T}^\delta )=1$, for $\delta <\min \{\delta _0, \delta _1,\delta _2\}$. By (10), on $\mathcal {T}_l$, we have that $\mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= g^*(X_i)}\ge \mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i)\quad \text { for all }i=1,\ldots ,l,$ and therefore,

$$\begin{aligned} \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= g^*(X_i)} \;\mathbb {I}_{g^*(X_i)=Y_i}\;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i)\ge \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g^*(X_i)=Y_i} \;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i). \end{aligned}$$

Then, on $\mathcal {T}^\delta $, we have that $\liminf _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\ge \mathbb {P}\{g^*(X)=Y, X\in A_0^\delta \cup A_1^\delta \}$ and so

$$\begin{aligned} \liminf _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i} \ge \mathbb {P}\{g^*(X)=Y\} \quad a.s. \end{aligned}$$

(13)

On the other hand,

$$\begin{aligned}&\frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}= \frac{1}{l}\sum _{i=1}^l \;\mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i} \mathbb {I}_{g^*(X_i)=Y_i}\\&\quad + \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\;\mathbb {I}_{g^*(X_i)\not =Y_i} \le \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g^*(X_i)=Y_i} + \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)\not = g^*(X_i)} \end{aligned}$$

From Lemma 3, on $\mathcal {T}_l^\delta $, $\mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)\not = g^*(X_i)}\le \mathbb {I}_{(A_0^\delta \cup A_1^\delta )^c}(X_{i})$, and therefore, on $\mathcal {T}^\delta $,

$$\begin{aligned} \limsup _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\le \mathbb {P}(g^*(X)=Y)+\mathbb {P}(X\not \in \{A_0^\delta \cup A_1^\delta \}) . \end{aligned}$$

By H2, (ii) the last term in the previous display converges to zero when $\delta \rightarrow 0$, and thus,

$$\begin{aligned} \limsup _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l) }\le \mathbb {P}(g^*(X)=Y)\quad a.s. \end{aligned}$$

(14)

Combining (13) and (14), we deduce the announced convergence. The consistency defined in (1) follows from the dominated convergence theorem. $\square $

Appendix B

In this section, we will prove that under H2, conditions H3 and H6 hold if we impose some geometric restrictions on $I_0$ and $I_1$. In order to make this Appendix self-contained, we need some geometric definitions and also include some results which will be invoked.

First, we introduce the concept of Hausdorff distance. Given two compact non-empty sets $A,C\subset {\mathbb {R}}^d$, the Hausdorff distance or Hausdorff–Pompei distance between A and C is defined by $d_H(A,C)=\inf \{\varepsilon \ge 0: \text{ such } \text{ that } A\subset B(C,\varepsilon )\, \text{ and } C\subset B(A,\varepsilon )\}.$

Next, we define standard sets, according to Cuevas and Fraiman (1997) (see also Cuevas and Rodríguez-Casal 2004).

Definition 1

A bounded set $S\subset \mathbb {R}^d$ is said to be standard with respect to a Borel measure $\mu $ if there exists $\lambda >0$ and $\beta >0$ such that $\mu \big (B(x,\varepsilon )\cap S\big )\ge \beta \mu _L(B(x,\varepsilon ))\text { for all }x\in S,\ 0<\varepsilon \le \lambda ,$ where $\mu _L$ denotes the Lebesgue measure on $\mathbb {R}^d$.

Roughly speaking, standardness prevents the set from having peaks that are too sharp.

The following theorem is proved in Cuevas and Rodríguez-Casal (2004)).

Theorem 2

(Cuevas and Rodríguez-Casal 2004) Let $Z_1,Z_2,\dots $ be a sequence of iid observations in $\mathbb {R}^d$ drawn from a distribution $P_Z$. Assume that the support Q of $P_Z$ is compact and standard with respect to $P_Z$. Then,

$$\begin{aligned} \limsup _{l\rightarrow \infty } \left( \frac{l}{\log (l)}\right) ^{1/d}d_H(\mathcal {Z}_l, Q)\le \left( \frac{2}{\beta \omega _d}\right) ^{1/d}\quad \text { a.s.}, \end{aligned}$$

(15)

where $\omega _d=\mu _L(B(0,1))$, $\mathcal {Z}_l=\{Z_1,\ldots ,Z_l\}$ and $\beta $ is the standardness constant introduced in Definition 1.

Remark 3

Theorem 2 implies that, if we choose $\epsilon _l=C \left( \frac{\log (l)}{l}\right) ^{1/d}$ with $C>(2/(\beta \omega _d))$, then $Q\subset \cup _{i=1}^l B(Z_i,\epsilon _l)$ for l large enough. This in turn implies that if Q is connected, $\cup _{i=1}^l B(X_i,\epsilon _l)$ is connected.

As a consequence of Theorem 2, we get the following covering property that will be used to prove Proposition 2 and H5 alone Proposition 3.

Lemma 4

Let $X_1,X_2,\dots $ be a sequence of iid observations in $\mathbb {R}^d$ drawn from a distribution $P_X$ with support S. Let $Q\subset S$, be compact and standard with respect to $P_X$ restricted to Q, with $P_X(Q)>0$. Consider $(h_l)_{l\ge 1}$ such that $h_l\rightarrow 0$ and $lh_l^d/\log (l)\rightarrow \infty $. Then, with probability one, for l large, $Q \subset \bigcup _{X\in \mathcal {X}_{l}\cap Q } B(X,h_l/2),$ where $\mathcal {X}_l=\{X_1,\ldots ,X_l\}$.

Proof

We need to work with $\mathcal {X}_l$ restricted to Q, in order to do that, consider the sequence of stopping times defined by $\tau _0\equiv 0, \ \tau _1=\inf \{l:X_l\in Q\}, \ \tau _j=\inf \{l\ge \tau _{j-1}:X_l\in Q\},$ and the sequence of visits to Q given by $Z_j:=X_{\tau _j}$. Then, $(Z_j)_{j\ge 1} $ are iid, distributed as $X\mid (X\in Q)$, with support Q. Observe that the distribution $P_Z$ of Z is the restriction of $P_X$ to Q. Since Q is compact and standard wrt $P_Z$, we can invoke Theorem 1 for $(Z_j)_{j\ge 1}$, and in order to conclude that there exists a positive constant $C_Q$ depending on Q, such that for $k\ge k_0=k_0(\omega )$,

$$\begin{aligned} d_H(\mathcal {Z}_k,Q)\le C_Q (\log (k)/k)^{1/d}, \end{aligned}$$

(16)

where $\mathcal {Z}_k=\{Z_1, \ldots , Z_k\}$. Define now $V_l$ as the number of visits to the set Q up to time l. Namely, $V_l= \sum _{i=1}^l I_{\{X_i\in Q\}}.$ By the law of large numbers, $V_l/l\rightarrow P(X\in Q)>0$ a.e., and therefore, for l large enough, $V_l\ge k_0$. Thus, by (16), recalling that $h_l^d l/\log (l)\rightarrow \infty $, we get that

$$\begin{aligned} d_H(\mathcal {Z}_{V_l},Q)\le C_Q(\log (V_l)/V_l)^{1/d}\le \tilde{C}_Q(\log (l)/l)^{1/d}\le \frac{h_l}{2}. \end{aligned}$$

In particular, $Q\subseteq \bigcup _{Z_j\in \mathcal {Z}_{V_l}} B(Z_j, h_l/2)=\bigcup _{X\in \mathcal {X}_{l}\cap Q} B(X,h_l/2).$ $\square $

This last lemma will be applied to get the covering properties stated in H2 and H5 for $I_a$ and $A_a^\delta $. The following results are needed to show that these sets satisfy the conditions imposed in Lemma 4.

Lemma 5

Let $\nu $ be a distribution with support I such that $int(I)\ne \emptyset $ and $reach(\overline{I^c})> 0$. Assume that $\nu $ has density f bounded from below by $f_0> 0$. Let $Q=\overline{I\ominus B(0,\gamma )}$ such $\nu (Q)>0$, then Q is standard with respect to $\nu _Q$, the restriction of $\nu $ to Q (i.e. $\nu _Q(A)=\nu (A\cap Q)/\nu (Q)$), for all $0\le \gamma <reach(I^c)$, with $\beta =f_0/(3\nu (Q))$.

Proof

Let $0\le \gamma < reach(\overline{I^c})$. By corollary 4.9 in Federer (1959) applied to $I^c$, we get that $reach(\overline{(I\ominus B(0,\gamma ))^c})\ge reach(\overline{I^c})-\gamma > 0$, and now by proposition 1 in Aaron et al. (2017), $\nu _Q$ is standard, with $\beta =f_0/(3\nu (Q))$ (see Definition 1). $\square $

Lemma 6

Let $I\subset \mathbb {R}^{d}$ be a non-empty, connected, compact set with $reach(\overline{I^c})>0$. Then, for all $0< \varepsilon \le reach(I^c)$, $I\ominus B(0,\varepsilon )$ is connected.

Proof

Let $0<\varepsilon \le reach(\overline{I^c})$. By corollary 4.9 in Federer (1959) applied to $I^c$ , $reach(I\ominus B(0,\varepsilon ))> \varepsilon $. Then, the function $f(x)=x$ if $x\in I\ominus B(0,\varepsilon )$, and $f(x)=\pi _{\partial (I \ominus B(0,\varepsilon ))}(x)$ if $x \in I{\setminus } (I\ominus B(0,\varepsilon ))$ where $\pi _{\partial S}$ denotes the metric projection onto $\partial S$, is well defined. By item 4 of theorem 4.8 in Federer (1959), f is a continuous function, so it follows that $f(I)=I\ominus B(0,\varepsilon )$ is connected. $\square $

Proof of Proposition 3

Since $reach(\overline{I_a^c})>0$ $P_X(\partial I_a)=0$ (this follows from Proposition 1 and 2 in Cuevas et al. (2012) together with Proposition 2 in Cholaquidis et al. (2014)), then $\mathbb {P}(X\in int(I_a))=P(X\in I_a)>0$. By Lemma 5, choosing $\gamma =0$, the set $\overline{I_a}$ is standard with respect to $P_X$ restricted to $\overline{I_a}$, for $a=0,1$. By Lemma 4, with $Q=\overline{I_a}$, $\overline{I_a}$ is coverable; finally, we get that H3 is satisfied.

To prove H5 i) observe that the connectedness of $A_a^\delta $ follows from that of $I_a$ (H2 i) together with Lemma 6. For H5 ii), take $\delta $ small enough such that $\mathbb {P}(X \in A_a^\delta )>0$, which should exist because of H2 ii). By (1) in Erdös (1945), using that $\partial A_a^\delta \subset \{x:d(x,\partial I_a)=\delta \}$, we get that $\mathbb {P}(X\in \partial A_a^\delta )=0$. Finally, to prove the covering stated in H5 first observe that, by Lemma 5, $\overline{ A_a^\delta }$ is standard wrt $P_X$ restricted to $\overline{A_a^\delta }$. Invoking Lemma 4 with $Q=\overline{A_a^\delta }$ and recalling that $\mathbb {P}(X\in \partial A_a^\delta )=0$, we get the covering property stated in H5 iii).

Lastly, the uniform convergence stated in H7 follows from Theorem 6 in Abdous and Theodorescu (1989), since f is uniformly continuous, assumptions (i)–(iii) hold for the uniform kernel and the bandwidth fulfils $lh_l^{2d}/\log (l)\rightarrow \infty $. $\square $

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cholaquidis, A., Fraiman, R. & Sued, M. On semi-supervised learning. TEST 29, 914–937 (2020). https://doi.org/10.1007/s11749-019-00690-2

Download citation

Received: 08 January 2019
Accepted: 10 November 2019
Published: 16 November 2019
Issue Date: December 2020
DOI: https://doi.org/10.1007/s11749-019-00690-2

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On semi-supervised learning

Abstract

Access this article