Skip to main content
Log in

On semi-supervised learning

  • Original Paper
  • Published:
TEST Aims and scope Submit manuscript

Abstract

Major efforts have been made, mostly in the machine learning literature, to construct good predictors combining unlabelled and labelled data. These methods are known as semi-supervised. They deal with the problem of how to take advantage, if possible, of a huge amount of unlabelled data to perform classification in situations where there are few labelled data. This is not always feasible: it depends on the possibility to infer the labels from the unlabelled data distribution. Nevertheless, several algorithms have been proposed recently. In this work, we present a new method that, under almost necessary conditions, attains asymptotically the performance of the best theoretical rule when the size of the unlabelled sample goes to infinity, even if the size of the labelled sample remains fixed. Its performance and computational time are assessed through simulations and in the well- known “Isolet” real data of phonemes, where a strong dependence on the choice of the initial training sample is shown. The main focus of this work is to elucidate when and why semi-supervised learning works in the asymptotic regime described above. The set of necessary assumptions, although reasonable, show that semi-parametric methods only attain consistency for very well-conditioned problems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Aaron C, Cholaquidis A, Cuevas A (2017) Stochastic detection of some topological and geometric features. Electron J Stat 11(2):4596–4628. https://doi.org/10.1214/17-EJS1370

    Article  MathSciNet  MATH  Google Scholar 

  • Abdous B, Theodorescu R (1989) On the strong uniform consistency of a new Kernel density estimator. Metrika 11:177–194

    Article  MathSciNet  Google Scholar 

  • Agrawala AK (1970) Learning with a probabilistic teacher. IEEE Trans Autom Control 19:716–723

    MathSciNet  MATH  Google Scholar 

  • Arnold A, Nallapati R, Cohen W (2007) A comparative study of methods for transductive transfer learning. In: Seventh IEEE international conference on data mining workshops (ICDMW)

  • Asuncion A, Newman DJ (2007) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. http://www.ics.uci.edu/~mlearn/MLRepository.html

  • Azizyan M, Singh A, Wasserman L (2013) Density-sensitive semisupervised inference. Ann Stat 41(2):751–771

    Article  MathSciNet  Google Scholar 

  • Belkin M, Niyogi P (2004) Semi-supervised learning on Riemannian manifolds. Mach Learn 56:209–239

    Article  Google Scholar 

  • Ben-David S, Lu T, Pal D (2008) Does unlabelled data provably help? Worst-case analysis of the sample complexity of semi-supervised learning. In: 21st annual conference on learning theory (COLT). Available at http://www.informatik.uni-trier.de/~ley/db/conf/colt/colt2008.html

  • Chapelle O, Schölkopf B, Zien A (eds) (2006) Semi-supervised learning. MIT Press, Cambridge

    Google Scholar 

  • Chapelle O, Zien A (2005) Semi-supervised classification by low density separation AISTATS, vol 2005, pp 57–64

  • Cholaquidis A, Cuevas A, Fraiman R (2014) On Poincaré cone property. Ann Stat 42:255–284

    Article  Google Scholar 

  • Castelli V, Cover TM (1995) On the exponential value of labeled samples. Pattern Recognit Lett 16(1):105–111

    Article  Google Scholar 

  • Castelli V, Cover TM (1996) The relative value of labeled and unlabeled samples in pattern recognition with an unknown mixing parameter. IEEE Trans Inf Theory 42(6):2102–2117

    Article  MathSciNet  Google Scholar 

  • Cuevas A, Fraiman R (1997) A plug-in approach to support estimation. Ann Stat 25:2300–2312

    Article  MathSciNet  Google Scholar 

  • Cuevas A, Rodríguez-Casal A (2004) On boundary estimation. Adv Appl Probab 36:340–354

    Article  MathSciNet  Google Scholar 

  • Cuevas A, Fraiman R, Pateiro-López B (2012) On statistical properties of sets fulfilling rolling-type conditions. Adv Appl Probab 44:311–329

    Article  MathSciNet  Google Scholar 

  • Erdös P (1945) Some remarks on the measurability of certain sets. Bull Am Math Soc 51:728–731

    Article  MathSciNet  Google Scholar 

  • Fanty M, Cole R (1991) Spoken letter recognition. In: Lippman RP, Moody J, Touretzky DS (eds) Advances in neural information processing systems, 3. Morgan Kaufmann, San Mateo

    Google Scholar 

  • Federer H (1959) Curvature measures. Trans Am Math Soc 93:418–491

    Article  MathSciNet  Google Scholar 

  • Fralick SC (1967) Learning to recognize patterns without a teacher. IEEE Trans Inf Theory 13:57–64

    Article  Google Scholar 

  • Haffari G, Sarkar A (2007) Analysis of semi-supervised learning with the Yarowsky algorithm. In: Proceedings of the 23rd conference on uncertainty in artificial intelligence, UAI 2007, July 19–22, 2007. Vancouver, BC

  • Joachims T (1999) Transductive inference for text classification using support vector machines. In: ICML 16

  • Joachims T (2003) Transductive learning via spectral graph partitioning. In: ICML

  • Lafferty J, Wasserman L (2008) Statistical analysis of semi-supervised regression. In: Conference in advances in neural information processing systems, pp 801–808

  • Nadler B, Srebro N, Zhou X (2009) Statistical analysis of semi-supervised learning: the limit of infinite unlabelled data. In: Bengio Y, Schuurmans D, Lafferty JD, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. Curran Associates, Inc., pp 1330–1338. http://papers.nips.cc/paper/3652-statistical-analysis-of-semi-supervised-learning-the-limit-of-infinite-unlabelled-data.pdf

  • Niyogi P (2008) Manifold regularization and semi-supervised learning: some theoretical analyses. Technical Report TR-2008-01, Computer Science Dept., Univ. of Chicago. Available at http://people.cs.uchicago.edu/~niyogi/papersps/ssminimax2.pdf

  • Rigollet P (2007) Generalized error bound in semi-supervised classification under the cluster assumption. J Mach Learn Res 8:1369–1392 MR2332435

    MathSciNet  MATH  Google Scholar 

  • Scudder HJ (1965) Probability of error of some adaptive patter-recognition machines. IEEE Trans Inf Theory 11:363–371

    Article  MathSciNet  Google Scholar 

  • Singh A, Nowak RD, Zhu X (2008) Unlabeled data: now it helps, now it doesn’t. Technical report, ECE Dept., Univ. Wisconsin-Madison. Available at www.cs.cmu.edu/~aarti/pubs/SSL_TR.pdf

  • Sinha K, Belkin M (2009) Semi-supervised learning using sparse eigenfunction bases. In: Bengio Y, Schuurmans D, Lafferty J, Williams CKI, Culotta A (eds) Advances in neural information processing systems, vol 22. MIT Press, Cambridge, pp 1687–1695

    Google Scholar 

  • Vapnik V (1998) Statistical learning theory. Wiley, Hoboken

    MATH  Google Scholar 

  • Wang J, Shen X, Pan W (2007) On transductive support vector machines. Contemp Math 443:7–20

    Article  MathSciNet  Google Scholar 

  • Zhu X (2008) Semi-supervised learning literature survey. http://pages.cs.wisc.edu/~jerryzhu/research/ssl/semireview.html

Download references

Acknowledgements

We thank two referees and an associated editor for their constructive comments and insightful suggestions, which have improved the presentation of the present manuscript. We also thank to Damián Scherlis for helpful suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to A. Cholaquidis.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A

Proof of Proposition 1

Observe that \(\mathbb {P}\big (g_i(\mathcal {X}_l)\ne Y_i \mid \mathcal {X}_l{\setminus } X_i\big )\ge \mathbb {P}(g^*(X_i)\ne Y_i) \), for \(i=1,\ldots , l\). Thus,

$$\begin{aligned} \mathbb {E}\Big (\mathbb {I}_{g_i(\mathcal {X}_l)\ne Y_i}\Big )=\mathbb {P}(g_i(\mathcal {X}_l)\ne Y_i) =\mathbb {E}\Big (\mathbb {P}\big (g_i(\mathcal {X}_l)\ne Y_i|\mathcal {X}_l{\setminus } X_i\big )\Big )\ge \mathbb {P}(g^*(X_i)\ne Y_i), \end{aligned}$$

and therefore, \(L({\mathbf {g}_l})=\mathbb {E}\Big (\frac{1}{l} \sum _{i=1}^l I_{g_i(\mathcal {X}_l)\ne Y_i}\Big )\ge \mathbb {P}(g^*(X_i)\ne Y_i),\) showing that \(L({\mathbf {g}_l})\ge \mathbb {P}(g^*(X)\ne Y)\), for any \(\mathbf {g}_l=(g_1,\ldots , g_l)\). The lower bound is attained by choosing the ith coordinate of \( \mathbf {g}_l\) equal to \(g^*(X_i)\). Moreover, the accuracy of \(\mathbf {g}^*_l\) equals that of a single coordinate; namely, \(L(\mathbf {g}^*_l)= \mathbb {P}(g^*(X)\ne Y)=L^*\). \(\square \)

Proof of Proposition 2

We will prove that if H1, H2 i) and H4 are satisfied, then \(\mathcal {I}_{0,l}\cap \mathcal {I}_{1,l}\subset \mathcal {F}_l\). Combining this inclusion with H3, we conclude that \(\mathbb {P}(\mathcal {F})=1\). To prove that \(\mathcal {I}_{0,l}\cap \mathcal {I}_{1,l}\subset \mathcal {F}_l\), we will see that if

$$\begin{aligned} I_a\subseteq \bigcup _{X\in \mathcal {X}_{l}\cap I_a } B(X,h_l/2)\;,\quad a=0,1, \end{aligned}$$
(11)

all the elements of \(\mathcal {X}_l\) are labelled by the algorithm. To do so, note that, by H4, there exists \(X_a^{*}\) in \(\mathcal {X}^n\) such that \(X_a^{*}\in I_a\), for \(a=0,1\). We will now prove that the algorithm starts. Since \(X^*_1\) is in \(I_1\) and (11) holds with \(a=1\), there exists \(X_j^1\in \mathcal {X}_l\cap I_1\) with \(d(X_1^{*}, X_j^1)< h_l\). In particular, \(d(\mathcal {X}^n, X_j^1)<h_l\) and so \( X_j^1\in \mathcal {U}_0(h_l)\). This guarantees that \(\mathcal {U}_0(h_l) \not =\emptyset \), and hence, the algorithm can start.

Assume now that we have classified \(j< l\) points of \(\mathcal {X}_l\). We will prove that there exists at least one point satisfying the iteration condition required at step \(j+1\): \(\mathcal {U}_{j}(h_l)\not =\emptyset \). By H1, we can assume that \(\mathcal {U}_j=\mathcal {U}_j\cap (I_0\cup I_1)\). Take a such that \(\mathcal {U}_j\cap I_a\not =\emptyset \). We will consider now two possible cases: (i) if \(\mathcal {X}_l\cap I_a\cap \mathcal {U}_j^c=\emptyset \), then \(\mathcal {X}_l\cap I_a=\mathcal {X}_l\cap I_a\cap \mathcal {U}_j \) and so, by (11), \(X^*_a\in B(X,h_l/2)\) for some \(X\in \mathcal {X}_l\cap \mathcal {U}_j\). Since \(X^*_a\) is in \(\mathcal {Z}_j\) and \(X\in \mathcal {U}_j\), we conclude that \(X\in \mathcal {U}_j(h_l)\). Assume now that (ii) \(\mathcal {X}_l\cap I_a\cap \mathcal {U}_j^c\not =\emptyset \). Since \(I_a\) is connected and (11) holds, the union of \( B(X, h_l/2)\), with \(X \in \mathcal {X}_l\cap I_a\), is also a connected set and, therefore,

$$\begin{aligned} \Bigg (\bigcup _{X\in \mathcal {X}_{_l}\cap I_a \cap \mathcal {U}_{j}^c} B(X,h_l/2)\Bigg )\ \ \bigcap \ \Bigg (\bigcup _{X\in \mathcal {X}_{l}\cap I_a \cap \mathcal {U}_j} B(X,h_l/2)\Bigg )\ne \emptyset . \end{aligned}$$

Finally, take \(X\in \mathcal {X}_{_l}\cap I_a \cap \mathcal {U}_{j}^c\) and \(\tilde{X}\in \mathcal {X}_{_l}\cap I_a \cap \mathcal {U}_j\) such that \(B(X, h_l/2)\cap B(\tilde{X},h_l/2)\not =\emptyset \) to conclude that \(\tilde{X}\in \mathcal {U}_j(h_l)\). \(\square \)

Proof of Lemma 1

By H1, we can assume that \(\eta (X)\ne 1/2\) for all \(X\in \mathcal {X}^n\cup \mathcal {X}_l\). Assume first that \(\eta (X_{j_{\mathrm{bad}}})>1/2\), that is, \(X_{j_{\mathrm{bad}}}\in I_1\), \(\tilde{Y}_{j_{\mathrm{bad}}}=0\), and all the points labelled up to the step \(j_{\mathrm{bad}}-1\) by the algorithm are well classified. Now, suppose by contradiction \(X_{j_{\mathrm{bad}}}\not \in B_1^{h_l}\), which means that \(X_{j_{\mathrm{bad}}}\not \in B(I_0, h_l)\) and thus, \(B(X_{j_{\mathrm{bad}}}, h_l)\cap I_0=\emptyset \). This implies that \(g^*(X)=1 \) for all \(X\in (\mathcal {X}^n\cup \{X_{i_1},\ldots ,X_{j_{\mathrm{bad}} -1} \})\cap B(X_{j_\mathrm{bad}},h_l)\), contradicting the label assigned to \(X_{j_\mathrm{bad}}\) according to the majority rule that is used by the algorithm. Thus, \(B(X_{j_\mathrm{bad}}, h_l)\cap I_0\not =\emptyset \), and so \(X_{j_\mathrm{bad}}\in B_1^{h_l}\). Analogously, if \(\eta (X_{j_\mathrm{bad}})<1/2\), we deduce that \(X_{j_\mathrm{bad}}\in B_0^{h_l}\). \(\square \)

Proof of Lemma 2

Given \(\delta <\delta _1\), choose \(\varepsilon \) such that \(\gamma (\delta )-2\varepsilon >0\), for \(\gamma (\delta )\) introduced in H6. We will prove \(\mathcal {S}_l^\varepsilon = \{\sup _{u \in S}|\hat{f}_l(u)-f(u)|<\varepsilon \}\) is included in \( \mathcal {V}_l^\delta \) as far as \(h_l<\delta \), and therefore, from (7), we conclude that \(\mathbb {P}(\mathcal {V}^\delta )=1\).

Now, note that on \(\mathcal {S}_l^\varepsilon \), we get that \(f(u)-\varepsilon< \hat{f}_l(u) < f(u)+\varepsilon ,\) and so, on \(\mathcal {S}_l^\varepsilon \), for \(a\in A_0^{\delta }\cup A_1^{\delta }\) and \(b \in B_1^{h}\cup B^{h}_0\), \(\hat{f}_l(b)< f(b)+\varepsilon<f(a)-\gamma +\varepsilon < \hat{f}_l(a)+2\varepsilon - \gamma .\) Thus, on \(\mathcal {S}_l^\varepsilon \),

$$\begin{aligned} \sup _{b\in B_0^{h_l}\cup B_1^{h_l}}\hat{f}_l(b) \;\le \; \inf _{a\in A_0^\delta \cup A_1^\delta } \hat{f}_l(a)+2\varepsilon -\gamma < \inf _{a\in A_0^\delta \cup A_1^\delta } \hat{f}_l(a), \end{aligned}$$

when \(2\varepsilon -\gamma <0\). This proves that \(\mathcal {S}_l^\varepsilon \subseteq \mathcal {F}_l^\delta \), for l such that \(8h_l<\delta \). \(\square \)

Proof of Lemma 3

When \(j_\mathrm{bad}=\infty \), \(\mathcal {Z}_{j_\mathrm{bad}-1}=\mathcal {X}_n\cup \mathcal {X}_l\). This fact implies that, on the event \(\mathcal {F}_l\cap \mathcal {B}_l^c\), the following identity holds: \(\mathcal {X}_l\cap (\mathcal {Z}_{j_\mathrm{bad}-1})^c=\emptyset \). Thus, to prove (9), we need to show that, for \(a=0,1\) \(\mathcal {F}_l \cap \mathcal {A}_{a,l}^\delta \cap \mathcal {V}^\delta _l\cap \mathcal {B}_l \;\subset \; \left\{ \mathcal {X}_l\cap A_a^\delta \cap (\mathcal {Z}_{j_\mathrm{bad}-1})^c=\emptyset \right\} .\) We will argue by contradiction, assuming that there exists \(\omega \in \mathcal {F}_l \cap \mathcal {A}_{a,l}^\delta \cap \mathcal {V}^\delta _l\cap \mathcal {B}_l \) for which \(\emptyset \not = \mathcal {X}_l\cap A_a^\delta \cap (\mathcal {Z}_{j_\mathrm{bad}-1})^c=\{W_1, \ldots , W_m\}\). Invoking H8, \(\mathcal {X}^n\subseteq \mathcal {Z}_{j_\mathrm{bad}-1}\) and there exists \(X_a^*\in A_a^\delta \cap \mathcal {X}^n\). These facts guarantee that \(X_a^*\in A_0^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1}\), and since we are working on \(\mathcal {A}_{a,l}^\delta \), we get that

$$\begin{aligned} X_a^*\in A_a^\delta \subseteq \bigcup _{X\in \mathcal {X}_{l}\cap A_a^\delta } B(X,h_l/2)\quad \hbox {and}\quad X_a^*\in \mathcal {Z}_{j_\mathrm{bad}-1}. \end{aligned}$$
(12)

Next, we will argue that there exist \(W^*\in \{W_1, \ldots , W_m\}\) such that \(d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l\). To do so, consider the following two cases:

  1. (i)

    \(\mathcal {X}_l \cap A_a^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1} =\emptyset \). In such a case, from (12) we get that \(A_a^\delta \) can be covered by balls centred at \(\{W_1, \ldots , W_m\}\) and, since \(X_a^*\in A_a^\delta \), \(X_a^*\in B(W^*, h_l/2)\) for some \(W^*\in \{W_1, \ldots , W_m\}\). Therefore, \(d(X_a^*, W^*)<h_l\). Recalling that, as stated in (12), \(X_a^*\in \mathcal {Z}_{j_\mathrm{bad}-1}\), we conclude that \(d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l\).

  2. (ii)

    Assume now that \(\mathcal {X}_l \cap A_a^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1} \not =\emptyset \). Since \(A_a^\delta \) is connected, the union of balls given in (12) is connected, and then,

    $$\begin{aligned} \Bigg \{\bigcup _{X\in \mathcal {X}_{l}\cap A_a^\delta \cap \mathcal {Z}_{j_\mathrm{bad}-1} } B(X,h_l/2)\Bigg \} \quad \bigcap \quad \Bigg \{\bigcup _{1\le i\le m} B(W_i,h_l/2)\Bigg \}\quad \not =\quad \emptyset . \end{aligned}$$

    Thus, there exist \(X\in \mathcal {Z}_{j_\mathrm{bad}-1}\) and \(W^*\in \{W_1, \ldots , W_m\}\) with \(d(X^*, W^*)<h_l\), which implies that \(d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l\).

To finish the proof, we will show that such a \(W^*\) should have been chosen by the algorithm to be labelled before \(X_{j_\mathrm{bad}}\), which implies that \(W^*\in \mathcal {Z}_{j_\mathrm{bad}-1}\), contradicting that \(W^*\in (\mathcal {Z}_{j_\mathrm{bad}-1})^c\). This contradiction show that no such \(W^*\) exists, as announced. Since \(d(W^*,\mathcal {Z}_{j_\mathrm{bad}-1})<h_l\), we get that \(W^*\in \mathcal {U}_{j_\mathrm{bad}-1}(h_l)\), the set of candidates to be labelled by the algorithm at step \(j_\mathrm{bad}\). Indeed, since \(W^*\in A_a^\delta \) and \(h<\delta \), \(B(W^*, h_l) \subseteq I_a\). Thus, \(\hat{\eta }_{j_\mathrm{bad}-1}(W^*)=a\), implying that \(W^*\) attains the maximum stated in (3). Invoking now Lemma 2, since \(W^*\in A_a^\delta \) while \(X_{j_\mathrm{bad}}\) is in \(B_0^h\cap B_1^h\) (see Lemma 1), we know that \(\#\{\mathcal {X}_l\cap B(W^*,h_l)\}\ge \#\{\mathcal {X}_l\cap B(X_{j_\mathrm{bad}},h_l)\}\); thus, \(W^*\) should have been chosen before \(X_{j_\mathrm{bad}}\). This conclude the proof of the result (Fig. 9). \(\square \)

Fig. 9
figure 9

We show: in black \(X_{j_\mathrm{bad}}\), in red \(X_{j_k}\), and in blue we represent the points of \(\mathcal {X}_{l_k}\) belonging to \(B(X_{j_k},h_{l_k})\) and \(B(X_{j_\mathrm{bad}},h_{l_k})\) (color figure online)

Proof of Theorem 1

Recall that \(g_{{n,l,r(i)}}(\mathcal {X}_l)\) denotes the label assigned by the algorithm to the observation\( X_i\in \mathcal {X}_l\). The empirical mean accuracy of classification satisfies

$$\begin{aligned} \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\ge&\; \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= Y_i}\;\mathbb {I}_{g^*(X_i)=Y_i}\;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i)\\ =&\; \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= g^*(X_i)} \;\mathbb {I}_{g^*(X_i)=Y_i}\;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i). \end{aligned}$$

Consider \(\mathcal {T}_l^\delta =\mathcal {F}_l\cap \mathcal {A}_{0,l}^\delta \cap \mathcal {A}_{1,l}^\delta \cap \mathcal {V}_l^\delta \), and \(\mathcal {T}^\delta =\bigcup _{l_0}\bigcap _{l\ge l_0}\mathcal {T}_l^\delta \). Combining the results obtained in Proposition 2 and Lemma 2 with condition H5, we conclude that \(\mathbb {P}(\mathcal {T}^\delta )=1\), for \(\delta <\min \{\delta _0, \delta _1,\delta _2\}\). By (10), on \(\mathcal {T}_l\), we have that \(\mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= g^*(X_i)}\ge \mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i)\quad \text { for all }i=1,\ldots ,l,\) and therefore,

$$\begin{aligned} \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_{l})= g^*(X_i)} \;\mathbb {I}_{g^*(X_i)=Y_i}\;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i)\ge \frac{1}{l}\sum _{i=1}^{l} \mathbb {I}_{g^*(X_i)=Y_i} \;\mathbb {I}_{A_0^\delta \cup A_1^\delta }(X_i). \end{aligned}$$

Then, on \(\mathcal {T}^\delta \), we have that \(\liminf _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\ge \mathbb {P}\{g^*(X)=Y, X\in A_0^\delta \cup A_1^\delta \}\) and so

$$\begin{aligned} \liminf _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i} \ge \mathbb {P}\{g^*(X)=Y\} \quad a.s. \end{aligned}$$
(13)

On the other hand,

$$\begin{aligned}&\frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}= \frac{1}{l}\sum _{i=1}^l \;\mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i} \mathbb {I}_{g^*(X_i)=Y_i}\\&\quad + \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\;\mathbb {I}_{g^*(X_i)\not =Y_i} \le \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g^*(X_i)=Y_i} + \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)\not = g^*(X_i)} \end{aligned}$$

From Lemma 3, on \(\mathcal {T}_l^\delta \), \(\mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)\not = g^*(X_i)}\le \mathbb {I}_{(A_0^\delta \cup A_1^\delta )^c}(X_{i})\), and therefore, on \(\mathcal {T}^\delta \),

$$\begin{aligned} \limsup _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l)= Y_i}\le \mathbb {P}(g^*(X)=Y)+\mathbb {P}(X\not \in \{A_0^\delta \cup A_1^\delta \}) . \end{aligned}$$

By H2, (ii) the last term in the previous display converges to zero when \(\delta \rightarrow 0\), and thus,

$$\begin{aligned} \limsup _{l\rightarrow \infty } \frac{1}{l}\sum _{i=1}^l \mathbb {I}_{g_{{n,l,r(i)}}(\mathcal {X}_l) }\le \mathbb {P}(g^*(X)=Y)\quad a.s. \end{aligned}$$
(14)

Combining (13) and (14), we deduce the announced convergence. The consistency defined in (1) follows from the dominated convergence theorem. \(\square \)

Appendix B

In this section, we will prove that under H2, conditions H3 and H6 hold if we impose some geometric restrictions on \(I_0\) and \(I_1\). In order to make this Appendix self-contained, we need some geometric definitions and also include some results which will be invoked.

First, we introduce the concept of Hausdorff distance. Given two compact non-empty sets \(A,C\subset {\mathbb {R}}^d\), the Hausdorff distance or Hausdorff–Pompei distance between A and C is defined by \(d_H(A,C)=\inf \{\varepsilon \ge 0: \text{ such } \text{ that } A\subset B(C,\varepsilon )\, \text{ and } C\subset B(A,\varepsilon )\}.\)

Next, we define standard sets, according to Cuevas and Fraiman (1997) (see also Cuevas and Rodríguez-Casal 2004).

Definition 1

A bounded set \(S\subset \mathbb {R}^d\) is said to be standard with respect to a Borel measure \(\mu \) if there exists \(\lambda >0\) and \(\beta >0\) such that \(\mu \big (B(x,\varepsilon )\cap S\big )\ge \beta \mu _L(B(x,\varepsilon ))\text { for all }x\in S,\ 0<\varepsilon \le \lambda ,\) where \(\mu _L\) denotes the Lebesgue measure on \(\mathbb {R}^d\).

Roughly speaking, standardness prevents the set from having peaks that are too sharp.

The following theorem is proved in Cuevas and Rodríguez-Casal (2004)).

Theorem 2

(Cuevas and Rodríguez-Casal 2004) Let \(Z_1,Z_2,\dots \) be a sequence of iid observations in \(\mathbb {R}^d\) drawn from a distribution \(P_Z\). Assume that the support Q of \(P_Z\) is compact and standard with respect to \(P_Z\). Then,

$$\begin{aligned} \limsup _{l\rightarrow \infty } \left( \frac{l}{\log (l)}\right) ^{1/d}d_H(\mathcal {Z}_l, Q)\le \left( \frac{2}{\beta \omega _d}\right) ^{1/d}\quad \text { a.s.}, \end{aligned}$$
(15)

where \(\omega _d=\mu _L(B(0,1))\), \(\mathcal {Z}_l=\{Z_1,\ldots ,Z_l\}\) and \(\beta \) is the standardness constant introduced in Definition 1.

Remark 3

Theorem 2 implies that, if we choose \(\epsilon _l=C \left( \frac{\log (l)}{l}\right) ^{1/d}\) with \(C>(2/(\beta \omega _d))\), then \(Q\subset \cup _{i=1}^l B(Z_i,\epsilon _l)\) for l large enough. This in turn implies that if Q is connected, \(\cup _{i=1}^l B(X_i,\epsilon _l)\) is connected.

As a consequence of Theorem 2, we get the following covering property that will be used to prove Proposition 2 and H5 alone Proposition 3.

Lemma 4

Let \(X_1,X_2,\dots \) be a sequence of iid observations in \(\mathbb {R}^d\) drawn from a distribution \(P_X\) with support S. Let \(Q\subset S\), be compact and standard with respect to \(P_X\) restricted to Q, with \(P_X(Q)>0\). Consider \((h_l)_{l\ge 1}\) such that \(h_l\rightarrow 0\) and \(lh_l^d/\log (l)\rightarrow \infty \). Then, with probability one, for l large, \(Q \subset \bigcup _{X\in \mathcal {X}_{l}\cap Q } B(X,h_l/2),\) where \(\mathcal {X}_l=\{X_1,\ldots ,X_l\}\).

Proof

We need to work with \(\mathcal {X}_l\) restricted to Q, in order to do that, consider the sequence of stopping times defined by \(\tau _0\equiv 0, \ \tau _1=\inf \{l:X_l\in Q\}, \ \tau _j=\inf \{l\ge \tau _{j-1}:X_l\in Q\},\) and the sequence of visits to Q given by \(Z_j:=X_{\tau _j}\). Then, \((Z_j)_{j\ge 1} \) are iid, distributed as \(X\mid (X\in Q)\), with support Q. Observe that the distribution \(P_Z\) of Z is the restriction of \(P_X\) to Q. Since Q is compact and standard wrt \(P_Z\), we can invoke Theorem 1 for \((Z_j)_{j\ge 1}\), and in order to conclude that there exists a positive constant \(C_Q\) depending on Q, such that for \(k\ge k_0=k_0(\omega )\),

$$\begin{aligned} d_H(\mathcal {Z}_k,Q)\le C_Q (\log (k)/k)^{1/d}, \end{aligned}$$
(16)

where \(\mathcal {Z}_k=\{Z_1, \ldots , Z_k\}\). Define now \(V_l\) as the number of visits to the set Q up to time l. Namely, \(V_l= \sum _{i=1}^l I_{\{X_i\in Q\}}.\) By the law of large numbers, \(V_l/l\rightarrow P(X\in Q)>0\) a.e., and therefore, for l large enough, \(V_l\ge k_0\). Thus, by (16), recalling that \(h_l^d l/\log (l)\rightarrow \infty \), we get that

$$\begin{aligned} d_H(\mathcal {Z}_{V_l},Q)\le C_Q(\log (V_l)/V_l)^{1/d}\le \tilde{C}_Q(\log (l)/l)^{1/d}\le \frac{h_l}{2}. \end{aligned}$$

In particular, \(Q\subseteq \bigcup _{Z_j\in \mathcal {Z}_{V_l}} B(Z_j, h_l/2)=\bigcup _{X\in \mathcal {X}_{l}\cap Q} B(X,h_l/2).\) \(\square \)

This last lemma will be applied to get the covering properties stated in H2 and H5 for \(I_a\) and \(A_a^\delta \). The following results are needed to show that these sets satisfy the conditions imposed in Lemma 4.

Lemma 5

Let \(\nu \) be a distribution with support I such that \(int(I)\ne \emptyset \) and \(reach(\overline{I^c})> 0\). Assume that \(\nu \) has density f bounded from below by \(f_0> 0\). Let \(Q=\overline{I\ominus B(0,\gamma )}\) such \(\nu (Q)>0\), then Q is standard with respect to \(\nu _Q\), the restriction of \(\nu \) to Q (i.e. \(\nu _Q(A)=\nu (A\cap Q)/\nu (Q)\)), for all \(0\le \gamma <reach(I^c)\), with \(\beta =f_0/(3\nu (Q))\).

Proof

Let \(0\le \gamma < reach(\overline{I^c})\). By corollary 4.9 in Federer (1959) applied to \(I^c\), we get that \(reach(\overline{(I\ominus B(0,\gamma ))^c})\ge reach(\overline{I^c})-\gamma > 0\), and now by proposition 1 in Aaron et al. (2017), \(\nu _Q\) is standard, with \(\beta =f_0/(3\nu (Q))\) (see Definition 1). \(\square \)

Lemma 6

Let \(I\subset \mathbb {R}^{d}\) be a non-empty, connected, compact set with \(reach(\overline{I^c})>0\). Then, for all \(0< \varepsilon \le reach(I^c)\), \(I\ominus B(0,\varepsilon )\) is connected.

Proof

Let \(0<\varepsilon \le reach(\overline{I^c})\). By corollary 4.9 in Federer (1959) applied to \(I^c\) , \(reach(I\ominus B(0,\varepsilon ))> \varepsilon \). Then, the function \(f(x)=x\) if \(x\in I\ominus B(0,\varepsilon )\), and \(f(x)=\pi _{\partial (I \ominus B(0,\varepsilon ))}(x)\) if \(x \in I{\setminus } (I\ominus B(0,\varepsilon ))\) where \(\pi _{\partial S}\) denotes the metric projection onto \(\partial S\), is well defined. By item 4 of theorem 4.8 in Federer (1959), f is a continuous function, so it follows that \(f(I)=I\ominus B(0,\varepsilon )\) is connected. \(\square \)

Proof of Proposition 3

Since \(reach(\overline{I_a^c})>0\) \(P_X(\partial I_a)=0\) (this follows from Proposition 1 and 2 in Cuevas et al. (2012) together with Proposition 2 in Cholaquidis et al. (2014)), then \(\mathbb {P}(X\in int(I_a))=P(X\in I_a)>0\). By Lemma 5, choosing \(\gamma =0\), the set \(\overline{I_a}\) is standard with respect to \(P_X\) restricted to \(\overline{I_a}\), for \(a=0,1\). By Lemma 4, with \(Q=\overline{I_a}\), \(\overline{I_a}\) is coverable; finally, we get that H3 is satisfied.

To prove H5 i) observe that the connectedness of \(A_a^\delta \) follows from that of \(I_a\) (H2 i) together with Lemma 6. For H5 ii), take \(\delta \) small enough such that \(\mathbb {P}(X \in A_a^\delta )>0\), which should exist because of H2 ii). By (1) in Erdös (1945), using that \(\partial A_a^\delta \subset \{x:d(x,\partial I_a)=\delta \}\), we get that \(\mathbb {P}(X\in \partial A_a^\delta )=0\). Finally, to prove the covering stated in H5 first observe that, by Lemma 5, \(\overline{ A_a^\delta }\) is standard wrt \(P_X\) restricted to \(\overline{A_a^\delta }\). Invoking Lemma 4 with \(Q=\overline{A_a^\delta }\) and recalling that \(\mathbb {P}(X\in \partial A_a^\delta )=0\), we get the covering property stated in H5 iii).

Lastly, the uniform convergence stated in H7 follows from Theorem 6 in Abdous and Theodorescu (1989), since f is uniformly continuous, assumptions (i)–(iii) hold for the uniform kernel and the bandwidth fulfils \(lh_l^{2d}/\log (l)\rightarrow \infty \). \(\square \)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cholaquidis, A., Fraiman, R. & Sued, M. On semi-supervised learning. TEST 29, 914–937 (2020). https://doi.org/10.1007/s11749-019-00690-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11749-019-00690-2

Keywords

Mathematics Subject Classification

Navigation