Skip to main content
Log in

Provably accurate and scalable linear classifiers in hyperbolic spaces

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Many high-dimensional practical data sets have hierarchical structures induced by graphs or time series. Such data sets are hard to process in Euclidean spaces, and one often seeks low-dimensional embeddings in other space forms to perform the required learning tasks. For hierarchical data, the space of choice is a hyperbolic space because it guarantees low-distortion embeddings for tree-like structures. The geometry of hyperbolic spaces has properties not encountered in Euclidean spaces that pose challenges when trying to rigorously analyze algorithmic solutions. We propose a unified framework for learning scalable and simple hyperbolic linear classifiers with provable performance guarantees. The gist of our approach is to focus on Poincaré ball models and formulate the classification problems using tangent space formalisms. Our results include a new hyperbolic perceptron algorithm as well as an efficient and highly accurate convex optimization setup for hyperbolic support vector machine classifiers. Furthermore, we adapt our approach to accommodate second-order perceptrons, where data are preprocessed based on second-order information (correlation) to accelerate convergence, and strategic perceptrons, where potentially manipulated data arrive in an online manner and decisions are made sequentially. The excellent performance of the Poincaré second-order and strategic perceptrons shows that the proposed framework can be extended to general machine learning problems in hyperbolic spaces. Our experimental results, pertaining to synthetic, single-cell RNA-seq expression measurements, CIFAR10, Fashion-MNIST and mini-ImageNet, establish that all algorithms provably converge and have complexity comparable to those of their Euclidean counterparts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Chien E, Pan C, Tabaghi P, Milenkovic O (2021) Highly scalable and provably accurate classification in poincaré balls, In: 2021 IEEE international conference on data mining (ICDM). IEEE, pp 61–70

  2. Krioukov D, Papadopoulos F, Kitsak M, Vahdat A, Boguná M (2010) Hyperbolic geometry of complex networks. Phys Rev E 82(3):036106

    Article  MathSciNet  Google Scholar 

  3. Sarkar R (2011) Low distortion delaunay embedding of trees in hyperbolic plane, In: international symposium on graph drawing. Springer, pp 355–366

  4. Sala F, De Sa C, Gu A, Re C (2018) Representation tradeoffs for hyperbolic embeddings, In: international conference on machine learning, vol. 80. PMLR, pp 4460–4469

  5. Nickel M, Kiela D (2017) Poincaré embeddings for learning hierarchical representations, In: Advances in Neural Information Processing Systems, pp 6338–6347

  6. Papadopoulos F, Aldecoa R, Krioukov D (2015) Network geometry inference using common neighbors. Phys Rev E 92(2):022807

    Article  Google Scholar 

  7. Tifrea A, Becigneul G, Ganea O-E (2019) Poincaré glove: hyperbolic word embeddings, In: international conference on learning representations, [Online]. Available: https://openreview.net/forum?id=Ske5r3AqK7

  8. Linial N, London E, Rabinovich Y (1995) The geometry of graphs and some of its algorithmic applications. Combinatorica 15(2):215–245

    Article  MathSciNet  MATH  Google Scholar 

  9. Cho H, DeMeo B, Peng J, Berger B (2019) Large-margin classification in hyperbolic space, In: international conference on artificial intelligence and statistics. PMLR, pp 1832–1840

  10. Monath N, Zaheer M, Silva D, McCallum A, Ahmed A (2019) Gradient-based hierarchical clustering using continuous representations of trees in hyperbolic space, In: ACM SIGKDD international conference on knowledge discovery & data mining, pp 714–722

  11. Weber M, Zaheer M, Rawat AS, Menon A, Kumar S (2020) Robust large-margin learning in hyperbolic space, In: Advances in Neural Information Processing Systems

  12. Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20(3):273–297

    Article  MATH  Google Scholar 

  13. Ganea O, Bécigneul G, Hofmann T (2018) Hyperbolic neural networks, In: Advances in Neural Information Processing Systems, pp 5345–5355

  14. Shimizu R, Mukuta Y, Harada T (2021) Hyperbolic neural networks++, In: international conference on learning representations, [Online]. Available: https://openreview.net/forum?id=Ec85b0tUwbA

  15. Lee K, Maji S, Ravichandran A, Soatto S (2019) Meta-learning with differentiable convex optimization, In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 10 657–10 665

  16. Cesa-Bianchi N, Conconi A, Gentile C (2005) A second-order perceptron algorithm. SIAM J Comput 34(3):640–668

    Article  MathSciNet  MATH  Google Scholar 

  17. Ahmadi S, Beyhaghi H, Blum A, Naggita K (2021) The strategic perceptron, In: proceedings of the 22nd ACM conference on economics and computation, pp 6–25

  18. Cesa-Bianchi N, Conconi A, Gentile C (2004) On the generalization ability of online learning algorithms. IEEE Trans Inf Theory 50(9):2050–2057

    Article  MATH  Google Scholar 

  19. Olsson A, Venkatasubramanian M, Chaudhri VK, Aronow BJ, Salomonis N, Singh H, Grimes HL (2016) Single-cell analysis of mixed-lineage states leading to a binary cell fate choice. Nature 537(7622):698–702

    Article  Google Scholar 

  20. Krizhevsky A, Hinton G et al (2009) Learning multiple layers of features from tiny images

  21. Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms, arXiv preprint arXiv:1708.07747

  22. Ravi S, Larochelle H (2017) Optimization as a model for few-shot learning, In: international conference on learning representations, [Online]. Available: https://openreview.net/forum?id=rJY0-Kcll

  23. Brückner M, Scheffer T (2011) Stackelberg games for adversarial prediction problems, In: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 547–555

  24. Hardt M, Megiddo N, Papadimitriou C, Wootters M (2016) Strategic classification, In: proceedings of the 2016 ACM conference on innovations in theoretical computer science, pp 111–122

  25. Liu Q, Nickel M, Kiela D (2019) Hyperbolic graph neural networks, In: Advances in Neural Information Processing Systems, pp 8230–8241

  26. Nagano Y, Yamaguchi S, Fujita Y, Koyama M (2019) A wrapped normal distribution on hyperbolic space for gradient-based learning, In: international conference on machine learning. PMLR, pp 4693–4702

  27. Mathieu E, Lan CL, Maddison CJ, Tomioka R, Teh YW (2019) Continuous hierarchical representations with poincaré variational auto-encoders, In: Advances in Neural Information Processing Systems

  28. Skopek O, Ganea O-E, Bécigneul G (2020) Mixed-curvature variational autoencoders, In: international conference on learning representations, [Online]. Available: https://openreview.net/forum?id=S1g6xeSKDS

  29. Ungar AA (2008) Analytic hyperbolic geometry and Albert Einstein’s special theory of relativity. World Scientific

  30. Vermeer J (2005) A geometric interpretation of ungar’s addition and of gyration in the hyperbolic plane. Topol Appl 152(3):226–242

    Article  MathSciNet  MATH  Google Scholar 

  31. Ratcliffe JG, Axler S, Ribet K (2006) Foundations of hyperbolic manifolds, vol 149. Springer, Berlin

    Google Scholar 

  32. Graham RL (1972) An efficient algorithm for determining the convex hull of a finite planar set. Info Pro Lett 1:132–133

    Article  MATH  Google Scholar 

  33. Barber CB, Dobkin DP, Huhdanpaa H (1996) The quickhull algorithm for convex hulls. ACM Trans Math Softw (TOMS) 22(4):469–483

    Article  MathSciNet  MATH  Google Scholar 

  34. Tabaghi P, Pan C, Chien E, Peng J, Milenković O (2021) Linear classifiers in product space forms, arXiv preprint arXiv:2102.10204

  35. Platt J et al (1999) Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Adv Large Margin Classif 10(3):61–74

    Google Scholar 

  36. Klimovskaia A, Lopez-Paz D, Bottou L, Nickel M (2020) Poincaré maps for analyzing complex hierarchies in single-cell data. Nat Commun 11(1):1–9

    Article  Google Scholar 

  37. Khrulkov V, Mirvakhabova L, Ustinova E, Oseledets I, Lempitsky V (2020) Hyperbolic image embeddings, In: proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6418–6428

  38. Cannon JW, Floyd WJ, Kenyon R, Parry WR et al (1997) Hyperbolic geometry. Flavors Geom 31(59–115):2

    MathSciNet  MATH  Google Scholar 

  39. Sherman J, Morrison WJ (1950) Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. Ann Math Stat 21(1):124–127

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

We thank anonymous reviewers for their very useful comments and suggestions. This work was done while Puoya Tabaghi and Jianhao Peng were at University of Illinois Urbana-Champaign. This work was supported by the NSF grant 1956384.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chao Pan.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Accompanying codes can be found at: https://github.com/thupchnsky/PoincareLinearClassification. A shorter version of this work [1] was presented as a regular paper at the International Conference on Data Mining (ICDM), 2021.

Appendices

Proof of Lemma 4.2

By the definition of Möbius addition, we have

$$\begin{aligned} a\oplus b = \frac{1+2a^{T}b+\Vert b\Vert ^2}{1+2a^{T}b+\Vert a\Vert ^2\Vert b\Vert ^2}a + \frac{1-\Vert a\Vert ^2}{1+2a^{T}b+\Vert a\Vert ^2\Vert b\Vert ^2}b. \end{aligned}$$

Thus,

$$\begin{aligned}&\Vert a\oplus b\Vert ^2 = \frac{\Vert (1+2a^Tb+\Vert b\Vert ^2)a + (1-\Vert a\Vert ^2)b\Vert ^2}{ \big ( 1+2a^Tb+\Vert a\Vert ^2\Vert b\Vert ^2 \big )^2} \nonumber \\&= \frac{\Vert a+b\Vert ^2 \big (1+\Vert b\Vert ^2\Vert a\Vert ^2+2a^Tb \big ) }{ \big ( 1+2a^Tb+\Vert a\Vert ^2\Vert b\Vert ^2 \big )^2} =\frac{\Vert a+b\Vert ^2}{1+2a^Tb+\Vert a\Vert ^2\Vert b\Vert ^2} . \end{aligned}$$
(33)

Next, use \(\Vert b\Vert =r\) and \(a^Tb = r\Vert a\Vert \cos (\theta )\) in the above expression:

$$\begin{aligned} \Vert a\oplus b\Vert ^2&= \frac{\Vert a\Vert ^2+2r\Vert a\Vert \cos (\theta ) + r^2}{1+2r\Vert x\Vert \cos (\theta ) + \Vert a\Vert ^2r^2} \nonumber \\&=1-\frac{(1-r^2)(1-\Vert a\Vert ^2)}{1+2r\Vert x\Vert \cos (\theta ) + \Vert a\Vert ^2r^2}. \end{aligned}$$
(34)

The function in (34) attains its maximum at \(\theta = 0\) and \(r=R\). We also observe that (33) is symmetric in ab. Thus, the same argument holds for \(\Vert b\oplus a\Vert \).

Convex Hull algorithms in Poincaré Ball model

We introduce next a generalization of the Graham scan and Quickhull algorithms for the Poincaré ball model. In a nutshell, we replace lines with geodesics and vectors \(\overrightarrow{AB}\) with tangent vectors \(\log _A(B)\) or equivalently \((-A)\oplus B\). The pseudo-code for the Poincaré version of the Graham scan is listed in Algorithm 2, while Quickhull is listed in Algorithm 6 (both for the two-dimensional case). The Graham scan has worst-case time complexity \(O(N\log N)\), while Quickhull has complexity \(O(N\log N)\) in expectation and \(O(N^2)\) in the worst case. The Graham scan only works for two-dimensional points while Quickhull can be generalized for higher dimensions [33].

figure k
figure l

Hyperboloid perceptron

The hyperboloid model and the definition of hyperplanes. The hyperboloid model \({\mathbb {L}}^n_c\) is another model for representing points in a n-dimensional hyperbolic space with curvature \(-c\;(c>0)\). Specifically, it is a Riemannian manifold \(({\mathbb {L}}^n_c, g^{{\mathbb {L}}})\) for which

$$\begin{aligned} {\mathbb {L}}_c^{n}&=\left\{ x \in {\mathbb {R}}^{n+1}:[x, x]=-\frac{1}{c}, x_{0}>0\right\} , \\ g^{{\mathbb {L}}}(u,v)&=[u, v]=u^{\top } H v,\; u,v\in T_p{\mathbb {L}}_c^n, \; H=\left( \begin{array}{cc} -1 &{} 0^{\top } \\ 0 &{} I_{d} \end{array}\right) . \end{aligned}$$

Throughout the remainder of this section, we restrict our attention to \(c=1\); all results can be easily generalized to arbitrary values of c.

We make use of the following bijection between the Poincaré ball model and hyperboloid model, given by

$$\begin{aligned} \left( x_{0}, \ldots , x_{n}\right) \in {\mathbb {L}}^{n} \Leftrightarrow \left( \frac{x_{1}}{1+x_{0}}, \ldots , \frac{x_{n}}{1+x_{0}}\right) \in {\mathbb {B}}^{n}. \end{aligned}$$
(35)

Additional properties of the hyperboloid model can be found in [38].

The recent work [34] introduced the notion of a hyperboloid hyperplane of the form

$$\begin{aligned} H_w&=\{x\in {\mathbb {L}}^n: \text {asinh}\left( [w,x]\right) =0\} \nonumber \\&=\{x\in {\mathbb {L}}^n: [w,x]=0\} \end{aligned}$$
(36)

where \(w\in {\mathbb {R}}^{n+1}\) and \([w,w]=1\). The second equation is a consequence of the fact that \(\text {asinh}(\cdot )\) is an increasing function and will not change the sign of the argument. Thus, the classification result based on the hyperplane is given by \(\text {sgn}\left( [w,x]\right) \) for some weight vector w, as shown in Figure 9.

The hyperboloid perceptron. The definition of a linear classifier in the hyperboloid model is inherently different from that of a classifier in the Poincaré ball model, as the former is independent of the choice of reference point p. Using the decision hyperplane defined in (36), a hyperboloid perceptron [34] described in Algorithm 8 can be shown to have easily established performance guarantee.

figure m

Theorem C.1

Let \((x_i , y_i )_{i=1}^N\) be a labeled data set from a bounded subset of \({\mathbb {L}}^n\) such that \(\Vert x_i\Vert \le R\;\forall i\in [N]\). Assume that there exists an optimal linear classifier with weight vector \(w^\star \) such that \(y_i\text {asinh}([w^\star , x_i])\ge \varepsilon \) (\(\varepsilon \)-margin). Then, the hyperboloid perceptron in Algorithm 8 will correctly classify all points with at most \(O\left( \frac{1}{\sinh ^2(\varepsilon )}\right) \) updates.

Proof

According to the assumption, the optimal normal vector \(w^\star \) satisfies \(y_t\text {asinh}([w^\star , x_t]) \ge \varepsilon \) and \([w^\star , w^\star ]=1\). So we have

$$\begin{aligned} \left\langle w^\star , w_{k+1} \right\rangle&=\left\langle w^\star , w^{k} \right\rangle +y_{t}\left[ w^\star , x_{t}\right] \nonumber \\&\ge \left\langle w^\star , w^{k} \right\rangle +\sinh (\varepsilon )\nonumber \\&\ge \ldots \ge k \sinh (\varepsilon ), \end{aligned}$$
(37)

where the first inequality holds due to the \(\varepsilon \)-margin assumption and because \(y_t\in \{-1,+1\}\). We can also upper bound \(\Vert w_{k+1}\Vert \) as

$$\begin{aligned} \left\| w_{k+1}\right\| ^{2}&=\left\| w_{k}+y_{t} H x_{t}\right\| ^{2}\nonumber \\&=\left\| w_{k}\right\| ^{2}+\left\| x_{t}\right\| ^{2}+2 y_{t}\left[ w_{k}, x_{t}\right] \nonumber \\&\le \left\| w_{k}\right\| ^{2}+R^{2}\nonumber \\&\le \ldots \le k R^{2} , \end{aligned}$$
(38)

where the first inequality follows from \(y_{t}\left[ w_{k}, x_{t}\right] \le 0,\) corresponding to the case when the classifier makes a mistake. Combining (37) and (38), we have

$$\begin{aligned} k\sinh (\varepsilon )\le \left\langle w^\star , w_{k+1} \right\rangle&\le \Vert w^\star \Vert \Vert w_{k+1}\Vert \le \Vert w^\star \Vert \sqrt{k}R \nonumber \\ \Leftrightarrow k&\le \left( \frac{R\Vert w^\star \Vert }{\sinh (\varepsilon )}\right) ^2, \end{aligned}$$
(39)

which completes the proof. In practice, we can always perform data processing to control the norm of \(w^\star \). Also, for small classification margins \(\varepsilon \), we have \(\sinh (\varepsilon )\sim \varepsilon \). As a result, for data points that are very close to the decision boundary (\(\varepsilon \) is small), Theorem C.1 shows that the hyperbolid perceptron has roughly the same convergence rate as its Euclidean counterpart \(\left( \frac{R}{\varepsilon }\right) ^2\).

To experimentally confirm the convergence of Algorithm 8, we run synthetic data experiments similar to those described in Sect. 4. More precisely, we first randomly generate a \(w^\star \) such that \([w^\star , w^\star ] = 1\). Then, we generate a random set of \(N = 5,000\) points \({x_i}_{i=1}^N\) in \({\mathbb {L}}^2\). For margin values \(\varepsilon \in [0.1,1]\), we remove points that violate the required constraint on the distance to the classifier (parameterized by \(w^\star \)). Then, we assign binary labels to each data point according to the optimal classifier so that \(y_i = \text {sgn}([w^\star , x_i])\). We repeat this process for 100 different values of \(\varepsilon \) and compare the classification results with those of Algorithm 1 of [11]. Since the theoretical upper bound \(O\left( \frac{1}{\sinh (\varepsilon )}\right) \) claimed in [11] is smaller than \(O\left( \frac{1}{\sinh ^2(\varepsilon )}\right) \) in Theorem C.1, we also plot the upper bound for comparison. From Fig. 9, one can conclude that (1) Algorithm 8 always converges within the theoretical upper bound provided in Theorem C.1, and (2) both methods disagree with the theoretical convergence rate results of [11].

Fig. 9
figure 9

a Visualization of a linear classifier in the hyperboloid model. Colors are indicative of the classes while the gray region represents the decision hyperplane; b, c A comparison between the classification accuracy of the hyperboloid perceptron from Algorithm 8 and the perceptron of Algorithm 1 of [11], for different values of the margin \(\varepsilon \). The classification accuracy is the average of five independent random trials. The stopping criterion is to either achieve a \(100\%\) classification accuracy or to reach the theoretical upper bound on the number of updates of the weight vector from Theorem 3.1 in [11] (Figure (b)), and Theorem C.1 (Figure (c))

Proof of Theorem 4.2 and 4.3

Let \(x_i\in {\mathbb {B}}^n\) and let \(v_i=\log _{p}(x_i)\) be its logarithmic map value. The distance between the point and the hyperplane defined by \(w\in T_p{\mathbb {B}}^n\) and \(p\in {\mathbb {B}}^n\) can be written as (see also (14))

$$\begin{aligned} d(x, H_{w,p}) = \sinh ^{-1}\left( \frac{2 \tanh \left( \sigma _{p}\Vert v_i\Vert /2\right) |\langle v_i, w\rangle |}{\left( 1-\tanh ^2 \left( \sigma _{p}\Vert v_i\Vert /2\right) \right) \Vert w\Vert \sigma _p\Vert v_i\Vert /2}\cdot \frac{\sigma _p}{2}\right) . \end{aligned}$$
(40)

For support vectors, \(|\langle v_i, w\rangle |=1\) and \(\Vert v_i\Vert \ge 1/\Vert w\Vert \). Note that \(f(x)=\frac{2\tanh (x)}{x(1-\tanh ^2(x))}\) is an increasing function in x for \(x>0\) and \(g(y)=\sinh ^{-1}(y)\) is an increasing function in y for \(y\in {\mathbb {R}}\). Thus, the distance in (40) can be lower bounded by

$$\begin{aligned} d(x_i,H_{w, p}) \ge \sinh ^{-1}\left( \frac{2 \tanh (\sigma _p/2\Vert w\Vert )}{1-\tanh ^2(\sigma _p/2\Vert w\Vert )}\right) . \end{aligned}$$
(41)

The goal is to maximize the distance in (41). To this end, observe that \(h(x)=\frac{2x}{1-x^2}\) is an increasing function in x for \(x\in (0,1)\) and \(\tanh (\sigma _p/2\Vert w\Vert )\in (0,1)\). So maximizing the distance is equivalent to minimizing \(\Vert w\Vert \) (or \(\Vert w\Vert ^2\)), provided \(\sigma _p\) is fixed. Thus, the Poincaré SVM problem can be converted into the convex problem of Theorem 4.2; the constraints are added to force the hyperplane to correctly classify all points in the hard-margin setting. The formulation in Theorem 4.3 can also be seen as arising from a relaxation of the constraints and consideration of the trade-off between margin values and classification accuracy.

Proof of Theorem 5.1

We generalize the arguments in [16] to hyperbolic spaces. Let \(A_0 = aI\). The matrix \(A_k\) can be recursively computed from \(A_k = A_{k-1} + z_tz_t^T\), or equivalently \(A_k = aI + X_kX_k^T\). Without loss of generality, let \(t_k\) be the time index of the \(k^{th}\) error.

$$\begin{aligned}&\xi _k^T A_k^{-1} \xi _k = (\xi _{k-1} + y_{t_k}z_{t_k})^TA_k^{-1}(\xi _{k-1} + y_{t_k}z_{t_k}) \\&= \xi _{k-1}^TA_k^{-1}\xi _{k-1} + z_{t_k}^TA_k^{-1}z_{t_k} + 2y_{t_k}(A_k^{-1}\xi _{k-1})^Tz_{t_k}\\&= \xi _{k-1}^TA_k^{-1}\xi _{k-1} + z_{t_k}^TA_k^{-1}z_{t_k} + 2y_{t_k}(w_{t_k})^Tz_{t_k}\\&\le \xi _{k-1}^TA_k^{-1}\xi _{k-1} + z_{t_k}^TA_k^{-1}z_{t_k} \\&{\mathop {=}\limits ^{\mathrm {(a)}}} \xi _{k-1}^TA_{k-1}^{-1}\xi _{k-1} - \frac{(\xi _{k-1}^TA_{k-1}^{-1}z_{t_k})^2}{1+z_{t_k}^TA_{k-1}^{-1}z_{t_k}} + z_{t_k}^TA_k^{-1}z_{t_k} \\&\le \xi _{k-1}^TA_{k-1}^{-1}\xi _{k-1}+ z_{t_k}^TA_k^{-1}z_{t_k} \end{aligned}$$

where \(\mathrm {(a)}\) is due to the Sherman–Morrison formula [39] below.

Lemma E.1

( [39]) Let A be an arbitrary \(n\times n\) positive-definite matrix. Let \(x\in {\mathbb {R}}^n\). Then, \(B = A+xx^T\) is also a positive-definite matrix and

$$\begin{aligned}&B^{-1} = A^{-1} - \frac{(A^{-1}x)(A^{-1}x)^T}{1+x^TA^{-1}x}. \end{aligned}$$
(42)

Note that the inequality holds since \(A_{k-1}\) is a positive-definite matrix, and thus, so is its inverse. Therefore, we have

$$\begin{aligned}&\xi _k^T A_k^{-1} \xi _k \le \xi _{k-1}^TA_{k-1}^{-1}\xi _{k-1} + z_{t_k}^TA_k^{-1}z_{t_k} \le \sum _{j\in [k]}z_{t_j}^TA_j^{-1}z_{t_j}\\&{\mathop {=}\limits ^{\mathrm {(b)}}} \sum _{j\in [k]}\left( 1-\frac{\text {det}(A_{j-1})}{\text {det}(A_{j})}\right) {\mathop {\le }\limits ^{\mathrm {(c)}}} \sum _{j\in [k]}\log \left( \frac{\text {det}(A_{j})}{\text {det}(A_{j-1})}\right) \\&= \log \left( \frac{\text {det}(A_{k})}{\text {det}(A_{0})}\right) = \log \left( \frac{\text {det}(aI + X_kX_k^T)}{\text {det}(aI)}\right) = \sum _{i\in [n]}\log \left( 1+\frac{\lambda _i}{a}\right) , \end{aligned}$$

where \(\lambda _i\) are the eigenvalues of \(X_kX_k^T\). Claim \(\mathrm {(b)}\) follows from Lemma E.2 while \(\mathrm {(c)}\) is due to the fact \(1-x\le -\log (x),\forall x>0\).

Lemma E.2

( [16]) Let A be an arbitrary \(n\times n\) positive-semidefinite matrix. Let \(x\in {\mathbb {R}}^n\) and \(B = A-xx^T\). Then,

$$\begin{aligned} x^TA^\dagger x = {\left\{ \begin{array}{ll} 1 &{} \text { if } x\notin span(B)\\ 1-\frac{\text {det}_{\ne 0}(B)}{\text {det}_{\ne 0}(A)} <1 &{} \text { if } x \in span(B) \end{array}\right. }, \end{aligned}$$
(43)

where \(\text {det}_{\ne 0}(B)\) is the product of non-zero eigenvalues of B.

This leads to the upper bound for \(\xi _k^TA_k^{-1}\xi _k\). For the lower bound, we have

$$\begin{aligned}&\sqrt{\xi _k^TA_k^{-1}\xi _k} \ge \left\langle A_k^{-1/2}\xi _k, \frac{A_k^{1/2}w^\star }{\Vert A_k^{1/2}w^\star \Vert } \right\rangle = \frac{\left\langle \xi _k, w^\star \right\rangle }{\Vert A_k^{1/2}w^\star \Vert }\ge \frac{k \varepsilon '}{\Vert A_k^{1/2}w^\star \Vert }. \end{aligned}$$

Also, recall \(\xi _k = \sum _{j\in [k]}y_{\sigma (j)}z_{\sigma (j)}\). Combining the bounds, we get

$$\begin{aligned} \left( \frac{k \varepsilon '}{\Vert A_k^{1/2}w^\star \Vert }\right) ^2 \le \xi _k^TA_k^{-1}\xi _k \le \sum _{i\in [n]}\log \left( 1+\frac{\lambda _i}{a}\right) . \end{aligned}$$

This leads to the bound \(k \le \frac{\Vert A_k^{1/2}w^\star \Vert }{\varepsilon '}\sqrt{\sum _{i\in [n]}\log (1+\frac{\lambda _i}{a})}\). Finally, since \(\Vert w^\star \Vert = 1\), we have

$$\begin{aligned}&\Vert A_k^{1/2}w^\star \Vert ^2 = (w^\star )^T (aI + X_kX_k^T) w^\star = a + \lambda _{w^\star }, \end{aligned}$$

which follows from the definition of \(\lambda _{w^\star }\). Hence,

$$\begin{aligned} k \le \frac{1}{\varepsilon '}\sqrt{(a + \lambda _{w^\star })\sum _{i\in [n]}\log (1+\frac{\lambda _i}{a})}, \end{aligned}$$
(44)

which completes the proof.

Proof of Theorem 5.2

To prove Theorem 5.2, we need the following lemmas F.1 and F.2.

Lemma F.1

For any \({\tilde{v}}_t\) in the update rule, we have \(\eta _t y_t \left\langle {\tilde{v}}_t, w^\star \right\rangle \ge \sinh (\varepsilon )\), where \(w^\star \) stands for the optimal classifier in Assumption 4.1. Also, for any \(w_k\), we have \(\left\langle w_k, w^\star \right\rangle \ge 0\).

Proof

The proof is by induction. Initially, \(w_1={\textbf{0}}\) and all arriving points get classified as positive. The first mistake occurs when the first negative point \(z_t\) arrives, which gets classified as positive. In this case, \(w_2=-\eta _t v_t\), where \(v_t=\log _p(z_t)\) and \(\eta _t = \frac{2\tanh \left( \frac{\sigma _p\Vert v_t\Vert }{2}\right) }{\left( 1-\tanh \left( \frac{\sigma _p\Vert v_t\Vert }{2}\right) ^2\right) \left\| v_t\right\| }\). Also, \(v_t\) must be unmanipulated (i.e., \(v_t=u_t\)) since it will always be classified as positive. Therefore, based on Assumption 4.1, we have

$$\begin{aligned} \eta _t \left\langle v_t, w^\star \right\rangle =\eta _t \left\langle u_t, w^\star \right\rangle \le -\sinh (\varepsilon ), \left\langle w_2, w^\star \right\rangle =-\eta _t\left\langle v_t, w^\star \right\rangle \ge 0. \end{aligned}$$
(45)

Next, suppose that \(w_{t-1}\) denotes the weight vector at the end of step \(t-1\) and \(\left\langle w_{t-1}, w^\star \right\rangle \ge 0\). We need to show that \(\eta _t y_t \left\langle {\tilde{v}}_t, w^\star \right\rangle \ge \sinh (\varepsilon )\). By definition, for any point such that \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{w_{t-1}}\ne \frac{\alpha }{\sigma _p}\), \({\tilde{v}}_t=v_t\). According to Observation 1, those points are also not manipulated, i.e., \({\tilde{v}}_t=v_t=u_t\). Therefore, the claim holds. For data points such that \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{w_{t-1}}=\frac{\alpha }{\sigma _p}\), if they are positive, we have \({\tilde{v}}_t=v_t=u_t+\beta \frac{w_{t-1}}{\Vert w_{t-1}\Vert }\), where \(0\le \beta \le \frac{\alpha }{\sigma _p}\). The reason behind \(\beta \) always being positive is that all rational agents want to be classified as positive so the only possible direction of change is \(w_{t-1}\). Hence,

$$\begin{aligned} \eta _t\left\langle {\tilde{v}}_t, w^\star \right\rangle =\eta _t\left\langle u_t+\beta \frac{w_{t-1}}{\Vert w_{t-1}\Vert }, w^\star \right\rangle \ge \eta _t\left\langle u_t, w^\star \right\rangle \ge \sinh (\varepsilon ). \end{aligned}$$
(46)

For data points with negative labels such that \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{w_{t-1}}=\frac{\alpha }{\sigma _p}\), \({\tilde{v}}_t=v_t-\frac{\alpha w_{t-1}}{\sigma _p\Vert w_{t-1}\Vert }\) and \(v_t=u_t+\beta \frac{w_{t-1}}{\Vert w_{t-1}\Vert }\). This implies that \({\tilde{v}}_t=u_t+\left( \beta -\frac{\alpha }{\sigma _p}\right) \frac{w_{t-1}}{\Vert w_{t-1}\Vert }\). Therefore,

$$\begin{aligned} \eta _t\left\langle {\tilde{v}}_t, w^\star \right\rangle =\eta _t\left\langle u_t+\left( \beta -\frac{\alpha }{\sigma _p}\right) \frac{w_{t-1}}{\Vert w_{t-1}\Vert }, w^\star \right\rangle \le \eta _t\left\langle u_t, w^\star \right\rangle \le -\sinh (\varepsilon ). \end{aligned}$$
(47)

Combining the above two claims, we get \(\eta _t y_t \left\langle {\tilde{v}}_t, w^\star \right\rangle \ge \sinh (\varepsilon )\).

The last step is to assume \(\left\langle w_{t-1}, w^\star \right\rangle \ge 0\) and \(\eta _t y_t \left\langle {\tilde{v}}_t, w^\star \right\rangle \ge \sinh (\varepsilon )\). I this case, we need to show \(\left\langle w_t, w^\star \right\rangle \ge 0\). If the classifier does not make a mistake at step t, the claim is obviously true since \(w_{t-1}=w_t\). If he classifier makes a mistake, we have

$$\begin{aligned} \left\langle w_t, w^\star \right\rangle = \left\langle w_{t-1}+\eta _t y_t {\tilde{v}}_t, w^\star \right\rangle \ge \left\langle w_{t-1}, w^\star \right\rangle \ge 0. \end{aligned}$$
(48)

This completes the proof.

Lemma F.2

If Algorithm 5 makes a mistake on an observed data point \(v_t\), then \(y_t\left\langle {\tilde{v}}_t, w_{t-1} \right\rangle \le 0\).

Proof

If the algorithm makes a mistake on a positive example, we have \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{\Vert w_{t-1}\Vert }< \frac{\alpha }{\sigma _p}\). By Observation 2, no point will fall within the region \(0<\frac{\left\langle v_t, w_{t-1} \right\rangle }{\Vert w_{t-1}\Vert }< \frac{\alpha }{\sigma _p}\). Thus, one must have \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{\Vert w_{t-1}\Vert }\le 0\). Since \(y_t=+1\), \({\tilde{v}}_t=v_t\). Therefore, \(\left\langle {\tilde{v}}_t, w_{t-1} \right\rangle \le 0\). If the algorithm makes a mistake on a negative point, we have \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{\Vert w_{t-1}\Vert }\ge \frac{\alpha }{\sigma _p}\). For the case \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{\Vert w_{t-1}\Vert } > \frac{\alpha }{\sigma _p}\), we have \({\tilde{v}}_t=v_t\). In this case, \(\left\langle {\tilde{v}}_t, w_{t-1} \right\rangle \ge 0\) obviously holds. For \(\frac{\left\langle v_t, w_{t-1} \right\rangle }{\Vert w_{t-1}\Vert } = \frac{\alpha }{\sigma _p}\), we have

$$\begin{aligned} \left\langle {\tilde{v}}_t, w_{t-1} \right\rangle =\left\langle v_t-\frac{\alpha w_{t-1}}{\sigma _p\Vert w_{t-1}\Vert }, w_{t-1} \right\rangle =0. \end{aligned}$$
(49)

The above equality implies that for a negative sample we have \(\left\langle {\tilde{v}}_t, w_{t-1} \right\rangle \ge 0\). Therefore, for any mistaken data point, \(y_t\left\langle {\tilde{v}}_t, w_{t-1} \right\rangle \le 0\).

We are now ready to prove Theorem 5.2.

Proof

The analysis follows along the same lines as that for the standard Poincaré perceptron algorithm described in Sect. 4. We first lower bound \(\Vert w_{k+1}\Vert \) as

$$\begin{aligned} \Vert w_{k+1}\Vert&\ge \left\langle w_{k+1}, w^\star \right\rangle \nonumber \\&= \left\langle w_k, w^\star \right\rangle + \eta _{i_k}y_{i_k}\left\langle {\tilde{v}}_{i_k}, w^\star \right\rangle \nonumber \\&\ge \left\langle w_k, w^\star \right\rangle + \sinh (\varepsilon ) \ge \cdots \ge k\sinh (\varepsilon ), \end{aligned}$$
(50)

where the first bound follows from the Cauchy–Schwartz inequality, while the second inequality was established in Lemma F.1. Next, we upper bound \(\Vert w_{k+1}\Vert \) as

$$\begin{aligned} \Vert w_{k+1}\Vert ^2&= \Vert w_k + \eta _{i_k} y_{i_k} {\tilde{v}}_{i_k}\Vert ^2 \nonumber \\&= \Vert w_k\Vert ^2 + 2\eta _{i_k} y_{i_k} \left\langle w_k, {\tilde{v}}_{i_k} \right\rangle + \Vert {\tilde{v}}_{i_k}\Vert ^2 \nonumber \\&\le \Vert w_k\Vert ^2 + \Vert {\tilde{v}}_{i_k}\Vert ^2 \nonumber \\&\le \Vert w_k\Vert ^2 + \left( \frac{2\tanh (\frac{\sigma _p\Vert v_{i_k}\Vert }{2})}{1-\tanh (\frac{\sigma _p\Vert v_{i_k}\Vert }{2})^2} + \frac{\alpha }{\sigma _p}\right) ^2 \nonumber \\&\le \Vert w_k\Vert ^2 + \left( \frac{2R_p}{1-R_p^2} + \frac{\alpha }{\sigma _p}\right) ^2 \le \cdots \le k\left( \frac{2R_p\sigma _p+\alpha (1-R_p^2)}{\sigma _p(1-R_p^2)}\right) ^2, \end{aligned}$$
(51)

where the first inequality was established Lemma F.2 while the second inequality follows from the fact that the manipulation budget is \(\alpha \).

Combining (50) and (51), we obtain

$$\begin{aligned} k^2\sinh (\varepsilon )^2&\le k\left( \frac{2R_p\sigma _p+\alpha (1-R_p^2)}{\sigma _p(1-R_p^2)}\right) ^2 \nonumber \\ k&\le \left( \frac{2R_p\sigma _p+\alpha (1-R_p^2)}{\sigma _p(1-R_p^2)\sinh (\varepsilon )}\right) ^2, \end{aligned}$$
(52)

which completes the proof.

Table 4 Performance of the SVM, linear regression and logistic regression algorithms generated based on 5 independent trials. The results show mean accuracy \((\%)\;\pm \) standard deviation and the bold numbers indicate the best results

Detailed experimental setting

For the first set of experiments, we have the following hyperparameters. For the Poincaré perceptron, there are no hyperparameters to choose. For the Poincaré second-order perceptron, we adopt the strategy proposed in [16]. That is, instead of tuning the parameter a, we set it to 0 and change the matrix inverse to pseudo-inverse. For the Poincaré SVM and the Euclidean SVM, we set \(C=1000\) for all data sets. This theoretically forces SVM to have a hard decision boundary. For the hyperboloid SVM, we surprisingly find that choosing \(C=1000\) makes the algorithm unstable. Empirically, \(C=10\) in general produces better results despite the fact that it still leads to softer decision boundaries and still breaks down when the point dimensions are large. As the hyperboloid SVM works in the hyperboloid model of a hyperbolic space, we map points from the Poincaré ball to points in the hyperboloid model as follows. Let \(x\in {\mathbb {B}}^n\) and \(z\in {\mathbb {L}}^n\) be its corresponding point in the hyperboloid model (Table 4). Then,

$$\begin{aligned}&z_0 = \frac{1-\sum _{i=0}^{n-1}x_i^2}{1+\sum _{i=1}^{n}x_i^2},\;z_j = \frac{2x_j}{1+\sum _{i=1}^{n}x_i^2}\;\forall j\in [n]. \end{aligned}$$
(53)

On the other hand, Olsson’s scRNA-seq data contain 319 points from 8 classes and we perform a \(70\%/30\%\) random split to obtain training (231) and test (88) point sets. CIFAR10 contains 50, 000 training points and 10, 000 testing points from 10 classes. Fashion-MNIST contains 60, 000 training points and 10, 000 testing points from 10 classes. Mini-ImageNet contains 8, 000 data points from 20 classes, and we do \(70\%/30\%\) random split to obtain training (5, 600) and test (2, 400) point sets. For all data sets, we choose the trade-off coefficient \(C=5\) and use it with all three SVM algorithms to ensure a fair comparison. We also find that in practice the performance of all three algorithms remains stable when \(C\in [1,10]\).

Additional experimental results

Here we report the performance of another two commonly used linear classifiers (linear regression, logistic regression) in Euclidean space to compare it with the Poincaré SVM. Note that there exist no established linear regression and logistic regression methods for hyperbolic geometries at this moment.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pan, C., Chien, E., Tabaghi, P. et al. Provably accurate and scalable linear classifiers in hyperbolic spaces. Knowl Inf Syst 65, 1817–1850 (2023). https://doi.org/10.1007/s10115-022-01820-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-022-01820-3

Keywords

Navigation