Hyperbolic Deep Learning in Computer Vision: A Survey

Deep representation learning is a ubiquitous part of modern computer vision. While Euclidean space has been the de facto standard manifold for learning visual representations, hyperbolic space has recently gained rapid traction for learning in computer vision. Specifically, hyperbolic learning has shown a strong potential to embed hierarchical structures, learn from limited samples, quantify uncertainty, add robustness, limit error severity, and more. In this paper, we provide a categorization and in-depth overview of current literature on hyperbolic learning for computer vision. We research both supervised and unsupervised literature and identify three main research themes in each direction. We outline how hyperbolic learning is performed in all themes and discuss the main research problems that benefit from current advances in hyperbolic learning for computer vision. Moreover, we provide a high-level intuition behind hyperbolic geometry and outline open research questions to further advance research in this direction.


Introduction
From image segmentation to future frame prediction and from video grounding to generating images, deep representation learning is the central component that drives modern computer vision problems (LeCun et al, 2015).In short succession, many differentiable layers and network architectures have been proposed to tackle visual research problems (Gu et al, 2018;Bommasani et al, 2021;Khan et al, 2022).While different in structure, scope, and inductive biases, all are based on Euclidean operators and therefore -implicitly or explicitly -assume that data is best represented on regular grids.
Euclidean space forms an intuitive and grounded underlying manifold, but its inherent properties are not a best match for all types of data.Consider for example hierarchical structures such as trees, ontologies, and taxonomies.Hierarchies are foundational building blocks across all scientific disciplines to formalize our knowledge (Noy and Hafner, 1997).In hierarchies, the number of nodes grows exponentially with depth, from few coarse-grained to many fine-grained nodes.The volume of a ball in Euclidean space however, grows only polynomially with its diameter.An alternative geometry is needed to match the nature of hierarchies.
In the quest for a more appropriate geometry of hierarchies, hyperbolic geometry provides a direct fit (Bridson and Haefliger, 2013).In essence, hyperbolic and Euclidean geometry are different in only one aspect: the parallel postulate.In Euclidean space, there is exactly one parallel line that goes through a point not on the other line.In hyperbolic space, there are at least two such parallel lines.This change comes with many consequences and as a result, hyperbolic geometry can be seen as a geometry of constant negative curvature.In the context of deep learning this geometry has many attractive properties, such as its hierarchical structure and exponential expansion.
Empowered by these geometric properties, hierarchical embeddings have in recent years been performed in hyperbolic space with great success (Nickel and Kiela, 2017), leading to unparalleled abilities to embed deep and complex trees with minimal distortion (Ganea et al, 2018a;Sala et al, 2018).This has led to rapid advances in hyperbolic deep learning across many disciplines and research areas, including but not limited to graph networks (Chami et al, 2019;Liu et al, 2019;Dai et al, 2021), text embeddings (Tifrea et al, 2019;Zhu et al, 2020), molecular representation learning (Klimovskaia et al, 2020;Yu et al, 2020;Wu et al, 2021), and recommender systems (Mirvakhabova et al, 2020;Wang et al, 2021;Yang et al, 2022).
In the wake of other disciplines, computer vision has in recent years also benefited from research into deep learning in hyperbolic space.A quickly growing body of literature has shown that hyperbolic embeddings benefit few-shot learning, zero-shot recognition, out-of-distribution generalization, uncertainty quantification, generative learning, and hierarchical representation learning amongst others.These works show evidence that hyperbolic geometry has a lot of potential for learning in computer vision.
This survey provides an in-depth overview and categorization of the recent boom in hyperbolic computer vision literature.These works have investigated hyperbolic learning across many visual research problems with different solutions.As a result, it is unclear how current literature is connected, what is common and new in each work, and in which direction the field is heading.This survey seeks to fill this void.We investigate both supervised and unsupervised papers.For supervised learning, we identify three shared themes amongst current papers, where samples are matched to either gyroplanes, prototypes, or other samples in hyperbolic space.For unsupervised papers, we dive into the three main axes explored in current papers, namely generative learning, clustering, and self-supervised learning.Peng et al (2021) have recently written a general survey on hyperbolic neural networks but their scope did not include the computer vision literature on hyperbolic learning.This survey fills this void.
The rest of the paper is organised as follows.In Section 2 we provide the background on hyperbolic geometry and foundational papers on hyperbolic embeddings and hyperbolic neural networks.Sections 3 and 4 provide an overview of supervised and unsupervised hyperbolic visual learning literature.Lastly in Section 5 we outline advantages and improvements reported in current papers, as well as open challenges for the field.
2 Background on hyperbolic geometry 2.1 What is hyperbolic geometry?Hyperbolic geometry was initially developed in the 19th century by Gauss, Lobachevsky, Bolyai and others as a concrete example of a non-Euclidean geometry.Soon after it found important applications in physics, as the mathematical basis of Einstein's special theory of relativity.It can be characterized as the geometry of constant negative curvature, differentiating it from the flat geometry of Euclidean space and the positively curved geometry of spheres and hyperspheres.From the point of view of representation learning, its attractive properties are its exponential expansion and its hierarchical, tree-like structure.Exponential expansion means that the volume of a ball in hyperbolic space growths exponentially with its diameter, in contrast to Euclidean space, where the rate of growth is polynomial.The 'tree-likeness' of a metric space can be quantified by Gromov's hyperbolicity (Bridson and Haefliger, 2013), which is zero for tree graphs, finite (but non-zero) for hyperbolic space, and infinite for Euclidean space.

Models of hyperbolic geometry
Several different, but eventually equivalent, models of hyperbolic geometry exist (Cannon et al, 1997).They differ in their coordinate representations of points and in their expressions for distances, geodesics, and other quantities.Although they can be converted into each other, certain models may be preferred for a given task, for reasons of numerical efficiency, ease of visualization, or simplified calculations.The most commonly used models are the Poincaré model, the hyperboloid (or 'Lorentz') model, the Klein model, and the upper half-space model.
• The Poincaré model represents d-dimensional hyperbolic space by the unit ball which, in the frequently considered case d = 2 becomes the unit disc.Geodesics ('shortest paths') are arcs of Euclidean circles (or lines), meeting the boundary of D d at a right angle.While distances, area and volume are distorted in comparison to their Euclidean counterparts, the model is conformal, i.e., hyperbolic angles are measured as in Euclidean geometry.In its two-dimensional form as Poincaré disc, the model is popular for visualizations; it is also the geometric basis of the art works Circle Limits I-IV of M. C. Escher; see Figure 1.• The hyperboloid model uses the single-sheet hyperboloid as a model of d-dimensional hyperbolic geometry.Contrary to the other models, its ambient space R d+1 adds one dimension to the modeled space.
Many formulas involving the hyperboloid model can be written in concise form by introducing the Lorentz product ).An advantage of the hyperboloid model is that it retains some linear structure; translations and other isometries, for example, can be represented by linear maps.Expressions for distances and geodesics are simpler compared to other models.Notably, the Poincaré model can be derived as a projection ('stereographic projection') of the hyperboloid model to the unit ball (Cannon et al, 1997;Ratcliffe, 1994).• The Klein model K d also uses the unit ball to represent hyperbolic space.In contrast to the Poincaré model, it is not conformal; its geodesics, however, are Euclidean ('straight') lines, which can be beneficial from a computational point of view, e.g., when computing barycenters.

Five core hyperbolic operations
Within the context of deep learning and computer vision, we find that five core operations form the basic building blocks of the vast majority of algorithms.The ability to work with these five operations will cover most of existing literature: 1. Measuring the distance of two points x and y; 2. Finding the geodesic arc (the distance-minimizing curve) from x to y; 3. Forming a geodesic, by extending a geodesic arc as far as possible; 4. Using the exponential map, to determine the result of following a geodesic in direction u, at speed r, starting at a point x; 5. Moving a cloud of points, while preserving all their pairwise hyperbolic distances, by applying a hyperbolic translation.
The distance of two points is given, in the Poincaré and the hyperboloid model respectively, by In the less frequently used Klein and the upper half space model, distances are given by see (Ratcliffe, 1994, §6.1).The scaling factor of distances is controlled by the curvature parameter κ ∈ (0, ∞), which is often standardized to κ = 1.The sectional curvature (in the sense of differential geometry) of hyperbolic space is constant, negative and equal to −κ.Given the distance function, it makes sense to speak of geodesics and geodesic arcs, that is (locally) distanceminimizing curves, either extending infinitely or connecting two points.In the hyperboloid model for example, each geodesic is the intersection of H d with a Euclidean hyperplane in the ambient space R d+1 .The geodesic at a point x ∈ H d in direction v can be written as where u is an element of the tangent space In the Poincaré model, the geodesics are precisely the segments of Euclidean circles and lines that meet the boundary of D d at a right angle.A convenient formula for the geodesic arc between two points p, q ∈ D d can be given in terms of gyrovectorspace calculus, see (8).
The value of the exponential map exp x (tu) is the result of following a geodesic in a normalized direction u at a speed t > 0, after starting at a given point x in hyperbolic space.Identifying R d with the tangent space T x at x, the exponential mapping provides a convenient way to embed R d into hyperbolic space with origin at x.The exponential map is the most often used function in hyperbolic learning for computer vision, as it allows us to map visual representations from Euclidean to hyperbolic space.In the hyperboloid model, the exponential mapping coincides with the expression of the geodesic given in (5).In the Poincaré model the exponential map can be conveniently written in terms of gyrovectorspace addition and is given in (9).Finally, the hyperbolic translation τ x , also called Lorentz boost, Möbius transformation, or gyrovectorspace addition, is the unique distance-preserving transformation of hyperbolic space, which moves 0 to a given point x.In the hyperboloid model, it can be represented by the linear map In the Poincaré model hyperbolic translations are also known as gyrovectorspace addition and form the basic operation of gyrovectorspace calculus.

Gyrovectorspace calculus
Gyrovectorspace calculus, as introduced by Ungar (2005Ungar ( , 2012)), provides a convenient and rapidly adopted framework for calculations in the Poincaré ball model.Its first basic operation is the (non-commutative) gyrovectorspace addition As a secondary operation, the (commutative) gyrovectorspace scalar product with a scalar t ∈ R is introduced.Hyperbolic translations are directly given by τ p (q) = p⊕q and the geodesic arc connecting p and q is Letting t range through all of R a full geodesic line is obtained.
In the context of gyrovector space calculus, the Poincaré ball is often rescaled with the square root of curvature, setting The advantage of this rescaling is that Euclidean space is obtained as a continuous limit as κ → 0. In the rescaled model, gyrovectorspace addition and scalar product become and for p, q ∈ D d κ .The exponential map in direction of a tangent vector v ∈ T p can then be written as for p ∈ D d κ , see Ganea et al (2018b).

Non-visual hyperbolic learning
The traction of hyperbolic learning in computer vision is built upon advances in embedding hierarchical structures, designing hyperbolic network layers, and hyperbolic learning on other data types such as graphs, text, and more.Below, we discuss these works and their relevance for hyperbolic visual learning literature.
Hyperbolic embedding of hierarchies.Embedding hierarchical structures like trees and taxonomies in Euclidean space suffers from large distortion (Bachmann et al, 2020), and polynomial volume expansion, limiting its capacity to capture the exponential complexity of hierarchies.However, hyperbolic space can be thought of as a continuous version of trees (Nickel and Kiela, 2017) and has tree-like properties (Hamann, 2018;Ungar, 2008), like the exponential growth of distances when moving from the origin towards the boundary.Encouraged by this, Nickel and Kiela (2017) propose to embed hierarchical structures on the Poincaré model.The goal is to learn hyperbolic representations for the nodes of a hierarchy, such that the distance in the embedding space has an inverse relation with semantic similarity.Let D = {(u, v)} denote the set of the nodes connected in a given hierarchy.To embed the nodes in the Poincaré model, Nickel and Kiela (2017) minimize the following loss function: where N (u) = {v |(u, v ) / ∈ D} ∪ {v} denotes the set of the nodes not related to u, including v, as negative examples.The loss function pushes unrelated nodes farther apart than the related ones.To evaluate the embedded hierarchy, the distances between pairs of connected nodes (u, v) are calculated and ranked among the negative pairs of nodes (i.e., the nodes not in D), and the mean average precision (MAP) is calculated based on the ranking.Later, Sala et al (2018) propose a combinatorial construction to embed the trees in hyperbolic space without optimization and with low distortion, relieving the optimization problems in existing works.Ganea et al (2018a) address drawbacks of (Nickel and Kiela, 2017) including the collapse of the points on the boundary of the space as a result of the loss function and incapability of encoding asymmetric relations.They introduce entailment cones to embed hierarchies, using a max-margin loss function: where γ, P, and N indicate margin, the positive and negative edges, respectively.E(u, v) is a penalty term that forces child nodes to fall under the cone of the parent node.Amongst others, hyperbolic embeddings have been proposed for multi-relational graphs (Balazevic et al, 2019), low-dimensional knowledge graphs (Chami et al, 2020b), and learning continuous hierarchies in Lorentz model (Nickel and Kiela, 2018).
Hyperbolic neural networks.Foundational in the transition of deep learning towards hyperbolic space is the development of hyperbolic network layers and their optimization.We consider two pivotal papers here that provide a such theoretical foundation, namely Hyperbolic Neural Networks by Ganea et al (2018b) As an extension, a hyperbolic version of linear layer f is given as f : R n → R m , a Möbius version of f where the map from D n → D m is defined as: with exp c 0 : T 0m D m c → D m c and log c 0 : D n c → T 0n D n c .They furthermore outline how to create recurrent network layers.
Shimizu et al (2021) reformulate the hyperbolic logistic regression of (Ganea et al, 2018b) to reduce the number of parameters to the same level as the Euclidean logistic regression.The new formulation is p(y = k|x) ∝ exp(v k (x)), where where r k ∈ R and z k ∈ T 0 B n c = R n are the parameters for each class.In turn, their linear layer is given as where More importantly for computer vision, they show how to formulate convolutional layers using Poincaré fully connected layer and β-concatenation.To do so, they show how to generalize the hyperbolic linear layer to image patches through β-splits, and β-concatenation, leading in principle to arbitrary-dimensional convolutional layers.Moreover, Poincaré multi-head attention is possible through the same operators.
.  Beyond text and graphs, hyperbolic learning has shown to be beneficial for several other research directions, including but not limited to learning representations for molecular/cellular structures (Klimovskaia et al, 2020;Yu et al, 2020;Wu et al, 2021), recommender systems (Mirvakhabova et al, 2020;Wang et al, 2021;Yang et al, 2022), skeletal action recognition (Franco et al, 2023), LiDAR data (Tong et al, 2022;Wang et al, 2023), point clouds (Montanaro et al, 2022;Anvekar and Bazazian, 2023), and 3D shapes (Chen et al, 2020b).In summary, hyperbolic geometry has impacted a wide range of research fields.This survey focuses specifically on the impact and potential in the visual domain.

Supervised hyperbolic visual learning
In Figure 2, we provide an overview of literature on supervised learning with hyperbolic geometry in computer vision.In current vision works, hyperbolic learning is mostly performed at the embedding-or classifierlevel.In other words, current works rely on standard networks for feature learning and transform the output embeddings to hyperbolic space for the final learning stage.For supervised learning in hyperbolic space, we have identified three main optimization strategies: 1. Sample-to-gyroplane learning denotes the setting where classes are represented by hyperbolic hyperplanes, i.e., gyroplanes, with networks optimized based on confidence logit scores between samples and gyroplanes.2. Sample-to-prototype learning denotes the setting where class semantics are represented as points in hyperbolic space, and networks are optimized to minimize hyperbolic distances between samples and prototypes.3. Sample-to-sample learning denotes the setting where networks are optimized by learning metrics or contrastive objectives between samples in a batch.
For all strategies, let (x, y) denote the visual input x, which can be an image or a video, and the corresponding label y ∈ Y. Let f θ (x) ∈ R D denote its Euclidean embedding after going through a network.This representation is mapped to hyperbolic space using the exponential map, denoted as g(x) = exp 0 (f θ (x)).In many hyperbolic works, additional information about hierarchical relations between classes is assumed.Let H = (Y, P, R), with Y the class labels denoting the leaf nodes of the hierarchy, P the internal nodes, and R the set of hypernym-hyponym relations of the hierarchy.Below, we discuss how current literature tackles each strategy in detail sequentially.

Sample-to-gyroplane learning
The most direct way to induce hyperbolic geometry in the classification space is by replacing the classification layer by a hyperbolic alternative.This can be done either by means of a hyperbolic logistic regression or through hyperbolic kernel machines.
Hyperbolic logistic regression.Khrulkov et al ( 2020) incorporate a hyperbolic classifier by taking a standard convolutional network and mapping the outputs of the last hidden layer to hyperbolic space using an exponential map.Afterwards, the hyperbolic multinomial logistic regression as described by Ganea et al ( 2018b) is used to obtain class logits which can be optimized with cross-entropy.They find that training a hyperbolic classifier on top of a convolutional network allows us to obtain uncertainty information based on the distance to the origin of the hyperbolic embeddings of images.
Out-of-distribution samples on average have a smaller norm, making it possible by differentiating in-to outof-distribution samples by sorting them by the distance to the origin.Hong et al (2022) show that hyperbolic classification is beneficial for visual anomaly recognition tasks, such as out-of-distribution detection in image classification and segmentation tasks.Araño et al (2021) use hyperbolic layers to perform multi-modal sentiment analysis based on the audio, video, and text modalities.Ahmad and Lecue (2022) also show the effect of hyperbolic space to perform object recognition with ultra-wide field-of-view lenses.Guo et al (2022) address a limitation when training classifiers in hyperbolic space, namely a vanishing gradient problem due to the hybrid architecture of current hyperbolic approaches in computer vision, where Euclidean features are connected to a hyperbolic classifier.Equation 12 highlights that to maximize the likelihood of correct predictions, the distance to hyperbolic gyroplanes needs to be maximized.In practice, embeddings of samples are pushed to the boundary of the Poincaré ball.As a result, the inverse of the Riemannian tensor metric approaches zero, resulting in small gradients.This finding is in line with several other works on vanishing gradients in hyperbolic representation learning (Nickel and Kiela, 2018;Liu et al, 2019).
To combat the vanishing gradient problem, Guo et al (2022) propose to clip the Euclidean embeddings of samples before the exponential mapping, i.e.,: with r as a hyperparameter.This trick improves learning with hyperbolic multinomial logistic regression, especially when dealing with many classes such as on ImageNet.Furthermore, training with clipped hyperbolic classifiers improves out-of-distribution detection over training with Euclidean classifiers, while also being more robust to adversarial attacks.
Next to global classification, a few recent works have investigated hyperbolic logistic regression for structured prediction tasks such as object detection and image segmentation.Valada (2022) extend object detection with hyperbolic geometry, amongst others by replacing the classifier head of a two-stage detection like Sparse R-CNN (Sun et al, 2021) with a hyperbolic logistic regression, improving object detection performance in standard and zero-shot settings.Ghadimi Atigh et al (2022) introduce Hyperbolic Image Segmentation, where the final per-pixel classification was performed in hyperbolic space.Starting from the geometric interpretation of hyperbolic gyroplanes of Ganea et al (2018b), they find that simultaneously computing class logits over all pixels of all images in a batch, as is customary in Euclidean networks, is not directly applicable in hyperbolic space.This is because the explicit computation of the Möbius addition requires evaluating a tensor in R W ×H×|Y|×d for an images of size (W × H) with d embedding dimensions.Instead, they rewrite the Möbius addition as: This rewrite reduces the addition to adding two tensors in R W ×H×|Y| , allowing for per-pixel evaluation on image batches.For training, Ghadimi Atigh et al (2022) incorporate hierarchical information by replacing the one-hot softmax with a hierarchical softmax: with H y = {y} ∩ A y the set containing y and its ancestors and S h the set of siblings of class h.Performing per-pixel classification with hyperbolic hierarchical logistic regression opens up multiple new doors for image segmentation.First, the notion of uncertainty as given by the hyperbolic norm of output embeddings generalizes naturally to the pixel level.As shown in Figure 3, the norm of pixel embeddings correlates with semantic ambiguity; the closer the pixel is to a semantic boundary the lower the pixel norm.Chen et al ( 2022) have already used this insight to improve image segmentation.They outline a hyperbolic uncertainty loss, where the cross-entropy loss of a pixel is weighted as follows for pixel with s the most confident pixel and t a hyperparameter set to 1.02 in order to have a wide weight variation while avoiding division by zero.Adding this weight to the cross-entropy pixel loss consistently improves segmentation results for well-known segmentation networks.Other benefits of hyperbolic image segmentation include better zero-label generalization and higher effectiveness with few embedding dimensions compared to Euclidean pixel embeddings.
Hyperbolic kernel machines.Next to logistic regression, Cho et al (2019) provide a general formulation for kernel methods in hyperbolic space with large-margin classifiers.Fang et al ( 2021) introduce positive definite kernel functions in hyperbolic space and show its potential for computer vision.Specifically, they propose hyperbolic instantiations of tangent kernels, radial basis function kernels, (generalized) Laplace kernels, and binomial kernels.The kernels can be plugged on top of convolutional networks and trained with cross-entropy to benefit from both the representation learning of the convolutional layers and the hyperbolic kernel dynamics in the classifier.Deep learning with hyperbolic kernel methods improves few-shot learning, person re-identification, and knowledge distillation.Zero-shot learning is even enabled through kernel distances between visual embeddings and semantic class representations.

Sample-to-prototype learning
The most popular strategy in hyperbolic learning is to represent classes as prototypes, i.e., as points in hyperbolic space.In this research direction, there are two solutions: embedding classes based on their sample mean, in the spirit of Prototypical Networks (ProtoNet) (Snell et al, 2017), or embeddings classes based on a given hierarchy over all classes.
Hyperbolic ProtoNet.In Prototypical Networks (Snell et al, 2017), the prototype of a class k is determined as the mean vector of the samples belonging to that class: with S k the set of samples belonging to class k.Inference can in turn be performed by assigning the label of the nearest prototype for a test sample.Khrulkov et al (2020) generalize this formulation to Hyperbolic Prototypical Networks.Since computing averages in the Poincaré ball model requires expensive Fréchet mean calculations, they perform averaging using the Einstein midpoint, given in Klein coordinates as: with γ i the Lorentz factors: Since Khrulkov et al (2020) operate in the Poincaré ball model, this averaging operation requires transforming embeddings to and from the Klein model: with g D (x i ) and g K (x i ) the embeddings of input x i in respectively the Poincaré ball model and the Klein model.Akin to its Euclidean counterpart, Hyperbolic ProtoNet is used to address few-shot learning, where the sample mean prototype serves as the class representation.Khrulkov et al (2020) show that performing prototypical few-shot learning in hyperbolic space is competitive to Euclidean prototypical learning, even resulting in better accuracy scores when relying on a 4-layer Con-vNet as the backbone.
As a follow-up work, Gao et al (2021) show that different tasks and even individual classes in few-shot The hyperbolic clipping of Guo et al (2022) is also effective for few-shot learning, consistently outperforming the standard ProtoNet and Hyperbolic ProtoNet on the CUB Birds and miniImageNet few-shot benchmarks.A few other works have extended Hyperbolic ProtoNet for few-shot learning with set-and groupletbased learning and will be discussed in the sample-tosample learning section.
Recently, Gao et al ( 2022) investigate feature augmentation in hyperbolic space to solve the overfitting problem when dealing with limited data.On top, they introduce a scheme to estimate the feature distribution using neural-ODE.These elements are then plugged into few-shot approaches such as the hyperbolic prototypical networks of Khrulkov et al (2020), improving performance.Choudhary and Reddy (2022) improve hyperbolic few-shot learning by reformulating hyperbolic neural networks through Taylor series expansions of hyperbolic trigonometric functions and show that it improves the scalability and compatibility, and outperforms Hyperbolic ProtoNet.
Hierarchical embedding of prototypes.Where Hyperbolic ProtoNets are effective in few-shot settings, a number of works have also investigated prototype-based solutions for the general classification.As starting point, these works commonly assume that the classes in a dataset are organized in a hierarchy, see Figure 4. Long et al (2020) embed action class hierarchy H in hyperbolic space using hyperbolic entailment cones (Ganea et al, 2018a), with an additional loss to increase the angular separation between leaf nodes to avoid inter-label confusion amongst class labels Y.With L H (H) as the hyperbolic embedding loss for hierarchy H, let P denote the leave nodes of the hierarchy.Then the separationbased loss is given over the leaf nodes as: with P the 2 -normalized representations of the leaf nodes.By combining the hierarchical and separation based losses, the hierarchy is embedded to balance both hierarchical constraints and discriminative abilities.The embedding is learned a priori, after which video embeddings are projected to the same hyperbolic space and optimized to their correct class embedding.This approach improves action recognition, zero-shot action classification, and hierarchical action search.In a similar spirit, Dhall et al (2020) show that using hyperbolic entailment cones for image classification is empirically better than using Euclidean entailment cones.Rather than separating hierarchical and visual embedding learning, Yu et al ( 2022) propose to simultaneously learn hierarchical and visual representations for skin lesion recognition in images.Image embeddings are optimized towards their correct class prototype, while the classes are optimized to abide by their hyperbolic entailment cones with an extra distortion loss to obtain better hierarchical embeddings.Gulshad et al (2023) propose Hierarchical Prototype Explainer, a reasoning model in hyperbolic space to provide explainability in video action recognition.Their approach learns hierarchical prototypes at different levels of granularity e.g., parent and grandparent levels, to explain the recognized action in the video.By learning the hierarchical prototypes, they can provide explanations on different levels of granularity, including interpretation of the prediction of a specific class label and providing information on the spatiotemporal parts that contribute to the final prediction.Li et al (2023) investigate the semantic space of action recognition datasets and bridge the gap between different labeling systems.To achieve a unified action learning, actions are connected into a hierarchy using VerbNet (Schuler, 2005) and embedded as prototypes in hyperbolic space.
Hierarchical prototype embeddings have also been successfully employed in the zero-shot domain.Liu et al (2020) show how to perform zero-shot learning with hyperbolic embeddings.Classes are embedded by taking their WordNet-based Poincaré Embeddings (Nickel and Kiela, 2017)  For standard classification, Ghadimi Atigh et al (2021) show how to integrate uniformity amongst prototypes in hyperbolic space by embedding classes with maximum separation on the boundary of the Poincaré ball given by (Mettes et al, 2019;Kasarla et al, 2022).With prototypes now at the boundary of the ball, standard distance functions no longer apply since they are at the infinite distance to any point within the ball.To that end, they propose to use the Busemann distance, which is given for hyperbolic image embedding g(x) and pro-totype p as: By fixing prototypes with maximum separation a priori and minimizing this distance function with an extra regularization towards the origin, it becomes possible to perform hyperbolic prototypical learning with prototypes at the ideal boundary.Ghadimi Atigh et al (2021) show that such an approach has direct links with conventional logistic regression in the binary case, highlighting its inherent properties.Moreover, maximally separated prototypes can also be replaced by prototypes from word embeddings or hierarchical knowledge, depending on the available knowledge and task at hand.In addition to standard classification, hierarchical hyperbolic embeddings have demonstrated effectiveness in continual learning (Gao et al, 2023).To learn the new data, Gao et al ( 2023) propose a dynamically expanding geometry through a mixed-curvature space, enabling learning of complex hierarchies in a data stream.To prevent forgetting, angle-regularization and neighborrobustness losses are used to preserve the geometry of the old data.Few-shot learning has also been investigated with hierarchical knowledge.Zhang et al (2022) perform such few-shot learning by first training a network on a joint classification and hierarchical consistency objective.The classification is given as a softmax over the class probabilities, as well as the softmax over the superclasses.In the few-shot inference stage, class prototypes are obtained through hyperbolic graph propagation to deal with the limited sample setting, improving few-shot learning as a result.

Sample-to-sample learning
Lastly, a number of recent works have investigated hyperbolic learning by contrasting between samples.2022) investigate the potential of hyperbolic embedding for metric learning.In metric learning, the de facto solution is to match representations of sample pairs based on embeddings given by a pre-trained encoder.Rather than relying on Euclidean distances and contrastive learning for optimization, they propose a hyperbolic pairwise cross-entropy loss.Given a dataset with |Y| classes, each batch samples two samples from each category, i.e., K = 2 • |Y|.Then the loss function for a positive pair with the same class label is given as: where D(•, •) can be either a hyperbolic or a cosine distance and τ denotes a temperature hyperparameter.This loss is computed over all positive pairs (i, j) and (j, i) in a batch.Using supervised (Dosovitskiy et al, 2021) and self-supervised (Caron et al, 2021) vision transformers as encoders, hyperbolic metric learning consistently outperforms Euclidean alternatives and sets state-of-the-art on fine-grained datasets.Figure 5 shows a 2D projection of the embeddings learned with hyperbolic metric learning on vision transformers, where classes are grouped towards the boundary and latent hierarchical neighborhood relations emerge.

Hyperbolic metric learning. Ermolov et al (
Hyperbolic metric learning has shown to be effective to overcome overfitting and catastrophic forgetting in few-shot class-incremental learning tasks, explored by Cui et al (2022).This is done by adding a metric learning loss as a part of the distillation in continual learning.They also propose a hyperbolic version of Reciprocal Point Learning (Chen et al, 2020a) to provide extra-class space for known categories in the fewshot learning stage.Yan et al ( 2023) also explore hyperbolic metric learning, incorporating noise-insensitive and adaptive hierarchical similarity to handle noisy la-bels and multi-level relations.Kim et al (2022) add a hierarchical regularization term on top of the metric learning approaches, with the goal of learning hierarchical ancestors in hyperbolic space without any annotation.Hyperbolic metric learning is furthermore effective in semantic hashing (Amin et al, 2022), face recognition via large-margin nearest-neighbor learning (Trpin and Boshkoska, 2022), and multi-modal alignment given videos and knowledge graph (Guo et al, 2021).
Following the progress of large language models and the success of vision-language models (e.g., CLIP (Radford et al, 2021)) in multimodal representation learning, Desai et al ( 2023) propose a hyperbolic imagetext representation.The proposed method first processes the input image and text using two separate encoders.Then, the generated embedding is projected into the hyperbolic space, and training is performed using a contrastive and entailment loss.The paper shows that the proposed approach outperforms the Euclidean CLIP as it is capable of capturing hierarchical multimodal relations in hyperbolic space.
Hyperbolic set-based learning.Where sample-to-prototype and sample-to-sample approaches compare samples to individual elements, some works have shown that setbased and group-based distances are more effective and robust.Ma et al ( 2022) introduce an adaptive sampleto-set distance function in the context of few-shot learning.Rather than aggregating support samples to a single prototype, an adaptive sample-to-set approach is proposed to increase the robustness to the outliers.The sample-to-set function is a weighted average of the distance from the query to all support samples, where the distance is calculated with a small network over the feature maps of the query and support samples.This approach benefits few-shot learning, especially when dealing with outliers.
In the context of metric learning, Zhang et al (2021a) argue that sample-to-sample learning is computationally expensive, while sample-to-prototype learning is less accurate.They propose a hybrid strategy based on grouplets.Each grouplet is a random subset of samples and the set of grouplets is matched with prototypes through a differentiable optimal transport.Akin to Ermolov et al (2022), they show that using hyperbolic embedding spaces improved metric learning on fine-grained datasets.Moreover, they provide empirical evidence that other metric-based losses benefit from hyperbolic embeddings, highlighting the general utility of hyperbolic space for metric learning. .Fig. 6: The three major methods for unsupervised hyperbolic learning in computer vision.Current literature performs unsupervised learning in hyperbolic space using (i) generative models, (ii) clustering, (iii) self-supervised learning.

Unsupervised hyperbolic visual learning
Hyperbolic learning has been actively researched in the unsupervised domain of computer vision.We identify three dominant research directions in which hyperbolic deep learning has found success: generative learning, clustering, and self-supervised learning.Below, each is discussed separately.

Hyperbolic VAEs
Variational autoencoders (VAEs) (Kingma and Welling, 2013;Rezende et al, 2014) with hyperbolic latent space have been used to learn representations of images.Nagano et al (2019) propose the hyperbolic wrapped normal distribution and derive algorithms for both reparametrizable sampling and computing the probability density function.They then derive a hyperbolic β-VAE (Higgins et al, 2017) using the wrapped normal function as the prior and posterior, replacing the usual (Euclidean) Gaussian distribution.The wrapped normal distribu-tion in a manifold M is the pushforward measure under the exponential map exp M .Thus, a sample z can be obtained as (Mathieu et al, 2019): where exp M µ is the exponential map of M at µ and G is the matrix representation of the metric of M. To accommodate the geometry of the latent space, exponential and logarithmic maps were added at the end of the VAE encoder and before the start of the VAE decoder, respectively.In order to train their hyperbolic VAE with the typical evidence lower bound, Nagano et al (2019) compute the density of the wrapped normal distribution using the change-of-variables formula.Since their sampling algorithm required the exponential and parallel transport maps, Nagano et al (2019) compute the log-determinants and inverses of these maps in order to apply the change-of-variables formula.Nagano et al (2019) then use their VAE to learn representations of MNIST and Atari 2600 Breakout screens.On MNIST, Hyperbolic representations outperform Euclidean representations at low latent dimensions but were overtaken starting at dimension 10.Mathieu et al (2019) extend the work of Nagano et al (2019) by introducing the Riemannian normal distribution and deriving reparametrizable sampling schemes for both the Riemannian normal and wrapped normal using hyperbolic polar coordinates.The Riemannian normal views the Euclidean normal distribution as the distribution minimizing the entropy for a given mean and standard deviation and defines a new normal distribution on hyperbolic space with this property: where Z R is a normalizing constant.Mathieu et al (2019) additionally introduce the use of a gyroplane layer as the first layer of the decoder, following Ganea et al (2018b).Noting that an Euclidean affine transform can be written as where H a,p = {z ∈ R n | a, z − p = 0} is the decision hyperplane, they replace each piece of the formula with its hyperbolic counterpart to obtain where all H c a,p = {z ∈ H| a, log c p (z) = 0}.The closedform formula for the distance term in the Poincaré ball is Mathieu et al ( 2019) also use their hyperbolic VAE to learn representations of MNIST and find that using both the Riemannian normal and the gyroplane layer improve test log-likelihoods, especially at low latent dimensions.
Cho et al ( 2022) extend the previous two works by proposing a new version of the hyperbolic wrapped normal distribution (HWN).Their primary observation is that for the wrapped normal distribution, the principal axes of the distributions are not aligned with the local standard axes, see Figure 7.They propose a new sampling process that fixes the alignment of the principal axes, resulting in a new distribution which they call the rotated hyperbolic wrapped normal (RoWN).Given a mean µ in the Lorentz model of hyperbolic geometry and a diagonal covariance matrix Σ, samples from the RoWN distribution are sampled as follows: 1. Find the rotation matrix R that rotates the x-axis x = ([±1, . . ., 0]) to y = µ 1: .We can compute R as

Hyperbolic GANs
Using the intuition that images are organized hierarchically, several works have proposed hyperbolic generative adversarial networks (GANs  Qu and Zou (2022) propose HAEGAN, a hyperbolic autoencoder and GAN framework in the Lorentz model L (also known as the hyperboloid model), of hyperbolic geometry.The GAN is based on the structure Fig. 8: Hierarchical attribute editing in hyperbolic space is possible due to hyperbolic space's ability to encode semantic hierarchical structure within image data.Changing the high-level, category-relevant details (closest to the origin) changes the category, while changing low-level (farthest from the origin), categoryirrelevant attributes varies images within categories.Image courtesy of Li et al (2022).
of WGAN-GP (Arjovsky et al, 2017;Gulrajani et al, 2017).The structure of HAEGAN consists of an encoder, which takes in real data and generates real representations, and a generator, which takes in noise and generates fake representations.A critic is trained to distinguish between the two representations, and a decoder takes the fake representations and produces the final generated object.Qu and Zou (2022) generalize WGAN-GP to hyperbolic space using three operations: the first is the hyperbolic linear layer is HLinear n,m : L n K → L m K of Chen et al ( 2021) , the second the hyperbolic centroid distance layer HCDist n,m (x) : Liu et al (2019), and the third a a new Lorentz concatenation layer:  Li et al (2022) propose a hyperbolic method for few-shot image generation.The main idea is that hyperbolic space encodes a semantic hierarchy, where the root of the hierarchy (i.e., at the center of hyperbolic space) is a category, e.g., dog.At lower levels, we have more fine-grained separations, such as subcategories, e.g., Shih-Tzu and Ridgeback dogs.Finally, at the lowest level, there are category-irrelevant features, e.g., the hair color or pose of the dog (see Figure 8).This method builds on the Euclidean pSp method (Richardson et al, 2021) for image-to-image translation.The pSp method uses a feature pyramid to extract feature maps and uses a set of projection heads on these feature maps to produce each of the style vectors required by StyleGAN (Karras et al, 2019, 2020), which is commonly denoted the W + -space.Image-to-image translation can then be done by editing or replacing style vectors.Li et al (2022) generalize to hyperbolic space by mapping the output of a frozen, pre-trained pSp encoder to hyperbolic space and then back to the W + -space of style vectors, and then feeding the style vectors into a frozen, pre-trained StyleGAN.Projection to hyperbolic space is done using the Mobius layer f ⊗c of Ganea et al (2018b), with the full projection layer having the form with mapping back to the W + -space achieved by a logarithmic map plus an MLP.Li et al (2022) supervise the hyperbolic latent space with a hyperbolic classification loss based on the multinomial logistic regression formulation of Ganea et al (2018b).After calculating the probabilities, the loss function is just negative loglikelihood as The full loss function is the pSp loss function plus this term, excluding a specific facial reconstruction loss used by the pSp method, since Li et al (2022) do not focus on face generation.Li et al (2022) perform image generation as follows: given an image x i , the image is embedded in hyperbolic space with representation g D (x i ) and is rescaled to the desired radius (i.e., fine-grainedness) r.A random vector is then sampled from the seen categories and a point is taken on the geodesic between the two points.Li et al (2022) find that their method is competitive with state-of-the-art methods and show promise for image-to-image transfer.

Hyperbolic Normalizing Flows
Bose et al ( 2020) propose a hyperbolic normalizing flow that generalizes the Euclidean normalizing flow Real-NVP (Dinh et al, 2016) to hyperbolic space.They propose two types of hyperbolic normalizing flows: the first, which they call tangent coupling, which carries out the coupling layer of RealNVP in the tangent space at the hyperbolic origin o: where s, t are neural networks and σ is a pointwise nonlinearity.
The wrapped hyperboloid extends tangent coupling by using parallel transport to map intermediate vectors from the tangent space of the origin to the tangent space of another point in hyperbolic space (see Figure 9): Compared to tangent coupling, wrapped hyperbolic coupling allows the flow to leverage different parts of the manifold instead of just the origin.The paper also derives the inverse and Jacobian determinants of the two flows.As is the case for hyperbolic VAEs, Bose et al (2020) also benchmark on MNIST, and find a similar trend as Nagano et al ( 2019): the performance of hyperbolic models exceed that of the equivalent Euclidean model at low dimension, but as early as latent dimension 6 Euclidean models overtake hyperbolic models in performance.Bose et al (2020) find that hyperbolic normalizing flows outperform hyperbolic VAEs at these low latent dimensions.

Clustering
Due to the close relationship between hyperbolic space, hierarchies, and trees, several works have explored hierarchical clustering using hyperbolic space.Monath et al (2019) propose to perform hierarchical clustering using hyperbolic representations.Given a dataset Monath et al (2019) require a hyperbolic representation at the edge of the Poincaré disk D d for each data point x i ∈ D, which becomes the leaves of the hierarchical clustering.The method of Monath et al (2019) creates a hierarchical clustering by optimizing the hyperbolic representations for a fixed number of internal nodes.Parent-children dissimilarity between a child representation z c and a parent representation z p is measured by which encourages children to have larger norms than their parents.A discrete tree can then be extracted as follows: The internal node observations are supervised by two losses: first, a hierarchical clustering loss based on Dasgupta's cost (Dasgupta, 2016) and a continuous extension due to Wang and Wang (2018) that reformulates the loss in terms of last common ancestors (LCAs), and second, a parent-child margin objective that encourages parent nodes to have smaller norm than their children.Suppose D has pairwise similarities {w ij } i,j∈ [N ] .A hierarchical clustering of D is a rooted tree T such that each leaf is a data point.For leaves i, j ∈ T , denote their LCA by i ∨ j, the subtree rooted at i ∨ j by T [i ∨ j], and the leaves of T [i ∨ j] by leaves(T [i ∨ j]).Finally, let relation {i, j|k} holds if i ∨ j is a descendant of i ∨ j ∨ k.Then Dasgupta's cost can be formulated as Wang and Wang (2018) show that where The margin parent-child dissimilarity is given as and the total margin objective is The embedding is alternately optimized between the clustering objective and the parent-child objective.Optimization of the hyperbolic parameters is done via the method of Nickel and Kiela (2017).Using this method, Monath et al (2019) are able to embed ImageNet using representations taken from the last layer of a pretrained Inception neural network.Similar to Monath et al (2019), Chami et al (2020a) base their method on Dasgupta's cost (Equation 42) and Wang and Wang's (Equation 43) reformulation in terms of LCAs.Chami et al (2020a) define the LCA of two points in hyperbolic space to be the point on the geodesic connecting the two points that are closest to the hyperbolic origin, and provide a formula to calculate this point in the Poincaré disk D. This formula allows Equation 43 to be directly optimized by replacing the w ijk (T ; w) terms with its continuous counterpart.A hierarchical clustering tree can then be produced by iteratively merging the most similar pairs, where similarity is measured by their hyperbolic LCA distance from the origin.Unlike the method of Monath et al (2019), Chami et al (2020a) do not require hyperbolic embeddings to be available, and optimize the hyperbolic embeddings of the whole tree, not just the leaves.Lin et al (2022) propose a neural-network based framework for the hierarchical clustering of multi-view data.The framework consists of two steps: first, improving representation quality via reconstruction loss, contrastive learning between different views, and a weighted triplet loss between positive examples and mined hard negative examples, and second, applying the hyperbolic hierarchical clustering framework of Chami et al (2020a).
The contrastive loss in Lin et al (2022) is the usual contrastive loss (see following section) where positive examples are views from the same object and negative examples are views from different objects.The weighted triplet loss is where a i refer to the anchor points, p i are the positive examples, and n i are the negative examples.Positive and negatives examples are mined based on the method of Iscen et al (2017), which measures the similarity of a pair of points based on estimating the data manifold using k-nearest neighbors graphs.Lin et al (2022) apply their method to perform multi-view clustering for a variety of multi-view image datasets.

Self-supervised learning
In Section 4.3.1,we describe methods for hyperbolic self-supervision which are primarily based on triplet losses, and in Section 4.3.2we discuss methods for hyperbolic self-supervision which are primarily based on contrastive losses.

Hyperbolic self-supervision
Based on the idea that biomedical images are inherently hierarchical, Hsu et al (2021)  To capture the hierarchical structure of 3D biomedical images, Hsu et al (2021) propose that given a parent patch µ p , to sample a child patch µ c which is a subpatch of the parent patch, and a negative patch µ n that does not overlap with the parent patch.Then the hierarchical self-supervised loss is defined as a margin triplet loss as follows: This encourages the representations of subpatches to be children or descendants of the representation of the main patch, and faraway patches (which likely contain different structures) to be on other branches of the learned hierarchical representation.
To perform unsupervised segmentation, the learned latent representations are extracted and clustered using a hyperbolic k-means algorithm, where the traditional Euclidean mean is replaced with the Frechet mean.For a manifold M with metric d M , the Frechet mean of a set of points {z i } k i=1 , z i ∈ M is defined as the point µ that minimizes the squared distance to all points z i : and is one way to generalize the concept of a mean to manifolds.Unfortunately, the Frechet mean on the Poincaré ball does not admit a closed-form solution, so Hsu et al (2021) compute the Frechet mean with the iterative algorithm of Lou et al (2020).The paper finds that this strategy is effective for the unsupervised segmentation of both synthetic biological data and 3D brain tumor MRI scans (Menze et al, 2014;Bakas et al, 2017Bakas et al, , 2018)).Weng et al (2021) propose to leverage the hierarchical structure of objects within images to perform weakly-supervised long-tail instance segmentation.To capture this hierarchical structure, Weng et al (2021) learn hyperbolic representations which are supervised with several hyperbolic self-supervised losses.Instance segmentation is done in three stages: first, mask proposals are generated using a pre-trained mask proposal network.Mask proposals consists of bounding boxes {B i } k i=1 and masks {M i } k i=1 .Define x full i to be the original image cropped to bounding box B i , x bg i to be the cropped image with the object masked out using mask 1 − M i , and x fg i to be the same cropped image with the background masked out using mask M i .We will refer to these as the full object image, object background, and object, respectively.
Second, hyperbolic representations of z bg i = g(x bg i ), and z fg i = g(x fg i ) are learned by a pre-trained feature extractor and supervised by a combination of three selfsupervised losses.The representations are fixed to have latent dimension 2. The first self-supervised loss encourages representation of the object to be similar to that of the full object image and farther away from the representation of the object background: The second loss is a triplet loss that requires the sampling of positive and negative examples.
The third loss is similar to the hierarchical triplet loss of Hsu et al (2021) described above, except with the origin taking the place of negative samples: Finally, the representations are clustered using hyperbolic k-means clustering.Unlike Hsu et al (2021), to compute the mean they map the representations from the Poincare disk to the hyperboloid model L and compute the (weighted) hyperboloid midpoint proposed by Compared to the Frechet mean, this mean has the advantage of having a closed-form formula, making it more computationally efficient.Weng et al (2021) find that their method improves other partially-supervised methods on the LVIS long-tail segmentation dataset (Gupta et al, 2019).

Hyperbolic contrastive learning
Hyperbolic contrastive learning methods have also been proposed.Surís et al (2021) propose to learn hyperbolic representations for video action prediction because of their ability to combine representing hierarchy and giving a measure of uncertainty (See Figure 10).Surís et al (2021) learn an action hierarchy where more abstract actions are near the origin of the Poincaré disk and more fine-grained actions are near the edge.If the preceding video frames are ambiguous, this hierarchical representation allows the ability to predict a more general parent category of action (e.g., greeting) instead of having to predict more fine-grained child categories of action (e.g., handshake or high-five).The parent of two actions is computed as the hyperbolic mean of their hyperbolic representations, which Surís et al (2021) compute as the midpoint of the geodesic connecting the two representations.Surís et al ( 2021) propose a two-stage framework for video action prediction which consists Self-supervised pre-training proceeds as follows: let x t be a frame of the video, and a representation z t = f (x t ) is produced by an encoder f .The pretext task is to predict the representation z t+δ of a clip δ frames into the future.The model produces an estimate ẑt+δ = φ(c t , δ), where c t = g(z 1 , . . ., z t ) is an encoding of all past video frames.All function f, g, φ are parameterized by a neural network.The training is supervised by a contrastive loss: which encourages the positive pairs ẑi , z i to have similar representations while pushing ẑi from the representations of all negative examples z j .One key feature of this loss is that under the presence of uncertainty, say when actions a, b are probable, L is minimized by predicting the midpoint on the geodesic connecting a, b, which is equivalent to moving one level up the hierarchy to the parent of a, b.Ge et al (2022) propose to improve contrastive learning by incorporating the hierarchical structure of images with a scene-object hierarchy (see Figure 11).Ge et al (2022) use a hyperbolic version of the MoCo architecture (He et al, 2020), which the authors call HCL.Ge et al (2022) extend the MoCo architectures in several ways: first, unlike previous works for visual contrastive learning, HCL requires that object regions be extracted from the input image.Secondly, a hyperbolic backbone along with a corresponding momentum encoder is added to MoCo's Euclidean backbone and its momentum encoder.The Euclidean backbone and momentum encoder are trained the same way as in He et al (2020), but the inputs are not images but the extracted object regions.The hyperbolic branch takes as input a scene region u and an object region v that is a subregion of the scene u, and negative objects N u = {n 1 , . . ., n k } that are not subregions of the scene u.Let the representations of u, v, n j be z u , z v , z j , respectively.The hyperbolic branch is then trained with a contrastive loss with hyperbolic distance as the similarity measure: where τ is a temperature parameter.This loss encourages representations to form a scene-object hierarchy where scenes have the highest norm (i.e., are at the edge of the Poincaré ball D) and objects have the smallest norm (i.e., are at the center of D).The paper finds that their method achieves small gains over the original MoCo and MoCo augmented with bounding box information.They also examine the representations of out-of-context objects using their method, and find that they generally have higher distance to the scene images.Yue et al (2023) propose a different method for hyperbolic contrastive learning that is based on SimCLR (Chen et al, 2020c).Like Ge et al (2022), Yue et al (2023) replace the dot-product similarity of the contrastive loss with the hyperbolic distance: From current works, we identify four main axes of improvement that have come with the recent shift towards learning in hyperbolic space for computer vision: • Hierarchical learning.The inherent links between hierarchical data and hyperbolic embeddings are well known.It is therefore not all too surprising to see that a wide range of works have used hyperbolic learning to improve hierarchical objectives in computer vision.The ability to incorporate hierarchical knowledge, for example through hyperbolic embeddings or hierarchical hyperbolic logistic regression, has been utilized for several problems.Hierarchical learning in hyperbolic space can among others reduce error severity, resulting in smaller mistakes and more consistent retrieval.This is a key property for example in medical domains, where large mistakes need to be avoided at all costs.Hierarchical learning has also shown to enable zeroshot generalization.By embedding class hierarchies in hyperbolic space and mapping examples of seen classes to their corresponding embedding, it becomes possible to generalize to examples of unseen classes.
In general, hierarchical information between classes helps to structure the semantics of the task at hand, and embedding such knowledge in hyperbolic space is preferred over Euclidean space.• Few-sample learning.Few-shot learning is popular in hyperbolic deep learning for computer vision.Many works have shown that consistent improvements can be made by performing this task with hyperbolic embeddings and prototypes, both with and without hierarchical knowledge.In fewshot learning, samples are scarce when it comes to generalization, and working in hyperbolic space consistently improves accuracy.These results indicate that hyperbolic space can generalize from fewer examples, with potential in domains where examples are scarce.This is already visible in the unsupervised domain, where generative learning is better in hyperbolic space when working with constrained data sources.• Robust learning.Across several axes, hyperbolic learning has shown to be more robust.For example, hyperbolic embeddings improve out-of-distribution detection, provide a natural way to quantify uncertainty about samples, pinpoint unsupervised outof-context samples, and can improve robustness to adversarial attacks.Robustness and uncertainty are key challenges in deep learning in general, hyperbolic deep learning can provide a natural solution to robustify networks.• Low-dimensional learning.For a lot of applications, networks, and embedding spaces need to be constrained, for example when learning on embedded devices or when visualizing data.In the unsupervised domain, hyperbolic learning consistently improves over Euclidean learning when working with smaller embedding spaces.Similarly, the embedding space in supervised problems can be substantially reduced while maintaining downstream performance in hyperbolic space.As such, hyperbolic learning has the potential to enable learning in compressed and embedded domains.

Open research questions
Hyperbolic learning has made an impact on computer vision with many promising avenues ahead.The field is however still in the early stages with many challenges and opportunities ahead.Three directions stand out: • Fully hyperbolic learning.Hyperbolic learning papers in computer vision commonly share one perspective: hyperbolic learning should be done in the embedding space.For the most part, the representation learning of earlier layers is done in Euclidean space, resulting in hybrid networks.

Fig. 1 :
Fig. 1: Circle Limit I (1958).This artwork by M. C. Escher is based on the Poincaré disc model of hyperbolic geometry.

Fig. 2 :
Fig.2: The three core strategies for supervised hyperbolic learning in computer vision.Current literature performs hyperbolic learning of visual embeddings by learning to match training samples (i) to hyperbolic class hyperplanes, i.e., gyroplanes, (ii) to hyperbolic class prototypes, or (iii) by contrasting to other samples.

Fig. 3 :
Fig. 3: Hyperbolic image segmentation naturally provides us per-pixel uncertainty information.Pixels with low hyperbolic norm constitute pixels with high uncertainty and are strongly correlated with closeness to semantic boundaries.Image courtesy of Ghadimi Atigh et al (2022).

Fig. 4 :
Fig. 4: Hierarchical knowledge amongst classes provides a structure for hyperbolic embeddings in computer vision approaches, where classes are represented as points or prototypes in hyperbolic space according to their hypernym-hyponym relations.For example, Dhall et al (2020) exploit hierarchical relations from entomological collections (left), while Yu et al (2022) utilize taxonomies of skin lesion diseases (middle) and Long et al (2020) do the same for action hierarchies (right).Images courtesy of the respective publications.
and text-based Poincaré GloVe embeddings (Tifrea et al, 2019).Both are concatenated to obtain class prototypes.By optimizing seen training images to their prototypes, it becomes possible to generalize to unseen classes during testing through a nearest neighbor search in the concatenated hyperbolic space.Xu et al (2022) also perform hyperbolic zero-shot learning by training hyperbolic graph layers (Chami et al, 2019) on top of hyperbolic word embeddings.Dengxiong and Kong (2023) show the potential of hyperbolic space in generalized open set recognition, which classifies unknown samples based on side information.A side information (taxonomy) learning framework is introduced to embed the information in hyperbolic space with low distortion and identify the unknown samples.Moreover, an ancestor search algorithm is outlined to find the most similar ancestor in the taxonomy of the known classes.

Fig. 5 :
Fig. 5: Embeddings of hyperbolic vision transformers cluster samples based on their label towards the boundary of the Poincaré ball, while simultaneously exhibiting latent hierarchical relations.Image courtesy of Ermolov et al (2022).

Fig. 7 :
Fig. 7: The standard hyperbolic wrapped normal (top) and rotated hyperbolic wrapped normal (bottom).In (a), the principal axes of the normal distribution are illustrated.In (b), the principal axes of the transported normal distribution are visualized.The density of the two distributions are visualized in (c).Image courtesy of Cho et al (2022).
2. Rotate Σ by R: Σ = RΣR T 3. Now sample as in the usual hyperbolic wrapped normal: sample v ∼ N (0, Σ) and then map it to hyperbolic space as follows: exp µ (PT 0→µ ([0, v])) Cho et al (2022) find that RoWN outperforms HWN in a variety of settings, such as the Atari 2600 Breakout image generation experiment first examined in Nagano et al (2019).

Fig. 9 :
Fig. 9: The left figure shows the partitioning step of wrapped hyperbolic coupling, and the right figure shows how the vector is transformed, transported, and projected back to hyperbolic space.Image courtesy of Bose et al (2020).

Fig. 10 :
Fig. 10: Surís et al (2021) model uncertainty with hyperbolic representations.If the model is uncertain, it can predict an abstraction of all possible actions (red square), and if it is certain it can predict a more specific action (blue square).The pink circle shows how computing the mean of two representations (pink squares) increases the generality.Image courtesy of Surís et al (2021).

Fig. 11 :
Fig. 11: The learned hierarchy of Ge et al (2022) has objects near the origin of the Poincaré disk and scenes near the edge of hyperbolic space.Image courtesy of Ge et al (2022).

•
Lastly, the upper half space model represents ddimensional hyperbolic space by the set U d = {x ∈ R d : x d > 0}.It is a conformal model and shares many properties with the Poincaré model; geodesics, for example, are also arcs of Euclidean circles (or lines), meeting the boundary of U d at a right angle.
x 1s , . . ., x 1s Zhang et al (2023)ience indicate that for the earlier layers in neural networks, hyperbolic space can also play a prominent role(Chossat, 2020).Recently,Zhang et al (2023)have shown that spatial relations in the hippocampus are more hyperbolic than Euclidean.Learning deep networks fully in hyperbolic space requires rethinking all layers, from convolutions to self-attention and normalization.At the time of writing the survey, two works have made steps in this direction.Bdeir et al (2023)introduce a hyperbolic convolutional network in the Lorentz model of hyperbolic space.They outline how to perform convolutions, batch normalization, and residual connections.Simultaneously, van Spengler et al (2023) introduce Poincaré ResNet, with convolutions, residuals, batch normalization, and better network initialization in the Poincaré ball model.The works provide a foundation towards fully hyperbolic learning, but many open questions remain.Which model is most suitable for fully hyperbolic learning?Or do different layers work best in different models?And how can fully hyperbolic learning scale to ImageNet and beyond?Should each stage of the network have the same curvature?And how effective can hyperbolic networks become across all possible tasks compared to Euclidean networks?A lot more research is needed to answer these questions.• Computational challenges.Performing gradientbased learning in hyperbolic space changes how networks are optimized and how parameters behave.Compared to their Euclidean counterpart however, hyperbolic networks and embeddings can be numerically more unstable, with issues at the boundary of the ball, vanishing gradients, and more.Moreover, hyperbolic operations can be more involved and computationally heavy depending on the used model, leading to less efficient networks.Such computational challenges are relevant for all domains of hyperbolic learning and a broader topic that is receiving attention.• Open source community.Modern deep learning libraries are centered around Euclidean geometry.Any new researcher in hyperbolic learning, therefore, does not have the opportunity to quickly implement networks and layers to get an intuition into its workings.Moreover, any new advances have to be either implemented from scratch or imported from code repositories of other papers.What is missing is an open-source community and a shared repository that houses advances in hyperbolic learning for computer vision.Such a community and code base is vital to get further traction and attract a wide audience, including practitioners.Whether it be part of existing libraries or as a separate library, continued development of open-source hyperbolic learning code is key for the future of the field.• Large and multimodal learning.In computer vision, and Artificial Intelligence in general, there is a strong trend towards learning at large scale and learning with multiple modalities, e.g., imagetext or video-audio models.It is therefore a natural desire for the field to arrive at hyperbolic founda-tion models.While early work has shown that largescale and/or multimodal learning is viable with hyperbolic embeddings (Desai et al, 2023), hyperbolic foundation models form a longer-term commitment as they require solutions to all open problems mentioned above, from stable, fully hyperbolic learning to continued open source development.