Dimensionality selection for hyperbolic embeddings using decomposed normalized maximum likelihood code-length

Graph embedding methods are effective techniques for representing nodes and their relations in a continuous space. Specifically, the hyperbolic space is more effective than the Euclidean space for embedding graphs with tree-like structures. Thus, it is critical how to select the best dimensionality for the hyperbolic space in which a graph is embedded. This is because we cannot distinguish nodes well with dimensionality that is considerably low, whereas the embedded relations are affected by irregularities in data with excessively high dimensionality. We consider this problem from the viewpoint of statistical model selection for latent variable models. Thereafter, we propose a novel methodology for dimensionality selection based on the minimum description length principle. We aim to introduce a latent variable modeling of hyperbolic embeddings and apply the decomposed normalized maximum likelihood code-length to latent variable model selection. We empirically demonstrated the effectiveness of our method using both synthetic and real-world datasets.


Motivation
Graphs are convenient tools for knowledge representation and can be used to represent various types of real-world data.Consequently, graph analysis has garnered significant attention in various fields, such as biology (e.g., protein-protein interaction networks) [1], social sciences (e.g., friendship networks) [2], and linguistics (e.g., word co-occurrence networks) [3], in recent years.Generally, tasks in graph analysis are classified into the following four categories: (1) node classification, (2) link prediction, (3) node clustering, and (4) graph visualization [4].
Graph embeddings, which convert discrete representations into continuous ones, such as vectors in Euclidean space, have become popular tools in graph analysis [5][6][7][8].They provide effective solutions for the aforementioned tasks, as continuous representations can be used as the input in tasks of types (1), (2), and (3), whereas two-dimensional continuous representations are used directly in tasks of type (4).
Dimensionality is one of the most important hyperparameters in graph embeddings.First, the node classification, link prediction, and node clustering performance depend on it.Intuitively, we cannot distinguish nodes well with considerably low dimensionality, while the embedded relations are significantly affected by irregularities of data with considerably high dimensionality.Second, the training time and computational expenses directly depend on it.Therefore, the issue of dimensionality selection for graph and word embedding has garnered significant attention [9][10][11][12][13].However, most of existing studies have focused on Euclidean space, although hyperbolic space is a viable alternative embedding space.
The hyperbolic space is a Riemannian manifold with negative constant curvature.In network science, a hyperbolic space is suitable for modeling hierarchical structures [14,15].
In a tree at level h, the number of leaves and nodes is exponential in h.The analogies of the two aforementioned concepts in hyperbolic and Euclidean space are the circumference and area of a circle, respectively.In the two-dimensional hyperbolic space with constant curvature K = −1, the circumference of a circle is provided by 2π sinh r and its area is 2π(cosh r − 1) with hyperbolic radius r , both increasing exponentially with r .This analogy demonstrates that the hyperbolic space has an affinity for the hierarchical structure.However, in the twodimensional Euclidean space, R 2 , the circumference of a circle is provided by 2πr and its area is given by πr 2 , both increasing polynomially with r .Thus, increasing the dimensionality is essential for embedding a hierarchical structure in the Euclidean space.Owing to these properties, hyperbolic embeddings have been extensively studied in recent years [16][17][18].However, to the best of our knowledge, there has been no previous research except [19] on dimensionality selection in the hyperbolic space.
In this study, we propose a novel methodology for dimensionality selection of hyperbolic graph embeddings.We address this issue from the viewpoint of statistical model selection.First, we demonstrate that there is a non-identifiability problem in the conventional probabilistic model of hyperbolic embeddings; that is, there is no one-to-one correspondence between the parameter and the probability distribution.This problem invalidates the use of the conventional model selection criteria, such as Akaike's information criterion (AIC) [20] and the Bayesian information criterion (BIC) [21].To overcome this difficulty, we employ two latent variable models of hyperbolic embeddings following pseudo-uniform distributions (PUDs) [14,15] and wrapped normal distributions (WNDs) in a hyperbolic space [22].We thereby introduce a criterion for dimensionality selection based on the minimum description length (MDL) principle [23].
The MDL principle asserts that the best model minimizes the total code-length required for encoding the particular data.It exhibits several advantages, such as consistency [24] and rapid convergence in the framework of probably approximately correct (PAC) learning [25].Although the MDL-based dimensionality selection was developed for Word2Vec-type word embeddings into the Euclidean space by Hung and Yamanishi [12], their techniques cannot straightforwardly be applied to hyperbolic graph embeddings.
The DNML criterion [26] is a model selection criterion for latent variable models based on the MDL principle, where the non-identifiability problem is resolved by jointly encoding the observed and latent variables.The shorter the DNML criterion, the better the dimensionality.Herein, we propose to apply DNML into the problem of dimensionality selection for hyperbolic embeddings.The DNML criteria obtained by applying to PUD and WND are called decomposed normalized maximum likelihood code-length for pseudo-uniform distributions (DNML-PUD) and DNML code-length for wrapped normal distributions (DNML-WND), respectively.
The novelty and significance of this study are summarized as follows.

• Proposal of a novel methodology of dimensionality selection for hyperbolic embeddings
We propose DNML-PUD and DNML-WND for selecting the best dimensionality of hyperbolic graph embeddings.We aim to introduce latent variable models of hyperbolic embeddings with PUDs and WNDs and then apply the DNML criterion to its dimensionality selection, based on the MDL principle.One of our significant contributions is to derive explicit formulas of DNML for specific cases of PUDs and WNDs.• Empirical demonstration of the effectiveness of our methodology We evaluated the proposed method using both synthetic and real-world datasets.For synthetic datasets, firstly, graphs with their true dimensionality were generated.We then performed the identification of the true dimensionality.For real-world datasets, we examine a relationship between the selected dimensionality and performance of link prediction.Furthermore, we quantified to what extent the hierarchical structure of a graph was preserved using WordNet (WN) [27] dataset.Overall, our experimental results confirmed the effectiveness of our method.
The preliminary version of this paper appeared in [28].The major updates of this paper are summarized below: • We introduced a new latent variable model called wrapped normal distributions in hyperbolic space [22] and derived the upper bound on its DNML criterion, which we call DNML-WND.• Besides, the evaluation of DNML-WND was added in the experimental results.
• We added the new metric called conciseness in the evaluation of the link prediction task in Sect.4.3.1.

Related work
Conventionally, the dimensionality is determined heuristically based on domain knowledge.However, in recent years, several studies have proposed more principled approaches for this purpose.
Yin and Shen [9] proposed a pairwise inner product (PIP) loss, which quantifies the performance of embeddings based on the bias-variance trade-off.PIP loss is applicable to embeddings that can be formulated as low-rank matrix approximations, and its theoretical aspects have been investigated extensively.However, it is not known if hyperbolic embeddings satisfy this condition; thus, PIP loss cannot be directly used for hyperbolic embeddings.Gu et al. [10] extended PIP loss to normalized embedding loss.It is applicable to hyperbolic embeddings after defining their normalized embedding loss of hyperbolic embeddings.However, empirical observations (for example, normalized embedding loss following Eq.( 2) in [10]) are limited to the Euclidean space, and it is still unknown whether such observations are also valid or not for hyperbolic embeddings.
Luo et al. [11] proposed minimum graph entropy (MinGE) to select a dimensionality that minimizes graph entropy, which is a weighted sum of feature entropy and structure entropy.However, feature entropy depends on a certain probability distribution in the Euclidean space, and its extension to hyperbolic space is not straightforward.Moreover, although it was demonstrated to exhibit excellent experimental performance, there was no particular rationale with respect to the selected dimensionality.Wang [13] proposed a method that first learns embeddings in a sufficiently high-dimensional Euclidean space (e.g., the 1000dimensional Euclidean space) and then applies principal component analysis (PCA) to the embeddings and selects the dimensionality that minimizes the predefined score function.Recently, several hyperbolic dimensionality reduction methods have also been proposed [29][30][31], which indicates the possibility of extending the method to the hyperbolic space.To extend the method to the hyperbolic case, the following two points should be discussed: (1) how to define the score function and (2) which dimensionality reduction method should be used.
Recently, the graph neural architecture search method (GraphNAS) has been proposed in [32].GraphNAS selects the best architecture of graph neural networks, including their dimensionality, using reinforcement learning.The most important difference between the proposed method and GraphNAS is that GraphNAS determines the architecture in a taskdependent manner (e.g., the accuracy in node classification), while the proposed method is task-independent and estimates universal dimensionalities based on the MDL principle.Another difference is that the proposed method targets hyperbolic embeddings, while Graph-NAS targets Euclidean embeddings.Thus, it is potentially possible to extend GraphNAS to hyperbolic neural networks and compare their performance with the proposed method.However, to the best of our knowledge, there are no papers that addressed this extension, and the extension is not straightforward.
Almagro and Boguna [19] proposed a dimensionality selection method for hyperbolic embeddings.In [19], dimensionality was inferred using predictive models, such as the knearest neighbors algorithm or deep learning, where the input is the triplet of the mean densities of chordless cycles, squares, and pentagons of a given graph.
Hung and Yamanishi [12] proposed a dimensionality selection method for Word2Vec.They applied the MDL principle to select the optimal dimensionality.However, contrary to our method, they did not employ latent variable models for embeddings and used sequentially normalized maximum code-length rather than DNML code-length.
The remainder of this paper is organized as follows.Section 2 introduces hyperbolic geometry, the non-identifiability problem, and latent variable models of hyperbolic graph embeddings.Section 3 explains the DNML criteria and algorithms used for optimization.Section 4 presents the results obtained using artificial and real-world datasets.Section 5 presents the conclusions and future work."Appendix" section provides the derivation of the DNML code-lengths and experimental details.

Preliminaries
In this section, we first introduce the hyperbolic geometry following [18].Subsequently, the non-identifiability problem of hyperbolic embeddings is discussed.Finally, we introduce two latent variable models for hyperbolic graph embeddings.

Definition of hyperbolic space
There are several models for representing hyperbolic space1 (e.g., the Poincaré disk model, the Beltrami-Klein model, and the Poincaré half-plane model) [33].In this study, a hyperboloid model was used.Since all the models introduced above are isometric to each other, the discussion of the distance structure is the same for the other models.Let H D = (H D , g D ) be the D-dimensional hyperbolic space, where D+1) , where arcosh(x) := log(x + √ x 2 − 1).Note that x 0 is determined by Thus, only D variables are independent.

Coordinate system of hyperbolic space
Next, we explain the coordinate system of the hyperbolic space.The Cartesian coordinate system of the ambient Euclidean space was used as an element of the hyperbolic space (i.e., x = (x 0 , x 1 , . . ., x D ) ∈ H D ).Alternatively, for a maximum hyperbolic radius R > 0, the polar coordinate system (r , θ 1 , . . ., θ D−1 ) introduced in [34] was used, where r ∈ [0, R], The coordinate transformation is expressed as follows: x 2 = sinh r sin θ 1 cos θ 2 , . . . ( In this study, we specify the coordinate system we use when we introduce the notation of elements in the hyperbolic space.

Tangent space and exponential map
When we introduce wrapped normal distributions and the optimization algorithm, the concepts of tangent space T x H D and exponential map Exp x (•) are necessary.
For x ∈ H D , the tangent space T x H D is defined as the set of vectors orthogonal to x with respect to the inner product •, • L .Hence, Thereafter, the exponential map Exp x (•) : T x H D → H D maps a tangent vector v ∈ T x H D onto H D along the geodesic, where the geodesics are the generalizations of straight lines to Riemannian manifolds.The explicit forms of the exponential map and its inverse are well known (e.g., [22]) and are defined as follows: x, y 2 L − 1

Non-identifiability problem of hyperbolic embeddings
In a non-identifiable model, as pointed out in [26], the central limit theorem (CLT) does not hold for the maximum likelihood estimator uniformly over the parameter space.Thus, under these circumstances, neither AIC nor BIC can be applied to latent variable models because they are derived under the CLT assumption uniformly over the parameter space.
For notational simplicity, we omit D from the probability distribution, unless noted otherwise.We focus on undirected, unweighted, and simple graphs.Let n ∈ Z ≥2 be the number of nodes.For k ∈ Z ≥2 , [k] and k are defined as follows: [k] := {1, 2, . . ., k} and k := {(i, j) | i, j ∈ [k], i < j}.For (i, j) ∈ n , let y i j = y ji ∈ {0, 1} be a random variable that assumes the value 1 if the i-th node is connected to the j-th node and 0 otherwise.For D = 2 and i ∈ [n], let φ i := (r i , θ i ) ∈ H D be the polar coordinates of the i-th node, where r i ∈ [0, R] and θ i ∈ [0, 2π).In this model, y := {y i j } (i, j)∈ n is an observable variable, whereas φ := {φ i } i∈[n] is a probability distribution parameter.For β max > β min > 0, γ max > γ min > 0, β ∈ [β min , β max ], and γ ∈ [γ min , γ max ], we assume that a random variable y is drawn from the following distribution: Then, the following lemma holds.

Lemma 1
We assume that r j = 0 for some j ∈ [n].For α ∈ (0, 2π), we define , and the following equation holds: Therefore, the probability distribution of hyperbolic embeddings is non-identifiable.
Proof Since φ j = φ j holds for some j ∈ [n] such that r j = 0, we have φ = φ .For all Thus, the result follows from Eq. ( 4).
For D ≥ 3, non-identifiability can be proved by a similar transformation to the (D − 1)-th angular coordinate.

Latent variable models of hyperbolic embeddings with PUDs and WNDs
To resolve the non-identifiability problem, we introduce two latent variable models, following work on PUDs [14,15,34] and WNDs [22].In PUDs and WNDs, an embedding is regarded as a set of latent variables and edges as observed variables.Among several latent variable models, we chose two for the following reasons.
• For PUDs, in [14,15,34], it has been demonstrated that the graphs generated with PUDs have two properties: the power law for the degree of a node and high clustering coefficient.These properties are common in real-world graphs [35].
• For WNDs, it has been demonstrated that the experimental performance of various downstreaming tasks has been improved.

Latent variable model with PUDs
The generation process of y, z with PUDs can be summarized as follows:
Below we provide an explicit form of the probability distribution of z for PUDs.
For σ max > σ min ≥ 0, the random variable u := {u i } i∈ [n] is drawn according to the following distribution with the parameter σ ∈ [σ min , σ max ]: where I D, j := π 0 sin D−1− j θ dθ and C D (σ ) := R 0 sinh D−1 (σ r )dr denote the normalization constants.For p(z; σ, R), because z i,0 is determined by the other D variables in Eq. ( 1), we have where z i,1:D := (z i,1 , . . ., z i,D ) and J (z i,1:D : u i ) is the Jacobian of the transformation from u i to z i,1:D , which is given as The derivation is provided in "Appendix A." The probability distribution p(z; σ, R) is called the pseudo-uniform distribution because it is reduced to the uniform distribution in hyperbolic space when σ = 1.
In the following discussion, the value of R is assumed to be constant and satisfies R = O(log n) where n is the number of nodes, and it is omitted from the description of the probability distribution.This is because the maximum average degree satisfies k max = O(n), and the minimum average degree satisfies k min = O(1) under certain conditions, which is a common property of real-world complex networks [34].
In the aforementioned distribution, σ, β, and γ are parameters, and D denotes the model of the probability distribution.

Probability distribution of WNDs
WNDs are a generalization of Euclidean Gaussian distributions to the hyperbolic space.Thus, WNDs have two parameters: a mean in hyperbolic space μ ∈ H D and a positive-definite covariance matrix ∈ R D×D .In our model, we set μ to μ 0 , where μ 0 := (1, 0, . . ., 0) denote the origin of H D .We assume this because a tree-like graph is considered to be radially distributed around the origin μ 0 .
The generation process of y and z with the WNDs is summarized as follows: 1.For a vertex i ∈ [n]: , which is a tangent vector at μ 0 .
2. For a pair of vertices (i, j) ∈ n : (a) Generate an observable variable y i j ∼ p(y i j | z i , z j ; β, γ ) using Eq. ( 4).
Note that the second step is the same as that of the model with the PUDs.Below we provide an explicit form of the probability distribution of z with the WNDs.
A random variable v := {v i } i∈ [n] is drawn according to the following distribution: For p(z; ), we have that where J (z i,1:D : v i ) is the Jacobian of the transformation from v i to z i,1:D , which is provided by The derivation of the Jacobian is obtained from [22].

Dimensionality selection using DNML code-lengths
In this section, we present the calculation of the DNML code-lengths for two latent variable models.Thereafter, we present the optimization algorithm.

DNML code-lengths with PUDs and WNDs
According to the MDL principle [24], the probabilistic model that minimizes the total codelength required to encode the given data is selected.Data may be encoded using multiple methods.Although the NML code-length [36] is one of the most common encoding methods, its calculation is quite difficult for complex probability distributions such as PUDs and WNDs.Therefore, we employ the DNML code-length [26], whose calculation for latent variable models is relatively easier.
).We estimate optimal dimensionality D ∈ D and the optimal embedding ẑ that minimizes the following criterion, which we call DNML-PUD: y, z), γ ( y, z)) where I n (β, γ ) and I (σ ) denote Fisher information, which is computed as The derivation is presented in "Appendix B." Practically, I n (β, γ ) and I (σ ) are calculated numerically because the analytic solution of the integral terms is not trivial.For WNDs, the DNML criterion is defined as follows: where ˆ denotes the maximum likelihood estimator of .Since the exact value of p(z; ˆ (z))dz 1:D is analytically intractable, we employ the following upper bound.
We provide a more detailed explanation of DNML-PUD and DNML-WND.Figures 1  and 2 show L NML ( y | z), L NML (z) and L DNML ( y, z) of an artificially generated graph with the true dimensionality D true = 8 and n = 6400.The value of L NML ( y | z) decreases as dimensionality increases.This implies that as the dimensionality increases, the graph can be reconstructed more accurately.However, the value of L NML (z) increases as dimensionality increases.This is because more code-length is required to encode the extra dimension of the embedding; that is, the model becomes more complex, and L NML (z) acts as a penalty
First, we explain how to optimize L(z, β, γ, σ ).We rewrite it as We applied the stochastic update rule at iteration t using the following equation: where B (t) ⊂ n is the mini-batch for each iteration and |•| denotes the number of elements in a set.
For z i , we used the geodesic update in the hyperboloid model [18].The update rule for z is given as follows: where η where Through a preliminary experiment using synthetic datasets, we confirmed that σ rarely converges to the true value when using the gradient descent method.Thus, for each epoch, we numerically calculated σ (z) as where S = {σ min , σ min + 1 C (σ max − σ min ), . . ., σ min + C−1 C (σ max − σ min ), σ max }, and C + 1 denote the number of candidates.
For the optimization of L(z, β, γ, ), we define L (t) (z, β, γ, ) in a similar manner.The update rules for z i , β, γ are provided by Eqs. ( 7), (8), and (9), respectively.For each epoch, we optimized using the following equation: where The optimization procedure is summarized in Algorithm 1.

Experimental results
This section presents a comparison between the proposed criteria and conventional methods using artificial and real-world datasets.The details of code, data, and training details are presented in "Appendix C."

Methods for comparison
We used three criteria-AIC, BIC, and MinGE-for a comparative analysis of the performance of the proposed method.Here, the AIC and BIC with respect to the non-identifiable model, that is, β, γ , and z, are interpreted as parameters and are defined as follows: Note that these criteria are not guaranteed to work for this model because of the nonidentifiability.MinGE [11] is a dimensionality selection criterion for Euclidean graph embeddings.We set the weighting factor λ = 1 and selected the dimensionality, where MinGE was closest to 0. Furthermore, we did not consider the cross-validation (CV) for comparison.This is because CV requires considerable computation time, particularly, when learning graph embeddings.

Artificial dataset
In this experiment, we verified whether the proposed DNML criteria could estimate the true dimensionality.

Dataset detail
We considered the case of D true = 8, where D true is the true dimensionality of the PUDs.We generated a graph for each combination of parameters from the following candidates: n ∈ {800, 1600, 3200, 6400}, β ∈ {0.5, 0.6, 0.7, 0.8}, and σ ∈ {0.5, 1.0, 2.0}.Furthermore, we set R = log n and γ = β log n.Consequently, we obtained 48 graphs in total, which we call PUD-8.Similarly, we generated PUD-16 with the true dimensionality which was 16.
In the above generation process, the parameters were set such that the generated graphs were sparse; that is, the average degree is low with respect to the number of nodes.

Results
To provide an illustrative example for each criterion, we first compared the selected dimensionality of PUD-8 and WND-8 with n = 800, 6400.Figures 3 and 4 show the normalized values for each criterion.For n = 800, AIC, BIC, and DNML selected D = 4. Intuitively, a graph with a few nodes is expected to be embedded in low dimensionality, even if its true dimensionality is high.For n = 6400, DNML selected the correct dimensionality D = 8, whereas AIC and BIC selected incorrect dimensionalities.This implies that the DNML criteria can select the true dimensionality with a sufficient amount of data.For MinGE, it selected the maximum dimensionality D = 64 for all cases.This is possibly because MinGE was designed for Euclidean embeddings, which require larger dimensionality than hyperbolic embeddings for hierarchical structures, as discussed in Sect.1.1.
Next, we provide a quantitative comparison in terms of mean average precision (MAP) [37].A MAP is calculated using the ranking of the dimensionalities, which was created in ascending order for each criterion.Furthermore, we applied DNML-PUD to WND dataset and DNML-WND to PUD dataset.
Tables 1, 2, 3, and 4 present the results for PUD-8, PUD-16, WND-8, and WND-16, respectively.Firstly, the MAPs of BIC and MinGE were not so high, and they always selected D = 4 and D = 64, respectively.Since the selected dimensionalities were constant, BIC and MinGE are less reliable.
For AIC, we observed good performance in many cases; however, it tends to overestimate the true dimensionality for PUD-8 and WND-8 with n = 6400.This is because the penalty term of AIC is smaller than those of other criteria.For DNML criteria, in general, when the sample size is sufficiently large, DNML-PUD identifies the true dimensionality of the PUD dataset and the same tendency holds for DNML-WND and the WND dataset.Thus, we concluded that the proposed DNML criteria are more effective than AIC when the true dimensionality of the given graph is low.
Note that, in general, the performance of DNML-PUD in the WND dataset varied, sometimes being better and sometimes worse compared to DNML-WND.Similarly, in the PUD dataset, the performance of DNML-WND also varied, sometimes being better and sometimes worse compared to DNML-PUD.This is because the theoretical properties of the MDL principle are not valid when the generation process of the given data and the assumed generation process for calculating DNML code-length are different from each other.Therefore, this observation is an expected result of the mismatch of the generation process.

Real-world datasets
We used scientific collaboration networks from [38][39][40], flight network from https:// openflights.org,protein-protein interaction network from [41], and the WN datasets from [27] for our study because they were employed in [16,42,43], which are representative studies in the field of hyperbolic embeddings.The experimental results in [16,42,43] demonstrated that hyperbolic embeddings outperformed Euclidean embeddings in several graph mining tasks performed on the networks.Therefore, we concluded that they are suitable for comparing our proposed method with others.

Link prediction
The DNML-PUD, DNML-WND, and other model selection criteria were applied to eight real-world graphs.In real-world graphs, the true dimensionality is unknown.Therefore, in this experiment, the link prediction performance for the selected dimensions was evaluated.

Dataset detail
We listed the details of eight graphs below.
• Scientific collaboration networks.We used AstroPh, CondMat, GrQc, and HepPh from [40], Cora from [38], and PubMed from [39].These graphs are networks that represent the co-authorship of papers, where an edge exists between two people if they are co-authors.• Flight networks.We used Airport from https://openflights.org/.In this graph, nodes represent airports, and edges represent airline routes.• Protein-protein interaction (PPI) networks.We further used PPI from [41].This graph represents the protein interactions in yeast bacteria.
Furthermore, Table 5 summarizes the statistics of these graphs.Then, each graph was split into a training set y train ⊂ y and test set y test ⊂ y.The test set y test comprises the positive and negative samples.First, we sampled 10% of the positive samples from a graph.Subsequently, to generate negative samples, we sampled an equal number of node pairs with no edge connection.Finally, we obtained the training set y train := y\ y test .
For y test , we calculated the area under the curve (AUC), which we define as follows.We first calculated the distance of each samples in y test .Subsequently, we calculated the true positive rate and false positive rate with fixed threshold of the distance.Finally, we obtained the receiver operating characteristic (ROC) curve by varying the threshold, and the AUC is defined as the area under the ROC curve.

Results
Figure 5 shows the AUCs of the optimal embeddings associated with − log p( y, z; β, γ , σ ), − log p( y, z; β, γ , ), and − log p( y | z; β, γ ). Figure 6 shows the normalized values of each criterion, and Table 6 shows the selected dimensionalities.The performance at the selected dimensionalities by DNML-PUD and DNML-WND was not the best, and higher dimensionalities tended to yield higher AUCs.According to [24], the consistency of the MDL model selection is theoretically guaranteed; that is, the model with the shortest code-length would converge to the true one if it exists.Therefore, the dimensionalities selected by DNML were considered to be close to the dimensionalities of the true probabilistic models that generated the data.However, our results suggest that such dimensionality of the true probabilistic model is not necessarily the best one for link prediction.
In this section, we provide another perspective on the experimental results.As discussed in Sect.1.1, dimensionality also controls the computation time and memory.Therefore, it is important to select a dimensionality at which a relatively high performance is achieved while maintaining low computational resources (e.g., using embeddings in edge devices).To quantify this, we introduce conciseness defined as follows: Let D := {D 1 , . . ., D N } be the candidates of dimensionalities, AUC D i be the AUC at dimensionality D i , AUC be the maximum AUC, and max be a maximum tolerance gap relative to AUC.For the selected , and P ∈ Z ≥1 is the number of candidate points to calculate the conciseness.
Figure 7 shows the typical example of c( D, ).Firstly, the proposed conciseness measure assumes a situation where the AUC improves by increasing the dimensionality, while the extent of improvement diminishes, and with limited computational resource, we want to select lower dimensionality while achieving the maximum tolerate AUC.Based on this motivation, the conciseness measure is designed to take high values when D ∈ D is close to D min .
Since the conciseness significantly depends on max , we computed it for max = 0.050 and 0.100.Table 7 shows the average conciseness of the selected dimensionalities.To calculate the conciseness, we used the embeddings associated with − log p( y, z; β, γ , σ ) for DNML-PUD, − log p( y, z; β, γ , ) for DNML-WND, and − log p( y | z; β, γ ) for AIC, BIC, and MinGE.Furthermore, we set AUC as the maximum AUC associated with three embeddings.
The best or second best performance was achieved by either DNML-PUD or DNML-WND in many cases.For AIC, the performance was relatively high, but not the best in many cases.For BIC, the performance was relatively high, specifically for max = 0.100.This indicates that BIC is effective when the maximum tolerate gap is high.For MinGE, the performance was close to 0 because the selected dimensionality was considerably high.Overall, the proposed method works well in that it identifies dimensionality with a relatively high AUC while maintaining low computational resources.
Note that all the performances were 0 for GrQc.This is because higher dimensions achieve considerably higher AUC in GrQc, unlike most other networks where increasing dimensions do not significantly improve AUC.In such scenarios, the conciseness measure does not take positive values unless max is set to a considerably high value; however, setting an excessively high tolerate gap (e.g., max = 0.300) lacks practical meaning, and it is sufficient to select the maximum dimensionality within the given computational resources.

Preservation of hierarchy
To investigate the extent to which the hierarchical structure was preserved, we used a subset of WordNet [27] closely following the setting in [16,18].

Dataset detail
We first considered the transitive closure of the is-a relationship for all the nouns.Subsequently, we took the subset of the nouns that have "mammal" as a hypernym and selected relations that have "mammal" as a hyponym or hypernym.Finally, we connected the two nouns if they have an is-a relationship.We refer to this dataset as WN-mammal.Similarly, we generated WN-mammal, WN-solid, WN-tree, WN-worker, WN-adult, WN-instrument, WN-leader, and WN-implement.Table 8 summarizes the statistics for these datasets.
Each graph is expected to have a hierarchical structure because a hypernym is often related to many hyponyms.We embedded eight graphs with various dimensionalities and calculated each criterion.Subsequently, we quantified the extent to which the obtained embeddings reflected the true hierarchy of the is-a relation on the data.Following [16], we used the following score function: where r u and r v are the radius coordinates of u and v, respectively, and α > 0 is a constant.In general, it can be assumed that a hypernym has a lower radial coordinate than its hyponyms.Thus, α(r u −r v ) acts as a penalty when v, which is a hypernym of u, is lower in the embedding hierarchy.Therefore, the score is expected to be high when the embedding reflects the true hierarchy of data.Figure 9 shows the average score over all is-a relation pairs for each dimensionality with α = 100.Overall, the lower dimensionality achieved higher is-a scores.Next, we provide a quantitative comparison of the benefits.With an estimated dimensionality D and a maximum tolerance gap T gap , the benefit is defined as follows: where D best is the dimensionality at which the best is-a score is achieved.It has a high value when the estimated dimensionality is close to D best .Table 9 shows the average benefits over the WN datasets.Note that D best was 2 for all datasets, and we set T gap = 2 in the experiments.The DNML-WND and BIC achieved the best results.This result implies that DNML-WND and BIC selected the dimensionality that reflects the true hierarchical structure of the particular graph.

Discussion
We summarize the experimental results.Firstly, as a general trend, the AIC usually selects higher dimensionality, the BIC lower dimensionality, and the DNML criteria middle dimensionality.This is due to the relative magnitude of the penalty terms.MinGE always selected the largest dimensionality in all the experiments.This is possibly because MinGE was designed for Euclidean embeddings, which require higher dimensionality than hyperbolic embeddings for hierarchical structures, as discussed in Sect.1.1.
The performance of the AIC was relatively well on the first and second tasks, but not on the third task.In contrast, the BIC showed high performance in the third task, but low performance in the first and second tasks.The DNML criterion does not necessarily give the best performance, but it often gives the best performance or the next-best performance for all tasks.Therefore, it can be concluded that the performance of the proposed DNML criteria was good on average for all tasks.

Conclusion and future work
In this study, we proposed a dimensionality selection method for hyperbolic embeddings based on the MDL principle.We demonstrated that there is a non-identifiability problem for hyperbolic embeddings.Therefore, we employed the latent variable model of hyperbolic embeddings using PUDs and WNDs to formulate dimensionality selection as the statistical model selection for latent variable models.Within this formulation, we proposed the DNML code-length criterion for dimensionality selection based on the MDL principle.For artificial datasets, we experimentally demonstrated that our method is effective when the true dimensionality is low.For real-world datasets, we used the scientific collaboration networks and WN datasets.For the scientific collaboration networks, we demonstrated that the proposed method selected simple probabilistic models while maintaining AUCs.For the WN datasets, we confirmed that the proposed method selects the dimensionality which preserves the true hierarchy of graphs.Overall, the proposed method performed well in average.Note that we cannot directly use the dimensionality selected by the proposed method in hyperbolic neural networks methods, such as [43,44].Because the probabilistic models used in the proposed method differ from those used in [43,44], the optimal dimensionalities within them are inherently different.This implies that using the dimensionality selected by the proposed method in [43,44] rationales the rationales of the DNML code length and MDL principle.However, we would like to emphasize that even in hyperbolic neural networks, it is possible to consider latent variable models and derive DNML code-lengths following a similar procedure to our study.Specifically, we can treat the parameters and outputs of hidden layers of hyperbolic neural networks as latent variables.In this sense, the proposed method can be generalized.However, as indicated in "Appendix B," the derivation of the DNML code-length requires significant effort and is not straightforward.Therefore, we consider the extension of our proposed method to hyperbolic neural networks as future work.
The latent variable model approach adopted in this study is promising and is not limited to hyperbolic space.In the future, we plan to build a methodology for dimensionality selection in Euclidean and spherical embeddings by introducing latent variable for their corresponding space [45,46].
In the following, we calculate I n (β, γ ): We rewrite p(y i j | z i , z j ; β, γ ) as , where p i j = 1/(1 + exp(βd z i z j − γ )).This form is a logistic function with the constraints β ∈ [β min , β max ] and γ ∈ [γ min , γ max ].Then, L(β, γ ) is rewritten as Hence, we obtain Since all terms are independent of y i j , we obtain
Thereafter, we use the asymptotic approximation of parametric complexity according to Rissanen [49]: where I (σ ) denotes Fisher information In the following, we calculate I (σ ).
The negative logarithm of the likelihood for By interchanging the derivative and integral, we obtain Similarly, we get The second and third terms are independent of r .The expectation of the first term with respect to r is calculated as follows.
Finally, we derive the following: x + 1− j 2 , and (•) is the gamma function.Using Rissanen's g-function [24], the parametric complexity of L NML (z) can be rewritten as: The first inequality is derived from Thm. 2.13 [47].
The main difference between our upper bound and that in [50,51] is as follows: • We fixed the mean to the origin of the Euclidean space, whereas the previous studies did not.• We removed the restriction max < 1 and π , where max > 0 is set such that 2 j ≤ max for all j ∈ [D].Below we explain the reason.In the discussion in [51], the difference between the code-lengths of any two data is scale-invariant in Gaussian mixture models (GMMs).In other words, the result of model selection is scale-invariant.Thus, we can perform model selection on rescaled data on GMMs, e.g., to multiply 1/α, etc.However, in hyperbolic embeddings, the difference of the code-lengths is not scale-invariant due to its non-Euclidean nature.Thus, in our model, it is necessary to remove the restrictions and set 1 and 2 to appropriate values such that the raw data are included in Z D ( 1 , 2 ).where y is sampled uniformly at random over y.Similarly, we approximated I n (β, γ ) as follows:

C.3 Training detail
For all experiments, the training details were the same.

Embeddings
First, the Cartesian coordinates of each node were independently initialized uniformly at random in [−0.001, 0.001] D+1 ⊂ R (D+1)×(D+1) .They were then projected onto the hyperboloid plane using Eq. ( 1).When learning the embeddings, ten negative samples were sampled per positive sample for mini-batches.The number of epochs was 800, and we set η (t) β = 0.001 and η (t) γ = 0.001 for all epochs.Similar to [16], the learning process of embeddings is composed of two steps.
• In the first step, − log p( y | z; β, γ ) was optimized for all models with learning rate η

Fig. 1
Fig. 1 Example of DNML-PUD.The selected dimensionality and true dimensionality are D = 8.The graph was generated with parameters n = 6400, β = 0.6, γ = β log n, and σ = 1.0.The scores L DNML ( y, z) and L NML ( y | z) follow the left scale, whereas L NML (z) follows the right scale

Fig. 2
Fig. 2 Example of DNML-WND.The selected dimensionality and true dimensionality are D = 8.The graph was generated with parameters n = 6400, β = 1.2, γ = β log n, and = (0.35 log n) 2 I , where I ∈ R D×D is the identity matrix.The scores L DNML ( y, z) and L NML ( y | z) follow the left scale, whereas L NML (z) follows the right scale

z
denotes the learning rate, π z (•) denotes the projection from the Euclidean gradient to the Riemannian gradient, proj H D R (•) denotes the projection from H D to H D R := {x | x ∈ H D , d μ 0 x ≤ R}, and Exp z (•) is given by Eq. (3).The functions π z (•) and proj H D R (•) are defined as follows:

Fig. 6
Fig. 6 Results of each criterion.First row: AstroPh and CondMat.Second row: GrQc and HepPh.Third row: Cora and PubMed.Fourth row: Airport and PPI

Fig. 9
Fig. 9 Is-a scores on WN dataset.First row: WN-mammal, WN-solid.Second row: WN-tree and WN-worker.Third row: WN-adult and WN-instrument.Fourth row: WN-leader and WN-implement
denotes the Euclidean norm.
σ , the update rules of β and γ are provided as

Table 1
Average MAPs on PUD-8.The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Table 2
Average MAPs on PUD-16.The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Table 3
Average MAPs on WND-8.The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Table 6
Selected dimensionalities of each method

Table 7
Average conciseness of each method with max = 0.050 and 0.100 (the bold text indicates either the maximum conciseness or conciseness within a 10% decrease from the maximum one)

Table 8
Statistics of WN datasets Network

Table 9
Average benefits on WN datasets (the bold text indicates the maximum one)

Table 10
Computing environment in our experiments