1 Introduction

1.1 Motivation

Graphs are convenient tools for knowledge representation and can be used to represent various types of real-world data. Consequently, graph analysis has garnered significant attention in various fields, such as biology (e.g., protein–protein interaction networks) [1], social sciences (e.g., friendship networks) [2], and linguistics (e.g., word co-occurrence networks) [3], in recent years. Generally, tasks in graph analysis are classified into the following four categories: (1) node classification, (2) link prediction, (3) node clustering, and (4) graph visualization [4].

Graph embeddings, which convert discrete representations into continuous ones, such as vectors in Euclidean space, have become popular tools in graph analysis [5,6,7,8]. They provide effective solutions for the aforementioned tasks, as continuous representations can be used as the input in tasks of types (1), (2), and (3), whereas two-dimensional continuous representations are used directly in tasks of type (4).

Dimensionality is one of the most important hyperparameters in graph embeddings. First, the node classification, link prediction, and node clustering performance depend on it. Intuitively, we cannot distinguish nodes well with considerably low dimensionality, while the embedded relations are significantly affected by irregularities of data with considerably high dimensionality. Second, the training time and computational expenses directly depend on it. Therefore, the issue of dimensionality selection for graph and word embedding has garnered significant attention [9,10,11,12,13]. However, most of existing studies have focused on Euclidean space, although hyperbolic space is a viable alternative embedding space.

The hyperbolic space is a Riemannian manifold with negative constant curvature. In network science, a hyperbolic space is suitable for modeling hierarchical structures [14, 15]. In a tree at level h, the number of leaves and nodes is exponential in h. The analogies of the two aforementioned concepts in hyperbolic and Euclidean space are the circumference and area of a circle, respectively. In the two-dimensional hyperbolic space with constant curvature \(K=-1\), the circumference of a circle is provided by \(2\pi \sinh r\) and its area is \(2\pi (\cosh r-1)\) with hyperbolic radius r, both increasing exponentially with r. This analogy demonstrates that the hyperbolic space has an affinity for the hierarchical structure. However, in the two-dimensional Euclidean space, \({\mathbb {R}}^2\), the circumference of a circle is provided by \(2\pi r\) and its area is given by \(\pi r^2\), both increasing polynomially with r. Thus, increasing the dimensionality is essential for embedding a hierarchical structure in the Euclidean space. Owing to these properties, hyperbolic embeddings have been extensively studied in recent years [16,17,18]. However, to the best of our knowledge, there has been no previous research except [19] on dimensionality selection in the hyperbolic space.

In this study, we propose a novel methodology for dimensionality selection of hyperbolic graph embeddings. We address this issue from the viewpoint of statistical model selection. First, we demonstrate that there is a non-identifiability problem in the conventional probabilistic model of hyperbolic embeddings; that is, there is no one-to-one correspondence between the parameter and the probability distribution. This problem invalidates the use of the conventional model selection criteria, such as Akaike’s information criterion (AIC) [20] and the Bayesian information criterion (BIC) [21]. To overcome this difficulty, we employ two latent variable models of hyperbolic embeddings following pseudo-uniform distributions (PUDs) [14, 15] and wrapped normal distributions (WNDs) in a hyperbolic space [22]. We thereby introduce a criterion for dimensionality selection based on the minimum description length (MDL) principle [23].

The MDL principle asserts that the best model minimizes the total code-length required for encoding the particular data. It exhibits several advantages, such as consistency [24] and rapid convergence in the framework of probably approximately correct (PAC) learning [25]. Although the MDL-based dimensionality selection was developed for Word2Vec-type word embeddings into the Euclidean space by Hung and Yamanishi [12], their techniques cannot straightforwardly be applied to hyperbolic graph embeddings.

The DNML criterion [26] is a model selection criterion for latent variable models based on the MDL principle, where the non-identifiability problem is resolved by jointly encoding the observed and latent variables. The shorter the DNML criterion, the better the dimensionality. Herein, we propose to apply DNML into the problem of dimensionality selection for hyperbolic embeddings. The DNML criteria obtained by applying to PUD and WND are called decomposed normalized maximum likelihood code-length for pseudo-uniform distributions (DNML-PUD) and DNML code-length for wrapped normal distributions (DNML-WND), respectively.

The novelty and significance of this study are summarized as follows.

  • Proposal of a novel methodology of dimensionality selection for hyperbolic embeddings We propose DNML-PUD and DNML-WND for selecting the best dimensionality of hyperbolic graph embeddings. We aim to introduce latent variable models of hyperbolic embeddings with PUDs and WNDs and then apply the DNML criterion to its dimensionality selection, based on the MDL principle. One of our significant contributions is to derive explicit formulas of DNML for specific cases of PUDs and WNDs.

  • Empirical demonstration of the effectiveness of our methodology We evaluated the proposed method using both synthetic and real-world datasets. For synthetic datasets, firstly, graphs with their true dimensionality were generated. We then performed the identification of the true dimensionality. For real-world datasets, we examine a relationship between the selected dimensionality and performance of link prediction. Furthermore, we quantified to what extent the hierarchical structure of a graph was preserved using WordNet (WN) [27] dataset. Overall, our experimental results confirmed the effectiveness of our method.

The preliminary version of this paper appeared in [28]. The major updates of this paper are summarized below:

  • We introduced a new latent variable model called wrapped normal distributions in hyperbolic space [22] and derived the upper bound on its DNML criterion, which we call DNML-WND.

  • Besides, the evaluation of DNML-WND was added in the experimental results.

  • We added the new metric called conciseness in the evaluation of the link prediction task in Sect. 4.3.1.

1.2 Related work

Conventionally, the dimensionality is determined heuristically based on domain knowledge. However, in recent years, several studies have proposed more principled approaches for this purpose.

Yin and Shen [9] proposed a pairwise inner product (PIP) loss, which quantifies the performance of embeddings based on the bias-variance trade-off. PIP loss is applicable to embeddings that can be formulated as low-rank matrix approximations, and its theoretical aspects have been investigated extensively. However, it is not known if hyperbolic embeddings satisfy this condition; thus, PIP loss cannot be directly used for hyperbolic embeddings. Gu et al. [10] extended PIP loss to normalized embedding loss. It is applicable to hyperbolic embeddings after defining their normalized embedding loss of hyperbolic embeddings. However, empirical observations (for example, normalized embedding loss following Eq. (2) in [10]) are limited to the Euclidean space, and it is still unknown whether such observations are also valid or not for hyperbolic embeddings.

Luo et al. [11] proposed minimum graph entropy (MinGE) to select a dimensionality that minimizes graph entropy, which is a weighted sum of feature entropy and structure entropy. However, feature entropy depends on a certain probability distribution in the Euclidean space, and its extension to hyperbolic space is not straightforward. Moreover, although it was demonstrated to exhibit excellent experimental performance, there was no particular rationale with respect to the selected dimensionality. Wang [13] proposed a method that first learns embeddings in a sufficiently high-dimensional Euclidean space (e.g., the 1000-dimensional Euclidean space) and then applies principal component analysis (PCA) to the embeddings and selects the dimensionality that minimizes the predefined score function. Recently, several hyperbolic dimensionality reduction methods have also been proposed [29,30,31], which indicates the possibility of extending the method to the hyperbolic space. To extend the method to the hyperbolic case, the following two points should be discussed: (1) how to define the score function and (2) which dimensionality reduction method should be used.

Recently, the graph neural architecture search method (GraphNAS) has been proposed in [32]. GraphNAS selects the best architecture of graph neural networks, including their dimensionality, using reinforcement learning. The most important difference between the proposed method and GraphNAS is that GraphNAS determines the architecture in a task-dependent manner (e.g., the accuracy in node classification), while the proposed method is task-independent and estimates universal dimensionalities based on the MDL principle. Another difference is that the proposed method targets hyperbolic embeddings, while GraphNAS targets Euclidean embeddings. Thus, it is potentially possible to extend GraphNAS to hyperbolic neural networks and compare their performance with the proposed method. However, to the best of our knowledge, there are no papers that addressed this extension, and the extension is not straightforward.

Almagro and Boguna [19] proposed a dimensionality selection method for hyperbolic embeddings. In [19], dimensionality was inferred using predictive models, such as the k-nearest neighbors algorithm or deep learning, where the input is the triplet of the mean densities of chordless cycles, squares, and pentagons of a given graph.

Hung and Yamanishi [12] proposed a dimensionality selection method for Word2Vec. They applied the MDL principle to select the optimal dimensionality. However, contrary to our method, they did not employ latent variable models for embeddings and used sequentially normalized maximum code-length rather than DNML code-length.

The remainder of this paper is organized as follows. Section 2 introduces hyperbolic geometry, the non-identifiability problem, and latent variable models of hyperbolic graph embeddings. Section 3 explains the DNML criteria and algorithms used for optimization. Section 4 presents the results obtained using artificial and real-world datasets. Section 5 presents the conclusions and future work. “Appendix” section provides the derivation of the DNML code-lengths and experimental details.

2 Preliminaries

In this section, we first introduce the hyperbolic geometry following [18]. Subsequently, the non-identifiability problem of hyperbolic embeddings is discussed. Finally, we introduce two latent variable models for hyperbolic graph embeddings.

2.1 Hyperbolic geometry

2.1.1 Definition of hyperbolic space

There are several models for representing hyperbolic spaceFootnote 1 (e.g., the Poincaré disk model, the Beltrami–Klein model, and the Poincaré half-plane model) [33]. In this study, a hyperboloid model was used. Since all the models introduced above are isometric to each other, the discussion of the distance structure is the same for the other models. Let \({\mathcal {H}}^D=({\mathbb {H}}^D, g_{D})\) be the D-dimensional hyperbolic space, where

$$\begin{aligned} g_{D}= \begin{pmatrix} -1 &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad 1 &{} &{}\quad \vdots \\ \vdots &{} &{}\quad \ddots &{}\quad 0 \\ 0 &{}\quad \cdots &{}\quad 0 &{}\quad 1 \\ \end{pmatrix}\in {\mathbb {R}}^{(D+1) \times (D+1)}, \end{aligned}$$

\({\mathbb {H}}^D:=\{\varvec{x}=(x_0, x_1, \dots , x_D)^\top \mid \varvec{x}\in {\mathbb {R}}^{D+1}, \langle x, x \rangle _{{\mathcal {L}}}=-1, x_0>0 \}\) and \(\langle \varvec{u}, \varvec{v} \rangle _{{\mathcal {L}}}=\varvec{u}^\top g_{D} \varvec{v}\). The associated distance between \(\varvec{u}, \varvec{v}\in {\mathbb {H}}^D\) is provided by \(d_{\varvec{u} \varvec{v}}={{\,\textrm{arcosh}\,}}\bigl (-\langle \varvec{u}, \varvec{v} \rangle _{{\mathcal {L}}} \bigr )\), where \({{\,\textrm{arcosh}\,}}(x):=\log (x+\sqrt{x^2-1})\). Note that \(x_0\) is determined by

$$\begin{aligned} x_{0}=\sqrt{1+x_{1}^2+\cdots +x_{D}^2}. \end{aligned}$$
(1)

Thus, only D variables are independent.

2.1.2 Coordinate system of hyperbolic space

Next, we explain the coordinate system of the hyperbolic space. The Cartesian coordinate system of the ambient Euclidean space was used as an element of the hyperbolic space (i.e., \(\varvec{x}=(x_0, x_1, \dots , x_D)^\top \in {\mathbb {H}}^D\)). Alternatively, for a maximum hyperbolic radius \(R>0\), the polar coordinate system \((r, \theta _{1}, \dots , \theta _{D-1})^\top \) introduced in [34] was used, where \(r\in [0, R]\), \(\theta _{1}, \theta _{2}, \dots , \theta _{D-2}\in [0, \pi )\), and \(\theta _{D-1} \in [0, 2\pi )\). The coordinate transformation is expressed as follows:

$$\begin{aligned} \left\{ \begin{aligned}&x_0 = \cosh r, \\&x_1 = \sinh r \cos \theta _1, \\&x_2 = \sinh r \sin \theta _1 \cos \theta _2, \\&\quad \vdots \\&x_{D-1} = \sinh r \sin \theta _1 \sin \theta _2 \ldots \sin \theta _{D-2} \cos \theta _{D-1}, \\&x_{D} = \sinh r \sin \theta _1 \sin \theta _2 \ldots \sin \theta _{D-2} \sin \theta _{D-1}. \end{aligned} \right. \end{aligned}$$
(2)

In this study, we specify the coordinate system we use when we introduce the notation of elements in the hyperbolic space.

2.1.3 Tangent space and exponential map

When we introduce wrapped normal distributions and the optimization algorithm, the concepts of tangent space \({\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D\) and exponential map \(\textrm{Exp}_{\varvec{x}} (\cdot )\) are necessary.

For \(\varvec{x}\in {\mathbb {H}}^D\), the tangent space \({\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D\) is defined as the set of vectors orthogonal to \(\varvec{x}\) with respect to the inner product \(\langle \cdot , \cdot \rangle _{{\mathcal {L}}}\). Hence,

$$\begin{aligned} {\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D :=\{ \varvec{v} \mid \varvec{v} \in {\mathbb {R}}^{D+1}, \langle \varvec{x}, \varvec{v} \rangle _{{\mathcal {L}}} = 0 \}. \end{aligned}$$

Thereafter, the exponential map \(\textrm{Exp}_{\varvec{x}} (\cdot ) :{\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D \rightarrow {\mathbb {H}}^D\) maps a tangent vector \(\varvec{v}\in {\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D\) onto \({\mathbb {H}}^D\) along the geodesic, where the geodesics are the generalizations of straight lines to Riemannian manifolds. The explicit forms of the exponential map and its inverse are well known (e.g., [22]) and are defined as follows:

$$\begin{aligned} \textrm{Exp}_{\varvec{x}} (\varvec{v})&:=\cosh \left( \sqrt{\langle \varvec{v}, \varvec{v} \rangle _{{\mathcal {L}}}}\right) \varvec{x} + \sinh \left( \sqrt{\langle \varvec{v}, \varvec{v} \rangle _{{\mathcal {L}}}}\right) \frac{\varvec{v}}{\sqrt{\langle \varvec{v}, \varvec{v} \rangle _{{\mathcal {L}}}}}, \nonumber \\ \textrm{Exp}_{\varvec{x}}^{-1} (\varvec{y})&= \frac{{{\,\textrm{arcosh}\,}}{\left( -\langle \varvec{x}, \varvec{y} \rangle _{{\mathcal {L}}}\right) }}{\sqrt{\langle \varvec{x}, \varvec{y} \rangle _{{\mathcal {L}}}^2-1}}\left( \varvec{y}+\langle \varvec{x}, \varvec{y} \rangle _{{\mathcal {L}}} \varvec{x}\right) . \end{aligned}$$
(3)

2.2 Non-identifiability problem of hyperbolic embeddings

In a non-identifiable model, as pointed out in [26], the central limit theorem (CLT) does not hold for the maximum likelihood estimator uniformly over the parameter space. Thus, under these circumstances, neither AIC nor BIC can be applied to latent variable models because they are derived under the CLT assumption uniformly over the parameter space.

For notational simplicity, we omit D from the probability distribution, unless noted otherwise. We focus on undirected, unweighted, and simple graphs. Let \(n\in {\mathbb {Z}}_{\ge 2}\) be the number of nodes. For \(k\in {\mathbb {Z}}_{\ge 2}\), [k] and \(\Lambda _k\) are defined as follows: \([k] :=\{ 1, 2, \dots , k \}\) and \(\Lambda _k:=\{ (i, j) \mid i,j\in [k], i<j \}\). For \((i,j) \in \Lambda _n\), let \(y_{ij}=y_{ji} \in \{0, 1\}\) be a random variable that assumes the value 1 if the i-th node is connected to the j-th node and 0 otherwise.

For \(D=2\) and \(i \in [n]\), let \(\phi _i:=(r_i, \theta _{i})^\top \in {\mathbb {H}}^D\) be the polar coordinates of the i-th node, where \(r_i\in [0, R]\) and \(\theta _{i} \in [0, 2\pi )\). In this model, \(\varvec{y}:=\{ y_{ij} \}_{(i,j)\in \Lambda _n}\) is an observable variable, whereas \(\varvec{\phi }:=\{\phi _i \}_{i \in [n]}\) is a probability distribution parameter. For \(\beta _{{\max }}>\beta _{{\min }}>0\), \(\gamma _{{\max }}>\gamma _{{\min }}>0\), \(\beta \in [\beta _{{\min }}, \beta _{{\max }}]\), and \(\gamma \in [\gamma _{{\min }}, \gamma _{{\max }}]\), we assume that a random variable \(\varvec{y}\) is drawn from the following distribution:

$$\begin{aligned} p(\varvec{y}; \varvec{\phi }, \beta , \gamma )&:=\prod _{(i,j)\in \Lambda _n}p(y_{ij};\phi _i, \phi _j, \beta , \gamma ), \nonumber \\ p(y_{ij};\phi _i, \phi _j, \beta , \gamma )&:={\left\{ \begin{array}{ll} \frac{1}{1+\exp (\beta d_{\phi _i \phi _j}-\gamma )} &{}\quad (y_{ij}=1), \\ 1-\frac{1}{1+\exp (\beta d_{\phi _i \phi _j}-\gamma )} &{}\quad (y_{ij}=0). \end{array}\right. } \end{aligned}$$
(4)

Then, the following lemma holds.

Lemma 1

We assume that \(r_j \ne 0\) for some \(j\in [n]\). For \(\alpha \in (0, 2\pi )\), we define \(\phi _i^\prime :=(r_i, \theta _i+\alpha )^\top \) for all \(i\in [n]\). Then, \(\varvec{\phi } \ne \varvec{\phi }^\prime :=\{ \phi _i^\prime \}_{i\in [n]}\), and the following equation holds:

$$\begin{aligned} p(\varvec{y};\varvec{\phi }, \beta , \gamma ) = p(\varvec{y}; \varvec{\phi }^\prime , \beta , \gamma ). \end{aligned}$$

Therefore, the probability distribution of hyperbolic embeddings is non-identifiable.

Proof

Since \(\phi _j \ne \phi _j^\prime \) holds for some \(j\in [n]\) such that \(r_j \ne 0\), we have \(\varvec{\phi } \ne \varvec{\phi }^\prime \). For all \((i, j)\in \Lambda _n\), \(d_{\phi _i \phi _j}=d_{\phi _i^\prime \phi _j^\prime }\) because a simple calculation yields \(\langle \phi _i, \phi _j \rangle _{{\mathcal {L}}}:=\langle \phi _i^\prime , \phi _j^\prime \rangle _{{\mathcal {L}}}\). Thus, the result follows from Eq. (4). \(\square \)

For \(D\ge 3\), non-identifiability can be proved by a similar transformation to the \((D-1)\)-th angular coordinate.

2.3 Latent variable models of hyperbolic embeddings with PUDs and WNDs

To resolve the non-identifiability problem, we introduce two latent variable models, following work on PUDs [14, 15, 34] and WNDs [22]. In PUDs and WNDs, an embedding is regarded as a set of latent variables and edges as observed variables. Among several latent variable models, we chose two for the following reasons.

  • For PUDs, in [14, 15, 34], it has been demonstrated that the graphs generated with PUDs have two properties: the power law for the degree of a node and high clustering coefficient. These properties are common in real-world graphs [35].

  • For WNDs, it has been demonstrated that the experimental performance of various downstreaming tasks has been improved.

In these models, we consider \(\varvec{\phi }=\{\phi _i \}_{i \in [n]}\) as latent random variables rather than parameters. Thus, we rewrite it as \(\varvec{z}:=\{z_i \}_{i \in [n]}\), where \(z_i :=(z_{i,0}, z_{i,1}, \dots , z_{i,D})^\top \) denotes its Cartesian coordinates.

2.3.1 Latent variable model with PUDs

The generation process of \(\varvec{y}, \varvec{z}\) with PUDs can be summarized as follows:

  1. 1.

    For vertex \(i\in [n]\):

    1. (a)

      Generate \(u_i :=(r_i, \theta _{i,1}, \dots , \theta _{i, D-1})^\top \sim p(u_i; \sigma , R)\).

    2. (b)

      Transform \(u_i\) to \(z_i\) through Eq. (2).

  2. 2.

    For vertices \((i,j)\in \Lambda _n\):

    1. (a)

      Generate an observable variable \(y_{ij} \sim p(y_{ij} \mid z_i, z_j;\beta , \gamma )\) using Eq. (4).

Below we provide an explicit form of the probability distribution of \(\varvec{z}\) for PUDs.

For \(\sigma _{{\max }}> \sigma _{{\min }} \ge 0\), the random variable \(\varvec{u}:=\{u_i \}_{i \in [n]}\) is drawn according to the following distribution with the parameter \(\sigma \in [\sigma _{{\min }}, \sigma _{{\max }}]\):

$$\begin{aligned} p(\varvec{u}; \sigma , R)&:=\prod _{i\in [n]} p(u_i;\sigma , R), \\ p(u_i; \sigma , R)&:=p(r_i; \sigma , R)\prod _{j=1}^{D-1}p(\theta _{i,j}), \\ p(\theta _{i,j})&:={\left\{ \begin{array}{ll} \frac{\sin ^{D-1-j} \theta _{i,j}}{I_{D, j}} &{}\quad (j \ne D-1), \\ \frac{1}{2\pi } &{} \quad (j=D-1), \end{array}\right. } \\ p(r_i; \sigma , R)&\quad :=\frac{\sinh ^{D-1}(\sigma r_i)}{C_D(\sigma )}, \end{aligned}$$

where \(I_{D, j}:=\int ^\pi _0 \sin ^{D-1-j} \theta {\textrm{d}}\theta \) and \(C_D(\sigma ):=\int ^R_0 \sinh ^{D-1}(\sigma r) {\textrm{d}}r\) denote the normalization constants. For \(p(\varvec{z};\sigma , R)\), because \(z_{i,0}\) is determined by the other D variables in Eq. (1), we have

$$\begin{aligned} p(\varvec{z}; \sigma , R)&:=\prod _{i\in [n]} p(z_i;\sigma , R), \\ p(z_i;\sigma , R)&:=\frac{1}{J(z_{i,1:D}:u_i)} p(u_i, \sigma , R), \end{aligned}$$

where \(z_{i,1:D}:=(z_{i,1},\dots ,z_{i,D})^\top \) and \(J(z_{i, 1:D}:u_i)\) is the Jacobian of the transformation from \(u_i\) to \(z_{i, 1:D}\), which is given as

$$\begin{aligned} J(z_{i, 1:D}:u_i) =\cosh (r_i) {\sinh }^{D-1}(r_i) \prod _{j=1}^{D-2} \sin ^{D-j-1}\theta _{i, j}. \end{aligned}$$

The derivation is provided in “Appendix A.” The probability distribution \(p(\varvec{z}; \sigma , R)\) is called the pseudo-uniform distribution because it is reduced to the uniform distribution in hyperbolic space when \(\sigma =1\).

In the following discussion, the value of R is assumed to be constant and satisfies \(R=O(\log n)\) where n is the number of nodes, and it is omitted from the description of the probability distribution. This is because the maximum average degree satisfies \(k_{{\max }}=O(n)\), and the minimum average degree satisfies \(k_{{\min }}=O(1)\) under certain conditions, which is a common property of real-world complex networks [34].

In the aforementioned distribution, \(\sigma , \beta \), and \(\gamma \) are parameters, and D denotes the model of the probability distribution.

2.3.2 Probability distribution of WNDs

WNDs are a generalization of Euclidean Gaussian distributions to the hyperbolic space. Thus, WNDs have two parameters: a mean in hyperbolic space \(\varvec{\mu }\in {\mathbb {H}}^D\) and a positive-definite covariance matrix \(\Sigma \in {\mathbb {R}}^{D \times D}\). In our model, we set \(\varvec{\mu }\) to \(\varvec{\mu }_0\), where \(\varvec{\mu }_0:=(1, 0, \dots , 0)^\top \) denote the origin of \({\mathbb {H}}^D\). We assume this because a tree-like graph is considered to be radially distributed around the origin \(\varvec{\mu }_0\).

The generation process of \(\varvec{y}\) and \(\varvec{z}\) with the WNDs is summarized as follows:

  1. 1.

    For a vertex \(i\in [n]\):

    1. (a)

      Generate \(v_i:=(v_{i,1}, \dots , v_{i, D})^\top \sim p(v_i; \Sigma )\).

    2. (b)

      Transform \(v_i\) to \(\tilde{v_i}:=[0, v_i^\top ]^\top \), which is a tangent vector at \(\varvec{\mu }_0\).

    3. (c)

      Transform \({\tilde{v}}_i\) to \(z_i\) through the exponential map \(\text {Exp}_{\varvec{\mu }_0} ({\tilde{v}}_i)\).

  2. 2.

    For a pair of vertices \((i,j)\in \Lambda _n\):

    1. (a)

      Generate an observable variable \(y_{ij} \sim p(y_{ij} \mid z_i, z_j;\beta , \gamma )\) using Eq. (4).

Note that the second step is the same as that of the model with the PUDs. Below we provide an explicit form of the probability distribution of \(\varvec{z}\) with the WNDs.

A random variable \(\varvec{v}:=\{v_i \}_{i \in [n]}\) is drawn according to the following distribution:

$$\begin{aligned} p(\varvec{v}; \Sigma )&:=\prod _{i\in [n]} p(v_i;\Sigma ), \\ p(v_i; \Sigma )&:=\frac{1}{(2\pi )^{\frac{D}{2}} |\Sigma |^\frac{1}{2}}\exp \biggl (-\frac{1}{2} v_i^\top \Sigma ^{-1} v_i \biggr ). \end{aligned}$$

For \(p(\varvec{z};\Sigma )\), we have that

$$\begin{aligned} p(\varvec{z}; \Sigma )&:=\prod _{i\in [n]} p(z_i;\Sigma ), \\ p(z_i;\Sigma )&:=\frac{1}{J(z_{i,1:D}:v_i)} p(v_i, \Sigma ), \end{aligned}$$

where \(J(z_{i, 1:D}:v_i)\) is the Jacobian of the transformation from \(v_i\) to \(z_{i, 1:D}\), which is provided by

$$\begin{aligned} J(z_{i, 1:D}:v_i) =\biggl \{ \frac{\sinh \Vert v_i \Vert _{{\mathcal {L}}}}{\Vert v_i \Vert _{{\mathcal {L}}}} \biggr \}^{D-1}. \end{aligned}$$

The derivation of the Jacobian is obtained from [22].

3 Dimensionality selection using DNML code-lengths

In this section, we present the calculation of the DNML code-lengths for two latent variable models. Thereafter, we present the optimization algorithm.

3.1 DNML code-lengths with PUDs and WNDs

According to the MDL principle [24], the probabilistic model that minimizes the total code-length required to encode the given data is selected. Data may be encoded using multiple methods. Although the NML code-length [36] is one of the most common encoding methods, its calculation is quite difficult for complex probability distributions such as PUDs and WNDs. Therefore, we employ the DNML code-length [26], whose calculation for latent variable models is relatively easier.

Let \({\mathcal {D}}:=\{ D_1, D_2, \dots \, D_N \}\) denote a finite set of candidates of dimensionalities \((D_i\in {\mathbb {Z}}_{\ge 2})\). We estimate optimal dimensionality \({\hat{D}}\in {\mathcal {D}}\) and the optimal embedding \(\hat{\varvec{z}}\) that minimizes the following criterion, which we call DNML-PUD:

$$\begin{aligned} L_{\text {DNML-PUD}}(\varvec{y}, \varvec{z}) :=\,&L_{\text {NML}}(\varvec{y} \mid \varvec{z})+L_{\text {NML}}(\varvec{z}), \nonumber \\ L_{\text {NML}}(\varvec{y} \mid \varvec{z}) :=\,&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) \nonumber \\&+ \log \sum _{\bar{\varvec{y}}} p(\bar{\varvec{y}} \mid \varvec{z};{\hat{\beta }}(\bar{\varvec{y}}, \varvec{z}), {\hat{\gamma }}(\bar{\varvec{y}}, \varvec{z})), \nonumber \\ L_{\text {NML}}(\varvec{z}) :=\,&-\log p(\varvec{z}; {\hat{\sigma }}(\varvec{z})) +\log \int p(\bar{\varvec{z}}; {\hat{\sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}, \end{aligned}$$
(5)

where \({\hat{\beta }}(\cdot , \cdot )\), \({\hat{\gamma }}(\cdot , \cdot )\), and \({\hat{\sigma }}(\cdot )\) denote the maximum likelihood estimators, \(\bar{\varvec{z}}_{1:D}:=\{ {\bar{z}}_{i, 1:D} \}_{i\in [n]}\), and the sum and integral are obtained over all possible data in the predefined data domain. The second term in each NML code-length is called parametric complexity. As the exact value of the integral is analytically intractable, we approximate \(L_{\text {NML}}(\varvec{y} \mid \varvec{z})\) and \(L_{\text {NML}}(\varvec{z})\) as follows:

$$\begin{aligned} L_{\text {NML}}(\varvec{y} \mid \varvec{z}) \approx&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) \\ {}&+ \log \frac{n(n-1)}{4\pi } +\log \int _{\gamma _{{\min }}}^{\gamma _{{\max }}}\int _{\beta _{{\min }}}^{\beta _{{\max }}} \sqrt{|I_n(\beta , \gamma ) |}{\textrm{d}}\beta {\textrm{d}}\gamma , \\ L_{\text {NML}}(\varvec{z}) \approx&-\log p(\varvec{z}; {\hat{\sigma }}(\varvec{z})) + \frac{1}{2} \log \frac{n}{2\pi }+\log \int _{\sigma _{{\min }}}^{\sigma _{{\max }}} \sqrt{|I(\sigma ) |}{\textrm{d}}\sigma , \end{aligned}$$

where \(I_n(\beta , \gamma )\) and \(I(\sigma )\) denote Fisher information, which is computed as

$$\begin{aligned} I_n(\beta , \gamma ) =&\frac{2}{n(n-1)} \begin{pmatrix} \sum _{(i,j)\in \Lambda _n} \frac{d_{z_i z_j}^2}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} &{} \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )}\\ \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} &{} \sum _{(i,j)\in \Lambda _n} \frac{1}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} \end{pmatrix}, \\ I(\sigma ) =&(D-1)^2\frac{\int ^R_0 r^2 \cosh ^2(\sigma r)\sinh ^{D-3}(\sigma r){\textrm{d}}r}{C_D(\sigma )} \\&-\biggl \{ \frac{\int ^R_0 (D-1)r \cosh (\sigma r)\sinh ^{D-2}(\sigma r){\textrm{d}}r}{C_D(\sigma )} \biggr \} ^2. \end{aligned}$$

The derivation is presented in “Appendix B.” Practically, \(I_n(\beta , \gamma )\) and \(I(\sigma )\) are calculated numerically because the analytic solution of the integral terms is not trivial.

For WNDs, the DNML criterion is defined as follows:

$$\begin{aligned} L_{\text {DNML-WND}}(\varvec{y}, \varvec{z}) :=&L_{\text {NML}}(\varvec{y} \mid \varvec{z})+L_{\text {NML}}(\varvec{z}), \nonumber \\ L_{\text {NML}}(\varvec{z}) :=&-\log p(\varvec{z}; {\hat{\Sigma }}(\varvec{z})) +\log \int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}, \end{aligned}$$
(6)

where \({\hat{\Sigma }}\) denotes the maximum likelihood estimator of \(\Sigma \). Since the exact value of \(\int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}\) is analytically intractable, we employ the following upper bound.

$$\begin{aligned} \int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D} \le \frac{\pi ^{\frac{D^2}{2}} \prod _{i\in [D-1]} \epsilon _{2i}^{D-i} \prod _{j\in [D]} \epsilon _{1j}^{\frac{1-D}{2}}}{\Gamma _D \bigl (\frac{D}{2} \bigr ) \Gamma _D \bigl (\frac{n}{2} \bigr )} \biggl ( \frac{n}{2e} \biggr )^{\frac{nD}{2}} \biggl ( \frac{2}{D-1} \biggr )^D. \end{aligned}$$

The derivation is presented in “Appendix B.”

Fig. 1
figure 1

Example of DNML-PUD. The selected dimensionality and true dimensionality are \(D=8\). The graph was generated with parameters \(n=6400\), \(\beta =0.6\), \(\gamma = \beta \log n\), and \(\sigma =1.0\). The scores \(L_{\text {DNML}}(\varvec{y}, \varvec{z})\) and \( L_{\text {NML}}(\varvec{y} \mid \varvec{z})\) follow the left scale, whereas \(L_{\text {NML}}(\varvec{z})\) follows the right scale

Fig. 2
figure 2

Example of DNML-WND. The selected dimensionality and true dimensionality are \(D=8\). The graph was generated with parameters \(n=6400\), \(\beta =1.2\), \(\gamma = \beta \log n\), and \(\Sigma =(0.35 \log n)^2I\), where \(I\in {\mathbb {R}}^{D\times D}\) is the identity matrix. The scores \(L_{\text {DNML}}(\varvec{y}, \varvec{z})\) and \( L_{\text {NML}}(\varvec{y} \mid \varvec{z})\) follow the left scale, whereas \(L_{\text {NML}}(\varvec{z})\) follows the right scale

We provide a more detailed explanation of DNML-PUD and DNML-WND. Figures 1 and 2 show \(L_{\text {NML}}(\varvec{y} \mid \varvec{z}), L_{\text {NML}}(\varvec{z})\) and \(L_{\text {DNML}}(\varvec{y}, \varvec{z})\) of an artificially generated graph with the true dimensionality \(D_{\text {true}}=8\) and \(n=6400\). The value of \(L_{\text {NML}}(\varvec{y} \mid \varvec{z})\) decreases as dimensionality increases. This implies that as the dimensionality increases, the graph can be reconstructed more accurately. However, the value of \(L_{\text {NML}}(\varvec{z})\) increases as dimensionality increases. This is because more code-length is required to encode the extra dimension of the embedding; that is, the model becomes more complex, and \(L_{\text {NML}}(\varvec{z})\) acts as a penalty term. Hence, minimizing \(L_{\text {DNML}}(\varvec{y}, \varvec{z})\) implies that dimensionality is chosen by considering both the accuracy of the reconstruction and the complexity of the model. Therefore, the DNML-PUD and DNML-WND select the true dimensionality \(D_{\text {true}}=8\).

3.2 Optimization

To derive the optimal dimensionality, Eqs. (5) and (6) should be optimized with respect to \(\varvec{z}, \beta , \gamma \) and \(\sigma \) for each dimensionality \(D\in {\mathcal {D}}\). However, direct optimization is difficult because of the analytical intractability of the parametric complexity of \(L_{\text {NML}}(\varvec{y} \mid \varvec{z})\). Instead, we optimize \(L(\varvec{z}, \beta , \gamma , \sigma ) :=-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \sigma )\) and \(L(\varvec{z},\beta , \gamma , \Sigma ) :=-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \Sigma )\), which are lower bounds of Eqs. (5) and (6), respectively.

First, we explain how to optimize \(L(\varvec{z}, \beta , \gamma , \sigma )\). We rewrite it as

$$\begin{aligned} L(\varvec{z}, \beta , \gamma , \sigma )&=- \sum _{(i,j)\in \Lambda _n} \log p(y_{ij} \mid z_i, z_j; \beta , \gamma ) - \sum _{i\in [n]} \log p(z_i;\sigma ) \\&= \sum _{(i,j)\in \Lambda _n} \biggl \{ -\log p(y_{ij} \mid z_i, z_j; \beta , \gamma ) - \frac{1}{n-1}\log p(z_i;\sigma ) \\&\quad -\frac{1}{n-1} \log p(z_j; \sigma ) \biggr \}. \end{aligned}$$

We applied the stochastic update rule at iteration t using the following equation:

$$\begin{aligned} L^{(t)}(\varvec{z}, \beta , \gamma , \sigma )&:=\frac{1}{|{\mathcal {B}}^{(t)} |}\sum _{(i,j)\in {\mathcal {B}}^{(t)}} \biggl \{ -\log p(y_{ij} \mid z_i, z_j; \beta , \gamma ) \\&\quad - \frac{1}{n-1}\log p(z_i;\sigma )-\frac{1}{n-1} \log p(z_j; \sigma ) \biggr \}, \end{aligned}$$

where \({\mathcal {B}}^{(t)}\subset \Lambda _n\) is the mini-batch for each iteration and \(|\cdot |\) denotes the number of elements in a set.

For \(z_i\), we used the geodesic update in the hyperboloid model [18]. The update rule for \(\varvec{z}\) is given as follows:

$$\begin{aligned} z_i^{(t+1)} \leftarrow \text {proj}_{{\mathbb {H}}^D_R} \left( \text {Exp}_{z_i^{(t)}} \left( -\eta ^{(t)}_{\varvec{z}} \pi _{z_i^{(t)}} \left( g_{D}^{-1}\frac{\partial L^{(t)}}{\partial z_i^{(t)}} \right) \right) \right) , \end{aligned}$$
(7)

where \(\eta ^{(t)}_{\varvec{z}}\) denotes the learning rate, \(\pi _{\varvec{z}}(\cdot )\) denotes the projection from the Euclidean gradient to the Riemannian gradient, \(\text {proj}_{{\mathbb {H}}^D_R}(\cdot )\) denotes the projection from \({\mathbb {H}}^D\) to \({\mathbb {H}}^D_R :=\{ \varvec{x} \mid \varvec{x}\in {\mathbb {H}}^D, d_{\varvec{\mu }_0\varvec{x}}\le R \}\), and \(\text {Exp}_{\varvec{z}} (\cdot )\) is given by Eq. (3). The functions \(\pi _{\varvec{z}}(\cdot )\) and \(\text {proj}_{{\mathbb {H}}^D_R}(\cdot )\) are defined as follows:

$$\begin{aligned} \pi _{\varvec{z}} (\varvec{u})&:=\varvec{u} - \frac{\langle \varvec{z}, \varvec{u} \rangle _{{\mathcal {L}}}}{\langle \varvec{z}, \varvec{z} \rangle _{{\mathcal {L}}}}\varvec{z}=\varvec{u}+\langle \varvec{z}, \varvec{u} \rangle _{{\mathcal {L}}}\varvec{z}, \\ \text {proj}_{{\mathbb {H}}^D_R}(\varvec{z})&:={\left\{ \begin{array}{ll} \varvec{z} &{}\quad (d_{\varvec{\mu }_0\varvec{z}} \le R), \\ \bigl (\cosh R, \frac{\sinh R}{\Vert \varvec{z}_{1:D} \Vert }z_1, \cdots , \frac{\sinh R}{\Vert \varvec{z}_{1:D} \Vert }z_D \bigr )^\top &{}\quad (d_{\varvec{\mu }_0\varvec{z}} > R), \end{array}\right. } \end{aligned}$$

where \(\Vert \cdot \Vert \) denotes the Euclidean norm.

With the learning rates \(\eta ^{(t)}_{\beta }\) and \(\eta ^{(t)}_{\sigma }\), the update rules of \(\beta \) and \(\gamma \) are provided as

$$\begin{aligned} \beta ^{(t+1)}&\leftarrow \text {proj}_{[\beta _{{\min }},\beta _{{\max }}]} \biggl ( \beta ^{(t)}-\eta ^{(t)}_{\beta } \frac{\partial L^{(t)}}{\partial \beta ^{(t)}} \biggr ), \end{aligned}$$
(8)
$$\begin{aligned} \gamma ^{(t+1)}&\leftarrow \text {proj}_{[\gamma _{{\min }},\gamma _{{\max }}]} \biggl ( \gamma ^{(t)}-\eta ^{(t)}_{\gamma } \frac{\partial L^{(t)}}{\partial \gamma ^{(t)}} \biggr ), \\ \text {proj}_{[a, b]}(x)&= {\left\{ \begin{array}{ll} b &{}\quad (x \ge b), \\ x &{}\quad (a \le x \le b), \\ a &{}\quad (x<a). \\ \end{array}\right. }\nonumber \end{aligned}$$
(9)

Through a preliminary experiment using synthetic datasets, we confirmed that \(\sigma \) rarely converges to the true value when using the gradient descent method. Thus, for each epoch, we numerically calculated \({\hat{\sigma }}(\varvec{z})\) as

$$\begin{aligned} {\hat{\sigma }}(\varvec{z})=\mathop {\mathrm {arg~min}}\limits _{\sigma \in S} \bigl \{ -\log p(\varvec{z}; \sigma ) \bigr \}, \end{aligned}$$
(10)

where \(S=\{ \sigma _{{\min }}, \sigma _{{\min }} + \frac{1}{C}(\sigma _{{\max }}-\sigma _{{\min }}), \dots , \sigma _{{\min }} + \frac{C-1}{C}(\sigma _{{\max }}-\sigma _{{\min }}), \sigma _{{\max }} \}\), and \(C+1\) denote the number of candidates.

For the optimization of \(L(\varvec{z}, \beta , \gamma , \Sigma )\), we define \(L^{(t)}(\varvec{z}, \beta , \gamma , \Sigma )\) in a similar manner. The update rules for \(z_i, \beta , \gamma \) are provided by Eqs. (7), (8), and (9), respectively. For each epoch, we optimized \(\Sigma \) using the following equation:

$$\begin{aligned} {\hat{\Sigma }}(\varvec{z})&=\mathop {\mathrm {arg~min}}\limits _{\Sigma } \bigl \{ -\log p(\varvec{z}; \Sigma ) \bigr \} \nonumber \\&= \frac{1}{n} \sum _{i \in [n]} v_i v_i^\top , \end{aligned}$$
(11)

where \(v_i :=\text {Exp}_{\varvec{\mu }_0}^{-1}(z_i)\) for all \(i\in [n]\). The optimization procedure is summarized in Algorithm 1.

Subsequently, we analyze the time efficiency of the proposed algorithms. For sufficiently large n, the update of \(\varvec{z}^{(t)}:=\{z_i^{(t)}\}_{i\in [n]}\) is dominant. We assume that \(|{\mathcal {B}}^{(t)} |\) takes a constant value B for all iteration. Since at most 2B nodes are used in each iteration to compute \(L^{(t)}\), we only need to update O(B) nodes. Thus, \(\varvec{z}^{(t)}\) will be updated O(ETB) times in total.

Algorithm 1
figure d

Riemannian Gradient Descent for DNML criteria

4 Experimental results

This section presents a comparison between the proposed criteria and conventional methods using artificial and real-world datasets. The details of code, data, and training details are presented in “Appendix C.”

4.1 Methods for comparison

We used three criteria—\(\text {AIC}, \text {BIC}\), and MinGE—for a comparative analysis of the performance of the proposed method. Here, the AIC and BIC with respect to the non-identifiable model, that is, \(\beta , \gamma \), and \(\varvec{z}\), are interpreted as parameters and are defined as follows:

$$\begin{aligned} \text {AIC}(\varvec{y};D):=&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z}))+(nD+2), \\ \text {BIC}(\varvec{y};D) :=&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) +\frac{nD+2}{2}\log \frac{n(n-1)}{2}. \end{aligned}$$

Note that these criteria are not guaranteed to work for this model because of the non-identifiability. MinGE [11] is a dimensionality selection criterion for Euclidean graph embeddings. We set the weighting factor \(\lambda =1\) and selected the dimensionality, where MinGE was closest to 0.

Furthermore, we did not consider the cross-validation (CV) for comparison. This is because CV requires considerable computation time, particularly, when learning graph embeddings.

It should be noted that we optimized \(-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \sigma )\) for DNML-PUD, \(-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \Sigma )\) for DNML-WND, and \(-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )\) for AIC and BIC. Thus, three embeddings were generated for each graph.

4.2 Artificial dataset

In this experiment, we verified whether the proposed DNML criteria could estimate the true dimensionality.

4.2.1 Dataset detail

We considered the case of \(D_{\text {true}}=8\), where \(D_{\text {true}}\) is the true dimensionality of the PUDs. We generated a graph for each combination of parameters from the following candidates: \(n\in \{ 800, 1600, 3200, 6400\}\), \(\beta \in \{0.5, 0.6, 0.7, 0.8\}\), and \(\sigma \in \{0.5, 1.0, 2.0\}\). Furthermore, we set \(R=\log n\) and \(\gamma = \beta \log n\). Consequently, we obtained 48 graphs in total, which we call PUD-8. Similarly, we generated PUD-16 with the true dimensionality which was 16. The other parameters are the same as those of PUD-8.

Subsequently, we made WND-8. For the true dimensionality \(D_{\text {true}}=8\), we generated a graph for each combination of parameters from the following candidates: \(n\in \{ 800, 1600, 3200, 6400\}, \beta \in \{0.5, 0.6, 0.7, 0.8\}\), and \(\Sigma \in \{(0.35\log n)^2 I\), \((0.375\log n)^2 I\), \((0.40\log n)^2 I\}\), where \(I\in {\mathbb {R}}^{D\times D}\) is the identity matrix. Furthermore, we also set \(R=\log n\) and \(\gamma = \beta \log n\). Similarly, we generated WND-16 with \(\Sigma \in \{(0.225\log n)^2 I\), \((0.25\log n)^2 I\), \((0.275\log n)^2 I\}\). The other parameters are the same as those of WND-8. The set of candidates for the dimensionalities was \({\mathcal {D}}=\{ 2, 4, 8, 16, 32, 64 \}\).

In the above generation process, the parameters were set such that the generated graphs were sparse; that is, the average degree is low with respect to the number of nodes.

Table 1 Average MAPs on PUD-8. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)
Table 2 Average MAPs on PUD-16. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)
Table 3 Average MAPs on WND-8. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)
Table 4 Average MAPs on WND-16. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)
Fig. 3
figure 3

Results on PUD-8. Left: \(n=800\). Right: \(n=6400\)

Fig. 4
figure 4

Results on WND-8. Left: \(n=800\). Right: \(n=6400\)

4.2.2 Results

To provide an illustrative example for each criterion, we first compared the selected dimensionality of PUD-8 and WND-8 with \(n=800, 6400\). Figures 3 and 4 show the normalized values for each criterion.

For \(n=800\), AIC, BIC, and DNML selected \(D=4\). Intuitively, a graph with a few nodes is expected to be embedded in low dimensionality, even if its true dimensionality is high. For \(n=6400\), DNML selected the correct dimensionality \(D=8\), whereas AIC and BIC selected incorrect dimensionalities. This implies that the DNML criteria can select the true dimensionality with a sufficient amount of data. For MinGE, it selected the maximum dimensionality \(D=64\) for all cases. This is possibly because MinGE was designed for Euclidean embeddings, which require larger dimensionality than hyperbolic embeddings for hierarchical structures, as discussed in Sect. 1.1.

Next, we provide a quantitative comparison in terms of mean average precision (MAP) [37]. A MAP is calculated using the ranking of the dimensionalities, which was created in ascending order for each criterion. Furthermore, we applied DNML-PUD to WND dataset and DNML-WND to PUD dataset.

Tables 1, 2, 3, and 4 present the results for PUD-8, PUD-16, WND-8, and WND-16, respectively. Firstly, the MAPs of BIC and MinGE were not so high, and they always selected \(D=4\) and \(D=64\), respectively. Since the selected dimensionalities were constant, BIC and MinGE are less reliable.

For AIC, we observed good performance in many cases; however, it tends to overestimate the true dimensionality for PUD-8 and WND-8 with \(n=6400\). This is because the penalty term of AIC is smaller than those of other criteria. For DNML criteria, in general, when the sample size is sufficiently large, DNML-PUD identifies the true dimensionality of the PUD dataset and the same tendency holds for DNML-WND and the WND dataset. Thus, we concluded that the proposed DNML criteria are more effective than AIC when the true dimensionality of the given graph is low.

Note that, in general, the performance of DNML-PUD in the WND dataset varied, sometimes being better and sometimes worse compared to DNML-WND. Similarly, in the PUD dataset, the performance of DNML-WND also varied, sometimes being better and sometimes worse compared to DNML-PUD. This is because the theoretical properties of the MDL principle are not valid when the generation process of the given data and the assumed generation process for calculating DNML code-length are different from each other. Therefore, this observation is an expected result of the mismatch of the generation process.

4.3 Real-world datasets

We used scientific collaboration networks from [38,39,40], flight network from https://openflights.org, protein–protein interaction network from [41], and the WN datasets from [27] for our study because they were employed in [16, 42, 43], which are representative studies in the field of hyperbolic embeddings. The experimental results in [16, 42, 43] demonstrated that hyperbolic embeddings outperformed Euclidean embeddings in several graph mining tasks performed on the networks. Therefore, we concluded that they are suitable for comparing our proposed method with others.

4.3.1 Link prediction

The DNML-PUD, DNML-WND, and other model selection criteria were applied to eight real-world graphs. In real-world graphs, the true dimensionality is unknown. Therefore, in this experiment, the link prediction performance for the selected dimensions was evaluated.

Dataset detail

We listed the details of eight graphs below.

  • Scientific collaboration networks. We used AstroPh, CondMat, GrQc, and HepPh from [40], Cora from [38], and PubMed from [39]. These graphs are networks that represent the co-authorship of papers, where an edge exists between two people if they are co-authors.

  • Flight networks. We used Airport from https://openflights.org/. In this graph, nodes represent airports, and edges represent airline routes.

  • Protein–protein interaction (PPI) networks. We further used PPI from [41]. This graph represents the protein interactions in yeast bacteria.

Furthermore, Table 5 summarizes the statistics of these graphs.

Then, each graph was split into a training set \(\varvec{y}_{\text {train}} \subset \varvec{y}\) and test set \(\varvec{y}_{\text {test}}\subset \varvec{y}\). The test set \(\varvec{y}_{\text {test}}\) comprises the positive and negative samples. First, we sampled \(10\%\) of the positive samples from a graph. Subsequently, to generate negative samples, we sampled an equal number of node pairs with no edge connection. Finally, we obtained the training set \(\varvec{y}_{\text {train}} :=\varvec{y}{\setminus } \varvec{y}_{\text {test}}\).

For \(\varvec{y}_{\text {test}}\), we calculated the area under the curve (AUC), which we define as follows. We first calculated the distance of each samples in \(\varvec{y}_{\textrm{test}}\). Subsequently, we calculated the true positive rate and false positive rate with fixed threshold of the distance. Finally, we obtained the receiver operating characteristic (ROC) curve by varying the threshold, and the AUC is defined as the area under the ROC curve.

The AUC was used to quantify the link prediction performance. The candidates of dimensionalities were \({\mathcal {D}}=\{ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64 \}\).

Table 5 Statistics of scientific collaboration networks
Table 6 Selected dimensionalities of each method
Fig. 5
figure 5

AUCs on link prediction. First row: AstroPh and CondMat. Second row: GrQc and HepPh. Third row: Cora and PubMed. Fourth row: Airport and PPI

Fig. 6
figure 6

Results of each criterion. First row: AstroPh and CondMat. Second row: GrQc and HepPh. Third row: Cora and PubMed. Fourth row: Airport and PPI

Results

Figure 5 shows the AUCs of the optimal embeddings associated with \(-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \sigma )\), \(-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \Sigma )\), and \(-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )\). Figure 6 shows the normalized values of each criterion, and Table 6 shows the selected dimensionalities. The performance at the selected dimensionalities by DNML-PUD and DNML-WND was not the best, and higher dimensionalities tended to yield higher AUCs.

According to [24], the consistency of the MDL model selection is theoretically guaranteed; that is, the model with the shortest code-length would converge to the true one if it exists. Therefore, the dimensionalities selected by DNML were considered to be close to the dimensionalities of the true probabilistic models that generated the data. However, our results suggest that such dimensionality of the true probabilistic model is not necessarily the best one for link prediction.

Table 7 Average conciseness of each method with \(\epsilon _{\textrm{max}} = 0.050\) and 0.100 (the bold text indicates either the maximum conciseness or conciseness within a 10% decrease from the maximum one)
Fig. 7
figure 7

Typical example of \(c({\hat{D}}, \epsilon )\)

In this section, we provide another perspective on the experimental results. As discussed in Sect. 1.1, dimensionality also controls the computation time and memory. Therefore, it is important to select a dimensionality at which a relatively high performance is achieved while maintaining low computational resources (e.g., using embeddings in edge devices). To quantify this, we introduce conciseness defined as follows: Let \({\mathcal {D}}:=\{D_1, \dots , D_N\}\) be the candidates of dimensionalities, \(\textrm{AUC}_{D_i}\) be the AUC at dimensionality \(D_i\), \(\overline{\textrm{AUC}}\) be the maximum AUC, and \(\epsilon _{{\max }}\) be a maximum tolerance gap relative to \(\overline{\textrm{AUC}}\). For the selected dimensionality \({\hat{D}}\) of each criterion, the conciseness is provided by:

$$\begin{aligned} \textrm{conciseness} ({\hat{D}}, {\epsilon _{\max }})&:=\frac{1}{\epsilon _{{\max }}P} \sum _{i=0, 1, \dots , P} c \biggl ({\hat{D}}, \frac{i}{P}\epsilon _{{\max }} \biggr ), \\ c({\hat{D}}, \epsilon )&:={\left\{ \begin{array}{ll} 1-\frac{\log _2 {\hat{D}} - \log _2 D_{{\min }}}{\log _2 D_{{\max }} - \log _2 D_{{\min }}} &{}\quad ({\hat{D}} \in {\mathcal {D}}_\epsilon ), \\ 0 &{}\quad ({\hat{D}} \notin {\mathcal {D}}_\epsilon ), \\ \end{array}\right. } \end{aligned}$$

where \({\mathcal {D}}_\epsilon :=\{ D_i\in {\mathcal {D}} \mid \overline{\textrm{AUC}}-\textrm{AUC}_{D_i} < \epsilon \}\), \(D_{{\min }}:=\min _{D_i \in {\mathcal {D}}_\epsilon } D_i, D_{{\max }}:=\max _{D_i \in {\mathcal {D}}_{\epsilon }} D_i\), and \(P \in {\mathbb {Z}}_{\ge 1}\) is the number of candidate points to calculate the conciseness.

Figure 7 shows the typical example of \(c({\hat{D}}, \epsilon )\). Firstly, the proposed conciseness measure assumes a situation where the AUC improves by increasing the dimensionality, while the extent of improvement diminishes, and with limited computational resource, we want to select lower dimensionality while achieving the maximum tolerate AUC. Based on this motivation, the conciseness measure is designed to take high values when \({\hat{D}}\in {\mathcal {D}}\) is close to \(D_{{\min }}\).

Since the conciseness significantly depends on \(\epsilon _{{\max }}\), we computed it for \(\epsilon _{{\max }}=0.050\) and 0.100. Table 7 shows the average conciseness of the selected dimensionalities. To calculate the conciseness, we used the embeddings associated with \(-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \sigma )\) for DNML-PUD, \(-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \Sigma )\) for DNML-WND, and \(-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )\) for AIC, BIC, and MinGE. Furthermore, we set \(\overline{\textrm{AUC}}\) as the maximum AUC associated with three embeddings.

The best or second best performance was achieved by either DNML-PUD or DNML-WND in many cases. For AIC, the performance was relatively high, but not the best in many cases. For BIC, the performance was relatively high, specifically for \(\epsilon _{{\max }} = 0.100\). This indicates that BIC is effective when the maximum tolerate gap is high. For MinGE, the performance was close to 0 because the selected dimensionality was considerably high. Overall, the proposed method works well in that it identifies dimensionality with a relatively high AUC while maintaining low computational resources.

Note that all the performances were 0 for GrQc. This is because higher dimensions achieve considerably higher AUC in GrQc, unlike most other networks where increasing dimensions do not significantly improve AUC. In such scenarios, the conciseness measure does not take positive values unless \(\epsilon _{{\max }}\) is set to a considerably high value; however, setting an excessively high tolerate gap (e.g., \(\epsilon _{{\max }}=0.300\)) lacks practical meaning, and it is sufficient to select the maximum dimensionality within the given computational resources.

4.3.2 Preservation of hierarchy

To investigate the extent to which the hierarchical structure was preserved, we used a subset of WordNet [27] closely following the setting in [16, 18].

Table 8 Statistics of WN datasets

Dataset detail We first considered the transitive closure of the is-a relationship for all the nouns. Subsequently, we took the subset of the nouns that have “mammal” as a hypernym and selected relations that have “mammal” as a hyponym or hypernym. Finally, we connected the two nouns if they have an is-a relationship. We refer to this dataset as WN-mammal. Similarly, we generated WN-mammal, WN-solid, WN-tree, WN-worker, WN-adult, WN-instrument, WN-leader, and WN-implement. Table 8 summarizes the statistics for these datasets.

Each graph is expected to have a hierarchical structure because a hypernym is often related to many hyponyms. We embedded eight graphs with various dimensionalities and calculated each criterion.

Fig. 8
figure 8

Results on WN datasets. First row: WN-mammal, WN-solid. Second row: WN-tree and WN-worker. Third row: WN-adult and WN-instrument. Fourth row: WN-leader and WN-implement

Fig. 9
figure 9

Is-a scores on WN dataset. First row: WN-mammal, WN-solid. Second row: WN-tree and WN-worker. Third row: WN-adult and WN-instrument. Fourth row: WN-leader and WN-implement

Result Figure 8 shows the normalized criteria. The candidates of dimensionalities were \({\mathcal {D}}=\{ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64 \}\).

Subsequently, we quantified the extent to which the obtained embeddings reflected the true hierarchy of the is-a relation on the data. Following [16], we used the following score function:

$$\begin{aligned} {\text {is-a-score}}(\varvec{u}, \varvec{v}) :=(\alpha (r_u -r_v)-1) d_{\varvec{u}\varvec{v}}, \end{aligned}$$

where \(r_u\) and \(r_v\) are the radius coordinates of \(\varvec{u}\) and \(\varvec{v}\), respectively, and \(\alpha >0\) is a constant. In general, it can be assumed that a hypernym has a lower radial coordinate than its hyponyms. Thus, \(\alpha (r_u -r_v)\) acts as a penalty when v, which is a hypernym of u, is lower in the embedding hierarchy. Therefore, the score is expected to be high when the embedding reflects the true hierarchy of data. Figure 9 shows the average score over all is-a relation pairs for each dimensionality with \(\alpha =100\). Overall, the lower dimensionality achieved higher is-a scores.

Table 9 Average benefits on WN datasets (the bold text indicates the maximum one)

Next, we provide a quantitative comparison of the benefits. With an estimated dimensionality \({\hat{D}}\) and a maximum tolerance gap \(T_{\text {gap}}\), the benefit is defined as follows:

$$\begin{aligned} b({\hat{D}}, D_{\text {best}}) :=\max \biggl \{ 0, 1-\frac{|\log _2 {\hat{D}}- \log _2 D_{\text {best}} |}{T_{\text {gap}}} \biggr \}, \end{aligned}$$

where \(D_{\textrm{best}}\) is the dimensionality at which the best is-a score is achieved. It has a high value when the estimated dimensionality is close to \(D_{\textrm{best}}\).

Table 9 shows the average benefits over the WN datasets. Note that \(D_{\textrm{best}}\) was 2 for all datasets, and we set \(T_{\text {gap}}=2\) in the experiments. The DNML-WND and BIC achieved the best results. This result implies that DNML-WND and BIC selected the dimensionality that reflects the true hierarchical structure of the particular graph.

4.4 Discussion

We summarize the experimental results. Firstly, as a general trend, the AIC usually selects higher dimensionality, the BIC lower dimensionality, and the DNML criteria middle dimensionality. This is due to the relative magnitude of the penalty terms. MinGE always selected the largest dimensionality in all the experiments. This is possibly because MinGE was designed for Euclidean embeddings, which require higher dimensionality than hyperbolic embeddings for hierarchical structures, as discussed in Sect. 1.1.

The performance of the AIC was relatively well on the first and second tasks, but not on the third task. In contrast, the BIC showed high performance in the third task, but low performance in the first and second tasks. The DNML criterion does not necessarily give the best performance, but it often gives the best performance or the next-best performance for all tasks. Therefore, it can be concluded that the performance of the proposed DNML criteria was good on average for all tasks.

5 Conclusion and future work

In this study, we proposed a dimensionality selection method for hyperbolic embeddings based on the MDL principle. We demonstrated that there is a non-identifiability problem for hyperbolic embeddings. Therefore, we employed the latent variable model of hyperbolic embeddings using PUDs and WNDs to formulate dimensionality selection as the statistical model selection for latent variable models. Within this formulation, we proposed the DNML code-length criterion for dimensionality selection based on the MDL principle. For artificial datasets, we experimentally demonstrated that our method is effective when the true dimensionality is low. For real-world datasets, we used the scientific collaboration networks and WN datasets. For the scientific collaboration networks, we demonstrated that the proposed method selected simple probabilistic models while maintaining AUCs. For the WN datasets, we confirmed that the proposed method selects the dimensionality which preserves the true hierarchy of graphs. Overall, the proposed method performed well in average.

Note that we cannot directly use the dimensionality selected by the proposed method in hyperbolic neural networks methods, such as [43, 44]. Because the probabilistic models used in the proposed method differ from those used in [43, 44], the optimal dimensionalities within them are inherently different. This implies that using the dimensionality selected by the proposed method in [43, 44] rationales the rationales of the DNML code length and MDL principle. However, we would like to emphasize that even in hyperbolic neural networks, it is possible to consider latent variable models and derive DNML code-lengths following a similar procedure to our study. Specifically, we can treat the parameters and outputs of hidden layers of hyperbolic neural networks as latent variables. In this sense, the proposed method can be generalized. However, as indicated in “Appendix B,” the derivation of the DNML code-length requires significant effort and is not straightforward. Therefore, we consider the extension of our proposed method to hyperbolic neural networks as future work.

The latent variable model approach adopted in this study is promising and is not limited to hyperbolic space. In the future, we plan to build a methodology for dimensionality selection in Euclidean and spherical embeddings by introducing latent variable for their corresponding space [45, 46].