Dimensionality selection for hyperbolic embeddings using decomposed normalized maximum likelihood code-length

Yuki, Ryo; Ike, Yuichi; Yamanishi, Kenji

doi:10.1007/s10115-023-01934-2

Dimensionality selection for hyperbolic embeddings using decomposed normalized maximum likelihood code-length

Regular Paper
Open access
Published: 09 August 2023

Volume 65, pages 5601–5634, (2023)
Cite this article

Download PDF

You have full access to this open access article

Knowledge and Information Systems Aims and scope Submit manuscript

Dimensionality selection for hyperbolic embeddings using decomposed normalized maximum likelihood code-length

Download PDF

2170 Accesses
1 Citation
1 Altmetric
Explore all metrics

Abstract

Graph embedding methods are effective techniques for representing nodes and their relations in a continuous space. Specifically, the hyperbolic space is more effective than the Euclidean space for embedding graphs with tree-like structures. Thus, it is critical how to select the best dimensionality for the hyperbolic space in which a graph is embedded. This is because we cannot distinguish nodes well with dimensionality that is considerably low, whereas the embedded relations are affected by irregularities in data with excessively high dimensionality. We consider this problem from the viewpoint of statistical model selection for latent variable models. Thereafter, we propose a novel methodology for dimensionality selection based on the minimum description length principle. We aim to introduce a latent variable modeling of hyperbolic embeddings and apply the decomposed normalized maximum likelihood code-length to latent variable model selection. We empirically demonstrated the effectiveness of our method using both synthetic and real-world datasets.

Modularity Embedding

Manifold learning and maximum likelihood estimation for hyperbolic network embedding

Article Open access 15 November 2016

Model-independent embedding of directed networks into Euclidean and hyperbolic spaces

Article Open access 02 February 2023

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

1.1 Motivation

Graphs are convenient tools for knowledge representation and can be used to represent various types of real-world data. Consequently, graph analysis has garnered significant attention in various fields, such as biology (e.g., protein–protein interaction networks) [1], social sciences (e.g., friendship networks) [2], and linguistics (e.g., word co-occurrence networks) [3], in recent years. Generally, tasks in graph analysis are classified into the following four categories: (1) node classification, (2) link prediction, (3) node clustering, and (4) graph visualization [4].

Graph embeddings, which convert discrete representations into continuous ones, such as vectors in Euclidean space, have become popular tools in graph analysis [5,6,7,8]. They provide effective solutions for the aforementioned tasks, as continuous representations can be used as the input in tasks of types (1), (2), and (3), whereas two-dimensional continuous representations are used directly in tasks of type (4).

Dimensionality is one of the most important hyperparameters in graph embeddings. First, the node classification, link prediction, and node clustering performance depend on it. Intuitively, we cannot distinguish nodes well with considerably low dimensionality, while the embedded relations are significantly affected by irregularities of data with considerably high dimensionality. Second, the training time and computational expenses directly depend on it. Therefore, the issue of dimensionality selection for graph and word embedding has garnered significant attention [9,10,11,12,13]. However, most of existing studies have focused on Euclidean space, although hyperbolic space is a viable alternative embedding space.

The hyperbolic space is a Riemannian manifold with negative constant curvature. In network science, a hyperbolic space is suitable for modeling hierarchical structures [14, 15]. In a tree at level h, the number of leaves and nodes is exponential in h. The analogies of the two aforementioned concepts in hyperbolic and Euclidean space are the circumference and area of a circle, respectively. In the two-dimensional hyperbolic space with constant curvature $K=-1$, the circumference of a circle is provided by $2\pi \sinh r$ and its area is $2\pi (\cosh r-1)$ with hyperbolic radius r, both increasing exponentially with r. This analogy demonstrates that the hyperbolic space has an affinity for the hierarchical structure. However, in the two-dimensional Euclidean space, ${\mathbb {R}}^2$, the circumference of a circle is provided by $2\pi r$ and its area is given by $\pi r^2$, both increasing polynomially with r. Thus, increasing the dimensionality is essential for embedding a hierarchical structure in the Euclidean space. Owing to these properties, hyperbolic embeddings have been extensively studied in recent years [16,17,18]. However, to the best of our knowledge, there has been no previous research except [19] on dimensionality selection in the hyperbolic space.

In this study, we propose a novel methodology for dimensionality selection of hyperbolic graph embeddings. We address this issue from the viewpoint of statistical model selection. First, we demonstrate that there is a non-identifiability problem in the conventional probabilistic model of hyperbolic embeddings; that is, there is no one-to-one correspondence between the parameter and the probability distribution. This problem invalidates the use of the conventional model selection criteria, such as Akaike’s information criterion (AIC) [20] and the Bayesian information criterion (BIC) [21]. To overcome this difficulty, we employ two latent variable models of hyperbolic embeddings following pseudo-uniform distributions (PUDs) [14, 15] and wrapped normal distributions (WNDs) in a hyperbolic space [22]. We thereby introduce a criterion for dimensionality selection based on the minimum description length (MDL) principle [23].

The MDL principle asserts that the best model minimizes the total code-length required for encoding the particular data. It exhibits several advantages, such as consistency [24] and rapid convergence in the framework of probably approximately correct (PAC) learning [25]. Although the MDL-based dimensionality selection was developed for Word2Vec-type word embeddings into the Euclidean space by Hung and Yamanishi [12], their techniques cannot straightforwardly be applied to hyperbolic graph embeddings.

The DNML criterion [26] is a model selection criterion for latent variable models based on the MDL principle, where the non-identifiability problem is resolved by jointly encoding the observed and latent variables. The shorter the DNML criterion, the better the dimensionality. Herein, we propose to apply DNML into the problem of dimensionality selection for hyperbolic embeddings. The DNML criteria obtained by applying to PUD and WND are called decomposed normalized maximum likelihood code-length for pseudo-uniform distributions (DNML-PUD) and DNML code-length for wrapped normal distributions (DNML-WND), respectively.

The novelty and significance of this study are summarized as follows.

Proposal of a novel methodology of dimensionality selection for hyperbolic embeddings We propose DNML-PUD and DNML-WND for selecting the best dimensionality of hyperbolic graph embeddings. We aim to introduce latent variable models of hyperbolic embeddings with PUDs and WNDs and then apply the DNML criterion to its dimensionality selection, based on the MDL principle. One of our significant contributions is to derive explicit formulas of DNML for specific cases of PUDs and WNDs.
Empirical demonstration of the effectiveness of our methodology We evaluated the proposed method using both synthetic and real-world datasets. For synthetic datasets, firstly, graphs with their true dimensionality were generated. We then performed the identification of the true dimensionality. For real-world datasets, we examine a relationship between the selected dimensionality and performance of link prediction. Furthermore, we quantified to what extent the hierarchical structure of a graph was preserved using WordNet (WN) [27] dataset. Overall, our experimental results confirmed the effectiveness of our method.

The preliminary version of this paper appeared in [28]. The major updates of this paper are summarized below:

We introduced a new latent variable model called wrapped normal distributions in hyperbolic space [22] and derived the upper bound on its DNML criterion, which we call DNML-WND.
Besides, the evaluation of DNML-WND was added in the experimental results.
We added the new metric called conciseness in the evaluation of the link prediction task in Sect. 4.3.1.

1.2 Related work

Conventionally, the dimensionality is determined heuristically based on domain knowledge. However, in recent years, several studies have proposed more principled approaches for this purpose.

Yin and Shen [9] proposed a pairwise inner product (PIP) loss, which quantifies the performance of embeddings based on the bias-variance trade-off. PIP loss is applicable to embeddings that can be formulated as low-rank matrix approximations, and its theoretical aspects have been investigated extensively. However, it is not known if hyperbolic embeddings satisfy this condition; thus, PIP loss cannot be directly used for hyperbolic embeddings. Gu et al. [10] extended PIP loss to normalized embedding loss. It is applicable to hyperbolic embeddings after defining their normalized embedding loss of hyperbolic embeddings. However, empirical observations (for example, normalized embedding loss following Eq. (2) in [10]) are limited to the Euclidean space, and it is still unknown whether such observations are also valid or not for hyperbolic embeddings.

Luo et al. [11] proposed minimum graph entropy (MinGE) to select a dimensionality that minimizes graph entropy, which is a weighted sum of feature entropy and structure entropy. However, feature entropy depends on a certain probability distribution in the Euclidean space, and its extension to hyperbolic space is not straightforward. Moreover, although it was demonstrated to exhibit excellent experimental performance, there was no particular rationale with respect to the selected dimensionality. Wang [13] proposed a method that first learns embeddings in a sufficiently high-dimensional Euclidean space (e.g., the 1000-dimensional Euclidean space) and then applies principal component analysis (PCA) to the embeddings and selects the dimensionality that minimizes the predefined score function. Recently, several hyperbolic dimensionality reduction methods have also been proposed [29,30,31], which indicates the possibility of extending the method to the hyperbolic space. To extend the method to the hyperbolic case, the following two points should be discussed: (1) how to define the score function and (2) which dimensionality reduction method should be used.

Recently, the graph neural architecture search method (GraphNAS) has been proposed in [32]. GraphNAS selects the best architecture of graph neural networks, including their dimensionality, using reinforcement learning. The most important difference between the proposed method and GraphNAS is that GraphNAS determines the architecture in a task-dependent manner (e.g., the accuracy in node classification), while the proposed method is task-independent and estimates universal dimensionalities based on the MDL principle. Another difference is that the proposed method targets hyperbolic embeddings, while GraphNAS targets Euclidean embeddings. Thus, it is potentially possible to extend GraphNAS to hyperbolic neural networks and compare their performance with the proposed method. However, to the best of our knowledge, there are no papers that addressed this extension, and the extension is not straightforward.

Almagro and Boguna [19] proposed a dimensionality selection method for hyperbolic embeddings. In [19], dimensionality was inferred using predictive models, such as the k-nearest neighbors algorithm or deep learning, where the input is the triplet of the mean densities of chordless cycles, squares, and pentagons of a given graph.

Hung and Yamanishi [12] proposed a dimensionality selection method for Word2Vec. They applied the MDL principle to select the optimal dimensionality. However, contrary to our method, they did not employ latent variable models for embeddings and used sequentially normalized maximum code-length rather than DNML code-length.

The remainder of this paper is organized as follows. Section 2 introduces hyperbolic geometry, the non-identifiability problem, and latent variable models of hyperbolic graph embeddings. Section 3 explains the DNML criteria and algorithms used for optimization. Section 4 presents the results obtained using artificial and real-world datasets. Section 5 presents the conclusions and future work. “Appendix” section provides the derivation of the DNML code-lengths and experimental details.

2 Preliminaries

In this section, we first introduce the hyperbolic geometry following [18]. Subsequently, the non-identifiability problem of hyperbolic embeddings is discussed. Finally, we introduce two latent variable models for hyperbolic graph embeddings.

2.1 Hyperbolic geometry

2.1.1 Definition of hyperbolic space

There are several models for representing hyperbolic space^{Footnote 1} (e.g., the Poincaré disk model, the Beltrami–Klein model, and the Poincaré half-plane model) [33]. In this study, a hyperboloid model was used. Since all the models introduced above are isometric to each other, the discussion of the distance structure is the same for the other models. Let ${\mathcal {H}}^D=({\mathbb {H}}^D, g_{D})$ be the D-dimensional hyperbolic space, where

$$\begin{aligned} g_{D}= \begin{pmatrix} -1 &{}\quad 0 &{}\quad \cdots &{}\quad 0 \\ 0 &{}\quad 1 &{} &{}\quad \vdots \\ \vdots &{} &{}\quad \ddots &{}\quad 0 \\ 0 &{}\quad \cdots &{}\quad 0 &{}\quad 1 \\ \end{pmatrix}\in {\mathbb {R}}^{(D+1) \times (D+1)}, \end{aligned}$$

${\mathbb {H}}^D:=\{\varvec{x}=(x_0, x_1, \dots , x_D)^\top \mid \varvec{x}\in {\mathbb {R}}^{D+1}, \langle x, x \rangle _{{\mathcal {L}}}=-1, x_0>0 \}$ and $\langle \varvec{u}, \varvec{v} \rangle _{{\mathcal {L}}}=\varvec{u}^\top g_{D} \varvec{v}$. The associated distance between $\varvec{u}, \varvec{v}\in {\mathbb {H}}^D$ is provided by $d_{\varvec{u} \varvec{v}}={{\,\textrm{arcosh}\,}}\bigl (-\langle \varvec{u}, \varvec{v} \rangle _{{\mathcal {L}}} \bigr )$, where ${{\,\textrm{arcosh}\,}}(x):=\log (x+\sqrt{x^2-1})$. Note that $x_0$ is determined by

$$\begin{aligned} x_{0}=\sqrt{1+x_{1}^2+\cdots +x_{D}^2}. \end{aligned}$$

(1)

Thus, only D variables are independent.

2.1.2 Coordinate system of hyperbolic space

Next, we explain the coordinate system of the hyperbolic space. The Cartesian coordinate system of the ambient Euclidean space was used as an element of the hyperbolic space (i.e., $\varvec{x}=(x_0, x_1, \dots , x_D)^\top \in {\mathbb {H}}^D$). Alternatively, for a maximum hyperbolic radius $R>0$, the polar coordinate system $(r, \theta _{1}, \dots , \theta _{D-1})^\top $ introduced in [34] was used, where $r\in [0, R]$, $\theta _{1}, \theta _{2}, \dots , \theta _{D-2}\in [0, \pi )$, and $\theta _{D-1} \in [0, 2\pi )$. The coordinate transformation is expressed as follows:

$$\begin{aligned} \left\{ \begin{aligned}&x_0 = \cosh r, \\&x_1 = \sinh r \cos \theta _1, \\&x_2 = \sinh r \sin \theta _1 \cos \theta _2, \\&\quad \vdots \\&x_{D-1} = \sinh r \sin \theta _1 \sin \theta _2 \ldots \sin \theta _{D-2} \cos \theta _{D-1}, \\&x_{D} = \sinh r \sin \theta _1 \sin \theta _2 \ldots \sin \theta _{D-2} \sin \theta _{D-1}. \end{aligned} \right. \end{aligned}$$

(2)

In this study, we specify the coordinate system we use when we introduce the notation of elements in the hyperbolic space.

2.1.3 Tangent space and exponential map

When we introduce wrapped normal distributions and the optimization algorithm, the concepts of tangent space ${\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D$ and exponential map $\textrm{Exp}_{\varvec{x}} (\cdot )$ are necessary.

For $\varvec{x}\in {\mathbb {H}}^D$, the tangent space ${\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D$ is defined as the set of vectors orthogonal to $\varvec{x}$ with respect to the inner product $\langle \cdot , \cdot \rangle _{{\mathcal {L}}}$. Hence,

$$\begin{aligned} {\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D :=\{ \varvec{v} \mid \varvec{v} \in {\mathbb {R}}^{D+1}, \langle \varvec{x}, \varvec{v} \rangle _{{\mathcal {L}}} = 0 \}. \end{aligned}$$

Thereafter, the exponential map $\textrm{Exp}_{\varvec{x}} (\cdot ) :{\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D \rightarrow {\mathbb {H}}^D$ maps a tangent vector $\varvec{v}\in {\mathcal {T}}_{\varvec{x}}{\mathbb {H}}^D$ onto ${\mathbb {H}}^D$ along the geodesic, where the geodesics are the generalizations of straight lines to Riemannian manifolds. The explicit forms of the exponential map and its inverse are well known (e.g., [22]) and are defined as follows:

$$\begin{aligned} \textrm{Exp}_{\varvec{x}} (\varvec{v})&:=\cosh \left( \sqrt{\langle \varvec{v}, \varvec{v} \rangle _{{\mathcal {L}}}}\right) \varvec{x} + \sinh \left( \sqrt{\langle \varvec{v}, \varvec{v} \rangle _{{\mathcal {L}}}}\right) \frac{\varvec{v}}{\sqrt{\langle \varvec{v}, \varvec{v} \rangle _{{\mathcal {L}}}}}, \nonumber \\ \textrm{Exp}_{\varvec{x}}^{-1} (\varvec{y})&= \frac{{{\,\textrm{arcosh}\,}}{\left( -\langle \varvec{x}, \varvec{y} \rangle _{{\mathcal {L}}}\right) }}{\sqrt{\langle \varvec{x}, \varvec{y} \rangle _{{\mathcal {L}}}^2-1}}\left( \varvec{y}+\langle \varvec{x}, \varvec{y} \rangle _{{\mathcal {L}}} \varvec{x}\right) . \end{aligned}$$

(3)

2.2 Non-identifiability problem of hyperbolic embeddings

In a non-identifiable model, as pointed out in [26], the central limit theorem (CLT) does not hold for the maximum likelihood estimator uniformly over the parameter space. Thus, under these circumstances, neither AIC nor BIC can be applied to latent variable models because they are derived under the CLT assumption uniformly over the parameter space.

For notational simplicity, we omit D from the probability distribution, unless noted otherwise. We focus on undirected, unweighted, and simple graphs. Let $n\in {\mathbb {Z}}_{\ge 2}$ be the number of nodes. For $k\in {\mathbb {Z}}_{\ge 2}$, [k] and $\Lambda _k$ are defined as follows: $[k] :=\{ 1, 2, \dots , k \}$ and $\Lambda _k:=\{ (i, j) \mid i,j\in [k], i<j \}$. For $(i,j) \in \Lambda _n$, let $y_{ij}=y_{ji} \in \{0, 1\}$ be a random variable that assumes the value 1 if the i-th node is connected to the j-th node and 0 otherwise.

For $D=2$ and $i \in [n]$, let $\phi _i:=(r_i, \theta _{i})^\top \in {\mathbb {H}}^D$ be the polar coordinates of the i-th node, where $r_i\in [0, R]$ and $\theta _{i} \in [0, 2\pi )$. In this model, $\varvec{y}:=\{ y_{ij} \}_{(i,j)\in \Lambda _n}$ is an observable variable, whereas $\varvec{\phi }:=\{\phi _i \}_{i \in [n]}$ is a probability distribution parameter. For $\beta _{{\max }}>\beta _{{\min }}>0$, $\gamma _{{\max }}>\gamma _{{\min }}>0$, $\beta \in [\beta _{{\min }}, \beta _{{\max }}]$, and $\gamma \in [\gamma _{{\min }}, \gamma _{{\max }}]$, we assume that a random variable $\varvec{y}$ is drawn from the following distribution:

$$\begin{aligned} p(\varvec{y}; \varvec{\phi }, \beta , \gamma )&:=\prod _{(i,j)\in \Lambda _n}p(y_{ij};\phi _i, \phi _j, \beta , \gamma ), \nonumber \\ p(y_{ij};\phi _i, \phi _j, \beta , \gamma )&:={\left\{ \begin{array}{ll} \frac{1}{1+\exp (\beta d_{\phi _i \phi _j}-\gamma )} &{}\quad (y_{ij}=1), \\ 1-\frac{1}{1+\exp (\beta d_{\phi _i \phi _j}-\gamma )} &{}\quad (y_{ij}=0). \end{array}\right. } \end{aligned}$$

(4)

Then, the following lemma holds.

Lemma 1

We assume that $r_j \ne 0$ for some $j\in [n]$. For $\alpha \in (0, 2\pi )$, we define $\phi _i^\prime :=(r_i, \theta _i+\alpha )^\top $ for all $i\in [n]$. Then, $\varvec{\phi } \ne \varvec{\phi }^\prime :=\{ \phi _i^\prime \}_{i\in [n]}$, and the following equation holds:

$$\begin{aligned} p(\varvec{y};\varvec{\phi }, \beta , \gamma ) = p(\varvec{y}; \varvec{\phi }^\prime , \beta , \gamma ). \end{aligned}$$

Therefore, the probability distribution of hyperbolic embeddings is non-identifiable.

Proof

Since $\phi _j \ne \phi _j^\prime $ holds for some $j\in [n]$ such that $r_j \ne 0$, we have $\varvec{\phi } \ne \varvec{\phi }^\prime $. For all $(i, j)\in \Lambda _n$, $d_{\phi _i \phi _j}=d_{\phi _i^\prime \phi _j^\prime }$ because a simple calculation yields $\langle \phi _i, \phi _j \rangle _{{\mathcal {L}}}:=\langle \phi _i^\prime , \phi _j^\prime \rangle _{{\mathcal {L}}}$. Thus, the result follows from Eq. (4). $\square $

For $D\ge 3$, non-identifiability can be proved by a similar transformation to the $(D-1)$-th angular coordinate.

2.3 Latent variable models of hyperbolic embeddings with PUDs and WNDs

To resolve the non-identifiability problem, we introduce two latent variable models, following work on PUDs [14, 15, 34] and WNDs [22]. In PUDs and WNDs, an embedding is regarded as a set of latent variables and edges as observed variables. Among several latent variable models, we chose two for the following reasons.

For PUDs, in [14, 15, 34], it has been demonstrated that the graphs generated with PUDs have two properties: the power law for the degree of a node and high clustering coefficient. These properties are common in real-world graphs [35].
For WNDs, it has been demonstrated that the experimental performance of various downstreaming tasks has been improved.

In these models, we consider $\varvec{\phi }=\{\phi _i \}_{i \in [n]}$ as latent random variables rather than parameters. Thus, we rewrite it as $\varvec{z}:=\{z_i \}_{i \in [n]}$, where $z_i :=(z_{i,0}, z_{i,1}, \dots , z_{i,D})^\top $ denotes its Cartesian coordinates.

2.3.1 Latent variable model with PUDs

The generation process of $\varvec{y}, \varvec{z}$ with PUDs can be summarized as follows:

1.
For vertex $i\in [n]$:
1. (a)
  Generate $u_i :=(r_i, \theta _{i,1}, \dots , \theta _{i, D-1})^\top \sim p(u_i; \sigma , R)$.
2. (b)
  Transform $u_i$ to $z_i$ through Eq. (2).
2.
For vertices $(i,j)\in \Lambda _n$:
1. (a)
  Generate an observable variable $y_{ij} \sim p(y_{ij} \mid z_i, z_j;\beta , \gamma )$ using Eq. (4).

Below we provide an explicit form of the probability distribution of $\varvec{z}$ for PUDs.

For $\sigma _{{\max }}> \sigma _{{\min }} \ge 0$, the random variable $\varvec{u}:=\{u_i \}_{i \in [n]}$ is drawn according to the following distribution with the parameter $\sigma \in [\sigma _{{\min }}, \sigma _{{\max }}]$:

$$\begin{aligned} p(\varvec{u}; \sigma , R)&:=\prod _{i\in [n]} p(u_i;\sigma , R), \\ p(u_i; \sigma , R)&:=p(r_i; \sigma , R)\prod _{j=1}^{D-1}p(\theta _{i,j}), \\ p(\theta _{i,j})&:={\left\{ \begin{array}{ll} \frac{\sin ^{D-1-j} \theta _{i,j}}{I_{D, j}} &{}\quad (j \ne D-1), \\ \frac{1}{2\pi } &{} \quad (j=D-1), \end{array}\right. } \\ p(r_i; \sigma , R)&\quad :=\frac{\sinh ^{D-1}(\sigma r_i)}{C_D(\sigma )}, \end{aligned}$$

where $I_{D, j}:=\int ^\pi _0 \sin ^{D-1-j} \theta {\textrm{d}}\theta $ and $C_D(\sigma ):=\int ^R_0 \sinh ^{D-1}(\sigma r) {\textrm{d}}r$ denote the normalization constants. For $p(\varvec{z};\sigma , R)$, because $z_{i,0}$ is determined by the other D variables in Eq. (1), we have

$$\begin{aligned} p(\varvec{z}; \sigma , R)&:=\prod _{i\in [n]} p(z_i;\sigma , R), \\ p(z_i;\sigma , R)&:=\frac{1}{J(z_{i,1:D}:u_i)} p(u_i, \sigma , R), \end{aligned}$$

where $z_{i,1:D}:=(z_{i,1},\dots ,z_{i,D})^\top $ and $J(z_{i, 1:D}:u_i)$ is the Jacobian of the transformation from $u_i$ to $z_{i, 1:D}$, which is given as

$$\begin{aligned} J(z_{i, 1:D}:u_i) =\cosh (r_i) {\sinh }^{D-1}(r_i) \prod _{j=1}^{D-2} \sin ^{D-j-1}\theta _{i, j}. \end{aligned}$$

The derivation is provided in “Appendix A.” The probability distribution $p(\varvec{z}; \sigma , R)$ is called the pseudo-uniform distribution because it is reduced to the uniform distribution in hyperbolic space when $\sigma =1$.

In the following discussion, the value of R is assumed to be constant and satisfies $R=O(\log n)$ where n is the number of nodes, and it is omitted from the description of the probability distribution. This is because the maximum average degree satisfies $k_{{\max }}=O(n)$, and the minimum average degree satisfies $k_{{\min }}=O(1)$ under certain conditions, which is a common property of real-world complex networks [34].

In the aforementioned distribution, $\sigma , \beta $, and $\gamma $ are parameters, and D denotes the model of the probability distribution.

2.3.2 Probability distribution of WNDs

WNDs are a generalization of Euclidean Gaussian distributions to the hyperbolic space. Thus, WNDs have two parameters: a mean in hyperbolic space $\varvec{\mu }\in {\mathbb {H}}^D$ and a positive-definite covariance matrix $\Sigma \in {\mathbb {R}}^{D \times D}$. In our model, we set $\varvec{\mu }$ to $\varvec{\mu }_0$, where $\varvec{\mu }_0:=(1, 0, \dots , 0)^\top $ denote the origin of ${\mathbb {H}}^D$. We assume this because a tree-like graph is considered to be radially distributed around the origin $\varvec{\mu }_0$.

The generation process of $\varvec{y}$ and $\varvec{z}$ with the WNDs is summarized as follows:

1.
For a vertex $i\in [n]$:
1. (a)
  Generate $v_i:=(v_{i,1}, \dots , v_{i, D})^\top \sim p(v_i; \Sigma )$.
2. (b)
  Transform $v_i$ to $\tilde{v_i}:=[0, v_i^\top ]^\top $, which is a tangent vector at $\varvec{\mu }_0$.
3. (c)
  Transform ${\tilde{v}}_i$ to $z_i$ through the exponential map $\text {Exp}_{\varvec{\mu }_0} ({\tilde{v}}_i)$.
2.
For a pair of vertices $(i,j)\in \Lambda _n$:
1. (a)
  Generate an observable variable $y_{ij} \sim p(y_{ij} \mid z_i, z_j;\beta , \gamma )$ using Eq. (4).

Note that the second step is the same as that of the model with the PUDs. Below we provide an explicit form of the probability distribution of $\varvec{z}$ with the WNDs.

A random variable $\varvec{v}:=\{v_i \}_{i \in [n]}$ is drawn according to the following distribution:

$$\begin{aligned} p(\varvec{v}; \Sigma )&:=\prod _{i\in [n]} p(v_i;\Sigma ), \\ p(v_i; \Sigma )&:=\frac{1}{(2\pi )^{\frac{D}{2}} |\Sigma |^\frac{1}{2}}\exp \biggl (-\frac{1}{2} v_i^\top \Sigma ^{-1} v_i \biggr ). \end{aligned}$$

For $p(\varvec{z};\Sigma )$, we have that

$$\begin{aligned} p(\varvec{z}; \Sigma )&:=\prod _{i\in [n]} p(z_i;\Sigma ), \\ p(z_i;\Sigma )&:=\frac{1}{J(z_{i,1:D}:v_i)} p(v_i, \Sigma ), \end{aligned}$$

where $J(z_{i, 1:D}:v_i)$ is the Jacobian of the transformation from $v_i$ to $z_{i, 1:D}$, which is provided by

$$\begin{aligned} J(z_{i, 1:D}:v_i) =\biggl \{ \frac{\sinh \Vert v_i \Vert _{{\mathcal {L}}}}{\Vert v_i \Vert _{{\mathcal {L}}}} \biggr \}^{D-1}. \end{aligned}$$

The derivation of the Jacobian is obtained from [22].

3 Dimensionality selection using DNML code-lengths

In this section, we present the calculation of the DNML code-lengths for two latent variable models. Thereafter, we present the optimization algorithm.

3.1 DNML code-lengths with PUDs and WNDs

According to the MDL principle [24], the probabilistic model that minimizes the total code-length required to encode the given data is selected. Data may be encoded using multiple methods. Although the NML code-length [36] is one of the most common encoding methods, its calculation is quite difficult for complex probability distributions such as PUDs and WNDs. Therefore, we employ the DNML code-length [26], whose calculation for latent variable models is relatively easier.

Let ${\mathcal {D}}:=\{ D_1, D_2, \dots \, D_N \}$ denote a finite set of candidates of dimensionalities $(D_i\in {\mathbb {Z}}_{\ge 2})$. We estimate optimal dimensionality ${\hat{D}}\in {\mathcal {D}}$ and the optimal embedding $\hat{\varvec{z}}$ that minimizes the following criterion, which we call DNML-PUD:

$$\begin{aligned} L_{\text {DNML-PUD}}(\varvec{y}, \varvec{z}) :=\,&L_{\text {NML}}(\varvec{y} \mid \varvec{z})+L_{\text {NML}}(\varvec{z}), \nonumber \\ L_{\text {NML}}(\varvec{y} \mid \varvec{z}) :=\,&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) \nonumber \\&+ \log \sum _{\bar{\varvec{y}}} p(\bar{\varvec{y}} \mid \varvec{z};{\hat{\beta }}(\bar{\varvec{y}}, \varvec{z}), {\hat{\gamma }}(\bar{\varvec{y}}, \varvec{z})), \nonumber \\ L_{\text {NML}}(\varvec{z}) :=\,&-\log p(\varvec{z}; {\hat{\sigma }}(\varvec{z})) +\log \int p(\bar{\varvec{z}}; {\hat{\sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}, \end{aligned}$$

(5)

where ${\hat{\beta }}(\cdot , \cdot )$, ${\hat{\gamma }}(\cdot , \cdot )$, and ${\hat{\sigma }}(\cdot )$ denote the maximum likelihood estimators, $\bar{\varvec{z}}_{1:D}:=\{ {\bar{z}}_{i, 1:D} \}_{i\in [n]}$, and the sum and integral are obtained over all possible data in the predefined data domain. The second term in each NML code-length is called parametric complexity. As the exact value of the integral is analytically intractable, we approximate $L_{\text {NML}}(\varvec{y} \mid \varvec{z})$ and $L_{\text {NML}}(\varvec{z})$ as follows:

$$\begin{aligned} L_{\text {NML}}(\varvec{y} \mid \varvec{z}) \approx&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) \\ {}&+ \log \frac{n(n-1)}{4\pi } +\log \int _{\gamma _{{\min }}}^{\gamma _{{\max }}}\int _{\beta _{{\min }}}^{\beta _{{\max }}} \sqrt{|I_n(\beta , \gamma ) |}{\textrm{d}}\beta {\textrm{d}}\gamma , \\ L_{\text {NML}}(\varvec{z}) \approx&-\log p(\varvec{z}; {\hat{\sigma }}(\varvec{z})) + \frac{1}{2} \log \frac{n}{2\pi }+\log \int _{\sigma _{{\min }}}^{\sigma _{{\max }}} \sqrt{|I(\sigma ) |}{\textrm{d}}\sigma , \end{aligned}$$

where $I_n(\beta , \gamma )$ and $I(\sigma )$ denote Fisher information, which is computed as

$$\begin{aligned} I_n(\beta , \gamma ) =&\frac{2}{n(n-1)} \begin{pmatrix} \sum _{(i,j)\in \Lambda _n} \frac{d_{z_i z_j}^2}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} &{} \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )}\\ \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} &{} \sum _{(i,j)\in \Lambda _n} \frac{1}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} \end{pmatrix}, \\ I(\sigma ) =&(D-1)^2\frac{\int ^R_0 r^2 \cosh ^2(\sigma r)\sinh ^{D-3}(\sigma r){\textrm{d}}r}{C_D(\sigma )} \\&-\biggl \{ \frac{\int ^R_0 (D-1)r \cosh (\sigma r)\sinh ^{D-2}(\sigma r){\textrm{d}}r}{C_D(\sigma )} \biggr \} ^2. \end{aligned}$$

The derivation is presented in “Appendix B.” Practically, $I_n(\beta , \gamma )$ and $I(\sigma )$ are calculated numerically because the analytic solution of the integral terms is not trivial.

For WNDs, the DNML criterion is defined as follows:

$$\begin{aligned} L_{\text {DNML-WND}}(\varvec{y}, \varvec{z}) :=&L_{\text {NML}}(\varvec{y} \mid \varvec{z})+L_{\text {NML}}(\varvec{z}), \nonumber \\ L_{\text {NML}}(\varvec{z}) :=&-\log p(\varvec{z}; {\hat{\Sigma }}(\varvec{z})) +\log \int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}, \end{aligned}$$

(6)

where ${\hat{\Sigma }}$ denotes the maximum likelihood estimator of $\Sigma $. Since the exact value of $\int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}$ is analytically intractable, we employ the following upper bound.

$$\begin{aligned} \int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D} \le \frac{\pi ^{\frac{D^2}{2}} \prod _{i\in [D-1]} \epsilon _{2i}^{D-i} \prod _{j\in [D]} \epsilon _{1j}^{\frac{1-D}{2}}}{\Gamma _D \bigl (\frac{D}{2} \bigr ) \Gamma _D \bigl (\frac{n}{2} \bigr )} \biggl ( \frac{n}{2e} \biggr )^{\frac{nD}{2}} \biggl ( \frac{2}{D-1} \biggr )^D. \end{aligned}$$

The derivation is presented in “Appendix B.”

We provide a more detailed explanation of DNML-PUD and DNML-WND. Figures 1 and 2 show $L_{\text {NML}}(\varvec{y} \mid \varvec{z}), L_{\text {NML}}(\varvec{z})$ and $L_{\text {DNML}}(\varvec{y}, \varvec{z})$ of an artificially generated graph with the true dimensionality $D_{\text {true}}=8$ and $n=6400$. The value of $L_{\text {NML}}(\varvec{y} \mid \varvec{z})$ decreases as dimensionality increases. This implies that as the dimensionality increases, the graph can be reconstructed more accurately. However, the value of $L_{\text {NML}}(\varvec{z})$ increases as dimensionality increases. This is because more code-length is required to encode the extra dimension of the embedding; that is, the model becomes more complex, and $L_{\text {NML}}(\varvec{z})$ acts as a penalty term. Hence, minimizing $L_{\text {DNML}}(\varvec{y}, \varvec{z})$ implies that dimensionality is chosen by considering both the accuracy of the reconstruction and the complexity of the model. Therefore, the DNML-PUD and DNML-WND select the true dimensionality $D_{\text {true}}=8$.

3.2 Optimization

To derive the optimal dimensionality, Eqs. (5) and (6) should be optimized with respect to $\varvec{z}, \beta , \gamma $ and $\sigma $ for each dimensionality $D\in {\mathcal {D}}$. However, direct optimization is difficult because of the analytical intractability of the parametric complexity of $L_{\text {NML}}(\varvec{y} \mid \varvec{z})$. Instead, we optimize $L(\varvec{z}, \beta , \gamma , \sigma ) :=-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \sigma )$ and $L(\varvec{z},\beta , \gamma , \Sigma ) :=-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \Sigma )$, which are lower bounds of Eqs. (5) and (6), respectively.

First, we explain how to optimize $L(\varvec{z}, \beta , \gamma , \sigma )$. We rewrite it as

$$\begin{aligned} L(\varvec{z}, \beta , \gamma , \sigma )&=- \sum _{(i,j)\in \Lambda _n} \log p(y_{ij} \mid z_i, z_j; \beta , \gamma ) - \sum _{i\in [n]} \log p(z_i;\sigma ) \\&= \sum _{(i,j)\in \Lambda _n} \biggl \{ -\log p(y_{ij} \mid z_i, z_j; \beta , \gamma ) - \frac{1}{n-1}\log p(z_i;\sigma ) \\&\quad -\frac{1}{n-1} \log p(z_j; \sigma ) \biggr \}. \end{aligned}$$

We applied the stochastic update rule at iteration t using the following equation:

$$\begin{aligned} L^{(t)}(\varvec{z}, \beta , \gamma , \sigma )&:=\frac{1}{|{\mathcal {B}}^{(t)} |}\sum _{(i,j)\in {\mathcal {B}}^{(t)}} \biggl \{ -\log p(y_{ij} \mid z_i, z_j; \beta , \gamma ) \\&\quad - \frac{1}{n-1}\log p(z_i;\sigma )-\frac{1}{n-1} \log p(z_j; \sigma ) \biggr \}, \end{aligned}$$

where ${\mathcal {B}}^{(t)}\subset \Lambda _n$ is the mini-batch for each iteration and $|\cdot |$ denotes the number of elements in a set.

For $z_i$, we used the geodesic update in the hyperboloid model [18]. The update rule for $\varvec{z}$ is given as follows:

$$\begin{aligned} z_i^{(t+1)} \leftarrow \text {proj}_{{\mathbb {H}}^D_R} \left( \text {Exp}_{z_i^{(t)}} \left( -\eta ^{(t)}_{\varvec{z}} \pi _{z_i^{(t)}} \left( g_{D}^{-1}\frac{\partial L^{(t)}}{\partial z_i^{(t)}} \right) \right) \right) , \end{aligned}$$

(7)

where $\eta ^{(t)}_{\varvec{z}}$ denotes the learning rate, $\pi _{\varvec{z}}(\cdot )$ denotes the projection from the Euclidean gradient to the Riemannian gradient, $\text {proj}_{{\mathbb {H}}^D_R}(\cdot )$ denotes the projection from ${\mathbb {H}}^D$ to ${\mathbb {H}}^D_R :=\{ \varvec{x} \mid \varvec{x}\in {\mathbb {H}}^D, d_{\varvec{\mu }_0\varvec{x}}\le R \}$, and $\text {Exp}_{\varvec{z}} (\cdot )$ is given by Eq. (3). The functions $\pi _{\varvec{z}}(\cdot )$ and $\text {proj}_{{\mathbb {H}}^D_R}(\cdot )$ are defined as follows:

$$\begin{aligned} \pi _{\varvec{z}} (\varvec{u})&:=\varvec{u} - \frac{\langle \varvec{z}, \varvec{u} \rangle _{{\mathcal {L}}}}{\langle \varvec{z}, \varvec{z} \rangle _{{\mathcal {L}}}}\varvec{z}=\varvec{u}+\langle \varvec{z}, \varvec{u} \rangle _{{\mathcal {L}}}\varvec{z}, \\ \text {proj}_{{\mathbb {H}}^D_R}(\varvec{z})&:={\left\{ \begin{array}{ll} \varvec{z} &{}\quad (d_{\varvec{\mu }_0\varvec{z}} \le R), \\ \bigl (\cosh R, \frac{\sinh R}{\Vert \varvec{z}_{1:D} \Vert }z_1, \cdots , \frac{\sinh R}{\Vert \varvec{z}_{1:D} \Vert }z_D \bigr )^\top &{}\quad (d_{\varvec{\mu }_0\varvec{z}} > R), \end{array}\right. } \end{aligned}$$

where $\Vert \cdot \Vert $ denotes the Euclidean norm.

With the learning rates $\eta ^{(t)}_{\beta }$ and $\eta ^{(t)}_{\sigma }$, the update rules of $\beta $ and $\gamma $ are provided as

$$\begin{aligned} \beta ^{(t+1)}&\leftarrow \text {proj}_{[\beta _{{\min }},\beta _{{\max }}]} \biggl ( \beta ^{(t)}-\eta ^{(t)}_{\beta } \frac{\partial L^{(t)}}{\partial \beta ^{(t)}} \biggr ), \end{aligned}$$

(8)

$$\begin{aligned} \gamma ^{(t+1)}&\leftarrow \text {proj}_{[\gamma _{{\min }},\gamma _{{\max }}]} \biggl ( \gamma ^{(t)}-\eta ^{(t)}_{\gamma } \frac{\partial L^{(t)}}{\partial \gamma ^{(t)}} \biggr ), \\ \text {proj}_{[a, b]}(x)&= {\left\{ \begin{array}{ll} b &{}\quad (x \ge b), \\ x &{}\quad (a \le x \le b), \\ a &{}\quad (x<a). \\ \end{array}\right. }\nonumber \end{aligned}$$

(9)

Through a preliminary experiment using synthetic datasets, we confirmed that $\sigma $ rarely converges to the true value when using the gradient descent method. Thus, for each epoch, we numerically calculated ${\hat{\sigma }}(\varvec{z})$ as

$$\begin{aligned} {\hat{\sigma }}(\varvec{z})=\mathop {\mathrm {arg~min}}\limits _{\sigma \in S} \bigl \{ -\log p(\varvec{z}; \sigma ) \bigr \}, \end{aligned}$$

(10)

where $S=\{ \sigma _{{\min }}, \sigma _{{\min }} + \frac{1}{C}(\sigma _{{\max }}-\sigma _{{\min }}), \dots , \sigma _{{\min }} + \frac{C-1}{C}(\sigma _{{\max }}-\sigma _{{\min }}), \sigma _{{\max }} \}$, and $C+1$ denote the number of candidates.

For the optimization of $L(\varvec{z}, \beta , \gamma , \Sigma )$, we define $L^{(t)}(\varvec{z}, \beta , \gamma , \Sigma )$ in a similar manner. The update rules for $z_i, \beta , \gamma $ are provided by Eqs. (7), (8), and (9), respectively. For each epoch, we optimized $\Sigma $ using the following equation:

$$\begin{aligned} {\hat{\Sigma }}(\varvec{z})&=\mathop {\mathrm {arg~min}}\limits _{\Sigma } \bigl \{ -\log p(\varvec{z}; \Sigma ) \bigr \} \nonumber \\&= \frac{1}{n} \sum _{i \in [n]} v_i v_i^\top , \end{aligned}$$

(11)

where $v_i :=\text {Exp}_{\varvec{\mu }_0}^{-1}(z_i)$ for all $i\in [n]$. The optimization procedure is summarized in Algorithm 1.

Subsequently, we analyze the time efficiency of the proposed algorithms. For sufficiently large n, the update of $\varvec{z}^{(t)}:=\{z_i^{(t)}\}_{i\in [n]}$ is dominant. We assume that $|{\mathcal {B}}^{(t)} |$ takes a constant value B for all iteration. Since at most 2B nodes are used in each iteration to compute $L^{(t)}$, we only need to update O(B) nodes. Thus, $\varvec{z}^{(t)}$ will be updated O(ETB) times in total.

4 Experimental results

This section presents a comparison between the proposed criteria and conventional methods using artificial and real-world datasets. The details of code, data, and training details are presented in “Appendix C.”

4.1 Methods for comparison

We used three criteria—$\text {AIC}, \text {BIC}$, and MinGE—for a comparative analysis of the performance of the proposed method. Here, the AIC and BIC with respect to the non-identifiable model, that is, $\beta , \gamma $, and $\varvec{z}$, are interpreted as parameters and are defined as follows:

$$\begin{aligned} \text {AIC}(\varvec{y};D):=&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z}))+(nD+2), \\ \text {BIC}(\varvec{y};D) :=&-\log p(\varvec{y} \mid \varvec{z};{\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) +\frac{nD+2}{2}\log \frac{n(n-1)}{2}. \end{aligned}$$

Note that these criteria are not guaranteed to work for this model because of the non-identifiability. MinGE [11] is a dimensionality selection criterion for Euclidean graph embeddings. We set the weighting factor $\lambda =1$ and selected the dimensionality, where MinGE was closest to 0.

Furthermore, we did not consider the cross-validation (CV) for comparison. This is because CV requires considerable computation time, particularly, when learning graph embeddings.

It should be noted that we optimized $-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \sigma )$ for DNML-PUD, $-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \Sigma )$ for DNML-WND, and $-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )$ for AIC and BIC. Thus, three embeddings were generated for each graph.

4.2 Artificial dataset

In this experiment, we verified whether the proposed DNML criteria could estimate the true dimensionality.

4.2.1 Dataset detail

We considered the case of $D_{\text {true}}=8$, where $D_{\text {true}}$ is the true dimensionality of the PUDs. We generated a graph for each combination of parameters from the following candidates: $n\in \{ 800, 1600, 3200, 6400\}$, $\beta \in \{0.5, 0.6, 0.7, 0.8\}$, and $\sigma \in \{0.5, 1.0, 2.0\}$. Furthermore, we set $R=\log n$ and $\gamma = \beta \log n$. Consequently, we obtained 48 graphs in total, which we call PUD-8. Similarly, we generated PUD-16 with the true dimensionality which was 16. The other parameters are the same as those of PUD-8.

Subsequently, we made WND-8. For the true dimensionality $D_{\text {true}}=8$, we generated a graph for each combination of parameters from the following candidates: $n\in \{ 800, 1600, 3200, 6400\}, \beta \in \{0.5, 0.6, 0.7, 0.8\}$, and $\Sigma \in \{(0.35\log n)^2 I$, $(0.375\log n)^2 I$, $(0.40\log n)^2 I\}$, where $I\in {\mathbb {R}}^{D\times D}$ is the identity matrix. Furthermore, we also set $R=\log n$ and $\gamma = \beta \log n$. Similarly, we generated WND-16 with $\Sigma \in \{(0.225\log n)^2 I$, $(0.25\log n)^2 I$, $(0.275\log n)^2 I\}$. The other parameters are the same as those of WND-8. The set of candidates for the dimensionalities was ${\mathcal {D}}=\{ 2, 4, 8, 16, 32, 64 \}$.

In the above generation process, the parameters were set such that the generated graphs were sparse; that is, the average degree is low with respect to the number of nodes.

Table 1 Average MAPs on PUD-8. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Full size table

Table 2 Average MAPs on PUD-16. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Full size table

Table 3 Average MAPs on WND-8. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Full size table

Table 4 Average MAPs on WND-16. The bold indicates either the maximum MAP or MAP within a 10% decrease from the maximum one (average estimated dimensionality in the parentheses)

Full size table

4.2.2 Results

To provide an illustrative example for each criterion, we first compared the selected dimensionality of PUD-8 and WND-8 with $n=800, 6400$. Figures 3 and 4 show the normalized values for each criterion.

For $n=800$, AIC, BIC, and DNML selected $D=4$. Intuitively, a graph with a few nodes is expected to be embedded in low dimensionality, even if its true dimensionality is high. For $n=6400$, DNML selected the correct dimensionality $D=8$, whereas AIC and BIC selected incorrect dimensionalities. This implies that the DNML criteria can select the true dimensionality with a sufficient amount of data. For MinGE, it selected the maximum dimensionality $D=64$ for all cases. This is possibly because MinGE was designed for Euclidean embeddings, which require larger dimensionality than hyperbolic embeddings for hierarchical structures, as discussed in Sect. 1.1.

Next, we provide a quantitative comparison in terms of mean average precision (MAP) [37]. A MAP is calculated using the ranking of the dimensionalities, which was created in ascending order for each criterion. Furthermore, we applied DNML-PUD to WND dataset and DNML-WND to PUD dataset.

Tables 1, 2, 3, and 4 present the results for PUD-8, PUD-16, WND-8, and WND-16, respectively. Firstly, the MAPs of BIC and MinGE were not so high, and they always selected $D=4$ and $D=64$, respectively. Since the selected dimensionalities were constant, BIC and MinGE are less reliable.

For AIC, we observed good performance in many cases; however, it tends to overestimate the true dimensionality for PUD-8 and WND-8 with $n=6400$. This is because the penalty term of AIC is smaller than those of other criteria. For DNML criteria, in general, when the sample size is sufficiently large, DNML-PUD identifies the true dimensionality of the PUD dataset and the same tendency holds for DNML-WND and the WND dataset. Thus, we concluded that the proposed DNML criteria are more effective than AIC when the true dimensionality of the given graph is low.

Note that, in general, the performance of DNML-PUD in the WND dataset varied, sometimes being better and sometimes worse compared to DNML-WND. Similarly, in the PUD dataset, the performance of DNML-WND also varied, sometimes being better and sometimes worse compared to DNML-PUD. This is because the theoretical properties of the MDL principle are not valid when the generation process of the given data and the assumed generation process for calculating DNML code-length are different from each other. Therefore, this observation is an expected result of the mismatch of the generation process.

4.3 Real-world datasets

We used scientific collaboration networks from [38,39,40], flight network from https://openflights.org, protein–protein interaction network from [41], and the WN datasets from [27] for our study because they were employed in [16, 42, 43], which are representative studies in the field of hyperbolic embeddings. The experimental results in [16, 42, 43] demonstrated that hyperbolic embeddings outperformed Euclidean embeddings in several graph mining tasks performed on the networks. Therefore, we concluded that they are suitable for comparing our proposed method with others.

4.3.1 Link prediction

The DNML-PUD, DNML-WND, and other model selection criteria were applied to eight real-world graphs. In real-world graphs, the true dimensionality is unknown. Therefore, in this experiment, the link prediction performance for the selected dimensions was evaluated.

Dataset detail

We listed the details of eight graphs below.

Scientific collaboration networks. We used AstroPh, CondMat, GrQc, and HepPh from [40], Cora from [38], and PubMed from [39]. These graphs are networks that represent the co-authorship of papers, where an edge exists between two people if they are co-authors.
Flight networks. We used Airport from https://openflights.org/. In this graph, nodes represent airports, and edges represent airline routes.
Protein–protein interaction (PPI) networks. We further used PPI from [41]. This graph represents the protein interactions in yeast bacteria.

Furthermore, Table 5 summarizes the statistics of these graphs.

Then, each graph was split into a training set $\varvec{y}_{\text {train}} \subset \varvec{y}$ and test set $\varvec{y}_{\text {test}}\subset \varvec{y}$. The test set $\varvec{y}_{\text {test}}$ comprises the positive and negative samples. First, we sampled $10\%$ of the positive samples from a graph. Subsequently, to generate negative samples, we sampled an equal number of node pairs with no edge connection. Finally, we obtained the training set $\varvec{y}_{\text {train}} :=\varvec{y}{\setminus } \varvec{y}_{\text {test}}$.

For $\varvec{y}_{\text {test}}$, we calculated the area under the curve (AUC), which we define as follows. We first calculated the distance of each samples in $\varvec{y}_{\textrm{test}}$. Subsequently, we calculated the true positive rate and false positive rate with fixed threshold of the distance. Finally, we obtained the receiver operating characteristic (ROC) curve by varying the threshold, and the AUC is defined as the area under the ROC curve.

The AUC was used to quantify the link prediction performance. The candidates of dimensionalities were ${\mathcal {D}}=\{ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64 \}$.

Table 5 Statistics of scientific collaboration networks

Full size table

Table 6 Selected dimensionalities of each method

Full size table

Results

Figure 5 shows the AUCs of the optimal embeddings associated with $-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \sigma )$, $-\log p(\varvec{y}, \varvec{z};\beta , \gamma , \Sigma )$, and $-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )$. Figure 6 shows the normalized values of each criterion, and Table 6 shows the selected dimensionalities. The performance at the selected dimensionalities by DNML-PUD and DNML-WND was not the best, and higher dimensionalities tended to yield higher AUCs.

According to [24], the consistency of the MDL model selection is theoretically guaranteed; that is, the model with the shortest code-length would converge to the true one if it exists. Therefore, the dimensionalities selected by DNML were considered to be close to the dimensionalities of the true probabilistic models that generated the data. However, our results suggest that such dimensionality of the true probabilistic model is not necessarily the best one for link prediction.

Table 7 Average conciseness of each method with $\epsilon _{\textrm{max}} = 0.050$ and 0.100 (the bold text indicates either the maximum conciseness or conciseness within a 10% decrease from the maximum one)

Full size table

In this section, we provide another perspective on the experimental results. As discussed in Sect. 1.1, dimensionality also controls the computation time and memory. Therefore, it is important to select a dimensionality at which a relatively high performance is achieved while maintaining low computational resources (e.g., using embeddings in edge devices). To quantify this, we introduce conciseness defined as follows: Let ${\mathcal {D}}:=\{D_1, \dots , D_N\}$ be the candidates of dimensionalities, $\textrm{AUC}_{D_i}$ be the AUC at dimensionality $D_i$, $\overline{\textrm{AUC}}$ be the maximum AUC, and $\epsilon _{{\max }}$ be a maximum tolerance gap relative to $\overline{\textrm{AUC}}$. For the selected dimensionality ${\hat{D}}$ of each criterion, the conciseness is provided by:

$$\begin{aligned} \textrm{conciseness} ({\hat{D}}, {\epsilon _{\max }})&:=\frac{1}{\epsilon _{{\max }}P} \sum _{i=0, 1, \dots , P} c \biggl ({\hat{D}}, \frac{i}{P}\epsilon _{{\max }} \biggr ), \\ c({\hat{D}}, \epsilon )&:={\left\{ \begin{array}{ll} 1-\frac{\log _2 {\hat{D}} - \log _2 D_{{\min }}}{\log _2 D_{{\max }} - \log _2 D_{{\min }}} &{}\quad ({\hat{D}} \in {\mathcal {D}}_\epsilon ), \\ 0 &{}\quad ({\hat{D}} \notin {\mathcal {D}}_\epsilon ), \\ \end{array}\right. } \end{aligned}$$

where ${\mathcal {D}}_\epsilon :=\{ D_i\in {\mathcal {D}} \mid \overline{\textrm{AUC}}-\textrm{AUC}_{D_i} < \epsilon \}$, $D_{{\min }}:=\min _{D_i \in {\mathcal {D}}_\epsilon } D_i, D_{{\max }}:=\max _{D_i \in {\mathcal {D}}_{\epsilon }} D_i$, and $P \in {\mathbb {Z}}_{\ge 1}$ is the number of candidate points to calculate the conciseness.

Figure 7 shows the typical example of $c({\hat{D}}, \epsilon )$. Firstly, the proposed conciseness measure assumes a situation where the AUC improves by increasing the dimensionality, while the extent of improvement diminishes, and with limited computational resource, we want to select lower dimensionality while achieving the maximum tolerate AUC. Based on this motivation, the conciseness measure is designed to take high values when ${\hat{D}}\in {\mathcal {D}}$ is close to $D_{{\min }}$.

Since the conciseness significantly depends on $\epsilon _{{\max }}$, we computed it for $\epsilon _{{\max }}=0.050$ and 0.100. Table 7 shows the average conciseness of the selected dimensionalities. To calculate the conciseness, we used the embeddings associated with $-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \sigma )$ for DNML-PUD, $-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \Sigma )$ for DNML-WND, and $-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )$ for AIC, BIC, and MinGE. Furthermore, we set $\overline{\textrm{AUC}}$ as the maximum AUC associated with three embeddings.

The best or second best performance was achieved by either DNML-PUD or DNML-WND in many cases. For AIC, the performance was relatively high, but not the best in many cases. For BIC, the performance was relatively high, specifically for $\epsilon _{{\max }} = 0.100$. This indicates that BIC is effective when the maximum tolerate gap is high. For MinGE, the performance was close to 0 because the selected dimensionality was considerably high. Overall, the proposed method works well in that it identifies dimensionality with a relatively high AUC while maintaining low computational resources.

Note that all the performances were 0 for GrQc. This is because higher dimensions achieve considerably higher AUC in GrQc, unlike most other networks where increasing dimensions do not significantly improve AUC. In such scenarios, the conciseness measure does not take positive values unless $\epsilon _{{\max }}$ is set to a considerably high value; however, setting an excessively high tolerate gap (e.g., $\epsilon _{{\max }}=0.300$) lacks practical meaning, and it is sufficient to select the maximum dimensionality within the given computational resources.

4.3.2 Preservation of hierarchy

To investigate the extent to which the hierarchical structure was preserved, we used a subset of WordNet [27] closely following the setting in [16, 18].

Table 8 Statistics of WN datasets

Full size table

Dataset detail We first considered the transitive closure of the is-a relationship for all the nouns. Subsequently, we took the subset of the nouns that have “mammal” as a hypernym and selected relations that have “mammal” as a hyponym or hypernym. Finally, we connected the two nouns if they have an is-a relationship. We refer to this dataset as WN-mammal. Similarly, we generated WN-mammal, WN-solid, WN-tree, WN-worker, WN-adult, WN-instrument, WN-leader, and WN-implement. Table 8 summarizes the statistics for these datasets.

Each graph is expected to have a hierarchical structure because a hypernym is often related to many hyponyms. We embedded eight graphs with various dimensionalities and calculated each criterion.

Result Figure 8 shows the normalized criteria. The candidates of dimensionalities were ${\mathcal {D}}=\{ 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64 \}$.

Subsequently, we quantified the extent to which the obtained embeddings reflected the true hierarchy of the is-a relation on the data. Following [16], we used the following score function:

$$\begin{aligned} {\text {is-a-score}}(\varvec{u}, \varvec{v}) :=(\alpha (r_u -r_v)-1) d_{\varvec{u}\varvec{v}}, \end{aligned}$$

where $r_u$ and $r_v$ are the radius coordinates of $\varvec{u}$ and $\varvec{v}$, respectively, and $\alpha >0$ is a constant. In general, it can be assumed that a hypernym has a lower radial coordinate than its hyponyms. Thus, $\alpha (r_u -r_v)$ acts as a penalty when v, which is a hypernym of u, is lower in the embedding hierarchy. Therefore, the score is expected to be high when the embedding reflects the true hierarchy of data. Figure 9 shows the average score over all is-a relation pairs for each dimensionality with $\alpha =100$. Overall, the lower dimensionality achieved higher is-a scores.

Table 9 Average benefits on WN datasets (the bold text indicates the maximum one)

Full size table

Next, we provide a quantitative comparison of the benefits. With an estimated dimensionality ${\hat{D}}$ and a maximum tolerance gap $T_{\text {gap}}$, the benefit is defined as follows:

$$\begin{aligned} b({\hat{D}}, D_{\text {best}}) :=\max \biggl \{ 0, 1-\frac{|\log _2 {\hat{D}}- \log _2 D_{\text {best}} |}{T_{\text {gap}}} \biggr \}, \end{aligned}$$

where $D_{\textrm{best}}$ is the dimensionality at which the best is-a score is achieved. It has a high value when the estimated dimensionality is close to $D_{\textrm{best}}$.

Table 9 shows the average benefits over the WN datasets. Note that $D_{\textrm{best}}$ was 2 for all datasets, and we set $T_{\text {gap}}=2$ in the experiments. The DNML-WND and BIC achieved the best results. This result implies that DNML-WND and BIC selected the dimensionality that reflects the true hierarchical structure of the particular graph.

4.4 Discussion

We summarize the experimental results. Firstly, as a general trend, the AIC usually selects higher dimensionality, the BIC lower dimensionality, and the DNML criteria middle dimensionality. This is due to the relative magnitude of the penalty terms. MinGE always selected the largest dimensionality in all the experiments. This is possibly because MinGE was designed for Euclidean embeddings, which require higher dimensionality than hyperbolic embeddings for hierarchical structures, as discussed in Sect. 1.1.

The performance of the AIC was relatively well on the first and second tasks, but not on the third task. In contrast, the BIC showed high performance in the third task, but low performance in the first and second tasks. The DNML criterion does not necessarily give the best performance, but it often gives the best performance or the next-best performance for all tasks. Therefore, it can be concluded that the performance of the proposed DNML criteria was good on average for all tasks.

5 Conclusion and future work

In this study, we proposed a dimensionality selection method for hyperbolic embeddings based on the MDL principle. We demonstrated that there is a non-identifiability problem for hyperbolic embeddings. Therefore, we employed the latent variable model of hyperbolic embeddings using PUDs and WNDs to formulate dimensionality selection as the statistical model selection for latent variable models. Within this formulation, we proposed the DNML code-length criterion for dimensionality selection based on the MDL principle. For artificial datasets, we experimentally demonstrated that our method is effective when the true dimensionality is low. For real-world datasets, we used the scientific collaboration networks and WN datasets. For the scientific collaboration networks, we demonstrated that the proposed method selected simple probabilistic models while maintaining AUCs. For the WN datasets, we confirmed that the proposed method selects the dimensionality which preserves the true hierarchy of graphs. Overall, the proposed method performed well in average.

Note that we cannot directly use the dimensionality selected by the proposed method in hyperbolic neural networks methods, such as [43, 44]. Because the probabilistic models used in the proposed method differ from those used in [43, 44], the optimal dimensionalities within them are inherently different. This implies that using the dimensionality selected by the proposed method in [43, 44] rationales the rationales of the DNML code length and MDL principle. However, we would like to emphasize that even in hyperbolic neural networks, it is possible to consider latent variable models and derive DNML code-lengths following a similar procedure to our study. Specifically, we can treat the parameters and outputs of hidden layers of hyperbolic neural networks as latent variables. In this sense, the proposed method can be generalized. However, as indicated in “Appendix B,” the derivation of the DNML code-length requires significant effort and is not straightforward. Therefore, we consider the extension of our proposed method to hyperbolic neural networks as future work.

The latent variable model approach adopted in this study is promising and is not limited to hyperbolic space. In the future, we plan to build a methodology for dimensionality selection in Euclidean and spherical embeddings by introducing latent variable for their corresponding space [45, 46].

Notes

Throughout this paper, the value of the curvature K in hyperbolic space is assumed to be $K=-1$ because, as described in [15], the effect of changing the curvature in hyperbolic space can be expressed by scaling the other parameters.

References

Theocharidis A, Van Dongen S, Enright AJ, Freeman TC (2009) Network visualization and analysis of gene expression data using biolayout express 3D. Nat Protoc 4(10):1535–1550
Article Google Scholar
Freeman LC (2000) Visualizing social networks. J Soc Struct 1(1):4
Google Scholar
Cancho RFI, Solé RV (2001) The small world of human language. Proc R Soc Lond Ser B Biol Sci 268(1482):2261–2265
Article Google Scholar
Goyal P, Ferrara E (2018) Graph embedding techniques, applications, and performance: a survey. Knowl Based Syst 151:78–94
Article Google Scholar
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015) Line: large-scale information network embedding. In: Proceedings of the 24th international conference on World Wide Web, pp 1067–1077
Grover A, Leskovec J (2016) node2vec: scalable feature learning for networks. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, pp 855–864
Kipf TN, Welling M (2016) Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907
Veličković P, Cucurull G, Casanova A, Romero A, Lio P, Bengio Y (2017) Graph attention networks. arXiv preprint arXiv:1710.10903
Yin Z, Shen Y (2018) On the dimensionality of word embedding. In: Proceedings of the 32nd international conference on neural information processing systems, pp 895–906
Gu W, Tandon A, Ahn Y-Y, Radicchi F (2021) Principled approach to the selection of the embedding dimension of networks. Nat Commun 12(1):1–10
Article Google Scholar
Luo G, Li J, Peng H, Yang C, Sun L, Yu PS, He L (2021) Graph entropy guided node embedding dimension selection for graph neural networks. Main Track, pp 2767–2774
Hung PT, Yamanishi K (2021) Word2vec skip-gram dimensionality selection via sequential normalized maximum likelihood. Entropy 23(8):997
Article MathSciNet Google Scholar
Wang Y (2019) Single training dimension selection for word embedding with PCA. In: Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pp 3597–3602
Krioukov D, Papadopoulos F, Kitsak M, Vahdat A, Boguná M (2010) Hyperbolic geometry of complex networks. Phys Rev E 82(3):036106
Article MathSciNet Google Scholar
Yang W, Rideout D (2020) High dimensional hyperbolic geometry of complex networks. Mathematics 8(11):1861
Article Google Scholar
Nickel M, Kiela D (2017) Poincaré embeddings for learning hierarchical representations. Adv Neural Inf Process Syst 30:6338–6347
Google Scholar
Ganea O, Bécigneul G, Hofmann T (2018) Hyperbolic entailment cones for learning hierarchical embeddings. In: International conference on machine learning. PMLR, pp 1646–1655
Nickel M, Kiela D (2018) Learning continuous hierarchies in the Lorentz model of hyperbolic geometry. In: International conference on machine learning. PMLR, pp 3779–3788
Almagro P, Boguna M, Serrano M (2021) Detecting the ultra low dimensionality of real networks. arXiv preprint arXiv:2110.14507
Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723
Article MathSciNet MATH Google Scholar
Schwarz G (1978) Estimating the dimension of a model. Ann Stat 6:461–464
Article MathSciNet MATH Google Scholar
Nagano Y, Yamaguchi S, Fujita Y, Koyama M (2019) A wrapped normal distribution on hyperbolic space for gradient-based learning. In: International conference on machine learning. PMLR, pp 4693–4702
Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471
Article MATH Google Scholar
Rissanen J (2012) Optimal estimation of parameters. Cambridge University Press, Cambridge. https://doi.org/10.1017/CBO9780511791635
Book MATH Google Scholar
Yamanishi K (1992) A learning criterion for stochastic rules. Mach Learn 9(2–3):165–203
Article MATH Google Scholar
Yamanishi K, Wu T, Sugawara S, Okada M (2019) The decomposed normalized maximum likelihood code-length criterion for selecting hierarchical latent variable models. Data Min Knowl Discov 33(4):1017–1058
Article MathSciNet MATH Google Scholar
Fellbaum C (ed) (1998) WordNet: an electronic lexical database. Language, speech, and communication. MIT Press, Cambridge. https://doi.org/10.1017/CBO9780511791635
Book MATH Google Scholar
Yuki R, Ike Y, Yamanishi K (2022) Dimensionality selection of hyperbolic graph embeddings using decomposed normalized maximum likelihood code-length. In: 2022 IEEE international conference on data mining (ICDM). IEEE Computer Society, Los Alamitos, pp 666–675. https://doi.org/10.1109/ICDM54844.2022.00077
Fletcher PT, Lu C, Pizer SM, Joshi S (2004) Principal geodesic analysis for the study of nonlinear statistics of shape. IEEE Trans Med Imaging 23(8):995–1005
Article Google Scholar
Pennec X (2018) Barycentric subspace analysis on manifolds. Ann Stat 46(6A):2711–2746
Article MathSciNet MATH Google Scholar
Chami I, Gu A, Nguyen DP, Ré C (2021) Horopca: hyperbolic dimensionality reduction via horospherical projections. In: International conference on machine learning. PMLR, pp 1419–1429
Gao Y, Yang H, Zhang P, Zhou C, Hu Y (2020) Graph neural architecture search. IJCAI 20:1403–1409
Google Scholar
Ratcliffe JG, Axler S, Ribet K (1994) Foundations of hyperbolic manifolds, vol 149. Springer, New York
Google Scholar
Kitsak M, Aldecoa R, Zuev K, Krioukov D (2020) Random hyperbolic graphs in $ d+ 1$ dimensions. arXiv preprint arXiv:2010.12303
Barabási A-L (2013) Network science. Philos Trans R Soc A Math Phys Eng Sci 371(1987):20120375
Article Google Scholar
Shtar’kov YM (1987) Universal sequential coding of single messages. Probl Pereda Inform 23(3):3–17
MATH Google Scholar
Buckley C, Voorhees EM (2004) Retrieval evaluation with incomplete information. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval, pp 25–32
Sen P, Namata G, Bilgic M, Getoor L, Galligher B, Eliassi-Rad T (2008) Collective classification in network data. AI Mag 29(3):93–93
Google Scholar
Namata G, London B, Getoor L, Huang B, Edu U (2012) Query-driven active surveying for collective classification. In: 10th International workshop on mining and learning with graphs, vol 8, p 1
Leskovec J, Sosič R (2016) Snap: a general-purpose network analysis and graph-mining library. ACM Trans Intell Syst Technol (TIST) 8(1):1
Google Scholar
Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and centrality in protein networks. arXiv preprint arXiv:cond-mat/0105306
Sala F, De Sa C, Gu A, Ré C (2018) Representation tradeoffs for hyperbolic embeddings. In: International conference on machine learning. PMLR, pp 4460–4469
Chami I, Ying Z, Ré C, Leskovec J (2019) Hyperbolic graph convolutional neural networks. In: Advances in neural information processing systems, vol 32
Liu Q, Nickel M, Kiela D (2019) Hyperbolic graph neural networks. In: Advances in neural information processing systems, vol 32
Penrose M (2003) Random geometric graphs, vol 5. OUP, Oxford
Book MATH Google Scholar
Allen-Perkins A (2018) Random spherical graphs. Phys Rev E 98(3):032310
Article MathSciNet Google Scholar
Mathai AM (1997) Jacobians of matrix transformations and functions of matrix argument. World Scientific, New York
Book MATH Google Scholar
Myung PDGIJ, Pitt MA (2005) Advances in minimum description length: theory and applications
Rissanen JJ (1996) Fisher information and stochastic complexity. IEEE Trans Inf Theory 42(1):40–47
Article MathSciNet MATH Google Scholar
Hirai S, Yamanishi K (2013) Efficient computation of normalized maximum likelihood codes for Gaussian mixture models with its applications to clustering. IEEE Trans Inf Theory 59(11):7718–7727
Article MathSciNet MATH Google Scholar
Hirai S, Yamanishi K (2017) Upper bound on normalized maximum likelihood codes for Gaussian mixture models. arXiv preprint arXiv:1709.00925
Vetterling WT, Teukolsky SA, Press WH (1992) Numerical recipes: example book (C), 2nd edn. Press Syndicate of the University of Cambridge, Cambridge
MATH Google Scholar

Download references

Acknowledgements

This work was partially supported by JST KAKENHI JP19H0111.

Funding

Open access funding provided by The University of Tokyo.

Author information

Authors and Affiliations

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo, Tokyo, 113-8654, Japan
Ryo Yuki & Kenji Yamanishi
Institute of Mathematics for Industry, Kyushu University, 744 Motooka, Nishi-ku, Fukuoka-shi, Fukuoka, 819-0395, Japan
Yuichi Ike

Authors

Ryo Yuki
View author publications
You can also search for this author in PubMed Google Scholar
Yuichi Ike
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Yamanishi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

R.Y. contributed to conceptualization, investigation, data curation, writing—original draft preparation, visualization, and project administration and provided software; R.Y., Y.I., and K. Y. were involved in methodology, validation, formal analysis, and writing—review and editing; and K.Y. contributed to supervision and funding acquisition.

Corresponding author

Correspondence to Ryo Yuki.

Ethics declarations

Conflict of interest

The authors declare no conflicts of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

Appendix A. Derivation of $J(z_{1:D}:u_i)$

For notational simplicity, we omit i from $z_{i, 1:D}$ and $u_i$ in this section. To drive $J(z_{1:D}:u)$, we introduce intermediate variable $u^\prime :=(r^\prime , \theta _1^\prime , \dots , \theta _{D-1}^\prime )^\top $. The coordinate transformations are as follows:

$$\begin{aligned}&\left\{ \begin{aligned}&z_1 = r^\prime \cos \theta _1^\prime , \\&z_2 = r^\prime \sin \theta _1^\prime \cos \theta _2^\prime , \\&\quad \vdots \\&z_{D-1} = r^\prime \sin \theta _1^\prime \sin \theta _2^\prime \cdots \sin \theta _{D-2}^\prime \cos \theta _{D-1}^\prime , \\&z_{D} = r^\prime \sin \theta _1^\prime \sin \theta _2^\prime \cdots \sin \theta _{D-2}^\prime \sin \theta _{D-1}^\prime . \end{aligned} \right. \end{aligned}$$

$$\begin{aligned}&\left\{ \begin{aligned}&r^\prime = \sinh r, \\&\theta _1^\prime = \theta _1, \\&\theta _2^\prime = \theta _2, \\&\quad \vdots \\&\theta _{D-1}^\prime = \theta _{D-1}. \end{aligned} \right. \end{aligned}$$

Subsequently, $J(z_{1:D}:u)=J(z_{1:D}:u^\prime )J(u^\prime :u)$, where $J(z_{1:D}:u^\prime )$ and $ J(u^\prime :u)$ are the corresponding Jacobians. Jacobian $J(z_{1:D}:u^\prime )$ is well known (e.g., [47]):

$$\begin{aligned} J(z_{1:D}:u^\prime )=r^{\prime ^{D-1}} \prod ^{D-2}_{i=1} \sin ^{D-i-1} \theta ^\prime _i. \end{aligned}$$

Jacobian $J(u^\prime :u)$ is calculated by definition as

$$\begin{aligned} J(u^\prime :u)= \cosh r. \end{aligned}$$

Thus, we obtain the following:

$$\begin{aligned} J(z_{1:D}:u) = \sinh ^{D-1} r \cosh r \prod ^{D-2}_{i=1} \sin ^{D-i-1} \theta _i. \end{aligned}$$

Appendix B. Derivation of DNML code-length

1.1 B.1 Derivation of $L_{\text {NML}}(\varvec{y} \mid \varvec{z})$

For $L_{\textrm{NML}}(\varvec{y}\mid \varvec{z})$, following Chap. 5 in [48], we approximate parametric complexity as follows:

$$\begin{aligned}{} & {} \log \sum _{\bar{\varvec{y}}} p(\bar{\varvec{y}}\mid \varvec{z};{\hat{\beta }}(\bar{\varvec{y}}, \varvec{z}), {\hat{\gamma }}(\bar{\varvec{y}}, \varvec{z})) \\{} & {} \quad \approx \log \frac{n(n-1)}{4\pi }+\log \int _{\gamma _{{\min }}}^{\gamma _{{\max }}}\int _{\beta _{{\min }}}^{\beta _{{\max }}} \sqrt{|I_n(\beta , \gamma ) |}{\textrm{d}}\beta {\textrm{d}}\gamma , \end{aligned}$$

where

$$\begin{aligned} I_n(\beta , \gamma ) = E_{\beta \gamma }\left[ \frac{2}{n(n-1)} \begin{pmatrix} \frac{\partial ^2}{\partial \beta ^2} L(\beta , \gamma ) &{} \frac{\partial ^2}{\partial \beta \partial \gamma } L(\beta , \gamma ) \\ \frac{\partial ^2}{\partial \beta \partial \gamma } L(\beta , \gamma ) &{} \frac{\partial ^2}{\partial \gamma ^2} L(\beta , \gamma ) \end{pmatrix} \right] , \end{aligned}$$

where $L(\beta , \gamma ):=\log p(\varvec{y}\mid \varvec{z}; \beta , \gamma )$ and $E_{\beta \gamma }[\cdot ]$ is the expectation with respect to $p(\varvec{y}\mid \varvec{z}; \beta , \gamma )$.

In the following, we calculate $I_n(\beta , \gamma )$: We rewrite $p(y_{ij} \mid z_i, z_j;\beta , \gamma )$ as

$$\begin{aligned} p(y_{ij}\mid z_i, z_j;\beta , \gamma )&=p_{ij}^{y_{ij}}(1-p_{ij})^{(1-y_{ij})}\\&=(1-p_{ij})\biggl ( \frac{p_{ij}}{1-p_{ij}} \biggr )^{y_{ij}}, \end{aligned}$$

where $p_{ij}=1/(1+\exp (\beta d_{z_i z_j}-\gamma ))$. This form is a logistic function with the constraints $\beta \in [\beta _{{\min }}, \beta _{{\max }}]$ and $\gamma \in [\gamma _{{\min }}, \gamma _{{\max }}]$. Then, $L(\beta , \gamma )$ is rewritten as

$$\begin{aligned} L(\beta , \gamma )&=\sum _{(i,j)\in \Lambda _n} \biggl \{ -y_{ij}\log (1-p_{ij}) - \log \frac{p_{ij}}{1-p_{ij}} \biggr \} \\&=\sum _{(i,j)\in \Lambda _n} \biggl \{ \log (1+\exp (-\beta d_{z_i z_j}+\gamma )) + y_{ij}(\beta d_{z_i z_j} -\gamma ) \biggr \}. \end{aligned}$$

Hence, we obtain

$$\begin{aligned} \frac{\partial L(\beta , \gamma )}{\partial \beta }&= \sum _{(i,j)\in \Lambda _n} \biggl \{ \frac{-d_{z_i z_j} \exp (-\beta d_{z_i z_j}+\gamma )}{1+\exp (-\beta d_{z_i z_j}+\gamma )} +y_{ij}d_{z_i z_j} \biggr \} \\&= \sum _{(i,j)\in \Lambda _n} \biggl \{ \frac{-d_{z_i z_j}}{1+\exp (\beta d_{z_i z_j}-\gamma )} +y_{ij}d_{z_i z_j} \biggr \}, \\ \frac{\partial ^2 L(\beta , \gamma )}{\partial \beta ^2}&= \sum _{(i,j)\in \Lambda _n} \frac{d_{z_i z_j}^2 \exp (\beta d_{z_i z_j}-\gamma )}{\bigl (1+\exp (\beta d_{z_i z_j} - \gamma ) \bigr )^2} \\&=\sum _{(i,j)\in \Lambda _n} \frac{d_{z_i z_j}^2}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )}, \\ \frac{\partial ^2 L(\beta , \gamma )}{\partial \beta \partial \gamma }&= \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j} \exp (\beta d_{z_i z_j} - \gamma )}{1+\exp (\beta d_{z_i z_j}- \gamma )} \\&= \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )}, \\ \frac{\partial L(\beta , \gamma )}{\partial \gamma }&= \sum _{(i,j)\in \Lambda _n} \biggl \{ \frac{ \exp (-\beta d_{z_i z_j}+\gamma )}{1+\exp (-\beta d_{z_i z_j}+\gamma )} - y_{ij} \biggr \} \\&= \sum _{(i,j)\in \Lambda _n} \biggl \{ \frac{1}{1+\exp (\beta d_{z_i z_j}-\gamma )} - y_{ij} \biggr \}, \\ \frac{\partial ^2 L(\beta , \gamma )}{\partial \gamma ^2}&= \sum _{(i,j)\in \Lambda _n} \frac{\exp (\beta d_{z_i z_j}-\gamma )}{\bigl (1+\exp (\beta d_{z_i z_j} - \gamma ) \bigr )^2} \\&=\sum _{(i,j)\in \Lambda _n} \frac{1}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )}. \\ \end{aligned}$$

Since all terms are independent of $y_{ij}$, we obtain

$$\begin{aligned} I_n(\beta , \gamma ) = \frac{2}{n(n-1)} \begin{pmatrix} \sum _{(i,j)\in \Lambda _n} \frac{d_{z_i z_j}^2}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )} &{} \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl ( \frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )}\\ \sum _{(i,j)\in \Lambda _n} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )} &{} \sum _{(i,j)\in \Lambda _n} \frac{1}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} \end{pmatrix}. \end{aligned}$$

1.2 B.2 Derivation of $L_{\text {NML}}(\varvec{z})$ of PUDs

For $L_{\text {NML}}(\varvec{z})$, we first show that $\log \int p(\bar{\varvec{z}}; {\hat{\sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D} :=\log \int p(\bar{\varvec{u}}; {\hat{\sigma }}(\bar{\varvec{u}})) {\textrm{d}}\bar{\varvec{u}}$, where $\bar{\varvec{u}}=\{{\bar{u}}_i\}_{i \in [n]}$, and ${\bar{u}}_i :=({\bar{r}}_i, {\bar{\theta }}_{i,1}, \dots , {\bar{\theta }}_{i,D-1})^\top $ is the polar coordinates of each datum. Since $J({\bar{z}}_{i,1:D}:{\bar{u}}_i)$ does not depend on $\sigma $, the following equation holds.

$$\begin{aligned} {\hat{\sigma }} (\bar{\varvec{u}})&= \mathop {\mathrm {arg~max}}\limits _{\sigma \in [\sigma _{{\min }}, \sigma _{{\max }}]} p(\bar{\varvec{u}}; \sigma ) \\&=\mathop {\mathrm {arg~max}}\limits _{\sigma \in [\sigma _{{\min }}, \sigma _{{\max }}]} p(\bar{\varvec{z}};\sigma ) \prod _{i=1}^n J({\bar{z}}_{i,1:D}:{\bar{u}}_i) \\&=\mathop {\mathrm {arg~max}}\limits _{\sigma \in [\sigma _{{\min }}, \sigma _{{\max }}]} p(\bar{\varvec{z}};\sigma ) \\&={\hat{\sigma }} (\bar{\varvec{z}}). \end{aligned}$$

Thus,

$$\begin{aligned} \log \int p(\bar{\varvec{u}}; {\hat{\sigma }}(\bar{\varvec{u}})) {\textrm{d}}\bar{\varvec{u}} =&\log \int p(\bar{\varvec{z}};{\hat{\sigma }}(\bar{\varvec{u}})) \prod _{i=1}^n J({\bar{z}}_{i,1:D}:u_i) {\textrm{d}}\bar{\varvec{u}}\\ =&\log \int p(\bar{\varvec{z}};{\hat{\sigma }}(\bar{\varvec{z}})) \prod _{i=1}^n J({\bar{z}}_{i,1:D}:u_i) {\textrm{d}}\bar{\varvec{u}}\\ =&\log \int p(\bar{\varvec{z}}; {\hat{\sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D}. \\ \end{aligned}$$

Thereafter, we use the asymptotic approximation of parametric complexity according to Rissanen [49]:

$$\begin{aligned} \log \int p(\bar{\varvec{u}}; {\hat{\sigma }}(\bar{\varvec{u}})) {\textrm{d}}\bar{\varvec{u}} \approx \frac{1}{2} \log \frac{n}{2\pi }+\log \int _{\sigma _{{\min }}}^{\sigma _{{\max }}} \sqrt{|I(\sigma ) |}{\textrm{d}}\sigma , \end{aligned}$$

where $I(\sigma )$ denotes Fisher information

$$\begin{aligned} I(\sigma ) :=\lim _{n\rightarrow \infty } \frac{1}{n}E_\sigma \biggl [- \frac{\partial ^2\log p(\varvec{u}; \sigma )}{\partial \sigma ^2} \biggr ] = E_\sigma \biggl [-\frac{\partial ^2\log p(u_i; \sigma )}{\partial \sigma ^2} \biggr ]. \end{aligned}$$

In the following, we calculate $I(\sigma )$.

The negative logarithm of the likelihood for $u_i=(r, \theta _1, \dots , \theta _{D-1})^\top $ is

$$\begin{aligned} L(\sigma ):=- \log \frac{\sinh ^{D-1} (\sigma r)}{C_D(\sigma )} - \sum ^{D-2}_{j=1} \log \frac{\sin ^{D-1-j} \theta _{j}}{I_{D, j}} - \log \frac{1}{2\pi }. \end{aligned}$$

By interchanging the derivative and integral, we obtain

$$\begin{aligned} \frac{\partial L(\sigma )}{\partial \sigma } = -(D-1) \frac{r \cosh \sigma r}{\sinh \sigma r} + \frac{\int ^R_0 (D-1){\bar{r}} \cosh (\sigma {\bar{r}})\sinh ^{D-2}(\sigma {\bar{r}}){\textrm{d}}{\bar{r}}}{C_D(\sigma )}. \end{aligned}$$

Similarly, we get

$$\begin{aligned} \frac{\partial ^2 L(\sigma )}{\partial \sigma ^2}= & {} (D-1) \frac{r^2}{\sinh ^2 (\sigma r)} \\{} & {} +\, \frac{(D-1)}{C_D(\sigma )} \int ^R_0 \biggl ( {\bar{r}}^2 \sinh ^{D-1}(\sigma {\bar{r}}) +(D-2){\bar{r}}^2\cosh ^2 (\sigma {\bar{r}})\sinh ^{D-3}(\sigma {\bar{r}}) \biggr ) {\textrm{d}}{\bar{r}}\\{} & {} -\, \biggl \{ \frac{\int ^R_0 (D-1){\bar{r}} \cosh (\sigma {\bar{r}})\sinh ^{D-2}(\sigma {\bar{r}}){\textrm{d}}{\bar{r}}}{C_D(\sigma )} \biggr \} ^2. \end{aligned}$$

The second and third terms are independent of r. The expectation of the first term with respect to r is calculated as follows.

$$\begin{aligned} (D-1) \int ^R_0 \frac{\sinh ^{D-1}(\sigma {\bar{r}})}{C_D(\sigma )}\cdot \frac{{\bar{r}}^2}{\sinh ^2 (\sigma {\bar{r}})}{\textrm{d}}{\bar{r}} = \frac{D-1}{C_D(\sigma )} \int ^R_0 {\bar{r}}^2\sinh ^{D-3}(\sigma {\bar{r}}){\textrm{d}}{\bar{r}}. \end{aligned}$$

Finally, we derive the following:

$$\begin{aligned} I(\sigma )= & {} (D-1)^2\frac{\int ^R_0 r^2 \cosh ^2(\sigma r)\sinh ^{D-3}(\sigma r){\textrm{d}}r}{C_D(\sigma )} \\{} & {} -\biggl \{ \frac{\int ^R_0 (D-1)r \cosh (\sigma r)\sinh ^{D-2}(\sigma r){\textrm{d}}r}{C_D(\sigma )} \biggr \} ^2. \end{aligned}$$

1.3 B.3 Derivation of $L_{\text {NML}}(\varvec{z})$ of WNDs

Similar to the discussion presented in Sect. B.2, we obtain $\log \int p(\bar{\varvec{z}}; {\hat{\Sigma }}(\bar{\varvec{z}})) {\textrm{d}}\bar{\varvec{z}}_{1:D} :=\log \int p(\bar{\varvec{v}}; {\hat{\Sigma }}(\bar{\varvec{v}})) {\textrm{d}}\bar{\varvec{v}}$, where $\bar{\varvec{v}}:=\{{\bar{v}}_i\}_{i \in [n]}$, and ${\bar{v}}_i :=({\bar{v}}_{i,1}, \dots {\bar{v}}_{i,D})^\top $. The following derivation of $\log \int p(\bar{\varvec{v}}; {\hat{\Sigma }}(\bar{\varvec{v}})) {\textrm{d}}\bar{\varvec{v}}$ depends heavily on [50, 51]: Let ${\hat{\Sigma }}:=\frac{1}{n}\sum _{i\in [n]} v_i v_i^\top $ be the maximum likelihood estimator for $\Sigma $. The probability density of ${\hat{\Sigma }}$ is well known (e.g., Example 2.7 in [47]) and is given by

$$\begin{aligned} g({\hat{\Sigma }};\Sigma ) =\frac{|{\hat{\Sigma }} |^{\frac{n-D-1}{2}}}{2^{\frac{nD}{2}}|\frac{1}{n} \Sigma |^{\frac{n}{2}} \Gamma _D \bigl (\frac{n}{2} \bigr )} \exp \biggl ( -\frac{1}{2} \textrm{tr}(n\Sigma ^{-1}{\hat{\Sigma }}) \biggr ), \end{aligned}$$

where $\Gamma _D(x):=\pi ^{\frac{D(D-1)}{4}}\prod _{j\in [D]} \Gamma \bigl (x+\frac{1-j}{2} \bigr )$, and $\Gamma (\cdot )$ is the gamma function. Using Rissanen’s g-function [24], the parametric complexity of $L_{\textrm{NML}}(\varvec{z})$ can be rewritten as:

$$\begin{aligned} \int p(\bar{\varvec{v}}; {\hat{\Sigma }}(\bar{\varvec{v}})) {\textrm{d}}\bar{\varvec{v}} = \int g({\hat{\Sigma }}; {\hat{\Sigma }}) {\textrm{d}}{\hat{\Sigma }}, \end{aligned}$$

where $d{\hat{\Sigma }}=\prod _{i=1}^D \prod _{j=i}^D d\Sigma _{ij}$.

We also rewrite $g({\hat{\Sigma }}; {\hat{\Sigma }})$ as:

$$\begin{aligned} g(\hat{\varvec{\lambda }}):=\frac{n^{\frac{nD}{2}}}{2^{\frac{nD}{2}} \Gamma _D \bigl (\frac{n}{2} \bigr ) e^{\frac{nD}{2}}} \prod _{j\in [D]} {\hat{\lambda }}_j^{\frac{-D-1}{2}}, \end{aligned}$$

where $\hat{\varvec{\lambda }}:=\{{\hat{\lambda }}_i\}_{i\in [D]}$ is the set of the eigenvalues of ${\hat{\Sigma }}$ with $0 \le {\hat{\lambda }}_1 \le \cdots \le {\hat{\lambda }}_D$.

To avoid the divergence of parametric complexity, we restrict the data domain of $\varvec{v}$ and $\varvec{z}$ as follows:

$$\begin{aligned} \bar{{\mathcal {Z}}}^D(\varvec{\epsilon }_1, \varvec{\epsilon }_2)&:=\{ \varvec{v} \mid \forall i\in [n], v_i\in {\mathbb {R}}^D,\forall j\in [D], \epsilon _{1j} \le {\hat{\lambda }}_j (\varvec{v}) \le \epsilon _{2j} \}, \\ {\mathcal {Z}}^D(\varvec{\epsilon }_1, \varvec{\epsilon }_2)&:=\{ \varvec{z} \mid \forall i\in [n], \exists v_i\in \bar{{\mathcal {Z}}}^D(\varvec{\epsilon }_1, \varvec{\epsilon }_2), z_i= \textrm{Exp}_{\varvec{\mu }_0}([0, v_i^\top ]^\top ) \}. \end{aligned}$$

We then obtain the following:

$$\begin{aligned} \int g({\hat{\Sigma }}; {\hat{\Sigma }}) {\textrm{d}}{\hat{\Sigma }} \le \,&\frac{\pi ^{\frac{D^2}{2}}}{\Gamma _D \left( \frac{D}{2} \right) } \int g(\hat{\varvec{\lambda }}) \prod _{1\le i< j \le D} |{\hat{\lambda }}_i-{\hat{\lambda }}_j |{\textrm{d}}\hat{\varvec{\lambda }} \\ <&\frac{\pi ^{\frac{D^2}{2}} \prod _{i\in [D-1]} \epsilon _{2i}^{D-i}}{\Gamma _D \left( \frac{D}{2} \right) } \int g(\hat{\varvec{\lambda }}) {\textrm{d}}\hat{\varvec{\lambda }} \\ =\,&\frac{\pi ^{\frac{D^2}{2}} \prod _{i\in [D-1]} \epsilon _{2i}^{D-i}}{\Gamma _D \left( \frac{D}{2} \right) \Gamma _D \left( \frac{n}{2} \right) } \left( \frac{n}{2e} \right) ^{\frac{nD}{2}} \prod _{j\in [D]} \int _{\epsilon _{1j}}^{\epsilon _{2j}} {\hat{\lambda }}_j^{\frac{-D-1}{2}} {\textrm{d}}{\hat{\lambda }}_j \\ =\,&\frac{\pi ^{\frac{D^2}{2}} \prod _{i\in [D-1]} \epsilon _{2i}^{D-i}}{\Gamma _D \left( \frac{D}{2} \right) \Gamma _D \left( \frac{n}{2} \right) } \left( \frac{n}{2e} \right) ^{\frac{nD}{2}} \left( \frac{2}{D-1} \right) ^D \prod _{j\in [D]} \left( \epsilon _{1j}^{\frac{1-D}{2}} - \epsilon _{2j}^{\frac{1-D}{2}}\right) \\ \le \,&\frac{\pi ^{\frac{D^2}{2}} \prod _{i\in [D-1]} \epsilon _{2i}^{D-i} \prod _{j\in [D]} \epsilon _{1j}^{\frac{1-D}{2}}}{\Gamma _D \left( \frac{D}{2} \right) \Gamma _D \left( \frac{n}{2} \right) } \left( \frac{n}{2e} \right) ^{\frac{nD}{2}} \left( \frac{2}{D-1} \right) ^D. \end{aligned}$$

The first inequality is derived from Thm. 2.13 [47].

The main difference between our upper bound and that in [50, 51] is as follows:

We fixed the mean to the origin of the Euclidean space, whereas the previous studies did not.
We removed the restriction $\epsilon _{{\max }}<1$ and $\pi ^{\frac{D^2}{2}} \epsilon _{{\max }}^{D(D-1)}/\Gamma _D(\frac{D}{2})<1$ from $\bar{{\mathcal {Z}}}^D(\varvec{\epsilon }_1, \varvec{\epsilon }_2)$, where $\epsilon _{{\max }}>0$ is set such that $\epsilon _{2j} \le \epsilon _{{\max }}$ for all $j\in [D].$ Below we explain the reason. In the discussion in [51], the difference between the code-lengths of any two data is scale-invariant in Gaussian mixture models (GMMs). In other words, the result of model selection is scale-invariant. Thus, we can perform model selection on rescaled data on GMMs, e.g., to multiply $1/\alpha $, etc. However, in hyperbolic embeddings, the difference of the code-lengths is not scale-invariant due to its non-Euclidean nature. Thus, in our model, it is necessary to remove the restrictions and set $\varvec{\epsilon }_1$ and $\varvec{\epsilon }_2$ to appropriate values such that the raw data are included in $\bar{{\mathcal {Z}}}^D(\varvec{\epsilon }_1, \varvec{\epsilon }_2)$.

Appendix C Code and experimental details

1.1 C.1 Environment

Table 10 lists the computing environments used in our experiments.

Table 10 Computing environment in our experiments

Full size table

1.2 C.2 Code

The code used in our experiments can be found in the following GitHub repository https://github.com/IbarakikenYukishi/poincare_embedding/tree/KAIS2023. Please refer to the README for the instructions.

Heuristics for estimating likelihood and Fisher information

For the calculation of each criterion, because $-\log p (\varvec{y}\mid \varvec{z}; {\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z}))$ for large n requires considerable computational time to calculate, we approximated it as follows:

$$\begin{aligned} -\log p (\varvec{y} \mid \varvec{z}; {\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})) \approx - \frac{|\varvec{y} |}{|\varvec{y}^\prime |} \log p(\varvec{y}^\prime \mid \varvec{z}; {\hat{\beta }}(\varvec{y}, \varvec{z}), {\hat{\gamma }}(\varvec{y}, \varvec{z})), \end{aligned}$$

where $\varvec{y}^\prime $ is sampled uniformly at random over $\varvec{y}$. Similarly, we approximated $I_n(\beta , \gamma )$ as follows:

$$\begin{aligned} I_n(\beta , \gamma ) \approx \frac{2}{n^\prime (n^\prime -1)} \begin{pmatrix} \sum _{(i,j)\in \Lambda _{S_{n^\prime }}} \frac{d_{z_i z_j}^2}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )} &{} \sum _{(i,j)\in \Lambda _{S_{n^\prime }}} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl ( \frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )}\\ \sum _{(i,j)\in \Lambda _{S_{n^\prime }}} \frac{-d_{z_i z_j}}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2} \bigr )} &{} \sum _{(i,j)\in \Lambda _{S_{n^\prime }}} \frac{1}{4\cosh ^2 \bigl (\frac{\beta d_{z_i z_j}-\gamma }{2}\bigr )} \end{pmatrix}, \end{aligned}$$

where $S_{n^\prime }\subset [n]$ was sampled uniformly at random over [n], $|S_{n^\prime } |=n^\prime $, and $\Lambda _{S_{n^\prime }} :=\{ (i,j)\mid (i,j)\in S_{n^\prime }\times S_{n^\prime }, i<j \}$.

1.3 C.3 Training detail

For all experiments, the training details were the same.

Embeddings

First, the Cartesian coordinates of each node were independently initialized uniformly at random in $[-0.001, 0.001]^{D+1} \subset {\mathbb {R}}^{(D+1)\times (D+1)}$. They were then projected onto the hyperboloid plane using Eq. (1).

When learning the embeddings, ten negative samples were sampled per positive sample for mini-batches. The number of epochs was 800, and we set $\eta ^{(t)}_{\beta }=0.001$ and $\eta ^{(t)}_{\gamma }=0.001$ for all epochs. Similar to [16], the learning process of embeddings is composed of two steps.

In the first step, $-\log p(\varvec{y}\mid \varvec{z}; \beta , \gamma )$ was optimized for all models with learning rate $\eta ^{(t)}_{\varvec{z}}=0.1$.
In the second step, we optimized the joint likelihoods $-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \sigma )$ and $-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \Sigma )$ for DNML-PUD and DNML-WND, respectively, and $-\log p(\varvec{y}\mid \varvec{z}; \beta , \gamma )$ for the AIC and BIC. The learning rate $\eta ^{(t)}_{\varvec{z}}$ was 34.375.

The first step and second step are composed of 100 and 700 epochs, respectively, for the embeddings associated with $-\log p(\varvec{y} \mid \varvec{z}; \beta , \gamma )$ and $-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \sigma )$. For the embeddings associated with $-\log p(\varvec{y}, \varvec{z}; \beta , \gamma , \Sigma )$, the first step and second step are composed of 110 and 690 epochs, respectively.

Parameters

We set the parameters as follows: $\sigma _{{\max }}=1.0, \sigma _{{\min }}=0.1, \beta _{{\max }}=10.0, \beta _{{\min }}=0.1, \gamma _{{\max }}=10.0, \gamma _{{\min }}=0.1, \beta ^{(0)}=1.0, \gamma ^{(0)}=\log n, \sigma ^{(0)}=1.0$, and $C=1000$. Then, R was taken as the same value when the graph was generated for the artificial datasets. For real-world datasets, $R=\log n + 5$, where n denotes the number of nodes. To calculate the upper bound on parametric complexity, we set $\epsilon _{1j}=0.000001$ and $\epsilon _{2j}=1000$ for all $j\in [D]$. For numerical integration, Gaussian quadrature [52] was used.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Yuki, R., Ike, Y. & Yamanishi, K. Dimensionality selection for hyperbolic embeddings using decomposed normalized maximum likelihood code-length. Knowl Inf Syst 65, 5601–5634 (2023). https://doi.org/10.1007/s10115-023-01934-2

Download citation

Received: 04 February 2023
Revised: 14 June 2023
Accepted: 10 July 2023
Published: 09 August 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10115-023-01934-2

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Dimensionality selection for hyperbolic embeddings using decomposed normalized maximum likelihood code-length

Abstract

Similar content being viewed by others

Modularity Embedding

Manifold learning and maximum likelihood estimation for hyperbolic network embedding

Model-independent embedding of directed networks into Euclidean and hyperbolic spaces

1 Introduction

1.1 Motivation

1.2 Related work

2 Preliminaries

2.1 Hyperbolic geometry

2.1.1 Definition of hyperbolic space

2.1.2 Coordinate system of hyperbolic space

2.1.3 Tangent space and exponential map

2.2 Non-identifiability problem of hyperbolic embeddings

Lemma 1

Proof

2.3 Latent variable models of hyperbolic embeddings with PUDs and WNDs

2.3.1 Latent variable model with PUDs

2.3.2 Probability distribution of WNDs

3 Dimensionality selection using DNML code-lengths

3.1 DNML code-lengths with PUDs and WNDs

3.2 Optimization

4 Experimental results

4.1 Methods for comparison

4.2 Artificial dataset

4.2.1 Dataset detail

4.2.2 Results

4.3 Real-world datasets

4.3.1 Link prediction

4.3.2 Preservation of hierarchy

4.4 Discussion

5 Conclusion and future work

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Appendices

Appendix A. Derivation of \(J(z_{1:D}:u_i)\)

Appendix B. Derivation of DNML code-length

1.1 B.1 Derivation of \(L_{\text {NML}}(\varvec{y} \mid \varvec{z})\)

1.2 B.2 Derivation of \(L_{\text {NML}}(\varvec{z})\) of PUDs

1.3 B.3 Derivation of \(L_{\text {NML}}(\varvec{z})\) of WNDs

Appendix C Code and experimental details

1.1 C.1 Environment

1.2 C.2 Code

1.3 C.3 Training detail

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation