Object embedding using an information geometrical perspective

Sugiura, Taiki; Murata, Noboru

doi:10.1007/s41884-023-00114-z

Object embedding using an information geometrical perspective

Research Paper
Open access
Published: 27 July 2023

Volume 6, pages 435–462, (2023)
Cite this article

Download PDF

You have full access to this open access article

Information Geometry Aims and scope Submit manuscript

Object embedding using an information geometrical perspective

Download PDF

756 Accesses
4 Altmetric
Explore all metrics

A Publisher Correction to this article was published on 24 August 2023

This article has been updated

Abstract

Acquiring vector representations of objects is essential for applying machine learning, statistical inference, and visualization. Although various vector acquisition methods have been proposed considering the relationship between objects in target data, most of them are supposed to use only a specific relevance level. In real-world data, however, there are cases where multiple relationships are contained between objects, such as time-varying similarity in time-series data or various weighted edges on graph-structured data. In this paper, a vector acquisition method which assigns vectors in a single coordinate system to objects preserving the information given by multiple relations between objects is proposed. In the proposed method, a logarithmic bilinear model parameterized by representation vectors is utilized for approximating relations between objects based on a stochastic embedding idea. The inference algorithm proposed in this study is interpreted in terms of information geometry: the m-projection from the probability distribution constructed from observed relations on the model manifold and the e-mixture in the model manifold are alternately repeated to estimate the parameters. Finally, the performance of the proposed method is evaluated using artificial data, and a case study is conducted using real data.

Low-Dimensional Data Representation in Data Analysis

Visualizing High-Dimensional Data Using t-Distributed Stochastic Neighbor Embedding Algorithm

Complementary View on Multivariate Data Structure Based on Kohonen’s SOM, Parallel Coordinates and t-SNE Methods

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

The acquisition of vector representations for objects is useful for information processing. Vector representations allow the visualization of relationships between objects and can be used as a preprocessing tool for statistical inference and machine learning algorithms. However, there are many cases, such as images and sentences, that are difficult to represent in fixed-length vectors or cases where only relationships between objects, such as relevance level, are observed. In particular, we propose a method for obtaining an appropriate vector representation in situations where multiple relevance degrees are given for each object. For example, consider the case where a movie is rated in each country at year. In this case, multiple relevance degrees are given for the movie. Each country has different preferences for movies, and these preferences change over time. We propose a method to obtain a vector representation that takes into account multiple degrees of relevance for data with multiple observations over time. In another example, we can discuss cultural differences and trends in scientific papers. It is assumed that each journal or international conference has a different culture, as different words are used in different venues, and that they also change over time. The proposed method is expected to be applied to the analysis of such data. This example case study will be discussed in detail in Sect. 4. Such as graph data and image data, there are cases where only distance and similarity between objects are observed. In this paper, such distances and similarities are treated as special cases of relevance and discussed comprehensively. The proposed method can be applied to obtain vector representations for various data, including time-series data for which similarities are given at different times and graph-structured data for edges of which multiple pattern weights are assumed.

In the following, a set of objects is denoted by ${\mathcal {O}}$, and individual objects are denoted by o, with the subscript i where necessary to distinguish them. The problem of acquiring a vector $\varvec{x}$ to an object o to approximate the relationships between the objects is generally called object embedding or vectorization. This problem can be divided into two types: the first is “constellation”, in which a vector representation is assigned to an individual object,

$$\begin{aligned} o_i \mapsto \varvec{x}_i, \ \forall i. \end{aligned}$$

And the second is “embedding”, in which a function is constructed to give the vectors to the entire set of target objects,

$$\begin{aligned} \phi : {\mathcal {O}} \rightarrow {\mathcal {X}}. \end{aligned}$$

For the latter embedding problem, the problem of learning the mapping $\phi $ from the data using a function approximator,

$$\begin{aligned} o_i \mapsto \varvec{x}_i = \phi (o_i), \end{aligned}$$

such as a neural network, is considered. In this paper, we deal in particular with the former constellation problem.

For most of the methods that construct the vectors independently for each given relevance level is proposed and the comparison or manipulation of estimated vectors from multiple relevance levels is not assumed. In the example of movies and journals mentioned earlier, they could be used to analyze a single year or each year separately, but they are not intended for analysis of data that is acquired across years, i.e., multiple times. The individual constellations for multiple relevance levels will result in vectors that do not preserve the data characteristics of the given relevance levels.

Therefore, we propose a constellation method for multiple relevance levels. In this paper, we assign the vectors to objects using the stochastic embedding approach represented by Stochastic Neighborhood Embedding (SNE) [1], introduced in Sect. 2. If we reformulate the problem setting in terms of stochastic embedding, we can view the current problem as a problem of estimating the appropriate probability distribution to represent them in a situation where multiple empirical distributions are available for each object. The approach of simply estimating the appropriate distribution from each empirical distribution cannot capture the relationship between the empirical distributions. It is necessary to express the relationship between empirical distributions by imposing constraints on the target space. A proposed logarithmic bilinear model (log-bilinear model) approximates the relevance levels using two vectors, one for the objects and the other for the features as parameters. One of the properties of the log-bilinear model is that by fixing one parameter, the space stretched from the remaining parameters becomes the space of exponential families. In this paper, we use this property to represent the relationship between multiple empirical distributions by sharing some of the parameters of the probability distributions to be estimated. We also show that the weighted sum of the vectors obtained can represent a mixture of probability distributions in the space of probability distributions and can be easily visualized after estimation.

The contributions of this paper are described as follows.

Our proposed framework can provide vectors for even multiple relevance levels.
Our algorithm can be interpreted as a simple operation in terms of information geometry.
The proposed method enables vectorization even when additional data are generated.

The structure of this paper is as follows. Section 2 provides an overview of existing mapping methods based on similarity and discrepancy, comparing classical and probabilistic methods. In Sect. 3, we first introduce auxiliary features for objects and then propose approximate representations using a log-bilinear model with two vectors of the features and objects, and its optimization, on the basis of an information geometry interpretation. We evaluate the effectiveness of the proposed method and its visualization using artificial and real data in Sects. 4, and 5, we summarize the proposed method and discuss the remaining issues on its possible applications.

2 Related works

In this section, we provide an overview of existing methods of mapping vectors based on relevance level. First, let us organize the symbols. The set of objects is denoted by ${\mathcal {O}}$, and the number of objects is represented by n. Individual objects are distinguished by subscript i as

$$\begin{aligned} o_i \in {\mathcal {O}}, \quad i \in \Lambda = \{1, \dots , n\}. \end{aligned}$$

Each object is given a d-dimensional vector $\varvec{x} \in {\mathbb {R}}^d$, and a matrix X is defined as the matrix of vectors $\varvec{x}$ corresponding to object o, arranged in rows as column vectors,

$$\begin{aligned} X = \left[ \varvec{x}_1, \varvec{x}_2, \dots , \varvec{x}_n \right] ^{\textsf{T}} \in {\mathbb {R}}^{n \times d}. \end{aligned}$$

Here, we introduce a matrix $R \in {\mathbb {R}}^{n \times m}$ to estimate the matrix X that represents the relevance between objects and features. We assume that object o and feature u have an relevance $\textrm{rel}(o, u) \in {\mathbb {R}}$ between them, and estimate the vector $\varvec{x}$. Let ${\mathcal {U}}$ be a set of features.

$$\begin{aligned} (R)_{ij} = \textrm{rel}(o_i, u_j). \end{aligned}$$

where $(R)_{ij}$ is a (i, j)th entry of matrix R. As mentioned above, similarity and distance can be regarded as special cases of relevance R, in which case ${\mathcal {U}} = {\mathcal {O}}$ and $(R)_{ij}$ can be regarded as indicating their values.

In the following, we describe methods for estimating the matrix X from the relevance R. First, Torgerson’s classical metric multidimensional scaling (MDS) is a technique to obtain a low-rank approximation X using the similarity matrix as a following loss function,

$$\begin{aligned} L_{\textrm{Tor}}(R,X) = \sum _{i,j \in \Lambda }\left( (R)_{ij} - \varvec{x}_i^{\textsf{T}} \varvec{x}_j \right) ^2 = \Vert R - XX^{\textsf{T}} \Vert _F, \end{aligned}$$

where $\Vert \cdot \Vert _{\mathcal {F}} $ denotes the Frobenius norm [2]. Note that R is a symmetric matrix where each element represents the degree of similarity between objects. On the other hand, Kruskal’s non metric MDS [3] uses the Minkowski distance, a generalization of the Euclidean distance, to find the matrix X that approximates the relevance matrix R,

$$\begin{aligned} L_{\textrm{Kru}}(R,X) = \frac{ \sum _{i,j\in \Lambda } \left( \phi ((R)_{ij}) - \Vert \varvec{x}_i - \varvec{x}_j \Vert _p \right) ^2 }{ \sum _{i,j\in \Lambda } \Vert \varvec{x}_i - \varvec{x}_j \Vert ^2_p }, \end{aligned}$$

where $\Vert \cdot \Vert _p $ denotes the Minkowski distance of order p and $\phi (\cdot )$ is a suitable monotonic transformation. In this case, the relevance R is a matrix such that each element represents the distance between objects.

There are also other methods such as robust Euclidean embedding [4], which improves robustness to outliers by evaluating differences between relevance levels and estimated vectors with an $L_1$-loss function, and non metric MDS, which optimizes ordinal statistics [3]. Since essential information is often located in a low-dimensional space in practical problems using the high-dimensional data, various techniques have been actively studied to reconstruct it appropriately in the low-dimensional space. The difference among various methods is how they incorporate information from a neighborhood that is assumed to contain important information. For example, Isomap [5], which recalculates the geodetic distances from a neighborhood, semidefinite embedding (SDE) [6] using kernel matrix, locally linear embedding (LLE) [7], and laplacian eigenmaps [8]. In the practical problem of acquiring a vector representation of a word, a low-rank approximation of the word-document matrix is proposed to acquire the vectors [9, 10].

The approximation error was evaluated using the norm of the matrix in the classical method described above. On the other hand, a method called stochastic embedding has been proposed, which evaluates the reconstruction error of vectors based on the statistical distance between probability distributions.

Stochastic neighborhood embedding (SNE) is the pioneer of stochastic embedding. In this method, the vectors are mapped by expressing the probability of an object o being in the neighborhood of $o'$ as a conditional probability mass function $p_o(o')$,

$$\begin{aligned} p_o(o') = \frac{f(o, o')}{\sum _{o'' \in {\mathcal {O}}} {f(o, o'')}}. \end{aligned}$$

Note that f is a function of the similarity between objects. Let $\varvec{x}$ be the vector of an object o, and we denote the conditional probability q that is, an object o is a neighborhood of an object $o'$,

$$\begin{aligned} q_o(o') = \frac{g(\Vert \varvec{x} - \varvec{x}'\Vert )}{\sum _{o'' \in {\mathcal {O}}} {g(\Vert \varvec{x} - \varvec{x}''\Vert )}}, \end{aligned}$$

where g is a function that transforms such that the smaller the distance, the higher probability.

The statistical distance is employed as the evaluation function between probability distributions in a stochastic embedding. The Kullback–Leibler divergence (KL divergence) is generally defined as one of the statistical distances and defined as

$$\begin{aligned} D_{\textrm{KL}}(P \Vert Q) = \sum _{\omega \in \Omega } { p(\omega ) \left( \log p(\omega ) - \log q(\omega ) \right) }. \end{aligned}$$

(1)

The statistical distance between the two probability distributions is used as one of loss functions with respect to the matrix X in a stochastic embedding,

$$\begin{aligned} L_\textrm{SNE}(D, X)&= \sum _{o \in {\mathcal {O}}} D_\textrm{KL}(P_o \Vert Q_o) \\&= \sum _{o \in {\mathcal {O}}} \sum _{o' \in {\mathcal {O}}} p_o(o') \left( \log p_o(o') - \log q_o(o') \right) . \end{aligned}$$

Various methods have been proposed for the stochastic embedding, depending on a choice of statistical distances and similarity calculation methods to capture the relevance levels between objects. t-SNE [11], a variant of SNE, uses a function whose distribution after transformation is the t-distribution to improve estimation robustness and speed up computation. Other proposed methods include UMAP [12], which applies the theory of algebraic topology, and word2vec [13] which estimates the dual vectors using the asymmetry of the conditional probability defined from the similarity. GloVe [14] is also a highly relevant and key technique that is also related to the proposed method, which proposes vector acquisition by focusing on the ratio of conditional probabilities of objects i, j. The problem setting of this paper is related to embedding for time-series data. Dynamic t-SNE is an extension of t-SNE for mapping vectors to time-series data [15]. In addition to the conventional t-SNE loss function, this method enables the mapping of objects by adding penalties to differences between times. Several methods have been proposed to enable the application of Word Embedding to time-series data, called “Dynamic Word Embedding” [16,17,18]. Bunte et al. describes the these methods from a unified perspective [19].

Various techniques have been proposed to construct vectors that approximate relevance, although this problem can be viewed as a low-rank approximation problem of the relevance matrix R by a matrix X,

$$\begin{aligned} R \simeq XX^{\textsf{T}}. \end{aligned}$$

The classical method, metric MDS uses the evaluation of approximation errors by the Frobenius norm and results in a singular value decomposition of a symmetric matrix,

$$\begin{aligned} L_F(R, X) = \Vert R - XX^{\textsf{T}} \Vert _F. \end{aligned}$$

In this case, approximation errors are evaluated irrespective of the similarity owing to the nature of the Frobenius norm. On the other hand, the stochastic embedding approach considers the probabilistic models defined by relevances and vectors as

$$\begin{aligned} p_o(o') = \frac{f(o, o')}{\sum _{o'' \in {\mathcal {O}}} f(o, o'')},\ q_o(o') = \frac{g(\varvec{x}^{\textsf{T}} \varvec{x'})}{\sum _{o'' \in {\mathcal {O}}} g(\varvec{x}^{\textsf{T}} \varvec{x''})}, \end{aligned}$$

and evaluates the approximation error on the basis of the statistical distance as follows,

$$\begin{aligned} L_{\textrm{KL}}(R, X) = D_{\textrm{KL}} (P_o \Vert Q_o). \end{aligned}$$

In this case, the nature of the statistical distance gives more weight to relevance levels between objects with high similarity, i.e., high probability values. Various frameworks have been proposed for geometric approaches, such as Poincaré embedding [20], which is a method to improve performance by changing the calculation method of the distance between two objects and embedding them in the hyperbolic space. Related research in information geometry is described below, along with the proposed method.

3 Proposed methods

Since the vectors are obtained in existing methods for a single relevance matrix, as described in Sect. 2, they are not designed for use of multiple matrices. Therefore, it is difficult to compare the obtained vectors when the existing method applies to multiple matrices. In this section, we propose a method of mapping vectors that can be used when multiple relevance levels are obtained by assuming two types of vectors: object vectors and feature vectors. We can compare the object vectors using shared feature vectors among multiple relevance levels.

The symbols are described here. Let ${\mathcal {O}} = \{o_i\}_{i=1}^{n}$ be the set of objects and ${\mathcal {U}} = \{u_j\}_{j=1}^m$ be the set of features. For example, in the example of movie ratings by country described in the introduction, ${\mathcal {O}}$ is the set of movies and ${\mathcal {U}}$ is the set of countries. The function $\textrm{rel}(o, u)$ gives the relevance of an object $o \in {\mathcal {O}}$ to a feature $u \in {\mathcal {U}}$. In the case of the movie data example, the rating of the movie o in country u is defined as $\textrm{rel}(o, u)$.

In this section, we propose a method to estimate object vectors when k relevance matrices $R^{(1)}, \dots , R^{(K)}$are given. First, consider the case of a single relevance matrix. The proposed method approximates the relevance of an object and a feature by the inner product of vectors, given vector $\varvec{x}_i$ of the object $o_i$ and the vector $\varvec{y}_j$ of the feature $u_j$,

$$\begin{aligned} \textrm{rel}(o_i, u_j) \simeq \varvec{x}_i^{\textsf{T}} \varvec{y}_j. \end{aligned}$$

(2)

Here, considering the two matrices X and Y with each vector $\varvec{x}_i$ and $\varvec{y}_j$ aligned,

$$\begin{aligned} X&= \left[ \varvec{x}_1, \varvec{x}_2, \dots , \varvec{x}_n \right] ^{{\textsf{T}}}, \\ Y&= \left[ \varvec{y}_1, \varvec{y}_2, \dots , \varvec{y}_m \right] ^{{\textsf{T}}}. \end{aligned}$$

Equation (2) can be treated as a problem,

$$\begin{aligned} R \simeq&XY^{{\textsf{T}}}, \end{aligned}$$

(3)

which approximates the relevance matrix R by the matrices X and Y. In the following, R assumes that the value of $(R)_{ij}$ takes on larger values when $o_i$ and $u_j$ are strongly related.

Next, we define the probability distributions P and Q using the relevance matrix R and the matrices X and Y to evaluate the approximation accuracy of the decomposition in the framework of stochastic embedding. They indicate the probability that an object o and a feature u are observed simultaneously. The proposed method approximates the probability mass function p(o, u) of the empirical distribution P using a log-bilinear model on the basis of the inner product of the vectors $\varvec{x}$ and $\varvec{y}$ to correspond to the object o and the feature u, respectively,

$$\begin{aligned} p(o, u; R)&= \exp (\textrm{rel}(o, u) - \phi (R)) , \end{aligned}$$

(4)

$$\begin{aligned} q(o, u; X, Y)&= \exp (\varvec{x}^{\textsf{T}} \varvec{y} - \zeta (X, Y)), \end{aligned}$$

(5)

where $\phi (R)$ and $\zeta (X,Y)$ are the normalization terms for P(R) and Q(X, Y) to be the probability distributions. The KL divergences of the joint probability distribution is used as the evaluation function for this problem,

$$\begin{aligned} L_{\textrm{KL}}(R, X, Y)&= D_{\textrm{KL}} (P(R) \Vert Q(X, Y)) \nonumber \\&= \sum _{o \in {\mathcal {O}}} \sum _{u \in {\mathcal {U}}} p(o, u; R) \left( \log p(o, u; R) - \log q(o, u; X, Y) \right) . \end{aligned}$$

(6)

This concept of relying on a log-bilinear model is also the idea used in word2Vec [13] and GloVe [14]. However, minimizing the KL divergence for multiple relevance matrices does not allow us to estimate the appropriate vectors. If A is a regular matrix, then

$$\begin{aligned} X' = XA, \quad Y' = Y(A^{-1})^{\textsf{T}}. \end{aligned}$$

Since R of Eq. (3) has the same value as both a pair of X and Y, and a pair of $X'$ and $Y'$, the value of Eq. (6) give the same solution,

$$\begin{aligned} R = X'Y'^{{\textsf{T}}} = X A A^\mathsf {-1} Y^{\textsf{T}} = XY^{\textsf{T}}. \end{aligned}$$

This means that the solution to this problem is arbitrary in terms of regular matrices. It is difficult for given multiple relevance levels $\left\{ R^{(k)}\right\} _{k=1}^{K}$ to discuss the relationship between their solutions $\left\{ X^{(k)}\right\} $ for this reason. We can derive an operation such that the convex combination of matrices $X^{(k)}$ can be regarded a mixture of probability distributions $Q^{(k)}$ by using the shared matrices Y for any k,

$$\begin{aligned} X^{(k)}&= \left[ \varvec{x}_1^{(k)}, \dots , \varvec{x}_n^{(k)} \right] ^{{\textsf{T}}}, \\ Y&= \left[ \varvec{y}_1, \dots , \varvec{y}_m \right] ^{{\textsf{T}}}. \end{aligned}$$

We can extend the method to estimate the matrices $\{X^{(k)}\}$ for multiple relevance matrices by this shared matrix Y. Let us consider a dividing point $\bar{\varvec{x}}$ at vectors $\varvec{x}^{(1)}$ and $\varvec{x}^{(2)}$ estimated using the relevance $R^{(1)}$ and $R^{(2)}$,

$$\begin{aligned} \bar{\varvec{x}} = \alpha \varvec{x}^{(1)} + (1-\alpha ) \varvec{x}^{(2)} , \quad 0 \le \alpha \le 1. \end{aligned}$$

The probability mass function ${\bar{q}}(o,u; {\bar{X}}, Y)$ of the joint probability distribution ${\bar{Q}}({\bar{X}}, Y)$ by $\bar{\varvec{x}}$ can be written as

$$\begin{aligned}&{\bar{q}}(o, u; {\bar{X}},Y)\nonumber \\&\quad =\exp \left( \bar{\varvec{x}}^{{\textsf{T}}} \varvec{y}-{\bar{\zeta }}\left( {\bar{X}}, Y\right) \right) \nonumber \\&\quad =\exp \left( \left( \alpha \varvec{x}^{(1)}+(1-\alpha ) \varvec{x}^{(2)} \right) ^{{\textsf{T}}} \varvec{y}-{\bar{\zeta }} \left( {\bar{X}}, Y\right) \right) \nonumber \\&\quad =\exp \left( \alpha \left( \varvec{x}^{(1){\textsf{T}}} \varvec{y} -\zeta ^{(1)}\left( X^{(1)}, Y\right) \right) \right. \nonumber \\&\qquad \left. +(1-\alpha ) \left( \varvec{x}^{(2){\textsf{T}}} \varvec{y}-\zeta ^{(2)}\left( X^{(2)}, Y\right) \right) -\zeta '\left( {\bar{X}}, Y\right) \right) \nonumber \\&\quad =\left( q^{(1)}(o,u; X^{(1)},Y)\right) ^{\alpha } \left( q^{(2)}(o,u; X^{(2)},Y)\right) ^{1-\alpha } \exp \left( -\zeta '\left( {\bar{X}}, Y\right) \right) . \end{aligned}$$

(7)

where $q^{(k)}$ is a probability mass function of $Q^{(k)}$, and $\exp (-\zeta '({\bar{X}}, Y))$ is the normalization term. The distribution ${\bar{Q}}$ is a convex combination of two distributions $Q^{(1)}$ and $Q^{(2)}$ under log transformation,

$$\begin{aligned}&\log {\bar{q}} (o, u; {\bar{X}},Y) \\&\quad = \alpha \log q^{(1)}(o, u; {\bar{X}},Y) + (1-\alpha ) \log q^{(2)}(o, u; {\bar{X}},Y) - \zeta '({\bar{X}}, Y). \end{aligned}$$

Therefore, the distribution ${\bar{Q}}({\bar{X}},Y)$ is a mixture distribution called the e-mixture. Since the e-mixture emphasizes regions where both distributions have a high probability, it is intuitively an AND-like operation that takes the common part of the distributions. In this way, the mixture of the vectors corresponds to an operation of the distributions. The details of “e-mixture” are described in the next section. In the proposed method, we can estimate vectors for multiple relevance matrices by fixing Y. In this case, the matrices $\{X^{(k)}\}, Y$ are not uniquely determined. In this method, we reduce the degree of freedom of the solution by imposing the following constraints,

$$\begin{aligned} Y^{\textsf{T}}Y = I_d, \end{aligned}$$

(8)

where $I_d \in {\mathbb {R}}^{d\times d}$ is the identity matrix. Note that even with this condition, the estimation of X and Y remains arbitrary in the orthogonal matrices. Given multiple relevance levels, The optimization problem is written as follows for the matrices is written as

$$\begin{aligned}&\textrm{minimize}\ L_{\textrm{KL}}(R^{(k)}, X^{(k)} Y^{{\textsf{T}}}) \nonumber \\&\quad \mathrm {with\ respect\ to\ } X^{(k)}, \forall k\ \textrm{and}\ Y \nonumber \\&\quad \mathrm {subject\ to\ } Y^{{\textsf{T}}} Y = I_d. \end{aligned}$$

(9)

In this paper, we imposed the constraint Eq. (8), but it could be any other constraint to reduce the degree of freedom of the solution.

3.1 Preliminary on information geometry

Information geometry is a framework for discussing the behavior of statistical inference and machine learning using a geometric perspective. A wide variety of methods and algorithms have been understood and extended by information geometry. For example, Akaho proposed a method for obtaining vectors by dimension reduction in [21]. As another an information geometry approach to object embedding, Riccard et al. propose alpha embedding, which computes between two stochastic distributions using the computed alpha-divergence [22]. In both studies, the information geometry approach is used to extend the conventional methods for embedding and dimension reduction of a single data. As previously discussed, when multiple datasets are obtained, it is difficult to directly apply the object embedding approach to consider the relationship between multiple data, so it is necessary to extend the method to consider the relationship. As previously mentioned, directly applying the object embedding approach to deal relationships between multiple datasets is challenging. In this section, we describe the basic ideas necessary to explain the proposed method, especially the handling of statistical models and their estimation in information geometry [23].

Let ${\mathcal {Z}}$ be the sample space and Z be the random variable. And let ${\mathcal {S}}$ be the space containing an arbitrary probability distribution, and P be the probability distribution whose probability mass function is p(z),

$$\begin{aligned} {\mathcal {S}} = \left\{ P \vert p(z) \mathrm {\ is\ a\ probability\ mass\ function\ of\ } P \right\} . \end{aligned}$$

The set of probability distributions with parameter $\varvec{\theta }$ is defined as

$$\begin{aligned} {\mathcal {M}}(\varvec{\theta }) = \left\{ P_{\varvec{\theta }} \vert p(z; \varvec{\theta }) \right\} , \quad {\mathcal {M}}(\varvec{\theta }) \subset {\mathcal {S}}, \end{aligned}$$

which is called the model manifold ${\mathcal {M}}(\varvec{\theta })$. Let $z_1, \dots , z_N$ be the observed data sampled from a probability mass function p with appropriate parameters $\varvec{\theta }^*$. The observed data can be regarded as a probability distribution using a delta function called the empirical distribution,

$$\begin{aligned} p(z) = \frac{1}{N} \sum _{i=1}^N \delta (z - z_i). \end{aligned}$$

Estimating the parameter $\varvec{\theta }$ refers to finding the closest probability distribution Q on the subspace ${\mathcal {M}}$ from the empirical distribution P using the statistical distance D in information geometry [23]. The statistical distance D refers to the pseudo-distance between two probability distributions, P and Q, and here we employ the KL divergence $D_\textrm{KL}(P \Vert Q)$ in Eq. (1).

We define a set ${\mathcal {G}}_m$ and ${\mathcal {G}}_e$ of interior points for the two probability distributions $P^{(1)}$ and $P^{(2)}$ on the space ${\mathcal {S}}$ of probability distributions,

$$\begin{aligned} {\mathcal {G}}_m =&\left\{ P^\alpha _m \Big \vert p_m(z; \alpha ) = \alpha p^{(1)}(z) + (1-\alpha ) p^{(2)}(z) \right\} , \nonumber \\ {\mathcal {G}}_e\ =&\left\{ P^\alpha _e \Big \vert p_e(z; \alpha ) = \exp \left( \alpha \log \left( p^{(1)}(z) \right) + (1 - \alpha ) \log \left( p^{(2)}(z) \right) - \zeta (\alpha ) \right) \right\} . \end{aligned}$$

(10)

where $\alpha \in [0, 1]$. The subspace ${\mathcal {G}}_m$ expressed in Eq. (10) is called the m-geodesics, and ${\mathcal {G}}_e$ expressed in Eq. (10) is called as the e-geodesics, given K probability distributions, the subspace ${\mathcal {M}}_m$ is defined as the m-flat subspace spanned by their convex combination, and the subspace ${\mathcal {M}}_e$ is defined as the e-flat subspace spanned by a convex combination of logarithmic representations.

$$\begin{aligned} {\mathcal {M}}_m&= \left\{ P_m^{(\varvec{\alpha })} \Bigg \vert p_m(z; \varvec{\alpha }) = \sum _{k=1}^{K} \alpha ^{(k)} p^{(k)}(z) \right\} , \\ {\mathcal {M}}_e&= \left\{ P_e^{(\varvec{\alpha })} \Bigg \vert p_e(z; \varvec{\alpha }) = \exp \left( \sum _{k=1}^{K} \alpha ^{(k)} \log \left( p^{(k)}(z) \right) - \zeta (\varvec{\alpha }) \right) \right\} , \end{aligned}$$

where $\varvec{\alpha } = \left[ \alpha ^{(1)}, \dots , \alpha ^{(K)}\right] ^{{\textsf{T}}} \in {\mathbb {R}}^{K},\ \alpha ^{(k)} \ge 0$ and $\sum _{k=1}^K \alpha ^{(k)} = 1$. And p(x) is called the m-representations, $\log \left( p(z) \right) $ is called the e-representations, and the operations to construct $P_m$ and $P_e$ by the convex combination for each representation are called the m-mixture and the e-mixture, respectively.

We now consider the KL divergence for the three probability distributions P, Q, and R,

$$\begin{aligned}&D_\textrm{KL} (P \Vert Q) - D_\textrm{KL}(P \Vert R) - D_\textrm{KL}(R \Vert Q) \nonumber \\&\quad = \sum _{z \in {\mathcal {Z}}}\left\{ \left( p(z) (\log (p(z)) - \log (r(z)) \right) -\left( p(z) (\log (p(z)) - \log (r(z)) \right) \right. \nonumber \\&\qquad \left. - \left( r(z) (\log (r(z)) - \log (q(z)) \right) \right\} \nonumber \\&\quad = \sum _{z \in {\mathcal {Z}}} \left( p(z) - r(z)\right) \left( \log (r(z)) - \log (q(z))\right) , \end{aligned}$$

(11)

where r is the probability mass function of R. When the right-hand side of Eq. (11) is zero, that is, when the m-geodesics from probability distribution P to R and the e-geodesics from R to Q are orthogonal, the Pythagorean theorem holds

$$\begin{aligned} D_\textrm{KL} (P \Vert Q) = D_\textrm{KL}(P \Vert R) + D_\textrm{KL}(R \Vert Q). \end{aligned}$$

(12)

This is useful for describing parameter estimation in the stochastic models.

We now introduce a set of probability distributions called an exponential family to describe the geometric properties of parameter estimation in statistical models. An exponential family is a set of probability distribution whose probability mass function can be written in the following form:

$$\begin{aligned} {\mathcal {P}} = \left\{ P_{\varvec{\theta }} \Big \vert p(\varvec{z}; \varvec{\theta }) = \exp \left( \varvec{\theta }^{\textsf{T}} \varvec{z} - \zeta (\varvec{\theta }) \right) \right\} \subset {\mathcal {S}}, \end{aligned}$$

where $\varvec{\theta }$ are parameters of P, and $\zeta (\varvec{\theta })$ is the normalization term as

$$\begin{aligned} \zeta (\varvec{\theta }) = \log \sum _{\varvec{z} \in {\mathcal {Z}}} \exp \left( \varvec{\theta }^{\textsf{T}} \varvec{z} \right) . \end{aligned}$$

Probability distributions in the exponential family of distributions are e-flat [23]. For an e-flat subspace ${\mathcal {M}}_e$, if there exists probability distributions $P \notin {\mathcal {M}}_e$, Q and $R \in {\mathcal {M}}_e$. Let these three points satisfy Eq. (12). Then the point Q that is the shortest point from P to ${\mathcal {M}}_e$ is equivalent to R. because the m-geodesics from P to R and e-geodesics from R to Q are orthogonal by the above Pythagorean theorem. And this process of estimating R from P is called the m-projection,

$$\begin{aligned} R = \mathop {\mathrm{arg~min}}\limits _{Q \in {\mathcal {M}}_e} D_{\textrm{KL}}(P \Vert Q). \end{aligned}$$

In the next section, we explain how the proposed method can be interpreted in terms of information geometry for the case where multiple empirical distributions P are observed.

3.2 Algorithms for information geometric interpretation

This section describes an information geometrical interpretation of the proposed model. We use a vec: ${\mathbb {R}}^{N \times M} \rightarrow {\mathbb {R}}^{NM}$ and a Kronecker product operator $\otimes : {\mathbb {R}}^{N \times M} \times {\mathbb {R}}^{U \times V}\rightarrow {\mathbb {R}}^{AU\times MV}$ as

$$\begin{aligned} \textrm{vec}(A)&= \left[ \begin{array}{c} \varvec{a}_1 \\ \varvec{a}_2 \\ \vdots \\ \varvec{a}_N \\ \end{array} \right] ^{\textsf{T}} \in {\mathbb {R}}^{NM}, \ A = \left[ \begin{array}{cccc} \varvec{a}_1&\varvec{a}_2&\dots&\varvec{a}_N \end{array} \right] \in {\mathbb {R}}^{N \times M}, \ \varvec{a}_i \in {\mathbb {R}}^M, \\ A \otimes B&= \left[ \begin{array}{cccc} (A)_{11} B &{} (A)_{12} B &{} \cdots &{} (A)_{1M} B \\ (A)_{21} B &{} (A)_{22} B &{} \cdots &{} (A)_{2M} B \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ (A)_{N1} B &{} (A)_{N2} B &{} \cdots &{} (A)_{NM} B \end{array} \right] \in {\mathbb {R}}^{AU \times MV}, \end{aligned}$$

where B is a $U \times V$ matrix. The probability that an object o and a feature u are observed simultaneously, can be generalized as

$$\begin{aligned} q(o, u; X, Y)&= \exp \left( \sum _{i, j}^{n,m} \delta _i^o \delta _j^u \varvec{x}_i^{\textsf{T}} \varvec{y}_j - \zeta (X, Y) \right) \nonumber \\&= \exp \left( {\varvec{\delta }^o}^{\textsf{T}} XY^{\textsf{T}} \varvec{\delta }^u - \zeta (X, Y) \right) . \end{aligned}$$

(13)

Equation (13) is equivalent to Eq. (5) given $o_i$ and $u_j$ as a concrete event,

$$\begin{aligned} q(o_i, u_j; X, Y) = \exp (\varvec{x}_i^{\textsf{T}} \varvec{y}_j - \zeta (X, Y)). \end{aligned}$$

Here, $\delta _i^o$, $\delta _j^u$, $\varvec{\delta }^o$, and $\varvec{\delta }^{u}$ are

$$\begin{aligned} \delta ^o_i&= {\left\{ \begin{array}{ll} 1 &{} (o = o_i) \\ 0 &{} (o\ne o_i) \end{array}\right. } ,\quad \delta ^u_j = {\left\{ \begin{array}{ll} 1 &{} (u = u_j) \\ 0 &{} (u\ne u_j) \end{array}\right. } \\ \varvec{\delta }^o&= \left[ \begin{array}{c} \delta _1^o \\ \delta _2^o \\ \vdots \\ \delta _n^o \end{array} \right] , \quad \varvec{\delta }^u = \left[ \begin{array}{c} \delta _1^u \\ \delta _2^u \\ \vdots \\ \delta _m^u \end{array} \right] . \end{aligned}$$

Using the vec operator and the Kronecker product, we can transform Eq. (13) as

$$\begin{aligned} q(o,u; X, Y)&= \exp \left( \textrm{vec}\left( X^{\textsf{T}} \right) ^{\textsf{T}} \left( \varvec{\delta }^o {\varvec{\delta }^u}^{\textsf{T}} \otimes I_d \right) \textrm{vec}\left( Y^{\textsf{T}} \right) -\zeta (X, Y) \right) \nonumber \\&= \exp \left( \textrm{vec}\left( X^{\textsf{T}} \right) ^{\textsf{T}} \Lambda (o,u)\, \textrm{vec}\left( Y^{\textsf{T}} \right) -\zeta (X, Y) \right) , \end{aligned}$$

(14)

where $\Lambda $ is an $nd \times md$ matrix and $\Delta $ is an $n \times m $ matrix defined as

$$\begin{aligned} \Lambda (o,u)&= \underbrace{\Delta }_{n \times m} \otimes \underbrace{I_d}_{d \times d} = \underbrace{ \left[ \begin{array}{ccc} \left[ \begin{array}{ccc} (\Delta )_{11}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{11} \end{array} \right] &{} \cdots &{} \left[ \begin{array}{ccc} (\Delta )_{1m}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{1m} \end{array} \right] \\ \vdots &{} \ddots &{} \vdots \\ \left[ \begin{array}{ccc} (\Delta )_{n1}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{n1} \end{array} \right] &{} \cdots &{} \left[ \begin{array}{ccc} (\Delta )_{nm}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{nm} \end{array} \right] \end{array} \right] , }_{nd \times md} \\ (\Delta )_{ij}&= \delta _i^o \delta _j^u. \end{aligned}$$

We see that the probability mass function q is expressed in Eq. (14) is an exponential family with respect to $\varvec{\theta }$ with $\Lambda \varvec{\xi }$ fixed, and we see that it is an exponential family with respect to $\varvec{\xi }$ with $\varvec{\theta }^{\textsf{T}} \Lambda $ fixed as well. Here, we write $\varvec{v} = \Lambda (o,u) \varvec{\xi }$ and $\varvec{v}' = \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u)$, respectively,

$$\begin{aligned} q(o,u; X, Y)&= \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u) \varvec{\xi } -\zeta (X,Y) \right) \\&= \exp \left( \varvec{\theta }^{\textsf{T}} \varvec{v}(o,u,\varvec{\xi )} - \zeta (X,Y) \right) \\&= \exp \left( \varvec{\xi }^{\textsf{T}} \varvec{v'}(o,u,\varvec{\theta }) - \zeta (X,Y) \right) , \end{aligned}$$

where $\varvec{\theta }$ and $\varvec{\xi }$ are

$$\begin{aligned} \varvec{\theta }&= \textrm{vec}(X^{\textsf{T}}) = \left[ \begin{array}{c} \varvec{x}_1 \\ \varvec{x}_2 \\ \vdots \\ \varvec{x}_n \end{array} \right] \in {\mathbb {R}}^{nd} , \quad \varvec{\xi } = \textrm{vec}(Y^{\textsf{T}}) = \left[ \begin{array}{c} \varvec{y}_1 \\ \varvec{y}_2 \\ \vdots \\ \varvec{y}_m \end{array} \right] \in {\mathbb {R}}^{md}. \end{aligned}$$

We propose an optimization process of Eq. (9) on the basis of these results. The proposed algorithm estimates vectors from multiple relevance levels by a combination of the m-projection to the model manifold and the e-mixture on the model manifold.

Probability distributions $P^{(1)}, P^{(2)}, \dots , P^{(K)}$ computed from multiple relevance levels are represented by a single point in the space of probability distributions ${\mathcal {S}}$ as empirical distributions. Similarly, it is represented as a single point on ${\mathcal {S}}$, given the parameters X and Y. In other words, this problem is equivalent to finding an appropriate mapping from the empirical distribution $P^{(k)}$ to the model manifold ${\mathcal {M}}(X, Y)$,

$$\begin{aligned} {\mathcal {M}}(X, Y) = \left\{ Q(X, Y) \Big \vert q(o, u; X, Y) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u) \varvec{\xi } -\zeta (X, Y) \right) . \right\} \end{aligned}$$

Let ${\mathcal {M}}_{Y_\dagger }(X)$ be a subspace of ${\mathcal {M}}(X, Y)$ where Y is fixed to $Y_\dagger $, and X is freely chosen,

$$\begin{aligned} {\mathcal {M}}_{Y_\dagger }(X) = \left\{ Q(X, Y_\dagger ) \Big \vert q(o, u; X, Y_\dagger ) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u)\, \varvec{\xi }_\dagger -\zeta (X, Y_\dagger ) \right) \right\} , \end{aligned}$$

where $\varvec{\xi }_\dagger = \textrm{vec}(Y_\dagger )$. The subspace ${\mathcal {M}}_{Y_\dagger }(X)$ is e-flat with respect to the parameter X because it is a member of the exponential family. This means that the optimal distribution in Q from the empirical distribution P is given by the m-projecteion.

Similarly let ${\mathcal {M}}_{X_\dagger }(Y)$ be a subspace of ${\mathcal {M}}(X, Y)$ where X is fixed to $X_\dagger $, and Y is freely chosen,

$$\begin{aligned} {\mathcal {M}}_{{X}_\dagger }(Y) = \left\{ Q(X_\dagger , Y) \Big \vert q(o, u; X_\dagger , Y) = \exp \left( \varvec{\theta }_\dagger ^{\textsf{T}} \Lambda (o,u) \varvec{\xi } -\zeta (X_\dagger ,Y) \right) \right\} , \end{aligned}$$

where $\varvec{\theta }_\dagger = \textrm{vec}(X_\dagger )$.

The convex combination of the matrices on the model manifold is an e-mixture of probability distributions as shown in Eq. (7). Therefore, the model manifold $M_{{\bar{Y}}}$ computed from ${\bar{Y}}$ is expected to preserve the properties of $\left\{ Y^{(k)}\right\} $,

$$\begin{aligned} {\bar{Y}} = \sum _{k=1}^K \alpha _k Y^{(k)}, \quad \sum _{k=1}^K \alpha _k = 1. \end{aligned}$$

(15)

The feature matrix Y is constrained by Eq. (8) to limit the degree of freedom. The matrix satisfying this constraint is called a Stiefel manifold and which it defined as

$$\begin{aligned} \textrm{St}(n,d) = \left\{ A \in {\mathbb {R}}^{n \times d} \Big \vert A^{{\textsf{T}}} A = I_d \right\} , \quad n \ge d. \end{aligned}$$

The mixture matrix ${\bar{Y}}$ given in Eq. (15) is not necessarily a matrix on this manifold. Therefore, we introduce a retraction function $\pi : {\bar{Y}} \mapsto Y \in \textrm{St}$.

There are various types of retraction for matrix A, and we use the method based on QR decomposition [24],

$$\begin{aligned} \pi (A) = D , \quad A = DU, \end{aligned}$$

where D is a orthogonal matrix, and U is an upper triangular matrix. The proposed algorithm estimates parameters separately for each data set and integrates them, which is computationally less expensive than estimating all parameters simultaneously. However, efficient estimation when the amount of objects in a dataset is large is a future challenge. In particular, depending on the retraction method, it may become a bottleneck in computational complexity and should be chosen carefully. For the QR decomposition selected this time, ${\mathcal {O}}(n^3)$, it is necessary to consider other options when the data size is large.

We revisit the problem formulation from an information geometric perspective. The task involves projecting multiple empirical distributions onto a large model manifold while preserving the relationships between these empirical distributions.

To provide an intuitive explanation of the proposed algorithm, our approach involves constructing a large model manifold parameterized by X and Y. By applying constraints that fix a subset of parameters, we project the empirical distributions onto a subspace within the large model manifold. By iteratively performing this process, we expect to accurately estimate the parameters while retaining the characteristics of each empirical distribution.

The concept of an algorithm of the proposed method is shown in Fig. 1. Algorithm 1 is the proposed algorithm for estimating $\{X^{(k)}\}$ and Y. These figures describe the algorithm as follows. First, it is assumed that $\{P^{(k)}\}_{k=1}^{3}$ are obtained, which are computed from multiple relevance levels $\{R^{(k)}\}_{k=1}^3$, respectively. In addition, the parameters of the iteration index t are given as $\{X^{(k)}_t\}, Y_t$. Figure 1a firstly shows the matrices $\{X^{(k)}_{t+1}\}$ are estimated by the m-projection from $\{P^{(k)}\}$ to a subspace ${\mathcal {M}}_{Y_t}(X)$ on the model manifold with $Y_t$ fixed. $\{Y^{(k)}_{t+1}\}$ is estimated by the m-projection from $\{P^{(k)}\}$ to the model manifold with $\{X^{(k)}_{t+1}\}$ fixed in Fig. 1b. Figure 1c describes that a convex combination of $\{Y^{(k)}_{t+1}\}$ gives the subspace ${\mathcal {M}}_{{\bar{Y}}_{t+1}}(X)$. In Fig. 1d, $Y_{t+1}$ is derived by the retraction of ${\bar{Y}}_{t+1}$ onto a Stiefel manifold to satisfy the constraints of Eq. (8).

The matrices $\{X^{(k)}\}$ and Y estimated by this process maintains the neighborhood relations of $\{P^{(k)}\}$. The shared Y also allows the matrices of objects to be compared across matrices.

In addition to introducing two coordinates as in the proposed method when multiple relevance levels are given, another possible approach is to estimate them collectively, for instance, by vertically concatenating them and treating them as a single dataset. In this case, ${X^{(k)}}$ can be estimated directly, but the computational cost becomes substantial when the number of steps and data points increases. Therefore, the proposed method, which allows estimating the problem with a smaller size for each relevance level, offers an advantage. The weighted sum of the coordinates estimated by the proposed method also represents a mixture of probability distributions, enabling adjustments weights to accommodate subsequent tasks and visualizations after the vectors are estimated.

3.3 Appearance and disappearance of data

The proposed method can naturally estimate the vectors of time-series data, even when the data appear with the transition of time. Let $P^{(1)},P^{(2)}$, and $P^{(3)}$ be the empirical distributions at time $k=1,2,$ and 3, respectively. And the set of objects at each time be denoted by ${\mathcal {O}}', {\mathcal {O}}''$, and ${\mathcal {O}}'''$. In addition, the number of objects at each time is $|{\mathcal {O}}' |< |{\mathcal {O}}'' |< |{\mathcal {O}}''' |$. The proposed method works equally well for the time-series data regardless of the number of objects by assuming that the number of the features does not change with time and by fixing Y. However, in this case, the information geometrical interpretation changes slightly, as shown in Fig. 2. Since the number of objects has changed from time $k=1$ to time $k=2$, the target model manifolds ${\mathcal {M}}'$ and ${\mathcal {M}}''$ are different. At time $k=2$, the model manifold is expanded owing to the expanded domain of the parameter X,

$$\begin{aligned} {\mathcal {M}}'= & {} \left\{ Q(X, Y) \Big \vert q(o, u; X, Y) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda \varvec{\xi } -\zeta \right) , \, X \in {\mathbb {R}}^{|{\mathcal {O}}' |\times d}, \, o \in {\mathcal {O}}' \right\} , \\ {\mathcal {M}}''= & {} \left\{ Q(X, Y) \Big \vert q(o, u; X, Y) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda \varvec{\xi } -\zeta \right) , \, X \in {\mathbb {R}}^{|{\mathcal {O}}'' |\times d}, \, o \in {\mathcal {O}}'' \right\} . \end{aligned}$$

The estimated model are the subspaces ${\mathcal {M}}'_{Y_\textrm{fix}}(X) \subset {\mathcal {M}}'(X, Y)$, and ${\mathcal {M}}''_{Y_\textrm{fix}}(X) \subset {\mathcal {M}}''(X, Y)$ respectively, with the additional constraint that $Y=Y_{\textrm{fix}}$ at each time. Therefore, the proposed method can be applied similarly to Algorithm 1 in estimating the object matrices $X^{(k)}$. The case of $k=3$, the mapping is processed in the same way.

4 Experiments

Generally, when considering embedding real data, only the relevance is given, not the actual vector that is the correct answer. We evaluate the effectiveness of the proposed method using an artificial data with the true matrices $X^{(k)}_*$ and $Y_*$, and also show as a case study that the proposed method works and behaves intuitively on real data. We compare the proposed method with three other methods: t–SNE [11] applied independently at each time(ind t–SNE), t–SNE applied to all data at once(concat t–SNE), and Dynamic t–SNE [15], which is a probabilistic embedding method that considers the time series. The evaluation criteria will be employed Spearman’s ranking correlation and MAP@H. We introduce some symbols used for MAP@H. Let ${\mathcal {N}}(o_i, X, X', h)$ be a set of h neighbors of object o using the object matrix X,

$$\begin{aligned}&{\mathcal {N}}(o_i, X^{(k)}, X^{(k')}, h) \\&\quad = \left\{ o_j \Big \vert \Vert \varvec{x}_i^{(k)} - \varvec{x}_j^{(k')}\Vert ^2 \le \Vert \varvec{x}_i^{(k)} - \varvec{x}_l^{(k')}\Vert ^2,\ o_j \in {\mathcal {O}},\ o_l = o(X^{(k')}, h) \right\} . \end{aligned}$$

Here, $o(X^{(k)}, h)$ is the h th nearest object from o using the object matrix $X^{(k)}$. MAP@H evaluates the match between the neighborhoods of the true vectors and the estimated vectors as

$$\begin{aligned} \textrm{AP}(o, k, k', H)&=\frac{1}{H} \sum _{h=1}^H \delta (o, k, k',h)\, \frac{|{\mathcal {N}}(o, X^{(k)}, X^{(k')}, h) \cap {\mathcal {N}}(o, X_*^{(k)}, X_*^{(k')},h) |}{h}, \\ \delta (o, k, k', h)&= \left\{ \begin{array}{cl} 1 &{} \left( o(X^{(k)}, h) \in {\mathcal {N}}(o, X_*^{(k)}, X_*^{(k')}, h) \right) \\ 0 &{} (\textrm{otherwise})\\ \end{array} \right. , \\ \textrm{MAP}@H&= \frac{1}{K} \frac{1}{|{\mathcal {O}} |} \sum _{o, k} \textrm{AP}(o, k, k', H). \end{aligned}$$

We define the criteria to evaluate whether the information is preserved between time series. Normally, MAP@H is calculated inside the kth relevance matrix ($k'=k$), but in this case, we also calculate MAP@H using the kth relevance matrix and the $k-1$ th relevance matrix to verify whether the information between relevance is retained, which we call $\Delta $ MAP@H in this paper.

4.1 Mapping using artificial time-series relevance levels

We axevaluate the mapping of time-series data from multiple relevance levels. Each object belongs to one of the clusters and moves toward the center of each cluster over time. We map the vectors to this dataset and evaluate how well the characteristics hold. Let K be the time length and c(i) be the cluster to which each object $o_i$ belongs. The speed at which the object moves toward the center is defined for each cluster and is written as v(c(i)). And artificial time-series data are generated using Algorithm 2.

The quantitative evaluation are shown in Table 1. Experimental settings not listed in the Table 1 include $K=8, m=10$, and also $H=50$ for evaluation. The results of the proposed method show that for $d \ge d_*$, the ranking correlation and the MAP@H are high, indicating that the original relationship is reconstructed. When the true data dimension exceeds the estimated dimension, i.e., in $d_* \ge d$, the values of each evaluation index are low. The evaluation metric, $\Delta $ MAP@H, represents the relationships among the estimated vectors in each relevance matrix. When t–SNE is applied, we observe a significant decline in this value, suggesting a compromised representation of these relationships. In contrast in the case of our proposed method, this value does not decrease, indicating that our method successfully maintains the relationships among the estimated vectors within each relevance matrix. Thus, it is suggested that our proposed method effectively considers the relationships among the estimated vectors in each relevance matrix. However, it has been observed that the proposed method is less accurate than other methods when the dimension of the true data is large. Other methods show that MAP@H is higher, and ranking correlation is lower than those obtained by the proposed method, because they expresses the neighborhood relationship of each object rather than the whole relationship.

Table 1 Mapping results using multiple relevance level

Full size table

The results show that even time-series data can be used to estimate vectors that reproduce each relevance levels. The mapping results for time-series data generated as true dimension $d_*=2$ and time length $K=8$ are shown in Fig. 3:

We have shown that the proposed method preserves the relationship between times for the true vectors, while reproducing each relevance level.

4.2 Case study: DBLP dataset

In this case study, we will use the DBLP dataset [25] for real data. This dataset consists of meta and citation information on 4,107,340 papers.

This section presents the results of a case study on providing vectors using this dataset. The data for providing the vectors is created by aggregating this dataset using the fields in Table 2. Let ${\mathcal {O}}$ be the set of terms in the title and ${\mathcal {U}}$ be the venue on which the paper was published. The preprocess to create a dataset whose parameters are $n=300, m=30, K=12$ and $\{R^{(k)}\}_{k=1}^K$, the start year is 2006, and the last year is 2017 shown in Fig. 4.

Table 2 Data schema of DBLP dataset(except)

Full size table

The mapping results for $d=30$ are shown in Fig. 5. Here, we selected the most commonly used words for each venue at each time point in 2006 and 2017, applied PCA on them, and then visualized the results in two dimensions.

This figure shows that “deep” has significantly changed its position near “neural” and “network” from 2010 to 2017, while the positions of other words have not changed much. The reason for the movement of these words is assumed to be that “deep learning” has been discussed in various papers in recent years. This method can capture the trend changes given multiple relevance levels.

We showed that it is possible to simultaneously acquire high-dimensional vectors that retain relevance levels and enable the visualization of specific objects of interest in low dimensions by using PCA after the vectors are acquired in this experiment. Furthermore, it also enables the comparison of the mapping as time-series data by fixing the feature matrix Y for the features at all times.

5 Conclusion

In this paper, we proposed a method that provides vectors to an object on the basis of multiple relevance levels defined between the objects and features using a stochastic embedding. It is difficult to interpret the results of existing SNE and t-SNE, which are used for the stochastic embedding when applied independently, because they are not designed to acquire vectors to multiple relevance levels (Fig. 3).

We solved these problems using the log-bilinear model combined vectors of the objects and features. In this paper, we showed that the estimation of the object vectors using the common feature vectors is equivalent to mapping to the e-flat subspaces on the model manifold. In this method, the manipulation of the vectors corresponds to the manipulation of the probability distributions, since the convex combination of vectors is the e-mixture of the probability distributions.

We evaluated the effectiveness of the proposed method using artificially generated data and the DBLP dataset as a real-world time series data. The results show that the proposed method can provide the appropriate vectors to the objects while preserving the trend changes for the multiple relevance levels to the artificial data and the real-world data. This method enables the visualization of relationships that change over time, such as texts and videos. The establishment of a baseline for evaluating embedding methods across multiple datasets will be facilitated by defining versatile classification tasks utilizing multiple datasets in the future. When the goal is visualization itself, it is also expected that quantitative evaluations of the estimated vectors will be conducted through means such as questionnaires and interviews.

Data Availibility Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Change history

24 August 2023
A Correction to this paper has been published: https://doi.org/10.1007/s41884-023-00118-9

References

Hinton, G., Roweis, S.T.: Stochastic neighbor embedding. In: NIPS, vol. 15, pp. 833–840 (2002). Citeseer
Torgerson, W.S.: Multidimensional scaling: I. theory and method. Psychometrika 17(4), 401–419 (1952)
Article MathSciNet MATH Google Scholar
Kruskal, J.: Multidimensional scaling optimizing goodness of fit to a non-metric hypotheses. Psychometrika 29, 1–28 (1964)
Cayton, L., Dasgupta, S.: Robust euclidean embedding. In: Proceedings of the 23rd International Conference on Machine Learning. ICML ’06, pp. 169–176. Association for Computing Machinery, New York, NY, USA (2006)
Balasubramanian, M., Schwartz, E.L., Tenenbaum, J.B., de Silva, V., Langford, J.C.: The isomap algorithm and topological stability. Science 295(5552), 7 (2002)
Article Google Scholar
Weinberger, K.Q., Sha, F., Saul, L.K.: Learning a kernel matrix for nonlinear dimensionality reduction. In: Proceedings of the Twenty-first International Conference on Machine Learning, p. 106 (2004)
Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500), 2323–2326 (2000)
Article Google Scholar
Belkin, M., Niyogi, P.: Laplacian eigenmaps and spectral techniques for embedding and clustering. Adv. Neural Inf. Process. Syst. 14 (2001)
Landauer, T.K., Dumais, S.T.: A solution to plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychol. Rev. 104(2), 211 (1997)
Article Google Scholar
Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: A computational study. Behav. Res. Methods 39, 510–526 (2007)
Article Google Scholar
Maaten, L.V.d., Hinton, G.: Visualizing data using t-sne. J. Mach. Learn. Res. 9(Nov), 2579–2605 (2008)
McInnes, L., Healy, J., Melville, J.: Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426 (2018)
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2. NIPS’13, pp. 3111–3119. Curran Associates Inc., Red Hook, NY, USA (2013)
Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Rauber, P.E., Falcão, A.X., Telea, A.C.: Visualizing time-dependent data using dynamic t-sne. In: EuroVis (Short Papers), pp. 73–77 (2016)
Bamler, R., Mandt, S.: Dynamic word embeddings. In: Precup, D., Teh, Y.W. (eds.) Proceedings of the 34th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 70, pp. 380–389. PMLR, Sydney, Australia (2017)
Yao, Z., Sun, Y., Ding, W., Rao, N., Xiong, H.: Dynamic word embeddings for evolving semantic discovery. In: Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. WSDM ’18, pp. 673–681. Association for Computing Machinery, New York, NY, USA (2018)
Rudolph, M., Blei, D.: Dynamic embeddings for language evolution. In: Proceedings of the 2018 World Wide Web Conference. WWW ’18, pp. 1003–1011. International World Wide Web Conferences Steering Committee, Republic and Canton of Geneva, CHE (2018)
Bunte, K., Biehl, M., Hammer, B.: A general framework for dimensionality-reducing data visualization mapping. Neural Comput. 24(3), 771–804 (2012)
Article MATH Google Scholar
Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. Adv. Neural Inf. Process. Syst. 30 (2017)
Akaho, S.: The e-PCA and m-PCA: dimension reduction of parameters by information geometry. In: 2004 IEEE International Joint Conference on Neural Networks (IEEE Cat. No.04CH37541), vol. 1, pp. 129–134. IEEE, Budapest, Hungary (2005)
Volpi, R., Malagò, L.: Evaluating natural alpha embeddings on intrinsic and extrinsic tasks. In: Proceedings of the 5th Workshop on Representation Learning for NLP, pp. 61–71 (2020)
Amari, S.-i., Nagaoka, H.: Methods of Information Geometry vol. 191, pp. 36–91. American Mathematical Society, USA (2000)
Edelman, A., Arias, T.A., Smith, S.T.: The geometry of algorithms with orthogonality constraints. SIAM J. Matrix Anal. Appl. 20(2), 303–353 (1998)
Article MathSciNet MATH Google Scholar
Tang, J., Zhang, J., Yao, L., Li, J., Zhang, L., Su, Z.: Arnetminer: extraction and mining of academic social networks. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 990–998 (2008)
Lewis, D.D., Yang, Y., Russell-Rose, T., Li, F.: Rcv1: A new benchmark collection for text categorization research. J. Mach. Learn. Res. 5(Apr), 361–397 (2004)

Download references

Acknowledgements

This research was conducted under a contract of “MITIGATE” among “Research and Development for Expansion of Radio Wave Resources(JPJ000254)”, which was supported by the Ministry of Internal Affairs and Communications, Japan.

Author information

Taiki Sugiura and Noboru Murata contributed equally to this work.

Authors and Affiliations

Graduate School of Advanced Science and Engineering, Waseda University, 3-4-1 Ohkubo, Shinjuku-ku, Tokyo, 169–8555, Japan
Taiki Sugiura & Noboru Murata

Authors

Taiki Sugiura
View author publications
You can also search for this author in PubMed Google Scholar
Noboru Murata
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taiki Sugiura.

Ethics declarations

Conflict of interest

Noboru Murata is currently a board member of the journal. He was not involved in the peer review or handling of the manuscript. Furthermore, he is the corresponding author’s PhD thesis advisor. On behalf of all authors, the corresponding author states that there is no other potential conflict of interest to declare.

Additional information

Communicated by Nihat Ay.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

The original online version of this article was revised: the article was originally published without “Data Availability” and “Conflict of Interest”. The article has been updated with the following information: Data Availability The data that support the findings of this study are available from the corresponding author upon reasonable request. Conflict of Interest Noboru Murata is currently a board member of the journal. He was not involved in the peer review or handling of the manuscript. Furthermore, he is the corresponding author’s PhD thesis advisor. On behalf of all authors, the corresponding author states that there is no other potential conflict of interest to declare.

Appendices

Appendix A: Gradient of loss function

In this section, we describe the detail of the calculations for minimizing Eq. (6), i.e., m-projection on the model manifold. The following symbols are defined in the description,

$$\begin{aligned} (R)_{ij}&= \varvec{x}_i^{\textsf{T}} \varvec{y}_j,\\ (P)_{ij}&= p(o_i, u_j; R) = \exp \left( \textrm{rel}(o_i, u_j) -\phi (R) \right) ,\\ (Q)_{ij}&= q(o_i, u_j; X, Y) = \exp \left( \varvec{x}_i^{\textsf{T}} \varvec{y}_j - \zeta (X,Y) \right) . \end{aligned}$$

First, the loss function in Eq. (6) is expanded as follows, where Z is a constant value calculated only from the empirical distribution P,

$$\begin{aligned} L_{\textrm{KL}} (R, X, Y)&= D_{\textrm{KL}}(P \Vert Q(X, Y)) \\&= \sum _{i=1}^{n} \sum _{j=1}^{m} (P)_{ij} \log \frac{(P)_{ij}}{(Q)_{ij}} \\&= \sum _{i=1}^{n} \left( \sum _{j=1}^m (P)_{ij} \log (P)_{ij} - \sum _{j=1}^m (P)_{ij} \log (Q)_{ij} \right) \\&= \sum _{i=1}^{n} \left( \sum _{j=1}^m (P)_{ij} \log (P)_{ij} - \sum _{j=1}^m (P)_{ij} \log \left( \frac{\exp \left( \varvec{x}_i^{\textsf{T}} \varvec{y}_j \right) }{\sum _{i', j'}^{n,m} \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) } \right) \right) \\&= Z - \sum _{i=1}^{n} \sum _{j=1}^m (P)_{ij} \log \left( \frac{\exp \left( \varvec{x}_i^{\textsf{T}} \varvec{y}_j \right) }{\sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) } \right) \\&= Z - \sum _{i=1}^n \sum _{j=1}^m (P)_{ij} \varvec{x}_i^{\textsf{T}} \varvec{y}_j + \sum _{i=1}^n \sum _{j=1}^m (P)_{ij} \log \left( \sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) \right) . \end{aligned}$$

From here on, we will describe the gradients for X and Y.

1.1 A.1 Gradient in the X direction

We first calculate for $\varvec{x}_h$ and then show the gradient for the whole X,

$$\begin{aligned}&\frac{\partial }{\partial \varvec{x}_h} L_{\textrm{KL}} (P \Vert Q(X,Y)) \\&\quad = \frac{\partial }{\partial \varvec{x}_h} Z - \frac{\partial }{\partial \varvec{x}_h} \left( \sum _{i=1}^n \sum _{j=1}^m (P)_{ij} \varvec{x}_i^{\textsf{T}} \varvec{y}_j \right) \\&\qquad + \frac{\partial }{\partial \varvec{x}_h} \left( \log \left( \sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) \right) \sum _{i=1}^n \sum _{j=1}^m (P)_{ij} \right) \\&\quad = 0 - \sum _{j=1}^m (P)_{hj} \varvec{y}_j + \frac{\partial }{\partial \varvec{x}_h} \left( 1\ \log \left( \sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) \right) \right) \\&\quad = - \sum _{j=1}^m (P)_{hj} \varvec{y}_j + \frac{\sum _{j'=1}^{m} \exp \left( \varvec{x}_d^{\textsf{T}} \varvec{y}_{j'} \right) \varvec{y}_{j'} }{\sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) }\\&\quad = - \sum _{j=1}^m (P)_{hj} \varvec{y}_j + \frac{1}{\zeta } \left( \sum _{j'=1}^{m} \exp \left( \varvec{x}_h^{\textsf{T}} \varvec{y}_{j'} \right) \varvec{y}_{j'} \right) . \end{aligned}$$

Thus, the gradient of X is induced as

$$\begin{aligned} \frac{\partial }{\partial X} L_{\textrm{KL}} (P \Vert Q(X,Y)) = - P Y + \frac{1}{\zeta } \exp \left( R \right) Y \in {\mathbb {R}}^{n \times d}. \end{aligned}$$

1.2 A.2 Gradient in the Y direction

We similarly derive the gradients of $\varvec{y}_h$ and Y,

$$\begin{aligned}&\frac{\partial }{\partial \varvec{y}_h} L_{\textrm{KL}} (P \Vert Q(X,Y)) \\&\quad = \frac{\partial }{\partial \varvec{y}_h} Z - \frac{\partial }{\partial \varvec{y}_h} \left( \sum _{i=1}^n \sum _{j=1}^m (P)_{ij} \varvec{x}_i^{\textsf{T}} \varvec{y}_j \right) \\&\qquad + \frac{\partial }{\partial \varvec{x}_h} \left( \log \left( \sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) \right) \sum _{i=1}^n \sum _{j=1}^m (P)_{ij} \right) \\&\quad = 0 - \sum _{i=1}^n (P)_{ih} \varvec{x}_i + \frac{\partial }{\partial \varvec{x}_h} \left( 1\ \log \left( \sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) \right) \right) \\&\quad = - \sum _{i=1}^n (P)_{ih} \varvec{x}_i + \frac{\sum _{i'=1}^{n} \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{h} \right) \varvec{x}_{i'} }{\sum _{i'=1}^{n} \sum _{j'=1}^m \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{j'}\right) }\\&\quad = - \sum _{j=1}^n (P)_{ih} \varvec{x}_i + \frac{1}{\zeta } \left( \sum _{i'=1}^{n} \exp \left( \varvec{x}_{i'}^{\textsf{T}} \varvec{y}_{h} \right) \varvec{x}_{i'} \right) . \end{aligned}$$

Thus, the gradient of X is similarly induced as

$$\begin{aligned} \frac{\partial }{\partial Y} L_{\textrm{KL}} (P \Vert Q(X,Y)) = - P^{\textsf{T}} X + \frac{1}{\zeta } \exp \left( R\right) ^{\textsf{T}} X \in {\mathbb {R}}^{m \times d}. \end{aligned}$$

Appendix B: Pseudo-data generation procedures

1.1 B.1 Procedure of generating a multiple relevance levels

The following steps also show the process of generating $R^{(k)}$ and $X_*^{(k)}$ and $Y_*$ that compose the $R^{(k)}$ used in 4.1.

Appendix C: Preprocessing to DBLP dataset

Table 3 shows conditions for term extraction in Sect. 4.2.

Table 3 Conditions for term extraction

Full size table

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Sugiura, T., Murata, N. Object embedding using an information geometrical perspective. Info. Geo. 6, 435–462 (2023). https://doi.org/10.1007/s41884-023-00114-z

Download citation

Received: 02 May 2022
Revised: 09 June 2023
Accepted: 20 June 2023
Published: 27 July 2023
Issue Date: November 2023
DOI: https://doi.org/10.1007/s41884-023-00114-z

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Object embedding using an information geometrical perspective

Abstract

Similar content being viewed by others

Low-Dimensional Data Representation in Data Analysis

Visualizing High-Dimensional Data Using t-Distributed Stochastic Neighbor Embedding Algorithm

Complementary View on Multivariate Data Structure Based on Kohonen’s SOM, Parallel Coordinates and t-SNE Methods

1 Introduction

2 Related works