1 Introduction

The acquisition of vector representations for objects is useful for information processing. Vector representations allow the visualization of relationships between objects and can be used as a preprocessing tool for statistical inference and machine learning algorithms. However, there are many cases, such as images and sentences, that are difficult to represent in fixed-length vectors or cases where only relationships between objects, such as relevance level, are observed. In particular, we propose a method for obtaining an appropriate vector representation in situations where multiple relevance degrees are given for each object. For example, consider the case where a movie is rated in each country at year. In this case, multiple relevance degrees are given for the movie. Each country has different preferences for movies, and these preferences change over time. We propose a method to obtain a vector representation that takes into account multiple degrees of relevance for data with multiple observations over time. In another example, we can discuss cultural differences and trends in scientific papers. It is assumed that each journal or international conference has a different culture, as different words are used in different venues, and that they also change over time. The proposed method is expected to be applied to the analysis of such data. This example case study will be discussed in detail in Sect. 4. Such as graph data and image data, there are cases where only distance and similarity between objects are observed. In this paper, such distances and similarities are treated as special cases of relevance and discussed comprehensively. The proposed method can be applied to obtain vector representations for various data, including time-series data for which similarities are given at different times and graph-structured data for edges of which multiple pattern weights are assumed.

In the following, a set of objects is denoted by \({\mathcal {O}}\), and individual objects are denoted by o, with the subscript i where necessary to distinguish them. The problem of acquiring a vector \(\varvec{x}\) to an object o to approximate the relationships between the objects is generally called object embedding or vectorization. This problem can be divided into two types: the first is “constellation”, in which a vector representation is assigned to an individual object,

$$\begin{aligned} o_i \mapsto \varvec{x}_i, \ \forall i. \end{aligned}$$

And the second is “embedding”, in which a function is constructed to give the vectors to the entire set of target objects,

$$\begin{aligned} \phi : {\mathcal {O}} \rightarrow {\mathcal {X}}. \end{aligned}$$

For the latter embedding problem, the problem of learning the mapping \(\phi \) from the data using a function approximator,

$$\begin{aligned} o_i \mapsto \varvec{x}_i = \phi (o_i), \end{aligned}$$

such as a neural network, is considered. In this paper, we deal in particular with the former constellation problem.

For most of the methods that construct the vectors independently for each given relevance level is proposed and the comparison or manipulation of estimated vectors from multiple relevance levels is not assumed. In the example of movies and journals mentioned earlier, they could be used to analyze a single year or each year separately, but they are not intended for analysis of data that is acquired across years, i.e., multiple times. The individual constellations for multiple relevance levels will result in vectors that do not preserve the data characteristics of the given relevance levels.

Therefore, we propose a constellation method for multiple relevance levels. In this paper, we assign the vectors to objects using the stochastic embedding approach represented by Stochastic Neighborhood Embedding (SNE) [1], introduced in Sect. 2. If we reformulate the problem setting in terms of stochastic embedding, we can view the current problem as a problem of estimating the appropriate probability distribution to represent them in a situation where multiple empirical distributions are available for each object. The approach of simply estimating the appropriate distribution from each empirical distribution cannot capture the relationship between the empirical distributions. It is necessary to express the relationship between empirical distributions by imposing constraints on the target space. A proposed logarithmic bilinear model (log-bilinear model) approximates the relevance levels using two vectors, one for the objects and the other for the features as parameters. One of the properties of the log-bilinear model is that by fixing one parameter, the space stretched from the remaining parameters becomes the space of exponential families. In this paper, we use this property to represent the relationship between multiple empirical distributions by sharing some of the parameters of the probability distributions to be estimated. We also show that the weighted sum of the vectors obtained can represent a mixture of probability distributions in the space of probability distributions and can be easily visualized after estimation.

The contributions of this paper are described as follows.

  • Our proposed framework can provide vectors for even multiple relevance levels.

  • Our algorithm can be interpreted as a simple operation in terms of information geometry.

  • The proposed method enables vectorization even when additional data are generated.

The structure of this paper is as follows. Section 2 provides an overview of existing mapping methods based on similarity and discrepancy, comparing classical and probabilistic methods. In Sect. 3, we first introduce auxiliary features for objects and then propose approximate representations using a log-bilinear model with two vectors of the features and objects, and its optimization, on the basis of an information geometry interpretation. We evaluate the effectiveness of the proposed method and its visualization using artificial and real data in Sects. 4, and 5, we summarize the proposed method and discuss the remaining issues on its possible applications.

2 Related works

In this section, we provide an overview of existing methods of mapping vectors based on relevance level. First, let us organize the symbols. The set of objects is denoted by \({\mathcal {O}}\), and the number of objects is represented by n. Individual objects are distinguished by subscript i as

$$\begin{aligned} o_i \in {\mathcal {O}}, \quad i \in \Lambda = \{1, \dots , n\}. \end{aligned}$$

Each object is given a d-dimensional vector \(\varvec{x} \in {\mathbb {R}}^d\), and a matrix X is defined as the matrix of vectors \(\varvec{x}\) corresponding to object o, arranged in rows as column vectors,

$$\begin{aligned} X = \left[ \varvec{x}_1, \varvec{x}_2, \dots , \varvec{x}_n \right] ^{\textsf{T}} \in {\mathbb {R}}^{n \times d}. \end{aligned}$$

Here, we introduce a matrix \(R \in {\mathbb {R}}^{n \times m}\) to estimate the matrix X that represents the relevance between objects and features. We assume that object o and feature u have an relevance \(\textrm{rel}(o, u) \in {\mathbb {R}}\) between them, and estimate the vector \(\varvec{x}\). Let \({\mathcal {U}}\) be a set of features.

$$\begin{aligned} (R)_{ij} = \textrm{rel}(o_i, u_j). \end{aligned}$$

where \((R)_{ij}\) is a (ij)th entry of matrix R. As mentioned above, similarity and distance can be regarded as special cases of relevance R, in which case \({\mathcal {U}} = {\mathcal {O}}\) and \((R)_{ij}\) can be regarded as indicating their values.

In the following, we describe methods for estimating the matrix X from the relevance R. First, Torgerson’s classical metric multidimensional scaling (MDS) is a technique to obtain a low-rank approximation X using the similarity matrix as a following loss function,

$$\begin{aligned} L_{\textrm{Tor}}(R,X) = \sum _{i,j \in \Lambda }\left( (R)_{ij} - \varvec{x}_i^{\textsf{T}} \varvec{x}_j \right) ^2 = \Vert R - XX^{\textsf{T}} \Vert _F, \end{aligned}$$

where \(\Vert \cdot \Vert _{\mathcal {F}} \) denotes the Frobenius norm [2]. Note that R is a symmetric matrix where each element represents the degree of similarity between objects. On the other hand, Kruskal’s non metric MDS [3] uses the Minkowski distance, a generalization of the Euclidean distance, to find the matrix X that approximates the relevance matrix R,

$$\begin{aligned} L_{\textrm{Kru}}(R,X) = \frac{ \sum _{i,j\in \Lambda } \left( \phi ((R)_{ij}) - \Vert \varvec{x}_i - \varvec{x}_j \Vert _p \right) ^2 }{ \sum _{i,j\in \Lambda } \Vert \varvec{x}_i - \varvec{x}_j \Vert ^2_p }, \end{aligned}$$

where \(\Vert \cdot \Vert _p \) denotes the Minkowski distance of order p and \(\phi (\cdot )\) is a suitable monotonic transformation. In this case, the relevance R is a matrix such that each element represents the distance between objects.

There are also other methods such as robust Euclidean embedding [4], which improves robustness to outliers by evaluating differences between relevance levels and estimated vectors with an \(L_1\)-loss function, and non metric MDS, which optimizes ordinal statistics [3]. Since essential information is often located in a low-dimensional space in practical problems using the high-dimensional data, various techniques have been actively studied to reconstruct it appropriately in the low-dimensional space. The difference among various methods is how they incorporate information from a neighborhood that is assumed to contain important information. For example, Isomap [5], which recalculates the geodetic distances from a neighborhood, semidefinite embedding (SDE) [6] using kernel matrix, locally linear embedding (LLE) [7], and laplacian eigenmaps [8]. In the practical problem of acquiring a vector representation of a word, a low-rank approximation of the word-document matrix is proposed to acquire the vectors [9, 10].

The approximation error was evaluated using the norm of the matrix in the classical method described above. On the other hand, a method called stochastic embedding has been proposed, which evaluates the reconstruction error of vectors based on the statistical distance between probability distributions.

Stochastic neighborhood embedding (SNE) is the pioneer of stochastic embedding. In this method, the vectors are mapped by expressing the probability of an object o being in the neighborhood of \(o'\) as a conditional probability mass function \(p_o(o')\),

$$\begin{aligned} p_o(o') = \frac{f(o, o')}{\sum _{o'' \in {\mathcal {O}}} {f(o, o'')}}. \end{aligned}$$

Note that f is a function of the similarity between objects. Let \(\varvec{x}\) be the vector of an object o, and we denote the conditional probability q that is, an object o is a neighborhood of an object \(o'\),

$$\begin{aligned} q_o(o') = \frac{g(\Vert \varvec{x} - \varvec{x}'\Vert )}{\sum _{o'' \in {\mathcal {O}}} {g(\Vert \varvec{x} - \varvec{x}''\Vert )}}, \end{aligned}$$

where g is a function that transforms such that the smaller the distance, the higher probability.

The statistical distance is employed as the evaluation function between probability distributions in a stochastic embedding. The Kullback–Leibler divergence (KL divergence) is generally defined as one of the statistical distances and defined as

$$\begin{aligned} D_{\textrm{KL}}(P \Vert Q) = \sum _{\omega \in \Omega } { p(\omega ) \left( \log p(\omega ) - \log q(\omega ) \right) }. \end{aligned}$$
(1)

The statistical distance between the two probability distributions is used as one of loss functions with respect to the matrix X in a stochastic embedding,

$$\begin{aligned} L_\textrm{SNE}(D, X)&= \sum _{o \in {\mathcal {O}}} D_\textrm{KL}(P_o \Vert Q_o) \\&= \sum _{o \in {\mathcal {O}}} \sum _{o' \in {\mathcal {O}}} p_o(o') \left( \log p_o(o') - \log q_o(o') \right) . \end{aligned}$$

Various methods have been proposed for the stochastic embedding, depending on a choice of statistical distances and similarity calculation methods to capture the relevance levels between objects. t-SNE [11], a variant of SNE, uses a function whose distribution after transformation is the t-distribution to improve estimation robustness and speed up computation. Other proposed methods include UMAP [12], which applies the theory of algebraic topology, and word2vec [13] which estimates the dual vectors using the asymmetry of the conditional probability defined from the similarity. GloVe [14] is also a highly relevant and key technique that is also related to the proposed method, which proposes vector acquisition by focusing on the ratio of conditional probabilities of objects ij. The problem setting of this paper is related to embedding for time-series data. Dynamic t-SNE is an extension of t-SNE for mapping vectors to time-series data [15]. In addition to the conventional t-SNE loss function, this method enables the mapping of objects by adding penalties to differences between times. Several methods have been proposed to enable the application of Word Embedding to time-series data, called “Dynamic Word Embedding” [16,17,18]. Bunte et al. describes the these methods from a unified perspective [19].

Various techniques have been proposed to construct vectors that approximate relevance, although this problem can be viewed as a low-rank approximation problem of the relevance matrix R by a matrix X,

$$\begin{aligned} R \simeq XX^{\textsf{T}}. \end{aligned}$$

The classical method, metric MDS uses the evaluation of approximation errors by the Frobenius norm and results in a singular value decomposition of a symmetric matrix,

$$\begin{aligned} L_F(R, X) = \Vert R - XX^{\textsf{T}} \Vert _F. \end{aligned}$$

In this case, approximation errors are evaluated irrespective of the similarity owing to the nature of the Frobenius norm. On the other hand, the stochastic embedding approach considers the probabilistic models defined by relevances and vectors as

$$\begin{aligned} p_o(o') = \frac{f(o, o')}{\sum _{o'' \in {\mathcal {O}}} f(o, o'')},\ q_o(o') = \frac{g(\varvec{x}^{\textsf{T}} \varvec{x'})}{\sum _{o'' \in {\mathcal {O}}} g(\varvec{x}^{\textsf{T}} \varvec{x''})}, \end{aligned}$$

and evaluates the approximation error on the basis of the statistical distance as follows,

$$\begin{aligned} L_{\textrm{KL}}(R, X) = D_{\textrm{KL}} (P_o \Vert Q_o). \end{aligned}$$

In this case, the nature of the statistical distance gives more weight to relevance levels between objects with high similarity, i.e., high probability values. Various frameworks have been proposed for geometric approaches, such as Poincaré embedding [20], which is a method to improve performance by changing the calculation method of the distance between two objects and embedding them in the hyperbolic space. Related research in information geometry is described below, along with the proposed method.

3 Proposed methods

Since the vectors are obtained in existing methods for a single relevance matrix, as described in Sect. 2, they are not designed for use of multiple matrices. Therefore, it is difficult to compare the obtained vectors when the existing method applies to multiple matrices. In this section, we propose a method of mapping vectors that can be used when multiple relevance levels are obtained by assuming two types of vectors: object vectors and feature vectors. We can compare the object vectors using shared feature vectors among multiple relevance levels.

The symbols are described here. Let \({\mathcal {O}} = \{o_i\}_{i=1}^{n}\) be the set of objects and \({\mathcal {U}} = \{u_j\}_{j=1}^m\) be the set of features. For example, in the example of movie ratings by country described in the introduction, \({\mathcal {O}}\) is the set of movies and \({\mathcal {U}}\) is the set of countries. The function \(\textrm{rel}(o, u)\) gives the relevance of an object \(o \in {\mathcal {O}}\) to a feature \(u \in {\mathcal {U}}\). In the case of the movie data example, the rating of the movie o in country u is defined as \(\textrm{rel}(o, u)\).

In this section, we propose a method to estimate object vectors when k relevance matrices \(R^{(1)}, \dots , R^{(K)}\)are given. First, consider the case of a single relevance matrix. The proposed method approximates the relevance of an object and a feature by the inner product of vectors, given vector \(\varvec{x}_i\) of the object \(o_i\) and the vector \(\varvec{y}_j\) of the feature \(u_j\),

$$\begin{aligned} \textrm{rel}(o_i, u_j) \simeq \varvec{x}_i^{\textsf{T}} \varvec{y}_j. \end{aligned}$$
(2)

Here, considering the two matrices X and Y with each vector \(\varvec{x}_i\) and \(\varvec{y}_j\) aligned,

$$\begin{aligned} X&= \left[ \varvec{x}_1, \varvec{x}_2, \dots , \varvec{x}_n \right] ^{{\textsf{T}}}, \\ Y&= \left[ \varvec{y}_1, \varvec{y}_2, \dots , \varvec{y}_m \right] ^{{\textsf{T}}}. \end{aligned}$$

Equation (2) can be treated as a problem,

$$\begin{aligned} R \simeq&XY^{{\textsf{T}}}, \end{aligned}$$
(3)

which approximates the relevance matrix R by the matrices X and Y. In the following, R assumes that the value of \((R)_{ij}\) takes on larger values when \(o_i\) and \(u_j\) are strongly related.

Next, we define the probability distributions P and Q using the relevance matrix R and the matrices X and Y to evaluate the approximation accuracy of the decomposition in the framework of stochastic embedding. They indicate the probability that an object o and a feature u are observed simultaneously. The proposed method approximates the probability mass function p(ou) of the empirical distribution P using a log-bilinear model on the basis of the inner product of the vectors \(\varvec{x}\) and \(\varvec{y}\) to correspond to the object o and the feature u, respectively,

$$\begin{aligned} p(o, u; R)&= \exp (\textrm{rel}(o, u) - \phi (R)) , \end{aligned}$$
(4)
$$\begin{aligned} q(o, u; X, Y)&= \exp (\varvec{x}^{\textsf{T}} \varvec{y} - \zeta (X, Y)), \end{aligned}$$
(5)

where \(\phi (R)\) and \(\zeta (X,Y)\) are the normalization terms for P(R) and Q(XY) to be the probability distributions. The KL divergences of the joint probability distribution is used as the evaluation function for this problem,

$$\begin{aligned} L_{\textrm{KL}}(R, X, Y)&= D_{\textrm{KL}} (P(R) \Vert Q(X, Y)) \nonumber \\&= \sum _{o \in {\mathcal {O}}} \sum _{u \in {\mathcal {U}}} p(o, u; R) \left( \log p(o, u; R) - \log q(o, u; X, Y) \right) . \end{aligned}$$
(6)

This concept of relying on a log-bilinear model is also the idea used in word2Vec [13] and GloVe [14]. However, minimizing the KL divergence for multiple relevance matrices does not allow us to estimate the appropriate vectors. If A is a regular matrix, then

$$\begin{aligned} X' = XA, \quad Y' = Y(A^{-1})^{\textsf{T}}. \end{aligned}$$

Since R of Eq. (3) has the same value as both a pair of X and Y, and a pair of \(X'\) and \(Y'\), the value of Eq. (6) give the same solution,

$$\begin{aligned} R = X'Y'^{{\textsf{T}}} = X A A^\mathsf {-1} Y^{\textsf{T}} = XY^{\textsf{T}}. \end{aligned}$$

This means that the solution to this problem is arbitrary in terms of regular matrices. It is difficult for given multiple relevance levels \(\left\{ R^{(k)}\right\} _{k=1}^{K}\) to discuss the relationship between their solutions \(\left\{ X^{(k)}\right\} \) for this reason. We can derive an operation such that the convex combination of matrices \(X^{(k)}\) can be regarded a mixture of probability distributions \(Q^{(k)}\) by using the shared matrices Y for any k,

$$\begin{aligned} X^{(k)}&= \left[ \varvec{x}_1^{(k)}, \dots , \varvec{x}_n^{(k)} \right] ^{{\textsf{T}}}, \\ Y&= \left[ \varvec{y}_1, \dots , \varvec{y}_m \right] ^{{\textsf{T}}}. \end{aligned}$$

We can extend the method to estimate the matrices \(\{X^{(k)}\}\) for multiple relevance matrices by this shared matrix Y. Let us consider a dividing point \(\bar{\varvec{x}}\) at vectors \(\varvec{x}^{(1)}\) and \(\varvec{x}^{(2)}\) estimated using the relevance \(R^{(1)}\) and \(R^{(2)}\),

$$\begin{aligned} \bar{\varvec{x}} = \alpha \varvec{x}^{(1)} + (1-\alpha ) \varvec{x}^{(2)} , \quad 0 \le \alpha \le 1. \end{aligned}$$

The probability mass function \({\bar{q}}(o,u; {\bar{X}}, Y)\) of the joint probability distribution \({\bar{Q}}({\bar{X}}, Y)\) by \(\bar{\varvec{x}}\) can be written as

$$\begin{aligned}&{\bar{q}}(o, u; {\bar{X}},Y)\nonumber \\&\quad =\exp \left( \bar{\varvec{x}}^{{\textsf{T}}} \varvec{y}-{\bar{\zeta }}\left( {\bar{X}}, Y\right) \right) \nonumber \\&\quad =\exp \left( \left( \alpha \varvec{x}^{(1)}+(1-\alpha ) \varvec{x}^{(2)} \right) ^{{\textsf{T}}} \varvec{y}-{\bar{\zeta }} \left( {\bar{X}}, Y\right) \right) \nonumber \\&\quad =\exp \left( \alpha \left( \varvec{x}^{(1){\textsf{T}}} \varvec{y} -\zeta ^{(1)}\left( X^{(1)}, Y\right) \right) \right. \nonumber \\&\qquad \left. +(1-\alpha ) \left( \varvec{x}^{(2){\textsf{T}}} \varvec{y}-\zeta ^{(2)}\left( X^{(2)}, Y\right) \right) -\zeta '\left( {\bar{X}}, Y\right) \right) \nonumber \\&\quad =\left( q^{(1)}(o,u; X^{(1)},Y)\right) ^{\alpha } \left( q^{(2)}(o,u; X^{(2)},Y)\right) ^{1-\alpha } \exp \left( -\zeta '\left( {\bar{X}}, Y\right) \right) . \end{aligned}$$
(7)

where \(q^{(k)}\) is a probability mass function of \(Q^{(k)}\), and \(\exp (-\zeta '({\bar{X}}, Y))\) is the normalization term. The distribution \({\bar{Q}}\) is a convex combination of two distributions \(Q^{(1)}\) and \(Q^{(2)}\) under log transformation,

$$\begin{aligned}&\log {\bar{q}} (o, u; {\bar{X}},Y) \\&\quad = \alpha \log q^{(1)}(o, u; {\bar{X}},Y) + (1-\alpha ) \log q^{(2)}(o, u; {\bar{X}},Y) - \zeta '({\bar{X}}, Y). \end{aligned}$$

Therefore, the distribution \({\bar{Q}}({\bar{X}},Y)\) is a mixture distribution called the e-mixture. Since the e-mixture emphasizes regions where both distributions have a high probability, it is intuitively an AND-like operation that takes the common part of the distributions. In this way, the mixture of the vectors corresponds to an operation of the distributions. The details of “e-mixture” are described in the next section. In the proposed method, we can estimate vectors for multiple relevance matrices by fixing Y. In this case, the matrices \(\{X^{(k)}\}, Y\) are not uniquely determined. In this method, we reduce the degree of freedom of the solution by imposing the following constraints,

$$\begin{aligned} Y^{\textsf{T}}Y = I_d, \end{aligned}$$
(8)

where \(I_d \in {\mathbb {R}}^{d\times d}\) is the identity matrix. Note that even with this condition, the estimation of X and Y remains arbitrary in the orthogonal matrices. Given multiple relevance levels, The optimization problem is written as follows for the matrices is written as

$$\begin{aligned}&\textrm{minimize}\ L_{\textrm{KL}}(R^{(k)}, X^{(k)} Y^{{\textsf{T}}}) \nonumber \\&\quad \mathrm {with\ respect\ to\ } X^{(k)}, \forall k\ \textrm{and}\ Y \nonumber \\&\quad \mathrm {subject\ to\ } Y^{{\textsf{T}}} Y = I_d. \end{aligned}$$
(9)

In this paper, we imposed the constraint Eq. (8), but it could be any other constraint to reduce the degree of freedom of the solution.

3.1 Preliminary on information geometry

Information geometry is a framework for discussing the behavior of statistical inference and machine learning using a geometric perspective. A wide variety of methods and algorithms have been understood and extended by information geometry. For example, Akaho proposed a method for obtaining vectors by dimension reduction in [21]. As another an information geometry approach to object embedding, Riccard et al. propose alpha embedding, which computes between two stochastic distributions using the computed alpha-divergence [22]. In both studies, the information geometry approach is used to extend the conventional methods for embedding and dimension reduction of a single data. As previously discussed, when multiple datasets are obtained, it is difficult to directly apply the object embedding approach to consider the relationship between multiple data, so it is necessary to extend the method to consider the relationship. As previously mentioned, directly applying the object embedding approach to deal relationships between multiple datasets is challenging. In this section, we describe the basic ideas necessary to explain the proposed method, especially the handling of statistical models and their estimation in information geometry [23].

Let \({\mathcal {Z}}\) be the sample space and Z be the random variable. And let \({\mathcal {S}}\) be the space containing an arbitrary probability distribution, and P be the probability distribution whose probability mass function is p(z),

$$\begin{aligned} {\mathcal {S}} = \left\{ P \vert p(z) \mathrm {\ is\ a\ probability\ mass\ function\ of\ } P \right\} . \end{aligned}$$

The set of probability distributions with parameter \(\varvec{\theta }\) is defined as

$$\begin{aligned} {\mathcal {M}}(\varvec{\theta }) = \left\{ P_{\varvec{\theta }} \vert p(z; \varvec{\theta }) \right\} , \quad {\mathcal {M}}(\varvec{\theta }) \subset {\mathcal {S}}, \end{aligned}$$

which is called the model manifold \({\mathcal {M}}(\varvec{\theta })\). Let \(z_1, \dots , z_N\) be the observed data sampled from a probability mass function p with appropriate parameters \(\varvec{\theta }^*\). The observed data can be regarded as a probability distribution using a delta function called the empirical distribution,

$$\begin{aligned} p(z) = \frac{1}{N} \sum _{i=1}^N \delta (z - z_i). \end{aligned}$$

Estimating the parameter \(\varvec{\theta }\) refers to finding the closest probability distribution Q on the subspace \({\mathcal {M}}\) from the empirical distribution P using the statistical distance D in information geometry [23]. The statistical distance D refers to the pseudo-distance between two probability distributions, P and Q, and here we employ the KL divergence \(D_\textrm{KL}(P \Vert Q)\) in Eq. (1).

We define a set \({\mathcal {G}}_m\) and \({\mathcal {G}}_e\) of interior points for the two probability distributions \(P^{(1)}\) and \(P^{(2)}\) on the space \({\mathcal {S}}\) of probability distributions,

$$\begin{aligned} {\mathcal {G}}_m =&\left\{ P^\alpha _m \Big \vert p_m(z; \alpha ) = \alpha p^{(1)}(z) + (1-\alpha ) p^{(2)}(z) \right\} , \nonumber \\ {\mathcal {G}}_e\ =&\left\{ P^\alpha _e \Big \vert p_e(z; \alpha ) = \exp \left( \alpha \log \left( p^{(1)}(z) \right) + (1 - \alpha ) \log \left( p^{(2)}(z) \right) - \zeta (\alpha ) \right) \right\} . \end{aligned}$$
(10)

where \(\alpha \in [0, 1]\). The subspace \({\mathcal {G}}_m\) expressed in Eq. (10) is called the m-geodesics, and \({\mathcal {G}}_e\) expressed in Eq. (10) is called as the e-geodesics, given K probability distributions, the subspace \({\mathcal {M}}_m\) is defined as the m-flat subspace spanned by their convex combination, and the subspace \({\mathcal {M}}_e\) is defined as the e-flat subspace spanned by a convex combination of logarithmic representations.

$$\begin{aligned} {\mathcal {M}}_m&= \left\{ P_m^{(\varvec{\alpha })} \Bigg \vert p_m(z; \varvec{\alpha }) = \sum _{k=1}^{K} \alpha ^{(k)} p^{(k)}(z) \right\} , \\ {\mathcal {M}}_e&= \left\{ P_e^{(\varvec{\alpha })} \Bigg \vert p_e(z; \varvec{\alpha }) = \exp \left( \sum _{k=1}^{K} \alpha ^{(k)} \log \left( p^{(k)}(z) \right) - \zeta (\varvec{\alpha }) \right) \right\} , \end{aligned}$$

where \(\varvec{\alpha } = \left[ \alpha ^{(1)}, \dots , \alpha ^{(K)}\right] ^{{\textsf{T}}} \in {\mathbb {R}}^{K},\ \alpha ^{(k)} \ge 0\) and \(\sum _{k=1}^K \alpha ^{(k)} = 1\). And p(x) is called the m-representations, \(\log \left( p(z) \right) \) is called the e-representations, and the operations to construct \(P_m\) and \(P_e\) by the convex combination for each representation are called the m-mixture and the e-mixture, respectively.

We now consider the KL divergence for the three probability distributions P, Q, and R,

$$\begin{aligned}&D_\textrm{KL} (P \Vert Q) - D_\textrm{KL}(P \Vert R) - D_\textrm{KL}(R \Vert Q) \nonumber \\&\quad = \sum _{z \in {\mathcal {Z}}}\left\{ \left( p(z) (\log (p(z)) - \log (r(z)) \right) -\left( p(z) (\log (p(z)) - \log (r(z)) \right) \right. \nonumber \\&\qquad \left. - \left( r(z) (\log (r(z)) - \log (q(z)) \right) \right\} \nonumber \\&\quad = \sum _{z \in {\mathcal {Z}}} \left( p(z) - r(z)\right) \left( \log (r(z)) - \log (q(z))\right) , \end{aligned}$$
(11)

where r is the probability mass function of R. When the right-hand side of Eq. (11) is zero, that is, when the m-geodesics from probability distribution P to R and the e-geodesics from R to Q are orthogonal, the Pythagorean theorem holds

$$\begin{aligned} D_\textrm{KL} (P \Vert Q) = D_\textrm{KL}(P \Vert R) + D_\textrm{KL}(R \Vert Q). \end{aligned}$$
(12)

This is useful for describing parameter estimation in the stochastic models.

We now introduce a set of probability distributions called an exponential family to describe the geometric properties of parameter estimation in statistical models. An exponential family is a set of probability distribution whose probability mass function can be written in the following form:

$$\begin{aligned} {\mathcal {P}} = \left\{ P_{\varvec{\theta }} \Big \vert p(\varvec{z}; \varvec{\theta }) = \exp \left( \varvec{\theta }^{\textsf{T}} \varvec{z} - \zeta (\varvec{\theta }) \right) \right\} \subset {\mathcal {S}}, \end{aligned}$$

where \(\varvec{\theta }\) are parameters of P, and \(\zeta (\varvec{\theta })\) is the normalization term as

$$\begin{aligned} \zeta (\varvec{\theta }) = \log \sum _{\varvec{z} \in {\mathcal {Z}}} \exp \left( \varvec{\theta }^{\textsf{T}} \varvec{z} \right) . \end{aligned}$$

Probability distributions in the exponential family of distributions are e-flat [23]. For an e-flat subspace \({\mathcal {M}}_e\), if there exists probability distributions \(P \notin {\mathcal {M}}_e\), Q and \(R \in {\mathcal {M}}_e\). Let these three points satisfy Eq. (12). Then the point Q that is the shortest point from P to \({\mathcal {M}}_e\) is equivalent to R. because the m-geodesics from P to R and e-geodesics from R to Q are orthogonal by the above Pythagorean theorem. And this process of estimating R from P is called the m-projection,

$$\begin{aligned} R = \mathop {\mathrm{arg~min}}\limits _{Q \in {\mathcal {M}}_e} D_{\textrm{KL}}(P \Vert Q). \end{aligned}$$

In the next section, we explain how the proposed method can be interpreted in terms of information geometry for the case where multiple empirical distributions P are observed.

3.2 Algorithms for information geometric interpretation

This section describes an information geometrical interpretation of the proposed model. We use a vec: \({\mathbb {R}}^{N \times M} \rightarrow {\mathbb {R}}^{NM}\) and a Kronecker product operator \(\otimes : {\mathbb {R}}^{N \times M} \times {\mathbb {R}}^{U \times V}\rightarrow {\mathbb {R}}^{AU\times MV}\) as

$$\begin{aligned} \textrm{vec}(A)&= \left[ \begin{array}{c} \varvec{a}_1 \\ \varvec{a}_2 \\ \vdots \\ \varvec{a}_N \\ \end{array} \right] ^{\textsf{T}} \in {\mathbb {R}}^{NM}, \ A = \left[ \begin{array}{cccc} \varvec{a}_1&\varvec{a}_2&\dots&\varvec{a}_N \end{array} \right] \in {\mathbb {R}}^{N \times M}, \ \varvec{a}_i \in {\mathbb {R}}^M, \\ A \otimes B&= \left[ \begin{array}{cccc} (A)_{11} B &{} (A)_{12} B &{} \cdots &{} (A)_{1M} B \\ (A)_{21} B &{} (A)_{22} B &{} \cdots &{} (A)_{2M} B \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ (A)_{N1} B &{} (A)_{N2} B &{} \cdots &{} (A)_{NM} B \end{array} \right] \in {\mathbb {R}}^{AU \times MV}, \end{aligned}$$

where B is a \(U \times V\) matrix. The probability that an object o and a feature u are observed simultaneously, can be generalized as

$$\begin{aligned} q(o, u; X, Y)&= \exp \left( \sum _{i, j}^{n,m} \delta _i^o \delta _j^u \varvec{x}_i^{\textsf{T}} \varvec{y}_j - \zeta (X, Y) \right) \nonumber \\&= \exp \left( {\varvec{\delta }^o}^{\textsf{T}} XY^{\textsf{T}} \varvec{\delta }^u - \zeta (X, Y) \right) . \end{aligned}$$
(13)

Equation (13) is equivalent to Eq. (5) given \(o_i\) and \(u_j\) as a concrete event,

$$\begin{aligned} q(o_i, u_j; X, Y) = \exp (\varvec{x}_i^{\textsf{T}} \varvec{y}_j - \zeta (X, Y)). \end{aligned}$$

Here, \(\delta _i^o\), \(\delta _j^u\), \(\varvec{\delta }^o\), and \(\varvec{\delta }^{u}\) are

$$\begin{aligned} \delta ^o_i&= {\left\{ \begin{array}{ll} 1 &{} (o = o_i) \\ 0 &{} (o\ne o_i) \end{array}\right. } ,\quad \delta ^u_j = {\left\{ \begin{array}{ll} 1 &{} (u = u_j) \\ 0 &{} (u\ne u_j) \end{array}\right. } \\ \varvec{\delta }^o&= \left[ \begin{array}{c} \delta _1^o \\ \delta _2^o \\ \vdots \\ \delta _n^o \end{array} \right] , \quad \varvec{\delta }^u = \left[ \begin{array}{c} \delta _1^u \\ \delta _2^u \\ \vdots \\ \delta _m^u \end{array} \right] . \end{aligned}$$

Using the vec operator and the Kronecker product, we can transform Eq. (13) as

$$\begin{aligned} q(o,u; X, Y)&= \exp \left( \textrm{vec}\left( X^{\textsf{T}} \right) ^{\textsf{T}} \left( \varvec{\delta }^o {\varvec{\delta }^u}^{\textsf{T}} \otimes I_d \right) \textrm{vec}\left( Y^{\textsf{T}} \right) -\zeta (X, Y) \right) \nonumber \\&= \exp \left( \textrm{vec}\left( X^{\textsf{T}} \right) ^{\textsf{T}} \Lambda (o,u)\, \textrm{vec}\left( Y^{\textsf{T}} \right) -\zeta (X, Y) \right) , \end{aligned}$$
(14)

where \(\Lambda \) is an \(nd \times md\) matrix and \(\Delta \) is an \(n \times m \) matrix defined as

$$\begin{aligned} \Lambda (o,u)&= \underbrace{\Delta }_{n \times m} \otimes \underbrace{I_d}_{d \times d} = \underbrace{ \left[ \begin{array}{ccc} \left[ \begin{array}{ccc} (\Delta )_{11}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{11} \end{array} \right] &{} \cdots &{} \left[ \begin{array}{ccc} (\Delta )_{1m}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{1m} \end{array} \right] \\ \vdots &{} \ddots &{} \vdots \\ \left[ \begin{array}{ccc} (\Delta )_{n1}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{n1} \end{array} \right] &{} \cdots &{} \left[ \begin{array}{ccc} (\Delta )_{nm}&{} \cdots &{} 0 \\ \vdots &{} \ddots &{} \vdots \\ 0 &{} \cdots &{} (\Delta )_{nm} \end{array} \right] \end{array} \right] , }_{nd \times md} \\ (\Delta )_{ij}&= \delta _i^o \delta _j^u. \end{aligned}$$

We see that the probability mass function q is expressed in Eq. (14) is an exponential family with respect to \(\varvec{\theta }\) with \(\Lambda \varvec{\xi }\) fixed, and we see that it is an exponential family with respect to \(\varvec{\xi }\) with \(\varvec{\theta }^{\textsf{T}} \Lambda \) fixed as well. Here, we write \(\varvec{v} = \Lambda (o,u) \varvec{\xi }\) and \(\varvec{v}' = \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u)\), respectively,

$$\begin{aligned} q(o,u; X, Y)&= \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u) \varvec{\xi } -\zeta (X,Y) \right) \\&= \exp \left( \varvec{\theta }^{\textsf{T}} \varvec{v}(o,u,\varvec{\xi )} - \zeta (X,Y) \right) \\&= \exp \left( \varvec{\xi }^{\textsf{T}} \varvec{v'}(o,u,\varvec{\theta }) - \zeta (X,Y) \right) , \end{aligned}$$

where \(\varvec{\theta }\) and \(\varvec{\xi }\) are

$$\begin{aligned} \varvec{\theta }&= \textrm{vec}(X^{\textsf{T}}) = \left[ \begin{array}{c} \varvec{x}_1 \\ \varvec{x}_2 \\ \vdots \\ \varvec{x}_n \end{array} \right] \in {\mathbb {R}}^{nd} , \quad \varvec{\xi } = \textrm{vec}(Y^{\textsf{T}}) = \left[ \begin{array}{c} \varvec{y}_1 \\ \varvec{y}_2 \\ \vdots \\ \varvec{y}_m \end{array} \right] \in {\mathbb {R}}^{md}. \end{aligned}$$

We propose an optimization process of Eq. (9) on the basis of these results. The proposed algorithm estimates vectors from multiple relevance levels by a combination of the m-projection to the model manifold and the e-mixture on the model manifold.

Probability distributions \(P^{(1)}, P^{(2)}, \dots , P^{(K)}\) computed from multiple relevance levels are represented by a single point in the space of probability distributions \({\mathcal {S}}\) as empirical distributions. Similarly, it is represented as a single point on \({\mathcal {S}}\), given the parameters X and Y. In other words, this problem is equivalent to finding an appropriate mapping from the empirical distribution \(P^{(k)}\) to the model manifold \({\mathcal {M}}(X, Y)\),

$$\begin{aligned} {\mathcal {M}}(X, Y) = \left\{ Q(X, Y) \Big \vert q(o, u; X, Y) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u) \varvec{\xi } -\zeta (X, Y) \right) . \right\} \end{aligned}$$

Let \({\mathcal {M}}_{Y_\dagger }(X)\) be a subspace of \({\mathcal {M}}(X, Y)\) where Y is fixed to \(Y_\dagger \), and X is freely chosen,

$$\begin{aligned} {\mathcal {M}}_{Y_\dagger }(X) = \left\{ Q(X, Y_\dagger ) \Big \vert q(o, u; X, Y_\dagger ) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda (o,u)\, \varvec{\xi }_\dagger -\zeta (X, Y_\dagger ) \right) \right\} , \end{aligned}$$

where \(\varvec{\xi }_\dagger = \textrm{vec}(Y_\dagger )\). The subspace \({\mathcal {M}}_{Y_\dagger }(X)\) is e-flat with respect to the parameter X because it is a member of the exponential family. This means that the optimal distribution in Q from the empirical distribution P is given by the m-projecteion.

Similarly let \({\mathcal {M}}_{X_\dagger }(Y)\) be a subspace of \({\mathcal {M}}(X, Y)\) where X is fixed to \(X_\dagger \), and Y is freely chosen,

$$\begin{aligned} {\mathcal {M}}_{{X}_\dagger }(Y) = \left\{ Q(X_\dagger , Y) \Big \vert q(o, u; X_\dagger , Y) = \exp \left( \varvec{\theta }_\dagger ^{\textsf{T}} \Lambda (o,u) \varvec{\xi } -\zeta (X_\dagger ,Y) \right) \right\} , \end{aligned}$$

where \(\varvec{\theta }_\dagger = \textrm{vec}(X_\dagger )\).

The convex combination of the matrices on the model manifold is an e-mixture of probability distributions as shown in Eq. (7). Therefore, the model manifold \(M_{{\bar{Y}}}\) computed from \({\bar{Y}}\) is expected to preserve the properties of \(\left\{ Y^{(k)}\right\} \),

$$\begin{aligned} {\bar{Y}} = \sum _{k=1}^K \alpha _k Y^{(k)}, \quad \sum _{k=1}^K \alpha _k = 1. \end{aligned}$$
(15)

The feature matrix Y is constrained by Eq. (8) to limit the degree of freedom. The matrix satisfying this constraint is called a Stiefel manifold and which it defined as

$$\begin{aligned} \textrm{St}(n,d) = \left\{ A \in {\mathbb {R}}^{n \times d} \Big \vert A^{{\textsf{T}}} A = I_d \right\} , \quad n \ge d. \end{aligned}$$

The mixture matrix \({\bar{Y}}\) given in Eq. (15) is not necessarily a matrix on this manifold. Therefore, we introduce a retraction function \(\pi : {\bar{Y}} \mapsto Y \in \textrm{St}\).

There are various types of retraction for matrix A, and we use the method based on QR decomposition [24],

$$\begin{aligned} \pi (A) = D , \quad A = DU, \end{aligned}$$

where D is a orthogonal matrix, and U is an upper triangular matrix. The proposed algorithm estimates parameters separately for each data set and integrates them, which is computationally less expensive than estimating all parameters simultaneously. However, efficient estimation when the amount of objects in a dataset is large is a future challenge. In particular, depending on the retraction method, it may become a bottleneck in computational complexity and should be chosen carefully. For the QR decomposition selected this time, \({\mathcal {O}}(n^3)\), it is necessary to consider other options when the data size is large.

We revisit the problem formulation from an information geometric perspective. The task involves projecting multiple empirical distributions onto a large model manifold while preserving the relationships between these empirical distributions.

To provide an intuitive explanation of the proposed algorithm, our approach involves constructing a large model manifold parameterized by X and Y. By applying constraints that fix a subset of parameters, we project the empirical distributions onto a subspace within the large model manifold. By iteratively performing this process, we expect to accurately estimate the parameters while retaining the characteristics of each empirical distribution.

The concept of an algorithm of the proposed method is shown in Fig. 1. Algorithm 1 is the proposed algorithm for estimating \(\{X^{(k)}\}\) and Y. These figures describe the algorithm as follows. First, it is assumed that \(\{P^{(k)}\}_{k=1}^{3}\) are obtained, which are computed from multiple relevance levels \(\{R^{(k)}\}_{k=1}^3\), respectively. In addition, the parameters of the iteration index t are given as \(\{X^{(k)}_t\}, Y_t\). Figure 1a firstly shows the matrices \(\{X^{(k)}_{t+1}\}\) are estimated by the m-projection from \(\{P^{(k)}\}\) to a subspace \({\mathcal {M}}_{Y_t}(X)\) on the model manifold with \(Y_t\) fixed. \(\{Y^{(k)}_{t+1}\}\) is estimated by the m-projection from \(\{P^{(k)}\}\) to the model manifold with \(\{X^{(k)}_{t+1}\}\) fixed in Fig. 1b. Figure 1c describes that a convex combination of \(\{Y^{(k)}_{t+1}\}\) gives the subspace \({\mathcal {M}}_{{\bar{Y}}_{t+1}}(X)\). In Fig. 1d, \(Y_{t+1}\) is derived by the retraction of \({\bar{Y}}_{t+1}\) onto a Stiefel manifold to satisfy the constraints of Eq. (8).

Fig. 1
figure 1

An optimization process. The first step a shows that is the m-projection from \(P^{(k)}\) to \({\mathcal {M}}_{Y_t}\) under with it fixed \(Y_{t}\). In the b step, each \(X^{(k)}_{t+1}\) is fixed, and \(Y^{(k)}_{t+1}\) is estimated with the second m-projection. c Next, we mix \(\{Y_{t+1}^{(k)}\}\) to calculate \({\bar{Y}}_{t+1}\) by using the e-mixture. d Finally, \(Y_{t+1}\) is derived of \({\bar{Y}}_{t+1}\) by the retraction onto the Stiefel manifold

The matrices \(\{X^{(k)}\}\) and Y estimated by this process maintains the neighborhood relations of \(\{P^{(k)}\}\). The shared Y also allows the matrices of objects to be compared across matrices.

Algorithm 1
figure a

an algorithm of the proposed mapping method

In addition to introducing two coordinates as in the proposed method when multiple relevance levels are given, another possible approach is to estimate them collectively, for instance, by vertically concatenating them and treating them as a single dataset. In this case, \({X^{(k)}}\) can be estimated directly, but the computational cost becomes substantial when the number of steps and data points increases. Therefore, the proposed method, which allows estimating the problem with a smaller size for each relevance level, offers an advantage. The weighted sum of the coordinates estimated by the proposed method also represents a mixture of probability distributions, enabling adjustments weights to accommodate subsequent tasks and visualizations after the vectors are estimated.

3.3 Appearance and disappearance of data

The proposed method can naturally estimate the vectors of time-series data, even when the data appear with the transition of time. Let \(P^{(1)},P^{(2)}\), and \(P^{(3)}\) be the empirical distributions at time \(k=1,2,\) and 3, respectively. And the set of objects at each time be denoted by \({\mathcal {O}}', {\mathcal {O}}''\), and \({\mathcal {O}}'''\). In addition, the number of objects at each time is \(|{\mathcal {O}}' |< |{\mathcal {O}}'' |< |{\mathcal {O}}''' |\). The proposed method works equally well for the time-series data regardless of the number of objects by assuming that the number of the features does not change with time and by fixing Y. However, in this case, the information geometrical interpretation changes slightly, as shown in Fig. 2. Since the number of objects has changed from time \(k=1\) to time \(k=2\), the target model manifolds \({\mathcal {M}}'\) and \({\mathcal {M}}''\) are different. At time \(k=2\), the model manifold is expanded owing to the expanded domain of the parameter X,

$$\begin{aligned} {\mathcal {M}}'= & {} \left\{ Q(X, Y) \Big \vert q(o, u; X, Y) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda \varvec{\xi } -\zeta \right) , \, X \in {\mathbb {R}}^{|{\mathcal {O}}' |\times d}, \, o \in {\mathcal {O}}' \right\} , \\ {\mathcal {M}}''= & {} \left\{ Q(X, Y) \Big \vert q(o, u; X, Y) = \exp \left( \varvec{\theta }^{{\textsf{T}}} \Lambda \varvec{\xi } -\zeta \right) , \, X \in {\mathbb {R}}^{|{\mathcal {O}}'' |\times d}, \, o \in {\mathcal {O}}'' \right\} . \end{aligned}$$

The estimated model are the subspaces \({\mathcal {M}}'_{Y_\textrm{fix}}(X) \subset {\mathcal {M}}'(X, Y)\), and \({\mathcal {M}}''_{Y_\textrm{fix}}(X) \subset {\mathcal {M}}''(X, Y)\) respectively, with the additional constraint that \(Y=Y_{\textrm{fix}}\) at each time. Therefore, the proposed method can be applied similarly to Algorithm 1 in estimating the object matrices \(X^{(k)}\). The case of \(k=3\), the mapping is processed in the same way.

Fig. 2
figure 2

Geometric conceptual figure of data appearance. At time \(k=1\), \({\mathcal {M}}'\) is used as the model manifold, but at times \(k=2,3\), it is extended to \({\mathcal {M}}''\) and \({\mathcal {M}}'''\) owing to the increase in the amount of objects. In addition, the vectors are estimated with the model manifold constrained by fixing \(Y=Y_{\textrm{fix}}\)

4 Experiments

Generally, when considering embedding real data, only the relevance is given, not the actual vector that is the correct answer. We evaluate the effectiveness of the proposed method using an artificial data with the true matrices \(X^{(k)}_*\) and \(Y_*\), and also show as a case study that the proposed method works and behaves intuitively on real data. We compare the proposed method with three other methods: t–SNE [11] applied independently at each time(ind t–SNE), t–SNE applied to all data at once(concat t–SNE), and Dynamic t–SNE [15], which is a probabilistic embedding method that considers the time series. The evaluation criteria will be employed Spearman’s ranking correlation and MAP@H. We introduce some symbols used for MAP@H. Let \({\mathcal {N}}(o_i, X, X', h)\) be a set of h neighbors of object o using the object matrix X,

$$\begin{aligned}&{\mathcal {N}}(o_i, X^{(k)}, X^{(k')}, h) \\&\quad = \left\{ o_j \Big \vert \Vert \varvec{x}_i^{(k)} - \varvec{x}_j^{(k')}\Vert ^2 \le \Vert \varvec{x}_i^{(k)} - \varvec{x}_l^{(k')}\Vert ^2,\ o_j \in {\mathcal {O}},\ o_l = o(X^{(k')}, h) \right\} . \end{aligned}$$

Here, \(o(X^{(k)}, h)\) is the h th nearest object from o using the object matrix \(X^{(k)}\). MAP@H evaluates the match between the neighborhoods of the true vectors and the estimated vectors as

$$\begin{aligned} \textrm{AP}(o, k, k', H)&=\frac{1}{H} \sum _{h=1}^H \delta (o, k, k',h)\, \frac{|{\mathcal {N}}(o, X^{(k)}, X^{(k')}, h) \cap {\mathcal {N}}(o, X_*^{(k)}, X_*^{(k')},h) |}{h}, \\ \delta (o, k, k', h)&= \left\{ \begin{array}{cl} 1 &{} \left( o(X^{(k)}, h) \in {\mathcal {N}}(o, X_*^{(k)}, X_*^{(k')}, h) \right) \\ 0 &{} (\textrm{otherwise})\\ \end{array} \right. , \\ \textrm{MAP}@H&= \frac{1}{K} \frac{1}{|{\mathcal {O}} |} \sum _{o, k} \textrm{AP}(o, k, k', H). \end{aligned}$$

We define the criteria to evaluate whether the information is preserved between time series. Normally, MAP@H is calculated inside the kth relevance matrix (\(k'=k\)), but in this case, we also calculate MAP@H using the kth relevance matrix and the \(k-1\) th relevance matrix to verify whether the information between relevance is retained, which we call \(\Delta \) MAP@H in this paper.

4.1 Mapping using artificial time-series relevance levels

We axevaluate the mapping of time-series data from multiple relevance levels. Each object belongs to one of the clusters and moves toward the center of each cluster over time. We map the vectors to this dataset and evaluate how well the characteristics hold. Let K be the time length and c(i) be the cluster to which each object \(o_i\) belongs. The speed at which the object moves toward the center is defined for each cluster and is written as v(c(i)). And artificial time-series data are generated using Algorithm 2.

The quantitative evaluation are shown in Table 1. Experimental settings not listed in the Table 1 include \(K=8, m=10\), and also \(H=50\) for evaluation. The results of the proposed method show that for \(d \ge d_*\), the ranking correlation and the MAP@H are high, indicating that the original relationship is reconstructed. When the true data dimension exceeds the estimated dimension, i.e., in \(d_* \ge d\), the values of each evaluation index are low. The evaluation metric, \(\Delta \) MAP@H, represents the relationships among the estimated vectors in each relevance matrix. When t–SNE is applied, we observe a significant decline in this value, suggesting a compromised representation of these relationships. In contrast in the case of our proposed method, this value does not decrease, indicating that our method successfully maintains the relationships among the estimated vectors within each relevance matrix. Thus, it is suggested that our proposed method effectively considers the relationships among the estimated vectors in each relevance matrix. However, it has been observed that the proposed method is less accurate than other methods when the dimension of the true data is large. Other methods show that MAP@H is higher, and ranking correlation is lower than those obtained by the proposed method, because they expresses the neighborhood relationship of each object rather than the whole relationship.

Table 1 Mapping results using multiple relevance level

The results show that even time-series data can be used to estimate vectors that reproduce each relevance levels. The mapping results for time-series data generated as true dimension \(d_*=2\) and time length \(K=8\) are shown in Fig. 3:

Fig. 3
figure 3

The color-coded matrices for each cluster at each time k. In the independent t-SNE, which does not consider the relationships among different relevance levels, the vectors are mapped to distinct positions at each time step, even when they belong to the same cluster. By comparing the results with two other methods that account for the relationships among relevance levels, it becomes evident that the proposed method produces outcomes most similar to the ground truth data. While other methods converge at a uniform speed, the proposed method in particular captures differences in the speed of movement to the center and appropriately represents the cohesiveness of the clusters at varying times

We have shown that the proposed method preserves the relationship between times for the true vectors, while reproducing each relevance level.

4.2 Case study: DBLP dataset

In this case study, we will use the DBLP dataset [25] for real data. This dataset consists of meta and citation information on 4,107,340 papers.

This section presents the results of a case study on providing vectors using this dataset. The data for providing the vectors is created by aggregating this dataset using the fields in Table 2. Let \({\mathcal {O}}\) be the set of terms in the title and \({\mathcal {U}}\) be the venue on which the paper was published. The preprocess to create a dataset whose parameters are \(n=300, m=30, K=12\) and \(\{R^{(k)}\}_{k=1}^K\), the start year is 2006, and the last year is 2017 shown in Fig. 4.

Table 2 Data schema of DBLP dataset(except)
Fig. 4
figure 4

Data flow of preprocess. First, papers in the target fields of study (FOS), venues are extracted. Next, the terms in the title of each paper are counted. It removes stopwords [26] from those words and extracts the top n words from those that occur at least once in each year. Details of conditions are described in Appendix C

The mapping results for \(d=30\) are shown in Fig. 5. Here, we selected the most commonly used words for each venue at each time point in 2006 and 2017, applied PCA on them, and then visualized the results in two dimensions.

Fig. 5
figure 5

Two-dimensional representation of the relationship between the terms in the titles of papers published between 2010 and 2017

This figure shows that “deep” has significantly changed its position near “neural” and “network” from 2010 to 2017, while the positions of other words have not changed much. The reason for the movement of these words is assumed to be that “deep learning” has been discussed in various papers in recent years. This method can capture the trend changes given multiple relevance levels.

We showed that it is possible to simultaneously acquire high-dimensional vectors that retain relevance levels and enable the visualization of specific objects of interest in low dimensions by using PCA after the vectors are acquired in this experiment. Furthermore, it also enables the comparison of the mapping as time-series data by fixing the feature matrix Y for the features at all times.

5 Conclusion

In this paper, we proposed a method that provides vectors to an object on the basis of multiple relevance levels defined between the objects and features using a stochastic embedding. It is difficult to interpret the results of existing SNE and t-SNE, which are used for the stochastic embedding when applied independently, because they are not designed to acquire vectors to multiple relevance levels (Fig. 3).

We solved these problems using the log-bilinear model combined vectors of the objects and features. In this paper, we showed that the estimation of the object vectors using the common feature vectors is equivalent to mapping to the e-flat subspaces on the model manifold. In this method, the manipulation of the vectors corresponds to the manipulation of the probability distributions, since the convex combination of vectors is the e-mixture of the probability distributions.

We evaluated the effectiveness of the proposed method using artificially generated data and the DBLP dataset as a real-world time series data. The results show that the proposed method can provide the appropriate vectors to the objects while preserving the trend changes for the multiple relevance levels to the artificial data and the real-world data. This method enables the visualization of relationships that change over time, such as texts and videos. The establishment of a baseline for evaluating embedding methods across multiple datasets will be facilitated by defining versatile classification tasks utilizing multiple datasets in the future. When the goal is visualization itself, it is also expected that quantitative evaluations of the estimated vectors will be conducted through means such as questionnaires and interviews.