1 Introduction

Tabular data in the real world is stored in computers as multidimensional arrays such as tensors or matrices. The low-rank approximation [1] is a popular task that approximates a multidimensional array by a low-rank array with a linear combination of fewer bases or principal components to extract features, find patterns, and store data efficiently with small memory storage [2]. A lot of applications are known in computer-vision [3], recommender-systems [4], denoising [5, 6], and machine learning [7]. The low-rank representation is typically obtained by minimizing the reconstruction error from the input array based on singular value decomposition (SVD) or its extension to tensors, high-order singular value decomposition (HOSVD).

Non-negative low-rank approximation, a variant of low-rank approximations imposing non-negativity on both input and output, has been widely studied and applied to many practical problems because of the usefulness of non-negative representation in image processing, counting data, and audio analysis [8]. For instance, non-negative Tucker rank decomposition (NTD) approximates a tensor with a mode-product of a non-negative core tensor and matrices [9]. It is applied in various fields of data mining [10, 11], bio-informatics [12], and mechanical engineering [13]. While non-negative low-rank approximation has many applications, the non-negativity causes SVD-based optimization to be infeasible. As a result, it is implemented by gradient methods that iteratively reduce reconstruction errors [14,15,16]. However, the gradient methods require appropriate tuning for initial values, stopping criterion, and learning rates, which is often nontrivial in practice.

The present study proposes non-gradient-based methods for non-negative low-rank approximation for multidimensional arrays based on information geometric analysis. We treat each input array as a probability distribution using a log-linear model on a poset [17], where a structure of an input array is realized as a partial order. We describe the conditions for a tensor to be rank-reduced with \(\theta \)- and \(\eta \)-parameters of the model, which are canonically used as coordinate systems of a dually flat manifold in information geometry. We capture the problem of non-negative low-rank approximation using such parameterization and realize it as a projection onto the subspace satisfying the constraints (low-rank space). In this novel formulation, we show that the rank-r approximation can be realized by combining the rank-1 approximation on each mode. By using the known formula of the tensor best rank-1 approximation [18], we developed a fast low-Tucker rank approximation method—Legendre Tucker rank reduction (LTR)—which is not based on the gradient method.

In addition, if the target rank is 1, we show that the low-rank space consists of the product of independent distributions. Approximating a joint distribution by the product of independent distributions is called mean-field approximation, which is a well-established method in physics [19]. The information-geometric formulation of low-rank approximation gives us a new perspective with which to view rank-1 approximation as a mean-field approximation. The rank-1 approximation is used when we want to quickly extract only the most significant factor from a large-scale input. There are particular applications of rank-1 approximation, such as sound separation [20], analysis of fMRI images [21], 3D structure recovery [22], and optimization [23].

The flexibility of the log-linear model allows us to extend our formulation to related tasks, such as non-negative multiple matrix factorization (NMMF), that conduct simultaneous factor-sharing decomposition of multiple matrices; this task appears in purchase forecast systems [24] and recommender systems [25]. We find the best rank-1 NMMF formula as a closed form by describing the rank-1 condition with shared factors in a dually flat coordinate system. As an application of the formula, we develop an efficient method for rank-1 non-negative matrix factorization (NMF) for matrices with missing entries. The key idea is to transform the cost function of NMF for missing data into NMMF by increasing the missing values, enabling us to use the closed form.

Finally, we explain the geometric relationship between tensor rank-1 approximation and tensor balancing based on an information-geometric analysis. Tensor balancing—the operation of rescaling the axial sums of a tensor—has already been formulated using the log-linear model on posets [17]. We suppose that both the balancing and rank-1 conditions are imposed simultaneously on a tensor. In that case, all parameters of the distribution are determined, so the distribution is uniquely determined. This implies that balancing of a rank-1 tensor is unique.

A preliminary version of this paper was presented at the NeurIPS2021 [26] and AISTATS2022 [27]. In [27], we have discussed NMMF for three input matrices, but in this study we extend it for four input matrices and find the more general rank-1 approximation formula in a closed form. This enables us to exactly solve the missing NMF with a rank-2 weight matrix, as shown in Sect. 3.2.3. Moreover, we firstly point out the information-geometric relationship between rank-1 approximation and tensor balancing in Sect. 3.3.8.

2 Preliminaries

We provide the notations in Sect. 2.1 and briefly explain our modeling framework, the log-linear model on a partially ordered set (poset), which was introduced in [17], in Sect. 2.2.

2.1 Notation

Tensors are denoted by calligraphic capital letters such as \(\mathcal {T}\) and \(\mathcal {P}\). Matrices are denoted by bold capital letters like \(\textbf{X}\) and \(\textbf{Y}\). Each element is denoted as \(\mathcal {T}_{i_1,\dots ,i_d}\) and \(\textbf{X}_{ij}\). Mode-k expansion of a tensor \(\mathcal {T}\) is denoted by \(\textbf{T}^{(k)}\). Vectors are denoted by lower-case bold alphabets like \({\varvec{a}}\) and \({\varvec{b}}\). We regard vectors and matrices as particular cases of tensors. The total sum of a vector, a matrix, or a tensor is represented as \(S(\cdot )\). The i-th component of a vector \({\varvec{a}}\) is written in its non-bold letter as \(a_i\). The Kronecker product of two vectors \({\varvec{a}}\) and \({\varvec{b}}\) is denoted by \( ({\varvec{a}} \otimes {\varvec{b}})\), which is a rank-1 matrix, and each element is defined as \(({\varvec{a}}\otimes {\varvec{b}})_{ij} = a_ib_j\). The \(I \times J\) all-one and all-zero matrices are denoted by \({\varvec{1}}_{I\times J}\) and \({\varvec{0}}_{I\times J}\), respectively. The identity matrix is denoted by \(\textbf{I}\). The transpose of a matrix \(\textbf{X}\) is denoted by \(\textbf{X}^\textrm{T}\). The element-wise product of the two matrices \(\textbf{A}\) and \(\textbf{B}\) is denoted by \(\textbf{A}\circ \textbf{B}\). We denote a pair of natural numbers n and \(m\ (\ge n)\) with \([n,m] = \left\{ n,n+1,\dots ,m-1,m \right\} \) and abbreviate [1, m] as [m]. We use \(\mathcal {P}_{a^{(k)}:b^{(k)}}\) to denote the subtensor obtained by fixing the range of kth index to only from a to b. The set difference of B and A is denoted by \(B \backslash A\). When we use the Kullback–Leibler (KL) divergence \(D(\mathcal {P},\mathcal {T})\) for non-negative tensors \(\mathcal {T}\) and \(\mathcal {P}\), it is defined as follows: [9]

$$\begin{aligned} D(\mathcal {P}, \mathcal {T}) = \sum _{i_1=1}^{I_1} \!\dots \! \sum _{i_d=1}^{I_d} \!\left\{ \mathcal {P}_{i_1 \dots i_d} \log { \frac{ \mathcal {P}_{i_1 \dots i_d}}{ \mathcal {T}_{i_1 \dots i_d}} - \mathcal {P}_{i_1 \dots i_d} + \mathcal {T}_{i_1 \dots i_d}} \right\} . \end{aligned}$$

We define \(\textrm{Rank}(\textbf{X})\) as the rank of a matrix \(\textbf{X}\).

2.2 Modeling

This paper discusses non-negative low-rank approximations of tensors, matrices, and multiple matrices. To treat various structures of such inputs, we use the log-linear model on posets, which is a flexible statistical model to treat various structures by designing appropriate partial orders [17]. Furthermore, we describe the low-rank conditions using the model’s parameters. Consequently, we formulate the low-rank approximation as a projection onto a subspace satisfying the conditions that some natural parameters become zero. We provide an overview of the log-linear model on posets in Sect. 2.2.1 and the general projection theory in information geometry in Sect. 2.2.2.

2.2.1 Reminder of the log-linear model on poset

The log-linear model on a poset [17] is a generalization of Boltzmann machines [28], where we can flexibly design interaction between variables using partial orders. A poset \((\Omega , \le )\) is a set \(\Omega \) of elements associated with a partial order \(\le \) on \(\Omega \), where the relation “\(\le \)” satisfies the following three properties: For all \(x,y,z \in \Omega \), (1) \(x \le x\), (2) \(x \le y, y\le x \Rightarrow x = y\), and (3) \(x \le y,y \le z \Rightarrow x \le z\). We consider a discrete probability distribution p on a poset \((\Omega ,\le )\), which is treated as a mapping \(p:\Omega \rightarrow (0,1)\) such that \( \sum _{x \in \Omega } p(x) = 1\). Each element p(x) is assumed to be strictly larger than zero. We assume that the structured domain \(\Omega \) has the least element \(\perp \); that is, \(\perp \le x\) for all \(x \in \Omega \). The log-linear model for a distribution p on \((\Omega , \le )\) is defined as

$$\begin{aligned} \log {p}(x) = \sum _{s\le x} \theta (s), \end{aligned}$$
(1)

where \(\theta ({\perp })\) corresponds to the normalizing factor (partition function). The convex quantity defined as the sign inverse \(\psi (\theta ) = -\theta (\perp )\) is called the Helmholtz free energy of p. This model belongs to the exponential family and \(\theta \) corresponds to natural parameters except for \(\theta ({\perp })\).

The log-linear model’s natural parameter \(\theta \) uniquely identifies the distribution p. Using \(\theta \) as a coordinate system in the set of distributions, which is a typical approach in information geometry [29], we can draw the following geometric picture: Each point in the \(\theta \)-coordinate system corresponds to a distribution. Moreover, because the log-linear model belongs to the exponential family, we can also identify a distribution by expectation parameters defined as

$$\begin{aligned} \eta (x) = \sum _{x \le s} p(s). \end{aligned}$$
(2)

Using the Möbius function [30], inductively defined as

$$\begin{aligned} \mu (x,y)&= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if } x=y, \\ -\sum _{x \le s< y} \mu (x,s) &{}\quad \text {if } x < y, \\ 0 &{}\quad \textrm{otherwise}, \end{array}\right. } \end{aligned}$$

each distribution can be described as

$$\begin{aligned} p_\eta (x) = \sum _{s\in \Omega } \mu (x,s)\eta (s). \end{aligned}$$
(3)

We write \(p_\eta \) if p is determined by the expectation parameter \(\eta \). We can also identify each point using the \(\eta \)-coordinate system. Each expectation parameter \(\eta (x)\) is consistent with the expected value \(\mathbb {E}[F_x(s)]\) for the function \(F_x(s)\), such that \(F_x(s) = 1\) if \(x \le s\) and 0 otherwise [31]. Technically, \(\eta ({\perp })\) is not an expectation parameter as it is always 1 from the definition.

In addition, the \(\theta \)-coordinate and the \(\eta \)-coordinate are orthogonal with each other, which guarantees that we can combine these coordinates together as a mixture coordinate and a point specified by the mixture coordinate also identifies a distribution uniquely [29].

2.2.2 Projection theory in information geometry

We consider the set of distributions \(\mathcal {S}\). We achieve non-negative low-rank approximations by projection onto a subspace \(\mathcal {Q}\subseteq \mathcal {S}\). In a subspace parameterized by \(\theta \)- and \(\eta \)-coordinate systems, two types of geodesics—m- and e-geodesics from a point \(q_1\in \mathcal {Q}\) to \(q_2\in \mathcal {Q}\)—can be defined as

$$\begin{aligned} \{r_t \mid r_t=(1-t)q_1+tq_2 \},\quad \{r_t \mid \log {r}_t=(1-t)\log {q_1}+t\log {q_2}-\phi (t) \}, \end{aligned}$$

respectively, where \(0\le t\le 1\) and \(\phi (t)\) is a normalization factor to keep \(r_t\) to be a distribution. A subspace is called e-flat when any e-geodesic connecting two points in a subspace is included in the subspace. The vertical descent of an m-geodesic from a point \(p\in \mathcal {S}\) to q in an e-flat subspace \(\mathcal {Q}_e\) is called m-projection. Similarly, e-projection is obtained when we replace all e with m and m with e. The flatness of subspaces guarantees the uniqueness of the projection destination. The projection destination \(r_m\) or \(r_e\) obtained by m- or e-projection onto \(\mathcal {Q}_e\) or \(\mathcal {Q}_m\) minimizes the following KL divergence:

$$\begin{aligned} r_m = \mathop {\hbox {argmin}}\limits _{q\in \mathcal {Q}_e } D(p;q), \quad r_e = \mathop {\hbox {argmin}}\limits _{q\in \mathcal {Q}_m } D(q;p). \end{aligned}$$

We assume that distributions in \(\mathcal {S}\) are parameterized by N natural parameters. Let \(\mathcal {Q}\) be the set of distributions satisfying the condition \(f(\varvec{\theta }_{1:n})=0\) for a linear function \(f(\cdot )\) on a part of the natural parameters \(\varvec{\theta }_{1:n}= (\theta _{1},\dots ,\theta _{n})\), where we assume that this part is from 1 to n \((\le N)\) without loss of generality. Since a subspace defined by linear constrains in natural parameters is e-flat [29, Chapter 2.4], the m-projection from p onto \(\mathcal {Q}\) is unique. This m-projection does not change the rest of the part of expectation parameters \(\varvec{\eta }_{n+1:N}=(\eta _{n+1},\dots ,\eta _{N} )\) [32, Chapter 11.3]. We call this property expectation conservation law in m-projection, which is a key idea for getting analytical solutions to non-negative rank-1 approximations.

3 The proposed methods

We propose our low-rank approximation methods with a focus on their relationship to information geometry. In the following, we regard normalized inputs array as a distribution using the log-linear model on posets. The sample space of the distribution is the index set of the array. The probability of realizing an index is the corresponding value of the multidimensional array on the index. For example, for a positive matrix \(\textbf{X}\in \mathbb {R}_{> 0}^{2 \times 3}\), the sample space of the model is the index set \(\Omega =[2]\times [3]\) and the probability is given as \(p(i,j)=\textbf{X}_{ij}/ \sum _{ij}\textbf{X}_{ij} \) for random variables \((i,j)\in \Omega \).

The distribution is parameterized by \(\theta \)-parameter and \(\eta \)-parameters. We can describe low-rank conditions using \(\theta \) and \(\eta \) to formulate the low-rank approximation as a projection onto the subspace satisfying the condition.

The advantage of our approach is that low-rank approximation via our formulation always becomes a convex optimization due to the flatness of its subspace. The limitation of our approach is that since our proposed methods are based on the log-linear model, they only treat positive arrays and cannot handle zero or negative values. Although experimental results show the usefulness of our methods, even for inputs array including zeros, we always assume that every element of the input is strictly positive in our theoretical discussion.

Since our approach is based on information geometric analysis, we naturally adopt the KL divergence as a cost function. It is known that optimizing the KL divergence prevents overfitting for sound data, noisy data, and purchasing data generated from Poisson distribution [33]. This is why many studies of robust low-rank approximations minimize the KL divergence [34, 35].

Before discussing the low-rank approximations of tensors, we begin with a discussion of a particular case of tensors: matrices. We introduce the best rank-1 NMMF formula as a closed form in Sect. 3.1. Then, as an application of the formula, we introduce faster rank-1 NMF method for missing data in Sect. 3.2. By extending these discussions to tensors, we introduce a fast non-gradient-based low-Tucker-rank approximation method, LTR, in Sect. 3.3.

3.1 Best rank-1 NMMF

In this section, to find an example of how information geometric discussion enables us to formulate non-negative low-rank approximation without gradient methods, we focus on NMMF, a variant of matrix factorization. First, we define the rank-1 NMMF in Sect. 3.1.1. We then provide the best rank-1 approximation formula in Sect. 3.1.2. In Sects. 3.1.33.1.4, we introduce information geometric formulation of NMMF using the log-linear model and derive the closed form solution. Although they can be viewed as a proof of the closed form solution, we provide it in the main text because it includes various interesting properties between NMMF and information geometry, particularly the characterization of rank-1 approximation via parameters of the exponential family, which we believe is valuable for further development of this line of research.

Fig. 1
figure 1

A sketch of Rank-1 NMMF with four inputs matrices for \(I=J=N=M=L=3\). The task approximates four input matrices with shared factors

3.1.1 Rank-1 NMMF

Rank-1 NMMF simultaneously decomposes four matrices \(\textbf{X}\in \mathbb {R}_{\ge 0}^{I \times J}\), \(\textbf{Y}\in \mathbb {R}_{\ge 0}^{N \times J}\), \(\textbf{Z}\in \mathbb {R}_{\ge 0}^{I \times M}\) and \(\textbf{U}\in \mathbb {R}_{\ge 0}^{L \times M}\) into four rank-1 matrices \({\varvec{w}}\otimes {\varvec{h}}\), \({\varvec{a}}\otimes {\varvec{h}}\), \({\varvec{w}}\otimes {\varvec{b}}\) and \({\varvec{c}}\otimes {\varvec{b}}\), respectively, using non-negative vectors \({\varvec{w}}\in \mathbb {R}_{\ge 0}^{I}, {\varvec{h}}\in \mathbb {R}_{\ge 0}^{ J}, {\varvec{a}}\in \mathbb {R}_{\ge 0}^{N}\), \({\varvec{b}}\in \mathbb {R}_{\ge 0}^{M}\) and \({\varvec{c}}\in \mathbb {R}_{\ge 0}^{L}\). The cost function of NMMF is defined as

$$\begin{aligned} D(\textbf{X},{\varvec{w}}\otimes {\varvec{h}})+\alpha D(\textbf{Y},{\varvec{a}}\otimes {\varvec{h}})+\beta D(\textbf{Z},{\varvec{w}}\otimes {\varvec{b}})+\gamma D(\textbf{U}, {\varvec{c}}\otimes {\varvec{b}}), \end{aligned}$$
(4)

and the task of rank-1 NMMF is to find vectors \({\varvec{w}}\), \({\varvec{h}}\), \({\varvec{a}}\), \({\varvec{b}}\) and \({\varvec{c}}\) that minimize the above cost. We assume that the scaling parameters \(\alpha \), \(\beta \) and \(\gamma \) are non-negative real numbers. We provide a sketch of the task in Fig. 1.

3.1.2 A closed formula of the best rank-1 NMMF

We give the following closed form of the best rank-1 NMMF that exactly minimizes the cost function in Eq. (4), which is one of our main theoretical contributions. This formula efficiently extracts only the most dominant shared factors from four input matrices.

Theorem 1

(the best rank-1 NMMF) For any four positive matrices \(\textbf{X} \in \mathbb {R}^{I\times J}_{> 0}\), \(\textbf{Y}\in \mathbb {R}_{> 0}^{N \times J}\), \(\textbf{Z}\in \mathbb {R}_{> 0}^{I \times M}\), and \(\textbf{U}\in \mathbb {R}_{> 0}^{L \times M}\) and three parameters \(\alpha , \beta , \gamma \ge 0\), five non-negative vectors \({\varvec{w}}\in \mathbb {R}_{\ge 0}^{I}, {\varvec{h}}\in \mathbb {R}_{\ge 0}^{ J}, {\varvec{a}}\in \mathbb {R}_{\ge 0}^{N}, {\varvec{b}}\in \mathbb {R}_{\ge 0}^{M}\), and \({\varvec{c}}\in \mathbb {R}_{\ge 0}^{L}\) that minimize the cost function in Eq. (4) is given as

$$\begin{aligned} \begin{aligned} w_i&=\frac{\sqrt{S(\textbf{X})}}{S(\textbf{X})+\beta S(\textbf{Z})}\left( \sum _{j=1}^J \textbf{X}_{ij} + \beta \sum _{m=1}^M \textbf{Z}_{im} \right) , \\ h_j&=\frac{\sqrt{S(\textbf{X})}}{S(\textbf{X})+\alpha S(\textbf{Y})}\left( \sum _{i=1}^I \textbf{X}_{ij} + \alpha \sum _{n=1}^N \textbf{Y}_{nj} \right) , \\ a_n&= \frac{1}{\sqrt{S(\textbf{X})}}\left( \sum _{j=1}^J \textbf{Y}_{nj}\right) , \ \ \ \ c_l = \frac{\sqrt{S(\textbf{X})}}{S(\textbf{Z})}\left( \sum _{m=1}^M \textbf{U}_{lm}\right) , \\ b_m&= \frac{S(\textbf{Z})}{\beta S(\textbf{Z}) + \gamma S(\textbf{U})} \frac{1}{\sqrt{S(\textbf{X})}} \left( \beta \sum _{i=1}^I \textbf{Z}_{im} + \gamma \sum _{l=1}^L \textbf{U}_{lm}\right) . \end{aligned} \end{aligned}$$

The time complexity to obtain the best rank-1 NMMF is \(O(IJ+NJ+IM+LM)\) because all we need is to take the summation of each column and row of the matrices \(\textbf{X,Y,Z}\) and \(\textbf{U}\). If \(N=M=0\), our result in Theorem 1 coincides with the best rank-1 NMF minimizing the KL divergence from an input matrix \(\textbf{X}\) shown in [36]. There is a known closed formula for the best rank-1 approximation that minimizes the KL divergence from a tensor [18]. The above case for \(N=M=0\) corresponds to the formula with \(d=2\), where d is the order of the tensor.

3.1.3 Posets for NMMF

Fig. 2
figure 2

a A partial order structure for NMMF for three input matrices \(\textbf{X}\in \mathbb {R}_{>0}^{I\times J},\textbf{Y}\in \mathbb {R}_{>0}^{N \times J}\), \(\textbf{Z}\in \mathbb {R}_{>0}^{I\times M}\) and \(\textbf{U}\in \mathbb {R}_{>0}^{L\times M}\). Only \(\theta \)-parameters on gray-colored nodes can have non-zero values if and only if \((\textbf{X},\textbf{Y},\textbf{Z},\textbf{U})\) is simultaneously rank-1 decomposable. b Information geometric view of rank-1 NMMF. Rank-1 NMMF is m-projection onto simultaneous rank-1 subspace from a tuple of input four matrices, where one-body \(\eta \)-parameters do not change

The input of NMMF is a tuple \(\left( \textbf{X},\textbf{Y},\textbf{Z},\textbf{U}\right) \), where \(\textbf{X} \in \mathbb {R}^{I\times J}_{> 0}\), \(\textbf{Y}\in \mathbb {R}_{> 0}^{N \times J}\), \(\textbf{Z}\in \mathbb {R}_{> 0}^{I \times M}\), and \(\textbf{U}\in \mathbb {R}_{> 0}^{L \times M}\). For simplicity, we normalize them beforehand so that their sum is 1; that is, \(S(\textbf{X})+S(\textbf{Y})+S(\textbf{Z})+S(\textbf{U})=1\). It is straightforward to eliminate this assumption using the property of the KL divergence, \(\mu D\left( \textbf{X},\textbf{Y}\right) =D\left( \mu \textbf{X},\mu \textbf{Y}\right) \), for any non-negative number \(\mu \). We model these four matrices using a single discrete distribution on a partial ordered sample space.

To make a one-to-one mapping from \(\left( \textbf{X},\textbf{Y},\textbf{Z},\textbf{U}\right) \) to a probability mass function p, we prepare the index set \(\Omega \) as

$$\begin{aligned} \Omega&= \Omega _{\textbf{X}} \cup \Omega _{\textbf{Y}} \cup \Omega _{\textbf{Z}} \cup \Omega _{\textbf{U}}, \ \text {where}\\ \Omega _{\textbf{X}}&= [N+1,I+N] \times [J],\quad \Omega _{\textbf{Y}} = [N] \times [J],\\ \Omega _{\textbf{Z}}&= [N+1, I+N] \times [J+1, J+M], \\ \Omega _{\textbf{U}}&= [N+I+1,N+I+L] \times [J+1,J+M], \end{aligned}$$

where the subspace \(\Omega _{\textbf{X}}\) corresponds to the index of \(\textbf{X}\), \(\Omega _{\textbf{Y}}\) to \(\textbf{Y}\), \(\Omega _{\textbf{Z}}\) to \(\textbf{Z}\), and \(\Omega _{\textbf{U}}\) to \(\textbf{U}\). Then, we define the following partial order “\(\le \)” between each element (st) in the index set \(\Omega \), \((s,t) \le (s',t') \Leftrightarrow s \le s' \mathrm {\ and \ } t \le t'\). We regard the multiple matrices \(\left( \textbf{X},\textbf{Y},\textbf{Z},\textbf{U}\right) \) as a distribution on the log-linear model on the poset \((\Omega , \le )\),

$$\begin{aligned} p(s,t) = \exp \left( \sum _{(s',t') \le (s,t)} \theta _{s't'} \right) , \end{aligned}$$
(5)

where \((s,t)\in \Omega \). The \(\theta \)-parameters \(\{\theta _{{2}1},\dots ,\theta _{N+I+L,J+M}\}\) are identified so that they satisfy

$$\begin{aligned} \textbf{X}_{ij}=p(N+i,j), \ \&\textbf{Y}_{nj}=p(n,j), \\ \textbf{Z}_{im}=p(N+i,J+m), \ \ {}&\textbf{U}_{lm}=p(N+I+l,J+m), \end{aligned}$$

for \(i\in [I]\), \(j\in [J]\), \(n\in [N]\), \(m\in [M]\), and \(l\in [L]\), and the normalizing factor \(\theta _{11}\) is uniquely determined from them. Figure 2 illustrates the partial order for the input triple \((\textbf{X,Y,Z,U})\).

There are other possible ways to model \(\left( \textbf{X},\textbf{Y},\textbf{Z},\textbf{U}\right) \) as a probability distribution using different partial order structure. However, the solution formula that we obtain in Theorem 1 does not depend on the modeling.

Using results provided in [17] based on the incidence algebra between \(\theta \)- and \(\eta \)-parameters, we can also obtain the \(\eta \)-parameters using the following formula:

$$\begin{aligned} \eta _{st}=\sum _{(s,t)\le (s',t')}p(s',t'). \end{aligned}$$

To make the following discussion clear, for all \(i\in [I],j\in [J],n\in [N],m\in [M]\) and \(l\in [L]\), we define

$$\begin{aligned} \eta ^{\textbf{Y}}_{nj} = \eta _{n,j}, \ \ \eta ^{\textbf{X}}_{ij} = \eta _{N+i,j}, \ \ \eta ^{\textbf{Z}}_{im} = \eta _{N+i,J+m}, \ \ \eta ^{\textbf{U}}_{lm} = \eta _{N+I+l,J+m}. \end{aligned}$$

3.1.4 Derivation of the exact solution of rank-1 NMMF

Let \({\varvec{w}}\in \mathbb {R}_{\ge 0}^I, {\varvec{h}}\in \mathbb {R}_{\ge 0}^J, {\varvec{a}}\in \mathbb {R}_{\ge 0}^N\), \({\varvec{b}}\in \mathbb {R}_{\ge 0}^M\), and \({\varvec{c}}\in \mathbb {R}_{\ge 0}^L\). If four positive matrices \(\textbf{X} \in \mathbb {R}^{I\times J}_{> 0}\), \(\textbf{Y}\in \mathbb {R}_{> 0}^{N \times J}\), \(\textbf{Z}\in \mathbb {R}_{> 0}^{I \times M}\), and \(\textbf{U}\in \mathbb {R}_{> 0}^{L \times M}\) can be decomposed into a form \({\varvec{w}}\otimes {\varvec{h}}, {\varvec{a}}\otimes {\varvec{h}}\), \({\varvec{w}}\otimes {\varvec{b}}\), and \({\varvec{c}}\otimes {\varvec{b}}\), we say that \(\left( \textbf{X,Y,Z,U}\right) \) is simultaneously rank-1 decomposable.

We define the terms one-body and many-body parameters to describe low-rank conditions for multiple matrices and tensors. Let a one-body parameter be a parameter of which at least \(d-1\) indices are 1; for example, \(\theta _{1, 3}\) and \(\eta _{5, 1}\) are one-body parameters for \(d=2\), where d is the order of the tensor. Parameters other than one-body parameters are called many-body parameters. Gray-colored nodes in Fig. 2a correspond to one-body parameters. We treat the matrix case in this section—that is, d is always 2—and we will discuss the general case in Sect. 3.3.5. This categorization of parameters is inspired by the study of the Boltzmann machine [28], which is a special case of the log-linear model [17], where a one-body parameter corresponds to a bias and a many-body parameter corresponds to a weight. In the Boltzmann machine, if all weights are zero, the distribution modeled by the Boltzmann machine is represented by the product of independent distributions. In the same way, if all the many-body parameters are 0, the distribution is represented by the product of independent distributions. Approximating a joint distribution by a product of independent distributions is called mean-field approximation, and this approach is frequently used in physics. Further, as the following proposition shows, the rank of a multidimensional array is 1 if all many-body parameters are 0. We will discuss the relationship between the mean-field approximation and the rank-1 approximation in more detail in Sect. 3.3.5.

Proposition 1

(simultaneous rank-1 \(\theta \)-condition) A tuple \(\left( \textbf{X,Y,Z,U}\right) \) is simultaneously rank-1 decomposable if and only if its all many-body \(\theta \)-parameters are 0.

See Supplement for proof. We call a subspace that satisfies simultaneous rank-1 condition simultaneous rank-1 subspace. From the viewpoint of information geometry, we can understand the best rank-1 NMMF as follows (shown in Fig. 2b). The input of NMMF \(\left( \textbf{X},\textbf{Y},\textbf{Z},\textbf{U}\right) \) corresponds to a point in the space described by the \(\theta \)-coordinate system. The best rank-1 NMMF is an m-projection onto the simultaneous rank-1 subspace from the input point.

Since the m-projection is a convex optimization, we can get the projection destination by a gradient method. However, it requires appropriate settings for initial values, stopping criterion, and learning rates.

Our closed analytical formula of the projection destination in Theorem 1 addresses all the drawbacks of the gradient-based optimization. According to the expectation conservation law in this m-projection onto simultaneous rank-1 subspace, one-body \(\eta \)-parameters do not change in the m-projection. That is, for any \(i\in [I]\), \(j\in [J]\), \(n\in [N]\), \(m\in [M]\) and \(l\in [L]\),

$$\begin{aligned} \eta ^{{\textbf{Y}}}_{n1} = \overline{\eta }^{{\textbf{Y}}}_{n1}, \quad \eta ^{{\textbf{Y}}}_{1j} = \overline{\eta }^{{\textbf{Y}}}_{1j}, \quad \eta ^{{\textbf{X}}}_{i1} = \overline{\eta }^{{\textbf{X}}}_{i1}, \quad \eta ^{{\textbf{Z}}}_{1m} = \overline{\eta }^{{\textbf{Z}}}_{1m}, \quad \eta ^{{\textbf{U}}}_{l1} = \overline{\eta }^{{\textbf{U}}}_{l1} \end{aligned}$$
(6)

where \(\eta \) is the expectation parameter of input, and \(\overline{\eta }\) is the expectation parameter after the m-projection onto simultaneous rank-1 subspace. By the definition of expectation parameters, we obtain

$$\begin{aligned} {\eta }^{\textbf{Y}}_{n1} - {\eta }^{\textbf{Y}}_{n+1,1}&= a_n S({\varvec{h}}), \quad {\eta }^{\textbf{X}}_{i1} - {\eta }^{\textbf{X}}_{i+1,1} = w_i\left( S({\varvec{h}}) + S({\varvec{b}}) \right) , \\ {\eta }^{\textbf{U}}_{l1} - {\eta }^{\textbf{U}}_{l+1,1}&= d_lS({\varvec{b}}), \quad {\eta }^{\textbf{Y}}_{1j} - {\eta }^{\textbf{Y}}_{1,j+1} = \left( S({\varvec{a}}) + S({\varvec{w}}) \right) h_j, \\ {\eta }^{\textbf{Z}}_{1m} - {\eta }^{\textbf{Z}}_{1,m+1}&= \left( S({\varvec{w}}) + S({\varvec{d}}) \right) b_m. \end{aligned}$$

The expectation conservation law in Eq. (6) guarantees that the values of the left-hand sides do not change before the m-projection, and they do not change after the m-projection either. Since the sum of each matrix \(\textbf{X,Y,Z,}\) and \(\textbf{U}\) is represented by one-body \(\eta \)-parameters, the sum of each matrix does not change in the m-projection. Using these facts and multiplying these equations together, we can derive Theorem 1. A complete proof of Theorem 1 is provided in the Supplementary Material.

3.2 Rank-1 missing NMF

As an application of the closed form in Theorem 1, we develop an efficient method to solve rank-1 NMF for missing data. Rank-1 NMF for a given matrix \(\textbf{X}\in \mathbb {R}_{\ge 0}^{I \times J}\) with missing values (rank-1 missing NMF) is the task of finding two non-negative vectors \({\varvec{w}}\in \mathbb {R}^{I}_{\ge 0}\) and \({\varvec{h}}\in \mathbb {R}^{J}_{\ge 0}\) that minimize a weighted cost function \(D_{\Phi }(\textbf{X},{\varvec{w}}\otimes {\varvec{h}})\) defined as

$$\begin{aligned} D_{\varvec{\Phi }}(\textbf{X},{\varvec{w}}\otimes {\varvec{h}})=D(\varvec{\Phi }\circ \textbf{X}, \varvec{\Phi }\circ \left( {\varvec{w}}\otimes {\varvec{h}}\right) ) \end{aligned}$$
(7)

for a binary weight matrix \(\varvec{\Phi }\in \{0,1\}^{I \times J}\). The weight matrix indicates the position of missing values; that is, \(\varvec{\Phi }_{ij}=0\) if the entry \(\textbf{X}_{ij}\) is missing, \(\varvec{\Phi }_{ij}=1\) otherwise.

If the binary weight matrix \(\varvec{\Phi }\) satisfies \(\textrm{Rank}\left( \varvec{\Phi }\right) \le 2\), we can find the exact solution for rank-1 missing NMF. After we mention the relationship between NMMF and missing NMF in Sect. 3.2.1, we demonstrate a way to find the best rank-1 missing NMF when it holds \(\textrm{Rank}\left( \varvec{\Phi }\right) \le 2\) in Sect. 3.2.23.2.3. We also develop an efficient method for the general cases to find an approximate solution based on the closed formula. The proposed method is described in Sect. 3.2.4.Footnote 1

3.2.1 Connection between NMMF and missing NMF

Our discussion is based on the two following fundamental facts. First, we can regard NMMF as a special case of missing NMF. We assume that a binary weight matrix \(\varvec{\Phi }\in \{0,1\}^{N+I+L,J+M}\) and an input matrix \(\textbf{K}\in \mathbb {R}^{N+I+L,J+M}_{\ge 0}\) are given in the form of

$$\begin{aligned} \varvec{\Phi } = \begin{bmatrix} \textbf{1}_{NJ} &{}\quad \textbf{0}_{NM} \\ \textbf{1}_{IJ} &{} \quad \textbf{1}_{IM} \\ \textbf{0}_{LJ} &{}\quad \textbf{1}_{LM} \\ \end{bmatrix}, \quad \textbf{K} = \begin{bmatrix} \textbf{Y} &{}\quad \textbf{F} \\ \textbf{X} &{}\quad \textbf{Z} \\ \textbf{E} &{}\quad \textbf{U} \\ \end{bmatrix}, \end{aligned}$$
(8)

where \(\textbf{X} \in \mathbb {R}^{I\times J}_{> 0}\), \(\textbf{Y}\in \mathbb {R}_{> 0}^{N \times J}\), \(\textbf{Z}\in \mathbb {R}_{> 0}^{I \times M}\), \(\textbf{U}\in \mathbb {R}_{> 0}^{L \times M}\), \(\textbf{E}\in \mathbb {R}_{> 0}^{L \times J}\), and \(\textbf{F}\in \mathbb {R}_{> 0}^{N \times M}\). All of the elements of \(\textbf{E}\) and \(\textbf{F}\) are missing. We consider the rank-1 approximation of \(\textbf{K}\) as

$$\begin{aligned} \mathbf {K_1} = \begin{bmatrix} {\varvec{a}}\\ {\varvec{w}}\\ {\varvec{c}}\\ \end{bmatrix} \begin{bmatrix} {\varvec{h}}^\textrm{T} &{} {\varvec{b}}^\textrm{T}\\ \end{bmatrix}. \end{aligned}$$
(9)

In this situation, the cost function of missing NMF is equivalent to that of NMMF [37]:

$$\begin{aligned}&\mathop {\hbox {argmin}}\limits _{\textbf{K}_1;\textrm{rank}(\textbf{K}_1)=1} D_{\varvec{\Phi }}(\textbf{K},\textbf{K}_1 )\\&\qquad = \mathop {\hbox {argmin}}\limits _{{\varvec{w,h,a,b,c}}} D(\textbf{X},{\varvec{w}}\otimes {\varvec{h}})\!+\! D(\textbf{Y},{\varvec{a}}\otimes {\varvec{h}}) \!+\! D(\textbf{Z},{\varvec{w}}\otimes {\varvec{b}}) +D(\textbf{U},{\varvec{c}}\otimes {\varvec{b}}). \end{aligned}$$

The second fundamental fact is the homogeneity of rank-1 missing NMF, which ensures a factorization after row or column permutations can be reproduced by permutations after the factorization. See Supplement for proof.

Proposition 2

(Homogeneity of rank-1 missing NMF) Let \(\textrm{NMF}_1(\varvec{\Phi },\textbf{X})\) be the best rank-1 matrix \({\varvec{w}}\otimes {\varvec{h}}\), which minimizes the cost function in Eq. (7). For any permutation matrices \(\textbf{G}\) and \(\textbf{H}\), it holds that

$$\begin{aligned} \textrm{NMF}_{1}\left( \textbf{G}\varvec{\Phi } \textbf{H},\textbf{GKH}\right) =\textbf{G}^{\textrm{T}}\textrm{NMF}_{1}\left( \varvec{\Phi },\textbf{K}\right) \textbf{H}^{\textrm{T}}. \end{aligned}$$

Therefore, using the closed formula of the best rank-1 NMMF in Theorem 1, we can solve the rank-1 missing NMF when we can relocate the position of missing values to the form Eq. (8) by row and column permutations.

3.2.2 Rank-1 missing NMF for grid-like missing

We introduce the term grid-like, which is defined as follows. As we describe in this section, we can regard missing NMF as NMMF that requires only three matrices \(\textbf{X,Y,Z}\) as input, which corresponds to the case \(L=0\) in Theorem 1.

Definition 1

(grid-like binary weight matrix) Let \(\varvec{\Phi } \in \{0,1\}^{I\times J}\) be a binary weight matrix. If there exist two sets—\(S^{(1)} \subset [I]\) and \(S^{(2)} \subset [J]\)—such that

$$\begin{aligned} {\varvec{{\Phi }}}_{ij}=\left\{ \begin{array}{ll} 0 &{}\quad \textrm{ if } \ \ i \in S^{(1)}\ \textrm{and}\ j \in S^{(2)}, \\ 1 &{}\quad \textrm{otherwise},\end{array}\right. \end{aligned}$$

then \(\varvec{\Phi }\) is said to be grid-like.

It holds that \(\textrm{rank}\left( \varvec{\Phi }\right) =2\) if \(\varvec{\Phi }\) is grid-like, but the converse is not true. We discuss the case \(\textrm{rank}\left( \varvec{\Phi }\right) =2\) but not grid-like in Sect. 3.2.3.

Real-world tabular datasets tend to have missing values only on certain rows or columns. Therefore, the binary weight matrix \(\varvec{\Phi }\) often becomes grid-like in practice (we show example datasets in Sect. 4). Figure 3 illustrates examples of matrices with grid-like missing values.

When \(\varvec{\Phi } \in \{0,1\}^{I+N,J+M}\) is grid-like, we can transform it in the form given in Eq. (8) with \(L=0\) using row and column permutations. Let \(S^{(1)} \subset [I+N]\) and \(S^{(2)} \subset [J+M]\) with \(\mid S^{(1)} \mid = C^{(1)}\) and \(\mid S^{(2)} \mid = C^{(2)}\) be the row and column index sets for zero entries in \(\varvec{\Phi }\). For the block at the upper right of \(\varvec{\Phi }\) whose row and column indices are specified as

$$\begin{aligned} B^{(1)}=[C^{(1)}], \quad B^{(2)}=[J+M-C^{(2)}+1,J+M], \end{aligned}$$

we can collect all the zero entries of \(\varvec{\Phi }\) in the rectangular region \(B^{(1)} \times B^{(2)}\) using row and column permutations. Formally, for a grid-like binary weight matrix \(\varvec{\Phi }\), there are row \(\mathcal {G}: [I+N] \rightarrow [I+N]\) and column \(\mathcal {H}: [J+M] \rightarrow [J+M]\) permutations satisfying

$$\begin{aligned} (\textbf{G}\varvec{\Phi }\textbf{H})_{ij}= \left\{ \begin{array}{ll} 0 &{}\quad \textrm{ if } \ \ i \in B^{(1)} \ \textrm{and}\ j \in B^{(2)} \\ 1 &{}\quad \textrm{otherwise}\end{array}\right. \end{aligned}$$

where \(\textbf{G}\) and \(\textbf{H}\) are corresponding permutation matrices to \(\mathcal {G}\) and \(\mathcal {H}\), respectively.

We can obtain \(\textbf{G}\) and \(\textbf{H}\) as follows. First, we focus on row permutation \(\mathcal {G}\). We want to include each row \(j\in S^{(1)}\cap B^{(1)c}\), where \(B^{(1)c}=[I+N]{\setminus } B^{(1)}\), in \(B^{(1)}\) by row permutation \(\mathcal {G}\), which can be achieved by any one-to-one mapping from \(S^{(1)}\cap B^{(1)c}\) to \(S^{(1)c}\cap B^{(1)}\), where \(S^{(1)c}=[I+N]\setminus S^{(1)}\). Note that \(\mid S^{(1)}\cap B^{(1)c}\mid = \mid S^{(1)c}\cap B^{(1)} \mid \) always holds. The corresponding permutation matrix is given as

$$\begin{aligned} \textbf{G} = \prod _{k \in S^{(1)}\cap B^{(1)c}} \textbf{R}^{k\leftrightarrow \mathcal {G}(k)} \end{aligned}$$

where \(\textbf{R}^{k \leftrightarrow l}\) is a permutation matrix, which switches the k-th row and the l-th row; that is,

$$\begin{aligned} \textbf{R}^{k \leftrightarrow l}_{ij}= \left\{ \begin{array}{llll} &{} 0 &{}\qquad &{}\text {if}\, (i,j) = (k,k) \ \textrm{ or } \ (l,l), \\ &{} 1 &{} &{}\text {if} \,(i,j) = (k,l) \ \textrm{ or } \ (l,k), \\ &{} \textbf{I}_{ij} &{} &{}\text {otherwise}.\end{array}\right. \end{aligned}$$

Since \(S^{(1)}\cap B^{(1)^c}\) and \(S^{(1)c}\cap B^{(1)}\) are disjoint, it holds that \(\textbf{G}=\textbf{G}^\textrm{T}\).

In the same way, any one-to-one mapping from \(S^{(2)}\cap B^{(2)c}\) to \(S^{(2)c}\cap B^{(2)}\) can be \(\mathcal {H}\), where \(S^{(2)c}=[J+M]{\setminus } S^{(2)}\) and \(B^{(2)c}=[J+M]{\setminus } B^{(2)}\). The corresponding permutation matrix is given as

$$\begin{aligned} \textbf{H} = \prod _{k \in S^{(2)}\cap B^{(2)c}} \textbf{R}^{k\leftrightarrow \mathcal {H}(k)}, \end{aligned}$$

which is also a symmetric matrix.

The above discussion leads to the following procedure of the best rank-1 missing NMF for an input matrix \(\textbf{K}\) if a binary weight matrix \(\varvec{\Phi }\) is grid-like. The first step is to find proper permutations \(\textbf{G}\) and \(\textbf{H}\) to collect the missing values in the upper right corner. In the next step, we obtained \(\textrm{NMF}_1\left( \textbf{G}\varvec{\Phi } \textbf{H},\textbf{GKH}\right) \) using the closed formula of the best rank-1 NMMF. In the final step, we operated the inverse permutations of \(\textbf{G}\) and \(\textbf{H}\) to the result of the previous step; that is, \(\textbf{G}^{-1}\textrm{NMF}_1\left( \textbf{G}\varvec{\Phi } \textbf{H},\textbf{GKH}\right) \textbf{H}^{-1}\). Note that \(\textbf{G}^{-1}=\textbf{G}^\textrm{T} = \textbf{G}\) and \(\textbf{H}^{-1}=\textbf{H}^\textrm{T} = \textbf{H}\) always holds since these permeation matrices are orthogonal and symmetrical.

3.2.3 Rank-1 missing NMF with \(\textrm{Rank}\left( \varvec{\Phi }\right) \le 2\)

In this subsection, we show that we can relocate missing values into the form of Eq. (8) by column and row permutations if the binary weight satisfies \(\textrm{rank}\left( \varvec{\Phi }\right) =2\) and solve the rank-1 missing NMF as a rank-1 NMMF.

As we can confirm immediately, there are only two cases when the rank of a binary matrix \(\varvec{\Phi }\) is 1. In the first case, all of the elements \(\varvec{\Phi }_{ij}\) are 1. This does not happen in our case because the number of missing values is assumed to be strictly larger than 0, resulting in \(\varvec{\Phi } \ne {\varvec{1}}\). In the second case, if there are rows or columns with all zero elements in \(\varvec{\Phi }\), the matrix rank of \(\varvec{\Phi }\) can be 1. However, since rows and columns with all zero elements do not contribute to the cost function (7), we ignore such rows and columns. As a result, we discuss only the case of \(\textrm{Rank}(\varvec{\Phi }) = 2\).

We consider a rank-2 weight matrix \(\varvec{\Phi } \in \{0,1\}^{N+I+L,J+M}\). There are two linear independent column bases if and only if \(\textrm{Rank}\left( \varvec{\Phi }\right) =2\) since row-rank and column-rank are always the same. We define them as \({\varvec{a}}\in \left\{ 0,1\right\} ^{N+I+L}\) and \({\varvec{b}}\in \left\{ 0,1\right\} ^{N+I+L}\). We also assume that \({\varvec{a}}\ne {\varvec{0}}\) and \({\varvec{b}}\ne {\varvec{0}}\) since a zero-vector cannot be a basis. Then, for the binary matrix \(\varvec{\Phi }=[{\varvec{c}}^{(1)},\dots ,{\varvec{c}}^{(J+M)}]\), it should be possible to write any column \({\varvec{c}}^{(i)}\) as \({\varvec{c}}^{(i)} = \alpha _i {\varvec{a}} + \beta _i {\varvec{b}}\) using two bases \({\varvec{a}}\) and \({\varvec{b}}\). Since \({\varvec{c}}^{(i)}\) is a binary vector, possible domains of \(\alpha _i\) and \(\beta _i\) are limited and we analyze the domains in the following by separating them into three cases. In all three cases, we can rearrange \(\varvec{\Phi }\) into the form of Eq. (8) by permutations. To consider the possible values of the pair \((\alpha _i, \beta _i)\), we say that two binary vectors \({\varvec{a}}\) and \({\varvec{b}}\) are disjoint with each other if \(a_k \ne b_k\) for all \(k\in [N+I+L]\). For example, two vectors \({\varvec{a}}=(0,0,1,0)^\textrm{T}\) and \({\varvec{b}}=(1,1,0,1)^\textrm{T}\) are disjoint and \({\varvec{a}}=\left( 1,0,0,1\right) ^\textrm{T}\) and \({\varvec{b}}=(1,1,0,1)^\textrm{T}\) are not disjoint. Using this concept, we divide possible pairs \((\alpha _i, \beta _i)\) into three cases for rank-2 \(\varvec{\Phi }\) as follows:

Case 1: the bases are disjoint If the bases \({\varvec{a}}\) and \({\varvec{b}}\) are disjoint, it holds that \((\alpha _i,\beta _i) \in \{(1,0),(0,1),(1,1)\}\); that is, \({\varvec{c}}^{(i)}\) can be \({\varvec{a}}\), \({\varvec{b}}\), or \({\varvec{a}}+{\varvec{b}}\). Since \({\varvec{a}}\) and \({\varvec{b}}\) are disjoint, \({\varvec{a}}+{\varvec{b}}={\varvec{1}}\) follows. Then, using only column permutations, we can arrange the binary weight matrix \(\varvec{\Phi }=[{\varvec{c}}^{(1)},\dots ,{\varvec{c}}^{(J+M)}]\) in the form of

$$\begin{aligned} \varvec{\Phi }\textbf{H} = [{\varvec{a}},\dots ,{\varvec{a}},{\textbf {1}},\dots ,{\textbf {1}},{\varvec{b}},\dots , {\varvec{b}}], \end{aligned}$$

where \(\textbf{H}\) is a permutation matrix corresponding to the column permutation. After the column permutation, we conduct row permutation as follows. First, we define \(S=\{i \mid a_i = 0\}\), \(C = \,\mid \! S \!\mid \) and \(B=[J+M-C+1,J+M]\). There is a one-to-one mapping \(\mathcal {G}\) from \(S \cap B^c\) to \(S^c \cap B\). The corresponding permutation matrix is given as

$$\begin{aligned} \textbf{G} = \prod _{k \in S \cap B^{c}} \textbf{R}^{k\leftrightarrow \mathcal {G}(k)}, \end{aligned}$$

where \(B^{c}=[J+M]\backslash B\). By operating the permutation \(\textbf{G}\) on bases, we obtain,

$$\begin{aligned} \tilde{{\varvec{a}}} \equiv \textbf{G}{\varvec{a}}= \begin{pmatrix} 1,&\dots ,&1,&0,&\dots ,&0 \end{pmatrix}^\textrm{T}, \tilde{{\varvec{b}}} \equiv \textbf{G}{\varvec{b}}= \begin{pmatrix} 0,&\dots ,&0,&1,&\dots ,&1 \end{pmatrix}^\textrm{T}. \end{aligned}$$

Using the fact \(\textbf{G}{\varvec{1}}={\varvec{1}}\), finally, we obtain

$$\begin{aligned} \textbf{G}\varvec{\Phi } \textbf{H}=[\tilde{{\varvec{a}}},\dots ,\tilde{{\varvec{a}}},{\varvec{1}},\dots ,{\varvec{1}},\tilde{{\varvec{b}}},\dots ,\tilde{{\varvec{b}}}], \end{aligned}$$

which means \(\textbf{G}\varvec{\Phi } \textbf{H}\) is in the form of Eq. (8).

Case 2: The bases are not disjoint, but one of them is one vector If the bases \({\varvec{a}}\) and \({\varvec{b}}\) are not disjoint but \({\varvec{a}}={\varvec{1}}\), it holds that \((\alpha _i,\beta _i)\in \{(1,0),(0,1),(1,-1)\}\); that is, \({\varvec{c}}^{(i)}\) can be \({\varvec{1}}\), \({\varvec{b}}\), or \({\varvec{1}} - {\varvec{b}}\). As we can confirm immediately, \({\varvec{b}}\) and \({\varvec{1}} - {\varvec{b}}\) are disjoint since the sum of them is \({\varvec{1}}\). Then, we can rearrange \(\varvec{\Phi }\) in the form of Eq. (8) as in the same way as in Case 1. If the bases \({\varvec{a}}\) and \({\varvec{b}}\) are not disjoint but \({\varvec{b}}={\varvec{1}}\), the situation is the same as Case 1.

Case 3: The bases are not disjoint and not one vector If the vectors \({\varvec{a}}\) and \({\varvec{b}}\) are not disjoint and \({\varvec{a}}\ne {\varvec{1}}\) and \({\varvec{b}}\ne {\varvec{1}}\), it holds that \(\left( \alpha _i,\beta _i\right) \in \{(1,0),(0,1)\}\) since \({\varvec{a}} + {\varvec{b}}\) includes 2 and \(\pm ({\varvec{a}}-{\varvec{b}})\) makes rows such that all the elements are 0, which is contrary to the assumption. By only column permutations, we can arrange the binary weight matrix \(\varvec{\Phi }=[{\varvec{c}}^{(1)},\dots ,{\varvec{c}}^{(J+M)}]\) in the form of

$$\begin{aligned} \varvec{\Phi }\textbf{H} = [{\varvec{a}},\dots ,{\varvec{a}},{\varvec{b}},\dots , {\varvec{b}}]. \end{aligned}$$

In the same way as with Case 1, we obtain \(\textbf{G}\varvec{\Phi } \textbf{H}=[\tilde{{\varvec{a}}},\dots ,\tilde{{\varvec{a}}},\tilde{{\varvec{b}}},\dots ,\tilde{{\varvec{b}}}]\), which corresponds to the form of Eq. (8) with \(I=0\), corresponding to NMMF for only two matrices \(\textbf{Z}\) and \(\textbf{U}\).

The above discussion leads to the following procedure of the best rank-1 missing NMF for an input matrix \(\textbf{K}\) if the rank of the binary weight matrix \(\varvec{\Phi }\) is 2. The first step is to find the bases \({\varvec{a}}\) and \({\varvec{b}}\) of \(\varvec{\Phi }\) and find proper permutations \(\textbf{G}\) and \(\textbf{H}\) as described above to collect the missing values in the corners. The remaining step is the same as described in the final paragraph in Sect. 3.2.2.

3.2.4 Rank-1 missing NMF for the general case

Fig. 3
figure 3

Examples of matrices with non-grid-like missing values (left) and grid-like missing values (right). Meshed entries are missing values. We can create grid-like missing values by increasing missing values

Fig. 4
figure 4

Sketch of the algorithm of A1GM. Meshed entries are missing values. In Step 1, we increase missing values so that they become grid-like. In Step 2, we gather missing values in the block at the upper right by low and column permutations. In Step 3, we use the closed formula of the best rank-1 NMMF in Theorem 1 with \(L=0\). In this example, we get \({\varvec{w}}=(1.9, 1.5, 1.3)^\textrm{T},{\varvec{a}}=\left( 1.9,1.1\right) ^\textrm{T},{\varvec{h}}=(1.8,1.6,1.3)^\textrm{T},{\varvec{b}}=(0.85,3.4)^\textrm{T}\). Finally, we get two vectors as the output by the repermutation. We use two significant digits in this figure

If the rank of the binary weight matrix \(\varvec{\Phi }\) is strictly larger than 2, the above procedure cannot be directly applied. To treat any matrices with missing values, our idea is increase missing values so that the rank of \(\varvec{\Phi }\) becomes 2. However, the optimal way to increase missing values is not obvious to make \(\varvec{\Phi }\) rank-2. Then, we increase missing values so that \(\varvec{\Phi }\) becomes grid-like since the optimal way to increase missing values is clear.

Although this strategy is counter-intuitive because we lose some information, which may cause a larger reconstruction error, we gain the efficiency instead of using our closed form solution in Theorem 1 with \(L=0\), and, as we empirically show in Sect. 4.1.2, the error increase is not significant in many datasets. Examples of this step are demonstrated in Fig. 3.

In the worst case, the number of missing values after this step becomes \(k^2\) for k missing values. If every row or column has at least one missing value, all indices are missing after this step, for which our algorithm does not work. Thus, our method is not suitable if there are too many missing values in a matrix.

We illustrate an example of the overall procedure of A1GM in Fig. 4 and show its algorithm in Supplement. Since the time complexity of each process of A1GM is at most linear with respect to the number of entries of an input matrix, the time complexity is \(O((I+N)(J+M))\) for input matrix \(\textbf{X}\in \mathbb {R}^{(I+N)\times (J+M)}\).

3.3 Legendre Tucker rank reduction

In this section, by designing posets appropriately, we extend the discussion so far to tensors and derive Legendre Tucker rank Reduction (LTR), which is a non-gradient method for non-negative low-Tucker-rank approximation. First, we define the task in Sect. 3.3.1. Then, we overview our fundamental ideas for LTR in Sect. 3.3.2, following the introduction to the algorithm of LTR in Sect. 3.3.3 and derive the LTR algorithm in Sects. 3.3.43.3.6, pointing out the relationship between the rank-1 approximation and mean field approximation.Footnote 2 Finally, we mention the relationship between the proposed LTR and related works in Sects. 3.3.7 and 3.3.8.

3.3.1 Low-Tucker-rank approximation for tensors

First we define the Tucker rank of tensors and formulate the problem of non-negative low-Tucker-rank approximation. The Tucker rank of a dth-order tensor \(\mathcal {P} \in \mathbb {R}^{I_1 \times \dots \times I_d}\) is defined as a tuple \((\textrm{Rank}(\textbf{P}^{(1)})\), \(\dots \), \(\textrm{Rank}(\textbf{P}^{(d)}))\), where each \(\textbf{P}^{(k)}\in \mathbb {R}^{I_k\times \prod _{m\ne k} I_m}\) is the mode-k expansion of the tensor \(\mathcal {P}\) [38,39,40]. We describe the definition of mode-k expansion at the end of this subsection. If the Tucker rank of a tensor \(\mathcal {P}\) is \((r_1,\dots ,r_d)\), it can always be decomposed as

$$\begin{aligned} \mathcal {P} = \sum \nolimits _{i_1=1}^{r_1} \dots \sum \nolimits _{i_d=1}^{r_d} \mathcal {G}_{i_1,\dots ,i_d} \varvec{a}^{(1)}_{i_1}\otimes \varvec{a}^{(2)}_{i_2}\otimes \dots \otimes \varvec{a}^{(d)}_{i_d} \end{aligned}$$
(10)

with a tensor \(\mathcal {G}\in \mathbb {R}^{r_1 \times \dots \times r_d}\), called the core tensor of \(\mathcal {P}\), and vectors \(\varvec{a}^{(k)}_{i_k} \in \mathbb {R}^{I_k}\), \(i_k \in [r_k]\), for each \(k \in [d]\) [41]. The core tensor and these vectors are often called factors.

In this paper, we say that a tensor is rank-1 if its Tucker-rank is \((1,\dots ,1)\). Non-negative low-Tucker-rank approximation approximates a given non-negative tensor \(\mathcal {P}\) by a non-negative lower-Tucker-rank tensor \(\mathcal {T}\) that optimizes the cost function \(D(\mathcal {P},\mathcal {T})\). The task is not a non-negative factorization that imposes nonnegativity on factors, but a low-rank approximation that allows negative factors [42, 43].

Mode-k expansion The mode-k expansion [41] of a tensor \(\mathcal {P}\in \mathbb {R}^{I_1\times \dots \times I_d}\) is an operation to convert \(\mathcal {P}\) into a matrix \(\textbf{P}^{(k)} \in \mathbb {R}^{I_k \times \prod ^{d}_{m=1 (m \ne k)}I_m}\) focusing on kth mode. The relation between tensor \(\mathcal {P}\) and its mode-k expansion \(\textbf{P}^{(k)}\) is formally given as

$$\begin{aligned} \left( \textbf{P}^{(k)}\right) _{i_k,j}&= \mathcal {P}_{i_1,\dots ,i_d}, \quad j = 1 + \sum _{l=1,(l \ne k)}^d \left( i_{l} - 1\right) J_l, \quad J_l = \prod _{m=1,(m \ne k)}^{l-1} I_m. \end{aligned}$$

3.3.2 Idea of LTR

Fig. 5
figure 5

An example of reducing the Tucker rank of a tensor \(\mathcal {P}\in \mathbb {R}^{I_1\times I_2 \times I_3}_{>0}\) to at most \((r_1,r_2,I_3)\) by the proposed method LTR. \(\mathcal {F}_1\) is the set of positive tensors with the Tucker rank at most \((r_1,I_2,I_3)\) and \(\mathcal {F}_2\) with the Tucker rank at most \((I_1,r_2,I_3)\). The best approximation tensor exists in \(\mathcal {F}_1\cap \mathcal {F}_2\), enclosed by the dotted lines. For \(m=1,2\), there exist e-flat bingo spaces \(\mathcal {B}^{(m)} \subset {\mathcal {F}_m}\). The projection onto \(\mathcal {B}_m\) can be performed by dividing \(\mathcal {P}\) into subtensors along with mode-m direction and replacing each subtensor with its rank-1 approximation. The choice of bingo space is not unique

In this subsection, we overview our core idea for LTR. LTR reduces Tucker rank of an input tensor \(\mathcal {P}\) by known closed-formula of the best rank-1 approximation. As an example, here we reduce the Tucker rank of a positive tensor \(\mathcal {P}\in \mathbb {R}_{>0}^{I_1\times I_2 \times I_3}\) to at most \((r_1,r_2,I_3)\) as shown in Fig. 5. In the space of positive tensors, there exists a subspace \(\mathcal {F}_1\) consisting of positive tensors of the Tucker rank at most \((r_1,I_2,I_3)\) and a subspace \(\mathcal {F}_2\) consisting of positive tensors of Tucker rank at most \((I_1,r_2,I_3)\). We want to find a low-Tucker-rank tensor in \(\mathcal {F}_1\cap \mathcal {F}_2\) that approximates \(\mathcal {P}\) as close as possible.

First, we map a tensor to a probability distribution. Then, using natural parameters of the distribution, we describe sufficient conditions for reducing the Tucker rank of tensors, called bingo rule. For \(m=1\) and 2, we define e-flat subspace \(\mathcal {B}^{(m)}\subset {\mathcal {F}_m}\), called bingo space, that satisfies the bingo rule. The projection from \(\mathcal {P}\) onto \(\mathcal {B}^{(1)}\) can be conducted by using known rank-1 approximation formula onto the subtensor of \(\mathcal {P}\). Also, the projection from the point on \(\mathcal {B}^{(1)}\) onto \(\mathcal {B}^{(1)}\cap \mathcal {B}^{(2)}\) can be conducted in the same way.

Bingo spaces only cover tensors generated by applying rank-1 approximations to subtensors along with each mode of a tensor. Therefore, the search space is smaller than the traditional low-rank approximation, which approximates the tensor with an appropriately chosen basis and its coefficients, and there is no guarantee that LTR finds the best approximation; however, we can guarantee that LTR finds a tensor in the selected bingo spaces that minimizes the KL divergence from an input tensor. Such a smaller search space derived by the bingo rule makes our algorithm efficient without a gradient method. We discuss this point in more detail in Sect. 3.3.6.

3.3.3 The LTR algorithm

In LTR, we use the rank-1 approximation method that always finds the rank-1 tensor that minimizes the KL divergence from an input tensor [18]. The optimal rank-1 tensor \(\mathcal {{\overline{P}}}\) of \(\mathcal {P}\) is given by

$$\begin{aligned} \mathcal {{\overline{P}}} = S\left( \mathcal {P}\right) ^{1-d} {{\varvec{s}}}^{(1)} \otimes {{\varvec{s}}}^{(2)} \otimes \dots \otimes {{\varvec{s}}}^{(d)}, \end{aligned}$$
(11)

where each \({{\varvec{s}}}^{(k)} = (s_{1}^{(k)}, \dots , s_{I_k}^{(k)})\) with \(k \in [d]\) is defined as

$$\begin{aligned} s_{i_k}^{(k)} = \sum _{i_1=1}^{I_1} \dots \sum _{i_{k-1}=1}^{I_{k-1}} \sum _{i_{k+1}=1}^{I_{k+1}} \dots \sum _{i_d=1}^{I_d} \mathcal {P}_{i_1, \dots ,i_d}. \end{aligned}$$

Now we introduce LTR, which iteratively applies the above rank-1 approximation to subtensors of a tensor \(\mathcal {P} \in \mathbb {R}^{I_1 \times \dots \times I_d}\). When we reduce the Tucker rank of \(\mathcal {P}\) to \((r_1, \dots , r_d)\), LTR performs the following two steps for each \(k \in [d]\):

Step 1: We construct \(C = \{c_1, \dots , c_{r_k}\} \subseteq [I_k]\) by random sampling from \([I_k]\) without replacement, where we always assume that \(c_1 = 1\) and \(c_l < c_{l + 1}\) for every \(l \in [r_k - 1]\).

Step 2: For each \(l \in [r_k]\), if \(c_l \ne c_{l+1} - 1\) holds, we replace the subtensor \(\mathcal {P}_{c^{(k)}_l:c^{(k)}_{l+1}-1}\) of \(\mathcal {P}\) by its rank-1 approximation obtained by Eq. (11).

The choice of C in Step 1 is arbitrary, which means that another strategy can be used. For example, if we know that some parts of an input tensor are less important than others, we can directly choose these indices for C instead of random sampling to obtain a more accurate reconstructed tensor. We provide the algorithm of LTR in algorithmic format in Supplement.

Computational complexity of LTR Step 1 of LTR requires \(O(r_1+r_2+\dots +r_d)\) since we only need to sample integers from \(1,2,\dots ,I_k\) for each \(k\in [d]\) using the Fisher-Yates method. Since the above procedure repeats the best rank-1 approximation at most \(r_1 r_2 \dots r_d\) times, the worst computational complexity of LTR is \(O(r_1 r_2 \dots r_d I_1 I_2 \dots I_d)\).

3.3.4 Posets for LTR

We derive the LTR algorithm by information geometric formulation of low-Tucker-rank approximation. The discussion is based on the log-linear model on poset. For simplicity, we normalize input tensor beforehand so that the sum is 1. To regard any positive tensor \(\mathcal {P}\in \mathbb {R}^{I_1\times \dots \times I_d}_{>0}\) as a distribution, we introduce the following partial order “\(\le \)” between each elements \((i_1,\dots ,i_d)\) in the index set \(\Omega _d=[I_1]\times \dots \times [I_d]\) of the tensor \(\mathcal {P}\),

$$\begin{aligned} (i_1,\dots ,i_d) \le (i_1',\dots ,i_d') \Leftrightarrow i_k \le i_k' \mathrm {\ for \ all \ } k\in [d]. \end{aligned}$$
(12)

See Fig. 6 as an example for the poset \(\Omega _d\) with \(I_1=I_2=I_3=3\) and \(d=3\). We regard \(\mathcal {P}\) as a discrete probability distribution whose sample space is the index set of \(\mathcal {P}\) by log-linear model on \((\Omega _d, \le )\). Any positive normalized tensor can be described by natural parameters as

$$\begin{aligned} \mathcal {P}_{i_1,\dots ,i_d} = \exp { \left( \sum _{(i'_1,\dots ,i'_d) \le (i_1,\dots ,i_d)} \theta _{i'_1,\dots ,i'_d}\right) }. \end{aligned}$$
(13)

A parameter vector \(({\theta })_{i_1,\dots ,i_d}= (\theta _{{2},\dots ,1},\dots ,\theta _{I_1,\dots ,I_d})\) uniquely identifies the normalized positive tensor \(\mathcal {P}\). Therefore, \((\theta )_{i_1,\dots ,i_d}\) can be used as an alternative representation of \(\mathcal {P}\). Note that \(\theta _{1,\dots ,1}\) is the normalizing factor and not independent from natural parameters.

In our modeling in Eq. (13), which clearly belongs to the exponential family, each value of the vector of \(\eta \)-parameters \((\eta )_{i_1,\dots ,i_d}\) is written as follows and uniquely identifies a normalized positive tensor \(\mathcal {P}\):

$$\begin{aligned} \eta _{i_1,\dots ,i_d} = \sum _{i'_1=i_1}^{I_1} \dots \sum _{i'_d=i_d}^{I_d} \mathcal {P}_{i'_1, \dots ,i'_d}. \end{aligned}$$
(14)

The normalization condition is realized as \(\eta _{1,\dots ,1} = 1\). As shown in [17], by using the Möbius function [30] inductively defined as

$$\begin{aligned} \mu _{i_1, \dots , i_d}^{i'_1, \dots , i'_d}&= {\left\{ \begin{array}{ll} 1 &{}\quad \text {if } (i_1, \dots , i_d) = (i_1', \dots , i_d'), \\ -\sum _{j_1=i_1}^{i'_1-1}\dots \sum _{j_d=i_d}^{i'_d-1} \mu _{i_1, \dots , i_d}^{j_1, \dots j_d}, &{}\quad \text {if }\, (i_1, \dots , i_d) \ne (i_1', \dots , i_d') \\ &{}\qquad \hbox {and} \,(i_1, \dots , i_d) \le (i_1', \dots , i_d'), \\ 0 &{}\quad \textrm{otherwise}. \end{array}\right. } \end{aligned}$$

each distribution \(\mathcal {P}\) can be described as

$$\begin{aligned} \mathcal {P}_{i_1,\dots ,i_d} = \sum _{(i'_1, \dots , i'_d)\in \Omega _d} \mu _{i_1, \dots , i_d}^{i'_1, \dots , i'_d}\, \eta _{i'_1, \dots , i'_d} \end{aligned}$$
(15)

using the \(\eta \)-coordinate system. Note that, to identify the value of \(\mathcal {P}_{i_1,\dots ,i_d}\), we need only \(\eta _{i'_1,\dots ,i'_d}\) with \((i'_1,\dots ,i'_d)\in \{i_1,i_1+1\}\times \{i_2,i_2+1\}\times \dots \times \{i_d,i_d+1\}\). For example, if \(d = 2\), it holds that \(\mathcal {P}_{i_1, i_2} = \eta _{i_1, i_2} - \eta _{i_1 + 1, i_2} - \eta _{i_1, i_2 + 1}+\eta _{i_1+1, i_2 + 1}\).

Fig. 6
figure 6

a A poset \((\Omega _3,\le )\) corresponding to \(3\times 3\times 3\) tensor. The parameters on the gray nodes are one-body parameters. b The non-negative rank-1 approximation is formulated as a m-projection from input tensor to rank-1 space, minimizing KL divergence

3.3.5 Information geometric view of rank-1 approximation

Before we dive into the general case of non-negative low-Tucker-rank approximation, we focus on the problem of rank-1 approximation for positive tensors and show the fundamental relationship with the mean-field theory.

We describe the necessary and sufficient condition for the rank of a tensor to be 1 using \(\theta \)- and \(\eta \)-parameters. We formulate tensor rank-1 approximation as a projection onto a subspace consisting of positive rank-1 tensors, which is called a rank-1 space. We use the overline for rank-1 tensors; that is, \(\mathcal {{\overline{P}}}\) is a rank-1 tensor, and \(\overline{\theta }, \overline{\eta }\) are corresponding parameters of \(\theta \)- and \(\eta \)-coordinates.

In the following, we use the one-body parameters and many-body parameters defined in Sect. 3.1.4 to describe the low-rank condition of tensors. For clarity, we also use the following notation for one-body parameters of a dth-order tensor,

$$\begin{aligned} \theta _j^{(k)} \equiv \theta _{\underbrace{1,\dots ,1}_{k-1}, j,\underbrace{1,\dots ,1}_{d-k}}, \quad \eta _j^{(k)} \equiv \eta _{\underbrace{1,\dots ,1}_{k-1}, j,\underbrace{1,\dots ,1}_{d-k}} \quad \text { for each } k \in [d]. \end{aligned}$$

The rank-1 condition for positive tensors is described as follows using many-body \(\theta \) parameters, and we have also succeeded in describing the rank-1 condition using the \(\eta \)-parameter. See Supplement for proof.

Proposition 3

(rank-1 condition on \(\theta \)-form) For any positive tensor \(\mathcal {{\overline{P}}}\), \(\textrm{rank}(\mathcal {{\overline{P}}})=1\) if and only if all of its many-body \(\theta \)-parameters are 0.

Proposition 4

(rank-1 condition as \(\eta \)-form) For any positive dth-order tensor \(\mathcal {{\overline{P}}}\in \mathbb {R}_{>0}^{I_1\times \dots \times I_d}\), \(\textrm{rank}(\mathcal {{\overline{P}}})=1\) if and only if all of its many-body \(\eta \)-parameters are factorizable as

$$\begin{aligned} \overline{\eta }_{i_1,\dots ,i_d} = \prod _{k=1}^d \overline{\eta }_{i_k}^{(k)}. \end{aligned}$$
(16)

Since the rank-1 space is described by linear constraints in natural parameters, the rank-1 space is e-flat [29, Chapter 2.4]. Using Propositions 3 and 4, we can derive the projected destination without the gradient method, which reproduces the closed formula of the best rank-1 approximation minimizing KL divergence [18] from the view of information geometry. We also use the expectation conservation low in the m-projection to prove the following proposition (See Appendix D in Supplement for proof). In this case, one-body \(\eta \)-parameters do not change during the m-projection as follows:

$$\begin{aligned} \eta ^{(k)}_{i_k} = \overline{\eta }_{i_k}^{(k)} \text { for any } i_k\in [I_k] \text { and } k\in [d]. \end{aligned}$$
(17)

Proposition 5

(m-projection onto factorizable subspace) For any positive tensor \(\mathcal {P} \in \mathbb {R}_{> 0}^{I_1 \times \dots \times I_d}\), its m-projection onto the rank-1 space is given as

$$\begin{aligned} \mathcal {{\overline{P}}}_{i_1,\dots ,i_d} = \prod _{k=1}^{d} \left( \sum _{i'_1=1}^{I_1} \dots \sum _{i'_{k-1}=1}^{I_{k-1}} \sum _{i'_{k+1}=1}^{I_{k+1}} \dots \sum _{i'_d=1}^{I_d} \mathcal {P}_{i'_1 ,\dots ,i'_{k-1}, i_k, i'_{k+1} ,\dots , i_d} \right) . \end{aligned}$$
(18)

Since the m-projection minimizes the KL divergence, it is guaranteed that \(\mathcal {{\overline{P}}}\) obtained by Eq. (18) minimizes the KL divergence from \(\mathcal {P}\) within the set of rank-1 tensors. If a given tensor is not normalized, we need to divide the right-hand side of Eq. (18) by the \((d-1)\)-th power sum of all entries of the tensor in order to match the scales of the input and the output tensors, which is consistent with Eq. (11).

Rank-1 approximation as a mean-field approximation We consider a rank-1 positive tensor \(\mathcal {{\overline{P}}} \in \mathbb {R}_{> 0}^{I_1\times \dots \times I_d}\) and show that it is represented as a product of independent distributions, which leads to an analogy with the mean-field theory. In the rank-1 space, by substituting 0 for all many-body parameters in our model in Eq. (13), we obtain

$$\begin{aligned} \mathcal {{\overline{P}}}_{j_1,\dots ,j_d} = \prod _{k=1}^d \frac{\exp {\left( \sum _{j'_k=2}^{j_k} \overline{\theta }_{j'_k}^{(k)} \right) }}{1 + \sum _{i_k=2}^{I_k} \exp {\left( \sum _{i'_k=2}^{i_k} \overline{\theta }_{i'_k}^{(k)} \right) }} \equiv \prod _{k=1}^{d} s_{j_k}^{(k)}, \end{aligned}$$
(19)

where \(\varvec{s}^{(k)} \in \mathbb {R}^{I_k}\) is a positive first-order tensor normalized as \(\sum _{j_k=1}^{I_k} {s}_{j_k}^{(k)} = 1\); we can then regard \({{\varvec{s}}}^{(k)}\) as a probability distribution with a single random variable \(j_k \in [I_k]\). The above discussion means that any positive a rank-1 tensor can be represented as a product of normalized independent distributions.

The operation of approximating a joint distribution as a product of independent distributions is called mean-field approximation. The mean-field approximation was invented in physics for discussing phase transition in ferromagnets [19]. Nowadays, it appears in a wide range of domains such as statistics [44], game theory [45, 46], and information theory [47]. From the viewpoint of information geometry, [48] developed a theory of mean-field approximation for Boltzmann machines [28], which is defined as \(p({{\varvec{x}}}) = \exp (\sum _ib_ix_i + \sum _{ij} w_{ij} x_i x_j)\) for a binary random variable vector \({{\varvec{x}}} \in \{0,1\}^n\) with a bias parameter \({{\varvec{b}}} = (b)_i \in \mathbb {R}^{n}\) and an interaction parameter \({{\varvec{W}}} = (w_{ij})\in \mathbb {R}^{n \times n}\). To illustrate that a rank-1 approximation can be regarded as a mean-field approximation, we prepare the mean-field theory of Boltzmann machines, as follows.

The mean-field approximation of Boltzmann machines is formulated as the projection from a given distribution onto the e-flat subspace consisting of distributions whose interaction parameters \(w_{ij}=0\) for all i and j, which is called a factorizable subspace. Since the distribution with the constraint \(w_{ij}=0\) for all i and j can be decomposed into a product of independent distributions, we can approximate a given distribution as a product of independent distribution by the projection onto a factorizable subspace. The m-projection onto the factorizable subspace requires us to know the expectation value \(\eta _i \equiv \mathbb {E}[x_i]\) of an input distributions and requires \(O(2^n)\) computational cost [49], so we usually approximate it by replacing the m-projection with e-projection. The e-projection is usually conducted by a self-consistent equation called mean-field equation. The e-projection finds the distribution \(\mathcal {{\overline{P}}}_e\) that minimizes \(D(\mathcal {{\overline{P}}}_e;\mathcal {P})\) for a given distribution \(\mathcal {P}\) and the projection is conduced by solving the mean-field equations \(\eta _i = \sigma (b_i + \sum _j w_{ij} \eta _j)\) numerically, where \(\sigma (\cdot )\) is a sigmoid function. There is no theoretical guarantee that the e-projection destination is uniquely determined, since the factorizable subspace is e-flat but not m-flat. The factorizable subspace has a special property such that we can calculate the expectation value \(\eta _i \equiv \mathbb {E}[x_i]\) from a distribution as \(\eta _i=\tanh ^{-1}{b_i}\) and can also compute the distribution from the expectation value as \(b_i = \frac{1}{2} \log {\frac{1-\eta _i}{1-\eta _i}}\).

The analogy of rank-1 approximation and mean-field theory is clear. In our modeling, a joint distribution \(\mathcal {P}\) is approximated by a product of independent distributions \({{\varvec{s}}}^{(k)}\) by projecting \(\mathcal {P}\) onto the subspace such that all many-body \(\theta \) parameters are 0, leading to the rank-1 tensor \(\mathcal {{\overline{P}}}\). Since we can compute expectation parameters \(\eta \) by simply summing the input positive tensor in each axial direction, m-projection can be performed directly in our formulation with \(O(I_1\dots I_d)\), which is computationally infeasible in the case of Boltzmann machines due to \(O(2^n)\) cost. Moreover, as we prove in Proposition 7 in in the supplementary material, the rank-1 space has the same property as the factorizable subspace of Boltzmann machines; that is, parameters can be easily computed from the dual parameter in a closed form:

$$\begin{aligned}&\overline{\theta }_{j}^{(k)} = \log { \left( \frac{\overline{\eta }_j^{(k)}-\overline{\eta }_{j+1}^{(k)}}{\overline{\eta }_{j-1}^{(k)}-\overline{\eta }_{j}^{(k)}} \right) } , \nonumber \quad&\overline{\eta }_{j}^{(k)} = \frac{\sum _{i_k=j}^{I_k} \exp {\sum _{i'_k=2}^{i_k}\overline{\theta }_{i'_k}^{(k)}}}{1 + \sum _{i_k=2}^{I_k} \exp {\left( \sum _{i'_k=2}^{i_k}\overline{\theta }_{i'_k}^{(k)} \right) } }. \end{aligned}$$
Fig. 7
figure 7

Examples of LTR for (8, 8, 2) tensor. \(\Omega _\mathcal {B}\) is bingo-index set. Tensor values and their \(\eta \)-parameters on \(\hat{\Omega }_\mathcal {B}\) do not change, and neither do the \(\eta \)-parameters on \(\hat{\Omega }^c_{\mathcal {B}}\cap \Omega ^c_{\mathcal {B}}\), where \(\hat{\Omega }^c_{\mathcal {B}}=\Omega _d\backslash \hat{\Omega }_{\mathcal {B}}\) and \(\Omega ^c_{\mathcal {B}}=\Omega _d\backslash \Omega _{\mathcal {B}}\). For the target rank (8, 5, 2), we firstly define three bingos on mode-2 as shown in a since \(8-5\) is 3, and approximate contiguous blocks filled in green and blue in b by rank-1 tensors using Eq. (11). In the same way, c shows the case where the target rank is (7, 5, 2). We also define single bingo on mode-1 since \(8-7\) is 1. A subtensor approximated by Eq. (11) is filled in yellow. We assume that we project a tensor onto \(\mathcal {B}^{(1)}\), followed by projecting it onto \(\mathcal {B}^{(2)}\). After the second m-projection, the \(\theta \)-parameters on red panels seem to be overwritten. However, these values remain zero after the second m-projection

3.3.6 Bingo rule for general low-Tucker-rank approximation

In this subsection, we extend the above discussion to arbitrary Tucker rank. First, we relax the \(\theta \)-representation of the rank-1 condition, which was described in Proposition 3. We introduce the bingo rule as a sufficient condition for the tensor to be rank-reduced. Next, we formulate the low-Tucker-rank approximation as a projection onto the subspace that satisfies this bingo rule. Finally, we show that the projection can be achieved by rank-1 approximations for subtensors of input tensor without using the gradient method.

Definition 2

(Bingo) Let \((\theta )_{ij}^{\left( k\right) } = (\theta _{11}^{(k)}, \dots , \theta _{I_kK}^{(k)})\) with \(K = \prod _{m \not = k}I_m\) be the \(\theta \)-coordinate representation of the mode-k expansion of a tensor \(\mathcal {P} \in \mathbb {R}_{>0}^{I_1 \times \dots \times I_d}\). If there exists an integer \(i \in [I_k] {\setminus } \{1\}\) such that \(\theta ^{(k)}_{ij}=0\) for all \(j \in [K] \setminus \{1\}\), we say that \(\mathcal {P}\) has a bingo on mode-k.

Proposition 6

(Bingo and Tucker rank) If there are \(b_k\) bingos on mode-k of a tensor \(\mathcal {P}\), it holds that

$$\begin{aligned} \textrm{Rank}(\textbf{P}^{(k)}) \le I_k-b_k. \end{aligned}$$

See Supplement for proof. Therefore, for any tensor \(\mathcal {P} \in \mathbb {R}_{>0}^{I_1\times \dots \times I_d}\) such that it has \(b_k\) bingos on each mode-k, we can always guarantee that its Tucker rank is at most \((I_1 - b_1, \dots , I_d - b_d)\). We define bingo space as a subspace consisting of tensors with bingos. We denote a bingo space by \(\mathcal {B}\). The bingo space is always e-flat since a subspace defined by linear constrains in natural parameters is e-flat [29, Chapter 2.4]. For a given bingo space \(\mathcal {B}\), the set of indices \((i_1,\dots ,i_d)\) that imposes bingos \(\theta _{i_1,\dots ,i_d}=0\) is called the bingo-index set and denoted by \(\Omega _\mathcal {B}\).

Finally, we prove that LTR successfully reduces the Tucker rank by extending the above discussion. We formulate low-Tucker-rank approximation as an m-projection onto a specific bingo space. This bingo space is constructed in Step 1 of LTR. Then, in Step 2, we perform the m-projection using the closed formula of the rank-1 approximation without a gradient method. We first discuss the case in which the rank of only one mode is reduced, and then discuss the case in which the ranks of two modes are reduced.

When the rank of only one mode is reduced Let us assume that the target Tucker rank is \((I_1\), \(\dots \), \(I_{k-1}\), \(r_k\), \(I_{k+1}\), \(\dots \), \(I_d)\) for an input positive tensor \(\mathcal {P}\in \mathbb {R}_{>0}^{I_1 \times \dots \times I_d}\) and \(r_k<I_k\). Let \(\mathcal {B}^{(k)}\) be the set of tensors having \(I_k-r_k\) bingos on mode-k and \(\Omega _{\mathcal {B}^{(k)}}\) be the set of the bingo indices for mode-k constructed in Step 1 of LTR:

$$\begin{aligned} \mathcal {B}^{(k)} = \{ \mathcal {P} \mid \theta _{i_1,\dots ,i_d}=0 \text { for } (i_1,\dots ,i_d) \in \Omega _{\mathcal {B}^{(k)}}\}. \end{aligned}$$
(20)

Note that \(\mathcal {P} \in \mathcal {B}^{(k)}\) implies that the Tucker rank of \(\mathcal {P}\) is at most \((I_1, \dots , I_{k - 1}, r_k, I_{k + 1}, \dots , I_d)\). Let \(\mathcal {P}_{(k)}\) be the destination of the m-projection from \(\mathcal {P}\) onto \(\mathcal {B}^{(k)}\) and \(\tilde{\theta }\), \(\tilde{\eta }\) be its corresponding parameters of \(\theta \)- and \(\eta \)-coordinates. From the definition of m-projection and the conservation low of \(\eta \), the parameters of tensor \(\mathcal {P}_{(k)}\) satisfy

$$\begin{aligned} \tilde{\theta }_{i_1,\dots ,i_d}=0 \text { for } (i_1,\dots ,i_d)\in \Omega _{\mathcal {B}^{(k)}}, \ \ \tilde{\eta }_{i_1,\dots ,i_d}=\eta _{i_1,\dots ,i_d} \text { for } (i_1,\dots ,i_d)\not \in \Omega _{\mathcal {B}^{(k)}} . \end{aligned}$$
(21)

As described in Sect. 3.3.4, we need \(\eta \)-parameters on only \((i'_1,\dots ,i'_d) \in \{i_1,i_1+1\}\times \dots \times \{i_d,i_d+1\}\) to identify the value of \(\mathcal {P}_{i'_1, \dots , i'_d}\). It leads that \(\mathcal {P}_{i_1,\dots ,i_d} = {\mathcal {P}_{(k)}}_{i_1,\dots ,i_d}\) for \((i_1,\dots ,i_d)\in \hat{\Omega }_{\mathcal {B}^{(k)}}\) for

$$\begin{aligned} \hat{\Omega }_{\mathcal {B}^{(k)}} = \{(i_1,\dots ,i_d) \mid \{i_1,i_1+1\} \times \dots \times \{i_d,i_d+1\} \cap \Omega _{\mathcal {B}^{(k)}}=\varnothing \}. \end{aligned}$$
(22)

Therefore, all we have to do to reduce the Tucker rank is to change the elements \(\mathcal {P}_{i_1,\cdots ,i_d}\) for only \((i_1,\dots ,i_d) \not \in \hat{\Omega }_{\mathcal {B}^{(k)}}\). Such adjustable parts of \(\mathcal {P}\) can be divided into some contiguous blocks; we call each of these a subtensor of \(\mathcal {P}\) on mode-k. In Fig. 7a, b, for example, we can find two subtensors \(\mathcal {P}_{3^{(1)}:5^{(1)}}\) and \(\mathcal {P}_{7^{(1)}:8^{(1)}}\). By conducting the rank-1 approximation introduced in Sect. 3.3.3 onto each subtensor, we obtain \(\mathcal {P}_{(k)}\) satisfying Eq. (21), since Proposition 3 and Eq. (17) hold in these rank-1 approximations.

When the rank of only two modes are reduced Let us assume that the target Tucker rank of mode-k is \(r_k<I_k\) and that of mode-l is \(r_l<I_l\). In this case, we need to consider two bingo spaces \(\mathcal {B}^{(k)}\) and \(\mathcal {B}^{(l)}\) associated with bingo-index set \(\Omega _{\mathcal {B}^{(k)}}\) and \(\Omega _{\mathcal {B}^{(l)}}\). Let \(\mathcal {P}_{(k)}\) be the resulting tensor of m-projection of \(\mathcal {P}\) onto \(\mathcal {B}^{(k)}\) and \(\mathcal {P}_{(k, l)}\) be the resulting tensor of m-projection of \(\mathcal {P}_{(k)}\) onto \(\mathcal {B}^{(l)}\). To get \(\mathcal {P}_{(k,l)}\), let us consider m-projection from \(\mathcal {P}_{(k)}\in \mathcal {B}^{(k)}\) to the bingo space \(\mathcal {B}^{(l)}\). In this projection, the natural parameters that are set to be 0 in the previous m-projection onto \(\mathcal {B}^{(k)}\) from \(\mathcal {P}\) seem to be overwritten (see red panels in Fig. 7c). However, as shown in the Proposition 8 in the supplementary material, after the rank-1 approximation of a tensor where some one-body \(\theta \)-parameters are already zero, these parameters remain zero. As a result, the \(\theta \)- and \(\eta \)-parameters of \(\mathcal {P}_{(k,l)}\) satisfy

$$\begin{aligned} \tilde{\theta }_{i_1,\dots ,i_d}&=0 \text { if } (i_1,\dots ,i_d)\in \Omega _{\mathcal {B}^{(k)}}\cup \Omega _{\mathcal {B}^{(l)}}, \end{aligned}$$
(23)
$$\begin{aligned} \tilde{\eta }_{i_1,\dots ,i_d}&=\eta _{i_1,\dots ,i_d} \text { if } (i_1,\dots ,i_d)\notin \Omega _{\mathcal {B}^{(k)}}\cup \Omega _{\mathcal {B}^{(l)}}, \end{aligned}$$
(24)

where \(\eta \) is expectation parameter of \(\mathcal {P}\). That means \(\mathcal {P}_{(k,l)}\) is resulting tensor of m-projection from \(\mathcal {P}\) onto \(\mathcal {B}=\mathcal {B}^{(k)}\cap \mathcal {B}^{(l)}\) since \(\Omega _{\mathcal {B}}=\Omega _{\mathcal {B}^{(k)}}\cup \Omega _{\mathcal {B}^{(l)}}\). In conclusion, we can obtain the projected tensor onto \(\mathcal {B}\) by rank-1 approximations on each subtensor of \(\mathcal {P}_{(k)}\) on mode-l. We can also immediately confirm \(\mathcal {P}_{(l,k)}=\mathcal {P}_{(k,l)}\); that is, the projection order does not matter. The projection sketch is shown in Fig. 8b.

For general case Based on the above discussion, we can derive Step 2 for the general case of low-Tucker-rank approximation. We formulate non-negative low-Tucker-rank approximation as a m-projection onto the intersection of bingo spaces on each mode \(\mathcal {B}^{(k)}\); that is, \(\mathcal {B}=\mathcal {B}^{(1)}\cap \dots \cap \mathcal {B}^{(d)}\). The m-projection destination is given as an iterative application of m-projection d times, starting from \(\mathcal {P}\) onto subspace \(\mathcal {B}^{(1)}\), and from there onto subspace \(\mathcal {B}^{(2)}\), \(\dots \), and finally onto subspace \(\mathcal {B}^{(d)}\). Note that each m-projection needs only rank-1 approximation for subtensors on each mode. The result of LTR does not depend on the projection order. Since the m-projection minimizes the KL divergence from the input onto the bingo space, LTR always provides the best low-rank approximation in the specified bingo space \(\mathcal {B}\); that is, for a given non-negative tensor \(\mathcal {P}\), the output \(\mathcal {T}^*\) of LTR satisfies that

$$\begin{aligned} \mathcal {T}^* = \mathop {\hbox {argmin}}\limits _{\mathcal {T} \in \mathcal {B}} D(\mathcal {P}, \mathcal {T}). \end{aligned}$$

The usual low-rank approximation without the bingo-space requirement approximates a tensor by a linear combination of appropriately chosen bases. In contrast, our method with the bingo-space requirement approximates a tensor by scaling of bases. Therefore, our method has a smaller search space for low-rank tensors. This search space allows us to derive an efficient algorithm without a gradient method, which always outputs the globally optimal solution in the space.

3.3.7 Relationship to Legendre decomposition

Our theoretical analysis is closely related to Legendre decomposition [50], which also uses information geometric parameterization of tensors and solves the problem of tensor decomposition by a projection onto a subspace. However, their concept differs from ours in the following regard. In the Legendre decomposition, any single point in a subspace that has some constraints on the \(\theta \)-coordinate is taken as the initial state and moves by gradient descent inside the subspace to minimize the KL divergence from an input tensor. This operation is an e-projection, where the constrained \(\theta \)-coordinates do not change from the initial state. In contrast, we employ the m-projection from the input tensor onto the low-rank space by fixing some \(\eta \)-coordinates. Using the conservation law for \(\eta \)-coordinates, we obtain an exact analytical representation of the coordinates of the projection destination without using a gradient method. Figure 8 illustrates the relationship between our approach and Legendre decomposition. Moreover, the Tucker rank is not discussed in the Legendre decomposition, so it is not guaranteed that Legendre decomposition reduces the Tucker rank, which is in contrast to our method. In addition, although we derived the algorithm based on the discussion using the dually flat manifold on the \((\theta ,\eta )\)-coordinate, we do not have to know the value of \((\theta , \eta )\) during the algorithm, which also makes a difference from related works in [17, 31].

Fig. 8
figure 8

a The relationship among rank-1 approximation, Legendre decomposition [50], and mean-field equation, where we assume that the same bingo space is used in Legendre decomposition. A solid line illustrates m-projection with fixing one-body \(\eta \) parameters. \(\mathcal {P}\) is an input positive tensor and \(\mathcal {{\overline{P}}}\) is the rank-1 tensor that minimizes the KL divergence from \(\mathcal {P}\). \(\mathcal {O}\) is an initial point of Legendre decomposition, which is usually a uniform distribution. \(\mathcal {{\overline{P}}}_t\) is a tensor of the t-th step of gradient descent in Legendre Decomposition. b The m-projection of a common space of two different bingo spaces from \(\mathcal {P}\) can be achieved by m-projection into one bingo space and then m-projecting into the other bingo space

3.3.8 Connection between rank-1 approximation and balancing

So far, we have seen that the rank of tensors can be reduced by describing the low-rank condition with the many-body \(\theta \)-parameters. Related to this task, it has been reported that characterizing tensors by one-body \(\eta \)-parameters enables an operation called tensor balancing [17], which is often solved by the Sinkhorn algorithm [51], and its quantum information geometry is also studied [52]. Therefore, we summarize the relationship between the rank-1 approximation, which constrains many-body \(\theta \)-parameters, and tensor balancing, which constrains one-body \(\eta \)-parameters.

First, we introduce tensor balancing for a non-negative tensor \(\mathcal {P}\in \mathbb {R}^{I_1\times \cdots \times I_d}_{\ge 0}\). There are two kinds of balancing: fiber balancing and slice balancing. In this section, we discuss only the slice balancing. See the supplementary material for discussion of fiber balancing. Given d vectors \({\varvec{c}}^{(k)}\in \mathbb {R}^{I_k}_{\ge 0}\) for \(k\in [d]\), the task of \({\varvec{c}}\)-slice balancing is to rescale a tensor so that the sum of each slice satisfies

$$\begin{aligned} \sum ^{I_1}_{i_1=1} \dots \sum ^{I_{k-1}}_{i_{k-1}=1}\sum ^{I_{k+1}}_{i_{k+1}=1} \cdots \sum ^{I_d}_{i_d=1} \mathcal {P}_{i_1,\dots ,i_d}= c^{(k)}_{i_k} \end{aligned}$$
(25)

for \(i_k \in [I_k]\). Note that there is no solution when \(\sum _i c^{(k)}_i \ne \sum _i c^{(l)}_i\), so we always assume that \(\sum _i c^{(k)}_i = 1\) for any \(k\in [d]\) without loss of generality. The information geometric formulation of tensor balancing has already been performed in [17]. Recall the definition of \(\eta \)-parameters, the above condition for \({\varvec{c}}\)-slice balancing can be expressed by one-body \(\eta \)-parameters of a tensor as

$$\begin{aligned} \eta ^{(k)}_{i_k}=\sum ^{I_k}_{i=i_k} c^{(k)}_{i}. \end{aligned}$$
(26)

Let us define \({\varvec{c}}\)-slice balancing space \(\mathcal {Q}_{{\varvec{c}}}\) as the set of \({\varvec{c}}\)-slice balanced tensor, yielding \(\mathcal {Q}_{{\varvec{c}}} = \{ p_\eta \mid \eta \) satisfies the condition (26) \(\}\). By considering \({\varvec{c}}\)-slice balancing and rank-1 approximation simultaneously in the framework of information geometry, we can derive the following property: the \({\varvec{c}}\)-balanced rank-1 tensor always uniquely exists. As discussed in Sect. 2.2, we can identify a distribution using the mixture coordinate system \((\theta , \eta )\) that combines \(\theta \)- and \(\eta \)-coordinates. On the intersection \(\mathcal {Q}_{{\varvec{c}}} \cap \mathcal {B}_1\), all one-body parameters are identified by \({\varvec{c}}\)-balancing condition in Eq. (26) and other parameters are identified by rank-1 condition in Proposition 3, where \(\mathcal {B}_1\) is the rank-1 space. Now, balancing conditions and rank-1 condition specify all parameters; therefore, the mixture coordinate \((\theta , \eta )\) uniquely identifies the rank-1 \({\varvec{c}}\)-balanced tensor. See Supplement for proof.

Theorem 2

The intersection \(\mathcal {Q}_{{\varvec{c}}} \cap \mathcal {B}_1\) is a singleton.

To get the intuition of geometric structure across conditions on rank-1 approximation and balancing, we see a simple case of \(I_1=I_2=2\) and \(d = 2\). We illustrate a simple case of \(d = 2\) as 3D plots in Fig. 9. Let us consider the \({\varvec{c}}\)-balanced matrix \(\textbf{P} \in \mathbb {R}_{\ge 0}^{2\times 2}\) with \(n = 2\). We obtain \(\theta \)- and \(\eta \)-coordinate of \(\textbf{P}\in \mathcal {B}_{{\varvec{c}}}\) using \(\textbf{P}_{22}\):

$$\begin{aligned} \theta \left( {\textbf{P}}\right) = \log {} \begin{bmatrix} (1-{c}^{(1)}_2-{c}^{(2)}_2+\textbf{P}_{22}) &{}\quad \frac{{c}^{(2)}_2-\textbf{P}_{22}}{(1-{c}^{(1)}_2-{c}^{(2)}_2+\textbf{P}_{22})} \\ \frac{{c}^{(1)}_2-\textbf{P}_{22}}{(1-{c}^{(1)}_2-{c}^{(2)}_2+\textbf{P}_{22})} &{}\quad \frac{\textbf{P}_{22}(1-{c}^{(1)}_2-{c}^{(2)}_2+\textbf{P}_{22})}{(c^{(2)}_2-\textbf{P}_{22})({c}^{(1)}_2-\textbf{P}_{22})} \\ \end{bmatrix} , \eta \left( {\textbf{P}}\right) = \begin{bmatrix} 1 &{}\quad {c}^{(2)}_2 \\ {c}^{(1)}_2 &{}\quad \textbf{P}_{22} \\ \end{bmatrix}, \end{aligned}$$

where \(\log \) is element-wise logarithm. Remember that \(\theta _{11}\) corresponds to the normalizing factor and \(\eta _{11} = 1\). The subspace consisting of \({\varvec{c}}\)-balanced matrices can be drawn as a convex curve in a three-dimensional space by regarding \(\textbf{P}_{22}\) as a mediator variable. The curve becomes a straight line in the \(\theta \)-coordinate only when \(\varvec{c}^{(1)} = \varvec{c}^{(2)} = (0.5,0.5)\). In contrast, the set of rank-1 matrices is identified as a plane \((\theta _{21},\theta _{12},0)\) in the \(\theta \)-coordinate since \(\theta _{22}=0\) ensures \(\textrm{rank}(\textbf{P})=1\) and on the plane \((\eta _{21},\eta _{12},\eta _{21}\eta _{12})\) in the \(\eta \) space (See Propositions 3 and 4). We can observe that \({\varvec{c}}\)-slice balancing space \(\mathcal {Q}_{{\varvec{c}}}\) and rank-1 space \(\mathcal {B}_1\) cross a point, which is shown in Fig. 9. It is coherent with Theorem 2. The cross point changes dynamically by \(\varvec{c}^{(1)}\) and \(\varvec{c}^{(2)}\). In addition, Fig. 9 shows that we obtain the same matrix by rank-1 approximation of any matrix on \(\mathcal {Q}_{{\varvec{c}}}\) since rank-1 approximation does not change the values of one-body \(\eta \)-parameters.

Fig. 9
figure 9

Balancing space \(\mathcal {Q}_{{\varvec{c}}}\) (blue) and rank-1 space (orange) \(\mathcal {B}_1\) in \(\theta \) space (left) and \(\eta \) space (right) with \(I_1=I_2=2\) and \(\varvec{c}^{(1)}=\varvec{c}^{(2)}=(0.4,0.6)\)

3.4 Relation between A1GM and LTR

We summarize the relationship between the two proposed algorithms A1GM and LTR, which are based on the log-linear model on posets and its convex optimization via information geometry. The difference between A1GM and LTR is the structure of posets behind algorithms. After designing proper posets, these two algorithms perform m-projection in common, where some natural parameters become zero. Interestingly, \(\theta \)- and \(\eta \)-parameters are not computed explicitly during the procedure of both algorithms. This is because the trick of describing the low-rank condition in a dual-flat coordinate system and using a conservation law for the parameters allows us to know the projection destination in a closed form.

It is also known that the constraints of tensor balancing can be described in terms of expectation parameters, as we see in Sect. 3.3.8. In our framework, the task to be solved, such as low-rank approximation or balancing, is described as a constraint in a dual-flat coordinate system. We expect that higher-rank approximations for multiple matrices could also be possible by defining bingos as well as LTR.

In summary, our approach formulates tasks as convex optimizations by taking the input data structure as proper poset and describing the constraints of the task in a dual-flat coordinate system.

4 Numerical experiments

We empirically examined the efficiency and the effectiveness of the proposed methods A1GM and LTR using synthetic and real-world datasets. See Supplement for implementation and dataset details.

4.1 Experiments for A1GM

We used three types of data to empirically investigate the efficiency and effectiveness of A1GM: (i) synthetic data with missing values at the upper right corner, (ii) synthetic data with random grid-like missing values, and (iii) real data with grid-like and non-grid-like missing values. It is guaranteed that A1GM always finds the best solution for any data of (i) and (ii). Thus, we only investigate efficiency in our experiments for (i) and (ii). By contrast, for data (iii), the reconstruction error can be worse than the existing methods due to increased missing values in A1GM. Therefore, in the experiment for data (iii), we investigate both efficiency (running time) and effectiveness (reconstruction error).

We use KL-WNMF as a comparison method [53]. KL-WNMF is a commonly used gradient method that reduces the KL-based cost in Eq. (7) by multiplicative updates. Although faster NMF methods, such as ALS [54] and ADMM [55], have been developed, they are just as fast as a multiplicative update when the target rank is small [56]. Moreover, as we will show in Sects. 4.1.1 and 4.1.2, KL-WNMF converges within only 2–4 iterations in our experiments. Thus, these techniques are considered ineffective for speeding up rank-1 KL-WNMF. This is why we only compared A1GM with simple KL-WNMF. We implemented KL-WNMF by referring to the original paper [53]. The stopping criterion of KL-WNMF follows the implementation of the standard NMF in scikit-learn [57]. The initial values of KL-WNMF are determined by sampling from a uniform continuous distribution from 0 to 1.

4.1.1 Synthetic datasets

Missing values in the top right corner We prepared synthetic matrices \(\textbf{X}\in \mathbb {R}^{N \times N}\) and their weights \(\varvec{\Phi }\in \{0,1\}^{N \times N}\). We assumed that each input weight \(\varvec{\Phi }\) is in the form of Eq. (8) with \(L=0\). We measured the running time to obtain rank-1 decomposition of \(\textbf{X}\) with varying the matrix size N. Figure 10a shows that A1GM is an order of magnitude faster than the existing gradient method. The number of iterations of the existing method until convergence was between 2 and 4. A1GM just applies the closed formula in Theorem 1 to parts of input matrices.

Fig. 10
figure 10

Running time comparison of the proposed method A1GM (triangle, dots line) and KL-WNMF (circle, dashed line) with respect to the matrix size N. a Missing values are at the top right corner. b Missing value positions are grid-like. We plot the mean ± S.D. of five trials

Random grid-like missing values We also prepared synthetic matrices and its binary weight matrices \(\varvec{\Phi }\in \{0,1\}^{N \times N}\). We assumed that every input weight matrix \(\varvec{\Phi }\) is grid-like, and we set the ratio of missing values to be 5 percent. We measured the running time of A1GM to complete the best rank-1 missing NMF compared with KL-WNMF by varying the matrix size N. Figure 10b shows that our method is always faster than the gradient method. The number of iterations of the existing method required for convergence was between 2 and 4. Note that, in these datasets, A1GM does not need to increase missing values.

4.1.2 Real datasets

We used 20 real datasets. We downloaded tabular datasets that have missing values from the Kaggle databankFootnote 3 or UCI dataset.Footnote 4 If a dataset contains negative values, we converted them to their absolute values. Zero values in a matrix were replaced with the average value of the matrix to make them all positive. We evaluated the relative error as \(D_{\varvec{\Phi }}(\textbf{X},\textrm{A1GM}(\textbf{X})) \big / D_{\varvec{\Phi }}(\textbf{X},\textrm{WNMF}(\textbf{X}))\), where \(\textrm{WNMF}(\textbf{X})\) and \(\textrm{A1GM}(\textbf{X})\) are the rank-1 reconstructed matrices by KL-WNMF and A1GM, respectively, and the binary weight matrix \(\varvec{\Phi }\) indicates the locations of the missing values of \(\textbf{X}\). We also compared the relative running time of A1GM to KL-WNMF.

The results are summarized in Table 1. In the table, the column increase rate means the ratio of the number of missing values after addition in A1GM to the original number of missing values. If increase rate is 1, it means that the location of missing values of the dataset is originally grid-like. For such datasets, it is theoretically guaranteed that our method A1GM always provides the best rank-1 missing NMF, which minimizes the KL divergence in Eq. (7). It is reasonable that the matrices reconstructed by KL-WNMF and by A1GM are the same since the cost function (7) is convex in \({\varvec{w}}\) and \({\varvec{h}}\). The number of iterations of the existing method required for convergence was between 2 and 4 for real datasets.

We can see that A1GM is much faster than KL-WNMF for all the datasets. Moreover, the relative error remains low even if missing values of datasets are not grid-like for most of datasets. In some real data, missing values are likely to be biased towards a particular row or column. As a result, they become grid-like by just adding a small number of missing values. In these cases, our proposed method can conduct rank-1 missing NMF rapidly with competitive errors to KL-WNMF. By contrast, a large amount of information is lost after the increasing missing value step for some datasets (large increase rates). As a result, our method is not suitable for obtaining an accurately reconstructed rank-1 matrix, even though it is much faster than the existing method.

Table 1 Performance of A1GM compared to KL-WNMF on real datasets

4.2 Experiments for LTR

We compared LTR with two existing non-negative low Tucker rank approximation methods. The first method is non-negative Tucker decomposition, which is the standard non-negative tensor decomposition method [58] whose cost function is either the Least Squares (LS) error (NTD_LS) or the KL divergence (NTD_KL). The second method is sequential non-negative Tucker decomposition (lraSNTD), which is known as the faster of the two methods [59]. Its cost function is the LS error.

Fig. 11
figure 11

Experimental results for synthetic (a, b) and real-world (c, d) datasets. Mean errors ± standard error for 20 times iterations are plotted. a The horizontal axis is r for target Tucker rank (rrrrr). b The horizontal axis is \(n^3\) for input (nnn) tensor. c, d The horizontal axis is the number of elements of the core tensor

Results on synthetic data We created tensors with \(d = 3\) or \(d = 5\), where every \(I_k = n\). We changed n to generate various sizes of tensors. Each element was sampled from the uniform continuous distribution from 0 to 1. To evaluate the efficiency, we measured the running time of each method. To evaluate the accuracy, we measured the LS reconstruction error, defined as the Frobenius norm between input and output tensors. Figure 11a shows the running time and the LS reconstruction error for randomly generated tensors with \(d=3\) and \(n= 30\) varying the target Tucker tensor rank. Figure 11b shows the running time and the LS reconstruction error for the target Tucker rank (10, 10, 10) with varying the input tensor size n. These plots clearly show that our method is faster than other methods but still retains the competitive approximation accuracy.

Results on real data We evaluated running time and the LS reconstruction error for two real-world datasets. 4DLFD is a (9, 9, 512, 512, 3) tensor [60] and AttFace is a (92, 112, 400) tensor [61]. AttFace is commonly used in tensor decomposition experiments [9, 16, 59]. For the 4DLFD dataset, we chose the target Tucker rank as (1,1,1,1,1), (2,2,2,2,1), (3,3,4,4,1), (3,3,5,5,1), (3,3,6,6,1), (3,3,7,7,1), (3,3,8,8,1), (3,3,16,16,1), (3,3,20,20,1), (3,3,40,40,1), (3,3,60,60,1), and (3,3,80,80,1). For the AttFace dataset, we chose (1,1,1), (3,3,3), (5,5,5), (10,10,10), (15,15,15), (20,20,20), (30,30,30), (40,40,40), (50,50,50), (60,60,60), (70,70,70), and (80,80,80). In both datasets, LTR is always faster than the comparison methods, as shown in Fig. 11c, d, with competitive or better approximation accuracy in terms of the LS error. We also obtained almost the same results as in Fig. 11 with the KL reconstruction error and we provided the experimental results as a tabular format in Supplement.

As described in Sect. 3.3.6, the search space of LTR is smaller than that of NTD and lraSNTD. Nevertheless, our experiments show that the approximation accuracy of LTR is competitive with other methods. This means that NTD and lraSNTD do not effectively treat linear combinations of bases.

5 Conclusion

We have discussed non-negative low-rank approximations from an information geometric viewpoint. Specifically, using a log-linear model with a partial order structure in the sample space, we have treated a tensor as a probability distribution and successfully characterized the rank of the tensor in a dually flat coordinate system. We have pointed out that rank-1 approximation of a non-negative tensor can be geometrically captured as a mean-field approximation that reduces the many-body problem to the one-body problem, which is frequently used in statistical physics. We also have constructed a new algorithm LTR that can perform low-Tucker-rank approximation by appropriately designing constraints for the dually flat coordinate system.

We have also analyzed the problem of non-negative multiple matrix factorization (NMMF) in the same way and found the closed formula of globally optimal rank-1 NMMF. By focusing on the equivalence of the NMMF and the decomposition of matrices with missing values, we have constructed a new algorithm A1GM that can quickly compute approximate solutions to rank-1 approximations for matrices with missing values without using the gradient method. A1GM can extract the largest principal components from matrices with missing values.

Our work will form the basis for further investigation between linear algebraic matrix operations, statistics, and machine learning via information geometry.