Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Wang, Deqing; Chang, Zheng; Cong, Fengyu

doi:10.1007/s00521-021-06325-8

Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Original Article
Open access
Published: 04 October 2021

Volume 33, pages 17369–17387, (2021)
Cite this article

Download PDF

You have full access to this open access article

Neural Computing and Applications Aims and scope Submit manuscript

Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Download PDF

Deqing Wang ORCID: orcid.org/0000-0002-1333-0928^1,2,
Zheng Chang^2,3 &
Fengyu Cong^1,2,4,5

2947 Accesses
2 Citations
2 Altmetric
Explore all metrics

Abstract

Nonnegative tensor decomposition is a versatile tool for multiway data analysis, by which the extracted components are nonnegative and usually sparse. Nevertheless, the sparsity is only a side effect and cannot be explicitly controlled without additional regularization. In this paper, we investigated the nonnegative CANDECOMP/PARAFAC (NCP) decomposition with the sparse regularization item using $l_1$-norm (sparse NCP). When high sparsity is imposed, the factor matrices will contain more zero components and will not be of full column rank. Thus, the sparse NCP is prone to rank deficiency, and the algorithms of sparse NCP may not converge. In this paper, we proposed a novel model of sparse NCP with the proximal algorithm. The subproblems in the new model are strongly convex in the block coordinate descent (BCD) framework. Therefore, the new sparse NCP provides a full column rank condition and guarantees to converge to a stationary point. In addition, we proposed an inexact BCD scheme for sparse NCP, where each subproblem is updated multiple times to speed up the computation. In order to prove the effectiveness and efficiency of the sparse NCP with the proximal algorithm, we employed two optimization algorithms to solve the model, including inexact alternating nonnegative quadratic programming and inexact hierarchical alternating least squares. We evaluated the proposed sparse NCP methods by experiments on synthetic, real-world, small-scale, and large-scale tensor data. The experimental results demonstrate that our proposed algorithms can efficiently impose sparsity on factor matrices, extract meaningful sparse components, and outperform state-of-the-art methods.

An inexact alternating proximal gradient algorithm for nonnegative CP tensor decomposition

Article 20 July 2021

Robust Schatten-p Norm Based Approach for Tensor Completion

Article 08 January 2020

A proximal point like method for solving tensor least-squares problems

Article 26 November 2021

1 Introduction

1.1 Background

Nonnegative tensor decomposition is a powerful tool in signal processing and machine learning [10, 35]. Nonnegative CANDECOMP/PARAFAC (NCP), as an important decomposition method, has been widely applied to processing multiway data, such as hyperspectral data [39], electroencephalograph (EEG) data [11], fluorescence excitation-emission matrix (EEM) data [13], neural data [46], and many other multiway tensor data [30]. In many cases, the extracted components by NCP are not only nonnegative but also sparse. For example, the spectral components from EEG tensor decomposition are usually very sparse, representing the narrow-band frequencies of some brain activities [11]. For another example, after decomposing EEM tensor, a component in the sample mode denotes the concentrations of a compound in all samples [5], which is sometimes also sparse. The nonnegative constraint in NCP will naturally lead to sparse results. However, this sparsity is only a side effect, which cannot be controlled to a certain level [18]. Without properly controlling the sparsity, the intrinsic components in the data cannot be extracted precisely, especially in low signal-to-noise ratio conditions. Therefore, in order to extract meaningful and accurate sparse components, additional sparse regularization is necessary for NCP tensor decomposition.

The design of NCP decomposition with explicit sparse regularization (sparse NCP) will benefit a lot from the methods in nonnegative matrix factorization (NMF) cases. On the one hand, an early study of NMF [18] proposed the method of projecting components into sparse vectors at some sparsity level. However, this method keeps all components at the same fixed sparsity level, which is not in line with the true sparsity of different components in the data. On the other hand, incorporating sparse regularization items into the optimization model is a popular method. The $l_1$-norm is a conventional and effective regularizer to impose sparsity for signal processing [6]. The reason is that, for most underdetermined linear equations, the optimization problem with $l_1$-norm regularization can yield strong sparsity [12]. More information about the sparse regularization can be found in [2, 34, 50].

Many works have been devoted to the tensor decomposition with sparse regularization, but only a few can be found for NCP. The works of [1, 21] and [29] studied the sparse regularization for tensor decomposition using $l_1$-norm and trace norm, but they only focused on the unconstrained CP model without the nonnegative constraint. The works of [14, 28, 31] and [47] proposed the methods of imposing sparsity by the $l_1$-norm on nonnegative Tucker decomposition. However, these methods are not suitable for large-scale problems [47], and their effectiveness is unknown to NCP. Kim et al. considered solving sparse NCP using ANLS [23]. Nevertheless, ANLS seriously suffers from rank deficiency caused by high sparsity or zero components in the factor matrices. Recently, Huang et al. have proposed an alternating optimization-based ADMM (AO-ADMM) method, which can handle the $l_1$-norm regularization item in NCP [19]. Nevertheless, there is no experimental evaluation on the sparse NCP in [19]. The work [32] proposed a sparse NCP algorithm, which is targeted at the multiway co-clustering. In practical applications, the sparse NCP may face the following two major challenges.

One challenge is that when the tensor data are highly sparse or strongly sparse regularization is imposed on the decomposition, more and more zero components will appear in the factor matrices. Thus, the factor matrices are not of full column rank, which will cause the rank deficiency problem. The rank deficiency will further cause a poor convergence of the tensor decomposition algorithm. It is introduced that the proximal algorithm is an excellent method to improve the convergence of a mathematical optimization method [4]. In an optimization problem by iterations, the proximal algorithm is constructed by adding a proximal regularization item to the original model. This proximal item is the squared Frobenius norm of the difference between the current variable and its value in previous iteration [4]. The proximal algorithm can naturally be incorporated into tensor decomposition [26].

The other challenge is that, for large-scale tensor data, the process of sparse NCP decomposition might be inefficient. It is reported that the inexact block coordinate descent scheme could accelerate the convergence and is very beneficial to the large-scale problem [15, 40]. Hence, the inexact scheme can be employed in the sparse NCP problem.

1.2 Contribution

Firstly, in this paper, we propose a novel sparse NCP method with the $l_1$-norm and the proximal algorithm. The proposed sparse NCP will overcome the rank deficiency and guarantee the decomposition to converge to a stationary point. The block coordinate descent (BCD) is one of the main techniques for tensor decomposition, especially the constrained one [24]. In BCD framework, each factor matrix is updated as a subproblem alternatively while other factor matrices are fixed. By the proximal algorithm, the proximal regularization item can make the subproblems strongly convex [26] and can provide a full column rank condition for the sparse NCP.

Secondly, we develop an inexact BCD scheme for the novel sparse NCP model. The inexact scheme will speed up the computation of the sparse NCP, especially in large-scale cases. Specifically, in the inexact BCD scheme, the subproblem of the sparse NCP is iterated multiple times for updating a factor matrix.

Thirdly, in order to prove the viability of the sparse NCP model with the proximal algorithm and the inexact scheme, we employ two efficient optimization algorithms to solve the model, including inexact alternating nonnegative quadratic programming and inexact hierarchical alternating least squares. We evaluate the proposed sparse NCP methods on synthetic, real-world, small-scale and large-scale tensor data. By properly selecting and tuning the sparse regularization, the effectiveness and efficiency of the sparse NCP methods are demonstrated to impose sparsity on factor matrices.

1.3 Organization

The rest of this paper is organized as follows. Section 2 introduces some preliminaries. In Sect. 3, we describe the mathematical model of sparse NCP with the proximal algorithm and inexact BCD scheme. Section 4 elucidates the solutions to the sparse NCP model using the optimization methods. Section 5 describes the detailed experiments on synthetic and real-world datasets. Some critical observations are discussed in Sect. 6. Finally, we conclude our paper in Sect. 7.

2 Preliminaries

In this paper, operator $\circ$ represents the outer product of vectors, $\odot$ represents the Khatri-Rao product, * represents the Hadamard product that is the elementwise matrix product, 〈〉 represents the inner product, 〚〛 represents Kruskal operator and [ ]₊ represents nonnegative projection. $||\; ||_{F}$ denotes Frobenius norm, and $||\; ||_1$ denotes $l_1$-norm. Basics of tensor computation and multi-linear algebra can be found in review papers [25, 35].

2.1 Nonnegative CP decomposition

Given an Nth-order nonnegative tensor $\varvec{\mathscr {X}}\in \mathbb {R}^{I_{1} \times I_{2} \times \cdots \times I_{N}}$ and a positive number R, nonnegative CANDECOMP/PARAFAC (NCP) is to solve the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)}} \frac{1}{2}&||\varvec{\mathscr {X}}-\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket ||_{F}^{2}\\ \text {s.t. } \varvec{A}^{(n)}&\geqslant 0 \text { for } n=1, \dotsc ,N, \end{aligned} \end{aligned}$$

(1)

where $\varvec{A}^{(n)}\in \mathbb {R}^{I_{n} \times R}$ for $n=1,\dotsc ,N$ are the estimated factor matrices in different modes, $I_{n}$ is the size in mode-n, and R is the initial number of components. We use $\mathscr {F}_{\text {tensor}}\big ( \varvec{A} \big ) = \mathscr {F}_{\text {tensor}}\big ( \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \big )$ to denote the objective function in (8). The estimated factor matrices in Kruskal operator can be represented by the sum of R rank-1 tensors in outer product form:

$$\begin{aligned} \llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket = \sum _{r=1}^{R} \varvec{\mathscr {Y}}_r = \sum _{r=1}^{R}\varvec{a}_{r}^{(1)} \circ \cdots \circ \varvec{a}_{r}^{(N)}, \end{aligned}$$

(2)

where $\varvec{a}_{r}^{(n)}$ represents the rth column of $\varvec{A}^{(n)}$.

Let $\varvec{X}_{(n)}\in \mathbb {R}^{I_n \times \prod _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}}}$ represent the mode-n unfolding (matricization) of original tensor $\varvec{\mathscr {X}}$. The mode-n unfolding of the estimated tensor in Kruskal operator $\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket$ can be written as $\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T$, in which $\varvec{B}^{(n)}=\big (\varvec{A}^{(N)}\odot \cdots \odot \varvec{A}^{(n+1)}\odot \varvec{A}^{(n-1)}\odot \cdots \odot \varvec{A}^{(1)}\big )\in \mathbb {R}^{\prod _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}} \times R}$. In BCD framework, factor $\varvec{A}^{(n)}$ is updated alternatively by a subproblem in every iteration, which is equal to the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \mathscr {F} \big (\varvec{A}^{(n)}\big )&= \frac{1}{2} ||\varvec{X}_{(n)}-\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T||_{F}^{2}\\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0. \end{aligned} \end{aligned}$$

(3)

The partial gradient (or partial derivative) of $\mathscr {F} \big (\varvec{A}^{(n)}\big )$ with respect to $\varvec{A}^{(n)}$ is

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \varvec{A}^{(n)}} \mathscr {F} \big (\varvec{A}^{(n)}\big ) = \varvec{A}^{(n)} \big [{\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)} \big ]-\varvec{X}_{(n)} \varvec{B}^{(n)}. \end{aligned} \end{aligned}$$

(4)

In (4), the item $\varvec{X}_{(n)} \varvec{B}^{(n)}$ is called the Matricized Tensor Times Khatri-Rao Product (MTTKRP) [35]. The item ${\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}$ can be computed efficiently by

$$\begin{aligned} {\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)} = &\big [ {\big ( \varvec{A}^{(N)} \big )}^T\!\varvec{A}^{(N)} \big ] *\cdots \\*&\big [ {\big ( \varvec{A}^{(n+1)} \big )}^T\!\varvec{A}^{(n+1)} \big ] *\big [ {\big ( \varvec{A}^{(n-1)} \big )}^T\!\varvec{A}^{(n-1)} \big ]\\*&\cdots *\big [ {\big ( \varvec{A}^{(1)} \big )}^T\!\varvec{A}^{(1)} \big ]. \end{aligned}$$

(5)

2.2 Sparse regularization with $l_1$-norm

In order to impose sparsity to the factor matrices, it is natural to incorporate the sparse regularization items using $l_1$-norm [9, 47] into the objective function in (1), which leads to the following basic sparse NCP problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)}} \frac{1}{2}&||\varvec{\mathscr {X}}-\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket ||_{F}^{2}+\sum _{n=1}^N\beta _n\sum _{r=1}^R||\varvec{a}_r^{(n)}||_1\\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0 \text { for } n=1, \dotsc ,N, \end{aligned} \end{aligned}$$

(6)

where $\beta _n$ are positive sparse regularization parameters in parameter vectors $\varvec{\beta }\in \mathbb {R}^{N\times 1}$. The subproblem can be written as the following optimization problem

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \mathscr {F}_0 \big (\varvec{A}^{(n)}\big ) =&\frac{1}{2} \Big \Vert \varvec{X}_{(n)}-\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T\Big \Vert _{F}^{2} + \beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1\\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0. \end{aligned} \end{aligned}$$

(7)

In the objective function of the subproblem, the sparse regularization is imposed on the factor matrix $\varvec{A}^{(n)}$ by the $l_1$-norm.

3 The proposed sparse NCP model

3.1 Sparse NCP with proximal algorithm

The basic sparse NCP in (6) has a serious drawback. When strongly sparse regularization is imposed in (6), many zero columns will appear in the factor matrices $\varvec{A}^{(n)}$. Thus, both $\varvec{A}^{(n)}$ and $\varvec{B}^{(n)}$ cannot guarantee to be of full column rank. Therefore, the basic sparse NCP model in (6) will suffer from rank deficiency and cannot guarantee to converge.

In order to overcome the above drawback, we propose the following sparse NCP model with proximal algorithm (a proximal regularization item using squared Frobenius norm):

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)}}&\bigg \lbrace \frac{1}{2} \Big \Vert \varvec{\mathscr {X}}-\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket \Big \Vert _{F}^{2}\\ + \sum _{n=1}^N&\frac{\alpha _n}{2}\Big \Vert \widetilde{\varvec{A}}^{(n)} - \varvec{A}^{(n)}\Big \Vert _F^2+\sum _{n=1}^N\beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1 \bigg \rbrace \\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0 \text { for } n=1, \dotsc ,N, \end{aligned} \end{aligned}$$

(8)

where $\widetilde{\varvec{A}}^{(n)}$ is the value of the factor $\varvec{A}^{(n)}$ in previous iteration during updating and $\alpha _n$ are positive regularization parameters in vectors $\varvec{\alpha }\in \mathbb {R}^{N\times 1}$.

In BCD framework, the subproblem of model (8) can be written in the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big ) = \bigg \lbrace \frac{1}{2}&\Big \Vert \varvec{X}_{(n)}-\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T\Big \Vert _{F}^{2}\\ +\frac{\alpha _n}{2}&\Big \Vert \widetilde{\varvec{A}}^{(n)} - \varvec{A}^{(n)}\Big \Vert _F^2 +\beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1 \bigg \rbrace \\ \text {s.t. }&\varvec{A}^{(n)} \geqslant 0. \end{aligned} \end{aligned}$$

(9)

The objective function $\mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )$ can be further represented by the following form:

$$\begin{aligned} \begin{aligned} \mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )&=\frac{1}{2} \left \Vert \begin{pmatrix} \varvec{X}_{(n)}^T\\ \sqrt{\alpha _n}{\big (\widetilde{\varvec{A}}^{(n)}\big )}^T \end{pmatrix}- \begin{pmatrix} \varvec{B}^{(n)}\\ \sqrt{\alpha _n}\varvec{I}_R \end{pmatrix} {\big (\varvec{A}^{(n)}\big )}^T \right\Vert_F^2\\&\quad +\beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1. \end{aligned} \end{aligned}$$

(10)

In (10), it is clear to see that the item $\begin{pmatrix} \varvec{B}^{(n)}\\ \sqrt{\alpha _n}\varvec{I}_R \end{pmatrix}$ must be of full column rank even though $\varvec{B}^{(n)}$ is not of full column rank. Thus, the proposed sparse NCP with the proximal algorithm can successfully overcome the rank deficiency problem in the objective function.

3.2 Inexact block coordinate descent scheme

The BCD is a main framework to solve tensor decomposition. It is reported that the inexact BCD scheme could accelerate the computation [15, 40]. Specifically, the factor matrices $\varvec{A}^{(n)}, n=1,\dotsc , N$, are updated alternatively in outer iterations; meanwhile, in the subproblem (9), the factor $\varvec{A}^{(n)}$ is also updated several times in inner iterations. The procedures of the inexact scheme are listed in Algorithm 1.

3.3 Convergence analysis

The proposed sparse NCP method in (8) can guarantee to converge to a stationary point.

Proposition 1

Every limit point of the sequence ${\left\{ \varvec{A}_k^{(1)}, \dotsc , \varvec{A}_k^{(N)} \right\} }_{k=1}^\infty$ generated by the sparse NCP in Algorithm 1 is a stationary point of (6).

Proof

The objective function $\mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )$ in (9) with the proximal regularization item is strictly convex [4]. Moreover, $\mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )$ is a proximal upper bound [17, 33] of the objective function $\mathscr {F}_0 \big (\varvec{A}^{(n)}\big )$ in (7). Using the inexact block coordinate descent scheme, the subproblem in Algorithm 1 is updated by a finite number of inner iterations. According to the Theorem 2 in [49], every limit point of the sequence ${\left\{ \varvec{A}_k^{(1)}, \dotsc , \varvec{A}_k^{(N)} \right\} }_{k=1}^\infty$ generated by the sparse NCP in Algorithm 1 is a stationary point of (6). $\square$

4 Optimization methods for solving sparse NCP

In order to prove the viability and effectiveness of the novel sparse NCP with the proximal algorithm and inexact scheme, we employ the following two optimization methods to solve the model.

4.1 Alternating nonnegative quadratic programming

First, we utilize a method that is based on a general form of the alternating nonnegative least squares (ANLS). The classical ANLS is an important tool for NMF and NCP [24]. Many efficient optimization algorithms were proposed to solve the nonnegative least squares (NNLS) subproblems, such as active-set (AS) [20] and block principal pivoting (BPP) [22]. However, there are two limitations to the application of ANLS to sparse NCP. The first limitation is that ANLS is very prone to rank deficiency. The proximal algorithm can tackle this limitation in our sparse NCP model. The second limitation is that the subproblem of our proposed sparse NCP model cannot be represented in a least squares form due to the $l_1$-norm regularization, which can be clearly seen in (10). Therefore, some new forms of the objective function in (8) should be considered.

Inspired by [27], the subproblem of the proposed sparse NCP in (9) can be represented in the nonnegative quadratic programming (NNQP) form as the following problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \sum _{i=1}^{I_n}&\bigg \{\frac{1}{2}{\big [\varvec{A}^{(n)}\big ]}_{(i,:)}\varvec{M}{\big [\varvec{A}^{(n)}\big ]}_{(i,:)}^{T} + \varvec{N}_{(i,:)} {\big [\varvec{A}^{(n)}\big ]}_{(i,:)}^{T}\\&+ \frac{1}{2}{\big [\varvec{X}_{(n)}\big ]}_{(i,:)}{\big [\varvec{X}_{(n)}\big ]}_{(i,:)}^T + \frac{\alpha _n}{2} {\big [\widetilde{\varvec{A}}^{(n)}\big ]}_{(i,:)}{\big [\widetilde{\varvec{A}}^{(n)}\big ]}_{(i,:)}^T \bigg \}\\ \text {s.t. }&\varvec{A}^{(n)} \geqslant 0, \end{aligned} \end{aligned}$$

(11)

where ${\big [\; \big ]}_{(i,:)}$ represents the ith row of a matrix, $\varvec{M}={\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)} + \alpha _n \varvec{I}_R$, $\varvec{N}=\beta _n \varvec{E}-\varvec{X}_{(n)}\varvec{B}^{(n)} - \alpha _n\widetilde{\varvec{A}}^{(n)}$ and $\varvec{E}$ is a matrix of all ones. In fact, NNQP is a general form of NNLS.

The above-mentioned optimization methods for NNLS can also be used to solve NNQP problem. In this study, we only use block principal pivoting (BPP) [22] as the NNQP solver, which has been proven to be a very efficient method [22, 24]. The solver of BPP contains multiple inner iterations. We limited the inner iterations by several times in the inexact scheme. We name the method of solving tensor decomposition using NNQP as alternating nonnegative quadratic programming (ANQP). Furthermore, we abbreviated the method of solving the sparse NCP with the proximal algorithm using ANQP as PROX-ANQP. Algorithm 2 explicates the PROX-ANQP method.

4.2 Inexact hierarchical alternating least squares

Second, we employ an inexact hierarchical alternating least squares (iHALS) method for solving the sparse NCP with the proximal algorithm. The conventional HALS is an efficient method of updating each factor column by column [7, 9]. However, the HALS method has two major drawbacks to solving the sparse NCP.

First, HALS is also very prone to rank deficiency. Specifically, if a column of the factor matrix $\varvec{A}^{(n)}$ becomes a zero vector, the HALS will break down [22]. One practical remedy is to replace the zero elements with a small positive value [9], such as $10^{-16}$. However, by this modification, the obtained factor matrices are not sparse anymore.

Second, HALS suffers from the caveat problem (see Section 5.2 in [22]). Specifically, the unbalanced scales will appear in different columns and factors. For example, one column in the first factor might have a scale of $10^{-8}$ and the corresponding column in the second factor might have a scale of $10^{8}$. At the same time, another column in the first factor might have a scale of $10^{16}$ and the corresponding column in the third factor might have a scale of $10^{-16}$. One common method of controlling the unbalanced scales is to normalize all columns to unit vectors in the factors [9]. However, by factor normalization, the factor columns will never become zeros vectors. Hence it is impossible to impose sparsity efficiently.

The proximal algorithm in our sparse NCP will overcome the above drawbacks. We have mentioned that the proximal will guarantee the full column rank in the model. Moreover, the proximal regularization item in sparse NCP can keep all columns in factors on a balanced scale.

Next, we will introduce the solution of the model in (8) using the iHALS method. For the sake of simplification, we use $\varvec{a}_r$ and $\varvec{b}_r$ instead of $\varvec{a}_r^{(n)}$ and $\varvec{b}_r^{(n)}$ in this part, which are the rth column of $\varvec{A}^{(n)}$ and $\varvec{B}^{(n)}$, respectively. We also use ${\big [ \varvec{A}^{(n)} \big ]}_{(:,r)}=\varvec{a}_r\in \mathbb {R}^{I_n\times 1}$ to represent the column of a matrix, and ${\big [ \varvec{A}^{(n)} \big ]}_{(i,r)}=a_{ir}^{(n)}$ to represent an element in a matrix.

The objective function in (9) can be further represented as

$$\begin{aligned} \begin{aligned} \mathscr {F} \big (\varvec{A}^{(n)}\big ) =&\frac{1}{2}\Big \Vert {\varvec{X}_{(n)}-\sum _{r=1}^R\varvec{a}_r\varvec{b}_r^T}\Big \Vert _F^2\\&+\frac{\alpha _n}{2}\sum _{r=1}^R||\varvec{a}_r - \widetilde{\varvec{a}}_r||_2^2+\beta _n\sum _{r=1}^R||\varvec{a}_r||_1, \end{aligned} \end{aligned}$$

(12)

where $\widetilde{\varvec{a}}_r$ is the rth column of $\widetilde{\varvec{A}}^{(n)}$. The minimization problem for (12) can be solved iteratively by columnwise subproblems:

$$\begin{aligned} \begin{aligned} \min _{\varvec{a}_r}\mathscr {F}_r=\frac{1}{2}\Big \Vert {\varvec{Z}_r-\varvec{a}_r\varvec{b}_r^T}\Big \Vert _F^2+&\frac{\alpha _n}{2}||\varvec{a}_r - \widetilde{\varvec{a}}_r||_2^2+\beta _n||\varvec{a}_r||_1\\ \text {s.t. }\varvec{a}_r\geqslant 0&, \end{aligned} \end{aligned}$$

(13)

for $r=1,\dotsc ,R$, in which

$$\begin{aligned} \varvec{Z}_r=\varvec{X}_{(n)}-\sum _{\tilde{r}=1,\tilde{r}\ne r}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T. \end{aligned}$$

(14)

The partial derivative of $\mathscr {F}_r$ with respect to $\varvec{a}_r$ is

$$\begin{aligned} \begin{aligned} \frac{\partial \mathscr {F}_r}{\partial \varvec{a}_r}&=\big (\varvec{a}_r\varvec{b}_r^T-\varvec{Z}_r\big )\varvec{b}_r + \alpha _n\varvec{a}_r - \alpha _n\widetilde{\varvec{a}}_r + \beta _n\varvec{1},\\&= \big (\varvec{b}_r^T\!\varvec{b}_r+\alpha _n \big )\varvec{a}_r - \big (\varvec{Z}_r\varvec{b}_r + \alpha _n\widetilde{\varvec{a}}_r - \beta _n\varvec{1} \big ), \end{aligned} \end{aligned}$$

(15)

where $\varvec{1}\in \mathbb {R}^{I_n\times 1}$ is a vector with all elements equal to 1. When $\frac{\partial \mathscr {F}_r}{\partial \varvec{a}_r}=0$, nonnegative column vector $\varvec{a}_r$ can be updated as

$$\begin{aligned} \begin{aligned} \varvec{a}_r\leftarrow {\left[ \frac{\varvec{Z}_r\varvec{b}_r + \alpha _n\widetilde{\varvec{a}}_r - \beta _n\varvec{1}}{\varvec{b}_r^T\varvec{b}_r+\alpha _n} \right] }_+, \end{aligned} \end{aligned}$$

(16)

which is a closed form solution of (13) according to the Theorem 2 in [24].

A fast HALS method was utilized to solve the large-scale problem [7, 24]. We use the same idea to solve the sparse NCP problem. $\varvec{Z}_r$ in (14) can also be represented as

$$\begin{aligned} \varvec{Z}_r=\varvec{X}_{(n)}-\sum _{\tilde{r}=1}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T + \widetilde{\varvec{a}}_r\varvec{b}_r^T. \end{aligned}$$

(17)

Replacing $\varvec{Z}_r$ in (16) by (17), we obtain the new update rule for $\varvec{a}_r$ as shown in (18).

$$\begin{aligned} \begin{aligned} \varvec{a}_r\leftarrow&{\bigg [\frac{ \big ( \varvec{X}_{(n)}-\sum _{\tilde{r}=1}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T + \widetilde{\varvec{a}}_r\varvec{b}_r^T \big ) \varvec{b}_r + \alpha _n \widetilde{\varvec{a}}_r -\beta _n\varvec{1} }{\varvec{b}_r^T\varvec{b}_r+\alpha _n} \bigg ]}_+\\&= {\bigg [ \widetilde{\varvec{a}}_r + \frac{ \varvec{X}_{(n)}\varvec{b}_r - \sum _{\tilde{r}=1}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T\varvec{b}_r - \beta _n\varvec{1}}{\varvec{b}_r^T\varvec{b}_r+\alpha _n} \bigg ]}_+\\&=\Bigg [ \widetilde{\varvec{a}}_r + \frac{{\big [\varvec{X}_{(n)}\varvec{B}^{(n)}\big ]}_{(:,r)} - \varvec{A}^{(n)}{\big [ {\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}\big ]}_{(:,r)} -\beta _n\varvec{1}}{{\big [{\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}\big ]}_{(r,r)}+\alpha _n} \Bigg ]_+ \end{aligned} \end{aligned}$$

(18)

We implement the above procedures using the inexact scheme. We use PROX-iHALS to denote the inexact hierarchical alternating least squares method for solving the sparse NCP with the proximal algorithm. The PROX-iHALS is illustrated in Algorithm 3.

4.3 Stopping conditions

4.3.1 Stopping condition for outer loop

We terminate the outer loop according to the change of relative error during iteration. Relative error is related to data fitting. In the kth outer iteration, the relative error [48] of tensor decomposition is defined by

$$\begin{aligned} \text {RelErr}_k=\frac{\Vert \varvec{\mathscr {X}} - \llbracket \varvec{A}_k^{(1)},\dotsc ,\varvec{A}_k^{(N)} \rrbracket \Vert _F}{\Vert \varvec{\mathscr {X}}\Vert _F}. \end{aligned}$$

(19)

Based on the relative error, we terminate the outer loop using the following stopping condition

$$\begin{aligned} |\text {RelErr}_{k-1}-\text {RelErr}_k |< \varepsilon . \end{aligned}$$

(20)

The threshold of $\varepsilon$ can be set by a very small positive value, such as $1e-8$.

In addition, we also set a maximum running time for the outer loop.

4.3.2 Stopping condition for inner loop

In the lth inner iteration, we define the relative residual of the nth factor matrix $\varvec{A}^{(n)}$ as

$$\begin{aligned} r_l^{(n)}=\frac{\big \Vert \varvec{A}_l^{(n)} - \varvec{A}_{l-1}^{(n)}\big \Vert _F}{\big \Vert \varvec{A}_l^{(n)}\big \Vert _F}. \end{aligned}$$

(21)

For the PROX-iHALS, we terminate the inner loop by the stopping condition of $r_l^{(n)}<\delta ^{(n)}$, where $\delta ^{(n)}$ is a dynamic positive threshold. If there is only one iteration in the inner loop, we update $\delta ^{(n)}$ by $\delta ^{(n)}=\delta ^{(n)}/10$. We set the initial value by $\delta ^{(n)}=0.01$. For the PROX-ANQP, the inner loop is terminated according to the columns in the feasible region of the BPP algorithm [22].

Since we employ the inexact BCD framework, we also set a maximum number of inner iterations (MAX_INNER_ITER) to terminate the inner loop.

We summarize the stopping conditions for both of the outer and inner loop in Algorithm 4.

4.4 Remarks on convergence

The PROX-ANQP and PROX-iHALS methods in the inexact BCD framework have outstanding convergence properties. In Sect. 3.3, we have mentioned that proposed sparse NCP using the proximal algorithm and inexact BCD scheme can guarantee to converge to a stationary point. The subproblem with the proximal algorithm in (9) is strongly convex, which can yield a unique minimum [4]. Furthermore, the optimization methods of ANQP and iHALS can stably decrease the subproblem. According to the Proposition 3.7.1 in [4], both the PROX-ANQP and PROX-iHALS can converge to stationary points.

5 Experiments and results

We carried out the experiments on synthetic, real-world, dense, sparse, small-scale, and large-scale tensors. We compared the proposed PROX-ANQP and PROX-iHALS methods with three sparse NCP methods listed below.

AO-ADMM: This is the sparse NCP method using AO-ADMM algorithm [19], which includes multiple inner iterations. The $l_1$-norm is handled by a proximal operator.
iAPG: We extend the APG method in sparse Tucker decomposition [47] to the sparse NCP problem in (6). In order to make a fair comparison, we implement APG in the inexact scheme using multiple inner iterations, which is abbreviated as iAPG [42]. The $l_1$-norm is handled by a proximal operator.
iMU: This is the sparse NCP method using the classical MU algorithm [9]. We implement MU in the inexact scheme using multiple inner iterations, which is abbreviated as iMU.

The above three methods can be directly applied to solve the sparse NCP in (6). The $l_1$-norm can be handled by the proximal operator in AO-ADMM and iAPG. Due to the proximal operator, AO-ADMM and iAPG do not suffer from the rank deficiency. Using the multiplicative updating rule, MU does not suffer from the rank deficiency.

In Table 1, we summarized the computational complexity of all the sparse NCP methods. Only the multiplicative operations were counted for mode-n in one outer iteration. The main time cost of these algorithms was spent on the calculation of MTTKRP $\varvec{X}_{(n)} \varvec{B}^{(n)}$, which consists of two parts: Khatri-Rao product $\varvec{B}^{(n)}$ and matrix product of $\varvec{X}_{(n)}$ and $\varvec{B}^{(n)}$. The computational complexity of $\varvec{B}^{(n)}$ reaches $R\prod _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}}$ and that of $\varvec{X}_{(n)} \varvec{B}^{(n)}$ reaches $R\prod _{n=1}^N I_n$. Item ${\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}$ can be calculated efficiently by (5), whose complexity is $R^2\sum _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}}$. For the inner loop of the subproblem, $\bar{K}$ is assumed to be the average iteration number. In Table 1, we can find that the complexity of these algorithms is highly comparable to each other. It can be inferred that the time of convergence is highly related to the number of iterations.

Table 1 Computational Complexity of Multiplicative Operations for Subproblem (9)

Full size table

Many experimental parameters and settings will affect the performances of a sparse NCP method. Since our purpose in the experiments is only to test the ability to impose sparsity, we fix the following settings for all methods.

Initialization. For PROX-ANQP, PROX-iHALS, AO-ADMM and iAPG, all factor matrices were initialized using nonnegative random numbers by MATLAB function max(0,randn($I_n,R$)). Only the iMU was initialized by max(0,randn($I_n,R$))+0.1. All initialized factors were scaled by $\textstyle \varvec{A}_0^{(n)}=\frac{\varvec{A}_0^{(n)}}{||\varvec{A}_0^{(n)}||_F}\times \root N \of {||\varvec{\mathscr {X}}||_F}$.
The factor updating order was fixed by $1,2,\dotsc ,N$.
The maximum inner iteration MAX_INNER_ITER was fixed by 5 according to the default setting in AO-ADMM [19].
For the PROX-ANQP and PROX-iHALS method, the proximal regularization parameter $\alpha _n$ was fixed by 1e-4 [45].

The $l_1$-norm regularization parameters of $\beta _n, n=1,\dotsc ,N,$ in sparse NCP are the key elements to impose sparsity, which are the most crucial testing parameters in the experiments. We selected a sequence of $\beta _n$ values in ascending order for each tensor by manual testing. For synthetic tensors, we stop the increase of $\beta _n$ when the true sparse components are recovered, while for real-world tensors, we stop the increase of $\beta _n$ when the number of nonzero components is reduced to less than half of the initial number. In order to make it convenient to select and test the parameters, we kept $\beta _n, n=1,\dotsc ,N,$ the same in all modes of the tensor. After choosing the $\beta _n$, we calculated and evaluated the sparsity level [44] of the factor matrices by

$$\begin{aligned} \text {Sparsity}_{\varvec{A}^{(n)}}=\frac{\#\big \lbrace \varvec{A}^{(n)}_{i,r}<T_s \big \rbrace }{I_n \times R}, \end{aligned}$$

(22)

where $T_s$ is a small positive number and $\#\left\{ \cdot \right\}$ denotes the number of elements that are smaller than the threshold $T_s$ in factor matrix $\varvec{A}^{(n)}$.

In the synthetic tensor experiments, we used prior sparse matrices to construct the data. After decomposition, the accuracy of the recovered sparse signals should be evaluated. Let $\varvec{S}^{(n)}=[\varvec{s}_1,\dotsc ,\varvec{s}_{R}]\in \mathbb {R}^{L \times R}$ denote the mode-n prior sparse matrix, where R is the real number of components and L is the length of a component. Let $\varvec{T}^{(n)}=[\varvec{t}_1,\dotsc ,\varvec{t}_{\tilde{R}}]\in \mathbb {R}^{L \times \tilde{R}}$ represent the mode-n estimated sparse matrix, where the $\tilde{R}$ is the estimated number of nonzero components. We evaluate the accuracy of the estimated matrix $\varvec{T}^{(n)}$ compared with original sparse signals $\varvec{S}^{(n)}$ by peak-signal-to-noise ratio (PSNR, see Chapter 3 in [9])

$$\begin{aligned} \text {PSNR}=\frac{1}{\tilde{R}}\sum _{r=1}^{\tilde{R}}10\text {log}_{10} \frac{L}{\big \Vert \hat{\varvec{t}}_r-\hat{\varvec{s}}_c\big \Vert _2^2}, \end{aligned}$$

(23)

where $\hat{\varvec{t}}_r$ is the rth normalized estimated sparse signal, and $\hat{\varvec{s}}_c$ is the normalized reference sparse signal. $\hat{\varvec{s}}_c$ comes from $\varvec{S}^{(n)}$, which has the highest correlation coefficient with $\hat{\varvec{t}}_r$.

All the experiments were conducted on the computer with Intel Core i5-4590 3.30 GHz CPU, 8 GB memory, 64-bit Windows 10 and MATLAB R2016b. The fundamental tensor computation was based on Tensor Toolbox 2.6 [3]. The codes are available on the author's website http://deqing.me/.

5.1 Synthetic tensor data

5.1.1 Size $1000 \times 100 \times 100 \times 5$ with one sparse factor

In this experiment, we constructed a synthetic fourth-order tensor by 10 channels of simulated sparse and nonnegative signals, as shown in Fig. 1a. The signals come from the file VSparse_rand_10.mat in NMFLAB [8]. There are 1000 points in each channel, so the sparse signal matrix is $\varvec{S}^{(1)}=[\varvec{s}_1,\dotsc ,\varvec{s}_{10}]\in \mathbb {R}^{1000 \times 10}$. Three uniformly distributed random matrices $\varvec{A}^{(2)}, \varvec{A}^{(3)} \in \mathbb {R}^{100 \times 10}$ and $\varvec{A}^{(4)} \in \mathbb {R}^{5 \times 10}$ were employed as mixing matrices, which were generated by rand function in MATLAB. Afterwards, we synthesized a third-order tensor by $\varvec{\mathscr {X}}_{\text {SNY1}}= \llbracket \varvec{S}^{(1)},\varvec{A}^{(2)},\varvec{A}^{(3)},\varvec{A}^{(4)} \rrbracket \in \mathbb {R}^{1000 \times 100 \times 100 \times 5}$. Next, nonnegative noise was added to the tensor with SNR of 40dB, which was generated by code max(0,randn(size($\varvec{\mathscr {X}}$))).

For all sparse NCP methods, we set $\varepsilon =1e-8$ as the threshold of outer stopping condition in (20). We set $T_s=1e-3$ in (22). The maximum running time was set by 180 seconds. We selected 20 as the number of components for tensor decomposition^{Footnote 1}. The reason is that we intend to recover the ten channels of the true signal just by imposing sparse regularization during decomposition, even though the exact optimal number of components is unknown. We selected the values of $\beta _n=0,0.1,0.5,1,2,3$ for all the optimization methods to evaluate their abilities to impose sparsity. The selection of sparse regularization parameters depends on the tensor data. After tensor decomposition, the values of objective function value (Obj), relative error (RelErr), running time in seconds (in wall-clock time), iteration number (Iter), the number of nonzero components (NNC), sparsity level (Spars) and PSNR of the estimated signal factor matrix were recorded as the performance evaluation criteria. For all optimization methods with each $\beta _n$, the sparse NCP was run 30 times, and the average values of all criteria were computed. The results are shown in Table 2. We emphasized the outstanding performances of the sparse NCP algorithms in bold in the tables.

From Table 2, it can be found that all methods are able to impose sparsity with proper sparse regularization parameter $\beta _n$. When $\beta _n$ increases, the sparsity level of the mode-1 factor matrix also increases. After properly tuning the sparse regularization parameter $\beta _n$, weak components will be removed (set to 0), weak elements in strong components will be prohibited, and the true ten channels of sparse signals will be recovered.

When $\beta _n$ increases to a proper value, the PSNR increases significantly. In this experiment, the higher the value of PSNR is, the better an algorithm performs to recover original sparse components. In Table 2, it is clear to see that PROX-ANQP, PROX-iHALS and iAPG have higher PSNR with larger sparse regularization parameters, for example $\beta _n=4,5$. This means that these three methods recover the ten channels of sparse signals more precisely. One of the recovered sparse signal matrix from $\varvec{\mathscr {X}}_{\text {SYN1}}$ by PROX-ANQP is shown in Fig. 1b.

For the synthetic data, the objective function values and relative errors are very similar with the same $\beta _n$ value. The convergence speed can be concluded from Table 2. iMU performs slowly compared with other methods. AO-ADMM performs slowly with $\beta _n=0$, but it becomes fast with $\beta _n>0$. All PROX-ANQP, PROX-iHALS and iAPG methods perform very well. It can also be concluded from Table 2 that the running time is highly related to the number of outer iterations.

Table 2 Comparison of Sparse NCPs on $\varvec{\mathscr {X}}_{\text {SYN1}} \in \mathbb {R}^{1000 \times 100 \times 100 \times 5}$

Full size table

5.1.2 Size $500 \times 500 \times 500$ with two sparse factors

For this third-order tensor, the factor matrices were generated using the following codes.

Factor	Code	Zeros (%)
$\varvec{S}^{(1)}\in \mathbb {R}^{500 \times 100}$	max(0,rand(500,100)*10-9);	90
$\varvec{S}^{(2)}\in \mathbb {R}^{500 \times 100}$	max(0,rand(500,100)*2-1);	50
$\varvec{A}^{(3)}\in \mathbb {R}^{500 \times 100}$	rand(500,100);	0

Afterwards, a third-order tensor was synthesized by $\varvec{\mathscr {X}}_{\text {SNY2}}=\llbracket \varvec{S}^{(1)},\varvec{S}^{(2)},\varvec{A}^{(3)} \rrbracket$, whose true number of components was 100 (rank = 100). Noise with SNR of 40dB was added.

We set the outer stopping condition by $\varepsilon =1e-6$ and the maximum running time by 600 seconds. 200 was selected as the initial number of components. The average performances of all sparse NCP methods after 30 runs were computed. We only show running time in seconds, iteration number (Iter), number of nonzero components (NNC) and the sparsity level (Spars) of all estimated factors in Table 3.

Table 3 Comparison of Sparse NCPs on $\varvec{\mathscr {X}}_{\text {SYN2}} \in \mathbb {R}^{500 \times 500 \times 500}$

Full size table

Table 3 shows that all methods are able to impose sparsity on all factors matrices. PROX-ANQP, PROX-iHALS, and iAPG methods perform very well to extract the true 100 sparse components. Interestingly, the sparsity levels of all the extracted factor matrix by PROX-ANQP, PROX-iHALS, and iAPG methods are also very close to the ground-truth^{Footnote 2} values with some $\beta _n$. iMU and AO-ADMM do not always work well to reach the ground-truth factor sparsity levels. Moreover, iMU shows slow convergence compared with other methods.

Table 4 Comparison of Sparse NCPs on Ongoing EEG Tensor $\varvec{\mathscr {X}}_{\text {EEG}} \in \mathbb {R}^{64 \times 146 \times 510}$

Full size table

5.2 Third-order dense tensor data

In this experiment, we used a real-world third-order ongoing EEG tensor. The original data were collected from fourteen right-handed and healthy subjects of adults in a music listening experiment. The music stimulus was a piece of 8.5 minutes of modern tango Adiós Nonino by the composer Astor Piazzolla. Short-Time Fourier Transform (STFT) with the window size of 3 seconds and the hop size of 1 second was used to transform the data of each subject into a third-order tensor. Details of data collection and preprocessing can be found in [43, 44]. We only employ the tensor data of one subject in this experiment.^{Footnote 3} The size of this tensor is $channel \times frequency \times time = 64 \times 146 \times 510$, in which 64 channel points represent 64 electrodes on the scalp, 146 frequency points represent the spectrum in 1-30Hz, and 510 time points represent the duration of a stimulus of about 8.5 minutes. Since the spectra from EEG tensor are usually sparse, we wish to recover the sparse spectral components by sparse regularization.

We set $\varepsilon =1e-8$ in (20) and $T_s=1e-6$ in (22). The maximum running time was set by 120 seconds. The initial number of components was set by 40 according to previous studies [11, 44]. We tested the values of $\beta _n=0,1e5,5e5,10e5,15e5,20e5$ all methods. All methods were run 30 times. The averages of performance criteria are recorded in Table 4. The results show that all methods are effective to impose sparsity with $\beta _n$. The iMU is slower than the other four methods.

We selected three groups of extracted components using the PROX-ANQP method with $\beta _n=0,5e5,10e5$, respectively, as shown in Fig. 2. These three groups show the same brain activity. It is obvious to see that the spectra become sparser and sparser when the sparse regularization parameter increases. With $\beta _n=5e5,10e5$, more and more unnecessary elements are removed in the spectra, and only the most prominent frequency band is retained. Figure 2 demonstrates that our methods are effective to extract meaningful sparse components that are related to some brain activities.

5.3 Third-order large-scale sparse tensor data

In this experiment, we tested the sparse NCP algorithms on a third-order large-scale sparse social network tensor. The data contain Facebook wall posts^{Footnote 4} information among 46, 952 users in 1952 days [41]. The size of this sparse tensor $\varvec{\mathscr {X}}_{\text {Sp1}}$ is 46,952 $\times ~ 46, 951$ $\times ~ 1952$, and the number of nonzero elements is 737, 928.

We set the outer stopping condition by $\varepsilon =1e-6$ in (20) and the sparsity threshold by $T_s=1e-6$ in (22). The maximum running time was set by 1200 seconds, and the initial number of components was set by 100. We tested the values of $\beta _n=0, 0.01, 0.05, 0.1$ for all methods. The average values of performance criteria after 30 runs are recorded in Table 5. We recorded the objective function values of all methods within the first 600 seconds in Fig. 3.

Table 5 Comparison of Sparse NCPs on Facebook Wall Posts Tensor $\varvec{\mathscr {X}}_{\text {Sp1}} \in \mathbb {R}^{46952 \times 46951 \times 1952}$

Full size table

Table 6 Comparison of Sparse NCPs on NIPS Publications Tensor $\varvec{\mathscr {X}}_{\text {Sp2}} \in \mathbb {R}^{2482 \times 2862 \times 14036 \times 17}$

Full size table

5.4 Fourth-order large-scale sparse tensor data

In this experiment, we evaluated the sparse NCP algorithms on a fourth-order large-scale sparse tensor of NIPS publications^{Footnote 5} [16]. The size of this sparse tensor $\varvec{\mathscr {X}}_{\text {Sp2}}$ is $2482 \times 2862 \times 14{,}036 \times 17$. The values represent 2482 papers, 2862 authors, 14,036 words and 17 years. There are 3,101,609 nonzero elements.

The maximum running time was set by 1800 seconds. Other settings are the same as the previous third-order case. We tested the values of $\beta _n=0, 0.1, 0.3, 0.5$ for all methods. The average values of performance criteria after 30 runs are recorded in Table 6. Figure 4 illustrates the objective function values of all methods within the first 1200 seconds.

The experimental results of both the third-order and fourth-order large-scale sparse tensor demonstrate that all our algorithms are effective to impose sparsity on the factor matrices. From Tables 5 and 6, it is clear to see that, with sparse regularization parameter $\beta _n$ increasing, the number of nonzero components (NNC) decreases gradually. The extracted factor matrices from the sparse tensors are extremely sparse, even though no additional sparse regularization is imposed ($\beta =0$). Nevertheless, our algorithms with $\beta >0$ can further increase the sparsity level of the factor matrices.

From the perspective of convergence speed, both the PROX-ANQP and PROX-iHALS methods run fast for the large-scale sparse tensors with different $\beta _n$ values compared with other methods, which can be concluded from Tables 5, 6, Figs. 3 and 4. However, AO-ADMM and iAPG perform slowly for the large-scale sparse tensors. iMU has fast convergence speed, but it has higher objective function value with large $\beta _n$ values (e.g., $\beta _n$ = 0.05, 0.1 in the third-order case, and $\beta _n$ = 0.3, 0.5 in the fourth-order case). The reason is that, with the same $\beta _n$ value, iMU yields fewer nonzero components compared with other methods.

6 Discussion

We have proposed a novel sparse NCP model using the proximal algorithm and inexact scheme. The model can be efficiently solved by two algorithms of the PROX-ANQP and PROX-iHALS. In order to test the performance of the algorithms, we conducted experiments in different cases, including synthetic and real-world tensors, third-order and fourth-order tensors, dense and sparse tensors, and small-scale and large-scale tensors. Three state-of-the-art sparse NCP methods are also tested for comparison, including the AO-ADMM, iAPG and iMU. We have the following findings: (1) The PROX-ANQP and PROX-iHALS methods particularly have the fast convergence speed and excellent effects of imposing sparsity in all cases compared with other methods. The outstanding performances of the PROX-ANQP and PROX-iHALS are due to two points: the proximal algorithm that overcomes the rank deficiency; and the inexact scheme that increases the efficiency. (2) The iAPG method contains a proximal operator, which can handle the $l_1$-norm. With the proximal operator, iAPG does not suffer from the rank deficiency. In the experiments, we find that iAPG is very efficient to impose sparsity for small-scale dense tensor decomposition. Nevertheless, iAPG runs slowly for large-scale sparse tensor decomposition. (3) The iMU method converges very slowly for dense tensors, but it becomes fast for large-scale sparse tensor. For sparse tensor decomposition, the extracted factor matrices are extremely sparse already. Most elements in the factors are zeros. According to the multiplicative updating rule, once an element becomes zero, it will never change. This property might be the reason why iMU converges fast on the sparse tensor. (4) The AO-ADMM also contains a proximal operator that can handle $l_1$-norm and overcome the rank deficiency. However, AO-ADMM converges slowly compared with PROX-ANQP and PROX-iHALS in most cases, particularly in the large-scale sparse tensor case. Moreover, AO-ADMM is inferior to iAPG with $\beta _n=0$ in many cases. In a word, our proposed PROX-ANQP and PROX-iHALS methods have the best performances for sparse NCP, and have very good generalization for the different types and scales of datasets.

In addition to the solving methods, another critical issue of sparse NCP is the selection of the sparse regularization parameter $\beta _n$. Firstly, we want to mention that one purpose of this paper is to demonstrate the effectiveness of the algorithms to impose sparsity. Therefore, in order to simplify the selection of parameters, we keep $\beta _n$ the same for all factor matrices in sparse NCP. With the same $\beta _n$ on all modes, the sparse NCP can still recover high sparse components and low sparse (or even dense) components in different factor matrices. In the future, it would be interesting to investigate how to separately control the sparsity levels of different factor matrices using unbalanced sparse regularization parameters. Secondly, the appropriate value of parameter $\beta _n$ depends on the tensor to be decomposed. In this study, we selected $\beta _n$ separately for each tensor in the experiments. When the sparse regularization parameter is larger, the extracted factor matrices are sparser, and the relative error of decomposition is also larger. The trade-off between the sparse level and the relative error depends on the meanings of real applications. An example of sparse regularization parameter selection of sparse NCP for ongoing EEG can be found in [44]. It is also possible to select an appropriate parameter for a concrete application using model-order selection methods [37], such as the Bayesian information criteria (BIC).

The third critical issue of sparse NCP is the sparse regularization item. In this study, we only investigated the $l_1$-norm item. In the future, it is worth trying to incorporate other types of sparse regularization items [2] to our sparse NCP model besides $l_1$-norm, such as the $l_q$-norm ($0<q<1$) [36] and trace norm [29].

7 Conclusion

In this paper, we have investigated the nonnegative CANDECOMP/PARAFAC tensor decomposition with $l_1$-norm-based sparse regularization (sparse NCP). We have proposed a novel sparse NCP model using the proximal algorithm, which can guarantee the full column rank condition and the property of convergence to a stationary point. In addition, an inexact block coordinate descent scheme was presented to accelerate the computation of sparse NCP. In the inexact scheme, the subproblems are updated using multiple inner iterations. We have employed two algorithms for solving the proposed sparse NCP model with the proximal algorithm, including the inexact alternating nonnegative quadratic programming (PROX-ANQP) and the inexact hierarchical alternating least squares (PROX-iHALS). The experimental results on all synthetic, real-world, small-scale and large-scale tensors demonstrated that our sparse NCP methods can impose sparsity and extract meaningful sparse components successfully. Both PROX-ANQP and PROX-iHALS have exhibited the faster computational speed and better performances of imposing sparsity compared with other sparse NCP algorithms. The experimental results proved that the proposed sparse NCP with the proximal algorithm and inexact scheme is effective and efficient.

Notes

Since ten channels of signals are mixed in the tensor, it is natural to select 10 as the optimal component number. The number of components might also be estimated by some classical methods, such as DIFFIT [38]. However, we selected 20 to test the performances of sparse regularization.
Since we use double number of true sparse components as the initial number of tensor decomposition, the ground truth sparsity of the factor matrix is computed by $(x\%+1)/2$. $x\%$ is the percentage of zeros in a simulated matrix.
https://github.com/wangdeqing/Ongoing_EEG_Tensor_Decomposition.
http://konect.cc/networks/facebook-wosn-wall/.
http://frostt.io/tensors/nips/.

References

Allen G (2012) Sparse higher-order principal components analysis. In: Lawrence ND, Girolami M (eds) Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, PMLR, La Palma, Canary Islands, Proceedings of Machine Learning Research, 22: 27–36
Bach F, Jenatton R, Mairal J, Obozinski G (2012) Optimization with sparsity-inducing penalties. Found Trends® Mach Learn 4(1):1–106. https://doi.org/10.1561/2200000015
Article MATH Google Scholar
Bader BW, Kolda TG et al (2015) Matlab tensor toolbox version 2.6. Available online, http://www.sandia.gov/~tgkolda/TensorToolbox/
Bertsekas DP (2016) Nonlinear Programming, 3rd edn. Athena Scientific, Belmont, Massachusetts
MATH Google Scholar
Bro R, Kiers HAL (2003) A new efficient method for determining the number of components in PARAFAC models. J Chemom 17(5):274–286. https://doi.org/10.1002/cem.801
Article Google Scholar
Bruckstein AM, Donoho DL, Elad M (2009) From sparse solutions of systems of equations to sparse modeling of signals and images. SIAM Rev 51(1):34–81. https://doi.org/10.1137/060657704
Article MathSciNet MATH Google Scholar
Cichocki A, Phan AH (2009) Fast local algorithms for large scale nonnegative matrix and tensor factorizations. IEICE Trans Fundam Electron Commun Comput Sci E92–A(3):708–721. https://doi.org/10.1587/transfun.e92.a.708
Article Google Scholar
Cichocki A, Zdunek R (2006) NMFLAB—MATLAB toolbox for non-negative matrix factorization. http://www.bsp.brain.riken.jp/ICALAB/nmflab.html. Accessed 22 Nov 2017
Cichocki A, Zdunek R, Phan AH, Si Amari (2009) Nonnegative matrix and tensor factorizations: applications to exploratory multi-way data analysis and blind source separation. Wiley, New York
Book Google Scholar
Cichocki A, Mandic D, Lathauwer LD, Zhou G, Zhao Q, Caiafa C, Phan HA (2015) Tensor decompositions for signal processing applications: from two-way to multiway component analysis. IEEE Signal Process Mag 32(2):145–163. https://doi.org/10.1109/msp.2013.2297439
Article Google Scholar
Cong F, Lin QH, Kuang LD, Gong XF, Astikainen P, Ristaniemi T (2015) Tensor decomposition of EEG signals: a brief review. J Neurosci Methods 248:59–69. https://doi.org/10.1016/j.jneumeth.2015.03.018
Article Google Scholar
Donoho DL (2006) For most large underdetermined systems of linear equations the minimal $\ell _1$-norm solution is also the sparsest solution. Commun Pure Appl Math 59(6):797–829. https://doi.org/10.1002/cpa.20132
Article MathSciNet MATH Google Scholar
Elcoroaristizabal S, Bro R, García JA, Alonso L (2015) PARAFAC models of fluorescence data with scattering: a comparative study. Chemom Intell LabD Syst 142:124–130. https://doi.org/10.1016/j.chemolab.2015.01.017
Article Google Scholar
Friedlander MP, Hatz K (2008) Computing non-negative tensor factorizations. Optim Methods Softw 23(4):631–647. https://doi.org/10.1080/10556780801996244
Article MathSciNet MATH Google Scholar
Gillis N, Glineur F (2012) Accelerated multiplicative updates and hierarchical ALS algorithms for nonnegative matrix factorization. Neural Computation 24(4):1085–1105. https://doi.org/10.1162/NECO_a_00256
Article MathSciNet Google Scholar
Globerson A, Chechik G, Pereira F, Tishby N (2007) Euclidean Embedding of Co-occurrence Data. The Journal of Machine Learning Research 8: 2265–2295, http://www.jmlr.org/papers/volume8/globerson07a/globerson07a.pdf
Hong M, Razaviyayn M, Luo ZQ, Pang JS (2016) A unified algorithmic framework for block-structured optimization involving big data: with applications in machine learning and signal processing. IEEE Signal Process Mag 33(1):57–77. https://doi.org/10.1109/msp.2015.2481563
Article Google Scholar
Hoyer PO (2004) Non-negative matrix factorization with sparseness constraints. J Mach Learn Res 5(Nov):1457–1469
MathSciNet MATH Google Scholar
Huang K, Sidiropoulos ND, Liavas AP (2016) A flexible and efficient algorithmic framework for constrained matrix and tensor factorization. IEEE Trans Signal Process 64(19):5052–5065. https://doi.org/10.1109/tsp.2016.2576427
Article MathSciNet MATH Google Scholar
Kim H, Park H (2008) Nonnegative matrix factorization based on alternating nonnegativity constrained least squares and active set method. SIAM J Matrix Anal Appl 30(2):713–730. https://doi.org/10.1137/07069239x
Article MathSciNet MATH Google Scholar
Kim HJ, Ollila E, Koivunen V (2013) Sparse regularization of tensor decompositions. In: 2013 IEEE international conference on acoustics, speech and signal processing. IEEE. https://doi.org/10.1109/ICASSP.2013.6638376
Kim J, Park H (2011) Fast nonnegative matrix factorization: an active-set-like method and comparisons. SIAM J Sci Comput 33(6):3261–3281. https://doi.org/10.1137/110821172
Article MathSciNet MATH Google Scholar
Kim J, Park H (2012) Fast nonnegative tensor factorization with an active-set-like method. In: High-performance scientific computing. Springer, London, pp 311–326. https://doi.org/10.1007/978-1-4471-2437-5_16
Kim J, He Y, Park H (2014) Algorithms for nonnegative matrix and tensor factorizations: a unified view based on block coordinate descent framework. J Global Optim 58(2):285–319. https://doi.org/10.1007/s10898-013-0035-4
Article MathSciNet MATH Google Scholar
Kolda TG, Bader BW (2009) Tensor decompositions and applications. SIAM Rev 51(3):455–500. https://doi.org/10.1137/07070111x
Article MathSciNet MATH Google Scholar
Li N, Kindermann S, Navasca C (2013) Some convergence results on the regularized alternating least-squares method for tensor decomposition. Linear Algebra Appl 438(2):796–812. https://doi.org/10.1016/j.laa.2011.12.002
Article MathSciNet MATH Google Scholar
Li Y, Ngom A (2013) The non-negative matrix factorization toolbox for biological data mining. Source Code Biol Med 8(1):10. https://doi.org/10.1186/1751-0473-8-10
Article Google Scholar
Liu J, Liu J, Wonka P, Ye J (2012) Sparse non-negative tensor factorization using columnwise coordinate descent. Pattern Recogn 45(1):649–656. https://doi.org/10.1016/j.patcog.2011.05.015
Article MATH Google Scholar
Liu Y, Shang F, Jiao L, Cheng J, Cheng H (2015) Trace norm regularized CANDECOMP/PARAFAC decomposition with missing data. IEEE Trans Cybern 45(11):2437–2448. https://doi.org/10.1109/tcyb.2014.2374695
Article Google Scholar
Mørup M (2011) Applications of tensor (multiway array) factorizations and decompositions in data mining. Wiley Interdiscip Rev Data Mining Knowl Discov 1(1):24–40. https://doi.org/10.1002/widm.1
Article Google Scholar
Mørup M, Hansen LK, Arnfred SM (2008) Algorithms for sparse nonnegative tucker decompositions. Neural Comput 20(8):2112–2131. https://doi.org/10.1162/neco.2008.11-06-407
Article MATH Google Scholar
Papalexakis EE, Sidiropoulos ND, Bro R (2013) From $k$-means to higher-way co-clustering: Multilinear decomposition with sparse latent factors. IEEE Trans Signal Process 61(2):493–506. https://doi.org/10.1109/tsp.2012.2225052
Article Google Scholar
Razaviyayn M, Hong M, Luo ZQ (2013) A unified convergence analysis of block successive minimization methods for nonsmooth optimization. SIAM J Optim 23(2):1126–1153. https://doi.org/10.1137/120891009
Article MathSciNet MATH Google Scholar
Selesnick I (2017) Sparse regularization via convex analysis. IEEE Trans Signal Process 65(17):4481–4494. https://doi.org/10.1109/tsp.2017.2711501
Article MathSciNet MATH Google Scholar
Sidiropoulos ND, Lathauwer LD, Fu X, Huang K, Papalexakis EE, Faloutsos C (2017) Tensor decomposition for signal processing and machine learning. IEEE Trans Signal Process 65(13):3551–3582. https://doi.org/10.1109/tsp.2017.2690524
Article MathSciNet MATH Google Scholar
Sigurdsson J, Ulfarsson MO, Sveinsson JR (2014) Hyperspectral unmixing with $l_q$ regularization. IEEE Trans Geosci Remote Sens 52(11):6793–6806. https://doi.org/10.1109/tgrs.2014.2303155
Article Google Scholar
Stoica P, Selen Y (2004) Model-order selection: a review of information criterion rules. IEEE Signal Process Mag 21(4):36–47. https://doi.org/10.1109/msp.2004.1311138
Article Google Scholar
Timmerman ME, Kiers HAL (2000) Three-mode principal components analysis: choosing the numbers of components and sensitivity to local optima. British J Math Stat Psychol 53(1):1–16. https://doi.org/10.1348/000711000159132
Article Google Scholar
Veganzones MA, Cohen JE, Farias RC, Chanussot J, Comon P (2016) Nonnegative tensor CP decomposition of hyperspectral data. IEEE Trans Geosci Remote Sens 54(5):2577–2588. https://doi.org/10.1109/tgrs.2015.2503737
Article Google Scholar
Vervliet N, Lathauwer LD (2019) Numerical optimization-based algorithms for data fusion. In: Data handling in science and technology. Elsevier, pp 81–128. https://doi.org/10.1016/B978-0-444-63984-4.00004-1
Viswanath B, Mislove A, Cha M, Gummadi KP (2009) On the evolution of user interaction in facebook. In: Proceedings of the 2nd ACM workshop on Online social networks—WOSN ’09. ACM Press. https://doi.org/10.1145/1592665.1592675
Wang D, Cong F (2021) An inexact alternating proximal gradient algorithm for nonnegative CP tensor decomposition. Sci China Technol Sci 64(9):1893–1906. https://doi.org/10.1007/s11431-020-1840-4
Article Google Scholar
Wang D, Cong F, Zhao Q, Toiviainen P, Nandi AK, Huotilainen M, Ristaniemi T, Cichocki A (2016) Exploiting ongoing EEG with multilinear partial least squares during free-listening to music. In: IEEE 26th international workshop on machine learning for signal processing (MLSP). IEEE. https://doi.org/10.1109/mlsp.2016.7738849
Wang D, Wang X, Zhu Y, Toiviainen P, Huotilainen M, Ristaniemi T, Cong F (2018) Increasing stability of EEG components extraction using sparsity regularized tensor decomposition. In: Advances in neural networks – ISNN 2018. Springer, pp 789–799. https://doi.org/10.1007/978-3-319-92537-0_89
Wang D, Cong F, Ristaniemi T (2019) Higher-order nonnegative CANDECOMP/PARAFAC tensor decomposition using proximal algorithm. In: 2019 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3457–3461. https://doi.org/10.1109/ICASSP.2019.8683217
Williams AH, Kim TH, Wang F, Vyas S, Ryu SI, Shenoy KV, Schnitzer M, Kolda TG, Ganguli S (2018) Unsupervised discovery of demixed, low-dimensional neural dynamics across multiple timescales through tensor component analysis. Neuron 98(6):1099-1115.e8. https://doi.org/10.1016/j.neuron.2018.05.015
Article Google Scholar
Xu Y (2015) Alternating proximal gradient method for sparse nonnegative tucker decomposition. Math Program Comput 7(1):39–70. https://doi.org/10.1007/s12532-014-0074-y
Article MathSciNet MATH Google Scholar
Xu Y, Yin W (2013) A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM J Imag Sci 6(3):1758–1789. https://doi.org/10.1137/120887795
Article MathSciNet MATH Google Scholar
Yang Y, Pesavento M, Luo ZQ, Ottersten B (2020) Inexact block coordinate descent algorithms for nonsmooth nonconvex optimization. IEEE Trans Signal Process 68:947–961. https://doi.org/10.1109/tsp.2019.2959240
Article MathSciNet Google Scholar
Zhang H, Wang S, Xu X, Chow TWS, Wu QMJ (2018) Tree2vector: learning a vectorial representation for tree-structured data. IEEE Trans Neural Networks Learn Syst 29(11):5304–5318. https://doi.org/10.1109/tnnls.2018.2797060
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (Grant No.91748105), National Foundation in China (No. JCKY2019110B009 & 2020-JCJQ-JJ-252), the Fundamental Research Funds for the Central Universities [DUT20LAB303 & DUT20LAB308] in Dalian University of Technology in China, and the scholarship from China Scholarship Council (No. 201600090043). This study is to memorize Prof. Tapani Ristaniemi for his great help to Fengyu Cong, Zheng Chang and Deqing Wang. Prof. Tapani Ristaniemi supervised this work.

Funding

Open access funding provided by University of Jyväskylä (JYU).

Author information

Authors and Affiliations

School of Biomedical Engineering, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, 116024, China
Deqing Wang & Fengyu Cong
Faculty of Information Technology, University of Jyväskylä, Jyväskylä, 40100, Finland
Deqing Wang, Zheng Chang & Fengyu Cong
School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu, 611731, China
Zheng Chang
School of Artificial Intelligence, Faculty of Electronic Information and Electrical Engineering, Dalian University of Technology, Dalian, 116024, China
Fengyu Cong
Key Laboratory of Integrated Circuit and Biomedical Electronic System, Liaoning Province, Dalian University of Technology, Dalian, 116024, China
Fengyu Cong

Authors

Deqing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Zheng Chang
View author publications
You can also search for this author in PubMed Google Scholar
Fengyu Cong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Deqing Wang or Fengyu Cong.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Wang, D., Chang, Z. & Cong, F. Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme. Neural Comput & Applic 33, 17369–17387 (2021). https://doi.org/10.1007/s00521-021-06325-8

Download citation

Received: 14 October 2020
Accepted: 08 July 2021
Published: 04 October 2021
Issue Date: December 2021
DOI: https://doi.org/10.1007/s00521-021-06325-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Abstract

Similar content being viewed by others

An inexact alternating proximal gradient algorithm for nonnegative CP tensor decomposition

Robust Schatten-p Norm Based Approach for Tensor Completion

A proximal point like method for solving tensor least-squares problems

1 Introduction

1.1 Background

1.2 Contribution

1.3 Organization

2 Preliminaries

2.1 Nonnegative CP decomposition

2.2 Sparse regularization with \(l_1\)-norm

3 The proposed sparse NCP model

3.1 Sparse NCP with proximal algorithm

3.2 Inexact block coordinate descent scheme

3.3 Convergence analysis

Proposition 1

Proof

4 Optimization methods for solving sparse NCP

4.1 Alternating nonnegative quadratic programming

4.2 Inexact hierarchical alternating least squares

4.3 Stopping conditions

4.3.1 Stopping condition for outer loop

4.3.2 Stopping condition for inner loop

4.4 Remarks on convergence

5 Experiments and results

5.1 Synthetic tensor data

5.1.1 Size \(1000 \times 100 \times 100 \times 5\) with one sparse factor

5.1.2 Size \(500 \times 500 \times 500\) with two sparse factors

5.2 Third-order dense tensor data

5.3 Third-order large-scale sparse tensor data

5.4 Fourth-order large-scale sparse tensor data

6 Discussion

7 Conclusion

Notes

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation