1 Introduction

1.1 Background

Nonnegative tensor decomposition is a powerful tool in signal processing and machine learning [10, 35]. Nonnegative CANDECOMP/PARAFAC (NCP), as an important decomposition method, has been widely applied to processing multiway data, such as hyperspectral data [39], electroencephalograph (EEG) data [11], fluorescence excitation-emission matrix (EEM) data [13], neural data [46], and many other multiway tensor data [30]. In many cases, the extracted components by NCP are not only nonnegative but also sparse. For example, the spectral components from EEG tensor decomposition are usually very sparse, representing the narrow-band frequencies of some brain activities [11]. For another example, after decomposing EEM tensor, a component in the sample mode denotes the concentrations of a compound in all samples [5], which is sometimes also sparse. The nonnegative constraint in NCP will naturally lead to sparse results. However, this sparsity is only a side effect, which cannot be controlled to a certain level [18]. Without properly controlling the sparsity, the intrinsic components in the data cannot be extracted precisely, especially in low signal-to-noise ratio conditions. Therefore, in order to extract meaningful and accurate sparse components, additional sparse regularization is necessary for NCP tensor decomposition.

The design of NCP decomposition with explicit sparse regularization (sparse NCP) will benefit a lot from the methods in nonnegative matrix factorization (NMF) cases. On the one hand, an early study of NMF [18] proposed the method of projecting components into sparse vectors at some sparsity level. However, this method keeps all components at the same fixed sparsity level, which is not in line with the true sparsity of different components in the data. On the other hand, incorporating sparse regularization items into the optimization model is a popular method. The \(l_1\)-norm is a conventional and effective regularizer to impose sparsity for signal processing [6]. The reason is that, for most underdetermined linear equations, the optimization problem with \(l_1\)-norm regularization can yield strong sparsity [12]. More information about the sparse regularization can be found in [2, 34, 50].

Many works have been devoted to the tensor decomposition with sparse regularization, but only a few can be found for NCP. The works of [1, 21] and [29] studied the sparse regularization for tensor decomposition using \(l_1\)-norm and trace norm, but they only focused on the unconstrained CP model without the nonnegative constraint. The works of [14, 28, 31] and [47] proposed the methods of imposing sparsity by the \(l_1\)-norm on nonnegative Tucker decomposition. However, these methods are not suitable for large-scale problems [47], and their effectiveness is unknown to NCP. Kim et al. considered solving sparse NCP using ANLS [23]. Nevertheless, ANLS seriously suffers from rank deficiency caused by high sparsity or zero components in the factor matrices. Recently, Huang et al. have proposed an alternating optimization-based ADMM (AO-ADMM) method, which can handle the \(l_1\)-norm regularization item in NCP [19]. Nevertheless, there is no experimental evaluation on the sparse NCP in [19]. The work [32] proposed a sparse NCP algorithm, which is targeted at the multiway co-clustering. In practical applications, the sparse NCP may face the following two major challenges.

One challenge is that when the tensor data are highly sparse or strongly sparse regularization is imposed on the decomposition, more and more zero components will appear in the factor matrices. Thus, the factor matrices are not of full column rank, which will cause the rank deficiency problem. The rank deficiency will further cause a poor convergence of the tensor decomposition algorithm. It is introduced that the proximal algorithm is an excellent method to improve the convergence of a mathematical optimization method [4]. In an optimization problem by iterations, the proximal algorithm is constructed by adding a proximal regularization item to the original model. This proximal item is the squared Frobenius norm of the difference between the current variable and its value in previous iteration [4]. The proximal algorithm can naturally be incorporated into tensor decomposition [26].

The other challenge is that, for large-scale tensor data, the process of sparse NCP decomposition might be inefficient. It is reported that the inexact block coordinate descent scheme could accelerate the convergence and is very beneficial to the large-scale problem [15, 40]. Hence, the inexact scheme can be employed in the sparse NCP problem.

1.2 Contribution

Firstly, in this paper, we propose a novel sparse NCP method with the \(l_1\)-norm and the proximal algorithm. The proposed sparse NCP will overcome the rank deficiency and guarantee the decomposition to converge to a stationary point. The block coordinate descent (BCD) is one of the main techniques for tensor decomposition, especially the constrained one [24]. In BCD framework, each factor matrix is updated as a subproblem alternatively while other factor matrices are fixed. By the proximal algorithm, the proximal regularization item can make the subproblems strongly convex [26] and can provide a full column rank condition for the sparse NCP.

Secondly, we develop an inexact BCD scheme for the novel sparse NCP model. The inexact scheme will speed up the computation of the sparse NCP, especially in large-scale cases. Specifically, in the inexact BCD scheme, the subproblem of the sparse NCP is iterated multiple times for updating a factor matrix.

Thirdly, in order to prove the viability of the sparse NCP model with the proximal algorithm and the inexact scheme, we employ two efficient optimization algorithms to solve the model, including inexact alternating nonnegative quadratic programming and inexact hierarchical alternating least squares. We evaluate the proposed sparse NCP methods on synthetic, real-world, small-scale and large-scale tensor data. By properly selecting and tuning the sparse regularization, the effectiveness and efficiency of the sparse NCP methods are demonstrated to impose sparsity on factor matrices.

1.3 Organization

The rest of this paper is organized as follows. Section 2 introduces some preliminaries. In Sect. 3, we describe the mathematical model of sparse NCP with the proximal algorithm and inexact BCD scheme. Section 4 elucidates the solutions to the sparse NCP model using the optimization methods. Section 5 describes the detailed experiments on synthetic and real-world datasets. Some critical observations are discussed in Sect. 6. Finally, we conclude our paper in Sect. 7.

2 Preliminaries

In this paper, operator \(\circ\) represents the outer product of vectors, \(\odot\) represents the Khatri-Rao product, * represents the Hadamard product that is the elementwise matrix product, 〈 〉 represents the inner product, 〚 〛 represents Kruskal operator and [ ]+ represents nonnegative projection. \(||\; ||_{F}\) denotes Frobenius norm, and \(||\; ||_1\) denotes \(l_1\)-norm. Basics of tensor computation and multi-linear algebra can be found in review papers [25, 35].

2.1 Nonnegative CP decomposition

Given an Nth-order nonnegative tensor \(\varvec{\mathscr {X}}\in \mathbb {R}^{I_{1} \times I_{2} \times \cdots \times I_{N}}\) and a positive number R, nonnegative CANDECOMP/PARAFAC (NCP) is to solve the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)}} \frac{1}{2}&||\varvec{\mathscr {X}}-\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket ||_{F}^{2}\\ \text {s.t. } \varvec{A}^{(n)}&\geqslant 0 \text { for } n=1, \dotsc ,N, \end{aligned} \end{aligned}$$
(1)

where \(\varvec{A}^{(n)}\in \mathbb {R}^{I_{n} \times R}\) for \(n=1,\dotsc ,N\) are the estimated factor matrices in different modes, \(I_{n}\) is the size in mode-n, and R is the initial number of components. We use \(\mathscr {F}_{\text {tensor}}\big ( \varvec{A} \big ) = \mathscr {F}_{\text {tensor}}\big ( \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \big )\) to denote the objective function in (8). The estimated factor matrices in Kruskal operator can be represented by the sum of R rank-1 tensors in outer product form:

$$\begin{aligned} \llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket = \sum _{r=1}^{R} \varvec{\mathscr {Y}}_r = \sum _{r=1}^{R}\varvec{a}_{r}^{(1)} \circ \cdots \circ \varvec{a}_{r}^{(N)}, \end{aligned}$$
(2)

where \(\varvec{a}_{r}^{(n)}\) represents the rth column of \(\varvec{A}^{(n)}\).

Let \(\varvec{X}_{(n)}\in \mathbb {R}^{I_n \times \prod _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}}}\) represent the mode-n unfolding (matricization) of original tensor \(\varvec{\mathscr {X}}\). The mode-n unfolding of the estimated tensor in Kruskal operator \(\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket\) can be written as \(\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T\), in which \(\varvec{B}^{(n)}=\big (\varvec{A}^{(N)}\odot \cdots \odot \varvec{A}^{(n+1)}\odot \varvec{A}^{(n-1)}\odot \cdots \odot \varvec{A}^{(1)}\big )\in \mathbb {R}^{\prod _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}} \times R}\). In BCD framework, factor \(\varvec{A}^{(n)}\) is updated alternatively by a subproblem in every iteration, which is equal to the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \mathscr {F} \big (\varvec{A}^{(n)}\big )&= \frac{1}{2} ||\varvec{X}_{(n)}-\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T||_{F}^{2}\\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0. \end{aligned} \end{aligned}$$
(3)

The partial gradient (or partial derivative) of \(\mathscr {F} \big (\varvec{A}^{(n)}\big )\) with respect to \(\varvec{A}^{(n)}\) is

$$\begin{aligned} \begin{aligned} \frac{\partial }{\partial \varvec{A}^{(n)}} \mathscr {F} \big (\varvec{A}^{(n)}\big ) = \varvec{A}^{(n)} \big [{\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)} \big ]-\varvec{X}_{(n)} \varvec{B}^{(n)}. \end{aligned} \end{aligned}$$
(4)

In (4), the item \(\varvec{X}_{(n)} \varvec{B}^{(n)}\) is called the Matricized Tensor Times Khatri-Rao Product (MTTKRP) [35]. The item \({\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}\) can be computed efficiently by

$$\begin{aligned} {\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)} = &\big [ {\big ( \varvec{A}^{(N)} \big )}^T\!\varvec{A}^{(N)} \big ] *\cdots \\*&\big [ {\big ( \varvec{A}^{(n+1)} \big )}^T\!\varvec{A}^{(n+1)} \big ] *\big [ {\big ( \varvec{A}^{(n-1)} \big )}^T\!\varvec{A}^{(n-1)} \big ]\\*&\cdots *\big [ {\big ( \varvec{A}^{(1)} \big )}^T\!\varvec{A}^{(1)} \big ]. \end{aligned}$$
(5)

2.2 Sparse regularization with \(l_1\)-norm

In order to impose sparsity to the factor matrices, it is natural to incorporate the sparse regularization items using \(l_1\)-norm [9, 47] into the objective function in (1), which leads to the following basic sparse NCP problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)}} \frac{1}{2}&||\varvec{\mathscr {X}}-\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket ||_{F}^{2}+\sum _{n=1}^N\beta _n\sum _{r=1}^R||\varvec{a}_r^{(n)}||_1\\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0 \text { for } n=1, \dotsc ,N, \end{aligned} \end{aligned}$$
(6)

where \(\beta _n\) are positive sparse regularization parameters in parameter vectors \(\varvec{\beta }\in \mathbb {R}^{N\times 1}\). The subproblem can be written as the following optimization problem

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \mathscr {F}_0 \big (\varvec{A}^{(n)}\big ) =&\frac{1}{2} \Big \Vert \varvec{X}_{(n)}-\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T\Big \Vert _{F}^{2} + \beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1\\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0. \end{aligned} \end{aligned}$$
(7)

In the objective function of the subproblem, the sparse regularization is imposed on the factor matrix \(\varvec{A}^{(n)}\) by the \(l_1\)-norm.

3 The proposed sparse NCP model

3.1 Sparse NCP with proximal algorithm

The basic sparse NCP in (6) has a serious drawback. When strongly sparse regularization is imposed in (6), many zero columns will appear in the factor matrices \(\varvec{A}^{(n)}\). Thus, both \(\varvec{A}^{(n)}\) and \(\varvec{B}^{(n)}\) cannot guarantee to be of full column rank. Therefore, the basic sparse NCP model in (6) will suffer from rank deficiency and cannot guarantee to converge.

In order to overcome the above drawback, we propose the following sparse NCP model with proximal algorithm (a proximal regularization item using squared Frobenius norm):

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)}}&\bigg \lbrace \frac{1}{2} \Big \Vert \varvec{\mathscr {X}}-\llbracket \varvec{A}^{(1)},\dotsc ,\varvec{A}^{(N)} \rrbracket \Big \Vert _{F}^{2}\\ + \sum _{n=1}^N&\frac{\alpha _n}{2}\Big \Vert \widetilde{\varvec{A}}^{(n)} - \varvec{A}^{(n)}\Big \Vert _F^2+\sum _{n=1}^N\beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1 \bigg \rbrace \\&\text {s.t. } \varvec{A}^{(n)} \geqslant 0 \text { for } n=1, \dotsc ,N, \end{aligned} \end{aligned}$$
(8)

where \(\widetilde{\varvec{A}}^{(n)}\) is the value of the factor \(\varvec{A}^{(n)}\) in previous iteration during updating and \(\alpha _n\) are positive regularization parameters in vectors \(\varvec{\alpha }\in \mathbb {R}^{N\times 1}\).

In BCD framework, the subproblem of model (8) can be written in the following minimization problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big ) = \bigg \lbrace \frac{1}{2}&\Big \Vert \varvec{X}_{(n)}-\varvec{A}^{(n)}{\big (\varvec{B}^{(n)}\big )}^T\Big \Vert _{F}^{2}\\ +\frac{\alpha _n}{2}&\Big \Vert \widetilde{\varvec{A}}^{(n)} - \varvec{A}^{(n)}\Big \Vert _F^2 +\beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1 \bigg \rbrace \\ \text {s.t. }&\varvec{A}^{(n)} \geqslant 0. \end{aligned} \end{aligned}$$
(9)

The objective function \(\mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )\) can be further represented by the following form:

$$\begin{aligned} \begin{aligned} \mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )&=\frac{1}{2} \left \Vert \begin{pmatrix} \varvec{X}_{(n)}^T\\ \sqrt{\alpha _n}{\big (\widetilde{\varvec{A}}^{(n)}\big )}^T \end{pmatrix}- \begin{pmatrix} \varvec{B}^{(n)}\\ \sqrt{\alpha _n}\varvec{I}_R \end{pmatrix} {\big (\varvec{A}^{(n)}\big )}^T \right\Vert_F^2\\&\quad +\beta _n\sum _{r=1}^R\Big \Vert \varvec{a}_r^{(n)}\Big \Vert _1. \end{aligned} \end{aligned}$$
(10)

In (10), it is clear to see that the item \(\begin{pmatrix} \varvec{B}^{(n)}\\ \sqrt{\alpha _n}\varvec{I}_R \end{pmatrix}\) must be of full column rank even though \(\varvec{B}^{(n)}\) is not of full column rank. Thus, the proposed sparse NCP with the proximal algorithm can successfully overcome the rank deficiency problem in the objective function.

3.2 Inexact block coordinate descent scheme

The BCD is a main framework to solve tensor decomposition. It is reported that the inexact BCD scheme could accelerate the computation [15, 40]. Specifically, the factor matrices \(\varvec{A}^{(n)}, n=1,\dotsc , N\), are updated alternatively in outer iterations; meanwhile, in the subproblem (9), the factor \(\varvec{A}^{(n)}\) is also updated several times in inner iterations. The procedures of the inexact scheme are listed in Algorithm 1.

figure a

3.3 Convergence analysis

The proposed sparse NCP method in (8) can guarantee to converge to a stationary point.

Proposition 1

Every limit point of the sequence \({\left\{ \varvec{A}_k^{(1)}, \dotsc , \varvec{A}_k^{(N)} \right\} }_{k=1}^\infty\) generated by the sparse NCP in Algorithm 1 is a stationary point of (6).

Proof

The objective function \(\mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )\) in (9) with the proximal regularization item is strictly convex [4]. Moreover, \(\mathscr {F}_{\text {PROX}} \big (\varvec{A}^{(n)}\big )\) is a proximal upper bound [17, 33] of the objective function \(\mathscr {F}_0 \big (\varvec{A}^{(n)}\big )\) in (7). Using the inexact block coordinate descent scheme, the subproblem in Algorithm 1 is updated by a finite number of inner iterations. According to the Theorem 2 in [49], every limit point of the sequence \({\left\{ \varvec{A}_k^{(1)}, \dotsc , \varvec{A}_k^{(N)} \right\} }_{k=1}^\infty\) generated by the sparse NCP in Algorithm 1 is a stationary point of (6). \(\square\)

4 Optimization methods for solving sparse NCP

In order to prove the viability and effectiveness of the novel sparse NCP with the proximal algorithm and inexact scheme, we employ the following two optimization methods to solve the model.

4.1 Alternating nonnegative quadratic programming

First, we utilize a method that is based on a general form of the alternating nonnegative least squares (ANLS). The classical ANLS is an important tool for NMF and NCP [24]. Many efficient optimization algorithms were proposed to solve the nonnegative least squares (NNLS) subproblems, such as active-set (AS) [20] and block principal pivoting (BPP) [22]. However, there are two limitations to the application of ANLS to sparse NCP. The first limitation is that ANLS is very prone to rank deficiency. The proximal algorithm can tackle this limitation in our sparse NCP model. The second limitation is that the subproblem of our proposed sparse NCP model cannot be represented in a least squares form due to the \(l_1\)-norm regularization, which can be clearly seen in (10). Therefore, some new forms of the objective function in (8) should be considered.

Inspired by [27], the subproblem of the proposed sparse NCP in (9) can be represented in the nonnegative quadratic programming (NNQP) form as the following problem:

$$\begin{aligned} \begin{aligned} \min _{\varvec{A}^{(n)}} \sum _{i=1}^{I_n}&\bigg \{\frac{1}{2}{\big [\varvec{A}^{(n)}\big ]}_{(i,:)}\varvec{M}{\big [\varvec{A}^{(n)}\big ]}_{(i,:)}^{T} + \varvec{N}_{(i,:)} {\big [\varvec{A}^{(n)}\big ]}_{(i,:)}^{T}\\&+ \frac{1}{2}{\big [\varvec{X}_{(n)}\big ]}_{(i,:)}{\big [\varvec{X}_{(n)}\big ]}_{(i,:)}^T + \frac{\alpha _n}{2} {\big [\widetilde{\varvec{A}}^{(n)}\big ]}_{(i,:)}{\big [\widetilde{\varvec{A}}^{(n)}\big ]}_{(i,:)}^T \bigg \}\\ \text {s.t. }&\varvec{A}^{(n)} \geqslant 0, \end{aligned} \end{aligned}$$
(11)

where \({\big [\; \big ]}_{(i,:)}\) represents the ith row of a matrix, \(\varvec{M}={\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)} + \alpha _n \varvec{I}_R\), \(\varvec{N}=\beta _n \varvec{E}-\varvec{X}_{(n)}\varvec{B}^{(n)} - \alpha _n\widetilde{\varvec{A}}^{(n)}\) and \(\varvec{E}\) is a matrix of all ones. In fact, NNQP is a general form of NNLS.

The above-mentioned optimization methods for NNLS can also be used to solve NNQP problem. In this study, we only use block principal pivoting (BPP) [22] as the NNQP solver, which has been proven to be a very efficient method [22, 24]. The solver of BPP contains multiple inner iterations. We limited the inner iterations by several times in the inexact scheme. We name the method of solving tensor decomposition using NNQP as alternating nonnegative quadratic programming (ANQP). Furthermore, we abbreviated the method of solving the sparse NCP with the proximal algorithm using ANQP as PROX-ANQP. Algorithm 2 explicates the PROX-ANQP method.

figure b

4.2 Inexact hierarchical alternating least squares

Second, we employ an inexact hierarchical alternating least squares (iHALS) method for solving the sparse NCP with the proximal algorithm. The conventional HALS is an efficient method of updating each factor column by column [7, 9]. However, the HALS method has two major drawbacks to solving the sparse NCP.

First, HALS is also very prone to rank deficiency. Specifically, if a column of the factor matrix \(\varvec{A}^{(n)}\) becomes a zero vector, the HALS will break down [22]. One practical remedy is to replace the zero elements with a small positive value [9], such as \(10^{-16}\). However, by this modification, the obtained factor matrices are not sparse anymore.

Second, HALS suffers from the caveat problem (see Section 5.2 in [22]). Specifically, the unbalanced scales will appear in different columns and factors. For example, one column in the first factor might have a scale of \(10^{-8}\) and the corresponding column in the second factor might have a scale of \(10^{8}\). At the same time, another column in the first factor might have a scale of \(10^{16}\) and the corresponding column in the third factor might have a scale of \(10^{-16}\). One common method of controlling the unbalanced scales is to normalize all columns to unit vectors in the factors [9]. However, by factor normalization, the factor columns will never become zeros vectors. Hence it is impossible to impose sparsity efficiently.

The proximal algorithm in our sparse NCP will overcome the above drawbacks. We have mentioned that the proximal will guarantee the full column rank in the model. Moreover, the proximal regularization item in sparse NCP can keep all columns in factors on a balanced scale.

Next, we will introduce the solution of the model in (8) using the iHALS method. For the sake of simplification, we use \(\varvec{a}_r\) and \(\varvec{b}_r\) instead of \(\varvec{a}_r^{(n)}\) and \(\varvec{b}_r^{(n)}\) in this part, which are the rth column of \(\varvec{A}^{(n)}\) and \(\varvec{B}^{(n)}\), respectively. We also use \({\big [ \varvec{A}^{(n)} \big ]}_{(:,r)}=\varvec{a}_r\in \mathbb {R}^{I_n\times 1}\) to represent the column of a matrix, and \({\big [ \varvec{A}^{(n)} \big ]}_{(i,r)}=a_{ir}^{(n)}\) to represent an element in a matrix.

The objective function in (9) can be further represented as

$$\begin{aligned} \begin{aligned} \mathscr {F} \big (\varvec{A}^{(n)}\big ) =&\frac{1}{2}\Big \Vert {\varvec{X}_{(n)}-\sum _{r=1}^R\varvec{a}_r\varvec{b}_r^T}\Big \Vert _F^2\\&+\frac{\alpha _n}{2}\sum _{r=1}^R||\varvec{a}_r - \widetilde{\varvec{a}}_r||_2^2+\beta _n\sum _{r=1}^R||\varvec{a}_r||_1, \end{aligned} \end{aligned}$$
(12)

where \(\widetilde{\varvec{a}}_r\) is the rth column of \(\widetilde{\varvec{A}}^{(n)}\). The minimization problem for (12) can be solved iteratively by columnwise subproblems:

$$\begin{aligned} \begin{aligned} \min _{\varvec{a}_r}\mathscr {F}_r=\frac{1}{2}\Big \Vert {\varvec{Z}_r-\varvec{a}_r\varvec{b}_r^T}\Big \Vert _F^2+&\frac{\alpha _n}{2}||\varvec{a}_r - \widetilde{\varvec{a}}_r||_2^2+\beta _n||\varvec{a}_r||_1\\ \text {s.t. }\varvec{a}_r\geqslant 0&, \end{aligned} \end{aligned}$$
(13)

for \(r=1,\dotsc ,R\), in which

$$\begin{aligned} \varvec{Z}_r=\varvec{X}_{(n)}-\sum _{\tilde{r}=1,\tilde{r}\ne r}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T. \end{aligned}$$
(14)

The partial derivative of \(\mathscr {F}_r\) with respect to \(\varvec{a}_r\) is

$$\begin{aligned} \begin{aligned} \frac{\partial \mathscr {F}_r}{\partial \varvec{a}_r}&=\big (\varvec{a}_r\varvec{b}_r^T-\varvec{Z}_r\big )\varvec{b}_r + \alpha _n\varvec{a}_r - \alpha _n\widetilde{\varvec{a}}_r + \beta _n\varvec{1},\\&= \big (\varvec{b}_r^T\!\varvec{b}_r+\alpha _n \big )\varvec{a}_r - \big (\varvec{Z}_r\varvec{b}_r + \alpha _n\widetilde{\varvec{a}}_r - \beta _n\varvec{1} \big ), \end{aligned} \end{aligned}$$
(15)

where \(\varvec{1}\in \mathbb {R}^{I_n\times 1}\) is a vector with all elements equal to 1. When \(\frac{\partial \mathscr {F}_r}{\partial \varvec{a}_r}=0\), nonnegative column vector \(\varvec{a}_r\) can be updated as

$$\begin{aligned} \begin{aligned} \varvec{a}_r\leftarrow {\left[ \frac{\varvec{Z}_r\varvec{b}_r + \alpha _n\widetilde{\varvec{a}}_r - \beta _n\varvec{1}}{\varvec{b}_r^T\varvec{b}_r+\alpha _n} \right] }_+, \end{aligned} \end{aligned}$$
(16)

which is a closed form solution of (13) according to the Theorem 2 in [24].

A fast HALS method was utilized to solve the large-scale problem [7, 24]. We use the same idea to solve the sparse NCP problem. \(\varvec{Z}_r\) in (14) can also be represented as

$$\begin{aligned} \varvec{Z}_r=\varvec{X}_{(n)}-\sum _{\tilde{r}=1}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T + \widetilde{\varvec{a}}_r\varvec{b}_r^T. \end{aligned}$$
(17)

Replacing \(\varvec{Z}_r\) in (16) by (17), we obtain the new update rule for \(\varvec{a}_r\) as shown in (18).

$$\begin{aligned} \begin{aligned} \varvec{a}_r\leftarrow&{\bigg [\frac{ \big ( \varvec{X}_{(n)}-\sum _{\tilde{r}=1}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T + \widetilde{\varvec{a}}_r\varvec{b}_r^T \big ) \varvec{b}_r + \alpha _n \widetilde{\varvec{a}}_r -\beta _n\varvec{1} }{\varvec{b}_r^T\varvec{b}_r+\alpha _n} \bigg ]}_+\\&= {\bigg [ \widetilde{\varvec{a}}_r + \frac{ \varvec{X}_{(n)}\varvec{b}_r - \sum _{\tilde{r}=1}^R\varvec{a}_{\tilde{r}}\varvec{b}_{\tilde{r}}^T\varvec{b}_r - \beta _n\varvec{1}}{\varvec{b}_r^T\varvec{b}_r+\alpha _n} \bigg ]}_+\\&=\Bigg [ \widetilde{\varvec{a}}_r + \frac{{\big [\varvec{X}_{(n)}\varvec{B}^{(n)}\big ]}_{(:,r)} - \varvec{A}^{(n)}{\big [ {\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}\big ]}_{(:,r)} -\beta _n\varvec{1}}{{\big [{\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}\big ]}_{(r,r)}+\alpha _n} \Bigg ]_+ \end{aligned} \end{aligned}$$
(18)

We implement the above procedures using the inexact scheme. We use PROX-iHALS to denote the inexact hierarchical alternating least squares method for solving the sparse NCP with the proximal algorithm. The PROX-iHALS is illustrated in Algorithm 3.

figure c

4.3 Stopping conditions

4.3.1 Stopping condition for outer loop

We terminate the outer loop according to the change of relative error during iteration. Relative error is related to data fitting. In the kth outer iteration, the relative error [48] of tensor decomposition is defined by

$$\begin{aligned} \text {RelErr}_k=\frac{\Vert \varvec{\mathscr {X}} - \llbracket \varvec{A}_k^{(1)},\dotsc ,\varvec{A}_k^{(N)} \rrbracket \Vert _F}{\Vert \varvec{\mathscr {X}}\Vert _F}. \end{aligned}$$
(19)

Based on the relative error, we terminate the outer loop using the following stopping condition

$$\begin{aligned} |\text {RelErr}_{k-1}-\text {RelErr}_k |< \varepsilon . \end{aligned}$$
(20)

The threshold of \(\varepsilon\) can be set by a very small positive value, such as \(1e-8\).

In addition, we also set a maximum running time for the outer loop.

4.3.2 Stopping condition for inner loop

In the lth inner iteration, we define the relative residual of the nth factor matrix \(\varvec{A}^{(n)}\) as

$$\begin{aligned} r_l^{(n)}=\frac{\big \Vert \varvec{A}_l^{(n)} - \varvec{A}_{l-1}^{(n)}\big \Vert _F}{\big \Vert \varvec{A}_l^{(n)}\big \Vert _F}. \end{aligned}$$
(21)

For the PROX-iHALS, we terminate the inner loop by the stopping condition of \(r_l^{(n)}<\delta ^{(n)}\), where \(\delta ^{(n)}\) is a dynamic positive threshold. If there is only one iteration in the inner loop, we update \(\delta ^{(n)}\) by \(\delta ^{(n)}=\delta ^{(n)}/10\). We set the initial value by \(\delta ^{(n)}=0.01\). For the PROX-ANQP, the inner loop is terminated according to the columns in the feasible region of the BPP algorithm [22].

Since we employ the inexact BCD framework, we also set a maximum number of inner iterations (MAX_INNER_ITER) to terminate the inner loop.

We summarize the stopping conditions for both of the outer and inner loop in Algorithm 4.

figure d

4.4 Remarks on convergence

The PROX-ANQP and PROX-iHALS methods in the inexact BCD framework have outstanding convergence properties. In Sect. 3.3, we have mentioned that proposed sparse NCP using the proximal algorithm and inexact BCD scheme can guarantee to converge to a stationary point. The subproblem with the proximal algorithm in (9) is strongly convex, which can yield a unique minimum [4]. Furthermore, the optimization methods of ANQP and iHALS can stably decrease the subproblem. According to the Proposition 3.7.1 in [4], both the PROX-ANQP and PROX-iHALS can converge to stationary points.

5 Experiments and results

We carried out the experiments on synthetic, real-world, dense, sparse, small-scale, and large-scale tensors. We compared the proposed PROX-ANQP and PROX-iHALS methods with three sparse NCP methods listed below.

  • AO-ADMM: This is the sparse NCP method using AO-ADMM algorithm [19], which includes multiple inner iterations. The \(l_1\)-norm is handled by a proximal operator.

  • iAPG: We extend the APG method in sparse Tucker decomposition [47] to the sparse NCP problem in (6). In order to make a fair comparison, we implement APG in the inexact scheme using multiple inner iterations, which is abbreviated as iAPG [42]. The \(l_1\)-norm is handled by a proximal operator.

  • iMU: This is the sparse NCP method using the classical MU algorithm [9]. We implement MU in the inexact scheme using multiple inner iterations, which is abbreviated as iMU.

The above three methods can be directly applied to solve the sparse NCP in (6). The \(l_1\)-norm can be handled by the proximal operator in AO-ADMM and iAPG. Due to the proximal operator, AO-ADMM and iAPG do not suffer from the rank deficiency. Using the multiplicative updating rule, MU does not suffer from the rank deficiency.

In Table 1, we summarized the computational complexity of all the sparse NCP methods. Only the multiplicative operations were counted for mode-n in one outer iteration. The main time cost of these algorithms was spent on the calculation of MTTKRP \(\varvec{X}_{(n)} \varvec{B}^{(n)}\), which consists of two parts: Khatri-Rao product \(\varvec{B}^{(n)}\) and matrix product of \(\varvec{X}_{(n)}\) and \(\varvec{B}^{(n)}\). The computational complexity of \(\varvec{B}^{(n)}\) reaches \(R\prod _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}}\) and that of \(\varvec{X}_{(n)} \varvec{B}^{(n)}\) reaches \(R\prod _{n=1}^N I_n\). Item \({\big ( \varvec{B}^{(n)} \big )}^T\!\varvec{B}^{(n)}\) can be calculated efficiently by (5), whose complexity is \(R^2\sum _{\tilde{n}=1,\tilde{n}\ne n}^N I_{\tilde{n}}\). For the inner loop of the subproblem, \(\bar{K}\) is assumed to be the average iteration number. In Table 1, we can find that the complexity of these algorithms is highly comparable to each other. It can be inferred that the time of convergence is highly related to the number of iterations.

Table 1 Computational Complexity of Multiplicative Operations for Subproblem (9)

Many experimental parameters and settings will affect the performances of a sparse NCP method. Since our purpose in the experiments is only to test the ability to impose sparsity, we fix the following settings for all methods.

  • Initialization. For PROX-ANQP, PROX-iHALS, AO-ADMM and iAPG, all factor matrices were initialized using nonnegative random numbers by MATLAB function max(0,randn(\(I_n,R\))). Only the iMU was initialized by max(0,randn(\(I_n,R\)))+0.1. All initialized factors were scaled by \(\textstyle \varvec{A}_0^{(n)}=\frac{\varvec{A}_0^{(n)}}{||\varvec{A}_0^{(n)}||_F}\times \root N \of {||\varvec{\mathscr {X}}||_F}\).

  • The factor updating order was fixed by \(1,2,\dotsc ,N\).

  • The maximum inner iteration MAX_INNER_ITER was fixed by 5 according to the default setting in AO-ADMM [19].

  • For the PROX-ANQP and PROX-iHALS method, the proximal regularization parameter \(\alpha _n\) was fixed by 1e-4 [45].

The \(l_1\)-norm regularization parameters of \(\beta _n, n=1,\dotsc ,N,\) in sparse NCP are the key elements to impose sparsity, which are the most crucial testing parameters in the experiments. We selected a sequence of \(\beta _n\) values in ascending order for each tensor by manual testing. For synthetic tensors, we stop the increase of \(\beta _n\) when the true sparse components are recovered, while for real-world tensors, we stop the increase of \(\beta _n\) when the number of nonzero components is reduced to less than half of the initial number. In order to make it convenient to select and test the parameters, we kept \(\beta _n, n=1,\dotsc ,N,\) the same in all modes of the tensor. After choosing the \(\beta _n\), we calculated and evaluated the sparsity level [44] of the factor matrices by

$$\begin{aligned} \text {Sparsity}_{\varvec{A}^{(n)}}=\frac{\#\big \lbrace \varvec{A}^{(n)}_{i,r}<T_s \big \rbrace }{I_n \times R}, \end{aligned}$$
(22)

where \(T_s\) is a small positive number and \(\#\left\{ \cdot \right\}\) denotes the number of elements that are smaller than the threshold \(T_s\) in factor matrix \(\varvec{A}^{(n)}\).

In the synthetic tensor experiments, we used prior sparse matrices to construct the data. After decomposition, the accuracy of the recovered sparse signals should be evaluated. Let \(\varvec{S}^{(n)}=[\varvec{s}_1,\dotsc ,\varvec{s}_{R}]\in \mathbb {R}^{L \times R}\) denote the mode-n prior sparse matrix, where R is the real number of components and L is the length of a component. Let \(\varvec{T}^{(n)}=[\varvec{t}_1,\dotsc ,\varvec{t}_{\tilde{R}}]\in \mathbb {R}^{L \times \tilde{R}}\) represent the mode-n estimated sparse matrix, where the \(\tilde{R}\) is the estimated number of nonzero components. We evaluate the accuracy of the estimated matrix \(\varvec{T}^{(n)}\) compared with original sparse signals \(\varvec{S}^{(n)}\) by peak-signal-to-noise ratio (PSNR, see Chapter 3 in [9])

$$\begin{aligned} \text {PSNR}=\frac{1}{\tilde{R}}\sum _{r=1}^{\tilde{R}}10\text {log}_{10} \frac{L}{\big \Vert \hat{\varvec{t}}_r-\hat{\varvec{s}}_c\big \Vert _2^2}, \end{aligned}$$
(23)

where \(\hat{\varvec{t}}_r\) is the rth normalized estimated sparse signal, and \(\hat{\varvec{s}}_c\) is the normalized reference sparse signal. \(\hat{\varvec{s}}_c\) comes from \(\varvec{S}^{(n)}\), which has the highest correlation coefficient with \(\hat{\varvec{t}}_r\).

All the experiments were conducted on the computer with Intel Core i5-4590 3.30 GHz CPU, 8 GB memory, 64-bit Windows 10 and MATLAB R2016b. The fundamental tensor computation was based on Tensor Toolbox 2.6 [3]. The codes are available on the author's website http://deqing.me/.

5.1 Synthetic tensor data

5.1.1 Size \(1000 \times 100 \times 100 \times 5\) with one sparse factor

In this experiment, we constructed a synthetic fourth-order tensor by 10 channels of simulated sparse and nonnegative signals, as shown in Fig. 1a. The signals come from the file VSparse_rand_10.mat in NMFLAB [8]. There are 1000 points in each channel, so the sparse signal matrix is \(\varvec{S}^{(1)}=[\varvec{s}_1,\dotsc ,\varvec{s}_{10}]\in \mathbb {R}^{1000 \times 10}\). Three uniformly distributed random matrices \(\varvec{A}^{(2)}, \varvec{A}^{(3)} \in \mathbb {R}^{100 \times 10}\) and \(\varvec{A}^{(4)} \in \mathbb {R}^{5 \times 10}\) were employed as mixing matrices, which were generated by rand function in MATLAB. Afterwards, we synthesized a third-order tensor by \(\varvec{\mathscr {X}}_{\text {SNY1}}= \llbracket \varvec{S}^{(1)},\varvec{A}^{(2)},\varvec{A}^{(3)},\varvec{A}^{(4)} \rrbracket \in \mathbb {R}^{1000 \times 100 \times 100 \times 5}\). Next, nonnegative noise was added to the tensor with SNR of 40dB, which was generated by code max(0,randn(size(\(\varvec{\mathscr {X}}\)))).

Fig. 1
figure 1

Sparse and nonnegative signals used in synthetic tensor. a shows the original ten channels of signals. b shows the estimated ten channels of signals from the synthetic tensor \(\varvec{\mathscr {X}}_{\text {SYN1}}\) by sparse NCP based on the PROX-ANQP method with \(\beta _n=5\). The PSNR is 90.2698 according to (23)

For all sparse NCP methods, we set \(\varepsilon =1e-8\) as the threshold of outer stopping condition in (20). We set \(T_s=1e-3\) in (22). The maximum running time was set by 180 seconds. We selected 20 as the number of components for tensor decompositionFootnote 1. The reason is that we intend to recover the ten channels of the true signal just by imposing sparse regularization during decomposition, even though the exact optimal number of components is unknown. We selected the values of \(\beta _n=0,0.1,0.5,1,2,3\) for all the optimization methods to evaluate their abilities to impose sparsity. The selection of sparse regularization parameters depends on the tensor data. After tensor decomposition, the values of objective function value (Obj), relative error (RelErr), running time in seconds (in wall-clock time), iteration number (Iter), the number of nonzero components (NNC), sparsity level (Spars) and PSNR of the estimated signal factor matrix were recorded as the performance evaluation criteria. For all optimization methods with each \(\beta _n\), the sparse NCP was run 30 times, and the average values of all criteria were computed. The results are shown in Table 2. We emphasized the outstanding performances of the sparse NCP algorithms in bold in the tables.

From Table 2, it can be found that all methods are able to impose sparsity with proper sparse regularization parameter \(\beta _n\). When \(\beta _n\) increases, the sparsity level of the mode-1 factor matrix also increases. After properly tuning the sparse regularization parameter \(\beta _n\), weak components will be removed (set to 0), weak elements in strong components will be prohibited, and the true ten channels of sparse signals will be recovered.

When \(\beta _n\) increases to a proper value, the PSNR increases significantly. In this experiment, the higher the value of PSNR is, the better an algorithm performs to recover original sparse components. In Table 2, it is clear to see that PROX-ANQP, PROX-iHALS and iAPG have higher PSNR with larger sparse regularization parameters, for example \(\beta _n=4,5\). This means that these three methods recover the ten channels of sparse signals more precisely. One of the recovered sparse signal matrix from \(\varvec{\mathscr {X}}_{\text {SYN1}}\) by PROX-ANQP is shown in Fig. 1b.

For the synthetic data, the objective function values and relative errors are very similar with the same \(\beta _n\) value. The convergence speed can be concluded from Table 2. iMU performs slowly compared with other methods. AO-ADMM performs slowly with \(\beta _n=0\), but it becomes fast with \(\beta _n>0\). All PROX-ANQP, PROX-iHALS and iAPG methods perform very well. It can also be concluded from Table 2 that the running time is highly related to the number of outer iterations.

Table 2 Comparison of Sparse NCPs on \(\varvec{\mathscr {X}}_{\text {SYN1}} \in \mathbb {R}^{1000 \times 100 \times 100 \times 5}\)

5.1.2 Size \(500 \times 500 \times 500\) with two sparse factors

For this third-order tensor, the factor matrices were generated using the following codes.

Factor

Code

Zeros (%)

\(\varvec{S}^{(1)}\in \mathbb {R}^{500 \times 100}\)

max(0,rand(500,100)*10-9);

90

\(\varvec{S}^{(2)}\in \mathbb {R}^{500 \times 100}\)

max(0,rand(500,100)*2-1);

50

\(\varvec{A}^{(3)}\in \mathbb {R}^{500 \times 100}\)

rand(500,100);

0

Afterwards, a third-order tensor was synthesized by \(\varvec{\mathscr {X}}_{\text {SNY2}}=\llbracket \varvec{S}^{(1)},\varvec{S}^{(2)},\varvec{A}^{(3)} \rrbracket\), whose true number of components was 100 (rank = 100). Noise with SNR of 40dB was added.

We set the outer stopping condition by \(\varepsilon =1e-6\) and the maximum running time by 600 seconds. 200 was selected as the initial number of components. The average performances of all sparse NCP methods after 30 runs were computed. We only show running time in seconds, iteration number (Iter), number of nonzero components (NNC) and the sparsity level (Spars) of all estimated factors in Table 3.

Table 3 Comparison of Sparse NCPs on \(\varvec{\mathscr {X}}_{\text {SYN2}} \in \mathbb {R}^{500 \times 500 \times 500}\)

Table 3 shows that all methods are able to impose sparsity on all factors matrices. PROX-ANQP, PROX-iHALS, and iAPG methods perform very well to extract the true 100 sparse components. Interestingly, the sparsity levels of all the extracted factor matrix by PROX-ANQP, PROX-iHALS, and iAPG methods are also very close to the ground-truthFootnote 2 values with some \(\beta _n\). iMU and AO-ADMM do not always work well to reach the ground-truth factor sparsity levels. Moreover, iMU shows slow convergence compared with other methods.

Table 4 Comparison of Sparse NCPs on Ongoing EEG Tensor \(\varvec{\mathscr {X}}_{\text {EEG}} \in \mathbb {R}^{64 \times 146 \times 510}\)

5.2 Third-order dense tensor data

In this experiment, we used a real-world third-order ongoing EEG tensor. The original data were collected from fourteen right-handed and healthy subjects of adults in a music listening experiment. The music stimulus was a piece of 8.5 minutes of modern tango Adiós Nonino by the composer Astor Piazzolla. Short-Time Fourier Transform (STFT) with the window size of 3 seconds and the hop size of 1 second was used to transform the data of each subject into a third-order tensor. Details of data collection and preprocessing can be found in [43, 44]. We only employ the tensor data of one subject in this experiment.Footnote 3 The size of this tensor is \(channel \times frequency \times time = 64 \times 146 \times 510\), in which 64 channel points represent 64 electrodes on the scalp, 146 frequency points represent the spectrum in 1-30Hz, and 510 time points represent the duration of a stimulus of about 8.5 minutes. Since the spectra from EEG tensor are usually sparse, we wish to recover the sparse spectral components by sparse regularization.

We set \(\varepsilon =1e-8\) in (20) and \(T_s=1e-6\) in (22). The maximum running time was set by 120 seconds. The initial number of components was set by 40 according to previous studies [11, 44]. We tested the values of \(\beta _n=0,1e5,5e5,10e5,15e5,20e5\) all methods. All methods were run 30 times. The averages of performance criteria are recorded in Table 4. The results show that all methods are effective to impose sparsity with \(\beta _n\). The iMU is slower than the other four methods.

We selected three groups of extracted components using the PROX-ANQP method with \(\beta _n=0,5e5,10e5\), respectively, as shown in Fig. 2. These three groups show the same brain activity. It is obvious to see that the spectra become sparser and sparser when the sparse regularization parameter increases. With \(\beta _n=5e5,10e5\), more and more unnecessary elements are removed in the spectra, and only the most prominent frequency band is retained. Figure 2 demonstrates that our methods are effective to extract meaningful sparse components that are related to some brain activities.

Fig. 2
figure 2

Selected groups of components from the ongoing EEG tensor using the PROX-ANQP algorithm. All groups reveal the same brain activity. In the decomposed EEG data, the spatial component is topography, the spectral component is the spectrum, and the temporal component is the energy evolution series. The components in a were extracted with \(\beta _n=0\), b with \(\beta _n=5 \times 10^5\) and c with \(\beta _n=10 \times 10^5\)

5.3 Third-order large-scale sparse tensor data

In this experiment, we tested the sparse NCP algorithms on a third-order large-scale sparse social network tensor. The data contain Facebook wall postsFootnote 4 information among 46, 952 users in 1952 days [41]. The size of this sparse tensor \(\varvec{\mathscr {X}}_{\text {Sp1}}\) is 46,952 \(\times ~ 46, 951\) \(\times ~ 1952\), and the number of nonzero elements is 737, 928.

We set the outer stopping condition by \(\varepsilon =1e-6\) in (20) and the sparsity threshold by \(T_s=1e-6\) in (22). The maximum running time was set by 1200 seconds, and the initial number of components was set by 100. We tested the values of \(\beta _n=0, 0.01, 0.05, 0.1\) for all methods. The average values of performance criteria after 30 runs are recorded in Table 5. We recorded the objective function values of all methods within the first 600 seconds in Fig. 3.

Table 5 Comparison of Sparse NCPs on Facebook Wall Posts Tensor \(\varvec{\mathscr {X}}_{\text {Sp1}} \in \mathbb {R}^{46952 \times 46951 \times 1952}\)
Table 6 Comparison of Sparse NCPs on NIPS Publications Tensor \(\varvec{\mathscr {X}}_{\text {Sp2}} \in \mathbb {R}^{2482 \times 2862 \times 14036 \times 17}\)
Fig. 3
figure 3

The Objective Function Value Curves of Sparse NCPs on Third-order Facebook Wall Posts Tensor

Fig. 4
figure 4

The objective function value curves of sparse NCPs on fourth-order NIPS publications tensor

5.4 Fourth-order large-scale sparse tensor data

In this experiment, we evaluated the sparse NCP algorithms on a fourth-order large-scale sparse tensor of NIPS publicationsFootnote 5 [16]. The size of this sparse tensor \(\varvec{\mathscr {X}}_{\text {Sp2}}\) is \(2482 \times 2862 \times 14{,}036 \times 17\). The values represent 2482 papers, 2862 authors, 14,036 words and 17 years. There are 3,101,609 nonzero elements.

The maximum running time was set by 1800 seconds. Other settings are the same as the previous third-order case. We tested the values of \(\beta _n=0, 0.1, 0.3, 0.5\) for all methods. The average values of performance criteria after 30 runs are recorded in Table 6. Figure 4 illustrates the objective function values of all methods within the first 1200 seconds.

The experimental results of both the third-order and fourth-order large-scale sparse tensor demonstrate that all our algorithms are effective to impose sparsity on the factor matrices. From Tables 5 and 6, it is clear to see that, with sparse regularization parameter \(\beta _n\) increasing, the number of nonzero components (NNC) decreases gradually. The extracted factor matrices from the sparse tensors are extremely sparse, even though no additional sparse regularization is imposed (\(\beta =0\)). Nevertheless, our algorithms with \(\beta >0\) can further increase the sparsity level of the factor matrices.

From the perspective of convergence speed, both the PROX-ANQP and PROX-iHALS methods run fast for the large-scale sparse tensors with different \(\beta _n\) values compared with other methods, which can be concluded from Tables 5, 6, Figs. 3 and 4. However, AO-ADMM and iAPG perform slowly for the large-scale sparse tensors. iMU has fast convergence speed, but it has higher objective function value with large \(\beta _n\) values (e.g., \(\beta _n\) = 0.05, 0.1 in the third-order case, and \(\beta _n\) = 0.3, 0.5 in the fourth-order case). The reason is that, with the same \(\beta _n\) value, iMU yields fewer nonzero components compared with other methods.

6 Discussion

We have proposed a novel sparse NCP model using the proximal algorithm and inexact scheme. The model can be efficiently solved by two algorithms of the PROX-ANQP and PROX-iHALS. In order to test the performance of the algorithms, we conducted experiments in different cases, including synthetic and real-world tensors, third-order and fourth-order tensors, dense and sparse tensors, and small-scale and large-scale tensors. Three state-of-the-art sparse NCP methods are also tested for comparison, including the AO-ADMM, iAPG and iMU. We have the following findings: (1) The PROX-ANQP and PROX-iHALS methods particularly have the fast convergence speed and excellent effects of imposing sparsity in all cases compared with other methods. The outstanding performances of the PROX-ANQP and PROX-iHALS are due to two points: the proximal algorithm that overcomes the rank deficiency; and the inexact scheme that increases the efficiency. (2) The iAPG method contains a proximal operator, which can handle the \(l_1\)-norm. With the proximal operator, iAPG does not suffer from the rank deficiency. In the experiments, we find that iAPG is very efficient to impose sparsity for small-scale dense tensor decomposition. Nevertheless, iAPG runs slowly for large-scale sparse tensor decomposition. (3) The iMU method converges very slowly for dense tensors, but it becomes fast for large-scale sparse tensor. For sparse tensor decomposition, the extracted factor matrices are extremely sparse already. Most elements in the factors are zeros. According to the multiplicative updating rule, once an element becomes zero, it will never change. This property might be the reason why iMU converges fast on the sparse tensor. (4) The AO-ADMM also contains a proximal operator that can handle \(l_1\)-norm and overcome the rank deficiency. However, AO-ADMM converges slowly compared with PROX-ANQP and PROX-iHALS in most cases, particularly in the large-scale sparse tensor case. Moreover, AO-ADMM is inferior to iAPG with \(\beta _n=0\) in many cases. In a word, our proposed PROX-ANQP and PROX-iHALS methods have the best performances for sparse NCP, and have very good generalization for the different types and scales of datasets.

In addition to the solving methods, another critical issue of sparse NCP is the selection of the sparse regularization parameter \(\beta _n\). Firstly, we want to mention that one purpose of this paper is to demonstrate the effectiveness of the algorithms to impose sparsity. Therefore, in order to simplify the selection of parameters, we keep \(\beta _n\) the same for all factor matrices in sparse NCP. With the same \(\beta _n\) on all modes, the sparse NCP can still recover high sparse components and low sparse (or even dense) components in different factor matrices. In the future, it would be interesting to investigate how to separately control the sparsity levels of different factor matrices using unbalanced sparse regularization parameters. Secondly, the appropriate value of parameter \(\beta _n\) depends on the tensor to be decomposed. In this study, we selected \(\beta _n\) separately for each tensor in the experiments. When the sparse regularization parameter is larger, the extracted factor matrices are sparser, and the relative error of decomposition is also larger. The trade-off between the sparse level and the relative error depends on the meanings of real applications. An example of sparse regularization parameter selection of sparse NCP for ongoing EEG can be found in [44]. It is also possible to select an appropriate parameter for a concrete application using model-order selection methods [37], such as the Bayesian information criteria (BIC).

The third critical issue of sparse NCP is the sparse regularization item. In this study, we only investigated the \(l_1\)-norm item. In the future, it is worth trying to incorporate other types of sparse regularization items [2] to our sparse NCP model besides \(l_1\)-norm, such as the \(l_q\)-norm (\(0<q<1\)) [36] and trace norm [29].

7 Conclusion

In this paper, we have investigated the nonnegative CANDECOMP/PARAFAC tensor decomposition with \(l_1\)-norm-based sparse regularization (sparse NCP). We have proposed a novel sparse NCP model using the proximal algorithm, which can guarantee the full column rank condition and the property of convergence to a stationary point. In addition, an inexact block coordinate descent scheme was presented to accelerate the computation of sparse NCP. In the inexact scheme, the subproblems are updated using multiple inner iterations. We have employed two algorithms for solving the proposed sparse NCP model with the proximal algorithm, including the inexact alternating nonnegative quadratic programming (PROX-ANQP) and the inexact hierarchical alternating least squares (PROX-iHALS). The experimental results on all synthetic, real-world, small-scale and large-scale tensors demonstrated that our sparse NCP methods can impose sparsity and extract meaningful sparse components successfully. Both PROX-ANQP and PROX-iHALS have exhibited the faster computational speed and better performances of imposing sparsity compared with other sparse NCP algorithms. The experimental results proved that the proposed sparse NCP with the proximal algorithm and inexact scheme is effective and efficient.