Sparse nonnegative tensor decomposition using proximal algorithm and inexact block coordinate descent scheme

Nonnegative tensor decomposition is a versatile tool for multiway data analysis, by which the extracted components are nonnegative and usually sparse. Nevertheless, the sparsity is only a side effect and cannot be explicitly controlled without additional regularization. In this paper, we investigated the nonnegative CANDECOMP/PARAFAC (NCP) decomposition with the sparse regularization item using l1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$l_1$$\end{document}-norm (sparse NCP). When high sparsity is imposed, the factor matrices will contain more zero components and will not be of full column rank. Thus, the sparse NCP is prone to rank deficiency, and the algorithms of sparse NCP may not converge. In this paper, we proposed a novel model of sparse NCP with the proximal algorithm. The subproblems in the new model are strongly convex in the block coordinate descent (BCD) framework. Therefore, the new sparse NCP provides a full column rank condition and guarantees to converge to a stationary point. In addition, we proposed an inexact BCD scheme for sparse NCP, where each subproblem is updated multiple times to speed up the computation. In order to prove the effectiveness and efficiency of the sparse NCP with the proximal algorithm, we employed two optimization algorithms to solve the model, including inexact alternating nonnegative quadratic programming and inexact hierarchical alternating least squares. We evaluated the proposed sparse NCP methods by experiments on synthetic, real-world, small-scale, and large-scale tensor data. The experimental results demonstrate that our proposed algorithms can efficiently impose sparsity on factor matrices, extract meaningful sparse components, and outperform state-of-the-art methods.


Background
Nonnegative tensor decomposition is a powerful tool in signal processing and machine learning [10,35]. Nonnegative CANDECOMP/PARAFAC (NCP), as an important decomposition method, has been widely applied to processing multiway data, such as hyperspectral data [39], electroencephalograph (EEG) data [11], fluorescence excitation-emission matrix (EEM) data [13], neural data [46], and many other multiway tensor data [30]. In many cases, the extracted components by NCP are not only nonnegative but also sparse. For example, the spectral components from EEG tensor decomposition are usually very sparse, representing the narrow-band frequencies of some brain activities [11]. For another example, after decomposing EEM tensor, a component in the sample mode denotes the concentrations of a compound in all samples [5], which is sometimes also sparse. The nonnegative constraint in NCP will naturally lead to sparse results. However, this sparsity is only a side effect, which cannot be controlled to a certain level [18]. Without properly controlling the sparsity, the intrinsic components in the data cannot be extracted precisely, especially in low signal-to-noise ratio conditions. Therefore, in order to extract meaningful and accurate sparse components, additional sparse regularization is necessary for NCP tensor decomposition.
The design of NCP decomposition with explicit sparse regularization (sparse NCP) will benefit a lot from the methods in nonnegative matrix factorization (NMF) cases.
Extended author information available on the last page of the article On the one hand, an early study of NMF [18] proposed the method of projecting components into sparse vectors at some sparsity level. However, this method keeps all components at the same fixed sparsity level, which is not in line with the true sparsity of different components in the data. On the other hand, incorporating sparse regularization items into the optimization model is a popular method. The l 1 -norm is a conventional and effective regularizer to impose sparsity for signal processing [6]. The reason is that, for most underdetermined linear equations, the optimization problem with l 1 -norm regularization can yield strong sparsity [12]. More information about the sparse regularization can be found in [2,34,50].
Many works have been devoted to the tensor decomposition with sparse regularization, but only a few can be found for NCP. The works of [1,21] and [29] studied the sparse regularization for tensor decomposition using l 1norm and trace norm, but they only focused on the unconstrained CP model without the nonnegative constraint. The works of [14,28,31] and [47] proposed the methods of imposing sparsity by the l 1 -norm on nonnegative Tucker decomposition. However, these methods are not suitable for large-scale problems [47], and their effectiveness is unknown to NCP. Kim et al. considered solving sparse NCP using ANLS [23]. Nevertheless, ANLS seriously suffers from rank deficiency caused by high sparsity or zero components in the factor matrices. Recently, Huang et al. have proposed an alternating optimization-based ADMM (AO-ADMM) method, which can handle the l 1 -norm regularization item in NCP [19]. Nevertheless, there is no experimental evaluation on the sparse NCP in [19]. The work [32] proposed a sparse NCP algorithm, which is targeted at the multiway co-clustering. In practical applications, the sparse NCP may face the following two major challenges.
One challenge is that when the tensor data are highly sparse or strongly sparse regularization is imposed on the decomposition, more and more zero components will appear in the factor matrices. Thus, the factor matrices are not of full column rank, which will cause the rank deficiency problem. The rank deficiency will further cause a poor convergence of the tensor decomposition algorithm. It is introduced that the proximal algorithm is an excellent method to improve the convergence of a mathematical optimization method [4]. In an optimization problem by iterations, the proximal algorithm is constructed by adding a proximal regularization item to the original model. This proximal item is the squared Frobenius norm of the difference between the current variable and its value in previous iteration [4]. The proximal algorithm can naturally be incorporated into tensor decomposition [26].
The other challenge is that, for large-scale tensor data, the process of sparse NCP decomposition might be inefficient. It is reported that the inexact block coordinate descent scheme could accelerate the convergence and is very beneficial to the large-scale problem [15,40]. Hence, the inexact scheme can be employed in the sparse NCP problem.

Contribution
Firstly, in this paper, we propose a novel sparse NCP method with the l 1 -norm and the proximal algorithm. The proposed sparse NCP will overcome the rank deficiency and guarantee the decomposition to converge to a stationary point. The block coordinate descent (BCD) is one of the main techniques for tensor decomposition, especially the constrained one [24]. In BCD framework, each factor matrix is updated as a subproblem alternatively while other factor matrices are fixed. By the proximal algorithm, the proximal regularization item can make the subproblems strongly convex [26] and can provide a full column rank condition for the sparse NCP.
Secondly, we develop an inexact BCD scheme for the novel sparse NCP model. The inexact scheme will speed up the computation of the sparse NCP, especially in largescale cases. Specifically, in the inexact BCD scheme, the subproblem of the sparse NCP is iterated multiple times for updating a factor matrix.
Thirdly, in order to prove the viability of the sparse NCP model with the proximal algorithm and the inexact scheme, we employ two efficient optimization algorithms to solve the model, including inexact alternating nonnegative quadratic programming and inexact hierarchical alternating least squares. We evaluate the proposed sparse NCP methods on synthetic, real-world, small-scale and largescale tensor data. By properly selecting and tuning the sparse regularization, the effectiveness and efficiency of the sparse NCP methods are demonstrated to impose sparsity on factor matrices.

Organization
The rest of this paper is organized as follows. Section 2 introduces some preliminaries. In Sect. 3, we describe the mathematical model of sparse NCP with the proximal algorithm and inexact BCD scheme. Section 4 elucidates the solutions to the sparse NCP model using the optimization methods. Section 5 describes the detailed experiments on synthetic and real-world datasets. Some critical observations are discussed in Sect. 6. Finally, we conclude our paper in Sect. 7.

Preliminaries
In this paper, operator represents the outer product of vectors, represents the Khatri-Rao product, * represents the Hadamard product that is the elementwise matrix product, h i represents the inner product, [ ] represents Kruskal operator and [ ] ? represents nonnegative projection. jj jj F denotes Frobenius norm, and jj jj 1 denotes l 1norm. Basics of tensor computation and multi-linear algebra can be found in review papers [25,35].

Nonnegative CP decomposition
Given an Nth-order nonnegative tensor X 2 R I 1 ÂI 2 ÂÁÁÁÂI N and a positive number R, nonnegative CANDECOMP/ PARAFAC (NCP) is to solve the following minimization problem: where A ðnÞ 2 R I n ÂR for n ¼ 1; . . .; N are the estimated factor matrices in different modes, I n is the size in mode-n, and R is the initial number of components. We use . . .; A ðNÞ Á to denote the objective function in (8). The estimated factor matrices in Kruskal operator can be represented by the sum of R rank-1 tensors in outer product form: where a ðnÞ r represents the rth column of A ðnÞ .
Let X ðnÞ 2 R I n Â Q Ñ n¼1;ñ6 ¼n Iñ represent the mode-n unfolding (matricization) of original tensor X . The mode-n unfolding of the estimated tensor in Kruskal operator sA ð1Þ ; . . .; A ðNÞ t can be written as A ðnÞ À B ðnÞ Á T , in which . In BCD framework, factor A ðnÞ is updated alternatively by a subproblem in every iteration, which is equal to the following minimization problem: The partial gradient (or partial derivative) of F À A ðnÞ Á with In (4), the item X ðnÞ B ðnÞ is called the Matricized Tensor Times Khatri-Rao Product (MTTKRP) [35]. The item À B ðnÞ Á T B ðnÞ can be computed efficiently by 2.2 Sparse regularization with l 1 -norm In order to impose sparsity to the factor matrices, it is natural to incorporate the sparse regularization items using l 1 -norm [9,47] into the objective function in (1), which leads to the following basic sparse NCP problem: where b n are positive sparse regularization parameters in parameter vectors b 2 R NÂ1 . The subproblem can be written as the following optimization problem In the objective function of the subproblem, the sparse regularization is imposed on the factor matrix A ðnÞ by the l 1 -norm. 3 The proposed sparse NCP model

Sparse NCP with proximal algorithm
The basic sparse NCP in (6) has a serious drawback. When strongly sparse regularization is imposed in (6), many zero columns will appear in the factor matrices A ðnÞ . Thus, both A ðnÞ and B ðnÞ cannot guarantee to be of full column rank. Therefore, the basic sparse NCP model in (6) will suffer from rank deficiency and cannot guarantee to converge. In order to overcome the above drawback, we propose the following sparse NCP model with proximal algorithm (a proximal regularization item using squared Frobenius norm): where e A ðnÞ is the value of the factor A ðnÞ in previous iteration during updating and a n are positive regularization parameters in vectors a 2 R NÂ1 .
In BCD framework, the subproblem of model (8) can be written in the following minimization problem: The objective function F PROX À A ðnÞ Á can be further represented by the following form: In (10), it is clear to see that the item B ðnÞ ffiffiffiffi ffi a n p I R must be of full column rank even though B ðnÞ is not of full column rank. Thus, the proposed sparse NCP with the proximal algorithm can successfully overcome the rank deficiency problem in the objective function.

Inexact block coordinate descent scheme
The BCD is a main framework to solve tensor decomposition. It is reported that the inexact BCD scheme could accelerate the computation [15,40]. Specifically, the factor matrices A ðnÞ ; n ¼ 1; . . .; N, are updated alternatively in outer iterations; meanwhile, in the subproblem (9), the factor A ðnÞ is also updated several times in inner iterations. The procedures of the inexact scheme are listed in Algorithm 1.

Convergence analysis
The proposed sparse NCP method in (8) can guarantee to converge to a stationary point. Proof The objective function F PROX À A ðnÞ Á in (9) with the proximal regularization item is strictly convex [4]. Moreover, F PROX À A ðnÞ Á is a proximal upper bound [17,33] of the objective function F 0 À A ðnÞ Á in (7). Using the inexact block coordinate descent scheme, the subproblem in Algorithm 1 is updated by a finite number of inner iterations. According to the Theorem 2 in [49], every limit point of the sequence A

Optimization methods for solving sparse NCP
In order to prove the viability and effectiveness of the novel sparse NCP with the proximal algorithm and inexact scheme, we employ the following two optimization methods to solve the model.

Alternating nonnegative quadratic programming
First, we utilize a method that is based on a general form of the alternating nonnegative least squares (ANLS). The classical ANLS is an important tool for NMF and NCP [24]. Many efficient optimization algorithms were proposed to solve the nonnegative least squares (NNLS) subproblems, such as active-set (AS) [20] and block principal pivoting (BPP) [22]. However, there are two limitations to the application of ANLS to sparse NCP. The first limitation is that ANLS is very prone to rank deficiency. The proximal algorithm can tackle this limitation in our sparse NCP model. The second limitation is that the subproblem of our proposed sparse NCP model cannot be represented in a least squares form due to the l 1 -norm regularization, which can be clearly seen in (10). Therefore, some new forms of the objective function in (8) should be considered. Inspired by [27], the subproblem of the proposed sparse NCP in (9) can be represented in the nonnegative quadratic programming (NNQP) form as the following problem: À a n e A ðnÞ and E is a matrix of all ones. In fact, NNQP is a general form of NNLS.
The above-mentioned optimization methods for NNLS can also be used to solve NNQP problem. In this study, we only use block principal pivoting (BPP) [22] as the NNQP solver, which has been proven to be a very efficient method [22,24]. The solver of BPP contains multiple inner iterations. We limited the inner iterations by several times in the inexact scheme. We name the method of solving tensor decomposition using NNQP as alternating nonnegative quadratic programming (ANQP). Furthermore, we abbreviated the method of solving the sparse NCP with the proximal algorithm using ANQP as PROX-ANQP. Algorithm 2 explicates the PROX-ANQP method.

Inexact hierarchical alternating least squares
Second, we employ an inexact hierarchical alternating least squares (iHALS) method for solving the sparse NCP with the proximal algorithm. The conventional HALS is an efficient method of updating each factor column by column [7,9]. However, the HALS method has two major drawbacks to solving the sparse NCP.
First, HALS is also very prone to rank deficiency. Specifically, if a column of the factor matrix A ðnÞ becomes a zero vector, the HALS will break down [22]. One practical remedy is to replace the zero elements with a small positive value [9], such as 10 À16 . However, by this modification, the obtained factor matrices are not sparse anymore.
Second, HALS suffers from the caveat problem (see Section 5.2 in [22]). Specifically, the unbalanced scales will appear in different columns and factors. For example, one column in the first factor might have a scale of 10 À8 and the corresponding column in the second factor might have a scale of 10 8 . At the same time, another column in the first factor might have a scale of 10 16 and the corresponding column in the third factor might have a scale of 10 À16 . One common method of controlling the unbalanced scales is to normalize all columns to unit vectors in the factors [9]. However, by factor normalization, the factor columns will never become zeros vectors. Hence it is impossible to impose sparsity efficiently.
The proximal algorithm in our sparse NCP will overcome the above drawbacks. We have mentioned that the proximal will guarantee the full column rank in the model. Moreover, the proximal regularization item in sparse NCP can keep all columns in factors on a balanced scale.
Next, we will introduce the solution of the model in (8) using the iHALS method. For the sake of simplification, we use a r and b r instead of a ðnÞ r and b ðnÞ r in this part, which are the rth column of A ðnÞ and B ðnÞ , respectively. We also use Â A ðnÞ Ã ð:;rÞ ¼ a r 2 R I n Â1 to represent the column of a matrix, and Â A ðnÞ Ã ði;rÞ ¼ a ðnÞ ir to represent an element in a matrix. The objective function in (9) can be further represented as where e a r is the rth column of e A ðnÞ . The minimization problem for (12) can be solved iteratively by columnwise subproblems: The partial derivative of F r with respect to a r is a n a r À a n e a r þ b n 1; where 1 2 R I n Â1 is a vector with all elements equal to 1. When oF r oa r ¼ 0, nonnegative column vector a r can be updated as which is a closed form solution of (13) according to the Theorem 2 in [24]. A fast HALS method was utilized to solve the largescale problem [7,24]. We use the same idea to solve the sparse NCP problem. Z r in (14) can also be represented as Replacing Z r in (16) by (17), we obtain the new update rule for a r as shown in (18).
We implement the above procedures using the inexact scheme. We use PROX-iHALS to denote the inexact hierarchical alternating least squares method for solving the sparse NCP with the proximal algorithm. The PROX-iHALS is illustrated in Algorithm 3.

Stopping condition for outer loop
We terminate the outer loop according to the change of relative error during iteration. Relative error is related to data fitting. In the kth outer iteration, the relative error [48] of tensor decomposition is defined by Based on the relative error, we terminate the outer loop using the following stopping condition The threshold of e can be set by a very small positive value, such as 1e À 8.
In addition, we also set a maximum running time for the outer loop.

Stopping condition for inner loop
In the lth inner iteration, we define the relative residual of the nth factor matrix A ðnÞ as For the PROX-iHALS, we terminate the inner loop by the stopping condition of r ðnÞ l \d ðnÞ , where d ðnÞ is a dynamic positive threshold. If there is only one iteration in the inner loop, we update d ðnÞ by d ðnÞ ¼ d ðnÞ =10. We set the initial value by d ðnÞ ¼ 0:01. For the PROX-ANQP, the inner loop is terminated according to the columns in the feasible region of the BPP algorithm [22].
Since we employ the inexact BCD framework, we also set a maximum number of inner iterations (MAX_-INNER_ITER) to terminate the inner loop.
We summarize the stopping conditions for both of the outer and inner loop in Algorithm 4.

Remarks on convergence
The PROX-ANQP and PROX-iHALS methods in the inexact BCD framework have outstanding convergence properties. In Sect. 3.3, we have mentioned that proposed sparse NCP using the proximal algorithm and inexact BCD scheme can guarantee to converge to a stationary point. The subproblem with the proximal algorithm in (9) is strongly convex, which can yield a unique minimum [4]. Furthermore, the optimization methods of ANQP and iHALS can stably decrease the subproblem. According to the Proposition 3.7.1 in [4], both the PROX-ANQP and PROX-iHALS can converge to stationary points.

Experiments and results
We carried out the experiments on synthetic, real-world, dense, sparse, small-scale, and large-scale tensors. We compared the proposed PROX-ANQP and PROX-iHALS methods with three sparse NCP methods listed below.
• AO-ADMM: This is the sparse NCP method using AO-ADMM algorithm [19], which includes multiple inner iterations. The l 1 -norm is handled by a proximal operator. • iAPG: We extend the APG method in sparse Tucker decomposition [47] to the sparse NCP problem in (6).
In order to make a fair comparison, we implement APG in the inexact scheme using multiple inner iterations, which is abbreviated as iAPG [42]. The l 1 -norm is handled by a proximal operator. • iMU: This is the sparse NCP method using the classical MU algorithm [9]. We implement MU in the inexact scheme using multiple inner iterations, which is abbreviated as iMU.
The above three methods can be directly applied to solve the sparse NCP in (6). The l 1 -norm can be handled by the proximal operator in AO-ADMM and iAPG. Due to the proximal operator, AO-ADMM and iAPG do not suffer from the rank deficiency. Using the multiplicative updating rule, MU does not suffer from the rank deficiency.
In Table 1, we summarized the computational complexity of all the sparse NCP methods. Only the multiplicative operations were counted for mode-n in one outer iteration. The main time cost of these algorithms was spent on the calculation of MTTKRP X ðnÞ B ðnÞ , which consists of two parts: Khatri-Rao product B ðnÞ and matrix product of X ðnÞ and B ðnÞ . The computational complexity of B ðnÞ reaches R Q Ñ n¼1;ñ6 ¼n Iñ and that of X ðnÞ B ðnÞ reaches R Q N n¼1 I n . Item À B ðnÞ Á T B ðnÞ can be calculated efficiently by (5), whose complexity is R 2 P Ñ n¼1;ñ6 ¼n Iñ. For the inner loop of the subproblem, K is assumed to be the average iteration number. In Table 1, we can find that the complexity of these algorithms is highly comparable to each other. It can be inferred that the time of convergence is highly related to the number of iterations.
Many experimental parameters and settings will affect the performances of a sparse NCP method. Since our purpose in the experiments is only to test the ability to impose sparsity, we fix the following settings for all methods.
• Initialization. For PROX-ANQP, PROX-iHALS, AO-ADMM and iAPG, all factor matrices were initialized using nonnegative random numbers by MATLAB function max(0,randn(I n ; R)). Only the iMU was iMU KInR 2 K is assumed to be the average inner iteration number initialized by max(0,randn(I n ; R))?0.1. All initialized factors were scaled by A • The factor updating order was fixed by 1; 2; . . .; N.
• The maximum inner iteration MAX_INNER_ITER was fixed by 5 according to the default setting in AO-ADMM [19]. • For the PROX-ANQP and PROX-iHALS method, the proximal regularization parameter a n was fixed by 1e-4 [45].
The l 1 -norm regularization parameters of b n ; n ¼ 1; . . .; N; in sparse NCP are the key elements to impose sparsity, which are the most crucial testing parameters in the experiments. We selected a sequence of b n values in ascending order for each tensor by manual testing. For synthetic tensors, we stop the increase of b n when the true sparse components are recovered, while for real-world tensors, we stop the increase of b n when the number of nonzero components is reduced to less than half of the initial number. In order to make it convenient to select and test the parameters, we kept b n ; n ¼ 1; . . .; N; the same in all modes of the tensor. After choosing the b n , we calculated and evaluated the sparsity level [44] of the factor matrices by where T s is a small positive number and # Á f g denotes the number of elements that are smaller than the threshold T s in factor matrix A ðnÞ .
In the synthetic tensor experiments, we used prior sparse matrices to construct the data. After decomposition, the accuracy of the recovered sparse signals should be evaluated. Let S ðnÞ ¼ ½s 1 ; . . .; s R 2 R LÂR denote the mode-n prior sparse matrix, where R is the real number of components and L is the length of a component. Let T ðnÞ ¼ ½t 1 ; . . .; tR 2 R LÂR represent the mode-n estimated sparse matrix, where theR is the estimated number of nonzero components. We evaluate the accuracy of the estimated matrix T ðnÞ compared with original sparse signals S ðnÞ by peak-signal-to-noise ratio (PSNR, see Chapter 3 in [9]) wheret r is the rth normalized estimated sparse signal, and s c is the normalized reference sparse signal.ŝ c comes from S ðnÞ , which has the highest correlation coefficient witht r .
All the experiments were conducted on the computer with Intel Core i5-4590 3.30 GHz CPU, 8 GB memory, 64-bit Windows 10 and MATLAB R2016b. The fundamental tensor computation was based on Tensor Toolbox 2.6 [3]. The codes are available on the author's website http://deqing.me/.

Synthetic tensor data
5.1.1 Size 1000 · 100 · 100 · 5 with one sparse factor In this experiment, we constructed a synthetic fourth-order tensor by 10 channels of simulated sparse and nonnegative signals, as shown in Fig. 1a. The signals come from the file VSparse_rand_10.mat in NMFLAB [8]. There are 1000 points in each channel, so the sparse signal matrix is  Fig. 1 Sparse and nonnegative signals used in synthetic tensor. a shows the original ten channels of signals. b shows the estimated ten channels of signals from the synthetic tensor X SYN1 by sparse NCP based on the PROX-ANQP method with b n ¼ 5. The PSNR is 90.2698 according to (23) S ð1Þ ¼ ½s 1 ; . . .; s 10 2 R 1000Â10 . Three uniformly distributed random matrices A ð2Þ ; A ð3Þ 2 R 100Â10 and A ð4Þ 2 R 5Â10 were employed as mixing matrices, which were generated by rand function in MATLAB. Afterwards, we synthesized a third-order tensor by . Next, nonnegative noise was added to the tensor with SNR of 40dB, which was generated by code max(0,randn(size(X ))). For all sparse NCP methods, we set e ¼ 1e À 8 as the threshold of outer stopping condition in (20). We set T s ¼ 1e À 3 in (22). The maximum running time was set by 180 seconds. We selected 20 as the number of components for tensor decomposition 1 . The reason is that we intend to recover the ten channels of the true signal just by imposing sparse regularization during decomposition, even though the exact optimal number of components is unknown. We selected the values of b n ¼ 0; 0:1; 0:5; 1; 2; 3 for all the optimization methods to evaluate their abilities to impose sparsity. The selection of sparse regularization parameters depends on the tensor data. After tensor decomposition, the values of objective function value (Obj), relative error (RelErr), running time in seconds (in wall-clock time),  1 Since ten channels of signals are mixed in the tensor, it is natural to select 10 as the optimal component number. The number of components might also be estimated by some classical methods, such as DIFFIT [38]. However, we selected 20 to test the performances of sparse regularization.
iteration number (Iter), the number of nonzero components (NNC), sparsity level (Spars) and PSNR of the estimated signal factor matrix were recorded as the performance evaluation criteria. For all optimization methods with each b n , the sparse NCP was run 30 times, and the average values of all criteria were computed. The results are shown in Table 2. We emphasized the outstanding performances of the sparse NCP algorithms in bold in the tables. From Table 2, it can be found that all methods are able to impose sparsity with proper sparse regularization parameter b n . When b n increases, the sparsity level of the mode-1 factor matrix also increases. After properly tuning the sparse regularization parameter b n , weak components will be removed (set to 0), weak elements in strong components will be prohibited, and the true ten channels of sparse signals will be recovered.
When b n increases to a proper value, the PSNR increases significantly. In this experiment, the higher the value of PSNR is, the better an algorithm performs to recover original sparse components. In Table 2, it is clear to see that PROX-ANQP, PROX-iHALS and iAPG have higher PSNR with larger sparse regularization parameters, for example b n ¼ 4; 5. This means that these three methods recover the ten channels of sparse signals more precisely. One of the recovered sparse signal matrix from X SYN1 by PROX-ANQP is shown in Fig. 1b.
For the synthetic data, the objective function values and relative errors are very similar with the same b n value. The convergence speed can be concluded from Table 2. iMU performs slowly compared with other methods. AO-ADMM performs slowly with b n ¼ 0, but it becomes fast with b n [ 0. All PROX-ANQP, PROX-iHALS and iAPG methods perform very well. It can also be concluded from Table 2 that the running time is highly related to the number of outer iterations.

Size 500 · 500 · 500 with two sparse factors
For this third-order tensor, the factor matrices were generated using the following codes.
We set the outer stopping condition by e ¼ 1e À 6 and the maximum running time by 600 seconds. 200 was selected as the initial number of components. The average performances of all sparse NCP methods after 30 runs were computed. We only show running time in seconds, iteration number (Iter), number of nonzero components (NNC) and the sparsity level (Spars) of all estimated factors in Table 3. Table 3 shows that all methods are able to impose sparsity on all factors matrices. PROX-ANQP, PROX-iHALS, and iAPG methods perform very well to extract the true 100 sparse components. Interestingly, the sparsity levels of all the extracted factor matrix by PROX-ANQP, PROX-iHALS, and iAPG methods are also very close to

Third-order dense tensor data
In this experiment, we used a real-world third-order ongoing EEG tensor. The original data were collected from fourteen right-handed and healthy subjects of adults in a music listening experiment. The music stimulus was a piece of 8.5 minutes of modern tango Adiós Nonino by the composer Astor Piazzolla. Short-Time Fourier Transform (STFT) with the window size of 3 seconds and the hop size of 1 second was used to transform the data of each subject into a third-order tensor. Details of data collection and preprocessing can be found in [43,44]. We only employ the tensor data of one subject in this experiment 3 . The size of this tensor is channel Â frequency Â time ¼ 64 Â146 Â 510, in which 64 channel points represent 64 electrodes on the scalp, 146 frequency points represent the spectrum in 1-30Hz, and 510 time points represent the duration of a stimulus of about 8.5 minutes. Since the spectra from EEG tensor are usually sparse, we wish to recover the sparse spectral components by sparse regularization.  2 Since we use double number of true sparse components as the initial number of tensor decomposition, the ground truth sparsity of the factor matrix is computed by ðx% þ 1Þ=2. x% is the percentage of zeros in a simulated matrix.  0  100  50  150  200  250  300  350  400  450  500   0  100  50  150  200  250  300  350  400 450 500 Fig. 2 Selected groups of components from the ongoing EEG tensor using the PROX-ANQP algorithm. All groups reveal the same brain activity.
In the decomposed EEG data, the spatial component is topography, the spectral component is the spectrum, and the temporal component is the energy evolution series. The components in a were extracted with b n ¼ 0, b with b n ¼ 5 Â 10 5 and c with b n ¼ 10 Â 10 5 We set e ¼ 1e À 8 in (20) and T s ¼ 1e À 6 in (22). The maximum running time was set by 120 seconds. The initial number of components was set by 40 according to previous studies [11,44]. We tested the values of b n ¼ 0; 1e5; 5e5; 10e5; 15e5; 20e5 all methods. All methods were run 30 times. The averages of performance criteria are recorded in Table 4. The results show that all methods are effective to impose sparsity with b n . The iMU is slower than the other four methods.
We selected three groups of extracted components using the PROX-ANQP method with b n ¼ 0; 5e5; 10e5, respectively, as shown in Fig. 2. These three groups show the same brain activity. It is obvious to see that the spectra become sparser and sparser when the sparse regularization parameter increases. With b n ¼ 5e5; 10e5, more and more unnecessary elements are removed in the spectra, and only the most prominent frequency band is retained. Figure 2 demonstrates that our methods are effective to extract meaningful sparse components that are related to some brain activities.

Third-order large-scale sparse tensor data
In this experiment, we tested the sparse NCP algorithms on a third-order large-scale sparse social network tensor. The data contain Facebook wall posts 4 information among 46, 952 users in 1952 days [41]. The size of this sparse tensor X Sp1 is 46,952 Â46; 951 Â1952, and the number of nonzero elements is 737, 928.
We set the outer stopping condition by e ¼ 1e À 6 in (20) and the sparsity threshold by T s ¼ 1e À 6 in (22). The maximum running time was set by 1200 seconds, and the initial number of components was set by 100. We tested the values of b n ¼ 0; 0:01; 0:05; 0:1 for all methods. The average values of performance criteria after 30 runs are recorded in Table 5. We recorded the objective function values of all methods within the first 600 seconds in Fig. 3.

Fourth-order large-scale sparse tensor data
In this experiment, we evaluated the sparse NCP algorithms on a fourth-order large-scale sparse tensor of NIPS publications 5 [16]. The size of this sparse tensor X Sp2 is 2482 Â 2862 Â 14;036 Â 17. The values represent 2482 papers, 2862 authors, 14,036 words and 17 years. There are 3,101,609 nonzero elements. The maximum running time was set by 1800 seconds. Other settings are the same as the previous third-order case. We tested the values of b n ¼ 0; 0:1; 0:3; 0:5 for all methods. The average values of performance criteria after 30 runs are recorded in Table 6. Figure 4 illustrates the objective function values of all methods within the first 1200 seconds.
The experimental results of both the third-order and fourth-order large-scale sparse tensor demonstrate that all our algorithms are effective to impose sparsity on the factor matrices. From Tables 5 and 6, it is clear to see that, with sparse regularization parameter b n increasing, the number of nonzero components (NNC) decreases gradually. The extracted factor matrices from the sparse tensors are extremely sparse, even though no additional sparse regularization is imposed (b ¼ 0). Nevertheless, our algorithms with b [ 0 can further increase the sparsity level of the factor matrices.
From the perspective of convergence speed, both the PROX-ANQP and PROX-iHALS methods run fast for the large-scale sparse tensors with different b n values compared with other methods, which can be concluded from Tables 5, 6, Figs. 3 and 4. However, AO-ADMM and iAPG perform slowly for the large-scale sparse tensors. iMU has fast convergence speed, but it has higher objective function value with large b n values (e.g., b n = 0.05, 0.1 in the third-order case, and b n = 0.3, 0.5 in the fourth-order case). The reason is that, with the same b n value, iMU yields fewer nonzero components compared with other methods.

Discussion
We have proposed a novel sparse NCP model using the proximal algorithm and inexact scheme. The model can be efficiently solved by two algorithms of the PROX-ANQP and PROX-iHALS. In order to test the performance of the algorithms, we conducted experiments in different cases, including synthetic and real-world tensors, third-order and fourth-order tensors, dense and sparse tensors, and smallscale and large-scale tensors. Three state-of-the-art sparse NCP methods are also tested for comparison, including the AO-ADMM, iAPG and iMU. We have the following findings: (1) The PROX-ANQP and PROX-iHALS methods particularly have the fast convergence speed and excellent effects of imposing sparsity in all cases compared with other methods. The outstanding performances of the PROX-ANQP and PROX-iHALS are due to two points: the proximal algorithm that overcomes the rank deficiency; and the inexact scheme that increases the efficiency. (2) The iAPG method contains a proximal operator, which can handle the l 1 -norm. With the proximal operator, iAPG does not suffer from the rank deficiency. In the experiments, we find that iAPG is very efficient to impose sparsity for small-scale dense tensor decomposition. Nevertheless, iAPG runs slowly for large-scale sparse tensor decomposition. (3) The iMU method converges very slowly for dense tensors, but it becomes fast for large-scale sparse tensor. For sparse tensor decomposition, the extracted factor matrices are extremely sparse already. Most elements in the factors are zeros. According to the multiplicative updating rule, once an element becomes zero, it will never change. This property might be the reason why iMU converges fast on the sparse tensor. (4) The AO-ADMM also contains a proximal operator that can handle l 1 -norm and overcome the rank deficiency. However, AO-ADMM converges slowly compared with PROX-ANQP and PROX-iHALS in most cases, particularly in the largescale sparse tensor case. Moreover, AO-ADMM is inferior to iAPG with b n ¼ 0 in many cases. In a word, our proposed PROX-ANQP and PROX-iHALS methods have the best performances for sparse NCP, and have very good generalization for the different types and scales of datasets. In addition to the solving methods, another critical issue of sparse NCP is the selection of the sparse regularization parameter b n . Firstly, we want to mention that one purpose of this paper is to demonstrate the effectiveness of the algorithms to impose sparsity. Therefore, in order to simplify the selection of parameters, we keep b n the same for all factor matrices in sparse NCP. With the same b n on all modes, the sparse NCP can still recover high sparse components and low sparse (or even dense) components in different factor matrices. In the future, it would be interesting to investigate how to separately control the sparsity levels of different factor matrices using unbalanced sparse regularization parameters. Secondly, the appropriate value of parameter b n depends on the tensor to be decomposed. In this study, we selected b n separately for each tensor in the experiments. When the sparse regularization parameter is larger, the extracted factor matrices are sparser, and the relative error of decomposition is also larger. The trade-off Spars n = Sparsity level of the mode-n estimated factor Spars n % 1 means that the factor is very close to a zero matrix NNC = Number of nonzero components between the sparse level and the relative error depends on the meanings of real applications. An example of sparse regularization parameter selection of sparse NCP for ongoing EEG can be found in [44]. It is also possible to select an appropriate parameter for a concrete application using model-order selection methods [37], such as the Bayesian information criteria (BIC). The third critical issue of sparse NCP is the sparse regularization item. In this study, we only investigated the l 1 -norm item. In the future, it is worth trying to incorporate other types of sparse regularization items [2] to our sparse NCP model besides l 1 -norm, such as the l q -norm (0\q\1) [36] and trace norm [29].

Conclusion
In this paper, we have investigated the nonnegative CANDECOMP/PARAFAC tensor decomposition with l 1norm-based sparse regularization (sparse NCP). We have proposed a novel sparse NCP model using the proximal algorithm, which can guarantee the full column rank condition and the property of convergence to a stationary point. In addition, an inexact block coordinate descent scheme was presented to accelerate the computation of sparse NCP. In the inexact scheme, the subproblems are updated using multiple inner iterations. We have employed two algorithms for solving the proposed sparse NCP model with the proximal algorithm, including the inexact alternating nonnegative quadratic programming (PROX-ANQP) and the inexact hierarchical alternating least squares (PROX-iHALS). The experimental results on all synthetic, real-world, small-scale and large-scale tensors demonstrated that our sparse NCP methods can impose sparsity and extract meaningful sparse components successfully. Both PROX-ANQP and PROX-iHALS have exhibited the faster computational speed and better performances of imposing sparsity compared with other sparse NCP algorithms. The experimental results proved that the proposed sparse NCP with the proximal algorithm and inexact scheme is effective and efficient.  Funding Open access funding provided by University of Jyväskylä (JYU).

Declarations
Conflict of interest The authors declare that they have no conflict of interest.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons. org/licenses/by/4.0/.