1 Introduction

Analysis of tagged data, where each data item is labeled with tags from a finite but very large set of tags, is emerging as a problem of immense importance in various industries. A common and familiar example is tagging of data items with users associated with them. An illustration of the significance of such analysis is the Netflix competition,Footnote 1 where the goal is to predict viewers ratings for the movies given the previous ratings by the viewers without any additional information about the viewers or the movies. In addition to explicit consumer reviews that a company has at its disposal, user comments posted on social media sites are turning out to be valuable resources for analyzing consumer preferences.

It is common in enterprise scenarios to associate tags (such as product names, business units, strategies, relevant industries, etc.) with documents, web and wiki pages, which are used as aids for enterprise search and various other analytics. Another similar example is hashtags for Twitter assigned by users.

For such dyadic data, we will assume that we have three central entities, which we will generically refer to as users (e.g., consumers, viewers, researchers), documents (e.g., movie scripts, product literature, research papers) and a vocabulary of words. One of the dyadic associations is between documents and words and the other, tagging of users with documents.

Although we focus on documents and words, the formulations and algorithms that we propose are equally applicable for discrete-valued dyadic associations between other kinds of entities, such as products and their features, movies and users etc., which we demonstrate experimentally.

Given such dyadic data, the task of interest is topical analysis over words, documents and users with an unobserved dimension, generically called topics. For example, genres correspond to topics in movies, while research and technology areas are examples of topics for enterprise documents. Knowledge of such topics can then be used for various purposes, such as recommendations, designing new products, etc. While topical analysis has been investigated extensively for dyadic data (Rosen-Zvi et al. 2004; Zheng et al. 2011), one important property of topical associations that has largely been overlooked is that of sparsity. While the actual number of topics may be large, the association between topics and documents, and similarly that between topics and users, is typically sparse. For example, most individual consumers prefer a small set of genres or product categories, and individual movies or products correspond to very few genres or categories. Similarly in the enterprise setting, most documents are associated with small sets of research areas, and any specific researcher usually has expertise on a limited set of research areas.

There is not much prior work that performs sparse topical analysis over dyadic data. Learning the probabilistic author-topic model (Rosen-Zvi et al. 2004) can be interpreted as estimating the three associations of interest to us i.e, the distribution over words for each topic, distribution over topics for each document and distribution over topics for each author. While this model was proposed to accommodate the knowledge of actual authors of documents, it does not directly address any notion of sparsity.

In this paper, we show that this problem can be formulated as a sparse matrix tri-factorization problem. To handle user-document associations, this formulation imposes a support constraint on one of the factors. Additionally, it imposes sparsity constraints on the individual factor matrices, and also on the product of two of the factors. Instead of directly incorporating a sparsity constraint on the product, we introduce a surrogate matrix on which we enforce the sparsity constraint, while making this term as close as possible to the original product of factors. We show that in this formulation the sub-problems for the individual factor matrices are constrained least squares problems. The least squares problem is efficiently solvable using the Fast Iterative Shrinkage-Thresholding Algorithm (FISTA) (Beck and Teboulle 2009). We use a projected variant of FISTA to handle the non-negativity constraints in all of the sub-problems. The overall non-convex problem is solved efficiently using the alternating minimization framework, using the projected variant of FISTA for the sub-problems in each iteration.

There has been significant interest in the matrix factorization problem over the last decade or so (Lee and Seung 1999; Paatero and Tapper 1994). Non-negativity and sparseness have been studied in the context of bi-factorization (Lee and Seung 1999; Hoyer 2004; Kim and Park 2007). In the context of tri-factorization, Block Value Decomposition has been proposed for non-negativity constraints (Long et al. 2005). Other constraints studied in the context of tri-factorization include orthonormality (Ding et al. 2006). Sachan and Srivastava (2013) address the problem of sparse topical analysis of dyadic data using a coupled sparse bi-factorization approach but do not estimate the strength of user-document associations. To the best of our knowledge, the sparsity and support constraints that we consider in the context of tri-factorization have not been investigated before.

We perform experiments over bibliographic and product review data to demonstrate that the proposed sparse matrix tri-factorization formulation results in better generalization ability and factorization accuracy compared to baselines that use sparse bi-factorization.

Our main contributions are as follows.

(A) We formulate the problem of sparse topical analysis of dyadic data, as that of matrix tri-factorization with sparsity and support constraints. This has not been studied previously. (B) We propose an efficient solution for this problem using the alternating minimization framework, where the individual non-negatively constrained sparse least-squares sub-problems in each iteration are solved using the projected variant of FISTA algorithm. (C) We show that our formulation supports parallel implementation since the individual sub-problems for solving a factor matrix are independent. (D) We demonstrate that the sparse matrix tri-factorization formulation yields topical associations that are more accurate and generalize better than those resulting from state-of-the-art baselines for the problem.

The rest of the paper is organized as follows: Sect. 2 discusses the related work. In Sect. 3, we propose the sparse matrix tri-factorization formulation for sparse topical analysis of dyadic data. In Sect. 4, we propose an efficient solution for this formulation using projected variant of FISTA in an alternating minimization framework. Section 5 deals with experimental work. We conclude our work in Sect. 6.

2 Related work

In this section, we first review existing literature from the point-of-view of sparse topical analysis on dyadic data, and then discuss related work on matrix factorization.

2.1 Sparse topical analysis

Collective non-negative matrix factorization (NMF) with constraints (Sachan and Srivastava 2013) considers a problem similar to ours, where authorship information is provided in addition to the documents. Here, low rank approximations are sought for both the matrices, and in addition to the sparsity requirements on the individual factors, two factor products are also required to be sparse. However, this is approximated by minimizing the Frobenius norms of the three factors. The optimization is performed using a dual ascent approach, where primal variables are exactly optimized in each iteration using gradient descent, and partial updates are made on the dual variables in the direction of the positive gradients. However, this model factorizes a binary author-document association matrix and may not result in accurate estimation of factor matrices and also does not estimate the strength of author-document associations.

In the probabilistic setting, a problem similar to non-negative matrix tri-factorization is addressed by the Author-Topic Model (Rosen-Zvi et al. 2004) in the context of bibliographic data. Here, the distribution over words for each topic, distribution over topics for each document and distribution over topics for each author explicitly capture three of the factors of interest. The fourth factor, which is the distribution over authors for each document, is not directly accounted for, but can be incorporated without much difficulty. The notion of sparsity, on the other hand, is not addressed by this model. Sparsity has however been extensively studied in recent years for probabilistic admixture models, mainly by replacing the Dirichlet distributions with those that promote sparsity, such as the Indian Buffet Process (Griffiths and Ghahramani 2005; Williamsonm et al. 2010) in the context of bi-factorization problems, such as LDA (Blei et al. 2003).

In the context of tri-factorization, sparsity has been introduced via hierarchical beta process in contextual focussed topic model (Chen et al. 2012), but words are required to belong to either a topic distribution corresponding to a publication venue, or an author, or else a document and the inference procedure employed in the model is complicated, hence intractable on large datasets. Further, unlike our model, it does not support parallelism.

There have been few other works on topic modeling in the context of bi-factorization considering the geometrical structure of the data (Cai et al. 2009) by locally consistent topic model (LTM) as well as double-latent-layered LDA (D-LDA) (Zhuang et al. 2010) for semi-defined classification of documents. LTM uses the local manifold structure of the data to regularize the learning of probability distributions so that two sufficiently close documents have similar conditional probability distributions P(z / d) where z is the topic vector for the words in document d, by making use of Kullback–Leibler divergence to measure the distance between two conditional probability distributions.

Compared to LDA, D-LDA uses another latent variable y for documents in addition to topic variable z for words. It makes use of both supervised classification and unsupervised clustering to classify an unlabeled document into one of the known or unknown classes and finally cluster the documents belonging to the unknown class to form meaningful groups. The topic mixtures are tied to the class variable y so that topics for words z help in inferring the right label for the documents. However, both LTM and D-LDA do not take into account, the association between authors and documents.

2.2 Matrix factorization

The area of matrix bi-factorization with constraints has seen a lot of research over the years. Non-negative matrix factorization (NMF) (Lee and Seung 1999) imposes non-negative constraints on the two factors. Various methods have been proposed to solve NMF including projected gradient and projected quasi-Newton techniques (Lin 2007; Kim et al. 2007), and the active set method (Kim and Park 2008a). Though NMF is often found to generate sparse factors, approaches have been proposed to directly control the sparsity of the two factors (Hoyer 2004; Kim and Park 2007). The Factorization machine (Rendle 2010) is similar to a polynomial SVM (polynomial kernel) except that the parameter matrix for the interaction between variables in the model is factorized here and therefore the parameters learnt under this model are not independent resulting in the high quality parameter estimates under sparsity. This model can be used for regression, binary classification and ranking of the vectors. But this is a supervised model since it needs training examples with labels to learn the model parameters. But our problem needs an unsupervised approach wherein we evaluate the clustering quality of document-topic and user-topic associations during the learning phase itself.

In the context of tri-factorization, the Block Value Decomposition (Long et al. 2005) approach extends NMF by minimizing decomposition error using Frobenius norm, while enforcing non-negative constraints on the three factors. The resulting optimization problem can be solved in the alternating non-negative least squares framework using multiplicative update rules.

In the orthogonal tri-factorization formulation (Ding et al. 2006), orthogonality constraints are introduced among the left and right factors, in addition to the non-negativity constraints, so that the low-dimensional embedding has a natural clustering interpretation. The optimization is again done using multiplicative update rules. To the best of our knowledge, we are not aware of any tri-factorization formulations considering sparsity and support constraints.

In summary, matrix factorization techniques for dyadic data generally approximate the data matrix as a product of the factor matrices, with finding unnormalized representation of the topical associations for entities, and without incorporation of any prior knowledge on the topical associations. But, dictionary learning approaches for signal and image processing problems enforce unit \(l_2\)-norm constraints on the columns of the left factor matrix (basis) (Mairal et al. 2010) so that one of the matrices do not become too large and the other too small. This basis is analogous to the word-topic matrix in our problem. However, recent papers (Kim and Park 2008b; Kasiviswanathan et al. 2011) employing NMF for topic modeling do not enforce unit \(l_2\)-norm on any of the word-topic or topic-document matrices.

On the other hand, probabilistic approaches (Rosen-Zvi et al. 2004; Chen et al. 2012) find the normalized representation of topical associations for the entities as distributions. These distributions are learnt through different criteria for maximizing the likelihood of the model. Additionally, they have the flexibility of incorporating prior knowledge on the topical associations for the entities.

3 Sparse matrix tri-factorization: formulation

We first formalize the notion of dyadic data. Data on m users, n documents and v terms can be captured using two dyadic matrices. The first is a user-document dyadic matrix \(A\in \{0,1\}^{m\times n}\), where \(A_{ui}=1\) indicates that user u is associated with document i, while \(A_{ui}=0\) indicates with certainty that user u cannot be associated with document i. The second is a term-document dyadic matrix \(D\in \mathbb {R}_+^{v\times n}\), where \(D_{ij}\) denotes the number of times term i appears in document j.

Given such data, and assuming k topics, we need to approximate D as a product of three factor matrices: \(D\approx {\varPhi }{\varTheta }\mathbb {A}\). The first factor here is \({\varPhi }\in \mathbb {R}_+^{v\times k}\) with \({\varPhi }_{it}\) denoting the association between word i and topic t. The second factor is \({\varTheta }\in \mathbb {R}_+^{k\times m}\) with \({\varTheta }_{tu}\) denoting the preference for topic t of user u. The third factor is \(\mathbb {A}\in \mathbb {R}_+^{m\times n}\) with \(\mathbb {A}_{ui}\) denoting the extent of association between user u and document i.

Let us now come to the constraints on the factors required for our problem. The first set is of the natural non-negativity constraints: \({\varPhi }\ge 0, {\varTheta }\ge 0, \mathbb {A} \ge 0\). To understand the second constraint, it is important to appreciate the difference between the two user-document matrices A and \(\mathbb {A}\). \(A_{ij}\) is binary-valued, indicating whether or not a specific user is associated with a document. In contrast, \(\mathbb {A}_{ij}\) is non-negative real-valued, denoting the strength of the association. The first indicates whether or not a user is associated with a document, while the second indicates the strength of association between the user and the document. Clearly, this leads to the constraint that \(\mathbb {A}_{ij}\) can be non-zero only when \(A_{ij}\) is 1. We denote this as \(supp(\mathbb {A}) \subseteq supp(A)\).

Let us now examine the sparsity requirements on the factor matrices. First, the columns of the topic-user matrix \({\varTheta }\) are required to be sparse. The second sparsity requirement is on the topic-document associations. This is captured by the product \({\varTheta }\mathbb {A}\) of the individual factors \({\varTheta }\) and \(\mathbb {A}\). Therefore, we require the columns of the product \({\varTheta }\mathbb {A}\) to be sparse. We enforce sparsity on the column vectors using the vector \(l_1\)-norm (Tibshirani 1996).

Finally, using the Frobenius norm to capture the approximation error between D and \({\varPhi }{\varTheta }\mathbb {A}\) and \(l_1\) regularizers for the sparsity constraints, results in the following tri-factorization problem:

$$\begin{aligned}&\arg \min _{{\varPhi }, {\varTheta }, \mathbb {A}} \frac{1}{2} \Vert D-{\varPhi }{\varTheta }\mathbb {A} \Vert _{F}^2 + \lambda _1 \sum _{j=1}^n \Vert ({\varTheta }\mathbb {A})_j\Vert _1 + \lambda _2 \sum _{j=1}^m \Vert {\varTheta }_j\Vert _1 + \lambda _3 \sum _{j=1}^t \Vert {\varPhi }_j\Vert _1 \nonumber \\&\quad \text{ s.t. } supp(\mathbb {A}) \subseteq supp(A),\quad {\varPhi }\ge 0,\quad {\varTheta }\ge 0, \quad \mathbb {A} \ge 0, \end{aligned}$$
(1)

where we have used the notation \(M_i\) to denote the ith column of matrix M and this applies to the rest of the paper. Note that we have additionally enforced sparsity on \({\varPhi }\) to prevent overfitting. \(\lambda _1\), \(\lambda _2\) and \(\lambda _3\) denote the regularization constants for the three sparsity constraints.

Typically, the alternating minimization framework is used to solve matrix factorization problems (Lin 2007; Kim and Park 2007, 2008a), where the sub-problems in the individual factor variables are convex with tractable solvers available. However, the sub-problem of (1) involving \({\varTheta }\) looks as follows:

$$\begin{aligned}&\arg \min _{{\varTheta }} \frac{1}{2}\Vert D-{\varPhi }{\varTheta }\mathbb {A} \Vert _{F}^2+\lambda _1 \sum _{j=1}^n \Vert ({\varTheta }\mathbb {A})_j\Vert _1+ \lambda _2 \sum _{j=1}^m \Vert {\varTheta }_j\Vert _1, \nonumber \\&\quad \text{ s.t. } {\varTheta }\ge 0, \end{aligned}$$
(2)

Though the optimization problem in (2) is convex, it does not admit an efficient iterative solution leading to huge increase in time as discussed in experimental section. Regarding convergence, the local convergence of (1) is not guaranteed as discussed in the “Appendix”.

To get around this, we reformulate our problem, with the motivation of making the individual sub-problems efficiently solvable using (variants of) algorithms such as FISTA (Beck and Teboulle 2009). We introduce a surrogate variable Q to capture the topic-document associations. We then enforce sparsity on the columns of Q using the \(l_1\)-norm, and simultaneously enforce Q to be close to the product \({\varTheta }\mathbb {A}\) by minimizing the Frobenius norm of \(Q - {\varTheta }\mathbb {A}\). This results in the following modified formulation of the sparse matrix tri-factorization (SMTF) problem:

$$\begin{aligned}&\arg \min _{{\varPhi }, {\varTheta },Q , \mathbb {A} } \frac{1}{2}\Vert D-{\varPhi }Q \Vert _{F}^2 + \frac{1}{2}\Vert Q-{\varTheta }\mathbb {A} \Vert _{F}^2 + \lambda _Q \sum _{j=1}^n\Vert Q_j\Vert _1+\lambda _{{\varTheta }} \sum _{j=1}^m\Vert {\varTheta }_j\Vert _1 \nonumber \\&\quad +\,\lambda _{{\varPhi }} \sum _{j=1}^t\Vert {\varPhi }_j\Vert _1 \nonumber \\&s.t. supp(\mathbb {A}) \subseteq supp(A), \quad {\varPhi }\ge 0, \quad {\varTheta }\ge 0, \quad Q \ge 0, \quad \mathbb {A} \ge 0, \end{aligned}$$
(3)

Each of the four sub-problems of (3) now admit an efficient solution using (a projected version of) FISTA where \(\lambda _Q\), \(\lambda _{\varTheta }\) and \(\lambda _{\varPhi }\) denote the regularization constants for the three sparsity constraints. Note that using a surrogate for the product of the factors \({\varPhi }{\varTheta }\) results in enforcing \(l_1\)-norm on \(\mathbb {A}\) in the sub-problem involving \(\mathbb {A}\) which is not desirable, the details of which are given in “Appendix” and also this can happen in (1). We elaborate more in the next section, where we propose an alternating minimization algorithm for the SMTF problem.

4 Sparse matrix tri-factorization: optimization

We now propose an alternating minimization algorithm for solving the sparse matrix tri-factorization formulation in (3). Recall that the optimization variables are the word-topic matrix \({\varPhi }\), the surrogate topic-document matrix Q, the topic-user matrix \({\varTheta }\) and the user-document matrix \(\mathbb {A}\). In this framework, starting with a suitable initialization of the four variables, we optimize over each of them in an alternating fashion, holding the others fixed as their current values. Local convergence for SMTF formulation (3) using alternating minimization framework is discussed in Sect. 4.2.

As we will show, the most general form for the individual sub-problems is that of \(l_1\)-regularized non-negative least squares (NNLS) (Kim and Park 2008b). Before moving on to the individual sub-problems, we first discuss some efficient algorithms for this general problem. As mentioned before, efficiency is critical, since a large number of instances of this problem are required to be solved in each iteration.

4.1 Efficiently solving sparse non-negative least squares

The constrained non-negative least squares problem comes up commonly in the context of matrix factorization and alternating minimization. The commonly used techniques for this are projected gradient (Lin 2007), projected subgradient (Sachan and Srivastava 2013), active set (Kim and Park 2008a) and Alternating Directions Method (ADM) (Yang and Zhang 2011; Kasiviswanathan et al. 2011). Projected gradient for non-negative constraints operates on the class of objective functions which are differentiable and therefore also applicable to the non-negative least squares problem. It projects the solution computed by gradient descent method onto the non-negative orthant at each iteration. The projected subgradient method is an extension of the projected gradient approach for non-differentiable functions. Hence, projected subgradient method can also be used to solve \(l_1\)-regularized non-negative least squares problems. In theory, the projected subgradient algorithm converges asymptotically to the minima for a convex problem.

The block principal pivoting (BPP) algorithm (Kim and Park 2008b) is an extension of the active set method, and can be used in conjunction with \(L_1\) regularizer on the right factor matrix, but not on the left factor matrix, which is the case for two of our sub-problems. Additionally, the BPP algorithm requires the coefficient matrix in the least squares problem to be full column rank. This condition is not guaranteed to be satisfied in the various sub-problems, and empirically we have found it to be violated frequently. Empirically, we found that ADM converges very slowly or may not converge to the true solution if the columns of the coefficient matrix are not normalized to 1 and this column normalization does not suit the alternating minimization framework.

FISTA has recently been shown to have a fast convergence rate for non-smooth convex problems (Beck and Teboulle 2009). FISTA minimizes cost functions of the form \(f(x) + g(x)\), where f is convex, smooth and its gradient is Lipschitz, and g is convex and continuous. FISTA finds the quadratic approximation of \(F(x)=f(x)+g(x)\) at a given point y through the form

$$\begin{aligned} Q_L(x,y):=f(y)+\langle x-y,\bigtriangledown f(y) \rangle +\frac{L}{2}\Vert x-y \Vert ^2+g(x). \end{aligned}$$

It is shown that the minimizer of \(Q_L(x,y)\) over x is unique, and takes the following form:

$$\begin{aligned} p_L(y):=\arg \min _x\left\{ g(x)+\frac{L}{2}\Vert x-\left( y-\frac{1}{L}\bigtriangledown f(y)\right) \Vert ^2\right\} . \end{aligned}$$

FISTA can be used to solve the unconstrained least squares problem \(\arg \min _{h}\frac{1}{2}\Vert c-W h\Vert _{F}^2\) for column matrix h. This is because \(f(h)=\arg \min _{h} \frac{1}{2} \Vert c-W h\Vert _{F}^2\) is convex, smooth, and its gradient is Lipschitz. Further, \(g(h)=0\) is a convex, continuous function. FISTA is also effective when g is the \(l_1\)-norm on h i.e., \(g(h)=\lambda \Vert h\Vert _1\). It has been proved that sequence of objective function values \(F(x_k)\) where \(x_k\) is the solution computed by FISTA at kth iteration, converges to optimal function value \(F(x^*)\) at a rate no worse than \(O(\frac{1}{k^2})\).

FISTA with constant step size for the unconstrained \(l_1\)-regularized least squares problem \(\arg \min _{h} \frac{1}{2} \Vert c-W h\Vert _{F}^2 +\lambda \Vert h\Vert _1\) is described in Algorithm 1 where \(eigs(W^T\times W,1)\) is the largest eigen-value of \(W^T\times W\), \(soft(a,b)=sign(a)\times max(|a|-b,0)\) (Kasiviswanathan et al. 2011) is the soft operator and \(x_c\) is the converged value of \(x_k\). The disadvantage of FISTA with constant step size is that for large scale problems, the Lipschitz constant (largest eigen-value) may not be efficiently computable.

figure a

FISTA with backtracking overcomes the largest eigen-value computation step in Algorithm 1 by computing the Lipschitz constant (\(L_k\)) for every kth step and has the same rate of convergence as that of FISTA with constant step size. FISTA with backtracking is described in Algorithm 2.

figure b

Note that the computational effort required in step 2 of Algorithm 2 depends on the values of \(L_0\) and \(\eta \). Smaller values of \(L_0\) and \(\eta \) increase the computation time of step 2 and large values of \(\eta \) decrease the computation time in step 2 but increase the number of iterations needed to converge.

Projected FISTA for non-negativity constraints: Our sub-problems additionally involve a non-negativity constraint in the \(l_1\)-regularized least squares problem:

$$\begin{aligned} \arg \min _{h \ge 0}\frac{1}{2}\Vert c-W h \Vert _{F}^2 +\lambda \Vert h\Vert _1. \end{aligned}$$
(4)

We use projected FISTA to solve (4) where \(W \in R^{m\times n}, h \in R^{n\times 1} , c \in R^{m\times 1}\). For non-negative orthant projection, we define projected FISTA to contain an additional projection step after the computation of \(p_{L}(y_k)\) and \(p_{L_k}(y_k)\) in Algorithms 1 and 2 respectively. The projection step is \((x_k)_m=0, \forall m\) s.t. \((x_k)_m<0\). So this gives two versions of projected FISTA, with constant step size and with backtracking.

We now state a proof of correctness for the convergence of the projected FISTA with non-negative orthant constraint for a broader class of problems.

Theorem 1

If g(x) is fully separable in terms of x, then projected FISTA converges for the problem \({\arg \min _{x \ge 0}} f(x) + g(x)\), when f is convex, smooth and its gradient is Lipschitz, and g is convex and continuous.

Proof

At iteration k, for any of our sub-problems the \(p_L(y_k)\) will look as below:

\(p_L(y_k)=\arg \min _{x \ge 0}\{g(x)+\frac{L}{2}\Vert x-(y_k-\frac{1}{L}\times \bigtriangledown f(y_k) )\Vert ^2_F\}\),

where \(x \in R^{n}\). It is easy to see that \(p_L(y_k)\) can be split into n independent sub-problems as follows:

$$\begin{aligned} p_L(y_k)= & {} \arg \min _{x \ge 0}\sum _{i=1}^n\left\{ g(x_i)+\frac{L}{2}\Vert x_i-(y_k-\frac{1}{L}\times \bigtriangledown f(y_k) )_i\Vert ^2\right\} \\= & {} \left\{ \arg \min _{x_i \ge 0}\left\{ g(x_i)+\frac{L}{2}\Vert x_i-(y_k-\frac{1}{L}\times \bigtriangledown f(y_k))_i\Vert ^2\right\} \right\} _{i=1:n}, \end{aligned}$$

since g(x) and the squared Frobenius norm are separable. After the correct computation of \(p_L(y_k)\) from independent sub-problems, the convergence proof for FISTA (Beck and Teboulle 2009) applies here as well. \(\square \)

It is easy to see that since (4) is of the form in Theorem 1 where \(f(h)=\frac{1}{2}\Vert c-W\times h \Vert _{F}^2\) and \(g(h)=\lambda \Vert h\Vert _1\), the projected FISTA converges for \(l_1\)-regularized non-negative least squares problem.

Sindhwani and Ghoting (2012) study the problem of \(l_1\)-projection using FISTA, but have also suggested that the same strategy works for non-negative orthant projection. They don’t specifically give a correctness proof for general objective functions or show experiments with non-negative orthant projections using FISTA.

We now have two versions of projected FISTA for non-negative orthant constraint, one with constant step size and the other with backtracking. We use both of these versions for different sub-problems, as we discuss in the next subsection.

4.2 Proposed algorithm

Having looked at projected FISTA for solving the \(l_1\)-regularized non-negative least squares problem, we now investigate the individual sub-problems for the four factor matrices.

Solving \({\varPhi }\): The sub-problem for \({\varPhi }\) is as follows:

$$\begin{aligned}&\arg \min _{{\varPhi }\ge 0}\frac{1}{2}\Vert D-{\varPhi }Q \Vert _{F}^2+\lambda _{\varPhi }\sum _{j=1}^t\Vert {\varPhi }_j\Vert _1 \nonumber \\&\quad = \arg \min _{{\varPhi }\ge 0}\frac{1}{2}\left\| D^T- Q^T {\varPhi }^T \right\| _{F}^2 +\lambda _{\varPhi }\sum _{j=1}^v\Vert ({\varPhi }^T)_j\Vert _1 \nonumber \\&\quad = \left\{ \arg \min _{({\varPhi }^T)_i\ge 0} \frac{1}{2}\Vert (D^T)_i- Q^T ({\varPhi }^T)_i \Vert _{F}^2 +\lambda _{\varPhi }\Vert ({\varPhi }^T)_i\Vert _1\right\} _{i=1:v}, \end{aligned}$$
(5)

where we have split the sub-problem into further independent sub-problems involving individual \(({\varPhi }^T)_i\) variables.

Let us set \(f(({\varPhi }^T)_i)=\frac{1}{2}\Vert (D^T)_i- Q^T ({\varPhi }^T)_i \Vert _{F}^2\) and \(g(({\varPhi }^T)_i)=\lambda _{\varPhi }\Vert ({\varPhi }^T)_i\Vert _1\). Observe that f is convex, smooth and its gradient is Lipschitz. Further, g is convex and continuous. Therefore, we can directly use the projected version of Algorithm 1 for solving each individual sub-problem in \(({\varPhi }^T)_i\). We use the version with constant step size because the largest eigen-value computation step is required only once in each iteration of alternating minimization for solving all the sub-problems (in the order of thousands) in \({\varPhi }\) whereas projected FISTA with backtracking requires computation of Lipschitz constant \(L_k\) for each step in an individual sub-problem as we discussed in Sect. 4.1.

Solving \({\varTheta }\): The sub-problem for \({\varTheta }\) looks very similar to that for \({\varPhi }\):

$$\begin{aligned}&\arg \min _{{\varTheta }\ge 0} \frac{1}{2} \Vert Q-{\varTheta }\mathbb {A} \Vert _{F}^2+\lambda _{\varTheta }\sum _{j=1}^m\Vert {\varTheta }_j\Vert _1 \nonumber \\&\quad = \arg \min _{{\varTheta }\ge 0} \frac{1}{2} \left\| Q^T- \mathbb {A}^T {\varTheta }^T \right\| _{F}^2+\lambda _{\varTheta }\sum _{j=1}^t\Vert ({\varTheta }^T)_j\Vert _1 \nonumber \\&\quad = \left\{ \arg \min _{({\varTheta }^T)_i \ge 0}\frac{1}{2} \Vert (Q^T)_i- \mathbb {A}^T ({\varTheta }^T)_i \Vert _{F}^2 +\lambda _{\varTheta }\Vert ({\varTheta }^T)_i\Vert _1 \right\} _{i=1:t}, \end{aligned}$$
(6)

where we have split the last step into t independent sub-problems involving individual \(({\varTheta }^T)_i\) variables. Observe that this problem is identical to that in (5), with Q substituted for D and \(\mathbb A\) for Q. Therefore, for the same reasons, we can use projected version of Algorithm 1 for solving the t sub-problems independently. As for \({\varPhi }\), we use the version with constant step size to solve sub-problems in \({\varTheta }\) also.

Solving Q: The sub-problem for Q is as follows:

$$\begin{aligned}&\arg \min _{Q \ge 0} \frac{1}{2} \Vert D-{\varPhi }Q \Vert _{F}^2+\frac{1}{2} \Vert Q-{\varTheta }\mathbb {A} \Vert _{F}^2+\lambda _Q \sum _{i=1}^n\Vert Q_i\Vert _1 \nonumber \\&\quad = \arg \min _{Q \ge 0} \frac{1}{2} \Vert [D; {\varTheta }\mathbb {A}]-[{\varPhi };eye(t,t)] Q \Vert _{F}^2+\lambda _Q \sum _{i=1}^n\Vert Q_i\Vert _1 \nonumber \\&\quad = \left\{ \arg \min _{Q_i \ge 0} \frac{1}{2} \Vert [D; {\varTheta }\mathbb {A}]_i- [{\varPhi };eye(t,t)] Q_i \Vert _{F}^2 +\lambda _Q \Vert Q_i\Vert _1 \right\} _{i=1:n}, \end{aligned}$$
(7)

where we have split it into independent sub-problems involving individual \(Q_i\) variables. We have used the notation eye(nn) to denote the identity matrix with n rows and n columns, and [; ] for vertical concatenation of matrices.

As before, we can set \(f(Q_i)= \frac{1}{2}\Vert [D;{\varTheta }\mathbb {A}]_i- [{\varPhi };eye(t,t)] Q_i \Vert _{F}^2\) and \(g(Q_i)=\lambda _Q \Vert Q_i\Vert _1\). Again, observe that these satisfy the conditions for FISTA. Therefore, we can apply projected version of Algorithm 1 to solve the n sub-problems independently. Again, we use the version with constant step size for similar reasons as before.

Solving \(\mathbb {A}\): The sub-problem for \(\mathbb {A}\) looks as follows:

$$\begin{aligned}&\arg \min _{\mathbb {A} \ge 0}\frac{1}{2} \Vert Q-{\varTheta }\mathbb {A} \Vert _{F}^2 \text{ s.t. } supp(\mathbb {A}) \subseteq supp(A) \nonumber \\&\quad = \arg \min _{\mathbb {A} \ge 0} \frac{1}{2} \Vert Q-{\varTheta }\mathbb {A} \Vert _{F}^2, \mathbb {A}_{ij}=0 \ \forall i,j \text{ s.t. } \mathbb {A}_{ij} \notin supp(A) \nonumber \\&\quad = \left\{ \arg \min _{\mathbb {A}_j \ge 0} \frac{1}{2} \Vert Q_j-{\varTheta }\mathbb {A}_j \Vert _{F}^2, \mathbb {A}_{ij}=0\ \forall i \text{ s.t. } \mathbb {A}_{ij} \notin supp(A) \right\} _{j=1:n}, \end{aligned}$$
(8)

where subscript ij denotes ith row and jth column and in the last line, we have split the problem into n independent sub-problems involving \(\mathbb {A}_j\) variables.

Because of the constraint, the problem in (8) does not fit into the FISTA framework. However, we can reformulate the problem by incorporating the constraints in the definition of the variable as given below.

Let \(G(j) = \{i : \mathbb {A}_{ij} \in supp(A)\}\) denote the indices of potential authors for document j. Let \({\varTheta }_{G(j)}\) be the truncated matrix containing the columns from \({\varTheta }\) indexed by G(j), denoting the topical associations of these potential authors. Let \(\{\mathbb {A}_j\}_{G(j)}\) be the truncated column vector containing the elements indexed by G(j), denoting the document associations of the potential authors. Then it is easy to see that the individual sub-problems in (8) can be rewritten as

$$\begin{aligned} \arg \min _{\{\mathbb {A}_j\}_{G(j)} \ge 0} \frac{1}{2} \Vert Q_j- {\varTheta }_{G(j)} \{\mathbb {A}_j\}_{G(j)} \Vert _{F}^2. \end{aligned}$$
(9)

This is a non-negative least squares problem which can be solved by projected version of Algorithm 2. For solving sub-problems in \(\mathbb {A}\), we use FISTA with backtracking version because Lipschitz constant(largest eigen-value) needs to be calculated separately for each of the n sub-problems in \(\mathbb {A}\) having different \({\varTheta }_{G(j)}\) in FISTA with constant step size.

Note that all the sub-problems involved in solving each of the factor matrices \({\varPhi }, Q, {\varTheta }\) and \(\mathbb {A}\) are independent supporting parallelism. Solving each of these factor matrices \({\varPhi }, Q, {\varTheta }\) and \(\mathbb {A}\) alternatingly using projected versions of FISTA constitutes one main iteration of alternating minimization algorithm. Local convergence for this algorithm is guaranteed by the alternating minimization framework (Lin 2007) as discussed in the “Appendix”.

5 Experiments

In this section, we experimentally evaluate various aspects of our proposed approach over synthetic and real world datasets. (A) Evaluation of projected FISTA for sparse non-negative least squares; (B) Evaluation of SMTF for dyadic data: generalization ability, accuracy with respect to available gold-standard; (B1) Effect of the sparseness on the factor matrices; (B2) Comparison with existing baselines; (B3) Evaluation for non-document data; (B4) Evaluation of execution time for SMTF and the baselines w.r.t each dataset.

5.1 Evaluating projected FISTA

In our first experiment, we empirically compare the convergence rates of projected FISTA and projected subgradient for the \(l_1\)-regularized non-negative least squares problem in (4). We are not aware of any prior experimental evaluation of projected FISTA for this problem. In this section, we report experimental results for projected FISTA with constant step size. The convergence behaviour is similar for projected FISTA with backtracking for smaller values of \(L_0\) and \(\eta \). So, we plot only the objective function values of (4) using projected FISTA with constant step size.

For this experiment, we randomly generate instances of the \(l_1\)-regularized non-negative least squares problem (4) by generating instances of c, W and h. Given dimensions m and n, we generate a non-negative real \(m\times n\) matrix W and a non-negative real \(n\times 1\) column vector h by drawing each entry independently from \(N(\mu ,\sigma )\). We then use a sparsity parameter \(s\in [0,1]\) to set each element of h independently to 0 by drawing from a Bernoulli distribution with parameter s. We call this the synthetic Sparse Non-negative Least Squares (SNNLS_s) data.

Fig. 1
figure 1

Convergence plots for a projected FISTA and b projected subgradient

Based on the data sizes in our real data experiments, we generate (W,h) matrices for \((m=500,n=30)\) and \((m=5000,n=10)\) to evaluate convergence for both small and large matrices. In each case, we experimented with multiple values of the sparseness parameter \(\lambda =0,0.1,1,10\), and we run the algorithm until convergence (with \(\epsilon =10^{-4}\)). In Fig. 1a, we only plot the objective function values over iterations for one \((m=5000,n=10)\) sample for \(\lambda =0,0.1,1,10\). We have experimented with \((m=500,n=30)\) sized matrices and for different samples also and the convergence trends look very similar. We can see that objective function stabilizes in about 60 iterations to values depending on \(\lambda \). The corresponding plots for the projected subgradient algorithm are shown in Fig. 1b.

The two algorithms start with the same objective function value, which is 70,177. The plots record values after the first iteration. Observe that the values are so different for the two algorithms from the first iteration onwards that we needed to plot them separately. We can see that projected FISTA converges to minima in about 60 iterations. In contrast, projected subgradient converges very slowly conforming to the theoretical guarantees. The objective function values for projected FISTA and projected subgradient after 60 iterations for \(\lambda =(0,0.1,1,10)\) are (7, 7, 10, 36) and around 9.0e13 respectively. This is an empirical validation of our problem formulation that every sub-problem is solvable by the projected FISTA approach.

5.2 Evaluating SMTF on dyadic data

We first perform extensive experiments on user-document data, where we investigate the effect of sparsity parameters and compare performance with available baselines, and then briefly explore non-document data.

5.2.1 Evaluations on user-document data

Though our formulation supports arbitrary items and features, in our evaluation for this paper, we focus largely on document-word data, with a brief exploration of other kinds of items and features. Recall that our formulation takes as input a word-document matrix D and a user-document matrix A, and produces as output a word-topic matrix \({\varPhi }\), a topic-user matrix \({\varTheta }\), a topic-document matrix Q and a user-document matrix \(\mathbb A\) using which we evaluate the generalization ability and accuracy of topic-document and user-topic associations.

Datasets: Our first dataset is the DBLP abstracts datasetFootnote 2 (DBLP), from which we use a subset of 6320 documents involving 3377 authors covering 8 conferences. Of these, 5533 documents are used for training, and the remaining 787 for testing, thereby ensuring that each author in test dataset is also present in at least one training document. This is done by letting each author vote with probability 0.7 and taking a majority vote whether to send a document to training set or test set. We consider each conference as a topic, so that each document is labeled with exactly one of the 8 topics. After eliminating rare words and stop words, the vocabulary is of size 3989 with a total of 0.57 million word occurrences.

As a second example of author-research paper associations, we use the NIPS datasetFootnote 3 (NIPS), which is a publication dataset from the Neural Information Processing Systems (NIPS) conference proceedings (volume 0-12). This collection contains 1,740 documents (complete papers) written by a total of 2,037 authors. Of these, 1,514 documents were used for training and 226 documents for testing, in a similar manner to that of the DBLP dataset. There are no gold standard topics available for this dataset. The vocabulary has 7717 words with a total of 2.2 million word occurrences.

In our third dataset, we consider associations between users and reviews. We use a subset of the Product Review (REV) datasetFootnote 4 from amazon.com, containing 9651 reviews written by 5675 reviewers in 10 different product categories (apparel, books, camera, computers, jewelry, kitchen items, magazines, music, sports, video). We create multi-author documents by concatenating all reviews written by different reviewers for one specific product. This results in 5998 documents, one for each product. Of these, 5040 documents were used for training and 958 documents for testing. We treat each product category as one topic, so that each document is labeled with exactly one of ten topics. The vocabulary has 8587 words with a total of 0.53 million word occurrences.

Setting parameters for SMTF: Recall that the SMTF formulation (3) involves three parameters for inducing sparsity, namely, \(\lambda _{{\varTheta }}\), \(\lambda _{Q}\) and \(\lambda _{{\varPhi }}\) respectively. We perform a grid search over the space of these parameters and evaluate performance for each grid point.

Evaluation measures: We evaluate accuracy of the individual factors against gold-standards where available, and additionally measure error on a test dataset using factors ‘learnt’ from training data. We discuss these in more detail below.

Document topic assignment accuracy: This corresponds to the evaluation of the topic-document matrix Q for SMTF and the corresponding equivalent for Collective Matrix Factorization Model (CMF) and Author-Topic Model (ATM) against an available gold-standard on the training dataset. We denote this as DTa. For DBLP and REV, we have gold-standard document-topic associations available, where each document is associated with exactly one topic. This can be interpreted as a hard clustering of the documents. To evaluate clustering accuracy, we use the F1 measure and the Adjusted Rand Index (ARI) over the pairwise clustering decisions. The Adjusted Rand Index for pairwise decisions is defined as \(\frac{2(ab - cd)}{((a + d)(d + b) + (a + c)(c + b))}\), where a is the number of true positive pairs, b is the number of true negative pairs, c is the number of false positive pairs and d is the number of false negative pairs, respectively. The F1 measure is the harmonic mean of precision (P) and recall (R) over pairwise decisions: \(F1 = \frac{2PR}{(P +R)}\), where \(P = \frac{a}{(a + c)}\) and \(R = \frac{a}{(a + d)}\). Higher values of F1 and ARI indicate higher clustering accuracy.

Author topic assignment accuracy: This is the evaluation of the topic-user matrix \({\varTheta }\) for SMTF and the corresponding equivalent for CMF and ATM against an available gold-standard on the training dataset. We denote this as ATa. Since DBLP and REV have gold-standard topic labels associated with documents, we consider as ‘gold-standard’ \(\overline{{\varTheta }_{ij}}\), the fraction of documents associated with user j that are labeled with topic i. Since for each user j, \({\varTheta }_{.j}\) (and \(\overline{{\varTheta }_{.j}}\)) potentially assigns the user to multiple topics i with non-zero weights, we measure the soft clustering accuracy of the authors using the categorical clustering distance (CCD) (Zhou et al. 2005). CCD is obtained by solving the following optimization problem: \( CCD ({\varTheta },\overline{{\varTheta }})=\min _{w_{k,j}}\sum _{k=1}^{K}\sum _{j=1}^{J}w_{k,j}\sum _{i=1}^{m}\Vert \overline{{\varTheta }}_{k,i}-{\varTheta }_{j,i}\Vert \), subject to \(w_{k,j}\ge 0\), \(\sum _{k=1}^{K}w_{k,j}=\frac{1}{J}, \sum _{j=1}^{J}w_{k,j}=\frac{1}{K}\) for all kj,  where K and J are the number of topics in \(\overline{{\varTheta }}\) and \({\varTheta }\), respectively, and m is the number of users. Smaller values of CCD indicate better agreement with the gold-standard.

Test error: Assuming a fixed set of authors and topics across corpora, the word-topic (\({\varPhi }\)) and topic-user \(({\varTheta })\) matrices can be learnt from a training corpus and used for analyzing a held-out test corpus. The quality of the learnt matrices \(\hat{\varPhi }\) and \(\hat{\varTheta }\) can be measured using the Frobenius error in fitting the test corpus \(D_t\). This can be measured by solving the following optimization problem for SMTF: \(\text{ TE }(D_t;\hat{{\varPhi }}, \hat{{\varTheta }}) = \min _{Q\ge 0 , \mathbb {A}\ge 0 } \frac{1}{2}\Vert D_t-\hat{{\varPhi }} Q \Vert _{F}^2+\frac{1}{2} \Vert Q-\hat{{\varTheta }} \mathbb {A} \Vert _{F}^2\) while imposing sparsity on Q. For CMF, we use the corresponding formulation: \(\min _{Q\ge 0, \sum _1\ge 0, \sum _2\ge 0}\frac{1}{2}\Vert D_t-Q \sum _1 \hat{{\varPhi }}\Vert _{F}^2+\frac{1}{2}\Vert A-\hat{{\varTheta }} \sum _2 Q^T\Vert _{F}^2\) while imposing sparsity on Q. Note that lower values of test error indicate better generalization performance. We denote this as TE. Since ATM is trained to maximize a very different objective function (loglikelihood), we do not evaluate such error for it.

Before showing the results of experiments which quantitatively evaluate the accuracy of the factor matrices, we present an example list of topics for DBLP dataset and the associations of words and authors to the topics obtained from the SMTF formulation (3) for DBLP dataset in Table 1.

Table 1 Illustration of topics for DBLP with top most list of words and authors associated with topics

The topics presented in Table 1 correspond to the 8 conferences (AAAI, CIKM, IJCAI, SIGIR, SIGMOD, VLDB, ICDM, KDD) in DBLP dataset. Table 1 contains the list of ten topmost words and authors for the corresponding topics.

(B) In our first experiment with dyadic data, we study the effect of sparseness on the factor matrices in terms of the different evaluation measures outlined in Sect. 5.2.1. For experiments, our main focus is on \( SMTF _1\) based on (3) using surrogate topic-document matrix Q. Additionally, we evaluate \( SMTF _2\) based on (1) for comparisons with \( SMTF _1\). Because of huge time taken by \( SMTF _2\) as reported in Table 10 and since local convergence of \( SMTF _2\) is not guaranteed as discussed in the “Appendix”, we evaluate \( SMTF _2\) only for few parameter combinations including those that perform best for \( SMTF _1\).

For \( SMTF _1\), we have seen that sparse \({\varPhi }\) does not help. Therefore, our main focus for the rest of the experiment is on the other two sparsity factors Q and \({\varTheta }\).

(B1) First, for \( SMTF _1\), we check how sparsity on Q using \(\lambda _Q\) affects test error. We wish to verify if imposing sparsity on Q fits the held out data better. Formally we check if: \(\min _{\lambda _{{\varPhi }},\lambda _{{\varTheta }}}\{ TE :\lambda _Q=0\}>\min _{\lambda _{{\varPhi }},\lambda _{{\varTheta }}}\{ TE :\lambda _Q> 0\}\).

Table 2 \( SMTF _1\) Test Error for \(\lambda _Q=0\) and \(\lambda _Q>0\)
Table 3 \( SMTF _2\) Test Error

In Table 2, we evaluate \( SMTF _1\) for different settings of \(\lambda _{{\varPhi }}\) and \(\lambda _{{\varTheta }}\) for \(\lambda _Q=0\) and \(\lambda _Q> 0\) for the three datasets. Our evaluations over different sparsity parameter settings showed that for a fixed value of \((\lambda _{{\varPhi }}, \lambda _{{\varTheta }})\), as \(\lambda _Q\) increases from zero, the TE curve is convex. TE decreases to a minimum and then increases again. The location of the convex curve depends on the \((\lambda _{{\varPhi }}, \lambda _{{\varTheta }})\) value. We also noticed that sparse \({\varTheta }\) does not affect TE since we minimize the term \(\Vert Q-{\varTheta }\mathbb {A}\Vert _F^2\) so that the surrogate Q and product \({\varTheta }\mathbb {A}\) are close. In Table 2, we record the value of TE for \( SMTF _1\) over all curves when \(\lambda _Q=0\) and when \(\lambda _Q>0\) for the three different datasets. We see that in all the three datasets, for \( SMTF _1\), lower TE can be achieved when \(\lambda _Q>0\), showing the importance of sparsity of Q for generalization.

We evaluate TE for \( SMTF _2\) by solving the following optimization problem: \(\text{ TE }(D_t;\hat{{\varPhi }}, \hat{{\varTheta }}) = \min _{\mathbb {A}\ge 0 } \frac{1}{2}\Vert D_t-\hat{{\varPhi }} \hat{{\varTheta }} \mathbb {A} \Vert _{F}^2\) while imposing sparsity on \(\hat{{\varTheta }} \mathbb {A}\), analogous to Q in \( SMTF _1\). Table 3 records the value of TE for \( SMTF _2\) over parameter combinations that perform best for \( SMTF _1\) and with (\(\lambda _{{\varPhi }}=0;\lambda _{{\varTheta }\mathbb {A}}=0\)), (\(\lambda _{{\varTheta }}=0;\lambda _{{\varTheta }\mathbb {A}}=0\)), (\(\lambda _{{\varTheta }\mathbb {A}}=0\)), (\(\lambda _{{\varTheta }\mathbb {A}}>0\)) for the three different datasets. Unlike \( SMTF _1\), sparsity of \({\varTheta }\mathbb {A}\) in \( SMTF _2\) increases the TE. Sparsity individually for \({\varPhi }\) and \({\varTheta }\) decreases the TE. Additional sparsity in \({\varTheta }\) along with sparsity in \({\varPhi }\) further decreases TE than just the sparsity in \({\varPhi }\). But sparsity in \({\varTheta }\) alone performs the best for all the three datasets. From Tables 2 and 3, \( SMTF _1\) performs better than \( SMTF _2\) in terms of generalization ability.

Next, we check how sparsity on Q using \(\lambda _Q\) in \( SMTF _1\) affects document-topic assignment accuracy (DTa). Similar to experiments for \( TE \), we hypothesize that sparsity on Q using \(\lambda _Q\) improves document-topic assignment accuracy (DTa) which we formally state as: \(\max _{\lambda _{{\varPhi }},\lambda _{{\varTheta }}}\{DTa-F1,DTa-ARI:\lambda _Q=0\}<\max _{\lambda _{{\varPhi }},\lambda _{{\varTheta }}}\{DTa-F1,DTa-ARI:\lambda _Q> 0\}\).

Table 4 \( SMTF _1\) Document-topic accuracy for \(\lambda _Q=0\) and \(\lambda _Q>0\)
Table 5 \( SMTF _2\) Document-topic accuracy for \(\lambda _{{\varTheta }\mathbb {A}}=0\) and \(\lambda _{{\varTheta }\mathbb {A}}>0\)

In Table 4, we evaluate \( SMTF _1\) for different settings of \(\lambda _{{\varPhi }}\) and \(\lambda _{{\varTheta }}\) for \(\lambda _Q=0\) and \(\lambda _Q> 0\) for DBLP and REV, which have gold-standard topics. As for TE, for DTa we see a similar trend, with the difference that the curve is concave. For a fixed value of \((\lambda _{{\varPhi }}, \lambda _{{\varTheta }})\), as \(\lambda _Q\) increases from zero, DTa first increases to a maximum and then falls off. Here also DTa is not affected by sparsity on \({\varTheta }\). In Table 4, we record the value of DTa for \(\lambda _Q=0\) and for \(\lambda _Q>0\) across curves. For DBLP, we see that better DTa (in terms of both F1 and ARI) is achieved with \(\lambda _Q>0\), suggesting that sparsity in Q leads to better recovery of document-topic associations.

For the REV dataset, sparse Q is not as helpful for DTa. DTa improves very marginally for small epsilon\(>\)0, and then falls off.

Table 5 records the value of DTa for \( SMTF _2\) over parameter combinations that perform best with \( SMTF _1\) when \(\lambda _Q=0\) and \(\lambda _Q>0\) for the two datasets. Additional sparsity on \({\varTheta }\) along with sparsity on \({\varTheta }\mathbb {A}\), analogous to Q in \( SMTF _1\) does not help, thereby decreasing DTa, see Table 5. So, with \(\lambda _{{\varTheta }}=0\), sparsity on \({\varTheta }\mathbb {A}\) has an effect similar to that in \( SMTF _1\) for REV dataset. For DBLP, sparsity on \({\varTheta }\mathbb {A}\) improves DTa-F1 but not DTa-ARI. Tables 4 and 5 show that \( SMTF _1\) outperforms \( SMTF _2\) in terms of DTa-F1 and DTa-ARI for both the datasets.

Table 6 \( SMTF _1\) Author-topic accuracy for \(\lambda _Q=0\) versus \(\lambda _Q>0\) and \(\lambda _{{\varTheta }}=0\) versus \(\lambda _{{\varTheta }}>0\)

(B1) For \( SMTF _1\), we evaluate the impact of sparsity of Q and \({\varTheta }\), incorporated through \(\lambda _Q\) and \(\lambda _{\varTheta }\) on author-topic assignment accuracy (ATa). Here we check if sparsity on both Q and \({\varTheta }\) increases author-topic assignment accuracy (ATa) i.e, decreases the ATa-CCD. Formally, we check if: \(\min _{\lambda _{{\varPhi }}}\{ ATa \hbox {-} CCD :\lambda _Q=0, \lambda _{{\varTheta }}=0 \} > \min _{\lambda _{{\varPhi }}}\{ ATa \hbox {-} CCD :\lambda _Q > 0 , \lambda _{{\varTheta }}=0\}>\min _{\lambda _{{\varPhi }}}\{ ATa \hbox {-} CCD :\lambda _Q = 0 , \lambda _{{\varTheta }}> 0\}>\min _{\lambda _{{\varPhi }}}\{ ATa \hbox {-} CCD :\lambda _Q > 0 , \lambda _{{\varTheta }}> 0\} \).

In Table 6, we record the values of ATa-CCD for DBLP and REV by evaluating \( SMTF _1\) for the four settings \((\lambda _Q=0, \lambda _{{\varTheta }}=0)\), \((\lambda _Q>0, \lambda _{{\varTheta }}=0)\), \((\lambda _Q=0, \lambda _{{\varTheta }}>0)\) and \((\lambda _Q>0, \lambda _{{\varTheta }}>0)\) over specific values of \(\lambda _{{\varPhi }}\). We see that for both datasets, sparsity individually for Q and \({\varTheta }\) helps. Sparsity of \({\varTheta }\) helps more significantly than sparsity of Q. Sparsity of Q has an effect similar to that of DTa for both datasets. Simultaneous sparsity in both \({\varTheta }\) and Q brings even bigger benefits for both DBLP and REV datasets. The overall pattern is that sparsity in Q and sparsity in \({\varTheta }\) are both beneficial in different extents for author-topic assignment.

For similar settings, we evaluate \( SMTF _2\) for best parameter combinations in Table 6 to obtain ATa-CCD for DBLP, REV and record the corresponding results in Table 7. Compared to \( SMTF _1\) for ATa-CCD, sparsity individually for \({\varTheta }\mathbb {A}\), \({\varTheta }\) helps and sparsity of \({\varTheta }\) helps more than sparsity of \({\varTheta }\mathbb {A}\) here as well for both datasets. Imposing sparsity on both \({\varTheta }\) and \({\varTheta }\mathbb {A}\) brings even bigger benefits for DBLP, but does not help for REV dataset. Sparsity on \({\varTheta }\) alone helps the most for REV dataset. From Tables 6 and 7, we can observe that \( SMTF _1\) performs better than \( SMTF _2\) for ATa-CCD metric.

Table 7 \( SMTF _2\) Author-topic accuracy for \(\lambda _{{\varTheta }\mathbb {A}}=0\) versus \(\lambda _{{\varTheta }\mathbb {A}}>0\) and \(\lambda _{{\varTheta }}=0\) versus \(\lambda _{{\varTheta }}>0\)

(B2) For dyadic data, the Collective Matrix Factorization Model (CMF) (Sachan and Srivastava 2013) and the probabilistic Author-Topic Model (ATM) (Rosen-Zvi et al. 2004) can perform sparse topical analysis and we compare our proposed SMTF approaches, \( SMTF _1\) and \( SMTF _2\), against both of these baselines. For ATM, we use available codeFootnote 5, and search over the hyper-parameters \(\alpha \) and \(\beta \) to identify the best performing configurations. For CMF, since no source code is available, we use our own implementation in matlab, following specifications mentioned in the paper. (We impose \(l_1\)-sparsity on the factor matrices \(Q,{\varPhi }\) and \({\varTheta }\) and penalize their Frobenius norms.) We use projected subgradient descent to make an update for the unconstrained optimization and then project the updates to constrained space after each iteration. Again, we search over the sparsity parameters of the \(Q,{\varPhi }\) and \({\varTheta }\) to identify the best configuration.

In Table 8, we report the best performance for each of the compared models across parameter configurations in terms of test error, document-topic accuracy and author-topic accuracy for the three datasets. While author-topic model performs the best in getting the document-topic assignment correct, our model \( SMTF _1\) performs best in one more important metrics i.e, author-topic assignment which can be used in some important applications such as automated reviewer recommendation (Rosen-Zvi et al. 2004), in recommending researchers with similar interests for academic collaboration, for friends recommendation in social media like facebook and twitter etc. and also has the best generalization error.

We do not report TE for ATM since it is a probabilistic model and has a different criterion of optimizing the likelihood rather than minimizing the Frobenius error.

Table 8 Performance comparison between SMTF, CMF and ATM

5.2.2 Evaluation on non-document data

(B3) We also performed an experiment to compare the models for non-document items that have a small number of binary features. To simulate non-document items, such as products, movies, etc, which are associated with a small number of binary features, we modify the DBLP dataset as follows. We restrict the vocabulary by taking only the top 3 words from each of the 8 topics (associated with conferences). We select only those documents which contain at least one of these words. For any selected document and a word, the corresponding entry in D is set to 1 if the document contains the word, and to 0 otherwise. This results in a dataset containing 5018 training documents and 1145 documents for test dataset with a total of 3366 authors. The vocabulary has 24 words with a total of 27 thousand word occurrences. We call this the synthetic Product dataset (PROD_s).

The results are recorded in Table 9. Again, we see the same trend here as well that \( SMTF _1\) performs the best in terms of TE, \( SMTF _2\) in terms of ATa-CCD, ATM and \( SMTF _2\) over DTa-F1 and ATM in DTa-ARI metrics.

Table 9 Performance comparison between SMTF, CMF and ATM for synthetic product dataset
Table 10 Run-time comparison between SMTF, CMF and ATM for three datasets in minutes

5.2.3 Execution time

(B4) Finally, we record the sequential run-time for our model SMTF (\( SMTF _1\) and \( SMTF _2\)) and the baselines CMF, ATM w.r.t each dataset in Table 10 until convergence is achieved on an Intel machine with 6-core processors and 24GB RAM.

We see that ATM has very little sequential execution time compared to SMTF and CMF in Table 10. The time required for execution of \( SMTF _2\) is approximately 3 times higher than that required for \( SMTF _1\) as seen from Table 10. This is because, for solving the sub-problem involving \({\varTheta }\) in (1), since Lipschitz constant of the gradient of the \(\arg \min _{{\varTheta }\ge 0}\Vert D-{\varPhi }{\varTheta }\mathbb {A}\Vert ^2_F\) is not known, we choose projected FISTA with backtracking to solve for \({\varTheta }\). Note that this problem cannot be decomposed into a series of l1-constrained non-negative least squares problems because of its middle position. Hence, the computation of step 2 in Algorithm 2 and \(p_{L_k}(y_k)\) requires matrix multiplications leading to enormous increase in time. Additionally, the sub-problem involving \(\mathbb {A}\) takes twice the amount of time than that required for solving \(\mathbb {A}\) in (3) because the coefficient matrix (\({\varPhi }\times {\varTheta }\)) in (1) has higher dimension (\(v\times n\)) than the corresponding coefficient matrix (\({\varTheta }\)) in (3) to solve \(\mathbb {A}\).

Regarding the parallel implementation of SMTF and the baselines, to the best of our knowledge, the parallel version of ATM is not known. In the other baseline CMF, even though the sub-problems for some of the factor matrices cannot be decomposed into a series of independent sparse non-negative least squares problems, they can be parallelized provided the matrix multiplications in the update step for factor matrices are executed parallely.

For the parallel execution of \( SMTF _1\), the following factors may be considered for high scalability:

(A1) Parallel computation of independent sub-problems (sparse non-negative least squares problem) for each of the factor matrices \({\varPhi }\), \({\varTheta }\), Q and \(\mathbb {A}\); (A2) Projected FISTA with backtracking version for evaluation of \({\varTheta }\) matrix because the computation of Lipschitz constant i.e., the largest eigen-value of \(PP^T\) will take a large time in projected FISTA with constant step size; (A3) For the sub-problem in \({\varTheta }\) matrix, the computation of \(p_{L_k}(y_k)\) involving matrix to vector multiplications and also computation of Lipschitz constant in step 2 of Algorithm 2 containing independent terms involving matrix to vector multiplications are parallelized.

For \( SMTF _2\), we consider similar factors as for \( SMTF _1\) such as:

(C1) Parallel computation of independent sub-problems (sparse non-negative least squares problem) for the factor matrices \({\varPhi }\) and \(\mathbb {A}\); (C2) Projected FISTA with backtracking version for evaluation of \({\varTheta }\) matrix as explained above; (C3) For the sub-problem in \({\varTheta }\) matrix, the computation of \(p_{L_k}(y_k)\) involving matrix multiplications and also computation of Lipschitz constant in step 2 of Algorithm 2 containing independent terms involving matrix multiplications are parallelized.

To summarize the experiments, we have seen that using the proposed sparse tri-factorization approach, the role that sparsity of Q and \({\varTheta }\) plays in the performance of the model—sparsity leads to improvements in test error, and recovery of both document-topic and user-topic associations. Additionally, we have experimented with original sparse tri-factorization approach without surrogate topic-document matrix for comparison purposes. Further, for the task of topical analysis, for dyadic data, it outperforms existing tri-factorization baselines and is competitive with respect to the probabilistic author topic model.

6 Conclusion

We explored the problem of sparse topical analysis of dyadic data using non-negative matrix tri-factorization framework. To make the matrix tri-factorization formulation tractable, we introduced a surrogate matrix for the product of topic-user \({\varTheta }\) and user-document \(\mathbb {A}\) matrices with \(l_1\)-sparsity constraints on individual factor matrices as well as on the product of \({\varTheta }\) and \(\mathbb {A}\) matrices with support constraints on \(\mathbb {A}\). This has not been studied before. We used projected FISTA for solving each of the factor matrices \({\varPhi }\), Q, \({\varTheta }\) and \(\mathbb {A}\) in an alternating minimization framework supporting parallelism. Experimentally, we have demonstrated that the proposed approach outperforms existing baselines for the task of sparse topical analysis for dyadic data.