1 Introduction

In the field of clustering, and more generally in data mining, one of the biggest open problems is the optimization subject to binary constraints. The imposition of binary constraints in the objective of the optimization corresponds, in the context of clustering, to requiring that each observation clearly belongs to one cluster or not, with no gray areas. More generally, binary constraints arise from a request for interpretable and definite data mining results: is a picture showing a cat? Should a movie be recommended to this user? Should the next chess move be this one? Binary results provide definite yes or no answers to the questions arising when solving data mining tasks.

Many methods are able to solve binary-constrained problems. However, they mostly work under one condition: exclusivity. In particular, we are assuming that if a picture shows a cat, then it cannot show a dog; if a movie is assigned to one cluster (e.g., a genre), then it cannot belong to another cluster (i.e., to another genre); there should be only one possible next chess move, which is the optimal one. From these examples, we can easily observe that the exclusivity assumption may make sense or not depending on the specific application. For example, for the movie genre case, and for movie recommendation in general, the exclusivity assumption is inappropriate: there is no one-to-one relation between genres and groups of movies, or between groups of users and movies. This scenario provides an ideal motivation for the fundamental contribution of the biclustering task: the simultaneous clustering (identification of groups) of users and movies, where a bicluster is a selection of users and movies, such that the users give similar ratings for the movies in the group, and the movies have similar ratings from the users in the group. Unfortunately, this motivation only works for the philosophy behind biclustering; many algorithms for biclustering impose the exclusivity constraint, even though it is no necessarily required by the task definition. Provided that they are not constrained by such an exclusivity assumption, biclusters could, e.g., represent a group of science-fiction fans and science-fiction movies. A science-fiction fan (usually) does not exclusively like science-fiction movies. Analogously, a science-fiction movie is not exclusively watched by science-fiction fans. In this respect, the exclusivity assumption is clearly imposing stringent, unrealistic, constraints.

Similar observations can be drawn in other biclustering applications. For example, biclustering of gene-expression data is employed to identify groups of genes and patients, which are strongly linked by similar expression levels. Such an analysis can be used to discover functions of genes related to clinical traits. However, one gene generally does not have one single function in an organism, but is actually involved in multiple biological processes (Pio et al. 2015). On the other hand, not every gene necessarily plays a significant role in the considered conditions. In this case, the exclusivity assumption would force every gene to belong to one cluster. Hence, outliers, or isolated objects, could be improperly modelled in presence of the exclusivity assumption.

A popular way to circumvent the difficulties of binary optimization is to relax the binary constraint into a nonnegative and/or orthogonal one (cf. Sect. 3.2). However, the resulting fuzzy clusters are not always easy to interpret and are usually difficult to process automatically.

In this paper we propose the method Broccoli (Binary RObust Co-Clustering Optimization through alternating LInearized minimization) to obtain models which can handle cluster overlap and the presence of outliers. Broccoli employs a penalization approach, where the relaxed objective is optimized while the violation of binary constraints is penalized. The penalization degree is optimized during the training, such that the returned minimizer of the objective function satisfies the binary constraints. Our optimization method is based on the theoretically founded framework of Proximal Stochastic Gradient Descent (PSGD) for matrix factorizations (Driggs et al. 2020).

We evaluate Broccoli on synthetic and real-world data, showing that it is able to detect biclusterings of various structures, being robust to the noise in the data. A qualitative inspection reveals that Broccoli is able to derive meaningful clusters, which are interpretable by their modular structure.

2 Background of biclustering models

The biclustering task is relevant when relationships in both observations and features (or rows and columns) have to be captured. Several biclustering approaches have been proposed in the literature, including spectral methods (Dhillon 2001; Kluger 2003), iterative greedy methods (Pio et al. 2012, 2013), matrix factorization methods (Long et al. 2005; Yoo and Choi 2010; Del Buono and Pio 2015), and multi-type methods (Barracchia et al. 2020).

In this paper, we focus on biclustering models based on matrix factorization. Matrix factorization is applied in multiple contexts, including image reconstruction (Zhou and Qi 2011; Yokota et al. 2019) and data representation (Cai et al. 2011; Wang et al. 2018). It is also intrinsically linked to clustering, due to its recognized role as a general framework for clustering objectives (Ding et al. 2006a; Pompili et al. 2014).

Existing clustering algorithms based on matrix factorization mainly rely on an iterative optimization approach, that aims to find a factorization of the input data matrix into two or more matrices that provide cluster membership information. Specifically, given a data matrix \(D \in {\mathbb {R}}^{m \times n}\), there exist suitable numerical methods to optimize a nonnegative matrix tri-factorization:

$$\begin{aligned} \min _{X,Y,C}&\Vert D-YCX^\top \Vert ^2&\text {s.t. } X\in {\mathbb {R}}_+^{n\times r_x}, \, Y\in {\mathbb {R}}_+^{m\times r_y}, \, C\in {\mathbb {R}}_+^{r_y\times r_x}, \end{aligned}$$

where the matrix Y has an interpretation as a fuzzy row-cluster indicator, X as a fuzzy column-cluster indicator, \(r_y\) is the number of row-clusters, and \(r_x\) is the number of column-clusters. The matrix C is the core matrix that assigns a weight to each (fuzzy) bicluster, i.e., to each pair of a row cluster and a column cluster. In particular, \(Y_{js}\) generally represents the degree of membership of the jth row of the data matrix to the row cluster s, while \(X_{it}\) represents the degree of membership of the column i of the data matrix to the column cluster t.

Depending on the application, different constraints can be imposed. For example, if we force Y to be binary and to have orthogonal columns, then the objective above becomes equivalent to the one of k-means clustering (Bauckhage 2015). In this case, the matrix Y indicates the assigned cluster for each observation, while the matrix \(CX^\top \) represents the centroids. We denote the space of \((m\times r_y)\)-dimensional cluster indicator matrices, implementing the exclusivity assumption, with \(\mathbb {1}^{m\times r_y}\). We have \(Y\in \mathbb {1}^{m\times r_y} \) if and only if \(Y\in \{0,1\}^{m\times r_y}\) and \(|Y_{j\cdot }|=1\). In essence, \(Y\in \mathbb {1}^{n\times r_y}\) if Y is binary and has orthogonal columns.

If we now want to model biclusters, assigning observations as well as features to clusters, without making use of the exclusivity assumption, then we impose binary constraints on the matrices X and Y:

$$\begin{aligned} \min _{X,Y,C}&\Vert D-YCX^\top \Vert ^2&\text {s.t. } X\in \{0,1\}^{n\times r_x}, \, Y\in \{0,1\}^{m\times r_y}, \, C\in {\mathbb {R}}_+^{r_y\times r_x}. \end{aligned}$$

In this case, \(Y_{js}=1\) (resp., \(Y_{js}=0\)) means that the jth row of the data matrix is (resp., is not) assigned to the row cluster s, while \(X_{it}=1\) (resp., \(X_{it}=0\)) means that the column i of the data matrix is (resp., is not) assigned to the column cluster t. Again, the matrix C assigns a weight to each bicluster, i.e., to each pair of a row cluster and a column cluster.

Table 1 Overview of biclustering objectives based on matrix factorization

In Table 1, we show three common biclustering models: checkerboard, diagonal and binary biclustering.

Fig. 1
figure 1

Visualization of checkerboard (on the top), block-diagonal (second from the top), binary (third from the top, with non-overlapping row-clusters but overlapping column-clusters) and the Broccoli (on the bottom) biclustering models. Every bicluster is a combination of a row-cluster and a column-cluster, which we visualize here as a block. Real values are indicated by colors, while binary values are indicated in black and white (black represents a one; white represents a zero). Best viewed in color

We observe that the objectives of checkerboard and diagonal biclustering impose the exclusivity by constraining the binary factor matrices to the \(\mathbb {1}\)-space. Since the exclusivity assumption enforces row- and column-clusters not to overlap, for checkerboard and diagonal biclusters there exists a permutation of rows and columns, such that each bicluster appears as a coherent block. This is shown in Fig. 1, where the area of the bicluster combining row-cluster s with column-cluster t is filled with one color, representing the value \(C_{st}\). This clustering concept goes back to Hartigan (1972).

A special case of biclustering arises if the data matrix is binary. In this case, a decomposition into two binary matrices (where the core matrix is equal to the identity matrix) may benefit the interpretability of the result (see Table 1). We refer to such factorizations as binary biclustering. For this task, the exclusivity constraint is usually not imposed on both binary factor matrices. This is because it would lead to a binary block-diagonal model, which is too simplistic for most applications. However, binary biclusters do have an inbuilt penalization of overlap, because overlapping areas of binary biclusters are approximated with a value of 2—introducing an approximation error of at least one. As a result, every pair of overlapping biclusters adds to the approximation error a value equal to the size of the overlapping area. Consequently, binary biclusters have usually either non-overlapping row- or non-overlapping column-clusters, as depicted in Fig. 1 (third row).

In this paper, we aim to get rid of restrictive constraints on the clustering structure. Our objective (see Table 1) resembles that of checkerboard biclustering, with the difference that our cluster indicator matrices do not enforce the exclusivity constraint. The result may yield overlapping row- and column-clusters (as shown in Fig. 1), and identify outlier observations (rows of the data matrix which are not assigned to any row-cluster) and outlier features (columns of the data matrix which are not assigned to any column-cluster).

3 Related work

The major challenge faced by clustering methods, without assuming exclusivity, comes from the optimization phase. While the fuzzy real-valued counterparts can suitably be optimized by numerical methods, the discrete nature of binary clustering tasks does not allow for a direct adaptation of such optimization methods. Deriving definite cluster assignments is an inherently combinatorial complex problem. The major advantage of using the exclusivity assumption is the possibility to adopt an efficient iterative optimization scheme, based on alternating minimization (cf. Sect. 3.1). If the exclusivity assumption is supposed to be relaxed, then the usual way is to relax the binary clustering constraints as well, such that numerical optimization methods can be applied (cf. Sect. 3.2). It is noteworthy that also the approach proposed in this paper, for the optimization of non-exclusive clusterings, is based on a relaxation of the constraints. In particular, our approach is related to the penalization approach reviewed in Sect. 3.3. However, in our method, the relaxations are gradually reversed, being a part of the optimization, such that we obtain binary solutions in the end.

3.1 Methods based on alternating minimization

Alternating minimization for clustering has been introduced by Lloyd (1982) with the k-means algorithm. Iteratively, one of the factor matrices is optimized while the other factor matrices are kept fixed. The exclusivity assumption enables the analytical derivation of the optimizer in every iteration (Gaul and Schader 1996; Mirkin et al. 1995). In other words, it is not necessary to apply gradient descent at every optimization step, but it is possible to directly state the optimum for one of the matrices. This way, the optimization subject to binary constraints is facilitated. The alternating minimization scheme has been implemented for checkerboard biclustering (Vichi 2001; Wang et al. 2011; Cho et al. 2004) and for diagonal biclustering (Han et al. 2017; Song et al. 2020). Koyutürk and Grama (2003), Li (2005) proposed alternating minimization for binary matrix factorization, where one the factor matrices is constrained to satisfy the exclusivity assumption. In this scenario, row-clusters are always nonoverlapping, but column-clusters may overlap, or vice versa. Although alternating minimization is an elegant and theoretically founded optimization method, it has the drawback that the global, yet separate, minimization in every step tends to converge to a minimum which is not far from the starting point. This behavior makes alternating minimization very sensitive to the initialization (Zhou et al. 2015). In contrast, gradient-based methods are generally more prone to explore the hypothesis space, before converging to a stationary point (assuming that the chosen step-size is not particularly small).

3.2 Methods based on (soft-) orthogonal relaxations

Since the binary constraints in clustering objectives hinder the application of numerical optimization methods, relaxations of the binary constraints have been adopted in the literature. The most popular approaches are based on nonnegative (soft-) orthogonal relaxations (Ding et al. 2006b; Yoo and Choi 2010; Zha et al. 2001; Dhillon 2001; Nie et al. 2017). These methods aim at solving the following objective:

$$\begin{aligned} \min _{Y\ge 0, C\ge 0, X\ge 0}\left\Vert D-YCX^\top \right\Vert ^2, \text { s.t. } Y^\top Y = I, \ X^\top X = I. \end{aligned}$$
(1)

The orthogonal relaxation is interesting because of its correspondence to hard (binary) clusterings: requiring that the columns of X and Y are orthogonal and nonnegative implies that every row has at most one nonzero entry.

Moreover, matrices having orthogonal columns may contain rows entirely filled with zeros, indicating that the corresponding observation or feature is not assigned to any cluster. However, orthogonal nonnegative matrix factorization is an NP-hard problem (Asteris et al. 2015), and in practice only soft-orthogonality of the matrices is achieved. On the one hand, the resulting fuzzy, soft-orthogonal indication of clusters is in principle suitable to model overlapping clusters as well. On the other hand, the fuzzy indication of clusters requires to make a-posteriori decisions to obtain definite cluster membership indications. A straightforward approach would be to assign observation j (or feature i) to the k clusters having the highest values in \(Y_{j\cdot }\) (or \(X_{i\cdot }\)). Of course, imposing exclusivity would correspond to setting \(k = 1\) (Del Buono and Pio 2015; Yoo and Choi 2010), but determining the correct value of k for each observation or feature in an overlapping setting may be very problematic.

The optimization of (soft)-orthogonal factorizations is usually performed with multiplicative updates. These updates can be considered as gradient descent, where the step-size is chosen small enough such that the constraints are not violated. Unfortunately, the conservative choice of the step-size results in a slow convergence rate.

Fig. 2
figure 2

Binary penalization functions: the Mexican hat function and \(\varLambda \)

3.3 Methods based on penalization of nonbinary values

One of the very few attempts to solve binary constrained biclustering, without imposing the exclusivity, is the penalization approach proposed by Zhang et al. (2007, 2010, 2013). They propose a multiplicative update algorithm to minimize the approximation error together with a term that penalizes non-binary values. The employed penalization term is the Mexican hat function, shown in Fig. 2. Such a penalization scheme has also been applied for Boolean matrix factorization. This is a variant of binary biclustering, where clusters are explicitly allowed to overlap. Here, the matrix multiplication is computed in Boolean algebra, yielding \(1\oplus 1=1\). Note that, in binary matrix factorization, areas where two biclusters overlap are approximated by \(1+1=2\), areas where three biclusters overlap are approximated by \(1+1+1=3\), and so on. Hence, overlap of binary biclusters introduces an approximation error to the binary data matrix. In Boolean algebra, this is not the case and we always obtain a binary approximation by the Boolean product. In this context, Hess et al. (2017) proposed a proximal optimization scheme, employing the penalization function \(\varLambda \) defined in Eq. (3) and shown in Fig. 2. Adapting the function \(\varLambda \) instead of the Mexican Hat function has the advantage to lead to more matrix entries which are actually binary, instead of just being close to binary values. The optimization of a Boolean factorization in elementary algebra is approached by an optimization of the penalized objective and a subsequent thresholding to binary matrices. As we will detail in Sect. 4, Broccoli is built upon this approach, while exploiting the fact that the checkerboard biclustering objective does not require the approximation of a multiplication in another algebra.

3.4 Methods to explicitly model overlaps and outliers

In the literature, we can find a few methods which aim to specifically model outliers and overlaps among clusters. For example, Whang and Dhillon (2017) propose a semidefinite program that allows for a specified amount of overlap and a specified amount of outliers for row- and column-clusters. This approach introduces four parameters, which are not easy to estimate beforehand. Laclau and Brault (2019) propose a probabilistic binary biclustering model, which simultaneously facilitates feature selection. In this way, feature-(column-) clusters are not required to be exhaustive, and implicitly discard features that can be considered outliers. However, the same does not hold for the observations, and the identified clusters are still disjoint.

A specific context in which bicluster overlap has been explicitly modelled is that of bioinformatics. Among the approaches proposed in the literature, it is worth mentioning Floc (Yang et al. 2005), that addresses the issues originally present in the sequential approach proposed by Cheng and Church (2000). Floc is a probabilistic algorithm that can discover a set of possibly overlapping biclusters simultaneously, by iteratively and greedily determining the best actions (called moves) to perform for each row and column, as long as an improvement of the objective function is observed. Another relevant method is Fabia (Hochreiter et al. 2010), that is based on a multiplicative model that assumes that two feature vectors are similar if one is a multiple of the other, i.e., if the angle between them is zero or, as realization of random variables, their correlation coefficient is 1 or − 1. This model was specifically adopted to capture possible linear dependencies between gene expressions and conditions, as well as heavy tailed distributions, which are typically observed in real-world transcriptomic data. However, although being generally able to discover overlapping biclusters, such approaches may appear to be inadequate if the underlying assumptions do not hold for the data at hand.

4 A novel stochastic optimization for non-exclusive biclusters

The objective of the biclustering task is nonconvex, as known for matrix factorizations. This entails that there are multiple local optima, which are typically not all suited to reflect a true, underlying clustering structure. Moreover, binary constraints on the matrices make this issue even more evident: every binary matrix induces a local optimum. Indeed, every binary matrix is the only feasible (and, therefore, the best) optimizer within its \(\epsilon \)-ball neighborhood for small enough \(\epsilon \). Therefore, the optimization of the non-binary core matrix C, when fixing two arbitrary binary matrices, leads to a local optimum. It is noteworthy that a real-valued core matrix, as in the case of checkerboard biclustering, can lead to a significant decrease in the approximation error, even if the biclustering represented by the binary matrices is far away from the global optimum. This phenomenon makes it hard to distinguish between local optima and the global optimum by means of the objective function value (i.e., by observing the approximation error). In other words, having a good optimizer is not enough: We need a very good optimizer which simultaneously (i) handles the existence of many local optima that are almost indistinguishable from the global optimum by observing only the objective function, (ii) integrates binary constraints, and (iii) is robust to noise and can handle the presence of outliers.

4.1 Gradually increasing penalization of nonbinary values

We propose a biclustering optimization scheme based on the stochastic, proximal optimization framework SPRING (Driggs et al. 2020). Stochastic optimization computes the gradient descent updates in every iteration on a batch of the data. This makes stochastic optimization suitable for large scale data. Stochastic gradient descent is also known for its generalizing properties (Hardt et al. 2016; Hoffer et al. 2017).

We propose to optimize the following objective:

$$\begin{aligned} \min _{\begin{array}{c} X,Y,C,\\ \varvec{\lambda }_x,{\varvec{\lambda }}_y \end{array}}&\ \frac{1}{mn}\Vert D-YCX^\top \Vert ^2 + \langle {\varvec{\lambda }}_x,\varLambda (X)-{\mathbf {1}}\rangle +\langle {\varvec{\lambda }}_y,\varLambda (Y)-{\mathbf {1}}\rangle + \phi _c(C) \nonumber \\ \text {s.t. }&\ {\varvec{\lambda }}_x\in [0,\theta ]^{n\times r},{\varvec{\lambda }}_y\in [0,\theta ]^{m\times r}. \end{aligned}$$
(2)

The parameter \(\theta \) is here employed as a placeholder for the required regularization weights \({\varvec{\lambda }}_x\) and \({\varvec{\lambda }}_y\) such that the optimizing factor matrices Y and X of Eq. (2) are binary. Bounding the regularization weights above by the parameter \(\theta \) ensures that the objective in Eq. (2) is well-defined. However, we do not need to determine the parameter \(\theta \) in practice.

The regularization term \(\langle {\varvec{\lambda }}_x,\varLambda (X)-{\mathbf {1}}\rangle \) is composed of two parts: the nonbinary penalisation term \(\langle {\varvec{\lambda }}_x, \varLambda (X)\rangle \) and a term which pushes \({\varvec{\lambda }}_x\) to be as large as possible \(-\langle {\varvec{\lambda }}_x,{\mathbf {1}}\rangle \). The matrix \({\mathbf {1}}\) is here a constant one matrix, whose dimensionality is inferred from the context. The parameter matrices \({\varvec{\lambda }}_x\) and \({\varvec{\lambda }}_y\) are the regularization weights of the non-binary penalization functions \(\varLambda (X)\) and \(\varLambda (Y)\). The matrix \(\varLambda (X)=(\varLambda (X_{is}))\) is defined by the elementwise application of the function:

$$\begin{aligned} \varLambda (x) = {\left\{ \begin{array}{ll} -|1-2x|+1 &{} \quad x\in [0,1],\\ \infty &{} \quad \text {otherwise.} \end{array}\right. } \end{aligned}$$
(3)

The function \(\varLambda \) (graphically depicted in Fig. 2) reaches its maximum value (1.0) at 0.5, its minimum value (0.0) at binary values, and returns infinity outside of the interval [0, 1]. It defines the feasible set of the matrices \(X\in [0,1]^{n\times r}\) and \(Y\in [0,1]^{m\times r}\) in the objective function [Eq. (2].

The Frobenius inner product

$$\begin{aligned} \langle \varvec{\lambda },\varLambda (X)\rangle = \sum _{i,s}\varvec{\lambda }_{is}\varLambda (X_{is}) \end{aligned}$$

sums the elementwise penalization terms weighted by the parameters \(\varvec{\lambda }\).

The function \(\varLambda \) is non-smooth, but feasible for optimization by proximal gradient descent. Proximal optimization is a theoretically well-founded way to facilitate the optimization of non-smooth and possibly non-continuous terms of the objective. It is particularly used when the loss is a smooth function, and the constraints are possibly non-smooth but simple penalization terms. The proximal mapping of a function \(\phi \) is a function which returns a matrix solving the following optimization problem:

$$\begin{aligned} {{\,\mathrm{prox}\,}}_\phi (X) \in \mathop {\hbox {arg min}}\limits _{X^*}\left\{ \tfrac{1}{2}\,\Vert X-X^*\Vert ^2+\phi (X^*)\right\} . \end{aligned}$$
(4)

Loosely speaking, the proximal mapping gives its argument a little push into a direction which minimizes \(\phi \). For a detailed discussion, see, e.g., the work by Parikh et al. (2014). This operator is employed in every iteration, which makes its optimization by numerical methods infeasible in practice. The trick is to use only simple functions \(\phi \) for which the proximal mapping can be calculated analytically, in a closed form.

The proximal operator for \(\varLambda \) has been shown by Hess et al. (2017) to satisfy, for \(x\in {\mathbb {R}}\)

$$\begin{aligned} {{\,\mathrm{prox}\,}}_{\lambda \varLambda }(x)= {\left\{ \begin{array}{ll} \max \{0,x-2\lambda \} &{} \quad x\le 0.5, \\ \min \{1,x+2\lambda \} &{} \quad x>0.5. \end{array}\right. } \end{aligned}$$
(5)

The larger the regularization weight \(\lambda \), the more the corresponding matrix value is pushed into the direction of binary values. After every gradient descent step of one of the cluster indicator matrices, the prox-operator is applied and pushes the matrix towards binary values. As a result, if we choose \(\lambda \) large enough, then we will get binary matrices after a couple of iterations. However, in this case, we also risk to converge to a local optimum close to the initialization. This would make our method even more sensitive to the initialization than it already is due to the nonconvexity of the objective. In turn, if we choose a too small value for \(\lambda \), then the optimum of the penalized objective might not return binary matrices.

In order to circumvent these issues, we gradually increase the regularization weights throughout the optimization process. In addition, we employ individually determined regularization weights. To this end, we introduce the regularization weights as optimization parameters and subtract the sum of the regularization parameters \(\langle {\varvec{\lambda }}_x,{\mathbf {1}}\rangle + \langle {\varvec{\lambda }}_y,{\mathbf {1}}\rangle \) from the objective function value. As a result, matrix entries which naturally fall close to a binary value receive a stronger push towards binary values than those entries which are undecided (i.e., close to 0.5).

4.2 The method Broccoli

We formally describe our method Broccoli in Algorithm 1. The method returns a non-exclusive biclustering model, which is defined by the specification of constraints on the core matrix by \(\phi _c\). In this paper, we focus on the identification of nonnegative checkerboard clusterings. Specifically, we set

$$\begin{aligned}\phi _c(x) = {\left\{ \begin{array}{ll} \infty , &{} \quad \text {if } x<0 \text { or } x>max_c \\ 0, &{} \quad \text {otherwise}. \end{array}\right. } \end{aligned}$$

The proximal operator \({{\,\mathrm{prox}\,}}_{\alpha \phi _c}(C)\) projects the elements of the matrix C onto the interval \(C_{s,t}\in [0,max_c]\), with the effect of bounding the maximum value of C. This prevents an imbalance of the matrices where X or Y have very low values, which are compensated by large values in C.

The input of our method Broccoli is the data matrix D, the rank of the factorization r and the step-size \(\gamma \) of the gradient descent steps, updating the regularisation parameters \({\varvec{\lambda }}_x\) and \({\varvec{\lambda }}_y\).

figure a

In every iteration, a batch of columns of the data matrix is sampled and the matrices C and X, as well as the regularization weights, are updated. Likewise, a batch of rows of the data matrix is sampled subject to which the matrices C and Y are updated. Every update computes the proximal mapping of the gradient step, performed on the respective batch. The step-size is chosen by means of the Lipschitz constant of the gradient (cf. Appendix A). Under these circumstances, SPRING guarantees convergence to a local minimum in expectation (Driggs et al. 2020).

4.2.1 Initialization

As generally known from matrix factorization, the initialization may influence the results. There are multiple issues which can occur during the optimization of matrix factorizations with binary constraints, and which can be alleviated with a good initialization method. One of these issues is an imbalance in the scale of the factor matrices (Zhang et al. 2007). For nonnegative matrix factorization this is not a big issue. However, it is a problem when binary constraints are imposed. Suppose that the ground truth biclustering is given by the matrices \(Y^*,X^*\) and \(C^*\). Let \({\mathbf {x}},{\mathbf {y}}\in {\mathbb {R}}^r\) be positive vectors. Then, for \(Y= {Y^*}{{\,\mathrm{diag}\,}}({\mathbf {y}})^{-1}\), \(C={{\,\mathrm{diag}\,}}({\mathbf {y}})C^*{{\,\mathrm{diag}\,}}({\mathbf {x}})\) and \(X={X^*}{{\,\mathrm{diag}\,}}({\mathbf {x}})^{-1} \), we have:

$$\begin{aligned} Y^*C^*{X^*}^\top = Y^*{{\,\mathrm{diag}\,}}({\mathbf {y}})^{-1}{{\,\mathrm{diag}\,}}({\mathbf {y}})C^*{{\,\mathrm{diag}\,}}({\mathbf {x}}){{\,\mathrm{diag}\,}}({\mathbf {x}})^{-1} {X^*}^\top = YCX^\top . \end{aligned}$$
(6)

As a result, the matrix product \(YCX^\top \) is indistinguishable from the one of the ground truth, although the factor matrices differ from the ground truth factors. During optimization, when the binary constraints are relaxed, we might obtain differently scaled matrices as iterates. If the vector \({\mathbf {x}}\) in Eq. (6) contains large numbers, then the matrix X has columns which are close to zero. In upcoming iterations, the proximal operator pushes these small values even closer to zero, which has to be balanced by the scaling of C. This scaling coping mechanism works until the values in X reach zero. If that is the case, then we observe a sudden increase of the approximation error.

figure b

The scaling issue can be alleviated by a suitable initialization. We consider two possible schemes (see Algorithm 2): a baseline approach consisting of a uniformly random initialization of the (relaxed) cluster indicator matrices (henceforth this approach will be denoted as InitRND), and a more sophisticated initialization based on a shortly optimized nonnegative matrix factorization (henceforth this approach will be denoted as InitNMF). The latter approach is the one that we propose in this paper. In both cases, the inputs are the data matrix D and the rank r. The method InitNMF has one additional parameter p, specifying the percentage of fuzzy cluster assignments which are set to 1.0 after the initialization. In Broccoli we adopt \(p = 80\), which according to a set of independent preliminary experiments, appears to be appropriate and reasonable.

The proposed InitNMF employs the uniformly random initialization InitRND for its own initialization. The optimization consists of 100 proximal stochastic gradient descent updates for nonnegative matrix factorization. The nonnegative constraints are incorporated by the regularizing function \(\phi _+\), which returns infinity at negative matrix entries, zero otherwise. The proximal operator of this function is a projection of the argument matrix onto nonnegative values (Bolte et al. 2014).

Given the resulting nonnegative factors \((X_+,Y_+)\), they are converted into a suitable tri-factorization (CXY). Ideally, the initialization yields factor matrices X and Y which already reflect the optimal distribution of zeros and ones per cluster. Since the optimal distribution is not known in advance, we employ a heuristic strategy. In particular, we determine scaling vectors \({\mathbf {x}}\) and \({\mathbf {y}}\), whose product will reflect the diagonal entries of C (cf. Steps 12 and 13). In the first step, we set the scaling vectors \({\mathbf {x}}\) and \({\mathbf {y}}\) to the pth percentile of the nonnegative indicators of each cluster. In this way, we ensure that the cluster indicator matrices are not too sparse, having at least \(100-p\%\) of all entries equal to one. At the same time, we need to ensure that the core matrix C is not too sparse. If one row or column of the core matrix is equal to zero, then the corresponding row- or column-cluster is not used. Therefore, we add 0.01 to the diagonal of the core matrix. This is achieved by adding 0.1 to the scaling vectors, where we additionally apply a weighting scheme (cf. Step 11) that provides a higher weight to denser factor matrices than to sparse indicator matrices. Broccoli’s tri-factorization is finally initialized by the scaled nonnegative matrix factorization.

4.2.2 Time complexity analysis

The complexity of Broccoli’s optimization scheme corresponds to the complexity of an update step times the number of required iterations. In the following, we derive the complexity of each update step as \({\mathcal {O}}(mnr)\), resulting in an overall time complexity of \({\mathcal {O}}(Tmnr)\), where T is the number of iterations.

For each update step, we compute the gradient, the Lipschitz constant and the proximal operator. The complexity for computing the gradient and the Lipschitz constant can be derived by the complexity of matrix multiplications, that is \({\mathcal {O}}(mnr)\) for the multiplication of an \(n\times r\) matrix with an \(r\times m\) matrix. Accordingly, the gradients are computed in \({\mathcal {O}}(mnr)\), while the Lipschitz constant is computed in \({\mathcal {O}}(r^2)\), if the required matrix products are reused from the computation of the gradient (cf. Appendix A). Finally, all employed proximal operators require constant time to update one matrix entry [cf. Eq. (5)]. Therefore, the proximal mappings of a factor matrix are computed in \({\mathcal {O}}(\max \{nr, mr, r^2 \})\). Since we can assume that \(r\le \min \{n,m\}\), the dominating factor is the computation of the gradient, resulting in an overall complexity for each update step equal to \({\mathcal {O}}(mnr)\).

As regards the initialization strategy, it is noteworthy that it does not affect the time complexity, since the more expensive initialization scheme proposed in this paper (InitNMF) takes \({\mathcal {O}}(mnr)\), required by the gradient update steps. In this case, the number of iterations is constant (\(T = 100\)) and does not asymptotically affect the time complexity.

5 Experiments

We compare the proposed method Broccoli against six competitors: two methods based on a nonnegative relaxation (henceforth denoted by N (Long et al. 2005) and NN (Del Buono and Pio 2015)), two methods based on an orthogonal relaxation (henceforth denoted by O (Yoo and Choi 2010) and OO (Del Buono and Pio 2015)) and the biclustering methods Fabia (Hochreiter et al. 2010) and Floc (Yang et al. 2005). Since N, NN, O and OO return fuzzy membership values for each observation, we binarize the result for comparison purposes. For each sample (observation or feature) we set the top-k fuzzy cluster indicator values to one, where k is the number of ground truth clusters the sample belongs to. Note that in this way we provide our competitors with additional background knowledge, which is not available in real-world scenarios. The goal is to estimate how good the clustering, derived from a relaxed result, could potentially be, if supported by additional knowledge provided (e.g., by domain experts).

BroccoliFootnote 1 is implemented in PyTorch, exploits the inbuilt implementation of SGD and batch sampling, and relies on the heavily parallelized execution of matrix multiplications by Graphics Processing Units (GPUs). The batch sampling is performed epoch-wise. In every epoch, data is partitioned into 10 sets, which are returned as batches in subsequent iterations. Hence, the size of the batches are approximately equal to 0.1n and 0.1m, respectively. As default values for the step-size of the regularization weight updates, we set \(\gamma =10^{-9}\), that is suitable for most datasets. In practice, the step-size should be as low as the run-time allows. In order to speed up convergence without noticeably harming the quality of the results, we double the step-size \(\gamma \) every 2000 epochs. We set the parameter \(max_c\) of the function \(\phi _c\) equal to the maximum value in D.

The rank is set equal to the rank of the ground truth. In some experiments we denote the ranks \(r_y\) and \(r_x\), which are the number of row-clusters and column-clusters, respectively, of the obtained result. Note that the ranks of the returned tri-factorization may be smaller than the specified rank, due to constant zero columns in X or Y, or constant zero columns or rows in C.

We evaluate Broccoli using both the initialization schemes described in Sect. 4.2.1. Specifically, we denote the variant based on InitNMF as Broccoli NMF, and the variant based on InitRND as Broccoli RND.

5.1 Evaluation measures

We quantify how well a computed cluster indicator matrix Y matches the ground truth \(Y^*\) by an adaptation of the averaged \(F_1\)-measure, known from multi-class classification tasks. We compute a one-to-one matching \(\tau \) between computed and ground truth clustering and compute the average \(F_1\)-measure of the matched clusters. Formally, the \(F_1\)-measure of two binary vectors y and \(y^*\) is computed as the harmonic mean of precision and recall:

$$\begin{aligned} {{\,\mathrm{pre}\,}}(y,y^*)= & {} \frac{y^\top y^*}{\Vert y\Vert ^2},\quad {{\,\mathrm{rec}\,}}(y,y^*) = \frac{y^\top y^*}{\Vert y^*\Vert ^2}. \\ F_1(y,y^*)= & {} 2\frac{{{\,\mathrm{pre}\,}}(y,y^*)\cdot {{\,\mathrm{rec}\,}}(y,y^*)}{{{\,\mathrm{pre}\,}}(y,y^*)+{{\,\mathrm{rec}\,}}(y,y^*)} = \frac{2 y^\top y^*}{\Vert y\Vert ^2 + \Vert y^*\Vert ^2}. \end{aligned}$$

The average \(F_1\)-measures for column- and row-clusters are then computed by

$$\begin{aligned} F_1(Y,Y^*)&= \frac{1}{r_y}\sum _{s=1}^{r_y} F_1(Y_{\cdot s},Y^*_{\cdot \tau _y(s)}),&F_1(X,X^*)&= \frac{1}{r_x}\sum _{t=1}^{r_x} F_1(X_{\cdot t},X^*_{\cdot \tau _x(t)}). \end{aligned}$$

In addition to this matching-based measure, we also compute two agreement measures for overlapping clusterings proposed by Rabbany and Zaïane (2015). Given a cluster indicator matrix Y, these measures are defined as

$$\begin{aligned} I_{\cos }(Y,Y^*)&= \frac{\Vert Y^\top Y^*\Vert ^2}{\Vert Y^\top Y\Vert \Vert {Y^*}^\top Y^* \Vert }&I_{\text {sub}}(Y,Y^*)&= \frac{\Vert Y^\top Y^*\Vert }{\Vert Y\Vert \Vert {Y^*} \Vert }. \end{aligned}$$

The term

$$\begin{aligned} \Vert Y^\top Y^*\Vert ^2 = \sum _{1\le s,t \le r}(Y_{\cdot s}^\top Y^*_{\cdot s})^2 = |YY^\top \circ Y^*{Y^*}^\top |\end{aligned}$$

represents the agreement according to the number of elements that the two clusters \(Y_{\cdot s}\) and \(Y_{\cdot t}\) have in common (see Appendix B for a more detailed discussion of the clustering agreement indexes).

If we are given the ground truth for the row- and column-clusters, then we return the average of the measures:

$$\begin{aligned} F_1&= \tfrac{1}{2}(F_1(Y,Y^*)+F_1(X,X^*))\\ I_{\cos }&= \tfrac{1}{2}(I_{\cos }(Y,Y^*)+I_{\cos }(X,X^*)) \\ I_{\text {sub}}&= \tfrac{1}{2}(I_{\text {sub}}(Y,Y^*)+I_{\text {sub}}(X,X^*)) \end{aligned}$$

All measures range between 0.0 and 1.0. The closer they approach 1.0, the more the computed clustering matches the ground truth. Note that \(I_{\text {sub}}\) does generally attain a value smaller than 1.0 also if the cluster indicator matrix perfectly corresponds to the ground truth.

Finally, we report the value of the approximation error of the factorization with respect to the original data matrix, measured through the Mean Squared Error in percentage (MSE%). Formally:

$$\begin{aligned} \text {MSE}\% = \frac{\Vert D-YCX^\top \Vert ^2}{\Vert D\Vert ^2}\cdot 100. \end{aligned}$$

Note that, contrary to the performance measures, the lower the MSE% the better the approximation. After all, the MSE% represents the average error made by the factorization of the input data matrix. However, we also caution that a low approximation error does not necessarily indicate a correctly identified clustering structure (cf. Sect. 4).

5.2 Synthetic datasets

We create a set of synthetic datasets with overlap and outliers by sampling every cluster indicator matrix by a Bernoulli distribution. Each entry \(X^*_{it}\) and \(Y^*_{js}\) is equal to 1 with probability \(q=0.2\). Thereby, we ensure that each cluster contains at least \(1\%\) of the observations/features, which are exclusively assigned to that specific cluster. The core matrix is sampled as a sparse matrix containing uniformly distributed values \(C_{st}\in [0,5]\). The probability that a non-diagonal element is not zero is equal to 1/r. The data matrix is then generated by adding random Gaussian noise to the ground truth factorization:

$$\begin{aligned} D_{ji} = [Y_{j\cdot }^*C{X_{i\cdot }^*}^\top + \epsilon _{ji}]_{\ge 0}, \end{aligned}$$

where \(\epsilon _{ji}\sim {\mathcal {N}}(0,\sigma )\) and the operator \([\cdot ]_{\ge 0}\) projects negative values to zero. We generate five datasets for every noise variance \(\sigma \in \{0,0.2,0.4,\ldots ,2\}\) and dimensionality \((m,n)\in \{(300,200),(1000,800)\}\) (see Table 2 for a summary of the statistics of the generated datasets).

Table 2 Statistics about the synthetic datasets

For the smaller \(300\times 200\) dataset we choose a rank of \(r=3\), while for the larger \(1000\times 800\) dataset we choose a rank of \(r=5\). The characteristics overlap X and overlap Y denote the average number of clusters a feature or observation is assigned to, when it is not an outlier. Outliers are features or observations which are not assigned to any cluster.

Fig. 3
figure 3

Average \(F_1\)-, \(I_{\cos }\)- and \(I_{\text {sub}}\)-measure (the higher the better), plotted against the Gaussian noise parameter \(\sigma \). Error bars have a length of twice the standard deviation

In Fig. 3, we plot the \(F_1\)-, \(I_{\cos }\)- and \(I_{\text {sub}}\)-measures, against the Gaussian noise parameter \(\sigma \). For \(\sigma =2\), roughly 1/3 of the noise samples are larger than or equal to 1.0, and about 2/3 of all noise samples have an absolute value larger than or equal to 1.0 in expectation. We can observe that the measures overall indicate a similar ranking of the performance of the algorithms. Throughout the increase of the noise, Broccoli NMF attains very high scores, while Broccoli RND exhibits slightly lower performances. We also observe an increased variance of Broccoli RND’s results (shown by the error bars), indicating a sensitivity to the random initialization. Broccoli’s performance drops especially in the \(I_{\cos }\)-measure when the noise exceeds 1.5. Since this drop is more pronounced in the small dataset (\(300\times 200\)), it may be due to the combination of a high amount of noise and of a high number of outliers.

The methods N, NN, O and OO, which are based on orthogonal and nonnegative relaxations, seem largely unaffected by the noise parameter. We remind that we provided these methods with the advantage of knowing the true number of clusters for each observation or feature, during the binarization. On the contrary, this advantage was not provided to Broccoli. Despite such an advantage, they never achieve a score larger than 0.8. Fabia and Floc attain the lowest scores among all the considered competitors, where the scores of Fabia have a tendency to increase with the noise parameter. This behavior is possibly due to the fact that Fabia (together with Floc) does not explicitly handle the possible presence of noise in the data, and is therefore very sensitive to it. The difficulty faced in modeling the true clustering structure with no noise (\(F_1\)- and \(I_{\cos }\)-measure close to 0.5 for \(\sigma = 0\)) may ideally be alleviated in presence of a huge amount of noise, which strongly pushes it away from the (wrong) local minimum it was possibly stuck on.

5.3 Real-world datasets

Table 3 Characteristics of selected datasets for multi-label classification

We also evaluate our method on a series of real-world datasets, originally designed for multi-label classification tasks. These datasets have naturally multiple classes per observation, which we employ as the ground truth for evaluation purposes. The statistics of these datasets are depicted in Table 3. The birds dataset is derived from audio files (Briggs et al. 2013), emotions addresses the sentiment of music (Trohidis et al. 2008), scene is an image dataset (Boutell et al. 2004) and genebase and yeast are derived from the biological domain (Diplaris et al. 2005; Elisseeff and Weston 2002). We also adopt the 20 Newsgroups (20News) datasetFootnote 2, that in principle is not multi-label. However, since in 20News labels are hierarchically organized, it allows us to emphasize the capabilities of the considered methods in catching overlaps in terms of hierarchically organized clusters (i.e., a more general category is overlapping with—actually includes—more specific categories).

We evaluate how well the computed row-clusters match with the given labels. However, we should keep in mind that the fundamental task of clustering is to find the prevalent structure in the dataset, which does not necessarily correspond to the specific structure encoded by the class labels.

Table 4 Evaluation of the clustering methods on real-world datasets

Table 4 summarizes the results on the selected multilabel datasets. As a benchmark, we train the factor matrix X of a nonnegative matrix factorization when fixing Y to the class-assignment matrix:

$$\begin{aligned} \min _X\ \frac{1}{nm}\Vert D- YX^\top \Vert ^2, \quad \text { s.t. }X\in {\mathbb {R}}_+^{n\times r}, \end{aligned}$$
(7)

The optimization problem in Eq. (7) is convex. Hence, we can compute the global minimum of this objective, which provides a lower bound of the MSE which could possibly be attained by a biclustering where the row-clusters actually reflect the classes. We refer to this method as NMF given Y.

Table 4 does not display MSE% values for the algorithms Fabia and Floc, since these algorithms only return the cluster indicator matrices and not the core matrix. The performance measures of N, NN, O and OO are again derived from a binarization of the fuzzy row-clusters by means of the class-indicator matrix. The MSE of these methods is denoted for the relaxed (fuzzy) factor matrices. For all methods we set the number of clusters r equal to the number of classes.

Table 4 displays how well the competing algorithms match the given class labels. For five out of the six datasets, the row-clustering of Broccoli NMF attains the best score in at least one of the considered measures.

We observe that Broccoli (NMF and RND) returns fewer clusters with respect to the specified rank, for birds and genbase. In this case, we also report in the parentheses the \(F_1\)-score averaged only over the best-matched clusters identified by Broccoli. In other words, the \(F_1\)-scores in parentheses denote how well Broccoli’s clusters match a selection of true classes.

The birds dataset poses an exception for Broccoli NMF. First, Broccoli NMF is not able to factorize this dataset with a low MSE%. This is partly due to the fact that almost half of the observations do not belong to any class and are thus marked as outliers. Additionally, the gap between the 99.9th percentile of data values and the maximum data value is very big (cf. Table 3). Therefore, there are very few values which dominate the mean squared error if these values are not approximated. In this case, the relaxed methods (O,OO, N and NN) are in advantage, because their relaxed approximation scheme allows for the adaptation to single, exceptionally high values within one bicluster (a similar effect is observable on 20News). In addition, the binarization ensures that all the outliers are also reflected as such in the binary factorization, which positively influences the performance measures.

In contrast, the genbase dataset is a binary dataset which is similarly well-factorized by Broccoli NMF as by the relaxed methods. In particular, Broccoli NMF attains a low MSE% with only three of the possible 19 row-clusters. With these few clusters, Broccoli NMF achieves the highest \(I_{\cos }\) and \(I_{\text {sub}}\) agreement measures, which do not explicitly depend on the number of modeled clusters due to their scaling invariance (cf. Appendix B). Yet, Broccoli NMF and Broccoli RND produce the lowest \(F_1\)-score, which is largely due to the fact that the average \(F_1\)-score returns a score of zero for every non-modeled cluster. On the other hand, the high \(F_1\)-score (shown in the parentheses), averaged over only the modeled three clusters, indicates a proper match of the clusters modeled by Broccoli NMF with true labels.

The genbase dataset is the only one for which Fabia attains a high score in at least one of the measures. On the other datasets, Fabia is often reaching good but not outstanding scores, which are not too far from the best ones. Floc usually obtains lower \(F_1\)-scores than Fabia, but the results in terms of agreement measures are discordant.

We observe that for all the other datasets, Broccoli NMF attains a lower MSE% than the benchmark NMF for given Y. This shows that the true class labels generally do not yield a suitable clustering, which approximates the data well. No clustering which actually aligns with the class labels can obtain a lower MSE% than the one denoted for NMF given Y. After all, the MSE% measures how well the identified clustering structure matches the data. Hence, any clustering which obtains a MSE% which is substantially lower than the baseline of NMF given Y cannot have a performance measure close to 1.0. We further note that Broccoli NMF usually attains a MSE% which is close to the one of relaxed factorizations N, NN, O and OO. For the emotions and the yeast dataset, Broccoli achieves even a lower MSE% than the relaxed factorizations. This demonstrates the strengths of the employed optimization.

Finally, we observe that Broccoli RND, based on a random initialization, generally performs worse than Broccoli NMF. In particular, we can observe significantly higher values of the MSE%, indicating that Broccoli RND often converges to noticeable worse local optima than Broccoli NMF. This result confirms the positive contribution of the proposed initialization strategy that, on the other hand, requires a negligible amount of additional running time.

5.4 Qualitative evaluation of the 20 newsgroups dataset

Table 5 Summary of the 20 Newsgroup categories

We perform a qualitative evaluation of results on the 20News dataset. This dataset is a collection of posts assigned to 20 topics which are hierarchically organized (cf. Table 5). We process the textual data as a data matrix, corresponding to the term frequency of \(n=6643\) lemmatized words in \(m=11314\) posts (training data only, as provided by scikit-learnFootnote 3). We apply NN, OO, Broccoli RND and Broccoli NMF to derive \(r=20\) row- and column-clusters. Fabia and Floc were not able to process such a large dataset.

Fig. 4
figure 4

Illustration of derived word-clusters by the method NN on the 20 Newsgroups dataset. The size of a word reflects its weight in the corresponding cluster (\(X_{\cdot s}\))

Fig. 5
figure 5

Illustration of derived word-clusters by the method OO on the 20 Newsgroups dataset. The size of a word reflects its weight in the corresponding cluster (\(X_{\cdot s}\))

Fig. 6
figure 6

Illustration of derived word-clusters by Broccoli RND on the 20 Newsgroups dataset. The size of a word reflects its weight in the corresponding cluster (\(X_{\cdot s}\))

Fig. 7
figure 7

Illustration of derived word-clusters by Broccoli NMF on the 20 Newsgroups dataset. The size of a word reflects its weight in the corresponding cluster (\(X_{\cdot s}\))

5.4.1 Inspection of feature clusters

The obtained column-clusters (the feature clusters which, in this case, are clusters of words) are shown in Figs. 4, 5, 6 and 7. For the fuzzy cluster indicators of NN and OO, the size of the word i in the wordcloud of cluster s corresponds to the assigned weight \(X_{is}\ge 0\). In turn, the binary word-indicators of Broccoli RND and Broccoli NMF are visualized such that the size of a word in the cloud is proportional to the inverse of the number of clusters the word is assigned to. That is, those words which are unique to the respective cluster are larger than those words which are assigned to multiple clusters.

Looking at the visualizations of clusters, we see that the word max pops up prominently in many clusters. The word max obtains comparably very high term frequencies. The average term frequency of a word is equal to 1.59, and \(99\%\) of all words have a term frequency smaller than or equal to 8. The word max occurs in 149 posts and obtains term frequencies in [1, 800]. Hence, the word max attains exceptionally high term frequencies in a few posts and exhibits therewith a special role. The unusual high term frequencies of this word are handled differently among the clustering methods. While NN and OO give a high weight to this word in almost all clusters, Broccoli RND and Broccoli NMF reflect more general clustering structures. It is noteworthy that, as we can observe in Table 4, Broccoli RND and Broccoli NMF obtain a notably higher (worse) MSE% than competitors. The unusually high frequency of the word max vastly increases the approximation error for any cluster model that does not adapt to (and possibly overfit on) these particularly high word occurrences. Nevertheless, the approximation error exhibited by the model based on true class labels (i.e., NMF given Y) is comparable to that achieved by Broccoli RND and Broccoli NMF.

Fig. 8
figure 8

Visualization of the overlap of Broccoli NMF clusters on the 20News dataset and the given categories

In any case, we can detect meaningful clusters which represent a specific topic for all clustering methods. Comparing the topics, we can see that Broccoli NMF provides a distinctive view on the dataset, identifying, for example, a religion cluster, which is not featured by the methods NN and OO. Hence, although Broccoli’s optimization makes use of a relaxed objective, its results still provide another view on the data with respect to that provided by the relaxed counterparts NN and OO.

5.4.2 The distribution of observation-clusters over 20News categories

The differences between the cluster models are evident also when we look at the visual representation of the cluster overlaps with the 20News categories, depicted in Fig. 8. In this figure, on the horizontal axis we show the 20News categories at the bottom and super-categories at the top, while on the vertical axis we list the identified clusters. The intensity of each pixel reflects the normalized count of posts in the corresponding cluster and category. The cluster-category count is normalized such that every cluster has a maximum agreement of 1.0 with at least one of the categories, while the remaining agreement values range between 0.0 and 1.0. Consequently, we see at least one pixel per row having the highest intensity. The presence of multiple pixels per row with a high intensity means that such a cluster covers multiple categories with equally high agreement.

In Fig. 8, we can observe that all methods except Broccoli NMF yield clusters which overlap mostly with only one of the categories. For OO and NN, such a category is comp.os.ms-windows, while for Broccoli RND such a category is comp.windows.x. It is interesting to note that comp.os.ms-windows and comp.windows.x are the categories which contain few posts with an exceptionally high frequency of the word max. From an optimization perspective this is not surprising, since the fuzzy clustering of NN and OO is able to notably decrease the MSE%, by specifically focusing on those data matrix entries with very high values, which strongly influence the (squared) approximation error. On the other hand, from a clustering perspective, this behavior is not necessarily desirable, since it can result in a very one-sided representation of the data, as shown in Fig. 8: the clusters are overfitting 12 posts containing the word max at least 45 times. As regards Broccoli RND, the imposed binary constraints result in slightly more widespread clusters, since they ensure that all features in one bicluster are approximated with the same value. However, Broccoli RND’s clusters concentrate around one of the categories with the unusual high word occurrences. Here, the random initialization does not seem to be suitable to find a good starting point for the optimization: neither the performance measures nor the approximation error nor the visual analysis indicate that Broccoli RND reflect meaningful topic similarities.

Fig. 9
figure 9

Visualization of the overlap of Broccoli NMF clusters on the 20News dataset for different ranks

In contrast, the clusters identified by Broccoli NMF are diverse and cover various categories. In particular, we can see clusters which clearly originate only from one category (e.g., talk.politics.guns in cluster 8), and others which summarize various topics (e.g., comp in clusters 14–19). Few clusters cover categories from different super-categories. For example, cluster 11 comprises posts from comp.graphics and sci.space, while cluster 5 models posts from talk.politics.mideast and religion.christian. However, even if they come from different super-categories, such specific combinations semantically make sense, and emphasize the ability of Broccoli NMF of catching such particular inter-topic relationships.

5.4.3 Influence of the rank

Besides the optimization aspects, there is one open problem for (bi)clustering applications in practice: the determination of the rank, i.e., of the most appropriate number of clusters. Although specifically addressing such a vast topic is out of the scope of this paper, in this subsection we qualitatively observe the influence of the rank on the results obtained on the 20News dataset. In Table 4, we have already observed that Broccoli NMF identifies for some datasets a number of biclusters that is smaller than the specified rank. In Fig. 9 we show the category distribution of Broccoli NMF clusters, when we specify lower ranks \(r\in \{5,10,15\}\). We observe that Broccoli NMF tends to model more general clusters for smaller ranks. For \(r=5\), there are two clusters focused on sci.crypt, which go either in the direction of politics or comp-related topics, a cluster representing the super-category belief, a cluster encompassing politics and belief, and a cluster encompassing sports and politics. As expected, the more we increase the rank, the more specific the clusters become, until they only cover few categories. It is noteworthy that, even when choosing a rank which apparently seems too low, Broccoli NMF still groups together categories that are somehow related, or belonging to the same super-category. This behavior is evident starting from \(r = 10\), where sub-categories related to religion, talks and computers, respectively, appear to be reasonably grouped into distinct clusters.

5.5 Discussion on possible limitations

Our proposed penalized optimization approach is flexible and has the potential to become a novel general approach for the optimization of nonexclusive clustering structures based on matrix factorization. Notably, most of the popular clustering methods are based on (or can be viewed as) a matrix factorization with binary constraints, including k-means, spectral clustering, and variants of deep clustering. Despite these characteristics, in the following we highlight possible limitations of the proposed approach. We should first note that the adopted squared Frobenius norm is sensitive to data matrix entries having a magnitude that is much higher than the average. In other words, if there are very few data entries with very high values, then any factorization with a low MSE needs to fit the clustering to the unusual high-valued entries. This phenomenon is in contrast with the general concept of clustering, which is to group reoccurring patterns. We actually considered two examples of datasets which have few unusual high values: birds and 20News. For these datasets, we observe a big gap between the 99.9th percentile and the maximum data value (cf. Table 3). For both datasets, Broccoli NMF is not able to achieve a low MSE and the performance measures give contrasting signals. This does not mean that the clusters identified by Broccoli NMF are not meaningful, as confirmed in the qualitative exploration we performed for 20News. However, outliers with extremely high values aggravates the optimization.

This leads to the second limitation of our approach: the initialization. We have been able to derive a suitable heuristic for the initialization, which works well for most of the considered cases. However, other contexts might benefit from a different initialization. In practice, variations of the proposed initialization should be considered in future experiments, by varying, for example, the percentage of cluster indicators which are scaled to one.

Finally, there is the issue of selecting the rank. This is not a point in question specifically for the proposed method, but is related to a general open problem in clustering. There is always the possibility to apply popular heuristics such as the elbow method, which is based on the identification of an elbow in the curve of the approximation error, plotted against the rank. Moreover, we have also observed that, in some datasets, increasing the user-specified rank does not necessarily correspond to a higher number of clusters identified by Broccoli NMF. This means that such heuristics might be adopted to determine an upper bound of the clusters Broccoli should identify, but, in general, the determination of the rank based on theoretically justified grounds is still an open problem.

6 Conclusions

We have proposed the biclustering method Broccoli, which employs recent advances in optimization theory for the optimization of nonnegative tri-factorizations with binary constraints. To the best of authors’ knowledge, Broccoli can be considered the first algorithm that is able to model the possible overlap between clusters, as well as the presence of outliers in the data, without requiring the user to specify characteristics about the obtained clustering (such as the amount of overlap or outliers), while returning definite, non-fuzzy cluster assignments. Employing the well-founded theory of proximal stochastic optimization, our method is guaranteed to converge to a local minimum in expectation (Driggs et al. 2020). Our method is based on the penalization of non-binary terms in the cluster-assignment matrices. The regularization weight of the penalizing terms is automatically updated during training, while the user has only to specify the step-size of these updates, which should be as small as possible in practice.

Our experiments on synthetic datasets show that our method is able to detect the underlying clustering structure and that it is robust to noise. Figure 3 shows that the direct optimization for binary cluster indicator matrices of Broccoli generally achieves higher performance measures (cf. Sect. 5.1) than the methods which compute a relaxed factorization, which are binarized according to the (generally unknown) ground truth. Broccoli relies on an initialization based on shortly optimized nonnegative matrix factorization. While this initialization method did not make a very notable difference for the synthetic data, which have a clear biclustering structure as ground truth, we have seen that the initialization becomes important for real-world data. The experiments on real-world data (cf. Table 4) show that the initialization of Broccoli with a shortly optimized nonnegative matrix factorization (Broccoli NMF) is suitable to find minima which attain a remarkably low approximation error, sometimes even lower than the one of relaxed factorizations. Finally, in our qualitative evaluation, we have observed that Broccoli NMF is able to derive meaningful clusters in terms of the found feature-clusters (cf. Figs. 4, 5, 6, 7.) and the observation clusters (cf. Fig. 8). Furthermore, we have briefly discussed the effect of the setting of the rank for Broccoli NMF (cf. Fig. 9). We found that for smaller ranks, Broccoli NMF tends to return more encompassing clusters, which overarch multiple categories and super-categories.

We can conclude that Broccoli allows swift adaptations of advances in the theory of nonconvex optimization. In addition, techniques to cope with specific data characteristics in matrix factorization can easily be transferred to the optimization scheme adopted in Broccoli. For example, a common technique for handling missing data in matrix factorization consists in the minimization of the approximation error only for the observed data matrix entries, i.e., on the non-missing data. This approach can be integrated in Broccoli by setting the partial derivatives in the gradient descent updates to zero in correspondence with the missing data indices.

This makes Broccoli a theoretically founded, practically well-performing and flexible approach which has the potential to spark further research on the optimization of non-exclusive clusterings in particular, and on the learning of discrete structures in general.