Column-coherent matrix decomposition

Matrix decomposition is a widely used tool in machine learning with many applications such as dimension reduction or visualization. In this paper we consider decomposing X, a matrix of size n×m\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \times m$$\end{document}, to a product WS where we require that S, a matrix of size n×k\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$n \times k$$\end{document}, needs to have consecutive ones property. More specifically, we require that each row of S needs to be in the form of 0,…,0,1,…,1,0,…,0\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$0, \ldots , 0, 1, \ldots , 1, 0, \ldots , 0$$\end{document}. Such decompositions are particularly meaningful if X is a matrix where each row represents a time series; in such a case the ones in each row in S represent a time segment. We show that the optimization problem is inapproximable. To solve the problem we propose 5 different algorithms. The first two algorithms are based on solving iteratively S while keeping W fixed and then solving W while keeping S fixed. The next two algorithms are based on greedily optimizing a single row in S and the corresponding column in W. The last algorithm first finds the optimal decomposition of with 2k-1\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$2k - 1$$\end{document}non-overlapping rows, and then greedily combines the rows until k rows remain. We compare the algorithms experimentally, focusing on the quality of the decomposition as well as the computational time. We show experimentally that our algorithms yield interpretable results in practical time.


Introduction
Matrix decomposition, where a matrix X of size n × m is approximated with a product WS, where W and S are both of rank k, has been a standard tool in data mining with applications such as reducing dimensionality and visualization.In order to improve the interpretability of the decomposition, variants have been 1 3 Column-coherent matrix decomposition proposed such as non-negative matrix factorization (Wang and Zhang 2012), sparse matrix factorization (Gupta et al. 1997), or factorizations that prefer smooth rows (Rallapalli et al. 2010;Roughan et al. 2011;Xiong et al. 2010;Hsiang-Fu et al. 2016;Chen and Cichocki 2005).
In this paper we consider approximating X with WS, where S needs to be a binary matrix with consecutive ones property, that is, each row needs to be of form 0, … , 0, 1, … , 1, 0, … , 0 .Such decomposition is interesting if X has a natu- ral column order, for example, X consists of n time-series, each with m aligned time stamps.In such a case, each row in S corresponds to a range, a potentially interesting feature.
We first show that our problem is NP-hard, and even inapproximable.In order to find good decompositions we propose 5 algorithms, see Table 1 for summary.
The first two algorithms are based on the iterative approach where we iteratively update S while fixing W and solve W while fixing S. We show that, unlike finding W, solving S is inapproximable.We propose an exponential time algorithm, IExact, that finds S exactly using a dynamic program.IExact is fixedparameter tractable and is reasonable for smaller values of k.For large values of k we propose an iterative update approach IHIll, where each row of S is optimized while keeping the remaining rows fixed.
The next two algorithms are based on optimizing a row of S and the corresponding column of W simultaneously, while keeping the other rows fixed.The first algorithm GExact solves this subproblem exactly in O m 2 n time.If m is large, then this approach may be infeasible.Consequently, we also propose an algorithm GEst that (1 + )-approximates the optimal row in O(mn∕ ) time.
The last algorithm is a bottom-up algorithm.We start by considering a decomposition WS where the segments of S must also be disjoint.We show that such decomposition with 2k − 1 components is as good as the original decomposition with k components.We argue that we can solve the former in polynomial time with a dynamic program.Once, we have found the initial solution we greedily merge the rows until k rows remain.
The remainder of the paper is organized as follows.We introduce the notation and define the optimization problem in Sect. 2. In Sect. 3 we show the NP-hardness of the problem and discuss how to solve the problem exactly.We introduce the algorithms in Sects.4, 5 and 6.Section 7 is devoted to related work.We present the experimental evaluation in Sect.8, and conclude the paper with discussion in Sect.9.

Preliminaries and problem definition
Throughout the paper we will use X, a matrix of size n × m , as our input data.We will write X ⋅i and X i⋅ to indicate the ith column and row, respectively.We will write to indicate the (squared) Frobenius norm.We are interested in approximating X with WS, where S has a particular shape.We say that a binary matrix S is a C1P matrix if its first row is full 1 s, and the remaining rows have consecutive ones property: every row is of form (0, … , 0, 1, … , 1, 0, … , 0) , that is, the 1 s on every row are all consecutive.
Our main optimization problem is as follows.
Problem 2.1 (DcMp) Given a real-valued matrix X of size n × m and an integer k, find a C1P matrix S of size k × m and a real-valued matrix W of size n × k such that ‖X − WS‖ F is minimized.
We should point out that the reason for having the first row in S equal to 1 is to have W ⋅1 act as a bias term.
In addition, we will consider the two subproblems where either W or S is fixed.We will use these two problems extensively.
Problem 2.2 (DcMp-W) Given a real-valued matrix X of size n × m and a C1P matrix S of size k × m , find a real-valued matrix W of size n × k such that ‖X − WS‖ F is minimized.
Problem 2.3 (DcMp-s) Given a real-valued matrix X of size n × m , a real-valued matrix W of size n × k find a C1P matrix S of length k × m , such that ‖X − WS‖ F is minimized.

Exact algorithm for Dcmp
Before introducing more practical algorithms in the following section, let us first show that DcMp is NP-hard, and consider an exponential algorithm for solving the problem exactly.Proposition 3.1 Dcmp is NP-hard even for n = 1.Proof We will prove the claim by reducing subsEt-suM, a known NP-complete problem, to DcMp.In subsEt-suM we are given a multiset of integers y 1 , … , y 1 3 Column-coherent matrix decomposition with y i > 0 and an integer T, and are asked whether there is an index set I such ∑ i∈I y i = T. Assume an instance of subsEt-suM.We set X be a matrix of size 1 × ( + 1) , where X 1i = ∑ i j=1 y j for i = 1, … , .We also set X 1( +1) = T .Finally, we set k = .We claim that subsEt-suM has a solution if and only if there is C1P-decomposition such that X = WS.
Assume that there are W and S such that X = WS .Since X 12 > X 11 , there must be a row in S where the first 1 is at column 2. Similarly, since X 13 > X 12 , there must be a row in S where the first 1 is at column 3. Applied iteratively, we see that, for each i = 1, … , , there is a row in S where the first 1 is at column i.We can shuffle these rows and safely assume that first 1 in the ith row is at column i.
The assumption that X = WS forces W 1i = y i .Let I = i | S i( +1) = 1 be the indices of rows with 1 s in the last column.Then since X = WS , we have To prove the other direction, assume that there is a set I such that The proof of Proposition 3.1 shows that DcMp is not only NP-hard but also inapproximable.
Corollary 3.1 Let OPT(X, k) be the optimal cost for Dcmp.Assume that there is a polynomial-time algorithm ALG(X, k) finding a decomposition with a cost that yields an approximation guarantee OPT(X, k) ≥ (X, k)ALG(X, k) , for some  > 0 .Then P = NP .This holds even if we limit X to have n = 1.

Proof
The proof in Proposition 3.1 showed that it is NP-complete to test whether there is a C1P-decomposition such that ‖X − WS‖ 2 F = 0 .This immediately implies that there is no algorithm that yields multiplicative approximation guarantee as we would be able to use the algorithm to test whether there is a decomposition with no error.

◻
We can provide more fine-grained complexity result if we assume that Strong Exponential Time Hypothesis (SETH) holds.1

Corollary 3.2 Assume that SETH holds. Assume that we can k-decompose X, a matrix of size
Proof The claim follows from the proof of Proposition 3.1 and Theorem 1 in (Abboud et al. 2022).◻ Proposition 3.1 states that we cannot solve DcMp in polynomial time, unless P = NP .However, we can find the exact solution by enumerating all possible matrices S and solve the sub-problem DcMp-W.
Note that DcMp-W is a standard multivariate least squares problem with the optimal solution being Here, we assume that SS T is invertible.If this does not hold, we can delete the dependent rows, and set the corresponding columns in W to 0.
Assuming, for simplicity, a naive implementations of matrix multiplication and inversion, solving DcMp-W can be done in O k 3 + mk 2 + nmk time.There are We should point out that this enumeration is not practical unless m and k are both small.We present this result mainly to contrast the exponential solution of DcMp-s in the next section.

Iterative algorithm
A natural approach for obtaining a good decomposition is an iterative method, shown in Algorithm 1, where we first fix W and find optimal S (DcMp-s) and then fix S and solve W (DcMp-W).This is repeated until convergence.Note that we need to provide initial W as a parameter.We will use MErGE, described in Sect.6, for initial W.
As we saw in the previous section, we can solve DcMp-W.In this section, we consider two approaches for solving DcMp-s.Unfortunately, DcMp-s, unlike DcMp-W, is an NP-hard problem.We will show how to solve DcMp-s in fixed parameter tractable time, and also provide a polynomial-time variant, IHIll, that optimizes rows of S one at a time while keeping the remaining rows fixed.

Column-coherent matrix decomposition
We should point out that it is possible that after the update S has a smaller rank than k.In that case we remove the dependent rows and add new rows containing a single 1 so that S will have a rank of k.

Exact solution for Dcmp-S
Let us first prove the hardness of DcMp-s.
Proposition 4.1 Dcmp-S is NP-hard even for n = m = 1.

Proof
The proof is a simpler version of the proof of Proposition 3.1.We reduce subsEt-suM to DcMp.
Then it immediately follows that subsEt-suM has a solution if and only if there is C1P matrix S such that X = WS .◻ Moreover, the argument given in Corollary 3.1 can be used to show that DcMp-s is also inapproximable.
Corollary 4.1 Let OPT(X, W) be the optimal cost for Dcmp-S.Assume that there is a polynomial-time algorithm ALG(X, W) finding a decomposition with a cost that yields an approximation guarantee OPT(X, W) ≥ (X, W)ALG(X, W) , for some  > 0 .Then P = NP .This holds even if we limit X to have n = m = 1.Futhermore, we can use SETH to provide a more fine-grained complexity statement.
Corollary 4.2 Assume that SETH holds.Assume that we can solve Dcmp-S for X and W in f (Δ, n, m, k) time, where , n × m is the size of X and n × k is the size of W.Then, for each  > 0 , there is such that f Proof The claim follows from the proof of Proposition 4.1 and Theorem 1 in (Abboud et al. 2022). ◻ Next we will show that even though DcMp-s is inapproximable, the problem is fixed parameter tractable.More precisely, we will show that we can solve DcMp-s in O 3 k km + 2 k nm time.
We can solve DcMp-s exactly with a dynamic program by computing an array q[A, B, j], where A ⊆ B ⊆ {2, … , k} are two index sets and j ∈ [m].
In order to define q[A, B, j], assume that we are given an index j = 1, … , m and two index sets A ⊆ B ⊆ {2, … , k} .Write X ′ to be the first j columns of X.We define q[A, B, j] to be the cost of the optimal C1P decomposition WS of X ′ such that the rows corresponding to B have at least one 1 and the rows corresponding to A do not have 1 at the jth column, that is, Note that, by definition, S ij = 1 if and only if i ∈ B ⧵ A.

Proposition 4.2
The array q can be computed with the dynamic program, and q[A, B, 0] = 0 as the boundary case.
We prove first the only if direction.Condition (1) follows since S is C1P matrix.Condition (3) follows from the definitions of a and b.Finally, by definition, Secondly, by definition of the last column, we have a(S) = A .Since S ′ is C1P, the only C1P violation can occur if S ij = 1 for i ∈ a(S � ) .Since a(S � ) ⊆ A , this cannot happen.◻ Here, the first term corresponds to the error coming from the jth column with indices in B ⧵ A corresponding to the 1 s in the jth column of S. The term r [A, B, j]  is the optimal error up to the (j − 1)th column such that the C1P requirement is met.
Once we have computed q, we can obtain the optimal cost of DcMp-s by computing In order to recover S, let us first write A (m) and B (m) to be the minimizers for Eq. 2.
During the dynamic program, we store the minimizers of r[A, B, j] Eq. 1 for every A, B, and j.We then iteratively define A (j−1) and B (j−1) to be the sets responsible for r[A (j) , B (j) , j] .The optimal S is then defined by setting S ij = 1 if and only if i ∈ B (j) ⧵ A (j) .
Next we show the running time needed to solve S.
Proof We can precompute the vectors (1) where

3
Column-coherent matrix decomposition different subsets S = B ⧵ A and there are m different j.Consequently, computing the needed error terms requires O 2 k (k + nm) time in total.
Consider now r [A, B, j].We can show that holds.This identity allows us to compute r[A, B, j] in O(k) time assuming we have computed r[A � , B � , j] for every subset A ′ and B ′ .Since an index in [k] can either belong to A, or belong to B ⧵ A , or be outside of both sets, there are O 3 k valid pairs of (A, B).Consequently, are O 3 k m cells in r.
Thus computing r requires O 3 k km time and computing q requires O 3 k km + 2 k nm time.◻ We should point out that unlike the exact exponential algorithm given in Sect. 3 this algorithm may be practical as long as k is small.These cases may be particularly useful if we are using the obtained W to visualize the rows of X in a plane.
Proposition 4.3 shows that DcMp-s is a fixed-parameter tractable problem.It is not known whether the main problem DcMp is also fixed-parameter tractable.We conjecture that some techniques used by Fomin et al. (2020) can be adopted to develop a fixed-parameter tractable algorithm.

Hill-climbing algorithm for Dcmp-S
If k is large, the dynamic program given in the previous section becomes impractical.In such cases, we propose the following optimization strategy.Instead of optimizing all rows of S simultaneously, we select one row to optimize while keeping the remaining rows fixed, as shown in Algorithm 2. We repeat this step for every row.
We can find the optimal row S i⋅ using a similar dynamic program as in the previous section, except now the array is going to be significantly smaller.More formally, let j ∈ [m] be an index.We define q[0, j] to be the error of the first j columns with S i⋅ being 0, q[1, j] to be the optimal error of the first j columns with S ij = 1 , and q[2, j] to be the optimal error of the first j columns with and S i⋅ not being full of 0 s.Naturally, when computing q[⋅, j] we require that S i⋅ satisfies the consecutive ones property.
To simplify the notation, let us write E = X − WS � , where S ′ is equal to the current S except that S � i⋅ = 0.
Proof We will prove the claim by induction over j.The case j = 1 is trivial.Assume that the claim holds for j − 1.
The case q[0, j] is trivial.
Let S be responsible for q [1, j].
is the minimum of the two cases, proving the case for q[1, j].
Let S be responsible for q [2, j].
Once q is computed, the optimal error is equal to = min x q[x, m] .In order to restore the corresponding S i⋅ , let us define t j = 1 if q[1, j] < q[2, j] , and 0 otherwise.Let us also define s j = 1 if q[0, j] < q[1, j] , and 0 otherwise.If = q[0, m] , then the corresponding S i⋅ is all zeros.Otherwise, the last column containing 1 s is equal to b = max j | t j = 1 .To extract the starting point of Next we show the time needed to update S. Proof For a fixed i, the matrix E can be computed in O(mn) time by maintaining the matrix, say, Z = X − WS , and using the identity E = Z + W ⋅i S i⋅ .Computing a single entry in q requires O(n) time due to error term.Since there are O(m) cells in q, we can find optimal row in O(nm) time.Since, there are O(k) different values for i, the result follows.◻ Column-coherent matrix decomposition

Greedy algorithm
In the previous section both algorithms kept W fixed while optimizing S. In this section we consider approaches that update W and S at the same time.
More specifically, we propose optimizing a single row, say , of S, the corresponding column W ⋅ , and the first column W ⋅1 , 2 while keeping the remaining cells in W and S fixed.The algorithm, given in Algorithm 3 enumerates over until no gain is possible.
In addition to GExact, we consider a faster variant, Est, that provides (1 + ) approximation of the optimal row.

Solving the greedy step exactly
Our next step is to solve the optimization problem in GrEEDy exactly and as quickly as possible.
To this end, fix .Let us write R = X − W � S , where W ′ are the current weights except that W ⋅1 and W ⋅ are set to 0. In other words, we consider the differences between X and the decomposition without the first and the th dimension.
Assume that we have selected a new S ⋅ such that 1 s start at the ith column and end at the jth column.Then it is straightforward to see that the optimal error is equal to a(i, j) + b(i, j) , where where (3) 2 Recall that the first column is essentially the bias term of the decomposition.
1 3 Here, a(i, j) is the error of the columns between i and j, and b(i, j) is the error of the remaining columns.Moreover, this error is achieved if we set To find the optimal S ⋅ , we simply test every index pair i ≤ j , leading to the follow- ing proposition.
Proposition 5.1 A single iteration of GExact runs in O m 2 nk time.
Proof For a fixed , the matrix R can be computed in O(mn) time by maintaining the matrix, say, Z = X − WS , and using the identity R = Z + W ⋅ S ⋅ + W ⋅1 S 1⋅ .We can compute a(i, j) and b(i, j) in O(n) time by precomputing the cumulative sums and the second moments.That is, we first precompute in O(nm) time the func- tions f (x) = ∑ y≤x R ⋅y and g(x) = ∑ y≤x R 2 ⋅y , where the square is applied element-wise.Then we can compute a(i, j) in O(n) time with the standard identity where the square is applied element-wise and e is a vector of 1 s of length k.The computation of b(i, j) is similar.
Since there are O m 2 pairs of (i, j) and O(k) different values of , the claim fol- lows.◻

Approximating the greedy step fast
Finding the optimal row requires O m 2 n time which may be impractical for large values of m.In this section we design an algorithm that (1 + )-approximates the optimal row in O(mn∕ ) time.
The main idea of the algorithm is as follows: Recall the definition of a and b as given in Eq. 3. Let (i * , j * ) be the index pair minimizing a(i, j) + b(i, j) .Assume that we know that is larger than but close to a(i * , j * ) .If we were to select the largest j such that a(i * , j) ≤ , then a(i * , j) is close to a(i * , j * ) and b(i * , j) ≤ b(i * , j * ) since j * ≤ j.
We will see later that, given , enumerating every maximal interval such that a(i, j) ≤ can be done in linear time.The issue is that we do not know .Instead we compute an upper bound of a(i * , j * ) , say, u and a decrement and test every = u − r , where r is an integer.For each tested pair (i, j) we Column-coherent matrix decomposition compute a(i, j) + b(i, j) , and store the smallest cost, say , as well as the pair that yielded , which is the final output of the algorithm.See Algorithm 4 for the pseudo-code.
We still need to select u and such that the approximation guarantee holds and there are not too many values of that need to tested.To the end, we set u = 2 and = , where = min i≤j max(a(i, j), b(i, j)) .Proposition 5.2 shows that these values are suitable.The complete pseudo-code is shown in Algorithm 5.
Before proving correctness we need to show that we can compute in linear time.The algorithm, MaxsEG, for finding is given in Algorithm 6.The algorithm enumerates index pairs in the following manner: if currently a(i, j) < b(i, j) we increase j as this potentially decreases max(a(i, j), b(i, j)) , otherwise we increase i.
Let us now prove the correctness of MaxsEG.Proof Let i * and j * be the indices yielding the smallest error max(a(i, j), b(i, j)) .If are ties, we select the smallest possible indices.Let i and j be the variables used in MaxsEG.To prove the lemma, we show by induction that MaxsEG has visited (i * , j * ) , or i ≤ i * and i ≤ j * .The induction base i = j = 1 is trivial.
In order to prove the general case assume that the claim holds for (i, j).If Max-sEG has been visited, then we have nothing to prove.Assume otherwise, then by the induction assumption i ≤ i * and j ≤ j * .
If i = i * and j = j * , then we have nothing to prove.If i < i * and j < j * , then the induction step follows trivially since we increase i or j by 1.
Assume i = i * and j < j * .
Assume i < i * and j = j * .
The while-loop in MAXSEG(a, b) is executed at most 2m times.Each a(i, j) or b(i, j) requires O(n) time, proving the claim.◻ Next, we prove the approximation result.Proof Let (i * , j * ) be the indices yielding O .Similarly, let (i � , j � ) be the indices yielding .Lemma 5.1 To summarize ≤ O ≤ 2 .
Let be the smallest variable used by Est such that ≥ a(i * , j * ) .Such variable exist since a(i * , j * ) ≤ O ≤ 2 .During this iteration, let j be the largest variable vis- ited by Est when i = i * .Note that since a(i * , j * ) ≤ , we must have j ≥ j * as other- wise is not maximal.
Since is minimal, we have a(i * , j * ) ≥ − .Consequently, proving the first claim.
To prove the second claim note that the inner loop is executed O(m) times and the outer loop is executed O(u∕ The running time results in Lemma 5.1 and in Proposition 5.2 immediately imply the following result.

Proposition 5.3 A single iteration of GESt runs in O(mnk∕ ) time.
In order to improve the performance in practice, at the end of each iteration we optimize S using the for-loop given in IHIll.This update does not change the running time analysis.
We should point out that the problem relevant to minimizing a(i, j) + b(i, j) is segmentation, where the goal is to partition a sequence into k segments minimizing some error function.The segmentation problem can be solved exactly in quadratic time (Bellman 1961) and approximated in polylogaritmic time (Guha et al. 2006;Tatti 2019).3However, we cannot use these results directly since b(i, j) depends both on the beginning and on the end of the row.

Bottom-up algorithm
In this section we introduce our final algorithm.We start by considering an easier decomposition problem.
We say that a binary matrix S is segment matrix, if all the rows have consecutive ones property, are disjoint, and every column has at least 1.This definition leads to the following optimization problem.Problem 6.1 (sEG) Given a matrix X of size n × m and an integer k, find a segment matrix S of size k × m and a matrix W of size n × k such that ‖X − WS‖ F is minimized.
We should point out that sEG is equivalent to segmenting X into k segments, each segment correspond to 1 s in a row of S, while minimizing L 2 cost.This problem can be solved with a dynamic program in O km 2 n time (Bellman 1961).Proposition 6.1 Assume a C1P matrix S of size k × m and a weight matrix W of size n × k .Let k � = 2k − 1 .Then there is a segment matrix T of size k � × m and a weight matrix U of size n × k � such that WS = UT.Proof We claim that there are at most k � − 1 columns in S that are different than the column on their right.The claim is trivial for k = 1 , and follows immediately by induction since a new row in S introduce at most 2 such columns.
Let i 1 , … , i k � −1 be these columns.Write also i 0 = 0 and i k � = n .We define T such that the jth row has 1 s between i j−1 + 1 and i j , and 0 otherwise.Finally, we set U such that jth column is equal to WS ⋅i j .The proposition follows.◻ 1 3 Column-coherent matrix decomposition The proposition leads to the following approach, for which the pseudo-code is given in Algorithm 7. Given X we find a decomposition with 2k − 1 components WS, where S is a segment matrix.In order to manipulate S we will represent these matrices as a sequence of pairs I = ((a i , b i )) , where a i and b i indicate the column end points containing 1 s at the ith row.
In order to transform S to C1P-matrix with only k segments, we first add a constant row full of 1 s, that is, we add (1, m) to I.
We then test each pair, say (a, b) and (a � , b � ) , in I.We replace these pairs with a new pair of (s, t) = (min(a, a � ), max(b, b � )) , generating a new candidate.We test the new candidate by solving DcMp-W.In addition, if I contains that starts at t + 1 , we generate a new candidate by extending that pair to max(a, a � ) , and test the new candidate.Similarly, if I contains a pair that ends at s − 1 , we generate a new candidate by extending that pair to min(b, b � ) , and test the new canidate.
After the tests, we keep w best candidates, where w is a user parameter specifying the width of the beam search.As a tiebreaker we use the total number of 1 s.This procedure is repeated k times for each of the w candidates, after which only k rows remain in each I.We select the candidate with the lowest score as the final output.During the merging phase we need to solve O wk 3 DcMp-W problems for which we need O k 3 (k 3 + mk 2 + nmk) time.◻ We should point out that MErGE has a high dependency on k, due to the excessive comparisons when merging rows.Luckily, k is typically of moderate size.We leave developing more aggressive merging strategies for large k as a future line of work.

Related work
In this paper, we require that S has a very specific shape, making the neighboring columns of WS to look similar.Instead of having a hard constraint, previous approaches regularized S by punishing large changes between neighboring columns.Hsiang-Fu et al. ( 2016) regularized matrix decomposition with a score based on a Markov model, thus encouraging discovering temporal dependencies.Chen and Cichocki (2005) considered non-negative decompositions WS where the rows S are regularized by the error when compared against exponentially weighted moving average, thus encouraging smooth behavior of rows.In similar spirit, regularizations have been propposed where a column of S is compared to its neighboring columns, encouraging having similar values.(Rallapalli et al. 2010;Roughan et al. 2011;Xiong et al. 2010).
In related work, Tatti and Miettinen (2019) considered permuting and approximating binary matrix X with W T •S where W and S are both C1P-matrices and • is a boolean multiplication.In their setup the order of column and rows is not fixed but one also needs to find a good permutation of rows and columns, as well as find W and S. Discovering such decomposition is closely related to discovering tilings, regions of 1 s in a binary matrix that are dense.Discovering such tilings have been proposed by Miettinen et al. (2008), minimizing the Frobenius norm.In addition, Gionis et al. (2004), Tatti and Vreeken (2012) proposed mining geometric tiles, that is, tiles that are column and row coherent, organized as trees while maximizing a likelihood-based score.Geerts et al. (2004) proposed mining covering binary data with k tiles while maximizing the number of 1 s covered by the tiles.Similarly, Henelius et al. (2016) proposed mining tiles from time series that were columncoherent while maximizing same objective.Kontonasios and De Bie (2010) proposed mining tiles maximizing a score based on a maximum entropy model.Xiang et al. (2008) considered representing the data exactly with tiles while minimizing their border length.Discovering tiles in a data stream was considered by Lam et al. (2014).
The aforementioned approaches are focused on modeling binary data.Similar to tilings, finding biclusters, submatrices in real-valued data that are coherent according to a given objective function have been proposed (Cheng and Church 2000;Madeira and Oliveira 2005;Zhang et al. 2005;Hartigan 1972).These approaches do not regularize or constraint biclusters based on the column/row order.
Our method resembles a problem of segmentation, where the goal is to partition the sequence into k segments such that segments are cohesive according to some 1 3 Column-coherent matrix decomposition error function (Bellman 1961).In our case, the segments would be the neighboring columns in S having equal values.While the standard segmentation problem is solvable in polynomial time (Bellman 1961;Guha et al. 2006;Tatti 2019) our problem is NP-hard because the columns of W may participate in multiple segments.In similar fashion, Gionis and Mannila (2003) considered an NP-hard problem where k segments can only use h < k different centroids.However, their approach cannot be applied to our problem due to the differences in the optimization problems.

Experimental evaluation
In this section we present our experimental evaluation.

Datasets
We used a series of synthetic datasets and 4 real-world datasets as benchmark datasets.
We generate two synthetic dataset series as follows.In the first set, each dataset is a size of 500 × 500 and has the form W gen S gen + N .Here S gen is of size 5 × 500 , the ith row has 1 s between 50i − 49 and 550 − 50i , W gen has real values sampled uniformly between 1 and 2, and N is a matrix of size 500 × 500 , consist- ing of Gaussian noise with 0 mean and 2 variance.We denote this dataset by Mode ( ) , and vary .
In the second dataset series, each dataset has the form W gen S gen + N .Here, S gen is of size 5 × n with intervals of 1 s sampled uniformly.However, we reject cases where two segments share the same end point since the ground truth becomes ambiguous as the shorter row can be subtracted from the longer row.The ith column in W gen has real values sampled uniformly between 1i and 2i.Finally, N is a matrix of size n × m , consisting of Gaussian noise with 0 mean and 2 variance.We denote this dataset by Syn ( ) , and vary .Unless specified, we set n = m = 500.
The first real-world dataset, Milan, consists of monthly averages of maximum daily temperatures in Milan between the years 1763-2007. 4The second dataset, Power, consists of hourly power consumption (variable global_active_ power) of a single household over almost 4 years, a single time series representing a day. 5The third dataset, ECG are heart beat data (Goldberger et al. 2000).We used MLII data of a single patient (id 106) from the MIT-BIH arrhythmia database, 6 containing both normal beats and abnormal beats with premature ventricular contraction.Each time series represent measurements 4 https:// www.ncdc.noaa.gov/.

Setup
We implemented the 4 algorithms using Python and used a laptop with Intel Core i5 (2.3GHz) to conduct our experiments. 8 We decomposed each dataset with 5 algorithms using k = 5 .Since the 4 iterative algorithms require initial decomposition, we used the solution given by MErGE as the starting point.To speed up computation of MErGE, we used approximative segmentation algorithm by Guha et al. (2001) with = 0.05 .We set the beam search width to w = 50 .Finally, we used = 0.1 for GEst.

Results
Synthetic data: Our first goal is to test how well we can recover ground truth from synthetic data as a function of noise.We used MErGE with k = 5 , the other algo- rithms produced identical results.For comparison we used SVD decomposition in the following manner: we first demeaned every row and used these means as the first column in W (with the row being full of 1 s), we then applied SVD with 4 components to obtain the remaining part of the decomposition.
In Figs.1a, 2a we show the error, normalized by nm, of sVD and MErGE as a function of the variance of the noise .As expected, the cost increases as the noise increases.Moreover, sVD has a smaller cost as it does not have C1P requirements on S. In Figs.1b, 2b, we see that the errors of the disovered decompositions are always slightly smaller than the error of the ground truth.This is largely due to the matrix W optimized to the deviations in X due to the noise.
Next, we compare how well the recovered S matches the ground truth S gen .To that end, we computed L 1 distance between S discovered by MErGE, here we took into account all possible row permutations of S. Note that S discovered by sVD 1 3 Column-coherent matrix decomposition has orthogonal rows, which does not hold for S gen .Therefore, in order to make the comparison more fair, when evaluating sVD, we projected S gen to the subspace spanned by S, and computed L 1 distance between the projection and S gen .In Figs.1c, 2c we show the L 1 distance, normalized by mk, of sVD and MErGE as a function of the variance of the noise .We see that in both cases the distance increases as a function of .However, MErGE is more resilient to the noise and outperforms sVD.   2 and  3 we show the errors, running time, and number of required iterations.In order to normalize the errors we used the error of an SVD decomposition with 5 components.
Our first observation is that MErGE typically finds a good solution that the iterative algorithms cannot improve.However, in ECG MErGE finds the worst decomposition, further improved by every iterative algorithm, with IExact yielding the smallest cost.
The error values are between 1 and − 7.5, that is, produced errors are up to 8 times larger than the errors of SVD decomposition.This is expected, as our decomposition is significantly more restricted when compared to SVD.The largest error ratio of 7.5.was achieved for Population.However, in this case both decompositions are accurate: SVD achieves an L 2 error of 8 × 10 −5 for an average row, while MErGE achieves an L 2 error of 6 × 10 −5 for an average row.Recall that a single row is a his- togram summing to 1.
Next, we demonstrate how error behaves as a function of k.A typical example is shown in Fig. 4a for ECG data.Here, the error drops quickly as k increases: an error for k = 20 is 2% of the error for k = 1.
Running time: In our experiments, IHIll was the fastest and GExact or MErGE were the slowest, mostly due to the O m 2 term.We should stress that the running times for the iterative algorithms do not include the running time needed to compute the initial solution.
Figure 3 shows the running time of the algorithms as we increase either n or m.The behavious is as expected.The algorithm scale linearly with n, and GExact and MErGE scale quadratically with m while the remaining algorithms scaler linearly with m.The reported times in Fig. 3 are per iteration, the number of iterations required were between 1 and 3. We see in Table 2 that GEst yields as good decomposition for ECG as GExact, despite solving the subproblem only approximately.On the other hand, GEst is faster than GExact.
Next, let us compare the running times of the two iterative algorithms.Recall that IExact is exponential w.r.t.k while IHIll remains polynomial.This effect is shown in Fig. 4b with ECG data.Here IExact slows down considerably as k increases when compared to IHIll (note that y-axis is logarithmic).
Similarly, GExact requires quadratic time w.r.t.m while GEst is quasilinear.This effect is shown in Fig. 4c when decomposing m first columns of ECG data: GExact slows down faster than GEst as the number of columns increases.
Examples: Finally, as an application we consider scatter plots of ECG and Population datasets, shown in Fig. 6.Here, we used GExact with k = 3 and used 2nd and 3rd columns of W as coordinates of each row in X.We did not plot the first column W as this is a bias term.In Fig. 5a we see that the normal beats and the abnormal beats yield different scatter plots, suggesting that the decomposition was able to find discriminative features.These features are shown in Fig. 6b, compared to the data averages given in Fig. 6a.The y-axis in Fig. 5b differentiates large university cities in Finland from the more rural municipalities.

Concluding remarks
We proposed a matrix decomposition problem that promotes having neighboring columns to have similar values.More specifically, we propose approximating X with WS, where S is a binary matrix such that 1 s on each row of S form a contiguous segment.We showed that the problem is inapproximable and proposed 5 algorithms whose computational complexity is summarized in Table 1.
Of the 5 algorithms, MErGE produces good results that typically cannot be improved by the iterative algorithms.When improvement is possible the greedy algorithms GExact and GEst is a good choice: GExact when the number of columns is small and GEst when the number of columns is large.
We should point out that the consecutive ones requirement is strict.This is by design, since it allows us to summarize each row of S with just two numbers.An immediate extension would be to allow S to have segments of ones, per row; proposed algorithms can be extended to handle such a case.However, we can achieve a more flexible decomposition with the current approach by multiplying k with .
Future lines of work include additional regularization and relaxing the constraints for S, for example, requiring that S must be consistent with a PQ-tree (Booth and Lueker 1976).

m+1 2 ∈
O m 2 different rows of length m with consecutive ones property.A C1P matrix of size k × m has k − 1 such rows.Consequently, there are O m 2k−2 such matrices, leading to the following result.Proposition 3.2 The solution for Dcmp can be found in O m 2k−2 (k 3 + mk 2 + nmk) time.
Next, let us analyze the running time of MErGE.Proposition 6.2 mErGE runs in O knm 2 + wk 3 (k 3 + mk 2 + nmk) time.Proof Finding initial segmentation requires O knm 2 time.

Fig. 1 Fig. 2
Fig. 1 Results Merge and SVD decompositions for Mode( ) as a function of noise ,averages of 10 runs.a cost of the decomposition, normalized by nm.b cost of the decomposition, normalized by the cost of the ground truth decomposition.c L 1 distance between S produced by Merge and the ground truth S gen , normalized by mk

Fig. 4 a
Fig. 3 a, b Running time as a function of n for Syn(1) while m = 500 .c, d Running time as a function of m for Syn(1) while n = 500 .The running times are per iteration, except for MErGE.Note that the time scales differ, and the x-axes are scaled with 1000

Fig. 5 Fig. 6 a
Fig. 5 Scatter plots based on W of ECG and Population.Here, GExact with k = 3 was used

Table 1
Summary of the proposed algorithmsHere T ∈ O k 3 + mk 2 + nmk is the time needed to solve a least squares problem.The second column shows parameters needed in addition to the input matrix and number of components k.The running times are per iteration, except for MErGE

Table 2
Sizes of the datasets and results of the greedy algorithmsEach matrix is a size of n × m .E is the error of the decomposition normalized by the error of SVD decomposition, T is the execution time of an algorithm, R is the number of rounds 8The code is available at https:// versi on.helsi nki.fi/ dacs/ column-coher ent-decom posit ion.

Table 3
Results for MErGE, IHIll, and IExact E is the error of the decomposition normalized by the error of SVD decomposition, T is the execution time of an algorithm, R is the number of rounds