1 Introduction

Matrix decomposition, a.k.a. matrix factorization, has a long history and is an indispensable tool in matrix algebra [14]. Many applications of matrix decomposition to data mining are described in a recent book on massive data mining by Rajaraman et al. [39]. The well-known singular value decomposition (SVD), for example, is now a well-established technique and has been applied in diverse areas, ranging from statistics, image processing, and signal processing to data analytics, to name a few. Although SVD provides a powerful tool in many applications, it suffers from the lack of interpretability in some applications [31]. To address the interpretability issue, researchers investigated nonnegative factorization (NMF) [3, 25, 26, 47]. In applications such as digital image analysis, DNA analysis, and chemical spectral analysis, for example, it is required that the factor matrices have only nonnegative elements.

To deal with categorical data in data mining, there have recently been intensive research activities in Boolean matrix decomposition (BMD). This problem has appeared and been investigated in many different guises. A good overview can be found in the Ph.D. thesis of Miettinen [32], and Vaidya  [45] surveys many equivalent problems to BMD. The essence of these problems can be abstracted as formal concept analysis (FCA) [12]. The bi-clique cover problem [1, 6, 9, 17, 29] is a particularly nice equivalent formulation of BMD. Unfortunately, the bi-clique covering of a bipartite graph, hence BMD, is an NP-hard problem [38] even for the chordal bipartite graphs [35]. However, it can be solved in polynomial time for some other subclasses of bipartite graphs [11, 30, 35].

In connection with data mining, BMD has attracted a great deal of research interest in recent years, as evidenced by a large number of recent publications. The seminal work by Miettinen et al. [32, 34] was a catalyst to ignite a wave of interest in BMD and its applications to data mining, for example, [2, 4, 5, 21, 33, 34, 40, 43, 46, 49].

By \(M\in \{0,1\}^{m\times n}\), we mean that M is an \(m\times n\) Boolean matrix. BMD aims to find two matrices \(U\in \{0,1\}^{m\times k}\) and \(V\in \{0,1\}^{k\times n}\) such that the difference \(\Vert M-U\circ V\Vert _L\) under some norm L is minimized with a given k or as small a k as possible. The minimum possible k is called the Boolean rank of M. It is known that the Boolean rank of a binary matrix may be larger or smaller than its real rank [15]. Moreover, the rank of any real matrix can be computed efficiently by Gaussian elimination, while finding the Boolean rank of a binary matrix is NP-hard [37]. The minimization of \(\Vert M-U\circ V\Vert _L\) under the Hamming norm L for a given k is called the discrete basis problem (DBP) [34].

We can divide \(\Vert M-U\circ V\Vert _L\) into two components [4], \(E_u\),Footnote 1 which is the number of 1’s in M that are 0’s in \(U\circ V\), and \(E_o\),Footnote 2 which is the number of 0’s in M that are 1’s in \(U\circ V\). In this paper, we require that \(E_o=0\) or \(U\circ V \le M\), in other words, if an element of M is a 0, then the corresponding element of \(U\circ V\) must also be a 0. This condition is called from-below approximation in [4, 5]. We initially require that \(\Vert M-U \circ { V} \Vert _L =0\) under any norm L, namely we are interested in exact BMD. Later in this paper we relax the requirement of exact decomposition, and also discuss from-below approximation to BMD. Unless otherwise specified, \(\Vert M\Vert \) (resp. \(\Vert {\varvec{v}}\Vert \)) shall denote the number of non-0 elements in matrix M (resp. vector \({\varvec{v}}\)), i.e., we adopt the \(l_0\) norm. Since BMD is an NP-hard problem, it is impractical to insist that we discover U and V with the minimum k, especially when the size of M is very large.

Geerts et al. [13] formulate the problem as follows. A tile consists of a set of 1’s in a Boolean matrix that appear at every intersection of a set of rows and a set of columns, and the number of those 1’s is called the area of the tile. A tile is also called a combinatorial rectangle in a communications context [22], and a maximal tile corresponds to a formal concept in FCA [12]. A set of tiles is called a tiling. Geerts et al. [13] investigate several tiling problems cast in the context of databases. We paraphrase some of them as problems of covering 1’s in a given matrix M.

  • Minimum tiling Find a tiling containing the smallest number of tiles that together cover all the 1’s in M.

  • Maximum k-tiling Find a tiling consisting of at most k tiles covering the largest number of 1’s in M.

  • Large tile mining (LTM) Given a minimum threshold \(\sigma \), find all tiles whose area is at least \(\sigma \).

Thus, the difference between maximum k-tiling and the discrete basis problem is that the former imposes the from-below approximation condition, but the latter does not. Geerts et al.’s main interest is in designing an algorithm for maximum k-tiling, which can be used to solve minimum tiling problem.

We mentioned nonnegative factorization (NMF) earlier in connection with the interpretability issue. To address this issue from a different angle, Drineas et al. [7, 8] introduced CX- and CUR-decompositions. In the CX-decomposition, a given matrix M is decomposed into two matrices C and X such that the “difference” \(\Vert M-C \) \(\circ X\) \(\Vert _L\) is minimized, with the column-use condition that the columns of C must be a subset of the columns of M. Note that in CX-decomposition, a parameter k is given and it is required that C have no more than k columns.

In the CUR-decomposition, on the other hand, a given matrix M is decomposed into three matrices CU, and R, with the condition that the columns of C (resp. rows of R) must be a subset of the columns (resp. rows) of M. Miettinen [31] applies CX- (resp. CUR)-decomposition to BMD, where all the factor matrices are Boolean, and proposes heuristic algorithms. They calls it BCX- (resp. BCUR)-decomposition, and imposes the column-use condition that the columns of C form a subset of the columns of M.

We also adopt the column-use condition that the set of columns of the factor matrix U form a subset of the columns of M in decomposing M into \(U\circ V\). Arguments in support of imposing this condition in some data mining applications can be found in [18, 31]. Role mining problem [46], which is also equivalent to BMD, is particularly useful to explain/justify the column-use condition. The human resources department of a company may assign certain permissions to its employees. These permissions can be represented by a Boolean matrix M, where the rows (resp. columns) represent the employees (resp. permissions). Since it is constructed by the management, each permission has a well-defined specific purpose, namely it is clearly interpretable. We now quote a paragraph from Ene et al. [9], which gives some support to the column-use condition. A “role” corresponds to a tile. “We have ignored the qualitative but important question of whether or not these roles are meaningful. Indeed, this is the biggest barrier we have encountered to getting the results of role mining to be used in practice; customers are unwilling to deploy roles that they can’t understand. In practice, role mining alone is not sufficient.”

Our main goal is to solve the minimum tiling problem defined above (or from-below BMD) with the column-use condition. Imposing the column-use condition has a beneficial effect of greatly reducing the search space for candidate tiles. As commented in [4], the number of maximal tiles may be exponential in \(n+m\). The major effort in [13] is on pruning tiles that are not good candidates. Thanks to the column-use condition that we impose, we are spared of that task and our search space consists only of O(n) tiles. Selecting the best k from among a set of candidates is common to both their algorithms and ours and they both use essentially the same set-covering heuristic.

It is clear that exact BMD is easily reducible to the set cover problem (SC). Feige [10] shows that SC can be solved approximately with the guaranteed approximation ratio of \(O(\log n)\) in the worst case. Umetani et al. [44] give a survey on SC algorithms, but new heuristics are still being proposed, e.g., [23]. Bělohlávek et al. [5] comment that using a SC heuristic (without any modification) to solve BMD is not very effective. In another context, Miettinen [31] also states that in practice algorithms without provable approximation factors performed better.

1.1 Main contributions of this paper

We present algorithms for minimum tiling (or exact BMD) in a limited search space, although our algorithms can be modified only slightly to solve the maximum k-tiling problem (or from-below BMD) as well.

Using elementary matrix calculus, we first derive a simple closed-form formula for matrix J satisfying \(M=M\circ J^\mathrm{T}\), where J is maximal in the sense that if any 0 in J is changed to a 1, then this equality is violated. We then propose two heuristic algorithms for decomposing \(M\in \{0,1\}^{m\times n}\) into \(U\circ V\), where \(U\in \{0,1\}^{m\times k}\) and \(V\in \{0,1\}^{k\times n}\), such that U satisfies the column-use condition and k is minimized. Matrix J greatly facilitates finding the set of all candidate tiles.

Two important performance criteria are (i) how close the common dimension k of the generated U and V is to the (Schein) rank of M, which is the minimum possible, and (ii) how fast U and V can be computed. We demonstrate that our algorithms do rather well in these aspects in comparison with other known algorithms without the column-use condition [4, 5, 13, 49]. Obviously, without the column-use condition, one should be able to achieve a smaller (not larger to be exact) k. When the objective is exact BMD, in spite of this restriction, our algorithms do as well as or better than the others (without the column-use condition) on four out of the five popular datasets we have tested,Footnote 3 which we find rather surprising.

We apply one of our algorithms to educational data mining. The “ideal item response matrix” R, the “knowledge state matrix” A, and the “Q-matrix” Q play important roles. As they are related exactly by \(\overline{R}=\overline{A}\circ Q^\mathrm{T}\), given R, we can find A and Q with a small number (k) of interpretable “knowledge states,” using our heuristics.

Our algorithms can be slightly modified to find from-below approximation with competitive coverage (i.e., the fraction of the 1’s covered by the selected tiles). Since matrix operations are available in popular mathematical software packages such as MATLAB, Maple, and the R-language, we made special efforts to state our algorithms in matrix operations. We believe that it has helped to enhance readability.

1.2 Related work

Geerts et al. [13] concentrate on ‘maximum k-tiling’ and ‘large tile mining’ mentioned before. Their algorithm, which we call Tiling, uses the well-known greedy SC heuristic to iteratively find tiles that cover the most uncovered 1’s in the given matrix M. What is new is the way they choose the candidate tiles. Miettinen et al. [34] designed an algorithm, named Asso, to solve the DBP. As such, it is not designed to produce exact BMD with the minimum dimension.

Work by Bělohlávek et al. [4, 5] addresses exact, as well as approximate, BMD. They make use of ideas from FCA [12], and propose two heuristic algorithms, named GreConD and GreEss, which find good from-below approximation as well as exact BMD. They do not impose the column-use condition. In [4], they compare the performance of their algorithms with other known algorithms. Keprt and Snášel  [19] also discuss BMD, from the viewpoint of concept lattice [12].

Another group of researchers, Xiang et al., worked on the “summarization” of a database [49]. Essentially, they also try to find a tiling that covers all 1’s in a given transactional database, which can be represented by a Boolean matrix. However, the objective function that they want to minimize is not the number of tiles in the tiling, but the total size of the “description length,” based on the Minimum Description Length (MDL) Principle. (See Grünwald’s book [16].) They equate the “description length” of a tile with the sum of the number of 1’s in a row of the tile and the number of 1’s in a column of the tile. They propose a heuristic algorithm, named Hyper, to minimize this objective function, and claim that it also tends to minimize the number of tiles, which is the dimension k in our model.

Ene et al. [9] have proposed a very effective heuristic algorithm for the bi-clique cover problem. Their main contribution is to find a small set of candidate tiles in polynomial time, and they use a simple heuristic used in [13, 46] to select the smallest subset from among them. Therefore, it is not relevant to our work reported here, because, thanks to the column-use condition that we adopt, we already have a small number (i.e., O(n)) of candidates.

1.3 Paper organization

The rest of the paper is organized as follows. Section 2 gives some basic definitions which will be used throughout the paper, and reviews a minimal set of Boolean algebra facts needed to understand this paper. Section 3 derives a formula, which forms the theoretical basis for our algorithms. In Sect. 4, we describe two algorithms for decomposing a given M into the unknown U and V, and illustrate them with simple examples. Section 5 presents some experimental results, which are very encouraging. In Sect. 6, as an example of possible practical applications, we show how to apply our algorithms to educational data mining. Section 7 concludes the paper with some discussions.

2 Preliminaries

In this section, the basic notations and definitions used throughout this paper are given. We also cite some basic formulae of Boolean matrix theory. Some standard terms in matrix theory are used without definition since they are readily available, for example, in books by Golub and Van Loan [14] and Kim [20].

2.1 Notations and definitions

Let \(M= [\mu _{ij}] \in \{0,1\}^{m\times n}\). Although there is no intrinsic size or magnitude attribute in the value 0 (False) and 1 (True), we assume that the “larger than” (>) relation \(1> 0\) holds and \(1-0=1, 1-1=0-0=0\). In an expanded form, it is represented as

$$\begin{aligned} M=\left( \begin{array}{c} {\varvec{\mu }}_1 \\ {\varvec{\mu }}_2 \\ \vdots \\ {\varvec{\mu }}_m \\ \end{array} \right) =\left( \begin{array}{cccc} \mu _{11} &{}\quad \mu _{12} &{}\quad \cdots &{}\quad \mu _{1n}\\ \mu _{21} &{}\quad \mu _{22} &{}\quad \cdots &{}\quad \mu _{2n}\\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ \mu _{m1} &{}\quad \mu _{m2} &{}\quad \cdots &{}\quad \mu _{mn}\\ \end{array} \right) \end{aligned}$$
(1)

where \({\varvec{\mu }}_i=[\mu _{i1},\mu _{i2},\ldots ,\mu _{in}]\) is called the ith row vector, and \([\mu _{1j},\mu _{2j},\ldots ,\mu _{mj}]^\mathrm{T}\) is called the jth column vector of M. We also often use M[i,  : ] (resp. M[ : , j]) to denote the i-th row(resp. j-th column) vector of M. The matrix whose (ij) elements is \(\overline{\mu }_{ij}\), where \(\overline{0}=1\) and \(\overline{1}=0\), is called the complement of M and is denoted by \(\overline{M}\). The matrix whose (ij) elements is \(\mu _{ji}\) is called the transpose of M, and is denoted by \(M^\mathrm{T}\). The \(n\times n\) identity matrix is denoted by \(I_{n\times n}\), and \([0]_{m\times n}\) shall denote an \({m\times n}\) matrix whose elements are all 0’s. Let \({\mathbb {R}}\) (resp. \({\mathbb {N}}\)) denote the set of all real numbers (resp. natural numbers, including 0).

Definition 1

Let \({\varvec{p}}= [p_1, p_2,\ldots , p_n] \in \{0,1\}^{1\times n}\) and \({\varvec{q}}= [q_1, q_2,\ldots ,q_n] \in \{0,1\}^{1\times n}\). We say that \({\varvec{p}}\) dominates \({\varvec{q}}\) if \(p_i\ge q_i\) for all \(i=1,\ldots , n\), and write \({\varvec{p}} \ge {\varvec{q}}\). We write \({\varvec{p}} > {\varvec{q}}\) if \({\varvec{p}} \ge {\varvec{q}}\) and \(p_{i} > q_{i}\) for some \(i~(1\le i\le n)\), and say that \({\varvec{p}}\) strictly dominates \({\varvec{q}}\). Dominance relation is similarly defined for a pair of column vectors.

Definition 2

We define a partial order “\(\le \)” on a pair of binary matrices \(P= [p_{ij}] \in \{0,1\}^{m\times n}\) and \(Q= [q_{ij}] \in \{0,1\}^{m\times n}\). We write \(P\le Q\), if \(p_{ij}\le q_{ij}\), for all \( i=1,2,\ldots , m\) and \(j=1,2,\ldots ,n\).

Definition 3

Let \(P= [p_{ij}] \in \{0,1\}^{m\times n}\) and \(Q= [q_{ij}] \in \{0,1\}^{m\times n}\) such that \(P\le Q\). We say that P covers the set of 1 entries in Q at \(\{(i,j) \mid p_{ij} =1\}\).

Definition 4

If \(U= [u_{ij}] \in \{0,1\}^{m\times n}\) and \(V= [v_{ij}] \in \{0,1\}^{m\times n}\), the element-wise Boolean sum of U and V is defined by

$$\begin{aligned} U\vee V=[u_{ij}\vee v_{ij}] \in \{0,1\}^{m\times n}, \end{aligned}$$

and element-wise Boolean product of U and V is defined by

$$\begin{aligned} U\wedge V=[u_{ij}\wedge v_{ij}] \in \{0,1\}^{m\times n}, \end{aligned}$$

where \(0\vee 0 = 0\), \(1\vee 0 = 0\vee 1=1\vee 1 = 1\), \(0\wedge 0 = 1\wedge 0 = 0\wedge 1=0\), and \(1\wedge 1 = 1\).

For \(U= [u_{ij}] \in \{0,1\}^{m\times k}\) and \(V= [v_{ij}] \in \{0,1\}^{k\times n}\), their ordinary arithmetic product is defined by

$$\begin{aligned} P= UV =[p_{ij}]\in {\mathbb {R}}^{m\times n},\quad p_{ij}=\sum _{t=1}^k u_{it}v_{tj}. \end{aligned}$$
(2)

Their Boolean product is defined by

$$\begin{aligned} B= U\circ V = [b_{ij}]\in \{0,1\}^{m\times n},\quad b_{ij}=\vee _{t=1}^k (u_{it} \wedge v_{tj}). \end{aligned}$$
(3)

In a Boolean product, 1’s and 0’s are considered as Boolean values, while in an arithmetic product, they are treated as integers. Let M be given by (1) and c be a constant. The matrix whose (ij) element is \(c\mu _{ij}\) is called a scalar multiple of M and is denoted by \(c \cdot M\).

2.2 Brief review of matrix algebra relevant to this paper

The materials in this subsection, except Lemma 1, can be found in [14, 20].

Proposition 1

Associativity.

  1. (a)

    \((UV)W=U(VW)\)

  2. (b)

    \((U\circ V)\circ W =U\circ (V\circ W)\).

We can thus write UVW (resp. \(U\circ V\circ W\)) for (a) (resp. (b)) without ambiguity.

Proposition 2

Transpose of product.

  1. (a)

    For \(U \in \{0,1\}^{m\times k}\) and \(V\in \{0,1\}^{k\times n}\), \((U\circ V)^\mathrm{T} = V^\mathrm{T} \circ U^\mathrm{T}\) holds.

  2. (b)

    For \(U\in {\mathbb {R}}^{m\times k}\) and \(V\in {\mathbb {R}}^{k\times n}\), \((U V)^\mathrm{T} = V^\mathrm{T} U^\mathrm{T}\) holds.

Proposition 3

Product expansion.

$$\begin{aligned} M= & {} U\circ V =U[:,1]\circ V[1,:]\vee U[:, 2]\circ V[2,:]\vee \cdots \nonumber \\&\vee U[:,k]\circ V[k,:] \nonumber \\= & {} \vee _{t=1}^k \{U[:, t]\circ V[t,: ]\} \end{aligned}$$
(4)

The following proposition follows directly from (3).

Proposition 4

Let \({\varvec{p}}= \left[ p_1~p_2 \ldots p_m\right] \) and \({\varvec{q}}= \) \(\left[ q_1~q_2\ldots q_n\right] \) be two Boolean row vectors. We have

$$\begin{aligned} {\varvec{p}}^\mathrm{T}\circ {\varvec{q}}= & {} \left[ q_1\cdot {\varvec{p}}^\mathrm{T}~~q_2\cdot {\varvec{p}}^\mathrm{T}\ldots q_n\cdot {\varvec{p}}^\mathrm{T}\right] \end{aligned}$$
(5)
$$\begin{aligned}= & {} \left( \begin{array}{c} p_1 \cdot {\varvec{q}}\\ p_2 \cdot {\varvec{q}}\\ \vdots \\ p_m \cdot {\varvec{q}}\\ \end{array} \right) \in \{0,1\}^{m\times n}. \end{aligned}$$
(6)

For example, if \({\varvec{p}}= \left[ 0~1~0~1~0~1\right] \) and \({\varvec{q}}= \left[ 0~1~0~1~1\right] \), then

$$\begin{aligned} {\varvec{p}}^\mathrm{T}\circ {\varvec{q}} =\left( \begin{array}{llllll} . &{}\quad . &{}\quad . &{}\quad . &{}\quad .\\ . &{}\quad 1 &{}\quad . &{}\quad 1&{}\quad 1\\ . &{}\quad . &{}\quad . &{}\quad . &{}\quad .\\ . &{}\quad 1 &{}\quad . &{}\quad 1&{}\quad 1\\ . &{}\quad . &{}\quad . &{}\quad . &{}\quad .\\ . &{}\quad 1 &{}\quad . &{}\quad 1&{}\quad 1\\ \end{array} \right) \end{aligned}$$
(7)

Thus, \({\varvec{p}}^\mathrm{T}\circ {\varvec{q}}\) represents a tile. We identify \({\varvec{p}}^\mathrm{T}\circ {\varvec{q}}\) with the tile it represents, and sometimes call this expression itself a tile. The formula in the following lemma will be used to simplify our algorithms later.

Lemma 1

Let \({\varvec{p}}= \left[ p_1~p_2 \ldots p_m\right] \) and \({\varvec{q}}= \) \(\left[ q_1~q_2\ldots q_n\right] \) be two Boolean row vectors, and let \(C\in \{0,1\}^{n\times m}\). Then, the following equality holds.

$$\begin{aligned} \Vert C \wedge ({\varvec{p}}^\mathrm{T}\circ {\varvec{q}})\Vert = {\varvec{q}} C {\varvec{p}}^\mathrm{T}. \end{aligned}$$
(8)

Proof

The quantity \(\Vert C \wedge ({\varvec{p}}^\mathrm{T}\circ {\varvec{q}})\Vert \) is clearly the number of 1 elements of C such that the corresponding element of \({\varvec{p}}^\mathrm{T}\circ {\varvec{q}}\) is also a 1. By Proposition 4, the (ij) element of \({\varvec{p}}^\mathrm{T}\circ {\varvec{q}}\) is a 1 if \(p_i=q_j=1\), and a 0 otherwise. Note that \(C {\varvec{p}}^\mathrm{T}\in {\mathbb {N}}^{n\times 1}\) on the right hand side of (8) is a column vector such that its ith element is the number of 1’s in the ith row of C, which are counted if it is in column j satisfying \(C[i,j]=p_j=1\). Now, \({\varvec{q}}(C {\varvec{p}}^\mathrm{T})\) adds the ith element of \(C {\varvec{p}}^\mathrm{T}\) provided \(q_i=1\) and computes their total. \(\square \)

3 BMD theorems

In the rest of this paper, we refer to matrix \(U\in \{0,1\}^{m\times k}\) defined by

$$\begin{aligned} U=\left( \begin{array}{c} {\varvec{u}}_1 \\ {\varvec{u}}_2 \\ \vdots \\ {\varvec{u}}_m \\ \end{array} \right) =\left( \begin{array}{cccc} u_{11} &{}\quad u_{12} &{}\quad \cdots &{}\quad u_{1k}\\ u_{21} &{}\quad u_{22} &{}\quad \cdots &{}\quad u_{2k}\\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ u_{m1} &{}\quad u_{m2} &{}\quad \cdots &{}\quad u_{mk}\\ \end{array} \right) \end{aligned}$$
(9)

and matrix \(V\in \{0,1\}^{k\times n}\) defined by

$$\begin{aligned} V=\left( \begin{array}{c} {\varvec{v}}_1 \\ {\varvec{v}}_2 \\ \vdots \\ {\varvec{v}}_k \\ \end{array} \right) =\left( \begin{array}{cccc} v_{11} &{}\quad v_{12} &{}\quad \cdots &{}\quad v_{1n}\\ v_{21} &{}\quad v_{22} &{}\quad \cdots &{}\quad v_{2n}\\ \vdots &{}\quad \vdots &{}\quad \vdots &{}\quad \vdots \\ v_{k1} &{}\quad v_{k2} &{}\cdots &{}v_{kn}\\ \end{array} \right) \end{aligned}$$
(10)

The following lemma follows easily from the fact that \(1\vee 1 = 1\).

Lemma 2

Define matrices \(G= [g_{ij}]=UV \in {\mathbb {N}}^{m\times n}\) and \(H=[h_{ij}]=U\circ V \in \{0,1\}^{m\times n}\). Then, for \(i=1,2,\ldots ,m\) and \(j=1,2,\ldots ,n\) we have

$$\begin{aligned} g_{ij}=0\Leftrightarrow & {} h_{ij}=0 \nonumber \\ g_{ij} \ge 1\Leftrightarrow & {} h_{ij}=1. \end{aligned}$$
(11)

The following proposition follows easily from definition.

Proposition 5

Let \({\varvec{p}}, {\varvec{q}} \in \{0,1\}^{1\times a}\) be two Boolean row vectors. Then “\({\varvec{p}}\) dominates \({\varvec{q}}\)” can be expressed as

$$\begin{aligned} {\varvec{p}} \ge {\varvec{q}} \Leftrightarrow \overline{{\varvec{p}}}\circ {\varvec{q}}^\mathrm{T} = {\varvec{q}} \circ \overline{{\varvec{p}}}^\mathrm{T}=0 \Leftrightarrow \overline{\overline{{\varvec{p}}} \circ {\varvec{q}}^\mathrm{T}}= \overline{{\varvec{q}} \circ \overline{{\varvec{p}}}^\mathrm{T}}= 1.\nonumber \\ \end{aligned}$$
(12)

Lemma 4 below plays an important role in what follows. In order to prove it, we first need to show a technical lemma.

Lemma 3

Let \(P\in \{0,1\}^{a\times p}\) be an arbitrary Boolean matrix.

  1. (a)

    For any two row vectors \({\varvec{u}}, {\varvec{v}}\in \{0,1\}^{1\times a}\) we have

    $$\begin{aligned}{}[\overline{{\varvec{v}}}=\overline{({\varvec{u}}\circ P)}\circ P^\mathrm{T}] \Rightarrow {\varvec{v}}\ge {\varvec{u}} \end{aligned}$$
    (13)
  2. (b)

    For any two matrices \(U,V\in \{0,1\}^{b\times a}\) we have

    $$\begin{aligned}{}[\overline{V}= \overline{U\circ P}\circ P^\mathrm{T}] \Rightarrow V\ge U. \end{aligned}$$
    (14)

Proof

(a) Suppose \(\overline{{\varvec{v}}}=\overline{({\varvec{u}}\circ P)}\circ P^\mathrm{T}\) holds. Then \(\overline{v}_j=0\) (i.e., \(v_j=1\)) if and only if

$$\begin{aligned} \overline{({\varvec{u}}\circ P)}\circ P[j,:]^\mathrm{T}=0. \end{aligned}$$

By Proposition 5, this implies that \({\varvec{u}}\circ P\) dominates the jth column of \(P^\mathrm{T}\), i.e., the jth row of P. Since this clearly happens if \(u_j=1\), we have \(u_j=1 \Rightarrow v_j=1\). It follows that \({\varvec{v}}\ge {\varvec{u}}\).

(b) Let \({\varvec{u}}_i\) (resp. \({\varvec{v}}_i\)) be the ith row vector of matrix U (resp. V), as in (9) (res. (10)). Then (13) holds for each \(i~(1\le i \le b)\), namely,

$$\begin{aligned}{}[\overline{{\varvec{v}}}_i=\overline{({\varvec{u}}_i\circ P)}\circ P^\mathrm{T}] \Rightarrow {\varvec{v}}_i\ge {\varvec{u}}_i, \end{aligned}$$

and (14) follows. \(\square \)

Without loss of generality, we assume from now on that the given matrix M has no all-0 row or all-0 column. We now prove the following theorem, which provides a basis for the algorithms given in the next section.

Lemma 4

Let \(M\in \{0,1\}^{m\times n}\), \(U\in \{0,1\}^{m\times k}\), and \(V\in \{0,1\}^{k\times n}\) satisfy \(M=U\circ V\), and define

$$\begin{aligned} K\equiv & {} \overline{\overline{M}^\mathrm{T} \circ U} \end{aligned}$$
(15)

Then we have

  1. (a)

    \(V\le K^\mathrm{T}\), and

  2. (b)

    \(M=U\circ K^\mathrm{T}\)

Proof

(a) From (15), we get

$$\begin{aligned} \overline{K}= & {} \overline{M}^\mathrm{T} \circ U \end{aligned}$$
(16)

Plugging \(M=U\circ V\) into (16) and using Proposition 2(a) and the fact that the order of complementation and transpose is reversible, we obtain

$$\begin{aligned}&\overline{K}= \overline{U\circ V}^\mathrm{T} \circ U =\overline{V^\mathrm{T}\circ U^\mathrm{T}} \circ U. \end{aligned}$$
(17)

If we set \(U=P^\mathrm{T}\) in (17), we get

$$\begin{aligned} \overline{K}= \overline{V^\mathrm{T}\circ P} \circ P^\mathrm{T}. \end{aligned}$$

Thus we get \(K\ge V^\mathrm{T}\) using (14).

(b) Define \(N=U\circ K^\mathrm{T}\). We want to show that \(N=M\). From (15), we get

$$\begin{aligned}&N^\mathrm{T}=K\circ U^\mathrm{T}=\overline{\overline{M}^\mathrm{T} \circ U} \circ U^\mathrm{T}, \end{aligned}$$
(18)

which yields \(\overline{N}^\mathrm{T}\ge \overline{M}^\mathrm{T}\) or \(N\le M\) by (14). On the other hand, from \(V\le K^\mathrm{T}\) (see part (a)), we get \(M=U\circ V\le U\circ K^\mathrm{T}=N\). It follows that \(M= N\). \(\square \)

Example 1

Let

$$\begin{aligned} U=\left( \begin{array}{llll} 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 0 \\ 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 \\ \end{array} \right) ;\quad V = \left( \begin{array}{llllll} 1 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ \end{array} \right) \end{aligned}$$

Then we have

$$\begin{aligned} M= & {} U\circ V = \left( \begin{array}{llllll} 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ \end{array} \right) ;\\ K= & {} \overline{\overline{M}^\mathrm{T} \circ U} = \left( \begin{array}{llll} 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1\\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1\\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ \end{array} \right) \end{aligned}$$

Clearly, Lemma 4(a) holds, and we can verify Lemma 4(b) as well.

From now on, we consider the special case in Lemma 4, where \(U=M\), hence

$$\begin{aligned} J=\overline{\overline{M}^\mathrm{T}\circ M}\in \{0,1\}^{n\times n}. \end{aligned}$$
(19)

Lemma 4 has the following important implication.

Corollary 1

Given an arbitrary matrix \(M\in \{0,1\}^{m\times n}\), let J be defined by (19). Then, \(V\le J^\mathrm{T}\) holds for any matrix \(V\in \{0,1\}^{n\times n}\) satisfying \(M=M\circ V\). \(\square \)

Matrix J has a number of other important properties.

Theorem 1

For any \(M\in \{0,1\}^{m\times n}\), matrix J defined by (19) has the following properties.

  1. (a)

    \(J[i,j]=1\Leftrightarrow M[:,i]\ge M[:,j]\), i.e., column i dominates column j of M.

  2. (b)

    \(J[i,j]=J[j,i]=1\Leftrightarrow M[:,i]=M[:,j]\) \(\Leftrightarrow J[:,i]= J[:,j]\) and \(J[i,: ]= J[j,: ]\).

  3. (c)

    \(J[i,j]=1> J[j,i]=0\Leftrightarrow M[:,i]> M[:,j]\) \(\Rightarrow J[:,j]> J[:,i]\) and \(J[j,:] < J[i,:]\).

Proof

(a) If we let \({\varvec{p}}= M^\mathrm{T}[:,i]\) and \({\varvec{q}}= M^\mathrm{T}[:,j]\) in (12), then we get \(M^\mathrm{T}[i,:] \ge M^\mathrm{T}[j,:]\) if and only if

$$\begin{aligned} \overline{\overline{M}^\mathrm{T}[i,:]\circ M[:,j]}=1, \end{aligned}$$

which holds if and only if \(J[i,j]=1\) by (19). Note that \(M^\mathrm{T}[i,:] \ge M^\mathrm{T}[j,:]\) is equivalent to \(M[:,i]\ge M[:,j]\).

(b) By interchanging i and j in part (a), we get \(J[j,i]=1\Leftrightarrow M[:,i]\le M[:,j]\). It follows that \(J[i,j]=J[j,i]=1\Leftrightarrow M[:,i]=M[:,j]\). Thus for any column M[ : , k] we have \(M[:,k]\ge M[:,j] \Leftrightarrow M[:,k]\ge M[:,i]\), i.e., any column that dominates M[ : , j] also dominates M[ : , i], and vice versa. This implies \(J[:,i]= J[:,j]\). Similarly, for any column M[ : , k] we have \(M[:,k]\le M[:,j] \Leftrightarrow M[:,k]\le M[:,i]\), i.e., any column that is dominated by M[ : , j] is also dominated by M[ : , i], and vice versa, which implies \(J[i,:]= J[j,:]\).

(c) \(J[i,j]=1 > J[j,i]=0\) implies that M[ : , i] strictly dominates M[ : , j], i.e., \(M[:,i]> M[:,j]\). Therefore, for any column M[ : , k] we have \(M[:,k]\ge M[:,i] \Rightarrow M[:,k]> M[:,j]\), which implies \(J[:,j]> J[:,i]\). Similarly, for any column M[ : , k] we have \(M[:,k]\le M[:,j] \Rightarrow M[:,k]< M[:,i]\), which implies \(J[j,:] < J[i,:]\). \(\square \)

Example 2

The properties proved in Theorem 1 can be verified for the matrix M in Example 1, for which matrix J defined by (19) is

$$\begin{aligned} J=\overline{\overline{M}^\mathrm{T} \circ M}=\left( \begin{array}{llllll} 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ \end{array} \right) \end{aligned}$$

We now prove another useful property of matrix J defined by (19).

Lemma 5

Given an arbitrary matrix \(M\in \{0,1\}^{m\times n}\), let J be defined by (19). If any 0-element in J is changed to a 1, then \(M=M\circ J^\mathrm{T}\) no longer holds.

Proof

Assume that J does not have the maximum number of 1’s and assume that \(J[i,j]=0\), \(1\le i,j \le n\), can be changed from 0 to 1 without violating Lemma 4(b) with \(U=M\), i.e., \(M=M\circ J^\mathrm{T}\). Let \({\varvec{w}}_j = J[j,\cdot ]\), so that \(({\varvec{w}}_j)^\mathrm{T}\) is the jth column of \(J^\mathrm{T}\). If the ith element of \({\varvec{w}}_j\) is 0, i.e., \(J[j,i]=0\), then \(M[\cdot ,j] \not \ge M[\cdot ,i]\) by Theorem 1(a). Let \({\varvec{w'}}_j\) be obtained from \({\varvec{w}}_j\) by changing its ith element from 0 to 1. Since \(M\circ ({\varvec{w}}'_j)^\mathrm{T} \ge M[\cdot ,i]\), we have \(M[\cdot ,j]\not \ge M\circ ({\varvec{w}}'_j)^\mathrm{T}\), a contradiction. We conclude that if any element in J is changed from a 0 to a 1, then \(M=M\circ J^\mathrm{T}\) is violated. \(\square \)

If we change \(J[3,1]=0 \rightarrow 1\) in Example 2, for example, then the [3, 3] element of \(M\circ J^\mathrm{T}\) become a 1, while \(M[3,3]=0\), and \(M=M\circ J^\mathrm{T}\) no longer holds.

Theorem 2

Let \(M=U\circ V\) be an optimal decomposition of M, satisfying the column-use condition,Footnote 4 where \(U\in \{0,1\}^{m\times k}\), \(V\in \{0,1\}^{k\times n}\) and k is the minimum possible. Then for each \(i=1,2,.\ldots , k\), we have \(U[:,i]\circ V[i,:] \in \{M[:,t]\circ J[t,:] \mid t=1,\ldots , n\}\).

Proof

Let

$$\begin{aligned} U\circ V =\vee _{i=1}^k \{U[:,i]\circ V[i,:]\}, \end{aligned}$$

and consider a particular term \(U[:,i]\circ V[i,:]\) in it. Since U consists of columns of M, there is an h such that \(U[:,i] =M[:,h]\). By Corollary 1, J[h,  : ] is the maximal row vector such that \(U[:,i] \circ J[h,:] \le M\), hence \(V[i,:] \le J[h,:]\). We thus have \(U[:,i]\circ V[i,:] \le U[:,i]\circ J[h,:] = M[:,h] \circ J[h,:]\). \(\square \)

Intuitively, Theorem 2 implies that the search space for an optimal decomposition of M under the column-use condition can be limited to \(\{M[:,t]\circ J[t,:] \mid t=1,\ldots , n\}\). When the column-use condition is not imposed, the counterpart to Theorem 2 is proved in [4], using FCA.

In the next section, we design heuristic algorithms for exact BMD, based on Theorem 2.

4 Heuristic BMD algorithms

4.1 Algorithm description

In this section, we propose new algorithms for finding factor matrices \(U\in \{0,1\}^{m\times k}\) and \(V\in \{0,1\}^{k\times n}\) from matrix \(M\in \{0,1\}^{m\times n}\). By Theorem 2, we want to find a subset of \(\{M[:,t]\circ J[t,:] \mid t=1,\ldots , n\}\) that provides the optimal tiling. Since an exhaustive search is obviously impractical, we want to devise a heuristic algorithm that yields a good suboptimal tiling.

Suppose there exists an l satisfying

$$\begin{aligned} U\circ V =\vee _{i=1,j\ne l}^k \{U[:,i]\circ V[i,:]\}, \end{aligned}$$

in other words,

$$\begin{aligned} \vee _{i=1,j\ne l}^k \{U[:,i]\circ V[i,:]\} \ge U[:,l]\circ V[l,:]. \end{aligned}$$
(20)

Then, we can safely eliminate the lth column U[ : , l] and the lth row V[l,  : ] from U and V, respectively, which helps reduce the dimension k. The condition (20) is equivalent to \(\Vert {\mathcal {T}}\Vert = \Vert {\mathcal {T}} - T_l\Vert \), where \({\mathcal {T}}=U J^\mathrm{T}\) (arithmetic matrix product defined by (2)) with J given by (19), and \(T_l = U[:,l]\circ V[:,l]\). There may be several indices l that satisfy (20). Therefore, we need to decide in which order the eliminations should be carried out. We thus define the selection index \(\sigma _i\) as follows:

$$\begin{aligned} \sigma _i=\Vert U[:,i]\Vert \times \Vert V[:,i]\Vert , \end{aligned}$$

where, as the reader recalls, \(\Vert V\Vert \) represents the number of 1’s (\(l_0\) norm) in vector V. Clearly, \(\sigma _i\) is the number of 1 entries in M that are covered by \(G_i\). Given the initial matrices U and V, satisfying \(M=U\circ V\), we generate the new matrix J by (19). There are at least two straightforward strategies that appear reasonable, regarding which attribute we should process first.

  1. (a)

    Remove-Smallest: Remove attribute j such that \(\sigma _j\) is the smallest, provided the removal does not affect M.

  2. (b)

    Pick-Largest: Retain attribute j such that \(\sigma _j\) is the largest.

Strategy (b) has been used before by other researchers, including Geerts et al. [13] and Vaiya et al. [46]. Our first heuristic algorithm adopts strategy (a). After deleting one attribute, we update U and V, and repeat the elimination process until there is no more attribute that can be deleted.

Algorithm 1

Remove-Smallest Input: Response matrix \(M\in \{0,1\}^{m\times n}\).

  1. 1.

    Initialize \(U=M\) and \(k=n\).

  2. 2.

    Compute

    $$\begin{aligned} V^\mathrm{T}=J=\overline{\overline{M}^\mathrm{T}\circ M}. \end{aligned}$$
    (21)
  3. 3.

    ComputeFootnote 5

    $$\begin{aligned} {\mathcal {T}}=U V. \end{aligned}$$
  4. 4.

    For \(i=1,2,\ldots , k\) compute the size of the maximal tile for ith attribute (\(\alpha _i\)) by

    $$\begin{aligned} \sigma _i=\Vert U[:, i]\Vert \times \Vert V[:,i]\Vert , \end{aligned}$$

    and rename the attributes so that \(\sigma _1\le \sigma _2\le \cdots \le \sigma _k\) holds.

  5. 5.

    For \(j=1,2,\ldots ,k\), do

    1. (a)

      Compute

      $$\begin{aligned} T_j =U[:,j] \circ V[j,:]; \end{aligned}$$
    2. (b)

      If \(\Vert {\mathcal {T}}\Vert = \Vert {\mathcal {T}} - T_j\Vert \) then (i) remove column U[ : , j] from U and row V[j,  : ] from V; and (ii) set \({\mathcal {T}}={\mathcal {T}}-T_j\); \(k=k-1\).

  6. 6.

    Output U and V.

Our second algorithm adopts strategy (b).

Algorithm 2

Pick-Largest Input: Response matrix \(M\in \{0,1\}^{m\times n}\).

  1. 1.

    Initialize \(U=M\) and \(k=n\).

  2. 2.

    Compute

    $$\begin{aligned} V^\mathrm{T}=J=\overline{\overline{M}^\mathrm{T}\circ M}. \end{aligned}$$
    (22)
  3. 3.

    InitializeFootnote 6 \(C=[0]_{m\times n}\in \{0,1\}^{m\times n}\).

  4. 4.

    For \(i=1,2,\ldots , k\) compute the size of the maximal tile for the ith attribute (\(\alpha _i\)) by

    $$\begin{aligned} \sigma _i=\Vert U[:,i]\Vert \times \Vert V[:,i]\Vert . \end{aligned}$$
  5. 5.

    For each i such that \(\alpha _i\) has not been picked or discarded, compute (see (8))

    $$\begin{aligned} \delta _i=\sigma _i - U[:,i]^\mathrm{T}CV[:,i]. \end{aligned}$$

    If \(\delta _i =0\) then remove \(\alpha _i\) by deleting U[ : , i] (resp. V[i,  : ]) from U (resp. V).

  6. 6.

    Let \(\delta _j =\max _i \{\delta _i\}\) and compute

    $$\begin{aligned} T_j= U[:,j] \circ V[j,:]. \end{aligned}$$

    Replace matrix C by \(C \vee T_j\), and delete U[ : , j] (resp. V[j,  : ]) from U (resp. V). If there are still attributes remaining, then go to Step 5.

  7. 7.

    Output U and V.

Note that we compute \(\sigma _i\) just once in Step 4, but it is effectively updated in Step 5. The correctness of the above algorithms is implied by Theorems 1 and 2.

4.2 Simple example

Example 3

Let us consider the following matrix M, and carry out Steps 2) and 4), which are common to both algorithms.

$$\begin{aligned}&M=\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 \\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1\\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1 &{}\quad 1\\ \end{array}\right) \\&V^\mathrm{T}=\overline{\overline{M}^\mathrm{T}\circ M} =\left( \begin{array}{lllllll} 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 \\ 1 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 &{}\quad 1 \\ \end{array} \right) \end{aligned}$$

Step 3 of Remove-Smallest computes

$$\begin{aligned} {\mathcal {T}}= U V =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 4 \\ 0 &{}\quad 2 &{}\quad 4 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 4 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 2 &{}\quad 4 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 4\\ 0 &{}\quad 2 &{}\quad 4 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 4 \\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 4 \\ 2 &{}\quad 2 &{}\quad 6 &{}\quad 3 &{}\quad 2 &{}\quad 2 &{}\quad 4\\ \end{array} \right) \end{aligned}$$
(23)

If we order the columns of U from the smallest to the largest according to the value of \(\sigma _i\), we get 4,3,7,1,6,2,5. Thus, Remove-Smallest processes the columns of U in this order.

Step 5(a): Compute \(T_4\).

$$\begin{aligned} T_4=U[:,4] \circ {V[4,:]} =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ \end{array} \right) \end{aligned}$$

Step 5(b): \(\Vert {\mathcal {T}}\Vert > \Vert {\mathcal {T}} - T_4\Vert \Rightarrow \) Cannot remove attribute 4.

Table 1 Computing \(\sigma _i\)

Step 5(a): Now try the next smallest attribute 3, and compute \(T_3\).

$$\begin{aligned} T_3=U[:,3] \circ {V[3,:]} =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 \\ \end{array} \right) \end{aligned}$$

Step 5(b): \(\Vert {\mathcal {T}}\Vert = \Vert {\mathcal {T}} - T_3\Vert \Rightarrow \) Remove attribute 3, and update \({\mathcal {T}}\).

$$\begin{aligned} {\mathcal {T}}={\mathcal {T}}-T_3 =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 2 &{}\quad 0 &{}\quad 3 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 3 \\ 0 &{}\quad 2 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 3 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 2 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 3\\ 0 &{}\quad 2 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 3 \\ 2 &{}\quad 0 &{}\quad 3 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 3 \\ 2 &{}\quad 2 &{}\quad 5 &{}\quad 3 &{}\quad 2 &{}\quad 2 &{}\quad 3\\ \end{array} \right) \end{aligned}$$

Similarly, attributes (columns of M) 7, 1, and 5 are removed.

Step 6: generates

$$\begin{aligned} U=\left( \begin{array}{lll} 0 &{}\quad 0 &{}\quad 0 \\ 1&{}\quad 1 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 1 \\ 1 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0&{}\quad 1 \\ 1 &{}\quad 1 &{}\quad 0 \\ 1 &{}\quad 1&{}\quad 1 \\ \end{array} \right) ;~~~ V =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 1 \\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ \end{array} \right) \end{aligned}$$
(24)

The columns of U are columns 4, 6, and 2 of M, and \(M=U \circ V\). Let us now apply Algorithm Pick-Largest to matrix M. We already illustrated the first four steps above. From Table 1, we see that \(\delta _5=\sigma _5=16\) is the largest (tied with \(\delta _2=\sigma _2=16\)). Since \(\delta _i=0\) holds for no i, we proceed to Step 6.

$$\begin{aligned} T_5=U[:,5] \circ {V[5,:]} =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1\\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 1 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 1\\ \end{array} \right) \end{aligned}$$

We set \(C= C\vee T_5\) to remember the 1’s that are now covered by the picked product term. Although this algorithm does not use \({\mathcal {T}}\) in (23), it is instructive to interpret Steps 5 and 6 of Pick-Largest in terms of \({\mathcal {T}}\). We have

$$\begin{aligned} {\mathcal {T}}= {\mathcal {T}}-T_5 =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 4 \\ 0 &{}\quad 1 &{}\quad 3 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 3 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 1 &{}\quad 3 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 3\\ 0 &{}\quad 1 &{}\quad 3 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 3 \\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 4 \\ 2 &{}\quad 1 &{}\quad 5 &{}\quad 3 &{}\quad 1 &{}\quad 2 &{}\quad 3\\ \end{array} \right) \end{aligned}$$

In Step 5, we update \(\{\delta _i\}\). For example, let us compute \(CV[i,:]^\mathrm{T}\) for \(i=2\). We get

$$\begin{aligned} CV[2,:]^\mathrm{T} = \left[ 0~0~4~0~4~4~0~4\right] \quad \hbox {and}\quad U[:,2]^\mathrm{T}CV[:,2] =16. \end{aligned}$$

Therefore, \(\delta _2= \sigma _2 -16 =0\). This implies that \(T_2 \le C\). We can simply remove attribute 2 (i.e, U[ : , 2] and V[2,  : ]). Updating C by \(C=C\vee T_2\) does not change C.

$$\begin{aligned} {\mathcal {T}}= {\mathcal {T}}-T_2 =\left( \begin{array}{lllllll} 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 0\\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 4 \\ 0 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 2 \\ 0 &{}\quad 0 &{}\quad 0 &{}\quad 1 &{}\quad 0 &{}\quad 0 &{}\quad 0 \\ 0 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 2\\ 0 &{}\quad 0 &{}\quad 2 &{}\quad 0 &{}\quad 0 &{}\quad 0 &{}\quad 2 \\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 4 \\ 2 &{}\quad 0 &{}\quad 4 &{}\quad 3 &{}\quad 0 &{}\quad 2 &{}\quad 2\\ \end{array} \right) \end{aligned}$$

This computation can be done by matrix operation, although it is not the most efficient, since it computes elements that are of no use to us. Construct a column vector \({ Us}\) whose ith element is \(\Vert U[:,i]\Vert \), and a row vector \({ Vs}\) whose ith element is \(\Vert V[:,i]\Vert \). Compute matrix \(P= { Us}\circ Vs\).

$$\begin{aligned} P =\left( \begin{array}{lllllll} 15 &{}\quad 12 &{}\quad 6 &{}\quad ~3 &{}\quad 12 &{}\quad 15 &{}\quad 6\\ 20 &{}\quad 16 &{}\quad 8 &{}\quad ~4 &{}\quad 16 &{}\quad 20 &{}\quad 8\\ 30 &{}\quad 24 &{}\quad 12 &{}\quad ~6 &{}\quad 24 &{}\quad 30 &{}\quad 12\\ 20 &{}\quad 16 &{}\quad 8 &{}\quad ~4 &{}\quad 16 &{}\quad 20 &{}\quad 8\\ 20 &{}\quad 16 &{}\quad 8 &{}\quad ~4 &{}\quad 16 &{}\quad 20 &{}\quad 8\\ 15 &{}\quad 12 &{}\quad 6 &{}\quad ~3 &{}\quad 12 &{}\quad 15 &{}\quad 6\\ 30 &{}\quad 24 &{}\quad 12 &{}\quad ~6 &{}\quad 24 &{}\quad 30 &{}\quad 12\\ \end{array} \right) \end{aligned}$$

Thus, the diagonal elements of P are \(\Vert U[:,i]\Vert \times \Vert V[:,i]\Vert \), which are listed in Table 1. Note that the ith diagonal element of \(U^\mathrm{T} \circ C \circ V^\mathrm{T}\) is the number of 1’s in \(U[:,i]\circ V[:,i]\) that are already covered by C.

$$\begin{aligned} U^\mathrm{T} \circ C \circ V =\left( \begin{array}{rrrrrrr} 2 &{}\quad 4 &{}\quad ~2 &{}\quad ~0 &{}\quad 4 &{}\quad ~2 &{}\quad ~2\\ 8 &{}\quad 16 &{}\quad ~8 &{}\quad ~0 &{}\quad 16 &{}\quad ~8 &{}\quad ~8\\ 8 &{}\quad 16 &{}\quad ~8 &{}\quad ~0 &{}\quad 16 &{}\quad ~8 &{}\quad ~8\\ 2 &{}\quad 4 &{}\quad ~2 &{}\quad ~0 &{}\quad 4 &{}\quad ~2 &{}\quad ~2\\ 8 &{}\quad 16 &{}\quad ~8 &{}\quad ~0 &{}\quad 16 &{}\quad ~8 &{}\quad ~8\\ 2 &{}\quad 4 &{}\quad ~2 &{}\quad ~0 &{}\quad 4 &{}\quad ~2 &{}\quad ~2\\ 8 &{}\quad 16 &{}\quad ~8 &{}\quad ~0 &{}\quad 16 &{}\quad ~8 &{}\quad ~8\\ \end{array} \right) \end{aligned}$$

Thus, the amounts \(\{\delta _i\}\) can be found on the diagonal of \(P-U^\mathrm{T} \circ C \circ V\), and they are 13, 0, 4, 4, 0, 13, 4. So \(\delta _2=13\) and \(\delta _6=13\) are the largest. Let us pick attribute 6,Footnote 7 update \(C=C\vee T_6\), and recompute \(P-U^\mathrm{T} \circ C \circ V\). Since updated \(\delta _1= 0\), we discard attribute 1. (Step 5.) We then get \(\delta _7=4\), so pick attribute 7. Since \(\delta _3= 0\), we discard attribute 3. Finally, we need to pick attribute 4. For this particular example, Pick-Largest generates the same decomposition as Remove-Smallest given in (24).

Comment 3

Although computing \(U^\mathrm{T} \circ C \circ V\) is a conceptually neat way of finding \(\{\delta _i\}\), the time to compute the off-main diagonal elements is wasted. Thus, we do not use it in Pick-Largest.

5 Performance

5.1 Complexity analysis

The time complexity of both algorithms is dominated by the time to compute matrix V of (21) and (22), respectively, in their Step 2. By Proposition 3, it can be expanded into n (column vector, row vector) pairs of sizes m and n, respectively. Then, evaluating \(\overline{M}^\mathrm{T} \circ M\) takes time proportional to

$$\begin{aligned} \sum _{i=1}^m \Vert \overline{M}^\mathrm{T}[:,i]\Vert \times \Vert M[i,:]\Vert \le n \sum _{i=1}^m \Vert M[i,:]\Vert = n\Vert M\Vert . \end{aligned}$$

This implies that (21) and (22) can be evaluated in \(O(n\Vert M\Vert )\) time. Note that in terms of \({\mathcal {T}}\) defined in Step 3 of Algorithm Remove-Smallest, we have

$$\begin{aligned} \Vert {\mathcal {T}}\Vert _{l_1} = \sum _{i=1}^n\Vert U[:,i]\Vert \times \Vert V[:,i]\Vert \le m \sum _{i=1}^n \Vert U[:,i]\Vert = m\Vert M\Vert , \end{aligned}$$

where \(\Vert {\mathcal {T}}\Vert _{l_1}\) (\(l_1\) norm) represents the sum of the elements of \({\mathcal {T}}\).

Theorem 4

Both Algorithms Remove-Smallest and Pick-Largest run in \(O(m\Vert M\Vert )\) time.

Proof

We can consider that every operation in Algorithm Remove-Smallest essentially accesses/modifies some element of \({\mathcal {T}}\) and the (ij) element is accessed \({\mathcal {T}}[i,j]\) times. Therefore, the total time is given by \(O(\Vert {\mathcal {T}}\Vert _{l_1})\) \(= O(m\Vert M\Vert )\).

As for Algorithm  Pick-Largest, although \({\mathcal {T}}\) is not used in it, imagine that it was defined. We use \(U[:,i]^\mathrm{T}CV[:,i]\) to describe Step 5, but it is used for only for the purpose of a concise description, and this step can be implemented more efficiently without matrix multiplication. All we need is a way to keep track of which 1 elements of M has already been covered. Therefore, the total time is still given by \(O(\Vert {\mathcal {T}}\Vert _{l_1}) = O(m\Vert M\Vert )\), as reasoned above. \(\square \)

The above theorems imply that our algorithms run faster if the given matrix M is sparse. If we use a sophisticated algorithm, matrix multiplication can be done in \(O(m^{2.373})\) time, assuming \(m\ge n\) [24, 48]. We should mention that another important performance measure for heuristic algorithms of the approximation ratio relative to the optimum. We have not looked into this performance measure yet.

5.2 Experiments on real datasets

To evaluate the performance of our heuristic algorithms, Pick-Largest and Remove-Smallest, we have tested them on several real datasets, which have been used by other authors as benchmarks. They are Mushroom [27], DBLP,Footnote 8 DNA [36], Chess [27], and Paleo.Footnote 9 Table 2 in the next page lists the results of our experiments and compares them with Tiling [13], Asso [34], Hyper [49], and GreConD [5], and GreEss [5]. All but the last two columns of Table 2 are from [4]. The common dimension k of the factor matrices, generated by the exact BMD heuristics mentioned above are listed. The numbers in bold face indicate the best value in each row. The rows labeled 100 % shows the data for exact decomposition. Asso is not meant for exact BMD, as commented earlier.

Table 2 Coverage comparison of BMD algorithms for five datasets

Among the datasets we used, Mushroom consists of 8,124 objects and 23 “nominal” attributes. If a “nominal” attribute y takes \(h>2\) values, \(\{v_1, \ldots , v_h\}\), we expanded y, replacing it by h Boolean attributes \(\{y_{v_1}, \ldots ,\) \(y_{v_h}\}\). In row i, the value of the column corresponding to \(y_{v_j}\) is has a 1 if the value of the attribute y in row i in the original dataset is equal to \(v_j\).

Note that only our algorithms impose the column-use condition. In spite of this restriction, Pick-Largest achieves the smallest tiling size (or dimension k) for exact (i.e., 100 %) coverage for four out of the five datasets in Table 2, which was somewhat unexpected. Incidentally, we have found a decomposition without the column-use condition with \(k=101\) by some other means, so none of the algorithms in Table 2 can find the optimal decomposition for the Mushroom dataset. As can be seen from Table 2, Pick-Largest and Remove-Smallest performed equally well in finding the exact decomposition. Another observation on Table 2 is that for 100 % coverage some results are peculiar in that \(k>n\), i.e., the common dimension of the factor matrices is larger than n. Our algorithms are the only ones that never produce such results.

Although our original intention was to design algorithms for exact BMD, our algorithms can also be used for “from-below” approximation [5]. In the from-below approximation, an important performance criterion is the coverage, defined as the number of 1’s covered by the product \(U\circ V\) over the total number of 1’s in the given matrix M [13]. The coverage is given in the second column of Table 2. Each entry in the table is the number of tiles used, which is the same as the common dimension k of U and V. Figure 1 plots the coverage of Algorithm Pick-Largest as a function of the number of attributes contained in U and V. The attributes are arranged in the order they were picked.

Fig. 1
figure 1

Coverage of Algorithm Pick-Largest for Mushroom

In most applications, high coverage, say, more than 90 %, would be of interest, and we have collected coverage data in Table 3 in this range for Pick-Largest and Remove-Smallest, but unfortunately not for the others, since we haven’t had the time to program the other algorithms. For Mushroom, however, it is stated in  [13] that Tiling needs 45 tiles to attain 90 % coverage vs. 47 for Pick-Largest. Table 3 shows that to attain 100 % coverage, Tiling needs 119 tiles vs. 109 for Pick-Largest. We have some evidence to suggest that our algorithms perform better than others especially at higher coverages.

Table 3 Comparison of Remove-Smallest and Pick-Largest at high coverage ratios

Another important aspect of performance is the efficiency of the algorithm in terms of speed and memory use. Table 4 shows the time it took for them to decompose M (of Mushroom) into U and V and the amount of memory used.

Table 4 Performance comparison

Belohlavek et al. [5] carried out extensive tests of their algorithms GreConD and GreEss, which can be used for exact BMD, on Mushroom, and measured the time and memory requirement. Their data for exact BMD are given in Table 4. We should mention that the platforms we used to produce our results are different from theirs, as shown in Table 5. Probably, it is safe to say that there is not a huge difference between the two. Unfortunately, we do not have similar data for other algorithms, since they are not published.

Table 5 Running platforms

6 Application to educational data mining

Educational data mining has been attracting increasing interest in recent years. It aims to discover students’ mastery of knowledge, or skills which are itemized as attributes. In the widely studied Rule Space Model [43] in cognitive assessment in education, a Boolean matrix, named the Q-matrix, is used to represent hypothetical sets of attributes which would be needed to answer the test items correctly. To explain the relevance of exact BMD to the educational Q-matrix theory developed by Tatsuoka [42], let us introduce new symbols for matrices.

Attribute or skill set: We assume that the students’ knowledge can be represented by the knowledge state matrix \(A= [a_{ij}] \in \{0,1\}^{m\times k}\), where \(a_{ij}=1\) (resp. \(a_{ij}=0\)) indicates that the ith student possesses (resp. does not possess) knowledge represented by the jth attribute. For \(i=1,2,\ldots ,m\), the knowledge state of student i is represented by a row vector

$$\begin{aligned} {\varvec{a}}_i=[a_{i1},a_{i2},\ldots ,a_{ik}]. \end{aligned}$$

Q -matrix: It is denoted by \(Q= [q_{ij}]\in \{0,1\}^{n\times k}\), where \(q_{ij}=1\) (resp. \(q_{ij}=0\)) indicates that answering test item i correctly requires (resp. does not require) knowing or understanding attribute (=concept) j. Define a row vector by

$$\begin{aligned} {\varvec{q}}_i=[q_{i1},q_{i2},\ldots ,q_{ik}]. \end{aligned}$$

Response matrix: Given m students and n test items, the test results can be represented by a matrix \(R \in \{0,1\}^{m \times n}\), where \(R[i,j]=1\) (resp. \(R[i,j]=0\)) indicates that the ith student solved the jth test item correctly (resp. incorrectly). Theoretically, student i should be able to answer question j if \({\varvec{a}}_i \ge {\varvec{q}}_j\) or \(\overline{{\varvec{a}}_i}\circ {\varvec{q}}_j =0\). We thus define the ideal item response R[ij] by

$$\begin{aligned} R[i,j]=\left\{ \begin{array}{ll} 1&{}\quad {\varvec{a}}_i\ge {\varvec{q}}_j\\ 0&{}\quad otherwise\\ \end{array}\right. \end{aligned}$$
(25)

If both Q and A were known, then the students’ test performance, called the ideal item response pattern [43], could be theoretically predicted. The following result was announced in [41] without proof. Here we provide a simple but formal proof.

Theorem 5

The ideal item response matrix R, the knowledge state matrix A and the Q-matrix Q are related as follows:

$$\begin{aligned} R=\overline{\overline{A}\circ Q^\mathrm{T}}. \end{aligned}$$
(26)

Proof

The fact that student i has enough knowledge to answer question j can be represented by \({\varvec{a}}_i \ge {\varvec{q}}_j\), which is equivalent to \(\overline{{\varvec{a}}}_i \circ {\varvec{q}}_j^\mathrm{T} =0\) hence \(\overline{\overline{{\varvec{a}}_i}\circ {\varvec{q}}_j^\mathrm{T}} =1\) by Proposition 5. If he/she does not, i.e., \({\varvec{a}}_i \not \ge {\varvec{q}}_j\), on the other hand, then \(\overline{{\varvec{a}}}_i \circ {\varvec{q}}_j^\mathrm{T}=1\), and \(\overline{\overline{{\varvec{a}}_i}\circ {\varvec{q}}_j^\mathrm{T}} =0\). \(\square \)

If R is given but the underlying matrices Q and A are unknown, we want to mine Q and A out of R. Thanks to Theorem 5, by finding decomposition \(\overline{R}=\overline{A}\circ Q^\mathrm{T}\), we can learn students’ knowledge state matrix A and the Q-matrix Q from the test responses in R. We simply set \(M=\overline{R}\), \(U=\overline{A}\), and \(V=Q^\mathrm{T}\), and decompose M. Thus, the Q-matrix learning problem can be transformed into exact (i.e., not approximate) Boolean matrix decomposition problem. Here we assume that R has no “noise,” namely it correctly represents the students’ knowledge, and mine Q and A from it. Clearly, the set of collected test responses, \({\mathcal {R}}\), is likely to be “noisy,” because students may be able to guess correct answers by luck, or may make silly mistakes (called “slips” [43]). Therefore, the discovered factors \(A'\) and \(Q'\) of \({\mathcal {R}}\) are just approximations to the true A and Q. This problem is a main issue in Rule space model [28, 4143, 50], but is beyond the scope of this paper.

Example 4

Here we use the dataset of Example 3.9 in [43]. Table 6 shows the ideal item response pattern matrix R for \(m=12\) students and \(n=11\) test items,

Table 6 The ideal item response matrix \(R^{12\times 11}\) [43]

The knowledge state matrix \(A^{12\times 4}\) and Q-matrix \(Q^{11\times 4}\) (each with \(k=4\) attributes) are given by

$$\begin{aligned} A=\left( \begin{array}{llll} 1&{}\quad 1&{}\quad 1&{}\quad 1\\ 1&{}\quad 1&{}\quad 1&{}\quad 0\\ 1&{}\quad 1&{}\quad 0&{}\quad 1\\ 1&{}\quad 1&{}\quad 0&{}\quad 0\\ 1&{}\quad 0&{}\quad 1&{}\quad 1\\ 1&{}\quad 0&{}\quad 1&{}\quad 0\\ 1&{}\quad 0&{}\quad 0&{}\quad 1\\ 1&{}\quad 0&{}\quad 0&{}\quad 0\\ 0&{}\quad 0^* &{}\quad 1&{}\quad 1\\ 0&{}\quad 0^* &{}\quad 1&{}\quad 0\\ 0&{}\quad 0^* &{}\quad 1&{}\quad 0\\ 0&{}\quad 0^* &{}\quad 0&{}\quad 0 \end{array} \right) ;\quad Q=\left( \begin{array}{llll} 1&{}\quad 0&{}\quad 0&{}\quad 0\\ 1&{}\quad 1&{}\quad 0&{}\quad 0\\ 0&{}\quad 0&{}\quad 1&{}\quad 0\\ 0&{}\quad 0&{}\quad 0&{}\quad 1\\ 1&{}\quad 0&{}\quad 1&{}\quad 0\\ 1&{}\quad 0&{}\quad 0&{}\quad 1\\ 1&{}\quad 1&{}\quad 1&{}\quad 0\\ 1&{}\quad 1&{}\quad 0&{}\quad 1\\ 0&{}\quad 0&{}\quad 1&{}\quad 1\\ 1&{}\quad 0&{}\quad 1&{}\quad 1\\ 1&{}\quad 1&{}\quad 1&{}\quad 1 \end{array} \right) \end{aligned}$$

In [43], they constructed R from the given A and Q. Here, taking R as the input, Algorithms Remove-Smallest and Pick-Largest were able to recover A and Q.

Comment 6

In the above example, note that Q[ : , 1] dominates column Q[ : , 2]. This means that any test item that tests concept 2 automatically tests concept 1, in other words, attribute 1 is a prerequisite for concept 2 [43]. Students 9 to 12 have not mastered concept 1, which are tested in test items 1,2, 5\(\sim \)8, and 10\(\sim \)11. Thus \(R[s,1]=0\) (they cannot answer test items testing concept 1) for \(s=\) 9\(\sim \)12. As for any test items that has a 0 in both columns 1 and 2 of Q, i.e., Q[3,  : ], Q[4,  : ], and Q[9,  : ], students 9\(\sim \)12 (who haven’t mastered concept 1) cannot pass test items testing concept 2. Therefore, \(A[s,2]=0\) for \(s=9\sim \)12. However, mathematically, setting \(A[s,2]=1\) for \(s=9\sim \)12 still satisfies \(\overline{R}=\overline{A}\circ Q^\mathrm{T}\). See the entries \(0^*\) in matrix A in Example 4.

In general, we can prove the following.

Lemma 6

Suppose that column Q[ : , i] dominates column Q[ : , j]. Then \([A[s,i]=0] \Rightarrow [A[s,j]=0]\).

The input to our algorithms is just \(\overline{R}\), and the complemented knowledge state matrix \(\overline{A}\) is an output. Algorithm Pick-Largest computes the values of \(\delta _i\) in each round, whose maxima are shown in Table 7.

Table 7 The attribute picked in each round of Pick-Largest

Algorithm Remove-Smallest removes attributes in the increasing order of \(\sigma _i\), provided the product remains the same, i.e., \(\overline{R}\). For this example, both algorithms decompose \(\overline{R}\) into factor matrices with the common dimension (\(k=4\)), which equals the dimension of the original factors [43].

7 Conclusions and discussions

We have presented two heuristic algorithms to find an exact decomposition \(M=U\circ V\) such that U consists of a subset of the columns of M. Exact BMD is aesthetically pleasing and intellectually satisfying, and we believe that it will find useful applications in the future. In the present day data mining applications, however, it may not be necessary or very important. So we also showed that our algorithms can be used for approximate BMD, namely to find a product \(U\circ V\) that covers most of the 1’s and no 0’s in M. This is sometimes called “from-below” approximation [4].

We ran our algorithms on several real datasets, which are often used as benchmarks. On these particular datasets, our algorithms perform rather well, compared with the known algorithms proposed in [4, 5, 13, 49]. These results are despite the column-use condition that we impose, but the others do not. We think this fact is rather note-worthy. If this is generally true for large databases, it has a great potential for practical data mining. Clearly, more extensive tests are called for to arrive at any definite conclusions. Ene et al. [9] also report some unexpected, favorable properties of real datasets, which help role mining. It would be interesting to explore and understand how they are caused. Incidentally, when the column-use condition is imposed, it seems that the idea of concept lattice [12] is not particularly useful.

We have made an interesting observation that the sizes of the largest, second largest tiles, etc., picked by Pick-Largest follow Zipf’s distribution rather well.

Although we have concentrated on the elimination of column dominance, it is possible that a given matrix M has more row dominance than column dominance. In any case, it would be worthwhile to apply our algorithms to both M and \(M^\mathrm{T}\), and pick the result with the smaller factor matrix dimensions. There may be situations where a decomposition of \(M=A\circ B\) is already known, but it is desired to reduce the number of attributes (columns) in A. In such a case, we can apply our algorithms to decompose A as \(A=U\circ V\). We then have \(M=U\circ (V\circ B)\) such that U consists of a subset of the columns of A.

From Proposition 3, there is a lot of parallelism in matrix product computation. This implies that if the given matrix M is very large, our algorithms are amenable to the map-reduce technique [39].

Finally, as mentioned before, we have not examined the approximation ratio of our heuristic algorithms relative to the optimum. We leave it as future work.