Hierarchical heuristics for Boolean-reasoning-based binary bicluster induction

Biclustering is a two-dimensional data analysis technique that, applied to a matrix, searches for a subset of rows and columns that intersect to produce a submatrix with given, expected features. Such an approach requires different methods to those of typical classification or regression tasks. In recent years it has become possible to express biclustering goals in the form of Boolean reasoning. This paper presents a new, heuristic approach to bicluster induction in binary data.


Introduction
Biclustering is a data analysis technique that searches for interesting submatrices of a given matrix. The resultant submatrix, referred to as a bicluster, can be defined as an ordered pair, consisting of a subset of rows and a subset of columns of the given matrix. This approach was first used in the 1970s by Hartigan [1].
Boolean reasoning [2] is a paradigm computational task solving. Typically, the original issue becomes encoded into a Boolean formula and results of its transformation may be decoded into solutions of the original problem. Such an approach is widely applied in Rough Set Theory [2,3]; however, it is also used for decision tree induction [4].
Among many others [2,[5][6][7], a new approach to biclustering based on Boolean reasoning was presented [8] in 2018. Promising results with discrete and binary data took effect also with continuous data biclustering methods development [9].
The primary disadvantage of methods based on Boolean reasoning is high computational complexity, due to Boolean function satisfiability checking. This problem has given rise to the use of heuristics to accelerate the computations. In the paper [10] the proposition of a simple, sequential covering strategy is presented. The approach searches for the set of biclusters that together contain all ones in the binary data and requires the modified Johnson's strategy of prime implicant approximation [11]. Some scenarios, however, may require wider biclusters, potentially including those which overlap one another, that are more general and do not necessarily contain all ones in the binary data.
The need for wide bicluster searches has engendered new development of bicluster induction heuristics. The above-mentioned modified version of Johnson's strategy uses single prime implicant approximation in order to produce an iterative, sequential coverage of ones in a binary matrix. In general, searching only among the uncovered ones of a generated bicluster provides smaller biclusters (i.e., biclusters with fewer rows or columns) as the number of iterations increases. This paper presents a new hierarchical, heuristic strategy for binary data biclustering. The heuristics used is that of the modified version of Johnson's strategy of prime implicant approximation [10]. The use of the term hierarchical refers to the "tabu search" [12] paradigm: following the discovery of a solution (node), solutions that are similar but not equivalent are found (subnodes). This process then iterates. The similarity condition is satisfied by random data modification: several subnodes are invoked, and for each of them a different element of the input data is altered from one to zero (only ones in a bicluster discovered by the given node are altered in this manner).
This paper is organized as follows: it begins with a brief review of existing approaches to biclustering; following this, the essential notions of Boolean-reasoning-based biclustering are defined and presented; the subsequent section develops the central concept of the hierarchical, heuristic strategy for binary biclustering induction, and provides abstract examples and pseudo-code; the penultimate section reports the results of experimental tasks using artificial data; and the final section offers conclusions and a perspective on possible further work in this area.

Related works
Biclustering was first used in the 1970's [1]. Following this, the technique has been applied to many disciplines, including biomedical data analysis [13] and text mining [14], leading to the development of multiple approaches to biclustering.
Beyond methods dedicated strictly to biclustering, other data analysis paradigms share similar characteristics. For example, the search for an inclusion-maximal bicluster of ones in a binary matrix is comparable to the extraction of the concept lattice for a given context [34]. In the domain of basket analysis, the generation of a frequent itemset corresponds to the generation of an exact bicluster [35].

Boolean-reasoning-based biclustering
This section defines objects and concepts that will be used throughout the paper as well as provides backgrounds of Boolean-Reasoning-Based biclustering.

Definitions
Definition 1 (Bicluster) Let M be a matrix with rows R and columns C. The bicluster RC ≡ (R, C) is an ordered pair of a susbset of rows R ⊆ R and a subset of columns C ⊆ C.
Definition 2 (Exact bicluster) Let RC be a bicluster. RC is exact iff where M(r , c) is the element (cell) of matrix M in row r and column c.
Definition 3 (Inclusion-maximality of exact bicluster) Let M be a binary matrix and let RC be an exact bicluster derived from M. The bicluster is inclusion-maximal if and only if there exists no row r ∈ R \ R or column c ∈ C \ C such that any of the extended biclusters are also exact biclusters.
That is to say, the implicant and bicluster correspond if and only if the implicant contains Boolean variables that correspond to rows and columns that are not elements of the bicluster. Such a corresponding implicant is denoted as

Boolean reasoning in binary biclustering
Michalak andŚlȩzak [8] provide the mathematical background for bicluster induction with discrete and binary data in the context of Boolean reasoning. There exist two theorems that bind biclusters of binary matrices and implicants of precisely defined (and data dependent) Boolean formulas. As written here the definition and theorems are used to find exact biclusters of ones among a background of zeros in a matrix. Replacing zero with one in the text, and vice versa, generates a definition and theorems for finding exact biclusters of zeros in binary data.
Definition 8 (Zero-encoding Boolean function) Let M be a binary matrix with rows R and columns C. The zero-encoding Boolean function is the conjunction of disjunctions of the corresponding variables of row r ∈ R and column c ∈ C, such that M(r , c) = 0: Following the above function definition Michalak andŚlȩzak [8] prove two theorems. Here, the theorems are stated. The first theorem details the correspondence between implicants of f (M) and exact biclusters of M.

Theorem 1 (Exact bicluster and implicant correspondence theorem) Let M be a binary matrix with rows R and columns C. Bicluster RC is an exact bicluster of ones in M if and only if R C is an implicant of f (M).
The second theorem demonstrates the correspondence between exact, inclusion-maximal biclusters of ones in M and prime implicants of f (M).

Theorem 2 (Exact, inclusion-maximal bicluster and prime implicant correspondence theorem) Let M be a binary matrix with rows R and columns C. Bicluster RC is an exact, inclusion-maximal bicluster of ones in M if and only if R C is a prime implicant of f (M).
Consider the binary matrix M presented in Table 1. The goal is to find all exact, inclusionmaximal biclusters of ones. The formula f (M) can be expressed at the logical multiplication of two-literal clauses. A given two-literal clause consists of the Boolean variables that correspond to the row and column indices of a zero value element of M (Table 1). Consider the matrix element M(1, b) = 0. The corresponding two-literal clause has the form: Note that the same notation is used for both rows and columns and the Boolean variables that correspond to those rows and columns. However, the meaning is context-dependent. For example, b can represent either an index, as in M(1, b), or a Boolean variable, as in 1 ∨ b.
The formula that encodes all zeros in the matrix M has the following form: Transforming this into a function that consists only of prime implicants gives: The result is a function of three prime implicants, each of which corresponds (via Theorem 2) to an exact, inclusion-maximal bicluster of ones. A visualization of the biclusters corresponding to the f (M) prime implicants is presented in Table 2. Note that none of the biclusters contain a zero, and neither may they be extended by row or column without subsequently containing a zero.  Table 2 Prime implicants of the f (M) function and their corresponding biclusters

Heuristic and hierarchical search of wide biclusters in binary data
The above approach to binary data biclustering has a high degree of computational complexity due to the satisfiability problem of Boolean formulas. From Theorem 1, each implicant of a Boolean formula f (M) encodes an exact bicluster of the matrix M. By exploiting this relationship heuristic strategies can be applied to find implicants of Boolean formulas. A popular approach to prime implicant approximation searching, based on the frequency with which literals occur, is Johnson's strategy [11]. However, Michalak et al. [10] prove that this strategy may induct implicants for which the corresponding biclusters are empty (i.e., biclusters with rows but no columns, or vice versa). To avoid such situations, Michalak et al. [10] propose a new heuristic for implicant induction. In addition, they present an approach for sequential coverage of bicluster induction that covers all ones in the data.
The sequential coverage strategy ensures that all ones in the data are eventually covered by a bicluster. However, as the process progresses, the size of newly found biclusters decreases. As a result, only biclusters found in the initial phase of the process may be generalizable.
The new heuristic presented in this work adopts a different approach to finding biclusters.
Consider searching a binary matrix for biclusters of ones that are as wide as possible in both directions. The heuristics of Michalak et al. [10] provide a Boolean function implicant that encodes one exact bicluster of the data. This is the widest possible bicluster that the heuristics can find. Now consider the effect of replacing a one in the bicluster with a zero, and invoking the heuristics again. The result would not be the same bicluster as was originally found; the zero inside the original bicluster would violate the exactness condition. The visualization of two iterations of such an approach is presented in Table 3. It assumes a given heuristic to search for an exact bicluster of ones (not necessarily the heuristic above).
The example shown in Table 3 proceeds as follows. From the binary data (a) a bicluster is found using a given heuristic (b). An arbitrarily chosen one (third row, fourth column) is replaced by a zero (c). From the modified data the same heuristic is used to find another bicluster (d). The newly found bicluster can not be the same as was previously found.
For a bicluster consisting of r rows and c columns up to r · c different modifications to the original data can be made, and up to r · c new biclusters can be found heuristically. Moreover, each of the biclusters found in the modified data can be used as an input for further processing. This forms the general hierarchical strategy of bicluster induction. Table 3 Modifying data to invoke another iteration of bicluster searching: a original binary data; b bicluster found by a given heuristic; c a single one in the bicluster is replaced by a zero; d from the altered data a new bicluster is found Intuitively, such a recursive strategy can be executed until a given stop criterion is fulfilled. We can consider no fewer than four stop criteria: • a maximum number of iterations (recursive invocations), • a maximum number of found biclusters, • a maximum total coverage of the data, • a minimum assumed coverage of the data.
The pseudocode for this heuristic is presented in Algorithm 1. add data as the first task 6: while run do 7: whileData := queue.GetFirst() take first data from the queue 8: queue.RemoveFirst() remove first task from the queue 9: bicluster := FindSingleBicluster(whileData) heuristic search of one bicluster in the data 10: if not AlreadyCovered(biclusters, bicluster) then checking whether bicluster is a subset of any previously found bicluster 11: for all cell in bicluster do 12 The queue (Algorithm 1, line 4) is used as part of the breadth-search strategy: each iteration of the while loop adds new data based on the found biclusters. The stop condition continue can take the form of one of the above criteria. The FindSingleBicluster method is an implementation of the heuristics from [10]. To avoid the need for postprocessing of the biclustering results (i.e., the removal of biclusters that are fully covered by others) the AlreadyCovered method (line 11) tests each newly found bicluster to determine if it is a subset (by rows and columns) of the union of a bicluster in the list of biclusters. If this is not the case, the newly found bicluster is appended to the list of biclusters, and it is inserted into the queue as new data.
In practice, in addition to stop conditions, limitations on tree generation are required. The first limitation deals with non-exhaustive subtree induction: this limits the percentage of bicluster ones that are replaced by zeros and processed further. The second limitation handles maximal tree depth. The pseudocode for this limited heuristics is presented in Algorithm 2.
Algorithm 2 Heuristics pseudocode with random queue generation and limited tree depth. Following the preparation of a new task in the main while loop, if the queue is not empty the depth of the task is checked. Tasks with a depth greater than a given threshold are omitted. (line 11). If the queue is empty a new root with zero depth is built from the remaining, uncovered original data (line 13) and appended to the queue.

Experiments
Experiments were performed on an example data set. The data set took the form of three binary matrices, presented in Fig. 1. The matrices are those used by Michalak andŚlȩzak [8]. The three binary matrices were derived from a single data matrix (Fig. 1, left) which contained data of three discrete values. Each binary matrix was assigned one of the discrete values. If the value of a given element in the original data matrix was equal to one of the three discrete values, the corresponding element in the relevant binary matrix was set to one. All other elements of the binary matrices were set to zero. Fig. 1 The discrete data matrix (left) and the three binary matrices derived from it. The binary matrices were created using three discrete values: #0 (left center), #77 (right center) and #237 (right) The set of all exact, inclusion-maximal biclusters in each of three binary matrices can be found by using the BiMax algorithm [2] or an exhaustive Boolean reasoning strategy [8].
The results of applying these methods are presented in Table 4.
The total coverage is the ratio of the number of ones that are covered by at least one bicluster and the total number of ones. This value must be equal to unity as both methods used are exhaustive: they find all inclusion-maximal biclusters, which necessitates that every one in the data is covered by at least one bicluster. The overlap is the ratio of the summed size of all biclusters (the number of rows multiplied by the number of columns) and the total number of ones in the data. It represents the average number of biclusters that cover a single one in the data.
To provide a comparison to the exhaustive search and hierarchical heuristics strategies, Table 5 presents the results of sequential covering using a modified version of Johnson's strategy.
The application of a hierarchical heuristics to the data was carried out with the following assumptions and parameter settings: • the total coverage of ones in the data should be approximately 90%, • the maximal depth of the search tree is set as three, to enforce search outside of the root bicluster, and • up to 1% of matrix elements are selected from a newly found bicluster (Algorithm 2, line 20), but no fewer than three subiterations are invoked.
Originally, 10 experiments were performed on each binary matrix, with a generated bicluster value between 100 and 1000, in steps of 100. However, due to the experiments with 1000 biclusters providing unsatisfactory results, an additional three experiments per matrix were performed, with a generated bicluster value between 1,100 and 1300, in steps of 100. The results of using the exhaustive strategy (Table 4) provide a reference for subsequent results. The sequential coverage strategy (Table 5) generated a set of biclusters covering all ones; however, the size of the newly found biclusters decreased as the coverage increased. The results of using the hierarchical heuristics strategy, presented in Table 6, show a compromise between high generalization (biclusters that are wide in both directions) and computation period. Figures 2, 3 and 4 present histograms of bicluster area for each of the three binary matrices, when using each of the three bicluster induction strategies. The histograms show the comprise between high generalization and computation period more clearly.
For each set of data the same observations can be made: • the exhaustive strategy finds biclusters with a wider range of areas, • the sequential coverage strategy finds biclusters with smaller areas,  • the hierarchical heuristics strategy finds bicluster with a wider range of areas than those found by the sequential coverage strategy.
These observations demonstrate that the hierarchical heuristics strategy fulfills its expectations-the strategy can find more general biclusters in less time, compared to the modified version of Johnson's strategy. Figure 5 presents the relationship between total coverage and number of biclusters generated, when using the hierarchical heuristics strategy.
As the hierarchical heuristics strategy implements sequential coverage (only newly found biclusters that cover previously uncovered data are added to the final set), coverage increases as the number of generated biclusters increases. Figure 6 presents the relationship between bicluster area and iteration number, when using the hierarchical heuristics strategy. The observed relationship validates the central concept of the strategy. When using the modified version of Johnson's strategy, all covered ones are replaced with zeros, and the updated data is used as an input for further analysis. When using the hierarchical heuristics strategy, only a small number of ones are replaced with zeros. Invoking the process recurrently allows for the induction of biclusters that can cover both previously covered and previously uncovered ones, increasing bicluster generalization.
The results of applying the hierarchical heuristics strategy to the #237 data (Fig. 6, bottom) provides further insights. The strategy generates biclusters with small areas between iterations 200 and approximately 400. At approximately iteration 400 a steep increase in the area of the generated biclusters can be observed. A similar situation occurs at approximately iteration  0  100  200  300  400  500  600  700  800  900  1000   0  100  200  300  400  500  600  700  800  900  1000   0  200  400  600  800  1000  1200  1400   0   100 200 #237 Fig. 6 The area of the bicluster as the function of the iteration number 800. This is caused by the strategy "jumping" to unexploited regions of the matrix (as a new root bicluster is generated) and inducting matrices from those uncovered areas. The hierarchical strategy is capable of a total coverage value of unity. Table 7 presents the number of iterations required for this, in addition to the overlap.

Conclusions and further works
Exhaustive search generates the widest possible exact biclusters; however, such an approach has high computational complexity (with regards to both processing time and memory). In the paper [10] it was attempted to decrease the computation time while retaining the theoretical background of the approach by modifying Johnson's strategy of prime implicants approximation. Although this was successful, the sequential coverage approach limited the area of the generated biclusters. Based on the modified version of Johnson's strategy, the hierarchical heuristics strategy, introduced in this work, is capable of more general bicluster induction. Experimental results when using the hierarchical heuristics strategy are promising, demonstrating an ability to find widespread biclusters covering a substantial section of the binary data.
Further modifications to the strategy could be considered. The use of a task queue provides a straightforward method to further decrease computation time. As all tasks are independent (with regards to single task analysis), they can be executed concurrently on different processor cores or computing nodes. Moreover, the initial results when using the hierarchical heuristics strategy could be postprocessed in order to remove small and insignificant biclusters. This confirms that the application of the Boolean reasoning paradigm to binary data biclustering continues to provide challenges to be solved.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.