Maximum and top-k diversified biclique search at scale

Maximum biclique search, which finds the biclique with the maximum number of edges in a bipartite graph, is a fundamental problem with a wide spectrum of applications in different domains, such as E-Commerce, social analysis, web services, and bioinformatics. Unfortunately, due to the difficulty of the problem in graph theory, no practical solution has been proposed to solve the issue in large-scale real-world datasets. Existing techniques for maximum clique search on a general graph cannot be applied because the search objective of maximum biclique search is two-dimensional, i.e., we have to consider the size of both parts of the biclique simultaneously. In this paper, we divide the problem into several subproblems each of which is specified using two parameters. These subproblems are derived in a progressive manner, and in each subproblem, we can restrict the search in a very small part of the original bipartite graph. We prove that a logarithmic number of subproblems is enough to guarantee the algorithm correctness. To minimize the computational cost, we show how to reduce significantly the bipartite graph size for each subproblem while preserving the maximum biclique satisfying certain constraints by exploring the properties of one-hop and two-hop neighbors for each vertex. Furthermore, we study the diversified top-k biclique search problem which aims to find k maximal bicliques that cover the most edges in total. The basic idea is to repeatedly find the maximum biclique in the bipartite graph and remove it from the bipartite graph k times. We design an efficient algorithm that considers to share the computation cost among the k results, based on the idea of deriving the same subproblems of different results. We further propose two optimizations to accelerate the computation by pruning the search space with size constraint and refining the candidates in a lazy manner. We use several real datasets from various application domains, one of which contains over 300 million vertices and 1.3 billion edges, to demonstrate the high efficiency and scalability of our proposed solution. It is reported that 50% improvement on recall can be achieved after applying our method in Alibaba Group to identify the fraudulent transactions in their e-commerce networks. This further demonstrates the usefulness of our techniques in practice.


Introduction
A bipartite graph is denoted by G = (U , V , E) where U (G) and V (G) denote the two disjoint vertex sets and E(G) ∈ U × V denotes the edge set. Bipartite graph is a popular data structure, which has been widely used for modelling the relationship between two sets of entities in many real world applications. For example, in E-Commerce, a bipartite graph can be used to model the purchasing relationship between customers and products; In web applications, a bipartite graph can be used to model the visiting relationship between users and websites; In bioinformatics, a bipartite graph can be used to model the acting relationship between genes and roles in biological processes.
A subgraph C is a biclique if it is a complete bipartite subgraph of G that for every pair u ∈ U (C) and v ∈ V (C), we have (u, v) ∈ E(C). Like a clique in general graph, biclique is a fundamental structure in a bipartite graph, and has been widely used to capture cohesive bipartite subgraphs in a wide spectrum of bipartite graph applications. Below are several representative examples.
(1) Anomaly detection [4,7] In E-commerce such as Ebay and Alibaba, the behavior of a large group of customers purchasing a set of products together is considered as an anomaly because there is a high probability that the group of people is making fraudulent transactions to increase the rankings of their businesses selling the corresponding products. This can be modeled as bicliques in a bipartite graph. Similarly, in web services, bicliques can be used to detect a group of web spammers who click a set of webpages together to promote their rankings.
(2) Gene expression analysis [16,18,25,45,59] In gene expression data analysis, different genes will respond in different conditions. The group of genes that have a number of common responses over multiple conditions is considered as a significant gene group.
(3) Social recommendation [23] In social analysis, there may exist a group of users who have the same set of interests, such as swimming, hiking, and fishing. Such groups and interests can be naturally captured by a biclique, which is helpful in social recommendation and advertising.
In practice, we cannot directly enumerate the bicliques of the bipartite graphs as the number of bicliques is prohibitively large in the above applications. In this paper, we investigate the problem of maximum biclique search, i.e., finding the biclique with the largest number of edges, for the following two reasons: (1) Given the biclique model, it is a very natural problem to find the maximum biclique, which is not only theoretically interesting but also useful in many real-life scenarios. For instance, the maximum biclique may represent the largest suspicious click farm in the e-commerce networks, the most significant gene group in a gene-condition bipartite graph, and the user group with the largest potential market value in the social network. (2) In some scenarios, one may need to enumerate a set of bicliques. For instance, the fraud transactions cannot be fully covered by the maximum biclique in the e-commerce network. To reduce the number of output bicliques, we may consider the maximal biclique where none of its superset is also a biclique. Unfortunately, as shown in our initial empirical study, the number of maximal biclique is still large (e.g., over 10 9 maximal bicliques have been output after 24 h running of maximal biclique enumeration algorithm on a e-commerce bipartite graph obtained from Alibaba). Thus, we have to consider the diversified top-k bicliques. Inspired by the well-studied diversified top-k clique search problem (e.g., [57]), we can follow the same procedure by repeatedly removing the current maximum biclique from the bipartite graph k times. Clearly, the efficient computation of maximum biclique is the key of this problem.
Challenges and motivations Despite its wide range of applications, finding the maximum biclique is an NP-hard problem [38]. In the literature, there are many solutions to solve another related NP-hard problem: the maximum clique search problem in a general graph [17,19,20,26,31,[46][47][48][49]. The main idea is to use graph coloring and core decomposition to obtain an upper bound for the maximum clique size and use this upper bound to prune vertices that cannot be contained in the maximum clique.
A natural question raised is: can we use the above graph coloring and core decomposition techniques to search the maximum biclique in a bipartite graph? Unfortunately, the answer is negative. First, in a bipartite graph, only two colors are needed to color the whole bipartite graph. Obviously, we cannot obtain an upper bound for the maximum biclique size using graph coloring. Second, in a large biclique, it is possible for a vertex to have a very small degree/core number. For example, suppose the maximum biclique C is a star where |U (C)| = 1 and |V (C)| is large, we only require the degree/core number for each vertex in V (C) to be ≥ 1. Consequently, even a vertex has a small degree/core number, it still cannot be pruned. Therefore, the core decomposition technique also fails in maximum biclique search.
The main reason for the challenges in maximum biclique search is that the size of a biclique C depends on two factors: |U (C)| and |V (C)|; so, it is difficult to find a one-dimensional indicator, such as color number, degree, or core number, to prune vertices that cannot participate in the maximum biclique. Due to this challenge, existing solutions [38,59] can only handle small bipartite graphs and will face serious efficiency issues when the bipartite graph scales up in size. Motivated by this, in this paper, we tackle the above challenges and aim to solve the maximum biclique search problem on bipartite graphs at billion scale.
Furthermore, based on the maximum biclique search, we can find the diversified top-k bicliques which is desired in some applications such as fraudulent transaction detection. Instead of computing the top-k bicliques based on the maximal biclique enumeration algorithm which may output exponential number of bicliques and is not practical on largescale bipartite graphs, we adopt a simple but effective method by removing the maximum biclique from the bipartite graph k times to obtain the diversified top-k results. However, in this way, we still need to compute the maximum biclique k times independently, which is costly. One may wonder if we can share the computation costs among the diversified top-k bicliques. It is quite challenging because there is no overlap among the k diversified results.
Our solution Based on the above discussion, existing coloring and core decomposition-based approaches cannot yield effective pruning in maximum biclique search. Our paper aims for a new way to solve the problem. Our main idea is as follows: instead of finding the upper bounds for pruning, we try to guess a lower bound of |U (C * )| as well as a lower bound of |V (C * )| for the maximum biclique C * . If the guess is correct and tight, we can search on a much smaller bipartite graph by eliminating a large number of vertices based on the two lower bounds. However, we cannot guarantee that our guess is always correct. Therefore, instead of guessing only once, we guess multiple times which results in a list of lower-bound pairs (τ 0 U , τ 0 V ), (τ 1 U , τ 1 V ), . . .. To gain high pruning power, the list of pairs should satisfy four conditions: (1) τ 0 U × τ 0 V should be as large as possible but not larger than the number of edges in the optimal biclique C * ; (2) The pairs are derived in a progressive manner so that τ i for any i > 0; (3) There exists at least one pair τ k U and τ k V that are the true lower bounds of |U (C * )| and |V (C * )|; and (4) The number of pairs should be well-bounded.
To make this idea practically applicable, two issues need to be addressed: (1) How to guess the list of lower-bound pairs so that they satisfy the above four conditions; and (2) Given a lower-bound pair, how to eliminate as many vertices as possible while preserving the corresponding maximum biclique to optimize the computational cost.
Following the idea of the maximum biclique search problem, in the diversified top-k biclique search, we try to share the computation cost among the k results by taking advantage of the derived subspaces with lower-bound pairs. Our main idea is as follows: instead of guessing tight lower bounds only for the maximum biclique, we try to preserve more results within one list of lower-bound pairs by slightly relaxing the constraints in each lower bound pair. By doing this, we can share the computation cost among the preserved results, without computing lower-bound lists and eliminating vertices w.r.t. each lower-bound pair independently for every single result.
Contributions In this paper, we answer the above questions and make the following contributions: -The first work to practically study maximum biclique search on big real datasets Although the maximum biclique search problem is NP-hard, we aim to design practical solutions to solve the problem in real-world large bipartite graphs with billions of edges. To the best of our knowledge, this is the first work to solve this important problem on real datasets at billion scale. -A novel progressive-bounding framework We propose a progressive bounding framework to obtain the lowerbound pairs (τ i U , τ i V ). We analyze the framework by projecting the problem into a two-dimensional space, and we show that the set of lower-bound pairs forms a skyline in the two-dimensional space, and only logarithmic lower-bound pairs are enough to guarantee the correctness.
-Maximum-biclique preserved graph reduction Given a certain pair of lower bounds, we study how to eliminate vertices while preserving the maximum biclique. We investigate the vertex properties and derive pruning rules by exploring the one-hop and two-hop neighbors for each vertex. Based on the pruning rules, we can significantly reduce the size of the bipartite graph. -Diversified top-k biclique search with computation sharing We formalize the diversified top-k biclique search as a problem to maximize the total number of edges covered by the top-k bicliques, which takes both size and diversity into consideration. Instead of computing the k results independently, we propose an efficient algorithm by considering the computation sharing among them. Based on the progressive bounding framework, we generate the subspaces by slightly relaxing the lower-bound constraints to preserve more results within one subspace set, such that we can share the computation among the preserved results. Two optimizations are proposed to further accelerate the computation by pruning search spaces and lazy refining candidates. -Extensive performance studies on billion-scale bipartite graphs We conduct extensive performance studies using 18 real datasets from different application domains. The experimental results demonstrate the efficiency and scalability of our proposed approaches. Remarkably, in a user-product bipartite graph from Alibaba with over 300 million vertices and over 1.3 billion edges, our approach can find the maximum biclique within 15 min. It is also reported that 50% improvement on recall can be achieved after applying our proposed method in Alibaba Group to identify the fraudulent transactions.
Outline The remainder of this paper is organized as follows. Section 2 provides the preliminaries that formally defines the maximum biclique search problem and shows its hardness. Section 3 introduces the baseline solution based on the branch-and-bound framework. In Sect. 4, we analyze the reason for the inefficiency of the baseline solution, and propose the progressive bounding framework. Section 5 presents the maximum-biclique preserved graph reduction techniques and its optimizations. In Sect. 6, we study the problem of diversified top-k biclique search and propose an efficient algorithm by sharing the computation cost among the k results. In Sect. 7, we evaluate our proposed algorithms using extensive experiments. We review the related work in Sect. 8 and conclude the paper in Sect. 9. This paper is extended from our previous work [28] to give a more comprehensive study. First, we add the introduction and motivation of the diversified top-k biclique search problem. Then, we add algorithms TopKBasic and TopK to find the (a) (b) (c) Fig. 1 An example of a bipartite graph and its maximum biclique diversified top-k bicliques, with two optimizations to further accelerate the computation. Finally, we add more experiments on maximum and diversified top-k biclique search to show the efficiency of the proposed algorithms.

Preliminaries
We consider an unweighted and undirected bipartite graph, We have symmetrical definition for each vertex v ∈ V (G). The size of a bipartite graph G, denoted as |G|, is defined as the number of edges in G, i.e., |G| = |E(G)|.
Definition 1 (Biclique) Given a bipartite graph G = (U , V , E), a biclique C is a complete bipartite subgraph of G, i.e., for each pair of u ∈ U (C) and v ∈ V (C), we have (u, v) ∈ E(C).
In this paper, given a bipartite graph G, we aim to find a biclique C * in G with the maximum size. Considering that many real applications (e.g., fraud transaction detection) require that the number of vertices in each part of the biclique C * is not below a certain threshold, we add size constraints Such a size constraint can also provide the users with more flexibility to control the size of each side of the biclique or avoid generating a too skewed biclique (e.g., a biclique with a single vertex of the highest degree at one side and all its neighbors at the other side). As a special case, when τ U = 1 and τ V = 1, the problem will find the maximum biclique without any constraint. The maximum biclique problem studied in this paper is defined as follows: Problem statement Given a bipartite graph G = (U , V , E), and a pair of positive integers τ U and τ V , the problem of maximum biclique search aims to find a biclique C * in G, s.t. |U (C * )| ≥ τ U and |V (C * )| ≥ τ V , and |C * | is maximized. We use C * τ U ,τ V (G) to denote such a biclique. Figure 1a shows a bipartite graph G with

Example 1
NP-hardness and inapproximability As shown in [38], the maximum biclique problem is NP-hard, and as proved in [5] and [30], it is difficult to find a polynomial time algorithm to solve the maximum biclique problem with a promising approximation ratio. Due to the inapproximability, in this paper, we aim to find the exact maximum biclique and will propose several techniques to make our algorithm practical in handling large real-world bipartite graphs.

The baseline solution
In the literature, the state-of-the-art algorithm proposed in [59] resorts to the branch-and-bound framework, aiming to list all maximal bicliques by pruning non-maximal candidates from the search space. To obtain a reasonable baseline, in this section, we extend the algorithm proposed in [59], and design an algorithm to compute the maximum biclique by adding some pruning rules in the branch-and-bound process.
The branch-and-bound algorithm We briefly introduce the branch-and-bound algorithm. The algorithm maintains a partial biclique (U , V , U ×V ) and recursively adds vertices into V . When V is fixed, U can be simply computed as the set of common neighbors of all vertices in V , i.e., Therefore, we only need to consider V to determine the biclique. Based on this idea, the key to reducing the cost is to prune the useless vertices to be added into V . According to Eq. 1, when V is expanded, U will be contracted. The pseudocode of the algorithm is shown in Algorithm 1. The input of the algorithm includes the bipartite graph G, the thresholds τ U and τ V , and an initial biclique C. Here, C is used when a biclique is obtained before invoking the algorithm, or can be set as ∅ otherwise. The algorithm initializes C * as C (line 1), invokes the BranchBound procedure to update C * (line 2), and returns C * as the answer (line 3). The recursive procedure BranchBound has four parameters U , V , C V , and X V , initialized as U (G), ∅, V (G) and ∅, respectively. Here, (U , V , U × V ) defines a partial biclique. C V is the set of candidate vertices that can be possibly added to V , and X V is the set of vertices that has been used and should be excluded from V . The procedure BranchBound updates C * using (U , V , U ×V ) if it is larger than the current C * and satisfies the threshold constraints (line 5-6). Then, it iteratively adds vertex v * from C V to expand V (line 7-8).
Then, U is updated by selecting the vertices from U that are neighbors of v * ; V includes vertices in V , v * , and vertices in C V that are neighbors of all vertices in U ; C V includes the vertices in C V by excluding the vertices in V as well as the vertices with number of neighbors in U no larger than τ U ; X V includes all vertices in X V by excluding the vertices with number of neighbors in U no larger than τ U (line 9-12). (1) τ U pruning The size of U should be ≥ τ U since U will only be contracted in the branch.
(3) Size pruning The value of |U | × (|V | + |C V |) should be ≥ |C * |. Without it, exploiting the current branch will not result in a larger biclique.
(4) Non-maximality pruning The non-maximality pruning is based on the fact that a maximum biclique should be a maximal biclique. If there is a vertex v in the exclusion set X V that are neighbors of all vertices in U (i.e., U ⊆ N (v, G)), the resulting biclique cannot be maximal and thus the branch can be pruned. After searching bicliques with v * , we add v * into X v (line 15). Fig. 1a and thresholds τ U = 1 and τ V = 1, we show the search tree of MBC in Fig. 2a. The vertices in V are processed in non-descending order of degree [59], and each tree node represents v * selected in the branch. We illustrate the details in search branch from v 5 in Fig. 2b. At first, we have X V = {v 6 , v 1 }, 5 , v 6 }. In step (1), we select v * = v 5 and refine U = {u 2 , u 3 , u 4 , u 5 , u 6 }. V is the vertices in C V that connect to all vertices in U , i.e., V = {v 2 , v 3 , v 5 }. Then, we refine C V = {v 4 } and X V = {v 1 , v 6 }. By now, we update U (C * ) = U , V (C * ) = V and |C * | = 15. In step (2), we further select v * = v 4 , refine corresponding sets in a similar way as shown in Fig. 2, and update |C * | = 16.

A progressive bounding method
In this section, we first analyze the reason for the large search space of the baseline solution, and then introduce our approach using search space partitioning based on a progressive bounding framework to significantly reduce the computational cost.

Problem analysis
Why costly? Although four pruning conditions are used to reduce the search space for maximum biclique search in Algorithm 1, it will still result in a huge search space in real large bipartite graphs due to the following two drawbacks: -Drawback 1: loose pruning bounds Most pruning conditions in Algorithm 1 rely on τ U and τ V . However, τ U and τ V are user given parameters which can be small. In this way, the pruning power by τ U and τ V can be rather limited. For size pruning, the constraint of |U | × (|V | + |C V |) > |C * | can be very loose because C V is filtered using τ U and thus |C V | can be large when τ U is small. -Drawback 2: large candidate size The size of a biclique C, calculated as |U (C)| × |V (C)|, depends on two factors: |U (C)| and |V (C)|. It is possible that the optimal solution C * is unbalanced, i.e., either with a large |U (C * )| and a small |V (C * )| or with a small |U (C * )| and a large |V (C * )|. Therefore, during the branch-andbound process, even if the degrees of all candidates in C V are small (where |U | is small), we cannot stop branching when V ∪ C V is large, because we may still generate a large biclique in this situation. Similarly, we cannot remove a vertex from U when its degree is small. This can result in a huge search space on a large bipartite graph. Figure 3 shows a bipartite graph G with U = {u 1 , u 2 , ..., u 100 } and V = {v 1 , v 2 , ..., v 100 }. Specifically, u 1 connects to all vertices in V and v 1 connects to all vertices in U . Given τ U = 1 and τ V = 1, the size of maximum biclique C * is 100. By adopting MBC, we firstly select v 1 into V . As v 1 connects to all vertices in U , U = {u 1 , u 2 , ..., u 100 }. Furthermore, as u 1 connects to all vertices in V , C V = {v 2 , v 3 , ..., v 100 }. However, we cannot prune any vertices with τ U = 1 and τ V = 1, and neither can we prune search branches with size constraint since |U | × (|V | + |C V |) is larger than |C * |. Moreover, we can not prune candidate vertices in C V , though the degrees of vertices are 1s, which leads to large candidate size and a huge search space.

Example 3
Our idea Based on the above analysis and to significantly improve the algorithm, we consider two aspects: -To resolve drawback 1, we need to improve the pruning bounds to achieve the stop conditions in early stages of the branch-and-bound process; -To resolve drawback 2, we need to remove as many vertices as possible from the graph to reduce the number of candidates that may participate in the optimal solution.
Our idea is as follows: instead of using the thresholds τ U and τ V for pruning, we enforce two new thresholds τ * U and τ * V for U (C * ) and V (C * ), respectively, with τ * U ≥ τ U and τ * V ≥ τ V . To tighten the bounds, we try to make τ * U × τ * V as large as possible but ensure that τ * U × τ * V is no larger than the size of the optimal solution. With τ * U and τ * V , we are able to obtain a smaller bipartite graph G * by removing as many vertices as possible that will not participate in the maximum biclique. On the smaller graph G * with tighter bounds τ * U and τ * V , the algorithm will be much more efficient. Suppose C * is the optimal solution, if we can guarantee that τ * U ≤ |U (C * )| and τ * V ≤ |V (C * )|, the algorithm on graph G * with thresholds τ * U and τ * V will output the optimal solution. However, to make our idea practically applicable, the following two issues need to be addressed: -First, we do not know the size of the maximum biclique C * before the search. -Second, it is difficult to find a single pair τ * U and τ * V to guarantee that τ * U ≤ |U (C * )| and τ * V ≤ |V (C * )|.
In the following, we will introduce a progressive bounding framework to resolve the two issues.

The progressive bounding framework
We propose a progressive bounding framework to address the two issues raised as follows: -To address the first issue, instead of using the size of the optimal solution |C * |, we use a lower bound lb(C * ) of |C * |, i.e., lb(C * ) ≤ |C * |. The lower bound can be quickly initialized and will be updated progressively to make the thresholds τ * U and τ * V tighter. -To address the second issue, instead of using a single pair τ * U and τ * V , we use multiple pairs (τ 1 . We will guarantee that for any possible Among the computed bicliques, the biclique with the maximum size is the answer for the original problem.
The algorithm framework The progressive bounding framework is shown in Algorithm 2. For any valid biclique C with |C| is a lower bound of the optimal solution C * . Based on this, we first use InitMBC to obtain an initial biclique, denoted as C * 0 , s.t. |C * 0 | ≤ |C * | (line 1). Then, we set τ 0 V to be an upper bound of |V (C)| for any possible biclique C. Here, a natural upper bound is the maximum degree for any nodes in U (G), i.e., d U max (G) (line 2). k is used to denote the number of iterations and initialized as 0 (line 3). The progressive bounding framework will finish in logarithmic iterations. Each iteration will generate a pair τ k+1 U and τ k+1 V based on the values of τ k V and the the lower bound of the optimal solution |C * k |. When τ k+1 V (τ k+1 U resp.) is smaller than τ V (τ U resp.), it will be set to be τ V (τ U resp.) (line 5-6). We will analyze the rationale later. With τ k+1 U and τ k+1 V , we aim to obtain a graph G k+1 that is much smaller than G using procedure Reduce(G, τ k+1 U , τ k+1 V ), and the maximum biclique w.r.t. thresholds τ k+1 U and τ k+1 V is preserved in G k+1 (line 7). After this, we find the maximum biclique w.r.t. τ k+1 U and τ k+1 V on G k+1 with C * k as an initiation in MBC (line 8).
The rationale Next, we address the rationale of the progressive bounding framework. Note that the size of a biclique C is determined by |U (C)| and |V (C)|. Therefore, to analyze the problem, we define a two-dimensional space as follows: ) Given a bipartite graph G, a two-dimensional space S(G) has two axes |U | and |V |. Given any biclique C in G, we can represent it as a two- Given the search space S(G), the i-th search in line 7-8 of Algorithm 2 can be considered as to cover a certain sub- To show the search preserves the optimal solution, we define the optimal curve in S(G): Definition 3 (Optimal Curve) Given a bipartite graph G and parameters τ U and τ V , suppose C * is the maximum biclique w.r.t. τ U and τ V , we call the curve |U | × |V | = |C * | the optimal curve in the two-dimensional space S(G).
Note that the optimal curve is unknown before the search. However, it can be used to analyze the correctness of the progressive bounding framework as followers.

Theorem 1 (Algorithm Correctness) Given a bipartite graph G and parameters τ U and τ V , for any point
, and when k increases, τ k V will be iteratively divided by 2 until it is smaller than τ V . Therefore, we can always find a certain We consider two cases: In this case, we have: is a lower bound of the optimal value |C * | i.e., is a point on the optimal curve, we have Consequently, we can derive the following inequalities: According to the analysis above, Theorem 1 holds. Theorem 1 shows that all the points in the optimal curve within the range ) are covered by the search spaces in Algorithm 2. Note that for any . Therefore, Algorithm 2 obtains the optimal solution.
The rationale of the progressive bound framework is shown in Fig. 4. Here, we draw the two-dimensional space S(G), and show the search spaces of the first three iterations of Algorithm 2 on S(G). We generate three search spaces using (τ 1 , which obtains the bicliques C * 1 , C * 2 , and C * 3 , respectively. We use red, green, and blue colors to differentiate the three spaces respectively. As shown in Fig. 4, when i increases, the curve |U |×|V | = |C * i | progressively approaches the optimal curve |U | × |V | = |C * |, and the optimal curve |U | × |V | = |C * | in S(G) for |V | ≥ τ 3 V is totally covered by the three search spaces. This illustrates the correctness of the progressive bounding framework. Fig. 1a and thresholds τ U = 1 and τ V = 1, we adopt Algorithm 2 to find the maximum biclique. Suppose we initiate biclique C * 0 as shown in Fig. 1c that we have |C * 0 | = 12 and τ 0 V = 6. Then, we search the optimal solution progressively:

Example 4 Given the bipartite graph G in
We adopt Reduce to filter vertices in G, e.g., we filter u 7 as d(u 7 , G) = 2 and it cannot be involved in a biclique with τ 1 V = 3. We will explain Reduce in detail later. We search for C * 1 on G 1 , and get Since we cannot find any larger biclique on reduced graph G 2 , |C * 2 | = 16. As shown above, we progressively use multiple strict τ k U and τ k V threshold pairs to approach the optimal solution.
The effectiveness of the progressive bounding framework is further verified in our experiments. For example, Table 2 shows that the graph compression ratio in the bounding iterations varies from 0% (omitted in the table) to 2.05%. This reduces significantly the search space and computation cost in the maximum biclique search procedure.
To realize the algorithm framework MBC * in Algorithm 2, we still need to solve the following two components: -The initial biclique computation algorithm InitMBC. We use a greedy strategy to obtain the initial biclique. Specifically, we initialize an empty biclique and iteratively add the vertex that can maximize the size of the current biclique until no vertex can be added. The biclique with the maximum size among the process is returned. -The graph reduction algorithm Reduce. We will discuss the details of Reduce in the next section.

MBC-preserved graph reduction
As shown in Algorithm 2, one of the most important procedures is to reduce the size of the bipartite graph given certain τ i U and τ i V while preserving the maximum biclique. In this section, we show how to reduce the bipartite graph size by exploring some properties of the one-hop and twohop neighbors for a certain vertex. We first introduce the MBC-preserved graph below.

Definition 4 (MBC-Preserved Graph) Given a bipartite graph G, and thresholds τ
In other words, the maximum biclique for G is We can easily derive the following lemma:

One-hop graph reduction
To reduce the size of the bipartite graph, we first consider a simple case by exploring the one-hop neighbors for each vertex. Specifically, we use the number of neighbors to reduce the bipartite graph. Besides, we eliminate a vertex u by removing u and all its adjacent edges from G, denoted as G u. We derive the following lemma: Lemma 2 Given a bipartite graph G, thresholds τ i U and τ i V , we have: Proof Sketch: We only prove (1), and (2) can be proved similarly. Given a certain vertex Therefore, the lemma holds. Lemma 2 provides a sufficient condition for a vertex to be eliminated s.t. the maximum biclique is preserved. Based on the Lemma 1, Lemma 2 can be iteratively applied to reduce the graph size until no vertices can be eliminated.
The one-hop graph reduction is shown in Algorithm 3. Given a bipartite graph G and thresholds τ i U and τ i V , the algorithm aims to compute a bipartite graph G i s.t. G i τ i U ,τ i V G by applying the one-hop reduction rule in Lemma 2. We first initialize G i to be G (line 1), and then we iteratively remove vertices from G i that satisfy either case (1) (line [4][5] or case (2) (line 6-7) in Lemma 2. The algorithm terminates until no such vertices can be found in G i . The following lemma shows the time complexity of Algorithm 3.

Lemma 3 Algorithm 3 requires O(|G|) time.
Proof Sketch: To implement Algorithm 3 efficiently, we can use a queue Q to maintain the set of vertices satisfying Lemma 2. Each vertex is pushed into and poped from the queue Q at most once. For each vertex v, after removing it from G i , we need to maintain the degrees of its neighbors and put those neighbors that can be eliminated using Lemma 2 due to decreasing of the degree into the queue Q. This requires O(d(v, G)) time. Therefore, the overall time complexity of Algorithm

Two-hop graph reduction
Next, we explore the two-hop neighbors to further reduce the size of the bipartite graph. For each vertex u, suppose u is a two-hop neighbor of u, i.e., N (u , G) ∩ N (u, G) = ∅. To eliminate u by fully using the information involved within the two-hop neighbors, instead of only considering the degree of u , i.e., |N (u , G)|, we consider the number of common neighbors of u and u , i.e., |N (u , G) ∩ N (u, G)|. To do so, we define the τ -neighbor and τ -degree as follows: Definition 5 (τ -Neighbor and τ -degree) Given a bipartite graph G and a parameter τ , for any u ∈ U (G) and u ∈ Obviously, the τ -neighbor of any vertex u is a subset of a union of u itself and the two-hop neighbors of u. For example, in Fig. 5b The following lemma shows how to use the τ -neighbor of a vertex to eliminate the vertex with the given thresholds.

Lemma 4
Given a bipartite graph G, thresholds τ i U and τ i V , we have: Proof Sketch: We only prove (1), and (2) can be proved similarly. Given a certain vertex Consequently, we can derive: As a result, the lemma holds.
Based on Lemma 4 and the transitive property shown in Lemma 1, we are ready to design the two-hop graph reduction algorithm. The pseudocode of the algorithm is shown in Algorithm 4. Since Lemma 4 can be applied for vertices in both U (G) and V (G), the algorithm reduce the bipartite graph G twice, and each time the vertices in one side are reduced using the procedure Reduce2H (line 1-4).
In the Reduce2H procedure (line 5-18), we visit each vertex u ∈ U to check whether u can be eliminated using Lemma 4 (line 6). We use S to maintain the set of two-hop neighbors of u along with the number of common neighbors with each two-hop neighbor. Specifically, for each two-hop neighbor u of u, we create a unique entry o = (u , cnt) in S where o.cnt denotes the number of common neighbors for u and u . In the algorithm, we first search the neighbors v ∈ N (u, G i ) (line 8) and then search the neighbors u ∈ N (v, G i ) to obtain each two-hop neighbor u (line 9). If the entry for u does not exist in S, we add u to S with cnt = 1 (line 10-11); otherwise, we obtain the entry o for u and increase o.cnt by 1 (line [13][14]. After processing all two-hop neighbors of u, we maintain a counter c to count the number Therefore, if c < τ i U , we can eliminate u from G i according to Lemma 4 (line [16][17]. Proof Sketch: When processing U (G) (line 2), for each u ∈ U (G) (line 6) and v ∈ N (u, G) (line 8), we need to process all neighbors u of v using O(d(v, G)) time. Therefore, the total time complexity of the procedure in line 2 is Similarly, the total time complexity of the procedure in Optimizations However, Reduce2Hop is more costly than Reduce1Hop. So we introduce two heuristics, early pruning and early skipping, to further optimize the two-hop reduction algorithm as follows.
(1) Early pruning In Algorithm 4, there is no specific order to process vertices. However, if we process vertices that are more likely to be pruned first, the removal of these vertices may result in more vertices elimination in later iterations. Based on this, we design a score function so that vertices with small scores are more likely to be pruned. A straightforward score is the vertex degree. However, it only considers the vertices in one side and ignores those in the other side. Therefore, for each vertex u, we summarize the degrees for all u's neighbors, and design the score function as follows: The score function considers both the number of neighbors u has and the degrees of the u's neighbors, and is cheap to compute. Given the score function, we can simply modify the algorithm by processing vertices in non-decreasing order of their scores to improve the algorithm performance. (2) Early skipping Then, we proceed to identify some vertices that cannot be pruned using Reduce2Hop before exploring their two-hop neighbors. These vertices can be skipped directly. The following lemma provides a way to do this:

Lemma 6
For any vertices u, u and threshold τ , we have: and therefore u can be skipped by Lemma 4 without exploring the two-hop neighbors of u . To realize this idea, for each vertex u ∈ U (G), we use u .c to maintain the number of processed vertices u s.t. u ∈ N τ i V (u, G). When processing u, for each two-hop neighbor u , if u ∈ N τ i V (u, G), we increase u .c by 1. Later on, when processing u , we check whether u .c + 1 ≥ τ i U before exploring the two-hop neighbors of u . If so, we know that u cannot be pruned and directly skip u . Here, we use u .c + 1 to take u itself into consideration.

The overall reduction strategy
Based on the above analysis, we can use either one-hop or two-hop reduction to reduce the size of the bipartite graph G. The following lemma shows that the two-hop reduction rule in Lemma 4 has stronger pruning power than the one-hop reduction rule in Lemma 2.

Lemma 7
Given a bipartite graph G, thresholds τ i U and τ i V , we have: Proof Sketch: We first prove (1).
(2) can be proved similarly. Nevertheless, based on Lemmas 3 and 5, applying onehop reduction is much more efficient than applying two-hop reduction. Therefore, we design the overall graph reduction strategy as follows: Reduce Given a bipartite graph G and thresholds τ i U and τ i V , Reduce iteratively applies one-hop and two-hop reduction strategies on G for MAX_ITER rounds where MAX_ITER is a small constant, and returns the reduced graph G i . Specifically, in each round, Reduce first applies Reduce1Hop and then further applies Reduce2Hop on the reduced graph.

Example 5
We show the example of the complete graph reduction process in Fig. 5. Given the bipartite graph G in Fig. 1a and thresholds τ U = 4, τ V = 4 and MAX_ITER = 2, we first apply Reduce1Hop in Fig. 5a. Since d(u 7 , G) = 2 < τ V and d(v 6 , G) = 2 < τ U , we prune u 7 and v 6 . Then, we apply Reduce2Hop in Fig. 5b with the details shown in Fig. 5d. We traverse the one-hop and two-hop neighbors of v 1 , and update the entries in S as shown in step (1) to step (4). For example, in step (1), we traverse v 1 's neighbor u 1 and two-hop neighbors v 1 , v 2 , v 3 and v 4 , and set cnt = 1 for each two-hop neighbor. After visiting all neighbors in step (4), we have three vertices with cnt = 4, i.e., After that, we further apply Reduce1Hop in Fig. 5c, and prune vertices u 1 and u 2 . By applying Reduce, we save huge search space in biclique search.

Diversified top-k biclique search
In some applications, one may need to enumerate a set of bicliques. For example, in click farm detection in Ecommerce such as Alibaba Group, the fraudulent transactions cannot be fully covered by the maximum biclique. Instead, we may need to consider the maximal biclique, where none of its superset is also a biclique. However, as the number of maximal bicliques may be exponential in the graph size [11], a possible solution is to compute the top-k results ranked by size, since maximal bicliques with larger size are always more important [23]. However, the top-k results ranked by size are usually highly overlapping, which significantly reduce the effective information of the k results.
Motivated by this, we study the problem of the Diversified Top-k Biclique Search in this section, aiming to find top-k results that are distinctive and informationally rich. Firstly, we formally define the diversified top-k biclique search problem.

Example 6
We show an example of top-2 bicliques in bipartite graph G with τ U = 1 and τ V = 1 in Fig. 6. There are three maximal bicliques in G: The result of top-2 maximal bicliques ranked by size is it is obvious that D 2 is more favorable since R 2 is highly overlapping with R 1 . In other words, cov(D 2 ) > cov(D 1 ).

NP-hardness
We show the hardness of the problem by considering the simple case: k = 1, τ U = 1, and τ V = 1. In this case, the problem becomes the maximum biclique search problem which is NP-hard [38]. Therefore, the diversified top-k biclique search problem is an NP-hard problem. Algorithm 5: TopKBasic(G, k, τ U , τ V ) Input : Bipartite graph G, integer k, thresholds τ U and τ V Output : The set of diversified top-k results D D ← ∅;

Baseline solution
In the literature, the problem of maximal biclique enumeration is widely studied [15,29,[35][36][37]41,59]. This leads to a straightforward solution of diversified top-k biclique search: firstly, we can enumerate all the maximal bicliques satisfying the thresholds of τ U and τ V , and then we formulate the problem of diversified top-k biclique search as a max k-cover problem. However, in a large-scale bipartite graph, the enumeration is costly and may not be able to terminate. Besides, it is infeasible to keep all the maximal bicliques in memory due to the exponential number of maximal bicliques in large bipartite graphs.
Fortunately, by taking advantage of our efficient maximum biclique search method, we can find the diversified top-k results by repeatedly removing the current maximum biclique from the bipartite graph k times, which follows the framework in a well-studied diversified top-k clique problem [57].
The baseline solution is shown in Algorithm 5. It firstly initiates the result set D as empty (line 1), and then greedily compute k bicliques to insert into D (line 2-7), and return D as the top-k results (line 8). Each time, it invokes MBC * to compute the maximum biclique R i in G satisfying the thresholds of τ U and τ V (line 3). If R i is empty, it indicates that no more bicliques satisfying τ U and τ V can be found and we can stop searching (line [4][5]. Otherwise, we update D by inserting R i into it (line 6), and then remove the edges in R i from G (line 7).
Time complexity We analyze the time cost of Algorithm 5. The time cost is mainly spent on the k times computation of MBC * , which consists of the graph reduction time and maximum biclique searching time. In MBC * , we denote the number of subspaces generated for searching result R j as l j , where l j is bounded by log(d U max (G)). For result R j , we use T reduce (G) to denote the graph reduction time (including one-hop and two-hop graph reduction), and T search (G i, j ) to denote the maximum biclique searching time, where G i, j represents the reduced graph in the i-th subspace for R j . Here,  j )). Fig. 7a with thresholds τ U = 1 and τ V = 1, we adopt Algorithm 5 to find the diversified top-2 bicliques in G:

Example 7 Given a bipartite graph G in
(1) To find R 1 in G with MBC * , suppose |C 0 | = 10 and τ 0 V = 5, we generate two subspaces as follows: We find the maximum biclique C * 1 (marked as gray in Fig. 7a), and we update |C * We cannot find larger bicliques.
Thus, we obtain R 1 = C * 1 as shown in Fig. 7a. Then, we remove all edges in R 1 from G, and get G as shown in Fig. 7b. (Here, we omit the vertices with no edges.) (2) To find R 2 in G with MBC * , suppose |C 0 | = 6, and τ 0 V = 4, we generate two subspaces as follows: We find C * 1 (marked as gray in Fig. 7b), and update |C * We cannot find larger bicliques.
Thus, we obtain R 2 = C * 1 , as shown in Fig. 7b. It should be noticed that the subspaces generated in G and G are different. Consequently, for R 1 and R 2 , we compute the reduced subgraph by Reduce and the maximum biclique by MBC in each subspace independently.
Finally, we obtain the result set D = {R 1 , R 2 }.

Advanced diversified top-k search
In this subsection, we first analyze the drawbacks of the baseline solution, and then introduce our new diversified top-k biclique search approach, based on the idea of deriving the same subspaces for different results to share computation cost among them.

Problem analysis
Drawbacks of TopKBasic The major limitation of TopKBasic is the isolated computation of R i by MBC * . Recall that in MBC * , we progressively generate subspaces based on the value of the maximum biclique size found so far (line 5-6 in MBC * ). We call such subspaces generated for R i as a subspace set. Obviously, for the top-k results, the generated subspace sets are different (e.g., the generated subspace sets of R 1 and R 2 in Example 7). Consequently, both graph reduction by Reduce and maximum biclique search by MBC in MBC * will be computed independently in each subspace among all the k results, which is costly.
Our idea Intuitively, since the different generated subspace sets lead to the isolated computations of R i in TopKBasic, we consider to generate the same subspace set for all the k results so as to share the computation among them. Specifically, we fix the subspace set in MBC * as follows: (1) Instead of using the largest biclique size found so far as the lower bound of the optimal solution for generating subspaces, we use a constant c. According to Theorem 1, it is not hard to prove that with c as the lower bound, we can preserve the maximum biclique whose size is larger than c in the derived subspaces. Thus to find the top-k results, we set c as a constant value which is smaller than the size of the k-th result R k . (2) We fix where G ori denotes to the original bipartite graph G, and d U max (G ori ) is guaranteed to be an upper bound of |V (R)| for all the k results. Consequently, with the fixed c and τ 0 V , we can generate the same subspace set for all the k results. We denote such a fixed subspace set as FS(G, c), or FS in short if the context is clear.
With the idea of the fixed subspace set FS(G, c), the following two issues need to be further addressed: -First, we do not know the size of the k-th result R k .
Although we can set c as a small constant, e.g., c = τ U × τ V , the τ i U and τ i V computed based on c in each subspace may be very loose for graph reduction and search space pruning.
-Second, even we can generate the same subspace set for the k results, we still need to remove R i from bipartite graph G when searching for R i+1 , which indicates that the reduced graph and the maximum biclique in each subspace need to be recomputed.

Advanced top-k biclique search
To solve the above problems, we first preserve the following three information for each subspace in FS(G, c): (1) the thresholds τ i U and τ i V computed based on c; (2) the reduced subgraph G i w.r.t. τ i U and τ i V ; (3) the maximum biclique C Based on FS(G, c), we address the two issues as follows: -To address the first issue, instead of initiating c as a very small constant to cover all the results which leads to loose thresholds, we search for the top-k results by progressively relaxing c. Specifically, we use a lower bound of the size of the top-1 biclique to initiate c. Then, we generate FS(G, c) and search results in it. Once we cannot find enough results within FS(G, c), we relax c to c by multiplying a factor α, where 0 < α < 1, and regenerate FS(G, c ) to cover more results. -To address the second issue, instead of recomputing the subgraph by Reduce and maximum biclique by MBC in each subspace when searching for the next result, we apply light-costed subgraph updating and on-demand maximum biclique searching in FS(G, c). Specifically, as we have maintained the reduced subgraph G i and the maximum biclique C * i in each subspace in FS(G, c), when we need to remove R j from G, we can update G i by simply eliminating the edges in R j from G i . Moreover, we only need to recompute C * i that has overlaps with R j . Otherwise, C * i remains unchanged even when G i is updated.
The advanced algorithm The proposed algorithm is shown in Algorithm 6. It firstly initiates D as an empty set and R 0 as empty (line 1). We set the value of c as the size of the initial biclique in G found by InitMBC, which is a lower bound of the size of top-1 result R 1 (line 2). We use f lag to indicate whether or not we need to generate FS with constant c, initialized as true, and i to denote the index of the top-k results, initialized as 0 (line 3). Then, we search for the top-k results (line 4-16). We first invoke GenSubSpaces to generate FS when f lag is true , and after generation, we set f lag as false (line 5-6). With FS, we invoke FixedMBC * to search the maximum biclique in each subspace, respectively, and return the one with the largest size as the result R i+1 (line 7). If R i+1 is empty, it indicates that no bicliques larger than c can be found in FS. Here, we will terminate the computation if c < τ U × τ V , as in this case, no more bicliques satisfying τ U and τ V can be found (line 9-10). Otherwise, we relax c by multiplying a factor α, where 0 < α < 1, and set f lag as true to indicate that we need to regenerate FS (line [11][12]. If R i+1 is not empty, we add R i+1 into D and update G by deleting edges in R i+1 from G (line [13][14][15][16]. Finally, we return D as the diversified top-k results (line 17).
Procedure GenSubSpaces generates the fixed subspace set based on c. It follows similar procedures in MBC * , except that in GenSubSpaces, each iteration generates a pair of based on the fixed constant c (line 22-23) and searches for the maximum biclique whose size is larger than c (line 25). Here, we slightly modify MBC that we use a constant c rather than an initial biclique for size pruning, and if no biclique larger than c can be found, we directly return an empty biclique. Besides, we further preserve the reduced subgraph G i , thresholds τ i U and τ i V , and maximum biclique C * i as the subspace information in FS (line 26), in order to share the computation cost among all results preserved in FS. Procedure FixedMBC * searches for the maximum biclique in the fixed subspace set FS. Firstly, we initiate C * as empty (line 30), and then progressively update it with larger bicliques found in each subspace (line [31][32][33][34][35][36]. For the i-th subspace in FS, we first update the subgraph G i by eliminating all edges in R from G i , where R is the last diversified biclique result we found (line 32). Then if C * i overlaps with R, which indicates that the maximum biclique in current subspace has changed, we recompute C * i by MBC (line [33][34]. Otherwise, C * i remains unchanged, and there is no need to update it. We update C * if we find larger biclique (line [35][36], and finally return C * as the result(line 37).
Time complexity We analyze the time cost of Algorithm 6. The time cost mainly consists of the graph reduction time (in GenSubSpaces) and maximum biclique searching time (in GenSubSpaces and FixedMBC * ). Firstly, for graph reduction, suppose we generate k subspace sets to cover all the top-k results by invoking GenSubSpaces.
Here, k is bounded by log α ( 1 c 0 ), where c 0 is a lower bound of the size of the top-1 biclique R 1 (0 < α < 1). In each subspace set FS m (1 ≤ m ≤ k ), we denote the number of subspaces as l m , where l m is bounded by log(d U max (G)). Then, the total graph reduction time is O( k m=1 l m i=1 T reduce (G)). Secondly, for maximum biclique search, in GenSubSpaces, we need to compute the maximum biclique in all subspaces in FS m , while in FixedMBC * , we only need to compute the maximum biclique when needed. Suppose for result R j , we need to compute denotes the i-th maximum biclique that need to be recomputed on subgraph G i, j to obtain R j . Here, l j ≤ l m for R j preserved in FS m . Then, the total maximum biclique Note that in practice, we observe that k is much smaller than k, and for result R j preserved in FS m , l j is also much smaller than l m in most cases. Fig. 7a with thresholds τ U = 1 and τ V = 1, we adopt Algorithm 6 to find the diversified top-2 bicliques in G as shown in Fig. 8.

Example 8 Given a bipartite graph G in
Suppose we have c = 10 and τ 0 V = 5. We generate FS(G, c) consisting of two fixed subspaces: The reduced subgraph G 1 is shown in Fig. 8a with the maximum biclique C * 1 marked as gray; The reduced subgraph G 2 is shown in Fig. 8b with the maximum biclique C * 2 marked as gray. Based on FS(G, c), we search for the top-2 diversified bicliques as follows: (1) With the preserved maximum bicliques in subspace set, we obtain R 1 = C * 1 whose size is maximized. Then, we remove all edges in R 1 from G and get G .
(2) After found R 1 , we can update G 1 and G 2 by directly removing edges in R 1 from them, as shown in Fig. 8c, d, respectively (here we omit vertices with no edges). Furthermore, we only need to recompute C * 1 as it overlaps with R 1 , and skip C * 2 as it remains unchanged. Then, we obtain R 2 = C * 2 . Finally, we obtain the result D = {R 1 , R 2 }. Compared with TopKBasic, benefiting from the fixed subspace set, TopK saves the cost by sharing the computation of graph reduction and maximum biclique search in subspaces among the results preserved in FS(G, c).

Optimization strategies
In Algorithm 6, the computation cost mainly consists of two parts: (1) the graph reduction when generating FS in GenSubSpaces, and (2) the updating of maximum biclique C * i in FixedMBC * . To further save the computation cost, we propose the following two optimizations.
Global size pruning In GenSubSpaces, we apply Reduce on G in each subspace to reduce the graph size, which is costly since G is large . However, in FS(G, c), we search for biclique whose size is larger than c. Based on this, before we apply Reduce on G in each subspace w.r.t. τ i U and τ i V , we can firstly prune all vertices that cannot be involved in a biclique with size larger than c, so as to share the computation among all subspaces in FS(G, c). Although we do not know the biclique size before searching, for vertex u, we could use the summarization of the degree for all u's neighbors as an upper bound for the size of biclique that involves u. Following Definiton 4, with the size constraint c, we use G c G to denote that G is an MBC-preserved graph of G w.r.t. c. We derive the following lemma:

Lemma 8 Given a bipartite graph G, and size constraint c, we have:
(1) ∀u ∈ U (G): v∈N (u,G) d(v, G) We omit the proof here. Lemma 8 provides a sufficient condition for a vertex to be eliminated s.t. the maximum biclique whose size is larger than c is preserved. Based on the Lemma 1, Lemma 8 can be iteratively applied to reduce the graph size until no vertices can be eliminated. We can simply modify GenSubSpaces in Algorithm 6 by applying the global size pruning rule on G to get G first, and then iteratively generate subspaces by applying Reduce on G .
Lazy candidate refining In FixedMBC * , when searching for R j+1 after found R j , the case of C * i overlapping with R j indicates that C * i is not up to date, thus we recompute C * i by adopting MBC. However, it is not necessary to update C * i immediately if it cannot be R j+1 , thus we decide to refine the candidates in a lazy manner. Specifically, in FixedMBC * , when C * i that overlaps with R j is no larger than the optimal biclique C * found so far, instead of directly updating C * i by MBC which is costly, we label it with lazy = true, and will not recompute it until it could be the maximum one. To apply the lazy refine strategy, we modify GenSubSpaces by initiating lazy = false for C * i in all subspaces. Then, we modify FixedMBC * as follows: (1) Before updating subspaces in FS and searching for R j+1 , we first traverse all subspaces to get the updateto-date maximum bicliques, i.e., those who do not have overlaps with R j and have lazy = false. Among all these bicliques, we use the size of the largest one as the lower bound of |R j+1 |, denoted as lb(R j+1 ). (2) Then in all subspaces: (i) For C * i with lazy = true, if |C * i | > lb(R j+1 ), we update C * i by MBC and set lazy = false; Otherwise, we skip C * i . (ii) For C * i with lazy = false but overlaps with R j , if |C * i | > lb(R j+1 ), we update C * i by MBC; Otherwise, we set lazy = true for C * i and skip it. (iii) For C * i with lazy = false and does not overlap with R j , there is no need to refine it as it is already up-to-date.
(3) Finally, we return C * as the biclique with the largest size among all up-to-date candidates with lazy = false.

Performance studies
In this section, we show the performance studies. We first present the experimental results by comparing the proposed maximum biclique search algorithm MBC * , with the following two baseline algorithms: (1) MBC: MBC is developed based on the algorithm in [59], where the code is obtained from the authors, with the pruning rules in Algorithm 1 added.
(2) MAPEB: MAPEB is developed based on the parameterized algorithm APEB in [14]. Given a bipartite graph G and an integer p, APEB aims to find a biclique C in G with at least p edges (where (G, p) is called a yes-instance) , or report that no such biclique exists. Naturally, we extend APEB with the binary search technique to find the maximum biclique C * . We denote the lower bound and the upper bound of |C * | as lb and ub, respectively. The basic idea is that we iteratively set p = lb+ub 2 , and adopt APEB to compute if (G, p) is a yes-instance: if it is, we update C * as the found biclique C and set lb = |C| + 1; otherwise, we update ub = p − 1. We stop the computation when lb > ub, and return C * . We initialize C * as ∅, lb as τ U × τ V , and ub as the maximum score(u) (defined in Eq. 2) among all vertices u in G. We also add pruning rules for size constraints of τ U and τ V . We call the extended algorithm MAPEB. We evaluate our algorithms in two aspects: (1) the effectiveness of the graph reduction techniques and optimization strategies used in MBC * , and (2) the efficiency and scalability of maximum biclique search by comparing MBC * with MBC and MAPEB. Then, we show the performance of the diversified top-k biclique search by comparing the proposed algorithm TopK with the baseline algorithm TopKBasic. A case study of anomaly detection on real datasets obtained from Alibaba Group is further described to demonstrate the resultant quality by applying our method. Unless otherwise specified, experiments are conducted with τ U = 3, τ V = 3 by default. All of our experiments are performed on a machine with an Intel Xeon E5-2650 (32 Cores) 2.6GHz CPU and 128GB main memory running Linux.
Datasets We use 18 real datasets selected from different domains with various data properties, including the ones used in existing works. The detailed statistics of the datasets are shown in Table 1. The first 13 datasets are obtained from KONECT 1 . The last five datasets are real datasets obtained from the E-Commerce company Alibaba Group. Here, the AddCart20 and AddCart18 datasets include data of customers adding products into cart in 1 day (sampled from data in 2020) and 10 days (sampled from data in 2018), respectively. The Transaction20 and Transaction18 datasets include data of customers purchasing products in 3 days (sampled from data in 2020) and 15 days (sampled from data in 2018), respectively. Additionally, the LabeledAdd-Cart dataset includes fraudulent transactions labels that we utilize as the ground truth in the case study.

Graph reduction and optimizations
In this subsection, we test the effectiveness and efficiency of the graph reduction techniques and optimization strategies used in the algorithm MBC * .

Effectiveness of graph reduction
We test the effectiveness of the proposed one-hop and two-hop graph reduction techniques on datasets of TVTropes and BookCrossing, and show the results in Tables 2 and 3, respectively. We set MAX_ITER in Reduce as 2. Experiments on other datasets have similar outcomes. In Tables 2 and 3, we list τ k U , τ k V and the number of vertices and edges of the reduced graph in each iteration k in MBC * . We also list the size of C * k found in each iteration. We compute the compression ratio r k as the value of dividing reduced graph size by its original size. In iteration 0, we Table 3 Graph reduction on BookCrossing show the results of graph G 0 reduced by τ U = 3 and τ V = 3, as a comparison. We omit the results in the iterations where the reduced graphs are empty. From the results, we can see that in each iteration, we adopt much more strict τ k U and τ k V constraints rather than τ U and τ V . Therefore, by utilizing the graph reduction techniques, we get much smaller reduced graphs, e.g., compression ratio of 0% (omitted in the table) to 2.05% by using τ k U and τ k V in our progressively bounding framework v.s. 97.53% by using τ U and τ V as shown in Table 2. This saves huge search space and accelerates the biclique computation greatly.

Efficiency of graph reduction We conduct experiments on
LiveJournal and WebTrackers to compare the performance of the basic algorithms with the optimized versions. We denote the basic version of Algorithm 2 as BASIC, the algorithm with early pruning strategy introduced in Sect. 5.2 as OPT 1 , and the algorithm with early skipping strategy introduced in Sect. 5.2 as OPT 2 based on OPT 1 . The results are shown in Fig. 9, with the two-hop graph reduction time cost denoted as TwoHopTime, and the total time cost denoted as AllTime.

MBC * versus baseline algorithms
In this subsection, we compare the performance of MBC * , MBC and MAPEB on maximum biclique search by: (1) conducting experiments on all datasets; (2) varying τ U and τ V thresholds on both small-sized and large-sized graphs; (3) varying graph density; (4) varying graph scale.
In all experiments, we set the maximum processing time as 24 h, and if the methods cannot finish computing, we denote the time cost as NaN. For those experiments that cannot finish within 24 h, we also report the quality ratio above the corresponding bars, which is calculated as: quality ratio = the size of current best biclique the size of the maximum biclique Note that it is possible that the quality ratio is 100% while the algorithm cannot finish, because the size of the maximum biclique is unknown before the algorithm finishes. search in all datasets by comparing MBC * with MBC and MAPEB, and report the processing time in Fig. 10. From  Fig. 10, we can see that when the size of dataset is relatively small, e.g., around 0.1 million edges in Writers, MBC * and MBC can both find C * 3,3 efficiently. As the graph size scales up, e.g., for the graphs with millions of edges such as BookCrossing and StackOverflow, MBC takes hours to compute the results, while MBC * only takes seconds. Furthermore, when the graph size grows up to around 1 billion edges such as AddCart and Transaction, MBC cannot finish computing within 24 h, while MBC * only takes minutes to compute the results. MAPEB, however, fails to finish computing for most cases. Moreover, for most time-out cases, the bicliques found by MBC and MAPEB are far smaller than the maximum bicliques. From the results shown in Fig. 10, we can see that MBC * is much more efficient and scalable than both MBC and MAPEB on all datasets.
Varying τ U and τ V thresholds We vary τ U and τ V thresholds to compute C * and illustrate the performance of MBC * , MBC and MAPEB in Fig. 11. Figure 11 shows that MBC can process small graphs (YouTube and StackOverflow) but fails in processing large graphs (LiveJournal and WebTrackers). For small graphs, when τ U and τ V get larger, the time cost of MBC decreases. This is because as τ U and τ V get larger, MBC can filter more search branches. For large graphs, MBC cannot finish computing within 24 h, since the search space is huge and MBC is stuck in local search. MAPEB cannot finish computing for all cases. The main reason is that MAPEB is developed based on APEB [14], which mainly benefits from the early termination as soon as the yes-instance is found. However, to find the maximum biclique, we will encounter no-instances in the binary search process in MAPEB. For the no-instance case (G, p), for each vertex u ∈ U (G) (v ∈ V (G) resp.), APEB has to enumerate all the combinations (with size constraints of ≥ τ V (≥ τ U resp.) and ≤ √ p ) of u's (v's resp.) neighbors and induce biclique for each combination correspondingly, which is very costly. In comparison, MBC * is orders of magnitude faster than both MBC and MAPEB on all settings. For most cases, when τ U and τ V get larger, the time cost of MBC * slightly increases. This is because in most real cases, as τ U and τ V get larger, |C * | becomes smaller. Thus, MBC * generates relatively looser τ k U and τ k V constraints, which results in larger reduced graph. Specifically, in WebTrackers, the processing time is steady. This is because for all τ U and τ V settings in this experiment, |C * | in WebTrackers is relatively large, and consequently τ k U and τ k V are quite strict. In general, the high efficiency of MBC * mainly benefits from the effective progressive bounding framework with graph reduction techniques, which saves enormous search space in biclique search.
Varying graph density In this experiment, we test the effect of graph density on the performance, and demonstrate the results in Fig. 12. We prepare graphs with different density by sampling edges in the original graph. For example, we sample 20%, 40%, 60%, 80% and 100% edges in TVTropes, and denote these (sub)graphs as TV 1 , TV 2 , TV 3 , TV 4 and TV 5 in ascending order of density. Figure 12 shows that as the graphs grow denser, MBC takes longer time to find the maximum bicliques, or cannot finish computing within 24 h. Although MAPEB may output larger bicliques than MBC sometimes (e.g., on dataset of WebTrackers), since it may find yes-instances efficiently with some appropriate p during the binary search, it cannot finish computing for most cases due to the inefficiency of the no-instances in the binary search. In contrast, MBC * is orders of magnitude faster than both MBC and MAPEB on all settings. It is worth noting that for dense graphs, MBC * also finds maximum bicliques efficiently. For example, in Fig. 12c, as the graphs grow denser from LJ 3 to LJ 5 , the processing time of MBC * decreases. The reason is that MBC * can find larger C * k in denser graphs. This helps improve the τ k U and τ k V thresholds and lead to small reduced graphs (or even empty) in the progressive bounding framework. Therefore, MBC * finds maximum biclique efficiently on both sparse and dense graphs. Varying graph scale The effects of graph size on the performance show scalability. We prepare datasets by obtaining 1, 3, 6 and 10 days data of AddCart18, and 1, 3, 6, 10 and 15 days data of Transaction18. We list the statistics in Table 4, and report the results in Fig. 13. In Fig. 13, we can see that both MBC and MAPEB cannot finish computing within 24 h on all datasets and the reported bicliques are much smaller than the maximum bicliques. In contrast, the processing time of MBC * increases steadily as the graph scales up. For graphs of AddCart10d and Transaction15d, which both consist of about 1.3 billion edges, MBC * costs 18 min and 15 min to compute the results respectively, which is quite efficient. To the best of our knowledge, no existing solutions can find maximum bicliques in bipartite graphs at this scale.

TopK versus TopKBasic
In this subsection, we test the efficiency of TopK and TopKBasic on diversified top-k biclique search. We first test the efficiency of the proposed optimizations of global size pruning and lazy candidate refining in TopK. Then, we compare TopK with TopKBasic by: (1) varying the result number k; (2) varying τ U and τ V thresholds; and (3) varying graph density. In this subsection, we set k = 80, τ U = 3 and τ V = 3 by default unless otherwise specified. Besides, in TopK, we relax the lower bound c by multiplying factor α to include more results, where FS with smaller c can preserve more results, while FS with larger c can generate tighter bounds in subspaces. In compromise, we set α = 0.7 in this subsection. Moreover, we set the maximum processing time as 24 h, and if the computation is not finished, we denote the time cost as NaN.

Efficiency of optimizations
We conduct experiments on Transaction20 and AddCart20 to compare the basic TopK algorithm with the optimized versions. We denote the basic version of Algorithm 6 as BASIC, the algorithm with global Varying results number k In this experiment, we test the efficiency by varying the results number k and report the processing time of TopK and TopKBasic in Fig. 15. We set k 16 Top-k search by varying τ U and τ V as 10, 20, 40, 80 and 160, respectively. For both TopK and TopKBasic, when k increases, the time cost also increases. For small graphs of Wikipedia and DBLP, we can see that TopK achieves several times better performance than TopKBasic on all k settings. For large graphs of Transaction20 and Add-Cart20, TopK also outperforms TopKBasic by several times, and an order of magnitude for top-20 biclique searching on AddCart20. Besides, from the figure, we can see that as k becomes larger, it takes longer time for both TopK and TopKBasic to find the results. This is because as k increases, the sizes of the result bicliques tend to be smaller. Consequently, the subspaces of later results in top-k are generated with relatively looser τ i U and τ i V constraints, which leads to larger reduced subgraphs and longer biclique searching time. In general, TopK outperforms TopKBasic by several times to an order of magnitude for all k settings.
Varying τ U and τ V In this experiment, we test the efficiency by varying the thresholds of τ U and τ V . The results are reported in Fig. 16. On Wikipedia and AddCart20, when τ U and τ V get larger, the time cost of both TopK and TopKBasic increases. The reason is that the average size of the top-k results becomes much smaller on Wikipedia and AddCart20 as τ U and τ V get larger. This leads to relatively looser τ i U and τ i V constraints and thus larger reduced subgraphs in subspaces, which takes longer time for biclique searching. On DBLP and Transaction20, the sizes of the top-k results are relatively large on all τ U and τ V settings, and thus the performance is not sensitive to the τ U and τ V settings but to the specific generated τ i U and τ i V in subspaces. As TopKBasic needs to compute the k results one by one, while TopK can preserve more results in FS by slightly relax the τ i U and τ i V constraints in subspaces, the experimental results show that TopK benefits a lot from the computation sharing, and outperforms TopKBasic by several times to an order of magnitude on all τ U and τ V settings.
Varying graph density In this experiment, we show the effect of graph density on the performance and report the results in Fig. 17. We prepare graphs with different density by sampling edges in the original graphs, including small graphs of DBLP and Wikipedia, and large graphs of Transaction20 and AddCart20. For example, we sample 20%, 40%, 60%, 80% and 100% edges in Transaction20, denoted as TRA 1 , TRA 2 , TRA 3 , TRA 4 and TRA 5 , respectively. Note that we eliminate the results on WIKI 1 and DBLP 1 , since we cannot obtain enough top-80 results on them. Figure 17 shows that as the graphs grow denser, TopKBasic takes longer time to find the top-k results in most cases, except that in TRA 5 , the time cost decreases. The main reason is that the sizes of result bicliques in TRA 5 are relatively large, and consequently the subspaces are generated with more strict τ i U and τ i V constraints. The time cost of TopK has similar tendency with TopKBasic but increases slower as graphs grow denser (except TRA 5 where time cost decreases for the same reason), and TopK is from times to an order of magnitude faster than TopKBasic on all graphs. Therefore, TopK finds the diversified top-k bicliques efficiently on both sparse and dense graphs.

Case study
Our proposed algorithm has been deployed in Alibaba Group to detect fraudulent transactions. E-business owners at Taobao and Tmall (two E-commerce platforms of Alibaba Group) may pay some agents in black market to promote the rankings of their online shops. Considering the costs of Fig. 18 Precision of TopK fake transactions and maintenance of a large amount of user accounts, these agents usually need to organize a group of users to "purchase" a set of products at the same time for cost effectiveness. This will lead to some bicliques (i.e., click farms) in the bipartite graph consisting of users, products and purchase transactions. As the maximum biclique alone cannot cover all fraudulent transactions, we apply the diversified top-k biclique search method as follows. TopK We adopt Algorithm 6 to compute the diversified top-k bicliques (i.e., suspicious click farms) in the bipartite graph. Note that TopK improves the recall rate of fraudulent transaction by 50% according to the feedback of the risk management team from Alibaba Group. To further demonstrate the effectiveness and efficiency of TopK, we also evaluate the following two baseline approaches on a real dataset LabeledAddCart obtained from Alibaba Group, which includes the labels of ground-truth fraudulent transactions.
(1) EnumK We adopt EnumK, whose logic is the same with MBC but without the size pruning rule (in line 5 and 13 in Algorithm 1), to enumerate all maximal bicliques satisfying the thresholds τ U and τ V , and each maximal biclique represents a click farm. However, it is not possible to find all maximal bicliques and then select the top-k among them due to the huge number of maximal bicliques, thus we evaluate the result of the first-k output maximal bicliques.
(2) Reduce Given appropriate values of thresholds τ U and τ V , Reduce outputs the reduced bipartite graph, where the edges represent suspicious fraudulent transactions. Although Reduce cannot output bicliques, it can reduce the candidate size.
We define the precision and recall rate as follows: precision = number of found fraudulent transactions number of output edges of the method recall = number of found fraudulent transactions number of ground-truth fraudulent transactions TopK result evaluation In this experiment, we vary τ V from 2 to 5 (with τ U = 1) to test the precision of top-k diversified bicliques found by TopK on LabeledAddCart, and show the results in Fig. 18. The figure shows that the precision is over 95% in most cases except top-1000 when τ V = 2. This is because coincidences are more likely to happen when τ V is Fig. 19 Quality of EnumK small. When τ V > 2, the precision is even larger than 99%. In general, TopK outputs fraudulent transactions with high precision, and the found biclique can be served as the evidence when taking disciplinary measures. In real application in Alibaba Group, TopK not only returns fraudulent transactions with high precision, but also improves the recall rate by 50% w.r.t. to existing solutions.

EnumK result evaluation
We conduct experiments of EnumK on LabeledAddCart and show the results in Fig. 19. We set τ U = 1 and τ V = 2, and the results with other settings are similar. Given the fact that EnumK cannot finish maximal biclique enumeration within 24 h, we record two statistics of the first-k output maximal bicliques: (1) the total number of output edges, denoted as All, and (2) the number of unique output edges, denoted as Uni. Besides, the enumeration process easily becomes stuck in local search, so the search order has great influence on the result of first-k bicliques. Thus, we adopt two search orders in EnumK, i.e., we iteratively add v ∈ V into biclique in descending order (denoted as Desc) or ascending order (denoted as Asc) of the number of v's neighbors in U . This is because, intuitively, we may enumerate the maximal bicliques in the dense region or sparse region of the bipartite graph respectively. From Fig. 19, we can see that for Desc order, when the output biclique number increases, the total number of output edges increases as well. However, the number of unique edges barely grows, which indicates that EnumK enumerates many redundant maximal bicliques with very limited effective information when searching in dense region of the graph. In comparison, for Asc order, both total output edges and unique edges increases. However, the average size of the first-16000 maximal bicliques is only 12, which is too small to be used in anomaly detection application, with the precision of only 33.23% compared with the ground-truth. The computation cost of EnumK is also high, and the algorithm outputs huge amounts of maximal bicliques (over 10 9 bicliques in 24 h). In conclusion, maximal biclique enumeration is not suitable to this case study for anomaly detection on large-scale graphs.
Reduce result evaluation Given specific τ U and τ V values, we can detect fraudulent transactions with Reduce. In this experiment, we vary τ V from 2 to 5, and for each τ k V , we set two corresponding τ k U values, i.e., the small value τ s k U for loose condition, and the large value τ l k U for strict condition. All τ U values are suggested by the experts of anomaly detec- Due to the confidential nature, we omit the exact values. For simplicity, we use τ s U and τ l U to represent the loose and strict constraints for all τ k V . We evaluate the performance in terms of precision and recall rate, and present the results in Fig. 20. In Fig. 20a, the precision of Reduce improves when τ V grows larger, since the more common products a group bought together, the more suspicious the transactions are. Similarly, larger τ U also leads to higher precision with fixed τ V . However, the precision does not meet the requirement of at least 95% (from Alibaba). In Fig. 20b, the recall rate is relatively high especially for loose constraints τ s U , due to the fact that we only take advantages of the graph topological structure. However, we gain the high recall rate at the cost of low precision and large amount of output edges (over 10 7 edges for all settings). Besides, the result quality depends heavily on the given τ U and τ V thresholds, which cannot be easily adapted to different datasets manually. Therefore, Reduce is not suitable for anomaly detection in this case study.

Related work
In this section, we review the related work, including maximum biclique search and its variants, maximal biclique enumeration and diversified top-k search.

Maximum biclique search and its variants
The maximum biclique problem has become increasingly popular in recent years [14,42,43]. Reference [43] proposes an integer programming methodology to find the maximum biclique in general graphs. However, it is not applicable for large-scale graphs. Reference [42] develops a Monte Carlo algorithm for extracting a list of maximal bicliques, which contains a maximum biclique with fixed probability. Reference [14] studies the parameterized maximum biclique problem in bipartite graphs that reports if there exists a biclique with at least p edges, where p is a given integer parameter. Besides, there are two variants of the maximum biclique problem, i.e., the maximum vertex biclique and the maximum balanced biclique. The former one aims to find the biclique C * that |U (C * )| + |V (C * )| is maximized. This problem can be solved in polynomial time by a minimum cut algorithm [33]. The latter one aims to find the biclique C * with maxi-mum cardinality that |U (C * )| = |V (C * )|. The most popular approaches are heuristic algorithms, including [2,44,55] that solve the problem by converting it into a maximum balanced independent set problem on the complement bipartite graph with node deletion strategies, and [60] that combines tabu search and graph reduction to find the maximum balanced biclique on the original bipartite graph. References [53,56] propose local search framework to find good solutions within reasonable time. Refs. [32,61] introduce exact algorithms to find the maximum balanced biclique by following the branchand-bound framework.

Maximal biclique enumeration
The maximal biclique enumeration problem is widely studied. A biclique is said to be maximal if it is not contained in any larger bicliques. Reference [3] proposes a consensus approach, which starts with a collection of simple bicliques, and then expands the bicliques as a sequence of transformations on the biclique collections. References [36,41] find maximal bicliques C = (U , V , U × V ) by exhaustively enumerating U as subsets of one vertex partition, obtaining V as their common neighbors in the other vertex partition, and then checking the maximality of C. In [59], the authors propose algorithm iMBEA, which combines backtracking with branch-and-bound framework to filter out the branches that cannot lead to maximal bicliques. Reference [15,29] reduce the problem to the maximal clique enumeration problem by transferring the bipartite graph into a general graph. Reference [21] proves that maximal biclique is in correspondence with frequent closed itemset. The maximal biclique enumeration can be reduced then to the well-studied frequent closed itemsets mining problem [13,27,50,52]. References [35,37] propose parallel methods to enumerate maximal bicliques in large graphs.
Diversified top-k search The diversified top-k search problem has been extensively studied, which aims to find top-k results that are not only most relevant to a query but also diversified. In the literature, most existing solutions focus on finding diversified top-k results for a specific query. For example, Lin et al. study the k most representative skyline problem [22]. References [1,6] focus on the diversified topk document retrieval. Reference [62] studies the diversified keyword query recommendation. Reference [12] focuses on the diversified top-k graph pattern matching. Reference [24] studies the problem of top-k shortest paths with diversity. Zhang et al. study the diversified top-l (k, r )-core [58]. Yuan et al. and Wu et al. study the diversified top-k clique search problem [54,57]. Nevertheless, the techniques developed for diversified top-k clique search are not suitable for our diversified top-k biclique search problem. Some other works study the general framework for diversified top-k search. For example, [10,39,40,51] study the general diversified top-k results problem. References [8,34] study top-k result diversification on a dynamic environment. The complexity of query result diversification is analyzed in [9]. Nevertheless, the diversity in the above frameworks is considered based on the pair-wise dissimilarity of the query results, which cannot be applied directly on the diversified top-k biclique search problem studied in this paper.

Conclusion
Maximum biclique search in a bipartite graph is a fundamental problem with a wide spectrum of applications. Existing solutions are not scalable for handling large bipartite graphs because the search has to consider the size of both sides of the biclique. In this paper, instead of solving the problem directly on the original bipartite graph, we propose a progressive bounding framework which aims to solve the problem on several much smaller bipartite graphs. We prove that only logarithmic rounds are needed to guarantee the algorithm correctness, and in each round, we show how to significantly reduce the bipartite graph size by considering the properties of the one-hop and two-hop neighbors for each vertex. Based on the maximum biclique search method, we further propose an efficient algorithm to find the diversified top-k bicliques, which is also desirable in many applications. By taking advantage of the progressive bounding framework, we consider to derive the same subspaces for different results by slightly relaxing the constraints in each subspace, so as to share the computation cost among these results. We further propose two optimizations to accelerate the computation by pruning search space and lazy refining candidates. We conducted experiments on real datasets from different application domains, and two of the datasets contain billions of edges. The experimental results demonstrate that our approach is efficient and scalable to handle large bipartite graphs. It is reported that 50% improvement on recall can be achieved after applying our method in Alibaba Group to identify the fraudulent transactions.

Funding Open Access funding enabled and organized by CAUL and its Member Institutions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecomm ons.org/licenses/by/4.0/.