1 Introduction

A bipartite graph is denoted by \(G=(U,V,E)\) where U(G) and V(G) denote the two disjoint vertex sets and \(E(G)\in U\times V\) denotes the edge set. Bipartite graph is a popular data structure, which has been widely used for modelling the relationship between two sets of entities in many real world applications. For example, in E-Commerce, a bipartite graph can be used to model the purchasing relationship between customers and products; In web applications, a bipartite graph can be used to model the visiting relationship between users and websites; In bioinformatics, a bipartite graph can be used to model the acting relationship between genes and roles in biological processes.

A subgraph C is a biclique if it is a complete bipartite subgraph of G that for every pair \(u\in U(C)\) and \(v\in V(C)\), we have \((u,v)\in E(C)\). Like a clique in general graph, biclique is a fundamental structure in a bipartite graph, and has been widely used to capture cohesive bipartite subgraphs in a wide spectrum of bipartite graph applications. Below are several representative examples.

(1) Anomaly detection [4, 7] In E-commerce such as Ebay and Alibaba, the behavior of a large group of customers purchasing a set of products together is considered as an anomaly because there is a high probability that the group of people is making fraudulent transactions to increase the rankings of their businesses selling the corresponding products. This can be modeled as bicliques in a bipartite graph. Similarly, in web services, bicliques can be used to detect a group of web spammers who click a set of webpages together to promote their rankings.

(2) Gene expression analysis [16, 18, 25, 45, 59] In gene expression data analysis, different genes will respond in different conditions. The group of genes that have a number of common responses over multiple conditions is considered as a significant gene group.

(3) Social recommendation [23] In social analysis, there may exist a group of users who have the same set of interests, such as swimming, hiking, and fishing. Such groups and interests can be naturally captured by a biclique, which is helpful in social recommendation and advertising.

In practice, we cannot directly enumerate the bicliques of the bipartite graphs as the number of bicliques is prohibitively large in the above applications. In this paper, we investigate the problem of maximum biclique search, i.e., finding the biclique with the largest number of edges, for the following two reasons:

(1) Given the biclique model, it is a very natural problem to find the maximum biclique, which is not only theoretically interesting but also useful in many real-life scenarios. For instance, the maximum biclique may represent the largest suspicious click farm in the e-commerce networks, the most significant gene group in a gene-condition bipartite graph, and the user group with the largest potential market value in the social network.

(2) In some scenarios, one may need to enumerate a set of bicliques. For instance, the fraud transactions cannot be fully covered by the maximum biclique in the e-commerce network. To reduce the number of output bicliques, we may consider the maximal biclique where none of its superset is also a biclique. Unfortunately, as shown in our initial empirical study, the number of maximal biclique is still large (e.g., over \(10^9\) maximal bicliques have been output after 24 h running of maximal biclique enumeration algorithm on a e-commerce bipartite graph obtained from Alibaba). Thus, we have to consider the diversified top-k bicliques. Inspired by the well-studied diversified top-k clique search problem (e.g., [57]), we can follow the same procedure by repeatedly removing the current maximum biclique from the bipartite graph k times. Clearly, the efficient computation of maximum biclique is the key of this problem.

Challenges and motivations Despite its wide range of applications, finding the maximum biclique is an NP-hard problem [38]. In the literature, there are many solutions to solve another related NP-hard problem: the maximum clique search problem in a general graph [17, 19, 20, 26, 31, 46,47,48,49]. The main idea is to use graph coloring and core decomposition to obtain an upper bound for the maximum clique size and use this upper bound to prune vertices that cannot be contained in the maximum clique.

A natural question raised is: can we use the above graph coloring and core decomposition techniques to search the maximum biclique in a bipartite graph? Unfortunately, the answer is negative. First, in a bipartite graph, only two colors are needed to color the whole bipartite graph. Obviously, we cannot obtain an upper bound for the maximum biclique size using graph coloring. Second, in a large biclique, it is possible for a vertex to have a very small degree/core number. For example, suppose the maximum biclique C is a star where \(|U(C)|=1\) and |V(C)| is large, we only require the degree/core number for each vertex in V(C) to be \(\ge 1\). Consequently, even a vertex has a small degree/core number, it still cannot be pruned. Therefore, the core decomposition technique also fails in maximum biclique search.

The main reason for the challenges in maximum biclique search is that the size of a biclique C depends on two factors: |U(C)| and |V(C)|; so, it is difficult to find a one-dimensional indicator, such as color number, degree, or core number, to prune vertices that cannot participate in the maximum biclique. Due to this challenge, existing solutions [38, 59] can only handle small bipartite graphs and will face serious efficiency issues when the bipartite graph scales up in size. Motivated by this, in this paper, we tackle the above challenges and aim to solve the maximum biclique search problem on bipartite graphs at billion scale.

Furthermore, based on the maximum biclique search, we can find the diversified top-k bicliques which is desired in some applications such as fraudulent transaction detection. Instead of computing the top-k bicliques based on the maximal biclique enumeration algorithm which may output exponential number of bicliques and is not practical on large-scale bipartite graphs, we adopt a simple but effective method by removing the maximum biclique from the bipartite graph k times to obtain the diversified top-k results. However, in this way, we still need to compute the maximum biclique k times independently, which is costly. One may wonder if we can share the computation costs among the diversified top-k bicliques. It is quite challenging because there is no overlap among the k diversified results.

Our solution Based on the above discussion, existing coloring and core decomposition-based approaches cannot yield effective pruning in maximum biclique search. Our paper aims for a new way to solve the problem. Our main idea is as follows: instead of finding the upper bounds for pruning, we try to guess a lower bound of \(|U(C^*)|\) as well as a lower bound of \(|V(C^*)|\) for the maximum biclique \(C^*\). If the guess is correct and tight, we can search on a much smaller bipartite graph by eliminating a large number of vertices based on the two lower bounds. However, we cannot guarantee that our guess is always correct. Therefore, instead of guessing only once, we guess multiple times which results in a list of lower-bound pairs \((\tau ^0_U,\tau ^0_V)\), \((\tau ^1_U, \tau ^1_V)\), \(\ldots \). To gain high pruning power, the list of pairs should satisfy four conditions: (1) \(\tau ^0_U\times \tau ^0_V\) should be as large as possible but not larger than the number of edges in the optimal biclique \(C^*\); (2) The pairs are derived in a progressive manner so that \(\tau ^i_U\times \tau ^i_V \ge \tau ^{i-1}_U\times \tau ^{i-1}_V\) for any \(i>0\); (3) There exists at least one pair \(\tau ^k_U\) and \(\tau ^k_V\) that are the true lower bounds of \(|U(C^*)|\) and \(|V(C^*)|\); and (4) The number of pairs should be well-bounded.

To make this idea practically applicable, two issues need to be addressed: (1) How to guess the list of lower-bound pairs so that they satisfy the above four conditions; and (2) Given a lower-bound pair, how to eliminate as many vertices as possible while preserving the corresponding maximum biclique to optimize the computational cost.

Following the idea of the maximum biclique search problem, in the diversified top-k biclique search, we try to share the computation cost among the k results by taking advantage of the derived subspaces with lower-bound pairs. Our main idea is as follows: instead of guessing tight lower bounds only for the maximum biclique, we try to preserve more results within one list of lower-bound pairs by slightly relaxing the constraints in each lower bound pair. By doing this, we can share the computation cost among the preserved results, without computing lower-bound lists and eliminating vertices w.r.t. each lower-bound pair independently for every single result.

Fig. 1
figure 1

An example of a bipartite graph and its maximum biclique

Contributions In this paper, we answer the above questions and make the following contributions:

  • The first work to practically study maximum biclique search on big real datasets Although the maximum biclique search problem is NP-hard, we aim to design practical solutions to solve the problem in real-world large bipartite graphs with billions of edges. To the best of our knowledge, this is the first work to solve this important problem on real datasets at billion scale.

  • A novel progressive-bounding framework We propose a progressive bounding framework to obtain the lower-bound pairs \((\tau _U^i, \tau _V^i)\). We analyze the framework by projecting the problem into a two-dimensional space, and we show that the set of lower-bound pairs forms a skyline in the two-dimensional space, and only logarithmic lower-bound pairs are enough to guarantee the correctness.

  • Maximum-biclique preserved graph reduction Given a certain pair of lower bounds, we study how to eliminate vertices while preserving the maximum biclique. We investigate the vertex properties and derive pruning rules by exploring the one-hop and two-hop neighbors for each vertex. Based on the pruning rules, we can significantly reduce the size of the bipartite graph.

  • Diversified top-k biclique search with computation sharing We formalize the diversified top-k biclique search as a problem to maximize the total number of edges covered by the top-k bicliques, which takes both size and diversity into consideration. Instead of computing the k results independently, we propose an efficient algorithm by considering the computation sharing among them. Based on the progressive bounding framework, we generate the subspaces by slightly relaxing the lower-bound constraints to preserve more results within one subspace set, such that we can share the computation among the preserved results. Two optimizations are proposed to further accelerate the computation by pruning search spaces and lazy refining candidates.

  • Extensive performance studies on billion-scale bipartite graphs We conduct extensive performance studies using 18 real datasets from different application domains. The experimental results demonstrate the efficiency and scalability of our proposed approaches. Remarkably, in a user-product bipartite graph from Alibaba with over 300 million vertices and over 1.3 billion edges, our approach can find the maximum biclique within 15 min. It is also reported that 50% improvement on recall can be achieved after applying our proposed method in Alibaba Group to identify the fraudulent transactions.

Outline The remainder of this paper is organized as follows. Section 2 provides the preliminaries that formally defines the maximum biclique search problem and shows its hardness. Section 3 introduces the baseline solution based on the branch-and-bound framework. In Sect. 4, we analyze the reason for the inefficiency of the baseline solution, and propose the progressive bounding framework. Section 5 presents the maximum-biclique preserved graph reduction techniques and its optimizations. In Sect. 6, we study the problem of diversified top-k biclique search and propose an efficient algorithm by sharing the computation cost among the k results. In Sect. 7, we evaluate our proposed algorithms using extensive experiments. We review the related work in Sect. 8 and conclude the paper in Sect. 9.

This paper is extended from our previous work [28] to give a more comprehensive study. First, we add the introduction and motivation of the diversified top-k biclique search problem. Then, we add algorithms \({\mathsf {TopKBasic}} \) and \({\mathsf {TopK}} \) to find the diversified top-k bicliques, with two optimizations to further accelerate the computation. Finally, we add more experiments on maximum and diversified top-k biclique search to show the efficiency of the proposed algorithms.

2 Preliminaries

We consider an unweighted and undirected bipartite graph, \(G=(U,V,E)\) where U(G) and V(G) denote the two disjoint vertex sets and \(E(G)\in U\times V\) denotes the edge set in G. For each vertex \(u\in U(G)\), we use N(uG) to denote the set of neighbors of u in G, i.e., \(N(u,G)=\{v|(u,v)\in E(G)\}\). The degree of a vertex \(u\in U(G)\), denoted as d(uG), is the number of neighbors of u in G, i.e., \(d(u,G)=|N(u,G)|\). We use \(d_{\max }^U(G)\) to denote the maximum degree for all vertices in U(G), i.e., \(d_{\max }^U(G)=\max _{u\in U(G)}d(u,G)\). We have symmetrical definition for each vertex \(v\in V(G)\). The size of a bipartite graph G, denoted as |G|, is defined as the number of edges in G, i.e., \(|G|=|E(G)|\).

Definition 1

(Biclique) Given a bipartite graph \(G=(U,V,E)\), a biclique C is a complete bipartite subgraph of G, i.e., for each pair of \(u\in U(C)\) and \(v\in V(C)\), we have \((u,v)\in E(C)\).

In this paper, given a bipartite graph G, we aim to find a biclique \(C^*\) in G with the maximum size. Considering that many real applications (e.g., fraud transaction detection) require that the number of vertices in each part of the biclique \(C^*\) is not below a certain threshold, we add size constraints \(\tau _U\) and \(\tau _V\) on \(|U(C^*)|\) and \(|V(C^*)|\) s.t. \(|U(C^*)|\ge \tau _U\) and \(|V(C^*)|\ge \tau _V\). Such a size constraint can also provide the users with more flexibility to control the size of each side of the biclique or avoid generating a too skewed biclique (e.g., a biclique with a single vertex of the highest degree at one side and all its neighbors at the other side). As a special case, when \(\tau _U=1\) and \(\tau _V=1\), the problem will find the maximum biclique without any constraint. The maximum biclique problem studied in this paper is defined as follows:

Problem statement Given a bipartite graph \(G=(U,\) VE), and a pair of positive integers \(\tau _U\) and \(\tau _V\), the problem of maximum biclique search aims to find a biclique \(C^*\) in G, s.t. \(|U(C^*)|\ge \tau _U\) and \(|V(C^*)|\ge \tau _V\), and \(|C^*|\) is maximized. We use \(C^*_{\tau _U, \tau _V}(G)\) to denote such a biclique.

Example 1

Figure 1a shows a bipartite graph G with \(U(G)=\{u_1\), \(u_2\), ..., \(u_7\}\), \(V(G)=\{v_1\), \(v_2\), ..., \(v_6\}\). Given thresholds \(\tau _U=1\) and \(\tau _V=1\), the maximum biclique \(C^*_{1,1}(G)=C_1\) is shown in Fig. 1b, where \(U(C_1)=\{u_3,\) \(u_4,\) \(u_5,\) \(u_6\}\) and \(V(C_1)=\{v_2,\) \(v_3,\) \(v_4,\) \(v_5\}\). Given thresholds \(\tau _U=1\) and \(\tau _V=5\), the maximum biclique \(C^*_{1,5}(G)=C_2\) is shown in Fig. 1c, where \(U(C_2)=\{u_3,\) \(u_4\}\) and \(V(C_2)=\{v_1,\) \(v_2,\) ..., \(v_6\}\).

NP-hardness and inapproximability As shown in [38], the maximum biclique problem is NP-hard, and as proved in [5] and [30], it is difficult to find a polynomial time algorithm to solve the maximum biclique problem with a promising approximation ratio. Due to the inapproximability, in this paper, we aim to find the exact maximum biclique and will propose several techniques to make our algorithm practical in handling large real-world bipartite graphs.

3 The baseline solution

In the literature, the state-of-the-art algorithm proposed in [59] resorts to the branch-and-bound framework, aiming to list all maximal bicliques by pruning non-maximal candidates from the search space. To obtain a reasonable baseline, in this section, we extend the algorithm proposed in [59], and design an algorithm to compute the maximum biclique by adding some pruning rules in the branch-and-bound process.

figure a
Fig. 2
figure 2

An example of \(\mathsf {MBC}\) searching

The branch-and-bound algorithm We briefly introduce the branch-and-bound algorithm. The algorithm maintains a partial biclique \((U,V,U\times V)\) and recursively adds vertices into V. When V is fixed, U can be simply computed as the set of common neighbors of all vertices in V, i.e.,

$$\begin{aligned} U=\{u| (u,v)\in E(G)\ \forall v\in V\} \end{aligned}$$
(1)

Therefore, we only need to consider V to determine the biclique. Based on this idea, the key to reducing the cost is to prune the useless vertices to be added into V. According to Eq. 1, when V is expanded, U will be contracted.

The pseudocode of the algorithm is shown in Algorithm 1. The input of the algorithm includes the bipartite graph G, the thresholds \(\tau _U\) and \(\tau _V\), and an initial biclique C. Here, C is used when a biclique is obtained before invoking the algorithm, or can be set as \(\emptyset \) otherwise. The algorithm initializes \(C^*\) as C (line 1), invokes the \({\mathsf {BranchBound}} \) procedure to update \(C^*\) (line 2), and returns \(C^*\) as the answer (line 3).

The recursive procedure \({\mathsf {BranchBound}} \) has four parameters U, V, \(C_V\), and \(X_V\), initialized as U(G), \(\emptyset \), V(G) and \(\emptyset \), respectively. Here, \((U,V,U\times V)\) defines a partial biclique. \(C_V\) is the set of candidate vertices that can be possibly added to V, and \(X_V\) is the set of vertices that has been used and should be excluded from V. The procedure \({\mathsf {BranchBound}} \) updates \(C^*\) using \((U,V,U\times V)\) if it is larger than the current \(C^*\) and satisfies the threshold constraints (line 5–6). Then, it iteratively adds vertex \(v^*\) from \(C_V\) to expand V (line 7–8).

Then, \(U'\) is updated by selecting the vertices from U that are neighbors of \(v^*\); \(V'\) includes vertices in V, \(v^*\), and vertices in \(C_V\) that are neighbors of all vertices in \(U'\); \(C'_V\) includes the vertices in \(C_V\) by excluding the vertices in \(V'\) as well as the vertices with number of neighbors in \(U'\) no larger than \(\tau _U\); \(X'_V\) includes all vertices in \(X_V\) by excluding the vertices with number of neighbors in \(U'\) no larger than \(\tau _U\) (line 9–12). The new search branch by including \(v^*\) will be created after considering the following pruning conditions (line 13–14):

(1) \(\tau _U\) pruning The size of \(U'\) should be \(\ge \tau _U\) since U will only be contracted in the branch.

(2) \(\tau _V\) pruning The size of \(V'\) \(\cup \) \(\ C'_V\) should be \(\ge \tau _V\).

(3) Size pruning The value of \(|U'|\) \(\times \) \((|V'|\) \(+\) \(|C'_V|)\) should be \(\ge |C^*|\). Without it, exploiting the current branch will not result in a larger biclique.

Fig. 3
figure 3

Drawbacks of \(\mathsf {MBC}\)

(4) Non-maximality pruning The non-maximality pruning is based on the fact that a maximum biclique should be a maximal biclique. If there is a vertex v in the exclusion set \(X_V\) that are neighbors of all vertices in \(U'\) (i.e., \(U'\subseteq N(v,G)\)), the resulting biclique cannot be maximal and thus the branch can be pruned.

After searching bicliques with \(v^*\), we add \(v^*\) into \(X_v\) (line 15).

Example 2

Given the bipartite graph G in Fig. 1a and thresholds \(\tau _U=1\) and \(\tau _V=1\), we show the search tree of \(\mathsf {MBC}\) in Fig. 2a. The vertices in V are processed in non-descending order of degree [59], and each tree node represents \(v^*\) selected in the branch. We illustrate the details in search branch from \(v_5\) in Fig. 2b. At first, we have \(X_V=\{v_6,v_1\}\), \(C_V=\{v_5,v_2,v_4,v_3\}\), \(U(C^*)=\{u_3,u_4\}\), and \(V(C^*)=\{v_1,v_2,v_3,v_4,v_5,v_6\}\). In step (1), we select \(v^*\)=\(\ v_5\) and refine \(U'=\{u_2,u_3,u_4,u_5,u_6\}\). \(V'\) is the vertices in \(C_V\) that connect to all vertices in \(U'\), i.e., \(V'=\{v_2,v_3,v_5\}\). Then, we refine \(C'_V=\{v_4\}\) and \(X'_V=\{v_1,v_6\}\). By now, we update \(U(C^*)=U'\), \(V(C^*)=V'\) and \(|C^*|=15\). In step (2), we further select \(v^*=v_4\), refine corresponding sets in a similar way as shown in Fig. 2, and update \(|C^*|=16\).

4 A progressive bounding method

In this section, we first analyze the reason for the large search space of the baseline solution, and then introduce our approach using search space partitioning based on a progressive bounding framework to significantly reduce the computational cost.

4.1 Problem analysis

Why costly? Although four pruning conditions are used to reduce the search space for maximum biclique search in Algorithm 1, it will still result in a huge search space in real large bipartite graphs due to the following two drawbacks:

  • Drawback 1: loose pruning bounds Most pruning conditions in Algorithm 1 rely on \(\tau _U\) and \(\tau _V\). However, \(\tau _U\) and \(\tau _V\) are user given parameters which can be small. In this way, the pruning power by \(\tau _U\) and \(\tau _V\) can be rather limited. For size pruning, the constraint of \(|U'|\) \(\times \) \((|V'|\) \(+\) \(|C'_V|)>|C^*|\) can be very loose because \(C'_V\) is filtered using \(\tau _U\) and thus \(|C'_V|\) can be large when \(\tau _U\) is small.

  • Drawback 2: large candidate size The size of a biclique C, calculated as \(|U(C)|\times |V(C)|\), depends on two factors: |U(C)| and |V(C)|. It is possible that the optimal solution \(C^*\) is unbalanced, i.e., either with a large \(|U(C^*)|\) and a small \(|V(C^*)|\) or with a small \(|U(C^*)|\) and a large \(|V(C^*)|\). Therefore, during the branch-and-bound process, even if the degrees of all candidates in \(C_V\) are small (where |U| is small), we cannot stop branching when \(V\cup C_V\) is large, because we may still generate a large biclique in this situation. Similarly, we cannot remove a vertex from U when its degree is small. This can result in a huge search space on a large bipartite graph.

Example 3

Figure 3 shows a bipartite graph G with \(U=\{u_1,u_2,...,u_{100}\}\) and \(V=\{v_1,v_2,...,v_{100}\}\). Specifically, \(u_1\) connects to all vertices in V and \(v_1\) connects to all vertices in U. Given \(\tau _U=1\) and \(\tau _V=1\), the size of maximum biclique \(C^*\) is 100. By adopting \(\mathsf {MBC}\), we firstly select \(v_1\) into \(V'\). As \(v_1\) connects to all vertices in U, \(U'=\{u_1,u_2,...,u_{100}\}\). Furthermore, as \(u_1\) connects to all vertices in V, \(C'_V=\{v_2,v_3,...,v_{100}\}\). However, we cannot prune any vertices with \(\tau _U=1\) and \(\tau _V=1\), and neither can we prune search branches with size constraint since \(|U'|\times (|V'|+|C'_V|)\) is larger than \(|C^*|\). Moreover, we can not prune candidate vertices in \(C'_V\), though the degrees of vertices are 1s, which leads to large candidate size and a huge search space.

Our idea Based on the above analysis and to significantly improve the algorithm, we consider two aspects:

  • To resolve drawback 1, we need to improve the pruning bounds to achieve the stop conditions in early stages of the branch-and-bound process;

  • To resolve drawback 2, we need to remove as many vertices as possible from the graph to reduce the number of candidates that may participate in the optimal solution.

Our idea is as follows: instead of using the thresholds \(\tau _U\) and \(\tau _V\) for pruning, we enforce two new thresholds \(\tau ^*_U\) and \(\tau ^*_V\) for \(U(C^*)\) and \(V(C^*)\), respectively, with \(\tau ^*_U\ge \tau _U\) and \(\tau ^*_V\ge \tau _V\). To tighten the bounds, we try to make \(\tau ^*_U\times \tau ^*_V\) as large as possible but ensure that \(\tau ^*_U\times \tau ^*_V\) is no larger than the size of the optimal solution. With \(\tau ^*_U\) and \(\tau ^*_V\), we are able to obtain a smaller bipartite graph \(G^*\) by removing as many vertices as possible that will not participate in the maximum biclique. On the smaller graph \(G^*\) with tighter bounds \(\tau ^*_U\) and \(\tau ^*_V\), the algorithm will be much more efficient. Suppose \(C^*\) is the optimal solution, if we can guarantee that \(\tau ^*_U\le |U(C^*)|\) and \(\tau ^*_V\le |V(C^*)|\), the algorithm on graph \(G^*\) with thresholds \(\tau ^*_U\) and \(\tau ^*_V\) will output the optimal solution.

However, to make our idea practically applicable, the following two issues need to be addressed:

  • First, we do not know the size of the maximum biclique \(C^*\) before the search.

  • Second, it is difficult to find a single pair \(\tau ^*_U\) and \(\tau ^*_V\) to guarantee that \(\tau ^*_U\le |U(C^*)|\) and \(\tau ^*_V\le |V(C^*)|\).

In the following, we will introduce a progressive bounding framework to resolve the two issues.

4.2 The progressive bounding framework

We propose a progressive bounding framework to address the two issues raised as follows:

  • To address the first issue, instead of using the size of the optimal solution \(|C^*|\), we use a lower bound \(lb(C^*)\) of \(|C^*|\), i.e., \(lb(C^*)\le |C^*|\). The lower bound can be quickly initialized and will be updated progressively to make the thresholds \(\tau ^*_U\) and \(\tau ^*_V\) tighter.

  • To address the second issue, instead of using a single pair \(\tau ^*_U\) and \(\tau ^*_V\), we use multiple pairs \((\tau ^1_U,\tau ^1_V)\), \((\tau ^2_U,\tau ^2_V)\), \(\ldots \), \((\tau ^k_U,\tau ^k_V)\). We will guarantee that for any possible biclique C with \(U(C)\times V(C) \ge lb(C^*)\), there exists a pair \((\tau ^i_U,\tau ^i_V)\) for \(1\le i \le k\) s.t. \(\tau ^i_U\le |U(C)|\) and \(\tau ^i_V\le |V(C)|\). Then, for each \((\tau ^i_U,\tau ^i_V)\) for \(1\le i \le k\), we compute a biclique \(C^*_i\) with maximum size s.t. \(|U(C^*_i)|\ge \tau ^i_U\) and \(|V(C^*_i)|\ge \tau ^i_V\). Among the computed bicliques, the biclique with the maximum size is the answer for the original problem.

figure b

The algorithm framework The progressive bounding framework is shown in Algorithm 2. For any valid biclique C with \(|U(C)|\ge \tau _U\) and \(|V(C)|\ge \tau _V\), |C| is a lower bound of the optimal solution \(C^*\). Based on this, we first use \({\mathsf {InitMBC}} \) to obtain an initial biclique, denoted as \(C^*_0\), s.t. \(|C^*_0|\le |C^*|\) (line 1). Then, we set \(\tau _V^0\) to be an upper bound of |V(C)| for any possible biclique C. Here, a natural upper bound is the maximum degree for any nodes in U(G), i.e., \(d_{\max }^U(G)\) (line 2). k is used to denote the number of iterations and initialized as 0 (line 3). The progressive bounding framework will finish in logarithmic iterations. Each iteration will generate a pair \(\tau _U^{k+1}\) and \(\tau _V^{k+1}\) based on the values of \(\tau _V^k\) and the the lower bound of the optimal solution \(|C^*_k|\). When \(\tau _V^{k+1}\) (\(\tau _U^{k+1}\) resp.) is smaller than \(\tau _V\) (\(\tau _U\) resp.), it will be set to be \(\tau _V\) (\(\tau _U\) resp.) (line 5–6). We will analyze the rationale later. With \(\tau _U^{k+1}\) and \(\tau _V^{k+1}\), we aim to obtain a graph \(G_{k+1}\) that is much smaller than G using procedure \({\mathsf {Reduce}} (G, \tau _U^{k+1}, \tau _V^{k+1})\), and the maximum biclique w.r.t. thresholds \(\tau _U^{k+1}\) and \(\tau _V^{k+1}\) is preserved in \(G_{k+1}\) (line 7). After this, we find the maximum biclique w.r.t. \(\tau _U^{k+1}\) and \(\tau _V^{k+1}\) on \(G_{k+1}\) with \(C^*_k\) as an initiation in \(\mathsf {MBC}\) (line 8).

The rationale Next, we address the rationale of the progressive bounding framework. Note that the size of a biclique C is determined by |U(C)| and |V(C)|. Therefore, to analyze the problem, we define a two-dimensional space as follows:

Definition 2

(Search Space \({\mathcal {S}}(G)\)) Given a bipartite graph G, a two-dimensional space \({\mathcal {S}}(G)\) has two axes |U| and |V|. Given any biclique C in G, we can represent it as a two-dimensional point (|U(C)|, |V(C)|) in the space \({\mathcal {S}}(G)\).

Given the search space \({\mathcal {S}}(G)\), the i-th search in line 7-8 of Algorithm 2 can be considered as to cover a certain subspace \(([\tau ^i_U,+\infty ),[\tau ^i_V,+\infty ))\) in \({\mathcal {S}}(G)\). To show the search preserves the optimal solution, we define the optimal curve in \({\mathcal {S}}(G)\):

Definition 3

(Optimal Curve) Given a bipartite graph G and parameters \(\tau _U\) and \(\tau _V\), suppose \(C^*\) is the maximum biclique w.r.t. \(\tau _U\) and \(\tau _V\), we call the curve \(|U|\times |V|=|C^*|\) the optimal curve in the two-dimensional space \({\mathcal {S}}(G)\).

Note that the optimal curve is unknown before the search. However, it can be used to analyze the correctness of the progressive bounding framework as followers.

Theorem 1

(Algorithm Correctness) Given a bipartite graph G and parameters \(\tau _U\) and \(\tau _V\), for any point \((s_U, s_V)\) on the optimal curve with \(s_U\in [\tau _U,d_{\max }^V(G)]\) and \(s_V\in [\tau _V, d_{\max }^U(G)]\), there exists a certain \((\tau ^i_U, \tau ^i_V)\) generated by Algorithm 2 s.t. \((s_U,s_V)\in ([\tau ^i_U,+\infty ),[\tau ^i_V,+\infty ))\).

Proof Sketch: In Algorithm 2, \(\tau _V^0\) is set to be \(d_{\max }^U(G)\), and when k increases, \(\tau _V^k\) will be iteratively divided by 2 until it is smaller than \(\tau _V\). Therefore, we can always find a certain \(i>0\) s.t.

$$\begin{aligned} \tau _V^{i} \le s_V \le \tau _V^{i-1} \end{aligned}$$

Based on Algorithm 2, we have \(\tau _U^{i}=\max (\biggl \lfloor \frac{|C^*_{i-1}|}{\tau _V^{i-1}} \biggr \rfloor , \tau _U)\). We consider two cases:

  • Case 1: \(\tau _U^{i}=\tau _U\). In this case, we have:

    $$\begin{aligned} s_U\ge \tau _U = \tau _U^{i} \end{aligned}$$

    Therefore, \((s_U,s_V)\in ([\tau ^i_U,+\infty ],[\tau ^i_V,+\infty ])\) holds.

  • Case 2: \(\tau _U^{i}=\biggl \lfloor \frac{|C^*_{i-1}|}{\tau _V^{i-1}}\biggr \rfloor \). Note that \(|C^*_{i-1}|\) is a lower bound of the optimal value \(|C^*|\) i.e.,

    $$\begin{aligned} |C^*_{i-1}|\le |C^*| \end{aligned}$$

    Since \((s_U,s_V)\) is a point on the optimal curve, we have

    $$\begin{aligned} s_U\times s_V = |C^*| \end{aligned}$$

    Consequently, we can derive the following inequalities:

    $$\begin{aligned} \begin{aligned} \tau _U^{i}= ~&\biggl \lfloor \frac{|C^*_{i-1}|}{\tau _V^{i-1}} \biggr \rfloor \le \biggl \lfloor \frac{|C^*|}{\tau _V^{i-1}} \biggr \rfloor \\ \le ~&\biggl \lfloor \frac{|C^*|}{s_V} \biggr \rfloor = \lfloor s_U \rfloor \le s_U \end{aligned} \end{aligned}$$

    Therefore, \((s_U,s_V)\in ([\tau ^i_U,+\infty ],[\tau ^i_V,+\infty ])\) holds.

According to the analysis above, Theorem 1 holds. \(\square \)

Theorem 1 shows that all the points in the optimal curve within the range \(([\tau _U,\) \(d_{\max }^V(G)],\) \([\tau _V,\) \(d_{\max }^U(G)])\) are covered by the search spaces in Algorithm 2. Note that for any biclique C in G, we can guarantee that \(|U(C)|\le d_{\max }^V(G)\) and \(|V(C)|\le d_{\max }^U(G)\). Therefore, Algorithm 2 obtains the optimal solution.

Fig. 4
figure 4

Illustration of algorithm rationale

The rationale of the progressive bound framework is shown in Fig. 4. Here, we draw the two-dimensional space \({\mathcal {S}}(G)\), and show the search spaces of the first three iterations of Algorithm 2 on \({{\mathcal {S}}}(G)\). We generate three search spaces using \((\tau ^1_U,\tau ^1_V)\), \((\tau ^2_U,\tau ^2_V)\), and \((\tau ^3_U,\tau ^3_V)\), which obtains the bicliques \(C^*_1\), \(C^*_2\), and \(C^*_3\), respectively. We use red, green, and blue colors to differentiate the three spaces respectively. As shown in Fig. 4, when i increases, the curve \(|U|\times |V|=|C^*_i|\) progressively approaches the optimal curve \(|U|\times |V|=|C^*|\), and the optimal curve \(|U|\times |V|=|C^*|\) in \({{\mathcal {S}}}(G)\) for \(|V|\ge \tau _V^3\) is totally covered by the three search spaces. This illustrates the correctness of the progressive bounding framework.

Example 4

Given the bipartite graph G in Fig. 1a and thresholds \(\tau _U=1\) and \(\tau _V=1\), we adopt Algorithm 2 to find the maximum biclique. Suppose we initiate biclique \(C^*_0\) as shown in Fig. 1c that we have \(|C^*_0|=12\) and \(\tau _V^0=6\). Then, we search the optimal solution progressively:

  1. (1)

    \(\tau _U^1=2\), \(\tau _V^1=3\). We adopt \({\mathsf {Reduce}} \) to filter vertices in G, e.g., we filter \(u_7\) as \(d(u_7,G)=2\) and it cannot be involved in a biclique with \(\tau _V^1=3\). We will explain \({\mathsf {Reduce}} \) in detail later. We search for \(C^*_1\) on \(G_1\), and get \(U(C^*_1)=\{u_3,u_4,u_5,u_6\}\), \(V(C^*_1)=\{v_2,v_3,v_4,v_5\}\). Thus \(|C^*_1|=16\).

  2. (2)

    \(\tau _U^2=5\), \(\tau _V^2=1\). Since we cannot find any larger biclique on reduced graph \(G_2\), \(|C^*_2|=16\). As shown above, we progressively use multiple strict \(\tau ^k_U\) and \(\tau ^k_V\) threshold pairs to approach the optimal solution.

The effectiveness of the progressive bounding framework is further verified in our experiments. For example, Table 2 shows that the graph compression ratio in the bounding iterations varies from 0% (omitted in the table) to 2.05%. This reduces significantly the search space and computation cost in the maximum biclique search procedure.

To realize the algorithm framework \({\mathsf {MBC}} ^*\) in Algorithm 2, we still need to solve the following two components:

  • The initial biclique computation algorithm \({\mathsf {InitMBC}} \). We use a greedy strategy to obtain the initial biclique. Specifically, we initialize an empty biclique and iteratively add the vertex that can maximize the size of the current biclique until no vertex can be added. The biclique with the maximum size among the process is returned.

  • The graph reduction algorithm \({\mathsf {Reduce}} \). We will discuss the details of \({\mathsf {Reduce}} \) in the next section.

5 MBC-preserved graph reduction

As shown in Algorithm 2, one of the most important procedures is to reduce the size of the bipartite graph given certain \(\tau ^i_U\) and \(\tau ^i_V\) while preserving the maximum biclique. In this section, we show how to reduce the bipartite graph size by exploring some properties of the one-hop and two-hop neighbors for a certain vertex. We first introduce the MBC-preserved graph below.

Definition 4

(MBC-Preserved Graph) Given a bipartite graph G, and thresholds \(\tau ^i_U\) and \(\tau ^i_V\), a bipartite graph \(G'\) is called a MBC-preserved graph w.r.t. \(\tau ^i_U\) and \(\tau ^i_V\), if \(U(G')\subseteq U(G)\), \(V(G')\subseteq V(G)\), \(E(G')\subseteq E(G)\) and \(|C^*_{\tau ^i_U,\tau ^i_V}(G')|=|C^*_{\tau ^i_U,\tau ^i_V}(G)|\). In other words, the maximum biclique for G is preserved in \(G'\). We use \(G'\sqsubseteq _{\tau ^i_U, \tau ^i_V} G\) to denote that \(G'\) is an MBC-preserved graph of G.

We can easily derive the following lemma:

Lemma 1

(Transitive Property) If \(G_1\sqsubseteq _{\tau ^i_U,\tau ^i_V} G_2\) and \(G_2\sqsubseteq _{\tau ^i_U,\tau ^i_V} G_3\), we have \(G_1\sqsubseteq _{\tau ^i_U,\tau ^i_V} G_3\).

5.1 One-hop graph reduction

To reduce the size of the bipartite graph, we first consider a simple case by exploring the one-hop neighbors for each vertex. Specifically, we use the number of neighbors to reduce the bipartite graph. Besides, we eliminate a vertex u by removing u and all its adjacent edges from G, denoted as \(G\ominus u\). We derive the following lemma:

Lemma 2

Given a bipartite graph G, thresholds \(\tau ^i_U\) and \(\tau ^i_V\), we have:

  1. (1)

    \(\forall u\in U(G)\): \(d(u,G)<\tau ^i_V \implies G\ominus u \sqsubseteq _{\tau ^i_U,\tau ^i_V}G\);

  2. (2)

    \(\forall v\in V(G)\): \(d(v,G)<\tau ^i_U \implies G\ominus v \sqsubseteq _{\tau ^i_U,\tau ^i_V}G\).

Proof Sketch: We only prove (1), and (2) can be proved similarly. Given a certain vertex \(u\in U(G)\) with \(d(u,G)<\tau ^i_V\), we need to prove that for any biclique C in G with \(|U(C)|\ge \tau ^i_U\) and \(|V(C)|\ge \tau ^i_V\), C is also a biclique in \(G\ominus u\). That is, we only need to prove \(u\notin U(C)\). Next, we prove \(u\notin U(C)\) by contradiction. Suppose \(u\in U(C)\), since C is a biclique with \(|V(C)|\ge \tau ^i_V\), u has at least \(\tau ^i_V\) neighbors in G, i.e., \(d(u,G)\ge \tau ^i_V\). This contradicts with the fact that \(d(u,G)< \tau ^i_V\). Therefore, the lemma holds. \(\square \)

Lemma 2 provides a sufficient condition for a vertex to be eliminated s.t. the maximum biclique is preserved. Based on the Lemma 1, Lemma 2 can be iteratively applied to reduce the graph size until no vertices can be eliminated.

figure c

The one-hop graph reduction is shown in Algorithm 3. Given a bipartite graph G and thresholds \(\tau ^i_U\) and \(\tau ^i_V\), the algorithm aims to compute a bipartite graph \(G_i\) s.t. \(G_i\sqsubseteq _{\tau ^i_U,\tau ^i_V}G\) by applying the one-hop reduction rule in Lemma 2. We first initialize \(G_i\) to be G (line 1), and then we iteratively remove vertices from \(G_i\) that satisfy either case (1) (line 4–5) or case (2) (line 6–7) in Lemma 2. The algorithm terminates until no such vertices can be found in \(G_i\). The following lemma shows the time complexity of Algorithm 3.

Lemma 3

Algorithm 3 requires O(|G|) time.

Proof Sketch: To implement Algorithm 3 efficiently, we can use a queue Q to maintain the set of vertices satisfying Lemma 2. Each vertex is pushed into and poped from the queue Q at most once. For each vertex v, after removing it from \(G_i\), we need to maintain the degrees of its neighbors and put those neighbors that can be eliminated using Lemma 2 due to decreasing of the degree into the queue Q. This requires O(d(vG)) time. Therefore, the overall time complexity of Algorithm 3 is \(O(\sum _{u\in U(G)} d(u,G) + \sum _{v\in V(G)} d(v,G))=O(|G|)\). \(\square \)

5.2 Two-hop graph reduction

Next, we explore the two-hop neighbors to further reduce the size of the bipartite graph. For each vertex u, suppose \(u'\) is a two-hop neighbor of u, i.e., \(N(u',G)\cap N(u,G) \ne \emptyset \). To eliminate u by fully using the information involved within the two-hop neighbors, instead of only considering the degree of \(u'\), i.e., \(|N(u',G)|\), we consider the number of common neighbors of u and \(u'\), i.e., \(|N(u',G)\cap N(u,G)|\). To do so, we define the \(\tau \)-neighbor and \(\tau \)-degree as follows:

Definition 5

(\(\tau \)-Neighbor and \(\tau \)-degree) Given a bipartite graph G and a parameter \(\tau \), for any \(u\in U(G)\) and \(u'\in U(G)\), \(u'\) is a \(\tau \)-neighbor of u iff

$$\begin{aligned} |N(u', G)\cap N(u, G)|\ge \tau \end{aligned}$$

For any \(u\in U(G)\), the set of \(\tau \)-neighbors of u is defined as \(N_{\tau }(u,G)\), i.e.,

$$\begin{aligned} N_{\tau }(u,G) = \{u' \ | \ |N(u', G)\cap N(u, G)|\ge \tau \} \end{aligned}$$

and the \(\tau \)-degree of u is defined as the number of vertices in \(N_{\tau }(u,G)\), i.e.,

$$\begin{aligned} d_{\tau }(u,G) = |N_{\tau }(u,G)| \end{aligned}$$

Similarly, we can define the \(\tau \)-neighbor set \(N_{\tau }(v,G)\) and the \(\tau \)-degree \(d_{\tau }(v,G)\) for any \(v\in V(G)\).

Obviously, the \(\tau \)-neighbor of any vertex u is a subset of a union of u itself and the two-hop neighbors of u. For example, in Fig. 5b, when \(\tau =4\), \(N_{\tau }(v_1,G')=\{v_1,v_2,v_3\}\), because both \(v_2\) and \(v_3\) have \(\ge 4\) neighbors with \(v_1\).

The following lemma shows how to use the \(\tau \)-neighbor of a vertex to eliminate the vertex with the given thresholds.

Lemma 4

Given a bipartite graph G, thresholds \(\tau ^i_U\) and \(\tau ^i_V\), we have:

  1. (1)

    \(\forall u\in U(G): d_{\tau ^i_V}(u,G)<\tau ^i_U\implies G\ominus u \sqsubseteq _{\tau ^i_U,\tau ^i_V}G\);

  2. (2)

    \(\forall v\in V(G): d_{\tau ^i_U}(v,G)<\tau ^i_V\implies G\ominus v \sqsubseteq _{\tau ^i_U,\tau ^i_V}G\).

Proof Sketch: We only prove (1), and (2) can be proved similarly. Given a certain vertex \(u\in U(G)\) with \(d_{\tau ^i_V}(u,G)<\tau ^i_U\), we need to prove that for any biclique C in G with \(|U(C)|\ge \tau ^i_U\) and \(|V(C)|\ge \tau ^i_V\), C is also a biclique in \(G\ominus u\). That is, we only need to prove \(u\notin U(C)\). Next, we prove \(u\notin U(C)\) by contradiction. Suppose \(u\in U(C)\), since C is a biclique with \(|U(C)|\ge \tau ^i_U\) and \(|V(C)|\ge \tau ^i_V\), for each \(u'\in U(C)\), we have:

$$\begin{aligned} |N(u,C)\cap N(u',C)| = |V(C)| \ge \tau ^i_V \end{aligned}$$

In other words, \(u'\) is a \(\tau ^i_V\)-neighbor of u in C, i.e., \(u'\in N_{\tau ^i_V}(u,C)\). Therefore,

$$\begin{aligned} |N_{\tau ^i_V}(u,C)|=|U(C)|\ge \tau ^i_U \end{aligned}$$

Consequently, we can derive:

$$\begin{aligned} \begin{aligned} d_{\tau ^i_V}(u,G)&=|N_{\tau ^i_V}(u,G)| \ge |N_{\tau ^i_V}(u,C)| \ge \tau ^i_U \end{aligned} \end{aligned}$$
Fig. 5
figure 5

An example of graph reduction with \(\tau _U=4\), \(\tau _V=4\)

This contradicts with the assumption that \(d_{\tau ^i_V}(u,G)<\tau ^i_U\). As a result, the lemma holds. \(\square \)

figure d

Based on Lemma 4 and the transitive property shown in Lemma 1, we are ready to design the two-hop graph reduction algorithm. The pseudocode of the algorithm is shown in Algorithm 4. Since Lemma 4 can be applied for vertices in both U(G) and V(G), the algorithm reduce the bipartite graph G twice, and each time the vertices in one side are reduced using the procedure \({\mathsf {Reduce2H}} \) (line 1–4).

In the \({\mathsf {Reduce2H}} \) procedure (line 5–18), we visit each vertex \(u\in U\) to check whether u can be eliminated using Lemma 4 (line 6). We use S to maintain the set of two-hop neighbors of u along with the number of common neighbors with each two-hop neighbor. Specifically, for each two-hop neighbor \(u'\) of u, we create a unique entry \(o=(u',cnt)\) in S where o.cnt denotes the number of common neighbors for u and \(u'\). In the algorithm, we first search the neighbors \(v\in N(u,G_i)\) (line 8) and then search the neighbors \(u'\in N(v,G_i)\) to obtain each two-hop neighbor \(u'\) (line 9). If the entry for \(u'\) does not exist in S, we add \(u'\) to S with \(cnt=1\) (line 10–11); otherwise, we obtain the entry o for \(u'\) and increase o.cnt by 1 (line 13–14). After processing all two-hop neighbors of u, we maintain a counter c to count the number of \(\tau _V^i\)-neighbor of u (line 15). Obviously, \(c=d_{\tau _V^i}(u,G)\). Therefore, if \(c<\tau _U^i\), we can eliminate u from \(G_i\) according to Lemma 4 (line 16–17).

Lemma 5

Algorithm 4 requires \(O(\sum _{u\in U(G)} d(u,G)^2 + \sum _{v\in V(G)} d(v,G)^2)\) time.

Proof Sketch: When processing U(G) (line 2), for each \(u\in U(G)\) (line 6) and \(v\in N(u,G)\) (line 8), we need to process all neighbors \(u'\) of v using O(d(vG)) time. Therefore, the total time complexity of the procedure in line 2 is

$$\begin{aligned} \begin{aligned}&O\biggl (\sum _{u\in U(G)}\sum _{v\in N(u,G)}d(v,G)\biggr )\\&\quad =O\biggl (\sum _{(u,v)\in E(G)}d(v,G)\biggr )\\&\quad =O\biggl (\sum _{v\in V(G)}\sum _{u\in N(v,G)}d(v,G)\biggr )\\&\quad =O\biggl (\sum _{v\in V(G)} d(v,G)^2\biggr ) \end{aligned} \end{aligned}$$

Similarly, the total time complexity of the procedure in line 3 is \(O(\sum _{u\in U(G)} d(u,G)^2)\). Consequently, the overall time complexity of Algorithm 4 is \(O(\sum _{u\in U(G)}\) \(d(u,G)^2\) \(+\) \(\sum _{v\in V(G)}\) \(d(v,G)^2)\). \(\square \)

Optimizations However, \(\mathsf {Reduce2Hop}\) is more costly than \(\mathsf {Reduce1Hop}\). So we introduce two heuristics, early pruning and early skipping, to further optimize the two-hop reduction algorithm as follows.

(1) Early pruning In Algorithm 4, there is no specific order to process vertices. However, if we process vertices that are more likely to be pruned first, the removal of these vertices may result in more vertices elimination in later iterations. Based on this, we design a score function so that vertices with small scores are more likely to be pruned. A straightforward score is the vertex degree. However, it only considers the vertices in one side and ignores those in the other side. Therefore, for each vertex u, we summarize the degrees for all u’s neighbors, and design the score function as follows:

$$\begin{aligned} {\mathsf {score}} (u)=\sum _{v\in N(u,G)}d(v,G) \end{aligned}$$
(2)

The score function considers both the number of neighbors u has and the degrees of the u’s neighbors, and is cheap to compute. Given the score function, we can simply modify the algorithm by processing vertices in non-decreasing order of their scores to improve the algorithm performance.

(2) Early skipping Then, we proceed to identify some vertices that cannot be pruned using \(\mathsf {Reduce2Hop}\) before exploring their two-hop neighbors. These vertices can be skipped directly. The following lemma provides a way to do this:

Lemma 6

For any vertices u, \(u'\) and threshold \(\tau \), we have: \(u'\in N_{\tau }(u,G) \iff u\in N_{\tau }(u',G)\)

Proof Sketch: According to Definiton 5, \(u'\in N_{\tau }(u,G)\) is equivalent to \(N(u',G) \cap N(u,G) \ge \tau \), which is equivalent to \(u\in N_{\tau }(u',G)\). \(\square \)

Based on Lemma 6, for any vertex \(u'\in U(G)\), if there are more than \(\tau ^i_U\) vertices u with \(u'\in N_{\tau ^i_V}(u,G)\), we can guarantee that \(d_{\tau ^i_V}(u', G) \ge \tau ^i_U\), and therefore \(u'\) can be skipped by Lemma 4 without exploring the two-hop neighbors of \(u'\). To realize this idea, for each vertex \(u'\in U(G)\), we use \(u'.c\) to maintain the number of processed vertices u s.t. \(u'\in N_{\tau ^i_V}(u,G)\). When processing u, for each two-hop neighbor \(u'\), if \(u'\in N_{\tau ^i_V}(u,G)\), we increase \(u'.c\) by 1. Later on, when processing \(u'\), we check whether \(u'.c+1 \ge \tau _U^i\) before exploring the two-hop neighbors of \(u'\). If so, we know that \(u'\) cannot be pruned and directly skip \(u'\). Here, we use \(u'.c+1\) to take \(u'\) itself into consideration.

5.3 The overall reduction strategy

Based on the above analysis, we can use either one-hop or two-hop reduction to reduce the size of the bipartite graph G. The following lemma shows that the two-hop reduction rule in Lemma 4 has stronger pruning power than the one-hop reduction rule in Lemma 2.

Lemma 7

Given a bipartite graph G, thresholds \(\tau ^i_U\) and \(\tau ^i_V\), we have:

  1. (1)

    \(\forall u\in U(G): d(u,G)<\tau ^i_V \implies d_{\tau ^i_V}(u,G)<\tau ^i_U\);

  2. (2)

    \(\forall v\in V(G): d(v,G)<\tau ^i_U\implies d_{\tau ^i_U}(v,G)<\tau ^i_V\).

Proof Sketch: We first prove (1). For any \(u\in U(G)\), if \(d(u,G)<\tau ^i_V\), we know that there does not exist a two-hop neighbor \(u'\) of u s.t. \(|N(u',G)\cap N(u,G)|\ge \tau ^i_V\). Therefore, \(d_{\tau ^i_V}(u,G)=0 <\tau ^i_U\). (2) can be proved similarly. \(\square \)

Nevertheless, based on Lemmas 3 and 5, applying one-hop reduction is much more efficient than applying two-hop reduction. Therefore, we design the overall graph reduction strategy as follows:

Reduce Given a bipartite graph G and thresholds \(\tau ^i_U\) and \(\tau ^i_V\), \(\mathsf {Reduce}\) iteratively applies one-hop and two-hop reduction strategies on G for \({\mathsf {MAX\_ITER}} \) rounds where \({\mathsf {MAX\_ITER}} \) is a small constant, and returns the reduced graph \(G_i\). Specifically, in each round, \(\mathsf {Reduce}\) first applies \(\mathsf {Reduce1Hop}\) and then further applies \(\mathsf {Reduce2Hop}\) on the reduced graph.

Example 5

We show the example of the complete graph reduction process in Fig. 5. Given the bipartite graph G in Fig. 1a and thresholds \(\tau _U=4\), \(\tau _V=4\) and \({\mathsf {MAX\_ITER}} =2\), we first apply \(\mathsf {Reduce1Hop}\) in Fig. 5a. Since \(d(u_7,G)=2<\tau _V\) and \(d(v_6,G)=2<\tau _U\), we prune \(u_7\) and \(v_6\). Then, we apply \(\mathsf {Reduce2Hop}\) in Fig. 5b with the details shown in Fig. 5d. We traverse the one-hop and two-hop neighbors of \(v_1\), and update the entries in S as shown in step (1) to step (4). For example, in step (1), we traverse \(v_1\)’s neighbor \(u_1\) and two-hop neighbors \(v_1\), \(v_2\), \(v_3\) and \(v_4\), and set \(cnt=1\) for each two-hop neighbor. After visiting all neighbors in step (4), we have three vertices with \(cnt=4\), i.e., \(c=d_{\tau _U}(v_1,G')=3\). According to Lemma 4, since \(d_{\tau _U}(v_1,G')<\tau _V\), we prune \(v_1\). After that, we further apply \(\mathsf {Reduce1Hop}\) in Fig. 5c, and prune vertices \(u_1\) and \(u_2\). By applying \(\mathsf {Reduce}\), we save huge search space in biclique search.

6 Diversified top-k biclique search

In some applications, one may need to enumerate a set of bicliques. For example, in click farm detection in E-commerce such as Alibaba Group, the fraudulent transactions cannot be fully covered by the maximum biclique. Instead, we may need to consider the maximal biclique, where none of its superset is also a biclique. However, as the number of maximal bicliques may be exponential in the graph size [11], a possible solution is to compute the top-k results ranked by size, since maximal bicliques with larger size are always more important [23]. However, the top-k results ranked by size are usually highly overlapping, which significantly reduce the effective information of the k results. Motivated by this, we study the problem of the Diversified Top-k Biclique Search in this section, aiming to find top-k results that are distinctive and informationally rich.

Firstly, we formally define the diversified top-k biclique search problem.

Definition 6

(Coverage \({\mathsf {cov}} ({{\mathcal {D}}})\)) Given a set of bicliques \({{\mathcal {D}}}=\{R_1, R_2, ...\}\) in a bipartite graph G, the coverage of \({{\mathcal {D}}}\), denoted by \({\mathsf {cov}} (\mathcal{D})\), is the set of edges in G covered by the bicliques in \(\mathcal{D}\), i.e., \({\mathsf {cov}} ({{\mathcal {D}}})=\bigcup _{R\in {{\mathcal {D}}}}{E(R)}\).

Problem statement Given a bipartite graph \(G=(U, V, E)\), an integer k, and thresholds of \(\tau _U\) and \(\tau _V\), the problem of diversified top-k biclique search aims to find a set \({{\mathcal {D}}}\), such that (1) each biclique \(R\in {{\mathcal {D}}}\) is a maximal biclique with \(|U(R)|\ge \tau _U\) and \(|V(R)|\ge \tau _V\), (2) \(|{{\mathcal {D}}}| \le k\) and (3) \({\mathsf {cov}} ({{\mathcal {D}}})\) is maximized.

Fig. 6
figure 6

Top-2 bicliques in G with \(\tau _U=1\) and \(\tau _V=1\)

Example 6

We show an example of top-2 bicliques in bipartite graph G with \(\tau _U=1\) and \(\tau _V=1\) in Fig. 6. There are three maximal bicliques in G: \(R_1\) with \(U(R_1)=\{u_3, u_4, u_5, u_6\}\) and \(V(R_1)=\{v_2, v_3, v_4, v_5\}\); \(R_2\) with \(U(R_2)=\{u_3, u_4, u_5, u_6, u_7\}\) and \(V(R_2)=\{v_3, v_4, v_5\}\); and \(R_3\) with \(U(R_3)=\{u_1, u_2, u_3\}\) and \(V(R_3)=\{v_1, v_2\}\). The result of top-2 maximal bicliques ranked by size is \(\mathcal{D}_1=\{R_1, R_2\}\), and the result of diversified top-2 bicliques is \({{\mathcal {D}}}_2=\{R_1, R_3\}\). Although \(|R_2|>|R_3|\), it is obvious that \({{\mathcal {D}}}_2\) is more favorable since \(R_2\) is highly overlapping with \(R_1\). In other words, \({\mathsf {cov}} ({{\mathcal {D}}}_2)>{\mathsf {cov}} (\mathcal{D}_1)\).

NP-hardness We show the hardness of the problem by considering the simple case: \(k=1\), \(\tau _U=1\), and \(\tau _V=1\). In this case, the problem becomes the maximum biclique search problem which is NP-hard [38]. Therefore, the diversified top-k biclique search problem is an NP-hard problem.

6.1 Baseline solution

In the literature, the problem of maximal biclique enumeration is widely studied [15, 29, 35,36,37, 41, 59]. This leads to a straightforward solution of diversified top-k biclique search: firstly, we can enumerate all the maximal bicliques satisfying the thresholds of \(\tau _U\) and \(\tau _V\), and then we formulate the problem of diversified top-k biclique search as a max k-cover problem. However, in a large-scale bipartite graph, the enumeration is costly and may not be able to terminate. Besides, it is infeasible to keep all the maximal bicliques in memory due to the exponential number of maximal bicliques in large bipartite graphs.

Fortunately, by taking advantage of our efficient maximum biclique search method, we can find the diversified top-k results by repeatedly removing the current maximum biclique from the bipartite graph k times, which follows the framework in a well-studied diversified top-k clique problem [57].

The baseline solution is shown in Algorithm 5. It firstly initiates the result set \({{\mathcal {D}}}\) as empty (line 1), and then greedily compute k bicliques to insert into \({{\mathcal {D}}}\) (line 2–7), and return \({{\mathcal {D}}}\) as the top-k results (line 8). Each time, it invokes \({\mathsf {MBC}} ^*\) to compute the maximum biclique \(R_i\) in G satisfying the thresholds of \(\tau _U\) and \(\tau _V\) (line 3). If \(R_i\) is empty, it indicates that no more bicliques satisfying \(\tau _U\) and \(\tau _V\) can be found and we can stop searching (line 4–5). Otherwise, we update \({{\mathcal {D}}}\) by inserting \(R_i\) into it (line 6), and then remove the edges in \(R_i\) from G (line 7).

figure e

Time complexity We analyze the time cost of Algorithm 5. The time cost is mainly spent on the k times computation of \({\mathsf {MBC}} ^*\), which consists of the graph reduction time and maximum biclique searching time. In \({\mathsf {MBC}} ^*\), we denote the number of subspaces generated for searching result \(R_j\) as \(l_j\), where \(l_j\) is bounded by \(\log (d^U_{\max }(G))\). For result \(R_j\), we use \(T_{\mathrm{reduce}}(G)\) to denote the graph reduction time (including one-hop and two-hop graph reduction), and \(T_{\mathrm{search}}(G_{i,j})\) to denote the maximum biclique searching time, where \(G_{i,j}\) represents the reduced graph in the i-th subspace for \(R_j\). Here, \(T_{\mathrm{search}}(G)=O(|V(G)|d^V_{\max }(G)\beta )\) where \(\beta \) denotes the number of the maximal bicliques in G as introduced in [59]. Thus, the time cost of Algorithm 5 is \(O(\sum _{j=1}^{k}\sum _{i=1}^{l_j}(T_{\mathrm{reduce}}(G)+T_{\mathrm{search}}(G_{i,j}))\).

Fig. 7
figure 7

Search top-2 bicliques in G by \({\mathsf {TopKBasic}} \)

Example 7

Given a bipartite graph G in Fig. 7a with thresholds \(\tau _U=1\) and \(\tau _V=1\), we adopt Algorithm 5 to find the diversified top-2 bicliques in G:

  1. (1)

    To find \(R_1\) in G with \({\mathsf {MBC}} ^*\), suppose \(|C_0|=10\) and \(\tau ^0_V=5\), we generate two subspaces as follows:

    1. (i)

      \(\tau ^1_U=\biggl \lfloor \frac{|C_0|}{\tau ^0_V} \biggr \rfloor =2\), \(\tau ^1_V=\biggl \lfloor \frac{\tau ^0_V}{2} \biggr \rfloor =2\). We find the maximum biclique \(C^*_1\) (marked as gray in Fig. 7a), and we update \(|C^*_1|=16\).

    2. (ii)

      \(\tau ^2_U=\biggl \lfloor \frac{|C^*_1|}{\tau ^1_V} \biggr \rfloor =8\), \(\tau ^2_V=\biggl \lfloor \frac{\tau ^1_V}{2} \biggr \rfloor =1\). We cannot find larger bicliques.

    Thus, we obtain \(R_1=C^*_1\) as shown in Fig. 7a. Then, we remove all edges in \(R_1\) from G, and get \(G'\) as shown in Fig. 7b. (Here, we omit the vertices with no edges.) (2) To find \(R_2\) in \(G'\) with \({\mathsf {MBC}} ^*\), suppose \(|C'_0|=6\), and \({\tau ^0_V}'=4\), we generate two subspaces as follows:

    1. (i)

      \({\tau ^1_U}'=\biggl \lfloor \frac{|{C_0}'|}{{\tau ^0_V}'} \biggr \rfloor =1\), \({\tau ^1_V}'=\biggl \lfloor \frac{{\tau ^0_V}'}{2} \biggr \rfloor =2\). We find \({C^*_1}'\) (marked as gray in Fig. 7b), and update \(|{C^*_1}'|=15\).

    2. (ii)

      \({\tau ^2_U}'=\biggl \lfloor \frac{|{C^*_1}'|}{{\tau ^1_V}'} \biggr \rfloor =7\), \({\tau ^2_V}'=\biggl \lfloor \frac{{\tau ^1_V}'}{2} \biggr \rfloor =1\). We cannot find larger bicliques.

Thus, we obtain \(R_2={C^*_1}'\), as shown in Fig. 7b. It should be noticed that the subspaces generated in G and \(G'\) are different. Consequently, for \(R_1\) and \(R_2\), we compute the reduced subgraph by \({\mathsf {Reduce}} \) and the maximum biclique by \(\mathsf {MBC}\) in each subspace independently.

Finally, we obtain the result set \({{\mathcal {D}}}=\{R_1,R_2\}\).

6.2 Advanced diversified top-k search

In this subsection, we first analyze the drawbacks of the baseline solution, and then introduce our new diversified top-k biclique search approach, based on the idea of deriving the same subspaces for different results to share computation cost among them.

6.2.1 Problem analysis

Drawbacks of \({\mathsf {TopKBasic}} \) The major limitation of \({\mathsf {TopKBasic}} \) is the isolated computation of \(R_i\) by \({\mathsf {MBC}} ^*\). Recall that in \({\mathsf {MBC}} ^*\), we progressively generate subspaces based on the value of the maximum biclique size found so far (line 5–6 in \({\mathsf {MBC}} ^*\)). We call such subspaces generated for \(R_i\) as a subspace set. Obviously, for the top-k results, the generated subspace sets are different (e.g., the generated subspace sets of \(R_1\) and \(R_2\) in Example 7). Consequently, both graph reduction by \(\mathsf {Reduce}\) and maximum biclique search by \(\mathsf {MBC}\) in \({\mathsf {MBC}} ^*\) will be computed independently in each subspace among all the k results, which is costly.

Our idea Intuitively, since the different generated subspace sets lead to the isolated computations of \(R_i\) in \({\mathsf {TopKBasic}} \), we consider to generate the same subspace set for all the k results so as to share the computation among them. Specifically, we fix the subspace set in \({\mathsf {MBC}} ^*\) as follows: (1) Instead of using the largest biclique size found so far as the lower bound of the optimal solution for generating subspaces, we use a constant c. According to Theorem 1, it is not hard to prove that with c as the lower bound, we can preserve the maximum biclique whose size is larger than c in the derived subspaces. Thus to find the top-k results, we set c as a constant value which is smaller than the size of the k-th result \(R_k\). (2) We fix \(\tau ^0_V=d^U_{\max }(G_{\mathrm{ori}})\) where \(G_{\mathrm{ori}}\) denotes to the original bipartite graph G, and \(d^U_{\max }(G_{\mathrm{ori}})\) is guaranteed to be an upper bound of |V(R)| for all the k results. Consequently, with the fixed c and \(\tau ^0_V\), we can generate the same subspace set for all the k results. We denote such a fixed subspace set as \(\mathcal{FS}(G,c)\), or \(\mathcal{FS}\) in short if the context is clear.

With the idea of the fixed subspace set \(\mathcal{FS}(G,c)\), the following two issues need to be further addressed:

  • First, we do not know the size of the k-th result \(R_k\). Although we can set c as a small constant, e.g., \(c=\tau _U\times \tau _V\), the \(\tau _U^i\) and \(\tau _V^i\) computed based on c in each subspace may be very loose for graph reduction and search space pruning.

  • Second, even we can generate the same subspace set for the k results, we still need to remove \(R_i\) from bipartite graph G when searching for \(R_{i+1}\), which indicates that the reduced graph and the maximum biclique in each subspace need to be recomputed.

6.2.2 Advanced top-k biclique search

To solve the above problems, we first preserve the following three information for each subspace in \(\mathcal{FS}(G,c)\):

  1. (1)

    the thresholds \(\tau _U^i\) and \(\tau _V^i\) computed based on c;

  2. (2)

    the reduced subgraph \(G_i\) w.r.t. \(\tau _U^i\) and \(\tau _V^i\);

  3. (3)

    the maximum biclique \(C^*_i\) in \(G_i\) that \(|U(C^*_i)|\ge \tau _U^i\), \(|V(C^*_i)|\ge \tau _V^i\) and \(|C^*_i|>c\).

Based on \(\mathcal{FS}(G,c)\), we address the two issues as follows:

  • To address the first issue, instead of initiating c as a very small constant to cover all the results which leads to loose thresholds, we search for the top-k results by progressively relaxing c. Specifically, we use a lower bound of the size of the top-1 biclique to initiate c. Then, we generate \(\mathcal{FS}(G,c)\) and search results in it. Once we cannot find enough results within \(\mathcal{FS}(G,c)\), we relax c to \(c'\) by multiplying a factor \(\alpha \), where \(0<\alpha <1\), and regenerate \(\mathcal{FS}(G,c')\) to cover more results.

  • To address the second issue, instead of recomputing the subgraph by \({\mathsf {Reduce}} \) and maximum biclique by \({\mathsf {MBC}} \) in each subspace when searching for the next result, we apply light-costed subgraph updating and on-demand maximum biclique searching in \(\mathcal{FS}(G,c)\). Specifically, as we have maintained the reduced subgraph \(G_i\) and the maximum biclique \(C^*_i\) in each subspace in \(\mathcal{FS}(G,c)\), when we need to remove \(R_j\) from G, we can update \(G_i\) by simply eliminating the edges in \(R_j\) from \(G_i\). Moreover, we only need to recompute \(C^*_i\) that has overlaps with \(R_j\). Otherwise, \(C^*_i\) remains unchanged even when \(G_i\) is updated.

figure f

The advanced algorithm The proposed algorithm is shown in Algorithm 6. It firstly initiates \({{\mathcal {D}}}\) as an empty set and \(R_0\) as empty (line 1). We set the value of c as the size of the initial biclique in G found by \({\mathsf {InitMBC}} \), which is a lower bound of the size of top-1 result \(R_1\) (line 2). We use flag to indicate whether or not we need to generate \(\mathcal{FS}\) with constant c, initialized as true, and i to denote the index of the top-k results, initialized as 0 (line 3). Then, we search for the top-k results (line 4–16). We first invoke \({\mathsf {GenSubSpaces}} \) to generate \(\mathcal{FS}\) when flag is true , and after generation, we set flag as false (line 5–6). With \(\mathcal{FS}\), we invoke \({\mathsf {FixedMBC}} ^*\) to search the maximum biclique in each subspace, respectively, and return the one with the largest size as the result \(R_{i+1}\) (line 7). If \(R_{i+1}\) is empty, it indicates that no bicliques larger than c can be found in \(\mathcal{FS}\). Here, we will terminate the computation if \(c<\tau _U\times \tau _V\), as in this case, no more bicliques satisfying \(\tau _U\) and \(\tau _V\) can be found (line 9–10). Otherwise, we relax c by multiplying a factor \(\alpha \), where \(0<\alpha <1\), and set flag as true to indicate that we need to regenerate \(\mathcal{FS}\) (line 11–12). If \(R_{i+1}\) is not empty, we add \(R_{i+1}\) into \({{\mathcal {D}}}\) and update G by deleting edges in \(R_{i+1}\) from G (line 13–16). Finally, we return \({{\mathcal {D}}}\) as the diversified top-k results (line 17).

Procedure \({\mathsf {GenSubSpaces}} \) generates the fixed subspace set based on c. It follows similar procedures in \({\mathsf {MBC}} ^*\), except that in \({\mathsf {GenSubSpaces}} \), each iteration generates a pair of \(\tau ^{i+1}_U\) and \(\tau ^{i+1}_V\) based on the fixed constant c (line 22–23) and searches for the maximum biclique whose size is larger than c (line 25). Here, we slightly modify \({\mathsf {MBC}} \) that we use a constant c rather than an initial biclique for size pruning, and if no biclique larger than c can be found, we directly return an empty biclique. Besides, we further preserve the reduced subgraph \(G_i\), thresholds \(\tau ^i_U\) and \(\tau ^i_V\), and maximum biclique \(C^*_i\) as the subspace information in \(\mathcal{FS}\) (line 26), in order to share the computation cost among all results preserved in \(\mathcal{FS}\).

Procedure \({\mathsf {FixedMBC}} ^*\) searches for the maximum biclique in the fixed subspace set \(\mathcal{FS}\). Firstly, we initiate \(C^*\) as empty (line 30), and then progressively update it with larger bicliques found in each subspace (line 31–36). For the i-th subspace in \(\mathcal{FS}\), we first update the subgraph \(G_i\) by eliminating all edges in R from \(G_i\), where R is the last diversified biclique result we found (line 32). Then if \(C^*_i\) overlaps with R, which indicates that the maximum biclique in current subspace has changed, we recompute \(C^*_i\) by \({\mathsf {MBC}} \) (line 33–34). Otherwise, \(C^*_i\) remains unchanged, and there is no need to update it. We update \(C^*\) if we find larger biclique (line 35–36), and finally return \(C^*\) as the result(line 37).

Time complexity We analyze the time cost of Algorithm 6. The time cost mainly consists of the graph reduction time (in \({\mathsf {GenSubSpaces}} \)) and maximum biclique searching time (in \({\mathsf {GenSubSpaces}} \) and \({\mathsf {FixedMBC}} ^*\)).

Firstly, for graph reduction, suppose we generate \(k'\) subspace sets to cover all the top-k results by invoking \({\mathsf {GenSubSpaces}} \). Here, \(k'\) is bounded by \(\log _\alpha (\frac{1}{c_0})\), where \(c_0\) is a lower bound of the size of the top-1 biclique \(R_1\) (\(0<\alpha <1\)). In each subspace set \(\mathcal{FS}_m\) (\(1\le m\le k'\)), we denote the number of subspaces as \(l_m\), where \(l_m\) is bounded by \(\log (d^U_{\max }(G))\). Then, the total graph reduction time is \(O(\sum _{m=1}^{k'}\sum _{i=1}^{l_m}T_{\mathrm{reduce}}(G))\).

Secondly, for maximum biclique search, in \({\mathsf {GenSubSpaces}} \), we need to compute the maximum biclique in all subspaces in \(\mathcal{FS}_m\), while in \({\mathsf {FixedMBC}} ^*\), we only need to compute the maximum biclique when needed. Suppose for result \(R_j\), we need to compute \(C^*_{1,j}, C^*_{2,j},...,C^*_{l'_j,j}\), where \(C^*_{i,j}\) denotes the i-th maximum biclique that need to be recomputed on subgraph \(G_{i,j}\) to obtain \(R_j\). Here, \(l'_j\le l_m\) for \(R_j\) preserved in \(\mathcal{FS}_m\). Then, the total maximum biclique search time is \(O(\sum _{j=1}^{k}\sum _{i=1}^{l'_j}T_{\mathrm{search}}(G_{i,j}))\). Consequently, the time cost of Algorithm 6 is \(O(\sum _{m=1}^{k'}\sum _{i=1}^{l_m}T_{\mathrm{reduce}}(G)+\sum _{j=1}^{k}\sum _{i=1}^{l'_j}T_{\mathrm{search}}(G_{i,j}))\), where \(k'\le \log _\alpha (\frac{1}{|R_1|})\). Note that in practice, we observe that \(k'\) is much smaller than k, and for result \(R_j\) preserved in \(\mathcal{FS}_m\), \(l'_j\) is also much smaller than \(l_m\) in most cases.

Fig. 8
figure 8

Search top-2 bicliques in G by \({\mathsf {TopK}} \)

Example 8

Given a bipartite graph G in Fig. 7a with thresholds \(\tau _U=1\) and \(\tau _V=1\), we adopt Algorithm 6 to find the diversified top-2 bicliques in G as shown in Fig. 8.

Suppose we have \(c=10\) and \(\tau ^0_V=5\). We generate \(\mathcal{FS}(G,c)\) consisting of two fixed subspaces:

  1. (i)

    \(\tau ^1_U=\biggl \lfloor \frac{c}{\tau ^0_V} \biggr \rfloor =2\), \(\tau ^1_V=\biggl \lfloor \frac{\tau ^0_V}{2} \biggr \rfloor =2\). The reduced subgraph \(G_1\) is shown in Fig. 8a with the maximum biclique \(C^*_1\) marked as gray;

  2. (ii)

    \(\tau ^2_U=\biggl \lfloor \frac{c}{\tau ^1_V} \biggr \rfloor =5\), \(\tau ^2_V=\biggl \lfloor \frac{\tau ^1_V}{2} \biggr \rfloor =1\). The reduced subgraph \(G_2\) is shown in Fig. 8b with the maximum biclique \(C^*_2\) marked as gray. Based on \(\mathcal{FS}(G,c)\), we search for the top-2 diversified bicliques as follows:

    1. (1)

      With the preserved maximum bicliques in subspace set, we obtain \(R_1=C^*_1\) whose size is maximized. Then, we remove all edges in \(R_1\) from G and get \(G'\).

    2. (2)

      After found \(R_1\), we can update \(G_1\) and \(G_2\) by directly removing edges in \(R_1\) from them, as shown in Fig. 8c, d, respectively (here we omit vertices with no edges). Furthermore, we only need to recompute \(C^*_1\) as it overlaps with \(R_1\), and skip \(C^*_2\) as it remains unchanged. Then, we obtain \(R_2=C^*_2\). Finally, we obtain the result \({{\mathcal {D}}}=\{R_1,R_2\}\). Compared with \({\mathsf {TopKBasic}} \), benefiting from the fixed subspace set, \({\mathsf {TopK}} \) saves the cost by sharing the computation of graph reduction and maximum biclique search in subspaces among the results preserved in \(\mathcal{FS}(G,c)\).

6.3 Optimization strategies

In Algorithm 6, the computation cost mainly consists of two parts: (1) the graph reduction when generating \(\mathcal{FS}\) in \({\mathsf {GenSubSpaces}} \), and (2) the updating of maximum biclique \(C^*_i\) in \({\mathsf {FixedMBC}} ^*\). To further save the computation cost, we propose the following two optimizations.

Global size pruning In \({\mathsf {GenSubSpaces}} \), we apply \(\mathsf {Reduce}\) on G in each subspace to reduce the graph size, which is costly since G is large. However, in \(\mathcal{FS}(G,c)\), we search for biclique whose size is larger than c. Based on this, before we apply \(\mathsf {Reduce}\) on G in each subspace w.r.t. \(\tau ^i_U\) and \(\tau ^i_V\), we can firstly prune all vertices that cannot be involved in a biclique with size larger than c, so as to share the computation among all subspaces in \(\mathcal{FS}(G,c)\). Although we do not know the biclique size before searching, for vertex u, we could use the summarization of the degree for all u’s neighbors as an upper bound for the size of biclique that involves u. Following Definiton 4, with the size constraint c, we use \(G'\sqsubseteq _{c} G\) to denote that \(G'\) is an MBC-preserved graph of G w.r.t. c. We derive the following lemma:

Lemma 8

Given a bipartite graph G, and size constraint c, we have:

  1. (1)

    \(\forall u\in U(G)\): \(\sum _{v\in N(u,G)}d(v,G)\le c \implies G\ominus u \sqsubseteq _{c}G\);

  2. (2)

    \(\forall v\in V(G)\): \(\sum _{u\in N(v,G)}d(u,G)\le c\implies G\ominus v \sqsubseteq _{c}G\).

We omit the proof here. Lemma 8 provides a sufficient condition for a vertex to be eliminated s.t. the maximum biclique whose size is larger than c is preserved. Based on the Lemma 1, Lemma 8 can be iteratively applied to reduce the graph size until no vertices can be eliminated.

We can simply modify \({\mathsf {GenSubSpaces}} \) in Algorithm 6 by applying the global size pruning rule on G to get \(G'\) first, and then iteratively generate subspaces by applying \(\mathsf {Reduce}\) on \(G'\).

Lazy candidate refining In \({\mathsf {FixedMBC}} ^*\), when searching for \(R_{j+1}\) after found \(R_j\), the case of \(C^*_i\) overlapping with \(R_j\) indicates that \(C^*_i\) is not up to date, thus we recompute \(C^*_i\) by adopting \({\mathsf {MBC}} \). However, it is not necessary to update \(C^*_i\) immediately if it cannot be \(R_{j+1}\), thus we decide to refine the candidates in a lazy manner. Specifically, in \({\mathsf {FixedMBC}} ^*\), when \(C^*_i\) that overlaps with \(R_j\) is no larger than the optimal biclique \(C^*\) found so far, instead of directly updating \(C^*_i\) by \({\mathsf {MBC}} \) which is costly, we label it with \(lazy=\mathbf{true }\), and will not recompute it until it could be the maximum one. To apply the lazy refine strategy, we modify \({\mathsf {GenSubSpaces}} \) by initiating \(lazy=\mathbf{false }\) for \(C^*_i\) in all subspaces. Then, we modify \({\mathsf {FixedMBC}} ^*\) as follows:

  1. (1)

    Before updating subspaces in \(\mathcal{FS}\) and searching for \(R_{j+1}\), we first traverse all subspaces to get the update-to-date maximum bicliques, i.e., those who do not have overlaps with \(R_j\) and have \(lazy=\mathbf{false }\). Among all these bicliques, we use the size of the largest one as the lower bound of \(|R_{j+1}|\), denoted as \(lb(R_{j+1})\).

  2. (2)

    Then in all subspaces: (i) For \(C^*_i\) with \(lazy=\mathbf{true }\), if \(|C^*_i|>lb(R_{j+1})\), we update \(C^*_i\) by \({\mathsf {MBC}} \) and set \(lazy=\mathbf{false }\); Otherwise, we skip \(C^*_i\). (ii) For \(C^*_i\) with \(lazy=\mathbf{false }\) but overlaps with \(R_j\), if \(|C^*_i|>lb(R_{j+1})\), we update \(C^*_i\) by \({\mathsf {MBC}} \); Otherwise, we set \(lazy=\mathbf{true }\) for \(C^*_i\) and skip it. (iii) For \(C^*_i\) with \(lazy=\mathbf{false }\) and does not overlap with \(R_j\), there is no need to refine it as it is already up-to-date.

  3. (3)

    Finally, we return \(C^*\) as the biclique with the largest size among all up-to-date candidates with \(lazy=\mathbf{false }\).

Table 1 Dataset statistics

7 Performance studies

In this section, we show the performance studies. We first present the experimental results by comparing the proposed maximum biclique search algorithm \({\mathsf {MBC}} ^*\), with the following two baseline algorithms:

(1) \(\mathsf {MBC}\): \(\mathsf {MBC}\) is developed based on the algorithm in [59], where the code is obtained from the authors, with the pruning rules in Algorithm 1 added.

(2) \(\mathsf {MAPEB}\): \(\mathsf {MAPEB}\) is developed based on the parameterized algorithm \(\mathsf {APEB}\) in [14]. Given a bipartite graph G and an integer p, \(\mathsf {APEB}\) aims to find a biclique C in G with at least p edges (where (Gp) is called a yes-instance) , or report that no such biclique exists. Naturally, we extend \(\mathsf {APEB}\) with the binary search technique to find the maximum biclique \(C^*\). We denote the lower bound and the upper bound of \(|C^*|\) as lb and ub, respectively. The basic idea is that we iteratively set \(p=\lfloor \frac{lb+ub}{2}\rfloor \), and adopt \(\mathsf {APEB}\) to compute if (Gp) is a yes-instance: if it is, we update \(C^*\) as the found biclique C and set \(lb=|C|+1\); otherwise, we update \(ub=p-1\). We stop the computation when \(lb>ub\), and return \(C^*\). We initialize \(C^*\) as \(\emptyset \), lb as \(\tau _U\times \tau _V\), and ub as the maximum \({\mathsf {score}} (u)\) (defined in Eq. 2) among all vertices u in G. We also add pruning rules for size constraints of \(\tau _U\) and \(\tau _V\). We call the extended algorithm \(\mathsf {MAPEB}\).

We evaluate our algorithms in two aspects: (1) the effectiveness of the graph reduction techniques and optimization strategies used in \({\mathsf {MBC}} ^*\), and (2) the efficiency and scalability of maximum biclique search by comparing \({\mathsf {MBC}} ^*\) with \({\mathsf {MBC}} \) and \(\mathsf {MAPEB}\). Then, we show the performance of the diversified top-k biclique search by comparing the proposed algorithm \({\mathsf {TopK}} \) with the baseline algorithm \({\mathsf {TopKBasic}} \). A case study of anomaly detection on real datasets obtained from Alibaba Group is further described to demonstrate the resultant quality by applying our method. Unless otherwise specified, experiments are conducted with \(\tau _U=3\), \(\tau _V=3\) by default. All of our experiments are performed on a machine with an Intel Xeon E5-2650 (32 Cores) 2.6GHz CPU and 128GB main memory running Linux.

Datasets We use 18 real datasets selected from different domains with various data properties, including the ones used in existing works. The detailed statistics of the datasets are shown in Table 1. The first 13 datasets are obtained from KONECTFootnote 1. The last five datasets are real datasets obtained from the E-Commerce company Alibaba Group. Here, the AddCart20 and AddCart18 datasets include data of customers adding products into cart in 1 day (sampled from data in 2020) and 10 days (sampled from data in 2018), respectively. The Transaction20 and Transaction18 datasets include data of customers purchasing products in 3 days (sampled from data in 2020) and 15 days (sampled from data in 2018), respectively. Additionally, the LabeledAddCart dataset includes fraudulent transactions labels that we utilize as the ground truth in the case study.

7.1 Graph reduction and optimizations

In this subsection, we test the effectiveness and efficiency of the graph reduction techniques and optimization strategies used in the algorithm \({\mathsf {MBC}} ^*\).

Table 2 Graph reduction on TVTropes

Effectiveness of graph reduction We test the effectiveness of the proposed one-hop and two-hop graph reduction techniques on datasets of TVTropes and BookCrossing, and show the results in Tables 2 and 3, respectively. We set \({\mathsf {MAX\_ITER}} \) in \(\mathsf {Reduce}\) as 2. Experiments on other datasets have similar outcomes. In Tables 2 and 3, we list \(\tau _U^k\), \(\tau _V^k\) and the number of vertices and edges of the reduced graph in each iteration k in \({\mathsf {MBC}} ^*\). We also list the size of \(C^*_k\) found in each iteration. We compute the compression ratio \(r_k\) as the value of dividing reduced graph size by its original size. In iteration 0, we show the results of graph \(G_0\) reduced by \(\tau _U=3\) and \(\tau _V=3\), as a comparison. We omit the results in the iterations where the reduced graphs are empty. From the results, we can see that in each iteration, we adopt much more strict \(\tau _U^k\) and \(\tau _V^k\) constraints rather than \(\tau _U\) and \(\tau _V\). Therefore, by utilizing the graph reduction techniques, we get much smaller reduced graphs, e.g., compression ratio of 0% (omitted in the table) to 2.05% by using \(\tau _U^k\) and \(\tau _V^k\) in our progressively bounding framework v.s. 97.53% by using \(\tau _U\) and \(\tau _V\) as shown in Table 2. This saves huge search space and accelerates the biclique computation greatly.

Table 3 Graph reduction on BookCrossing
Fig. 9
figure 9

Optimization strategies

Efficiency of graph reduction We conduct experiments on LiveJournal and WebTrackers to compare the performance of the basic algorithms with the optimized versions. We denote the basic version of Algorithm 2 as BASIC, the algorithm with early pruning strategy introduced in Sect. 5.2 as OPT\(_1\), and the algorithm with early skipping strategy introduced in Sect. 5.2 as OPT\(_2\) based on OPT\(_1\). The results are shown in Fig. 9, with the two-hop graph reduction time cost denoted as TwoHopTime, and the total time cost denoted as AllTime. We can see that for TwoHopTime in LiveJournal, OPT\(_2\) is about 21.41% faster than OPT\(_1\), and 41.74% faster than BASIC. Consequently, OPT\(_2\) accelerates AllTime by around 17.56% w.r.t. the BASIC version. For TwoHopTime in WebTrackers, OPT\(_2\) is about 30.9% faster than OPT\(_1\), and 45.7% faster than BASIC. Consequently, OPT\(_2\) accelerates AllTime by around 23.2% w.r.t. the BASIC version. When comparing with the baseline algorithm in the following experiments, we apply all the optimization techniques.

7.2 \({\mathsf {MBC}} ^*\) versus baseline algorithms

In this subsection, we compare the performance of \({\mathsf {MBC}} ^*\), \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) on maximum biclique search by: (1) conducting experiments on all datasets; (2) varying \(\tau _U\) and \(\tau _V\) thresholds on both small-sized and large-sized graphs; (3) varying graph density; (4) varying graph scale.

In all experiments, we set the maximum processing time as 24 h, and if the methods cannot finish computing, we denote the time cost as NaN. For those experiments that cannot finish within 24 h, we also report the quality ratio above the corresponding bars, which is calculated as:

$$\begin{aligned} \text {quality ratio} = \frac{\text {the size of current best biclique}}{\text {the size of the maximum biclique}} \end{aligned}$$

Note that it is possible that the quality ratio is \(100\%\) while the algorithm cannot finish, because the size of the maximum biclique is unknown before the algorithm finishes.

Fig. 10
figure 10

\(C^*_{3,3}\) search on all datasets

All datasets In this experiment, we test the efficiency of \(C^*_{3,3}\) search in all datasets by comparing \({\mathsf {MBC}} ^*\) with \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\), and report the processing time in Fig. 10. From Fig. 10, we can see that when the size of dataset is relatively small, e.g., around 0.1 million edges in Writers, \({\mathsf {MBC}} ^*\) and \(\mathsf {MBC}\) can both find \(C^*_{3,3}\) efficiently. As the graph size scales up, e.g., for the graphs with millions of edges such as BookCrossing and StackOverflow, \(\mathsf {MBC}\) takes hours to compute the results, while \({\mathsf {MBC}} ^*\) only takes seconds. Furthermore, when the graph size grows up to around 1 billion edges such as AddCart and Transaction, \(\mathsf {MBC}\) cannot finish computing within 24 h, while \({\mathsf {MBC}} ^*\) only takes minutes to compute the results. \(\mathsf {MAPEB}\), however, fails to finish computing for most cases. Moreover, for most time-out cases, the bicliques found by \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) are far smaller than the maximum bicliques. From the results shown in Fig. 10, we can see that \({\mathsf {MBC}} ^*\) is much more efficient and scalable than both \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) on all datasets.

Fig. 11
figure 11

\(C^*\) search by varying \(\tau _U\) and \(\tau _V\)

Varying \(\tau _U\) and \(\tau _V\) thresholds We vary \(\tau _U\) and \(\tau _V\) thresholds to compute \(C^*\) and illustrate the performance of \({\mathsf {MBC}} ^*\), \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) in Fig. 11. Figure 11 shows that \(\mathsf {MBC}\) can process small graphs (YouTube and StackOverflow) but fails in processing large graphs (LiveJournal and WebTrackers). For small graphs, when \(\tau _U\) and \(\tau _V\) get larger, the time cost of \(\mathsf {MBC}\) decreases. This is because as \(\tau _U\) and \(\tau _V\) get larger, \(\mathsf {MBC}\) can filter more search branches. For large graphs, \(\mathsf {MBC}\) cannot finish computing within 24 h, since the search space is huge and \(\mathsf {MBC}\) is stuck in local search. \(\mathsf {MAPEB}\) cannot finish computing for all cases. The main reason is that \(\mathsf {MAPEB}\) is developed based on \(\mathsf {APEB}\) [14], which mainly benefits from the early termination as soon as the yes-instance is found. However, to find the maximum biclique, we will encounter no-instances in the binary search process in \(\mathsf {MAPEB}\). For the no-instance case (Gp), for each vertex \(u\in U(G)\) (\(v\in V(G)\) resp.), \(\mathsf {APEB}\) has to enumerate all the combinations (with size constraints of \(\ge \tau _V\) (\(\ge \tau _U\) resp.) and \(\le \left\lceil \sqrt{p}\right\rceil \)) of u’s (v’s resp.) neighbors and induce biclique for each combination correspondingly, which is very costly. In comparison, \({\mathsf {MBC}} ^*\) is orders of magnitude faster than both \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) on all settings. For most cases, when \(\tau _U\) and \(\tau _V\) get larger, the time cost of \({\mathsf {MBC}} ^*\) slightly increases. This is because in most real cases, as \(\tau _U\) and \(\tau _V\) get larger, \(|C^*|\) becomes smaller. Thus, \({\mathsf {MBC}} ^*\) generates relatively looser \(\tau ^k_U\) and \(\tau ^k_V\) constraints, which results in larger reduced graph. Specifically, in WebTrackers, the processing time is steady. This is because for all \(\tau _U\) and \(\tau _V\) settings in this experiment, \(|C^*|\) in WebTrackers is relatively large, and consequently \(\tau ^k_U\) and \(\tau ^k_V\) are quite strict. In general, the high efficiency of \({\mathsf {MBC}} ^*\) mainly benefits from the effective progressive bounding framework with graph reduction techniques, which saves enormous search space in biclique search.

Fig. 12
figure 12

\(C^*_{3,3}\) search by varying graph density

Varying graph density In this experiment, we test the effect of graph density on the performance, and demonstrate the results in Fig. 12. We prepare graphs with different density by sampling edges in the original graph. For example, we sample \(20\%\), \(40\%\), \(60\%\), \(80\%\) and \(100\%\) edges in TVTropes, and denote these (sub)graphs as TV\(_1\), TV\(_2\), TV\(_3\), TV\(_4\) and TV\(_5\) in ascending order of density. Figure 12 shows that as the graphs grow denser, \(\mathsf {MBC}\) takes longer time to find the maximum bicliques, or cannot finish computing within 24 h. Although \(\mathsf {MAPEB}\) may output larger bicliques than \(\mathsf {MBC}\) sometimes (e.g., on dataset of WebTrackers), since it may find yes-instances efficiently with some appropriate p during the binary search, it cannot finish computing for most cases due to the inefficiency of the no-instances in the binary search. In contrast, \({\mathsf {MBC}} ^*\) is orders of magnitude faster than both \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) on all settings. It is worth noting that for dense graphs, \({\mathsf {MBC}} ^*\) also finds maximum bicliques efficiently. For example, in Fig. 12c, as the graphs grow denser from LJ\(_3\) to LJ\(_5\), the processing time of \({\mathsf {MBC}} ^*\) decreases. The reason is that \({\mathsf {MBC}} ^*\) can find larger \(C^*_k\) in denser graphs. This helps improve the \(\tau ^k_U\) and \(\tau ^k_V\) thresholds and lead to small reduced graphs (or even empty) in the progressive bounding framework. Therefore, \({\mathsf {MBC}} ^*\) finds maximum biclique efficiently on both sparse and dense graphs.

Table 4 Statistics of AddCart and Transaction
Fig. 13
figure 13

\(C^*_{3,3}\) search by varying graph scale

Varying graph scale The effects of graph size on the performance show scalability. We prepare datasets by obtaining 1, 3, 6 and 10 days data of AddCart18, and 1, 3, 6, 10 and 15 days data of Transaction18. We list the statistics in Table 4, and report the results in Fig. 13. In Fig. 13, we can see that both \(\mathsf {MBC}\) and \(\mathsf {MAPEB}\) cannot finish computing within 24 h on all datasets and the reported bicliques are much smaller than the maximum bicliques. In contrast, the processing time of \({\mathsf {MBC}} ^*\) increases steadily as the graph scales up. For graphs of AddCart10d and Transaction15d, which both consist of about 1.3 billion edges, \({\mathsf {MBC}} ^*\) costs 18 min and 15 min to compute the results respectively, which is quite efficient. To the best of our knowledge, no existing solutions can find maximum bicliques in bipartite graphs at this scale.

7.3 \(\mathsf {TopK}\) versus \(\mathsf {TopKBasic}\)

In this subsection, we test the efficiency of \(\mathsf {TopK}\) and \(\mathsf {TopKBasic}\) on diversified top-k biclique search. We first test the efficiency of the proposed optimizations of global size pruning and lazy candidate refining in \(\mathsf {TopK}\). Then, we compare \(\mathsf {TopK}\) with \(\mathsf {TopKBasic}\) by: (1) varying the result number k; (2) varying \(\tau _U\) and \(\tau _V\) thresholds; and (3) varying graph density. In this subsection, we set \(k=80\), \(\tau _U=3\) and \(\tau _V=3\) by default unless otherwise specified. Besides, in \(\mathsf {TopK}\), we relax the lower bound c by multiplying factor \(\alpha \) to include more results, where \(\mathcal{FS}\) with smaller c can preserve more results, while \(\mathcal{FS}\) with larger c can generate tighter bounds in subspaces. In compromise, we set \(\alpha =0.7\) in this subsection. Moreover, we set the maximum processing time as 24 h, and if the computation is not finished, we denote the time cost as NaN.

Fig. 14
figure 14

Optimization strategies in \({\mathsf {TopK}} \)

Efficiency of optimizations We conduct experiments on Transaction20 and AddCart20 to compare the basic \({\mathsf {TopK}} \) algorithm with the optimized versions. We denote the basic version of Algorithm 6 as BASIC, the algorithm with global size pruning strategy introduced in Sect. 6.3 as TopKOPT\(_1\), and the algorithm with lazy candidate refining strategy introduced in Sect. 6.3 as TopKOPT\(_2\) based on TopKOPT\(_1\). We show the time cost to compute the top-10, 20, 40, and 80 bicliques on Transaction20 and AddCart20 respectively in Fig. 14. From the result, we can see that on Transaction20, TopKOPT\(_1\) is 21.27% faster than BASIC on average, which mainly benefits from the computation sharing among subspaces in \(\mathcal{FS}\) by pruning vertices with size constraint first. Furthermore, TopKOPT\(_2\) is 18.37% faster than TopKOPT\(_1\) on average, which mainly saves the computation cost in unnecessary maximum biclique updating in subspaces. Consequently, TopKOPT\(_2\) accelerates the computation by 35.81% on average w.r.t. the BASIC version on Transaction20. Similarly, on AddCart20, TopKOPT\(_1\) is \(27.73\%\) faster than BASIC, and TopKOPT\(_2\) further improves TopKOPT\(_1\) by \(19.92\%\) on average. In total, TopKOPT\(_2\) accelerates the computation by 42.03% on average on Addcart20.

In the following experiments, when comparing \(\mathsf {TopK}\) with the baseline algorithm \(\mathsf {TopKBasic}\), we apply all the optimization techniques.

Fig. 15
figure 15

Top-k search by varying k

Varying results number k In this experiment, we test the efficiency by varying the results number k and report the processing time of \({\mathsf {TopK}} \) and \({\mathsf {TopKBasic}} \) in Fig. 15. We set k as 10, 20, 40, 80 and 160, respectively. For both \({\mathsf {TopK}} \) and \({\mathsf {TopKBasic}} \), when k increases, the time cost also increases. For small graphs of Wikipedia and DBLP, we can see that \(\mathsf {TopK}\) achieves several times better performance than \(\mathsf {TopKBasic}\) on all k settings. For large graphs of Transaction20 and AddCart20, \(\mathsf {TopK}\) also outperforms \(\mathsf {TopKBasic}\) by several times, and an order of magnitude for top-20 biclique searching on AddCart20. Besides, from the figure, we can see that as k becomes larger, it takes longer time for both \(\mathsf {TopK}\) and \(\mathsf {TopKBasic}\) to find the results. This is because as k increases, the sizes of the result bicliques tend to be smaller. Consequently, the subspaces of later results in top-k are generated with relatively looser \(\tau ^i_U\) and \(\tau ^i_V\) constraints, which leads to larger reduced subgraphs and longer biclique searching time. In general, \({\mathsf {TopK}} \) outperforms \({\mathsf {TopKBasic}} \) by several times to an order of magnitude for all k settings.

Fig. 16
figure 16

Top-k search by varying \(\tau _U\) and \(\tau _V\)

Varying \(\tau _U\) and \(\tau _V\) In this experiment, we test the efficiency by varying the thresholds of \(\tau _U\) and \(\tau _V\). The results are reported in Fig. 16. On Wikipedia and AddCart20, when \(\tau _U\) and \(\tau _V\) get larger, the time cost of both \(\mathsf {TopK}\) and \(\mathsf {TopKBasic}\) increases. The reason is that the average size of the top-k results becomes much smaller on Wikipedia and AddCart20 as \(\tau _U\) and \(\tau _V\) get larger. This leads to relatively looser \(\tau ^i_U\) and \(\tau ^i_V\) constraints and thus larger reduced subgraphs in subspaces, which takes longer time for biclique searching. On DBLP and Transaction20, the sizes of the top-k results are relatively large on all \(\tau _U\) and \(\tau _V\) settings, and thus the performance is not sensitive to the \(\tau _U\) and \(\tau _V\) settings but to the specific generated \(\tau ^i_U\) and \(\tau ^i_V\) in subspaces. As \(\mathsf {TopKBasic}\) needs to compute the k results one by one, while \(\mathsf {TopK}\) can preserve more results in \(\mathcal{FS}\) by slightly relax the \(\tau ^i_U\) and \(\tau ^i_V\) constraints in subspaces, the experimental results show that \(\mathsf {TopK}\) benefits a lot from the computation sharing, and outperforms \(\mathsf {TopKBasic}\) by several times to an order of magnitude on all \(\tau _U\) and \(\tau _V\) settings.

Fig. 17
figure 17

Top-k search by varying graph density

Varying graph density In this experiment, we show the effect of graph density on the performance and report the results in Fig. 17. We prepare graphs with different density by sampling edges in the original graphs, including small graphs of DBLP and Wikipedia, and large graphs of Transaction20 and AddCart20. For example, we sample \(20\%\), \(40\%\), \(60\%\), \(80\%\) and \(100\%\) edges in Transaction20, denoted as TRA\(_1\), TRA\(_2\), TRA\(_3\), TRA\(_4\) and TRA\(_5\), respectively. Note that we eliminate the results on WIKI\(_1\) and DBLP\(_1\), since we cannot obtain enough top-80 results on them. Figure 17 shows that as the graphs grow denser, \(\mathsf {TopKBasic}\) takes longer time to find the top-k results in most cases, except that in TRA\(_5\), the time cost decreases. The main reason is that the sizes of result bicliques in TRA\(_5\) are relatively large, and consequently the subspaces are generated with more strict \(\tau ^i_U\) and \(\tau ^i_V\) constraints. The time cost of \(\mathsf {TopK}\) has similar tendency with \(\mathsf {TopKBasic}\) but increases slower as graphs grow denser (except TRA\(_5\) where time cost decreases for the same reason), and \(\mathsf {TopK}\) is from times to an order of magnitude faster than \(\mathsf {TopKBasic}\) on all graphs. Therefore, \(\mathsf {TopK}\) finds the diversified top-k bicliques efficiently on both sparse and dense graphs.

7.4 Case study

Our proposed algorithm has been deployed in Alibaba Group to detect fraudulent transactions. E-business owners at Taobao and Tmall (two E-commerce platforms of Alibaba Group) may pay some agents in black market to promote the rankings of their online shops. Considering the costs of fake transactions and maintenance of a large amount of user accounts, these agents usually need to organize a group of users to “purchase” a set of products at the same time for cost effectiveness. This will lead to some bicliques (i.e., click farms) in the bipartite graph consisting of users, products and purchase transactions. As the maximum biclique alone cannot cover all fraudulent transactions, we apply the diversified top-k biclique search method as follows.

TopK We adopt Algorithm 6 to compute the diversified top-k bicliques (i.e., suspicious click farms) in the bipartite graph. Note that \(\mathsf {TopK}\) improves the recall rate of fraudulent transaction by 50% according to the feedback of the risk management team from Alibaba Group.

To further demonstrate the effectiveness and efficiency of \(\mathsf {TopK}\), we also evaluate the following two baseline approaches on a real dataset LabeledAddCart obtained from Alibaba Group, which includes the labels of ground-truth fraudulent transactions.

(1) EnumK We adopt \(\mathsf {EnumK}\), whose logic is the same with \(\mathsf {MBC}\) but without the size pruning rule (in line 5 and 13 in Algorithm 1), to enumerate all maximal bicliques satisfying the thresholds \(\tau _U\) and \(\tau _V\), and each maximal biclique represents a click farm. However, it is not possible to find all maximal bicliques and then select the top-k among them due to the huge number of maximal bicliques, thus we evaluate the result of the first-k output maximal bicliques.

(2) Reduce Given appropriate values of thresholds \(\tau _U\) and \(\tau _V\), \(\mathsf {Reduce}\) outputs the reduced bipartite graph, where the edges represent suspicious fraudulent transactions. Although \(\mathsf {Reduce}\) cannot output bicliques, it can reduce the candidate size.

We define the precision and recall rate as follows:

$$\begin{aligned} \text {precision}= & {} \frac{\text {number of found fraudulent transactions}}{\text {number of output edges of the method}}\\ \text {recall}= & {} \frac{ \text {number of found fraudulent transactions}}{\text {number of ground-truth fraudulent transactions}} \end{aligned}$$
Fig. 18
figure 18

Precision of \(\mathsf {TopK}\)

Fig. 19
figure 19

Quality of \(\mathsf {EnumK}\)

TopK result evaluation In this experiment, we vary \(\tau _V\) from 2 to 5 (with \(\tau _U=1\)) to test the precision of top-k diversified bicliques found by \(\mathsf {TopK}\) on LabeledAddCart, and show the results in Fig. 18. The figure shows that the precision is over 95% in most cases except top-1000 when \(\tau _V=2\). This is because coincidences are more likely to happen when \(\tau _V\) is small. When \(\tau _V>2\), the precision is even larger than 99%. In general, \(\mathsf {TopK}\) outputs fraudulent transactions with high precision, and the found biclique can be served as the evidence when taking disciplinary measures. In real application in Alibaba Group, \(\mathsf {TopK}\) not only returns fraudulent transactions with high precision, but also improves the recall rate by 50% w.r.t. to existing solutions.

EnumK result evaluation We conduct experiments of \(\mathsf {EnumK}\) on LabeledAddCart and show the results in Fig. 19. We set \(\tau _U=1\) and \(\tau _V=2\), and the results with other settings are similar. Given the fact that \(\mathsf {EnumK}\) cannot finish maximal biclique enumeration within 24 h, we record two statistics of the first-k output maximal bicliques: (1) the total number of output edges, denoted as All, and (2) the number of unique output edges, denoted as Uni. Besides, the enumeration process easily becomes stuck in local search, so the search order has great influence on the result of first-k bicliques. Thus, we adopt two search orders in \(\mathsf {EnumK}\), i.e., we iteratively add \(v\in V\) into biclique in descending order (denoted as Desc) or ascending order (denoted as Asc) of the number of v’s neighbors in U. This is because, intuitively, we may enumerate the maximal bicliques in the dense region or sparse region of the bipartite graph respectively. From Fig. 19, we can see that for Desc order, when the output biclique number increases, the total number of output edges increases as well. However, the number of unique edges barely grows, which indicates that \(\mathsf {EnumK}\) enumerates many redundant maximal bicliques with very limited effective information when searching in dense region of the graph. In comparison, for Asc order, both total output edges and unique edges increases. However, the average size of the first-16000 maximal bicliques is only 12, which is too small to be used in anomaly detection application, with the precision of only 33.23% compared with the ground-truth. The computation cost of \(\mathsf {EnumK}\) is also high, and the algorithm outputs huge amounts of maximal bicliques (over \(10^9\) bicliques in 24 h). In conclusion, maximal biclique enumeration is not suitable to this case study for anomaly detection on large-scale graphs.

Fig. 20
figure 20

Precision and recall rate of \(\mathsf {Reduce}\)

Reduce result evaluation Given specific \(\tau _U\) and \(\tau _V\) values, we can detect fraudulent transactions with \(\mathsf {Reduce}\). In this experiment, we vary \(\tau _V\) from 2 to 5, and for each \(\tau ^k_V\), we set two corresponding \(\tau ^k_U\) values, i.e., the small value \(\tau _U^{s_k}\) for loose condition, and the large value \(\tau _U^{l_k}\) for strict condition. All \(\tau _U\) values are suggested by the experts of anomaly detection in Alibaba. Due to the confidential nature, we omit the exact values. For simplicity, we use \(\tau _U^{s}\) and \(\tau _U^{l}\) to represent the loose and strict constraints for all \(\tau _V^k\).

We evaluate the performance in terms of precision and recall rate, and present the results in Fig. 20. In Fig. 20a, the precision of \(\mathsf {Reduce}\) improves when \(\tau _V\) grows larger, since the more common products a group bought together, the more suspicious the transactions are. Similarly, larger \(\tau _U\) also leads to higher precision with fixed \(\tau _V\). However, the precision does not meet the requirement of at least 95% (from Alibaba). In Fig. 20b, the recall rate is relatively high especially for loose constraints \(\tau ^s_U\), due to the fact that we only take advantages of the graph topological structure. However, we gain the high recall rate at the cost of low precision and large amount of output edges (over \(10^7\) edges for all settings). Besides, the result quality depends heavily on the given \(\tau _U\) and \(\tau _V\) thresholds, which cannot be easily adapted to different datasets manually. Therefore, \(\mathsf {Reduce}\) is not suitable for anomaly detection in this case study.

8 Related work

In this section, we review the related work, including maximum biclique search and its variants, maximal biclique enumeration and diversified top-k search.

Maximum biclique search and its variants The maximum biclique problem has become increasingly popular in recent years [14, 42, 43]. Reference [43] proposes an integer programming methodology to find the maximum biclique in general graphs. However, it is not applicable for large-scale graphs. Reference [42] develops a Monte Carlo algorithm for extracting a list of maximal bicliques, which contains a maximum biclique with fixed probability. Reference [14] studies the parameterized maximum biclique problem in bipartite graphs that reports if there exists a biclique with at least p edges, where p is a given integer parameter. Besides, there are two variants of the maximum biclique problem, i.e., the maximum vertex biclique and the maximum balanced biclique. The former one aims to find the biclique \(C^*\) that \(|U(C^*)|+|V(C^*)|\) is maximized. This problem can be solved in polynomial time by a minimum cut algorithm [33]. The latter one aims to find the biclique \(C^*\) with maximum cardinality that \(|U(C^*)|=|V(C^*)|\). The most popular approaches are heuristic algorithms, including [2, 44, 55] that solve the problem by converting it into a maximum balanced independent set problem on the complement bipartite graph with node deletion strategies, and [60] that combines tabu search and graph reduction to find the maximum balanced biclique on the original bipartite graph. References [53, 56] propose local search framework to find good solutions within reasonable time. Refs. [32, 61] introduce exact algorithms to find the maximum balanced biclique by following the branch-and-bound framework.

Maximal biclique enumeration The maximal biclique enumeration problem is widely studied. A biclique is said to be maximal if it is not contained in any larger bicliques. Reference [3] proposes a consensus approach, which starts with a collection of simple bicliques, and then expands the bicliques as a sequence of transformations on the biclique collections. References [36, 41] find maximal bicliques \(C=(U,V,U\times V)\) by exhaustively enumerating U as subsets of one vertex partition, obtaining V as their common neighbors in the other vertex partition, and then checking the maximality of C. In [59], the authors propose algorithm \(\mathsf {iMBEA}\), which combines backtracking with branch-and-bound framework to filter out the branches that cannot lead to maximal bicliques. Reference [15, 29] reduce the problem to the maximal clique enumeration problem by transferring the bipartite graph into a general graph. Reference [21] proves that maximal biclique is in correspondence with frequent closed itemset. The maximal biclique enumeration can be reduced then to the well-studied frequent closed itemsets mining problem [13, 27, 50, 52]. References [35, 37] propose parallel methods to enumerate maximal bicliques in large graphs.

Diversified top-k search The diversified top-k search problem has been extensively studied, which aims to find top-k results that are not only most relevant to a query but also diversified. In the literature, most existing solutions focus on finding diversified top-k results for a specific query. For example, Lin et al. study the k most representative skyline problem [22]. References [1, 6] focus on the diversified top-k document retrieval. Reference [62] studies the diversified keyword query recommendation. Reference [12] focuses on the diversified top-k graph pattern matching. Reference [24] studies the problem of top-k shortest paths with diversity. Zhang et al. study the diversified top-l (kr)-core [58]. Yuan et al. and Wu et al. study the diversified top-k clique search problem [54, 57]. Nevertheless, the techniques developed for diversified top-k clique search are not suitable for our diversified top-k biclique search problem. Some other works study the general framework for diversified top-k search. For example, [10, 39, 40, 51] study the general diversified top-k results problem. References [8, 34] study top-k result diversification on a dynamic environment. The complexity of query result diversification is analyzed in [9]. Nevertheless, the diversity in the above frameworks is considered based on the pair-wise dissimilarity of the query results, which cannot be applied directly on the diversified top-k biclique search problem studied in this paper.

9 Conclusion

Maximum biclique search in a bipartite graph is a fundamental problem with a wide spectrum of applications. Existing solutions are not scalable for handling large bipartite graphs because the search has to consider the size of both sides of the biclique. In this paper, instead of solving the problem directly on the original bipartite graph, we propose a progressive bounding framework which aims to solve the problem on several much smaller bipartite graphs. We prove that only logarithmic rounds are needed to guarantee the algorithm correctness, and in each round, we show how to significantly reduce the bipartite graph size by considering the properties of the one-hop and two-hop neighbors for each vertex. Based on the maximum biclique search method, we further propose an efficient algorithm to find the diversified top-k bicliques, which is also desirable in many applications. By taking advantage of the progressive bounding framework, we consider to derive the same subspaces for different results by slightly relaxing the constraints in each subspace, so as to share the computation cost among these results. We further propose two optimizations to accelerate the computation by pruning search space and lazy refining candidates. We conducted experiments on real datasets from different application domains, and two of the datasets contain billions of edges. The experimental results demonstrate that our approach is efficient and scalable to handle large bipartite graphs. It is reported that 50% improvement on recall can be achieved after applying our method in Alibaba Group to identify the fraudulent transactions.