1 Introduction

In recent years, human society has become increasingly informative, and new technologies that produce massive data, such as e-commerce platforms and social networks, have influenced every aspect of our daily life. Therefore, graph, an important representation structure of big data, has attracted considerable attention. Due to the increasing popularity of graph data, numerous research works have been devoted to mining the valuable information in graph-structured data. Existing research works mainly focus on the unipartite graph, in which the connections can link any pair of entities. However, compared to the unipartite graphs, multipartite graphs, such as bipartite graphs, received less attention from the academic community regardless of their popularity and ubiquity in real-life applications. A bipartite graph has two mutually independent vertex classes and edges only exist between vertices of different classes. For example, the users-page relationships between users and pages of Wiktionary can be represented by a bipartite graph, where edges indicate the editing action from the users on pages. In such graph, users (resp. pages) can be related to multiple pages (resp. users), but there is no user-user or page-page connection. Figure 1 presents an example of a bipartite graph of the user-page relationships. Bob (\(s_1\)) and Lisa (\(s_3\)) collaborate to edit the page B (\(t_1\)), while Jack (\(s_2\)), Lisa (\(s_3\)) and Sam (\(s_4\)) collaborate to edit the pages A (\(t_1\)) and C (\(t_3\)).

Fig. 1
figure 1

Example of Bipartite Graph

Due to the ubiquitous properties of bipartite graphs, the classification task of bipartite graphs has become a fundamental tool in various fields [5]. For example, in the user-page bipartite graph such as Wiktionary, users and corresponding pages in different languages forms different bipartite graphs. Similarly, there can be different bipartite graphs of the user-page relationship for the edit relationship under different topics. In these cases, the graph classification on the bipartite graphs can be utilized to determine the languages and topics preference of the users, and hence improve the user experience. Another example of the bipartite graph classification task can be used for money laundry detection. Considering the directed edges between a known cycle for money laundry in e-commerce platforms such as Amazon, we can learn the feature representations of these bipartite graphs and further use them to detect other potential money laundry cycles. In addition to anti-money laundering, the classification of the bipartite graph can also solve many other problems in e-commerce. For example, some unscrupulous merchants will look for buyers to initiate bogus transactions. The merchants only need to mail some empty packages to the buyers and pay small commissions. The users can make many positive comments about these businessmen’s goods to increase the exposure of the interests in the e-commerce platform and increase sales. The bipartite graph classification algorithm can help e-commerce platforms find these unscrupulous merchants. In recent years, some merchants have used the recommendation mechanism of e-commerce platforms to “Ride Item’s Coattail” attacks have also become a matter of concern. The bipartite graph classification algorithm can also distinguish these cheating merchants. In addition, it is feasible to represent the interactions between secondary structures of proteins as bipartite graphs. Moreover, bipartite graph classification can be used as a basis for finding common substructures in proteins [26]. Thus, this task can also play an important role in protein discovery.

There have been a lot of work investigated in the graph classification problem. The graph classification problem is usually more complex than vertex classification problem, since more and higher-order information should be considered. Though traditional kernel methods and GNN methods have received great success in the vertex classification task [7, 21], they cannot be directly adapted to the graph classification task.

Based on graph neural networks and capsule networks, some prominent methods [2, 16, 30, 42] have been proposed for the graph classification task. For example,  [10] not only utilizes a powerful neural network, but also separates numerous important features while still keeping them independent during training. It allows the model to capture hidden factors more clearly, which makes it achieve higher accuracy on the graph classification task. Further, HCGNN [39] takes the hierarchical information in the graph into account based on the capsule network and continuously synthesizes numerous fine information into more concentrated information, so that the final result can retain the details of the graph structures better, which has outstanding performance of graph classification. However, these methods only specialize in the classification of unipartite graphs. If they are directly applied to bipartite graph classification, the relationships between vertices of the same type cannot be fully retained. This is because there is no connection between vertices of the same type in the bipartite graph, and most methods perform the propagation along the edges to capture the relationship between vertices.

Compared to the traditional scalar-based neural network, capsule network, a vector-based neural network, can represent features using mutually independent sets of vectors [38]. As a result, capsule network can characterize the information of vertex or graph better. Therefore, the capsule network is the basis for our model to obtain the bipartite graph structure information better. It makes capsule network crucial in the work of bipartite graph classification.

Contributions In this paper, we propose a novel method, named Bipartite Capsule Graph Neural Network (BCGNN), to achieve classification performance better on bipartite graphs. To preserve the structure, nature and labeling information of the bipartite graph, BCGNN creates the connections between vertices of the same type to build it’s one-mode projection. Then, it captures the features and performs better by using the hierarchical capsule network. Specifically, we first decide whether to establish connections between pairs of vertices in the same type depending on the number of their common neighbors. Then, to represent the overall structural information of the bipartite graph, the structural information in the one-mode projection is extracted layer by layer using hierarchical capsule network. Finally, class capsules at the last layer are used to perform the bipartite graph classification task. The main contributions of the paper are summarized as follows:

  • To the best of our knowledge, we are the first to design graph neural networks on bipartite graphs for graph classification task based on capsule networks.

  • Our model combines hierarchical capsule network and one-mode projection, which allows us to better capture the relationships between vertices of the same type in a bipartite graph and preserve the structure information of the bipartite graph.

  • Extensive experiments on real-world graphs prove that BCGNN outperforms the state-of-the-art baseline methods, in terms of the bipartite graph classification task.

Organization The rest of the paper is organized as follows. We present the related concepts in Section 2. In Section 3, we introduce the model developed. We report the experiment results on real-world datasets in Section 4. Finally, we review the related work in Section 5 and conclude the paper in Section 6.

Table 1 Notation table

2 Preliminaries

In this section, we introduce some key definitions and important notations used in this paper. Table 1 summarized the important notations frequently used throughout the paper.

Definition 1

(Bipartite Graph) A bipartite graph can be denoted as \(G = (\mathcal {V}_S, \mathcal {V}_T ,\mathcal {E})\), where \(\mathcal {V}_S = \{s_1, s_2,...,s_m\}, \mathcal {V}_T = \{t_1, t_2,...,t_n\}\) are the mutually exclusive vertices sets. \(\mathcal {E} \subset \mathcal {V}_S \times \mathcal {V}_T\) is a set of edges that connect vertices between two partitions.

It is important to explain that the bipartite graphs of the same type that we use are all subgraphs of a certain dynamic bipartite graph. Similarly, a dynamic bipartite graph can be defined as \(G_t = (\mathcal {V}_S, \mathcal {V}_T ,\mathcal {E}, \mathcal {T})\), where \(\mathcal {T}\) is the timestamp set containing the timestamps corresponding to all connection moments. For convenience, we denote total vertex set as \(\mathcal {V} = \mathcal {V}_S \cup \mathcal {V}_T\) and denote total number of vertices in the bipartite graph by \(|\mathcal {V}| = |\mathcal {V}_S| + |\mathcal {V}_T|\).

Definition 2

(One-Mode Projection) One-mode projection on the bipartite graphs aims to construct a projection graph that exist links between the vertices of the same type, i.e., to build graphs \(G_S = (\mathcal {V}_S ,\mathcal {E}_S)\), \(G_T = (\mathcal {V}_T ,\mathcal {E}_T)\) where \(\mathcal {E}_S \subset \mathcal {V}_S \times \mathcal {V}_S\) and \(\mathcal {E}_T \subset \mathcal {V}_T \times \mathcal {V}_T\). And their adjacency matrices are \(\varvec{A}_S \in \mathbb {R}^{|\mathcal {V}_S| \times |\mathcal {V}_S|}\) and \(\varvec{A}_{T} \in \mathbb {R}^{|\mathcal {V}_T| \times |\mathcal {V}_T|}\).

Definition 3

(Bipartite Graph Classification) A bipartite graph classification problem can be defined as the following. A learning machine receives a set of N training examples \({(G_1, y_1), (G_2, y_2), (G_3, y_3),...,(G_N, y_N)}\), where each example \((G_i, y_i)\) is given as a pair of a bipartite graph \(G_i\) and the class \(y_i\), which is the label of the graph [13]. The bipartite graph classification problem is the problem of inferring the class label \(y_i\) corresponding to the graph \(G_i\).

Graph Neural Networks Existing graph neural networks usually adopt an aggregate and combine scheme as follows:

$$\begin{aligned} \varvec{z}^{(k)}_u = \mathcal {COM}^{(k)}(\varvec{z}^{(k-1)}_u, \mathcal {AGG}^{(k)}\{\varvec{z}^{(k)}_{u^{\prime }}; u^{\prime } \in N(u)\}), \end{aligned}$$
(1)

where \(\varvec{z}^{(k)}_u\) is the vertex representation of vertex u at \(k^{th}\) layer of the graph neural network, \(\mathcal {AGG}\) is the aggregation that iteratively updates the representation of vertex u by aggregating the representations of its neighbors, and \(\mathcal {COM}\) is the combine operation that updates the representation of vertex u by the aggregated representations and its own representation \(\varvec{z}^{(k-1)}_u\) from previous layer. The main difference between graph neural networks is the design of the aggregate and combine mechanism.

3 Model

In this section, we introduce the details of BCGNN. Section 3.1 introduces the framework of our model; Section 3.2 describes how to create edges between vertices of the same type using the one-mode projection; Sections 3.3 and 3.4 introduce the capsule network in detail and the following Section 3.5 illustrates the learning objective with auxiliary graph reconstruction loss.

3.1 Framework

Different from traditional graph neural networks, capsule networks use activity vectors or pose matrices to represent entities. As a result, capsule networks are able to isolate numerous hidden factors and discern relationships among them. Therefore, capsule networks can be very advantageous when being applied on the graphs with complex structures. However, due to the nature of bipartite graphs, vertices that are supposed to share the same type property lack connections with each other. Therefore, capsule networks cannot reach satisfactory performance if being applied on the bipartite graphs for graph classification task since it cannot perform the message passing properly when there is no edge between vertices in the same set. To solve this problem, in this paper, we propose an effective model BCGNN to optimize the performance of conventional capsule network on bipartite graph classification task. BCGNN first generates edges between vertices in the same type based on the number of common neighbors between them. Consequently, BCGNN converts the bipartite graph to its one-mode projection, which enables the GNN part of the capsule network to better extract information between vertices of the same type. With the built one-mode projection, we design the graph capsule network on the bipartite graph to preserve the interaction relationship between the vertices within the same set and in two different sets Figure 2.

Fig. 2
figure 2

The Framework of Proposed BCGNN

3.2 One-mode projection

Since the features of vertices are going to be aggregated along edges in capsule networks, the direct use of capsule networks on bipartite graphs usually results in unsatisfactory performance. Therefore, to enable our model to obtain relationships not only between two different vertex sets but also between vertices in the same type, we first generate the one-mode projection of the bipartite graph as the input of the capsule network. The basic idea of generating one-mode projections for bipartite graphs is to determine the number of common neighbors of all possible pairs of vertices in the same type. And then it adds connections between pairs of vertices whose number of common neighbors is greater than a certain threshold. Figure 3 shows the one-mode projection of the bipartite graph illustrated in Figure 1 with threshold value of 2. Since there are two common neighbors of \(s_2\) and \(s_3\), \(s_2\) and \(s_4\), \(s_3\) and \(s_4\), and three common neighbors between \(t_1\) and \(t_3\), which are equal to or greater than the threshold, then connections are established between these pairs of vertices. Although there is one common neighbor between \(s_1\) and \(s_3\), the number of common neighbor does not reach the threshold, so there is no connection established between them.

The implementation can be done by first finding all possible vertex pairs consisting of two vertices in the same type. Then it counts the number of common neighbors of two vertices in all vertex pairs. Finally, we establish connections between all pairs of vertices whose number of common neighbors is greater than or equal to a threshold value. However, the time complexity of this method for establishing edges between vertices of the same type is \(O(|\mathcal {V}| \times |\mathcal {V}| \times |\mathcal {E}|)\), which is cost-prohibitive. Therefore, instead of obtaining the one-mode projection in that way, we will use a more efficient method indicated in Algorithm 1. The details are presented as follows.

Fig. 3
figure 3

One-Mode Projection of Bipartite

Since the way to generate connections between vertices in the same part (\(\mathcal {V}_S\) or \(\mathcal {V}_T\)) is the same for both parts of vertices, we introduce how to build connections between \(s \in \mathcal {V}_S\) and the way to connect between \(t \in \mathcal {V}_T\) are the same. First, each vertex \(s_i\) in vertex set \(\mathcal {V}_S\) is obtained, and neighboring vertices set of \(s_i\), denoted by \(\mathcal {N}({s_i})\) is obtained by utilizing the edge set \(\mathcal {E}\), The respective set of neighbor vertices \(\mathcal {N}({t})\) of all vertices \(t \in \mathcal {N}({s_i})\) can also be obtained. Due to the nature of bipartite graph, there will be no connection between vertices of the same type, so after conducting the above operation, the obtained vertices \(s_i\) and \(s_j \in \mathcal {N}({t}), s_j \ne s_i\) that are 2 hops away from each other must be a pair of vertices of the same type that have a set of common neighbors \(\mathcal {N}({t}) \subseteq \mathcal {V}_T\). After that, it is easy to count the number of neighbors existing between each pair of vertices in a container of size \(|\mathcal {V}_S| \times |\mathcal {V}_S|\) to judge whether the number of common neighbors between the pair of vertices reaches the threshold, i.e., whether a connection needs to be added between the pair of vertices. The time complexity of this algorithm is \(O(|\mathcal {V}| \times |\mathcal {N}({s_i})| \times |\mathcal {E}|)\), where \({|\mathcal {N}(s_i)|} \ll |\mathcal {V}|\). Further analysis shows that Algorithm 1 is essentially performing a depth-first search with the depth of 2, so in fact, each vertex only needs to traverse at most |E| edges. The time complexity of this algorithm is equivalent to \(O(|\mathcal {V}| \times |\mathcal {E}|)\). The process of building the one-mode projection adjacency matrix on vertices in \(\mathcal {V}_S\) is summarized in Algorithm 1. Using the above approach, it is possible to obtain graphs \(G_S\), \(G_T\) and their adjacency matrices \(\varvec{A}_{S} \in \mathbb {R}^{|\mathcal {V}_S| \times |\mathcal {V}_S|}\), \(\varvec{A}_{T} \in \mathbb {R}^{|\mathcal {V}_T| \times |\mathcal {V}_T|}\), which only contain vertices of the same type. Finally, we can obtain the one-mode projection of the original bipartite graph, whose adjacency matrix can be represented as: \(\varvec{A}_{OM} = \varvec{A}_O + \left[ \begin{array}{cc}\varvec{A}_{S} &{} \varvec{0}\\ \varvec{0} &{} \varvec{A}_{T}\end{array}\right]\), where \(\varvec{A}_{OM}\) is the adjacency matrix of the graph after one-mode projection and \(\varvec{A}_O \in \mathbb {R}^{|\mathcal {V}| \times |\mathcal {V}|}\) is the adjacency matrix of the original bipartite graph.

figure a

3.3 Graph capsule framework

The problem of graph classification is based on the classification of structures of individual graphs. Conventional GNN model can also extract features from graph structure and attribute information for downstream learning objectives, such as vertex classification and link prediction. However, conventional GNNs cannot handle the heterogeneous information of the graphs, and cannot capture the hierarchical structure in the graph either. Consequently, conventional GNN lacks the capability to obtain better performance on the graph classification problem, especially in the classification tasks on the graphs with complex structure and information such as bipartite graphs. Different from conventional GNN, the feature vectors in capsule networks are disentangled into multiple vectors to represent different classes of features, and parameters of multi-layer perceptrons used for each disentangled feature vector are independent with each other during training, i.e., the parameters for each feature are not shared in the neural network. Therefore, capsule networks have significant advantages over conventional GNN for graph classification problems.

In order to handle the different feature information embedded in the graph, we pass the feature vector of each vertex through multiple sets of mutually independent fully connected layers and activation functions to obtain multiple mutually independent features for representing different hidden factors. Then we use these factors to obtain the most primitive capsules that will be used afterwards. Specifically, given a vertex i in a bipartite graph G which have a feature vector \(x_i \in \mathbb {R}^d\). We need to pass each disentangled feature vector through a fully connected layer with different parameters and a nonlinear activation function to obtain the most primitive capsule, which is formulated as:

$$\begin{aligned} \varvec{Z}_{i,k} = \sigma (\varvec{W}_{k}^{T}x_i) + b_k, \end{aligned}$$
(2)

where \(\varvec{Z}_{i, k} \in \mathbb {R}^{\frac{d}{\varvec{K}}}\) is the \(k^{th}\) hidden feature of vertex i, \(\varvec{W}_{k} \in \mathbb {R}^{d \times \frac{d}{\varvec{K}}}\) and \(b_k \in \mathbb {R}^{\frac{d}{\varvec{K}}}\) are \(k^{th}\) learnable weight matrix and bias, and each vertex has \(\varvec{K}\) hidden features. \(\sigma\) denotes the activation function. In (2), the feature vector i can be considered to have been converted into a vector set of feature vectors containing \(\varvec{K}\) hidden features. As a result, the capsule of vertex i is \(\varvec{Z}_{i} \in \mathbb {R}^{\varvec{K} \times \frac{d}{\varvec{K}}}\). For simplicity, \(\varvec{Z}_{i}\) can be reshaped to the vector format \(z_i \in \mathbb {R}^d\). As mentioned in [39], the capsule length of the corresponding disentangled entity represents the probability of the existence of the hidden feature it corresponds to, and the longer the capsule length, the higher the probability of the existence of the entity. Therefore, we need to preserve the vector direction while normalizing the length of the vector, and the squash function is implemented as follows:

$$\begin{aligned} \varvec{\varTheta }_{i} = squash(z_i) = \frac{|z_i|^2}{1+|z_i|^2} \times \frac{z_i}{|z_i|} \end{aligned}$$
(3)

Thus, we can transform the feature vector of each vertex into the lowest and most preliminary capsule \(\varvec{\varTheta }_i^{(1)} \in \mathbb {R}^{d_1}\) and vertices can be converted into preliminary capsule set \(\varvec{\varTheta }^{(1)} \in \mathbb {R}^{|\mathcal {V}| \times d_1}\), where \(d_1\) is the overall length of any capsule in the preliminary capsule layer. Eventually, with the reducing number of capsules, we can obtain the final graph classification result while preserving the hierarchical graph structural information.

3.4 Graph capsule layers

In this section, we introduce the layers in our capsule network in detail. Obviously, hierarchical information plays an important role in graph classification. For example, accurate inference of some important substructures (functional groups) within a protein (chemicals) can greatly help us predict the properties of the protein or chemical compounds. In this paper, we utilize the capsule network to preserve the hierarchical information in the graph, and hence improve the graph classification performance of our proposed model. In order to obtain the hierarchical information, we need to map the bottom capsule to the top capsule layer by layer, continuously extract and integrate the structural information hidden in different levels. Finally, we obtain the last layer of capsules, in which the number of capsules is equal to the number of graph types. Based on the length of these capsules, the class of the graph can be predicted. We refer to this last layer as class capsule layer. More specifically, each capsule in the previous layer needs to generate a corresponding vote for each capsule in the later layer to obtain the features. To pass the features with attentional tendency, a weighting parameter needs to be acquired for each vote. The next layer of capsules is then obtained by weighting and summing these votes. The weighting parameter is computed based on the similarity of these votes from the previous layer to the capsule in the next layer. The higher similarity indicates, the larger weight it has. So features in lower levels can be informatively and hierarchically transmitted to features in higher levels.

First, we need to use GNN to aggregate the \(N_l\) capsules of the \(l^{th}\) layer for \(N_{l+1}\) times, to obtain the votes from each capsule at the \(l^{th}\) layer to all the capsules at the \((l+1)^{th}\) layer, where \(N_l\) is the number of capsules in the \(l^{th}\) layer. As introduced in Section 2, the different GNNs utilizes different aggregate and combine mechanism. Specifically, in this work, we choose graph convolutional network [14] (GCN) as the GNN method to get votes. GCN aggregates the neighbor representations by summation over a normalized adjacency matrix \(\widetilde{\varvec{D}}^{-\frac{1}{2}}\widetilde{\varvec{A}}\widetilde{\varvec{D}}^{-\frac{1}{2}}\), where \(\widetilde{\varvec{A}}\) is the adjacency matrix \(\varvec{A}\) with self-loop as: \(\widetilde{\varvec{A}} = \varvec{A} + \varvec{I}_N\), \(\widetilde{\varvec{D}}\) is a diagonal degree matrix of \(\widetilde{\varvec{A}}\) where \(\varvec{D}_{ii} = \sum _j\widetilde{\varvec{A}}_{ij}\). Consequently, GCN can be formulated with the following equations:

$$\begin{aligned} \varvec{H}^{(l+1)} = \sigma (\widetilde{\varvec{D}}^{-\frac{1}{2}}\widetilde{\varvec{A}}\widetilde{\varvec{D}}^{-\frac{1}{2}}{\varvec{H}}^{(l)}{\varvec{W}}^{(l)}), \end{aligned}$$
(4)

where \(\varvec{H}^{(l)}\) is the hidden feature vector at \(l^{th}\) layer and \(\varvec{H}^{(0)}\) is the input representations of the vertices, \(\varvec{W}^{(l)}\) is a trainable weight matrix for \(l^{th}\) layer, and \(\sigma\) is the nonlinear activation function. With the help of GCN, BCGNN could capture the neighboring information of the graphs via message passing along the edges. Eventually, the vector representations for the vertices are obtained. In addition, to generate the feature vector of the vertex in the latter layer without losing the feature of the vertex in the current layer, it is necessary to add self-loop to all vertices, so the adjacency matrix used in aggregation is \(\widetilde{\varvec{A}}\). The degree matrix \(\widetilde{\varvec{D}}\) is also applied on the adjacency matrix for a normalization purpose.

In our capsule network, the capsules in the first layer are directly obtained from vertices in the graph, therefore, there are \(|\mathcal {V}|\) capsules in total initially, where \(|\mathcal {V}|\) is the number of vertices in the graph G. The GCN is directly used on the graph built by the one-mode projection described in Section 3.2 to obtain their votes for the next layer of capsules. It is worth noting that the graph applied to the capsule network is a one-mode projection of the bipartite graph, and its adjacency matrix is \(\varvec{A}^{(1)} \in \mathbb {R}^{|\mathcal {V}|^2}\). Practically, we perform one layer of aggregation on the input graph, so the GCN used in the BCGNN is formulated as follows:

$$\begin{aligned} \mu _j^{(l)} = \sigma [({\widetilde{\varvec{D}}_{OM}^{(l)}})^{-\frac{1}{2}}\widetilde{\varvec{A}}_{OM}^{(l)}({\widetilde{\varvec{D}}_{OM}^{(l)}})^{-\frac{1}{2}}{\varvec{\varTheta }}^{(l)}{\varvec{W}_{j}}^{(l)}], \end{aligned}$$
(5)

where \(\mu _j^{(l)}\) is the vote of the \(l^{th}\) layer’s capsules to the \((l+1)^{th}\) layer’s capsule j, \(\widetilde{\varvec{A}}_{OM}^{(l)} = {\varvec{A}}_{OM}^{(l)} + {\varvec{I}}_N\), \(\widetilde{\varvec{D}}_{OM}^{(l)}\) is a diagonal degree matrix of \(\widetilde{\varvec{A}}_{OM}^{(l)}\), and \(\varvec{W}_j^{(l)}\) is a trainable weight matrix for \(l^{th}\) layer’s capsules and is used to generate the votes for capsule j in \((l+1)^{th}\) layer. The operation of getting a vote is referred as voting.

Then, it is required to learn weight parameter c for every vote. In order to ensure that the weights corresponding to the votes in the same layer are normalized, we need to ensure that c sums to 1 for all the votes in the same layer, that means \(\sum _{j=1}^{N_{l+1}} c_{i, j} = 1\), where \(c_{i, j}\) denotes the weight for vote from capsule i in \(l^{th}\) layer to capsule j in \((l+1)^{th}\) layer. For this purpose, we need an auxiliary parameter b to learn the appropriate parameter c. Specifically, after initializing \(b \leftarrow 0\), we iteratively perform the following steps for every capsule in the consequent layer:

  • 1. Applying softmax function, transforming b into c as follows:

    $$\begin{aligned} c_{i, j}^{(l)} = \frac{exp(b_{i, j}^{(l)})}{\sum _kexp(b_{i, k}^{(l)})}, \end{aligned}$$
    (6)

    where \(c_{i,j}^{(l)}\) is the weight parameter of capsule i in the \(l^{th}\) layer for the vote of capsule j in the \((l+1)^{th}\) layer, and \(b_{i, j}^{(l)}\) corresponds to \(c_{i, j}^{(l)}\). Using (6), we can provide a set of weight \(c^{(l)}\) for all votes in the \(l^{th}\) layer for capsule j in the \((l+1)^{th}\) layer.

  • 2. All the weighted votes of the \(l^{th}\) layer for capsule j in the \((l+1)^{th}\) layer are summed and squashed to obtain the feature vector of capsule j as follows:

    $$\begin{aligned} \varvec{\varTheta }_j^{l+1} = squash(\sum _ic_{i, j}^{(l)}\mu _{i, j}^{(l)}), \end{aligned}$$
    (7)

    where \(\mu _{i, j}^{(l)}\) is the vote of capsule i to capsule j.

  • 3. Judging the similarity between the capsule j obtained from (7) and each vote in layer l for capsule j, then update the parameter b according to the similarity as follows:

    $$\begin{aligned} b_{i, j}^{(l)} = b_{i, j}^{(l)} + \mu _{i, j}^{(l)} \varvec{{\varTheta }}_j^{l+1} \end{aligned}$$
    (8)

The act of repeating the above three operations is referred as routing.

After iterating the above three operations for R times, the more desirable capsule j in \((l+1)^{th}\) layer and the set of weight parameters \(C_j^{(l)} \in \mathbb {R}^{N_{l}}\) for all the votes corresponding to capsule j are obtained. When all the capsules of \((l+1)^{th}\) layer are obtained, we will also get the parameter matrix \(C^{(l)} \in \mathbb {R}^{N_{l} \times N_{l+1}}\) at the same time. Using \(C^{(l)}\), we can get the adjacency matrix of the capsule of \((l+1)^{th}\) layer by the following method:

$$\begin{aligned} \varvec{A}^{(l+1)} = \varvec{C}^{{(l)}^T} {\varvec{A}}^{(l)} {\varvec{C}^{(l)}} \end{aligned}$$
(9)

Please note that since \(N_{l+1} < N_{l}\), the number of capsules involved in the computation reduces after each layer. Therefore, BCGNN could learn the representation for the graph by preserving its structural and attribute information hierarchically. After that, by repeating the above operation with the capsules and the adjacency matrix of \((l+1)^{th}\) layer, we can get the capsules of the next layer, until we obtain the class capsule layer for output, which has the same number of capsules as the number of graph classes.

In order to retain and transmit features from the previous layer to the next layer better, drawing on the approach of  [23], we add a residual connection at each pair of consecutive capsule layers as follows:

$$\begin{aligned} \varvec{{\varTheta }}^{(l+1)} \leftarrow \widetilde{\varvec{{\varTheta }}}^{(l+1)} + {\varvec{M}}({\varvec{\varTheta }}^{(l)}), \end{aligned}$$
(10)

where \({\varvec{M}}(\cdot )\) indicates the global average function. \({\widetilde{\varvec{\varTheta }}}^{(l+1)}\) indicates the capsule layer of the \(l+1\) layer that has not yet weighted the information of the previous capsule layer. \({\varvec{\varTheta }}^{(l+1)}\) is the final \(l+1\) capsule layer.

3.5 Learning objectives

Once the class capsules in the output layer \({\varvec{\varTheta }}^L \in \mathbb {R}^{|\Gamma | \times d_L}\) are obtained, where \(\Gamma\) is the set of labels for the graph class, the probability for a certain class can be judged by the length of the capsule’s feature vector [39]. Thus, the classification loss can be measured by the following margin loss function:

$$\begin{aligned} \mathcal {L}_m({\varvec{\varTheta }}^{(L)}) = \sum \limits _{\gamma \in \Gamma }[{\varvec{T}}_{\gamma }max(0, m^+ - ||{\varvec{\varTheta }}_{\gamma }^{L}||)^2 + \lambda (1-{\varvec{T}}_{\gamma })max(0, ||{\varvec{\varTheta }}_{\gamma }^{L}|| - m^-)^2], \end{aligned}$$
(11)

where \(m^+\) and \(m^-\) are the marginal coefficients which are set to 0.9 and 0.1 respectively in this work, \(T_{\gamma }\) is the class label indicator which equals to 1 iff \({\varvec{\varTheta }}_{\gamma }^{(L)}\) has label \(\gamma\), otherwise \(T_{\gamma } = 0\).

To preserve the original graph structural information during training and to improve the stability of the training, we use reconstruction loss to constrain the training. The core idea is to decode the adjacency matrix of the class capsule layer to obtain a matrix which is close to the adjacency matrix of the initial capsule layer.

Specifically, we take the output class capsule \({\varvec{\varTheta }}^L\) of BCGNN as the input, and use a fully connected network to map this capsule back to a matrix with the dimension as the primary capsules, i.e., \(\mathbb {R}^{|\mathcal {V}| \times d_1}\), with the following equation:

$$\begin{aligned} {\varvec{Z}}_r = {\varvec{\varTheta }}^{(1)} +({\varvec{W}}_r^{T}\varPhi ({\varvec{\varTheta }}^{(L)}) + b_r), \end{aligned}$$
(12)

where \(\varPhi\) is the mask operation, \(\varvec{W}_r \in \mathbb {R}^{(|\Gamma | \times d_L) \times d_1}\) is a learnable parameter matrix, \(b_r \in \mathbb {R}^{d_1}\) is learnable bias vector, and \(\varvec{Z}_r \in \mathbb {R}^{|\mathcal {V}| \times d_1}\). Then, rely on the results \({\varvec{Z}}_r\), a matrix with the same dimensions as the adjacency matrix of the preliminary capsules can be obtained by \({\varvec{A}}_r = {\varvec{Z}}_r{\varvec{Z}}_r^T\), which is the preliminary adjacency matrix obtained by re-decoding the class capsules. Then, reconstruction loss can be implemented as follows:

$$\begin{aligned} \mathcal {L}_r({\varvec{A}}^{(1)},{\varvec{A}}_r) = -\frac{1}{N_1^2}\sum \limits _{a=1}^{N_1} \sum \limits _{b=1}^{N_1} [{\varvec{A}}_{a, b}^{(1)}log({\varvec{A}_r}_{a, b}) - (1 - {\varvec{A}}_{a, b}^{(1)})log(1 - {\varvec{A}_r}_{a, b})] \end{aligned}$$
(13)

It is worth noting that since \(A^{(1)}\) is the adjacency matrix of the one-mode projection of the original bipartite graph, \(A \in \{0, 1\}^{N_1\times N_1}\). We clamp the values greater than 1 in \(A_r\) to 1.

Finally, the loss function for optimization is shown below:

$$\begin{aligned} \mathcal {L} = \mathcal {L}_m({\varvec{\varTheta }}^{(L)}) + \beta \mathcal {L}_r({\varvec{A}}^{(1)},{\varvec{A}}_r), \end{aligned}$$
(14)

where \(\beta\) can be used to adjust the importance of the overall loss function \(\mathcal {L}\) with respect to \(\mathcal {L}_r\).

4 Experiment

In this section, we experimentally demonstrate the ability of BCGNN to classify bipartite graphs. We attempt to answer the following two research questions:

  • Q1. Has the utilization of one-mode projection and hierarchical capsule network led to improved classification results?

  • Q2. How much the proposed method improves the baselines?

To answer the above questions and further validate the superiority of our proposed method, experiments are conducted on seven sets of bipartite graphs which are generated from seven real-world temporal bipartite graphs.

4.1 Datasets and baselines

The dataset used in the experiments is generated from seven temporal bipartite graphs:

  • edit-nawikiFootnote 1, edit-dvwiktionaryFootnote 2, edit-ltwikisourceFootnote 3, edit-mswikibooksFootnote 4, edit-sswiktionaryFootnote 5, edit-bgwikisourceFootnote 6 and edit-tawikiquoteFootnote 7 contain users and pages from the Nauru Wikipedia, the Divehi Wiktionary, the Lithuanian Wikisource, the Malay Wikibooks, the Swati Wiktionary, the Bulgarian Wikisource and the the Tamil Wikiquote, connected by edit events. Each edge represents an edit. And each dataset includes the timestamp of each edit. The statistics of these datasets and the groups of graphs generated through each of them are summarized in Table 2.

Table 2 Statistics of Datasets

To examine the effectiveness of our proposed framework, we compare BCGNN with the following baseline methods:

  • AWE [11], WWL [27] are kernel based graph classification methods.

  • DGCNN [42], HaarPool [33] are state-of-the-art deep neural network methods for graph classification.

  • CapsGNN [38] is the first work to adapt capsule network to graph neural networks which achieves significant improvement on graph classification task compare to conventional graph classification models.

  • HCGNN [39] utilizes the capsule network to preserve the hierarchical information in the graphs, and hence has the state-of-the-art performance for graph classification problem.

4.2 Experiment Settings

We generate new bipartite graphs from vertices and edges that appear in the same time slot based on the time-stamped separation of all edges in the temporal bipartite graph. Based on the fact that bipartite graphs generated from the same original bipartite graph have similar structures and attributed information, we group them into the same category, i,e., they have a same label. The specific steps for graph generation are shown in Algorithm 2. In the algorithm, the line 1 sorts the input temporal bipartite graph according to the timestamps of the edges generating the links in ascending order, then get the sequence of ordered edges \(\mathcal {E}\) and the timestamps corresponding to these edges \(\mathcal {T}\). The purpose of the line 2 is to decide a time slot length, which will be used to split the time interval for the same subgraph. Line 9 ensures that there are no duplicate edges in each subgraph. Line 16 controls the range of the number of edges in each subgraph, therefore, all generated graphs have similar graph size, and line 24 ensure that the generated graphs are connected graphs. If non-connected graphs exist, they are divided into multiple connected graphs and the graphs that do not have the required number of edges are removed. More specifically, for each graph of \(\Psi _L\), we perform the following operations. We deposit each edge into the preparatory graph. The rule for depositing is that if the vertex of the edge appears in one of the preparatory graphs, it is added to that preparatory graph, otherwise, a new preparatory graph is created, and the edge is added. After this operation is performed on all edges, the preparatory graphs with duplicate vertices are merged to obtain the final set of connected graphs.

figure b

Consequently, we obtained a graph set with a total of 1080 graphs in seven classes. Among them, the largest class has 216 graphs while the smallest class has 75 graphs. In addition, we made disentangle feature number \(\varvec{K}=4\), routing iteration number \(\varvec{R}=3\), \(\lambda =0.5\), \(\beta =0.1\) and \(\varvec{L}=2\) in our experiments, choose Adam as the optimizer with the learning rate \(lr= 0.001\) and used 10-fold cross-validation to train the model. The capsule dimension was set to 128, while the input feature dimensions of the vertices were generated based on the size of the bipartite graph and the degree of each vertex. In the experiments, the feature vector dimension of each vertex is 106. We take the average of these 10 predictions as the final accuracy and consider their standard deviation as the floating range of accuracy.

4.3 Bipartite graph classification results

Fig. 4
figure 4

Experiment Result for Bipartite Graph Classification

The experimental results are presented in Figure 4. Our proposed BCDNN possesses a higher accuracy than all other baselines in completing the bipartite graph classification task.

Among them, AWE, WWL and CapsGNN are less accurate on the graph classification task because they do not consider the hierarchical information of the bipartite graph. It can be seen that the hierarchical information can play an important role in the graph classification problem. It is worth noting the WWL model. Inspired by WL, the WWL algorithm counts the ground distance between all pairs of vertices in two graphs, and then obtains the Wasserstein distance of the two graphs to predict the similarity of the structure between them. This full utilization of vertex features makes the model very robust. However, when applying this model to bipartite graphs, it does not extract hierarchical structure well, nor does it allow information to be exchanged between vertices of the same type well. Therefore, WWL cannot achieve the accuracy of BCGNN. Although DGCNN and HaarPool also consider hierarchical information, HCGNN, with the help of capsule network, is able to integrate the information better. As a result, the accuracy of HCGNN is better than that of DGCNN and HaarPool. Although CapsGNN also uses the capsule network, the results are not satisfactory, which shows that simply using the capsule network when completing the bipartite graph classification task does not yield the desired results.

And compared with HCGNN, BCDNN optimizes the method for the characteristics of bipartite graphs by establishing the one-mode projection for the original bipartite graph. So the preliminary capsules which generated by the same type of vertices can be aggregated better and exchange feature information better among themselves. So BCDNN can achieve higher accuracy when implementing bipartite graph classification than HCGNN. It can be seen that using a one-mode projection of the bipartite graph before processing it in the GNN related algorithm is an outstanding way to enhance the effect.

4.4 Parameter Analysis

Fig. 5
figure 5

Parameter Analysis Results in BCDNN

We conduct the parameter analysis experiments on the following parameters: The disentangle feature number \(\varvec{K}\), the number of routing iterations \(\varvec{R}\), the number of hidden layer capsule and learning rate lr. The analysis results are shown in Figure 5.

For BCDNN, the most important parameter is the disentangle feature number \(\varvec{K}\). We tested five values \(\{2, 4, 8, 16, 32\}\) as our \(\varvec{K}\). The result of tests can be seen in Figure 5(a). We can conclude from the experimental results that \(\varvec{K}\) is robust to the accuracy of classification. The best accuracy value is achieved when \(\varvec{K}\) is 4. The performance of BCDNN worsens with the increment of \(\varvec{K}\). Therefore, \(\varvec{K}\) was set to be 4.

In addition, \(\varvec{R}\), which determines the number of routing iterations, is also an important parameter. We tested all values from 2 to 6 as the values of \(\varvec{R}\). The experimental results are shown in Figure 5(b). From the experimental results, we can learn that the accuracy reaches the highest at \(\varvec{R}=3\). If the value is increased further, the accuracy decreases, so we choose 3 as the value of \(\varvec{R}\).

In order to test the effect of the number of capsules in the hidden layer on the experiment, we tested four cases when the number of capsules in the hidden layer was 5, 10, 15, 20, 25 and 30. The experimental results are shown in Figure 5(c). From the experimental results, After the number of capsules is greater than 10, as the growth in the number of capsules decreases the accuracy of bipartite graph classification, we choose 10 as the value of the number of capsules in our hidden layer, which has better results.

Finally, we tested {\(1 \times 10^{-5}, 1 \times 10^{-4}, 5 \times 10^{-3}, 1 \times 10^{-3}, 0.01\)} five learning rates. The experimental results can be seen in Figure 5(d). The experimental results show that the learning rate has a greater impact on the accuracy, so we choose the best result \(1 \times 10^{-3}\) as our learning rate.

5 Related work

In this section, we present the related works from the following four perspectives.

Bipartite graph related neural networks Numerous research works [8, 9, 17, 20, 34, 37, 41] have been proposed with the focuses on the analysis of bipartite graph neural networks. Among them, [17, 20, 37, 41] use neural networks on bipartite graphs to implement an efficient recommendation system, while [8, 34] focus on cancer survival prediction and drug-disease association prediction. [9] delves into the vertex representation learning problem. However, these methods are focused on dealing with microscopic vertex and edge information, and cannot be directly used to implement macroscopic graph classification task.

Graph classification Aiming to solve the graph classification problem, a variety of methods [12, 13, 15, 16, 18, 25, 31, 35, 36, 42] are proposed. These works are well implemented for graph classification utilizing the techniques such as mathematical programming[12, 35], multiview learning[36], reinforcement learning[16], feature selection [15], graph kernels[13, 18], and graph neural network[42]. However, these methods are designed for unipartite graphs and cannot be directly generalized to the bipartite graph classification problem.

Bipartite graph analytics Nowadays, with the increasing popularity of bipartite graphs, there are several methods proposed for bipartite graph analytics, such as [1, 3, 4, 22] which are able to find meaningful community structures on bipartite graphs. [26] presents a bipartite graph matching method on protein structure, which is consequently used for protein graph classification application.

Capsule network Recently, a method called capsule network [10] is proposed and achieved state-of-the-art performance on image classification problem. Due to its excellent performance, many methods [6, 19, 29, 32, 40] achieve excellent results on graph related problems by applying capsule network on graphs, and [24, 28, 38] also accomplish outstanding results on graph classification task. However, since these methods are mainly designed for unipartite graphs, there are still few investigations using capsule network that have excellent performance on graph classification problem involving bipartite graphs. In this paper, we utilize one-mode projection and hierarchical capsule network to improve the performance of GNN-based methods on the bipartite graph classification task, and demonstrate that it possesses excellent accuracy.

6 Conclusion

Bipartite graphs are becoming more and more common in practice, but little work has been done on them due to the complexity caused by the bipartite setting. In this paper, we propose the first capsule network on bipartite graphs for graph classification tasks. The proposed BCDNN first applies one-mode projection to bipartite graphs, allowing the capsule network to better capture information between vertices of the same type consequently. BCDNN significantly improves the accuracy of bipartite graph classification by integrating the bipartite graph classification task based on one-mode projection and hierarchical capsule networks. Extensive experiments on the real-life bipartite graphs within seven classes demonstrate a significant improvement of BCDNN compared to the state-of-the-art methods.