Representing Graphs as Bag of Vertices and Partitions for Graph Classification
 203 Downloads
Abstract
Graph classification is a difficult task because finding a good feature representation for graphs is challenging. Existing methods use topological metrics or local subgraphs as features, but the time complexity for finding discriminatory subgraphs or computing some of the crucial topological metrics (such as diameter and shortest path) is high, so existing methods do not scale well when the graphs to be classified are large. Another issue of graph classification is that the number of distinct graphs for each class that are available for training a classification model is generally limited. Such scarcity of graph data resources yields models that have much fewer instances than the model parameters, which leads to poor classification performance. In this work, we propose a novel approach for solving graph classification by using two alternative graph representations: the bag of vertices and the bag of partitions. For the first representation, we use representation learningbased node features and for the second, we use traditional metricbased features. Our experiments with 43 reallife graphs from seven different domains show that the bag representation of a graph improves the performance of graph classification significantly. We have shown 4–75% improvement on the vertexbased and 4–36% improvement on partitionbased approach over the existing best methods. Besides, our vertex and partition multiinstance methods are on average 75 and 11 times faster in feature construction time than the current best, respectively.
Keywords
Graph classification Graph embedding Frequent subgraph patterns1 Introduction
Graph classification is an important research task which is used for solving various reallife classification problems. For example, in the domain of software engineering, softwares are represented as program flow graphs and graph classification is used for discriminating between correct and faulty software [6]. In the domain of mobile security, classification of function call graphs is used to categorize between malicious (malware) and benign Android application [12]. The most wellknown example of graph classification probably comes from the cheminformatics domain, where graphs are used for representing chemical compounds, and molecular property descriptors are developed to classify these graphs for performing structure–activity relationship (SAR) analysis [51]. Formally, the graph classification problem is to develop a mapping function \(f({\mathbf {x}}):X \rightarrow \{1,\ldots ,c\}\), given a set of training instances \(X={\left\langle {\mathbf {x}}_i,y_i\right\rangle }_{i=1}^N\), where an instance \({\mathbf {x}}_i \in X\) is the representation of a graph in a chosen feature space and \(y_i \in \{1,\ldots ,c\}\) is the class label associated with \({\mathbf {x}}_i\).
A graph does not have a natural embedding in a metric space, but the majority of supervised classification methods require that the graph is represented as a point in a metric space. So, a critical prerequisite of a graph classification task is selecting features for embedding a graph in a metric space. Existing works on graph classification use two kinds of feature representation for a graph: topological features [1, 26] and local substructurebased features [11, 50]. Methods belonging to the former group compute different local or global topologybased graph metrics such as centralities, eccentricity, egonet degree, egonet size and diameter and use these as features of the graphs. Methods belonging to the latter group extract local topologies such as frequent subgraphs [20, 25], discriminative subgraphs [37, 52] or graphlets [36] from the input graphs and use binary features representing their occurrence or lack of occurrences in a given graph. Besides these, some graph kernels have also been developed [4, 41] that encode the similarity between a pair of graphs in a Gram matrix. A specialized graph classification platform, gBoost [39], is proposed, which unifies the subgraph extraction and graph classification in an integrative process.
Much development has been made in graph classification, yet feature selection for graph data still remains as a challenge, specifically for the case when the graph to be classified is large in size. For such graphs, the task of feature value computation does not scale with the size of the graphs. For instance, for the topological featurebased methods, the computation of a number of metrics, such as the shortest path and diameter, has at least a quadratic complexity, which is not quite feasible for many reallife graphs. We experiment with a few coauthorship networks (average number of vertices in these graphs is around 96 thousands) and find that for these graphs, the execution time for constructing metrics that are used in two mostrecent graph classification methods ([1, 26]) is around 40 hours. (Experimental results are available in Sect. 4.5.) Likewise, extracting frequent subgraph features is also not a scalable task. For example, we ran Gaston [32] (current stateoftheart frequent subgraph mining algorithm) on several animal and human contact graphs (list of datasets is given in Table 1) with 30% support, but the mining task did not finish in 2 whole days of running. As we can see in Table 1, most of the graphs that we use in our experiments are of moderate size, yet existing methods for graph feature extraction are already infeasible for many of these graphs. Feature extraction tasks for even larger graphs, such as Facebook or Wikipedia, will be nearly impossible! Besides the high computational cost of feature extraction, selecting a small number of good features is another challenge for the graph classification task. More often, the domain knowledge from the analyst is critical for finding good features for graph classification, as different kinds of features work well for graphs appearing in various domains.
In recent years, unsupervised feature learning using neural networks has become popular. These methods help an analyst discover features automatically, thus obviating the necessity of feature engineering using domain knowledge. Researchers achieve excellent performance with these learning techniques for extraction of features form text [2, 29], speech [43] and images [45]. Perozzi et al. [27, 35] have shown the potential of neural network models for learning feature representation of the vertices of an input graph for solving vertex classification [35] and link prediction [27] tasks. However, none of the existing methods find feature representation for a graph in a graph database for graph classification. In addition to these neural networkbased techniques, there exist a few works [38, 40] that find optimal embedding of a graph in the Euclidean space while preserving the shortest path distance. Nevertheless, due to their high computation cost, these embeddingbased methods work well for small graphs only. Besides, the metric representation using these methods performs poorly in graph classification setting, as we will show in experiment section.
Existing graph classification methods also suffer from an issue that arises from the application of graph mining in reallife settings. The issue is that the number of available graphs for training a classification model is generally limited, even though each of the graphs can be very large. This causes poor performance to the methods that extract features using a supervised approach. The poor performance also propagates to the classification model building phase as the number of features is generally much larger than the number of training instances. This leads to the fat matrix phenomenon where the number of model parameters is considerably larger than the number of instances, and it is well known that such models are prone to overfitting. In many of the existing works on machine learning, specifically in deep learning, artificial random noises are inserted to create distorted samples [7, 42], which greatly increases the training set size, and thus can alleviate the overfitting problem. In [42], authors proposed a technique called “elastic distortion” to increase the number of image training data by applying simple distortions such as translations, rotations and skewing, which yield superior classification performance. However, in the existing works, no such mechanism is available for training data inflation for a graph classification task.
In this work, we propose a novel graph data representation for facilitating graph classification. Our feature representation is different from the existing works, and it solves the limitations of the existing graph classification methodologies that we have discussed in the earlier paragraphs. The first novelty of our feature representation is that we consider a graph as a bag of a uniform random subset of vertices of that graph such that each vertex in this set becomes an independent instance (a row) of a graph classification data set. Conceptually, each vertex in this feature representation is a distorted sample of the original graph from which the vertex is taken. This inflates the size of training data and substantially improves the performance of graph classification. One can also view such a data representation as multiinstance learning (MIL) [53]. However unlike MIL, for our data representation, during training phase each of the instances (in this case the vertices) in a bag assumes the same label which is the label of the graph, whereas in MIL the instances in a bag can have different labels. Another difference from MIL is that for our case, the classification of a graph is determined by the majority voting of the vertex instances in the corresponding bag, but for MIL as long as one of the instances is of positive class, the bag is labeled as positive. The second novelty of our method is an unsupervised feature representation of each vertex utilizing deep neural network, similar to the one that is used for language modeling [29, 34]. Computing such features for a graph is faster than all the existing metricbased or subgraphbased feature representation, and they perform substantially better than the existing methodologies.
The idea of training data inflation using many distorted samples, in isolation, provides a significant performance boost for graph classification. To demonstrate this, we also propose another graph classification framework, which considers a graph as a bag of subgraphs. Given a graph, we partition the vertices of the graph; each of the partitioninduced subgraphs then becomes an instance in a bag corresponding to that graph. To find features of these partitioninduced subgraphs, we do not use language modelbased feature embedding. Rather, we use existing metricbased approaches, which compute local/global topological features. In this work, we use seven topologybased (egonet, degree and clustering coefficient) features presented in [1] to represent each of the partitioninduced subgraphs. Finding such topological metrics is costly on the entire graph, but it is cheap when it run over the partitioninduced subgraphs yielding significant reduction in the execution time. Empirical evaluation over a large number of reallife graphs shows that training data inflation using graph partitioning is fast and robust, and it is substantially more accurate than the existing stateoftheart graph classification methods.

We propose two novel approaches for graph classification by training data inflation. In one approach, each sample in the inflated data is a randomly chosen vertex whose feature representation is obtained using a neural networkbased language model. In another approach, each sample is a partition subgraph, whose feature representation is obtained through traditional graph topology metrics. Both the proposed methods are substantially better in terms of feature computation time and classification performance, specifically for large graphs.

We empirically evaluate the performance of our proposed classification algorithms on multiple realworld datasets. To be precise, we use 43 reallife graphs from 7 different domains and classify these graphs into their respective domains.
2 Related Works
We discuss the related works in two different categories.
2.1 Graph Classification
In the area of data mining, Gonzalez et al. [13] are probably the first to address the problem of supervised graph classification. They propose an algorithm called SubdueCL, which finds discriminatory subgraphs from a set of graphs and uses these subgraphs as features for graph classification. Deshpande et al. [9] also use a similar approach for subgraph feature extraction for classifying chemical compounds using SVM. [31] proposed DTCLGBI, which uses a custommade decision tree for graph classification such that each node of the tree represents a mined subgraph. In all these works, features extraction is isolated from the classification task. A collection of followup works integrates the subgraph mining and graph classification in a unified framework. gBoost [39] is one of the earliest among these which uses mathematical programming. [11] use boosting decision stumps where a decision stump is associated with a subgraph. Other recent methods that use similar approach are gActive [23], RgMiner [22], cogboost [33], GAIA [21] and Cork [49].
Besides discriminating subgraphs, topological metrics are also used as features for graph classification. Li et al. [26] use 20 topological and label features, which include the degree, clustering coefficient, eccentricity, giant connected ratio, eigenvalues, label entropy and trace. Rahman et al. [36], use graphlet frequency distribution (GFD) to cluster graphs from various domains. In [28], authors compare graphs by using three metrics, called Leadership (it measures the extent to which the edge connectivity of a graph is dominated by a single vertex), Bonding a.k.a clustering coefficient and Diversity (its measurement is based on the number of edges, which share no common end points, and hence are disjoint.) In recent years, Berlingerio et al. [1] propose an algorithm called NetSimile, which computes features of a graph motivated from different social theories.
Kernelbased approaches are also popular for graph classification. Graph kernels are designed to exploit the shortest path [3], cyclic patterns [19], random walks [4], subgraphs [41, 50] and topological and vertex attributes [26]. Graph kernels compute the similarity between a pair of graphs, but this computation generally has high computational complexity; for example, the complexity of random walk kernel, a popular graph kernel, is \(O(n^3)\), where n is the number of vertices. In summary, graph kernelbased methods do not scale for classification of large graphs.
2.2 Vertex Representation
There also exists a handful of solutions for the problem of “node classification” or “withinnetwork classification.” Some of these works use effective feature representation of the vertices for solving this task. Neville et al. [30] propose ICA, which is an iterative method based on constructing feature vectors for vertices from the information about them and their neighborhood. Henderson et al.’s method, called ReFeX [17], captures the behavioral feature of a vertex in the graph by recursively combining each node’s local feature with their neighborhood (egonetbased) features. Koutra et al. [24] compare and contrast several guiltbyassociation approaches for vertex representation. Tang et al. [47] propose to extract latent social dimensions based on network information and then use them as features for discriminative learning. In a followup work [48], they propose a framework called SocioDim which extracts social dimensions based on the network structure to capture prominent interaction patterns between nodes in order to learn a discriminative classifier. Han et al.[15] suggest that frequent neighborhood patterns can be used for constructing strong structureaware features, which are effective for withinnetwork classification task. Recently, Perozzi et al. [35] propose an algorithm for finding neighborhoodbased feature representation of a vertex in a graph.
In the literature, another line of works exists to find feature representation of the vertices of graph [38, 40] that computes optimal embedding of a graph in Euclidean space that preserves topological properties, i.e., knearest neighbors. In [40], authors proposed an algorithm called “structurepreserving embedding” (SPE) that creates a lowdimensional set of coordinates for each vertex while perfectly encoding graph’s connectivity. One of the crucial drawbacks of these embedding algorithms is that these methods are extremely memory intensive, and hence, they work well for small graphs only. We ran SPE [40] with several moderate size graphs (number of vertices is in 3 digits) in a 16 GB memory machine, but the processes were terminated due to insufficient memory error.
3 Method
Consider a graph database \({\mathcal {G}}= \{G_i\}_{1 \le i \le n}\). Each graph \(G_i\) is associated with a category label \(L(G_i)\). For a graph \(G_i\), we use \(G_i.V\) and \(G_i.P_k\) to denote the set of vertices, and the set of kpartitions of that graph, respectively. The task of supervised graph classification is to learn a classification model from a set of training graph instances. The main challenge of a graph classification task is to obtain a good feature representation for the graphs in \({\mathcal {G}}\) for solving this classification task.
Our solution to the feature representation for graph classification is to map a graph to a bag of multiple vertices or a bag of multiple subgraph instances, such that each of the instances in a bag becomes a distinct row in the classification training data. The instances in a bag inherit the label from the parent graph which they represent. Thus, if \(v \in G_i.V\) is used as a bag instance for the graph \(G_i\), the label of v is \(L(G_i)\). If for a graph \(G_i\) all the vertices are used in the bag, one row of a traditional graph classification dataset becomes \(G_i.V\) rows in our data representation each sharing the same label \(L(G_i)\). Instead of vertices, we can also use partitioninduced subgraphs as the bag instances. In this case, one row in a traditional graph classification dataset becomes \(G_i.P_k\) rows in our data representation each sharing the same label \(L(G_i)\). For a large input graph, we do not need to fill the bag with all the vertices of that graph, rather we can include only a random subset of vertices in the bag. In experiment section, we will show that the number of vertex instance that we take in a bag does not affect the classification performance significantly. For the case of partitioninduced subgraph representation, we usually take all the partitions after choosing a reasonable number of partitions based on the size of the graph.
The immediate benefit of the above multiinstance feature representation is that such representation increases the number of rows in a classification dataset by providing multiple instances for each input graph. Given that most of the graphs are large with many vertices, multiinstance representation provides manyfold increase in the number of instances. This solves the fat matrix problem and thus obtains a robust graph classification model with higher accuracy. Furthermore, making many instances for one graph instance enables a learning algorithm to learn topological variances among different parts of the network, which also contributes to the model’s accuracy. Many recent research works [7, 42] in deep learning community show the importance of training data inflation using the distorted copy of the input data sample. Our approach of multiinstance feature representation is a demonstration of such an endeavor for the task of graph classification.
Below we describe the feature representation of each instance in the bag. The next subsection will describe the vertexbased feature representation. In the subsection after that, we will discuss the feature representation of a partitioninduced subgraph.
3.1 Vertex Feature Representation Using Random Walk
For our graph classification task, we assume that the nodes do not have a label or any other satellite data associated with them. So, the feature representation of a node v requires to capture the local topology around v. Following the DeepWalk method [35], we use a fixed length (say, l) random walk starting from the given node v (which we call root node) to build a sequence of nodes capturing the local topology around v. The method uniformly chooses an outgoing edge of currently visiting vertex until it makes l steps and builds a sequence of vertices that it visits through this walk. For each root node, the method performs t number of l length random walks. One can view each sequence of vertices as a sentence in a language, where the vertices are the words in that sentence. Given a set of sentences that are derived from a given root node, DeepWalk uses Word2Vec [29] embedding method to find a metric embedding of the given vertex in an appropriately chosen vector space. It finds the ddimensional (d is user defined) feature representation of all vertices in the document. As we obtain the feature representation of each of the vertices in \({\mathbb {R}}^{d}\) using the above method, the embedding vectors become an instance of the bag of the given graph, \(G_i\). The label of each of the vectors in this bag is \(L(G_i)\).
In Fig. 1, we illustrate how we compute the feature representation of the vertices of an input graph G(V, E). The toy graph that we use in this example has 12 vertices and 12 edges. At first, we uniformly select a set of target vertices; for each of these vertices, we will perform random walk of length l for t times. Suppose, we pick vertex 1, 8 and 12 and l is 5 and t is 2. The figure shows two random walks for vertex 1 only. We use different colors (red, blue, violate and orange) to illustrate different random walks. Once we have the random walks, we treat them as sentences in a document (step 2 in Fig. 1). In the third step, we pass the document in the text modeler (shown as black box in Fig. 1). As an output (step 4 in Fig. 1), text modeler produces feature representation of a given length (ddimension) for all words, i.e., vertices (\(1,2,\ldots 12\)) in this case.
To give some perspective of adapting deep language modeler to model and find feature representation of the vertices of a graph, assume a sequence of words \(W = \{w_0,w_1,\ldots ,w_n\}\), where \(w_i \in V\) (V is the vocabulary), a language model maximizes \(\text {Pr}[w_n  w_0,w_1,\ldots ,w_{n1}]\) over all the training corpus. Similarly, when we map random walks as sentences, estimated likelihood can be written as \(\text {Pr}[v_i  v_0,v_1,\ldots ,v_{i1}]\), which is the likelihood of observing a vertex \(v_i\) in a walk given all the previously visited vertices. This helps us to learn a mapping function \({\mathcal {K}}\), where \({\mathcal {K}}: v \in V \rightarrow {\mathbb {R}}^{d}\). Such mapping \({\mathcal {K}}\) embodies the latent topological representation associated with each vertex v in the graph. So the likelihood function becomes \(\text {Pr}[v_i  {\mathcal {K}}(v_0),{\mathcal {K}}(v_1),\ldots ,{\mathcal {K}}(v_{i1})]\). The deep language model (Word2Vec [29]) we used in this work adopts some of the recent relaxation to model the likelihood function. In our case, \(v_i\) in the likelihood function does not necessary be at the end of the context (\(v_0,v_1,\ldots ,v_{i1}\)), rather the context of a vertex consists of vertices appearing to the right of the given vertex in the random walk.
3.2 PartitionInduced Subgraph Feature Representation
In another feature representation, we build the bag instances of a graph by partitioning the graph into different parts and then considering each part as a bag instance corresponding to that graph. Like the case of vertex multiinstance representation, each partition has the same label as the label of the parent graph. Since each partition is still a graph (but with a smaller size), we can use the existing metricbased approaches for its feature representation. Our main intention of using a partitioninduced feature representation is to measure the effectiveness of training dataset inflation, irrespective of feature representation.

\(d_u\), degree of vertex u of G

\(d_{{\mathrm{nei}}(u)}= \frac{1}{d_u} \sum _{v\in {\mathrm{nei}}(u)} d_v\), average neighbor’s degree of vertex u

\(E_{{\mathrm{ego}}(u)}\), number of edges in node u’s egonet.^{1} \({\mathrm{ego}}(u)\) returns node u’s egonet.

\(CC_u\), clustering coefficient of node u which is defined as the number of triangles connected to vertex u over the number of connected triples centered at vertex u.

\(CC_{{\mathrm{nei}}(u)}= \frac{1}{d_u} \sum _{v\in {\mathrm{nei}}(u)} CC_v\), average clustering coefficient of vertex u’s neighbors.

\(E^{in}_{{\mathrm{ego}}(u)}\) = number of edge incident to \({\mathrm{ego}}(u)\).

\(d_{{\mathrm{ego}}(u)}\)= degree of \({\mathrm{ego}}(u)\) i.e., number of neighbors.
3.3 Classification Model
Reallife graphs
Domain  Datasets  (Vertex; edge)  Description 

Animal  Bison  (26; 314)  Dominance between American bisons 
Hen  (32; 496)  Dominance between White Leghorn hens  
Dolphin  (62; 159)  Social network of bottlenose dolphins  
Kangaroo  (17; 91)  Interactions between freeranging gray kangaroos  
Cattle  (28; 217)  Dominance behaviors observed between dairy cattles  
Zebra  (27; 111)  Interactions between Grevy’s zebras  
sheep  (28; 250)  Dominance behavior between bighorn sheeps  
Macaques  (62; 1,187)  Dominance behavior between female Japanese macaques  
Communication  UC Irvine messages  (1,899; 20,296)  Messages between students of UC, Irvine 
Enron  (87,273; 321,918)  Email communication between employees of Enron  
Digg  (30,398; 86,404)  Reply network of the social news website Digg  
FB Wall Post  (46,952; 274,086)  Subset of posts to other user’s wall on Facebook  
LKML  (63,399; 242,976)  Communication network of the Linux kernel Mailing List  
EU institution  (265,214; 420,045)  Email communication of the undisclosed European institution  
U. Rovira email  (1,133; 5,451)  Email communication at the University Rovira i Virgili  
Human Contact  Train bombing  (64; 143)  Contacts between terrorists involved in the train bombing 
Windsurfers  (43; 336)  Interpersonal contacts between windsurfers  
Infectious  (410; 2,765)  Facetoface behavior of people during the exhibition  
Conference  (113; 2196)  Facetoface contacts of the attendees in a conference  
Coauthorship  arXiv hepth  (22,908; 2,673,133)  Collaboration graph of arXiv’s High Energy Physics—Theory 
arXiv astroph  (18,771; 198,050)  Collaboration graph of arXiv’s Astrophysics section  
DBLP coauthorship  (317,080; 1,049,866)  Collaboration graph from DBLP computer science bibliography  
arXiv hepph  (28,093; 3,148,447)  Collaboration graph of arXiv’s High Energy Physics—Phenomenology  
Citation  arXiv hepph Cit.  (34,546; 421,578)  Citation graph of the arXiv’s High Energy Physics—Phenomenology 
arXiv hepth Cit.  (27,770; 352,807)  Citation graph of the arXiv’s High Energy Physics—Theory  
Cora citation  (23,166; 91,500)  Cora citation network  
DBLP  (12,591; 49,743)  Citation graph of DBLP  
Human Social  Jazz  (198; 2,742)  Collaboration network between Jazz musicians 
HighSchool  (70; 366)  Network contains friendships between boys highschool  
Residence hall  (217; 2,672)  Friendship between residents living at a residence hall  
Taro exchange  (22; 78)  Giftgivings (taro) between households in a Papuan village  
Dutch college  (32; 3,062)  Network contains friendships between university freshmen  
Sampson  (18; 188)  Network contains ratings between monks related to a crisis  
Zachary karate  (34; 78)  Network contains interaction between members of a karate club  
Seventh graders  (29; 376)  Network contains ratings between students from seventh grade  
Adolescent health  (2,539; 12,969)  Network was created from a adolescent health survey  
Tribes  (16; 58)  Social network of tribes of the GahukuGama  
Infrastucture  USAirports  (1,574; 28,236)  Network of flights between US airports in 2010 
Air traffic control  (1,226; 2,615)  Network of preferred routes recommendations  
OpenFlights  (2,939; 30,501)  Network contains flights between airports of the world  
US power grid  (4,941; 6,594)  Network of power supply line between US power grids  
EuroRoad  (1,174; 1,417)  Road Network in Europe 
3.4 PseudoCode
In Fig. 2, we present the pseudocode of the vertex multiinstancebased approach of graph classification. The input of the algorithm is a graph database \({\mathcal {G}}\) where each graph \(G_i\) is associated with a category label \(L(G_i)\). Algorithm starts by iterating over each graph \(G_i\). For each graph after populating random walks, it executes Word2Vec over the collection of walks to find the feature representation of the vertices of the graph and stores in Bag(\({G_i}.V\)) (lines 2–4). Then, the algorithm labels each of the instances in the Bag(\({G_j}.V\)) by the graph \(G_i\)’s category label \(L(G_i)\). At the end, Bag(\({G_j}.V\)) of labeled data instances is stored in a list called Data. When the iteration finishes, the algorithm applies kfold crossvalidation to split Data in the Bag level to generate the train(\(Data_{\mathrm{train}}\)) and test(\(Data_{\mathrm{test}}\)) fold. Algorithm then executes the training phase using \(Data_{\mathrm{train}}\) to train model \(H_{\varvec{\theta }}\) (line 6). Finally, the algorithm predicts the label of data points from the test fold (\(Data_{\mathrm{test}}\)) using \(H_{\varvec{\theta }}\) and outputs label of the test graphs represented by the bags of vertices in the test folds using majority voting (lines 7–8).
Figure 3 shows the pseudocode of the partition multiinstancebased graph classification algorithm. Steps of the partition multiinstance technique are similar to the vertex multiinstance, except we use the graph partition algorithm to partition a graph \(G_i\) (line 3), say into k partitions. Then for each partitioninduced subgraph in \({G_i}.P_k\), the partition multiinstance algorithm computes seven topologybased features in Bag(\({G_i}.P_k\)) (lines 4–5). Once the algorithm labels each instance in Bag(\({G_i}.P_k\)) with the corresponding graph category label and stores in a global list Data, it performs kfold traintest scheme for Softmax classification (lines 6–9) and output label of the test graphs using majority voting (line 10).
4 Experiments and Results
To validate our proposed vertex and partition multiinstance graph classification algorithm, we perform several experiments. We use realworld graph data [46] from different domains in all our experiments. We have collected 43 graphs from 7 domains. In Table 1, we present basic statistics and short description of the graphs. As we can see, Animal, Human Social and Human Contact domains have smaller size graphs, whereas Citation, Coauthorship, Communication and Infrastructure contain moderate and large size graphs.
4.1 Experimental Setup
To find feature representation of the vertices of a graph, we use gensim library (https://radimrehurek.com/gensim/), which contains an opensource python implementation of “Word2Vec” algorithm. We write our own random walk generator using python. We set the length of the random walk (l) and feature vector size (d) to 40 and 30 for small and moderate size graphs (Animal, Human Contact, Human Social). For large size graphs (Citation, Collaboration, Communication and Infrastructure), we set these numbers to 60 and 70, respectively. For both cases, we set the number of random walk parameter (t) to 10. In Sect. 4.8, we discuss in detail the effects of parameter values on the performance of the classification task.
In the partition multiinstancebased approach, we use NetSimile by Berlingerio et al. [1] to compute features of the partitioninduced subgraph constructed from each partition of a graph. We implement our own version of NetSimile in Python where all topological features are computed using Networkx [14] package. To partition the graphs in the dataset, we use GraClus by Kulis et al. [10]. Graclus takes the number of partition k as a userdefined parameter. We set the value of k to a small number for smaller graphs and reasonably high number for larger graphs. In this work, we choose k to be 60, 20 and 5, for large, moderate and small size graphs, respectively. We implement our own softmax classifier in Python. We set regularize parameter \(\lambda\) to 1e−4 for all executions. We perform fourfold crossvalidation over the data and use threefold to train, and onefold to test. To measure classifier’s performance, we use percentage accuracy and microF1 metrics. We run all experiments in 3 GHz Intel machine with 16GB memory.
Percentage accuracy of graph classifier
Domains  No. of class  Vertex multiinstancebased  Partition multiinstancebased  Current best method  

Accuracy (in %)  Improvement (in %) w.r.t current best  Accuracy (in %)  Improvement (in %) w.r.t current best  Accuracy (in %)  
AC  2  85.1  13.4  81.6  8.8  75 
BC  2  97.0  16.4  87.5  5.0  83.3 
AB  2  81.8  13.2  83.3  15.3  72.2 
AD  2  89.0  11.2  80.0  0  80.0 
CD  2  83.6  4.5  91.6  14.5  80.0 
BD  2  88.4  17.8  97.5  30.0  75.0 
ABC  3  85.2  11.8  80.0  4.9  76.2 
ABD  3  85.1  6.3  83.3  4.1  80.0 
ACD  3  81.2  4.5  81.3  4.6  77.7 
BCD  3  88.1  22.1  87.5  21.5  72.0 
ABCD  4  80.0  12.0  77.5  8.5  71.4 
EF  2  83.4  25.2  70.0  5.1  66.6 
GF  2  75.0  75.2  65.0  51.8  42.8 
EGF  3  65.0  32.7  61.5  28.4  47.9 
MicroF1 (%) score of graph classifier
Domains  No. of class  Vertex multiinstancebased microF1 (%)  Partition multiinstancebased microF1 (%) 

AC  2  86.7  84.7 
BC  2  93.2  82.5 
AB  2  80.0  81.2 
AD  2  91.2  80.5 
CD  2  78.2  92.1 
BD  2  80.0  92.2 
ABC  3  85.4  75.2 
ABD  3  84.4  85.1 
ACD  3  82.1  79.1 
BCD  3  85.9  88.1 
ABCD  4  77.6  77.9 
EF  2  78.8  69.3 
GF  2  71.4  62.6 
EGF  3  61.4  59.1 
4.2 Experiment on Classification Performance
4.3 Comparison with the Existing Algorithms
In this experiment, we perform accuracy comparison with the existing algorithms (Li and NetSimile) of graph classification for all the graphs mentioned in Table 1. In order to make comparisons with a recent stateoftheart frequent subgraphbased graph classification algorithm RgMiner[22], we pick animal, human social and human contact domains (Table 1) because these domains have smaller size graphs.
To illustrate the comparisons with Li and NetSimile, we create a scatter plot (Fig. 4) by placing the accuracy of our method in xaxis and the same of competitor’s method in the yaxis for each classification task shown in Column 1 of Table 2. We also place a diagonal line in the plotting area. The number of points that lie in the lower triangle represents the number of tasks our methods are better than existing ones. Figure 4a, b shows performance comparison of the vertex multiinstance approach with Li and NetSimile. As we can see in all tasks, vertex multiinstance approach performs better than both Li and NetSimile. In Fig. 4b, d, we compare the accuracy of the partition multiinstance approach with Li and NetSimile. In this case, among 14 classification tasks our method performs better in 13, ties in 1. Superior performance of our proposed methods is essentially triggered by the training data inflation technique. Such technique helps to alleviate the fatmatrix phenomena (discussed in Sect. 1) and improves graph classification accuracy.
As discussed in Sect. 1, the execution of Gaston [32] over different combination of animal, humansocial and humancontact settings for 30% support took more than 2 days. The same also holds true for the other subgraph featurebased methods, including GAIA and Cork. For subgraph mining using the above methods, we assume that the edges are unlabeled and all the vertices have the same label. This is a valid assumption because the edges of the graphs that we have used in this work do not possess any label except indicating a relationship; for example, bison graphs from animal domain portrait dominance relation between different bisons. We execute Gaston for 30% support over different combination of animal, humansocial and humancontact settings. Once we have the frequent subgraphs, we apply RgMiner [22] to perform frequent subgraphbased graph classification. Average accuracies we got for “animal–human contact,” “human contact–human social” and “animal–human contact–human social” settings are 52.3, 50.0 and 43.7%, respectively, which are lower than the accuracies reported by both of our proposed methods (see last 3 rows in Table 2) in the corresponding setups.
Average running time in second (VML vertex multiinstance, PML partition multiinstance)
Domain  Avg. vertex size  Time [26]’s (s)  Time [1]’s (s)  Time VML (s)  Time PML (s) 

Animal  35  0.05  0.135  0.09  0.08 
Human contact  157  1.71  0.69  0.45  0.17 
Human social  317  2.73  0.38  1.1  0.32 
Infrastucture  2K  27.7  18.2  8.2  1.5 
Citation  24K  8634.7  1142.8  121.6  89.1 
Communication  62K  41217.8  2136.4  289.1  254.6 
Coauthorship  96K  137106.4  117811.5  562.4  8367.1 
4.4 Experiment with Vertically Scaled Dataset
In earlier experiments, we have shown that our proposed methods are particularly suitable for a horizontally scaled dataset—in such a dataset, the number of graphs is small, but each of the graphs is large in size. For example, the average number of vertices in the Citation, Communication and Coauthorship networks is 24K, 62K and 96K, respectively; for these graphs, our methods perform the best over all the existing methods. Note that for these datasets the frequent/discriminative subgraphbased methods are not able to run due to their excessive computation cost. On the other hand, if the graph dataset is vertically scaled, i.e., if there are many graphs in the dataset but each of the graphs is small in size, then the existing subgraphbased methods generally work well. To show this, we consider a wellknown discriminative subgraphbased approach, namely gboost^{2} and run it on breast cancer (MCF7) dataset in the National Cancer Institute (NCI) graph data repository.^{3} On this dataset, gboost has an FScore value of 0.75, whereas the best Fscore among our proposed methods is 0.65. This inferior performance of our proposed methods in MCF7 dataset is expected. In this dataset, the graphs, on average, have 26 nodes and 28 edges. On such small graphs, the random walkbased vertex embedding or partitionbased multiinstance learning is unable to capture the topological properties of the graphs which are suitable for classification.
4.5 Timing Analysis
In this section, we perform the timing analysis of our proposed algorithms and compare with the existing ones. To report runtime performance of an algorithm, we group execution times of the algorithm over graphs by its (graphs) category, i.e., citation, collaboration and report average running time. In Table 4, we show average running time in seconds. For the sake of comparison, we only report running time of finding feature representation for all the algorithms except for the partition multiinstance method, where we incorporate running time of partitioning as well. As we can see that for smaller graphs, all the methods finish within a reasonable time frame. However, for large graphs specially in the Citation, Coauthorship and Communication category, running time is very high for Li and NetSimile. For communication and Coauthorship category graphs, the vertex multiinstance method achieves 142 and 243fold improvement over Li and 7 and 209 time improvement over NetSimile, respectively. The partition multiinstance approach achieves 162 and 16fold improvement over Li and NetSimile for communication but only 16 and 14 for Coauthorship domain. Note that we could execute RgMiner just for the smaller size graphs from animal, human contact and humansocial category, and such executions solely will not be enough to portray the complete picture on the comparisons over the running time. So, we decide not to report RgMiner’s running time in Table 4. Nevertheless, it takes 17.2, 15.8 and 21.3 s for RgMiner to mine frequent subgraphs using Gaston for 30% support and finding feature representation for “animal–human contact,” “human contact–human social” and animal–human contact–human social” setting, respectively.
4.6 Effectiveness of Training Data Inflation
In earlier experiments, we see that both the vertex and partition multiinstance methods perform better than the existing algorithms. The vertex multiinstance method incorporates training data inflation along with the deep learningbased feature representation, and the partition multiinstance uses training data inflation with the existing metricbased feature representation. In this experiment, we want to investigate whether the superior performance of our proposed methods can be attributed to training data inflation or deep learningbased representation of vertices. To do this, we populate scatter plot (Fig. 5) similar to Fig. 4, by placing the vertex multiinstance method’s accuracy in xaxis, and the partition multiinstance method’s accuracy in yaxis. As shown in Fig. 5, all the points in the plotting area are very close to the diagonal line, which establishes competitive performance between these two methods. Training data inflation is the common part between these two approaches. Moreover, the partition multiinstance method shows that improved graph classification performance is achievable without deep learningbased techniques. So training data inflationbased paradigm is attributed more toward better performance of our methods of graph classification. However, for large graphs vertex multi instancebased classification may be more attractive due to its smaller running time for constructing the feature representation of a graph.
To investigate further, we perform a graph classification experiment without leveraging the training data inflation. After obtaining deep learningbased feature representation of the vertices of a graph, we apply five aggregator functions: mean, median, kurtosis, standard deviation and variance over each feature and derive a single feature vector for the bag. Then, we use these single feature vectors per graph, i.e., bag to train the classification model. We perform this experiment for all classification settings mentioned in Table 2. In Fig. 5, we compare the classification performance for MultiInstance (xaxis) and No MultiInstance (yaxis)based approach using scatter plot. As we can see, for all cases, all the points reside in the lower triangle of the plotting area, hence establishing superior performance of MultiInstancebased approach over No MultiInstance.
4.7 Experiment on Bag size
4.8 Parameter Selection for Vertex Multiinstance Approach
To see the effect of the length of random walk (l), we perform the same experiment as above but with different lengths of random walk ranging from 20 to 60, while keeping the dimension size to \(d=30\) for small/moderate graphs. For large graphs, we set l from 20 to 90 while fixing d to 70. For both cases, the number of random walk (t) is set to 10. In Fig. 7b, d, we plot the percentage of accuracy across different random walk lengths. Walk length parameter l has a similar effect as d. For small l, local topological information of a node is less captured and large l capture more topological information with respect to the entire graph than a node. We get best performance for \(l=40\) and \(l=60\) for small and large graphs, respectively.
Finally, we perform the same experiment as above for the different numbers of random walk (t) parameter ranging from 1 to 30 for both small and large size graphs. In Fig. 7e, f, we plot the percentage of accuracy across different numbers of random walk. As we can see, after \(t=10\), the classifier’s performance does not vary for both small and large size graphs. Moreover, when we set the number of walks to a higher value, overall training time of the model increases. So, we set the number of random walk parameter t to 10 for all experiments we perform in this research.
5 Conclusions
In this work, we propose two novel solutions of the graph classification problem. In the vertex multiinstance solution, we map a graph into a bag of vertices and leverage neural networkbased representation learning technique to find feature representation of the vertices. In the partition multiinstance solution, we map a graph into a bag of subgraphs and use traditional metricbased feature representation technique to construct features of the partitioninduced subgraphs. We perform extensive empirical evaluations of our proposed methods over several realworld graph data from different domains. We compare our algorithms with the existing methods of graph classification and show that our methods perform significantly better on classification accuracy as well as running time.
Footnotes
 1.
A vertex’s egonet is the induced subgraph of its neighboring nodes.
 2.
Matlab implementation of gboost is publicly available from http://www.nowozin.net/sebastian/gboost/.
 3.
MCF7 dataset is available to download from https://www.cs.ucsb.edu/~xyan/dataset.htm.
Notes
Acknowledgements
Funding for this work is provided by United States National Science Foundation (Grant No. IIS1149851).
References
 1.Berlingerio M, Koutra D, EliassiRad T, Faloutsos C (2013) Network similarity via multiple social theories. In: Proceedings of the 2013 IEEE/ACM international conference on advances in social networks analysis and mining (ASONAM’13), pp 1439–1440Google Scholar
 2.Bordes A, Glorot X, Weston J, Bengio Y (2012) Joint learning of words and meaning representations for opentext semantic parsing. In: International conference on artificial intelligence and statistics, pp 127–135Google Scholar
 3.Borgwardt KM, Kriegel HP (2005) Shortestpath kernels on graphs. In: Proceedings of the fifth IEEE international conference on data mining (ICDM’05), pp 74–81Google Scholar
 4.Borgwardt KM, Schraudolph NN, Vishwanathan S (2007) Fast computation of graph kernels. In: Schölkopf B, Platt J, Hoffman T (eds) Advances in neural information processing systems, vol 19, pp 1449–1456Google Scholar
 5.Burt RS (2009) Structural holes: the social structure of competition. Harvard University Press, HarvardGoogle Scholar
 6.Cheng H, Lo D, Zhou Y, Wang X, Yan X (2009) Identifying bug signatures using discriminative graph mining. In: Proceedings of the eighteenth international symposium on software testing and analysis, pp 141–152Google Scholar
 7.Ciresan D, Meier U, Schmidhuber J (2012) Multicolumn deep neural networks for image classification. In: 2012 IEEE conference on computer vision and pattern recognition (CVPR), pp 3642–3649Google Scholar
 8.Coleman JS (1986) Individual interests and collective action: selected essays. Cambridge University Press, CambridgeGoogle Scholar
 9.Deshpande M, Kuramochi M, Wale N, Karypis G (2005) Frequent substructurebased approaches for classifying chemical compounds. IEEE Trans Knowl Data Eng 17(8):1036–1050CrossRefGoogle Scholar
 10.Dhillon I, Guan Y, Kulis B (2005) A fast kernelbased multilevel algorithm for graph clustering. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining, pp 629–634Google Scholar
 11.Fei H, Huan J (2014) Structured sparse boosting for graph classification. ACM Trans Knowl Discov Data 9(1):4:1–4:22CrossRefGoogle Scholar
 12.Gascon H, Yamaguchi F, Arp D, Rieck K (2013) Structural detection of android malware using embedded call graphs. In: Proceedings of the 2013 ACM workshop on artificial intelligence and security, pp 45–54Google Scholar
 13.Gonzalez JA, Holder LB, Cook DJ (2002) Graphbased relational concept learning. In: Proceedings of the nineteenth international conference on machine learning (ICML’02), pp 219–226Google Scholar
 14.Hagberg AA, Schult DA, Swart PJ (2008) Exploring network structure, dynamics, and function using NetworkX. In: Proceedings of the 7th python in science conference (SciPy2008). Pasadena, CA, USA, pp 11–15Google Scholar
 15.Han J, Wen JR, Pei J (2014) Withinnetwork classification using radiusconstrained neighborhood patterns. In: Proceedings of the 23rd ACM CIKM, pp 1539–1548Google Scholar
 16.Heider F (2013) The psychology of interpersonal relations. Wiley, LondonGoogle Scholar
 17.Henderson K, Gallagher B, Li L, Akoglu L, EliassiRad T, Tong H, Faloutsos C (2011) It’s who you know: graph mining using recursive structural features. In: Proceedings of the 17th ACM SIGKDD, KDD’11Google Scholar
 18.Homans GC (1958) Social behavior as exchange. Am J Sociol 63(6):597–606CrossRefGoogle Scholar
 19.Horváth T, Gärtner T, Wrobel S (2004) Cyclic pattern kernels for predictive graph mining. In: Proceedings of the tenth ACM SIGKDD international conference on knowledge discovery and data mining (KDD’04), pp 158–167Google Scholar
 20.Jiang C, Coenen F, Zito M (2013) A survey of frequent subgraph mining algorithms. Knowl Eng Rev 28(01):75–105CrossRefGoogle Scholar
 21.Jin N, Young C, Wang W (2010) Gaia: graph classification using evolutionary computation. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 879–890Google Scholar
 22.Keneshloo Y, Yazdani S (2013) A relative feature selection algorithm for graph classification. In: Advances in databases and information systems, advances in intelligent systems and computing, vol 186, pp 137–148Google Scholar
 23.Kong X, Fan W, Yu PS (2011) Dual active feature and sample selection for graph classification. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’11), pp 654–662Google Scholar
 24.Koutra D, Ke TY, Kang U, Chau DH, Pao HKK, Faloutsos C (2011) Unifying guiltbyassociation approaches: theorems and fast algorithms. In: Proceedings of the 2011 European conference on machine learning and knowledge discovery in databases—Volume Part II, pp 245–260Google Scholar
 25.Kuramochi M, Karypis G (2001) Frequent subgraph discovery. In: Proceedings of the IEEE international conference on data mining, 2001 (ICDM 2001). IEEE, pp 313–320Google Scholar
 26.Li G, Semerci M, Yener B, Zaki MJ (2012) Effective graph classification based on topological and label attributes. Stat Anal Data Min 5(4):265–283MathSciNetCrossRefGoogle Scholar
 27.Liu F, Liu B, Sun C, Liu M, Wang X (2013) Deep learning approaches for link prediction in social network services. In: Lee M, Hirose A, Hou ZH, Kil RM (eds) Neural information processing, vol 8227. Springer, Berlin, pp 425–432CrossRefGoogle Scholar
 28.Macindoe O, Richards W (2010) Graph comparison using fine structure analysis. In: Proceedings of the 2010 IEEE second international conference on social computing, pp 193–200Google Scholar
 29.Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. Adv Neural Inf Process Syst 26:3111–3119Google Scholar
 30.Neville J, Jensen D (2000) Iterative classification in relational data. In: Proceedings of the AAAI, pp 13–20Google Scholar
 31.Nguyen PC, Ohara K, Mogi A, Motoda H, Washio T (2006) Constructing decision trees for graphstructured data by chunkingless graphbased induction. In: Proceedings of the 10th PacificAsia conference on advances in knowledge discovery and data mining (PAKDD’06), pp 390–399Google Scholar
 32.Nijssen S, Kok J (2004) A quickstart in frequent structure mining can make a difference. In: Proceedings of the ACM SIGKDDGoogle Scholar
 33.Pan S, Wu J, Zhu X (2015) Cogboost: boosting for fast costsensitive graph classification. IEEE Trans Knowl Data Eng 27(11):2933–2946CrossRefGoogle Scholar
 34.Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Proceedings of the empirical methods in natural language processing (EMNLP 2014), vol 12, pp 1532–1543Google Scholar
 35.Perozzi B, AlRfou R, Skiena S (2014) Deepwalk: Online learning of social representations. In: Proceedings of the 20th ACM SIGKDD, pp 701–710Google Scholar
 36.Rahman M, Bhuiyan MA, Al Hasan M (2014) Graft: an efficient graphlet counting method for large graph analysis. IEEE Trans Knowl Data Eng 26(10):2466–2478CrossRefGoogle Scholar
 37.Ranu S, Hoang M, Singh A (2013) Mining discriminative subgraphs from globalstate networks. In: Proceedings of the 19th ACM SIGKDD international conference on knowledge discovery and data mining, pp 509–517Google Scholar
 38.Reiterman J, Rödl V, Šiňajová E (1992) On embedding of graphs into Euclidean spaces of small dimension. J Comb Theory Ser B 56(1):1–8MathSciNetCrossRefMATHGoogle Scholar
 39.Saigo H, Nowozin S, Kadowaki T, Kudo T, Tsuda K (2009) gboost: a mathematical programming approach to graph classification and regression. Mach Learn 75(1):69–89CrossRefGoogle Scholar
 40.Shaw B, Jebara T (2009) Structure preserving embedding. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 937–944Google Scholar
 41.Shervashidze N, Petri T, Mehlhorn K, Borgwardt KM, Vishwanathan S (2009) Efficient graphlet kernels for large graph comparison. In: Proceedings of the twelfth international conference on artificial intelligence and statistics (AISTATS09), vol 5, pp 488–495Google Scholar
 42.Simard P, Steinkraus D, Platt JC (2003) Best practices for convolutional neural networks applied to visual document analysis. In: Proceedings of the seventh international conference on document analysis and recognition, 2003, pp 958–963Google Scholar
 43.Socher R, Huang EH, Pennin J, Manning CD, Ng AY (2011) Dynamic pooling and unfolding recursive autoencoders for paraphrase detection. In: Advances in neural information processing systems, pp 801–809Google Scholar
 44.Spielman DA, Teng SH (2004) Nearlylinear time algorithms for graph partitioning, graph sparsification, and solving linear systems. In: Proceedings of the thirtysixth annual ACM symposium on theory of computing, STOC’04, pp 81–90Google Scholar
 45.Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A (2014) Going deeper with convolutions. CoRR arXiv:1409.4842
 46.The Koblenz Network Collection (konect) (2015) http://konect.unikoblenz.de/networks/
 47.Tang L, Liu H (2009) Scalable learning of collective behavior based on sparse social dimensions. In: Proceedings of the 18th ACM conference on information and knowledge management (CIKM’09), pp 1107–1116Google Scholar
 48.Tang L, Liu H (2011) Leveraging social media networks for classification. Data Min Knowl Discov 23(3):447–478MathSciNetCrossRefMATHGoogle Scholar
 49.Thoma M, Cheng H, Gretton A, Han J, Kriegel HP, Smola A, Song L, Yu PS, Yan X, Borgwardt K (2009) Nearoptimal supervised feature selection among frequent subgraphs. In: Proceedings of the 2009 SIAM international conference on data mining. SIAM, pp 1076–1087Google Scholar
 50.Thoma M, Cheng H, Gretton A, Han J, Kriegel HP, Smola A, Song L, Yu PS, Yan X, Borgwardt KM (2010) Discriminative frequent subgraph mining with optimality guarantees. Stat Anal Data Min 3(5):302–318MathSciNetCrossRefGoogle Scholar
 51.Wawer M, Peltason L, Weskamp N, Teckentrup A, Bajorath J (2008) Structure–activity relationship anatomy by networklike similarity graphs and local structure–activity relationship indices. J Med Chem 51(19):6075–6084CrossRefGoogle Scholar
 52.Yan X, Cheng H, Han J, Yu PS (2008) Mining significant graph patterns by leap search. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data. ACM, pp 433–444Google Scholar
 53.Zhou ZH, Zhang ML, Huang SJ, Li YF (2012) Multiinstance multilabel learning. Artif Intell 176(1):2291–2320MathSciNetCrossRefMATHGoogle Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.