## 1 Introduction

Graph-based methods have been very successful for pattern recognition, computer vision and machine learning tasks [16, 25, 77]. However, due to their symbolic and relational nature, graphs have some limitations if we compare them with the traditional statistical (vector-based) representations. Some trivial mathematical operations do not have an equivalence in the graph domain. For example, computing pairwise sums or products (which are elementary operations in many classification and clustering algorithms) is not defined in a standard way in the graph domain. In the literature, a possible way this problem has been addressed is by means of embedding functions. Given a graph space $${\mathbb {G}}$$, an explicit embedding function is defined as $$\varphi :{\mathbb {G}}\rightarrow {\mathbb {R}}^n$$ which maps a given graph to a vector representation [12, 29, 47, 65, 68] whereas an implicit embedding function is defined as $$\varphi :{\mathbb {G}}\rightarrow {\mathcal {H}}$$ which maps a given graph to a high-dimensional Hilbert space $${\mathcal {H}}$$ where a dot product defines the similarity between two graphs $$K(G,G')=\langle \varphi (G),\varphi (G') \rangle$$, $$G,G'\in {\mathbb {G}}$$ [18, 27, 32, 35]. In the graph domain, the process of implicitly embedding graph is termed as graph kernel which basically defines a way to compute the similarity between two graphs. However, defining such embedding functions is extremely challenging, when the constraints on time efficiency and preserving the underlying structural information is concerned. The problem becomes even more difficult with the growing size of graphs, as the structural complexity increases the possibility of noise and distortion in structure, and raises risk of loosing information. Hierarchical representation is often used as a way to deal with noise and distortion [50, 76], which provides a stable delineation for an underlying object. Hierarchical representations allow to incrementally contract the graph, in a space-scale representation, so the salient features (relevant subgraphs) remain in the hierarchy. Thus, top levels become a compact and stable summarization.

Processing information using a multiscale representation is successfully employed in computer vision and image processing algorithms, which is mostly inspired by its resemblance with human visual perception [1]. It is observed that a naturalistic visual interpretation always demands a data structure able to represent scattered local information as well as summarized global facts [33]. Hierarchical representation is often used as a paradigm to efficiently extract the global information from the local features. Apart from that, hierarchical models are also believed to provide time- and space-efficient solutions [76]. Motivated by the above-mentioned intuition and the existing works in the related fields, many authors have come up with different hierarchical graph structures for solving various problems [22, 23, 48, 76]. In this sense, it is worth to mention the work of Mousavi et al. [50], who presented a hierarchical framework for graph embedding, although they did not explore the complex encoding of the hierarchy.

In this paper, motivated by the successes of the hierarchical models and the efficiency of graph embedding theory, we propose a general hierarchical graph embedding formulation that first creates a hierarchical structure from a given graph and then utilizes the multiscale structure to explicitly embed a graph in a real vector space by means of local graphlets. First, we make use of the graph clustering algorithm proposed in [31] to obtain a hierarchical graph representation of a given input graph. Here, each cluster of nodes in a level i is depicted as a single node in the upper hierarchical level $$i+1$$, whereas the edges in a level are connected depending on the original topology of the base graph, and the hierarchical edges are created by joining a node representing a cluster to all the nodes in the lower level. Thus, we propose a richer encoding than Mousavi [50], because our hierarchy not only contains different graph abstractions but also encodes useful hierarchical contractions through the hierarchical edges.

Once the hierarchical structure of a graph is created, we propose a novel use of the Stochastic Graphlet Embedding (SGE) [21] to exploit this hierarchical information. On the one hand, we can exploit the local configuration in form of graphlets thanks to the SGE design, because graphlets provide information at different neighborhood sizes. On the other hand, the hierarchical connections allow to encode more abstract information and hence to deal with noise present in the data. As a result, the Hierarchical Stochastic Graphlet Embedding (HSGE) encodes a global and compact representation of the graph that is embedded in a vector space. The consideration of the entire graph hierarchy for the embedding instead of only the base graph empowers the representation ability and handles the loss of information that usually occurs in graph embedding methods. Moreover, the statistics obtained from the uniformly sampled graphlets of increasing size model the complex interactions among different object parts represented as graph nodes. Here, the hierarchical graph structure and the statistics of increasing sized graphlets fetch important structural information of varied contexts.

As a result, our approach produces robust representations that can benefit from the advantages of the two above-mentioned strategies: we first take advantage of the embedding ability for mapping symbolic relational representations to n-dimensional spaces, so machine learning approaches can be used; and second, the ability of hierarchical structures to reduce noise and distortion inherently involved in graph representations of real data, keeping the more stable and relevant substructures in a compact way.

In conclusion, the main contribution of our work is the exploitation of the hierarchical structure of a given graph, rather than only studying the base graph for graph embedding purposes. Assessing the hierarchical information of a graph pyramid allows to extend the representation power of the embedded graph and tolerate the instability caused due to noise and distortion. Our proposal is robust because, on the one hand, it organizes the structural information in the hierarchical abstraction, and on the other hand, it considers the relation between object parts and their complex interactions with the help of uniformly sampled graphlets of unbounded size. Additionally, the proposed method is generic and can adapt any other graph embedding algorithm in the framework. In this sense, we extensively validated our proposed algorithm on many different benchmark graph datasets coming from different application domains.

The rest of this paper is organized as follows: Sect. 2 describes the related works in the literature. In Sect. 3, we introduce some definitions and notations related to the work. Our generic hierarchical graph representation is presented in Sect. 4. Section 5 introduces the Stochastic Graphlet Embedding as the base embedding we will use. Afterward, Sect. 7 reports our experimental validation and compares the proposed method with available state-of-the-art algorithms. Finally, in Sect. 8 we draw the conclusions and describe the future direction of the present work.

## 2 Related work

In what follows, we review the related works, respectively, on explicit and implicit graph embedding techniques, different hierarchical models and graph summarization methods, which we believed to be relevant to the main focus of the present paper.

### 2.1 Graph embedding

Graph embedding methods are mainly divided into two different categories: (1) explicit graph embedding, (2) implicit graph embedding or graph kernel.

#### 2.1.1 Explicit graph embedding

Explicit graph embedding refers to those techniques that aim to explicitly map graphs to vector spaces. The methods belonging to this category can be further divided into four different classes. The first one, known as graph probing [47], needs measuring the frequency of specific substructures (that capture content and topology) into graphs. Based on different graph substructures (e.g., node, edge, subgraph etc.) considered, different embedding techniques have been proposed. For example, Shervashidze et al. [68] studied the non-isomorphic graphlets, albeit, node label and edge relation statistics are considered by Gibert et al. [29]. Saund in [65], introduced a bottom up graph lattice in order to efficiently extract the subgraph features in preprocessed administrative documents, while Dutta and Sahbi [21] proposed a distribution of stochastic graphlets for embedding graphs into a vector space. The second class of graph embedding techniques is based on spectral graph theory [13, 34, 37, 39, 64, 82], which aims to analyze the structural properties of graphs in terms of the eigenvectors/eigenvalues of the adjacency or Laplacian matrices of a graph [82]. Recently, Verma and Zhang [78] proposed a family of graph spectral distances for robust graph feature representation. Despite their relative successes, spectral methods are quite prone to structural noise and distortions. The third class of methods is inspired by dissimilarity measurements proposed in [56]; in this context, Bunke and Riesen have presented several works on the vectorial description of a given graph by its distances to a number of pre-selected prototype graphs [9, 12, 62, 63]. Motivated by the recent advancements of deep learning and neural networks, many researchers have proposed to utilize neural network for obtaining a vectorial representation of graphs [4, 17, 30, 36, 55], which results in the fourth category of methods, called geometric deep learning.

#### 2.1.2 Implicit graph embedding

Implicit graph embedding or graph kernel methods is primarily another way to embed graphs into a vector space. They are also popular for the ability to efficiently extend the existing machine learning algorithms to nonlinear data, such as, graphs, strings etc. Graph kernel methods can be roughly divided into three different categories. The first one, known as diffusion kernel, is based on the similarity measures among the subparts of two graphs, and propagating them on the entire structure to obtain global similarity measure for two graphs [43, 72]. The second class of methods, called as convolution kernel, aims to measure the similarity of composite objects (modeled with graph) from the similarity of their parts (i.e., nodes) [80]. This type of graph kernel derives the similarity between two graphs G, $$G'$$ from the sum, over all decompositions, of the similarity products of the subparts of G and $$G'$$ [52]. Recently, Kondor and Pan [38] proposed multiscale Laplacian graph kernel having the property of lifting a base kernel defined on the vertices of two graphs to a kernel between graphs. The third class of methods is based on the analysis of the common substructures that belong to both graphs and is termed as substructure kernel. This family includes the graph kernel methods that consider random walks [27, 79], backtrackless walks [5], shortest paths [8], subtrees [68], graphlets [70] as the substructure. Different from the above three categories, Shervashidze et al. [69] proposed a family of efficient graph kernels on the Weisfeiler-Lehman test of graph isomorphism, which maps the original graph to a sequence of graphs. More recently, inspired by the successes of deep learning, Yanardag and Viswanathan [83] presented a unified framework to learn latent representations of substructures for graphs. They claimed that given a pre-computed kernel of graphs, their proposed technique produces an improved representation that leverages hidden representations of substructures.

### 2.2 Hierarchical graph representation

In general, hierarchical models have been successfully employed in many different domains within the computer vision and image processing field, such as, image segmentation [22, 48], scene categorization [23], action recognition [54], shape classification [18], graphic recognition [10], 3D object recognition [76] etc. These approaches usually exploit some kind of pyramidal structure containing information at various resolutions. Usually, at the finest level of the pyramid, the captured information is related to local features, whereas, at coarser levels, global aspects of the underlying data are represented. This way of representation helps to interpret knowledge in a naturalistic way [33].

Inspired by the above intuition, hierarchical structures are often employed to extract coarse-to-fine information from a graph representation. Pelillo et al.  [57] proposed to match two hierarchical structures as a clique detection problem on their association graph, which was solved with a dynamic programming approach. In [71], Shokoufandeh et al.  presented a spectral characterization based framework for indexing hierarchical structures that embed the topological information of a directed acyclic graph. Hierarchical representation of objects and an elastic matching procedure are also proposed from deformable shape matching in [24]. In [46], Liu et al.  utilized hierarchical graph representation and a stochastic sampling strategy for layered shape matching and registration problem. A graph kernel based on hierarchical bag-of-paths where each path is associated to a hierarchy encoding successive simplifications is presented in [18]. Ahuja and Todorovic [2] used a hierarchical graph of segmented regions for object recognition. Motivated by them, Broelemann et al. [10, 11] proposed two closely related approaches based on hierarchical graph for error-tolerant matching of graphical symbols. Mousavi et al.  [50] proposed a graph embedding strategy based on hierarchical graph representation, which considers different levels of a graph pyramid. They claimed that the proposed framework is generic enough to incorporate any kind of graph embedding technique. However, the authors did not take advantage of the complex and rich encoding of hierarchy.

From the literature review, we can conclude that although there are some works in the graph domain exploiting the hierarchical graph structure, most of them are focused on some kind of error tolerance or elastic matching. Utilization of this type of multiscale representation of graph for vector space embedding is quite rare and has not been properly explored yet. This fact has worked as our motivation to work on a graph hierarchical structure for explicit graph embedding task.

## 3 Definitions and notations

In this section, we introduce some definitions and notations, which are relevant to the proposed work.

### Definition 1

(Attributed Graph) An attributed graph is a $$4-\text {tuple}$$$$G=(V,E,L_V,L_E)$$ comprising a set V of vertices together with a set $$E\subseteq V\times V$$ of edges and two mappings$$L_V:V\rightarrow {\mathbb {R}}^m$$ and $$L_E:E\rightarrow {\mathbb {R}}^n$$ which, respectively, assign attributes to the nodes and edges.

Attributed graphs have been widely used for all sort of real-world problems. The most common methodologies are error-tolerant graph matching [51, 67], graph kernels and embedding techniques [41].

### Definition 2

(Subgraph) Given an attributed graph $$G=(V,E,L_V,L_E)$$, another attributed graph $$G'=(V',E',L_V',L_E')$$ is said to be a subgraph of G and is denoted by $$G'\subseteq G$$ iff,

• $$V'\subseteq V$$

• $$E'=E\cap V'\times V'$$

• $$L_V'(u)=L_V(u)$$, $$\forall u \in V'$$

• $$L_E'(e)=L_E(e)$$, $$\forall e \in E'$$

A graphletg of G is nothing but a subgraph which inherits the topology and the attributes of G. In the literature, subgraphs are often used for error-tolerant matching [7, 19, 66, 73, 75] and frequent pattern discovery problems [2, 6, 42].

### Definition 3

(Hierarchical graph) A hierarchical graphH is defined as a 6-tuple $$H=(V,E_N,E_H,L_V,\, L_{E_N},L_{E_H})$$ where V is the set of nodes; $$E_N \subseteq V\times V$$ are the neighborhood edges; $$E_H \subseteq V\times V$$ are the hierarchical edges; $${\text {L}}_{{\text {V}}}$$, $${\text {L}}_{{{\text {E}}}_{{\text {N}}}}$$ and $${\text {L}}_{{{\text {E}}}_{{\text {H}}}}$$ are three labeling functions defined as $${\text {L}}_{{{\text {V}}}}:V \rightarrow \Sigma _V \times A^k_V$$, $${\text {L}}_{{{\text {E}}}_{{\text {N}}}}: E_N \rightarrow \Sigma _{E_N} \times A^l_{E_N}$$ and $${\text {L}}_{{{\text {E}}}_{{\text {H}}}}: E_H \rightarrow \Sigma _{E_H} \times A^m_{E_H}$$, where $$\Sigma _V$$, $$\Sigma _{E_N}$$ and $$\Sigma _{E_H}$$ are three sets of symbolic labels for vertices and edges, $$A_V$$, $$A_{E_N}$$ and $$A_{E_H}$$ are three sets of attributes for vertices and edges, respectively, and $$k,l,m\in {\mathbb {N}}$$.

Prior works used hierarchical structures for allowing a reasonable tolerance in the representation paradigm [11, 18, 24] and also for bringing robustness in the feature representation [46].

## 4 Hierarchical embedding

In the literature, only few embedding approaches exploit the idea of multiscale or abstraction information [38]. This section is devoted to provide a framework able to include this information given a graph embedding. Some works that have been proposed to exploit the mentioned multiscale information in the literature [20, 50, 59] discard the hierarchical information provided by the hierarchical edges and focus on abstractions of the original graph.

### 4.1 Graph clustering

Graph clustering has been widely used in several fields such as social and biological networks [31], recommendation systems [28, 44] etc. It can be roughly described as the task of grouping graph nodes into clusters depending on the graph structure. Ideally, the grouping should be performed in such a way that intra-cluster nodes are densely connected whereas the connections among inter-cluster nodes are sparse. For example, Girvan and Newman [31] propose a graph clustering algorithm to detect a community structures for studying social and biological networks. Li et al.  [28, 40, 44, 45] have proposed several graph clustering techniques for recommendation systems based on different strategies: context awareness [28], inclusion of frequency property [44], distributed clustering confidence [40], etc. Here we do not further review on graph clustering algorithms since it is not within the main scope of this paper. However, we would like to remark that one of the most important aspects of graph clustering is the evaluation of cluster quality, which is crucial not only to measure the effectiveness of clustering algorithms, but also to give insights on the dynamics of relationships in a given graph. For a detailed overview on effective graph clustering metrics, the interested readers are referred to [3].

Even though any graph clustering algorithm can be used, we use the standard divisive-based Girvan–Newman algorithm [31] for our purpose, because it provides structurally meaningful clusters of a given graph. The Girvan–Newman algorithm is an intuitive and well-known algorithm used for community detection in complex systems. It is a global divisive algorithm which removes the appropriate edge iteratively until all the edges are deleted. At each iteration, new clusters can emerge by means of connected components. The idea is that the edges with higher centrality are the candidates to be connecting two clusters. Therefore, betweenness centrality measure of the edges [26] is used to decide which edge is being removed. Betweenness centrality on an edge $$e \in E$$ is defined as the number of shortest walks between any pair of nodes that cross e. The output of this algorithm is a dendrogram codifying a hierarchical clustering of nodes. This algorithm consists of 4 steps:

1. 1.

Calculate the betweenness centrality for all edges in the network.

2. 2.

Remove the edge with highest betweenness and generate a cluster for each connected component.

3. 3.

Recalculate betweennesses for all edges affected by the removal.

4. 4.

Repeat from step 2 until no edges remain.

In this work, Girvan–Newman algorithm is early stopped given a reduction ratio $$r \in {\mathbb {R}}$$. Therefore, the number of clusters is forced to be $$\lfloor r \cdot |V| \rfloor$$.

### 4.2 Hierarchical construction

Given a graph G and a clustering $$C = \{C_1,\ldots ,C_k\}$$, each cluster is summarized into a new node with a representative label (see line 5). Let us consider that this label can be defined as the result of an embedding function applied to the subgraph defined by the clustered nodes and their edges. Moreover, edges between the new nodes are created depending on a connection ratio between clusters. That means that an edge is only created if there are enough connections between the set of nodes defined by both clusters (see line 7). Finally, hierarchical edges are created connecting the new node $$v_{C_i}$$ with all the nodes belonging to the summarized cluster $$C_i$$ (see line 12). The proposed hierarchical construction is similar to the one proposed by Mousavi et al.  [50] but including explicitly the summarization generated by the clustering algorithm by means of the hierarchical edges. Thus, the proposed hierarchical construction obtains a representation which encodes abstract information by means of the clusters while keeping the relation with the original graph.

Let us introduce some notations that will be used in the following sections. Given a graph G and a number of levels L, $$H_G$$ denotes their corresponding hierarchical graph computed from G with L levels. $$H_G^l$$, where $$l = \{0,\ldots ,L\}$$ is a graph without hierarchical edges corresponding to the l level of summarization, therefore, $$H_G^0 = G$$. Moreover, $$H_G^{l_1,l_2}$$ where $$l_i = \{0,\ldots ,L\}$$ and $$l_1\le l_2$$, corresponds to the hierarchical graph compressed between levels $$l_1$$ and $$l_2$$. Hence, $$H_G = H_G^{0,L}$$ and $$H_G^l = H_G^{l,l}$$. Finally, $$H_G^{l_1} \cup H_G^{l_2}$$ corresponds to the union of two graphs without hierarchical edges.

Figure 1a shows the construction of the hierarchy given a graph G. Each level shows an abstraction of the input graph where the nodes have been reduced.

### 4.3 Hierarchical embedding

This section introduces a novel way to encode hierarchical information of a graph into an embedding. Moreover, the proposed technique is generic in the sense that can be used by any graph embedding function.

Given a graph G which should be mapped into a vectorial space and an embedding function $$\varphi :{\mathbb {G}}\rightarrow {\mathbb {R}}^n$$, we first proceed to obtain hierarchical representation $$H_G$$ following the proposed methodology in Sect. 4.2. Therefore, $$H_G$$ has enriched the original graph with abstract information considering L levels. Finally, we propose to make use of the hierarchical information to construct a hierarchical embedding. The general form of the proposed embedding takes into account graphs at multiple scales and hierarchical relations. Thus, the embedding function does not only compactly encode the contextual information of nodes at different abstraction levels, but also it encodes the hierarchy contraction. The embedding function is defined as follows:

\begin{aligned} \begin{aligned} \Phi (H_G) = [&\varphi (H_G^0),\ldots ,\varphi (H_G^K), \\&\phi _1^1(H_G),\ldots ,\phi _1^{k_1}(H_G), \\&\phi _2^1(H_G), \ldots , \phi _2^{k_2}(H_G) ] \end{aligned} \end{aligned},
(1)

where

\begin{aligned} \phi _1^k(H_G)= & {} [ \varphi (H_G^{0,k}),\ldots ,\varphi (H_G^{K-k,K})] \end{aligned}
(2)
\begin{aligned} \phi _2^k(H_G)= & {} [ \varphi (H_G^0 \cup \cdots \cup H_G^{k}),\ldots ,\varphi (H_G^{K-k} \cup \cdots \cup H_G^K)] \end{aligned}
(3)

where $$K \le L$$ are the hierarchical levels taken into account and $$k_1,k_2 \le K$$ indicate the number of levels taken into account at the same time. Note that $$K=L$$, $$k_1=K$$ and $$k_2=K$$ will take into account the whole hierarchy and possible combinations. From this general representation of the proposed embedding, we have evaluated some particular cases (the reader is referred to Sect. 7 for more details on the experimental evaluation).

Baseline embedding This embedding is the one used as a baseline. In this scenario $$K=0$$, $$k_1=0$$ and $$k_2=0$$, therefore $$\Phi (H_G) = \varphi (H_G^0)$$. No abstract information is taken into consideration, hence, $$\Phi (H_G) = \varphi (G)$$.

Pyramidal embedding This embedding has been previously proposed in the literature [20, 50]. It combines information of the abstract levels of the graph, i.e., $$H_G^i$$ not taking into account hierarchical information. Therefore, the hierarchical edges are discarded and no relation between levels is considered, $$K\ge 1$$, $$k_1=0$$ and $$k_2=0$$. We define $$\Phi _{\text {pyr}}(H_G) = [\varphi (H_G^0),\ldots ,\varphi (H_G^K)]$$. Note that each element corresponds to independent levels of the hierarchy without hierarchical edges.

Generalized pyramidal embedding Following the previous idea, the information of the abstract levels of the graph, i.e., $$H_G^i$$ is combined. Now, hierarchical information is taken into account by embedding unions of levels, i.e., $$H_G^{i_1} \cup H_G^{i_2}$$ but discarding hierarchical edges (no clustering information is taken into account). In this scenario $$K\ge 1$$, $$k_1=0$$ and $$k_2\ge 1$$, therefore, we define $$\Phi _{\text {gen}\_\text {pyr}}(H_G) = [\varphi (H_G^0),\ldots ,\varphi (H_G^K),\varphi (H_G^0 \cup H_G^1),\ldots ,\varphi (H_G^{K-1} \cup H_G^K), \ldots , \varphi (H_G^0 \cup \cdots \cup H_G^{k_2}),\ldots ,\varphi (H_G^{K-k_2} \cup \cdots \cup H_G^K)]$$.

Hierarchical embedding This embedding is computed mixing different levels considering them as a single graph through the hierarchical edges, $$K \ge 1$$, $$k_1 \ge 1$$ and $$k_2=0$$. The idea is to create an embedding able to codify both, graph and clustering information. Depending on the embedding, hierarchical edges can make use of special label to treat them differently. The hierarchial embedding is defined as $$\Phi _{\text {hier}}(H_G) = [\varphi (H_G^0),\ldots ,\varphi (H_G^K),\varphi (H_G^{0,1}),\ldots ,\varphi (H_G^{K-1,K}) ,\ldots , \varphi (H_G^{0,k_1}), \ldots ,\varphi (H_G^{K-k_1,K})]$$. Note that each element corresponds to the subhierarchy compressed between the specified levels.

Exhaustive embedding Finally, in order to take into consideration the whole hierarchy, we can make use of the whole embedding $$\Phi$$ as defined in Eq. (1) where $$K \ge 1$$, $$k_1, k_2 \ge 1$$.

Figure 1b shows the graphs taken into consideration when the hierarchical embeddings are computed.

## 5 Stochastic graphlet embedding

The Stochastic Graphlet Embedding (SGE) can be defined as a function $$\varphi :{\mathbb {G}} \rightarrow {\mathbb {R}}^n$$ that explicitly embeds a graph $$G\in {\mathbb {G}}$$ to a high-dimensional vector space $${\mathbb {R}}^n$$ [21]. The entire procedure of SGE can be described in two stages (see Fig. 2), where in the first step, the method samples graphlets from G in a stochastic manner and in the second step, it counts the frequency of each isomorphic graphlet from the extracted ones in an approximated but near accurate manner. The entire procedure fetches a precise distribution of connected graphlets with increasing number of edges in G with a controlled complexity, which fetches the relation among information represented as nodes and their complex interaction.

### 5.1 Stochastic graphlets sampling

Considering a graph $$G=(V,E,L_V,L_E)$$, the goal of the graphlet extraction procedure is to obtain statistics of stochastic graphlets with increasing number of edges in G. The way of extracting graphlets is stochastic and it uniformly samples graphlets with boundlessly increasing number of edges without constraining their topology or structural properties such as maximum degree, maximum number of nodes, etc. Our graphlet sampling procedure, outlined in Algorithm 2, is recurrent and the number of recurrences is controlled by a parameter M that indicates the number of distinct graphlets to be sampled (see line 2 of Algorithm 2). Also, each of these M recurrent processes is regulated by another parameter T that denotes the maximum number of iterations a single recurrent process should have (see line 5). Since each of these iterations adds an edge to the presently constructing graphlet, T indirectly specifies the maximum number of distinct edges each graphlet should contain. Considering $$U_t$$ and $$A_t,$$ respectively, as the aggregated sets of visited nodes and edges till iteration t, they are initialized at the beginning of each recurrent step as $$A_0=\emptyset$$ and $$U_0=\lbrace u \rbrace$$ with a randomly selected node u which is uniformly sampled from V (see line 4). Thereafter, at tth iteration (with $$t\ge 1$$), the sampling procedure randomly selects an edge $$(u,v)\in E \backslash A_{t-1}$$ that is connected from any node $$u\in U_{t-1}$$ (see line 7). Accordingly, the process updates $$U_t \leftarrow U_{t-1} \cup \lbrace v \rbrace$$ and $$A_{t} \leftarrow A_{t-1} \cup \lbrace (u,v) \rbrace$$ (see line 8). All these processes within a recurrent step are repeated T times to sample a graphlet with maximum T edges. M is set to relatively large values in order to make the graphlet generation statistically meaningful. Theoretically, the values of M are guided by the theorem of sample complexity [81], which is widely studied and used in the Bioinformatics domain [58, 70]. However, the discussion and proof of that is out of scope of the current paper. Intuitively, the graphlet sampling procedure explained in this section follows a random walk process with restart that efficiently parses G and extracts the desired number of connected graphlets with an increasing number of edges. This algorithm allows to sample connected graphlets from a given graph but avoids expensive way of extracting them in an exact manner. Here the hypothesis is that if a sufficient number of graphlets are sampled, then the empirical distribution will be close to the actual distribution of graphlets in the graph. Furthermore, it is important to note that from the above process, one can extract, in total, $$M \times T$$ graphlets each with number of edges varying from 1 to T.

### 5.2 Hashed graphlets distribution

For obtaining a distribution of the extracted graphlets from G, it is needed to identify sets of isomorphic graphlets from the sampled ones and then count cardinality of each isomorphic set. A trivial way of doing that certainly involves checking the graph isomorphism for all possible pairs of graphlets for detecting possible partitions that might exist among them. Nevertheless, graph isomorphism is a GI-complete problem [49] for general graphs, so the previously mentioned scheme is extremely costly as the method samples huge number of graphlets with many edges. An alternative, efficient and approximate way of partitioning isomorphic graphlets is graph hashing. A graph hash function that can be defined as a mapping $$h:{\mathbb {G}} \rightarrow {\mathbb {R}}^m$$ that maps a graph into a hash code (a sequence of real numbers) based on the local as well as holistic topological characteristic of graphs. An ideal graph hash function should map two isomorphic graphs to the same hash code as well as two non-isomorphic graphs to two different hash codes. While it is easy to design hash functions satisfying the condition that two isomorphic graphs should have the same hash code, it is extremely difficult to find hash function that ensures different hash codes for every pair of non-isomorphic graphs. An alternative is to design graph hash functions with low collision probability, i.e., mapping any two non-isomorphic graphs to the same hash code with a very low probability. For obtaining a distribution of graphlets, the main aim of graph hashing is to assign extracted graphlets from G to corresponding subsets of isomorphic graphlets (a.k.a. partition index or histogram bins) in order to count and quantify their distributions. The proposed mechanism for obtaining the distribution of uniformly sampled graphlets, outlined in Algorithm 3, maintains a global hash table $${\mathbf {H}}$$, whose single entry corresponds to a hash code of a graphlet g produced by the graph hash function. $${\mathbf {H}}$$ grows incrementally as the algorithm confronts new graph hash codes and maintains all the unique hash codes encountered by the system. It is to be noted that the position of each unique hash code is kept fixed, because each position corresponds to a partition index or histogram bin. Now to allocate a given graphlet g to its corresponding histogram bin, its hash code h(g) is mapped to the index of the hash table $${\mathbf {H}}$$, whose corresponding graph hash code gives a hit with h(g) (see line 8). If h(g) does not exist in $${\mathbf {H}}$$ at some instance, it is considered as a new hash code (and hence g as a new graphlet) encountered by the system and appended h(g) at the end of $${\mathbf {H}}$$ (see line 6).

Designing hash functions that yield identical hash codes for two isomorphic graphlets is quite simple, whereas, prototyping those providing two distinct hash codes for two non-isomorphic graphs is very challenging. The chance of mapping two non-isomorphic subgraphs to the same hash code is termed as probability of collision. Indicating $$H_0$$ as the set of all pairs of non-isomorphic graphs, the probability of collision can be expressed as the following energy function:

\begin{aligned} E(f) = P((g,g') \in H_0 \quad | \quad h(g) = h(g')) \end{aligned}
(4)

So, in terms of collision probability, the hash functions that produce comparatively lower E(f) values in Eq. (4) are considered to be more reliable for checking the graph isomorphism. It has been studied that sorted degree of nodes has 0 collision probability for all graphs with number of edges less or equal to 4 [21]. Moreover, it is also a well-known fact that two graphs with the same betweenness centrality (sorted) would indeed be isomorphic with high probability [15, 53]. For example, sorted betweenness centrality has collision probabilities equal to $$3.2e^{-4}$$, $$1.9e^{-4}$$, $$1.1e^{-4},$$ respectively, for graphlets with 7, 8 and 9 edges. Interested readers are requested to see [21] for further discussions and analysis on various graph hash functions and corresponding elaboration on probability of collision. Considering the above facts, in this work, we consider sorted degree of nodes for graphlets with $$t\le 4$$ and the betweenness centrality for graphlets with $$t\ge 5$$.

\begin{aligned} \text {Hash function}= {\left\{ \begin{array}{ll} \text { degree of nodes},&{} \quad \text {if}\, t\le 4\\ \text { betweenness centrality},&{} \quad \text {otherwise} \end{array}\right. } \end{aligned}
(5)

It should be observed that the distribution of sampled graphlets obtained the way mentioned until now, only considers the topological structure of a graph, and ignores the node and edge attributes. However, it is worth mentioning that the stochastic graphlet embedding permits to consider a small set of nodes and edge attributes by creating respective signatures and then appending it to the hash code encoding the topology of the graphlet. In this work, if needed, we first discretize the existing continuous attributes using a combination of clustering algorithm such as k-means and pooling technique. Later, the sorted discrete node and edge labels are used as the attribute signatures and combined with the hash code.

### 5.3 Hierarchical stochastic graphlet embedding

In this work, we propose to combine the properties of the proposed Stochastic Graphlet Embedding with the Hierarchical Embedding introduced in the previous section.

On the one hand, SGE provides statistical information about local structures varying the number of edges involved. Therefore, it provides fine-grained insights of the graph which cannot deal with too noisy data. The use of abstractions provided by the graph hierarchy increases the receptive field of each graphlet moving to coarser information that is able to provide insights of the global graph information. Moreover, the use of hierarchical edges during the computation allows to combine information at some levels, i.e., combining different levels of detail (see Eq. (1)). For now on, we will denote this embedding as Hierarchical Stochastic Graphlet Embedding (HSGE).

## 6 Computational complexity

This section is devoted to study the computational complexity of the proposed approach given a graph $$G=(V,E,L_V,L_E)$$ where $$|V|=n$$ and $$|E|=m$$.

### 6.1 Hierarchical embedding complexity

Graph clustering algorithms are usually high computational complexity techniques. As it has been stated in Sect. 4.3, the Girvan–Newman algorithm has been chosen as a graph clustering technique. The Girvan–Newman algorithm is based on the betweenness centrality of the edges which has a time complexity of $${\mathcal {O}}(n \cdot m)$$ for unweighted graphs and $${\mathcal {O}}(n \cdot m + n\cdot (n+m) \log (n))$$ for weighted graphs. Hence, the Girvan–Newman algorithm, which has to remove all the edges, can be computed in $${\mathcal {O}}(n \cdot m^2)$$ for unweighted graphs and $${\mathcal {O}}(n \cdot m^2 + n\cdot m \cdot (n+m) \log (n))$$ for weighted graphs.

Assuming an embedding function $$\varphi$$ which has a complexity of $${\mathcal {O}}(N)$$ and assuming that the hierarchical graph construction has a complexity of $$C_1$$, then, if we assume L levels, the proposed configurations would become a complexity $${\mathcal {O}}(C_1 + L\cdot N)$$ in the case of the pyramid and $${\mathcal {O}}(C_1 + L^2\cdot N)$$ for the hierarchy and the exhaustive embeddings.

### 6.2 Stochastic graphlet embedding complexity

The computational complexity of Algorithm 2 is $${\mathcal {O}}(M \cdot T)$$ where M is the number of graphlets to be sampled and T is the maximum size of graphlets in terms of the number of edges. Assuming a hash function with a complexity of $${\mathcal {O}}(C_2)$$, Algorithm 3 has a time complexity of $${\mathcal {O}}(M \cdot T \cdot C_2)$$ for computing the stochastic graphlet embedding. Here it is worth mentioning that “degree of nodes” and “betweeness centrality,” respectively, have the time complexity of $${\mathcal {O}}(n)$$ and $${\mathcal {O}}(n \cdot m)$$. From the above explanation, it is clear that the complexity of these two algorithms do not depend on the size of the input graph G, but only on the parameters M, T and the hash functions used.

## 7 Experimental validation

This section presents the experimental results obtained by our proposed Hierarchical Stochastic Graphlet Embedding method. The main aim of this experimental study is to validate the proposed graph embedding technique for the graph classification task, which demands robust embedding technique for mapping a graph into a vector space. For experimentation, we have considered many different widely used graph datasets with varied characteristics. All these graphs come from real data generated in the fields of Biology, Chemistry, Graphics and Handwriting recognition. The MATLAB code of our experiment is available at https://github.com/priba/hierarchicalSGE.

### 7.1 Experiments on molecular graph datasets

The first set of experiments is conducted on various benchmarks of molecular graphs. Below, we provide a brief description of them followed by the experimental setup, results and discussions.

#### 7.1.1 Dataset description

Several bioinformatics datasets have been used: MUTAG, PTC, PROTEINS, NCI1, NCI109, D&D and MAO. These datasets have been widely used as benchmark in the literature. The MUTAG dataset contains graph representations of 188 chemical compounds which are either mutagenic aromatic or heteroromatic nitro compounds where nodes can have 7 discrete labels. The PTC or Predictive Toxicology Challenge dataset consists of 344 chemical compounds known to cause or not cause cancer in rats and mice. It has 19 discrete node labels. The PROTEINS dataset contains relations between secondary structure elements (SSEs) represented by nodes and neighborhood in the amino-acid sequence or in 3D space by edges. It has 3 discrete labels viz. helix, sheet or turn. The NCI1 and NCI109 come from the National Cancer Institute (NCI) and are two balanced subsets of chemical compounds screened for their ability to suppress or inhibit the growth of a panel of human tumor cell lines, having 37 and 38 discrete node labels, respectively. The D&D dataset consists of enzymes and non-enzymes proteins structures, in which their nodes are amino acids. The MAO database, taken from GREYC Chemistry graph dataset collection, is composed of 68 graphs representing molecules that either inhibit or not the monoamine oxidase, which is an antidepressant drug. Some more details on the proposed bioinformatics datasets are provided in Table 1.

#### 7.1.2 Experimental setup

We have performed two different experiments: the first one does not use the attribute information encoded in the nodes and edges of the graphs, whereas the second experiment does use the available node and edge features. For evaluating the performance of the proposed embedding technique, we have used a C-SVM solver [14] as a classifier. Since the datasets considered in this set of experiments do not contain predefined train and test sets, we have used a 10-fold cross-validation scheme to obtain accuracies and have reported the mean accuracies, respectively, in Tables 2 and 3 for unlabeled and labeled datasets. We follow a classical graph classification pipeline, where, in the first stage, graph embedding is computed by our proposed scheme, whereas in the second step, embedded graphs are classified using a previously trained classifier.

#### 7.1.3 Results and discussion

In Table 2, we present the experimental results obtained by our proposed hierarchical embedding techniques together with other existing works on the unlabeled datasets. The previously mentioned three configurations of our hierarchical embedding are, respectively, denoted as: pyramidal, hierarchical and exhaustive. For unlabeled datasets, we have considered 10 different state-of-the-art methods: (1) random walk kernel (RW) [27], (2) shortest path kernel (SP) [8], (3) graphlet kernel (GK) [70], (4) Weisfeiler-Lehman kernel (WL) [69], (5) deep graph kernel (DGK) [83], (6) multiscale Laplacian graph kernel (MLK) [38], (7) diffusion CNNs (DCNN) [4], (8) strong graph spectrums (SGS) [37], (9) family of graph spectral distances (F_GSD) [78], and (10) stochastic graphlet embedding (SGE) [21].

From the quantitative results shown in Table 2, it should be observed that for most datasets, the highest accuracy is achieved by one of the hierarchical configurations proposed by us, which sets a new state-of-the-art results on all the datasets considered. Particularly, the best accuracies are obtained either by the pyramidal or the exhaustive configurations, which indicates the importance of considering hierarchical information for the graph embedding problem. As expected, the proposed hierarchical embeddings have achieved better performance than the SGE which is regarded as the baseline of our proposal. It should be observed that with this experimental setting, particularly the hierarchical configuration has performed quite poorly compared to the other two configurations. This fact might suggest that only hierarchical edges together with the connecting levels do not contain sufficient information for a robust graph representation. Information captured in the multiscale graphs thought to play a vital role for graph embedding, which is proved by the excellent performance obtained with the pyramidal and exhaustive configurations.

In Table 3, we demonstrate the results acquired by three different configurations of our proposed hierarchical embedding on the labeled graph datasets. For comparing with other state-of-the-art methods, we have considered two additional techniques: (1) PATCHY-SAN (PSCN) [55] and (2) graphlet spectrum (GS) [39]. Some of the previously considered state-of-the-art techniques do not work with labeled graphs, so they have not been evaluated in this experimentation.

The results presented in Table 3 show that, except on the MUTAG dataset, our proposed hierarchical embedding techniques have achieved the best performances on all the other datasets. This demonstrates the usefulness of considering the hierarchical information for embedding graphs to a vector space. Contrary to the previous experiments on unlabeled datasets, in this case, the hierarchical configuration has performed reasonably better. This fact shows that on labeled graphs, the hierarchical edges together with the connecting levels might provide important structural information. Also, it is important to note that the level information also performed consistently on all the datasets.

### 7.2 Experiments on AIDS, GREC, COIL-DEL and histograph datasets

While the datasets considered in the previous set of experiments were mostly molecular in nature, the set of experiments to be discussed in this section consider graphs from various fields, such as, Biology, Computer Vision, Graphics Recognition and Handwriting Recognition. Underneath, we give a brief description of the datasets considered followed by the experimental setup, results and discussions.

#### 7.2.1 Dataset description

In this experiment, we consider four different datasets; three of them viz. AIDS, GREC and COIL-DEL are taken from the IAM graph database repositoryFootnote 1 [60]. The first one, viz., the AIDS database consists of 2000 graphs representing molecular compounds which are constructed from the AIDS Antiviral Screen Database of Active Compounds.Footnote 2 This dataset consists of two classes, viz., active (400 elements) and inactive (1600 elements), which, respectively, represent molecules with possible activity against HIV. The GREC dataset consists of 1100 graphs representing 22 different classes (characterizing architectural and electronic symbols) with 50 instances per class; these instances have different noise levels. The COIL-DEL database includes 3900 graphs belonging to 100 different classes with 39 instances per class; each instance has a different rotation angle. The HistoGraph datasetFootnote 3 [74] consists of graphs representing words from the communicating letters written by the first US president, George Washington. It consists of 293 graphs generated from 30 distinct words. Therefore, given a word, the task of the classifier is to predict its class which should be among the 30 words. Nodes are only labeled with their position in the image. Furthermore, this dataset used six different graph representation paradigms for delineating a single word into a graph, which results in six different subsets of graphs. The entire dataset is divided into 90, 60 and 143 graphs, respectively, for train, validation and test purposes. See Table 4 for the relevant statistics on these four datasets.

#### 7.2.2 Experimental setup

In this case as well, we have employed a C-SVM solver [14] as a classifier. Since the datasets used in this set of experiments contain well defined train and test sets, we have reported the obtained accuracies on the test set of the respective datasets in Table 5.

#### 7.2.3 Results and discussion

Similar to the experimental results obtained in the previous section, in this set of experiments as well, our proposed hierarchical embeddings have achieved the best results on most datasets. In this set of experiments, the leading scores are mostly obtained by the exhaustive configuration, which shows the effectiveness of combining multiscale structural information together with the hierarchical connections. For some datasets, our hierarchical embedding does not achieve the best results, but it has performed very competitively. This also proves the robustness of the hierarchical graph representation.

### 7.3 Discussion on the parameters involved in the algorithm

Our algorithm is mainly controlled by three different parameters: (1) the number of levelsL of the graph pyramid, (2) the reduction ratioR and (3) the maximum number of edgesT of a graphlet. For illustrating how these three parameters control the performance of the system, first we plot the classification accuracy by varying the levels of the graph pyramid (see Fig. 3), reduction ratio (see Fig. 4) and T (see Fig. 5). Here it is worth mentioning that for the sake of simplicity, for each level we just consider the maximum accuracy obtained by any configuration mentioned in Sect. 4.3. From Fig. 3, we can observe that for all the datasets, considering a second level together with the base graph increases the classification accuracy. However, the successive inclusion of hierarchical levels does not always increase the performance. It has been observed that for smaller graphs (with less number nodes and edges, e.g., the graphs from MUTAG), the further inclusion of hierarchical abstraction decreases the performance of the system; this means that for smaller graphs a higher level abstraction can introduce noise or distortion. The reduction ratio R directly decides the number of clusters in a given level, and hence the number of nodes in the next higher level of the hierarchy. For example, $$R=1$$ indicates that the number of clusters should remain the same with the number of nodes, while $$R=2$$ indicates that the number of clusters should be half the number of nodes in that level. Figure 4 shows the behavior of our method with different values of R while we have fixed $$L=2$$. From these plots, one must observe that R is completely dependant on the datasets irrespective of the size of graphs they contain. For PTC, PROTEINS, and MAO datasets, the performance mostly increases with the increase of R, while for MUTAG, it improves until $$R=2$$, and then it decreases for all hierarchical configurations. For MAO dataset, all the hierarchical configurations behave exactly in the same way with the increase of R, which might be because the smaller sized graphs on which the contribution of different hierarchical configuration is indistinguishable.

In Fig. 5, we show the performance trend on six datasets (i.e., MUTAG, PTC, PROTEINS, NCI1, and NCI109) only with the SGE algorithm, which is the baseline graph embedding technique that we considered. The hierarchical configurations are not considered in this case because they have different graphlet sizes in different hierarchical levels, so understanding their behavior would have been complicated. From Fig. 5, it is clear that increasing T mostly improves the performance of the system on all the datasets. Albeit, there are some exceptions (e.g., for PTC dataset, $$T=6$$), which suggests that graphlets with T edges are less informative for that particular graph dataset.

### 7.4 Discussion on the stochasticity of the algorithm

It is important to note that our proposed algorithm is stochastic in nature because of the involvement of the stochastic graphlet sampling and the subsequent graph embedding procedure. The graphlet sampling engaged here uniformly samples graphlets from a given population of graphs, and by the law of large numbers, this sampling guarantees that the empirical distribution of graphlets is asymptotically close to the actual distribution [58]. For demonstrating the fact that the stochastic behavior of our algorithm does not heavily impact on the experimental results, we repeated the last experiment on all the datasets considered for 10 iterations, and in each iteration, we randomly seeded the sampling algorithm. The mean and standard deviation of the classification accuracy obtained for each dataset is reported in Table 6. The mean accuracies reported in the table are quite close to the ones reported in Table 5, and the standard deviations are comparatively low (all of them are less than 1.0). This suggests that the proposed graph embedding technique, although employed a stochastic process, is consistent in terms of performance.

## 8 Conclusions

In this paper, we have proposed to enhance the information encoded in graph embeddings by means of hierarchical representations. We have experimentally validated that the abstract information is able to improve the graph classification performance. The embedding function is based on a stochastic sampling of graphlets to obtain the graphlet distribution within the graph. Graphlets of different sizes are considered to allow a change on the node context. Moreover, the hashing functions are used to identify graphlets in an efficient way. Event though considering different size graphlets provides robustness in terms of graph distortions, they still provide local information when we consider larger graphs. Therefore, building a graph hierarchy allows to increase the graphlet context without increasing the time needed for identifying the graphlet. In this work, we have carefully validated the performance of our approach in different application scenarios, showing that we outperform the state-of-the-art approaches in the graph classification task using an SVM as a classifier.

Further research will focus on improving the hierarchical graph construction. Even though the Girvan–Newman algorithm is able to exploit the desired properties of the graph, creating clusterings that allow to create good abstractions, their time complexity is a drawback that should be studied when considering large graphs.