Abstract
Given a nodeattributed graph, how can we efficiently represent it with few numerical features that expressively reflect its topology and attribute information? We propose ADOGE, for attributed DOSbased graph embedding, based on density of states (DOS, a.k.a. spectral density) to tackle this problem. ADOGE is designed to fulfill a long desiderata of desirable characteristics. Most notably, it capitalizes on efficient approximation algorithms for DOS, that we extend to blend in node labels and attributes for the first time, making it fast and scalable for large attributed graphs and graph databases. Being based on the entire eigenspectrum of a graph, ADOGE can capture structural and attribute properties at multiple (“glocal”) scales. Moreover, it is unsupervised (i.e., agnostic to any specific objective) and lends itself to various interpretations, which makes it suitable for exploratory graph mining tasks. Finally, it processes each graph independent of others, making it amenable for streaming settings as well as parallelization. Through extensive experiments, we show the efficacy and efficiency of ADOGE on exploratory graph analysis and graph classification tasks, where it significantly outperforms unsupervised baselines and achieves competitive performance with modern supervised GNNs, while achieving the best tradeoff between accuracy and runtime.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Graphs are widely used to model structured data from different domains such as chemistry [1], biology [2], cybersecurity [3], finance [4]. The effectiveness and popularity of datadriven machine learning algorithms has necessitated expressive vector representations of different kinds of complex data, and graphs are no exception. Different from images or text, graphs pose novel challenges in finding effective representations as graph databases may contain graphs that vary in size and structure, and do not necessarily exhibit alignment (i.e., correspondence) between the nodes of different graphs.
Formally, we want to design a function \(R: {G} \mapsto \textbf{z}_G \in \mathbb {R}^D\), where D is a fixed embedding size that does not depend on the input graph size. Ideally, given a graph database with N graphs (with n nodes and m edges per graph on average), we want R to be (i) permutation and size invariant, where graphs with similar structure and label/attribute distribution have similar embeddings irrespective of node ordering and number of nodes, (ii) flexible; that leverages information from node labels and/or multiattributes as well as edge weights, (iii) multiscale/glocal; that can capture local/microscopic, mesoscopic, as well as global/macroscopic properties of a graph, and (iv) taskagnostic/unsupervised; that can produce embeddings independent of any downstream task or related class labels, where not being tied to a specific task allows embeddings to be generalpurpose for use, e.g., in graph mining and exploratory data analysis. In addition, as with any algorithm, we want R to be (v) efficient and scalable to large graphs (large n, m) as well as large databases (large N). Finally, R that can produce one embedding at a time (vi) independently per graph (as opposed to “collective processing”) may be desirable, which allows onthefly embedding per incoming graph in streaming settings, as well as embarrassing parallelization for speed.
Spectrally designed embeddings are a popular class of techniques based on the graph eigenspectrum [5], as it captures key structural graph properties, such as cuts [6], random walk stationarity [7], dynamical processes and epidemic thresholds [8], diameter, connectedness, clustering [9]. However, the high complexity of computing the eigenspectrum exactly has proven to be a barrier for creating spectrumbased graph embeddings. Moreover, while the eigenspectrum can capture important topological properties, blending in node attributes/labels into spectrally designed embeddings is nontrivial.
In this paper, we leverage fast algorithms for approximating the spectral density of a graph [10] and use it to independently construct unsupervised graph embeddings that are permutation and size invariant, flexible and multiscale. Here, the focus is on representing the entire spectrum of the graph, which helps capture any arbitrary “band” of eigenvalues (bandpass), rather than only the extremal eigenpairs (low/high pass).
1.1 Prior work
Table 1 gives a comparison with three categories of relevant prior work in the context of desired properties for a graph embedding. These existing work do not satisfy one or more of the aforementioned properties (ii)–(vi) as we discuss next.
1.1.1 Unsupervised explicit graph embedding (UEGE)
Several unsupervised methods construct an explicit vector representation for each graph. Among those, spectrumbased methods have gained popularity in recent years. FGSD [11] treats a graph as a collection of spectral distances between its vertices. NetLSD [12] represents a graph as a collection of heat traces of the graph at several time points. Both methods are effective at capturing local and global structural properties of a graph; however, they ignore node labels and attributes. graph2vec [13] creates Weisfeiler–Lehman (WL) subtreebased features and learns an embedding of the graph trained to predict the existence of subtrees in the graph. It admits node labels, but ignores node attributes as well as edge weights.
1.1.2 Graph kernels (GK)
Due to the existence of many effective distance measures between graphs, graph kernels are a more widely studied method of graph representations [24]. While most popular kernels are effective at capturing characteristics of the graph structure, only a few, including the Propagation Kernel (PK) [17] are able to factor in edge weights, node labels and continuous node attributes (see Table 1 in [24]).
Several graph kernels which use spectral properties have been developed in recent years. RetGK [18] represents each graph as a collection of node embeddings, where the node features are the returnprobabilities of random walks of varying lengths. SAGE [16] extends this idea to graphs with labeled and attributed nodes by appending each node embedding with its onehot encoded label and/or attributes. However, both these methods do not scale well for large graphs. Moreover, computing return probabilities of random walks tends to overrepresent local features near a node, and often fails to capture global properties of the graph [19]. These issues are addressed by the density of states (DOS) GK, and its pointwise (i.e., nodelevel) extension (PDOS),^{Footnote 1} which uses Chebyshev polynomials to efficiently capture global properties of random walks, and uses fast approximation techniques [10]. However, despite their efficiency, they are limited to plain graphs and do not admit node labels or attributes.
Moreover, although graph kernels have proven effective at modeling graph structure, and in some cases node labels and attributes, for many kernel methods, computing an \(N\) \(\times \) \(N\)sized kernel matrix can be restrictive in terms of both time and space, which do not scale to large databases with many graphs.
1.1.3 Graph neural networks (GNNs)
While most existing unsupervised embedding and kernel methods are illequipped to handle continuous node attributes, GNNs are able to leverage such data to a great extent. However, deep parameterized models come with their own drawbacks. They are resourcehungry, not taskagnostic, and can be slow to train. Moreover, when viewed through a spectral lens [25], most neighborhoodaggregation based GNNs such as GCN [20] and GIN [26] can only act as lowpass or highpass filters on a graph spectrum. Only spectrally designed GNNs such as ChebNet [22] and CaleyNet [23] can act as bandpass filters.
A perhaps subtle characteristic of graph embedding methods is independent versus dependent/collective processing of the graphs. By design, all GNNbased methods including graph2vec require collective processing due to endtoend training. WL and PK, respectively, obtain the compressed labels and histogram bins based on all graphs which makes them dependent. RetGK, DOSGK, and SAGE obtain graphlevel embeddings through kernelizing the set of nodelevel embeddings, which is of different sizes across graphs, and hence they are inherently bound to create \(N\) \(\times \) \(N\) pairwise kernel values rather than an explicit/independent embedding for each graph.
1.2 Our contributions
We propose ADOGE (Attributed DOSbased Graph Embedding), for extremely fast unsupervised embedding of attributed graphs that is permutation and size invariant, flexible, and multiscale, which is produced independently per graph.
Our main technical contributions are as follows:

New graphlevel embedding algorithm: We introduce a new spectrally designed graph embedding approach, called ADOGE, that leverages the whole (eigen)spectrum of a graph. ADOGE capitalizes on recent algorithms that can efficiently approximate the (local) density of states (L)DOS [10], extending to attributed graphs for the first time.

Desired characteristics: Thanks to efficient approximations, ADOGE is extremely fast. It can handle node labels, continuous multiattributes, and edge weights. Leveraging the whole spectrum, it enables variable bandpass filtering as well as features that capture multiscale properties. Further, it processes each graph independently of others, which makes it amenable for streaming scenarios as well as parallelization.

Exploratory graph analysis: ADOGE is not tied to any specific objective, which makes it suitable for both un/supervised tasks. In fact, our embedding features lend themselves to various interpretations, related to graph signal convolution, random walks, and bandfilters, which prove useful in data mining and exploratory analysis of realworld graph datasets as we show through experiments.

Efficacy and Efficiency: Extensive experiments show that ADOGE is on par with or superior to all unsupervised baselines, and competitive against modern supervised GNNs on graph classification tasks. Notably, it achieves the best runtime–accuracy tradeoff. (See Fig. 1.)
Reproducibility and Resources: We share all datasets and source code at https://github.com/sawlani/ADOGE.
2 Problem statement and preliminaries
Notation. We denote scalars, vectors, matrices and sets by lowercase (x), lowercase boldface (\(\textbf{x}\)), uppercase boldface (\(\textbf{X}\)), and calligraphic (\(\mathcal {X}\)) letters, respectively. \(\textbf{X}_{:j}\) and \(\textbf{X}_{ij}\) refer to the jth column and the (i, j)th entry of a matrix.
We consider undirected, weighted nodeattributed graphs \(G=(\mathcal {V}, \mathcal {E}, \textbf{X}, \mathcal {A})\) where \(\mathcal {V}=\{v_1,\ldots ,v_n\}\) denotes the set of n nodes, and \(\mathcal {E}\subseteq \mathcal {V}\times \mathcal {V}\) denotes the set of m edges. \(\textbf{W}\) depicts the weighted adjacency matrix where \(\textbf{W}_{ij} > 0\) if \((v_i,v_j) \in \mathcal {E}\), and 0 otherwise. \(\textbf{X}\) is the \(n \times d\) nodeattribute matrix, where \(\mathcal {A}=\{a_1,\ldots ,a_d\}\) denotes the set of d attributes, with \(dom(a_j)\) depicting the domain of attribute \(a_j\). In terms of graph signal processing (GSP) terminology, any \(\textbf{x}=\textbf{X}_{:j}\) can be thought as a graph signal on the nodes, with one scalar per node.
Problem 1
(Unsupervised Graphlevel Embedding) Given a set of undirected, weighted and nodeattributed/labeled graphs \(\mathcal {G} = \{G_1, \ldots , G_N\}\), for \(G_i=(\mathcal {V}_i, \mathcal {E}_i, \textbf{X}_i, \mathcal {A})\), where

(i) graphs in \(\mathcal {G}\) can be of varying sizes, (ii) there exists no particular correspondence between the nodes of different graphs, and (iii) the (categorical and/or continuous) attributes and their domain are shared among all graphs,
Find Ddimensional graphlevel embedding \(\textbf{z}_G \in \mathbb {R}^D\) for each \(G\in \mathcal {G}\) that captures both structural and attribute information.
Let \({{\widetilde{\textbf{W}}}}= \textbf{D}^{1/2} \textbf{W}\textbf{D}^{1/2}\) denote the symmetrically normalized adjacency matrix, where \(\textbf{D}\) is the diagonal degree matrix with \(\textbf{D}_{i,i} = \sum _j \textbf{W}_{i,j}\). Let \({{\widetilde{\textbf{L}}}}= I  {{\widetilde{\textbf{W}}}}\) denote the Laplacian matrix, and \(\textbf{P}=\textbf{D}^{1} \textbf{W}\) the random walk matrix. For a connected graph, \({{\widetilde{\textbf{W}}}}\) has eigenvalues \(1=\lambda _0 < \lambda _1 \le \ldots \le \lambda _{n1}=1\) with corresponding eigenvectors \(\{\textbf{u}_k\}_{k=0}^{n1}\). \({{\widetilde{\textbf{W}}}}\) has the same set of eigenvectors as \({{\widetilde{\textbf{L}}}}\) whose eigenvalues are the shifted set \(\{\mu _k = 1 \lambda _k\}_{k=0}^{n1} \in [0,2]\). \({{\widetilde{\textbf{W}}}}\) also shares the same eigenvalues as \(\textbf{P}\). As such, the spectral density function has bounded support for these graph matrices. Following GSP convention, we refer to the eigenvalues as the graph frequencies.
In this work, we use \({{\widetilde{\textbf{W}}}}\) as the socalled graph shift operator \(\textbf{S}\) which generalizes to any symmetric matrix of a graph. Let \(\textbf{S}= \textbf{U}\varvec{\Lambda }\textbf{U}^T\) depict the eigendecomposition, where \(\varvec{\Lambda }:= \text {diag}([\lambda _1 \ldots \lambda _n])\) and \(\textbf{U}= [\textbf{u}_1 \ldots \textbf{u}_n]\).
Definition 1
(Graph spectrum) The spectrum of a graph is composed of the set of the graph eigenvalues, together with their multiplicities, of the (normalized) adjacency matrix.
Graph Fourier transform. The graph Fourier transform (GFT) of a graph signal \(\textbf{x}\in \mathbb {R}^n\) is defined as the projection
and the inverse GFT of \({\widehat{\textbf{y}}}\in \mathbb {R}^n\) is given as
Graph filtering. A graph filter is an operation on a graph signal with output in the graph frequency domain, that is,
where \(\phi (\varvec{\Lambda })\) is a diagonal matrix with filter frequency response values as its diagonal elements.
Definition 2
(Frequency Response Function (FRF)) The frequency response function of a graph filter is written as
which, simply put, assigns a scalar value \(\phi (\lambda _i)\) to each graph frequency (i.e., eigenvalue) \(\lambda _i\).
By applying the inverse GFT on both sides of Eq. (1), we can get the filter output in the node domain as
Signal convolution. Graph convolution of two signals, say \(\textbf{x}\) and \(\textbf{x}'\), each in \(\mathbb {R}^n\), yields another signal \(\textbf{c}\in \mathbb {R}^n\) as
where \(\odot \) depicts the Hadamard product. We can write the Fourier transform of the convolution as
Density of States. Spectral density is the overall distribution of the eigenvalues as induced by any symmetric \(n \times n\) graph matrix \(\textbf{S}=\textbf{U}\varvec{\Lambda }\textbf{U}^T\). It is also referred to as the density of states (DOS) in the physics literature, reflecting the number of states at different energy levels [27]. Formally,
Definition 3
(Density of States (DOS)) DOS or the spectral density induced by \(\textbf{S}\) is the density function
where \(\delta (\cdot )\) is the Dirac delta function.
Definition 4
(Local Density of States (LDOS)) Likewise, for any input vector \(\textbf{v}\in \mathbb {R}^n\), LDOS is given as
The following related equalities can be derived easily, respectively, for DOS and LDOS.
Scaling (L)DOS. The extremal (i.e., a few top largest or smallest) eigenpairs of various graph matrices have been associated with important graph characteristics, such as smallcut partitions [6], convergence rate of random walks to stationarity [7], unfolding of dynamical processes and epidemic thresholds [8], etc. Obtaining those few eigenpairs is also computationally easy. On the other hand, (L)DOS provides the distribution of the entire spectrum, which opens the door for the analysis of graph properties that are not evident from only the extremal eigenpairs. However, computing all n eigenvalues and eigenvectors of a graph with n nodes is considerably more demanding. Therefore, analyzing large graphs through their density of states has been obstructed by the lack of scalable algorithms, until recently.
In their awardwinning work, Dong et al. [10] introduced highly efficient approximation algorithms to compute spectral densities, scalable to graphs with as large as tens of millions of nodes and billions of edges. Their main focus has been scaling the computation of these functions, with approximationerror analysis on plain graphs. In this paper, we capitalize on their work for speed and extend it to leverage node attributes for the first time toward fast, attributed graphlevel embedding.
Fast Approximation of (L)DOS. As introduced in [10], there are two methods for estimating spectral densities: the kernel polynomial method (KPM) and the Gauss quadrature via Lanczos iteration (GQL). KPM expands the (L)DOS with orthogonal polynomial base functions, and the typical polynomial basis is the Chebyshev polynomials. Chebyshev approximation requires the eigenvalues of input matrix to be supported on \([1,1]\), which is satisfied by the graph shift operator \(\textbf{S}\). In practice, only a finite number of moments are needed to approximate the (L)DOS well, especially for smooth (L)DOS. Dong et al. [10] also proposed some preconditioning step to accelerate the error decay with respect to the number of moments. In this paper we use GQL to estimate (L)DOS which we introduce in detail as follows.
Gauss quadrature (GQ) is a numerical method to estimate definite integral of a function with a weighted sum of function values at specified points, and it has been applied to computing \(\textbf{u}^Tg(\textbf{A})\textbf{u}\) for an arbitrary vector \(\textbf{u}\), a symmetric positivedefinite \(n \times n\) matrix \(\textbf{A}\) and a matrix function \(g(\cdot )\). Note that \(\textbf{u}^Tg(\textbf{A})\textbf{u}\) can be rewritten as Riemann–Stieltjes integral [28]; letting \(\textbf{A}= Q^T\Lambda Q\) upon eigendecomposition and \(\tilde{\textbf{u}} = Q\textbf{u}\):
where \(\alpha (\lambda )\) is a piecewise constant function defined as
Using GQ, we can approximate the integral as \(\textbf{u}^T g(A)\textbf{u}= \int _{\lambda _{\min }}^{\lambda _{\max }} g(\lambda ) \textrm{d}\alpha (\lambda ) \approx \sum _{i=1}^{p} w_i g(\theta _i)\) with some weights \(\{ w_i\}_{i=1}^p\) and points \(\{ \theta _i\}_{i=1}^p\). Different ways of computing \(\{ w_i\}_{i=1}^p\) and \(\{ \theta _i\}_{i=1}^p\) have been summarized in [28], and one way is called Gauss quadrature via Lanczos iteration (GQL).
Before introducing GQL, let us build the connection between (L)DOS and computing \(\textbf{u}^T g(\textbf{A})\textbf{u}\). Expanding the definition of LDOS in Eq. (5), we can write
The above formulation indicates that LDOS can be represented in the form of \(\textbf{u}^T g(\textbf{A})\textbf{u}\) by substituting \(\textbf{u}\leftarrow \textbf{v}\), \(\textbf{A}\leftarrow \textbf{S}\), and \(g(x) \leftarrow \delta (\lambda  x)\), and hence can be approximated via GQL.
To generate \(\{ w_i\}_{i=1}^p\) and \(\{ \theta _i\}_{i=1}^p\) for GQ approximation of \(\textbf{v}^T \delta (\lambda  \textbf{S}) \textbf{v}\), Lanczos algorithm is first conducted to decompose \(\textbf{S}\) into a tridiagonal matrix \(\textbf{T}_k \in \mathbb {R}^{k \times k}\) with k being the number of iterations of Lanczos algorithm. Given a matrix \(\textbf{S}\) and an initial vector \(\textbf{z}_1\) with \(\Vert \textbf{z}_1 \Vert ^2=1\), Lanczos algorithm iteratively generates k orthogonal unit vectors \(\textbf{z}_1,\ldots ,\textbf{z}_k\) and outputs a tridiagonal matrix \(\textbf{T}_k\). Let \(\textbf{Z}=[\textbf{z}_1,\ldots ,\textbf{z}_k]\in \mathbb {R}^{n \times k}\), \(\textbf{Z}^T \textbf{Z}=\textbf{I}_{k}\), then
where \(\textbf{R}\in \mathbb {R}^{n\times k}\) is the residual term and each column vector of \(\textbf{R}\) is orthogonal to \(\textbf{z}_i, \forall i \). Hence \(\textbf{Z}^T \textbf{R}=\textbf{0}_k\).
A wellknown theorem characterizes the relationship between Lanczos algorithm and GQ approximation, as stated below.
Theorem 1
([29, 30]) The eigenvalues of \(T_k\) form the points \(\{\theta _i \}_{i=1}^k\) of Gauss quadrature, and the weights \(\{w_i \}_{i=1}^k\) are given by the squares of the first elements of the eigenvectors of \(T_k\).
The eigendecomposition for tridiagnoal matrix \(T_k\) is fast with complexity \(O(k^2)\). Let \(\{(\tau _1, \textbf{c}_i), \ldots , (\tau _k, \textbf{c}_k)\}\) be the eigenvalues and eigenvectors of \(T_k\). Theorem 1 implies that \(\textbf{z}_1^T g(\textbf{S}) \textbf{z}_1 \approx \sum _{i=1}^k (\textbf{e}_{1}^T\textbf{c}_i)^2\,g(\tau _i) = \sum _{i=1}^k \textbf{c}_{i1}^2\,g(\tau _i)\). Replacing the starting vector \(\textbf{z}_1\) of Lanczos to \(\frac{\textbf{v}}{\Vert \textbf{v}\Vert }\) and g(x) to \(\delta (\lambda x)\) we can get the approximation of LDOS as follows
Following Eq. (7), DOS can be computed as
where \(\textbf{z}_1,\ldots ,\textbf{z}_p\) are p random vectors from normal distribution.
3 Graphlevel embedding with ADOGE
3.1 Motivation
Our spectrally designed ADOGE derives graphlevel features based on the node attributes and the entire spectrum of \({{\widetilde{\textbf{W}}}}\) (can be other symmetric graph matrix, w.l.o.g. referred as \(\textbf{S}\), see Sect. 2), where the spectrum is composed of all of the eigenvalues. Before delving into details, we discuss the motive for using the full spectrum and present an illustrative example.
Why the entire spectrum? We design graphlevel features based on all of the eigenpairs of a graph matrix for two primary reasons. First, a large number of studies have found that the full eigenvalue spectra of different classes of realworld networks differ considerably [9, 31,32,33]. This suggests that the spectra can play a key discriminative role. Second, realworld networks are observed to exhibit localization on loworder eigenvectors, which are those eigenvectors associated with the nonextremal eigenvalues (in the sense of being the largest or smallest), but that are “buried” further down in the eigenvalue spectrum [34]. Notably, they capture mesoscopic inhomogeneity in networks which is defined as topologically distinct groupings of nodes, from few nodes to large modules, communities, or different interconnected subnetworks [35].
Illustrative example: To illustrate the valuable information that nonextremal eigenpairs carry, we present a visual analysis of loworder eigenvector localization using the MIG graph (See Sect. 4.1). It consists of the counties across 49 mainland US states as nodes, and an edge depicts the total number of people that migrated between two counties during 1995–2000 [34].
Eigenvector localization arises when most of the entries of an eigenvector are zero or nearzero and implies that the nonzero components of the eigenvector coincide with a particular set of geometrically distinguished nodes in the graph. Extremal eigenvectors typically exhibit low localization; as shown in Fig. 2(i), the 2nd eigenvector has many nonzeros and mainly captures macroscopic properties, in this case, the graph cut depicting relatively fewer migrations between west and eastcoasts. Lowerorder eigenvectors, as shown in (ii)–(iv), reflect mesoscopic structure in terms of migration patterns. For example, the 41st eigenvector depicts migration in and around South Dakota. On the other hand, even lower eigenvectors localize increasingly, narrowing in a few counties, as shown in (v) and (vi). For example, the 128th eigenvector has localized to a few counties within Texas near Austin, reflecting microscopic patterns. It is remarkable that the loworder eigenvectors align with geographical and political boundaries, carrying useful information at multiple scales.
3.2 Spectrum as histogram: DOS, LDOS, cLDOS features
3.2.1 Density of states (DOS)
DOS or spectral density as given in Eq. (4) is a continuous probability density function \(f(\lambda )\) of the eigenvalues. We represent it with a histogram density estimator, denoted \(h^{DOS}(\lambda )\) that partitions the eigenvalue range \([1,1]\) for \({{\widetilde{\textbf{W}}}}\) into \(B=2/\Delta \) disjoint bins of equal width \(\Delta \). Let us denote their centers by \(\widetilde{\lambda }_b\) for \(b \in \{1, \ldots , B\}\). For any \(i \in \{1,\ldots , n\}\), let \(\text {Bin}(\lambda _i)\) denote the bin that \(\lambda _i\) belongs to. We define our DOS histogram features for a graph as follows.
Definition 5
(DOS histogram features) DOS histogram is a Bdimensional vector, denoted \(\textbf{h}^{\text {DOS}}\in \mathbb {R}^B\), where
3.2.2 Local density of states (LDOS)
We also represent the local density of states (LDOS) in Eq. (5) similarly and define LDOS histogram features.
Definition 6
(LDOS histogram features) For a given vector \(\textbf{v}\in \mathbb {R}^n\), the LDOS histogram is a Bdimensional vector, denoted \(\textbf{h}^{\text {LDOS}}_{\textbf{v}} \in \mathbb {R}^B\), where
Note that by abusing convention slightly, we use the word histogram to refer to Eq. (17) although it is not a normalized density mass function. Figure 3 shows examples to DOS (top) and LDOS (middle & bottom) histograms with \(B=40\) each.
Computing both DOS and LDOS histograms requires all of the eigenvalues \(\lambda _i, i=\{1\ldots n\}\) for a graph with n nodes. Further, LDOS requires all the corresponding eigenvectors \(\textbf{u}_i\)’s. For even moderate size graphs, computing the complete set of eigenpairs is prohibitive. Most recently, Dong et al. [10] introduced fast and scalable approximation algorithms to estimate these spectral densities. Our work is inspired by and builds on their work to efficiently obtain both \(\textbf{h}^{\text {DOS}}\) and \(\textbf{h}^{\text {LDOS}}\) based on the Gauss Quadrature and Lanczos (GQL) algorithm [36].
On the other hand, both in [10] and their followup work [19], \(\textbf{v}= \textbf{e}_i\) is used in Eq. (5) to capture the spectral information about each particular node \(i=\{1,\ldots ,n\}\), called pointwise density of states (PDOS), where \(\textbf{e}_i\) is the ith standard basis vector with ith entry equal to 1 and 0 elsewhere. As such, both works are limited to plain graphs without node labels/attributes. We extend the use of LDOS to attributed graphs for the first time, by setting \(\textbf{v}\in \mathbb {R}^n\) in Eq. (17) to capture a graph signal on the nodes associated with an attribute.
Specifically, given a categorical or binary attribute \(a_j\), we create a separate \(\textbf{v}\) for each unique value \(val \in dom(a_j)\) where \(\textbf{v}_i:= 1\) if \(\textbf{X}_{ij} = val\) and 0 otherwise. For numerical attributes, we set \(\textbf{v}:= \overline{\textbf{X}}_{:j}\) where \(\overline{\textbf{X}}\) denotes the columnwise standardized attribute matrix. Notably, LDOS can be extended to structural nodelevel attributes, such as degree or other node centrality measures and eccentricity.
Interpreting LDOS. There is an intuitive interpretation of a LDOS feature in Eq. (17). The term \(\textbf{v}^T \textbf{u}_i\), that is the dot product between an attribute vector and a graph eigenvector, is to reflect the alignment between attribute values and the structurally distinct group of nodes that the eigenvector captures. The better the alignment, the larger is the LDOS feature value for the bin that the corresponding eigenvalue falls into.
Why the attributebased LDOS? We provide an illustrative example, motivating the use of LDOS besides DOSbased histogram features. To this end, we use our Congress graph, as described in experiments Sect. 4.1. It consists of US senators as nodes across 41 US Senates from the 70th to 110th Congress, where weighted edges capture voting agreement. Each node is labeled with party affiliation; as Democrat, Republican, or Other. Figure 4a gives the spy plot for the adjacency matrix, where the dense blocks on the diagonal correspond to each one of 41 Senates. Crosssenate edges connect the same senator who appear across multiple senates.
From the Congress graph, we create two variants. We first select only one Senate at random. Next, we add noise edges between the sameparty nodes in the first variant, called Congresswithin, and among random nodes in the second variant, called Congressrand. As such, the structural difference between the variants is associated with node labels. The edge weights are chosen uniformly at random from [0.25, 1]. The total weight of edges added to each graph is exactly the same. As we only perturb one of 41 senates in this way, the two variants share mostly the same topology. As a result, their DOS histograms are hard to distinguish, as shown in Fig. 4b, where the rightmost bump depicts the top 41 eigenvectors (each close to 1) capturing the 41 dense subgraphs per Senate. In contrast, LDOS histograms (using party affiliation Democrat, Republican is similar) reflect clear differences as shown in Fig. 4c.
3.2.3 Coupled local density of states (cLDOS)
In addition to the original LDOS, we also create interaction features between pairs of node attributes. Accordingly, the coupledLDOS histogram features are defined as follows.
Definition 7
(cLDOS histogram features) For two input vectors \(\textbf{v}, \textbf{v}^\prime \in \mathbb {R}^n\), the coupledLDOS histogram is a Bdimensional vector, denoted \(\textbf{h}^{\text {cLDOS}}_{\textbf{v},\textbf{v}^\prime } \in \mathbb {R}^B\), where
Note that \((\textbf{v}^T \textbf{u}_i) (\textbf{u}_i^T\textbf{v}^\prime )\) is the ith entry of \(\widehat{\textbf{c}}_{\textbf{v},\textbf{v}^\prime }\) (from Eq. 3). Hence \(\textbf{h}^{\text {cLDOS}}_{\textbf{v},\textbf{v}^\prime }\) can simply be viewed as \(\widehat{\textbf{c}}_{\textbf{v},\textbf{v}^\prime }\), binned according to the corresponding eigenvalues of each entry.
Moreover, recall that we use the GQL algorithm to approximate the LDOS features, where the terms \(\textbf{v}^T \textbf{u}_i\) or \(\textbf{u}_i^T\textbf{v}^\prime \) are not computed using the individual eigenvectors explicitly. Nevertheless it is easy to acquire cLDOS features in Eq. (18) using the LDOS features in Eq. 17 and simple algebra. Given the separate LDOS features for \(\textbf{v}\) and \(\textbf{v}^\prime \), we also create those for \((\textbf{v}+\textbf{v}^\prime )\). Then,
3.3 Functions over the spectrum: aggregate features
DOS, LDOS and cLDOS histograms provide “raw” information about the graph spectrum and the attributes. In addition, we define aggregate scalar features by specifying various frequency response functions (FRF) [25] (Eq. (2)) over these histograms.
Definition 8
((cL)DOS aggregate features) Given a DOS, LDOS or cLDOS histogram \(\textbf{h}\in \mathbb {R}^B\), and a frequency response func. \(\phi (\cdot )\), a (cL)DOS aggregate feature \(g_{\phi } \in \mathbb {R}\) is written as
Each FRF \(\phi (\cdot )\) focuses on a different part of the spectrum, inducing a variety of graph filters. In Fig. 3 (bottom), we show three example FRFs; a lowpass one (blue) that has high values for smaller eigenvalues, a midpass one (red) as well as one that is both lowandhigh pass (orange). To extract graph connectivity and attribute information broadly, we construct a portfolio of these graph filters, i.e., associated FRF’s \(\{\phi (\cdot )\}\), called a filterbank.
Before delving into the details of our filterbank, we make a few remarks. First, note that the sum in Eq. (20) is an approximation of the integral in Eq. (8) for \(\textbf{h}^{\text {DOS}}\), that of Eq. (9) for \(\textbf{h}^{\text {LDOS}}\), and accordingly an approximation of \(\textbf{v}^T \phi (\textbf{S}) \textbf{v}^\prime \) for \(\textbf{h}^{\text {cLDOS}}\). Second, given the efficiently computed histograms thanks to the GQL algorithm, computing the aggregate features by Eq. (20) is extremely fast and simply involves a weighted sum. This allows us to employ a large filterbank containing many different FRF ’s almost for “free”. Finally, we have seen that our cLDOS aggregate features relate to graph signal convolution. Denoting the vector of frequency responses by \( \varvec{\phi }:= \{ \phi (\lambda _i) \}_{i=1}^n \), based on Eqs. (9) and (3),
In the following, we present two classes of FRF’s that ADOGE employs to extract (cL)DOS aggregate features.
3.3.1 Chebyshev polynomials
We use the series of Chebyshev polynomials as a set of FRF ’s defined by the recurrence \(\phi _1(\lambda ) = 1, \; \phi _2(\lambda ) = 2({\lambda }/{\lambda _{\max }}) 1\), and \(\phi _k(\lambda ) = 2 \phi _2(\lambda ) \phi _{k1}(\lambda )  \phi _{k2}(\lambda )\), where \(\lambda _{\max }\) is the maximum eigenvalue.
Interpretation. As shown in Fig. 5a, Chebyshev polynomials provide frequency profiles that cover various parts of the spectrum. For example, the 2nd one is mostly a low and highpass filter and stops the middle band, while the 3rd one passes the middle bands as well as very high and very low bands of the spectrum, and so on. Given a number of these FRF ’s, emphasis can be put on passing/stopping specific bands by a weighted combination of them.
The flexibility of anyband filtering by ADOGE is favorable over several modern graph neural networks (GNNs). GCN [20], for instance, works as a lowpassonly filter and hence does not cover the whole spectrum. GIN [21] has a learnable scalar parameter \(\epsilon \) that determines which band to stop, however its FRF is a linearly decreasing function, which is not a strong lowpass or highpass filter. (See Fig. 2 in [25].) In contrast, spectrally designed ChebNet [22] is more expressive and also employs the Chebyshev polynomials. We compare to these modern GNNs in the experiments on graph classification tasks.
3.3.2 Power functions
The second class of FRF ’s in our filterbank uses (both positive and negative) powers of the spectrum, that is,
Interpretation. Our aggregate features using the power functions relate to random walks on the graph. Consider positive values of k and \(\textbf{S}= {{\widetilde{\textbf{W}}}}\). Recall that for \(\textbf{h}^{\text {DOS}}\), Eq. (20) is an approximation of \(\text {trace}(\phi (\textbf{S}))\), which is equal to the total returnprobability of a kstep random walk to a node. For \(\textbf{h}^{\text {LDOS}}\), aggregate features approximate \(\textbf{v}^T \phi (\textbf{S}) \textbf{v}\). For a binary/categorical attribute where \(\textbf{v}\) depicts a certain value, e.g., val \(:=\) (party_affiliation:democrat), it corresponds to the probability that a kstep random walk starting at any node with value val “hits”/reaches another node with the same value. For \(\textbf{h}^{\text {cLDOS}}\), similarly, it is the probability that such a walk starting at any node with a certain val will reach another node with a different \(val^\prime \). Moreover, for two continuous attributes \(\textbf{v}\) and \(\textbf{v}^\prime \), approximating \(\textbf{v}^T \phi (\textbf{S}) \textbf{v}^\prime \) via \(\textbf{h}^{\text {cLDOS}}\) would capture the covariance of the attributes over “khop connected” pairs of nodes that can reach each other within ksteps.
The interpretations extend to the negative powers as well, which correspond to many walks of different lengths in the limit. In that respect, aggregate features using power functions depict multiscale properties, where increasingly positive values of k capture microscopic to mesoscopic properties related to short/local random walks, whereas negative powers relate to the longrange walks and thereby macroscopic structure.
3.4 ADOGE: overall summary
We conclude with an overview of all the graphlevel features described in this section. Table 2 gives the number of features by category, where B is the number of histogram bins, K is the number of Chebyshev or power frequency response functions (FRF ’s), and \(D \ge d\) is the total number of attributes upon onehotencoding the categorical and binary attributes. ADOGE yields \((B+2K)(1 + D + {D \atopwithdelims ()2})\) features in total for an attributed graph, which are permutation and sizeinvariant, taskagnostic, variable bandpass, multiscale, and extremely efficient to compute. We outline the steps in Fig. 6 and give detailed complexity analysis next.
Extension to more features. We note that one may expand the set of ADOGE features in two possible ways. First, the input vector(s) to (c)LDOS, denoted \(\textbf{v}\) (and \(\textbf{v}^\prime \)) in Definition 6 (and in Definition 7), are flexible. Besides a vector depicting one node attribute at a time, one can design features of features or even include other topological properties of the nodes. Second, one can define FRF ’s of other forms, aiming to capture different functions of the spectrum. Those could include very flexible yet perhaps less interpretable functions, especially if the downstream task is more performanceoriented and less humancentered.
Extension to directed graphs. The original method is designed based on eigendecomposition of symmetric matrix, and the fast computation of DOS and LDOS also relies on symmetric matrix. For directed graph, directly using the eigenvalues and eigenvectors of the directed adjacency matrix introduces complex numbers that are hard to define its distributions, and the fast algorithm of computing eigenvectors is not available yet. To avoid the problem and still extend the designed method to directed graphs, one can apply the proposed method over a transformed undirected graph of the directed graph, instead of working with its directed adjacency matrix. Specifically, one can transform a directed graph to a undirected graph reversibly, by replacing each directed edge \(e=(x,y)\) with 5 new vertices \(v_1^e, \ldots , v_5^e\) and new edges \((x,v_1^e), (v_3^e,y), (v_1^e, v_3^e), (v_3^e, v_4^e), (v_4^e, v_5^e)\). Notice that the generated undirected graph can be transform back to original directed graph by identifying all added new nodes with degree information.^{Footnote 2} With the help of the injective transformation between directed graph and undirected graph, our method can be used to encode directed graph, with introducing some computational cost.
3.5 Computational complexity
Since ADOGE computes an embedding for each graph independently, it scales linearly with the number of graphs in the dataset, i.e., N.
We analyze the asymptotic runtime of ADOGE on a single graph G with n nodes, m edges, and D total node attributes (including onehot encoded labels and categorical attributes). We use the Gauss quadrature and Lanczos algorithm described by Dong et al. [10] to compute a (cL)DOS histogram. This involves (i) running \(\eta _L\) Lanczos iterations, each requiring \(O(\eta _L (n+m) )\) operations, followed by (ii) the eigendecomposition of a tridiagonal \(\eta _L\) \(\times \) \(\eta _L\) matrix, with \(O(\eta _L^2)\) operations. Note that although a tridiagonal matrix eigendecomposition has a quadratic worstcase complexity theoretically, this operation is extremely fast in practice—especially for realworld matrices. Each aggregate feature requires a dot product of two vectors of size B for O(B),
where we use 2K different frequency response functions (i.e., \(\phi (\cdot )\)’s) in total (K each for Chebyshev and powers). Then, the total complexity of computing one histogram and its related aggregate features is \(O(\eta _L^2 + \eta _L m + \eta _L n + KB)\). This gives a total runtime of \(O\left( (\eta _L^2 + \eta _L m + \eta _L n + KB) \cdot \alpha \right) \), where \(\alpha \) denotes the number of desired graphlevel features (i.e., embedding size) in ADOGE.
Notably, ADOGE is modular and can include any subset of the features in Table 2. For datasets with a large number of node attributes, one can skip cLDOS features, or only choose important attributepairs to ensure \(\alpha = O(D)\). Also note that each aggregate feature for a given \(\phi (\cdot )\) can be computed independently, and hence can be easily parallelized.
4 Experiments
To evaluate ADOGE, we design both quantitative and qualitative experiments to answer the following questions.
 Q1.:

Graph Classification How does ADOGE (unsupervised) compare to the modern GNNs and graph kernels (un/supervised) on benchmark graph classification tasks?
 Q2.:

Exploratory Graph Analysis Can ADOGE provide insights for mining realworld attributed graphs?
 Q3.:

Efficiency How fast and scalable is ADOGE?
 Q4.:

Boosting GNNs Can the unsupervised features generated by ADOGE help improve the expressiveness of modern GNNs further?
4.1 Experiment setup
Datasets. The list of all datasets used in the experiments and summary statistics are given in Table 3.
\(\underline{\textit{Graph classification benchmark datasets}}\) For graph classification, we use eight benchmark datasets from TUDataset repository.^{Footnote 3} Five are commonly used social network datasets, REDDITB, REDDIT5K, COLLAB, IMDBBIN and IMDBMUL. These contain only plain graphs—a setting with which all the baselines are compatible.

REDDITB is a balanced dataset, used in [37], where each graph is an online thread with nodes being users and edges representing the existence of direct response between two users. The task is to classify the thread (graph) as a Q &Abased community or discussionbased community. REDDIT5K is similar to REDDITB but has 5 different community types.

COLLAB [37] is a scientific collaboration dataset coming from three research fields: high energy physics, condensed matter physics, and astrophysics. Each graph represents a collaboration network with nodes being researchers and edges representing the collaborating relations.

IMDBBIN [37] is a movie collaboration dataset where each graph is a movie collaboration network with nodes being actors/actresses and edges representing that two actors/actresses have appeared in the same movie. The task is to classify a graph into two genres: action and romance. IMDBMUL is a 3class version IMDBBIN.
The other three are biochemistry datasets, PROTEINS, DD and AIDS, which have node labels and/or attributes.

PROTEINS [38] and DD [39] are two biochemistry datasets where each graph is a molecular structure of a protein. The task is to classify a graph to two categories: enzyme and nonenzyme.

AIDS [40] contains many molecular graphs where nodes represent atoms and edges indicate the valence between two atoms. The task is to predict whether the molecular is useful for treating AIDS or not.
\(\underline{\textit{Graph classification bandpass datasets}}\) We also use four other graph datasets to specifically showcase the strengths of ADOGE in leveraging the full graph spectrum.

BandPass is a synthetic dataset consisting of images generated via sinusoidal patterns from two frequency ranges, as used in [25].

Congress is based on the voting patterns in 41 US Senates (1927–2008), as used in [34], where nodes represent senators (labeled by party affiliation) and edge weights represent voting agreement. Nodes depicting the same senator who appear across multiple years are also connected with an edge. To create separate classes of graphs, we add noise to edge weights between sameparty senators (class 1), and randomly picked senators (class 2) in one randomly picked Senate. We repeat this process to obtain 100 graphs for each class.

CongressL is generated by picking one Senate at random and shuffling the labels of senators via random swaps; 50 swaps in class 1, and 300 in class 2. We repeat this process to obtain 100 graphs for each class.

MIG is based on the countytocounty migration in the USA, also used in [34]. Each node is a county (labeled using its state), and edge weights represent the amount of migration.^{Footnote 4} Of these, we pick two bordering states at random and add a small amount of noise to edge weights. For class 1, we add noise between samestate counties, and for class 2, we add noise between randomly picked counties. We repeat this process to obtain 100 graphs for each class.
\(\underline{\textit{Exploratory graph mining datasets}}\) In addition to the above datasets, we perform graph exploratory analysis using ADOGE on two more datasets:

Facebook100 consists of Facebook college social networks from 100 American institutions [41], with student demographic information (major, dorm, status, classyear, etc.) as node attributes.

BorderStates is built from the MIG dataset, by inducing 49 separate graphs—one for each mainland state and its bordering states. We label counties of the selected state as 0 and the counties of the neighbors as 1.
\(\underline{\textit{GNN expressiveness datasets}}\) Furthermore, to additionally test whether the unsupervised features can help improve expressiveness or the separation power of graph neural networks (GNNs), we use two additional datasets with tasks related their expressiveness.

CountingSub [42] is a simulated dataset with random graphs, and for each graph its number of triangles, tailed triangles, stars, and 4cycles are precomputed as target values for regression. The task is to count number of substructures for any input graph.

GraphProp [43] is also a simulated dataset with random graphs. The task is to regress a graph to some graphlevel properties, including the connectedness (binary), the diameter, and the radius of the graph.
These substructure counting tasks and graph property regressing tasks are closely related to expressiveness measurement of GNNs.
Baselines. We compare ADOGE quantitatively to various unsupervised and supervised graph embedding, graph kernel, and graph neural network methods on graph classification tasks.
\(\underline{\textit{Unsupervised explicit graph embeddings}}\) are in the same category as ADOGE and hence are the most comparable. As baselines from this category, we compare to

\(\bullet \) FGSD [11], \(\bullet \) NetLSD [12] and \(\bullet \) g2vec [13], which we described briefly in Sect. 1.1.
\(\underline{\textit{Graph kernels}}\) are also unsupervised; here, we use three of the best performing kernels on classification benchmarks, as well as a recent DOSbased graph kernel.

WL [14]: the Weisfeiler–Lehman graph kernel,

WLOA [15]: the Weisfeiler–Lehman Optimal Assignment kernel,

PK [17]: the Propagation Kernel, and

DOSGK [19], the Density of States Graph Kernel.
\(\underline{\textit{GNN baselines}}\) include stateoftheart supervised models, such as
Note that FGSD, NetLSD, and DOSGK are for plain graphs only. g2vec, WL, and WLOA admit node labels but not (continuous) attributes. Therefore, they input only the admissible parts of a graph dataset for classification.
Model configuration. In our experiments with ADOGE, we set \(\eta _L=100\), \(B=200\) and \(K=100\) (see Table 2). For plain datasets, we use node degree as a continuous attribute. For FGSD, we use \(L^{1}\) as the distance function and 0.001 as the binwidth. For NetLSD, we use heat trace signatures at 250 different values of t logarithmically spaced in \([10^{2}, 10^2]\). For g2vec, we set the WL iteration count to 5 and output dimension to 1024. For the kernels WL, WLOA and PK, we use the implementation from the GraKel package,^{Footnote 5} and the default parameters suggested. For DOSGK, same as with ADOGE, we use 200 bins and 100 Chebyshev moments. For all the GNNs, we use meanpooling as the readout function. Notice that ADOGE uses all designed features presented in Sect. 3 as input. These features can be directly input to SVM for graph classification and can be transformed to a fixed dimension embedding by a 2layer MLP, which is then used for augmenting GNN’s embedding space.
System configuration. We run all nonGNN experiments on one core of Intel(R) Xeon(R) CPU E52667 v3 CPU @3.20GHz. GNN experiments are run on a server with NVIDIA Tesla V100 GPU and one core of Intel(R) Xeon(R) Gold 6248 CPU @2.50GHz.
4.2 Graph classification
Classifier configurations. For classification with the embeddings produced by unsupervised methods, we use the kernelSVM^{Footnote 6} classifier with the regularization parameter C chosen from \(\{10^{3}, 10^{2}, \ldots , 10^3\}\) via 10fold crossvalidation. We perform this experiment 10 times using random splits. For explicit embeddings, we normalize each feature, and set \(\gamma \) to be the inverse of the median of pairwise \(\ell _2\) distances between all embeddings. For ADOGE, we also set the option of using LDOS, cLDOS features, and the option of using aggregate FRFs as hyperparameters. We normalize all kernels symmetrically. For GNNs, we train them endtoend using crossentropy loss, and hyperparameters (learningrate at 0.005, layers in {2,3,5,7}, hidden sizes from {32,64,128} and epochs up to 200) selected via 10fold crossvalidation. For each of the above methods, we report the mean test accuracy for the best choice of hyperparameters, and the corresponding standard deviation on every dataset except BandPass, for which we use the single trainvalidationtest split as specified in [25].
Results. Table 4 contains all the performance results of our classification experiments. Among the benchmark datasets, ADOGE achieves on par performance with the most competitive unsupervised baselines and is often comparable to (supervised) GNNs, while being considerably more resourcefrugal.
On the other four datasets, ADOGE significantly outperforms all baseline methods due to its ability to capture the alignment of labels and attributes with graph structure at a multiscale level, even in databases with as few as 200 graphs. Provided ADOGE uses considerably lower resources in comparison with kernels and GNNs, and considering that the latter are trained endtoend, we do not expect ADOGE to exhibit stateoftheart performance on every dataset. Still, ADOGE outperforms/equals all baselines on 7 of the datasets. Moreover, ADOGE stands out as the top choice based on average performance across all datasets.
On the BandPass dataset, only the spectrally designed ChebNet is able to outperform ADOGE. This can be attributed to the way that BandPass is created, wherein graph classes are formed based on the frequency band used to generate the underlying image. Figure 7 depicts the LDOS histograms of the graphs in the BandPass dataset. We can clearly see that capturing specific bands of the eigenspectrum suffices to characterize the disparity between the two graph classes.
Feature ablation. Table 4 also shows the DOSonly version of ADOGE without using node labels and attributes, called DOGE. We observe that in the benchmark datasets, graph structure seems to hold most of the useful information needed for classification, and hence, there is only a small improvement in performance from using node attributes. In the rest of the datasets, node attributes play an important role, causing significant improvements in results for ADOGE by using LDOS and cLDOS features.
4.3 Graph data mining
To demonstrate the interpretability of the ADOGE features, we perform exploratory graph analysis on three realworld datasets, Facebook100, Congress and BorderStates.
\({\underline{\texttt {Facebook100}}.~}\) In Facebook100, we denote each categorical feature (e.g., major) with its onehot encoding, and hence, each particular value (e.g., Computer Science) has its own (binary) attribute vector. We first visualize the Facebook100 graphs via LDOS aggregate features using these attribute vectors, with small positive power functions as FRF to capture the assortativity (homophily) of different attributes across different college networks. In each graph, we compute the aggregate feature that estimates \(\textbf{v}_m^T \textbf{S}\textbf{v}_m\) for every major captured by \(\textbf{v}_m\), and similarly \(\textbf{v}_d^T \textbf{S}\textbf{v}_d\) for every dormitory captured by \(\textbf{v}_d\). Figure 8 plots the mean homophily with respect to major and dorm for each of the 100 colleges.
While Carnegie pops up as having the highest correlation between edges and students with the same major, comparing the ranges of both axes suggests that dorm is a much stronger indicator of students within a college being friends. Moreover, this tendency seems to be more pronounced in Rice, Caltech and UCSC. This is also backed up by findings in [41] and the realworld knowledge that Rice and Caltech are organized predominantly by dorms and other oncampus housing.
We also analyze similar aggregate functions over the continuous attributes. Figure 9(left) plots the assortativity with respect to class_year for \(k=1\) and \(k=2\) for the power functions, which capture 1 and 2length paths. As we expect, these features are highly correlated in most colleges—with the striking exception of Harvard, where it appears that 2length paths are common between individuals of similar class_year, but this is not the case with 1length paths. To investigate further, we plot homophilies for student and nonstudent populations for all colleges in Fig. 9(right) and we learn that the Harvard network consists of a comparatively higher number of edges amongst nonstudent members, most of whom have empty or very disparate class_year. Even if edges between students are fewer, this is corrected when we look at 2length paths instead.
\({\underline{\texttt {Congress}}.~}\) Next, we want to explore scenarios where interactions between attributes prove important to understanding properties of a graph. To this end, we look at the Congress graph, where the two attribute vectors are binary vectors \(\textbf{v}_d\) and \(\textbf{v}_r\) corresponding to Democrat and Republican senators, respectively (ignoring the small minority of independents). We plot withinparty agreement \((\textbf{v}_d^T \textbf{S}\textbf{v}_d + \textbf{v}_r^T \textbf{S}\textbf{v}_r)/2\) and crossparty agreement (\(\textbf{v}_d^T \textbf{S}\textbf{v}_r\)) over the years in Fig. 10.
We can observe that beginning from the 1990s, senators tend to agree among their parties, and disagree with the opposite party to a higher extent, hinting at a growing polarization in politics. We note that agreement across parties is also low in 1937 (see the “dip”); however, this is better explained by the fact that this congress had overwhelmingly more number of democrats. There is no hint of polarization for that instance, since there is no corresponding rise in the dashed (withinparty) curve. Figure 10 shows that aggregate functions from ADOGE not only help us observe such phenomenon but also help quantify them to a relative extent.
\({\underline{\texttt {BorderStates}}.~}\) Lastly, we analyze BorderStates, comparing withinstate migration against crossborder migration for each of the 49 mainland states in the USA. We focus on LDOS aggregate features; this time using both positive and negative power functions in order to analyze both short and longrange migration patterns. In other words, while small positive powers (e.g., \(k=1\)) capture local migration patterns, negative powers (e.g., \(k=1\)) capture paths of all lengths and thereby reflect longrange migration behavior on a relatively global scale.
From Fig. 11(left), we observe that at the local scale, most states have greater withinstate migration than crossborder migration. Comparatively, NH and DE, being the states with the least number of counties (10 and 3, respectively), exhibit lower withinstate migration. Moreover, due to NH’s geographical and political similarity with its bordering states, it shows highest crossborder migration. On the other hand, larger states such as CA and MI exhibit mostly withinstate migrations on the local scale. However, on the global scale (Fig. 11(right)), the difference between these is more pronounced, since CA is a more popular longrange migration destination than MI. The ratio between the average withinstate and average acrossborder migration is 78.68 for MI—much larger as compared to DE, CA, and NH with values 43.43, 34.40, and 14.17, respectively.
Figure 12 helps explain this observed behavior visually, where we show via heatmaps the total migration volume among the counties of each state as well as the counties in their immediate neighbor/border states. DE and NH are small states with only a few counties, which explains the small withinstate migration volume in Fig. 11. On the other hand, NH exhibits the highest crossborder traffic as compared to the others. In stark contrast to NH, MI has relatively much less border traffic as compared to withinstate migration. CA, on the other hand, exhibits largevolume migration both withinstate and acrossstate.
4.4 Efficiency
ADOGE is not directly comparable to all the baselines in terms of resource requirements. GNNs and g2vec need GPU processing, which make them incomparable to CPUbased ADOGE and the rest. Other differences, such as supervised training and collective processing of the graphs via multiple passes over the dataset (in contrast to onebyone/independent processing by ADOGE) put them in a different “league”.
On the other hand, kernel baselines need considerably more memory. WL, WLOA and PK compute intermediate data (e.g., compressed labels) based on all the graphs in memory. These and DOSGK produce a \(N\) \(\times \) \(N\) kernel matrix that is also memoryresident.
FGSD and NetLSD are comparable in the sense that, similar to ADOGE, they process the graphs independently onebyone. Likewise, they are also unsupervised. However, they cannot handle node labels/attributes. Nevertheless, we provide running time and scalability comparison in Fig. 13 that plots the runtime vs. size in number of nodes for individual graphs in the REDDIT5K dataset. For any graph from the dataset (up to 9500 edges), ADOGE does not take more than 1 s to compute. Figure 1 compares this runtime for 3 of our largest datasets. We can see that ADOGE achieves the best timeaccuracy tradeoff among competing baselines. For methods with comparable or better accuracy scores (e.g., GIN), ADOGE is almost twice as fast on average. For baselines with similar runtime (e.g., WL), ADOGE achieves significantly higher accuracy.
4.5 Boosting GNNs
Modern GNNs have achieved significant success in both nodelevel and graphlevel tasks, with applications widely existing in many domains such as drug discovery, social network analysis, image analysis and bioinformatics [44,45,46]. However, the widely used messagepassingbased graph neural networks are known to have limited expressiveness and are specifically upper bounded by the firstorder Weisfeiler–Leman [21]. Many research works have been designed to improve the expressiveness of GNNs, and one direction is to use additional structural features [47, 48] or even random features [49, 50] to augment the input of GNNs, which achieves noticeable improvement over many tasks.
In this section, we investigate whether the unsupervised features from ADOGE can help improve the expressiveness of GNNs. We focus on two categories of tasks related to expressiveness: (1) counting the number of substructures and (2) regressing various graphlevel properties. To this end, we use two datasets, CountingSub and GraphProp, as introduced in Sect. 4.
Model configurations. As the two datasets contain only plain graphs without node and edge attributes,^{Footnote 7} we only compute DOS and PDOS histograms from ADOGE and augment them into any GNN model. The architecture of GNN contains the stacking of multiple messagepassing layers that update node features, and a following pooling layer that aggregates all nodes into a graphlevel feature. We augment DOS and PDOS into GNN as follows:

DOS involves graphlevel features. We pass DOS into a MLP and add the transformed DOS to the final layer of a GNN (right after pooling layer).

PDOS involves nodelevel features. We concatenate the original node features (i.e., degree) of the graph with the PDOS before input to a GNN.
In the experiments we focus on two types of GNN: GCN [20], and GIN [21]. For hyperparameters, we use 6 layers and hidden size 128. Batch normalization is applied after each layer. We use Adam as optimizer with learning rate 0.001. For both DOS and PDOS the histogram’s number of bins is set as 100.
The results are shown in Table 5. For both GIN and GCN, adding DOS and PDOS helps boost the performance of counting substructures as well as regressing graph properties dramatically in most cases. The results support that the DOS and PDOS contain additional information that cannot be extracted by GNNs; hence, adding them enhances the expressive power of GNNs. Graph eigenspectrum captures important structural graph properties like the diameter, connectedness, clustering, etc. [9]. DOS and LDOS contain rich spectral information that are thus helpful for characterizing a graph.
Characterizing the expressiveness of DOS and LDOS theoretically is an interesting direction which we aim to investigate in the future. In fact, Huang et al. [19] proved that LDOS contains all information regarding the return probabilities for each node. A broader research question is to discover the relationship between the graph spectrum and the graph isomorphism test, which historically has only been studied under certain conditions [51].
5 Conclusion
We propose ADOGE, an unsupervised graph embedding technique designed to efficiently capture structural properties as well as node labels and attributes of a graph. To this end ADOGE uses spectral density, or density of states (DOS), derived from the eigenspectrum of the graph, as a tool to capture both global and local properties of a graph. Further, we extend local density of states to leverage node labels and attributes, and capitalize on fast approximation algorithms making ADOGE efficient and scalable to large graphs both in terms of time and space. Being unsupervised, it is not only suitable for downstream supervised graph classification tasks, but also applies well to exploratory graph analysis. Through both quantitative and qualitative experiments, we show the efficacy and efficiency of ADOGE, where it outperforms unsupervised baselines and performs comparably to the supervised GNNs on graph classification tasks, and provides various insights into the analysis of realworld attributed graphs.
Notes
This is called LDOS in their paper, but we use PDOS to avoid confusion with our definition of LDOS in this paper.
Note that the migration graph is originally directed. We create an undirected graph with edge weights \(w_{ij}\) set to \(\frac{(m_{ij}+m_{ji})^2}{p_i p_j}\), where \(m_{ij}\) is the total migration from county i to county j and \(p_i\) depicts county i’s total population.
SVM facilitates comparable results between implicit and explicit kernels.
Node degree is used as a structural node feature as input.
References
Wale N, Watson IA, Karypis G (2008) Comparison of descriptor spaces for chemical compound retrieval and classification. Knowl Inf Syst 14(3):347–375
Przulj N (2010) Biological network comparison using graphlet degree distribution. Bioinform 26(6):853–854
Duen H, Carey N, Jeffrey W, Adam W, Christos F (2011) Polonium: terascale graph mining for malware detection. In: SIAM SDM
Ribeiro B, Chen N, Kovacec A (2019) Shaping graph pattern mining for financial risk. Neurocomputing 326:123–131
Mieghem PV (2011) Graph spectra for complex networks. Cambridge University Press, Cambridge
Pothen A, Simon HD, Liou KP (1990) Partitioning sparse matrices with eigenvectors of graphs. SIAM J Matrix Anal Appl 11(3):430–452
Fill JA (1991) Eigenvalue bounds on convergence to stationarity for nonreversible markov chains, with an application to the exclusion process. Ann Appl Probab 62–87
Chakrabarti D, Wang Y, Wang C, Leskovec J, Faloutsos C (2008) Epidemic thresholds in real networks. ACM TISSEC 10(4):1–26
Jin S, Zafarani R (2020) The spectral zoo of networks: embedding and visualizing networks with spectral moments. In: KDD, pp 1426–1434
Dong K, Benson AR, Bindel D (2019) Network density of states. In: KDD, pp 1152–1161
Verma S, Zhang ZL (2017) Hunt for the unique, stable, sparse and fast feature learning on graphs. In: NIPS, pp 88–98
Tsitsulin A, Mottin D, Karras P, Bronstein A, Müller E (2018) Netlsd: hearing the shape of a graph. In: KDD, pp 2347–2356
Narayanan A, Chandramohan M, Venkatesan R, Chen L, Liu Y, Jaiswal S (2017) graph2vec: learning distributed representations of graphs. arXiv:1707.05005
Shervashidze N, Schweitzer P, van Leeuwen EJ, Mehlhorn K, Borgwardt KM (2011) Weisfeiler–Lehman graph kernels. J Mach Learn Res 12:2539–2561
Kriege NM, Giscard PL, Wilson RC (2016) On valid optimal assignment kernels and applications to graph classification. In: NIPS, pp 1615–1623
Wu L, Zhang Z, Nehorai A, Zhao L, Xu F, Learning AS (2019) Sage: Scalable attributed graph embeddings for graph classification. In: ICLR workshop on representation learning on graphs and manifolds
Neumann M, Garnett R, Bauckhage C, Kersting K (2016) Propagation kernels: efficient graph kernels from propagated information. Mach Learn 102(2):209–245
Zhang Z, Wang M, Xiang Y, Huang Y, Nehorai A (2018) RetGK: graph kernels based on return probabilities of random walks. In: NeurIPS, pp 3968–3978
Huang L, Graven AJ, Bindel D (2021) Density of states graph kernels. In: SDM, pp 289–297. SIAM
Kipf TN, Welling M (2017) Semisupervised classification with graph convolutional networks. In: ICLR
Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks? In: ICLR, pp 1–17
Defferrard M, Bresson X, Vandergheynst P (2016) Convolutional neural networks on graphs with fast localized spectral filtering. In: NIPS, pp 3837–3845
Levie R, Monti F, Bresson X, Bronstein MM (2019) CayleyNets: graph convolutional neural networks with complex rational spectral filters. IEEE Trans Sign Process 67(1):97–109
Kriege NM, Johansson FD, Morris C (2020) A survey on graph kernels. Appl Netw Sci 5(1):6
Balcilar M, Guillaume R, Héroux P, Gaüzère B, Adam S, Honeine P (2021) Analyzing the expressive power of graph neural networks in a spectral perspective. In: ICLR
Xu K, Hu W, Leskovec J, Jegelka S (2019) How powerful are graph neural networks? In: ICLR
Wang F, Landau DP (2001) Efficient, multiplerange random walk algorithm to calculate the density of states. Phys Rev Lett 86(10):2050
Li C, Sra S, Jegelka S (2016) Gaussian quadrature for matrix inverse forms with applications. In: International conference on machine learning. PMLR, pp 1766–1775
Golub GH, Welsch JH (1969) Calculation of gauss quadrature rules. Math Comput 23(106):221–230
Golub GH, Meurant G (1997) Matrices, moments and quadrature ii; how to compute the norm of the error in iterative methods. BIT Numer Math 37(3):687–705
Farkas IJ, Derényi I, Barabási AL, Vicsek T (2001) Spectra of realworld graphs: beyond the semicircle law. Phys Rev E 64(2):026704
Banerjee A, Jost J (2008) Spectral plot properties: towards a qualitative classification of networks. Netw Heterog Media 3(2):395
McGraw PN, Menzinger M (2008) Laplacian spectra as a diagnostic tool for network structure and dynamics. Phys Rev E 77(3):031102
Cucuringu M, Mahoney MW (2011) Localization on loworder eigenvectors of data matrices. arXiv:1109.1355
Mitrović M, Tadić B (2009) Spectral and dynamical properties in classes of sparse networks with mesoscopic inhomogeneities. Phys Rev E 80(2):026123
Meurant G (2007) Matrices, moments, and quadrature. In: Milestones in matrix computation: the selected works of Gene H. Golub with commentaries, p 380
Yanardag P, Vishwanathan S (2015) Deep graph kernels. In: Proceedings of the 21th ACM SIGKDD international conference on knowledge discovery and data mining, pp 1365–1374
Borgwardt KM, Ong CS, Schönauer S, Vishwanathan S, Smola AJ, Kriegel HP (2005) Protein function prediction via graph kernels. Bioinformatics 21(suppl1):47–56
Dobson PD, Doig AJ (2003) Distinguishing enzyme structures from nonenzymes without alignments. J Mol Biol 330(4):771–783
Riesen K, Bunke H (2008) Iam graph database repository for graph based pattern recognition and machine learning. In: Joint IAPR international workshops on statistical techniques in pattern recognition (SPR) and structural and syntactic pattern recognition (SSPR), pp 287–297. Springer
Traud AL, Mucha PJ, Porter MA (2012) Social structure of facebook networks. Physica A 391(16):4165–4180
Zhengdao C, Lei C, Soledad V, Bruna J (2020) Can graph neural networks count substructures? Adv Neural Inf Process Syst 33:10383–10395
Corso G, Cavalleri L, Beaini D, Liò P, Veličković P (2020) Principal neighbourhood aggregation for graph nets. Adv Neural Inf Process Syst 33:13260–13271
Duvenaud DK, Maclaurin D, AguileraIparraguirre J, GómezBombarelli R, Hirzel T, AspuruGuzik A, Adams RP (2015) Convolutional networks on graphs for learning molecular fingerprints. Adv Neural inf Process Syst 28
Fan W, Ma Y, Li Q, He Y, Zhao E, Tang J, Yin D (2019) Graph neural networks for social recommendation. In: The world wide web conference, pp 417–426
Shi L, Zhang Y, Cheng J, Lu H (2019) Skeletonbased action recognition with directed graph neural networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 7912–7921
Bouritsas G, Frasca F, Zafeiriou S, Bronstein MM (2020) Improving graph neural network expressivity via subgraph isomorphism counting. arXiv:2006.09252
Barceló P, Geerts F, Reutter JL, Ryschkov M (2021) Graph neural networks with local graph parameters. Adv Neural Inf Process Syst 34:25280–25293
Sato R, Yamada M, Kashima H (2021) Random features strengthen graph neural networks. In: Proceedings of the 2021 SIAM international conference on data mining (SDM), pp 333–341. SIAM
Abboud R, Ceylan İİ, Grohe M, Lukasiewicz T (2021) The surprising power of graph neural networks with random node initialization. In: Proceedings of the thirtieth international joint conference on artifical intelligence (IJCAI)
Babai L, Grigoryev DY, Mount DM (1982) Isomorphism of graphs with bounded eigenvalue multiplicity. In: Proceedings of the fourteenth annual ACM symposium on theory of computing, pp 310–324
Acknowledgements
This work is sponsored by NSF CAREER 1452425. We also thank PwC Risk and Regulatory Services Innovation Center at CMU. Any conclusions expressed in this material are those of the authors and do not necessarily reflect the views of the funding parties.
Funding
Open Access funding provided by Carnegie Mellon University.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Zhao, L., Sawlani, S. & Akoglu, L. Density of states for fast embedding nodeattributed graphs. Knowl Inf Syst 65, 2455–2483 (2023). https://doi.org/10.1007/s10115023018363
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115023018363