Natural Graph Wavelet Packet Dictionaries

We introduce a set of novel multiscale basis transforms for signals on graphs that utilize their"dual"domains by incorporating the"natural"distances between graph Laplacian eigenvectors, rather than simply using the eigenvalue ordering. These basis dictionaries can be seen as generalizations of the classical Shannon wavelet packet dictionary and of classical time-frequency adapted wavelet packets to arbitrary graphs, and does not rely on the frequency interpretation of Laplacian eigenvalues. We describe the algorithms (involving either vector rotations or orthogonalizations) to efficiently approximate and compress signals through the best-basis algorithm, and demonstrate the strengths of these basis dictionaries for graph signals on sunflower graphs and road traffic networks.


Introduction and Motivation
There is an explosion of interest and demand to analyze data sampled on graphs and networks. This has motivated development of more flexible yet mathematically sound dictionaries (i.e., an overcomplete collection of atoms or basis vectors) for data analysis and signal processing on graphs. Our main goal here is to build smooth multiscale localized basis dictionaries on an input graph, with beneficial reconstruction and sparsity properties, and to fill the "gap" left from our previous graph basis dictionary constructions [15,14,16,17,36] as we explain below. Our approach differs from the standard literature as we fully utilize both the similarities between the nodes (through the graph adjacency matrix) and the similarities between the eigenvectors of the graph Laplacian matrix (through new non-trivial eigenvector distances).
Previous approaches to construct such graph basis dictionaries break down into two main categories. The first category partitions the nodes through recursive graph cuts to generate multiscale basis dictionaries. This includes: the Hierarchical Graph Laplacian Eigen Transform (HGLET) [15]; the Generalized Haar-Walsh Transform (GHWT) [14]; its extension, the eGHWT [36]; and other Haar-like graph wavelets (see, e.g., [29,21,10,3,38]). But their basis vectors either are nonsmooth piecewise constants or have non-overlapping supports. The second category uses spectral filters on the Laplacian (or diffusion kernel) eigenvalues to generate multiscale smooth dictionaries. This includes: the Spectral Graph Wavelet Transform [12]; Diffusion Wavelets [4]; extensions to spectral graph convolutional networks [22]. However, these dictionaries ignore the fact that eigenvectors have a more complex relationship than simply the difference in size of the corresponding eigenvalues [34,2,24]. These relationships can result from eigenvector localization in different clusters, differing scales in multi-dimensional data, etc. These notions of similarity and difference between eigenvectors, while studied in the eigenfunction literature [34,2,24], have yet to be incorporated into building localized dictionaries on graphs.
We combine the benefits of both approaches to construct the graph equivalent of spatialspectral filtering. We have two approaches: one is to utilize the dual geometry of an input graph without partitioning the input graph, and the other is to fully utilize clustering and partition information in both the input graph and its dual domain.
Our first approach, detailed in Section 3, fills the "gap" in the cycle of our development of the graph basis dictionaries, i.e., HGLET, GHWT, and eGHWT. This approach is a direct generalization of the classical wavelet packet dictionary [26,Chap. 8] to the graph setting: we hierarchically partition the dual domain to generate a tree-structured "subbands" each of which is an appropriate subset of the graph Laplacian eigenvectors. We also want to note the following correspondence: The HGLET [15] is a graph version of the Hierarchical Block DCT dictionary [26,Sec. 8.3] (i.e., the non-smooth non-overlapping version of the local cosine dictionary [7], [26,Sec. 8.5]), and the former exactly reduces to the latter if the input graph is P N , a path graph with N nodes. The former hierarchically partitions the input graph while the latter does the same (with a non-adaptive manner) on the unit interval [0, 1] in the time domain. On the other hand, the GHWT [14] is a graph version of the Haar-Walsh wavelet packet dictionary [6], [26,Sec. 8.1], and the former exactly reduces to the latter if the input graph is P N . The latter hierarchically partitions the interval [0, N ) in the sequency domain while the former does the same by the graph domain partitioning plus reordering; see [14,16,17] for the details. Our graph basis dictionary using this first approach is a graph version of the Shannon wavelet packet dictionary [26,Sec. 8.1.2], which hierarchically partitions the interval [0, 1/2) (or [0, π] depending on how one defines the Fourier transform) in the frequency domain. Again, the former essentially reduces to the latter if the input graph is P N .
Our second approach, detailed in Section 4, partitions both the input graph and its dual domain; more precisely, we first hierarchically partition the dual domain, and then partition the input graph with constraints imposed by the dual domain partition. This approach parallels and generalizes classical time-frequency analysis, where the time domain is replaced by a general node-domain geometry and the frequency domain is replaced by a general eigenvectordomain organization that changes as a function of location in the node-domain. A version of this approach of node-eigenvector organization that maps the eigenvectors to a 1D projection of the graph has also been considered as a visualization technique for low-frequency eigenfunctions on clustered graphs [11].
We aim for the significance and impact of this research to be two fold. First, these results will provide the first set of graph wavelet packet bases that adaptively scale to the local structure of the graph. This is especially important for graphs with complicated multiscale structure, whose graph Laplacians have localized eigenfunctions, for example. This is an impactful direction, as most of graph wavelet packet bases previously proposed only tile the node-eigenfunction "plane" along the node "axis," while Laplacian eigenfunctions only tile that plane along the eigenfunction "axis". Our approach in Section 4 constructs filters in both the nodal domain and eigenfunction domain, similar to the classical time-frequency adapted wavelet packets that tile both the time and the frequency domains [39,13].
Second, in the long term, this is a first method of systematically using the novel concept of eigenvector dual geometry [34,2,24]. This direction can set a path for future modification of spectral graph theory applications to incorporate dual geometry.
The structure of this article is organized as follows. Section 2 reviews fundamentals: the basics of graphs, i.e., graph Laplacians and graph Fourier transform as well as graph wavelet transforms and frames that were proposed previously. It also reviews non-trivial metrics of graph Laplacian eigenvectors, which are used to construct the dual domain of an input graph. Section 3 presents a natural graph wavelet basis constructed through hierarchical partition of the dual graph of Laplacian eigenvectors. Section 4 presents a second version of a natural graph wavelet packet dictionary constructed through a pair of hierarchical partitions, one on the input graph and one on its dual domain. In Section 5, we demonstrate the usefulness of our propose graph wavelet packet dictionaries in graph signal approximation using numerical experiments. Code scripts can be found at [23]. Finally, we discuss our findings gained through these numerical experiments and near-future projects for further improvements of our dictionaries.

Graph Laplacians and Graph Fourier Transform
Let G = G(V, E ) be an undirected connected graph. Let V = V (G) = {v 1 , v 2 , . . . , v N } denote the set of nodes (or vertices) of the graph, where N := |V (G)|. For simplicity, we typically associate each vertex with its index and write i in place of v i . E = E (G) = {e 1 , e 2 , . . . , e M } is the set of edges, where each e k connects two vertices, say, i and j , and M := |E (G)|. In this article we consider only finite graphs (i.e., M , N < ∞). Moreover, we restrict to the case of simple graphs; that is, graphs without loops (an edge connecting a vertex to itself ) and multiple edges (more than one edge connecting a pair of vertices). We use f ∈ R N to denote a graph signal on G, and we define We now discuss several matrices associated with graphs. The information in both V and E is captured by the edge weight matrix W (G) ∈ R N ×N , where W i j ≥ 0 is the edge weight between nodes i and j . In an unweighted graph, this is restricted to be either 0 or 1, depending on whether nodes i and j are adjacent, and we may refer to W (G) as an adjacency matrix. In a weighted graph, W i j indicates the affinity between nodes i and j . In either case, since G is undirected, W (G) is a symmetric matrix. We then define the degree matrix D(G) as the diagonal matrix with entries D i i = j W i j . With this in place, we are now able to define the (unnormal-ized) Laplacian matrix, random-walk normalized Laplacian matrix, and symmetric normalized Laplacian matrix, respectively, as We use 0 = λ 0 ≤ λ 1 ≤ . . . ≤ λ N −1 to denote the sorted Laplacian eigenvalues and φ 0 , φ 1 , . . . , φ N −1 to denote their corresponding eigenvectors, where the specific Laplacian matrix to which they refer will be clear from either context or subscripts. Denoting Φ := [φ 0 , . . . , φ N −1 ] and Λ := diag([λ 0 , . . . , λ N −1 ]), the eigendecomposition of L(G) can be written as L(G) = ΦΛΦ T . Since we only consider connected graphs here, we have 0 = λ 0 λ 1 . This second smallest eigenvalue λ 1 is called the algebraic connectivity of G and the corresponding eigenvector φ 1 is called the Fiedler vector of G. The Fiedler vector plays an important role in graph partitioning and spectral clustering; see, e.g., [41], which suggests the use of the Fiedler vector of L rw (G) for spectral clustering over that of the other Laplacian matrices.

Remark 2.1.
In this article, we use the Fiedler vectors of L rw of an input graph and its subgraphs as a tool to hierarchically bipartition the graph although any other graph partition methods can be used in our proposed algorithms. However, note that we use the unnormalized graph Laplacian eigenvectors of L(G) for simplicity to construct the dual domain of G and consequently our graph wavelet packet dictionaries. In other words, L rw is only used to compute its Fiedler vector for our graph partitioning purposes.
The graph Laplacian eigenvectors are often viewed as generalized Fourier modes on graphs. Therefore, for any graph signal f ∈ R N and coefficient vector g ∈ R N , the graph Fourier transform and inverse graph Fourier transform [37] are defined by Since Φ is an orthogonal matrix, it is not hard to see that F −1 G • F G = I N . Thus, we can use F G as an analysis operator and F −1 G as a synthesis operator for graph harmonic analysis.

Graph Wavelet Transforms and Frames
We now briefly review graph wavelet transforms and frames; see, e.g., [37,31] for more information. Translation and dilation are two important operators for classical wavelet construction. However, unlike in R d (d ∈ N) or its finite and discretized lattice graph P N 1 × · · · × P N d , we cannot assume the underlying graph has self-symmetric structure in general (i.e., all the interior nodes do not always have the same neighborhood structure, unless the graph is, for instance, unweighted semiregular tree [1]). Therefore, it is difficult to construct graph wavelet bases or frames by translating and dilating a single mother wavelet function with a fixed shape, e.g., the Mexican hat mother wavelet in R, because the graph structure varies at different locations. Instead, some researchers, e.g., Hammond et al. [12], constructed wavelet frames by shifting smooth graph spectral filters to be centered at different nodes. A general framework of building wavelet frames can then be summarized as follows: ψ j ,n := Filtering ΦF j Φ T δ n for j = 0, 1, · · · , J and n = 1, 2, · · · , N , where the index j stands for different scale of spectral filtering (the greater j , the finer the scale), the index n represents the center location of the wavelet, the diagonal matrices F j ∈ R N ×N satisfies F 0 (l , l ) = h(λ l −1 ) and F j (l , l ) = g (s j λ l −1 ) for 1 ≤ l ≤ N and 1 ≤ j ≤ J . Here, h is a scaling function (which mainly deals with the small eigenvalues), while g is a graph wavelet generating kernel. For example, the kernel proposed in [12] can be approximated by the Chebyshev polynomial and lead to a fast algorithm. Note that {s j } 1≤ j ≤J are dilation parameters and δ n is the standard basis vector centered at node n. Furthermore, one can show that as long as the generalized partition of unity holds, {ψ j ,n } 0≤ j ≤J ;1≤n≤N forms a graph wavelet frame, which can be used to decompose and recover any given graph signals [12].

Non-Trivial Eigenvector Distances
However, one important drawback of the above method is that the construction of the spectral filters F j solely depends on the eigenvalue distribution (except some flexibility in choosing the filter pair (h, g ), and the dilation parameters {s j }) and does not reflect how the eigenvectors behave. For simple graphs such as P N and C N (a cycle graph with N nodes), the graph Laplacian eigenvectors are global sinusoids whose frequencies can be simply read off from the corresponding eigenvalues, as discussed in [16,34]. Hence, the usual Littlewood-Paley wavelet theory (see, e.g., [9,Sec. 4.2], [18,Sec. 2.4]) applies for those simple graphs. Unfortunately, the graph Laplacian eigenvectors of a general graph can behave in a much more complicated manner than those of P N or C N (as discussed in [35,30,34,2,24]). They cannot be described and ordered simply by the corresponding eigenvalues. Therefore, it will be problematic to design graph wavelet by using spectral filters built solely upon eigenvalues and we need to find a way to distinguish eigenvector behaviors. To quantify the difference of behaviors between the eigenvectors, however, we cannot use the usual 2 -distances among them since they all have the same value 2 due to their orthonormality. As a remedy to these issues, we measure the "behavioral" difference between the eigenvectors as discussed in [2,34,24], one of which is used in our numerical experiments in Section 5. We briefly summarize this particular behavioral distance below. See [2,34,24] for more information on the other eigenvector metrics.
Instead of the usual 2 -distance, we use the absolute gradient of each eigenvector as its feature vector describing its behavior. More precisely, let Q(G) ∈ R N ×M be the incidence matrix of an input graph G(V, E ,W ) whose kth column indicates the head and tail of the kth edge e k ∈ E . However, we note that we need to orient each edge of G in an arbitrary manner to form a directed graph temporarily in order to construct its incidence matrix. For example, suppose e k joins nodes i and j , then we can set either ( We now define the DAG pseudometric between φ i and φ j by where abs .(·) applies the absolute value in the entrywise manner to its argument. We note that this quantity is not a metric but a pseudometric because the identity of discernible of the axioms of metric is not satisfied. In order to see the meaning of this quantity, let us analyze its square as follows.
where 〈·, ·〉 E is the inner product over edges. The last term of the formula can be viewed as a global average of absolute local correlation between eigenvectors. In this sense, this quantity is related to the Hadamard-product affinity between eigenvectors proposed by Cloninger and Steinerberger [2]. Note that the computational cost is O(M ) for each d DAG (·, ·) evaluation provided that the eigenvectors have already been computed.

Graph Wavelet Packets
Instead of building the graph wavelet packet dictionary by graph wavelet frames as summarized in Section 2.2, one could also accomplish it by generalizing the classical wavelet packets to graphs. The classical wavelet packet decomposition (or dictionary construction) of a 1D discrete signal is obtained by passing it through a full binary tree of filters (each node of the tree represents either low-pass filtered or high-pass filtered versions of the coefficients entering that node followed by the subsampling operation) to get a set of binary-tree-structured coefficients [8], [26,Sec. 8.1]. This basis dictionary for an input signal of length N has up to N (1 + log 2 N ) basis vectors (hence clearly redundant), yet contains more than 1.5 N searchable orthonormal bases (ONBs) [8,39]. For the purpose of efficient signal approximation, the best-basis algorithm originally proposed by Coifman and Wickerhauser [8] can find the most desirable ONB (and the expansion coefficients of the input signal) for such a task among such an immense number of ONBs. The best-basis algorithm requires a user-specified cost function, e.g., the p -norm (0 < p ≤ 1) of the expansion coefficients for sparse signal approximation, and the basis search starts at the bottom level of the dictionary and proceeds upwards, comparing the cost of the coefficients at the children nodes to the cost of the coefficients at their parents nodes. This bestbasis search procedure only costs O(N ) operations provided that the expansion coefficients of the input signal have already been computed. In order to generalize the classical wavelet packets to the graph setting, however, there are two main difficulties: 1) the concept of the frequency domain of a given graph is not welldefined; and 2) the relation between the Laplacian eigenvectors and sample locations are much more subtle on general graphs. For 1), we propose to construct a dual graph G of the input graph G and view it as the natural spectral domain of G, and use any graph partition method to hierarchically bipartition G instead of building low and high pass filters like the classical case. This can be viewed as the generalized Littlewood-Paley theory. For 2), we propose a nodeeigenvector organization algorithm called the pair-clustering algorithm, which implicitly provides a downsampling process on graphs; see Section 4 for the details.

Natural Graph Wavelet Packets using Varimax Rotations
Given a graph G = G(V, E ,W ) with |V | = N and the non-trivial distance d between its eigenvectors (e.g., d DAG of Eq. (5)), we build a complete dual graph G = G (V , E ,W ) by viewing the eigenvectors as its nodes, V = {φ 0 , . . . , φ N −1 }, and the non-trivial affinity between eigenvector pairs as its edge weights, Using G for representing the graph spectral domain and studying relations between the eigenvectors is clearly more natural and effective than simply using the eigenvalue magnitudes, as [2,34,24] hinted at. In this section, we will propose one of our graph wavelet packet dictionary constructions solely based on hierarchical bipartitioning of G . Basic Steps to generate such a graph wavelet packet dictionary for G are quite straightforward: Step 1: Bipartition the dual graph G recursively via any method, e.g., spectral graph bipartition using the Fiedler vectors; Step 2: Generate wavelet packet vectors using the eigenvectors belonging to each subgraph of G that are well localized on G.
Note that Step 1 corresponds to bipartitioning the frequency band of an input signal using the characteristic functions in the classical setting. Hence, our graph wavelet packet dictionary constructed as above can be viewed as a graph version of the Shannon wavelet packet dictionary [26, Sec. 8.1.2]. We now describe the details of each step below.

Hierarchical Bipartitioning of G
be the node set of the dual graph G , which are simply the set of the eigenvectors of the unnormalized graph Laplacian matrix L(G). Suppose we get the hierarchical bipartition tree of V (0) 0 as shown in Figure 1. Hence, each V ( j ) k contains an appro- . With a slight abuse of notation, let V ( j ) k also denote a matrix of size N × N j k whose columns are those eigenvectors. As we mentioned earlier, any graph bipartitioning method can be used to generate this hierarchical bipartition tree of G . Typically, we use the Fiedler vector of the random-walk normalized graph Laplacian matrix L rw (see Eq. (1)) of each subgraph of G , whose use is preferred over that of L or L sym as von Luxburg discussed [41].

Localization on G via Varimax Rotation
For realizing Step 2 of the above basic algorithm, we propose to use the varimax rotation on k for each j and k. A varimax rotation is an orthogonal rotation, originally proposed by Kaiser [20] and often used in factor analysis (see, e.g., [28,Chap. 11]), to maximize the variances of energy distribution (or a scaled version of the kurtosis) of the input column vectors, which can also be interpreted as the approximate entropy minimization of the distribution of the eigenvector components [33,Sec. 3.2]. For the implementation of the varimax rotation algorithm, see Appendix A, which is based on the Basic Singular Value (BSV) Varimax Algorithm of [19]. Thanks to the orthonormality of columns of V ( j ) k , this is equivalent to finding an orthogonal rotation that maximizes the overall 4th order moments, i.e., The column vectors of Ψ k . This type of localization is important since the graph Laplacian eigenvectors in Φ dictionary for an input graph signal, where one can run the best-basis algorithm of Coifman-Wickerhauser [8] to extract the ONB most suitable for a task at hand (e.g., an efficient graph signal approximation) once an appropriate cost function is specified (e.g., the p -norm minimization, 0 < p ≤ 1). Note also that it is easy to extract a graph Shannon wavelet basis from this dictionary by specifying the appropriate nodes in the binary tree structured dual graph nodes explicitly, i.e., Ψ (1) 1 , Ψ (2) 1 , . . . , Ψ Let us now demonstrate that our algorithm actually generates the classical Shannon wavelet packets dictionary [26, Sec. 8.1.2] when an input graph is the simple path P N . Note that the varimax rotation algorithm does not necessarily sort the vectors as shown in Figure 2 because the minimization in Eq. (6) is the same modulo to any permutation of the columns and any sign flip of each column. In other words, to produce Figure 2, we carefully applied sign flip to some of the columns, and sorted the whole columns so that each subfigure simply shows translations of the corresponding wavelet packet vectors.

Computational Complexity
The varimax rotation algorithm of Appendix A is of iterative nature and is an example of the BSV algorithms [19]: for each iteration at the dual node set V The convergence is checked with respect to the relative error between the current and previous gradient estimates measured in the nuclear norm (i.e., the sum of the singular values). For our numerical experiments in Section 5, we set the maximum iteration as 1000 and the error tolerance as 10 −12 . Therefore, to generate Ψ ( j ) k for each ( j , k), the computational cost in the worst case scenario is O c · (N j k ) 3 + N · (N j k ) 2 where c = 1000 and the first term accounts for the SVD computation and the second does for the matrix multiplication. For a perfectly balanced bipartition tree with N = 2 j max , we have N j k = 2 j max − j , j = 0, . . . , j max , k = 0, . . . , 2 j − 1. Hence we have: and Note that at the bottom level j = j max , each node is a leaf containing only one eigenvector, and there is no need to do any rotation estimation and computation. Hence, the total worst case computational cost is, summing the cost O c · (N j k ) 3 + N · (N j k ) 2 from j = 0 to j max − 1, i.e., O((2+4c/3)N 3 −2N 2 −4c/3N ), so after all, it is an O(N 3 ) algorithm. In practice, the convergence is often achieved with less than 1000 iterations at each node except possibly for the nodes with small j whose N j k is large. For example, when computing the VM-NGWP dictionary for the path graph P 512 shown in Figure 2, the average number of iterations over all the dual graph nodes

Natural Graph Wavelet Packets using Pair-Clustering
Another way to construct a natural graph wavelet packet dictionary is to mimic the convolution and subsampling strategy of the classical wavelet packet dictionary construction: form a full binary tree of spectral filters (i.e., the dual graph G ) and then perform the filtering/downsampling process based on the relations between the nodes and the eigenvectors of G. In order to fully utilize such relations, we look for a coordinated pair of partitions on G and G , which is realized by our pair-clustering algorithm described below. We will first describe the one-level pair-clustering algorithm and then proceed to the hierarchical version.

One-Level Pair-Clustering
Suppose we partition the dual graph G into K > 2 clusters using any method including the spectral clustering [41] as we used in the previous section. Let V 1 , . . . ,V K be those mutually disjoint K clusters of the nodes V representing the graph Laplacian eigenvectors, i.e., V = K k=1 V k , which is also often written as K k=1 V k . Denote the cardinality of each cluster as N k := |V k |, k = 1, . . . , K , and we clearly have K k=1 N k = N . Then, we also partition the original nodes V into mutually disjoint K clusters, V 1 , . . . ,V K with the constraint that |V k | = |V k | = N k , k = 1, . . . , K , and the members of V k and V k are as "closely related" as possible. The purpose of partitioning V is to select appropriate graph nodes as sampling points around which the graph wavelet packet vectors using the information on V k are localized. With a slight abuse of notation, let V also represent a collection of the standard basis vectors in R N , i.e., V := {δ 1 , . . . , δ N }, where δ k (k) = 1 and 0 otherwise. In order to formalize this constrained clustering of V , we define the affinity measure α between V k and V k as follows: where 〈·, ·〉 is the standard inner product in R N . Note that α(V,V ) = δ∈V,φ∈V Denote the feasible partition space as Now we need to solve the following optimization problem for a given partition of V = K k=1 V k : This is a discrete optimization problem. In general, it is not easy to find the global optimal solution except for the case K = 2. For K = 2, we can find the desired partition of V by the following greedy algorithm: 1) compute α({δ},V 1 )−α({δ},V 2 ) for each δ ∈ V ; 2) select the largest N 1 such δ's in V , set them as V 1 , and set V 2 = V \ V 1 .
When K > 2, we can find a local optimum by the similar strategy as above: 1) compute the values α({δ},V 1 ) for each δ ∈ V ; 2) select the largest N 1 such δ's, and set them as V 1 ; 3) compute the values α({δ},V 2 ) for each δ ∈ V \ V 1 , select the largest N 2 such δ's, and set them as V 2 ; 4) repeat the above process to produce V 3 , . . . ,V K . While this greedy strategy does not reach the global optimum of Eq (9), we find that empirically the algorithm attains a reasonably large value of the objective function.

Hierarchical Pair-Clustering
In order to build a multiscale graph wavelet packet dictionary, we develop a hierarchical (i.e., multilevel) version of the pair-clustering algorithm. First, let us assume that the hierarchical bipartition tree of V is already computed using the same algorithm discussed in Section 3.1. We now begin with level Then, we perform one-level pair-clustering algorithm (K = 2) to get V (1) 0 ,V (1) 0 and then V (1)

Generating the NGWP Dictionary
Once we generate two hierarchical bipartition trees V ( j ) k and V ( j ) k , we can proceed to generate the NGWP vectors Ψ ( j ) k that are necessary to form an NGWP dictionary. For each δ l ∈ V ( j ) k , we first compute the orthogonal projection of δ l onto the span of V . We use the modified Gram-Schmidt with p (0 < p < 2) pivoting orthogonalization (MGSLp) [5] to generate the orthonormal graph wavelet packet vectors associated with V

Computational Complexity
At each node V  ( j , k). And we need to sum up this cost on all the tree nodes. Let us analyze the special case of the perfectly balanced bipartition tree with N = 2 j max as we did for the VM-NGWP in Section 3.3. In this case, the bipartition tree has 1 + j max levels, and N j k = 2 j max − j , k = 0, . . . , 2 j − 1. So, for the j th level, using Eq. (7), we have O(3N 3 ·2 − j ). Finally, by summing this from j = 0 to j max −1 (again, no computation is needed at the bottom level leaves), the total cost for PC-NGWP dictionary construction in this ideal case is: O(3N 3 · (2 − 2/N )) ≈ O(6N 3 ). So, it still requires O(N 3 ) operations; the difference from that of the VM-NGWP is the constants, i.e., 6 (PC-NGWP) vs 1000 · 4/3 (the worst case VM-NGWP).

Applications in Graph Signal Approximation
In this section, we demonstrate the usefulness of our proposed NGWP dictionaries in efficient approximation of graph signals on two graphs, and compare the performance with the other previously-proposed methods, i.e., the global graph Laplacian eigenbasis; graph Haar basis; graph Walsh basis; GHWT coarse-to-fine (c2f) best basis [14]; GHWT fine-to-coarse (f2c) best basis [14]; and eGHWT best basis [36]. Note that the graph Haar basis is a particular basis choos- Figure 3: Sunflower graph (N = 400); node radii vary for visualization purpose only able from the GHWT f2c dictionary and the eGHWT dictionary while the graph Walsh basis is choosable from both versions of the GHWT dictionaries as well as the eGHWT; see [17,36] for the details. We use the 1 -norm minimization as the best-basis selection criterion for all the best bases in our experiments. The edge weights of the dual graph G are the reciprocals of the DAG pseudometric between the corresponding eigenvectors as defined in Eq. (5). For a given graph G and a graph signal f defined on it, we decompose f into those dictionaries first. Then, to measure the approximation performance, we sort the expansion coefficients in the non-increasing order of their magnitude, and use the top k most significant terms to approximate f where k starts from 0 up to about 0.5× N , i.e., 50% of the total number of terms. All of the approximation performance is measured by the relative 2 approximation error with respect to the fraction of coefficients retained, which ranges from 0 to 0.5.

Images Sampled on the Sunflower Graph
We consider the so-called "sunflower" graph shown in Fig. 3. This particular graph has 400 nodes and each edge weight is set as the reciprocal of the Euclidean distance between the endpoints of that edge. Consistently counting the number of spirals in such a sunflower graph gives rise to the Fibonacci numbers, i.e., 0, 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, . . . . One may observe that 8 clockwise and 13 counterclockwise spirals are emanating from the center in Fig. 3. See, e.g., [27,40] and our code SunFlowerGraph.jl in [23], for algorithms to construct such sunflower grids and graphs. We can also view such a distribution of nodes as a simple model of the distribution of photoreceptors in mammalian visual systems due to cell generation and growth; see, e.g., [32,Chap. 9]. Such a viewpoint motivates us the following sampling scheme: we overlay the sunflower graph on several parts of the standard Barbara image and sample the grayscale value at each node using the bilinear interpolation. In particular, we sample two different regions: her left eye and pants, where quite different image features are represented, i.e., a piecewise-smooth image containing oriented edges and a textured image with directional oscillatory patterns, re- First, let us discuss our approximation experiments on Barbara's eye graph signal, which are shown in Fig. 4. From Fig. 4(c), we observe the following: 1) the VM-NGWP best basis performed best closely followed by the PC-NGWP best basis; 2) the global graph Laplacian eigenbasis worked relatively well; and 3) those bases chosen from the Haar/Walsh wavelet packet dictionaries did not perform well. Note that the GHWT c2f best basis turned out to be the graph Walsh basis in this graph signal.
These observations can be attributed to the fact that this Barbara's eye graph signal is not of piecewise-constant nature: it is a piecewise-smooth graph signal. Hence, those methods using smoother dictionary vectors worked better. Yet, smooth localized basis vectors in the NGWP dictionaries made a difference in performance compared to the global graph Laplacian eigenbasis. In other words, the localized basis vectors were important for efficiently capturing the features of this graph signal.
In order to examine what kind of basis vectors were chosen as the best basis to approximate this Barbara's eye signal, we display the most important nine VM-NGWP best basis vectors in Fig. 5. The corresponding PC-NGWP best basis vectors are relatively similar; hence they are not shown here. We note that most of these top basis vectors essentially work as oriented edge detectors for the eye including the iris part while the basis vectors #2 and #7 take care of shading and peripheral features of this region. Now, let us discuss our second approximation experiments: Barbara's pants region as an input graph signal as shown in Fig. 6. The nature of this graph signal is completely different from the eye region: it is dominated by directional oscillatory patterns of her pants. From Fig. 6(c), we observe the following: 1) NGWP best bases and the eGHWT best basis performed very well and their performance is quite close until about 30% of the coefficients retained. Then eventually, the eGHWT best basis outperformed all the others; 2) the GHWT f2c best basis performed relatively well behind those three bases in 1); 3) there is a substantial gap in performance between those four bases and the rest, i.e., the graph Haar basis, the graph Walsh basis (= the GHWT c2f  best basis in this case too), and the global graph Laplacian eigenbasis.
We knew that the eGHWT is known to be quite efficient in capturing oscillating patterns as shown by Shao and Saito for the graph setting [36] and by Lindberg and Villemoes for the classical non-graph setting [25]. Hence, it is a good thing to observe that our NGWPs are competitive with the eGHWT for this type of textured signal. On the other hand, the graph Haar basis, the graph Walsh basis, and the global graph Laplacian eigenbasis do not have oriented anisotropic basis vectors, and consequently they performed poorly.
As in the case of Barbara's eye signal, we display the most important nine NGWP best basis vectors for approximating Barbara's pants signal in Fig. 7, but this time we display the PC-NGWP best basis vectors since that best basis performed slightly better than the VM-NGWP best basis vectors. The corresponding VM-NGWP best basis vectors, however, are relatively similar. We note that the majority of these basis vectors are of high-frequency nature than those for the eye signal shown in Fig. 5, which reflect the oscillating anisotropic patterns of her pants. The basis vector #1 takes care of shading in this region while the basis vector #5 is localized around the lower right peripheral region. The latter takes care of the strong spotty pattern (the dark point surrounded by light gray points) around that region in the input graph signal shown in Figure 6(b).

Toronto Traffic Network
We obtained the road network data of the City of Toronto from its open data portal 1 . Using the street names and intersection coordinates included in the dataset, we construct the graph representing the road network there with N = 2275 nodes and M = 3381 edges. Fig. 8(a) displays this graph. As before, each edge weight was set as the reciprocal of the Euclidean distance between the endpoints of that edge.
We analyze two graph signals on this street network. 1) spatial distribution of the street intersections and 2) pedestrian volume measured at such intersections. The graph signal 1) was constructed by counting the number of the nodes within the disk of radius of about 5 km centered at each node. In other words, this is just a smooth version of histogram of the distribution of street intersections computed with the overlapping circular bins of equal size. The longest edge length measured in the Euclidean distance among all these 3381 edges was chosen as this radius of this disk, which is located at the northeast corner of this graph as one can easily see in Figure 8(a). The graph signal 2) is the most recent 8 peak-hour volume counts collected at intersections (i.e., nodes in this graph) where there are traffic signals. The data was typically collected between the hours of 7:30 am and 6:00 pm, over the period of 03/22/2004-02/28/2018. From Fig. 8(b), we observe that qualitative behaviors of these error curves are similar to those of Barbara's eye signal shown in Fig. 4(c). That is, 1) NGWP best bases outperformed all the others and the difference between the VM-NGWP and the PC-NGWP is negligible; 2) the global graph Laplacian eigenbasis worked quite well; 3) those bases based on Haar-Walsh wavelet packet dictionaries did not perform well. In order to examine what kind of basis vectors were chosen to approximate this smooth histogram of street intersections, we display the most important nine VM-NGWP best basis vectors in Fig. 9. We note that these top basis vectors exhibit different spatial scales: the basis vectors #4, #6, and #9 are rather localized while the basis vectors #2, #3, and #7 are of medium scale trying to characterize the sharp gradient of the density function toward the most crowded lower middle region. Then, there are coarse scale basis vectors such as #1 and #5. The basis vector #8 is interesting: it is more oscillatory than the others, and tries to characterize the difference within the dense lower middle region. Now, let us analyze the pedestrian volume data measured at the street intersections as shown in Fig. 10(a), which is quite localized around the most dense region in the lower middle section of the street graph. Fig. 10(b) shows the approximation errors of various methods. From Fig. 10(b), we observe the following: 1) the eGHWT best basis clearly outperformed all the other methods; 2) the graph Haar basis and both versions of the GHWT best bases performed relatively well closely followed by the VM-NGWP best basis; 3) the PC-NGWP best basis was outperformed by the VM-NGWP best basis and the other three bases; 4) the global bases such as the graph Laplacian eigenbasis and the graph Walsh based did not perform well.
In order to examine the performance difference between the VM-NGWP best basis and the PC-NGWP best basis, we display the most important nine basis vectors in Figures 11 and 12. We note that the top VM-NGWP best basis vectors exhibit different spatial scales: the basis vectors #2 and #9 are rather localized while the basis vectors #1, #3, #4, #7, #8 are of medium scale trying to characterize the pedestrian data around the most crowded lower middle region. Then, there are coarse scale basis vectors such as #1 and #5. The basis vectors #5 and #6 try to delineate were selected as its best basis. On the other hand, the PC-NGWP best basis turned out to be a graph wavelet basis with j = 1, i.e., the subspaces V (1) 0 and V (1) 1 . Hence, the latter has only fine scale basis vectors. This is because the PC-NGWP procedure promotes localization of its basis vectors too strictly with scale j = 1 by using the orthogonal projections of the standard basis vectors supported on V (1) k of G onto the subspace V (1) k of G . Since the pedestrian volume data is quite non-smooth and localized, δ-like basis vectors with scale j = 1 in PC-NGWP tend to generate sparser coefficients, i.e., having a small number of large magnitude coefficients with many negligible ones. Therefore, the best basis algorithm with 1 -norm ends up favoring those δ-like basis vectors in

Discussion
In this article, we proposed two ways to construct graph wavelet packet dictionaries that fully utilize the natural dual domain of an input graph: the VM-NGWP and the PC-NGWP dictionaries. Then, using two different graph signals on each of the two different graphs, we compared their performance in approximating a given graph signal against our previous multiscale graph basis dictionaries, such as the GHWT and eGHWT dictionaries, which include the graph Haar and the graph Walsh bases. Our proposed dictionaries outperformed the others on piecewise smooth graph signals, and performed reasonably well for graph signals sampled on an image containing oriented anisotropic texture patterns. On the other hand, our new dictionaries were beaten by the eGHWT on the non-smooth and localized graph signal. One of the potential reasons for such a behavior is the fact that our dictionaries are a direct generalization of the "Shannon" wavelet packet dictionaries, i.e., their "frequency" domain support is local-ized and well controlled while the "time" domain support is not compact. In order to improve the performance of our dictionaries for such non-smooth localized graph signals, we need to bipartition G recursively but smoothly with overlaps, which may lead to a graph version of the Meyer wavelet packet dictionary [26,Sec. 7.2.2,8.4.2], whose basis vectors are more localized in the "time" domain than those of the Shannon wavelet packet dictionary. In fact, it is interesting to investigate such smooth partitioning with overlaps not only on G but also on G itself since it may lead to the graph version of the local cosine basis dictionary [7], [26,Sec. 8.4.3].
We also note the differences between the VM-NGWP and the PC-NGWP dictionaries: the PC-NGWP may allow one to "pinpoint" a particular node where the basis vectors should concentrate in a more explicit manner than the VM-NGWP does. The latter requires one to examine the computed basis vectors if one wants to know where in the input graph nodes they concentrate. On the other hand, the difference in their computational costs is just a constant in O(N 3 ) operations. Hence, it is important to investigate how to reduce the computational complexity in both cases. One such possibility is to use only the first N 0 graph Laplacian eigenvectors with N 0 N . Clearly, one cannot represent a given graph signal precisely with N 0 eigenvectors, but this scenario may be acceptable for certain applications including graph signal clustering, classification, and regression. Of course, it is of our interest to investigate whether we can come up with faster versions of varimax rotation algorithm and MGSLp algorithm, which forms one of our future research projects.
Jennrich also showed that under certain general conditions, this algorithm converges to a stationary point from any initial estimate. The above algorithm seems to be the standard varimax rotation algorithm available in many packages, e.g., MATLAB ®2 , R, etc.
Input: Full column rank input matrix whose columns to be rotated A ∈ R N ×m (m ≤ N ); maximum number of iteration steps maxit (default value: 1000); relative tolerance tol (default value: 1e-12) We implemented a simplified version (i.e., Algorithm 2) of the modified Gram-Schmidt with mixed 2 -p (0 < p < 2) pivoting algorithm in [5]. Our version skips the the step of computing the largest 2 norm and picking the parameter λ (a notation used in [5]) to increase the numerical stability. Instead, we directly set up a tolerance parameter, i.e., tol in Algorithm 2, for the robustness. On the other hand, we keep the p (0 < p < 2) pivoting portion in MGS (i.e., always perform the orthogonalization process of the vector with minimum p -norm in the candidate pool), which nicely preserves the sparsity of the obtained wavelet-like vectors after the orthogonalization process. The MGSLp algorithm is summarized as follows.