Pseudoinverse Graph Convolutional Networks: Fast Filters Tailored for Large Eigengaps of Dense Graphs and Hypergraphs

Graph Convolutional Networks (GCNs) have proven to be successful tools for semi-supervised classification on graph-based datasets. We propose a new GCN variant whose three-part filter space is targeted at dense graphs. Examples include Gaussian graphs for 3D point clouds with an increased focus on non-local information, as well as hypergraphs based on categorical data. These graphs differ from the common sparse benchmark graphs in terms of the spectral properties of their graph Laplacian. Most notably we observe large eigengaps, which are unfavorable for popular existing GCN architectures. Our method overcomes these issues by utilizing the pseudoinverse of the Laplacian. Another key ingredient is a low-rank approximation of the convolutional matrix, ensuring computational efficiency and increasing accuracy at the same time. We outline how the necessary eigeninformation can be computed efficiently in each applications and discuss the appropriate choice of the only metaparameter, the approximation rank. We finally showcase our method's performance regarding runtime and accuracy in various experiments with real-world datasets.


Introduction
One of the central tasks in data science applications is extracting information from relational data encoded in graphs. The umbrella term Graph Neural Networks (GNNs) comprises neural models that seek to combine the theoretical understanding of structured data with the flexibility of machine learning. An outstandingly successful class of GNNs relies on spectral convolution of features along the graph edges (Bruna et al., 2014;Defferrard et al., 2016). These Graph Convolutional Networks (GCNs) have become particularly popular through their success in semi-supervised node classification tasks (Kipf and Welling, 2017), where methods aim to benefit from both a small set of training data and clustering information extracted from a large amount of unlabeled data.
In this work, we consider a special type of graph that is particularly dense, i.e., any node is connected to most other nodes. This structure occurs in selected applications where each node of the graph describes a real-world entity but the edges are constructed artificially based on specific node features. We here cover two examples of such constructed graphs. First, a Gaussian kernel function can be used to generate edge weights for fully connected graphs based on spatial node features, e.g., for three-dimensional point clouds as created by LiDAR scans (Nguyen and Le, 2013). A localization parameter determines how fast the weights decay with the spatial distance, which can be understood to control the density of the graph. Up to now, only approaches with sparse k-nearest neighbor graphs have been proposed, but these are always associated with the loss of non-local information encoded within the large number of edges with lesser weights. We will show that indeed increasing the density improves our prediction performance. This phenomenon is also well-known in other related fields like computer vision (Coll and Morel, 2005;Tao et al., 2018) and image processing (Gilboa and Osher, 2008). For a second example, hypergraphs provide a natural extension of graph learning that can be used easily for categorical data (Bretto, 2013). For the purpose of many successful methods, these hypergraphs are equivalent to specific dense graphs with exploitable structure. Both these types of constructed graphs consist of edges that do not directly represent real-world connections but still encode valuable information about the dataset.
The density of constructed graphs has far-reaching consequences on the performance of GNNs. Graph learning has traditionally been targeted at purely data-inherent graphs where not only the nodes describe real-world entities, but also every single edge represents a real-world connection. Since the number of edges per node is often limited by real-world factors independent of the network size, the average node degree in these graphs is asymptotically constant, the total number of edges grows linearly with the network size, and the adjacency matrix is sparse. When moving to dense graphs, the increased computational cost and storage requirements might be expected to be a decisive issue. That is however often not the case, as the special structure of the constructed adjacency matrices can be exploited to speed up the relevant algorithms. The true problem is posed by the intrinsically different spectral properties of the dense graph Laplacian. It is well known in graph theory that the smallest eigenvalue λ 0 is always zero and the second smallest eigenvalue λ 1 gives a measure of how close the graph is to not being connected (Bauer and Jost, 2009). We call λ 1 the graph's eigengap as the difference between the first and second eigenvalue. The eigenvectors corresponding to the first nonzero eigenvalues are known to contain clustering information. Most common GNNs reinforce this clustering information through their feature maps. In spectral approaches, this is achieved explicitly via convolution with a spectral filter designed specifically for that purpose. However, denser graphs empirically have much larger eigengaps and almost all eigenvalues are clustered close to 1 (Bauer and Jost, 2009). For that reason, the informative eigenvectors are considerably harder to extract by existing filters as well as spatial feature maps. Common GNN architectures hence tend to underperform on dense graphs while our method embraces density and its spectral properties.
In this setting, the present work offers the following main contributions: -We give motivation for spectral filters that combine a zero-impulse part and an inverse part to overcome the issues related with large eigengaps. In order to make our approach computationally efficient, we add a highpass part and employ low-rank approximations to the pseudoinverse by computing a small number of eigenvalues of the graph Laplacian. -We propose a Graph Convolutional Network architecture with a threedimensional filter function space that represents learning our filter parameters in training. We discuss computational aspects such as asymptotic cost and parameter influence. -We consider two examples for applications where beneficial dense graphs can be constructed. We discuss how the intrinsic structure of these graphs can be exploited to speed up our setup. -We showcase the performance of our method in various experiments, comparing it to recent popular GNNs.

Related work
Graph Neural Networks. Neural networks have been used for learning on graphs for many years and the vast variety of recent methods and extensions can best be studied from the dedicated review articles (Bronstein et al., 2017;Wu et al., 2019;Zhang et al., 2019). As for methods with a particular connection to our work, we would like to single out Graph Diffusion Convolution (GDC; Klicpera et al., 2019) and ARMA filters (Bianchi et al., 2019). These methods also perform convolution with non-linear spectral filters, in GDC even nonlocally. However, their filters are not specifically designed towards boosting clustering information in dense graphs.
GNN techniques for dense graphs. To the best of our knowledge, the challenges posed by dense graphs have rarely been addressed in the context of GNNs. In GDC (Klicpera et al., 2019), however, a graph is transformed into a dense graph whose adjacency matrix incorporates diffusion information. This dense graph is then sparsified by a k-Nearest Neighbor approach, which is argued to both address runtime issues and improve prediction accuracy empirically.
This sparsification process can equally be applied to general dense graphs, though it has to be noted that non-local information will be lost. Other techniques for increased efficiency like, e.g., node sampling (Hamilton et al., 2017;Chen et al., 2018) are explicitly targeted at large, sparse graphs.
Semi-supervised classification on hypergraphs. Hypergraph generalizations of the graph Laplacian operator have been introduced in competing ways. A definition based on the clique expansion graph (Zhou et al., 2006) has been used for various approaches based on energy minimization (Hein et al., 2013;Bosch et al., 2018) as well as Hypergraph Neural Networks (Feng et al., 2019), a natural generalization of GCNs. A similar work targets hypergraphs constructed from graphs (Bai et al., 2019). Yadati et al. (2019) quote density issues as motivation to avoid the clique expansion graph and instead propose HyperGCN using the Laplacian definition as a nonlinear diffusion operator introduced by Chan et al. (2018). None of these works address the Laplacian structure resulting from categorical data.
Inverse Laplacians in Data Science. Herbster et al. (2005) use the pseudoinverse of the graph Laplacian for online learning on graphs. Otherwise, mainly inverses of shifted Laplacians can be found in the literature. On multi-layer graphs, higher negative powers of shifted Laplacians can be used to form a Power Mean Laplacian (Mercado et al., 2018). In the context of GNNs, the approximated inverse of a shifted Laplacian is the heart of the diffusion operator for Personalized Page-Rank in GDC (Klicpera et al., 2019). In a broader sense, it is also an example of rational filtering, which is the basis of ARMA networks (Bianchi et al., 2019).

Problem setting and terminology
Throughout this paper, we assume that we are given an undirected weighted graph whose n nodes form the samples of the dataset, and each sample is associated with a d-dimensional feature vector. We assume that the graph is connected and non-bipartite, which is trivially fulfilled in the setting of dense graphs. On this data we aim to perform semi-supervised node classification. The goal is to assign one of m classes to each sample in such a way that nodes with a strong connection in the graph are likely to belong to the same class. For a small subset of samples called the training set, the true class is known a priori. The graph can be described mathematically by its weighted adjacency matrix A ∈ R n×n , where A ij holds the weight of the edge between nodes i and j. This is set to 0 if the nodes are not connected. The degree matrix D ∈ R n×n is defined as the diagonal matrix holding the node degrees D ii = n j=1 A ij . Together they can be used to create the symmetrically normalized graph Laplacian matrix where I is the identity matrix. L sym is known to encode many useful clustering properties of the graph (von Luxburg, 2007).
Spectral graph theory now utilizes the eigenvalue decomposition of the graph Laplacian. Let (λ i , u i ) be the eigenpairs of L sym for i = 0, . . . , n − 1 such that L sym u i = λ i u i and u i 2 = 1. Since L sym is symmetric, we have It is known that the smallest eigenvalue is always 0 with multiplicity one since there is only one connected component, and the largest eigenvalue is less than 2 since the graph is non-bipartite (von Luxburg, 2007). We assume the eigenvalues to be sorted increasingly, 0 = λ 0 < λ 1 ≤ . . . ≤ λ n−1 < 2. The corresponding eigenvector to λ 0 = 0 is u 0 with entries equal to the square roots of the node degrees, Convolution on graphs. The generalization of Fourier transforms and convolution from regular grids to irregular graphs is a central result of spectral graph theory (Shuman et al., 2013). Let x ∈ R n be a spatial graph signal, i.e., x i describes the magnitude of some quantity on node i. Then the function x(λ j ) = u T j x, whose domain is the spectrum of the graph Laplacian, is the result of the graph Fourier transform. Its inverse is x = n−1 j=0x (λ j )u j . In the spectral space, convolution with another spectral element ϕ is simply pointwise multiplication,ŷ(λ j ) = ϕ(λ j )x(λ j ). Note that ϕ must be a real-valued function defined on the eigenvalues λ j , but for simplicity, it is commonly defined on the whole interval [0,2]. Put together, the spatial representation of the convolution of a spatial signal x with a spectral filter ϕ turns out to be y = U ϕ(Λ)U T x, where ϕ(Λ) denotes the diagonal matrix holding the function values of ϕ in the diagonal elements of Λ. The operator K = U ϕ(Λ)U T is sometimes called the convolutional matrix associated with ϕ.
Absence of loops. Some popular GNN methods preprocess the graph by adding loops (edges with the same start and end node) with a uniform weight. For GCN, this has been interpreted as a re-normalization trick (Kipf and Welling, 2017). Besides that interpretation, the self loops empirically lead to slightly reduced eigengaps and hence better performance of the traditional methods that benefit from small eigengaps. This is because the normalized weight of non-loop edges is decreased, making the graph slightly sparser from a spectral point of view. Since this behaviour is not required in our method, we generally do not use loops.
2 Proposed architecture

Pseudoinverse spectral filter functions
Many popular machine learning methods involve repeated multiplication of a feature matrix with some kind of adjacency matrix. The most common form of GCN uses the normalized adjacency matrix,Â = D −1/2 AD −1/2 , as the convolutional matrix with the spectral filter ϕ(λ) = 1 − λ (Kipf and Welling, 2017). If all non-zero eigenvalues are close to 1, then ϕ(λ) is close to zero for all eigenvalues except the first one. HenceÂx = U ϕ(Λ)U T x will be dominated by the first eigenvector u 0 for most x, i.e.,Âx will almost be a multiple of u 0 and close to orthogonal to all other u i (i > 0). This can be seen via the inner product Since all entries of u 0 have the same sign, this makes it hard for the network output to contain meaningful clustering information.
In order to overcome the issues of multiplying with the adjacency matrix, we would like to introduce the pseudoinverse of the graph Laplacian, denoted by L † sym . This has been used, e.g., for online learning (Herbster et al., 2005). At the same time, L † sym is also the convolutional matrix of the spectral filter function (cf. Golub and Van Loan, 1996) This function is decreasing on the nonzero eigenvalues, so low-frequency signals are reinforced and high-frequency signals are damped. Because of ϕ(0) = 0, the first eigenvector u 0 is completely removed from the output of the convolution L † sym x, i.e., that output will always be orthogonal to u 0 . The problematic behaviour described above is hence completely avoided by the pseudoinverse filter.
However, this is a double-edged sword, since often some limited presence of u 0 may be beneficial. Hence we propose custom filters that allow for the combination of the pseudoinverse approach with added u 0 . This can be achieved by filter functions of the form where the parameters α, β can either be assigned manually or learned. The eigengap λ 1 appears in the pseudoinverse part as a normalization factor. The corresponding convolutional matrix is

Low-rank approach
In most scenarios we cannot compute the full eigenvalue decomposition of L sym . Instead, we compute only the r + 1 smallest eigenvalues and replace the term β/λ in (2) with a third parameter γ for all λ > λ r . This means switching to a high-pass filter for higher frequencies, leading to the filter function and the corresponding convolutional matrix denote the matrices holding the second through (r + 1)-st eigenvalues and corresponding eigenvectors in an ascending order. Note that U r Λ −1 r U T r is the best rank-r approximation to L † sym , i.e., which can be proven following the argumentation by, e.g., Horn and Johnson (1985, Section 7.4.2). On the one hand, this low-rank approach is necessary to avoid computing the full eigenvalue decomposition. On the other hand, it also has spectral benefits. In spectral graph theory for clustering applications, the smallest, but nonzero eigenvalues λ 1 , λ 2 , . . . are called the informative eigenvalues. Their quantity depends on the graph. The corresponding eigenvectors contain clustering information in the sense that if two nodes have a strong connection in the graph, the corresponding entries in the eigenvector are likely to be similar, especially in terms of their sign. If r is chosen appropriately, this low-rank filter can be argued to perform pseudoinverse convolution on the informative part of the spectrum, while the non-informative eigenvectors -especially noiseare damped uniformly. Hence, the low-rank approximation may very well have positive effects on the accuracy. Since there is no hard boundary between informative and non-informative eigenvalues, there is a wide range of ranks that can yield these benefits.

Pseudoinverse filters in Graph Convolutional Networks
The general idea of convolutional feature maps in GNNs is that each column y j of an output matrix Y ∈ R n×N1 is obtained by convolving each column x i of an input matrix X ∈ R n×N0 with its own learned filter ϕ ij and then summing up the convolution results. The learned filters are restricted to a given function space, which is arguably the decisive property of each convolutional GNN variant. Let that filter space be K-dimensional and spanned by the basis functions ϕ (1) , . . . , ϕ (K) . Then the coefficients of ϕ ij in the given basis are the trained layer parameters, denoted by W ij . The individual filter functions are given as By organising the parameters in matrices W (k) ∈ R N0×N1 , this leads to the formula (Bruna et al., 2014;Defferrard et al., 2016) and by putting together the full input feature matrix X with columns x i and the output feature matrix Y with columns y j , we obtain the feature map We now aim to use a small filter space that contains the proposed pseudoinverse filters. One possibility is to choose the parameters in (3) manually and set the resulting ϕ α,β,γ,r as the only basis function for a one-dimensional filter space. That requires a-priori identification of the parameter impact on desirable behaviour, which differs from dataset to dataset. For that reason, this approach is infeasible in practice. Instead, we only fix r in (3) and have the other parameters learned in training, which means using the filter basis functions This means that the individual filter functions (4) are equal to (3) ij . The corresponding convolutional matrices can be set up as This feature map can now be embedded in a classical GNN architecture. We follow the traditional GCN setup for semi-supervised classification in that we use two layers each consisting of the convolutional feature map with an added bias and ReLU activation between the layers. The numbers of channels before and after each layer are given by the feature dimension d, the hidden width hyperparameter h, and the number of classes m. Let X (0) ∈ R n×d be the input feature matrix. Extending the notation by an index for the layer, we get the propagation scheme (10) Here σ is the ReLU function applied element-wise and W (1,k) ∈ R d×h , b (1) ∈ R h , W (2,k) ∈ R h×m , and b (2) ∈ R m are the trainable network parameters (k = 1, . . . , K). The addition of the biases b (l) to the matrices K k=1 K (k) X (l−1) W (l,k) is understood row-wise. i.e., b (l) is added as a row vector to each row of the former matrix (l = 1, 2).

Computational aspects
Setup. In order to assemble the convolutional matrices, we only need to compute the r + 1 smallest eigenvalues of L sym . This can be done by efficiently computing the largest eigenvalues of the signless Laplacian via the state-ofthe-art Krylov-Schur method (Stewart, 2002). Since the eigenvector to λ 0 = 0 is known a priori via (1), we can use Wielandt deflation to remove the eigengap and significantly accelerate the method (Saad, 2011, Chapter 4.2). Put together, the system matrix of the eigenvalue computation is If µ 0 ≥ . . . ≥ µ r−1 are the r largest eigenvalue of that matrix, then the nonzero eigenvalues of L sym can be recovered via λ i = 2 − µ i−1 for i = 1, . . . , r.
The corresponding eigenvectors are the same. This way, the asymptotic setup cost amounts to the number of Krylov iterations times the cost of one matrixvector product, which is O(n 2 ) in the worst case but may be significantly less if we are able to exploit any special problem-dependent structure. The number of required iterations, on the other hand, depends on the desired rank r.
Asymptotic cost of layer operations. Similar to L † sym , the convolutional matrices K (k) from (9) will in general not be sparse. However, we do not have to store and apply the full matrices, but rather exploit their low rank by keeping them in their factorized form and computing the feature map (5) via This way, the asymptotic cost of a single feature map is O(N 0 nr), where N 0 is the number of input features. Note that in general, multiplications with the dense adjacency matrix are already in O(n 2 ).
Choice of rank. As stated in Section 2.2, the target rank r should be chosen roughly as the number of informative eigenvalues. However, higher ranks may have additional benefits. Typically, the rank choice for low-rank approximations comes down to a trade-off between accuracy and runtime. An alternative is to view r as a meta-parameter to be determined via cross validation. We investigate its influence in Section 4.6. Note that it is very common for GNNs to depend on a few parameters that control some level of approximation.

Fast setup in special cases
Certain application settings allow us to exploit intrinsic structure of the adjacency matrix to speed up the required eigenvalue computations, as we discuss now.

Three-dimensional point clouds
One source of dense graphs are collections of points in three-dimensional space, called point clouds, which may be produced by LiDAR scans or other applications. We will again consider the task of semi-supervised point classification, i.e., taking a single partly-labeled point cloud and predicting the missing labels.
Other important tasks associated with this type of data are supervised point cloud segmentation and classification, which both revolve around transferring knowledge from fully labeled point clouds to new unlabeled point clouds.
One popular approach to working with this data, especially in robotics, is to turn the point cloud into a graph, e.g., by a k-Nearest Neighbor (KNN) setup (Nguyen and Le, 2013;Golovinskiy and Funkhouser, 2009). A promising alternative is to form a fully connected graph with Gaussian edge weights, leading to an adjacency matrix with entries for all i = j, where the x i ∈ R 3 denote the point coordinates and σ > 0 is a localization parameter. This graph setup is often used for Spectral Clustering (Ng et al., 2002). A smaller value for σ leads to a sparser graph. However, it may be beneficial for the graph to incorporate more non-local information by means of a larger value for σ, which leads to a dense graph. In this setting, the pseudoinverse assembly can be accelerated considerably. The smallest Laplacian eigenvalues and corresponding eigenvectors can be approximated accurately and efficiently by exploiting a fast summation scheme based on the Non-Equispaced Fast Fourier Transform (NFFT; see Alfke et al., 2018). This method has the remarkable property that it avoids assembling the n 2 adjacency entries altogether and that its computational effort for computing a small number of eigenvalues only scales linearly in n instead of the usual quadratic behaviour. By combining the NFFT algorithm with our low-rank approximation scheme, the computational effort is significantly reduced, especially for large n. Even though the amount of similarity information we look at is in O(n 2 ) and we do not discard any structure, the total complexity of our method with constant r scales only like O(n). For all details on the NFFT method, we refer to Alfke et al. (2018).

Hypergraphs for categorical data
While the edges of traditional graphs connect exactly two nodes with each other, the hyperedges of a hypergraph may connect any number of nodes (Bretto, 2013). Hypergraphs are most commonly described by their incidence matrix H ∈ {0, 1} n×|E| , where |E| denotes the number of hyperedges and H ie = 1 if and only if node i is a member of hyperedge e. Optional hyperedge weights can be given in a diagonal matrix W E = diag(w 1 , . . . , w |E| ). In addition to the node degree matrix D ∈ R n×n with entries D ii = |E| e=1 H ie w e , we also set up the hyperedge degree matrix B ∈ R |E|×|E| with entries B ee = n i=1 H ie .
The hypergraph Laplacian operator. The hypergraph Laplacian can be defined in multiple ways. We will use the linear definition introduced by Zhou et al. (2006), given as This is identical to the graph Laplacian of a classical graph with weighted adjacency matrix HW E B −1 H T , which is referred to as the clique expansion of the hypergraph. This graph contains a specific set of loops and is in most applications dense or even fully connected. As a consequence, naive computations with L sym may become quite expensive. This problem also affects Hypergraph Neural Networks (Feng et al., 2019), which essentially apply the standard GCN architecture (including self loops) to the clique expansion graph.
Efficient techniques for the special case. One application of hypergraphs is categorical data where each sample is described by a few categorical attributes. We can simply construct one hyperedge for each possible value of an attribute, connecting all the samples which share that particular attribute value. This leads to the number of hyperedges being significantly smaller than the number of samples, |E| ≪ n. For other automatically generated hypergraphs, there is precedence for the benefits of generating fewer, larger hyperedges as well (Purkait et al., 2017). In this special case, the Laplacian definition directly exhibits a useful structure. The matrix subtracted from the identity in (13) has rank |E|, so L sym is a linear combination of the identity and a low-rank matrix and can be written as Assume thatH has full rank |E| and that its thin singular value decomposition is given byH whereŨ ∈ R n×|E| andṼ ∈ R |E|×|E| have orthogonal columns andΣ ∈ R |E|×|E| holds the singular values on its diagonal. Then n − |E| eigenvalues of L sym are 1 and the remaining eigenvalues are given byΛ = I −Σ 2 . Consequently, the exact pseudoinverse of L sym is the identity plus a matrix of rank |E| and has the structure For our low-rank approach, this means that we cannot choose r > |E| − 1 because that would require singling out a few eigenvectors to the eigenvalue 1, which are indistinguishable. However, computing the first r + 1 ≤ |E| eigenvalues of the hypergraph Laplacian becomes much cheaper, since we only need the singular value decomposition ofH or equivalently the eigenvalue decomposition of the |E| × |E| matrixH TH . Thus the setup cost is significantly reduced. Furthermore, we can recreate the full-rank filter (2) within the lowrank setting of (3) by fixing r = |E| − 1 and γ = β.

Network architecture and training setup
Our code is available online, 1 implemented in Python using PyTorch and Py-Torch Geometric (Fey and Lenssen, 2019). The input features of each dataset are propagated through the network architecture (10) with the hidden width set to h = 32 in all experiments. In the first layer, the products K (k) X (0) are precomputed as proposed by Chen et al. (2018), and in the second layer we employ the efficient scheme from (11). Afterwards, the rows of the output X (2) are transformed via the softmax function to gain the predicted class probabilities for each sample. The parameters are trained using the average cross entropy loss of the predicted probabilities for the true class of the training samples. We use the Adam optimizer (Kingma and Ba, 2015) with learning rate 0.01. During training, we use a dropout rate of 0.5 between the layers. For the weight matrices W (l,k) , we use Glorot initialization (Glorot and Bengio, 2010) and a weight decay factor of 0.0005, while for the bias vectors b (l) , we use zero initialization and no weight decay. For each run, we set a fixed seed for random number generation, build the model, run 500 training epochs, and finally evaluate the classification accuracy on the non-training samples. We generally perform 100 of these runs for each experimental setting and report averages and standard deviations. Only in a few slow baseline experiments did we reduce the number of runs to 10. All experiments are run on a laptop with an Intel Core i7 processor and an NVIDIA GeForce GTX 950M.

Baselines
We refer to our own method as PinvGCN in plots and tables, where we compare it against the following methods on graphs: -GCN (Kipf and Welling, 2017), using the implementation from PyTorch Geometric. -GraphSAGE (Hamilton et al., 2017), using the implementation from Py-Torch Geometric with mean aggregation. We do not use neighbor sampling since that is designed for node batches in large datasets and it did not improve the results in any of our tests. -GDC (Klicpera et al., 2019), using the implementation from PyTorch Geometric with α = 0.05 and top-64 sparsification. Since the datasets where too large for exact matrix inversion, we needed to use the "inexact" version, which is not supported in the original code published with the paper. 2 -ARMA (Bianchi et al., 2019), using the implementation from PyTorch Geometric with parameters K = 3 and T = 2.
On hypergraphs we additionally tested the following two methods: -HGNN* (Feng et al., 2019), which we mark with an asterisk because we use our own implementation that exploits the structure of (14) in a similar way to our own method, significantly speeding up the runtime without changing the output.

Semi-supervised point classification in 3D point clouds
In order to illustrate our method's superior performance on very dense graphs, we employ two 3D point clouds as described in Section 3.1. We use a part of a subset of the popular Oakland dataset as used by Munoz et al. (2009). We obtained the subset of the Oakland dataset from the project website. 4 The dataset consists of multiple point clouds, two of which are used for training and validation in the original setting. Since their original usage is different from our own training splits, we only refer to the clouds as Oakland 1 (original training cloud) and Oakland 2 (original validation cloud). The remaining test clouds are unused. This data is usually used for 3D point segmentation, where the task is to transfer knowledge from one cloud to other clouds (Nguyen and Le, 2013). Instead, we look at the two clouds independently and perform 5-class semi-supervised point classification on each of them by splitting each cloud into its own sets of training points and test points. We use 100 training points from each of the five classes. For the input feature matrix X (0) , we use the original 3D coordinates. Dataset information is given in Table 1, where the diameter denotes the maximum Euclidean distance between two points in the cloud.
For eigenvalue computation, we use the fastadj Python implementation 5 of the NFFT-based fast summation scheme (Alfke et al., 2018) with default parameters, combined with the Krylov-Schur algorithm with tolerance 10 −3 . These settings are chosen to give fast, rough approximations of the eigenvalues because we found that higher quality did not have any influence on the PinvGCN results.
We conduct experiments for σ ∈ {10, 100} in the Gaussian function (12), where the larger value amounts to a stronger inclusion of non-local information. Our experimental results are shown in Table 2. Since the baselines are not designed for such dense graphs and would have exploding time and memory requirements on the fully connected graph, we only employ them on a k-NN subgraph of the k nearest neighbors, where k is either 10 or 100. For k = 100, ARMA ran out of memory on both datasets and we stopped GDC when the first run was not finished after 10 hours, which is why these baselines are not included. Note that it would be possible to employ the GCN baseline on the  fully connected graph by utilizing the same NFFT-based fast summation, but providing such an implementation was beyond the scope of our experiments.
To summarize the results, our method produces accurate predictions in reasonable time, while no baseline produces satisfactory results. The fact that our accuracy is substantially better for σ = 100 shows that our method greatly profits from non-local information in non-sparse Laplacians. As we see in Table 1, adding non-local information via a larger value of σ results in increasing λ 1 , confirming the connection between spatial and spectral properties.

Hypergraph datasets from categorical attributes
We finally employ our method on three hypergraphs based on the Mushroom and Covertype datasets from the UCI machine learning repository (Dua and Graff, 2017), which are common benchmarks for semi-supervised classification on hypergraphs (Hein et al., 2013;Yadati et al., 2019). The Mushroom dataset 6 contains 8124 samples from two classes, described by 21 categorical attributes (ignoring one with missing values). For each attribute, we create as many hyperedges as there are attribute values present, where each hyperedge connects all samples with a specific value.
The Covertype dataset 7 contains 581012 samples from 7 classes, described by 10 continuous and 44 binary attributes. We follow the setup process from (Hein et al., 2013), dividing the value range of each continuous attribute into 10 equally sized bins and creating hyperedges that connect all samples with values in the same bin. For each binary attribute, we only create one hyperedge for those samples with a "true" value. All hyperedges have weight w e = 1. Afterwards, we create two subhypergraphs by using only samples from classes 4 and 5, or 6 and 7. Because we remove all hyperedges with less than two nodes, the resulting hypergraphs have less than the original 144 hyperedges.
As in (Yadati et al., 2019), we use the hypergraph incidence matrix as input features for all three data sets, X (0) = H. Table 3 gives the full hypergraph specifications as well as their smallest nonzero Laplacian eigenvalue.
Classification results on all three datasets are listed in Table 4. We only employ our Pseudoinverse GCN with the maximum rank r = |E| − 1, as will be supported in Section 4.6. For the graph-based methods, we use KNN sparsification of the clique expansion graph, so each node is connected to the k = 10 other nodes with highest shared hyperedge membership. On the Mushroom hypergraph, our results are in the same order of magnitude as the graph baselines, which is better than the other hypergraph methods, but the ARMA network gives superior results. The graph methods perform remarkably well considering the fact that this process of clique expansion sparsification has, to our knowledge, never been discussed in the hypergraph literature. On both Covertype datasets, however, our Pseudoinverse GCN yields the best accuracies.
In addition, we also emulated the full-rank pseudoinverse filter (2) via fixing β = γ in (3) with r = |E| − 1. This method is named PinvGCN without highpass in Table 4, but it produces significantly worse results. This shows that keeping the high-pass filter (8) as a separate basis function is still beneficial even if the approximation is not required for efficiency.
For further investigation, Figure 1 presents the dependency of the misclassification rate (i.e., the complement of the accuracy, here plotted logarithmically) on the number of random training samples. The plots confirm the results of Table 4 in principle. A remarkable finding is that our method has a significantly better convergence rate on the Covertype45 hypergraph.

Limitations: Sparse graph datasets
For comparison and transparency, we also apply our method to the standard citation networks Cora, Citeseer, and Pubmed. These graph datasets are typical benchmarks for GNNs. However, they all are examples of sparse graphs that are exactly the opposite of what our method is designed to handle. One drastic drawback of our method is that it is only applicable to the largest connected component (LCC) of each graph because it relies on the multiplicity of the Laplacian eigenvalue 0 being one. Unsurprisingly, our method produces poor results on these datasets and it does not come close to the performance of a simple GCN (Kipf and Welling, 2017). Network statistics and full results are given in Table 5.

Rank dependency
Since the target rank r is the only metaparameter of our method, it is important to discuss its impact. As is typical for low-rank approximations, the choice of rank is subject to a trade-off between runtime and accuracy. Figure 2 depicts the development of the misclassification rate (plotted logarithmically) and runtime. We note that the accuracy depends nontrivially on the rank. For almost all datasets, the best performance is achieved with the highest tested rank, but the dependency is not monotonous. Especially the hypergraphs show behavior where increasing the rank over a short range slightly deteriorates the accuracy. The point clouds are closer to monotony in this regard. However, the method appears to be very robust with respect to the choice of r, as long as it is not too small (r < 20).
The runtime development, on the other hand, depends on the exploited structure of the adjacency matrix. The total effort is governed by the setup cost, while the cost of layer operations does not seem to have a big impact. For point clouds, the number of computed eigenvalues influences the number of Krylov-Schur iterations, which leads to an almost linear dependency on r. For hypergraphs, the SVD of the normalized adjacency matrix is cheap enough to avoid any increase in runtime, leading to almost constant times.
For the best performance, we recommend choosing the maximum r = |E|− 1 on hypergraphs without worrying about this parameter. For point clouds, we encourage choosing r as large as possible while keeping a practically feasible computational cost. . By forming the averages of the absolute weight entries, we can quantify the importance of each filter part for the trained network. To account for the different weight matrix sizes in each layer, we use the formula 1, 2, 3). (17) These numbers are furthermore averaged over all 100 runs. Table 6 lists these values for multiple PinvGCN instances. We clearly see that the pseudoinverse part consistently is the most important filter basis function, which supports the notion that these eigenvectors carry the most clustering information. The weights of the other two parts are smaller, but not by a large margin, which supports the intuition that the other eigenvectors are still beneficial for the classification result. At the same time, we observe that the entry ratios differ quite a bit between datasets, which implies that is indeed hard to manually choose parameters for one suitable filter function (3) a priori.

Conclusion
We here presented Pseudoinverse GCN, a new type of Graph Convolutional Network designed for dense graphs and hypergraphs. The feature maps are based on a novel three-part filter space motivated by a low-rank approximation of the Laplacian pseudoinverse. The method yielded strong experimental results in a setting where popular GNNs struggle. A further advantage of our method is the robustness with respect to its only parameter. Future work might include extensions towards supervised 3D point cloud segmentation.