1 Introduction

Clustering unlabeled data is essential for identifying distinct groups in exploratory data analysis, across diverse fields. Selecting the right clustering approach and determining the optimal number of clusters are critical challenges. However, employing clustering techniques can be complex due to dataset intricacies and the necessity of selecting suitable algorithms and parameters. Different clustering methods have been addressed in many contexts (Aref et al. 2023), such as image segmentation (Jalagam et al. 2020; Nameirakpam et al. 2015; Tung et al. 2010), object and character recognition (Leibe et al. 2006; Nguyen et al. 2020), document retrieval (Kurland 2014; Djenouri et al. 2018), pattern classification (Diday et al. 1981; Ezhilmaran and Vinoth Indira 2020), data mining (Rajan 2015) and others.

According to Filippone et al. (2008), it is possible to divide clustering techniques into two categories: hierarchical and partitional. The first one is able to find structures that can be further divided into substructures and so on recursively. The result is a hierarchical structure of groups known as a dendrogram. In this case, an important aspect is the establishment of a stopping condition, which determines when further division of subgraphs is unnecessary. This condition helps prevent over-segmentation of the data, ensuring that clusters are meaningful and interpretable. Partitional clustering methods try to obtain a single partition of data without any other sub-partition like hierarchical algorithms do and are often based on the optimization of an appropriate objective function. Its applicability in diverse domains makes clustering a task of significant importance.

An interesting approach to clustering tasks, such as community detection in networks, is to use algorithms that rely on concepts established by Spectral Graph Theory (SGT). These methods usually involve eigenvectors associated with eigenvalues of some matrix that represent a graph and then are used to cluster a set of points. The first paper that suggests constructing graph partitions based on the eigenvectors of the adjacency matrix was authored by Donath and Hoffman (1973). As defined in this work, graph partitioning is the problem of dividing the nodes of a graph into a given number of disjoint subsets such that the number of nodes in each subset is less than a given number, while the number of cut edges, i.e., edges connecting nodes in different subsets, is a minimum.

Fiedler (1973, 1975) found that the bipartitions of a graph are closely linked to the eigenvector associated with the second smallest eigenvalue of the graph Laplacian, called the Fiedler vector, and suggested using this eigenvector to partition its set of points. This method is called Spectral Bisection (SB) and involves partitioning the graph into two sets considering the magnitudes (or signs) of Fiedler vector components. Besides, Fiedler (1975) laid the groundwork for spectral clustering by discussing the connectedness of subgraphs obtained through SB, emphasizing the importance of the Fiedler vector in the case where this eigenvector does not contain any zero components. Subsequent research by Qiu and Hancock (2006), Spielman and Teng (2007), and Urschel and Zikatanov (2014) has further refined spectral clustering algorithms, extending their applicability to a wider range of graph structures and ensuring the preservation of connectivity within subgraphs. In the last decades, there has been a growing interest in spectral clustering algorithms, mainly because of their efficiency and mathematical elegance. The success of spectral clustering is especially based on the fact that it does not make strong assumptions about the form of the clusters.

The motivation for this work was to develop a spectral method comparable to existing spectral and/or hierarchical clustering methods. This research proposed 5 variations of a novel clustering approach which is based on SGT and whose output is hierarchically structured. These were compared to 6 benchmark clustering algorithms presented in the literature. It is interesting to highlight that some of these algorithms allow self-determination of the number of clusters, while others require the prior specification of such quantity. There is also some diversity in the fact some but not all of these rivaling alternatives rely on a hierarchical clustering approach, and the same goes for employing SGT. To assess the clustering methods examined in this work, computational experiments were conducted and the comparison was based on two evaluation metrics: (i) adjusted Rand index and (ii) modularity. The results provide us evidence to say that the proposed methods are competitive with those presented in the literature and can be a sound alternative to these methods.

Beyond this introduction, the organization of this paper is described next. In Sect. 2, are introduced useful concepts and notations needed to understand this work. The literature review with an overview of existing spectral clustering methods is given in Sect. 3. In Sect. 4, it is described the proposed methodology. The datasets, computational experiments, and the analysis of the results are presented in Sect. 5, and the conclusions are provided in Sect. 6.

2 Fundamental concepts

In this section, we introduce the basic concepts of graph theory and the evaluation metrics for measuring the quality of the partition of a network that will be used in this work. Table 1 provides a summary of symbols along with their respective meanings to improve readability in this section.

Table 1 List of symbols used in the paper

2.1 Graph theory

Let \({\mathscr {G}}\) the set of simple undirected graphs and \(G=(V_G,E_G) \in {\mathscr {G}}\), where \(V_G\) is the set of n nodes and \(E_G\) is the set of m edges. In case it is clear the graph G is considered, we just refer to V and E. A subgraph of G is any graph \(S=(V_{{S}},E_{{S}})\) such that \(V_{{S}} \subset V\) and \(E_{{S}} \subset E\). This subgraph is called induced if the set \(E_{{S}}\) consists of all of the edges in E that have both endpoints in \(V_{{S}}\). We denote by \(v_i \sim v_j\) if \(v_i \in V\) is adjacent to \(v_j \in V\), otherwise \(v_i \not \sim v_j\). For \(v \in V\), the set of its neighbors is represented by \(N_v\) and let \(\vert N_v \vert \) the cardinality of \(N_v\). For each node \(v \in V\) the degree of v is \(d(v)=\vert N_v \vert \). The minimum degree of G is denoted by \(\delta (G) := \min \{ d(v); v \in V \}\) and the maximum degree of G by \(\Delta (G) := \max \{d(v); v \in V\}\). Eventually, we will use \(d_i\) to represent \(d(v_i)\), \(\Delta \) to represent \(\Delta (G)\) and \(\delta \) for \(\delta (G)\). The density of a graph, den(G), is defined by \(den(G) = \dfrac{2m}{n(n-1)}\), which measures how many edges between nodes exist compared to how many edges between nodes are possible. The distance between two nodes uv of a connected graph G, d(uw), is the shortest path length from u to w. The diameter of G, denoted by diam(G), is the greatest distance between two nodes, i.e. \(diam(G) = \displaystyle \max _{u,w \in V}d(u,w)\).

The adjacency matrix of G, denoted by \(A = A(G) = (a_{ij})\), is a square and symmetric matrix of order n, such that \(a_{ij} = 1\) if \(v_i \sim v_j\), and \(a_{ij} = 0\) otherwise. The degree matrix of G, denoted by \(D = D(G) = (d_{ij})\), is the diagonal matrix such that \(d_{ii} = d_i = d(v_i)\). The Laplacian matrix of G is defined by \(L = L(G) = D - A\) and the normalized Laplacian matrix by \(\hat{L} =\hat{L}(G) = D^{-\frac{1}{2}} L D^{-\frac{1}{2}},\) where \(D^{-\frac{1}{2}}=(d^{-\frac{1}{2}}_{ij})\) is the diagonal matrix such that \(d^{-\frac{1}{2}}_{ii} = d^{-\frac{1}{2}}_i\). It is important to observe that \(\hat{L}\) is symmetric and, therefore, has a real spectrum. Also, it is well known that the smallest eigenvalue of this matrix is zero and, moreover, the multiplicity of this eigenvalue is equal to the number of connected components in the graph. We refer to Chung (1997) for more details on the normalized Laplacian of a graph.

Even though clustering algorithms involving both Laplacians can be easily found in the literature, this work will just consider for the proposed algorithm the use of the normalized Laplacian matrix. According to Von Luxburg (2007a), from a statistical point of view, it is preferable to work with spectral clustering for normalized Laplacians.

2.2 Evaluation metrics

According to Newman and Girvan (2004), modularity is a widely recognized metric for community detection and serves as a quality measure for a specific partition of a network, particularly when the communities are not predetermined. This measure is defined by Eq. (1).

$$\begin{aligned} Q = \frac{1}{2m}\displaystyle \sum _{v_iv_j}\left[ a_{ij} - \frac{d(v_i)d(v_j)}{2m}\right] \delta (c_{v_i},c_{v_j}), \end{aligned}$$
(1)

where \(a_{ij}\) represents the entry of the adjacency matrix corresponding to the nodes \(v_i\) and \(v_j\) of the graph that models the network, m is the number of edges in the graph, \(d(v_i)\) and \(d (v_j)\) denote the degree of node \(v_i\) and node \(v_j,\) respectively and, \(\delta (c_{v_i},c_{v_j}) = 1 \) when \(v_i\) and \(v_j\) belong to the same community, otherwise \(\delta (c_{v_i},c_{v_j})= 0\). When \(Q = 0\), the number of within-community edges is comparable to random chance; however, values approaching \(Q = 1\) suggest a strong community structure.

Despite the numerous virtues of modularity, other measures can provide complementary perspectives on node clustering results. A quite interesting example of these measures is the Adjusted Rand Index (ARI) (Hubert and Arabie 1985). Regardless of the number or size of clusters, the ARI is designed to approach a value near 0.0 for random labeling, will precisely reach 1.0 when the clusterings are identical, and has a lower bound of \(-0.5\) to indicate significant discordance in clustering. For its computation, consider that \(C = \{C_1, C_2, \cdots , C_{|C|}\}\) and \(P = \{P_1, P_2, \cdots , P_{|P|}\}\) are two hypothetical partitions of the set of nodes so that C could be the inferred communities and P could be the “ground truth” labeling of the nodes. This way, the ARI regarding C and P could be calculated as shown next:

$$\begin{aligned} ARI = \dfrac{ \sum _{ij} \left( {\begin{array}{c}|C_i \cap P_j|\\ 2\end{array}}\right) - \dfrac{ \sum _i \left( {\begin{array}{c}|C_i|\\ 2\end{array}}\right) \sum _j \left( {\begin{array}{c}|P_j|\\ 2\end{array}}\right) }{ \left( {\begin{array}{c}n\\ 2\end{array}}\right) } }{ \dfrac{ \sum _i \left( {\begin{array}{c}|C_i|\\ 2\end{array}}\right) + \sum _j \left( {\begin{array}{c}|P_j|\\ 2\end{array}}\right) }{ 2 } - \dfrac{ \sum _i \left( {\begin{array}{c}|C_i|\\ 2\end{array}}\right) \sum _j \left( {\begin{array}{c}|P_j|\\ 2\end{array}}\right) }{ \left( {\begin{array}{c}n\\ 2\end{array}}\right) } }\ . \end{aligned}$$

3 Related works

Numerous works on node clustering exist in the literature, covering a great variety of nuances on this subject, including some approaches that can be addressed as spectral methods. Specifically, these fall into a category of algorithms utilizing eigenvalue decomposition for clustering. Regardless, these works originate from different contexts and have distinct problem dimensions or principles. In this section, we offer a concise overview of existing node clustering methodologies.

3.1 Systematic review protocol

The material for the present study was collected by reviewing articles published in journals using a systematic mapping approach focused on clustering algorithms. To facilitate the search, specific keywords were identified to compile an initial set of articles. The snowballing technique entails leveraging the reference list and citations of a paper to discover additional relevant papers. This methodology was applied in the current study, whose overall concept is illustrated by Fig. 1. In this research, both backward and forward snowballing were conducted, each involving a single step. The Research RabbitFootnote 1 application was employed to streamline this process.

Using keywords like (i) Random clustering, (ii) Spectral clustering, (iii) Graph Laplacian, (iv) Community detection, (v) Laplacian Eigenvalues it was created a starting set with nine papers. From this set, we selected the papers from the last ten years, then we conducted the backward and forward snowballing. Some inclusion and exclusion criteria were established and implemented, as outlined in Tables 2 and 3. Upon scrutinizing the list of citations and references for all articles in the initial set, we obtained the final set of articles, briefly discussed in the following.

3.2 Literary panorama

In this subsection, we present a brief summary of the nine papers selected from the Systematic Review Protocol proposed. We point out that almost all of them deal with spectral clustering, considering some appropriated similarity matrix. Besides, the concept of modularity, included in our proposed method, is also present in some of the listed papers.

In Hagen and Kahng (1992), the authors demonstrate that the second smallest eigenvalue from the Laplacian matrix, derived from the netlist, provides a provably accurate approximation of the optimal ratio cut partition cost. They also showcase the robustness of fast Lanczos-type methods for solving the sparse symmetric eigenvalue problem, which forms a reliable foundation for computing heuristic ratio cuts based on the eigenvector of this second eigenvalue. The computation of this eigenvector leads to effective clustering methods, particularly successful on the “challenging” input classes proposed in the Computer-Aided Design (CAD) literature. Additionally, the authors explore the natural intersection graph representation of the circuit netlist as a foundation for partitioning and propose a heuristic based on spectral ratio cut partitioning of the netlist intersection graph. To validate their partitioning heuristics, extensive testing was conducted on industry benchmark suites, with results that compare favorably with those reported in the existing literature.

Fig. 1
figure 1

Snowballing main idea

Table 2 Definition of inclusion criteria
Table 3 Definition of exclusion criteria

A novel cost function for spectral clustering based on the error between a given partition and a solution of the spectral relaxation of a minimum normalized cut problem is introduced in Bach and Jordan (2003). The cost function is minimized concerning the partition to obtain a new spectral clustering algorithm. Additionally, it outlines an algorithm for learning the similarity matrix by minimizing the cost function concerning the similarity matrix. The authors show that minimizing the cost function is equivalent to a weighted k-means algorithm, providing a mathematical foundation for the use of k-means in spectral clustering. For more details on k-means, we suggest Hastie et al. (2009), for instance.

A comprehensive approach to network community detection, interpreting it as finding the ground state of an infinite range spin glass is presented in Jörg and Bornholdt (2006). Encompassing both weighted and directed networks, the approach incorporates the quality function proposed by the authors, as well as the modularity Q defined by Newman and Girvan (2004). The community structure is conceptualized as the spin configuration minimizing the energy of the spin glass, where the spin states denote the community indices. Additionally, the paper explores the properties of the ground state configuration, the detection of hierarchies and overlaps within the community structure, and computationally efficient local update rules for optimization. The conclusion offers expected values for the modularity of random graphs, facilitating the assessment of the statistical significance of community structure.

A fast spectral algorithm for community detection in complex networks, which accurately estimates the effect size of modularity is presented in Treviño et al. (2015). The algorithm, based on the bisection of the network obtained from the eigenvector associated with the largest eigenvalue of the modularity matrix, allows the estimation of the effect size of modularity, providing a z-score measure that enables a more informative comparison of networks with different sizes and structures. The proposed approach performs well on real-world benchmark networks and allows for the study of modularity distribution in ensembles of Erdős-Renyi networks, and also outperforms other known polynomial schemes in terms of accuracy and speed.

A new cluster-based information retrieval approach called Intelligent Cluster-based Information Retrieval (ICIR) that combines k-means clustering with frequent closed item set mining to extract clusters and find frequent terms within each cluster is proposed in Djenouri et al. (2018). The paper focuses on improving the quality and runtime of cluster-based information retrieval. Different heuristics are suggested to select relevant clusters and documents within them. Computational experiments on well-known document collections show that the designed approach outperforms traditional and cluster-based information retrieval approaches in terms of execution time and quality of the returned documents.

The study in Hofmeyr et al. (2019) focuses on finding the optimal low-dimensional projection that maximizes the separability of a binary partition of an unlabelled dataset using the eigenvector associated with the second smallest Laplacian eigenvalue. The authors suggest an approach that combines multiple binary partitions within a divisive hierarchical model to generate clustering solutions spanning various scales and residing in distinct subspaces. The study assesses the performance of this method using benchmark datasets.

The research in Akhanli and Hennig (2020) tackles the issue of selecting a suitable clustering method and deciding on the optimal number of clusters. It introduces a collection of internal clustering validity indices designed to assess various facets of clustering quality. The paper recommends the calibration of these indices for aggregation, enabling the comparison of clusterings across different methods and cluster numbers.

A SB-based method for partitioning urban traffic networks into spatially compact areas, based on a quantitative indicator of congestion levels between adjacent intersections, is proposed by Guo et al. (2021). This method provides a foundation for achieving regional control and coordination of large-scale urban traffic networks. The proposed approach demonstrates its effectiveness in obtaining the desired number of subnetworks with less computational complexity compared to other classical community detection methods. This method is based on the bisection obtained from the set of eigenvectors associated with the transition matrix, utilizing the modularity measure to evaluate the partitioning results and guide the next partitioning direction.

A class of models called Latent Structure Block Models (LSBM) that can handle scenarios where hidden substructure and community-specific submanifold structures exist in network data was introduced by Sanna Passino and Heard (2022). The LSBMs assign a latent submanifold to the latent positions of each community, allowing for graph clustering when a community-specific one-dimensional manifold structure is present. The Bayesian model for the embeddings arising from LSBMs shows good performance on simulated and real-world network data, accurately recovering the underlying communities living in a one-dimensional manifold. The model utilizes flexible Gaussian process priors and can be used when the number of communities is unknown. The authors have evaluated it on complex clustering tasks, and it has shown excellent results on datasets with substantial overlap between communities.

3.3 Rivaling methods

In this subsection we present six methods that will be compared, through computational experiments, with the Abrantes method proposed in this paper.

Ng et al. (2001) proposed a spectral algorithm to directly compute a k-way partitioning of a graph, which contrasts with other approaches that recursively bisect the graph to identify k clusters: (Spielman and Teng 2007, for example). In a general sense, considering the k eigenvectors associated with the k smallest eigenvalues of the normalized Laplacian matrix enables a transformation of the data point representation from an n-dimensional space into a k-dimensional space. Such a Spectral Embedding (SE) enhances the detectability of data clusters (Von Luxburg 2007b). At last, the k clusters are obtained using the k-means clustering algorithm. Henceforth this method is referred to as SE + k-means.

Yu and Shi (2003) introduces a new method for multiclass spectral clustering, combining continuous optimization and optimal discretization techniques. Leveraging eigen-decomposition, it elucidates the role of eigenvectors and achieves efficient computation using singular value decomposition and non-maximum suppression. This method, here called SE + Discretization, consists of two steps: (1) solving a relaxed continuous optimization problem where global optimal involves specific eigenvectors subjected to arbitrary orthonormal transforms, and (2) iteratively determining a discrete solution closest to the continuous optimal through an alternating optimization process. This entails identifying the continuous optimum nearest to a discrete solution by computing the optimal orthonormal transform and locating the discrete solution closest to a continuous one via non-maximum suppression. These iterations steadily diminish the gap between a discrete solution and the continuous optimal. Upon convergence, the method achieves a nearly global-optimal partitioning, successfully demonstrated by the authors in the paper.

More recently, Damle et al. (2018) proposed a clustering algorithm that combines SE and column-pivoted QR factorization, identified in this text as SE + CPQR factorization. The algorithm prioritizes simplicity and directness, eliminating the need for an initial guess. Furthermore, it exhibits linear scalability concerning the number of nodes in the graph, and a randomized version of the algorithm offers substantial computational efficiency. This holds under the condition that the subspace defined by the eigenvectors used for clustering encompasses a basis resembling the set of indicator vectors associated with the clusters.

The Hierarchical Community Detection (HCD) method, by Tianxi et al. (2022), is a top-down recursive partitioning algorithm for hierarchical community detection. This algorithm starts with a single community and separates nodes into two communities using spectral clustering repeatedly until a stopping rule suggests there are no further communities. In particular, the authors consider a sign-based bisection, using an eigenvector from the adjacency matrix, and also the spectral clustering using two eigenvectors from the regularized Laplacian to conduct the experiments based on the proposed HCD method. The algorithm is model-free, computationally efficient, and requires no tuning other than selecting a stopping rule.

The Paris algorithm (Bonald et al. 2018) is an agglomerative hierarchical graph clustering method based on modularity-maximization techniques. This agglomerative algorithm employs a distance measure determined by the probability of sampling node pairs to cluster the graph. The application of the nearest-neighbor chain to accelerate the clustering process. The output of the algorithm is a regular dendrogram, revealing the multi-scale structure inherent in the graph.

Among the widely adopted algorithms for revealing community structure, the Louvain algorithm (Blondel et al. 2008) stands out. However, a critical flaw in this algorithm was elucidated by Traag et al. (2019): the Louvain algorithm has the potential to produce communities with arbitrary and poor connectivity. To address this problem, the Leiden algorithm was introduced as an improvement of the Louvain algorithm, and it guarantees that the communities identified are well connected. Additionally, the Leiden algorithm is faster than the Louvain algorithm.

Finally, considering the characteristics of each method in terms of being a hierarchical, spectral, and self-determined number of clusters, we can observe that all methods have at least one of these properties while only the HCD and Abrantes (our proposal) methods have all of them. For an easier perception of coincidences and differences between the aforementioned algorithms, Table 4 summarizes their characteristics and anticipates those of the original approach described in the following section.

Table 4 Basic characteristics of the covered methods

4 The abrantes approach

This section is reserved for presenting the proposed strategy for partitioning a graph into plausible communities. As an initial step towards its complete dissection, a “rough sketch” of its operation is depicted in Algorithm 1, exhibiting the use of SB in its core at Line 5. We emphasize that the method proposed in this work considers SB using the eigenvector associated with the second smallest eigenvalue of the normalized Laplacian matrix, bipartitioning the set of nodes according to the signs of the entries of this eigenvector, as detailed in Algorithm 2. Despite its patent conciseness and simplicity, a closer analysis reveals intriguing insights that merit exploration. Thereupon we delve into its nuances through careful examination and discussion, shedding light on its effectiveness, efficiency, and other properties.

Algorithm 1
figure a

Abrantes Community Detection Method “Rough Sketch”

Algorithm 2
figure b

Spectral Bisection, SB(G)

4.1 Hyperparameters setup

Some of the elements which compose the just provided description are intentionally vague, because of their flexibility. One of these is the validation criterion for graph collections, represented by the function val (Line 3 of Algorithm 1). The arguably most trivial yet useful realization of it would be \(val: C \mapsto [|C| \ge t]\)Footnote 2, for any \(C \in 2^{{\mathscr {G}}}\) and a parameter \(t \in {\mathbb {N}}\): this criterion would ensure the partition of the input graph into t communities, which comes in handy in common scenarios in which such quantity is known a priori or was somehow stipulated and needs to be enforced.

Another interesting realization of the function val is this: consider that in each assessment of val(C) (Line 3 of Algorithm 1) the modularity of the partition C is computed and stored; let \(Q'\) and \(Q''\) be the last two modularity values kept, regarding C in the current and previous iteration of the algorithm, respectively; then, \(val(C) = [Q' - Q'' \le \theta ]\), for a parameter \(\theta \in {\mathbb {R}}\). Such a version of the function val allows the network subdivision process to continue until modularity decreases (i.e., \(\theta = 0\), which was used by default) or is considered stable so that continuing the subdivision process could be expected to lead to less desirable communities: similar ideas are implemented by other clustering methods in the literature (Blondel et al. 2008; Ketchen and Shook 1996). Moreover, its use not only avoids the need for the number of communities to be established beforehand but also entails its spontaneous determination.

The function pri (Lines 2 and 6 of Algorithm 1) is another adjustable component of the Abrantes method. It is responsible for prioritizing possible subdivisions of the input graph and guiding the partition process. In other words, it should point towards parts of the input graph whose further division would unveil otherwise entangled communities. Hypothetically these parts would lack properties generally associated with good communities, so assessing indicators of these properties could be used for this purpose. With this in mind, from the multitude of alternatives for the function pri that could be used, the following were considered in this work:

  1. 1.

    Maximum diameter (Ghoshal and Das 2017)

    $$\begin{aligned} pri({S}, G) = diam(S) \end{aligned}$$
  2. 2.

    Maximum modularity (Dinh et al. 2015)

    $$\begin{aligned} pri(S, G) = Q(G, \{V_G\setminus V_S\} \cup {SB}(S)) \end{aligned}$$
  3. 3.

    Maximum modularity \(\Delta \) (Seth et al. 2022)

    $$\begin{aligned} pri(S, G)&= Q(G, \{V_G\setminus V_S\} \cup {SB}(S)) \\&- Q(G, \{V_G \setminus V_S, V_S\}) \end{aligned}$$
  4. 4.

    Maximum size (Juliana Maria de Sousa 2014)

    $$\begin{aligned} pri(S, G) = |V_S| \end{aligned}$$
  5. 5.

    Minimum density (Lackner et al. 2018)

    $$\begin{aligned} pri(S, G) = -den(S) \end{aligned}$$

Still, regarding function pri, the greedy design of the procedure in question is utterly manifested through this function. After all, the priority of each part of the input graph is assessed only once when it is defined (Lines 2 and 6 of Algorithm 1), independently of the present or future definition of other parts. On top of that, among all candidates to be subdivided, one is chosen solely according to the already computed priorities (Line 4 of Algorithm 1), and there is no backtracking from this action.

The just finished explanation of functions val and pri helps to better understand their roles as hyperparameters of Abrantes. Such a setting is clearly distinct from the possibly more familiar scenario in which the behavior of an algorithm or model is tuned according to one or more hyperparameters which are numerical variables. Despite this fact, hyperparameters which are functions are far from unorthodox (Arnold et al. 2024): some examples are the kernel functions of Support Vector Machines, the activation functions of Artificial Neural Networks, and the distance metrics of Nearest Neighbors methods. The rationale for defining val and pri can vary from strategies like grid search to modeling decisions by human experts considering domain knowledge otherwise impossible or harder to attain from input data (e.g., the ground truth number of communities). By default, it could be suggested to use the modularity decrease thresholding described at the beginning of Sect. 4.1 for val, and the minimum density policy for pri, which is supported by experiments reported later in Sect. 5.

4.2 Algorithmic complexity

A useful feature of the proposed approach is that it can be characterized as an anytime algorithm (Jesus et al. 2020): it incrementally defines communities due to its divisive essence, refining its output on each iteration until a final state according to the validation criterion; however, at any instant during each of those iterations, the current input graph partition can be used as a preliminary, partial result. This can be quite convenient to comply with runtime constraints which can take place on some occasions. As a final remark, it is interesting to state that none of the rival methods described in Sect. 3.3 and summarized in Table 4 is claimed to possess this property.

The just mentioned constraints can also be contemplated from the perspective of algorithmic complexity. |C|, the number of communities eventually inferred, is also the number of iterations of the loop starting at Line 3 of Algorithm 1. The cost of each of these iterations is dominated by either the SB at Line 5, the computation of the priorities at Line 6, or the validation of the current partition at Line 3, which greatly exceed the other operations related to the priority queue. It is interesting to observe that as more iterations happen, the number of parts increases while the size of these parts decreases as they are subdivided, and this can impact runtime duration according to the validation and prioritization functions used, as well as the numerical solver employed to determine the SB.

As a general case, it could be reasonably assumed that priority and validation functions cost less than SB (e.g., when priority is determined by graph density and validation is based on a desired number of communities) so that a complexity analysis similar to that of HCD (Tianxi et al. 2022) is possible. Thus, for a sparse graph (a common property of numerous real networks), SB can take just \({\mathcal {O}}(m)\) steps, leveraging on the network’s lack of density as well as not requiring all eigenvectors but only one of the most extreme, so that Abrantes would have a complexity of \({\mathcal {O}}(m |C|)\). However, through Abrantes iterations, the input to SB becomes smaller, so that the aggregated cost of their splitting is dominated by the first SB, which is performed on the entire input graph. Therefore, a more precise statement of its computational cost would be \({\mathcal {O}}(m \log |C|)\).

Figure 2 shows a flowchart providing a higher-level view of the proposed method helping elucidate the Abrantes method steps.

Fig. 2
figure 2

Abrantes method search flow diagram

5 Experimental evaluation

This section is dedicated to the empirical assessment of a total of 11 different community detection approaches, among Abrantes method variations and their rivals: the algorithms described in Sect. 3.3 and summarized in Table 4, which collectively represent the current state of the art. The intricate nature of real-world networks requires a rigorous examination to quantify the effectiveness of the proposed approach. The evaluation process is structured in 4 parts: (i) test data (Sect. 5.1), (ii) experimental protocol (Sect. 5.2), (iii) results and discussion (Sect. 5.3), and key points (Sect. 5.4). Through the systematic exploration of these facets, we aim to provide a comprehensive and insightful performance analysis, shedding light on capabilities and potential implications for understanding the underlying structures within diverse networked systems.

Table 5 Basic characteristics of the tested networks

5.1 Test data

Within the broader framework of experimental evaluation, the choice of appropriate network datasets is a critical determinant of the reliability and applicability of any community detection method. In this subsection, we outline the characteristics of the test networks employed in assessing the proposed algorithm. These datasets are selected to capture the diverse and complex structures inherent in real-world networks. Several key factors were considered for the selection of these networks. Firstly, diversity in datasets allows us to assess the robustness and generalizability of the proposed method across varied scenarios, and a wide range of practical challenges, reflecting real-world complexities. Moreover, datasets commonly used in previous research works serve as benchmarks, facilitating comparisons and establishing a baseline for performance evaluation.

By transparently detailing the properties of these test networks, we aim to provide a clear understanding of the challenges and intricacies that any method may confront, thereby ensuring a comprehensive evaluation that reflects real-world scenarios and fosters a deeper appreciation of their performances. Numerous network data sets were readily accessible through Deep Graph Library (Wang et al. 2020), and some of these were originally intended for classification tasks, so that “ground truth” class labels for all nodes were available. From these, the eight networks indicated next were used in the computational experiments, taking into account just the largest connected component of the undirected underlying graph of each original network.

  1. 1.

    CiteseerGraph: A citation network whose nodes represent scientific publications.

  2. 2.

    CoraGraph: Another citation network, also the nodes describe scientific publications.

  3. 3.

    AmazonPhoto: A network regarding the purchase of photography products from Amazon.

  4. 4.

    WikiCS: A Wikipedia-based network.

  5. 5.

    AmazonComputer: A network regarding the purchase of computer products from Amazon.

  6. 6.

    CoauthorCS: A co-authorship network of Computer Science papers.

  7. 7.

    CoraFull: A larger version of the aforementioned CoraGraph.

  8. 8.

    PubmedGraph: A citation network of papers about diabetes.

The main features of the networks used are indicated in Table 5, where |V| and |E| are the number of nodes and edges of the network; diam is its diameter; #C is the number of classes over which the nodes are distributed; \({\bar{H}}_1\) is the normalized Shannon’s entropy (Kumar et al. 1986) of such a distribution; \(Q_{GT}\) is the modularity of the “ground truth” partition of the nodes according to their classes.

5.2 Experimental protocol

Three types of experiments were performed, casting light on the same target from different perspectives. For the first scenario, the number of communities to be determined was assumed to be known a priori, matching the number of node classes. Thus, 5 Abrantes algorithms with varied parameters were compared to its 3 previously established rivals (SE + CPQR factorization, SE + Discretization, and SE + k-means) which are based on Spectral Embedding and have the number of clusters as a parameter. The second scenario required the algorithms to determine not only the composition of the communities but also their quantity. Consequently, only algorithms capable of this assignment were considered: Paris, Leiden, and HCD, besides the Abrantes algorithm with varied parameters.

The first two scenarios confronted the proposed algorithm’s effectiveness compared to well-established alternatives in the literature. On the other hand, the last scenario was oriented towards “introspection”, taking a closer look at the inner dynamics of Abrantes. This can be represented by the following question: how sensitive is the algorithm to the choice of the priority assignment function, pri, not only regarding its final output but also its transitions through intermediary states? This is directly related to the anytime property of Abrantes, so this led to the assessment of all partial clusterings it produced until the algorithm terminated.

For all experiments, the performance of each node clustering algorithm was assessed through both ARI and modularity, which complement each other targeting distinct aspects of results to the problem at hand. Through numerous runs of the proposed experiments, it was perceived that the variability of results due to random or approximative components of the algorithms was negligible. The algorithm implementations were provided by the libraries scikit-learn (Pedregosa et al. 2011) and cdlib (Rossetti et al. 2019), except for HCD whose original implementation from its authors was usedFootnote 3. At last, these implementations were used considering their default parameters. Aiming to support research reproducibility and transparency, the source code of Abrantes original implementation is publicly available upon request to the corresponding author or at its repository Footnote 4.

5.3 Results and discussion

The results of the proposed tests as well as their analyses are reported next. The first batch of results regard the comparison of a variety of algorithms on a collection of networks so that they are presented as colored tables: each cell features a raw value of the measure being considered, such as ARI or modularity; the color of a cell regards the rank of the algorithm of the corresponding column compared to others on the network of the corresponding row, from light (first, best) to dark (last, worst). For a formally sound evaluation of the superiority of the methods, critical difference diagrams (CDDs) are also featured: after using a Friedman \(\chi ^2\) test with \(\alpha =0.05\) to reject the null hypothesis that the performances of all methods came from the same distribution, a Conover post hoc test was used for pairwise comparisons. At last, the second batch of results is reported as a collection of regular line graphs, detailing Abrantes dynamics through its iterations.

It is presented below the results obtained in the following experimental task: to detect communities when their quantity is known beforehand. In Table 6, regarding the ARI, it can be observed a great variety of results: from CiteseerGraph and CoraFull networks, in which the performances of all algorithms were similar, to CoraGraph and WikiCS networks, which provided more varied outcomes; regarding the columns, Abrantes [max. diameter] results were roughly even, while those of SE + Discretization were more divergent. As indicated by Fig. 3, despite the statistical tie between the top methods, the best overall performance could be considered that of Abrantes [min. density] algorithm: its average ARI is similar to that of Abrantes [max. modularity] and Abrantes [max. size] algorithms but its average rank is superior, although such precedence does not occur for all networks (e.g., CoraGraph and AmazonPhoto).

Table 6 Number of communities known: ARI
Fig. 3
figure 3

CDD of the average rank of the ARI of each method over all considered networks with the number of communities known

Table 7, which reports modularity measurements, depicts a more leveled outlook compared to that of its ARI counterpart. Notwithstanding, various insights derived from the previous table are confirmed by the current one: for example, once again the max. modularity, max. size, and min. density variations of the Abrantes algorithm overcame its rivals. However, there are some interesting differences, such as the fact that Abrantes [max. modularity] had the best overall performance: this could be interpreted as evidence of the distinct goals and eventual disagreement between internal and external validation of communities (Li 2016); moreover, this time such superiority before all SE-based rivals was statistically significant, as confirmed by Fig. 4. A similar statement is possible concerning CoraFull network, whose induced results on modularity were notably better than those regarding ARI, which can also be related to its expressive number of classes which entails relatively small and, accordingly, harder to grasp communities.

Table 7 Number of communities known: modularity
Fig. 4
figure 4

CDD of the average rank of the modularity of each method over all considered networks with the number of communities known

The results regarding community detection without a pre-established number of communities are shown next. Concerning the ARI, whose assessment is displayed in Table 8, the subpar overall performance of algorithms Paris and HCD is quite evident. Leiden, on the other hand, provided solid results: it led to the highest ARI in 3 of the 8 tested networks, which is the largest count among all algorithms. However, on average, Leiden’s performance was bested by that of the same instances of the Abrantes algorithm which excelled in the previous scenario: max. modularity, max. size, and min. density; nonetheless, these top-4 methods could be considered statistically tied according to Fig. 5, but just Leiden is not significantly superior than all bottom-4 alternatives. Moreover, it is quite interesting to observe that the Abrantes [max. diameter] still had the best outcome on 2 networks, supporting its usefulness in some cases despite its mediocre general performance.

Table 8 Number of communities unknown: ARI
Fig. 5
figure 5

CDD of the average rank of the ARI of each method over all considered networks with the number of communities unknown

Table 9 shows that results regarding modularity are similar to its ARI counterpart, but even simpler to analyze. There is a clear match between the suboptimal performance of algorithms HCD and Paris for both evaluation measures. Meanwhile, the Leiden algorithm now provided the, according to Fig. 6, statistically undisputed best results for each of the considered networks, overcoming the 3 Abrantes instances which had the prime ARI results and once again overcame the same bottom-4 rivals: another indication of the eventual misalignment between internal and external clustering validation.

Table 9 Number of communities unknown: modularity
Fig. 6
figure 6

CDD of the average rank of the modularity of each method over all considered networks with the number of communities unknown

One final but still essential point of view from which this scenario of a flexible number of clusters can be observed is exactly that of the number of clusters that were inferred by each of the tested algorithms. Table 10 provides such perspective. All instances of the Abrantes algorithm showcase resembling results: they produced several clusters relatively close to the ground truth, except for the CoraFull network, whose original number of clusters sets it substantially apart from the other networks. The previously observed subpar results of algorithms HCD and Paris regarding modularity and ARI could be related to the fact that the number of clusters provided by them is quite distant from the expected: while Paris generally underestimated this, HCD presented a tendency to exaggeration. A similar but positive claim could be made about the good modularity results of the Leiden algorithm, which also consistently overestimated the number of clusters, but on a smaller scale compared to HCD.

Table 10 Number of inferred communities

Next, the focus is shifted towards a more insightful understanding of the differences between the instances of the Abrantes algorithm considered in the previous experiments. Figure 7 shows how modularity varies as the input network is repeatedly subdivided increasing the number of clusters. For each instance, its execution was carried on until a decrease in modularity happens after the number of clusters found became at least as large as the number of ground truth classes.

Fig. 7
figure 7

Abrantes algorithm dynamics: modularity variation as the input graph is subdivided so that the number of communities increases

As a general pattern, modularity monotonically grows until the minimum number of clusters is obtained, even though up to that point its decrease would not halt the clustering process. All instances of Abrantes start from the same point, as evidenced by the guaranteed draw of all of them for up to 2 clusters, and then their behavior deviates from each other. In this regard, the maximum diameter instance is the most distinct of all thanks to its overall unconvincing performance. In the same regard, it is also interesting to observe how the alternatives differ not only concerning their performance assessments but also the number of clusters they eventually define.

Figure 8 depicts the same experimental scenario in relation to Abrantes dynamics, but from the point of view of the ARI. There is some resemblance between these results and those regarding modularity: they generally feature an initial upward trend which is eventually followed by a plateau encompassing the best outcomes the method could provide. Despite this, it can be noticed that the ARI plots are less “smooth” than those of modularity. Moreover, the number of communities for maximum modularity was usually greater than that with respect to ARI. These facts highlight once again the differences between internal and external clustering validation but also enable some reliance on modularity analysis alone as a guide to good overall results.

Fig. 8
figure 8

Abrantes algorithm dynamics: adjusted Rand index variation as the input graph is subdivided so that the number of communities increases

Both of these perspectives on the dynamics of Abrantes have one interesting feature in common: they show how most of the workload of the clustering optimization process is carried by the first iterations of the algorithm. This is particularly interesting since it cooperates with the anytime property in a quite convenient fashion. In other words, this can be interpreted as evidence that an early interruption of the algorithm would still allow it to provide a relatively good output compared to a hypothetical one related to a more linear increase of the clustering quality.

Coincidently, observing Abrantes from the perspective of scalability, once again such concentration of its processing on its initial steps becomes evident, as shown in Fig. 9. This panorama is even easier to analyze than the previous ones because for all networks and variants of Abrantes, the pattern is even more similar: there was a very fast increase in the elapsed time until the first 5 bisections the algorithm performs; then almost no time was spent in the remaining iterations. This is reasonable considering that at each iteration the algorithm breaks a portion of the input graph into smaller parts so that processing such portions could eventually become computationally cheaper as the algorithm execution advances. This is also naturally related to the complexity analysis presented in Sect. 4.2: at each iteration, reducing the cost of the most expensive operation of the set \(\{SB, pri, val\}\) would be plausible.

Fig. 9
figure 9

Abrantes algorithm dynamics: Runtime duration (in seconds) as the input graph is subdivided so that the number of communities increases

5.4 Key points

The Abrantes algorithm experimental assessment enabled some interesting perspectives from which this algorithm was analyzed, resulting summarily in the following key points:

  • Most instances of Abrantes in each test scenario performed well. But this does not contradict the fact that the best and worst results were significantly different. Therefore, although the algorithm is not extremely sensitive to the definition of its hyperparameters, fine-tuning should not be neglected.

  • Moreover, certain instances of Abrantes performed better than others in different scenarios. This could be explained by varied idiosyncrasies which were better captured by different setups of the algorithm: as a hypothetical example, a network whose communities have very similar sizes could be handled more appropriately by Abrantes [max. size] than other variants of the algorithm.

  • Still, in the same regard, such a coincidence between community characteristics and Abrantes hyperparameters setup would naturally allow it to perform excellently. On the other hand, a misalignment in this regard would severely harm its usefulness.

6 Conclusion

This paper introduced Abrantes, a greedy recursive spectral algorithm for community detection. This work aimed to present a node clustering approach that stands on par with state-of-the-art rivals previously established in the literature. The primary objective was to provide a robust and effective alternative in the domain of methods that rely solely on graph topology analysis. The present contribution includes the assessment of 5 options for the (sub)graph partition prioritization policy which serves as a hyperparameter of Abrantes: (i) Maximum diameter, (ii) Maximum modularity, (iii) Maximum modularity \(\Delta \), (iv) Maximum size, and (v) Minimum density.

We subjected the proposed clustering methods to a comparative analysis against 6 well-established clustering algorithms featured in the literature: (i) Louvain, (ii) Paris, (iii) HCD, (iv) SE + k-means, (v) SE + Discretization, and (vi) SE + CPQR factorization.

The objective was to assess the performance and efficacy of our methods concerning established benchmarks.

The study employed three variations of computational experiments, each offering unique perspectives on the same objective. The first scenario assumed a pre-known number of communities, while the second required the algorithms to determine both the composition and quantity of communities. The third scenario focused on introspection, providing a closer examination of the intrinsic dynamics within Abrantes. To assess the clustering methods under investigation in this work, the comparison was based on two evaluation metrics: (i) adjusted Rand index and (ii) modularity. These experiments led to this finding: the Abrantes algorithm can consistently perceive communities in networks of various kinds, determining node partitions whose quality is on par or above that of partitions provided by highly regarded rivaling methods, while exhibiting interesting properties which set it apart from these.

One possible continuation of this work could aim at the development and evaluation of variations of Abrantes employing techniques like sampling and approximate computing for a reduced algorithmic cost. Thus, tackling larger networks with this method could be more manageable. Another sequence could analyze in more detail the use of the normalized Laplacian matrix for SB, especially considering that discussions of its use for spectral projection are much more abundant in the literature, which is an imbalance this work contributes to alleviate.