1 Introduction

Minimum spanning tree (MST) algorithms have numerous applications in the domain of big data due to their ability to extract essential structures from large data sets. Some of the notable applications are clustering, network design and optimization, and anomaly detection. The goal of clustering is to group similar data points into clusters while keeping dissimilar points separate. MST is a graph-theoretic concept that can be used to identify groups of closely related data points that can be clustered together [1,2,3,4]. MST is a tree that spans all the nodes of a graph while minimizing the total weight of the edges. In the context of clustering, the nodes represent data points, and the weights of the edges represent the similarity or dissimilarity between the points. A low-weight edge between two nodes indicates a high similarity, while a high-weight edge indicates a low similarity.

MST can be used to identify clusters of data points by partitioning the tree into subtrees, where each subtree represents a cluster [5, 6]. The subtrees can be identified by cutting the edges of the tree at a certain threshold weight. The resulting subtrees consist of groups of closely related points that can be clustered together. Additionally, MST can also be used as a pre-processing step for clustering algorithms [7,8,9]. It can reduce the complexity of clustering algorithms by focusing on the most relevant pairwise relationships and filtering out noisy or less informative connections. The resulting MST can serve as a compact representation of the data, facilitating faster and more efficient clustering.

Although numerous algorithms construct MST accurately and efficiently, these algorithms typically require a graph that reflects the relationships among the data points [4, 10, 11]. However, building such a graph can be computationally expensive. In recent years, many algorithms have utilized the k Nearest Neighbors (kNN) graph to construct the MST. The kNN graph approximates the underlying data relationships by connecting each data point to its k nearest neighbors. By leveraging the kNN graph, MST construction algorithms can effectively capture the data structure and generate an MST that represents the relationships among the data points. This approach offers a computationally feasible and scalable solution for constructing MSTs in large data sets.

There are many algorithms that use kNN for classification and regression tasks in the field of machine learning and data mining. These algorithms are nonparametric (do not make any assumptions about the underlying data distribution) and instance-based (relies on the entire data set during the prediction phase) learning algorithms [12, 13]. However, finding the exact k nearest neighbors in high-dimensional data sets is computationally expensive, especially for large data sets [14], since the brute force approach, which involves computing the distance between all point pairs in the data set, has a time complexity of \(O(n^2)\). Therefore, approximate kNN search techniques have gained significant attention since they are able to find near neighbors in large data sets much more efficiently [15].

While the approximate kNN intends to approximate the true kNN closely, the two can be different due to the reduced computational complexity. The accuracy of approximate kNN search techniques for the entire data set can be measured by the difference between the approximate kNN distances and the true kNN distances:

$$\begin{aligned} \text {Accuracy} = \sum _{i=1}^{n} \sum _{j=1}^{k} \tilde{D}_{ij} - D_{ij}, \end{aligned}$$
(1)

where \(n\) is the number of data points in the data set, \(k\) is the considered number of nearest neighbors, \(D_{ij}\) is the true distance to the \(j\)th nearest neighbor of the \(i\)th data point, and \(\tilde{D}_{ij}\) refers to the distance of the \(i\)th data point to its \(j\)th nearest neighbor in the approximate kNN.

These techniques aim to get a balance between accuracy and computational complexity, allowing for faster search times while still providing reasonably accurate results. Several methods have been developed to address the approximate kNN search problem. The three main categories of such methods are spatial-tree-based, hashing-based, and graph-based methods.

Spatial tree-based methods partition the space into smaller regions or nodes, creating a hierarchical structure. The goal is to efficiently organize the data so that spatial queries like nearest-neighbor searches can be performed quickly. Examples of spatial trees include KD-tree [16], Ball-tree [17, 18] and K-Means tree [19]. The construction of a KD-tree involves recursively partitioning a set of points in a K-dimensional space. A splitting axis is chosen, often alternating between dimensions, and data points are separated based on their values along that axis. The median point along the chosen axis becomes the current node, and the process continues for the left and right subtrees until all points are organized into the tree structure.

Spatial-tree-based methods offer fast searching by traversing the tree structure. The advantages of tree-based methods include logarithmic search complexity, good performance for low-dimensional data, and support for range searches. However, they can struggle with high-dimensional data due to the curse of dimensionality and may suffer from unbalanced trees or inefficiency when the data distribution is skewed [20].

Hashing-based methods [21] map high-dimensional data points into lower-dimensional binary codes, enabling efficient indexing and searching. They offer fast query processing and are effective for large-scale data sets. Using the Hamming distance, these methods allow for efficient retrieval of approximate kNN. They have many advantages, including sublinear search complexity, scalability, and the ability to handle high-dimensional data [20]. However, there is a trade-off between accuracy and speed, as hashing-based methods may not always guarantee that all nearby vectors will fall into the same or nearby buckets, which can affect the accuracy of the search [22].

In approximate kNN search, graph-based methods represent data points as nodes in a graph and establish edges based on their distances [15]. These methods utilize graph traversal techniques to find approximate kNN. The advantages of graph-based methods include handling both sparse and dense data, flexibility in defining proximity relationships, and good performance for high-dimensional data [23, 24]. They can also support incremental updates to the graph. However, constructing and maintaining the graph can be computationally expensive, and the search complexity can be higher than with tree-based methods.

This paper provides a method for the construction of the approximate kNN graph, which is customized to generate a high-quality approximate MST. The proposed method is able to derive an approximate MST that is close to the exact MST with a lower computational cost, thereby enhancing the overall performance and efficiency of clustering algorithms, such as Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) [25], for which MST generation is a crucial step. The algorithm scales well to high-dimensional and large data sets, where not only exact methods, but also some approximate methods become computationally infeasible. Two versions of our algorithm are presented. The first version stands out for its memory efficiency. The second version maintains a balance between memory efficiency and execution time. It stores a reduced number of the computed distances during the optimization process, thereby ensuring a shorter execution time than the memory-efficient version and a lower memory consumption than the brute force method.

The rest of the paper is structured as follows. In Sect. 2, the relevant solutions from the literature are presented. The proposed method is detailed in Sect. 3. Results and discussions are introduced in Sect. 4. Finally, a conclusion of the paper is given in Sect. 5.

2 Existing MST Generation Methods

In graph theory, an MST is a subgraph of a given undirected graph that connects all the data points in a data set while minimizing the total sum of edge weights. The construction of an MST must follow the cut and cycle properties. The cut property of MSTs states that the lightest edge crossing any cut must be part of the MST, while the cycle property dictates that the heaviest edge in any cycle cannot be in the MST.

MSTs are employed in clustering and data analysis algorithms to identify patterns and relationships within large data sets. Constructing an MST on data points makes it possible to identify clusters and hierarchies, enabling efficient data exploration and visualization. MSTs can indeed serve as the primary and exclusive technique for unsupervised data clustering. For instance, MST plays a crucial role as a fundamental step within the HDBSCAN clustering framework. An MST can be generated using exact or approximate methods.

2.1 Exact Methods

Constructing an MST from a data set of n points in a d-dimensional space consists of two steps. First, a weighted undirected full graph is constructed from the data set, where each node represents a data point, and edges between nodes convey the similarity between the points. Second, the MST is constructed from this full graph. The process of constructing an MST can be achieved using various algorithms, such as Prim’s algorithm [26], Boruvka’s algorithm [27], or Kruskal’s algorithm. Here is a general outline of the procedure for building an MST from a full graph using Kruskal’s algorithms [26, 28]:

  • Create a set for each node, initially containing only that node.

  • Create a list of all edges in the graph.

  • Sort the list of edges in the non-decreasing order of their weights.

  • Select the first edge from the sorted list that has not been selected yet. On the condition of not forming a cycle, add this edge to the current MST and merge the sets of the two endpoints of the edge.

  • Repeat the last step until the MST contains (\(n - 1\)) edges, where n is the number of nodes in the full graph (identical to the size of the data set).

The authors in [29] introduce the Dual-Tree Boruvka algorithm, which efficiently computes the MST using a spanning forest approach. The algorithm employs a dual-tree strategy, utilizing either a KD-tree or a cover tree to identify nearest neighbor pairs for components within the forest simultaneously. The algorithm employs distance-based pruning mechanisms, ensuring accurate identification of nearest neighbors by eliminating points that cannot possibly form MST edges. This approach minimizes unnecessary distance calculations by leveraging space partitioning trees and properties of cover trees, minimizing unnecessary distance computations while efficiently exploring the tree structure when constructing the MST. Although the authors purport a time complexity of approximately \(\ \!\!\!O(N \log N)\), the performance of the algorithm depends on space partitioning trees. Due to this feature, its efficiency is reduced in high-dimensional data. Experimental results (Sect. 4 ) show that its scalability needs to be improved regarding the dimensionality of the data.

The process of building the exact MST from the full graph has a time complexity of \(O(n^2)\). This quadratic complexity poses challenges in the case of large data sets due to the significant time and memory requirements. Therefore, more efficient algorithms or specialized techniques are pursued to address this limitation and enable cluster analysis on very large data sets at the price of computing an approximate MST with a higher weight than the MST, which we refer to exact MST in the sequel for distinction.

Table 1 Data sets of different sizes and dimensions

2.2 Approximate Methods

Many algorithms initiate the construction of the approximate MST by generating a graph with a reduced number of edges, such as kNN graphs. When the \(O(n^2)\) time complexity of exact kNN graph generation is infeasible due to the size of the data set, efficiently generated approximate kNN graphs can be used. Many of the efficient approximate kNN graph generation procedures are based on spatial trees since the cost of their construction is low. For example, the time complexity of constructing a KD-tree from a d dimensional data set is \(O(d \cdot n \log n)\) and to generate an approximate kNN graph from a balanced KD-tree is \(O(n \cdot k \log n)\).

The price of this reduced computational cost is the reduced accuracy. The sum of the weights of the generated approximate kNN graphs is higher than that of the exact kNN graph, but the structure of the approximate kNN graphs is usually quite similar to the exact kNN graph, thus the classification based on approximate kNN graphs is similar to the one based on an exact kNN graph.

One of the most efficient approximate MST generation methods using KD-tree is the Well Separated Pair Decomposition- (WSPD) [30,31,32] based method proposed in [28]. In computational geometry, a WSPD of a data set X in d-dimensional space is a sequence of pairs of subsets (\(A_i\),\(B_i\)). Each pair maintains a significant distance between subsets, and for any two distinct points p,\(q \in X\), there is exactly one pair that separates them, placing p in \(A_i\) and q in \(B_i\). WSPD helps organize points for efficient geometric algorithms by ensuring separation and clear pairwise relationships. \(A_i\) and \(B_i\) are considered \(\epsilon \)-well-separated if

where the distance of set \(A_i\) and \(B_i\) is defined as the minimum distance between the bounding boxes of set \(A_i\) and \(B_i\), the is the length of the diagonal of the bounding box of the \(A_i\) set, and \(\epsilon \) is a hyperparameter.

In [28], Wang, et al. propose a fast approximate MST construction approach utilizing the WSPD technique. Their proposed method is composed of three steps:

  1. 1.

    building a spatial tree (they use KD-tree),

  2. 2.

    building a WSPD based on the spatial tree,

  3. 3.

    connecting pairs of WSPD using Bichromatic Closest Pair (BCCP) [30], which involves finding the pair of points (\(a_i\), \(b_i\)) that have the smallest distance between set A and set B, where \( a_i \in A\) and \(b_i \in B\).

In [28], the authors present a parallel version of this approach as a step in the parallel HDBSCAN clustering process.

The graph obtained through WSPD has only \(O(n~\log ~n)\) edges and is computed in \(O(dn~\log ~n~+\epsilon ^d d^{d/2}n)\) time [33]. However, steps (1)–(3) are considerably slower than the method proposed in this paper because WSPD relies on spatial trees, which scale poorly with the dimensionality of the data [33]. Additionally, adapting this algorithm for an incremental extension of the data set is non-trivial.

In this work, the WSPD-based method proposed in [28] is considered to be the most efficient among the existing approximate MST generation methods [34], and is compared with the proposed method in the numerical section.

In [35], an enhanced and efficient implementation of HDBSCAN called FISHDBC is introduced. The authors employ a technique referred to as Hierarchical Navigable Small World (HNSW) to achieve this improvement. While HNSW is efficient in approximate kNN search, the construction of HNSW is less efficient [15]. The approach also demands considerable memory resources [36]. Furthermore, the method may approximate the MST by a spanning forest initially, and the creation of a single approximate MST from this forest is improperly documented in [35].

In [37], Wang, et al. present a fast approximate MST-based clustering algorithm that employs a divide-and-conquer strategy. This method efficiently identifies long edges in the initial stages to decrease the required distance computations by utilizing the cut and cycle properties of MSTs. However, the worst-case complexity of the algorithm in [37] remains \(O(n^2)\), suggesting that it could face performance challenges when dealing with large data sets.

A K-means-based fast approximate MST computation is proposed in [38]. The solution consists of the following stages:

  • \(\lfloor \sqrt{n}\rfloor \) partition

    • dividing the data set into \(\lfloor \sqrt{n}\rfloor \) clusters (using the K-means method)

    • computing the exact MST for each cluster

    • connecting these MSTs into an approximate MST of the entire data set

  • \(\lfloor \sqrt{n}\rfloor -1\) partition

    • repeating the same step with \(\lfloor \sqrt{n}\rfloor -1\) clusters

  • the two approximate MSTs are combined into a single approximate MST.

The expected time complexity of the procedure is \(O(n^{1.5})\). However, the actual execution time might vary highly. This variation is due to the strong dependence on the distribution of partitions generated from the K-means algorithm. The computational complexity of the step of computing the exact MST for each cluster depends on the size of the largest cluster. A strongly unbalanced cluster structure might result in significant running time.

Jothi et al. propose a recursive approach for clustering data in [19]. The data set is recursively divided into a tree of clusters by repeatedly splitting the data set into two sets until the partition size meets a cluster size limit. This process creates a tree where the leaf nodes represent the final partitions of the data set. A full subgraph is then generated for each partition. The following steps involve identifying neighbor partitions and boundary points. Finally, the subgraphs are connected, and an approximate MST is constructed from them, achieving an expected time complexity of \(O(n^{1.5}~\log ~n)\) [39]. However, the accuracy of the final approximate MST heavily depends on the quality of the inter-set edges calculations used to identify neighboring partitions. This dependency may lead to less accurate approximate MSTs, especially when dealing with complex, high-dimensional data sets [40].

3 The Proposed Method

The goal of this paper is to construct an approximate MST efficiently, which is similar to the exact MST, from a data set denoted as \(X\), where \(X = \{x_1,x_2, \ldots , x_n \} \), comprises \(n\) data points in a d-dimensional space.

The proposed method is composed of the following steps:

  • Random k-neighbor graph generation.

  • Optimization of the random k-neighbor graph to obtain an approximate kNN graph

  • Generation of an approximate MST from the approximate kNN graph

  • Optimization of the approximate MST.

3.1 Random k-Neighbor Graph Generation

The initial step in constructing an approximate kNN graph is creating a random directed graph with \((n \cdot k)\) edges, such that each node has k neighbors. The weight associated with the edges between node \(x_i\) and \(x_j\) is the distance between them. This way, the weights of edge (\(x_i,x_j\)) and edge (\(x_j,x_i\)) are identical. Algorithm 1 generates a random directed graph by assigning k random data points to each node as neighbors. Each node is represented by a unique key in the graph dictionary, and the corresponding value is a list of k indices representing its neighbors. A random graph of this nature is constructed with a time complexity of \(O(n\cdot k)\).

figure a
Table 2 Comparison of algorithms 3 and 4 using different data sets with \(k=20\), \(\delta =8\), \(\Delta =\infty \) (optimization until convergence)

3.2 Optimizing the Random k-Neighbor Graph

The optimization of the random graph generated by Algorithm 1 aims to eliminate long edges based on the procedure proposed in [41]. The elementary step of the procedure is as follows. If \(x_j\) is a neighbor of \(x_i\) and \(x_z\) is a neighbor of \(x_j\) and \( distance (x_i,x_j)\) is greater than \( distance (x_i,x_z)\), then \(x_j\) is replaced by \(x_z\) in the neighbors list of \(x_i\). This elementary step is performed for all nodes and all neighbors of neighbors according to the pseudo-code in Algorithm 2. Since the replaced list of neighbors changes the structure of the graph, further iterations of the same steps might further refine the list of neighbors.

figure b

The random k-neighbor graph optimization process, according to Algorithm 2, can be continued until either convergence is achieved (where no further neighbors of neighbors are found to be closer to a given node) or a predetermined number of iterations is reached. Two versions of the random k-neighbor graph optimization process are introduced with different iteration policies based on hyperparameters \(\Delta \) and \(\delta \) (\(\Delta \gg \delta \)) as follows:

  • Long k-neighbor graph optimization (Algorithm 3): In this procedure, the random k-neighbor graph is optimized according to Algorithm 2 until either converge or reaching \(\Delta \) iterations. At this point, an undirected graph is generated from the directed k-neighbor graph such that one of the edges is dropped in case of a node pair connected with two oppositely directed edges. This way, the number of undirected edges is between \(n \cdot k/2\) and \(n \cdot k\), and the weights of the edges are the distances between the corresponding nodes. The approximate MST is obtained as the exact MST of this undirected graph. (The obtained MST is suboptimal because only a limited number of edges of the full graph is used to compute it.)

  • Short k-neighbor graph optimization (Algorithm 4): In this procedure, the random k-neighbor graph is optimized by Algorithm 2 only \(\delta \) times (if it does not converge before), and after that, an undirected graph and its MST of \(n-1\) edges is generated as in the previous case. In contrast to the long k-neighbor graph optimization method, the obtained approximate MST is further optimized in this procedure. First, the undirected approximate MST is transformed to a directed graph of \(2n-2\) edges such that the undirected edges are substituted with directed edges in both directions and Algorithm 2 is applied again until convergence or reaching \(\Delta \) iterations. The final step of the procedure is to transform the obtained directed graph to undirected again and compute its MST.

The memory consumption of both methods can be adjusted by the number of edge distances that are stored during the computations. If the computed edge distances are stored in the memory, the time of the consecutive iterations reduces because fewer distances have to be computed in one iteration. On the other hand, the memory consumption continuously increases during the procedure. In the long k-neighbor graph optimization method at most \(\Delta \cdot n \cdot k^2\) distances need to be stored and in the short k-neighbor graph optimization method, it depends on the structure of the data set. In our experiments, it was always less than the memory consumption of the long k-neighbor graph optimization method. It is observed that the second phase of the short k-neighbor graph optimization method, where the number of edges is \(2n-2\), converges rather fast in practice, thus its memory consumption and running time are much lower than the ones of the long k-neighbor graph optimization method.

If the edge distances are not stored during the computation, then the memory consumption is negligible (\(O(n \cdot k)\)) as only the indexes of k neighbors of each node are stored. However, the computation time increases with the repeated computation of the edge distances.

Essentially, the approach of the long k-neighbor graph optimization method resembles the approach used in [41], while the approach of the short k-neighbor graph optimization method was not considered before.

figure c

4 Numerical Examples

4.1 Applied Data Sets

Fig. 1
figure 1

Process of generating approximate MST based on Algorithm 4 and the associated MST. The circles indicate the differences between the obtained approximate MST compared to the MST

Fig. 2
figure 2

Unbalanced 2d synthetic data set [44] with \(n=6500\) data points and 8 Gaussian clusters. Some differences between the approximate and the exact MSTs are marked with circles

To demonstrate the effectiveness of the proposed method, a series of experiments was conducted on various data sets. In total, 13 different data sets were utilized as shown in Table 1. Speech, MNIST, and Shuttle are from Multi-dimensional point data sets of the Outlier Detection Data Sets.Footnote 1 CelebA was introduced in Ref. [42] and can be accessed online.Footnote 2 Miss America, House, Europe, Unbalanced, and Birch can be found on the web.Footnote 3 The first three of this group are real-world data, and the last two are synthetic data sets. Corel, Shape, and Audio were used by [41, 43] and are available online.Footnote 4 The make_moons library from Sklearn data setsFootnote 5 was used to generate the Make_moons data set. Finally, the make_blobs library from Sklearn data setsFootnote 6 was utilized to generate data sets of different sizes and dimensions.

4.2 Comparison of Short and Long k-Neighbor Graph Optimization Methods

The run time and accuracy of the short and long k-neighbor graph optimization methods were compared using the data sets in Table 1, with hyperparameters \(k=20\), \(\delta =8\), and \(\Delta =\infty \) in each case. Independently sampled initial random k-neighbor graphs were initiated for each comparison. To demonstrate the number of iteration cycles required for convergence, \(\Delta \) was set to \(\infty \). This setup resulted in Algorithms 3 and phase 2 of Algorithms 4 running until the graph could no longer be improved, determined by the iteration cycle of Algorithm 2.

The obtained results, in Table 2, demonstrate that the short k-neighbor graph optimization method provides more accurate approximate MST and better execution time. Its lower execution time is due to the fact that it converges in fewer iterations because Algorithm 4 performs at most \(\delta \) iterations of optimizing the random k-neighbor graph and then generates a directed graph with fewer edges whose optimization converges quickly. The intuitive explanation for the improvement in the total weight of the obtained approximate MST is that after optimizing the random k-neighbor graph, during the optimization of the approximate MST, the reverse edge of every existing edge in the approximate MST was added to create an undirected graph, which helps to avoid the convergence to local optima and enhances the accuracy of the final approximate MST.

The only data set for which optimization of the random k-neighbor graph converges within less than \(\delta \) iterations is the Speech data set. In that case, the independent sampling of the initial random k-neighbor graph results in the differences in the number of required iterations for Algorithm 3 and 4.

Table 3 Comparison against existing methods
Table 4 Execution time and memory consumption of obtaining \(\epsilon \)-separated pairs using WSPD for different data sets from Table 1
Fig. 3
figure 3

Execution time (in seconds) using synthetic and real-world data sets where the dimension of data is shown in Table 1

Fig. 4
figure 4

Impact of data dimension on the computational time with synthetic and real-world data sets (\(k=20\))

4.3 Demonstration of the Behavior of Algorithm 4

Figure 1 presents an overview of the approximate MST generation process facilitated by Algorithm 4 for a data set generated by the make_moons algorithm from Scikit-learn, comprising only 40 data points, to enhance visual clarity. In the initial step, Algorithm 4 generates a random k-neighbor graph with \(n=40\) vertices and \(n\cdot k=160\) edges, assuming \(k=4\), which is depicted in Fig. 1a. (A higher value for k would result in a denser graph and potentially better approximate MST, but with poorer graphical visibility). In this example, \(\delta \) was set to 2 to execute only two iterations of random k-neighbor graph optimization due to the relatively small number of edges. The result of the first iteration is depicted in Fig. 1b. It is noticeable that the first iteration significantly reduced the number of long edges in the graph, leading to a visually less dense impression of the graph in Fig. 1b.

The primary approximate MST, as a result of the first phase of Algorithm 4, is obtained from the second iteration of the k-neighbor graph optimization, and it is depicted in Fig. 1c. Figure 1d depicts the result of the first optimization cycles of the second phase of Algorithm 4. This phase converges in 2 cycles; no further optimization is possible for this example. The MST generated from the optimized graph is provided in Fig. 1e. For comparison purposes, Fig. 1f presents the exact MST obtained using the brute force method. Most of the approximate and the exact MST are identical in Fig. 1e, f except for some minor differences indicated by circles.

As another example, Fig. 2 presents the results of Algorithm 4 using a 2D synthetic unbalanced data set generated by the Make blobs procedure containing 6500 data points, which are distributed among eight distinct clusters. It is worth noting that the size of this data set is relatively modest, and the primary objective is to demonstrate the efficacy of the proposed algorithm in handling unbalanced data sets. Similar to the previous example, Fig. 2 suggests that most of the approximate and the exact MST are identical with some minor exceptions, and the weights of the approximate and the exact MST are close.

4.4 Comparison of Algorithm 4 Against Existing Methods

The performance of Algorithm 4 was compared to two other algorithms, MST_mlpack [29], utilizing the code referenced in [45], a and FISHDBC [35], utilizing the code repository specified by the author on GitHub.Footnote 7 While MST_mlpack is an exact algorithm, FISHDBC and Algorithm 4 are approximate algorithms. Table 3 presents the results of numerical experiments using some of the benchmark data sets from Table 1. In most cases, Algorithm 4 provided accurate results in terms of relative error, where the relative error is calculated as follows:

$$\begin{aligned} \text {rel.~error} = \frac{\text {approx.~MST~weight}-\text {MST~weight}}{\text {MST~weight}} \end{aligned}$$

and the weight of an undirected graph is the sum of the weights of its edges. Algorithm 4 achieves superior performance for large data sets of high dimensions. As previously mentioned, algorithms that involve partitioning the space, such as using KD-tree, do not scale well with the dimensions of the data. This is evidenced in Table 3. The empirical running time of the MST_mlpack algorithm is significantly impacted by the number of dimensions in the data. While the MST_mlpack algorithm outperforms Algorithm 4 in two cases, namely when the number of dimensions is low (as in the Celeba dataset), or when the size of the dataset is small (as in the MNIST dataset), it is clear that this algorithm has limitations in terms of scalability. FISHDBC and Algorithm 4 demonstrate scalability with respect to the dimensionality of the data. However, Algorithm 4 exhibits superior performance in terms of both run time and relative error when compared to FISHDBC.

To the best of the authors’ knowledge, there is currently no available implementation for obtaining an MST using sequential WSPD. We tried to run the implementation of [28] in a sequential manner. However, the current implementations shown inFootnote 8 only support low dimensions data (2–20). While an implementation for sequential WSPD exists,Footnote 9 its output consists of a set of \(\epsilon \)-separated pairs, as detailed in [33]. These pairs must be connected by an edge between the closest points for each pair using BCCP to obtain a connected graph from which an MST can be generated. The implementation of  [33] was utilized to illustrate that WSPD has a strong dependence on the dimension of the data set because WSPD suffers for higher-dimensional data sets, as the number of pairwise comparisons that need to be stored increases exponentially, leading to higher memory requirements [33]. Therefore, the dimension of the data can significantly impact memory usage. Table 4 illustrates the execution time and memory usage of obtaining \(\epsilon \)-separated pairs using the implementation of WSPD presented in [33]. It can be observed that WSPD is memory hungry when the dimension of the data set and \(\epsilon \) are high. The execution time detailed in Table 4 indicates the required time to obtain \(\epsilon \)-separated pairs. We note that the time required to connect the \(\epsilon \)-separated pairs and to generate the approximate MST is not considered.

4.5 Empirical Time Complexity

To investigate the impact of the size of the data set on the running time of Algorithm 4, evaluations were conducted on both real-world and synthetic data sets of varying sizes. Figure 3 shows the results of the experiments conducted. A least squares fitting of the obtained data points suggests that the empirical computational time of Algorithm 4 is around \(O(n^{1.07})\).

The performance of the proposed method on real-world data sets of different sizes and dimensions (all the real-world data sets in Table 1) follows the same trend as for the synthetic data sets. Additionally, subsets of the Corel data set were used to ensure the same structure but different sizes.

4.6 The Impact of Data Dimension

In the proposed approach, the dimension of the data set plays a role only in the computation of the edge weights of the graph, i.e., the distance between data points. As a result, our approach is only slightly affected by the dimension of the data set compared to the algorithms that rely on spatial trees or partitioning the space into subsets.

To demonstrate this behavior, multiple experiments were performed using data sets of varying dimensions. The experiments were carried out in two distinct settings: One involved storing the calculated distances in the memory throughout the optimization process, while in the other, these distances were not stored.

Figure 4 plots the result of the experiments with and without storing the calculated distances using the synthetic and real-world data sets from Table 1. As expected, the execution time is higher when the calculated distances are not stored in the memory but are recalculated each time they are needed. The trend of the execution time as a function of the data set dimension suggests that the function that computes the distance has a significant overhead in our Python implementation, which dominates the running time for dimensions less than \(10^3\). That is why the running time is seemingly independent of the dimension for dimensions less than \(10^3\) and starts increasing only after that limit.

5 Conclusion

Large data sets are often represented as graphs with as many nodes as the number of data points, and efficient graph procedures are needed to process such large graphs. One of the essential graph processing steps is the computation of the MST, which plays a role, e.g., in clustering, anomaly detection, etc.

Since exact methods are infeasible for large data sets, the paper presents an algorithm for approximate MST constructing accurately and computationally efficiently. The algorithm starts by creating a random k-neighbor graph and then optimizes it by crawling toward the optimal neighbors of each node to obtain an approximate kNN graph. Using this graph, the algorithm calculates an initial MST and optimizes it using the same crawling technique. Numerical experiments indicate the properties and the effectiveness of the proposed method.

Although the current paper primarily focuses on the static, sequential implementation of the proposed algorithm, it can be enhanced with dynamic parallel implementation, which makes the proposed algorithm compelling for a wide range of MST-based clustering tasks.