1 Introduction

The Similarity join is a fundamental operation in data analysis that consists in finding pairs of close tuples according to a given distance metric. This operation is often used in various applications, including data cleaning [1], entity resolution [2], and collaborative filtering [3, 4].

Naively, similarity join computations can be performed by comparing all the data pairs, thus requiring the computation of the entire Cartesian product. This has a fatal effect on performance and limits the scalability of processing large datasets. A cluster of machines and scalable distributed algorithms are required to perform similarity joins for huge datasets. Consequently, it is challenging for scientists unfamiliar with parallel and distributed computing to employ similarity join algorithms in their applications.

Parallelization patterns, such as MapReduce [5], have gained popularity easing the life of scientists in many application domains, accelerating code prototyping and time-to-solution. This is achieved by hiding all complex low-level mechanisms and ignoring the tuning of a considerable number of parameters to reach the maximum performance of the targeted parallel system.

In [6], we proposed MRS-join (MapReduce Similarity Join) a scalable similarity join computation using MapReduce and Locality Sensitive Hashing (LSH). Besides being easy to use for scientists, the proposed approach significantly reduces the number of comparisons needed and the communication costs while guaranteeing perfect computation balancing. The MRS-join algorithm was implemented in Hadoop. Apache Hadoop, with the Hadoop Distributed File System (HDFS), is the reference framework of the MapReduce paradigm and the de facto standard in industry and academia due to its ease of use, horizontal scalability, and failover properties. However, Hadoop can be overkill for small-medium datasets in terms of performance when the dataset fits in the aggregated memory of the cluster nodes used or when the MapReduce computation spans multiple chained jobs, like in the similarity join algorithm (i.e., histogram calculation and similarity join computation).

In this work, we propose the SimilarityJoin high-level pattern on top of the FastFlow [7, 8] library, a parallel programming library providing the programmer with both high-level parallel patterns and a lower-level software layer of nestable and composable data-flow components called Building Blocks (BBs) [9]. As a result, FastFlow’s BBs can be used for implementing efficient and scalable data processing with a single source for high-end multi-core servers and clusters of multi-core nodes. The SimilarityJoin pattern interface is back-end agnostic, offering all the benefits of the FastFlow library yet hiding all the complex parameters tuning of the its run-time system (RTS). Furthermore, the Input/Output data of the pattern is based on standard POSIX files, whereas all intermediate results can be stored either in the main memory, for small/medium datasets producing in-core executions, or on files, for large/huge datasets producing out-of-core executions, through a custom memory allocator using the standard C++ allocator in the former case and the Metall’s persistent allocator [10] in the latter case.

We validated the SimilarityJoin parallel pattern through a set of experiments based on the trajectories similarity use case, showing its scalability on a cluster of 16 server nodes varying the number of nodes employed and the input dataset sizes to consider both in-core and out-of-core computations. We compared the performance of the pattern against an existing hand-tuned implementation of the same use case in Hadoop [6] and a new version written from scratch in Apache Spark. The results have highlighted the strengths of the proposed FastFlow-based SimilarityJoin parallel pattern.

The outline of the paper is as follows. Section 2 presents an overview of the LSH-based similarity join algorithm and the FastFlow library. Section 3 introduces the SimilarityJoin pattern, and its FastFlow implementation. Section 4 presents the experimental evaluation conducted considering both in-core and out-of-cores executions. Section 5 provides a discussion of related works and Sect. 6 draws the conclusions of this paper.

2 Background

2.1 Similarity Join Algorithm

Formally, a similarity join for two collections of data R and S is \(R \bowtie _{\lambda } S= \{ (u,v) \in R \times S \ | \ Dist(u,v) \le \lambda \}\) where \(\text {Dist}(u,v)\) is a distance between u and v, and \(\lambda\) is the threshold parameter. The algorithm MRS-join [6] is based on MapReduce patterns to compute similarity join. To avoid comparing all data pairs, which requires a Cartesian product computation, the search space is reduced using a random hashing framework called Locality Sensitive Hashing (LSH) [11, 12]. It is based on a hashing scheme that ensures that nearby data points are more likely to collide than distant ones. The random hashing function depends on the type of data and the chosen distance. To produce a good fraction of all pairs of similar records, several independent iterations are required. The value obtained for each iteration is used as a join attribute. We refer the reader to [13] for more details on LSH.

In large skewed datasets, the similarity join processing may be inefficiently distributed, i.e., few computation nodes are used for a large number of distance computations. To avoid these effects, MRS-join uses distributed histograms and randomised communication patterns to ensure perfect balancing properties during all the steps of similarity join computations while reducing communication costs to only relevant data [6, 14, 15]. The histogram of a join is defined as the mapping between a join attribute value and its frequencies. Only join attribute values which might appear in the join result are retained to reduce communication costs to relevant data. For large datasets, we expect that the corresponding histogram does not fit in memory. Therefore, the histogram is distributed according to the occurrence of join attribute values in different portion of input data. It is then used to generate communication templates, allowing to transmit only relevant data fairly during the join processing step. We refer the reader to [6, 14, 15] for further details about distributed histograms and randomised communication templates.

Fig. 1
figure 1

MapReduce similarity join computation steps in Hadoop

MRS-join has been implemented in the Hadoop framework. It proceeds in two steps as illustrated in Fig. 1 where:

➊:

The histogram of the join is computed and distributed to reduce computation to only relevant data while guaranteeing balanced communication patterns.

➋:

Using distributed histograms, efficient and scalable communication templates are generated to balance the load of distance computations between pairs identified as similar.

At the beginning of each step, the join attribute values are computed using LSH for each record in the input. The step ➊ is composed of two MapReduce jobs, the first one is used to compute the histogram of the join and the second to distribute it. Using distributed histograms, the step ➋ computes the similarity join output from an additional MapReduce job.

2.2 The FastFlow Parallel Library

The C++ header-only FastFlow library [7, 8] is the result of a research effort started in 2010 intending to provide application designers with essential features for parallel programming via suitable abstractions and a carefully designed run-time system. At the lower software layer of the library, there are the so-called Building Blocks (BBs), i.e., recurrent data-flow compositions of concurrent activities working in a streaming fashion, which are used as the primary components for building FastFlow parallel patterns (e.g., Pipeline, ordered Task-Farm, Divide &Conquer, Parallel-For-Reduce, Macro Data-Flow) and, more generally FastFlow streaming topologies [7, 16].

Fig. 2
figure 2

FastFlow shared-memory Building Blocks (BBs)

Following the principles of the structured parallel programming methodology, a parallel application (or one of its components) is conceived by selecting and adequately assembling a small set of well-defined BBs modeling data and control flows. The set of FastFlow BBs are sketched in Fig. 2. They comprise both parallel BBs, i.e., the pipeline composition (ff_pipeline), task-farm (ff_farm), and all-to-all (ff_a2a), and sequential BBs, i.e., standard node (ff_node), multi-input/output node (ff_minode/monode), and the node combiner (ff_comb) implementing the sequential compositions of FastFlow nodes.

FastFlow’s BBs can be combined and nested in different ways forming either acyclic or cyclic concurrency graphs, where nodes are FastFlow concurrent entities and edges are communication channels carrying heap-allocated pointers. They have either bounded or unbounded capacity (feedback channels always have unbounded capacity). The concurrency control can be either blocking or non-blocking (default). BBs mainly target system programmers who want to build new frameworks, patterns or RTSs. All high-level parallel patterns offered by the FastFlow library have been implemented using BBs.

Initially, FastFlow was designed to target multi/many-cores. Recently, its run-time system has been extended to deploy FastFlow programs in distributed-memory environments [9]. The distributed RTS has been implemented by leveraging BBs and extending them with the objective of preserving the original data-flow streaming programming model. By introducing a small number of edits to programs already written using FastFlow’s BBs, the programmer may port its shared-memory parallel application to a hybrid implementation (shared-memory plus distributed-memory) in which parts of the concurrency graph will be executed in parallel on different machines according to the well-known SPMD model [17]. Such minimal refactoring involves the introduction of Distributed Groups (called dgroups) concept, i.e., the identification of logical partitions of the BBs composing the application streaming graph according to a small set of graph-splitting rules [9]. A simple example of a FastFlow shared-memory streaming application (left-hand side) partitioned into k distributed groups (right-hand side) is given in Fig. 3. The dgroups have been created by \(k-1\) horizontal graph cuts. In a nutshell, a graph cut is valid if the resulting sub-graph can be expressed with the composition of BBs (i.e., is a valid FastFlow shared-memory application). Currently, inter-dgroup (i.e., inter-process) communications leverage raw TCP/IP or MPI, whereas intra-dgroup communications use highly efficient lock-free shared-memory communication channels [18].

Fig. 3
figure 3

FastFlow streaming graph example. The communication topology is defined as a composition of BBs in a data-flow fashion. The concurrency graph can be partitioned into distributed groups (dgroups) each implemented by a dedicated multi-threaded process. In this example, the k dgroups have been obtained by cutting the a2a BB horizontally

In the distributed FastFlow RTS, data serialization can be carried out in two different ways. The programmer may select the best approach, between the two, for each data type flowing into the inter-group channels (i.e., the data types produced/received by the edge nodes of a dgroup). The first approach employs the Cereal serialization library [19]. It can automatically serialize base C++ types and compositions of C++ standard-library types; it just requires the implementation of simple mapping functions for custom or user-defined types. The second approach lets the user specify its serialization and deserialization function pair. This might be useful, when feasible, to avoid any extra copies needed by the serialization process itself. In this work, we always use Cereal based serialization, which guarantees portable representations over different platforms.

3 The SimilarityJoin Pattern

3.1 Description

The similarity join pattern is exposed to users through a C++ templated callable object, SimilarityJoin, in which inputs and outputs are based on files. Once the template parameter T is fixed at the current datatype of the application, the constructor requires the pattern configuration file’s path, the input file’s path, a few functions, and, optionally, the size of the application batching feature. The first function defines the parsing of a dataset’s line. It returns a template-based struct (i.e., sj_item<T>) holding the parsed data and some other related metadata. Specifically, it contains three fields: content representing the actual data, the dataset describing from which dataset the item comes, and a unique identifier, id. We expect to have in the input dataset a unique identifier and a tag to get the side (R or S). The next set of functions is used to implement the locality-sensitive hashing (LSH) for a given item. Since usually, the LSH functions are numerous (i.e., 8 or 16), they can be passed as a list of functions (through an iterable std container or using the curly brackets notation). Finally, the last function implements the similarity algorithm. It returns true if the two passed items are similar, false otherwise. In this proposed interface, all the functions may be provided using either lambdas (i.e., anonymous functions), std::functions, or C++ functors. The object must be called using the C++ operator () to start the execution without any parameter. The interface and its data types is sketched in Figure 4.

Fig. 4
figure 4

Similarity Join pattern’s interface for a generic data type T. It requires: a pattern configuration file, a input dataset file path, a line parsing function, a set of LSH functions, a similarity predicate, and the batching size (optional)

The configuration file is required to set the number of processes, the process-host mapping, and the number of Mappers and Reducers for each host. The format of each line of the configuration file is: hostname M R. By properly setting the number of hosts (one per line), M (number of Mappers) and R (number of Reducers), it is possible to run and tune the execution on a cluster of heterogeneous machines (i.e., with different numbers of cores and main memory capacity).

Concerning the input dataset, the SimilarityJoin pattern manages it as follows: if the execution is in a single server node (i.e., shared-memory execution), the dataset is expected to be provided as a single monolithic file. Conversely, if the execution spans multiple server nodes, the dataset must be split to assign a partition to each server node (for instance, using three server nodes, the input file must be split/organized into three partitions). In addition, each part must be tagged with a suffix in the form "000", "001", etc.Footnote 1. The final output, similarly to what Hadoop does, is always written into the disk using a separate file for each Reducer.

3.2 Use Cases

With the interface described in the previous section, it is possible to express several scientific applications leveraging the LSH-based similarity join algorithm and using different data types. This section presents two prominent case studies: trajectories and sets.

Trajectories are seen as polygonal lines where each point belongs to \(\mathbb {R}^d\) with d the space dimension. Thus, the datatype T of the pattern can be something like std::vector<std::tuple<double, double\(>>\) for a two-dimensional space. The similarity algorithm employs the discrete distance of Fréchet [20] to compare trajectories. The Fréchet distance is often explained by the following metaphor: a man holds his dog on a leash, both are walking on finite trajectories. Man and dog can vary their speed but cannot turn back. The Fréchet continuous distance is the minimum length of the leash to connect the man to his dog during the entire journey. To reduce the time for computing the distance, the algorithm uses several heuristics to test in near-linear time if two trajectories have a distance less than a given threshold \(\lambda\) [21, 22]. The Fréchet LSH function family [23, 24] uses a random grid of dimension d defined from a resolution \(\sigma\) and an origin randomly chosen in the half-open hypercube \([ 0, \sigma [^{d}\). Each trajectory is transformed into a sequence of grid nodes. The resulting sequence is universally hashed to be used as a join attribute. The same procedure is iterated several times to produce a good fraction of all pairs of similar trajectories. Therefore, the Fréchet distance is only computed for trajectories with the same sequence of grid nodes. The resolution parameter of the grid is set to \(\sigma =4\times d\times \lambda\) as it was done in the experiments of [6, 24]. Trajectories application is used for the experimental evaluation in Sect. 4. The SimilarityJoin pattern for the trajectories use-case is sketched in Fig. 5.

Fig. 5
figure 5

Example of usage of the SimilarityJoin pattern for the trajectories use-case. The dataset parsing and LSH implementations are omitted for brevity

A second well-known similarity join application in the literature concerns sets. It can be applied to different data structures representing a group of objects. This can be easily done by varying the template parameter and adapting LSH and similarity functions. Similarity join on sets has a large number of applications, including similar text detection [25, 26], collaborative filtering [3], and clustering of large malware datasets [27]. The pruning power of the set similarity join is also used to reduce the number of candidate pairs for edit-based string similarity joins [28]. A popular distance in the literature is the Jaccard distance defined as \({\text {Jaccard}(u, v) = 1 - {\Vert }u \, \cap \, v{\Vert }/{\Vert }u \, \cup \, v{\Vert }}\) where u and v are sets and \({\Vert }\cdot {\Vert }\) is the cardinality of a set. MinHash is a family of LSH functions that estimates the Jaccard distance. It takes advantage of a random permutation to retrieve similar sets. Each set is hashed by its element with the smallest position in the random permutation. The procedure is repeated several times to produce almost all pairs of similar sets. We refer the reader to [29, 30] for more detailed information on MinHash and to [15] for details on the implementation of MRS-join for set similarity join. In the evaluation section, we will evaluate the Frechet use case.

3.3 FastFlow-Based Implementation

This section describes the implementation of the SimilarityJoin pattern using FastFlow as a back-end, allowing us to execute the application both in shared-memory and distributed-memory environments using a single source.

The algorithm can be implemented as a sequence of two map-reduce phases: the computation of histograms, and the effective similarity join.

In FastFlow, we can implement a sequence of steps by using a pipeline BB. In this algorithm, the pipeline has two stages, each consisting of a map-reduce computation. The map-reduce paradigm can be easily implemented using the a2a BB (i.e., all-to-all) as sketched in Fig. 6. The Mappers are those FastFlow sequential BBs in the left-hand set of Workers of the all-to-all, whereas the Reducers are those sequential BBs in the right-hand set of Workers. However, since the two map-reduce phases must be computed in batch according to the algorithm, i.e., computational phases do not overlap, the implementation can be realized using a single all-to-all BB iterated two times by leveraging the feedback modifier of the a2a BB. In this way, Mappers and Reducers must contain the business logic of both phases. The resulting graph topology is shown in Fig. 7. Such reduced configuration optimizes resources halving the number of FastFlow sequential nodes and increases flexibility when moving from shared to distributed memory. In fact, exploiting a single all-to-all as a root BB allows the resulting graph to be cut both vertically as well as horizontally (and also in a mixed fashion) to create sub-graphs composing the distributed groups that are still valid all-to-all BBs. Unlike Hadoop MapReduce, the pattern implementation does not force the store of intermediate results on disks, preferring to store all data in the main memory when possible. As we will discuss in Sect. 4.3, for very large input datasets producing out-of-core executions, the user can select at compile time to employ a file-based memory allocator implemented leveraging the Metall [10] persistent allocator to run the pattern on memory-constrained systems.

Fig. 6
figure 6

BBs-based implementation of the SimilarityJoin pattern: a pipeline of two all-to-alls. The left-hand Workers are the Mappers. The right-hand side ones are the Reducers

Fig. 7
figure 7

Resource-optimized implementation of the SimilarityJoin pattern with FastFlowBBs: a single all-to-all with feedback channels

In the FastFlow-based implementation, Mappers send directly to Reducers the key-value pairs in a streaming fashion using the shuffle communication pattern of the all-to-all BB. The burden of sorting data is left to Reducers that perform this step while receiving data, partially amortizing its cost by overlapping computation and communication. The same kind of streaming-like computation is also exploited in the next phase through feedback channel communications to distribute the histograms back to Mappers who merge to a local and private data structure each received histogram partition.

We now briefly go through the main steps of the algorithm, giving some details of how those are implemented in FastFlow. The first phase is the computation of histograms. Mappers, in parallel, seek their part and start parsing the memory-mapped dataset line by line calling the provided parsing function. For each item, they invoke all the given LSH functions and emit the results to the corresponding Reducers along with the dataset tag (R/S). Reducers count the frequency of hash values based on the dataset tag and store the sender id of the Mapper. The latter is used to send back, through the feedback channels, only the relevant frequencies to each Mapper. The Reducers trigger the distribution of histograms once they have received all data from all Mappers, indicating that no more key-value pairs will be emitted in this phase. This kind of end-of-phase signal is implemented at the pattern run-time with a custom <key, value> pair detected by the FastFlow BB implementing the a2a Reducers. Afterward, as soon as all histograms produced by the Reducers are received and merged by the Mappers, the next phase (i.e., similarity join) can start. Now, for each hashed value of all elements of the dataset partition, the Mapper knows the frequency of that value for both sides (R/S) and thus may decide to discard or multicast the computed content to a proper subset of Reducers. A hashed value is discarded if it comes from a side and the frequency of the other side is equal to zero (line 7 of Fig. 8). This pre-filter means that there are no items of the other set that is possibly similar to the one being processed according to the LSH family of functions provided. The number of Reducers required to balance computation for that hash value is retrieved by computing the maximum frequency between side R and S, divided by the parameter \(f_{max}\) (line 8).

Fig. 8
figure 8

Algorithm followed by the Mappers during similarity join phase to implement the randomised communication schema in FastFlow to ensure load balancing among Reducers for skewed datasets. The process function is the one called by the Mapper for all LSH for each object. Random Unicast and Broadcast are commodity procedure to wrap the FastFlow’s ff_send_out_to method

Such (\(f_{max}\)) value denotes the number of per-key records that a Reducer should store and process during the similarity join step. The quantity of instantiated Reducers always bounds the computed number of required Reducers. Once the number of Reducers is defined, we need to determine which ones will be used among all we have available. To do so, we choose the first one and then all the subsequent ones up to the amount needed. Then, the first Reducer is chosen by computing the modulo operation between the LSH value and the number of all instantiated Reducers (line 9). Then, if a hash comes from a side X element (between R and S) and the maximum frequency is from the same side X, it will be sent to one of the selected Reducers. Otherwise, it will be sent to all selected Reducers (lines 11,12 and 14,15). This way, we minimize the quantity of replicated data, broadcasting the elements from the smaller sub-dataset (R or S) for a given key. This procedure is summarized in Fig. 8 and is implemented exploiting the ff_send_out_to method of the multi-output sequential BB.

Once all the <key,value> have been generated during the second phase, Mappers send the FastFlow end-of-stream (EOS) message and terminate. Reducers collect all the items and sort them directly while they are received. As soon as all EOS messages have been received, the Mapper of the similarity join procedure is triggered. Similarity join consists of testing the provided predicate over all the possible combinations of the elements with the same locality-sensitive hash key. For similar items, the Reducer prints to its output file the pair of related IDs.

In both map-reduce phases, data is sent from Mappers to Reducers in batches. This feature has been added to reduce the number of exchanged messages and, most of all, to optimize memory allocation for shared-memory executions. The batching size can be set in the pattern constructor, its default value is set to 256.

Concerning the distributed-memory execution of the SimilarityJoin pattern, the user needs to properly define the configuration file provided as first parameter of the pattern instance. Based on the number of non-empty lines contained in the configuration file, the corresponding number of horizontal cuts will be automatically applied to the a2a BB, which defines the implementation skeleton of the pattern. Each resulting dgroup has a unique identifier (e.g., G0..Gn where n is the number of lines of the configuration file). These identifiers are used to define the dgroup to server host mapping and to tag inter-group communications by the FastFlow RTS. The Gk group is created with the number of Mappers and Reducer specified by the user in the line-k of the configuration file. Communications between local group Mappers and Reducers happen in the shared-memory domain (i.e., communication channels are implemented with lock-free shared-memory queues), whereas communications from a Mapper in a group and a Reducer in a different group happen in the distributed-memory domain (i.e., inter-node communications). Data is serialized if the communication is inter-dgroups. Data serialization is wholly entrusted to the Cereal library, which requires the user to provide a serialization function for type T specified in the pattern template, if and only if T is a custom data type (i.e., user-defined class or struct).

4 Experimental Evaluation

In this section, we assess both the quality of the new C++ SimilarityJoin pattern based on the FastFlow library and its parallel performance on two different clusters with different interconnections, main memory availability, and number of cores per node. We consider the hand-tuned Hadoop-based implementation of the MRS-join computation presented in [6] as the baseline. Additionally we compare the performance of the same application implemented from scratch in Apache Spark [31] (simply Spark from now on).

Platforms Used. The experiments were carried out on two different clusters. The first one called Mirev, is hosted by the University of Orléans, and it is composed of two servers connected by a switched 1Gbit/s Ethernet network. Each server has 256 GB of memory and two Intel Xeon Gold 6248R CPUs running at 3.0 GHz for a total of 48 physical cores (96 HW threads) and a fast dedicated NVMe disk for local storage. The second cluster called Openhpc4 is hosted by the University of Pisa’s Green Datacenter. It includes 16 nodes connected by a 100 Gbit/s Infiniband network. Each server is a diskless node with two Intel Xeon Silver 4114 CPUs running at 3.0 GHz for a total of 20 physical cores (40 HW threads). About 128 GB of memory is reserved for the applications on each Openhpc4 cluster node. All distributed tests presented in this section were executed using the TCP/IP transport protocol for the FastFlow-based implementation.

Datasets. To use different-sized datasets for the experiments, we employed a synthetic data generator that allows us to specify the number of similarity join outputs. To this end, the generator takes as parameters the number of clusters and their sizes in the datasets R and S and a threshold value. For each cluster, the algorithm generates a random trajectory used as a template. The cluster trajectories take this template and alter it according to the given threshold. The dataset is supplemented with noise, i.e., several random trajectories that will not produce any similarity join output. For example, in the smallest dataset (i.e., 5 GB), there are 5000 clusters of size 200 equally distributed in R and S, with 2, 000, 000 additional random trajectories. For larger datasets, we increase the number of clusters and the number of random trajectories by the same factor so that the number of results produced follows the same scaling. All generated trajectories are 2-dimensional and comprise an average of 50 points spaced according to the given threshold. In our tests, we used three distinct datasets, namely: a “small” dataset of 5 GB with 50 million of similar trajectories, a “medium” dataset of 50 GB with 500 million of similar trajectories, and a “large” dataset of 100 GB with 1 billion similar trajectories. For the out-of-core experiments, we tested a huge dataset of 200 GB.

4.1 SimilarityJoin Pattern Validation

Table 1 Percentage of similarities discovered by the two implementations for the three distinct datasets considered

To validate the FastFlow-based SimilarityJoin pattern implementation, we compared the number of similar trajectories found for the three datasets considered with those found by the Hadoop-based implementation proposed in [6]. The results obtained running the two versions on the Mirev cluster are reported in Table 1. The number of similar trajectories found (reported in percentage in the table) for all datasets is almost the same. The FastFlow-based version is capable of obtaining a small extra-fraction of similar trajectories. We did not investigate deeply such a small difference. Reasonably, it is due to different random seeds and random number generators. For the validation tests and for all performance tests presented in the following, we used 8 LSH functions and a threshold value of 10 (i.e., \(n=8\) and \(\lambda = 10\) in the code in Fig. 5). It is worth noting that by using 8 LSH functions, the memory requirement for the largest dataset is relevant (there is a factor on the input size of about 2.4\(\times\) with 8 functions). Specifically, the amount of memory needed to run the SimilarityJoin pattern on a single node with the 100 GB dataset is a bit more than 240 GB. Therefore, the FastFlow-based version cannot be executed either on a single Openhpc4 node or on a single Mirev node. In both server nodes, the virtual memory size is equal to the physical memory available (that is equal to the nominal memory minus the space used by the OS and the running services). For this reason, the initial validation of the FastFlow-based version when using the 100 GB dataset has been conducted on two Mirev nodes. However, we will discuss how we handle out-of-core executions in the FastFlow-based pattern in Sect. 4.3. Finally, the number of similarities found in the dataset strongly depends on the number of LSH functions used. For example, for the 5 GB dataset, with 2 LSH functions, the similarity score falls to 73.7% whereas with 16 LSH functions, the similarity score reaches 99.99%

4.2 Performance evaluation

The first test aims to evaluate the batchSize parameter of the SimilarityJoin pattern (cf. Fig. 4). This internal pattern parameter aims to reduce both the number of messages exchanged between Mappers and Reducers in each iteration phase and to intensify contiguous main memory allocation for messages in the FastFlow run-time. The results of the test conducted on one node of both clusters are shown in Fig. 9 (top left-hand side). The dataset is the 5GB one. The optimal value of the batchSize for both platforms falls in the range 32–512. With smaller values, the overhead of memory management is higher in the FastFlow run-time. Higher values of batchSize might reduce the pipeline parallelism between Mapper and Reducers.

Fig. 9
figure 9

Dataset size: 5 GB Top left: Impact on the execution time of the batchSize parameter. Top right: Execution time varying the number of Mapper and Reducer on a single Mirev node (batchSize = 256). Bottom left: Execution time varying the number of Mapper and Reducer on a single Openhpc4 node (batchSize = 256). Bottom right: Impact on the execution time of the FastFlow distributed batching on two cluster nodes (batchSize = 256)

The second test aims to estimate a good value for the number of Mappers and Reducers for a single node. Again we used for the test the smallest dataset (i.e., 5 GB) and one single cluster node. The batchSize was fixed to 256. Figure 9 shows the results obtained on one Mirev node (top right-hand side) and on one Openhpc4 node (bottom left-hand side). In these tests, we did not verify all possible configurations for the number of Mappers and Reducers. Our aim was to estimate a “close-to-optimal” value for the per-node number of Mapper and Reducer parameters of the SimilarityJoin pattern and to verify that using either the total number of physical cores (assigning half of them for the Mappers and the other half for the Reducers) or the total number of logical cores (yet half and half) may be a reasonable choice for those parameters. Specifically, in the Mirev node the best value among those tested are 24–24, whereas on the Openhpc4 node the best values are 20–20 (thus filling all logical cores for a node).

The next test was to estimate a good value of the distributed batching size (called dFF_batch) of the FastFlow run-time system to optimize the network bandwidth in distributed communications transparently. For this test, we employed two nodes of the two clusters. Unlike the batchSize parameter, the distributed batching configuration parameter is used only for distributed communications between non-local-node Mappers and Reducers. Such value depends on the amount of data transmitted (thus also depends on the value of the batchSize) and the kind of network and protocol used. However, as it can be seen from Fig. 9 (bottom right-hand side), with large enough application batching (i.e., batchSize=256), there is no significant difference between the case of dFF_batch=1 (i.e., no distributed batching) and the case of dFF_batch=8, in particular for the fastest network (i.e., Infiniband in the Openhpc4 cluster). Given the results of this test, we decided to set the dFF_batch to 4 for the SimilarityJoin pattern implementation. All subsequent tests were executed with such fixed value regardless of the cluster used.

Table 2 Execution times (in seconds) varying the number of machines up to 16 nodes on the Openhpc4 cluster and up to 2 nodes on the Mirev cluster for the three datasets considered

Once all pattern parameters had been studied (i.e., batchSize, number of per-node Mappers and Reducers, and the dFF_batch) we tested the SimilarityJoin pattern scalability by increasing both the dataset size as well as the number of cluster nodes. Table 2 summarizes the results obtained on the Openhpc4 cluster for the three datasets considered varying the number of nodes from 1 to 16. Given the relatively small amount of main memory available on each node of the Openhpc4 cluster, the 50 GB and 100 GB datasets cannot be executed on 1 and 2 cluster nodes, respectively. However, the execution time scales reasonably well with the number of nodes for the most extensive dataset (100 GB). On the Mirev cluster, given the additional number of cores per processor and the slower network, there is no performance improvement when moving from 1 node to 2 nodes for the 5 GB dataset. The single-node shared-memory version of the pattern exploiting all physical cores of the machine (i.e., \(24+24\)) is already relatively fast (about 17s). For the 50 GB dataset, the execution time improvement is marginal, from 283 to 228 s. Instead, the SimilarityJoin pattern cannot be executed for the largest dataset on a single node because of insufficient main memory (we will study out-of-core SimilaityJoin pattern executions in Sect. 4.3). However, it can be executed on two Mirev nodes giving an execution time of about 450s.

Table 3 Execution times (in seconds) of the Hadoop version when using 1 and 2 nodes on the Mirev cluster

We executed the Hadoop version on the Mirev cluster for all datasets to have a performance comparison. The Hadoop framework is configured to use the HDFS filesystem with 1 name node and 2 data node. The results obtained are shown in Table 3. As expected, the overhead introduced by the Hadoop framework is not negligible for small datasets. On the other hand, for big datasets, Hadoop can execute on a single node with constrained memory (192 GB) and obtains a relevant reduction of the execution time of about 300s on two cluster nodes. The execution time differences between the Hadoop version and the SimilarityJoin pattern implemented in FastFlow, are primarily due to: (1) the extensive use of the HDFS filesystems in Hadoop, which introduces overheads but enables running with “Larger-than-Memory” data (out-of-core processing); (2) the storing of intermediate phase results into files; iii) the serialization of all messages between Mappers and Reducers. Finally, Hadoop provides the user with fault-tolerance capability, although not enabled in our tests, a feature currently missing in the SimilarityJoin pattern implementation.

As a further performance comparison, we implemented the similarity join application with the Spark framework (version 3.5.0) using Java. We executed the tests on the Mirev cluster with the same HDFS configuration used for the Hadoop tests. The Spark’s configuration is reported on the left-hand side of Table 4. The Spark implementation comprises two jobs: the first computes the histograms, and the second computes the actual similarity join. The whole transformation graph of the application is sketched in Fig. 10. Firstly, the dataset is read from HDFS and parsed by a map operator. The resulting RDD is then used in both jobs. The histograms map-reduce computation is based on a flatMapToPair and a reduceByKey operators. The results are collected in a hash-map from the driver and broadcasted to all executors to be used as the read-only state in the second job. The similarity join part comprises three computational stages: map, combine, and reduce (flatMapToPair, (combineByKey, and reduceByKey operators, respectively). The final results are written back to HDFS. In the right-hand side of Table 4, we reported the time obtained by the Spark-based version for the 5 GB dataset. For the reader’s convenience, we reported in the table the times obtained by the FastFlow-based SimilarityJoin pattern (FF), the Hadoop-based version (HD), and the Spark-based version (SP). As expected, the Spark version obtains a lower execution time than the Hadoop version, being more aggressive in using the main memory. However, our solution is confirmed to be the most performant.

Fig. 10
figure 10

Data-flow graph of the similarity join application implemented in Spark

Table 4 Spark’s configuration parameters used in the tests and execution times comparison

4.3 Dealing with out-of-core executions

To allow the user to run the SimilarityJoin pattern on machines with constrained memory and thus overcoming the usage limitation in out-of-core computations (cf. the label “no mem” in Table 2), we designed a version of the pattern that uses a persistent memory allocator. Specifically, we used the Metall [10] library implementing a persistent allocator built on top of the memory-mapped file OS mechanism (mmap(2)) to allow applications to allocate and access data in persistent memory (i.e., disks and NVRAM) transparently. Since the overall performance of using a file-based allocator rather than the standard heap allocator is generally worse, we decided that the user must explicitly enable such a new pattern configuration at compile time, if needed. In this case, the FastFlow’s RTS is forced to use only queues with bounded capacity (except for feedback channels) to further reduce memory usage. To minimize the modifications to the pattern’s run-time system, we defined a custom C++ memory allocator leveraging the Metall persistent allocator and used it in all internal C++ data containers. This way, the new pattern version can be enabled by passing the custom allocator instead of the C++ standard allocator to all internal containers. Additionally, to be sure that all data will be allocated with the persistent allocator, the data type used (i.e., the type T in Fig. 4) should be a POD (Plain Old Data) data type. This is the only aspect, in the current version of the pattern, that might have an impact to the programmer when using the persistent allocator for out-of-core executions.

Table 5 Execution time (in seconds) on one Mirev node (i.e., the shared-memory version with 24 Mapper and 24 Reducers) obtained using the 100 GB dataset and varying the global cache capacity (HEAP_CACHE_SIZE) and the cache-entry bucket length threshold (HEAP_CACHE_BUCKET_LENGTH)

On the one hand, this simple solution allowed us to run the most extensive dataset (i.e., 100 GB) with the SimilarityJoin pattern on a single Mirev node. On the other hand, the total execution time obtained is 2137s, i.e., about 1000 s higher than the Hadoop version on one node. Such high overhead is produced by the Metall-based custom allocator, being Metall very aggressive in using the storage to keep all data persistent, a feature that is not currently provided by the SimilarityJoin pattern. Moreover, we noticed that Metall, using the default Linux kernel configuration (Linux 5.14.0-1046-oem–Ubuntu 20.04.4 LTS), uses only a small fraction of the available main memory, about 30 GB out of 256 GB available in one Mirev node. To mitigate the overhead of using the persistent allocator, we introduced an intermediate data software cache, implemented with a set of C++ containers, all using the default C++ heap allocator, to temporarily store results before moving them to the Metall-allocated container. The cache reduces the pressure on the permanent storage and potentially increases the performance. It has a configurable global capacity (HEAP_CACHE_SIZE). Once the global capacity threshold is met, some cache entries (a cache entry corresponds to a key) are flushed into the Metall-allocated container in random order until the current size reaches at least 80% of the global capacity. However, to increase data locality when inserting the data into the Metall-allocated container, each time a cache entry bucket length is higher than a fixed static threshold (HEAP_CACHE_BUCKET_LENGTH), the entire bucket is flushed into the Metall-allocated container regardless the current cache size. We experimentally found that such a simple flushing policy allows us to increase the amount of main memory used and reduce the overhead without impairing the functional execution. The experimental results obtained varying the global cache capacity and the bucket length for the 100 GB dataset, are reported in Table 5. The lower execution time (1140s) has been obtained with a global cache capacity of 96 GB and a cache-entry bucket length threshold of 12. We used such “optimal values” for executing the SimilarityJoin pattern with the 100 GB dataset on 2 Mirev nodes obtaining an execution time of 540s, with a resulting overhead of 20% compared to the non-Metall-based version (i.e., 450s). Interestingly, even considering the overhead introduced by the persistent allocator, the overall execution time on 2 Mirev nodes is still lower than the Hadoop-based version (cf. Table 3).

As a final test, we also tested a huge 200 GB dataset on 2 Mirev nodes, producing an out-of-core execution on both nodes. Our pattern-based solutions took 1260 s to complete, while the Hadoop-based version took 1520 s (a speedup of about 1.2X). The tests confirmed the effectiveness of the approach taken to handle out-of-core computations.

5 Related work

Exact similarity joins have received considerable attention. Filtering and verification techniques use filters to eliminate comparisons that cannot reach the threshold distance. However, such techniques are metric-space-dependent and cannot be applied to the general case. On the other hand, the metric-space-partitioning technique permits handling the similarity join for any metric space. Nevertheless, the experimental survey [32] on exact set similarity joins shows that these methods often fail to compute the join on small datasets. Recently in [33], “Bloom filters” and “Fuzzy filters” were introduced for fuzzy join operations to eliminate most non-joining elements in the input dataset before sending the data to the join processing step. Thus, it reduces intermediate data and unnecessary comparisons. However, these approaches may face scalability problems when the probabilistic data structures used to store the filters cannot fit in the main memory of the processing nodes, especially for massive datasets. In an approximate context, i.e., algorithms that do not produce the total results, similarity join algorithms are usually based on Locality Sensitive Hashing (LSH) to generate candidate pairs by hashing similar input records into the same “buckets” with high probability. This drastically reduces the number of pair comparisons and generates almost all similarity join results. In the context of massively parallel computations, some recent works [34, 35] present an algorithm relying on LSH that achieves guarantees on the result completeness and good utilization of the processing nodes. However, in the case of large and skewed datasets, the workload balance among computing nodes is not ensured. Recently, [36] extended the analysis and improved the algorithm using sketching and deduplication.

Concerning programmability, offering easy-to-use yet sophisticated parallel patterns for specific application domains to end-users, who are not necessarily familiar with parallel programming, is a relevant research topic in the fields of high performance distributed computing. Notably, several parallel programming libraries or domain-specific languages (DSLs) have been proposed in the context of structured parallel programming [37], e.g., Muesli [38], SkePU [39], SkeTo [40], GrPPI [41], SkelCL [42], Musket [43] and SPar [44]. FastFlow [7, 8] owns to this category but, in addition to some high-level parallel patterns, it also provides a lower-level software layer to the parallel programmers (i.e., Building Blocks) to enable the easy development of new patterns and run-times system yet following the structured parallel programming methodology [7]. Several examples of domain-specific parallel patterns can be found in the literature, ranging from exact combinatorial search [45] in the distributed-memory domain to image filtering for visual data restoration [46] based for heterogeneous many-cores equipped with GPU accelerators, to Window-based stateful data-streaming operators for multi-core systems [47]. In the field of similarity joins, some works targeted different methodologies for different architectures: HySet [48] and fgssjoin [49], offering set similarity exploiting both CPU and GPU accelerators but not clusters, whereas in [50] the authors proposed an online streaming approach using distributed systems. However, all previous works focus mainly on set similarity, while the SimilarityJoin pattern we proposed in this work aims to target all similarity join operations.

6 Conclusions and Future Work

We proposed SimilarityJoin, a C++ high-level parallel pattern for computing similarity joins that relieves the user from many hidden pitfalls related to parallel programming. The implementationFootnote 2, based on FastFlow’s BBs, follows the MapReduce computation paradigm, enabling efficient execution on a single multi-core server and a cluster of multi-cores. The proposed solution has been validated using different-sized datasets to consider both in-core and out-of-core executions, and its scalability has been studied on a 16-node cluster. Additionally, we compared the performance of the proposed pattern with those obtained by a hand-tuned Hadoop implementation of the same use and against a new implementation written in Apache Spark. The results obtained yielded interesting qualitative and quantitative results reckoning the validity of the SimilarityJoin pattern.

In future work, we intend to optimize the proposed high-level pattern, mainly regarding the persistent allocator, employing different main-memory caching policies and malloc/free function implementations. Moreover, we intend to improve and tune up the Spark-based version to run horizontal scalability tests with large datasets.