Accurate and efficient privacy-preserving string matching

The task of calculating similarities between strings held by different organisations without revealing these strings is an increasingly important problem in areas such as health informatics, national censuses, genomics, and fraud detection. Most existing privacy-preserving string matching approaches are either based on comparing sets of encoded characters allowing only exact matching of encoded strings, or they are aimed at long genomics sequences that have a small alphabet. The set-based privacy-preserving similarity functions that are commonly used to compare name and address strings in the context of privacy-preserving record linkage do not take the positions of sub-strings into account. As a result, two very different strings can potentially be considered as a match leading to wrongly linked records. Furthermore, existing set-based techniques cannot identify the length of the longest common sub-string across two strings. In this paper, we propose two new approaches for accurate and efficient privacy-preserving string matching that provide privacy against various attacks. In the first approach we apply hashing-based encoding on sub-strings (q-grams) to compare sensitive strings, while in the second approach we generate one-bit array from the sub-strings of a string to identify the longest common bit sequences. We evaluate our approaches on several data sets with different types of strings, and validate their privacy, accuracy, and complexity compared to three baseline techniques, showing that they outperform all baselines.


Introduction
In application domains such as banking, health, bioinformatics, and national security, it has become an increasingly important aspect in decision making activities to integrate information from multiple databases [11,18]. Integrating databases can help to identify and link similar records that correspond to the same entity, a task known as record linkage [9]. This in turn can facilitate efficient and effective data analysis that is not possible on an individual database.
Increasingly, record linkage needs to be conducted across databases held by different organisations [57] 1 complementary information held by these organisations can, for example, help to identify patient groups that are susceptible to certain adverse drug reactions (linking doctor, hospital, and pharmacy databases), or detect welfare cheats (linking taxation with employment and social security databases). However, in many of these applications the databases to be linked contain sensitive information about people which cannot be shared between the organisations that are involved in a linkage protocol [11,57]. Similarly, in the bioinformatics domain the comparison of genomics data often raises confidentiality concern as genomics sequences might contain proprietary information and such data are often highly sensitive in nature [50].
Research in the area of privacy-preserving record linkage (PPRL) [56] aims to develop techniques for linking databases without the need of sharing the original (unencoded) sensitive values between the organisations that participate in the linkage protocol. In PPRL, the attribute values of records are usually encoded or encrypted in some form before they are being compared. Any encoding or encryption technique used must ensure that approximate similarities can still be calcu-lated between encoded values without the need for sharing the corresponding sensitive plaintext values [56]. PPRL is conducted in such a way that only limited information about the record pairs classified as matches is revealed to the participating organisations in the linkage process. The techniques used in PPRL must guarantee that no participating party, nor any external party, can compromise the privacy of the entities that are represented by records in the databases being linked [11].
One popular technique to allow privacy-preserving string comparison is based on converting strings into sets of qgrams (sub-strings of length q characters) and encoding these sets into Bloom filters (BFs) [44]. BFs are bit arrays where multiple independent hash functions are used to encode the elements of a set by setting those bit positions to 1 that are hit by a hash function. BFs can be compared using set-based similarity functions such as the Dice coefficient [9]. It has been shown that BF-based PPRL is both efficient and can achieve accurate linkage results comparable to non-PPRL approaches [41,42]. A related similar approach based on tabulation hash (TMH) encoding was recently proposed by Smith [51]. The proposed approach applies Min-hash locality sensitive hashing [5] and uses the Jaccard similarity function for comparing bit arrays.
One drawback of set-based comparisons as used with BFs or TMH is that the sequence of characters in a string is lost when the string value is converted into a q-gram set. As shown in Table 1, in certain cases [15] two different strings can result in the same q-gram set which would be encoded into the same bit pattern. This can lead to falsely matched record pairs because of too high similarities between rather different string values [9]. The likelihood of two different strings sharing the same or a highly similar q-gram set increases if the size of the alphabet Σ (the set of unique characters used to generate the strings to be encoded) becomes smaller, because less unique q-grams can be generated. Therefore, strings generated using only digits (alphabet of size |Σ| = 10), such as Table 1 Example string pairs from a real US voter database [10] that have the same set of q-grams with q = 2 (bigrams), and therefore Jaccard or Dice coefficient similarities of 1. On the other hand, their edit distance similarities [9] are (correctly) much lower zip codes or telephone numbers, will more likely result in increased q-gram set similarities compared to strings that contain letters (|Σ| = 26), such as first and last names. Another drawback of set-based string comparison functions is that they only allow the calculation of an overall similarity between two strings. However, identifying the longest common sub-string between two strings can be crucial in certain applications. For example, financial intelligence units around the world, including FinCEN (US), the National Crime Agency (UK), and AUSTRAC (Australia), collect financial information to help identify tax evasion, money laundering, and terrorism financing. This involves linking records from different reporting entities such as banks, casinos, and money remitters such as Western Union, and requires finding matches in a privacy-preserving way where bank identifiers such as SWIFT or BIC codes need to be paired with bank account numbers. Sub-string matching is crucial because leading zeros are often omitted, such that the identifier "DK54000074491162" would be the same account as "DK5474491162".
Contributions: In this paper, we propose two novel approaches to privacy-preserving string matching, where we encode each string based on its generated q-gram list. In the first approach, we encode the q-grams in each list into hash values, while in the second approach we encode each q-gram into a bit array of fixed length to improve privacy of q-grams. However, it requires more runtime for encoding and comparison than the first approach, resulting in a trade-off between privacy and scalability of our two approaches. In both approaches, we randomly shift the encoded q-grams in order to hide position information that could be exploited by an adversary. The encoded strings are then sent to a third party for identifying the longest common encoded sub-string for each pair of encoded string pairs. We analyse our proposed approaches in terms of complexity, accuracy, and privacy, and evaluate them using several real and synthetic data sets that contain different types of strings (only letters, only digits, and mixed).

Related work
The privacy-preserving comparison of values (such as strings or numbers) is a common problem for many application domains. Therefore, various techniques and algorithms have been proposed, as shown in Table 2.
String matching is often used in the PPRL context where encoded values of quasi-identifying attributes of individuals (such as their names and addresses) need to be compared across two or more databases to link records [57]. Bloom filter (BF) encoding is widely used in PPRL because it allows efficient encoding of values and supports approximate matching of strings [44,57], numerical values [33,55], hierarchical In this table, l is the string length, |Σ| is the size of the alphabet Σ, h is the number of hash functions used, b is the length of a Bloom filter or bit array, t is the number of hash tables, q is the length of a sub-string (q-gram), and |D| is the size of a string database D codes (such as of occupation and diseases) [45,46], geographical locations [47], and Chinese characters [53].
Although BF encoding is considered as a standard for PPRL, BFs cannot be used to identify the longest common sub-strings, because they require values to be converted into q-gram sets whereby positional information of q-grams in their corresponding string values are lost. Furthermore, the hash functions used in BF encoding likely lead to collisions (where several q-grams are hashed to the same bit position) and therefore the similarities between BFs, can be higher than the actual Dice coefficient similarity between their corresponding q-gram sets. Other set-based techniques, such as tabulation-based hashing (TMH) [51] have similar drawback because any set-based encoding of q-grams into bit arrays does not preserve their positional order.
Privacy-preserving matching of genome sequences is increasingly required in bioinformatics applications where the aim is to find the longest matching sub-sequences for a query sequence in large genome databases [50,58]. The algorithms used in such applications often have high computational complexities [50,58].
Shimizu et al. [50] proposed an approach for searching similar string patterns in a genome database using a recursive oblivious transfer protocol based on an additive homomorphic encryption [25] to query sequences in the genome database while ensuring each query does not lead to the identification of other similar strings in the database. However, the approach does not scale to queries of longer sequences because these incur high computational and communication costs due to the complex cryptographic functions used [50].
Later, Nakagawa et al. [40] proposed an approach to improve the time complexity and communication costs of genome sequence matching. They used a recursive oblivious transfer technique and a compressed indexing data structure [24] to find the longest prefix and longest exact match of a query sequence in a genome database. In this approach, the time complexity depends upon the length of the genome sequence that is being queried rather than the size of the genome database; thus, it consumes less time to query a sequence from a large genome database compared to the approach proposed by Shimizu et al. [50] and the other secure genome sequence matching technique [52].
Suffix trees [37] are often used in bioinformatics applications to search for patterns in genome or protein sequences [59]. A suffix tree allows searching for a given pattern with a linear complexity in terms of the length of the query string being searched [37]. Ukkonen [54] showed how suffix trees can be used for string matching efficiently; however, his approach requires more space to hold a suffix tree than the original string collection.
The use of suffix trees in privacy-preserving sub-string matching has been investigated by Chase and Shen [6]. Their approach constructs a queryable encryption scheme for finding all occurrences of a query string in the encrypted suffix tree stored on an untrusted server. However, this approach reveals information of client queries to the server which compromises the privacy of a client's data and can potentially lead to the identification of the encrypted string values.
Bezawada et al. [3] proposed a protocol based on a pattern aware secure search tree where each tree node contains a BF that encodes a set of encrypted strings. The approach is aimed for two parties to compare strings securely over a cloud infrastructure, where the parties only learn if their strings are matched but not the actual matching sub-strings. This approach therefore does not allow the privacy-preserving identification of the longest common sub-strings.
Chen et al. [7] proposed a secure pattern matching approach based on suffix arrays and order hashing, where each hashed character in a string is concatenated with the hashed value of the next character in the string. In this approach a database owner (DO) sends the encrypted data to an untrusted server and transmits the key to the clients. This key is then used by the clients for verifying if the encoded query sub-string results that the clients received from the server are correct. However, in this approach the server can learn the actual string length from the encrypted suffix array received from the DO.
Hahn et al. [29] proposed a privacy-preserving secure sub-string or q-gram query approach where the frequency distribution of sensitive data are hidden by applying a frequency-hiding order preserving encryption [35]. This approach involves three parties for processing a secure qgram query, which are (1) the DO which owns the sensitive information, encodes the q-grams, and generates the encoded data tables (indexes and q-grams); (2) the untrusted party which holds the index tables that are used for searching encoded q-gram; and (3) the clients who want to query their encoded q-grams. In this approach, the complexity of querying depends upon the q-gram length because it determines the number of iterations required for querying encoded q-grams from the untrusted party.
Bonomi et al. [4] proposed a PPRL approach to compare string values using bit arrays based on the embedding of the frequent q-grams. The DOs that participate in a PPRL protocol individually apply differential privacy [20] to generate a table of frequent q-grams that occur in their databases. The DOs then send their frequent q-gram tables to one of the DOs to find the common frequent q-grams (shared frequent q-grams) and send them to all DOs that participate in the protocol. Each DO then uses the shared frequent q-grams to embed their strings into bit arrays, and sends these bit arrays to a third party to compare pairs of bit arrays. However, in this approach the DO that identifies the shared frequent q-grams is able to learn the frequent q-grams of the other databases which can compromise the privacy of the entities in those databases.
Zarazadeh et al. [60] proposed a protocol for secure pattern matching on a client and server architecture using ElGamal encryption [22] and bit arrays. This approach is able to match either exact or approximate patterns with or without wildcard characters, or patterns with random bit vectors added (for hiding the length of strings). The server cannot learn anything about the pattern matching results. However, the clients can learn the positions of a matched pattern in a string in the database stored on the server.
Essex [23] proposed a secure two-party approximate string matching protocol using DGK homomorphic encryption [16] and private set intersection cardinality. Each DO first generates a list of all possible q-grams (based on letters a to z), where the DO replaces a q-gram in the list of all possible q-grams with the encryption of 1 if a q-gram is found in its string. The first DO sends its list to the other DO to conduct a set intersection cardinality and the Dice coefficient calculation based on the lengths of q-gram lists of the two strings in a pair, then returns the results to the first DO for decrypting the results, where the decryption of 1 means a pair of strings is classified as a match. However, using the lengths of the lists of q-grams to calculate the Dice coefficient can lead to false matches because some q-grams in the two lists are not common. Furthermore, this approach consumes a lot of memory as the list of all possible q-grams for each string needs to be kept in memory for the comparison process.
Recently, Mullaymeri and Karakasidis [39] proposed a two-party private approximate string matching protocol based on polynomial coefficients generated using a reference database and a Fuzzy Vault scheme [31]. The idea behind this approach is that if the set of keys (generated from reference strings) of the two strings in a pair are similar, the polynomial coefficients generated from the keys of these two strings must be the same, and therefore, the two strings in a pair are classified as a match. However, the main drawback of this approach is that the reference strings that are used to generate keys must be very similar to the strings in a pair to ensure that the polynomial coefficients of the two strings are the same. Therefore, the number of reference strings must be large enough to allow the two parties to generate the same polynomial coefficients.
The approaches discussed above mostly allow a user to query a database of strings or sequences for similar patterns, while the problem we aim to address involves the identification of similar sub-strings in two databases owned by different parties without each party having to reveal their strings. In contrast to most existing techniques, our approaches allow the efficient and accurate privacy-preserving comparison of pairs of strings that share a sub-string with a certain minimum length.

Privacy-preserving string matching
We assume our approaches follow a honest-but-curious (HBC) adversary model [27,30]. As illustrated in Fig. 1, we assume two database owners (DOs) want to find the length of the longest common sub-string (LCS) between pairs of sensitive strings in their databases. The DOs do not communicate with each other, except to agree on the parameters to be used. We assume a linkage unit (LU), which is a semitrusted party [26], is involved in the protocol to compare the strings sent to it by the DOs. Because the DOs do not want to reveal the sensitive string values in their databases to any other party that participates in the protocol, these strings need to be encoded before being sent to the LU such that the LU cannot learn anything about them.
In some cases, the first characters in a string value can reveal some information. For example, the distribution of the first digits in numerical values can follow Benford's law  [2], while the first letters in first and last names can follow Zipf's law [61]. This potentially allows an adversary to learn some of the encoded q-grams at the beginning of encodings by identifying the q-grams that occur frequently in a public database [11]. Hence, we propose two novel encoding approaches to prevent any q-grams from being re-identified. In our encoding approaches (as illustrated in Fig. 1), the DOs first generate sub-strings of length q, called q-grams, from their unique sensitive string values. The DOs then individually encode all q-grams of each string and send these encoded q-grams to the LU. The LU then compares the encoding of a pair of strings and returns the length of the longest common sequence of hash encoded q-grams (elements), called the longest common elements (LCE), or the length of the longest common bit array, called the longest common bit array (LCB), back to the DOs. The DOs can then calculate the actual length of LCS based on the information received from the LU, as we describe in Sect. 6.
As we discuss in Sect. 4, the first encoding approach improves privacy of encoded q-grams by randomly shifting the position of the encoded q-grams in the generated encoded q-gram lists. The shifting of encoded q-grams hides their actual positions in a string, which makes a position-based frequency analysis of q-grams more difficult and thereby prevents an adversary from identifying the string values that were encoded. This approach is useful for linking databases that require fast and accurate linkage results, such as link-ing the phone number of a criminal between databases to facilitate a fast response for police to take action.
In the second approach, described in Sect. 5, we improve the privacy of q-grams further by encoding q-grams into bit arrays. We hide the actual sub-string positions and the length of the encoded strings by adding random bit arrays at the beginning and end of the bit array that encodes a list of q-grams. This ensures that the bit arrays of all string values in a database have the same length, further increasing the difficulty for an adversary to identify the original string values that have been encoded into bit arrays. However, this approach uses more runtime than our first approach. Therefore, it can be useful for linking databases in application domains that require high privacy and accurate linkage results, but are less concerned about runtime, for example, linking credit card numbers between databases for financial crime investigations.
Both our approaches provide accurate calculations of the length of LCS while hiding the actual sensitive string values from any parties. The DOs and the LU cannot learn the original string values nor the positions of the LCS from the compared encodings. A DO cannot learn anything about the sensitive values of the other DO because the DOs do not communicate with each other, and they only receive the length of the LCE or LCB, respectively, from the LU.
As notation, we use italics type letters for numbers and strings, bold lowercase letters for lists and sets, and upper-case bold letters for lists and sets of lists or sets. We use || to denote the concatenation of strings and bit arrays and + when concatenating lists. Lists are shown with square and sets with curly brackets, where lists have an order while sets do not. We show the elements of a list l as l

Parameter agreement
Before the protocol starts, the DOs agree on the parameters to be used, which are: -The length of q-grams, q, to be used for generating the q-grams, as we describe in Sect. 4.2. -The padding characters, α and β, to be added to string values to avoid any incorrect length of LCS calculations, as we describe in Sects. 4.2 and 8.2. -The secret salting value, s, to be concatenated with the generated q-grams. This is to avoid a dictionary attack by the LU [11], as we describe in Sect. 4.3. -The one-way hash function [11], H (such as SHA [43]), to be used for hashing q-grams before sending them from the DOs to the LU, as we describe in Sect. 4.3. -The minimum length of the LCS, m, where m ≥ q. This is used for selecting those string pairs that have a LCS of at least m, as we describe in Sect. 6.

Generating Q-grams
Before the generation of q-grams from a string value, the DOs first add the agreed padding characters α and β to the beginning and end of their unique strings. Let us assume the first DO has a database D A and uses the padding character α, while the second DO has a database D B and uses the padding character β, where α = β. The padding characters are used to ensure the beginning and end of the compared strings are different. Due to the shifting process of encoded q-grams (as we describe in Sect. 4.3), without the padding characters the length of LCS could be calculated incorrectly, as we discuss in Sect. 8.2.
Assuming Σ is the alphabet of all characters in the databases where a is a character in a string value v in the two databases. It needs to hold that α / ∈ Σ and β / ∈ Σ. Let us assume two padded strings x = α||x||α and y = β||y||β, where x and y are strings with x ∈ D A and y ∈ D B , respectively.
Once the DOs have added the padding characters to their strings, they independently generate the q-gram lists of each padded string in their databases. Each padded string in a database (let us use D A ), consists of characters and can be written as x = [a 0 . . . a i . . . a n−1 ], where a 0 = a n−1 = α, and a i ∈ Σ for 0 < i < (n − 1), with n = |x |. We define a q-gram as q i = a i . . . a i+q−1 , and a q-gram list as q = [q 0 , . . . , q n−q ], where q is the q-gram length as agreed by the DOs.
For example, assume the agreed q-gram length is q = 2 and the string is x = "mar y . The DO adds the padding character α = "$ to x, resulting in x = "$mar y$ . The DO then generates a q-gram list of the padded string x , resulting in q = [$m, ma, ar, r y, y$], as also shown in the third column in Table 3.

Hashing of Q-grams and shifting Q-gram lists
Before the DOs send their databases to the LU, they individually hash encode the q-grams in each of the q-gram lists using the agreed hash function H. To prevent a dictionary attack on the encoded q-grams, we use a salted hash encoding approach [11]. Given that s is the secret salt value and H is the hash function agreed by the DOs, and assuming q i ∈ q, where q is the q-gram list and 0 ≤ i < n, with n = |q|, we hash encode each q-gram q i as h i = H(q i ||s), and define the hash encoded q-gram list as Once the q-grams in each list q are hashed into a list h, each DO generates a random number, r , for each of its hash encoded q-gram lists, h, where 0 ≤ r < |h|. The DO then shifts (rotates) the list h by r , resulting in the shifted hash encoded q-gram list, h . A given hash encoded value at a position i, with 0 ≤ i < n, is shifted to a new position i = ((i + r ) mod n), also with 0 ≤ i < n. This shifting process aims to hide the original positions of the hash encoded q-grams and therefore the corresponding positions of characters in each string. Hence, the frequency distribution of shifted hash encoded q-grams will not follow Benford's [2] law anymore because the first encoded q-grams are distributed to different positions in the shifted hash encoded q-gram lists, thereby preventing a position-based frequency attack on these encoded q-gram lists.
For example, assume the string "mar y has been padded and hashed into the hash encoded q-gram list h x = [H($m||s), H(ma||s), H(ar||s), H(r y||s), H(y$||s)], and r x = 2. Therefore, the hash encoded q-grams are shifted by two positions, resulting in h x = [H(r y||s), H( y$||s), Table 3 Example strings, and their corresponding q-grams and randomly shifted q-gram lists, where the patterns of where common characters occur are shown in the first column (where b, m, and e represent the LCS occurring at the beginning, middle, or end of a string, respectively)  The string pairs, with minimum length of the LCS m = 3, are shown in the second column. Q-grams are generated using q = 2. LCE refers to the longest common list of elements. The random numbers used to shift each q-gram list are shown in the fourth column. The common sub-strings and the common elements are shown in bold Table 3.

Comparison of hash encoded Q-gram lists
To simplify notation, we now use h x and h y to represent the shifted hash encoded q-gram lists h x and h y .
For a pair of strings, the LU needs to find the length of the longest common (same) sub-list elements (LCE), lce, between the two lists h x and h y . We propose two algorithms for this comparison process, a basic and a fast algorithm. The first is a naive method to conduct the comparison by the LU which, however, requires longer runtimes for comparing pairs of hashed q-gram lists. We then propose an alternative, more efficient, algorithm which allows for a faster comparison process, as we describe in Sect. 4.4.2.

Input:
h x : Hashed and shifted q-gram list of string x h y : Hashed and shifted q-gram list of string y Output: lce: LCE of the hashed q-gram list pair  k ← 0 // Initialise index and common count k 15:

Basic encoded Q-grams comparison algorithm
Because of the random shifting process performed by the DOs, the LU does not know the start and end positions of the hash encoded q-grams in the lists h x and h y to be compared. To compare two shifted hash encoded q-gram lists, the LU needs to rotate the two lists to find the length of their LCE, lce, and then returns the lce to the DOs.

Fast encoded Q-grams comparison algorithm
As outlined in Algorithm 2, the LU uses concatenated hashed q-gram lists, h h h s and h h h l , for the comparison. We use such a concatenation technique because (1) the concatenated list contains the actual sequence of consecutive elements in the hash encoded q-gram list before it has been shifted, and (2) even after concatenation the actual positions of the hash encoded q-grams are not being revealed to the LU, as we discuss in Sect. 8.3.
Let us use the q-gram list q = [$m, ma, ar, r y, y$] as an example. With the random number r = 2, the shifted q-gram list becomes q = [r y, y$, $m, ma, ar]. We now concatenate q with itself to generate a concatenated list q q q = [r y, y$, $m, ma, ar, ry, y$, $m, ma, ar]. As can be seen from this example, the concatenated list q q q does contain the actual sequence of consecutive elements of the q-gram list q (shown in bold).
In line 1 of Algorithm 2, the LU first initialises the length of the LCE, lce. It then finds the common elements between the two lists, h x and h y , it received from the DOs, and adds these common elements into the set c in line 2. If common elements occur between h x and h y , then the LU orders the two lists by their lengths, and assigns the shorter list to h s and the longer list to h l in lines 3 and 4. The LU then concatenates in lines 5 and 6 each of h s and h l with a copy of itself, resulting in the two concatenated lists h h h s and h h h l , respectively.
In line 7, the LU then finds the list p s of consecutive positions in h h h s of the common elements in c by using the function getConsecCommon(). For example, the first string pair in Table 3, the string x = "mar y" has the list p s = [1,2] which are the positions of q-grams ma and ar, respectively. The list p s is empty if there is no sequence of consecutive common elements occurring in both h x and h y . The length of LCE is returned as lce = 1 in line 9 (the lce is 1 because the set c is not empty, as tested in line 3).
If p s does contain consecutive common elements, then the LU loops over each position p s in the list p s in lines 10 and 11. In

String matching based on shifted random bit arrays
While our approach based on shifted hash encoded q-grams prevents frequency attacks that are exploiting Benford's law [2], an adversary might still be able to identify the most frequent q-grams because these are encoded into hash values that will become the most frequent in the lists of hash encoded q-grams, h. To prevent such attacks, we improve the privacy of the shifted hash encoded q-gram-based approach by encoding each unique q-gram list q into a bit array. Each such bit array is padded at the beginning and end with random bits to ensure the bit arrays of all encoded strings have the same length even if the length of their q-gram lists differ. This approach prevents the LU from identifying which subsequence of bits in a bit array correspond to which q-gram, as we discuss further in Sect. 8.3.

Generating bit arrays for strings
To generate bit arrays for the strings in a database D, each DO builds two tables of unique bit arrays. The first is the table T Σ which contains one unique bit array for each possible q-gram that can be generated from the alphabet Σ, where Σ contains all characters that occur in the string values of the databases D A and D B being compared, plus the padding characters, α and β. The second table, T R , contains random bit arrays which will be used as padding to make the bit arrays of all q-gram lists the same length, where each DO needs to generate a unique table of such random bit arrays in order to prevent false matches, as we discuss further below. Before building the tables T Σ and T R , each DO first generates all unique q-grams that can be obtained from the alphabet Σ based on the q-gram length, The total number of possible q-grams we obtain is |Σ| q .
Based on the sizes of Σ, the DOs now need to calculate the q-gram bit array length, l q , to be used for generating the unique bit array for each possible q-gram. Because each DO needs to generate two tables of bit arrays (T Σ and T R ), where the random bit arrays T R must be different between the two databases D A and D B , the bit array length l q must be large enough to allow at least 3|Σ| q unique bit arrays to be generated. We can calculate a minimum length for l q as l min q = log 2 (3|Σ| q ). This would however require every possible combination of bits to be generated, including [0]×l min q and [1] × l min q . Such patterns could however reveal information as their frequencies of occurrence could be analysed by an adversary. Therefore, to provide maximum entropy, which will make it more difficult for a frequency attack to be performed, each DO randomly generates bit arrays where bits are set to 0 or 1 with equal probability [11,48]. For a given bit array length l q , the number of unique bit arrays that can be generated with half their bits set to 1 is l q l q /2 , where this number needs to be at least 3|Σ| q in our case. We can calculate an estimation of this number based on Stirling's formula [14,28] as:  Table 4 shows values for both l min q and l est q for alphabets of different sizes and different qgram lengths. In the following, and in our implementation, we assume that the value of l q has been calculated based on Eq. (1).
Once the DOs have calculated the q-gram bit array length, l q , to be used, they engage in a secure protocol [49] to find the maximum length l s that corresponds to the longest q-gram list in their respective two databases, D A and D B . To ensure all q-gram lists can be padded with random bits both at the beginning and end, the DOs add 2 to the length l s and then calculate the final bit array length as l t = (l s + 2) × l q .
For example, as illustrated in Fig. 2, assume two padded strings are x = "$mar y$ and y = "#marr y# , and their corresponding q-gram lists are q x = [$m, ma, ar, r y, y$] and q y = [#m, ma, ar, rr, r y, y#], respectively. As shown in this figure, the longest q-gram list is q y with length l s = 6, where each q-gram bit array is of length l q = 6 (to simplify visualisation). Thus, the final bit array length is l t = (6 + 2) × 6 = 48 bits.
Algorithm 3 outlines the bit array generation by the DOs. In line 1, each DO initialises the bit array inverted index, B, to be used for the generated bit arrays that correspond to its unique strings. Each DO then initialises the tables T Σ and T R , respectively, in lines 2 and 3. Each DO in line 4 generates the set of all possible q-grams, Q Σ , based on the agreed alphabet, Σ, and q-gram length, q. Then, in lines 5 to 7, the DO loops over each q-gram q Σ ∈ Q Σ to generate a bit array for this q-gram using the function gen Bit Arr(). The details of this function are provided in lines 21 to 28.
In line 21, the function gen Bit Arr() first initialises a bit array of length l q with only 0 bits. Then the q-gram q Σ is concatenated with the secret salt value, s, that was agreed by the two DOs. This concatenated value is used as the seed to a pseudo random number generator (PRNG) [11]. With the same seed, a PRNG will generate the same sequence of random outputs, and therefore, all DOs generate the same random bit arrays for the q-gram q Σ . The loop in lines 23 and 24 will generate l q such random bits, where the function random.select(0, 1) returns a 0-or a 1 bit with equal probability. As a result, the bit array b q should be filled with roughly 50% 1 bits. To ensure that each q-gram has a unique bit array, in line 25 we check this condition, and if required, we generate a new bit array based on a changed secret salt value s. Because all DOs employ the same PRNG, they will generate the same bit arrays for the q-gram set Q Σ (which is the same for all DOs). The function returns b q in line 28, where b q / ∈ T Σ . Back to the main program, where in line 7, each DO inserts the generated bit array b q into the inverted index T Σ , where the corresponding q-gram, q Σ , is used as a key. Each DO repeats this process until one bit array, b q , has been generated for every q-gram q Σ ∈ Q Σ .
Each DO then generates in lines 8 to 12 its random bit array table, T R , where |T R | = |T Σ | using the function gen Rand Bit Arr(). A temporary table of random bit arrays, T T , is generated first (in line 9), where the function gen Rand Bit Arr() calls the function gen Bit Arr() (as described above) to generate each random bit array. Each DO uses its individual secret salt value, s d , as its random seed. These individual secret salt values should result in different random bit arrays being generated by the DOs. However, to ensure no random bit array is generated by more than one DO, a secure set intersection protocol [17] is employed in line 10 between the DOs. Any found random bit array that has been generated by more than one DO will be returned by the set I ntersect DOs() function in the set T C in line 10, and only those random bit arrays unique to a DO are then added to its list T R . The DO repeats the steps in line 9 to 12 until |T R | = |T Σ |.
In the last phase of the bit array generation process, each DO generates the final bit array, b f , of length, l t , for each string (assumed to be available as a q-gram list) in its database. Each DO first loops over the q-gram lists q in its database, D, in line 13. For each q, the DO initialises a qgram bit array b q in line 14, and in line 15 loops over each q-gram, q Σ ∈ q. Each DO then selects the q-gram bit array, b q , that corresponds to q Σ from T Σ and concatenates the selected b q to the bit array, b q , in line 16.
Finally, to ensure the generated bit arrays for all q-gram lists q ∈ D are of the same length of l t , in line 17 we calculate the number of random bits, l r , that are required for padding based on the length of the generated bit array b q . Using the function pad Rand Bit Arr(), in line 18 the DO then generates the final bit array, b f , where a random number of bits are padded both at the beginning and end of b q such that a total of l r bits are padded, and where these random bits are sourced from the list of random bits arrays, T R . We illustrated this random padding process in Fig. 2. Finally, in line 19 the DO inserts the final bit array b f into the bit array inverted index B which will be sent to the LU for comparison, as we describe next.

Comparison of bit arrays
For a pair of bit arrays, b x and b y , as received from the DOs, the LU needs to find the longest consecutive sequence of bits that are the same across the two bit arrays. In the following, we denote such a sequence as common bits, and the length of the longest such sequence as the length of longest common bits (LCB), lcb. We propose two algorithms for this comparison process. Similar to the comparison of hash encoded q-grams described in Sect. 4.4, the first algorithm is a basic algorithm that follows a naive comparison method, while the second algorithm is a fast algorithm which substantially improves the runtime of the comparison process.  Figure 3 shows an example of the basic comparison algorithm between two bit arrays, b x and b y , where b y is moved over b x by one bit position per iteration. In each iteration, the LU compares the bit segment in the overlapping positions between b x and b y to find the longest sequence of common bits between the two segments.

Basic bit arrays comparison algorithm
Algorithm 4 outlines the steps in the basic bit array comparison process. The LU first initialises the LCB to lcb = 0 and then obtains the length of the bit arrays b x and b y as l t = |b x |, where we assume |b x | = |b y |.
To compare the bit arrays b x and b y , from line 3 onwards the LU then loops over index position i, where −(l t − 1) ≤ i ≤ l t − 1. It calculates the start (x s and y s ) and end (x e and y e ) positions for the two bit segments in b x and b y to be compared, based on the value of i, In line 10, the LU uses the function f indCommon() to find the length of the LCB by applying the XOR operation on b x and b y and identifying the length of the longest consecutive sequence of 0 bits, which represents the LCB between b x and b y . In line 11, the LU then keeps the longest length so far identified the length of the LCB, and it repeats the steps in lines 3 to 11 for all positions i. Finally, the LU returns the found lcb to the DOs in line 12.

Fast bit arrays comparison algorithm
In the fast bit array comparison process, the DOs first agree on a segment length, l γ , where 0 < l γ ≤ l lcb , and l lcb is the minimum required length of LCB. The DOs can calculate l lcb = l q × (m − q + 1), where l q is the q-gram bit array length, m is the minimum length of LCS required, and q is the agreed q-gram length. If two encoded strings share m consecutive characters, then they need to have a common bit sequence of at least length of l lcb .
We use the segment length l γ because it allows the LU to compare bit arrays one segment after another, which reduces the runtime required by the LU. Furthermore, the LU will not be able to learn any information about the original bit arrays that represent individual q-grams, even if the segment length l γ = l q , because it does not know l q .
Algorithm 5 outlines the fast comparison process by the LU and Fig. 4 shows an example of this process on two bit arrays, b x and b y . As input, the LU receives b x and b y , and the segment length l γ , from the DOs. In line 1, it initialises the length of LCB, lcb, and in line 2, it generates a list of segment, s y , from b y using the function genSegment() (as illustrated in Fig. 4a). Each segment in s y has a length of l γ or less bits (last segment in the s y ). The LU then loops over the segments in the list s y in line 3, and for each segment in line 4, the LU finds the list of common positions, p x , in the bit array b x where the segment s y [i] occurs by using the function get Pos Match(), as shown in Fig. 4b.
Because each bit array contains random and q-gram bit arrays, a given segment can contain bits from both. For each position, p x , in the position list p x , the LU therefore needs to check if there are sequences of common bits between b x and b y both to the left and right of the common segment, because in either direction there can be further common bits, as is illustrated in Fig. 4c. The LU uses the functions getCommonLe f t() and getCommon Right() in lines 6 and 7 to find the number of common bits on the left and right, respectively, between the current segment in b y , s y [i], and bits in b x . The LU calculates the current length of the common bit sequence in line 8, and checks if it is a new LCB, lcb, in line 9. The LU repeats steps in lines 3 to 9 for all segments in s y . Finally, the LU returns the found lcb to the DOs in line 10.

LCS length calculation
As shown in Fig. 1, in the last step of our string matching approaches, using Eq. (2) the DOs calculate the length of the LCS, lcs, based on the matching results they received from the LU. For the approach based on shifted hash encoded qgrams we discussed in Sect. 4, the DOs calculate the lcs based on the length of the longest sequence of common elements, lce, while if they use the approach based on bit arrays described in Sect. 5, they calculate the lcs based on the length of the longest sequence of common bits, lcb. lcs = lce + q − 1 // For shifted hash encoded q-grams lcs = lcb/l q + q − 1 // For bit array encoding (2) The DOs then only keep the string pairs that have a lcs ≥ m. The last column in Table 3 shows examples of the calculated LCS length based on the lce for different pairs of strings.

Scalability aspects
In this section, we describe how we can scale our proposed string matching approaches to large databases. The number of string pairs increases quadratic with the numbers of strings in the two databases being compared. We can improve the complexity of the comparison process by applying a privacypreserving blocking technique [11,19] to reduce the number of encoded string pairs that need to be compared by the LU.
We apply a q-gram-based blocking approach [11] to generate blocks for each database. In this approach, each block is generated based on a permutation of q-grams in the q-gram list of each string. The DOs first agree on a secret salt value, s, a hash function, H, and the length of q-gram set permutations, t q . In our approach, we calculate t q based on the agreed minimum length of the LCS, m, and the q-gram length, q (as described in Sect. 4.1) as t q = m − q + 1.
The DOs then independently generate the q-gram permutation lists of length t q for each of their q-gram lists. Each DO concatenates the q-grams in each such list into one string, q s , which is used to generate a blocking key value, bkv, by concatenating it with the agreed secret salt value, s. This is followed by a hash encoding of this concatenated string, resulting in bkv = H(q s s). Finally, all q-gram lists in a database that have the same bkv are inserted into the same block. Once the DOs have generated their blocks, they then send these blocks to the LU for conducting comparisons. The LU finds the common bkv between the received databases and only compares the encoded string pairs in blocks that have the same bkv.
For example, let us consider the DOs have agreed on m = 3 and q = 2, and therefore they calculate t q = 3−2 +1 = 2. We assume the two strings in the two databases, x ∈ D A and y ∈ D B , are x = "mar y" and y = "marie", with the q-gram lists q x = [ma, ar, r y] and q y = [ma, ar, ri, ie], respectively. They then individually generate the bkv of their strings as bkv x

= {H(maar s), H(mar y s), H(arr y s)} and bkv y = {H( maar s), H(mari s), H(maie s), H (arri s), H(arie s),H(riie s)}. The encoding of strings
x and y are inserted into every block with the bkv x ∈ bkv x and bkv y ∈ bkv y , respectively. Once the DOs have sent their blocks to the LU, the LU can find the common bkv = H(maar s). Therefore, two encoded strings x and y are being compared.
In the random bit array-based approach, we generate blocks by applying Hamming locality sensitivity hashing (HLSH) [19,34]. In this approach, the LU receives two sets of bit arrays from the two DOs. The LU uses a set of hash functions to select certain bits, and it concatenates these bits into a bit array of fixed length, l b , to be used as a bkv. The bit arrays that have the same bkv are then inserted into the same block.
In our approach, the DOs individually generate blocks of bit arrays before sending them to the LU. The DOs first agree on the secret salt value, s, a hash function, H, and a bit percentage, p b . They use p b to calculate the length of a bit segment to be used for HLSH blocking as l b = ( p b × l lcb )/100, where we described l lcb in Sect. 5.2.2. They then generate segments of the selected q-gram bit arrays, b q , each of length of l b . Each of these segments is then used to generate a bkv by concatenating them with the agreed secret salt value, s, followed by a hash encoding using the function H. The bit segments that have the same hash encoded bkv are inserted into the same block. However, the length of the b q is possibly not divisible by l b , and therefore the last segment might be shorter than l b . To ensure every generated segment has the same length, we therefore extend any segment that is too short by adding bits from the left segment, for example, if we assume b q = 11001100 (with |b q | = 8) and l b = 3. The generated segments of this b q are 110, 011, 00. Therefore, the last segment, 00, is extended with the last bit from the second segment, resulting in the last segment becoming 100. The bkv of this b q are bkv = {H(110 s), H(011 s), H(100 s)}.

Analysis of our protocol
We now analyse our approaches in terms of complexity, accuracy, and privacy.

Complexity analysis
In the shifted hash encoded q-gram-based approach, each DO requires O(l h ) for each step of the encoding process, where l h is the length of hash encoded q-grams list corresponding to each string in its database.
In the comparison process, let us assume the two shifted hash encoded q-gram lists, h  + 2l h ). However, this is the worst case which only occurs when h x and h y contain exactly the same encodings. Otherwise, the LU requires less than O(2l 2 h ) to find the lce between h h h s and h h h l . Therefore, the fast comparison algorithm is faster experimentally than the basic comparison algorithm, as we will show in Sect. 9.5.
In the random bit array-based approach, each DO requires O(|Σ| q ) to generate the bit array table of all possible qgrams, T Σ . To generate each random bit array, b r , a DO checks if b r / ∈ T Σ ∪ T R A ∪ T R B , where T R A and T R B are the random bit array tables of the two DOs. Each DO therefore requires a maximum O(3|Σ| q ) to generate the random bit array table of size |T Σ |. To generate the final bit array, b f , of each string, a DO requires O(l h ) to concatenate the q-gram bit arrays, b q , into a bit array, b q , and O(n r ) to pad each b q , with random bit arrays, where n r is the number of random bit arrays to be selected from T R .
We assume the LU receives two bit arrays, b x and b y , from the DOs. In the basic bit array comparison algorithm, the LU requires O(l 2 t ) to find the length of the LCB between b x and b y , where l t = |b f |. In the fast bit array comparison, the LU requires O(l t ) to generate a list of segments, s y , from b y . For each s y ∈ s y , it then requires O(l t ) to find the positions in b x where each s y occurs. For each position p x in b x , the LU requires O(l t ) to find the sequence of common bits that occur to the left and right of bit at the position p x in b x . Therefore, the LU requires a total of O(l t + |s y |(l t + l 2 t )). When the DOs apply blocking to their databases, each DO requires O(l h × |D|) to generate blocks based on qgram-based blocking, while with HLSH-based blocking the DOs require O(b B × |D|), where b B = l t /l b and l b is the length of bit segments used to generate a blocking key. In the comparison process, the LU requires O(n 2 /n b ) block comparisons, where n is the number of hash encoded q-gram lists or bit arrays in each database (we assume |D A | = |D B |) and n b is the number of blocks.

Accuracy analysis
As mentioned in Sect. 4, we use different padding characters between databases to ensure that the calculated length of LCS, lcs, is correct. Let us describe why this approach is required by using a q-gram list without padding characters as an example. We assume the strings to be compared are x = "mar y and y = "marar y . The correct LCS between these two strings is "mar with lcs = 3. We assume the DOs have agreed on q = 2 and they use random numbers for shifting their q-gram lists r x = 2 and r y = 4, respectively. Therefore, their shifted q-gram lists are q x = [ar, ry, ma] and q y = [ar, ra, ar, ry, ma], where the common q-grams are shown in bold. When the LU compares these lists, it returns the lce = 3 to the DOs. The DOs then use Eq. (2) to calculate lcs = 3 + 2 − 1 = 4. Therefore, the DOs obtain an incorrect result. As this example shows, our approach does not work when strings are not padded using different characters. Examples of correct LCS calculations are shown in Table 3.
Apart from the padding characters, to calculate an accurate lcs, the minimum length of the LCS, m, must be at least of length q, m ≥ q. This is because when m < q, in the shifted hash encoded q-gram-based approach, the LU cannot find the length of LCE, lce, between the two hash encoded q-grams lists. Let us use the two q-gram lists, q x and q y , as an example. We assume m = 3, q = 4, and two padded strings are x = "$mar y$" and y = "#marar y#". The corresponding q-gram lists of x and y are q x = [$mar, mar y, ar y$] and q y = [#mar, arar, rar y, ar y#], respectively. There is no common q-gram between these lists and therefore the DOs obtain the length of LCS, lcs = 0, for this pair of strings, where the actual LCS between x and y is "mar" with the lcs = 3. The same issue also occurs in the random bit arraybased approach because each bit array is generated based on a list of q-grams.
In the random bit array-based approach, hash collisions [8], where two or more q-grams are encoded into the same qgram bit array, b q , can affect the accuracy of string matching. The probability of a hash collision, P b , that the bit can be set to 1 in this approach can be calculated by applying the dependent probability calculation [1] as shown in Eq. (3): where l q is the length of the bit array of each q-gram, and l q /2 means 50% of l q is set to 1. When selecting the first bit position, there is a l q /2 out of l q chance that the position is being selected to be set to 1 by two or more q-grams. The number of chances decreases by 1 once each position is selected. Finally, when selecting the last position, there remain 1 out of l q /2+1 chances that a position can be selected. For example, assume we use l q = 6, the probability that a hash collision can occur is P b = 3/6 × 2/5 × 1/4 = 0.05 or 5% of l q .

Privacy analysis
We assume the DOs and the LU follow the honest-butcurious (HBC) adversary model where no DO colludes with the LU [36]. The HBC model is commonly used in PPRL and private string comparison protocols [57] because of its applicability to real scenarios. In this model, each party tries to learn as much as possible about the other parties' data based on what it receives from the other parties, while following the protocol steps. In our approaches, the DOs first communicate with each other to agree on parameter settings. This allows them to learn the parameters that are being used in the encoding processes but they cannot learn any sensitive information of the strings in each other's databases.
In both approaches, the DOs then individually encode the unique strings in their databases using the agreed parameters without learning any information from the other database. However, to generate the random bit arrays in the random bit array-based approach, the DOs employ a secure set intersection protocol [17] to find and exclude the common random bit arrays from their random bit arrays tables. These random bit arrays however do not represent any actual q-grams, and therefore, the DOs do not learn any sensitive information from each other.
When blocking is used, the DOs apply a privacy-preserving blocking algorithm [11,19] on their encoded strings before these are being sent to the LU. We assume such a blocking algorithm to be secure such that it does not allow the DOs to learn any sensitive information about each other's databases. The LU then receives the two encoded databases from the DOs. It first finds common blocks between them and then compares only encoded strings that are in the same blocks. In this step, the LU does learn which encodings occur in both databases, but not their actual content.
In our shifted hash encoded q-gram-based approach, the LU can learn the string length by guessing the length of qgrams, q (commonly used values are 2 and 3), and checking the length of the hash encoded q-gram lists. The LU can then generate q-grams from a public database using the guessed q and compare the frequency of the generated q-grams and the hash encodings in a received database. However, in order to identify encoded q-grams, this public database must contain a very similar set of values with the same frequency distribution to the encoded database as otherwise the LU cannot employ a frequency analysis. Furthermore, an injection of faked values can be used to prevent such a frequency-based attack [32].
In the basic encoded q-gram comparison algorithm (described in Sect. 4.4.1), for each pair of shifted hash encoded q-gram lists, the LU finds the length of LCE by iteratively comparing and rotating the two lists. While the LU can keep the positions where the common hash encoded q-grams occur, it cannot learn the actual positions of these common hash encoded q-grams nor the positions of the original q-grams because (1) the common q-grams between two lists can occur at any position in the lists, and (2) the hash encoded q-grams in the two lists have been shifted by our random shifting technique. This results in the common patterns of the original string pairs to be distributed to different patterns of the encoded string pairs. In other words, the original positions where common q-grams occur in the q-gram lists have been shifted to other positions in the encoded and shifted lists.
Similarly, in the fast encoded q-gram comparison algorithm, although the actual sequence of hash encoded q-grams is contained in the concatenation of the shifted lists (as described in Sect. 4.4.2), the LU still cannot learn the actual positions of neither where the original q-grams nor the LCS occur in the two strings.
In the random bit array-based approach, the LU receives bit arrays which are randomly padded by random bits. The LU cannot learn the length of the original strings because every bit array has the same length. It also cannot learn the frequency distribution of the bits that encode each q-gram because the q-gram bit arrays are shifted by a random number of bits, and therefore, it cannot re-identify the original qgrams. The LU can only learn that the common bits occur in the middle of two bit arrays (common pattern m-m) but it cannot learn the actual positions where the LCS occurs in the strings that correspond to a bit array pair.
Once the LU has compared all encoded string pairs, it returns the LCE or LCB to the DOs. Each DO then calculates the length of the LCS, lcs. This allows each DO to learn the LCS between a string in its database and a string in the other DO's database, but the DO cannot learn the positions where the LCS occurs in their string. Therefore, the DOs only learn that there is a sub-string match.

Experimental evaluation
We evaluated the accuracy, privacy, and scalability of our privacy-preserving string matching approaches compared to Bloom filter (BF) encoding [44], tabulation-based hashing (TMH) [51], and DGK approximate string matching (DGK) [23]. We compared our approaches with these three baselines because BF encoding [44] is considered as a standard technique for PPRL, TMH [51] is a more secure technique compared to BF encoding, and DGK [23] is a recently proposed approach for secure string matching that encrypts strings based on their q-grams. We implemented all approaches using Python 2.7 and ran experiments on a server with 2.4 GHz CPUs running Ubuntu 18.04.

Data sets
In our evaluation, we require pairs of data sets where each pair contains different common patterns as illustrated in Table 3. We used both synthetic and real data of types numbers, letters, and mixed.
To generate the synthetic data, we used the Python package Faker 1 to create data sets of credit card, barcode, and IBAN (International Bank Account Number) numbers, where each such data set contains 1,000 unique strings. We used each of these data sets as the first data set in a pair. We then created the second data set of a pair by replacing characters at different positions in each string in the first data set by random characters of the same alphabet. We ensured each data set pair does contain different common patterns; however, the common pattern b-e (see Table 3) cannot occur for IBAN numbers because these numbers begin with letters and end with digits. Therefore, the beginning of the first IBAN number cannot be in common with the end of a second IBAN number in a string pair.
For real data, we extracted 1,000 strings of first names, cities, zip codes, and telephone numbers from the North Carolina Voter Registration (NCVR) 2 database, with snapshots Table 5 Lengths of the longest q-gram list, l s , q-gram bit array, l q , calculated using l est q (Eq. (1)), and final bit array, l t , of different data set pairs and alphabet sizes, |Σ| from 2011 (first data set) and 2020 (second data set). We also extracted 1,000 strings of US security numbers from the Social Security Death Master File 3 . Similar to generating the synthetic data sets, we used the extracted US security numbers to generate the second data set by randomly replacing characters at different positions in each string.
In total, we evaluated our approaches and baselines on eight data set pairs including three sets of synthetic data, four sets of real NCVR data, and one set of real social security death index data.

Parameter settings
We padded strings in the two databases, D A and D B , using the padding characters α = "$" and β = "#", respectively. We generated q-grams using q = 3 for first names, cities, zip codes, and the US security numbers data sets, while we used q = 4 for telephone, credit card, barcode, and IBAN numbers. For each data set, we used the minimum length of LCS, m = q.
In the shifted hash encoded q-gram-based approach, to generate hashed q-grams, we used the hash function H = SHA256 [43] and the agreed secret salt value s = 45. This salt value was also concatenated with q-grams for generating each q-gram bit array, b q , in the bit array-based approach. To generate the random bit arrays for databases D A and D B , we used the individual secret salt values, s A = 65 and s B = 56, respectively. We calculated the length of q-gram bit arrays, l q , using Eq. (1). Table 5 shows the alphabet sizes and bit array length for each data set.
We compared our approaches with three baselines, which are BF [44], TMH [51], and DGK approximate string matching (DGK) [23]. We used the same parameter settings as we used in our approaches, such as padding character α, q-gram length q, minimum length m, secret salt value s, and the hash function H.
To generate the BF for a string, we encoded each q-gram set into a BF of 1,000 bits as this is a commonly used BF length for PPRL [44]. We used the optimal number of hash functions [38] for each data set, which is 139 for first names, 87 for cities, 139 for zip codes, 87 for telephone numbers, 116 for US security numbers, 46 for credit card numbers, 58 for barcode numbers, and 33 for IBAN numbers. For the TMH approach, we followed the original publication [51] and used 8 tabulation hash keys each of 64 bits length to generate a bit array of length 1,000 bits to encode a string. For the DGK approach, we used keys of size 1,024 bits, and rather than using a two-party protocol as proposed in the original publication [23], we implemented a three-party protocol to be comparable with our approaches and the other two baselines by using a LU for conducting the comparison process.
To improve scalability, we applied a q-gram-based blocking technique for all approaches and applied HLSH -based blocking on the random bit array, BF, and TMH approaches, as we described in Sect. 7. However, we only show results based on q-gram-based blocking for BF [44] and TMH [51] as both blocking approaches (q-gram-and HLSH-based blockings) provide highly similar results. For q-gram-based blocking, we calculated the length of q-gram set permutations for generating blocks based on m and q, resulting in t q = 1.
For HLSH-based blocking, in our random bit array-based approach, we generated blocking key values, bkv, using the length of bit segments l b calculated based on the bit percentage, p b = 30, 50, and 80 (the l b calculation is described in Sect. 7). For the BF [44] and TMH [51] baselines, we used the same length of bit segments l b calculated based on p b = 80 in our approach because when using p b < 80 the resulting bit segments are too short and generate too many blocks for BFs or bit arrays of length 1000 bits. This would lead to a large number of comparisons. For example, as shown in the first names data set in the Table 6, the length of bit segments l b = 6 and number of blocks is 166 blocks when it is calculated based on p b = 30, while l b = 16 and number of blocks is 62 when calculated based on p b = 80. To generate each blocking key, bkv, we used the agreed secret salt value, s = 45, and the hash function H = SHA256.

Accuracy results
We evaluated the accuracy of all approaches based on the correctness of similarity calculations. We compared the length of the LCS of unencoded string pairs with the calculated length of LCS, lcs, of the corresponding encoded string pairs based on Eq. (2). To be comparable with the BF [44], TMH [51], and DGK [23] baselines, we normalised the lcs into the range [0...1] of similarity values, calculated as  , |y|), where x and y are the strings in a pair. For the BF approach, we calculated the similarity of qgram sets and of BFs using the Dice coefficient similarity [44], while we used the Jaccard similarity calculation for qgram sets and of bit arrays generated by the TMH encoding technique [51]. For the DGK approach, we calculated the similarity of q-gram and encryption (ciphertext) lists using the Dice coefficient [23]. Figure 5 shows scatter plots where the horizontal axis shows unencoded similarities and the vertical axis shows the corresponding encoded similarities. Points on the diagonal show pairs of strings where both the unencoded and the encoded similarities are the same, while any point off the diagonal shows differences in the calculated similarities between unencoded and encoded string pairs. As can be seen, both our approaches provide accurate string similarity results, while BF [44] and TMH [51] encodings can result in inaccurate similarities. This is because of high number of hash collisions that occur with both encoding approaches, where different q-grams are hashed into the same bit positions.
The DGK [23] approach also results in inaccurate similarities. This is because the Dice coefficient of the ciphertexts is calculated based on the cardinality, where some ciphertexts that represent encrypted q-grams of strings in a pair are not common, although these ciphertexts are common between the two lists of all possible q-grams that were used to generate the intersection set of cardinality.  5 Similarity plots of shifted hash encoded q-gram-based approach (first column), shifted random bit array-based approach (second column), Bloom filter (BF) encoding [44] (third column), tabulation-based hashing (TMH) [51] (fourth column), and the DGK approximate threshold [23]-based approach (last column). As can be seen, both of our approaches provide accurate similarity calculations (our LCS equals the actual LCS that is calculated on unencoded string pairs), while the BF and TMH approaches both can lead to substantially changed similarities even between very similar strings. The DGK approach results in the similarity of a pair of encoded strings to be higher than the similarity of its corresponding unencoded strings ciphertext (encryption of 1) is located at the same position in the list of all possible q-grams. However, this approach still provides high privacy because of the use of the DGK homomorphic encryption [16] which results in the same value being encrypted into different ciphertexts. Therefore, although the common patterns of unencrypted and encrypted string pairs are the same, it would be difficult to re-identify the original string values because an adversary cannot learn if a ciphertext represents a 0 or 1.
For the BF [44] and TMH [51] encoding approaches, the common patterns of the same string value in different data sets are not being distributed to other common patterns when they are encoded, and therefore, the common pattern is the same common pattern. The encoded string pairs can have the none common pattern because when using these approaches the bits that encode q-grams are not located in sequential order. The bits of common q-gram between two strings can be located next to bits that encode not common q-grams, and In each plot, the vertical axis shows the common pattern of unencoded (or unencrypted) string pairs and the horizontal axis shows the common pattern of encoded (or encrypted) string pairs Fig. 7 Heatmap [21] plots of the NCVR and US social security number data set that are compared using different approaches. Each column shows shifted hash encoded q-gram, random bit arrays, BF encoding [44], TMH [51], and DGK [23] ordered from left to right. Each row shows common patterns of different real data sets. In each plot, the verti-cal axis shows the common pattern of unencoded (or unencrypted) string pairs and the horizontal axis shows the common pattern of encoded (or encrypted) string pairs. Higher percentages of unencoded and encoded string pairs are shown in dark blue, while lower percentages are shown in light blue colour Fig. 8 Heatmap [21] plots of the synthetic data sets that are compared using different approaches. Each column shows shifted hash encoded q-gram, random bit arrays, BF encoding [44], TMH [51], and DGK [23] ordered from left to right. Each row shows common patterns of different synthetic data sets. The vertical axis shows the common pattern of unencoded (or unencrypted) string pairs and the horizontal axis shows the common pattern of encoded (or encrypted) string pairs. Higher percentages of unencoded and encoded string pairs are shown in dark blue, while lower percentages are shown in light blue colour the sequence of bits in a BF or TMH bit array is then a mix of common and not common q-grams.
For example, assume the two BFs b x = 1011000100 and b y = 0001010110 have common bits encoding of common q-gram locating at positions 3 and 7 (as shown in bold) of the BFs. These two bits are located next to the bits encoded of not common q-grams. The encoding is a mix of bits encoded of common and not common q-grams. As can be seen, this BF pair cannot be categorised to any of common patterns (as illustrated in Table 3), and therefore, the common pattern of this BFs pair is none.
We also evaluated the privacy of our random bit arrays approach and the BF [44] and TMH [51] encoding baselines using two cryptanalysis attacks developed for BFs for PPRL [12,13]. A frequency-based attack [12] cannot reveal any information from our random bit array-based approach as well as the two baselines because the frequency of bit arrays or BFs equals the frequency of strings (all have frequency of 1). Therefore, the attack cannot identify any pairs of unencoded and encoded values. A pattern mining-based attack [13] cannot re-identify any information in our random bit array-based approach and the two baselines either, because of the random bit arrays which result in encodings of the same q-gram in different strings being located at different positions. It also cannot attack the two baselines because too many hash collisions occur in encodings which means the attack cannot re-identify any information about individual q-grams.

Scalability results
To be comparable between our approaches and the baselines, we use a three-party protocol for all approaches [11]. We evaluated the runtime of the encoding process by a DO and the string comparison process by the LU, as shown in Fig. 9. We report the average times for one string or string pair in milliseconds.
As shown in Fig. 9, our shifted hash encoded q-grambased approach is the fastest encoding technique while the DGK approach [23] is the slowest encoding technique. In our random bit array-based approach, the encoding of letters is performed faster than the encoding of numbers. This is because the size of the alphabet, |Σ|, affects the runtime when generating the unique random bit arrays. A small |Σ| leads to shorter q-gram bit array length, l q , and results in longer runtimes to generate unique random bit arrays for the two DOs. Furthermore, encoding also uses more time for longer strings, such as IBAN numbers (as shown in Table 5). However, the size of the alphabet and the length of strings do not affect the other encoding approaches.
For the comparison process, we applied q-gram-based blocking to our shifted hash encoded q-gram-based approach  and the three baselines, while we applied HLSH-based blocking to our random bit array-based approach and the BF [44] and TMH [51] baselines. Table 7 shows the number of string pair comparisons of the different data set pairs and approaches, where we show only the number of comparisons based on q-gram-based blocking for the three baselines. As shown in Fig. 9, in the comparison process, our approaches consume similar runtimes to the DGK approach [23] and have longer runtimes than BF [44] and TMH [51] encoding, where these two baselines have similar runtimes. This is because the comparison process of our approaches is more complicated, where we find all sequences of common encodings that occur in the encoded strings pair and then find the LCS between them, while BFs [44] and TMH [51] both only calculate approximate similarities based on the set intersection of 1 bits that occur in a pair of encoded strings.
Overall, as expected, the runtimes of our fast comparison algorithms are faster than the basic comparison algorithms. However, in the random bit array-based approach, the fast algorithm is slower than the basic algorithm when we use l γ = 10% of l lcb . This is because of the overhead by the fast algorithm which needs to generate segments and find the common sequences of bits to the left and right of segments.

Discussion
Our approaches provide accurate string comparisons and outperform Bloom filter (BF) encoding [44], tabulationbased hashing (TMH) [51], and the DGK approximate string matching (DGK) [23] approaches, where all of these baselines calculate approximate similarities between string pairs. Our approaches use more time for the comparison step than the BF [44] and TMH [51] baselines, while our random bit array-based approach uses similar runtimes to the DGK [23] approach. For the encoding step, our approaches are faster than the TMH [51] and DGK [23] baselines.
In terms of privacy, the common patterns of the original string pairs are distributed to different patterns when strings are encoded using our approaches, while with the BF, TMH, and DGK baselines the common patterns of string pairs are not distributed to other common patterns. This implies that our approaches will make it more difficult for an adversary to re-identify the original string pairs based on a frequency analysis than with the three baselines because less common patterns are available for an attack. Overall, our approaches provide high accuracy and privacy, at the cost of increased comparison times if compared to the three baselines.

Conclusions and future work
We have presented two new privacy-preserving string matching techniques that allow the accurate and efficient calculation of the longest common sub-string between strings. Our approaches encode sensitive input strings such that no reidentification is possible, while also preventing frequency attacks on individual character encodings. Our experimental evaluation has shown that both our approaches result in the same string similarities as on the original unencoded strings, while commonly used Bloom filter encoding [44] , tabulation-based hashing [51], and DGK approximate string matching [23] approaches will lead to potentially much higher or lower similarities between encoded strings.
As future work we aim to improve the runtime of the comparison step of our random bit array-based approach by generating blocks based on the consecutive order of bit segments, and conduct more extensive scalability experiments on larger databases.