Unsupervised record matching with noisy and incomplete data

We consider the problem of duplicate detection: given a large data set in which each entry has multiple attributes, detect which distinct entries refer to the same real world entity. Our method consists of three main steps: creating a similarity score between entries, grouping entries together into `unique entities', and refining the groups. We compare various methods for creating similarity scores, considering different combinations of string matching, term frequency-inverse document frequency methods, and n-gram techniques. In particular, we introduce a vectorized soft term frequency-inverse document frequency method, with an optional refinement step. We test our method on the Los Angeles Police Department Field Interview Card data set, the Cora Citation Matching data set, and two sets of restaurant review data. The results show that in certain parameter ranges soft term frequency-inverse document frequency methods can outperform the standard term frequency-inverse document frequency method; they also confirm that our method for automatically determining the number of groups typically works well in many cases and allows for accurate results in the absence of a priori knowledge of the number of unique entities in the data set.


Introduction
Fast methods for matching records in databases that are similar or identical have growing importance as database sizes increase [37,39,11,21,1]. Slight errors in observation, processing, or entering data may cause multiple unlinked nearly duplicated records to be created for a single real world entity. Furthermore, records are often made up of multiple attributes, or fields; a small error or missing entry for any one of these fields could cause duplication.
For example, one of the data sets we consider in this paper is a database of personal information generated by the Los Angeles Police Department (LAPD). Each record contains information such as first name, last name, and address. Misspellings, different ways of writing names, and even address changes over time, can all lead to duplicate entries in the database for the same person.
Duplicate detection problems do not scale well. The number of comparisons which are required grows quadratically with the number of records, and the number of possible subsets grows exponentially. Unlinked duplicate records bloat the storage size of the database and make compression into other formats difficult. Duplicates also make analyses of the data much more complicated, much less accurate, and may render many forms of analyses impossible, as the data is no longer a true representation of the real world. After a detailed description of the problem in Section 2 and a review of previous methods in Section 3, we present in Section 4 a vectorized soft term frequency-inverse document frequency (soft TF-IDF) solution for string and record comparison. In addition to creating a vectorized version of the soft TF-IDF scheme we also present an automated thresholding and refinement method, which uses the computed soft TF-IDF similarity scores to cluster together likely duplicate. In Section 5 we explore the performances of different variations of our method on four duplicates containing text databases.

Terminology and problem statement
We define a data set D to be an n × a array where each element of the array contains a string (possibly the empty string). We refer to a column as a field, and denote the k th field c k . A row is referred to as a record, with r i denoting the i th record of the data set. An element of the array is referred to as an entry, denoted e i,j (referring to the i th entry in the j th field). Each entry can contain multiple features where a feature is a string of characters. There is significant freedom in choosing how to divide the string in entry e i,j into multiple features. In this paper in our implementations of soft TF-IDF (Section 3.3), we compare two different methods: (1) cutting the string at white spaces and (2) dividing the string into N -grams. For example, consider an entry e i,j containing the string "Albert Einstein". Following method (1) this entry has two features: "Albert" and ''Einstein". Method (2), the N -gram representation, creates features f k 1 , . . . , f k L , corresponding to all possible substrings of e i,j containing N consecutive characters (if an entry contains N characters or fewer, the full entry is considered to be a single token). Hence L is equal to the length of the string minus (N − 1). In our example, if we use N = 3, e i,j has 13 features. Ordered alphabetically (with white space " " preceding "A"), the features are f k 1 = " Ei", f k 2 = "Alb", f k 3 = "Ein", f k 4 = "ber", f k 5 = "ein", f k 6 = "ert", f k 7 = "ins", f k 8 = "lbe", f k 9 = "nst", f k 10 = "rt ", f k 11 = "ste", f k 12 = "t E", f k 13 = "tei".
In our applications we remove any N -grams that consist purely of white space. When discussing our results in Figure 3 and Sections 5 and 6 we will specify where we have used method (1) and where we have used method (2), by indicating if we have used word features or N -gram features respectively.
For each field we create a dictionary of all features in that field and then remove stop words or words that are irrelevant, such as "and", "the", "or", "None", "NA", or " " (the empty string). We refer to such words collectively as "stop words" and to this reduced dictionary as the "set of features", f k , where: for m features. The dictionary represents an ordered set of unique features found in field c k . Note that m, the number of features in the dictionary, depends on k, since a separate dictionary is constructed for each field. To keep the notation as simple as possible, we will not make this dependence explicit in our notation. Since, in this paper, m is always used in the context of a given, fixed k, this should not lead to confusion.
We will write f k j ∈ e i,k if the entry e i,k contains the feature f k j . Multiple copies of the same feature can be contained in any given entry. This will be explored further in Section 3.2. Note that an entry can be "empty" if it only contains stop words, since no features in the dictionary can represent it.
We refer to a subset of records as a cluster and denote it R = {r t1 , . . . , r tp } where each t i ∈ {1, 2, . . . n} is the index of a record in the data set.
The duplicate detection problem can then be stated as follows: given a data set containing duplicate records, find clusters of records that represent a single entity, i.e., the sets of records that are duplicates of each other. Duplicate records, in this sense, are not identical records but 'near identical' records. They are allowed to vary due to spelling errors or missing entries.

Existing methods
Numerous algorithms for duplicate detection already exist, including various probabilistic methods [18], string comparison metrics [17,36], feature frequency methods [29], and hybrid methods [9]. Here we present a brief overview of those methods which are related to the new method we introduce in Section 4. We review both the Jaro and Jaro-Winkler string metrics, the feature frequency based term frequency-inverse document frequency (TF-IDF) method, and the hybrid soft TF-IDF method.
3.1 Character-based similarity: Jaro and Jaro-Winkler Typographical variations are a common cause of duplication among string data, and the prevalence of this type of error motivates string comparison as a method for duplicate detection. The Jaro distance [17] was originally devised for duplicate detection in government census data and modified by Winkler [36] to give more favorable similarities to strings with matching prefixes. This latter variant is now known as the Jaro-Winkler string metric and has been found to be comparable empirically with much more complex measures [9]. Despite their names, neither the Jaro distance, nor the Jaro-Winkler metric, are in fact distances or metrics in the mathematical sense, since they do not satisfy the triangle inequality, and exact matches have a score of 1, not 0. Rather, they are similarity scores.
To define the Jaro-Winkler metric, we must first define the Jaro distance. For two features f k i and f k j , we define the character window size where |f k i | is the length of the string f k i , i.e., the number of characters in f k i counted according to multiplicity. The l th character of the string f k i is said to match the l th character of f k j , if both characters are identical and l − W i,j ≤ l ≤ l + W i,j . Let M be the number of characters in string f k i that match with characters in string f k j (or, equivalently, the number of characters in f k j that match with characters in f k i ), let (a 1 , . . . , a M ) be the matched characters from f k i in the order they appear in the string f k i , and let (b 1 , . . . , b M ) be the matched characters from f k j in order. Then t is defined to be half the number of transpositions between f k i and f k j , i.e., half the number of indices l ∈ {1, . . . , M } such that a l = b l . For this reason, each such pair (a l , b l ) can be called a transposition pair. Now the Jaro distance Notice that "G" is not considered a matching character as "G" in "NITHOWLG" is the 8th character while "G" in "NIGHTOWL" is the 3rd character, which is out of the W = 4 window for this example. Here, J = 1 3 ( 7 8 + 7 8 + 7−1 7 ) = 0.869.
The Jaro-Winkler metric, JW (f k i , f k j ), modifies the original Jaro distance by giving extra weight to matching prefixes. It uses a fixed prefix factor p to give a higher similarity score to features that match from the beginning for a prefix length i,j . Given two features f k i and f k j , the Jaro-Winkler metric is where J(f k i , f k j ) is the Jaro distance between two features f k i and f k j , p is a given prefix factor, and i,j is the number of prefix characters in f k i that match prefix characters in f k j . When we want to stress that, In Winkler's original work he set p = 0.1 and restricted i,j ≤ 4 (even when prefixes of five or more characters matched) [36]. We follow the same parameter choice and restriction in our applications in this paper. So long as p i,j ≤ 1 for all i, j, the Jaro-Winkler metric ranges from 0 to 1, where 1 indicates exact similarity between two features and 0 indicates no similarity between two features.
In Figure 1 we have = 2, as both features have identical first and second characters, but not a matching third character. This leads to JW = 0.869 + 0.1 · 2 · (1 − 0.869) = 0.895.
Because we remove stop words and irrelevant words from our set of features, it is possible for an entry to contain a feature that does not appear in the reduced set of features f k . If a featuref ∈ e i,k does not appear in the dictionary f k , we set, for all f k q ∈ f k , JW (f k q ,f ) := 0. We call such featuresf null features.

Algorithm 1: Jaro-Winkler Algorithm
Data: c k , an n × 1 array of text else end 3.2 Feature-based similarity: TF-IDF Another approach to duplicate detection, generally used in big data record matching, looks at similar distributions of features across records. This feature based method considers entries to be similar if they share many of the same features, regardless of order; this compensates for errors such as changes in article usage and varying word order (e.g. "The Bistro", "Bistro, The", or "Bistro"), as well as the addition of information (e.g. "The Bistro" and "The Bistro Restaurant").
This form of duplicate detection is closely related to vector space models of text corpora [30], where a body of text is represented as a vector in some word vector space. The dimension of the space is the number of relevant words (other words are assumed to be meaningless), and, for a given record, each element of the vector representation is the frequency with which a word appears in the entry. (It should be noted that these models also disregard word order.) A more powerful extension of these models is the term frequency-inverse document frequency (TF-IDF) scheme [29]. This scheme reweighs different features based on their frequency in a single field as well as in an entry.
Using the reduced set of features, f k , we create the term frequency and inverse document frequency matrices. We define the term frequency matrix for the k th field, TF k ∈ R n×m , such that TF k i,j is the number of times the feature f k j appears in the entry e i,k (possibly zero). A row of TF k represents the frequency of every feature in an entry.
Next we define the diagonal inverse document frequency matrix IDF k ∈ R m×m with diagonal elements 1 where |{e ∈ c k : f k i ∈ e}| is the number of entries 2 in field c k containing feature f k i , and where n is the number of records in the data set. The matrix IDF k uses the number of entries in the field containing a feature to give features a more informative weight. The issue when using term frequency only, is that it gives features that appear frequently a higher weight than rare features, which often are empirically more informative than common features. The basic intuition is that a feature that occurs frequently in many entries is not a good discriminator.
The resulting weight matrix for field k is then defined with a logarithmic scaling for the term frequency as 3 where 1 is an n × m matrix of ones and the log operation acts on each element of TF k + 1 individually. The resulting matrix has dimension n × m. Finally we normalize each row of TFIDF k by its 1 norm 4 . 1 We use log to denote the natural logarithm in this paper.
2 By the construction of our set of features in Section 2, this number of entries is always positive. 3 Note that, following [9], we use a slightly different logarithmic scaling, than the more commonly used TFIDF k i,j = log(TF k i,j ) + 1 IDF k i,i , if TF k i,j = 0, and TFIDF k i,j = 0, if TF k i,j = 0. This avoids having to deal with the case TF k i,j = 0 separately. The difference between log(TF k i,j ) + 1 and log(TF k i,j + 1) is bounded by 1 for TF k i,j ≥ 1. 4 Here we deviate from [9], in which the authors normalize by the 2 norm. We do this so that later in equation (3), we can guarantee that the soft TF-IDF values are upper bounded by 1.
Each entry TFIDF k i,j represents the weight assigned to feature j in field k for record i. Note that each entry is nonnegative.

Algorithm 2: TF-IDF Algorithm
Data: c k , an n × 1 array of text

Hybrid similarity: soft TF-IDF
The previous two methods concentrate on two different causes of record duplication, namely typographical error and varying word order. It is easy to imagine, however, a case in which both types of error occur; this leads us to a third class of methods which combine the previous two. These hybrid methods measure the similarity between entries using character similarity between their features as well as weights of their features based on importance. Examples of these hybrid measures include the extended Jacard similarity and the Monge-Elkan measure [25]. In this section we will discuss another such method, soft TF-IDF [9], which combines TF-IDF with a character similarity measure. In our method, we use the Jaro-Winkler metric, discussed above in Section 3.1, as the character similarity measure in soft TF-IDF.
For θ ∈ [0, 1), let S(θ, e i,k , e j,k ) be the set of fea- where JW is the Jaro-Winkler similarity metric from (1). The soft TF-IDF similarity between two entries e i,k and e j,k in field c k is defined as The parameter θ allows for stronger control over the similarity of features, removing entirely pairs that do not have Jaro-Winkler similarity above a certain threshold. For the results presented in this paper, we set θ = 0.90. Note from (3) that for all i, j, and k, we have sTFIDF k i,j ∈ [0, 1]. However, we do not necessarily have that sTFIDF k i,i = 1, even though this might be expected to hold for a similarity measure. Luckily, these diagonal elements of sTFIDF k will not be relevant in our method. For definiteness and computational ease 5 , we redefine, for all k and all i, sTFIDF k i,i := 1. In practice, this method's computational cost is greatly reduced by vectorization. Let M k,θ ∈ R m×m be the Jaro-Winkler similarity matrix defined by The soft TF-IDF similarity for each (i, j) pairing can then be computed as where TFIDF k i denotes the i th row of the TF-IDF matrix of field c k and * denotes the Hadamard product, or element-wise product. We can further simplify this using tensor products. Let M k,θ denote the vertical concatenation of the rows of M k,θ .
is the i th row of M k,θ . We then have: where ⊗ is the Kronecker product. Finally we set (redefine) the diagonal elements sTFIDF k i,i = 1.
The similarity matrices produced by the TF-IDF and Jaro-Winkler are typically sparse. This sparsity can be leveraged to reduce the computational cost of the soft TF-IDF method as well.

Algorithm 3: soft TF-IDF Algorithm
for each pair of entries (e i,k , e j,k ) in field c k do Compute soft TF-IDF: The soft TF-IDF scores above are defined between entries for a single field. For each pair of records we produce a composite similarity score ST i,j by adding their soft TF-IDF scores over all fields: Hence, ST ∈ R n×n with ST i,j the score between the i th and j th records. Remember that a is the number of fields in the data set. Each composite similarity score ST i,j is a number between 0 and a. For some applications it may be desirable to let some fields have a greater influence on the composite similarity score than others. In the above formulation this can easily be achieved by replacing the sum in (4) by a weighted sum: for positive weights w k ∈ R, k ∈ {1, . . . , a}. If the weights are chosen such that a k=1 w k ≤ a, then the weighted composite similarity score ST w takes values in [0, a], like ST . In this paper we use the unweighted composite similarity score ST .

Using TF-IDF instead of soft TF-IDF
In our experiments in Section 5 we will also show results in which we use TF-IDF, not soft TF-IDF, to compute similarity scores. This can be achieved in a completely analogous way to the one described in Section 3.3, if we replace (3) by where TFIDF k is the TF-IDF matrix from (2) and the superscript T denotes the matrix transpose. Note that this is equivalent to setting the Kronecker delta.
All the other computations from Section 5, in particular the computation of the composite similarity score in (4), then continue as before.

New method
We extend the soft TF-IDF method to address two common situations in duplicate detection: missing entries and large numbers of duplicates. For data sets with only one field, handling a missing field is a non-issue; a missing field is irreconcilable, as no other information is gathered. In a multi-field setting, however, we are faced with the problem of comparing partially complete records. Another issue is that a record may have more than one duplicate. If all entries are pairwise similar we can easily justify linking them all, but in cases where one record is similar to two different records which are dissimilar to each other the solution is not so clear cut. Figure 2 shows an outline of our method. First we use TF-IDF to assign weights to features that indicate the importance of that feature in an entry. Next, we use soft TF-IDF with the Jaro-Winkler metric to address spelling inconsistencies in our data sets. After this, we adjust for sparsity by taking into consideration whether or not a record has missing entries. Using the similarity matrix produced from the previous steps, we threshold and group records into clusters. Lastly, we refine these groups by evaluating how clusters break up under different conditions.

Adjusting for sparsity
A missing entry is an entry that is either entirely empty or one that contains only null features. Here, we assume that missing entries do not provide any information about the record and therefore cannot aid us in determining whether two records should be clustered together (i.e. labeled as probable duplicates). In [35], [36], and [2], records with missing entries are discarded, filled in by human fieldwork, and filled in by an expectation-maximization (EM) imputation algorithm, respectively. For cases in which a large number of entries are missing, or in data sets with a large number of fields such that records have a high probability of missing at least one entry, these first two methods are impractical. Furthermore, the estimation of missing fields is equivalent to unordered categorical estimation. In fields where a large number of features are present (i.e. the set of features is large), the type of estimation by EM scheme becomes computationally intractable [26] [38] [16]. Thus, a better method is required.
Leaving the records with missing entries in our data set, both TF-IDF and Jaro-Winkler remain well defined, allowing soft TF-IDF schemes to proceed. However, because the Jaro-Winkler metric for a null feature and any other feature is 0, the soft TF-IDF score between a missing entry and any other entry is 0. This punishes sparse records in the composite soft TF-IDF matrix ST . Even if two records have the exact same entries in fields where both records do not have missing entries, their missing entries deflate their composite soft TF-IDF similarity. Consider the following example using two records and three fields: ["Joe Bruin", " ", "male"] and ["Joe Bruin', "CA", " "]. The two records are likely to represent a unique entity "Joe Bruin", but the composite soft TF-IDF score between the two records is on the lower end of the similarity score range (0.33) due to the missing entry in the second field for the first record and the missing entry in the third field for the second record. To correct for this, we take into consideration the number of mutually present (not missing) entries in the same field for two records.
This can be done in a vectorized manner to accelerate computation. Let B be an n × a binary matrix, where a is the number of fields in the data set, such that This is a binary mask of the data set, where 1 denotes a non-missing entry (with or without error), and 0 denotes a missing entry. In the product BB T ∈ R n×n , each (BB T ) i,j is the number of "shared fields" between records r i and r j , i.e. the number of fields c k such that both e i,k and e j,k are non-missing entries. Our "sparsity adjusted soft TF-IDF similarity" is given by where denotes element-wise division.
Remembering that JW (f k p , f k q ) = 0 if f k p or f k q is a null feature, we see that, if e i,k or e j,k is a missing entry, then the set S(θ, e i,k , e j,k ) used in (3) is empty (independent of the choice of θ) and thus sTFIDF k i,j = 0. Hence, we have that, for all i, j (i = j), (ST ) i,j ∈ [0, (BB T ) i,j ] (which refines our earlier result that (ST ) i,j ∈ [0, a]) and thus (adjST ) i,j ∈ [0, 1] 6 .
In particular, in the event that there are records r i and r j such that (BB T ) i,j = 0, we have ST i,j = 0. Hence, if (BB T ) i,j = 0 and thus the expression in (5) is not defined for adjST i,j , we set adjST i,j = 0 instead. In the data sets we will discuss in Section 4, no pair of records was without shared fields, and so (5) suffices for our purposes in this paper.

Thresholding and grouping
The similarity score adjST i,j gives us an indication of how similar the records r i and r j are. If adjST i,j is Algorithm 4: Adjusting for Sparsity Data: sTFIDF k ∈ R n×n for k ∈ {1, . . . , a}, D an n × a array of text Result: adjST ∈ R n×n for each entry e i,k in each field c k of D do Compute B i,k end Initialize ST = k sTFIDF k Adjust ST for sparsity: adjST = ST BB T close to 1, then the records are more likely to represent the same entity. Now, we present our method of determining whether a set of records are duplicates of each other based on adjST . There exist many clustering methods that could be used to accomplish this goal. For example, [24] considers this question in the context of duplicate detection. For simplicity, in this paper we restrict ourselves to a relatively straightforward thresholding procedure, but other methods could be substituted in future implementations. We call this the thresholding and grouping step (TGS).
The method we will present below is also applicable to clustering based on other similarity scores. Therefore it is useful to present it in a more general format. Let SIM ∈ R n×n be a matrix of similarity scores, i.e., for all i, j, the entry SIM i,j is a similarity score between the records r i and r j . We assume that, for all i = j, SIM i,j = SIM j,i ∈ [0, a] 7 . If we use our adjusted soft TF-IDF method, SIM is given by adjST from (5). In Section 4.1 we saw that in that case we even have SIM i,j ∈ [0, 1].
Let τ ∈ [0, a] be a threshold and let S be the thresholded similarity score matrix defined for i = j as  7 We will not be concerned with the diagonal values of SIM , because trivially any record is a 'duplicate' of itself, but for definiteness we may assume that, for all i, SIM i,i = a.
Records r i and r j (i = j) are clustered together (as probable duplicates) if at least one of the following two conditions is satisfied: 1. S i,j = 1, 2. there exists a record r k (i = k = j) such that S i,k = 1 and S j,k = 1.
Note that (if we have set S i,i = 1 as above) we can combine both conditions into one condition: (S 2 ) i,j ≥ 1. The output of the TGS is a clustering of all the records in the data set, i.e. a collection of clusters, each containing one or more records, such that each record belongs to exactly one cluster. The choice of τ is crucial in the formation of clusters. Choosing a threshold that is too low leads to large clusters of records that represent more than one unique entity. Choosing a threshold that is too high breaks the data set into a large number of clusters, where a single entity may be represented by more than one cluster. Here, we propose a method of choosing τ .
Let H ∈ R n be the n × 1 vector defined by where the maximum is taken over j = 1, . . . , n. In other words, each element of H is the maximum similarity score SIM i,j between a fixed record and every other record. Now define where µ(H) is the mean value of H and σ(H) is its standard deviation. In many of our runs (Figure 3a is a representative example), there is a large peak of H values around the mean value µ(H). Choosing τ H = µ(H) + σ(H) will typically place the threshold far enough to the right of this peak so that the records corresponding to this peak do not get clustered together, yet also far enough removed from the maximum value so that more than only the top matches get identified as duplicates. In some cases however, where the distribution of H values has a peak near the maximum value (as, for example, in Figure 3b), the value µ(H) + σ(H) will be larger than the maximum and we chose τ H = µ(H) instead. It may not always be possible to choose a threshold in such a way that all the clusters generated by our TGS correspond well to sets of duplicates, as the following example, illustrated in Figure 4, shows. We   consider an artificial toy data set for which we computed the adjusted soft TF-IDF similarity, based on seven fields. We represent the result of the TGS as a graph in which each node represents a record in the data set. We connect nodes i and j (i = j) by an edge if and only if their similarity score SIM i,j equals or exceeds the chosen threshold value τ . The connected components of the resulting graph then correspond to the clusters the TGS outputs.
For simplicity, the Figure 4 only shows the features of each entry from the first two fields (first name and last name). Based on manual inspection, we declare the ground truth for this example to contain two unique entities: "Joey Bruin" and "Joan Lurin". The goal of our TGS is to detect two clusters, one for each unique entity. Using τ = 5.5, we find one cluster (Figure 4a). Using τ = 5.6, we do obtain two clusters (Figure 4b), but it is not true that one cluster represents "Joe Bruin" and the other "Joan Lurin", as desired. Instead, one clusters consists of only the "Joey B" record, while the other cluster contains all other records. Increasing τ further until the clusters change, would only result in more clusters, therefore we cannot obtain the desired result this way. This happens because the adjusted soft TF-IDF similarity between "Joey B" and "Joey Bruin" (respectively "Joe Bruin") is less than the adjusted soft TF-IDF similarity between "Joey Bruin" (respectively "Joe Bruin") and "Joan Lurin". To address this issue, we apply a refinement to each set of clustered records created by the TGS, as explained in the next section.
The graph representation of the TGS output turns out to be a very useful tool and we will use its language in what follows interchangeably with the 'cluster' language. Compute S i,j end for each pair of distinct records r i and r j do If (S 2 ) i,j ≥ 1, assign r i and r j to the same cluster end

Refinement
As the example in Figure 4 has shown, the clusters created by the TGS are not necessarily complete subgraphs: it is possible for a cluster to contain records r i , r j for which S i,j = 0, while also containing a record r k such that S i,k = 1 and S j,k = 1 (i = j = k = i). In such cases it is a priori unclear if the best clustering is indeed achieved by grouping r i and r j together or not. We introduce a way to refine clusters created in the TGS, to deal with situations  like these. We take the following steps to refine a cluster R: 1. determine whether R needs to be refined by determining the cluster stability with respect to single record removal; 2. if R needs be to refined, remove one record at a time from R to determine the "optimal record" r * to remove; 3. if r * is removed from R, find the subcluster that r * does belong to.
Before we describe these steps in more detail, we introduce more notation. Given a cluster (as determined by the TGS) R = {r t1 , . . . , r tp } containing p records, the thresholded similarity score matrix of the cluster R is given by the restricted matrix S| R ∈ R p×p with elements (S| R ) i,j := S ti,tj . Remember we represent R by a graph, where each node corresponds to a record r ti and two distinct nodes are connected by an edge if and only if their corresponding thresholded similarity score (S| R ) i,j is 1. If a record r ti is removed from R, the remaining set of records is R(r ti ) := {r t1 , . . . , r ti−1 , r ti+1 , . . . , r tp }. We define the subclusters R 1 , . . . R q of R(r ti ) as the subsets of nodes corresponding to the connected components of the subgraph induced by R(r(t i )).
Step 1. Starting with a cluster R from the TGS, we first determine if R needs to be refined, by investigating, for each r ti ∈ R, the subclusters of R(r ti ). If, for every r ti ∈ R, R(r ti ) has a single subcluster, then R need not be refined. An example of this is shown in Figure 5. If there is an r ti ∈ R, such that R(r ti ) has two or more subclusters, then we refine R.
Step 2. For any setR consisting of p records, we define its strength as the average similarity between the records inR: Note that s(R) = 1 if S|R = 1 p×p 8 . In other words, a cluster has a strength of 1 if every pair of records in that cluster satisfy condition 1 of the TGS. If in Step 1 we have determined that the cluster R requires refinement, we find the optimal record r * = r t k * such that the average strength of subclusters of R(r * ) is maximized: Here the sum is over all j such that R j is a subcluster of R(t i ), and q(i) is the (i-dependent) number of subclusters of R(t i ). In the unlikely event that the maximizer is not unique, we arbitrarily choose one of the maximizers as k * .
Step 3. After finding the optimal r * to remove, we now must determine the subcluster to which to add it. To do so, we evaluate the strength of the set R j ∪ {r * } ⊂ R, for each subcluster R j ⊂ R(r * ). We then add r * to subcluster R * = R l * , where l * := arg max j: R j is a subcluster of R(r * ) s(R j ∪ {r * }).  5: An example of a cluster R that does not require refinement. Each node represents a record. In each test we remove one and only one node from the cluster and apply TGS again. The red node represents the removed record r ti , the remaining black nodes make up the set R(t i ). Notice that every time we remove a record, all other records are still connected to each other by solid lines, hence R does not need to be refined.
In the rare event that the maximizer is not unique, we arbitrarily choose one of the maximizers as l * . We always add r * to one of the other subclusters and do not consider the possibility of letting {r * } be its own cluster. Note that this is justified, since from our definition of strength in (6), s({r * }) = 0 < s(R l * ∩ {r * }), because r * was connected to at least two other records in the original cluster R.
Finally, the original cluster R is removed from the output clustering, and the new clusters R 1 , . . . , R l * −1 , R * , R l * +1 , . . . , R q(k * ) are added to the clustering. Figure 6 shows an example of how the refinement helps us find desired clusters.
In our implementation, we computed the optimal values k * and l * are via an exhaustive search over all parameters. This can be computationally expensive when the initial threshold τ is small, leading to large initial clusters.

The data sets
The results presented in this section are based on four data sets: the Field Interview Card data set (FI), the Restaurant data set (RST), the Restaurant data set with entries removed to induce sparsity (RST30), and the Cora Citation Matching data set (Cora). FI is not publicly available at the mo-

Algorithm 6: Refinement
Data: R = {r t 1 , . . . , r t n } a cluster resulting from the TGS Result: R set of refined clusters if there exists r t i such that R(r t i ) has more than 1 subcluster then for each r t i ∈ R do Find the subclusters R 1 , . . . R q of R(r t i ) Do not refine R: R = {R} end ment. The other data sets currently can be found at [27]. Cora can also be accessed at [4]. RST and Cora are also used in [6] to compare several approaches to evaluate duplicate detection.  Fig. 6: An example of how refinement is used to improve our clusters. The left figure shows that by removing the record "Joan Lurin", we obtain the two desired subsets. The right figure shows that "Joan Lurin" is inserted back into the appropriate cluster. Note that we have not changed the threshold value τ during this process. first name, last name, suffix, date of event, location of event, social security number, residential address, gang affiliation, and gang moniker. The latter two are based on expert knowledge. A subset of this data set is used and described in more detail in [13]. The FI data set has 8,834 records, collected during the years 2001-2011. A ground truth of unique individuals is available, based on expert opinion. There are 2,920 unique people represented in the FI Card data set. The FI card data set has many misspellings as well as different names that correspond to the same individual. Another issue is variation over time: a given person is not guaranteed to have the same real world home address in two separate observations, and thus we would not necessarily expect to have matching address fields in our data, regardless of human error. Approximately 30% of the entries are missing, but the "last name" field is without missing entries.
RST This data set is a collection of restaurant information based on reviews from Fodor and Zagat, collected by Dr. Sheila Tejada [32], who also manually generated the ground truth. It contains five fields: restaurant name, address, location, phone number, and type of food. There are 864 records containing 752 unique entities/restaurants. There are no missing entries in this data set. The types of errors that are present include word and letter transpositions, varying standards for word abbreviation (e.g. "deli" and "delicatessen"), typographical errors, and conflicting information (such as different phone numbers for the same restaurant).

RST30
To be able to study the influence of sparsity of the data set on our results, we remove approximately 30% of the entries from the address, city, phone number and type of cuisine fields. The resulting data set we call RST30. We choose the percentage of removed entries to correspond to the percentage of missing entries in the FI data set. Because the FI data set has a field that has no missing entries, we do not remove entries from the "name" field.
Cora The records in the Cora Citation Matching data set 9 are citations to research papers [22]. Each of Cora's 1,295 records is a distinct citation to any one of the 122 unique papers to which the data set contains references. We use three fields: author(s), name of publication, and venue (name of the journal of publication). This data set contains misspellings and a small amount of missing entries (approximately 3%).

Evaluation metrics
We compare the performances of the methods summarized in Table 1. Each of these method outputs a similarity matrix, which we then use in the TGS to create clusters.
To evaluate the methods, we use purity [15], inverse purity, their harmonic mean [14], the relative error in the number of clusters, precision, recall [10,  (4) 3-grams yes Table 1: Summary of methods used. The second, third, and fourth columns list for each method which similarity score matrix is used in the TGS, if words or 3-grams are used as features, and if the refinement step is applied after TGS or not, respectively. The similarity score matrix refers to either the matrix from equation (4) or the alternative as explained in Section 3.4. 7], the F-measure (or F 1 score) [28,5], z-Rand score [23,33], and normalized mutual information (NMI) [31], which are all metrics that compare the output clusterings of the methods with the ground truth. Purity and inverse purity compare the clusters of records which the algorithm at hand gives with the ground truth clusters. Let C := {R 1 , . . . , R c } be the collection of c clusters obtained from a clustering algorithm and let C := {R 1 , . . . , R c } be the collection of c clusters in the ground truth. Remember that n is the number of records in the data set. Then we define purity as where we use the notation |A| to denote the cardinality of a set A. In other words, we identify each cluster R i with (one of the) ground truth cluster(s) R j which shares the most records with it, and compute purity as the total fraction of records that is correctly classified in this way. Note that this measure is biased to favour many small clusters over a few large ones. In particular, if each record forms its own cluster, Pur = 1. To counteract this bias, we also consider inverse purity, Note that inverse purity has a bias that is opposite to purity's bias: if the algorithm outputs only one cluster containing all the records, then Inv = 1.
We combine purity and inverse purity in their harmonic mean 10 , The relative error in the number of clusters in C is defined as We define precision, recall, and the F-measure (or F 1 score) by considering pairs of clusters that have correctly been identified as duplicates. This differs from purity and inverse purity as defined above, which consider individual records. To define these metrics the following notation is useful. Let G be the set of (unordered) pairs of records that are duplicates, according to the ground truth of the particular data set under consideration, G := {r, s} : r = s and ∃R ∈ C s. t. r, s ∈ R }, and let C be the set of (unordered) record pairs that have been clustered together by the duplicate detection method of choice, C := {r, s} : r = s and ∃R ∈ C s. t. r, s ∈ R .
Precision is the fraction of the record pairs that have been clustered together that are indeed duplicates in the ground truth, and recall is the fraction of record pairs that are duplicates in the ground truth that have been correctly identified as such by the method The F-measure or F 1 score is the harmonic mean of precision and recall, Note that in the extreme case in which |C| = n, i.e. the case in which each cluster contains only one record, precision, and thus also the F-measure, are undefined.
Another evaluation metric based on pair counting, is the z-Rand score. The z-Rand score z R is the number of standard deviations by which |C ∩ G| is removed from its mean value under a hypergeometric distribution of equally likely assignments with the same number and sizes of clusters. For further details about the z-Rand score, see [23,33,13]. The relative z-Rand score of C is the z-Rand score of that clustering divided by the z-Rand score of C , so that the ground truth C has a relative z-Rand score of 1 11 .
A final evaluation metric we consider, is normalized mutual information (NMI). To define this, we first need to introduce mutual information and entropy. We define the entropy of the collection of clusters C as and similarly for Ent(C ). The joined entropy of C and C is The mutual information of C and C is then defined as where the right hand side follows from the equalities There are various ways in which mutual information can be normalized. We choose to normalize by the geometric mean of Ent(C) and Ent(C ) to give the normalized mutual information Note that the entropy of C is zero, and hence the normalized mutual information is undefined, when |C| = 1, i.e. when one cluster contains all the records.
In the practice this is avoided by adding a small number (e.g. the floating-point relative accuracy eps in MATLAB).
For more information on many of these evaluation metrics, see also [3].

Results
In this section we consider six methods: TF-IDF, soft TF-IDF without the refinement step, and soft TF-IDF with the refinement step, with each of these three methods applied to both word features and 3gram features. We also consider five evaluation metrics: the harmonic mean of purity and inverse purity, the relative error in the number of clusters, the F 1 score, the relative z-Rand score, and the NMI. We investigate the results in two different ways: (a) by plotting the scores for a particular evaluation metric versus the threshold values, for the six different methods in one plot and (b) by plotting the evaluation scores obtained with a particular method versus the threshold values, for all five evaluation metrics in one plot.

The methods
When we compare the different methods by plotting the scores for a particular evaluation metric versus the threshold value τ H for all the methods in one plot (as can be seen for example in Figure 7a), one notable attribute is that the methods that use word features typically all show similar behavior and so do the methods using 3-gram features. There are some useful distinctions to make, however, between the methods that do and do not include the refinement step. A further discussion of this will follow in Section 6. This difference though is smaller than the difference between the word feature and 3-gram feature based methods. Unsurprisingly, between those two groups the behavior of the evaluation metrics is quite distinct, since the similarity scores produced by those methods, and hence their response to different threshold values, are significantly different.
It is also interesting to note which methods give better evaluation metric outcomes on which data sets. On the FI data set the word feature based methods outperform the 3-gram based methods (judged on the basis of best case performance, i.e. the optimal score attained over the full threshold range) for every evaluation metric, except the NMI for which they perform similarly.
On both the RST and RST30 data sets, the word feature based methods outperform the 3-gram feature based methods on the F 1 score and relative z-Rand score (Figure 7b), but both groups of methods perform equally well for the other metrics. It is noteworthy that all methods also do significantly worse on RST30 than on RST, when measured according (a) The F 1 score for the Cora data set (b) The relative z-Rand score for the RST data set Fig. 7: Two evaluation metrics as a function of the threshold value τ , computed on two different data sets. Each of the six graphs in a plot correspond to one of the six methods used. The filled markers indicate the metric's value at the automatically chosen threshold value for each method. In the legend, "(s)TF-IDF" stands for (soft) TF-IDF, "3g" indicates the use of 3-gram based features instead of word based ones, and "ref" indicates the presence of the refinement step.
to the F 1 and relative z-Rand scores, while there is no great difference, if any, measured according to the other metrics.
On the Cora data set all the methods perform equally well according to all evaluation metrics we considered. An interesting characteristic of the results on this data set, that is not observably present in the results for the other data sets, is that the methods that include the refinement step clearly outperform the ones that do not, according the harmonic mean, F 1 score, relative z-Rand score and NMI. Only the relative error in the number of clusters does not show a noticeable difference.

The metrics
When plotting the different evaluation metrics per method, we notice that the F 1 score and relative z-Rand score behave similarly, as do the harmonic mean of purity and inverse purity and the NMI. The relative error in the number of clusters is correlated to those other metrics in an interesting way. For the word feature based methods, the lowest relative error in the number of clusters is typically attained at or near the threshold values at which the F 1 and relative z-Rand scores are highest. Those are also usually the lowest threshold values for which the harmonic mean and NMI attain their high(est) values. The the harmonic mean and NMI, however, usually remain quite high when the threshold values are increased, whereas the F 1 and relative z-Rand scores typically drop (sometimes rapidly) at increased threshold values, as the relative error in number of clusters rises. Figure 8a shows an example of this behavior.
The relationship between the harmonic mean of purity and inverse purity and the NMI has some interesting subtleties. As mentioned before they mostly show similar behavior, but the picture is slightly more subtly in certain situations. On the Cora data set, the harmonic mean drops noticeably for higher threshold values, before settling eventually at a near constant value. This is a drop that is not present in the NMI. This behavior is also present in the plots for the 3-gram feature based methods on the FI data set and very slightly in the word feature based methods on the RST data set (but not the RST30 data set). For word feature based methods on the FI data set the behavior is even more pronounced, with little to no 'settling down at a constant value' happening for high threshold values (e.g Figure 8b).
Interestingly, both the harmonic mean and NMI show very slight (but consistent over both data sets) improvements at the highest threshold values for the 3-gram based methods applied to the RST and RST30 data sets.

The choice of threshold
On the RST and RST30 data sets our automatically chosen threshold performs well (e.g. see Figures 7b, 8a, and 9a). It usually is close to (or sometimes even equal to) the threshold value at which some or all  evaluation metrics attain their optimal value (remember this threshold value is not the same for all the metrics). The performance on RST is slightly better then on RST30, as can be expected, but in both cases the results are good.
On the FI and Cora data sets our automatically chosen threshold is consistently larger than the optimal value, as can be seen in e.g. Figures 7a, 8b, and 9b. This can be explained by the left-skewedness of the H-value distribution, as illustrated in Figure 3a. A good proxy for the volume of the tail is the ratio of number of records referring to unique entities to total number of entries in the data set. For RST and RST30 this ratio is a high 0.87, whereas (a) Soft TF-IDF (on 3-gram based features) without the refinement step applied to the RST data set (b) Soft TF-IDF (on 3-gram based features) with the refinement step applied to the FI data set Fig. 9: Different evaluation metrics as a function of the threshold value τ , computed on two different data sets. Each of the six graphs in a plot correspond to one of five evaluation metrics. The vertical dotted line indicates the automatically chosen threshold value for the method used.
for FI it is only 0.33 and for Cora only 0.09. This means that the relative error in the number of clusters grows rapidly with increasing threshold value and the values of the other evaluation metrics will deteriorate correspondingly.

Conclusions
In this paper we have investigated six methods which are based on term frequency-inverse document frequency counts for duplicate detection in a record data set. We have tested them on four different data sets and evaluated the outcomes using five different metrics.
One clear conclusion from our tests is that there is no benefit to constructing the features the methods work on using 3-grams as opposed to white space separated 'words'. The latter choice leads to methods that either outperform the former or perform equally well at worst (in terms of the optimal values they achieve for the evaluation metrics). Compare, for example, Figures 8b and 9b.
Somewhat surprisingly, our tests lead to a less clear picture regarding the choice between TF-IDF and soft TF-IDF (with word based features). For low threshold values TF-IDF performs better, for higher threshold values either soft TF-IDF performs slightly better, or the difference between the two methods is so small as to be negligible. Interesting exceptions are the F 1 score and relative z-Rand score for the RST30 data set. Here TF-IDF outperforms soft TF-IDF for almost every threshold value. At the highest threshold values both methods perform the same, as expected.
When it comes to the benefits of including the refinement step, the situation is again somewhat different depending on the data set. For the RST and RST30 data sets, for small threshold values including the refinement step is beneficial, which is to be expected, since the refinement will either increase the number clusters formed or keep it the same, so its effect is similar to (but not the same) raising the threshold value. On these data sets, for large threshold values there is little difference between including and excluding the refinement step. For the Cora data set an intermediate region is present between the lower and very high threshold values, in which the algorithms perform somewhat better without the refinement step (this is most noticeable in the word feature based algorithms; in the 3-gram feature based algorithms this effect is either absent or minor). This effect is least pronounced (to the point of becoming unnoticeable or absent) for the NMI evaluation metric. For the FI data set this 'intermediate' region (at least for the word feature based algorithms; the qualitative behavior for the 3-gram feature based methods is similar here as in the Cora case) and we are left with a low threshold region in which the refinement step is an improvement and a high threshold region in which excluding that step gives better results. The effect is again least pronounced for the NMI metric.
Our tests with our automatically chosen threshold show that τ H = µ(H) + σ(H) is a good choice on data sets which have H-distributions that are approximately normal or right-skewed. If, however, the H-distribution is left-skewed, this choice seems to be consistently larger than the optimal threshold. It should be noted though that for most of the evaluation metrics and most of the data sets, the behavior of the metrics with respect to variations in the threshold value is not symmetric around the optimal value. Typically the decline from optimality is less steep and/or smaller for higher threshold values, than for lower ones. This effect is even stronger if we consider methods without refinement step. Combined with the fact that at low threshold values the refinement step requires a lot more computational time than at high threshold values, especially for larger data sets, we conclude that, in the absence of a priori knowledge of the optimal threshold value, it is better to overestimate than underestimate this value. Hence, our suggestion to choose τ H = µ(U ) + σ(H) is a good rule of thumb at worst and a very good choice for certain data sets.