Knowledge and Information Systems

, Volume 37, Issue 3, pp 639–663

The address connector: noninvasive synchronization of hierarchical data sources

Regular Paper

DOI: 10.1007/s10115-012-0582-x

Cite this article as:
Augsten, N., Böhlen, M. & Gamper, J. Knowl Inf Syst (2013) 37: 639. doi:10.1007/s10115-012-0582-x

Abstract

Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.

Keywords

Data quality Record linkage Entity resolution  Hierarchical data Trees Approximate matching Similarity query Residential addresses 

1 Introduction

Large amounts of information about related objects in the real world are stored in databases. If different databases store data about the same real-world object, the data must be synchronized to enable collaboration. The synchronization is non-trivial since often databases are maintained by different departments and use different coding conventions and data items that represent the same real-world object are identified through different key values.

Residential addresses are data of particular interest. They appear in many databases and are often the only link between relevant information in different databases. Unfortunately, addresses that describe the same location vary considerably as they are maintained and updated independently. A synchronization step is necessary to reconcile the addresses.

Synchronizing residential addresses is a challenging task. As an example, consider Fig. 1 with the databases from the Electricity Company and the Registration Office from the Municipality of Bolzano. We want to establish a link between residents and electricity bills using the addresses as the linking element. Both databases cover the same geographic area, but an exact match on the address attributes obviously fails.
Fig. 1

Two databases with residential addresses that cover the same geographic area

Common solutions for address synchronization assume an authoritative set of reference addresses, also termed address register, that is used to correct the residential addresses in the databases. This approach suffers from several limitations. Often an authoritative reference is not available, and it is not clear which database should be used to correct the other databases. Moreover, correcting addresses fails if the databases store addresses with different granularity levels. In Fig. 1, “Gilmstrasse 3” in the RO database refers to a house, while “Hermann-von-Gilm-Str. 3/A” and “Hermann-von-Gilm-Str. 3/B” in the EC database are more detailed and refer to different entrances in the same house. It is not possible to change the less detailed address to a more detailed one since it is not clear how to assign the residents Hans, Renate, and Max to the more detailed addresses. Vice versa, correcting a detailed address to a less detailed one is not acceptable since information gets lost.

Contributions: We present a new data structure, called the address connector, which links residential addresses from different databases that refer to the same location. The address connector can be represented as a relation, where each tuple defines a residential address and establishes a link between two other residential addresses. A key feature of our solution is that an authoritative reference is not needed. Instead, we establish links that equally respect all participating addresses. At the core of the address connector is the synchronization operator, which establishes the links between different addresses that refer to the same location. The synchronization operator faces two key challenges. First, streets must be matched even if different databases use different names for the same street (e.g., due to misspellings, different coding conventions, or renamed streets). Second, addresses that are stored with different granularity must be linked correctly, although there is no one-to-one correspondence between them.

We introduce a structure-aware distance measure between two streets, called street distance, which relies on both the name of the two streets and the addresses of the two streets. Toward this end, the residential addresses of a street are organized in an ordered, labeled tree, called address tree. The root of the address tree is the set of all known names of the street, while the rest of the tree represents house numbers, entrance numbers, and apartment numbers (see Fig. 2). The street distance relies on both the structural similarity of the address trees and the similarity of the street names. Such a structure-aware approach allows us to match streets even if the names are completely unrelated (e.g., in the case of renamed streets) or if the structure of the address trees is ambiguous (e.g., the address trees of the streets “Mariengasse” and “Untervigil” in Fig. 1 have identical structure).
Fig. 2

Address tree of “Friedensplatz” (registration office database)

Given the distances between all pairs of streets, the streets need to be matched. A constraint of the matching is that a street can have at most one matching partner. A threshold-based approach will not work since the same threshold may be too high for some streets (they are matched multiple times), but too low for other streets (they remain unmatched). We use a global greedy matching algorithm that matches each street to at most one other street, and we show that the resulting matching is stable. A matching is stable if no new street pair can be found such that the streets in the new pair are closer to each other than to their current partner in the matching.

Finally, the addresses of two matching streets must be linked. In general, there is no one-to-one correspondence between all addresses of two streets since they might be stored with different granularity. We introduce the address containment. Intuitively, an address, \(a\), contains another address, \(b\), if the location referred to by \(a\) contains the location referred to by \(b\). For example, Friedensplatz 2 is a house that contains the apartments Fiedensplatz 2/A/1 and Fiedensplatz 2/A/2. We propose an efficient merge algorithm that correctly links addresses with different granularity by checking, in addition to equality, also for address containment.

To summarize, we introduce the address connector that offers lookups of residential addresses in different databases. At the core of the address connector is the synchronization operator, which establishes links between residential addresses that refer to the same location. The main features of the synchronization operator are as follows: a new street distance based on address trees; a global greedy matching algorithm that computes stable street pairs; and the concept of address containment that allows us to correctly link addresses with different granularity levels. We implemented the connector and evaluated it with real-world data from the Municipality of Bolzano. The experiments show the effectiveness and efficiency of the connector.

Beyond Residential Addresses: Our solution, the address connector, is useful far beyond residential addresses and is easily adapted to other domains that require the synchronization of hierarchical structures. Examples include the synchronization of directory trees and taxonomies.

Directory trees must be synchronized by file synchronization and backup tools. As a first step, directory trees are loaded to the connector, where they are synchronized. The connector maintains links between corresponding directories in multiple directory trees and deals with the different granularities (e.g., empty directories vs. directories with files and subdirectories). Given a path to a file or a directory in one tree, the connector returns the respective paths in the other directory trees.

For taxonomies the connector is used in a similar way. Multiple taxonomies are loaded to the connector. Given a concept in one taxonomy, the respective concepts in the other taxonomies are returned. The taxonomies are allowed to have different levels of detail, for example, one taxonomy stores only the concept “animal,” while another taxonomy further subdivides animals into fishes, amphibians, reptiles, birds, and mammals. If the dictionaries of the taxonomies do not match, the synchronization algorithm in the connector is extended to deal with synonyms.

Outline: In the next section we define and motivate the problem. We outline the solution in Sect. 3. In Sect. 4 we give a solution for computing the distances between pairs of streets, we match streets with a global greedy matching algorithm, and we show how to link addresses with different granularity. In Sect. 5 we provide algorithms for linking residential addresses. The algorithms are experimentally evaluated in Sect. 6. Section 7 discusses related work, and in Sect. 8 we draw conclusions and point to future work.

2 Problem definition and motivation

2.1 Problem definition

We assume different databases that store residential addresses about the same geographic area. The residential addresses reference houses, house entrances, or apartments. The street names may be spelled in different languages, or a changed street name may not be reflected in some databases. The addresses may be stored with different granularity, for example, one database may store only the address of a house without specifying entrance or apartment number while another database may store also entrance and apartment numbers for the same house.

Our goal is an effective and efficient lookup of residential addresses in different databases. The input for the lookup is a residential address defined in one of the databases (query address) and a target database. The lookup returns the set of all addresses in the target database that refer to the same location as the query address.

Example 2.1

Consider the two address databases in Fig. 1, and let the Registration Office be the target database. The lookup of “Hermann-von-Gilm-Str. 1” should return \(\{\)Gilmstrasse 1\(\}\), that is, the query and the result address are equivalent. The lookup of “Siegesplatz 3/-/1” should return \(\{\)Fiedensplatz 3\(\}\), that is, the query address is more detailed and is contained in the result address. Finally, the lookup of “Friedhofplatz 6” should return \(\{\)Cimitero 6/A”, “Cimitero 6/B\(\}\), that is, the result addresses are more detailed and are contained in the query address.

2.2 Motivation

Our work is motivated by an application scenario from the Municipality of Bolzano. Many administrative tasks performed by the civil servants require to combine information from different databases. The databases are maintained by internal (e.g., the Registration Office, the GIS Office, the Local Tax Office) or external departments of the municipality (e.g., the Electricity Company, the Land Registration Office, the Catastre). As residential addresses are often the only link between tuples in different databases, they must be used to access and connect related pieces of information.

Consider the two databases in Fig. 1. The Registration Office (RO) stores residents of apartments, and the Electricity Company (EC) stores the amount of the electricity bill of each apartment. For tax-fraud detection the municipality wants to compute a list of all apartments for which no electricity is payed although they have residents. To answer this query, the two databases have to be joined over corresponding residential addresses.

Unfortunately, exact matches between addresses mostly fail since the addresses differ substantially for a number of reasons: “Untervigil” is misspelled in one database; street names are coded using different conventions (e.g., “Hermann-von-Gilm-Str.” vs. “Gilmstrasse”); “Friedensplatz” was renamed to “Siegesplatz,” but the change was not reflected in all databases; and in the bilingual region of Bolzano, two names for each street exist, and they are used interchangeably, for example, “Friedhofplatz” and “Cimitero” are the German and Italian names of the same street. In addition to non-matching street names, the residential addresses are stored with different granularity in the different databases (e.g., with or without entrance/apartment numbers), and there is no one-to-one correspondence between them. For example, “Friedhofplatz 6” is a house that is divided into two parts with different entrances, “Cimitero 6/A” and “Cimitero 6/B.”

There is no authoritative reference database available to solve conflicts or to correct addresses. All input addresses have the same priority, and no input address can be deleted during the synchronization process. The synchronization must be extensible to additional databases. A mapping between only two address databases is of limited use since multiple departments need to interact and new services provided by the public administration require new departments to join the synchronization.

3 The connector

In this section we define the connector: a new data structure that supports the synchronization of residential addresses from different databases. The connector is represented as a relation. A tuple in the connector defines a residential address and establishes a link between two other residential addresses (see Fig. 3). A residential address is a reference to a physical object that is either a house, a part of a house that has its own entrance, or an apartment in a house. Residential addresses are grouped into partitions (which correspond to the different databases that are synchronized in the connector). The addresses of a partition are grouped into streets.
Fig. 3

Connector \(\mathfrak X \) after the synchronization \(\mathrm{synch _\mathcal{A ,\mathcal B \rightarrow \mathcal C }(\mathfrak X )}\)

Definition 3.1

(Connector, Residential Address, Partition, Street) A connector, \(\mathfrak X \), is a relation. A tuple \((id,a,c _1,c _2)\in \mathfrak X \) is identified by \(id\), defines the residential address \(a\), and establishes a link between the two residential addresses \(c _1\) and \(c _2\), where \(c _1\) and/or \(c _2\) may be empty (\(\epsilon \)). A residential address is a tuple \(( strName ,num,entr,apt)\) that consists of a non-empty set of street names, a house number, an entrance, and an apartment number (entrance and apartment number may be null values, \(@\)). The addresses of \(\mathfrak X \) are grouped into partitions, and each address is in exactly one partition. A street is a set of addresses which all are in the same partition and have identical street names.

The semantics of a tuple \((id,a,c _1,c _2)\in \mathfrak X \) in the connector is that \(a,\,c _1\), and \(c _2\) all refer to the same location. The identifier, \(id\), of the tuple is a triple of partition identifier, street identifier, and address identifier (local to the partition). Whenever possible we refer to an address only by the local address identifier. With \(\mathrm{str }(\mathcal A )\) we denote the set of all streets of a partition \(\mathcal A \). \(\mathrm{names }(\alpha )\) denotes the set of street names of a street \(\alpha \). The relative part of an address \(c\), \(\mathrm{rel }(c)=(num,entr,apt)\), is the triple of house number, entrance, and apartment number defined by address \(c\).

Example 3.1

The last tuple of the connector in Fig. 3 defines the address \(c _{13}\!\!=(\{{\text{ Cimitero,} \text{ Friedhofplatz}}\},6,B,@)\) of partition \(\mathcal C \), and it links the two addresses \(a _{10}\) of partition \(\mathcal A \) and \(b _{9}\) of partition \(\mathcal B \). Address \(c _{13}\) is in street \(\gamma _{3}\), which has two street names: \(\mathrm{names }(\gamma _{3})=\{{\text{ Cimitero,} \text{ Friedhofplatz}}\}\). The relative part of \(c _{13},\,\mathrm{rel }(c _{13})=(6,B,@)\), consists of house number \(6\), entrance \(B\), and a null value for the apartment number. Partition \(\mathcal A \) consists of the streets \(\mathrm{str }(\mathcal A )=\{\alpha _1,\alpha _2,\alpha _3,\alpha _4,\alpha _5\}\) (only two of them are shown in Fig. 3).

In order to support the synchronization of residential addresses, the connector provides the following main functionalities:
  • \(load(\mathfrak X ,\text{ DB},\mathcal A )\): Load an address database into the connector. The residential addresses in DB are stored as a new partition, \(\mathcal A \), in the connector \(\mathfrak X \). The tuples in the partition define the residential addresses of DB, and they include dummy links to empty addresses.

  • \(synch(\mathfrak X ,\mathcal A ,\mathcal B ,\mathcal C )\): Synchronize the two partitions \(\mathcal A \) and \(\mathcal B \) and store the result in a new partition \(\mathcal C \). The tuples in the new partition, \(\mathcal C \), align addresses from \(\mathcal A \) and \(\mathcal B \) that refer to the same location. Each tuple defines a new address.

  • \(lookup(\mathfrak X ,(\mathcal A ,\alpha ,a),\mathcal B )\): Retrieve from partition \(\mathcal B \) those addresses that are aligned with address \(a \) from partition \(\mathcal A \).

Example 3.2

Consider the databases in Fig. 1 and the tax-fraud query, which requires a join of the two databases over corresponding residential addresses. Using the connector, the residential addresses of the two databases are first loaded, that is, \(load(\mathfrak X ,\text{ EC},\mathcal A )\) and \(load(\mathfrak X ,\text{ RO},\mathcal B )\). This operation creates two new partitions, \(\mathcal A \) and \(\mathcal B \), in the connector \(\mathfrak X \). Next, the partitions \(\mathcal A \) and \(\mathcal B \) are synchronized by calling \(synch(\mathfrak X ,\mathcal A ,\mathcal B ,\mathcal C )\). The tuples in the new partition, \(\mathcal C \), establish links between addresses from \(\mathcal A \) and \(\mathcal B \) that refer to the same location, and each tuple defines a new address. For example, \(a _3=({\text{ Hermann-von-Gilm-Str.}},3,B,@)\) and \(b _2=({\text{ Gilmstrasse}},3,@,@)\) are linked (\(a _3\) is an entrance of house \(b _2\)) and define the new address \(c _3\). Finally, we take the addresses that have an electricity bill with amount zero and do a lookup in the RO database to find residents who do not pay for their electricity. For example, \(lookup(\mathfrak X ,(\mathcal A ,\alpha _1,a _{10}),\mathcal B )\) retrieves the set \(\{b _8,b _9\}\) representing two entrances of house \(a _{10}\). Thus, apartments “Cimitero 6/A” and “Cimitero 6/B” have residents but do not pay for the electricity.

The synchronization operator is the most important one and will be described in more detail below.

4 The synchronization operator

The synchronization operator, \(\mathrm{synch _\mathcal{A ,\mathcal B \rightarrow \mathcal C }(\mathfrak X )}\), aligns the addresses of partitions \(\mathcal A \) and \(\mathcal B \) and stores the result in a new partition \(\mathcal C \). A tuple in the new partition establishes a link between two addresses of \(\mathcal A \) and \(\mathcal B \) that refer to the same location, and the address defined by the tuple represents the linked addresses.

4.1 Overview

The synchronization of two partitions, \(\mathcal A \) and \(\mathcal B \), is a three-step process:
  1. 1.
    Computing Street Distances: Given two streets, \(\alpha \in \mathrm{str }(\mathcal A )\) and \(\beta \in \mathrm{str }(\mathcal B )\), the distance, \(\mathrm{dist }(\alpha ,\beta )\), between the two streets is computed.
    $$\begin{aligned} \begin{array}{ll} Input:&\quad \alpha \in \mathrm{str }(\mathcal A ),\,\beta \in \mathrm{str }(\mathcal B )\\ Output:&\quad \,\mathrm{dist }(\alpha ,\beta )\in [0..1]\\ \end{array} \end{aligned}$$
     
  2. 2.
    Matching Streets: Assume the streets of two partitions, \(\mathrm{str }(\mathcal A )=\{\alpha _1,\ldots ,\alpha _M\}\) and \(\mathrm{str }(\mathcal B )=\{\beta _1,\ldots ,\beta _N\},\,M \le N\), and a distance matrix \(D_{M\times N}\) with the distance between streets \(\alpha _i\) and \(\beta _j,\,\mathrm{dist }(\alpha _i,\beta _j)\), in row \(i\) and column \(j\). A matching, \(\mathsf M \), between the streets is computed, such that each street of \(\mathcal A \) matches at most one street of \(\mathcal B \) and vice versa.
    $$\begin{aligned} \begin{array}{ll} Input:&\quad \mathrm{str }(\mathcal A ),\,\mathrm{str }(\mathcal B ),\,D_{M\times N}\\ Output:&\quad \mathsf M \subseteq \mathrm{str }(\mathcal A )\times \mathrm{str }(\mathcal B ) \text{ such} \text{ that}\\ \,&\quad \forall (\alpha ,\beta )\in \mathsf M \; \forall (\alpha ^{\prime },\beta ^{\prime })\in \mathsf M : \alpha =\alpha ^{\prime } \Leftrightarrow \beta =\beta ^{\prime }\\ \end{array} \end{aligned}$$
     
  3. 3.
    Linking Addresses: Links between the addresses of two streets, \((\alpha ,\beta )\in \mathsf M \), are established. Two addresses are linked if they refer to the same location. An address that has no counterpart in the other street is linked to the empty address (\(\epsilon \)). Each link produces a new connector tuple, and the address defined by the new tuple represents the linked addresses. The set \(\bar{\gamma }\) of new tuples defines a new street \(\gamma \) in a new partition \(\mathcal{C }\notin \{\mathcal{A },\mathcal{B }\}\). With \(I=\{(\mathcal{C },{\gamma },c_i)\mid c_i \in \mathbb{N }\}\) we denote the set of tuple identifiers for street \(\gamma \) of partition \(\mathcal C \), and \(n_{\gamma }=\mathrm{names }(\alpha )\cup \mathrm{names }(\beta )\).
    $$\begin{aligned} \begin{array}{ll} Input:&\quad (\alpha , \beta ) \in \mathsf M \\ Output:&\quad \,\bar{\gamma }\subseteq I\times \big [\{n_{\gamma }\} \times \{\mathrm{rel }(c)\mid c\in \alpha \cup \beta \}\big ] \times \big [\alpha \cup \{\epsilon \}\big ] \times \big [\beta \cup \{\epsilon \}\big ] \end{array} \end{aligned}$$
     
In the following we discuss each of these steps in detail.

4.2 Step 1: Computing street distances

To compute the similarity of two streets, we introduce and define a new street distance which is based on two independent characteristics: the name of the two streets and the structure of the addresses of the two streets. To that end, we represent each street by its address tree.

Address Trees: The addresses of a street, \(\alpha \), define a hierarchy and are represented as a so-called address tree, \(\mathbf T (\alpha )\) [3]. Figure 4 shows the address trees of the streets in partitions \(\mathcal A \) and \(\mathcal B \) of connector \(\mathfrak X \) (see Fig. 3). The root of an address tree is the set of names of the corresponding street, the children of the root are the house numbers, the children of house numbers are the entrance numbers, and the children of entrance numbers are the apartment numbers. An address is a path from the root to a leaf node. For example, the shaded path in Fig. 4b is the address “Friedensplatz  2/A/1.” The identifiers of addresses that are defined by a root–leaf path are shown below the respective leaf. We omit null values in the address trees.
Fig. 4

Example address trees. a Address trees of the electricity company (partition \(\mathcal A \)), b address trees of the registration office (partition \(\mathcal B \))

4.2.1 The name distance

The root node of an address tree represents the set of all known names of the corresponding street. We define the name distance between two address trees, \(\mathbf T (\alpha )\) and \(\mathbf T (\beta )\), as the minimum distance between two of their names, \(n_{\alpha }\in \mathrm{names }(\alpha )\) and \(n_{\beta }\in \mathrm{names }(\beta )\). We use the \(q\)-gram distance to determine the distance between a single pair of street names. The \(q\)-grams of a street name are all its substrings of length \(q\). Intuitively, two street names are similar if they have many \(q\)-grams in common.

Definition 4.1

(\(q\)-Gram Distance) Given a string \(s\) of characters from a finite alphabet \(\Sigma \) and the extended string \(s^{\prime }\) that is formed by prefixing and suffixing \(s\) with \(q-1\) characters that are not in \(\Sigma \). A \(q\)-gram of \(s\) is a substring of length \(q\) of the extended string \(s^{\prime }\), and \(\mathcal{I }(s)\) is the bag of all \(q\)-grams of \(s\). The \(q\)-gram distance between two street names, \(s_1\) and \(s_2\), is defined as follows:
$$\begin{aligned} \mathrm{dist }_{q}(s_1, s_2)=1-\frac{|\mathcal{I }(s_1) \bigcap \!\!\!\!\!\!+ \mathcal{I }(s_2)|}{|\mathcal{I }(s_1) \uplus \mathcal{I }(s_2)|-|\mathcal{I }(s_1)\bigcap \!\!\!\!\!\!+\mathcal{I }(s_2)|}. \end{aligned}$$

The distance is normalized and can take values between \(0\) and \(1\). The \(q\)-gram distance is a pseudo-metric [6], that is, it is \(0\) if \(s_1=s_2\), it is symmetric, and the triangle inequality holds.

Example 4.1

We compute the name distance between the address trees \(\mathbf T (\alpha _5)\) and \(\mathbf T (\beta _4)\) in Fig. 4. Both root nodes store only one name, and the name distance is equal to the \(q\)-gram distances between these names. \(n_{\alpha }=\,\)Untervigli”,\(\,n_{\beta }=\,\)Untervigil”, the respective \(q\)-gram bags are \({\mathcal{I }}(n_{\alpha })= \{ {\#\#{\text{ U}}}, {\#{\text{ Un}}},\,{{\text{ Unt}}},\,{{\text{ nte}}},\,{{\text{ ter}}},\,{{\text{ erv}}},\, {{\text{ rvi}}}, {{\text{ vig}}},\,{{\text{ igl}}},\,{{\text{ gli}}}, {{\text{ li}}\#}, \,{{\text{ i}}\#\#} \}\) and \({\mathcal{I }}(n_{\beta })= \{ {\#\#{\text{ U}}}, \,{\#{\text{ Un}}},\,{{\text{ Unt}}},\,{{\text{ nte}}}, {{\text{ ter}}},\,{{\text{ erv}}}, \,{{\text{ rvi}}},\,{{\text{ vig}}},\,{{\text{ igi}}},\,{{\text{ gil}}}, {{\text{ il}}\#}, \,{{\text{ l}}\#\#} \}\), the \(q\)-gram distance is \(\mathrm{dist }_q(n_{\alpha }, n_{\beta })=1-\frac{|\mathcal{I }(n_{\alpha })\bigcap \!\!\!\!\!+ \mathcal{I }(n_{\beta })|}{|\mathcal{I }(n_{\alpha })\uplus \mathcal{I }(n_{\beta })|-|\mathcal{I }(n_{\alpha })\bigcap \!\!\!\!\!+ \mathcal{I }(n_{\beta })|} =1-\frac{8}{24-8}=\frac{1}{2}.\)

4.2.2 The structure distance

Intuitively, the structure distance of two streets considers how the (recorded) addresses of the two streets differ. If the addresses of a street are represented in an address tree, this measure can be defined as the structural distance between two address trees, and we will use \(pq\)-grams to measure the distance of two trees.

A \(pq\)-gram is a small, besom-shaped subtree consisting of an anchor node, \(p-1\) ancestors, and \(q\) consecutive children. Intuitively, the \(pq\)-grams are formed by shifting a \(pq\)-gram-shaped pattern over the tree (see Fig. 5). The nodes covered by the pattern form a \(pq\)-gram. The pattern is shifted such that each node appears in the anchor node position and each non-root node also in each leaf position of the pattern. We fill in dummy nodes for the parts of the pattern that extend beyond the tree border. For the following definitions we assume an ordered, labeled, rooted tree \(\mathbf T \). Each node \(\mathsf n \) of \(\mathbf T \) has a label \(\lambda (\mathsf n )\). A node with the special label “\({\text{*}}\)” is a dummy node.
Fig. 5

Computing the \(pq\)-grams of a tree. a pq-Gram pattern, b example tree \(\mathbf{T}_{0}\) and two 2, 3-grams of \(\mathbf{T}_{0}\)

Definition 4.2

(\(pq\)-Gram) Let \(\mathbf T \) be a tree, \(\mathsf {a} \) be a node of \(\mathbf T ,\,p>0,\,q>0\), and let \(\mathbf T ^{p,q}\) be \(\mathbf T \) extended with dummy nodes as follows: \(p-1\) ancestors to the root node, \(q-1\) children before the first and after the last child of each non-leaf node, and \(q\) children to each leaf. A \(pq\)-gram of \(\mathbf T \) with anchor node\(\mathsf {a} \) is a subtree of \(\mathbf T ^{p,q}\) that is composed of the following nodes: \(p\) nodes \(\mathsf {a} _{p-1},\dots ,\mathsf {a} _1,\mathsf {a} \), where \(\mathsf {a} _i\) is the ancestor of \(\mathsf {a} \) at distance \(i\), and \(q\) contiguous children \(\mathsf {c} _k,\dots ,\mathsf {c} _{k+q-1}\) of \(\mathsf {a} \).

We use a linear encoding and represent a \(pq\)-gram \(\mathbf G \) with anchor node \(\mathsf {a} \) as a tuple of its node labels, the label-tuple\(\lambda (\mathbf G )=(\lambda (\mathsf {a} _{p-1}),\dots ,\lambda (\mathsf {a} _1),\lambda (\mathsf {a}),\lambda (\mathsf {c} _k), \dots ,\lambda (\mathsf {c} _{k+q-1}))\). As the labels of a tree are not necessarily unique, two \(pq\)-grams of the same tree may yield identical label-tuples. The \(pq\)-gram distance is based on the number of label-tuples that two trees have in common.

We ignore the street names when we compute the structure distance between address trees and denote with \(\mathbf T ^{{\text{*}}}(\gamma )\) the address tree of street \(\gamma \) with a dummy root node. The structure of two address trees is similar if the trees are within a small \(pq\)-gram distance. The \(pq\)-gram distance is computed by splitting the trees into \(pq\)-grams; trees that share a high percentage of \(pq\)-grams are more similar than trees that share a low percentage.

Definition 4.3

(\(pq\)-Gram Distance) Let \(\mathcal{I }(\mathbf T )\) denote the bag of all label-tuples (labels of serialized \(pq\)-grams) of a tree \(\mathbf T \). The \(pq\)-gram distance between two trees, \(\mathbf T _1\) and \(\mathbf T _2\), is defined as follows:
$$\begin{aligned} \mathrm{dist }_{pq}(\mathbf T _1,\mathbf T _2) = 1 - \frac{|\mathcal{I }(\mathbf T _1) \bigcap \!\!\!\!\!+ \mathcal{I }(\mathbf T _2)|}{|\mathcal{I }(\mathbf T _1) \uplus \mathcal{I }(\mathbf T _2)|-|\mathcal{I }(\mathbf T _1) \bigcap \!\!\!\!\!+ \mathcal{I }(\mathbf T _2)|}. \end{aligned}$$

The \(pq\)-gram distance is normalized to values between \(0\) and \(1\) and was shown to be a pseudo-metric [6]. If \(\mathbf T _1\) and \(\mathbf T _2\) have identical structure and labels, the \(pq\)-gram distance is \(0\).

Example 4.2

We compute the structure distance between the address trees \(\mathbf T (\alpha _1)\) and \(\mathbf T (\beta _3)\) in Fig. 4 using the \(pq\)-gram distance (\(p=2,\,q=3\)). The root nodes are substituted by dummy nodes, the label-tuples are computed, and the \(pq\)s-distance is computed by intersecting the bags of label-tuples:
$$\begin{aligned} \mathcal{I }(\mathbf T ^{{\text{*}}}(\alpha _1))&= \{({\text{*}},{\text{*}},{\text{*}},{\text{*}},4), ({\text{*}},{\text{*}},{\text{*}},4,6), ({\text{*}},{\text{*}},4,6,{\text{*}}), ({\text{*}},{\text{*}},6,{\text{*}},{\text{*}}),\\&({\text{*}},6,{\text{*}},{\text{*}},{\text{*}}), ({\text{*}},4,{\text{*}},{\text{*}},{\text{*}})\}, \\ \mathcal{I }(\mathbf T ^{{\text{*}}}(\beta _3))&= \{({\text{*}},{\text{*}},{\text{*}},{\text{*}},4), ({\text{*}},{\text{*}},{\text{*}},4,6), ({\text{*}},{\text{*}},4,6,{\text{*}}), ({\text{*}},{\text{*}},6,{\text{*}},{\text{*}}),\\&({\text{*}},6,{\text{*}},{\text{*}},A), ({\text{*}},6,{\text{*}},A,B), ({\text{*}},6,A,B,{\text{*}}), ({\text{*}},6,B,{\text{*}},{\text{*}}),\\&({\text{*}},4,{\text{*}},{\text{*}},{\text{*}}), (6,A,{\text{*}},{\text{*}},{\text{*}}), (6,B,{\text{*}},{\text{*}},{\text{*}})\},\\ \mathrm{dist }_{pq}(\mathbf T ^{{\text{*}}}(\alpha _1),\mathbf T ^{{\text{*}}}(\beta _3))&= 1 - \frac{|\mathcal{I }(\mathbf T ^{{\text{*}}}(\alpha _1)) \bigcap \!\!\!\!\!\!+ \mathcal{I }(\mathbf T ^{{\text{*}}}(\beta _3))|}{|\mathcal{I }(\mathbf T ^{{\text{*}}}(\alpha _1)) \uplus \mathcal{I }(\mathbf T ^{{\text{*}}}(\beta _3))|-|\mathcal{I }(\mathbf T ^{{\text{*}}}(\alpha _1)) \bigcap \!\!\!\!\!\!+ \mathcal{I }(\mathbf T ^{{\text{*}}}(\beta _3))|}\\&= 1-\frac{5}{17-5}=\frac{7}{12}. \end{aligned}$$

4.2.3 The street distance

Depending on the input data, more reliable matches can be expected from either the name distance (e.g., both address sets use the same language and similar coding conventions) or the structure distance between the address trees (e.g., the input sets use different languages). We weight the name distance with \(\omega \) and structure distance with \(1-\omega \), and we combine the two distances into a single distance between address trees. Let \(d_n\) be the name distance and \(d_s\) the structure distance, the address tree distance is defined as follows:
$$\begin{aligned} d=\sqrt{\omega d_n^2+(1-\omega ) d_s^2}. \end{aligned}$$

Example 4.3

The name distance \(d_n=0.7272\) between the renamed streets \(\beta _2\) (Friedensplatz) and \(\alpha _3\) (Siegesplatz) is larger than the name distance \(d_n=0.5\) between \(\beta _2\) and \(\alpha _1\) (Friedhofplatz). As the renamed streets are structurally more similar (\(d_s=0.7308\) vs. \(d_s=1.0\) between \(\beta _2\) and \(\alpha _1\)), the street distance (\(w=0.5\uplus \)) between these streets is smaller than the street distance between \(\beta _2\) and \(\alpha _1\) and they are matched correctly (see Fig. 6).
Fig. 6

Distance matrix for the address trees in Fig. 4

4.3 Step 2: Matching streets

Given the distances between all street pairs of two partitions, the streets need to be matched. We define the matching as a set of street pairs, where each street appears in at most one pair. Our goal is to compute a stable matching. Intuitively, a matching is stable if it is not possible to break up existing matches and form a new match such that the new match is better than the old matches for both matching partners.

Definition 4.4

(Matching and Stable Matching) A matching, \(\mathsf M \subseteq \mathrm{str }(\mathcal A )\times \mathrm{str }(\mathcal B )\), of the streets of two partitions, \(\mathrm{str }(\mathcal A )\) and \(\mathrm{str }(\mathcal B )\), is a set of street pairs (matches), where each street \(\alpha \in \mathrm{str }(\mathcal A )\) is paired with at most one street \(\beta \in \mathrm{str }(\mathcal B )\), and each street \(\beta \in \mathrm{str }(\mathcal B )\) is paired with at most one street \(\alpha \in \mathrm{str }(\mathcal A )\).

\(\mathsf M \) is stable if there is no pair \((\alpha ,\beta )\notin \mathsf M \), such that \(\alpha \) is closer to \(\beta \) than to its current partner in \(\mathsf M \), and \(\beta \) is closer to \(\alpha \) than to its current partner in \(\mathsf M \):
$$\begin{aligned} \begin{array}{l} \forall (\alpha ,\beta )\in (\mathrm{str }(\mathcal A )\times \mathrm{str }(\mathcal B ))\setminus \mathsf M : \\ \qquad \exists x,y: (\alpha ,y)\in \mathsf M \wedge (x,\beta )\in \mathsf M \Rightarrow \\ \qquad \qquad \mathrm{dist }(\alpha ,y)\le \mathrm{dist }(\alpha ,\beta ) \vee \mathrm{dist }(x,\beta )\le \mathrm{dist }(\alpha ,\beta )\\ \end{array} \end{aligned}$$
(1)

Let \(D\) be the \(M\times N\) distance matrix that stores the distances between the streets of the two partitions, \(\mathrm{str }(\mathcal A )=\{\alpha _1,\ldots ,\alpha _{M}\}\) and \(\mathrm{str }(\mathcal B )=\{\beta _1,\ldots ,\beta _{N}\}\). The distance between the streets \(\alpha _i\) and \(\beta _j\) is stored in row \(i\) and column \(j\) of \(D\). Figure 6 shows the distance matrix for the address trees in Fig. 4. Name distance and structure distance are equally weighted (\(w=0.5\)), and the correct matches are shaded.

We choose a global greedy algorithm to solve the street matching problem. Such an approach matches close street pairs first and avoids missing good matches due to earlier mismatches. Matched streets are marked, and no street is matched twice. The matching produced by the algorithm is stable.

Example 4.4

Consider the distance matrix in Fig. 6. The global greedy matching computes the matches in the following order: \((\alpha _5,\beta _4),\,(\alpha _4,\beta _5),\,(\alpha _3,\beta _2),\,(\alpha _1,\beta _3),\,(\alpha _2,\beta _1)\).

The matching of the global greedy algorithm is maximum, that is, each street of the smaller set is matched to a street of the larger set. If the smaller set contains streets that should not have a matching partner in the larger set, they will still be matched. Therefore, the global greedy algorithm will perform best if the two sets have a large overlap. This is often the case in scenarios where one-to-one matches are meaningful, for example, the residential address databases in our application scenario (registration office, electricity company, and the census database of Bolzano) have an overlap of more than 95 %.

Note that matching two streets if they are within a fixed distance threshold is not good enough. The threshold may be too low for some streets (they remain unmatched), but too high for others (they are matched to multiple streets in the other partition). Often it is impossible to set a good threshold. A local greedy approach traverses the streets of one partition in random order and matches each street to its nearest neighbor in the other partition. If the nearest neighbor of a street is already matched, the next-nearest neighbors are visited until an unmatched street is found. Each street is matched only once, but the quality of the matching depends on the random matching order. Both approaches do not guarantee stable matches.

Example 4.5

Consider, for example, the distance matrix in Fig. 6. A threshold larger than or equal to \(0.8091\) matches \(\beta _1,\,\beta _4\), and \(\beta _5\) to \(\alpha _4\). For smaller thresholds, \(\beta _1\) and \(\beta _3\) remain unmatched. The local greedy algorithm matches each row to the unmatched column with the smallest distance value in the respective row. We match the rows in the order given by the distance matrix (first row first) and get the matching \(\mathsf M =\,\{(\alpha _{1},\beta _{2}),\,(\alpha _{2},\beta _{1}),\, (\alpha _{3},\beta _{3}),\,(\alpha _{4},\beta _{5}),\,(\alpha _{5},\beta _{4})\}.\) As \(\alpha _1\) is mismatched to \(\beta _{2}\) in the beginning, \(\alpha _{3}\) cannot be matched to its nearest neighbor \(\beta _{2}\), but it is matched to \(\beta _{3}\), which is very distant from \(\alpha _{3}\).

4.4 Step 3: Linking addresses

In this section we establish links between the addresses of two streets. Each link is represented by a new tuple in the connector. The new tuples define a new street, and each of the linked addresses is represented by one or more addresses in the new street.

Two addresses should be linked if they refer to the same location. It is not enough to check whether the relative parts of the addresses are equivalent as the addresses may be stored with different granularity. For example, “Gilmstrasse 3” should match both “Hermann-von-Gilm-Str. 3/A” and “Hermann-von-Gilm-Str. 3/B,” but the entrance is not specified in “Gilmstrasse 3.” We define the concept of address equivalence and address containment.

Definition 4.5

(Address Equivalence and Containment) Given two residential addresses, \(a \) and \(b \). Address \(a \) is equivalent to address \(b \) (\(a \equiv b \)) if both addresses refer to the same physical object. \(a \)contains\(b \) (\(a \sqsupseteq b \)) if and only if \(b \) refers to an object that is part of the object referred to by \(a \) or \(a \equiv b \).

The input for the address linking are the street pairs provided by the street-matching algorithm. The addresses of two matched streets refer to locations in the same real-world street; thus, addresses with identical relative parts are equivalent. Further, if all non-null values of \(\mathrm{rel }(a)=(num_{a},entr_{a},apt_{a})\) are the same as the respective attribute values of \(\mathrm{rel }(b)=(num_{b},entr_{b},apt_{b})\), then \(a \) contains \(b \).

We establish a link between two addresses if they are equivalent or if one address is contained in the other. The addresses that cannot be linked to an address in the other street are linked to the empty address (\(\epsilon \)). Each link is represented by a new tuple in the connector. The address defined by the new tuple represents the two linked addresses, and its relative part is identical to the relative part of the linked address that is more detailed. The set of new tuples, \(\bar{\gamma }\), defines a new street, \(\gamma \), with \(\mathrm{names }(\gamma )=\mathrm{names }(\alpha )\cup \mathrm{names }(\beta )\).

Definition 4.6

(Address Linking) The address linking between two streets, \(\alpha \) and \(\beta \), is the following set of new connector tuples:
$$\begin{aligned} \begin{array}{lll}&\,&\bar{\gamma } = \{(id(), (n_{\gamma })\circ \mathrm{rel }(a),a,b)\mid a\in \alpha , b\in \beta ,\mathrm{rel }(a)\sqsubseteq \mathrm{rel }(b)\}\cup \\ [2mm]&\,&\;\,\quad \quad \{(id(), (n_{\gamma })\circ \mathrm{rel }(b),a,b)\mid a\in \alpha , b\in \beta ,\mathrm{rel }(b)\sqsubseteq \mathrm{rel }(a)\}\cup \\ [2mm]&\,&\;\,\quad \quad \{(id(), (n_{\gamma })\circ \mathrm{rel }(a),a,\epsilon )\mid \not \exists b\in \beta :\mathrm{rel }(b)\sqsubseteq \mathrm{rel }(a)\vee \mathrm{rel }(a)\sqsubseteq \mathrm{rel }(b)\}\cup \\ [2mm]&\,&\;\,\quad \quad \{(id(), (n_{\gamma })\circ \mathrm{rel }(b),\epsilon ,b)\mid \not \exists a\in \alpha :\mathrm{rel }(a)\sqsubseteq \mathrm{rel }(b)\vee \mathrm{rel }(b)\sqsubseteq \mathrm{rel }(a)\}, \\ [2mm] \end{array} \end{aligned}$$
where \(n_{\gamma }=\mathrm{names }(\alpha )\cup \mathrm{names }(\beta )\), and \(id()\) creates an identifier for each connector tuple in \(\bar{\gamma }\).

Example 4.6

Linking the addresses of the two streets \(\alpha _2\) and \(\beta _1\) (see Fig. 3) results in a new set of connector tuples that define the street \(\gamma _{1}=\{c _1,\ldots ,c _5\}\). Figure 7 shows the address trees of the input streets (\(\alpha _2\) and \(\beta _1\)) and the new street \(\gamma _1\). A root–leaf path is an address, and the identifier of the address is shown below the leaf. The dashed lines represent connector tuples that link addresses of \(\alpha _2\) and \(\beta _1\) and define addresses in the new street \(\gamma _1\). \(a _4\) and \(b _3\) are linked to the empty address.
Fig. 7

Links between the addresses of \(\alpha _2\) and \(\beta _1\)

5 Algorithms

In this section we provide algorithms for the synchronization operator, including the street distance computation, the global greedy matching, and the address linking. We prove that global greedy produces a stable matching, and we discuss the complexity of our algorithms.

Synch: Algorithm 1 synchronizes two partitions, \(\mathcal A \) and \(\mathcal B \), of connector \(\mathfrak X \). The synchronization algorithm synch is the top-level algorithm that calls all the other algorithms presented in this section. The synch operator is closed, that is, it can be nested to synchronize multiple databases.

First, the synchronization operator computes the distance matrix, \(D\), with the distances between each pair of streets. Second, the stable street matching \(\mathsf M \) is computed. Third, for each pair of matched streets, \((\alpha ,\beta )\in \mathsf M \), a new set of tuples is added to the connector. The new tuples define a new street. The algorithm returns the connector \(\mathfrak X \) with the new partition \(\mathcal C \). The algorithms for the street distance, the global greedy matching, and the address linking as well as the overall complexity of the synchronization operator are discussed below.
Street Distance: Algorithm 2 computes the distance between two streets based on their address trees. The nested loop computes the minimum \(q\)-gram distance between the two street names. For the computation of the structure distance, the root nodes of the address trees are substituted by dummy nodes. The name distance is weighted with \(\omega \) and the structure distance with \(1-\omega \).
Global Greedy Matching: Algorithm 3 implements the global greedy matching. The algorithm sorts the street pairs by their distance and stores them in array \(S\). The closest street pair is matched. The respective row and column are marked in the distance matrix to prevent a street from being matched twice. The remaining street pairs in \(S\) are matched in ascending order of their distances if both streets in the pair are still available. This yields a stable matching.

Theorem 5.1

The global greedy matching (Algorithm 3) is stable.

Proof

Let \(\mathsf M^{\prime } _k\) be the matching produced by Algorithm 3 after the \(k\)-th execution of line 14, thus \(\mathsf M^{\prime } _0=\emptyset \) and \(\mathsf M^{\prime } _n=\mathsf M \) (\(n=|\mathsf M |\)). We substitute \(\mathsf M \) by \(\mathsf M^{\prime } _k\) in Equation (1) and prove it by induction. “Equation (1) holds for \(k=1\)”: The algorithm chooses the closest street pair among all possible pairs. “If (1) holds for \(\mathsf M^{\prime } _k\), then it also holds for \(\mathsf M^{\prime } _{k+1},\,k<n\)”: No pair \((\alpha ,\beta )\notin \mathsf M^{\prime } _{k+1}\) satisfies the right-hand condition (denoted as \(C\)). Let \((u,v)\) be the new pair in \(\mathsf M^{\prime } _{k+1}\), that is, \(\mathsf M^{\prime } _{k+1}\setminus \mathsf M^{\prime } _{k}=\{(u,v)\}\). We distinguish:
  1. 1.

    \(u\ne \alpha \) and \(v\ne \beta \): \(C\) is false as (1) holds for \(\mathsf M^{\prime } _k\) and neither \(u\) nor \(v\) appears in \(C\).

     
  2. 2.

    \(u=\alpha \) and \(v\ne \beta \): The algorithm matches the closest pair of unmatched streets first. Thus, if \(\beta \) is unmatched in \(\mathsf M^{\prime } _k,\,\forall (u,y)\in \mathsf M^{\prime } _{k+1}:\mathrm{dist }(u,y)\le \mathrm{dist }(u,\beta )\); if \(\beta \) is already matched, \(\forall (x,\beta )\in M_{k+1}:\mathrm{dist }(x,\beta )\le \mathrm{dist }(u,\beta )\). In both cases \(C\) does not hold.

     
  3. 3.

    \(u\ne \alpha \) and \(v=\beta \): Analog to previous case. \(\square \)

     

Address Linking: Algorithm 4 links the addresses of the two streets \(\alpha \) and \(\beta \) and produces the new set of connector tuples \(\bar{\gamma }\). Checking equivalence and containment for all pairs of addresses leads to a quadratic runtime in the size of the input streets. We define an order on residential addresses, and we present an efficient merge-based algorithm to link the addresses of two streets. The algorithm sorts the addresses of the input streets, and \(i\) and \(j\) point to the current addresses (initially the first address) of the sorted arrays \(a []\) and \(b []\), respectively. If one of the current addresses is contained in the other (equivalence is a special case of containment), a link is established and a new tuple for \(\bar{\gamma }\) is produced. The pointer of the more detailed address is moved on. If none of the current addresses is contained in the other address, or if one of the pointers reaches the end of the array, links to the empty address are produced.

Definition 5.1

(Order of Residential Addresses) Let \(a \) and \(b \) be two residential addresses with the relative parts \(\mathrm{rel }(a)=(num_{a},entr_{a},apt_{a})\) and \(\mathrm{rel }(b)=(num_{b},entr_{b},apt_{b})\), respectively. We define
$$\begin{aligned} \begin{array}{lll} a > b&\Leftrightarrow&num_{a}>num_{b}\\&\,&\text{ or} \,(num_{a}=num_{b} \wedge entr_{a}>entr_{b})\\&\,&\text{ or} \,(num_{a}=num_{b} \wedge entr_{a}=entr_{b} \wedge apt_{a}>apt_{b}), \end{array} \end{aligned}$$
where \(num_{a/b},\,entr_{a/b}\), and \(apt_{a/b}\) are ordered lexicographically.

Complexity: For our complexity analysis we assume the synchronization of two partitions with \(N\) streets, and each street has \(n\) addresses and \(c\) names of constant length. Thus, the number of tuples in a partition is \(Nn\). The name distance is computed in \(O(c^2)\) time and constant space, and the \(pq\)-gram distance between two address trees has runtime \(O(n\log n)\) and needs \(O(n)\) space [6]. The global greedy matching algorithm requires \(O(N^2)\) space (the size of the distance matrix) and runs in \(O(N^2\log N)\) time (sorting the distances). The address linking sorts the addresses of the streets in \(O(n\log n)\) time and runs in \(O(n)\) space. The overall time complexity of the synchronization (Algorithm 1) is \(O(N^2(c^2+n\log n+\log N))\), and the space complexity is \(O(N^2+n)\).

6 Experiments

We experimentally evaluate the accuracy of our approach on real-world residential addresses from the registration office (\(reg\), Italian street names, 314 streets, 43K addresses), the electricity company (\(elec\), German, 327 streets, 45K addresses), and the census database (\(cens\), German, 323 streets, 11K addresses) of the Municipality of Bolzano. All datasets contain streets that do not have a matching partner in any other dataset. Figure 8 shows the Venn diagram of common streets between the different datasets, for example, the census database \(cens\) shares 12 streets exclusively with \(elec\), 7 streets exclusively with \(reg\), and an additional 298 streets with both of them; six streets exist exclusively in \(cens\).
Fig. 8

Overlap between streets of address datasets

The street names between the datasets are very different. An exact join between the street names of \(elec\) and \(cens\) gives 118 out of 310 possible pairs. The exact join between the street names of all other dataset pairs is empty.

In all experiments we use the parameters \(q=3\) for the name distance (\(q\)-gram distance between street names) and the parameters \(p=2\) and \(q=3\) for the structure distance (\(pq\)-gram distance between address trees, cf. Sect. 4.2).

We load the data into the connector such that each database corresponds to a partition. We synchronize the partitions pairwise and verify the street matches. The runtimes for synchronizing two partitions are shown in Table 1 (AMD 2.6 GHz Processor, 16 GB RAM).
Table 1

Runtimes for synchronizing two partitions

\(\mathcal A \)

\(\mathcal B \)

Name dist (s)

Structure dist (s)

Overall (s)

\(elec\)

\(cens\)

1.7

6.1

9.2

\(elec\)

\(reg\)

1.1

11.3

13.6

\(reg\)

\(cens\)

1.0

6.5

8.8

In the following subsections we show that the combination of name and structure increases the quality of street distance. We also evaluate the matching accuracy of the global greedy matching algorithm and find that it consistently outperforms both the fixed threshold and the local greedy approach.

We further compare the global greedy algorithm with two alternative approaches, the Hungarian algorithm [27] and the stable marriage algorithm [23]. Both approaches produce one-to-one matches and can easily be plugged into our connector as an alternative for the global greedy algorithm.

6.1 Name and structure distance

We compute precision (correctly found matches to total number of computed matches) and recall (correctly found matches to total number of correct matches) for different weights \(w\) for name and structure distance. If \(w=0\), only the structure of the address trees is considered; if \(w=1\), only the street names are considered. The results are shown in Fig. 9. Pure street name matching (\(w=1\)) gives good results if both databases have German street names (Fig. 9a), but it fails if the street names have different languages (Fig. 9b, c). The combination of name and structure improves the results for all datasets. For \(w=0.5\) (equal weight for name and structure) we find more than \(95\,\%\) of the matches in all datasets, and more than 90 % of our matches are correct.
Fig. 9

Matching accuracy for different weights and databases. a \(elec (Ge) \leftrightarrow cens (Ge)\), b\(elec (Ge) \leftrightarrow reg (It)\), c\(reg (It) \leftrightarrow cens (Ge)\)

6.2 Global greedy versus local greedy

This section compares the global greedy matching algorithm with the local greedy algorithm (see Sect. 4.3). As in the previous experiment we match streets with different weights. In order to compare local greedy with global greedy, we compute the F-measure. The \(F\)-measure, \(F=\frac{2pr}{p+r}\), is the harmonic mean of precision, \(p\), and recall, \(r\), and it is a well-known performance measure in information retrieval literature [38]. Figure 10a shows the results for matching the streets of \(elec\) and \(cens\), in Fig. 10b we match \(elec\) and \(reg\). Matching \(reg\) and \(cens\) yields similar results. Global greedy yields better matches for all settings that we have tested.
Fig. 10

Global greedy versus local greedy. a \(elec (Ge) \leftrightarrow cens (Ge)\), b\(elec (Ge) \leftrightarrow reg (It)\)

6.3 Global greedy versus fixed threshold

This section compares the global greedy matching algorithm with the fixed-threshold approach. The fixed-threshold approach matches all streets that are within a given distance (see Sect. 4.3). Figure 11a shows precision and recall for increasing threshold values (\(elec\leftrightarrow cens\), string weight \(w=0.5\)). The precision is high for small thresholds (all matches are correct), but the recall is very low (only few matches were found). As the threshold increases, more matches are found, but also the number of incorrect matches increases. Very high thresholds compute the cross product between all streets, the recall is 100 %, and the precision decreases to almost zero.
Fig. 11

Global greedy versus fixed threshold. a Precision and recall (\(elec \leftrightarrow cens\)), b F-measure (\(elec \leftrightarrow cens, w = 0.5\)), c F-measure (\(elec \leftrightarrow reg, w = 0.5\)), d F-measure (\(reg \leftrightarrow cens, w = 0.5\))

In Fig. 11b–d we compare the matching accuracy of the global greedy algorithm with the fixed-threshold approach. The global greedy algorithm outperforms the fixed-threshold approach for all thresholds. The results for the global greedy algorithm are independent of the threshold. The missing values in Fig. 11c, d indicate thresholds for which both precision and recall are zero, that is, no streets could be matched within the given threshold.

6.4 Hungarian and stable marriage algorithm

Similar to the global greedy algorithm, the Hungarian algorithm and the stable marriage algorithm consider all distances in the distance matrix to produce one-to-one matches. We experimentally evaluate the effectiveness of these approaches together with the global greedy algorithm. Figure 12 shows the results for different weights and databases. All tested algorithms show similar performance and are valid candidates for street matching.
Fig. 12

Effectiveness of global greedy, Hungarian, and stable marriage. a \(elec (Ge) \leftrightarrow cens (Ge)\), b\(elec (Ge) \leftrightarrow reg (It)\), c\(reg (It) \leftrightarrow cens (Ge)\)

Substituting the global greedy algorithm with another matching algorithm in the connector is straightforward (see Line 5 of Algorithm 1). Like for global greedy, the input to the Hungarian algorithm is a distance matrix; the runtime, however, is \(O(N^3)\) as compared to \(O(N^2\log N)\) of the global greedy algorithm. The stable marriage algorithm needs ranked lists of matching partners. These lists can be produced from the distance matrix by sorting the streets of the second set according to their distances to the streets in the first set, and vice versa.

From the implementation point of view, the global greedy algorithm has two advantages. (a) The distance matrix can be stored in a relational table (each tuple being a street pair with its distance). The table is sorted by the distance values, and a single tuple at a time is loaded and processed in main memory. Only the flags that indicate whether a street is already matched need to be maintained in main memory (\( seen \text{-} row \) and \( seen \text{-} col \) in Algorithm 3). This reduces the main memory complexity from quadratic to linear. (b) The global greedy algorithm works well with sparse matrices. Such matrices result from join-based approaches to compute the \(q\)-gram and the \(pq\)-gram distance [1, 2, 33, 39], in which the distances between items that have no \(q\)-grams and \(pq\)-grams in common are never computed.

7 Related work

Residential addresses appear in many applications, and commercial tools that deal with the synchronization of residential addresses have been developed. Customer Data Integration tools often include residential address integration. Many tools, for example, DQaddress (caatoosee ag),1 rely on string matching techniques and find typos and small spelling variations. They often include rules for common abbreviations. They cannot deal with renamed streets or streets in different languages. Typical address applications in the United States use a standardized set of residential addresses and correct the input address according to the reference set. AbiliTec\(^{{\circledR}}\) (by ACXIOM\(^{{\circledR}}\))2, for example, relies on an extensive repository of historical name and address information. The repository stores associations between current an previous addresses, real names and nicknames, maiden names, married names, and multiple variations of business names. In our case no such database is available, and we cannot rely on a standardized set of addresses. Instead, we build links that equally respect all participating addresses. Extending our street-matching algorithm with additional features (e.g., aliases from a historical database) is straightforward and does not affect the other components of the address connector.

The concept of an address tree was introduced by [3]. We compute a distance between address trees to find matching pairs. Algorithms for computing tree distances have received much attention from the research community [10, 18, 36, 40, 41]. The standard distance between trees is the edit distance [11, 26, 41]. The fastest algorithm for the tree edit distance is RTED [31], a robust tree edit distance algorithm that runs in \(O(n^3)\) time and \(O(n^2)\) space. To avoid the cubic complexity of the tree edit distance algorithm, in our work we use \(pq\)-grams [4] to compute the distance between address trees. The \(pq\)-gram distance runs efficiently in \(O(n\log n)\) time and \(O(n)\) space and provides a lower bound for the tree edit distance [6]. For materialized \(pq\)-gram tables, called \(pq\)-gram indexes, efficient incremental update algorithms have been proposed[5].

In the previous work, we have successfully used \(pq\)-grams to match address trees [6]. These works focus on computing a structure distance between two trees and do not solve the problem of synchronizing multiple address databases. In particular, no data structure for maintaining links between addresses and no lookup operator are defined. Further, two streets are matched if they are mutual and strict nearest neighbors. The nearest-neighbor function is not symmetric, and some streets may not have a mutual nearest neighbor. The global greedy matching presented in this paper is symmetric and assigns a matching partner to each street of the smaller set, and we show that the resulting matching is stable. The \(pq\)-gram distance does not consider the similarity of street names. We show in our experiments that we can significantly improve the matching accuracy by considering the name distance in addition to the \(pq\)-gram structure distance between address trees.

\(q\)-Grams were introduced by [37] as a lower bound for the more expensive string edit distance [29, 30]. [21] show that \(q\)-grams can be implemented efficiently in a relational database. Our approach is independent of the choice of a specific string distance.

Matching data items based on the distance between them is a well-known problem in data integration. [21] define the approximate string join; approximate XML joins are introduced by [22]. Both approaches use a fixed distance threshold and match all pairs of items that are within the threshold. [9] point out that fixed thresholds lead to poor matching accuracy as items that should match may be more distant than items that should not match. They introduce a variable threshold for a duplicate detection scenario and define two criterions for duplicates, the compact set criterion (duplicates are closer to each other than to non-duplicate items) and the sparse neighborhood criterion (the local neighborhood of duplicate items is sparse). Both fixed and variable thresholds possibly match a single item to multiple other items, which is undesirable in our setting.

The street matching can be modeled as a bipartite weighted graph matching problem (also known as assignment problem). The streets form the disjoint node sets of the bipartite graph, and the distances between the streets are the weighted edges. The goal is to compute the minimum-weighted matching between the \(N\) nodes of the graph. The Hungarian algorithm by [27] runs in \(O(N^2|V|)\) time, which is \(O(N^4)\) in our case of a dense graph with \(|V|=N^2\) edges. [14] present an algorithm based on Dijkstra’s shortest paths [12] that runs in \(O(N^3)\) time if implemented with a Fibonacci heap [16]. For the more general maximum-flow problem, [20] propose an \(O(N^3)\) time algorithm. All these algorithms globally minimize the sum of the distances in a matching, but they cannot guarantee a stable matching. Computing stable matchings is known as the stable marriage problem [23]: Given a population of \(N\) men and \(N\) women, each man strictly ranks each woman according to his preferences for a marriage partner, and vice versa. [17] propose a \(O(N^2)\) time algorithm that computes a stable matching between men and women. The Gale-Shapley algorithm is not commutative, and the solution is optimized either for men or for women, depending on the order of the parameters. For the respective other part, the worst case solution is produced. Egalitary stable marriage algorithms that fix this problem have been proposed [15, 24]; the most efficient one runs in \(O(N^3)\) time. We can take advantage of the distances that globally rank the matches and produce a commutative stable matching in \(O(N^2\log N)\) time and \(O(N^2)\) space. The global greedy and the local greedy algorithms have been introduced by [28] as heuristics for the assignment problem. [7] surveys heuristics for the assignment problem in Euclidean and non-Euclidean space.

There is a rich body of research in the area of schema and ontology matching, which is a critical issue in many application domains. Good surveys of the state of the art are found in [8, 25, 32, 34, 35], where different classifications of existing systems are provided. The problem of combining basic ontology matching functions and techniques to improve the overall matching performance has been investigated by [19], whereas [13] provide a survey of matching data at the instance level. Similar to various ontology matching approaches, the address connector combines element-level mapping techniques (\(n\)-grams for street names) and structure-level mapping techniques (\(pq\)-grams for address trees). The mapping of residential addresses is performed at the instance level; schema information is not considered. Also no other auxiliary information, such as ontologies or thesauri, is assumed, but could be integrated into the street-matching algorithm if available. While most existing systems provide \(1\):\(1\) alignments between the nodes of the input onotologies, we provide \(n\):\(m\) alignments between the paths in the input trees, where each path represents a residential address. Moreover, our address connector provides an alignment between \(n\) different databases without the need of storing \(O(n^2)\) alignments between all pairs of databases.

8 Conclusion

We have presented the address connector which links residential addresses of different databases that refer to the same location. The address connector does not need an authoritative reference, but it builds a new reference that equally respects all participating addresses. The core of the connector is the synchronization operator which can deal with non-matching (even completely unrelated) street names and correctly links residential addresses with different granularity. The address connector has been successfully tested in the context of the Municipality of Bolzano. In our experiments with real-world data from the public administration, we show the effectiveness and the efficiency of our approach.

The synchronization operator implements a new, context-aware street distance that considers, in addition to the street name, also the hierarchical structure that is defined by the addresses of a street. The distances between all street pairs are stored in a distance matrix. We use a global greedy algorithm to match streets based on the distance matrix. The global greedy algorithm matches each street to at most one other street, the result is independent of the matching order, and we prove that the matches are stable. We define the concept of address containment that allows us to link the addresses of two streets correctly, even if they are stored with different granularity.

Future work will extend our solution to other applications. The combination of string and tree distances defined by the address tree distance is useful when hierarchical data include string-valued nodes that (almost) identify objects. As an example consider XML data that store publications. Two publications are similar if both their title and the XML structure (defined by authors, year of publication, etc.) are similar. We use a global greedy matching algorithm to solve the problem of matching items based on a distance matrix, a typical problem in data integration scenarios. The key properties of the global greedy matching (independence of the matching order, one-to-one matches, and the stability of the matching) set it apart from other approaches in data integration. The concept of containment extends to other kinds of hierarchical data that are stored with different granularity. Our approach to link data in the connector instead of correcting the databases is useful in applications where correcting data is not possible (e.g., due to read-only access or different granularity).

Acknowledgments

This work was partially funded by the SyRA (Synchronizing Residential Addresses) project of the Free University of Bozen-Bolzano, Italy.

Copyright information

© Springer-Verlag London 2012

Authors and Affiliations

  • Nikolaus Augsten
    • 1
  • Michael Böhlen
    • 2
  • Johann Gamper
    • 1
  1. 1.Faculty of Computer ScienceFree University of Bozen-BolzanoBolzanoItaly
  2. 2.Department of InformaticsUniversity of ZurichZurichSwitzerland

Personalised recommendations