Knowledge and Information Systems

, Volume 37, Issue 3, pp 639–663 | Cite as

The address connector: noninvasive synchronization of hierarchical data sources

Regular Paper

Abstract

Different databases often store information about the same or related objects in the real world. To enable collaboration between these databases, data items that refer to the same object must be identified. Residential addresses are data of particular interest as they often provide the only link between related pieces of information in different databases. Unfortunately, residential addresses that describe the same location might vary considerably and hence need to be synchronized. Non-matching street names and addresses stored at different levels of granularity make address synchronization a challenging task. Common approaches assume an authoritative reference set and correct residential addresses according to the reference set. Often, however, no reference set is available, and correcting addresses with different granularity is not possible. We present the address connector, which links residential addresses that refer to the same location. Instead of correcting addresses according to an authoritative reference set, the connector defines a lookup function for residential addresses. Given a query address and a target database, the lookup returns all residential addresses in the target database that refer to the same location. The lookup supports addresses that are stored with different granularity. To align the addresses of two matching streets, we use a global greedy address-matching algorithm that guarantees a stable matching. We define the concept of address containment that allows us to correctly link addresses with different granularity. The evaluation of our solution on real-world data from a municipality shows that our solution is both effective and efficient.

Keywords

Data quality Record linkage Entity resolution  Hierarchical data Trees Approximate matching Similarity query Residential addresses 

References

  1. 1.
    Arasu A, Ganti V, Kaushik R (2006) Efficient exact set-similarity joins. In: Proceedings of the international conference on very large databases (VLDB), VLDB Endowment, pp 918–929Google Scholar
  2. 2.
    Augsten N, Böhlen M, Dyreson C, Gamper J (2012) Windowed \(pq\)-grams for approximate joins of data-centric XML. VLDB J 21(4):463–488CrossRefGoogle Scholar
  3. 3.
    Augsten N, Böhlen M, Gamper J (2004) Reducing the integration of public administration databases to approximate tree matching. In: Electronic government—third international conference, LNCS 3183. Springer, Zaragoza, pp 102–107Google Scholar
  4. 4.
    Augsten N, Böhlen M, Gamper J (2005) Approximate matching of hierarchical data using \(pq\)-grams. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Trondheim, pp 301–312Google Scholar
  5. 5.
    Augsten N, Böhlen M, Gamper J (2006) An incrementally maintainable index for approximate lookups in hierarchical data. In: Proceedings of the international conference on very large databases (VLDB). ACM Press, Seoul, pp 247–258Google Scholar
  6. 6.
    Augsten N, Böhlen M, Gamper J (2010) The \(pq\)-gram distance between ordered labeled trees. ACM Trans Database Syst 35(1):1–36CrossRefGoogle Scholar
  7. 7.
    Avis D (1983) A survey of heuristics for the weighted matching problem. Networks 13(4):475–493MathSciNetCrossRefMATHGoogle Scholar
  8. 8.
    Bernstein PA, Madhavan J, Rahm E (2011) Generic schema matching, ten years later. Proc VLDB Endow 4(11):695–701Google Scholar
  9. 9.
    Chaudhuri S, Ganti V, Motwani R (2005) Robust identification of fuzzy duplicates. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, Tokyo, pp 865–876Google Scholar
  10. 10.
    Cobéna G, Abiteboul S, Marian A (2002) Detecting changes in XML documents. In: Proceedings of the international conference on data engineering (ICDE). IEEE Computer Science Press, San Jose, California, pp 41–52Google Scholar
  11. 11.
    Demaine ED, Mozes S, Rossman B, Weimann O (2007) An optimal decomposition algorithm for tree edit distance. In: Proceedings of the 34th international colloquium on automata, languages and programming (ICALP 2007), vol 4596 of Lecture Notes in Computer Science. Springer, Wroclaw, pp 146–157Google Scholar
  12. 12.
    Dijkstra EW (1959) A note on two problems in connexion with graphs. Numer Math 1:269–271MathSciNetCrossRefMATHGoogle Scholar
  13. 13.
    Dorneles CF, Gonçalves R (2011) Approximate data instance matching: a survey. Knowl Inf Syst 27(1):1–21CrossRefGoogle Scholar
  14. 14.
    Edmonds J, Karp RM (1972) Theoretical improvements in algorithmic efficiency for network flow problems. J ACM 19(2):248–264CrossRefMATHGoogle Scholar
  15. 15.
    Feder T (1992) A new fixed point approach for stable networks and stable marriages. J Comput Syst Sci 45(2):233–284MathSciNetCrossRefMATHGoogle Scholar
  16. 16.
    Fredman ML, Tarjan RE (1987) Fibonacci heaps and their uses in improved network optimization algorithms. J ACM 34(3):596–615MathSciNetCrossRefGoogle Scholar
  17. 17.
    Gale D, Shapley LS (1962) College admissions and the stability of marriage. Am Math Mon 69(1):9–15MathSciNetCrossRefMATHGoogle Scholar
  18. 18.
    Garofalakis M, Kumar A (2003) Correlating XML data streams using tree-edit distance embeddings. In: Proceedings of the ACM SIGACT-SIGMOD-SIGART symposium on principles of database systems (PODS). ACM Press, San Diego, California, pp 143–154Google Scholar
  19. 19.
    Gil JM, Montes JFA (2011) Evaluation of two heuristic approaches to solve the ontology meta-matching problem. Knowl Inf Syst 26(2):225–247CrossRefGoogle Scholar
  20. 20.
    Goldberg AV, Tarjan RE (1988) A new approach to the maximum-flow problem. J ACM 35(4):921–940MathSciNetCrossRefMATHGoogle Scholar
  21. 21.
    Gravano L, Ipeirotis PG, Jagadish HV, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proceedings of the international conference on very large databases (VLDB). Morgan Kaufmann, Roma, pp 491–500Google Scholar
  22. 22.
    Guha S, Jagadish HV, Koudas N, Srivastava D, Yu T (2002) Approximate XML joins. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Madison, pp 287–298Google Scholar
  23. 23.
    Gusfield D, Irving RW (1989) The stable marriage problem: structure and algorithms. The MIT Press, CambridgeMATHGoogle Scholar
  24. 24.
    Irving RW, Leather P, Gusfield D (1987) An efficient algorithm for the “optimal” stable marriage. J ACM 34(3):532–543MathSciNetCrossRefGoogle Scholar
  25. 25.
    Kalfoglou Y, Schorlemmer M (2003) Ontology mapping: the state of the art. Knowl Eng Rev 18(1):1–31CrossRefGoogle Scholar
  26. 26.
    Klein PN (1998) Computing the edit-distance between unrooted ordered trees. In: Proceedings of the 6th European symposium on algorithms, vol 1461 of Lecture Notes in Computer Science. Springer, Venice, pp 91–102Google Scholar
  27. 27.
    Kuhn HW (1955) The Hungarian method for the assignment problem. Naval Res Logist Q 2:83–97CrossRefGoogle Scholar
  28. 28.
    Kurtzberg JM (1962) On approximation methods for the assignment problem. J ACM 9(4):419–439MathSciNetCrossRefMATHGoogle Scholar
  29. 29.
    Levenshtein VI (1965) Binary codes capable of correcting spurious insertions and deletions of ones. Probl Inf Transm 1:8–17Google Scholar
  30. 30.
    Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88CrossRefGoogle Scholar
  31. 31.
    Pawlik M, Augsten N (2011) RTED: a robust algorithm for the tree edit distance. Proc VLDB Endow (PVLDB) 5(4):334–345Google Scholar
  32. 32.
    Rahm E, Bernstein PA (2001) A survey of approaches to automatic schema matching. VLDB J 10(4):334–350CrossRefMATHGoogle Scholar
  33. 33.
    Sarawagi S, Kirpal A (2004) Efficient set joins on similarity predicates. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM, New York, pp 743–754Google Scholar
  34. 34.
    Shvaiko P, Euzenat J (2005) A survey of schema-based matching approaches. J Data Semantics IV:146–171Google Scholar
  35. 35.
    Shvaiko P, Euzenat J (2011) Ontology matching: state of the art and future challenges. IEEE Trans Knowl Data Eng (99):1Google Scholar
  36. 36.
    Tai KC (1979) The tree-to-tree correction problem. J ACM 26(3):422–433MathSciNetCrossRefMATHGoogle Scholar
  37. 37.
    Ukkonen E (1992) Approximate string-matching with \(q\)-grams and maximal matches. Theor Comput Sci 92(1):191–211MathSciNetCrossRefMATHGoogle Scholar
  38. 38.
    van Rijsbergen CJ (1979) Information retrieval, 2nd edn. Butterworth-Heinemann, LondonGoogle Scholar
  39. 39.
    Xiao C, Wang W, Lin X (2008) Ed-join: an efficient algorithm for similarity joins with edit distance constraints. Proc VLDB Endow 1(1):933–944Google Scholar
  40. 40.
    Yang R, Kalnis P, Tung AKH (2005) Similarity evaluation on tree-structured data. In: Proceedings of the ACM SIGMOD international conference on management of data. ACM Press, Baltimore, pp 754–765Google Scholar
  41. 41.
    Zhang K, Shasha D (1989) Simple fast algorithms for the editing distance between trees and related problems. SIAM J Comput 18(6):1245–1262MathSciNetCrossRefMATHGoogle Scholar

Copyright information

© Springer-Verlag London 2012

Authors and Affiliations

  • Nikolaus Augsten
    • 1
  • Michael Böhlen
    • 2
  • Johann Gamper
    • 1
  1. 1.Faculty of Computer ScienceFree University of Bozen-BolzanoBolzanoItaly
  2. 2.Department of InformaticsUniversity of ZurichZurichSwitzerland

Personalised recommendations