ORCHID – Reduction-Ratio-Optimal Computation of Geo-spatial Distances for Link Discovery

  • Axel-Cyrille Ngonga Ngomo
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8218)

Abstract

The discovery of links between resources within knowledge bases is of crucial importance to realize the vision of the Semantic Web. Addressing this task is especially challenging when dealing with geo-spatial datasets due to their sheer size and the potential complexity of single geo-spatial objects. Yet, so far, little attention has been paid to the characteristics of geo-spatial data within the context of link discovery. In this paper, we address this gap by presenting Orchid, a reduction-ratio-optimal link discovery approach designed especially for geo-spatial data. Orchid relies on a combination of the Hausdorff and orthodromic metrics to compute the distance between geo-spatial objects. We first present two novel approaches for the efficient computation of Hausdorff distances. Then, we present the space tiling approach implemented by Orchid and prove that it is optimal with respect to the reduction ratio that it can achieve. The evaluation of our approaches is carried out on three real datasets of different size and complexity. Our results suggest that our approaches to the computation of Hausdorff distances require two orders of magnitude less orthodromic distances computations to compare geographical data. Moreover, they require two orders of magnitude less time than a naive approach to achieve this goal. Finally, our results indicate that Orchid scales to large datasets while outperforming the state of the art significantly.

Keywords

Link discovery Record Linkage Deduplication Geo-Spatial Data Hausdorff Distances 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Atallah, M.J.: A linear time algorithm for the hausdorff distance between convex polygons. Technical report, Purdue University, Department of Computer Science (1983)Google Scholar
  2. 2.
    Atallah, M.J., Ribeiro, C.C., Lifschitz, S.: Computing some distance functions between polygons. Pattern Recognition 24(8), 775–781 (1991)CrossRefGoogle Scholar
  3. 3.
    Bartoň, M., Hanniel, I., Elber, G., Kim, M.-S.: Precise hausdorff distance computation between polygonal meshes. Comput. Aided Geom. Des. 27(8), 580–591 (2010)CrossRefMATHGoogle Scholar
  4. 4.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Trans. Knowl. Data Eng. 19(1), 1–16 (2007)CrossRefGoogle Scholar
  5. 5.
    Guthe, M., Borodin, P., Klein, R.: Fast and accurate hausdorff distance calculation between meshes. J. of WSCG 13, 41–48 (2005)Google Scholar
  6. 6.
    Isele, R., Jentzsch, A., Bizer, C.: Efficient Multidimensional Blocking for Link Discovery without losing Recall. In: WebDB (2011)Google Scholar
  7. 7.
    Köpcke, H., Rahm, E.: Frameworks for entity matching: A comparison. Data Knowl. Eng. 69(2), 197–210 (2010)CrossRefGoogle Scholar
  8. 8.
    Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. Proc. VLDB Endow. 5(3), 253–264 (2011)Google Scholar
  9. 9.
    Ngonga Ngomo, A.C.: A Time-Efficient Hybrid Approach to Link Discovery. In: OM 2011 (2011)Google Scholar
  10. 10.
    Ngonga Ngomo, A.-C.: Link discovery with guaranteed reduction ratio in affine spaces with minkowski measures. In: Cudré-Mauroux, P., Heflin, J., Sirin, E., Tudorache, T., Euzenat, J., Hauswirth, M., Parreira, J.X., Hendler, J., Schreiber, G., Bernstein, A., Blomqvist, E. (eds.) ISWC 2012, Part I. LNCS, vol. 7649, pp. 378–393. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  11. 11.
    Ngonga Ngomo, A.-C.: On link discovery using a hybrid approach. J. Data Semantics 1(4), 203–217 (2012)CrossRefGoogle Scholar
  12. 12.
    Ngonga Ngomo, A.-C., Auer, S.: LIMES - A Time-Efficient Approach for Large-Scale Link Discovery on the Web of Data. In: IJCAI, pp. 2312–2317 (2011)Google Scholar
  13. 13.
    Nikolov, A., d’Aquin, M., Motta, E.: Unsupervised learning of link discovery configuration. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 119–133. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Nutanong, S., Jacox, E.H., Samet, H.: An incremental hausdorff distance calculation algorithm. Proc. VLDB Endow. 4(8), 506–517 (2011)Google Scholar
  15. 15.
    Scharffe, F., Liu, Y., Zhou, C.: Rdf-ai: an architecture for rdf datasets matching, fusion and interlink. In: Proc. IJCAI 2009 Workshop on Identity, Reference, and Knowledge Representation (IR-KR), Pasadena, CA, US (2009)Google Scholar
  16. 16.
    Tang, M., Lee, M., Kim, Y.J.: Interactive hausdorff distance computation for general polygonal models. ACM Trans. Graph. 28(3), 74:1–74:9 (2009)Google Scholar
  17. 17.
    Wang, J., Li, G., Feng, J.: Trie-join: Efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010)Google Scholar
  18. 18.
    Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933–944 (2008)Google Scholar
  19. 19.
    Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient similarity joins for near duplicate detection. In: WWW, pp. 131–140 (2008)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2013

Authors and Affiliations

  • Axel-Cyrille Ngonga Ngomo
    • 1
  1. 1.Department of Computer ScienceUniversity of LeipzigLeipzigGermany

Personalised recommendations