Skip to main content

Evaluating Performance and Quality of XML-Based Similarity Joins

  • Conference paper
Advances in Databases and Information Systems (ADBIS 2008)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5207))

Abstract

A similarity join correlating fragments in XML documents, which are similar in structure and content, can be used as the core algorithm to support data cleaning and data integration tasks. For this reason, built-in support for such an operator in an XML database management system (XDBMS) is very attractive. However, similarity assessment is especially difficult on XML datasets, because structure, besides textual information, may embody variations in XML documents representing the same real-world entity. Moreover, the similarity computation is considerably more expensive for tree-structured objects and should, therefore, be a prime optimization candidate. In this paper, we explore and optimize tree-based similarity joins and analyze their performance and accuracy when embedded in native XDBMSs.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Amit, C., Hassanzadeth, O., Koudas, N., Sadoghi, M.: Benchmarking Declarative Approximate Selection Predicates. In: Proc. SIGMOD Conf., pp. 353–364 (2007)

    Google Scholar 

  2. Arasu, A., Ganti, V., Kaushik, R.: Efficient Set-Similarity Joins. In: Proc. VLDB Conf., pp. 918–929 (2006)

    Google Scholar 

  3. Augsten, N., Böhlen, M., Gamper, J.: Approximate Matching of Hierarchical Data using pq-Grams. In: Proc. VLDB Conf., pp. 301–312 (2005)

    Google Scholar 

  4. Chaudhuri, S., Ganjam, K., Kaushik, R.: A Primitive Operator for Similarity Joins in Data Cleaning. In Proc. ICDE Conf., p. 5 (2006)

    Google Scholar 

  5. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Srivastava, D.: Text Joins in an RDBMS for Web Data Integration. In: Proc. WWW Conf., pp. 90–101 (2003)

    Google Scholar 

  6. Gravano, L., Ipeirotis, P., Jagadish, H., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate String Joins in a Database (Almost) for Free. In: Proc. VLDB Conf., pp. 491–500 (2001)

    Google Scholar 

  7. Guha, S., Jagadish, H., Koudas, N., Srivastava, D., Yu, T.: Integrating XML Data Sources using Approximate Joins. TODS 31(1), 161–207 (2006)

    Article  Google Scholar 

  8. Härder, T., Haustein, M., Mathis, C., Wagner, M.: Node labeling schemes for dynamic XML documents reconsidered. DKE 60(1), 126–149 (2007)

    Article  Google Scholar 

  9. Härder, T., Mathis, C., Schmidt, K.: Comparison of Complete and Elementless Native Storage of XML Documents. In: Proc. IDEAS 2007, pp. 102–113 (2007)

    Google Scholar 

  10. Haustein, M.P., Härder, T.: An Efficient Infrastructure for Native Transactional XML Processing. DKE 61(3), 500–523 (2007)

    Article  Google Scholar 

  11. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)

    Article  Google Scholar 

  12. Ribeiro, L., Härder, T.: Embedding Similarity Joins into Native XML Databases. In: Proc. 22nd Brasilian Symposium on Databases, pp. 285–299 (2007)

    Google Scholar 

  13. Sarawagi, S., Kirpal, A.: Efficient Set Joins on Similarity Predicates. In: Proc. SIGMOD Conf., pp. 743–754 (2004)

    Google Scholar 

  14. Ukkonen, E.: Approximate String Matching with q-grams and Maximal Matches. Theor. Comput. Science 92(1), 191–211 (1992)

    Article  MATH  MathSciNet  Google Scholar 

  15. Zhang, K., Shasha, D.: Simple Fast Algorithms for the Editing Distance Between Trees and related Problems. SIAM Journal on Computing 18(6), 1245–1262 (1989)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Paolo Atzeni Albertas Caplinskas Hannu Jaakkola

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Ribeiro, L., Härder, T. (2008). Evaluating Performance and Quality of XML-Based Similarity Joins. In: Atzeni, P., Caplinskas, A., Jaakkola, H. (eds) Advances in Databases and Information Systems. ADBIS 2008. Lecture Notes in Computer Science, vol 5207. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85713-6_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85713-6_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85712-9

  • Online ISBN: 978-3-540-85713-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics