Skip to main content

Supporting Similarity Operations Based on Approximate String Matching on the Web

  • Conference paper
On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE (OTM 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3290))

Abstract

Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, Web databases offer only limited interfaces and almost no support for similarity queries. The approach presented in this paper maps string similarity predicates to standard predicates like substring and keyword search as offered by many of the mentioned systems. To minimize the local processing costs and the required network traffic, the mapping uses materialized information on the selectivity of string samples such as q-samples, substrings, and keywords. Based on the predicate mapping similarity selections and joins are described and the quality and required effort of the operations is evaluated experimentally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2), 97–130 (2001)

    Article  Google Scholar 

  2. Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Haas, L.M., Tiwary, A. (eds.) Proceedings ACMSIGMOD, 1998, Seattle, Washington, USA, pp. 201–212. ACM Press, New York (1998)

    Google Scholar 

  3. Faloutsos, C., Lin, K.-I.: FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD 1995, pp. 163–174 (1995)

    Google Scholar 

  4. Fuhr, N.: Probabilistic datalog – A logic for powerful retrieval methods. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Retrieval Logic, pp. 282–290 (1995)

    Google Scholar 

  5. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB 2001, pp. 491–500 (2001)

    Google Scholar 

  6. Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW 2003, pp. 90–101 (2003)

    Google Scholar 

  7. Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: M.J. Carey and D.A. Schneider, editors, ACM SIGMOD 1995, pp. 233–244 (1995)

    Google Scholar 

  8. Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB 2002, pp. 394–405 (2002)

    Google Scholar 

  9. Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: One-dimensional and multidimensional substring selectivity estimation. The VLDB Journal 9(3), 214–230 (2000)

    Article  Google Scholar 

  10. Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Eighth International Conference on Database Systems for Advanced Applications (DASFAA 2003), March 26-28, Kyoto, Japan (2003)

    Google Scholar 

  11. Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research Developments, 31(2) (March 1987)

    Google Scholar 

  12. Koudas, N., Sevcik, K.C.: High dimensional similarity joins: Algorithms and performance evaluation. IEEETKDE: IEEE Transactions on Knowledge and Data Engineering, 12 (2000)

    Google Scholar 

  13. Krishnan, P., Vitter, J.S., Iyer, B.R.: Estimating Alphanumeric Selectivity in the Presence of Wildcards. In: Jagadish, H.V., Mumick, I.S. (eds.) ACM SIGMOD 1996, pp. 282–293 (1996)

    Google Scholar 

  14. Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)

    Article  Google Scholar 

  15. Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electonic Journal 1(2) (1998)

    Google Scholar 

  16. Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)

    Google Scholar 

  17. Roth, M.T., Schwarz, P.M.: Don’t scrap it, wrap it! a wrapper architecture for legacy data sources. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, pp. 266–275 (1997)

    Google Scholar 

  18. Santini, S., Jain, R.: Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(9), 871–883 (1999)

    Article  Google Scholar 

  19. Schallehn, E., Sattler, K., Saake, G.: Efficient Similarity-based Operations for Data Integration. Data and Knowledge Engineering Journal 48(3), 361–387 (2004)

    Article  Google Scholar 

  20. Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering 8(4), 540–547 (1996)

    Article  Google Scholar 

  21. Vassalos, V., Papakonstantinou, Y.: Describing and using query capabilities of heterogeneous sources. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, pp. 256–265 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Schallehn, E., Geist, I., Sattler, KU. (2004). Supporting Similarity Operations Based on Approximate String Matching on the Web. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3290. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30468-5_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-30468-5_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-23663-4

  • Online ISBN: 978-3-540-30468-5

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics