Supporting Similarity Operations Based on Approximate String Matching on the Web

Schallehn, Eike; Geist, Ingolf; Sattler, Kai-Uwe

doi:10.1007/978-3-540-30468-5_16

Eike Schallehn¹⁸,
Ingolf Geist¹⁸ &
Kai-Uwe Sattler¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3290))

Included in the following conference series:

OTM Confederated International Conferences "On the Move to Meaningful Internet Systems"

888 Accesses
3 Citations

Abstract

Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, Web databases offer only limited interfaces and almost no support for similarity queries. The approach presented in this paper maps string similarity predicates to standard predicates like substring and keyword search as offered by many of the mentioned systems. To minimize the local processing costs and the required network traffic, the mapping uses materialized information on the selectivity of string samples such as q-samples, substrings, and keywords. Based on the predicate mapping similarity selections and joins are described and the quality and required effort of the operations is evaluated experimentally.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2), 97–130 (2001)
Article Google Scholar
Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Haas, L.M., Tiwary, A. (eds.) Proceedings ACMSIGMOD, 1998, Seattle, Washington, USA, pp. 201–212. ACM Press, New York (1998)
Google Scholar
Faloutsos, C., Lin, K.-I.: FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD 1995, pp. 163–174 (1995)
Google Scholar
Fuhr, N.: Probabilistic datalog – A logic for powerful retrieval methods. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Retrieval Logic, pp. 282–290 (1995)
Google Scholar
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB 2001, pp. 491–500 (2001)
Google Scholar
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW 2003, pp. 90–101 (2003)
Google Scholar
Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: M.J. Carey and D.A. Schneider, editors, ACM SIGMOD 1995, pp. 233–244 (1995)
Google Scholar
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB 2002, pp. 394–405 (2002)
Google Scholar
Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: One-dimensional and multidimensional substring selectivity estimation. The VLDB Journal 9(3), 214–230 (2000)
Article Google Scholar
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Eighth International Conference on Database Systems for Advanced Applications (DASFAA 2003), March 26-28, Kyoto, Japan (2003)
Google Scholar
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research Developments, 31(2) (March 1987)
Google Scholar
Koudas, N., Sevcik, K.C.: High dimensional similarity joins: Algorithms and performance evaluation. IEEETKDE: IEEE Transactions on Knowledge and Data Engineering, 12 (2000)
Google Scholar
Krishnan, P., Vitter, J.S., Iyer, B.R.: Estimating Alphanumeric Selectivity in the Presence of Wildcards. In: Jagadish, H.V., Mumick, I.S. (eds.) ACM SIGMOD 1996, pp. 282–293 (1996)
Google Scholar
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Article Google Scholar
Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electonic Journal 1(2) (1998)
Google Scholar
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)
Google Scholar
Roth, M.T., Schwarz, P.M.: Don’t scrap it, wrap it! a wrapper architecture for legacy data sources. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, pp. 266–275 (1997)
Google Scholar
Santini, S., Jain, R.: Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(9), 871–883 (1999)
Article Google Scholar
Schallehn, E., Sattler, K., Saake, G.: Efficient Similarity-based Operations for Data Integration. Data and Knowledge Engineering Journal 48(3), 361–387 (2004)
Article Google Scholar
Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering 8(4), 540–547 (1996)
Article Google Scholar
Vassalos, V., Papakonstantinou, Y.: Describing and using query capabilities of heterogeneous sources. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, pp. 256–265 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Dpt. of Computer Science, University of Magdeburg, P.O. Box 4120, 39106, Magdeburg, Germany
Eike Schallehn & Ingolf Geist
Dpt. of Computer Science and Automation, Technical University of Ilmenau, P.O. Box 100565, 98684, Ilmenau, Germany
Kai-Uwe Sattler

Authors

Eike Schallehn
View author publications
You can also search for this author in PubMed Google Scholar
Ingolf Geist
View author publications
You can also search for this author in PubMed Google Scholar
Kai-Uwe Sattler
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Vrije Universiteit Brussel (VUB), STARLab, Bldg G/10, Pleinlaan 2, 1050, Brussels, Belgium
Robert Meersman
School of Computer Science and Information Technology, RMIT University, Bld 10.10, 376-392 Swanston Street, VIC 3001, Melbourne, Australia
Zahir Tari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Schallehn, E., Geist, I., Sattler, KU. (2004). Supporting Similarity Operations Based on Approximate String Matching on the Web. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3290. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30468-5_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-30468-5_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23663-4
Online ISBN: 978-3-540-30468-5
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics