Abstract
Querying and integrating sources of structured data from the Web in most cases requires similarity-based concepts to deal with data level conflicts. This is due to the often erroneous and imprecise nature of the data and diverging conventions for their representation. On the other hand, Web databases offer only limited interfaces and almost no support for similarity queries. The approach presented in this paper maps string similarity predicates to standard predicates like substring and keyword search as offered by many of the mentioned systems. To minimize the local processing costs and the required network traffic, the mapping uses materialized information on the selectivity of string samples such as q-samples, substrings, and keywords. Based on the predicate mapping similarity selections and joins are described and the quality and required effort of the operations is evaluated experimentally.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Callan, J., Connell, M.: Query-based sampling of text databases. ACM Trans. Inf. Syst. 19(2), 97–130 (2001)
Cohen, W.: Integration of heterogeneous databases without common domains using queries based on textual similarity. In: Haas, L.M., Tiwary, A. (eds.) Proceedings ACMSIGMOD, 1998, Seattle, Washington, USA, pp. 201–212. ACM Press, New York (1998)
Faloutsos, C., Lin, K.-I.: FastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In: ACM SIGMOD 1995, pp. 163–174 (1995)
Fuhr, N.: Probabilistic datalog – A logic for powerful retrieval methods. In: Proceedings of the Eighteenth Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Retrieval Logic, pp. 282–290 (1995)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB 2001, pp. 491–500 (2001)
Gravano, L., Ipeirotis, P.G., Koudas, N., Srivastava, D.: Text joins in an RDBMS for web data integration. In: WWW 2003, pp. 90–101 (2003)
Ioannidis, Y.E., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: M.J. Carey and D.A. Schneider, editors, ACM SIGMOD 1995, pp. 233–244 (1995)
Ipeirotis, P.G., Gravano, L.: Distributed search over the hidden web: Hierarchical database sampling and selection. In: VLDB 2002, pp. 394–405 (2002)
Jagadish, H.V., Kapitskaia, O., Ng, R.T., Srivastava, D.: One-dimensional and multidimensional substring selectivity estimation. The VLDB Journal 9(3), 214–230 (2000)
Jin, L., Li, C., Mehrotra, S.: Efficient record linkage in large data sets. In: Eighth International Conference on Database Systems for Advanced Applications (DASFAA 2003), March 26-28, Kyoto, Japan (2003)
Karp, R.M., Rabin, M.O.: Efficient randomized pattern-matching algorithms. IBM Journal of Research Developments, 31(2) (March 1987)
Koudas, N., Sevcik, K.C.: High dimensional similarity joins: Algorithms and performance evaluation. IEEETKDE: IEEE Transactions on Knowledge and Data Engineering, 12 (2000)
Krishnan, P., Vitter, J.S., Iyer, B.R.: Estimating Alphanumeric Selectivity in the Presence of Wildcards. In: Jagadish, H.V., Mumick, I.S. (eds.) ACM SIGMOD 1996, pp. 282–293 (1996)
Navarro, G.: A guided tour to approximate string matching. ACM Computing Surveys 33(1), 31–88 (2001)
Navarro, G., Baeza-Yates, R.: A practical q-gram index for text retrieval allowing errors. CLEI Electonic Journal 1(2) (1998)
Navarro, G., Baeza-Yates, R.A., Sutinen, E., Tarhio, J.: Indexing Methods for Approximate String Matching. IEEE Data Engineering Bulletin 24(4), 19–27 (2001)
Roth, M.T., Schwarz, P.M.: Don’t scrap it, wrap it! a wrapper architecture for legacy data sources. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, pp. 266–275 (1997)
Santini, S., Jain, R.: Similarity measures. IEEE Transactions on Pattern Analysis and Machine Intelligence 21(9), 871–883 (1999)
Schallehn, E., Sattler, K., Saake, G.: Efficient Similarity-based Operations for Data Integration. Data and Knowledge Engineering Journal 48(3), 361–387 (2004)
Shang, H., Merrett, T.H.: Tries for approximate string matching. IEEE Transactions on Knowledge and Data Engineering 8(4), 540–547 (1996)
Vassalos, V., Papakonstantinou, Y.: Describing and using query capabilities of heterogeneous sources. In: Jarke, M., Carey, M.J., Dittrich, K.R., Lochovsky, F.H., Loucopoulos, P., Jeusfeld, M.A. (eds.) VLDB 1997, pp. 256–265 (1997)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Schallehn, E., Geist, I., Sattler, KU. (2004). Supporting Similarity Operations Based on Approximate String Matching on the Web. In: Meersman, R., Tari, Z. (eds) On the Move to Meaningful Internet Systems 2004: CoopIS, DOA, and ODBASE. OTM 2004. Lecture Notes in Computer Science, vol 3290. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30468-5_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-30468-5_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23663-4
Online ISBN: 978-3-540-30468-5
eBook Packages: Springer Book Archive