Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources

Luján-Mora, Sergio; Palomar, Manuel

doi:10.1007/3-540-47714-4_18

Sergio Luján-Mora⁷ &
Manuel Palomar⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2118))

Included in the following conference series:

International Conference on Web-Age Information Management

327 Accesses
4 Citations

Abstract

The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.
Google Scholar
J. C. French, A. L. Powell, and E. Schulman. Applications of Approximate Word Matching in Information Retrieval. In F. Golshani and K. Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM 1997), pages 9–15, Las Vegas (USA), November 10-14 1997. ACM Press.
Google Scholar
J. A. Hartigan. Clustering Algorithms. A Wiley Publication in Applied Statistics. John Wiley & Sons, New York (USA), 1975.
Google Scholar
M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 2(1):9–37, 1998.
Article Google Scholar
V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10:707–710, 1966.
MathSciNet Google Scholar
S. Luján-Mora. An Algorithm for Computing the Invariant Distance from Word Position. Internet: http://www.dlsi.ua.es/~slujan/files/idwp.ps, June 2000.
S. Luján-Mora and M. Palomar. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems. In M. C. Monard and J. S. Sichman, editors, International Joint Conference IBERAMIA-SBIA 2000 Open Discussion Track Proceedings, pages 217–226, Atibaia, São Paulo (Brazil), November 19-22 2000. ICMC/USP.
Google Scholar
A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), pages 23–29, Tucson (USA), May 11 1997.
Google Scholar
A. Motro and I. Rakov. Estimating the Quality of Databases. In T. Andreasen, H. Christiansen, and H. Larsen, editors, Proceedings of FQAS 98: Third International Conference on Flexible Query Answering Systems, volume 1495 of Lecture Notes in Artificial Intelligence, pages 298–307, Roskilde (Denmark), May 1998. Springer-Verlag.
Google Scholar
E. T. O’Neill and D. Vizine-Goetz. The Impact of Spelling Errors on Databases and Indexes. In C. Nixon and L. Padgett, editors, 10th National Online Meeting Proceedings, pages 313–320, New York (USA), May 9–11 1989. Learned Information Inc.
Google Scholar
C. J. V. Rijsbergen. Information Retrieval. Butterworths, London (UK), 2 edition, 1979.
Google Scholar

Download references

Author information

Authors and Affiliations

Departamento de Lenguajes y Sistemas Informáticos, Universidad de Alicante, Campus de San Vicente del Raspeig Ap. Correos 99, E-03080, Alicante, Spain
Sergio Luján-Mora & Manuel Palomar

Authors

Sergio Luján-Mora
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Palomar
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Information and Software Engineering, George Mason University, Fairfax, VA, 22030-4444, USA
X. Sean Wang
Department of Computer Science and Engineering, Northeastern University, Shenyang, 110004, China
Ge Yu
Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong, China
Hongjun Lu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Luján-Mora, S., Palomar, M. (2001). Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_18

Download citation

DOI: https://doi.org/10.1007/3-540-47714-4_18
Published: 28 June 2001
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-42298-3
Online ISBN: 978-3-540-47714-3
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics