Skip to main content

Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources

  • Conference paper
  • First Online:
Advances in Web-Age Information Management (WAIM 2001)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2118))

Included in the following conference series:

Abstract

The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.

    Google Scholar 

  2. J. C. French, A. L. Powell, and E. Schulman. Applications of Approximate Word Matching in Information Retrieval. In F. Golshani and K. Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM 1997), pages 9–15, Las Vegas (USA), November 10-14 1997. ACM Press.

    Google Scholar 

  3. J. A. Hartigan. Clustering Algorithms. A Wiley Publication in Applied Statistics. John Wiley & Sons, New York (USA), 1975.

    Google Scholar 

  4. M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 2(1):9–37, 1998.

    Article  Google Scholar 

  5. V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10:707–710, 1966.

    MathSciNet  Google Scholar 

  6. S. Luján-Mora. An Algorithm for Computing the Invariant Distance from Word Position. Internet: http://www.dlsi.ua.es/~slujan/files/idwp.ps, June 2000.

  7. S. Luján-Mora and M. Palomar. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems. In M. C. Monard and J. S. Sichman, editors, International Joint Conference IBERAMIA-SBIA 2000 Open Discussion Track Proceedings, pages 217–226, Atibaia, São Paulo (Brazil), November 19-22 2000. ICMC/USP.

    Google Scholar 

  8. A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), pages 23–29, Tucson (USA), May 11 1997.

    Google Scholar 

  9. A. Motro and I. Rakov. Estimating the Quality of Databases. In T. Andreasen, H. Christiansen, and H. Larsen, editors, Proceedings of FQAS 98: Third International Conference on Flexible Query Answering Systems, volume 1495 of Lecture Notes in Artificial Intelligence, pages 298–307, Roskilde (Denmark), May 1998. Springer-Verlag.

    Google Scholar 

  10. E. T. O’Neill and D. Vizine-Goetz. The Impact of Spelling Errors on Databases and Indexes. In C. Nixon and L. Padgett, editors, 10th National Online Meeting Proceedings, pages 313–320, New York (USA), May 9–11 1989. Learned Information Inc.

    Google Scholar 

  11. C. J. V. Rijsbergen. Information Retrieval. Butterworths, London (UK), 2 edition, 1979.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Luján-Mora, S., Palomar, M. (2001). Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources. In: Wang, X.S., Yu, G., Lu, H. (eds) Advances in Web-Age Information Management. WAIM 2001. Lecture Notes in Computer Science, vol 2118. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-47714-4_18

Download citation

  • DOI: https://doi.org/10.1007/3-540-47714-4_18

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-42298-3

  • Online ISBN: 978-3-540-47714-3

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics