Advertisement

Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources

  • Sergio Luján-Mora
  • Manuel Palomar
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 2118)

Abstract

The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.

Keywords

Cluster Algorithm Similarity Measure Consistency Index Levenshtein Distance Word Position 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. [1]
    D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.Google Scholar
  2. [2]
    J. C. French, A. L. Powell, and E. Schulman. Applications of Approximate Word Matching in Information Retrieval. In F. Golshani and K. Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM 1997), pages 9–15, Las Vegas (USA), November 10-14 1997. ACM Press.Google Scholar
  3. [3]
    J. A. Hartigan. Clustering Algorithms. A Wiley Publication in Applied Statistics. John Wiley & Sons, New York (USA), 1975.Google Scholar
  4. [4]
    M. A. Hernandez and S. J. Stolfo. Real-world data is dirty: Data cleansing and the merge/purge problem. Journal of Data Mining and Knowledge Discovery, 2(1):9–37, 1998.CrossRefGoogle Scholar
  5. [5]
    V. I. Levenshtein. Binary codes capable of correcting deletions, insertions, and reversals. Cybernetics and Control Theory, 10:707–710, 1966.MathSciNetGoogle Scholar
  6. [6]
    S. Luján-Mora. An Algorithm for Computing the Invariant Distance from Word Position. Internet: http://www.dlsi.ua.es/~slujan/files/idwp.ps, June 2000.
  7. [7]
    S. Luján-Mora and M. Palomar. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems. In M. C. Monard and J. S. Sichman, editors, International Joint Conference IBERAMIA-SBIA 2000 Open Discussion Track Proceedings, pages 217–226, Atibaia, São Paulo (Brazil), November 19-22 2000. ICMC/USP.Google Scholar
  8. [8]
    A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), pages 23–29, Tucson (USA), May 11 1997.Google Scholar
  9. [9]
    A. Motro and I. Rakov. Estimating the Quality of Databases. In T. Andreasen, H. Christiansen, and H. Larsen, editors, Proceedings of FQAS 98: Third International Conference on Flexible Query Answering Systems, volume 1495 of Lecture Notes in Artificial Intelligence, pages 298–307, Roskilde (Denmark), May 1998. Springer-Verlag.Google Scholar
  10. [10]
    E. T. O’Neill and D. Vizine-Goetz. The Impact of Spelling Errors on Databases and Indexes. In C. Nixon and L. Padgett, editors, 10th National Online Meeting Proceedings, pages 313–320, New York (USA), May 9–11 1989. Learned Information Inc.Google Scholar
  11. [11]
    C. J. V. Rijsbergen. Information Retrieval. Butterworths, London (UK), 2 edition, 1979.Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2001

Authors and Affiliations

  • Sergio Luján-Mora
    • 1
  • Manuel Palomar
    • 1
  1. 1.Departamento de Lenguajes y Sistemas InformáticosUniversidad de AlicanteAlicanteSpain

Personalised recommendations