Comparing String Similarity Measures for Reducing Inconsistency in Integrating Data from Different Sources
The Web has dramatically increased the need for efficient and flexible mechanisms to provide integrated views over multiple heterogeneous information sources. When multiple sources need to be integrated, each source may represent data differently. A common problem is the possible inconsistency of the data: the very same term may have different values, due to misspelling, a permuted word order, spelling variants and so on. In this paper, we present an improvement from our previous work for reducing inconsistency found in existing databases. The objective of our method is integration and standardization of different values that refer to the same term. All the values that refer to a same term are clustered by measuring their degree of similarity. The clustered values can be assigned to a common value that could be substituted for the original values. The paper describes and compares five different similarity measures for clustering and evaluates their performance on real-world data. The method we present may work well in practice but it is time-consuming.
KeywordsCluster Algorithm Similarity Measure Consistency Index Levenshtein Distance Word Position
Unable to display preview. Download preview PDF.
- D. Bitton and D. J. DeWitt. Duplicate record elimination in large data files. ACM Transactions on Database Systems, 8(2):255–265, June 1983.Google Scholar
- J. C. French, A. L. Powell, and E. Schulman. Applications of Approximate Word Matching in Information Retrieval. In F. Golshani and K. Makki, editors, Proceedings of the Sixth International Conference on Information and Knowledge Management (CIKM 1997), pages 9–15, Las Vegas (USA), November 10-14 1997. ACM Press.Google Scholar
- J. A. Hartigan. Clustering Algorithms. A Wiley Publication in Applied Statistics. John Wiley & Sons, New York (USA), 1975.Google Scholar
- S. Luján-Mora. An Algorithm for Computing the Invariant Distance from Word Position. Internet: http://www.dlsi.ua.es/~slujan/files/idwp.ps, June 2000.
- S. Luján-Mora and M. Palomar. Clustering of Similar Values, in Spanish, for the Improvement of Search Systems. In M. C. Monard and J. S. Sichman, editors, International Joint Conference IBERAMIA-SBIA 2000 Open Discussion Track Proceedings, pages 217–226, Atibaia, São Paulo (Brazil), November 19-22 2000. ICMC/USP.Google Scholar
- A. E. Monge and C. P. Elkan. An efficient domain-independent algorithm for detecting approximately duplicate database records. In SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD’97), pages 23–29, Tucson (USA), May 11 1997.Google Scholar
- A. Motro and I. Rakov. Estimating the Quality of Databases. In T. Andreasen, H. Christiansen, and H. Larsen, editors, Proceedings of FQAS 98: Third International Conference on Flexible Query Answering Systems, volume 1495 of Lecture Notes in Artificial Intelligence, pages 298–307, Roskilde (Denmark), May 1998. Springer-Verlag.Google Scholar
- E. T. O’Neill and D. Vizine-Goetz. The Impact of Spelling Errors on Databases and Indexes. In C. Nixon and L. Padgett, editors, 10th National Online Meeting Proceedings, pages 313–320, New York (USA), May 9–11 1989. Learned Information Inc.Google Scholar
- C. J. V. Rijsbergen. Information Retrieval. Butterworths, London (UK), 2 edition, 1979.Google Scholar