The problem of matching data has as one of its major bottlenecks the rapid deterioration in performance of time and accuracy, as the amount of data to be processed increases. One reason for this deterioration in performance is the cost incurred by data matching systems when comparing data records to determine their similarity (or dissimilarity). Approaches such as blocking and concatenation of data attributes have been used to minimize the comparison cost. In this paper, we analyse and present Keyword and Digram clustering as alternatives for enhancing the performance of data matching systems. We compare the performance of these clustering techniques in terms of potential savings in performing comparisons and their accuracy in correctly clustering similar data. Our results on a sampled London Stock Exchange listed companies database show that using the clustering techniques can lead to improved accuracy as well as time savings in data matching systems.


Information Retrieval Minimal Span Tree Record Linkage Data Match London Stock Exchange 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Famili, A., Shen, W., Weber, R., Simoudis, E.: Data Preprocessing and Intelligent Data Analysis. Intelligent Data Analysis 1(1), 3–23 (1997)CrossRefGoogle Scholar
  2. 2.
    Hall, P.A., Dowling, G.R.: Approximate string matching. Computer Surveys (12), 381–402 (1980)CrossRefMathSciNetGoogle Scholar
  3. 3.
    Gill, L.: Methods for Automatic Record Matching and Linking and their use in National Statistics. National Statistics Methodology Series No. 25, London (2001)Google Scholar
  4. 4.
    Hernandez, M.A., Stolfo, S.J.: Real-world data is dirty: Data Cleansing and The Merge/Purge Problem. Knowledge Discovery 2(1), 9–37 (1998)CrossRefGoogle Scholar
  5. 5.
    Newcombe, H.B., Kennedy, J.M., Axford, S.J., James, A.P.: Automatic linkage of vital records. Science 130(3381), 954–959 (1959)CrossRefGoogle Scholar
  6. 6.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. J. Amer. Statist. Assoc. 64, 1183–1210 (1969)CrossRefGoogle Scholar
  7. 7.
    William, E.: Winkler and Yves Thibaudeau, An Application of the Fellegi-Sunter Model of Record Linkage to the, U.S. Census, Number RR91/09 (1990)Google Scholar
  8. 8.
    Winkler, W.E.: The State of Record Linkage and Current Research Problems. In: Statistical Society of Canada, Proceedings of the Section on Survey Methods, pp. 73–79 (1999)Google Scholar
  9. 9.
    Kimball, R.: Dealing with Dirty Data,DBMS online, Available at URL (September 1996), http://www.dbmsmag.com/9609d14.html
  10. 10.
    Galhardas, H., Florescu, D., Shasha, D., Simon, E., Saita, C.-A.: Declarative Data Cleaning: Language, Model, and Algorithms. In: 27th International Conference on Very Large Data Bases, pp. 371–380 (2001)Google Scholar
  11. 11.
    Low, W.L., Lee, M.-L., Lin, T.W.: A knowledge-based approach for duplicate elimination in data cleaning. Inf. Syst. 26(8), 585–606 (2001)MATHCrossRefGoogle Scholar
  12. 12.
    Monge, A.E., Elkan, C.P.: An efficient domain-independent algorithm for detecting approximately duplicate database records. In: Proceedings of the ACM-SIGMOD workshop on Research issues on Knowledge discovery and data mining. AZ (1997)Google Scholar
  13. 13.
    McCallum, A.K., Nigam, K., Ungar, L.H.: Efficient clustering of high dimensional datasets with application to reference matching. In: Sixth International Conference on Knowledge Discovery and Data Mining, Boston (2000)Google Scholar
  14. 14.
    Sauleau, E.A., Paumier, J.-P., Buemi, A.: Medical record linkage in health information systems byapproximate string matching and clustering. In: BMC Medical Informatics and Decision Making, pp. 5–32 (2005)Google Scholar
  15. 15.
    Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall, Englewood Cliffs (1988)MATHGoogle Scholar
  16. 16.
    Information retrieval: data structures and algorithmspp, pp. 419 - 442 (publication, 1992) Google Scholar
  17. 17.
    Zobel, J., Dart, P.: Phonetic string matching: Lessons from information retrieval. In: Proceedings of the Eighteenth ACM SIGIR International Conference on Research and Development in Information Retrieval, Zurich, Switzerland, pp. 166–173 (August 1996)Google Scholar
  18. 18.
    Lance, G., Williams, W.: A general theory of classification sorting strategies. Computer Journal 9, 373–386 (1967)Google Scholar
  19. 19.
    Lohninger, H.: Teach/Me Data Analysis. Springer, Berlin (1999)MATHGoogle Scholar
  20. 20.
    Margaret, H.: Dunham: Data Mining: Introductory and Advanced Topics. Prentice-Hall, Englewood Cliffs (2002)Google Scholar
  21. 21.
    Robertson, A.M., Willet, P.: Applications of n-grams in textual information systems. Journal of Documentation 54(1), 48–69 (1998)CrossRefGoogle Scholar
  22. 22.
    Van-Rijsbergen, C.J.: Information Retrieval, ch. 3, 2nd edn., Butterworths, London, England (1979)Google Scholar
  23. 23.
    Salton, G., McGill, M.J.: Introduction to Modern Information Retrieval. McGraw-Hill, New York (1983)MATHGoogle Scholar
  24. 24.
    Willett, P.: Recent trends in hierarchic document clustering: A critical review. Information Processing & Management 24(4), 577–597 (1988)CrossRefGoogle Scholar
  25. 25.
    Smadja, F.A., McKeown, K.R.: Translating collocations for use in bilingual lexicons. In: Proceedings of the ARPA Human Language Technology Workshop, Princeton, N.J. (1994)Google Scholar
  26. 26.
    Teknomo, K.: Similarity Measurement, http://people.revoledu.com/kardi/tutorial/Similarity/
  27. 27.
    Sparck Jones, K.: Automatic keyword classification for information retrieval, Butterworths, London, UK (1971)Google Scholar
  28. 28.
    Elfeky, M., Verykios, V., Elmagarmid, A.: TAILOR: A Record Linkage Toolbox. In: Proc. of the 18th Int. Conf. on Data Engineering, IEEE, Los Alamitos (2002)Google Scholar
  29. 29.
    Eisen, M.B., Spellman, P.T., Browndagger, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. In: Proceedings of the National Academy of Sciences of the United States of America (PNAS), vol. 95, p. 25 (1998)Google Scholar
  30. 30.
    Bar-Joseph, Z., Giord, D.K., Jaakkola, T.S.: Fast optimal leaf ordering for hierarchical clustering. Bioinformatics 17, 22–29 (2001)Google Scholar
  31. 31.
    Witten, I., Moffat, A., Bell, T.: Managing Gigabytes, 2nd edn. Morgan Kaufmann Publishers, New York (1999)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Edward Tersoo Apeh
    • 1
    • 2
  • Bogdan Gabrys
    • 1
  1. 1.School of Design, Engineering and Computing, Computational Intelligence Research GroupBournemouth UniversityPooleUK
  2. 2.QGate Software LimitedFareham, HampshireUK

Personalised recommendations