The Normalized Compression Distance as a Distance Measure in Entity Identification

  • Sebastian Klenk
  • Dennis Thom
  • Gunther Heidemann
Part of the Lecture Notes in Computer Science book series (LNCS, volume 5633)

Abstract

The identification of identical entities accross heterogeneous data sources still involves a large amount of manual processing. This is mainly due to the fact that different sources use different data representations in varying semantic contexts. Up to now entity identification requires either the – often manual – unification of different representations, or alternatively the effort of programming tools with specialized interfaces for each representation type. However, for large and sparse databases, which are common e.g. for medical data, the manual approach becomes infeasible.

We have developed a widely applicable compression based approach that does not rely on structural or semantical unity. The results we have obtained are promising both in recognition precision and performance.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Alfonseca, M., Cebrián, M., Ortega, A.: Testing genetic algorithm recombination strategies and the normalized compression distance for computer-generated music. In: AIKED 2006: Proceedings of the 5th WSEAS International Conference on Artificial Intelligence, Knowledge Engineering and Data Bases, Stevens Point, Wisconsin, USA, pp. 53–58. World Scientific and Engineering Academy and Society (WSEAS) (2006)Google Scholar
  2. 2.
    Amitay, E., Yogev, S., Yom-Tov, E.: Serial sharers: Detecting split identities of web authors. In: Stein, B., Koppel, M., Stamatatos, E. (eds.) PAN. CEUR Workshop Proceedings, vol. 276, CEUR-WS.org (2007)Google Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  4. 4.
    Bilenko, M., Mooney, R.J.: Adaptive duplicate detection using learnable string similarity measures. In: KDD 2003: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 39–48. ACM, New York (2003)Google Scholar
  5. 5.
    Casey, M.A., Veltkamp, R., Goto, M., Leman, M., Rhodes, C., Slaney, M.: Content-based music information retrieval: Current directions and future challenges. Proceedings of the IEEE 96(4), 668–696 (2008)CrossRefGoogle Scholar
  6. 6.
    Cebrian, M., Alfonseca, M., Ortega, A.: The normalized compression distance is resistant to noise. IEEE Transactions on Information Theory 53(5), 1895–1900 (2007)MathSciNetCrossRefMATHGoogle Scholar
  7. 7.
    Christen, P.: A two-step classification approach to unsupervised record linkage. In: AusDM 2007: Proceedings of the sixth Australasian conference on Data mining and analytics, Darlinghurst, Australia, pp. 111–119. Australian Computer Society, Inc. (2007)Google Scholar
  8. 8.
    Christen, P., Goiser, K.: Quality and complexity measures for data linkage and deduplication. In: Guillet, F., Hamilton, H.J. (eds.) Quality Measures in Data Mining. Studies in Computational Intelligence, vol. 43, pp. 127–151. Springer, Heidelberg (2007)CrossRefGoogle Scholar
  9. 9.
    Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Transactions on Information Theory 51(4) (2005)Google Scholar
  10. 10.
    Elmagarmid, A.K., Ipeirotis, P.G., Verykios, V.S.: Duplicate record detection: A survey. IEEE Transactions on Knowledge and Data Engineering 19(1), 1–16 (2007)CrossRefGoogle Scholar
  11. 11.
    Fellegi, I.P., Sunter, A.B.: A theory for record linkage. Journal of the American Statistical Association 64(328), 1183–1210 (1969)CrossRefMATHGoogle Scholar
  12. 12.
    Feller, W.: An introduction to probability theory and its applications, vol. 1. Wiley, Chichester (1950)MATHGoogle Scholar
  13. 13.
    Goiser, K., Christen, P.: Towards automated record linkage. In: AusDM 2006: Proceedings of the fifth Australasian conference on Data mining and analystics, Darlinghurst, Australia, pp. 23–31. Australian Computer Society, Inc. (2006)Google Scholar
  14. 14.
    Han, J., Kamber, M.: Data mining. Morgan Kaufmann, San Francisco (2001)MATHGoogle Scholar
  15. 15.
    Heidemann, G., Ritter, H.: On the Contribution of Compression to Visual Pattern Recognition. In: Proc. 3rd Int’l Conf. on Comp. Vision Theory and Applications, Funchal, Madeira - Portugal, vol. 2, pp. 83–89 (2008)Google Scholar
  16. 16.
    Hernández, M.A., Stolfo, S.J.: Real-world data is dirty: Data cleansing and the merge/purge problem. Data Min. Knowl. Discov. 2(1), 9–37 (1998)CrossRefGoogle Scholar
  17. 17.
    Kimball, R., Caserta, J.: The Data Warehouse ETL Toolkit: Practical Techniques for Extracting, Cleanin. John Wiley & Sons, Chichester (2004)Google Scholar
  18. 18.
    Li, M., Chen, X., Li, X., Ma, B., Vitanyi, P.: The similarity metric (2001)Google Scholar
  19. 19.
    McCallum, A., Nigam, K., Ungar, L.H.: Efficient clustering of high-dimensional data sets with application to reference matching. In: KDD 2000: Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 169–178. ACM Press, New York (2000)CrossRefGoogle Scholar
  20. 20.
    Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33(1), 31–88 (2001)CrossRefGoogle Scholar
  21. 21.
    Rodeh, M., Pratt, V.R., Even, S.: Linear algorithm for data compression via string matching. J. ACM 28(1), 16–24 (1981)MathSciNetCrossRefMATHGoogle Scholar
  22. 22.
    Sarawagi, S., Bhamidipaty, A.: Interactive deduplication using active learning. In: KDD 2002: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 269–278. ACM Press, New York (2002)Google Scholar
  23. 23.
    Winkler, W.E.: Overview of record linkage and current research directions. Technical Report RRS2006/02, US Bureau of the Census (2006)Google Scholar
  24. 24.
    Yan, S., Lee, D., Kan, M.-Y., Giles, L.C.: Adaptive sorted neighborhood methods for efficient record linkage. In: JCDL 2007: Proceedings of the 7th ACM/IEEE-CS joint conference on Digital libraries, pp. 185–194. ACM Press, New York (2007)CrossRefGoogle Scholar
  25. 25.
    Zhao, H.: Semantic matching across heterogeneous data sources. Commun. ACM 50(1), 45–50 (2007)CrossRefGoogle Scholar
  26. 26.
    Zhao, H., Ram, S.: Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf. Syst. 30(2), 119–132 (2005)CrossRefGoogle Scholar
  27. 27.
    Zhao, H., Ram, S.: Entity matching across heterogeneous data sources: An approach based on constrained cascade generalization. Data & Knowledge Engineering (corrected proof) (in press, 2008) (available online May 4)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2009

Authors and Affiliations

  • Sebastian Klenk
    • 1
  • Dennis Thom
    • 1
  • Gunther Heidemann
    • 1
  1. 1.Intelligent Systems GroupStuttgart University, Email: ais@vis.uni-stuttgart.deStuttgartGermany

Personalised recommendations