The VLDB Journal

, Volume 21, Issue 5, pp 677–702 | Cite as

Measuring structural similarity of semistructured data based on information-theoretic approaches

Regular Paper

Abstract

We propose and experimentally evaluate different approaches for measuring the structural similarity of semistructured documents based on information-theoretic concepts. Common to all approaches is a two-step procedure: first, we extract and linearize the structural information from documents, and then, we use similarity measures that are based on, respectively, Kolmogorov complexity and Shannon entropy to determine the distance between the documents. Compared to other approaches, we are able to achieve a linear run-time complexity and demonstrate in an experimental evaluation that the results of our technique in terms of clustering quality are on a par with or even better than those of other, slower approaches.

Keywords

Similarity measures Semistructured documents XML Clustering 

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Arasu, A., Garcia-Molina, H.: Extracting structured data from web pages. In: ACM SIGMOD International Conference on Management of Data, pp. 337–348 (2003)Google Scholar
  2. 2.
    Ashby F.G., Perrin N.A.: Toward a unified theory of similarity and recognition. Psychol. Rev. 95(1), 124–150 (1988)CrossRefGoogle Scholar
  3. 3.
    Augsten, N., Böhlen, M.H., Gamper, J.: Approximate matching of hierarchical data using pq-grams. In: Proceedings of the 31st International Conference on Very Large Data Bases (VLDB’05), pp. 301–312, Trondheim (2005)Google Scholar
  4. 4.
    Augsten, N., Böhlen, M., Dyreson, C., Gamper, J.: Approximate joins for data-centric XML. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 814–823, Cancún, Mexico. IEEE Computer Society (2008)Google Scholar
  5. 5.
    Augsten, N., Barbosa, D., Böhlen, M., Palpanas, T.: TASM: Top-k approximate subtree matching. In: Proceedings of the International Conference on Data Engineering (ICDE), pp. 353–364, Long Beach, California, USA. IEEE Computer Society (2010)Google Scholar
  6. 6.
    Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1) (2010)Google Scholar
  7. 7.
    Baeza-Yates R.A., Ribeiro-Neto B.A.: Modern Information Retrieval. Addison-Wesley, Reading (1999)Google Scholar
  8. 8.
    Barbosa D., Mignet L., Veltri P.: Studying the XML Web: gathering statistics from an XML sample. World Wide Web J. 8(4), 413–438 (2005)CrossRefGoogle Scholar
  9. 9.
    Baumgartner, R., Flesca, S., Gottlob, G.: Visual web information extraction with Lixto. In: International Conference on Very Large Databases (VLDB’01), pp. 119–128 (2001)Google Scholar
  10. 10.
    Bennet C.H., Gács P., Li M., Vitányi P.M.B.: Zurek W.H.: Information distance. IEEE Trans. Inf. Theory 44(4), 1407–1423 (1998)CrossRefGoogle Scholar
  11. 11.
    Bertino E., Guerrini G., Mesiti M.: A matching algorithm for measuring the structural similarity between an XML document and a DTD and its applications. Inf. Syst. 29(1), 23–46 (2004)MathSciNetCrossRefGoogle Scholar
  12. 12.
    Bohrer, K., Liu, X., McLaughlin, S., Schonberg, E., Singh, M.: Object oriented XML query by example. In: ER (Workshops), pp. 323–329 (2003)Google Scholar
  13. 13.
    Buttler, D.: A short survey of document structure similarity algorithms. In: 5th International Conference on Internet Computing, Las Vegas, Nevada (2004)Google Scholar
  14. 14.
    Chaitin G.J.: On the length of programs for computing finite binary sequences. J. ACM 13, 547–569 (1966)MathSciNetMATHCrossRefGoogle Scholar
  15. 15.
    Chawathe, S., Garcia-Molina, H.: Meaningful change detection in structured data. In: ACM SIGMOD International Conference on Management of Data, pp. 26–37 (1997)Google Scholar
  16. 16.
    Chawathe, S., Rajaraman, A., Garcia-Molina, H., Widom, J.: Change detection in hierarchically structured information. In: ACM SIGMOD International Conference on Management of Data, pp. 493–504 (1996)Google Scholar
  17. 17.
    Cherukuri, V.S., Candan, K.S.: Propagation-vectors for trees (PVT): concise yet effective summaries for hierarchical data and trees. In: ACM Workshop on Large-Scale Distributed Systems for Information Retrieval (LSDS-IR), Napa Valley, CA (2008)Google Scholar
  18. 18.
    Cilibrasi R., Vitányi P.M.B.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523– (2005)CrossRefGoogle Scholar
  19. 19.
    Coutinho, D.P., Figueiredo, M.A.T.: Information theoretic text classification using the Ziv-Merhav method. In: Proceeding 2nd Iberian Conference on Pattern Recognition and Image Analysis (IbPRIA), pp. 355–362, Estoril, Portugal (2005)Google Scholar
  20. 20.
    Cover T.M., Thomas J.A.: Elements of Information Theory. Wiley, London (2006)MATHGoogle Scholar
  21. 21.
    Crescenzi, V., Mecca, G., Merialdo, P.: RoadRunner: towards automatic data extraction from large web sites. In: International Conference on Very Large Databases (VLDB’01), pp. 109–118 (2001)Google Scholar
  22. 22.
    Dalamagas T., Cheng T., Winkel K.-J., Sellis T.: A methodology for clustering XML documents by structure. Inf. Syst. 31(3), 187–228 (2006)CrossRefGoogle Scholar
  23. 23.
    de Castro Reis, D., Golgher, P.B., da Silva, A.S., Laender, A.H.F.: Automatic web news extraction using tree edit distance. In: 13th International World Wide Web Conference (WWW’04), Manhattan, New York (2004)Google Scholar
  24. 24.
    Dice L.R.: Measures of the amount of ecologic association between species. Ecology 26(3), 297–302 (1945)CrossRefGoogle Scholar
  25. 25.
    Flesca S., Manco G., Masciari E., Pontieri L., Pugliese A.: Fast detection of XML structural similarity. IEEE Trans. Knowl. Data Eng. 17(2), 160–175 (2005)CrossRefGoogle Scholar
  26. 26.
    Garofalakis, M.N., Gionis, A., Rastogi, R., Seshadri, S., Shim, K.: XTRACT: a system for extracting document type descriptors from XML documents. In: ACM SIGMOD International Conference on Management of Data, pp. 165–176 (2000)Google Scholar
  27. 27.
    Grünwald, P., Vitányi, P.M.B.: Shannon information and Kolmogorov complexity. The Computing Research Repository (CoRR), cs.IT/0410002 (2004)Google Scholar
  28. 28.
    Helmer, S.: Measuring the structural similarity of semistructured documents using entropy. In: Proceedings of the 33rd International Conference on Very Large Data Bases (VLDB’07), pp. 1022–1032, Vienna (2007)Google Scholar
  29. 29.
    Herbert K.G., Wang J.T.L.: Biological data cleaning: a case study. Int. J. Inf. Qual. 1(1), 60–82 (2007)CrossRefGoogle Scholar
  30. 30.
    Jardine N., Sibson R.: Mathematical Taxonomy. Wiley, New York (1971)MATHGoogle Scholar
  31. 31.
    Kim, J.W., Candan, K.S.: CP/CV: concept similarity mining without frequency information from domain describing taxonomies. In: ACM International Conference on Information and Knowledge Management (CIKM), pp. 483–492, Arlington, Virginia (2006)Google Scholar
  32. 32.
    Knuth D.: The Art of Computer Programming, Volume I: Fundamental Algorithms. Addison-Wesley, Reading (1973)Google Scholar
  33. 33.
    Kolmogorov A.N.: Three approaches to the quantitative definition of information. Probl. Inf. Transm. 1, 1–7 (1965)Google Scholar
  34. 34.
    Kullback S.: Information Theory and Statistics. Dover Publications, New York (1968)Google Scholar
  35. 35.
    Lee, M.L., Yang, L.H., Hsu, W., Yang, X.: XClust: clustering XML schemas for effective integration. In: 11th International Conference on Information and Knowledge Management (CIKM’02), McLean, Virginia (2002)Google Scholar
  36. 36.
    Li M., Vitányi P.M.B.: An Introduction to Kolmogorov Complexity. Springer, (1997)MATHGoogle Scholar
  37. 37.
    Lian W., Cheung D.W.L., Mamoulis N., Yiu S.-M.: An efficient and scalable algorithm for clustering XML documents by structure. IEEE Trans. Knowl. Data Eng. (TKDE) 16(1), 82–96 (2004)CrossRefGoogle Scholar
  38. 38.
    Martins, A.: String kernels and similarity measures for information retrieval. Technical report, Priberam, Lisbon, Portugal (2006)Google Scholar
  39. 39.
    Mesiti, M., Bertino, E., Guerrini, G.: An abstraction-based approach to measuring the structural similarity between two unordered XML documents. In: ISICT ’03: Proceedings of the 1st International Symposium on Information and Communication Technologies, pp. 316–321 (2003)Google Scholar
  40. 40.
    Nestorov, S., Abiteboul, S., Motwani, R.: Extracting schema from semistructured data. In: ACM SIGMOD International Conference on Management of Data, pp. 295–306 (1998)Google Scholar
  41. 41.
    Nierman, A., Jagadish, H.V.: Evaluating structural similarity in XML documents. In: Proceedings of the 5th International Workshop on the Web and Databases (WebDB), pp. 61–66, Madison, Wisconsin, (2002)Google Scholar
  42. 42.
    Puglisi A., Benedetto D., Caglioti E., Loreto V., Vulpiani A.: Data compression and learning in time sequences analysis. Phys. D 189, 92–107 (2003)MathSciNetCrossRefGoogle Scholar
  43. 43.
    Santini, S., Jain, R.: Similarity measures. IEEE Trans. Pattern Anal. Mach. Intell. 21(9) (1999)Google Scholar
  44. 44.
    Selkow S.: The tree-to-tree editing problem. Inf. Process. Lett. 6(6), 184–186 (1977)MathSciNetMATHCrossRefGoogle Scholar
  45. 45.
    Shannon, C.E.: The mathematical theory of communication. Bell Syst. Tech. J. 27, 379–423, 623–656 (1948)Google Scholar
  46. 46.
    Shasha D., Zhang K.: Pattern Matching in Strings, Trees, and Arrays, chapter Approximate Tree Pattern Matching. Oxford University Press, Oxford (1995)Google Scholar
  47. 47.
    Sneath P.H.A., Sokal R.R.: Numerical Taxonomy. Freeman, San Francisco (1973)MATHGoogle Scholar
  48. 48.
    Solomonoff R.: A formal theory of inductive inference, part I. Inf. Control 7(1), 1–22 (1964)MathSciNetMATHCrossRefGoogle Scholar
  49. 49.
    Solomonoff R.: A formal theory of inductive inference, part II. Inf. Control 7(2), 224–254 (1964)MathSciNetMATHCrossRefGoogle Scholar
  50. 50.
    Tai K.C.: The tree-to-tree correction problem. J. ACM 26(3), 422–433 (1979)MathSciNetMATHCrossRefGoogle Scholar
  51. 51.
    Theobald, A., Weikum, G.: The XXL search engine: ranked retrieval of XML data using indexes and ontologies. In: ACM SIGMOD International Conference on Management of Data, p. 615 (2002)Google Scholar
  52. 52.
    Ukkonen E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)MathSciNetMATHCrossRefGoogle Scholar
  53. 53.
    Wang J., Zhang K., Jeong K., Shasha D.: A system for approximate tree matching. IEEE Trans. Knowl. Data Eng. 6(4), 559–571 (1994)CrossRefGoogle Scholar
  54. 54.
    Weiner, P.: Linear pattern matching algorithms. In: 14th Annual Symposium on Found. of Computer Science (FOCS), pp. 1–11, Iowa City, Iowa (1973)Google Scholar
  55. 55.
    Witten I.H., Moffat A., Bell T.C.: Managing Gigabytes. Morgan Kaufmann, San Francisco (1999)Google Scholar
  56. 56.
    Zhang K., Shasha D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM J. Comput. 18(6), 1245–1262 (1989)MathSciNetMATHCrossRefGoogle Scholar
  57. 57.
    Ziv J., Lempel A.: A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23(3), 337–343 (1977)MathSciNetMATHCrossRefGoogle Scholar
  58. 58.
    Ziv J., Merhav N.: A measure of relative entropy between individual sequences with application to universal classification. IEEE Trans. Inf. Theory 39(4), 1270–1279 (1993)MathSciNetMATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag 2012

Authors and Affiliations

  • Sven Helmer
    • 1
  • Nikolaus Augsten
    • 2
  • Michael Böhlen
    • 3
  1. 1.Birkbeck, University of LondonLondonUK
  2. 2.Free University of Bozen-BolzanoBozen-BolzanoItaly
  3. 3.University of ZurichZurichSwitzerland

Personalised recommendations