Measuring Tree Similarity for Natural Language Processing Based Information Retrieval

  • Zhiwei Lin
  • Hui Wang
  • Sally McClean
Part of the Lecture Notes in Computer Science book series (LNCS, volume 6177)

Abstract

Natural language processing based information retrieval (NIR) aims to go beyond the conventional bag-of-words based information retrieval (KIR) by considering syntactic and even semantic information in documents. NIR is a conceptually appealing approach to IR, but is hard due to the need to measure distance/similarity between structures. We aim to move beyond the state of the art in measuring structure similarity for NIR.

In this paper, a novel tree similarity measurement dtwAcs is proposed in terms of a novel interpretation of trees as multi dimensional sequences. We calculate the distance between trees by the way of computing the distance between multi dimensional sequences, which is conducted by integrating the all common subsequences into the dynamic time warping method. Experimental result shows that dtwAcs outperforms the state of the art.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Mauldin, M.: Retrieval performance in FERRET: a conceptual information retrieval system. In: SIGIR 1991 (1991)Google Scholar
  2. 2.
    Strzalkowski, T. (ed.): Natural language Information Retrieval. Kluwer, New York (1999)MATHGoogle Scholar
  3. 3.
    Carballo, J.P., Strzalkowski, T.: Natural language information retrieval: progress report. Information Processing Management 36(1), 155–178 (2000)CrossRefGoogle Scholar
  4. 4.
    Mittendorfer, M., Winiwarter, W.: Exploiting syntactic analysis of queries for information retrieval. Journal of Data and Knowledge Engineering (2002)Google Scholar
  5. 5.
    Moschitti, A.: Efficient convolution kernels for dependency and constituent syntactic trees. In: Fürnkranz, J., Scheffer, T., Spiliopoulou, M. (eds.) ECML 2006. LNCS (LNAI), vol. 4212, pp. 318–329. Springer, Heidelberg (2006)CrossRefGoogle Scholar
  6. 6.
    Zhang, D., Lee, W.S.: Question classification using support vector machines. In: SIGIR 2003: Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, pp. 26–32. ACM, New York (2003)CrossRefGoogle Scholar
  7. 7.
    Strzalkowski, T.: Natural Language Information Retrieval Project Homepage, http://www.cs.albany.edu/tomek
  8. 8.
    Strzalkowski, T., Perez-Carballo, J., Karlgren, J., Hulth, A., Tapanainen, P., Lahtinen, T.: Natural language information retrieval: TREC-8 report. In: TREC 1999, pp. 381–390 (1999)Google Scholar
  9. 9.
    Bille, P.: A survey on tree edit distance and related problems. Theor. Comput. Sci. 337(1-3), 217–239 (2005)MATHCrossRefMathSciNetGoogle Scholar
  10. 10.
    Chawathe, S.S.: Comparing hierarchical data in external memory. In: VLDB 1999, Edinburgh, UK, pp. 90–101 (1999)Google Scholar
  11. 11.
    Dalamagas, T., Cheng, T., Winkel, K.J., Sellis, T.: A methodology for clustering xml documents by structure. Information System 31(3), 187–228 (2006)CrossRefGoogle Scholar
  12. 12.
    Selkow, S.M.: The tree-to-tree editing problem. Information Processing Letters 6(6), 184–186 (1977)MATHCrossRefMathSciNetGoogle Scholar
  13. 13.
    Zhang, K., Shasha, D.: Simple fast algorithms for the editing distance between trees and related problems. SIAM Journal on Computing 18(6), 1245–1262 (1989)MATHCrossRefMathSciNetGoogle Scholar
  14. 14.
    Che, W., Zhang, M., Aw, A., Tan, C., Liu, T., Li, S.: Using a hybrid convolution tree kernel for semantic role labeling. ACM Transactions on Asian Language Information Processing (TALIP) 7(4), 1–23 (2008)CrossRefGoogle Scholar
  15. 15.
    Collins, M., Duffy, N.: Convolution kernels for natural language. In: Advances in Neural Information Processing Systems 14, pp. 625–632. MIT Press, Cambridge (2001)Google Scholar
  16. 16.
    Kashima, H., Koyanagi, T.: Kernels for semi-structured data. In: ICML 2002: Proceedings of the Nineteenth International Conference on Machine Learning, pp. 291–298. Morgan Kaufmann Publishers Inc., San Francisco (2002)Google Scholar
  17. 17.
    Moschitti, A.: Making tree kernels practical for natural language learning. In: Proceedings of the Eleventh International Conference on European Association for Computational Linguistics, Trento, Italy (2006)Google Scholar
  18. 18.
    Moschitti, A., Pighin, D., Basili, R.: Tree kernels for semantic role labeling. Computational Linguistics 34(2), 193–224 (2008)CrossRefMathSciNetGoogle Scholar
  19. 19.
    Tetsuji, K., Kouichi, H., Hisashi, K., Kiyoko, F.K., Hiroshi, Y.: A spectrum tree kernel. Transactions of the Japanese Society for Artificial Intelligence 22(2), 140–147 (2007)Google Scholar
  20. 20.
    Vishwanathan, S.V.N., Smola, A.: Fast kernels for string and tree matching. Advances in Neural Information Processing Systems 15 (2003)Google Scholar
  21. 21.
    Haussler, D.: Convolution kernels on discrete structures. Technical report, Department of Computer Science, University of California at Santa Cruz (1999)Google Scholar
  22. 22.
    Aiolli, F., Da San Martino, G., Sperduti, A.: Route kernels for trees. In: ICML 2009: Proceedings of the 26th Annual International Conference on Machine Learning, pp. 17–24. ACM, New York (2009)Google Scholar
  23. 23.
    Elzinga, C., Rahmann, S., Wang, H.: Algorithms for subsequence combinatorics. Theoretical Computer Science 409(3), 394–404 (2008)MATHCrossRefMathSciNetGoogle Scholar
  24. 24.
    Hirschberg, D.S.: A linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341–343 (1975)MATHCrossRefMathSciNetGoogle Scholar
  25. 25.
    Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for svm protein classification. Neural Information Processing Systems 15, 1441–1448 (2003)Google Scholar
  26. 26.
    Levenshtein, V.: Binary Codes Capable of Correcting Deletions, Insertions and Reversals. Soviet Physics Doklady 10, 707 (1966)MathSciNetGoogle Scholar
  27. 27.
    Wang, H.: All common subsequences. In: IJCAI 2007: Proceedings of the 20th international joint conference on Artifical intelligence, Hyderabad, India, pp. 635–640 (2007)Google Scholar
  28. 28.
    Sakoe, H.: Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing 26, 43–49 (1978)MATHCrossRefGoogle Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2010

Authors and Affiliations

  • Zhiwei Lin
    • 1
  • Hui Wang
    • 1
  • Sally McClean
    • 1
  1. 1.Faculty of Computing and EngineeringUniversity of UlsterNorthern Ireland, United Kingdom

Personalised recommendations