Using Proportional Transportation Distances for Measuring Document Similarity

  • Xiaojun Wan
  • Jianwu Yang
Part of the Lecture Notes in Computer Science book series (LNCS, volume 3936)


A novel document similarity measure based on the Proportional Transportation Distance (PTD) is proposed in this paper. The proposed measure improves on the previously proposed similarity measure based on optimal matching by allowing many-to-many matching between subtopics of documents. After documents are decomposed into sets of subtopics, the Proportional Transportation Distance is employed to evaluate the similarity between sets of subtopics for two documents by solving a transportation problem. Experiments on TDT-3 data demonstrate its good ability for measuring document similarity and also its high robustness, i.e. it does not rely on the underlying document decomposition algorithm largely as the optimal matching based measure.


Transportation Problem Mean Average Precision Vector Space Model Optimal Match Document Similarity 


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.


  1. 1.
    Allan, J., Carbonell, J., Doddington, G., Yamron, J.P., Yang, Y.: Topic detection and tracking pilot study: final report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)Google Scholar
  2. 2.
    Aslam, J.A., Frost, M.: An information-theoretic measure for document similarity. In: Proceedings of the 26th International ACM/SIGIR Conference on Research and Development in Information Retrieval (2003)Google Scholar
  3. 3.
    Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrival. ACM Press and Addison Wesley (1999)Google Scholar
  4. 4.
    Choi, F.: JTextTile: A free platform independent text segmentation algorithm,
  5. 5.
    Croft, B., Lafferty, J.: Language Modeling for Information Retrieval. Kluwer Academic Publishers, Dordrecht (2003)CrossRefMATHGoogle Scholar
  6. 6.
    Giannopoulos, P., Veltkamp, R.C.: A Pseudo-Metric for Weighted Point Sets. In: Heyden, A., Sparr, G., Nielsen, M., Johansen, P. (eds.) ECCV 2002. LNCS, vol. 2352, pp. 715–730. Springer, Heidelberg (2002)CrossRefGoogle Scholar
  7. 7.
    Hearst, M.A.: Multi-paragraph segmentation of expository text. In: Proceedings of the 32nd Meeting of the Association for Computational Linguistics (ACL), Los Cruces, NM, pp. 9–16 (1994)Google Scholar
  8. 8.
    Hillier, F.S., Liberman, G.J.: Introduction to Mathematical Programming. McGraw-Hill, New York (1990)Google Scholar
  9. 9.
    Karmarkar, N.: A new polynomial-time algorithm for linear programming. In: Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, pp. 302–311 (1984)Google Scholar
  10. 10.
    Kaufmann, S.: Cohesion and collocation: using context vectors in text segmentation. In: Proceedings of the 37th conference on Association for Computational Linguistics, pp. 591–595 (1999)Google Scholar
  11. 11.
    Lin, D.: An information-theoretic definition of similarity. In: Proc. 15th International Conf. on Machine Learning (1998)Google Scholar
  12. 12.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  13. 13.
    Robertson, S., Walker, S.: Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In: Proeedings of the 17th International ACM/SIGIR Conference on Research and Development in Information Retrieval, pp. 232–241 (1994)Google Scholar
  14. 14.
    Robertson, S., Walker, S., Beaulieu, M.: Okapi at TREC–7: automatic ad hoc, filtering, VLC and filtering tracks. In: Proceedings of TREC 1999 (1999)Google Scholar
  15. 15.
    Rubner, Y., Tomasi, C., Guibas, L.: The Earth Mover’s Distance as a metric for image retrieval. Int. Journal of Computer Vision 40(2), 99–121 (2000)CrossRefMATHGoogle Scholar
  16. 16.
    Salton, G.: The SMART retrieval system: experiments in automatic document processing. Prentice-Hall, Englewood Cliffs (1991)Google Scholar
  17. 17.
    Singhal, A., Buckley, C., Mitra, M.: Pivoted document length normalization. In: Proceedings of SIGIR 1996 (1996)Google Scholar
  18. 18.
    van Rijsbergen, C.J.: Information Retrieval. Butterworths, London (1979)MATHGoogle Scholar
  19. 19.
    Wan, X.J., Peng, Y.X.: A new retrieval model based on TextTiling for document similarity search. Journal of Computer Science & Technology 20(4), 552–558 (2005)CrossRefGoogle Scholar
  20. 20.
    Xiao, W.S.: Graph Theory and Its Algorithms. Aviation Industry Press, Beijing (1993)Google Scholar
  21. 21.
    Zha, H.: Generic summarization and keyphrase extraction using mutual reinforcement principle and sentence clustering. In: Proc. of 25th SIGIR conference, pp. 113–120 (2002)Google Scholar
  22. 22.
    Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of SIGIR 2001 (2001)Google Scholar

Copyright information

© Springer-Verlag Berlin Heidelberg 2006

Authors and Affiliations

  • Xiaojun Wan
    • 1
  • Jianwu Yang
    • 1
  1. 1.Institute of Computer Science and TechnologyPeking UniversityBeijingChina

Personalised recommendations