Skip to main content

A Measure Based on Optimal Matching in Graph Theory for Document Similarity

  • Conference paper
  • 428 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Abstract

Measuring pairwise document similarity is critical to various text retrieval and mining tasks. The most popular measure for document similarity is the Cosine measure in Vector Space Model. In this paper, we propose a new similarity measure based on optimal matching in graph theory. The proposed measure takes into account the structural information of a document by considering the word distributions over different text segments. It first calculates the similarities for different pairs of text segments in the documents and then gets the total similarity between the documents optimally through optimal matching. We set up experiments of document similarity search to test the effectiveness of the proposed measure. The experimental results and user study demonstrate that the proposed measure outperforms the most popular Cosine measure.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Allan, J., Carbonell, J., Doddington, G., Yamron, J.P., Yang, Y.: Topic Detection and Tracking Pilot Study: Final Report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)

    Google Scholar 

  2. Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: Proceedings of the 26th International ACM/SIGIR Conference on Research and Development in Information Retrieval (2003)

    Google Scholar 

  3. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrival (1999)

    Google Scholar 

  4. Choi, F.: JTextTile: A Free Platform Independent Text Segmentation Algorithm, http://www.cs.man.ac.uk/choif

  5. Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 211–240 (1990)

    Article  Google Scholar 

  6. Hammouda, K.M., Kamel, M.S.: Document Similarity Using a Phrase Indexing Graph Model. Journal of Knowledge and Information Systems 6(4) (2004)

    Google Scholar 

  7. Hearst, M.A.: Multi-paragraph Segmentation of Expository Text. In: Proceedings of the 32nd Meeting of the Association for Computational Linguistics (1994)

    Google Scholar 

  8. Jones, W.P., Furnas, G.W.: Pictures of Relevance: a Geometric Analysis of Similarity Measure. Journal of the American Society for Information Science 38–6, 420–442 (1987)

    Article  Google Scholar 

  9. Lovasz, L., Plummer, M.D.: Matching Theory (1986)

    Google Scholar 

  10. Peng, Y.X., Ngo, C.W., Dong, Q.J., Guo, Z.M., Xiao, J.G.: Video Clip Retrieval by Maximal Matching and Optimal Matching in Graph Theory. In: Proceedings of 2003 IEEE International Conference on Multimedia & Expo (2003)

    Google Scholar 

  11. Porter, M.F.: An Algorithm for Suffix Stripping. Program 14–3, 130–137 (1980)

    Google Scholar 

  12. Salton, G.: The SMART Document Retrieval Project. In: Proceedings of the Fourteenth International ACM/SIGIR Conference on Research and Development in Information Retrieval (1991)

    Google Scholar 

  13. Smadja, F.: Translating Collocations for Bilingual Lexicons: a Statistical Approach. Computational Linguistics 22–1 (1996)

    Google Scholar 

  14. Strehl, A., Ghosh, J.: Value-based Customer Grouping from Large Retail Data-sets. In: Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery (2000)

    Google Scholar 

  15. van Rijsbergen, C.J.: Information Retrieval (1979)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wan, X., Peng, Y. (2005). A Measure Based on Optimal Matching in Graph Theory for Document Similarity. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-31871-2_20

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25065-4

  • Online ISBN: 978-3-540-31871-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics