A Measure Based on Optimal Matching in Graph Theory for Document Similarity

Wan, Xiaojun; Peng, Yuxin

doi:10.1007/978-3-540-31871-2_20

A Measure Based on Optimal Matching in Graph Theory for Document Similarity

Xiaojun Wan²⁰ &
Yuxin Peng²⁰

Conference paper

428 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3411))

Abstract

Measuring pairwise document similarity is critical to various text retrieval and mining tasks. The most popular measure for document similarity is the Cosine measure in Vector Space Model. In this paper, we propose a new similarity measure based on optimal matching in graph theory. The proposed measure takes into account the structural information of a document by considering the word distributions over different text segments. It first calculates the similarities for different pairs of text segments in the documents and then gets the total similarity between the documents optimally through optimal matching. We set up experiments of document similarity search to test the effectiveness of the proposed measure. The experimental results and user study demonstrate that the proposed measure outperforms the most popular Cosine measure.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Allan, J., Carbonell, J., Doddington, G., Yamron, J.P., Yang, Y.: Topic Detection and Tracking Pilot Study: Final Report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Google Scholar
Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: Proceedings of the 26th International ACM/SIGIR Conference on Research and Development in Information Retrieval (2003)
Google Scholar
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrival (1999)
Google Scholar
Choi, F.: JTextTile: A Free Platform Independent Text Segmentation Algorithm, http://www.cs.man.ac.uk/choif
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 211–240 (1990)
Article Google Scholar
Hammouda, K.M., Kamel, M.S.: Document Similarity Using a Phrase Indexing Graph Model. Journal of Knowledge and Information Systems 6(4) (2004)
Google Scholar
Hearst, M.A.: Multi-paragraph Segmentation of Expository Text. In: Proceedings of the 32nd Meeting of the Association for Computational Linguistics (1994)
Google Scholar
Jones, W.P., Furnas, G.W.: Pictures of Relevance: a Geometric Analysis of Similarity Measure. Journal of the American Society for Information Science 38–6, 420–442 (1987)
Article Google Scholar
Lovasz, L., Plummer, M.D.: Matching Theory (1986)
Google Scholar
Peng, Y.X., Ngo, C.W., Dong, Q.J., Guo, Z.M., Xiao, J.G.: Video Clip Retrieval by Maximal Matching and Optimal Matching in Graph Theory. In: Proceedings of 2003 IEEE International Conference on Multimedia & Expo (2003)
Google Scholar
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14–3, 130–137 (1980)
Google Scholar
Salton, G.: The SMART Document Retrieval Project. In: Proceedings of the Fourteenth International ACM/SIGIR Conference on Research and Development in Information Retrieval (1991)
Google Scholar
Smadja, F.: Translating Collocations for Bilingual Lexicons: a Statistical Approach. Computational Linguistics 22–1 (1996)
Google Scholar
Strehl, A., Ghosh, J.: Value-based Customer Grouping from Large Retail Data-sets. In: Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery (2000)
Google Scholar
van Rijsbergen, C.J.: Information Retrieval (1979)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Computer Science and Technology, Peking University, Beijing, 100871, China
Xiaojun Wan & Yuxin Peng

Authors

Xiaojun Wan
View author publications
You can also search for this author in PubMed Google Scholar
Yuxin Peng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Engineering, Information and Communications University, 119, Munjiro, Yuseong-gu, 305-732, Daejeon, Korea
Sung Hyon Myaeng
The Key Laboratory of Power System Protection and Dynamic Security Monitoring and Control under Ministry of Education, North China Electric Power University, Zhuxinzhuang Dewai, 102206, Beijing, China
Ming Zhou
Department of Systems Engineering and Engineering Management, Shatin, The Chinese University of Hong Kong, Hong Kong, N.T.
Kam-Fai Wong
5F, Beijing Sigma Center, Microsoft Research Asia, No. 49 Zhichun Road Haidian District, 100080, Beijing, China
Hong-Jiang Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wan, X., Peng, Y. (2005). A Measure Based on Optimal Matching in Graph Theory for Document Similarity. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_20

Download citation

DOI: https://doi.org/10.1007/978-3-540-31871-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics