Abstract
Measuring pairwise document similarity is critical to various text retrieval and mining tasks. The most popular measure for document similarity is the Cosine measure in Vector Space Model. In this paper, we propose a new similarity measure based on optimal matching in graph theory. The proposed measure takes into account the structural information of a document by considering the word distributions over different text segments. It first calculates the similarities for different pairs of text segments in the documents and then gets the total similarity between the documents optimally through optimal matching. We set up experiments of document similarity search to test the effectiveness of the proposed measure. The experimental results and user study demonstrate that the proposed measure outperforms the most popular Cosine measure.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Allan, J., Carbonell, J., Doddington, G., Yamron, J.P., Yang, Y.: Topic Detection and Tracking Pilot Study: Final Report. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, pp. 194–218 (1998)
Aslam, J.A., Frost, M.: An Information-theoretic Measure for Document Similarity. In: Proceedings of the 26th International ACM/SIGIR Conference on Research and Development in Information Retrieval (2003)
Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrival (1999)
Choi, F.: JTextTile: A Free Platform Independent Text Segmentation Algorithm, http://www.cs.man.ac.uk/choif
Deerwester, S.C., Dumais, S.T., Landauer, T.K., Furnas, G.W., Harshman, R.A.: Indexing by Latent Semantic Analysis. Journal of the American Society of Information Science 41(6), 211–240 (1990)
Hammouda, K.M., Kamel, M.S.: Document Similarity Using a Phrase Indexing Graph Model. Journal of Knowledge and Information Systems 6(4) (2004)
Hearst, M.A.: Multi-paragraph Segmentation of Expository Text. In: Proceedings of the 32nd Meeting of the Association for Computational Linguistics (1994)
Jones, W.P., Furnas, G.W.: Pictures of Relevance: a Geometric Analysis of Similarity Measure. Journal of the American Society for Information Science 38–6, 420–442 (1987)
Lovasz, L., Plummer, M.D.: Matching Theory (1986)
Peng, Y.X., Ngo, C.W., Dong, Q.J., Guo, Z.M., Xiao, J.G.: Video Clip Retrieval by Maximal Matching and Optimal Matching in Graph Theory. In: Proceedings of 2003 IEEE International Conference on Multimedia & Expo (2003)
Porter, M.F.: An Algorithm for Suffix Stripping. Program 14–3, 130–137 (1980)
Salton, G.: The SMART Document Retrieval Project. In: Proceedings of the Fourteenth International ACM/SIGIR Conference on Research and Development in Information Retrieval (1991)
Smadja, F.: Translating Collocations for Bilingual Lexicons: a Statistical Approach. Computational Linguistics 22–1 (1996)
Strehl, A., Ghosh, J.: Value-based Customer Grouping from Large Retail Data-sets. In: Proceedings of the SPIE Conference on Data Mining and Knowledge Discovery (2000)
van Rijsbergen, C.J.: Information Retrieval (1979)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Wan, X., Peng, Y. (2005). A Measure Based on Optimal Matching in Graph Theory for Document Similarity. In: Myaeng, S.H., Zhou, M., Wong, KF., Zhang, HJ. (eds) Information Retrieval Technology. AIRS 2004. Lecture Notes in Computer Science, vol 3411. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-31871-2_20
Download citation
DOI: https://doi.org/10.1007/978-3-540-31871-2_20
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-25065-4
Online ISBN: 978-3-540-31871-2
eBook Packages: Computer ScienceComputer Science (R0)