Finding “Similar but Different” Documents Based on Coordinate Relationship

Zhao, Meng; Ohshima, Hiroaki; Tanaka, Katsumi

doi:10.1007/978-3-319-49304-6_15

Meng Zhao¹⁶,
Hiroaki Ohshima¹⁶ &
Katsumi Tanaka¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10075))

Included in the following conference series:

International Conference on Asian Digital Libraries

2306 Accesses

Abstract

Traditional search technologies are based on similarity relationship such that they return content similar documents in accordance with a given one. However, such similarity-based search does not always result in good results, e.g., similar documents will bring little additional information so that it is difficult to increase information gain. In this paper, we propose a method to find similar but different documents of a user-given one by distinguishing coordinate relationship from similarity relationship between documents. Simply, a similar but different document denotes the document with the same topic as that of the given document, but describing different events or concepts. For example, given as the input a news article stating the occurrence of the Oregon school shooting, articles stating the occurrence of other school shooting events, such as the Virginia Tech shooting, are detected and returned to users. Experiments conducted on the New York Times Annotated Corpus verify the effectiveness of our method and illustrate the importance of incorporating coordinate relationship to find similar but different documents.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Follow-up is an article giving further information on a previously reported news event.
2.
Note that verbs are compared in their base form.
3.
http://nlp.stanford.edu/software/tagger.shtml.

References

Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: Proceedings of SIGIR, pp. 37–45 (1998)
Google Scholar
Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 501–510 (2007)
Google Scholar
Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of SIGIR, pp. 335–336 (1998)
Google Scholar
Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)
Google Scholar
Feng, A., Allan, J.: Finding and linking incidents in news. In: Proceedings of CIKM, pp. 821–830 (2007)
Google Scholar
Feng, A., Allan, J.: Incident threading for news passages. In: Proceedings of CIKM, pp. 1307–1316 (2009)
Google Scholar
Haveliwala, T.H.: Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng. 15(4), 784–796 (2003)
Article Google Scholar
Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: Proceedings of SIGIR, pp. 297–304 (2004)
Google Scholar
Li, Z., Wang, B., Li, M., Ma, W.Y.: A probabilistic model for retrospective news event detection. In: Proceedings of SIGIR, pp. 106–113 (2005)
Google Scholar
Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Proc. EMNLP 2004, 404–411 (2004)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of ICLR Workshop (2013)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)
Google Scholar
Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)
Article Google Scholar
Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: Proceedings of CIKM, pp. 446–453 (2004)
Google Scholar
Ohshima, H., Oyama, S., Tanaka, K.: Searching coordinate terms with their context from the web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds.) WISE 2006. LNCS, vol. 4255, pp. 40–47. Springer, Heidelberg (2006). doi:10.1007/11912873_7
Chapter Google Scholar
Ohshima, H., Oyama, S., Tanaka, K.: Sibling page search by page examples. In: International Conference on Asian Digital Libraries, pp. 91–100 (2006)
Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Snow, R., Jurafsky, D., Ng, A.Y.: Semantic taxonomy induction from heterogenous evidence. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL ’44, pp. 801–808 (2006)
Google Scholar
Yang, Y., Pierce, T., Carbonell, J.: A study of retrospective and on-line event detection. In: Proceedings of SIGIR, pp. 28–36 (1998)
Google Scholar

Download references

Acknowledgment

This work was supported in part by the following projects: Grants-in-Aid for Scientific Research (Nos. 16H02906, 15H01718 and 24680008) from MEXT of Japan.

Author information

Authors and Affiliations

Graduate School of Informatics, Kyoto University, Yoshida Honmachi, Kyoto, 606–8501, Japan
Meng Zhao, Hiroaki Ohshima & Katsumi Tanaka

Authors

Meng Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Hiroaki Ohshima
View author publications
You can also search for this author in PubMed Google Scholar
Katsumi Tanaka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Meng Zhao .

Editor information

Editors and Affiliations

University of Tsukuba, Tsukuba, Japan
Atsuyuki Morishima
Vienna University of Technology, Vienna, Austria
Andreas Rauber
Victoria University of Wellington, Wellington, New Zealand
Chern Li Liew

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhao, M., Ohshima, H., Tanaka, K. (2016). Finding “Similar but Different” Documents Based on Coordinate Relationship. In: Morishima, A., Rauber, A., Liew, C. (eds) Digital Libraries: Knowledge, Information, and Data in an Open Access Society. ICADL 2016. Lecture Notes in Computer Science(), vol 10075. Springer, Cham. https://doi.org/10.1007/978-3-319-49304-6_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-49304-6_15
Published: 15 November 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-49303-9
Online ISBN: 978-3-319-49304-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics