Finding “Similar but Different” Documents Based on Coordinate Relationship

  • Meng Zhao
  • Hiroaki Ohshima
  • Katsumi Tanaka
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10075)


Traditional search technologies are based on similarity relationship such that they return content similar documents in accordance with a given one. However, such similarity-based search does not always result in good results, e.g., similar documents will bring little additional information so that it is difficult to increase information gain. In this paper, we propose a method to find similar but different documents of a user-given one by distinguishing coordinate relationship from similarity relationship between documents. Simply, a similar but different document denotes the document with the same topic as that of the given document, but describing different events or concepts. For example, given as the input a news article stating the occurrence of the Oregon school shooting, articles stating the occurrence of other school shooting events, such as the Virginia Tech shooting, are detected and returned to users. Experiments conducted on the New York Times Annotated Corpus verify the effectiveness of our method and illustrate the importance of incorporating coordinate relationship to find similar but different documents.


Coordinate relationship Similar but different Web mining 



This work was supported in part by the following projects: Grants-in-Aid for Scientific Research (Nos. 16H02906, 15H01718 and 24680008) from MEXT of Japan.


  1. 1.
    Allan, J., Papka, R., Lavrenko, V.: On-line new event detection and tracking. In: Proceedings of SIGIR, pp. 37–45 (1998)Google Scholar
  2. 2.
    Bao, S., Xue, G., Wu, X., Yu, Y., Fei, B., Su, Z.: Optimizing web search using social annotations. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 501–510 (2007)Google Scholar
  3. 3.
    Carbonell, J., Goldstein, J.: The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of SIGIR, pp. 335–336 (1998)Google Scholar
  4. 4.
    Erkan, G., Radev, D.R.: Lexrank: graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. 22, 457–479 (2004)Google Scholar
  5. 5.
    Feng, A., Allan, J.: Finding and linking incidents in news. In: Proceedings of CIKM, pp. 821–830 (2007)Google Scholar
  6. 6.
    Feng, A., Allan, J.: Incident threading for news passages. In: Proceedings of CIKM, pp. 1307–1316 (2009)Google Scholar
  7. 7.
    Haveliwala, T.H.: Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search. IEEE Trans. Knowl. Data Eng. 15(4), 784–796 (2003)CrossRefGoogle Scholar
  8. 8.
    Kumaran, G., Allan, J.: Text classification and named entities for new event detection. In: Proceedings of SIGIR, pp. 297–304 (2004)Google Scholar
  9. 9.
    Li, Z., Wang, B., Li, M., Ma, W.Y.: A probabilistic model for retrospective news event detection. In: Proceedings of SIGIR, pp. 106–113 (2005)Google Scholar
  10. 10.
    Mihalcea, R., Tarau, P.: Textrank: bringing order into texts. Proc. EMNLP 2004, 404–411 (2004)Google Scholar
  11. 11.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: Proceedings of ICLR Workshop (2013)Google Scholar
  12. 12.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: NIPS, pp. 3111–3119 (2013)Google Scholar
  13. 13.
    Miller, G.A.: Wordnet: a lexical database for english. Commun. ACM 38(11), 39–41 (1995)CrossRefGoogle Scholar
  14. 14.
    Nallapati, R., Feng, A., Peng, F., Allan, J.: Event threading within news topics. In: Proceedings of CIKM, pp. 446–453 (2004)Google Scholar
  15. 15.
    Ohshima, H., Oyama, S., Tanaka, K.: Searching coordinate terms with their context from the web. In: Aberer, K., Peng, Z., Rundensteiner, E.A., Zhang, Y., Li, X. (eds.) WISE 2006. LNCS, vol. 4255, pp. 40–47. Springer, Heidelberg (2006). doi: 10.1007/11912873_7 CrossRefGoogle Scholar
  16. 16.
    Ohshima, H., Oyama, S., Tanaka, K.: Sibling page search by page examples. In: International Conference on Asian Digital Libraries, pp. 91–100 (2006)Google Scholar
  17. 17.
    Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Commun. ACM 18(11), 613–620 (1975)CrossRefzbMATHGoogle Scholar
  18. 18.
    Snow, R., Jurafsky, D., Ng, A.Y.: Semantic taxonomy induction from heterogenous evidence. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL ’44, pp. 801–808 (2006)Google Scholar
  19. 19.
    Yang, Y., Pierce, T., Carbonell, J.: A study of retrospective and on-line event detection. In: Proceedings of SIGIR, pp. 28–36 (1998)Google Scholar

Copyright information

© Springer International Publishing AG 2016

Authors and Affiliations

  1. 1.Graduate School of InformaticsKyoto UniversityKyotoJapan

Personalised recommendations