Efficient Graph-Based Document Similarity

  • Christian Paul
  • Achim RettingerEmail author
  • Aditya MogadalaEmail author
  • Craig A. KnoblockEmail author
  • Pedro SzekelyEmail author
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 9678)


Assessing the relatedness of documents is at the core of many applications such as document retrieval and recommendation. Most similarity approaches operate on word-distribution-based document representations - fast to compute, but problematic when documents differ in language, vocabulary or type, and neglecting the rich relational knowledge available in Knowledge Graphs. In contrast, graph-based document models can leverage valuable knowledge about relations between entities - however, due to expensive graph operations, similarity assessments tend to become infeasible in many applications. This paper presents an efficient semantic similarity approach exploiting explicit hierarchical and transversal relations. We show in our experiments that (i) our similarity measure provides a significantly higher correlation with human notions of document similarity than comparable measures, (ii) this also holds for short documents with few annotations, (iii) document similarity can be calculated efficiently compared to other graph-traversal based approaches.


Semantic document similarity Knowledge graph based document models Efficient similarity calculation 



This material is based on research supported by the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 611346, and in part by the National Science Foundation under Grant No. 1117913.


  1. 1.
    Agirre, E., Carmen, B.: Semeval-2015 task 2: semantic textual similarity, English, Spanish and pilot on interpretability. In: Proceedings of the 9th International Workshop on Semantic Evaluation. Association for Computational Linguistics (2015)Google Scholar
  2. 2.
    Agirre, E., Mona, D., Daniel, C., Gonzalez-Agirre., A.: Semeval-2012 task 6: a pilot on semantic textual similarity. In: Proceedings of the Sixth International Workshop on Semantic Evaluation, Sofia, pp. 385–393. Association for Computational Linguistics (2012)Google Scholar
  3. 3.
    Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: distributed word representations for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, Sofia, pp. 183–192. Association for Computational Linguistics, August 2013Google Scholar
  4. 4.
    Benik, J., Chang, C., Raschid, L., Vidal, M.-E., Palma, G., Thor, A.: Finding cross genome patterns in annotation graphs. In: Bodenreider, O., Rance, B. (eds.) DILS 2012. LNCS, vol. 7348, pp. 21–36. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  5. 5.
    Bhagwani, S., Satapathy, S., Karnick, H.: Semantic textual similarity using maximal weighted bipartite graph matching. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, vol. 1: Proceedings of the Main Conference and the Shared Task, vol. 2: Proceedings of the Sixth International Workshop on Semantic Evaluation, SemEval 2012, pp. 579–585. Association for Computational Linguistics, Stroudsburg (2012)Google Scholar
  6. 6.
    Buscaldi, D., Tournier, R., Aussenac-Gilles, N., Mothe, J.: Irit: textual similarity combining conceptual similarity with an n-gram comparison method. In: Proceedings of the Sixth International Workshop on Semantic Evaluation, pp. 552–556. Association for Computational Linguistics (2012)Google Scholar
  7. 7.
    Damljanovic, D., Stankovic, M., Laublet, P.: Linked data-based concept recommendation: comparison of different methods in open innovation scenario. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds.) ESWC 2012. LNCS, vol. 7295, pp. 24–38. Springer, Heidelberg (2012)CrossRefGoogle Scholar
  8. 8.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: IJCAI. vol. 7, pp. 1606–1611 (2007)Google Scholar
  9. 9.
    Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011)Google Scholar
  10. 10.
    Kapanipathi, P., Jain, P., Venkataramani, C., Sheth, A.: Hierarchical interest graph, 21 January 2015.
  11. 11.
    Katz, L.: A new status index derived from sociometric analysis. Psychometrika 18(1), 39–43 (1953)CrossRefzbMATHGoogle Scholar
  12. 12.
    Lam, S., Hayes, C., Deri, N.U., Park, I.B.: Using the structure of dbpedia for exploratory search. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2013. ACM, New York (2013)Google Scholar
  13. 13.
    Leal, J.P., Rodrigues, V., Queirós, R.: Computing semantic relatedness using dbpedia. In: OASIcs-OpenAccess Series in Informatics. vol. 21. Schloss Dagstuhl-Leibniz-Zentrum fuer Informatik (2012)Google Scholar
  14. 14.
    Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., van Kleef, P., Auer, S., et al.: Dbpedia-a large-scale, multilingual knowledge base extracted from wikipedia. Semant. Web 6(2), 167–195 (2015)Google Scholar
  15. 15.
    Nunes, B.P., Fetahu, B., Dietze, S., Casanova, M.A.: Cite4me: a semantic search and retrieval web application for scientific publications. In: Proceedings of the 2013th International Conference on Posters & Demonstrations Track, vol. 1035, pp. 25–28. (2013)Google Scholar
  16. 16.
    Nunes, B.P., Kawase, R., Fetahu, B., Dietze, S., Casanova, M.A., Maynard, D.: Interlinking documents based on semantic graphs. Procedia Comput. Sci. 22, 231–240 (2013)CrossRefGoogle Scholar
  17. 17.
    Palma, G., Vidal, M.E., Haag, E., Raschid, L., Thor, A.: Measuring relatedness between scientific entities in annotation datasets. In: Proceedings of the International Conference on Bioinformatics, Computational Biology and Biomedical Informatics, BCB 2013, pp. 367–376. ACM, New York (2013)Google Scholar
  18. 18.
    Pekar, V., Staab, S.: Taxonomy learning: factoring the structure of a taxonomy into a semantic classification decision. In: Proceedings of the 19th International Conference on Computational Linguistics, vol. 1, pp. 1–7 (2002)Google Scholar
  19. 19.
    Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: WSDM, pp. 543–552. ACM (2014)Google Scholar
  20. 20.
    Shi, C., Kong, X., Huang, Y., Philip, S.Y., Wu, B.: Hetesim: a general framework for relevance measure in heterogeneous networks. IEEE Trans. Knowl. Data Eng. 10, 2479–2492 (2014)CrossRefGoogle Scholar
  21. 21.
    Sun, Y., Han, J., Yan, X., Yu, P.S., Wu, T.: Pathsim: meta path-based top-k similarity search in heterogeneous information networks. In: VLDB (2011)Google Scholar
  22. 22.
    Takagi, N., Tomohiro., M.: Wsl: Sentence similarity using semantic distance between words. In: SemEval. Association for Computational Linguistics (2015)Google Scholar
  23. 23.
    Thiagarajan, R., Manjunath, G., Stumptner, M.: Computing semantic similarity using ontologies. In: The International Semantic Web Conference (ISWC 2008) (2008)Google Scholar
  24. 24.
    Tiantian, Z., Man, L.: System description of semantic textual similarity (STS) in the semeval-2012 (task 6). In: Proceedings of the Sixth International Workshop on Semantic Evaluation. Association for Computational Linguistics (2012)Google Scholar
  25. 25.
    Zhang, L., Rettinger, A.: X-LiSA: Cross-lingual Semantic Annotation. Proc. VLDB Endowment (PVLDB) 7(13), 1693–1696 (2014). The 40th International Conference on Very Large Data Bases (VLDB)CrossRefGoogle Scholar

Copyright information

© Springer International Publishing Switzerland 2016

Authors and Affiliations

  1. 1.Institute of Applied Informatics and Formal Description Methods (AIFB)Karlsruhe Institute for TechnologyKarlsruheGermany
  2. 2.Information Sciences InstituteUniversity of Southern CaliforniaMarina Del ReyUSA

Personalised recommendations