Skip to main content

Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity

  • Conference paper
  • First Online:
Similarity Search and Applications (SISAP 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9939))

Included in the following conference series:

Abstract

We propose a novel knowledge-based technique for inter-document similarity, called Context Semantic Analysis (CSA). Several specialized approaches built on top of specific knowledge base (e.g. Wikipedia) exist in literature but CSA differs from them because it is designed to be portable to any RDF knowledge base. Our technique relies on a generic RDF knowledge base (e.g. DBpedia and Wikidata) to extract from it a vector able to represent the context of a document. We show how such a Semantic Context Vector can be effectively exploited to compute inter-document similarity. Experimental results show that our general technique outperforms baselines built on top of traditional methods, and achieves a performance similar to the ones of specialized methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.w3.org/TR/rdf-primer/.

  2. 2.

    We abbreviate URI namespaces with common prefixes, see http://prefix.cc for details.

  3. 3.

    https://www.textrazor.com/.

  4. 4.

    When an entity is an instance of more than one class we use the class with the minor number of instances because it better characterizes an entity; however if we filter the knowledge bases by excluding classes defined in external sources such as YAGO, GroNames, etc. only 6.4 % of entities in Dbpedia and 2.22 % in Wikidata are instances of more than one class.

  5. 5.

    https://webfiles.uci.edu/mdlee/LeePincombeWelsh.zip.

  6. 6.

    Implemented as in [15] (only removing the stopwords).

  7. 7.

    If not explicitly stated all the difference in performance are statistically significant at \(p{\text {-}}value < 0.05\) using Fisher’s Z-value transformation.

  8. 8.

    The sets of starting entities are obtained by using NER APIs.

  9. 9.

    With td-idf as weighting function.

  10. 10.

    Reuters collection is available at http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.

  11. 11.

    We executed this experiment in a Ubuntu machine with 16 cores (Intel Xeon E312xx) and 98 Gb of RAM.

  12. 12.

    http://aims.fao.org/standards/agrovoc/linked-open-data.

References

  1. Anyanwu, K., Maduko, A., Sheth, A.: SemRank: ranking complex relationship search results on the semantic web. In Proceedings of the 14th International Conference on World Wide Web, pp. 117–127. ACM (2005)

    Google Scholar 

  2. Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, R., Ives, Z.: DBpedia: a nucleus for a web of open data. In: Aberer, K., et al. (eds.) ASWC/ISWC -2007. LNCS, vol. 4825, pp. 722–735. Springer, Heidelberg (2007). doi:10.1007/978-3-540-76298-0_52

    Chapter  Google Scholar 

  3. Bär, D., Zesch, T., Gurevych, I.: A reflective view on text similarity. In: RANLP, pp. 515–520 (2011)

    Google Scholar 

  4. Beneventano, D., Bergamaschi, S., Sorrentino, S., Vincini, M., Benedetti, F.: Semantic annotation of the cerealab database by the agrovoc linked dataset. Ecol. Inform. 26, 119–126 (2015)

    Article  Google Scholar 

  5. Bizer, C., Heath, T., Berners-Lee, T.: Linked data-the story so far. In: Sheth, A.P. (ed.) Semantic Services, Interoperability, Web Applications: Emerging Concepts, pp. 205–227. IGI Global, Hershey (2009)

    Google Scholar 

  6. Bos, L., Donnelly, K.: SNOMED-CT: the advanced terminology and coding system for eHealth. Stud. Health Technol. Inform. 121, 279–290 (2006)

    Google Scholar 

  7. Caracciolo, C., Stellato, A., Morshed, A., Johannsen, G., Rajbhandari, S., Jaques, Y., Keizer, J.: The AGROVOC linked dataset. Semant. Web 4(3), 341–348 (2013)

    Google Scholar 

  8. Cyganiak, R., Wood, D., Lanthaler, M.: RDF 1.1 concepts, abstract syntax. W3C Recomm. 25, 1–8 (2014)

    Google Scholar 

  9. Dumais, S.T.: Latent semantic analysis. Annu. Rev. Inf. Sci. Technol. 38(1), 188–230 (2004)

    Article  Google Scholar 

  10. Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. IJCAI 7, 1606–1611 (2007)

    Google Scholar 

  11. Gomaa, W.H., Fahmy, A.A.: A survey of text similarity approaches. Int. J. Comput. Appl. 68(13), 13–18 (2013)

    Google Scholar 

  12. Hassan, S., Mihalcea, R.: Semantic relatedness using salient semantic analysis. In: AAAI (2011)

    Google Scholar 

  13. Haveliwala, T.H.: Topic-sensitive pagerank. In: Proceedings of the 11th International Conference on World Wide Web, pp. 517–526. ACM (2002)

    Google Scholar 

  14. Lawrence, I., Lin, K.: A concordance correlation coefficient to evaluate reproducibility. Biometrics 45, 255–268 (1989)

    Article  MATH  Google Scholar 

  15. Lee, M., Pincombe, B., Welsh, M.: An empirical evaluation of models of text document similarity. In: Cognitive Science (2005)

    Google Scholar 

  16. Manning, C.D., Raghavan, P., Schütze, H., et al.: Introduction to Information Retrieval, vol. 1. Cambridge University Press, Cambridge (2008)

    Book  MATH  Google Scholar 

  17. Mendes, P., Jakob, M., García-Silva, A., Bizer, C.: DBpedia spotlight shedding light on the web of documents. In: I-Semantics (2011)

    Google Scholar 

  18. Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Lingvisticae Investigationes 30(1), 3–26 (2007)

    Article  Google Scholar 

  19. Nakov, P., Popova, A., Mateev, P.: Weight functions impact on LSA performance. In: EuroConference RANLP, pp. 187–193 (2001)

    Google Scholar 

  20. Page, L., Brin, S., Motwani, R., Winograd, T.: The pagerank citation ranking: bringing order to the web (1999)

    Google Scholar 

  21. Schuhmacher, M., Ponzetto, S.P.: Knowledge-based graph document modeling. In: Proceedings of the 7th ACM International Conference on Web Search and Data Mining, pp. 543–552. ACM (2014)

    Google Scholar 

  22. Suchanek, F.M., Kasneci, G., Weikum, G.: YAGO: a core of semantic knowledge. In: Proceedings of the 16th International Conference on World Wide Web, pp. 697–706. ACM (2007)

    Google Scholar 

  23. Turney, P.D., Pantel, P., et al.: From frequency to meaning: vector space models of semantics. J. Artif. Intell. Res. 37(1), 141–188 (2010)

    MathSciNet  MATH  Google Scholar 

  24. Van de Cruys, T.: Two multivariate generalizations of pointwise mutual information. In Proceedings of the Workshop on Distributional Semantics and Compositionality, pp. 16–20. Association for Computational Linguistics (2011)

    Google Scholar 

  25. Vrandečić, D., Krötzsch, M.: Wikidata: a free collaborative knowledgebase. Commun. ACM 57(10), 78–85 (2014)

    Article  Google Scholar 

  26. Xing, W., Ghorbani, A.: Weighted pagerank algorithm. In: Second Annual Conference on Communication Networks and Services Research, 2004. Proceedings, pp. 305–314. IEEE (2004)

    Google Scholar 

  27. Yeh, E., Ramage, D., Manning, C.D., Agirre, E., Soroa, A.: WikiWalk: random walks on wikipedia for semantic relatedness. In Proceedings of the 2009 Workshop on Graph-Based Methods for Natural Language Processing, pp. 41–49. Association for Computational Linguistics (2009)

    Google Scholar 

  28. Zhao, Y., Karypis, G.: Evaluation of hierarchical clustering algorithms for document datasets. In Proceedings of the Eleventh International Conference on Information and Knowledge Management, pp. 515–524. ACM (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Fabio Benedetti .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Benedetti, F., Beneventano, D., Bergamaschi, S. (2016). Context Semantic Analysis: A Knowledge-Based Technique for Computing Inter-document Similarity. In: Amsaleg, L., Houle, M., Schubert, E. (eds) Similarity Search and Applications. SISAP 2016. Lecture Notes in Computer Science(), vol 9939. Springer, Cham. https://doi.org/10.1007/978-3-319-46759-7_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-46759-7_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-46758-0

  • Online ISBN: 978-3-319-46759-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics