Advertisement

Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams

  • Hai Hieu Vu
  • Jeanne Villaneau
  • Farida Saïd
  • Pierre-François Marteau
Part of the Lecture Notes in Computer Science book series (LNCS, volume 8655)

Abstract

We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, Rouge. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and Rouge perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts.

Keywords

Similarity Score Semantic Relatedness Computational Linguistics Annotation Task Sentence Similarity 
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. 1.
    Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008)CrossRefGoogle Scholar
  2. 2.
    Agirre, E., et al.: *Sem 2013 shared task: Semantic textual similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 32–43. Association for Computational Linguistics, Atlanta (2013), http://www.aclweb.org/anthology/S13-1004 Google Scholar
  3. 3.
    Balasubramanian, N., Allan, J., Croft, W.B.: A comparison of sentence retrieval techniques. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 813–814. ACM (2007)Google Scholar
  4. 4.
    Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP 2003, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003), http://dx.doi.org/10.3115/1119355.1119359 Google Scholar
  5. 5.
    Buscaldi, D., Le Roux, J., Garcia Flores, J.J., Popescu, A.: Lipn-core: Semantic text similarity using n-grams, wordnet, syntactic analysis, esa and information retrieval based features. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 162–168. Association for Computational Linguistics, Atlanta (2013), http://www.aclweb.org/anthology/S13-1023 Google Scholar
  6. 6.
    Dan, A., Bhattacharyya, P.: Cfilt-core: Semantic textual similarity using universal networking language. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 216–220. Association for Computational Linguistics, Atlanta (2013), http://www.aclweb.org/anthology/S13-1031 Google Scholar
  7. 7.
    Dasari, D.B., Rao, V.G.: A text categorization on semantic analysis. International Journal of Advanced Computational Engineering and Networking 1(9) (2013)Google Scholar
  8. 8.
    Egozi, O., Markovitch, S., Gabrilovich, E.: Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29(2), 8:1–8:34 (2011), http://doi.acm.org/10.1145/1961209.1961211
  9. 9.
    Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. (JAIR) 22, 457–479 (2004)Google Scholar
  10. 10.
    Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007), http://dl.acm.org/citation.cfm?id=1625275.1625535 Google Scholar
  11. 11.
    Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: CIKM 2011: Proceedings of 20th ACM Conference on Information and Knowledge Management (2011), http://dl.dropbox.com/u/20411070/Publications/2011-CIKM-Gottron-AS.pdf
  12. 12.
    Gupta, R., Ratinov, L.: Text categorization with knowledge transfer from heterogeneous data sources. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2, pp. 842–847. AAAI Press (2008), http://dl.acm.org/citation.cfm?id=1620163.1620203
  13. 13.
    Ko, Y., Park, J., Seo, J.: Automatic text categorization using the importance of sentences. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 65–79 (2002)Google Scholar
  14. 14.
    Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. on Knowl. and Data Eng. 18(8), 1138–1150 (2006), http://dx.doi.org/10.1109/TKDE.2006.130 CrossRefGoogle Scholar
  15. 15.
    Lin, C.: Rouge: a package for automatic evaluation of summaries, pp. 25–26 (2004)Google Scholar
  16. 16.
    Lin, C.Y., Hovy., E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada (May-June 2003)Google Scholar
  17. 17.
    Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)Google Scholar
  18. 18.
    Müller, C., Gurevych, I.: A study on the semantic relatedness of query and document terms in information retrieval. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 3, pp. 1338–1347. Association for Computational Linguistics, Stroudsburg (2009), http://dl.acm.org/citation.cfm?id=1699648.1699680 Google Scholar
  19. 19.
    Nakayama, K., Hara, T., Nishio, S.: Wikipedia mining for an association web thesaurus construction. In: Proceedings of IEEE International Conference on Web Information Systems Engineering, pp. 322–334 (2007)Google Scholar
  20. 20.
    Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011), http://dx.doi.org/10.1007/s10579-009-9114-z CrossRefGoogle Scholar
  21. 21.
    Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: Working Notes for the CLEF 2008 Workshop (2008), http://www.aifb.kit.edu/images/7/7c/2008_1837_Sorg_Cross-lingual_I_1.pdf
  22. 22.
    Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, EACL 2009, pp. 70–78. Association for Computational Linguistics, Stroudsburg (2009), http://dl.acm.org/citation.cfm?id=1609179.1609188 Google Scholar
  23. 23.
    Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1985, pp. 18–25. ACM, New York (1985), http://doi.acm.org/10.1145/253495.253506 Google Scholar

Copyright information

© Springer International Publishing Switzerland 2014

Authors and Affiliations

  • Hai Hieu Vu
    • 1
  • Jeanne Villaneau
    • 1
  • Farida Saïd
    • 2
  • Pierre-François Marteau
    • 1
  1. 1.IRISAUniversité de Bretagne Sud (UBS)France
  2. 2.LMBAUniversité de Bretagne SudFrance

Personalised recommendations