Advertisement

Automated Detection of Non-Relevant Posts on the Russian Imageboard “2ch”: Importance of the Choice of Word Representations

  • Amir BakarovEmail author
  • Olga Gureenkova
Conference paper
Part of the Lecture Notes in Computer Science book series (LNCS, volume 10716)

Abstract

This study considers the problem of automated detection of non-relevant posts on Web forums and discusses the approach of resolving this problem by approximation it with the task of detection of semantic relatedness between the given post and the opening post of the forum discussion thread. The approximated task could be resolved through learning the supervised classifier with a composed word embeddings of two posts. Considering that the success in this task could be quite sensitive to the choice of word representations, we propose a comparison of the performance of different word embedding models. We train 7 models (Word2Vec, Glove, Word2Vec-f, Wang2Vec, AdaGram, FastText, Swivel), evaluate embeddings produced by them on dataset of human judgements and compare their performance on the task of non-relevant posts detection. To make the comparison, we propose a dataset of semantic relatedness with posts from one of the most popular Russian Web forums, imageboard “2ch”, which has challenging lexical and grammatical features.

Keywords

Distributional semantics Compositional semantics 2ch Imageboard Semantic relatedness Word similarity Word embeddings 

References

  1. 1.
    Qadir, A., Riloff, E.: Classifying sentences as speech acts in message board posts. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 748–758 (2011)Google Scholar
  2. 2.
    Agirre, E., Cer, D., Diab, M., Gonzalez-Agirre, A., Guo, W.: sem 2013 shared task: semantic textual similarity, including a pilot on typed-similarity. In: *SEM 2013: The Second Joint Conference on Lexical and Computational Semantics. Association for Computational Linguistics, Citeseer (2013)Google Scholar
  3. 3.
    Panchenko, A., Ustalov, D., Arefyev, N., Paperno, D., Konstantinova, N., Loukachevitch, N., Biemann, C.: Human and machine judgements for Russian semantic relatedness. In: Ignatov, D.I., et al. (eds.) AIST 2016. CCIS, vol. 661, pp. 221–235. Springer, Cham (2017).  https://doi.org/10.1007/978-3-319-52920-2_21
  4. 4.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems, pp. 3111–3119 (2013)Google Scholar
  5. 5.
    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: EMNLP, vol. 14, pp. 1532–1543 (2014)Google Scholar
  6. 6.
    Levy, O., Goldberg, Y.: Dependency-based word embeddings. In: ACL, vol. 2, pp. 302–308 (2014)Google Scholar
  7. 7.
    Ling, W., Dyer, C., Black, A.W., Trancoso, I.: Two/too simple adaptations of word2vec for syntax problems. In: HLT-NAACL, pp. 1299–1304 (2015)Google Scholar
  8. 8.
    Bartunov, S., Kondrashkin, D., Osokin, A., Vetrov, D.: Breaking sticks and ambiguities with adaptive skip-gram. In: Artificial Intelligence and Statistics, pp. 130–138 (2016)Google Scholar
  9. 9.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606 (2016)
  10. 10.
    Shazeer, N., Doherty, R., Evans, C., Waterson, C.: Swivel: improving embeddings by noticing what’s missing. arXiv preprint arXiv:1602.02215 (2016)

Copyright information

© Springer International Publishing AG 2018

Authors and Affiliations

  1. 1.Chatme AI LLCNovosibirskRussia
  2. 2.Novosibirsk State UniversityNovosibirskRussia

Personalised recommendations