Morphosyntactic Preprocessing Impact on Document Embedding: An Empirical Study on Semantic Similarity

  • Nourelhouda YahiEmail author
  • Hacene Belhadef
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1073)


Word embedding technique is among the most widely known and used representations of text documents vocabulary. It serves to capture word context in a document, but in many applications the need is to understand the content of text, which is longer than just a single word, that’s what we call “Document Embedding”. This paper presents an empirical study that evaluates the morphosyntactic data preprocessing impact on document embedding techniques over textual semantic similarity evaluation task, and that by comparing the impact of the most widely known text preprocessing techniques, such as: (1) Cleaning technique containing stop-words removal, lowercase conversion, punctuation and number elimination, (2) Stemming technique using the most known algorithms in the literature: Porter, Snowball and Lancaster stemmer and (3) Lemmatization technique using Wordnet Lemmatizer. Experimental analysis on MSRP (Microsoft Research Paraphrase) dataset reveals that preprocessing techniques improve classifier accuracy, where Stemming methods outperforms other techniques.


Word embedding Document embedding Doc2vec Text preprocessing Semantic similarity 


  1. 1.
    Batet, M., Sanchez, D.: A review on semantic similarity. In: Encyclopedia of Information Science and Technology, Third Edition, pp. 7575–7583. IGI Global (2015)Google Scholar
  2. 2.
    Bengio, Y., et al.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)zbMATHGoogle Scholar
  3. 3.
    Camacho-Collados, J., Pilehvar, M.T.: On the role of text preprocessing in neural network architectures: an evaluation study on text categorization and sentiment analysis. arXiv preprint arXiv:1707-1780 (2017)
  4. 4.
    Chu, H.: Information Representation and Retrieval in the Digital Age. Information Today, Inc., Medford (2003)Google Scholar
  5. 5.
    Dolan, B., Quirk, C., Brockett, C.: Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th International Conference on Computational Linguistics, p. 350. Association for Computational Linguistics (2004)Google Scholar
  6. 6.
    Duwairi, R., El-Orfali, M.: A study of the effects of preprocessing strategies on sentiment analysis for Arabic text. J. Inform. Sci. 40(4), 501–513 (2014)CrossRefGoogle Scholar
  7. 7.
    Ibrahim, A., Katz, B., Lin, J.: Extracting structural paraphrases from aligned monolingual corpora. In: Proceedings of the Second International Workshop on Paraphrasing, vol. 16, pp. 57–64. Association for Computational Linguistics (2003)Google Scholar
  8. 8.
    Kenter, T., De Rijke, M.: Short text similarity with word embeddings. In: Proceedings of the 24th ACM International on Conference on Information and Knowledge Management, pp. 1411–1420. ACM (2015)Google Scholar
  9. 9.
    Kiela, D., Clark, S.: A systematic study of semantic vector space model parameters. In: Proceedings of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC), pp. 21–30 (2014)Google Scholar
  10. 10.
    Kosala, R., Blockeel, H.: Web mining research: a survey. ACM Sigkdd Explor. Newsl. 2(1), 1–15 (2000)CrossRefGoogle Scholar
  11. 11.
    Le, Q., Mikolov, T.: Distributed representations of sentences and douments. In: International Conference on Machine Learning, pp. 1188–1196 (2014)Google Scholar
  12. 12.
    Lodhi, H., et al.: Text classification using string kernels. J. Mach. Learn. Res. 2(Feb), 419–444 (2002)zbMATHGoogle Scholar
  13. 13.
    Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 746–751 (2013)Google Scholar
  14. 14.
    Mikolov, T., et al.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  15. 15.
    Pak, M.Y., Gunal, S.: The impact of text representation and preprocessing on author identification. Anadolu Üniversitesi Bilim Ve Teknoloji Dergisi A-Uygulamalı Bilimler ve Mühendislik 18(1), 218–224 (2017)Google Scholar
  16. 16.
    Park, E.-K., Ra, D.-Y., Jang, M.-G.: Techniques for improving web retrieval effectiveness. Inf. Process. Manage. 41(5), 1207–1223 (2005)CrossRefGoogle Scholar
  17. 17.
    Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)CrossRefGoogle Scholar
  18. 18.
    Sergienko, R., Shan, M., Minker, W.: A comparative study of text preprocessing approaches for topic detection of user utterances. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), pp. 1826–1831 (2016)Google Scholar
  19. 19.
    Uysal, A.K., Gunal, S.: The impact of preprocessing on text classification. Inform. Process. Manage. 50(1), 104–112 (2014)CrossRefGoogle Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  1. 1.NTIC Faculty, MISC LaboratoryUniversity of Constantine 2-Abdelhamid MehriConstantineAlgeria

Personalised recommendations