Skip to main content
Log in

The impact of corpus domain on word representation: a study on Persian word embeddings

  • Project Notes
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Word embedding, has been a great success story for natural language processing in recent years. The main purpose of this approach is providing a vector representation of words based on neural network language modeling. Using a large training corpus, the model most learns from co-occurrences of words, namely Skip-gram model, and capture semantic features of words. Moreover, adding the recently introduced character embedding model to the objective function, the model can also focus on morphological features of words. In this paper, we study the impact of training corpus on the results of word embedding and show how the genre of training data affects the type of information captured by word embedding models. We perform our experiments on the Persian language. In line of our experiments, providing two well-known evaluation datasets for Persian, namely Google semantic/syntactic analogy and Wordsim353, is also part of the contribution of this paper. The experiments include computation of word embedding from various public Persian corpora with different genres and sizes while considering comprehensive lexical and semantic comparison between them. We identify words whose usages differ between these datasets resulted totally different vector representation which ends to significant impact on different domains in which the results vary up to 9% on Google analogy and up to 6% on Wordsim353. The resulted word embedding for each of the individual corpora as well as their combinations will be publicly available for any further research based on word embedding for Persian.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Notes

  1. We will make our scripts and models available upon publication.

References

  • Al-Rfou, R., Perozzi, B., & Skiena, S. (2013). Polyglot: Distributed word representations for multilingual NLP. In Proceedings of CoNLL (pp. 183–192).

  • AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., & Oroumchian, F. (2009). Hamshahri: A standard persian text collection. Knowledge Based Systems, 2, 382–387.

    Article  Google Scholar 

  • AleAhmad, A., Zahedi, M. S., Rahgozar, M., & Moshiri, B. (2016). IrBlogs: A standard collection for studying Persian bloggers. Computers in Human Behavior, 57, 195–207.

    Article  Google Scholar 

  • Baroni, M., Dinu, G., & Kruszewski, G. (2014). Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In Proceedings of ACL (pp. 238–247).

  • Basirat, A., & Joakim, N. (2016). Greedy universal dependency parsing with right singular word vectors. In Proceedings of SLTC.

  • Bengio, Y., Ducharme, R., Vincent, P., & Janvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3, 1137–1155.

    Google Scholar 

  • Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

    Google Scholar 

  • Brokos, G., Malakasiotis, P., & Androutsopoulos, I. (2016). Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In proceedings of BioNLP (pp. 114–118).

  • Camacho-Collados, J., et al. (2017). Semeval-2017 Task 2: Multilingual and cross-lingual semantic word similarity. In Proceedings of SemEval2017.

  • Cha, M., Gwon, Y., & Kung, H. T. (2017). Language modeling by clustering with word embeddings for text readability assessment. In Proceedings of CIKM (pp. 2003–2006).

  • Chen, X., Liu, Z., & Sun, M. (2014). A unified model for word sense representation and disambiguation. In Proceedings of EMNLP (pp. 1025–1035).

  • Collobert, R., & Weston, J. (2008). A unified architecture for natural language processing. In Proceedings of ICML (pp. 160–167).

  • dos Santos, C. N., & Zadrozny, B. (2014). Learning character-level representations for part-of-speech tagging. In Proceedings of ICML (pp. 1818–1826).

  • Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., et al. (2002). Placing search in context: The concept revisited. ACM Transactions on Information Systems, 20, 116–131.

    Article  Google Scholar 

  • Gharavi, E., Bijari, K., Zahirnia, K., & Veisi, H. (2016). A deep learning approach to persian plagiarism detection. In Proceedings of FIRE (pp. 154–159).

  • Iacobacci, I., Pilehvar, M. T., & Navigli, R. (2016). Embeddings for word sense disambiguation: An evaluation study. In Proceedings of ACL (pp. 897–907).

  • Kenter, T., & de Rijke, M. (2015). Short text similarity with word embeddings. In Proceedings of CIKM (pp. 1411–1420).

  • Kusner, M. J., Sun, Y., Kolkin, N. I., & Weinberger K. Q., et al. (2015). From word embeddings to document distances. In Proceedings of ICML (pp. 957–966).

  • Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., & Dyer, C. (2016). Neural architectures for named entity recognition. In Proceedings of NAACL-HLT (pp. 260–270).

  • Levy, O., Goldberg, Y., & Dagan, I. (2015). Improving distributional similarity with lessons learned from word embeddings. Transactions of the Association for Computational Linguistics, 3, 211–225.

    Google Scholar 

  • Lin, C.-C., Ammar, W., Dyer, C., & Levin, L. (2015). Unsupervised POS induction with word embeddings (pp. 1311–1316).

  • Mikolov, T., Chen, K., Corrado, G., Dean, J.(2013a). Distributed representations of words and phrases and their compositionality. In Proceedings of NIPS (pp. 1–9).

  • Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013b). Efficient estimation of word representations in vector space (pp. 1–12). arXiv:1301.3781v3.

  • Mnih, A., Hinton, G. E. (2008). A scalable hierarchical distributed language model. In Proceedings of NIPS (pp. 1–8).

  • Passban, P., Qun, L., & Way, A. (2016). Boosting neural POS tagger for Farsi using morphological information. ACM Transactions on Asian and Low-Resource Language Information Processing, 16, 1–15.

    Article  Google Scholar 

  • Pennington, J., Socher, R., & Manning, C. D. (2014). GloVe: Global vectors for word representation. In Proceedings of EMNLP (pp. 1532–1543).

  • Rehurek, R., & Petr, S. (2010). Software framework for topic modelling with large corpora. In Proceedings of LREC (pp. 45–50).

  • Turian, J., Ratinov, L., & Bengio, Y. (2010). Word representations: A simple and general method for semi-supervised learning. In Proceedings of ACL (pp. 384–394).

  • Zamani, H., & Croft, W. B. (2016). Embedding-based query language models. In Proceedings of ICTIR (pp. 147–156).

  • Zhang, Y., Gaddy, D., Barzilay, R., & Jaakkola, T. S. (2016). Ten pairs to tag-multilingual pos tagging via coarse mapping between embeddings. In proceedings of NAACL-HLT (pp. 1307–1317).

  • Zuccon, G., Koopman, B., Bruza, P., & Azzopardi, L. (2015). Integrating and evaluating neural word embeddings in information retrieval. In Proceedings of ADCS (pp. 1–8).

Download references

Acknowledgements

Twitter dataset provided by Ali Shariat Bahadori from university of Tehran. Any usage and statement made herein are solely the responsibility of the authors.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Saeedeh Momtazi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hadifar, A., Momtazi, S. The impact of corpus domain on word representation: a study on Persian word embeddings. Lang Resources & Evaluation 52, 997–1019 (2018). https://doi.org/10.1007/s10579-018-9419-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-018-9419-x

Keywords

Navigation