Skip to main content

POS, ANA and LEM: Word Embeddings Built from Annotated Corpora Perform Better (Best Paper Award, Second Place)

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2018)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13396))

  • 234 Accesses

Abstract

Word embedding models have been popular and quite efficient tools for representing lexical semantics in different languages. Nevertheless, there is no standard for the direct evaluation of such models. Moreover, the applicability of word embedding models is still a research question for less resourced and morphologically complex languages. In this paper, we present and evaluate different corpus preprocessing methods that make the creation of high-quality word embedding models for Hungarian (and other morphologically complex languages) possible. We use a crowd-sourcing-based intrinsic evaluation scenario, and a detailed comparison of our models is presented. The results show that models built from analyzed corpora are of better quality than raw models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The resource is available online at http://www.cs.cornell.edu/schnabts/eval/.

  2. 2.

    Although testing at NN rank 30 may seem odd at first, word2vec output even at around rank 2000 often perfectly makes sense. Most entries at around rank 2000 for macska ‘cat’ or nyúl ‘rabbit’ are animals.

  3. 3.

    nyúl is also a verb in Hungarian ‘reach for’. The verbal sense is 4 times more frequent in the training corpus, dominating the vector representation for most models.

  4. 4.

    Each word is represented by exactly two tokens in the ANA model, thus the same context window only covers half as many words. The same applies to inflected word forms in the POS model, while noninflected words are represented by only a single token in that model. Note that this effect is mitigated by the fact that word2vec downsamples frequent word forms (among them frequent tags) when creating the model. This corresponds to an effective window size expansion.

References

  1. Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 238–247. Association for Computational Linguistics, Baltimore, Maryland (June 2014). https://aclanthology.org/P14-1023.pdf

  2. Bojar, O., Ercegovčević, M., Popel, M., Zaidan, O.F.: A grain of salt for the WMT manual evaluation. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 1–11. WMT 2011, Association for Computational Linguistics, Stroudsburg, PA, USA (2011). https://aclanthology.org/W11-2101/

  3. Ebert, S., Müller, T., Schütze, H.: LAMB: a good shepherd of morphologically rich languages. In: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. Austin, USA (November 2016). https://aclanthology.org/D16-1071/

  4. Endrédy, I., Prószéky, G.: A Pázmány Korpusz [The ‘Pázmány’ Corpus]. Nyelvtudományi Közlemények 112, 191–206 (2016). http://real.mtak.hu/79923/1/NyK20112_u.pdf

  5. Laki, L., Novák, A., Siklósi, B.: English to Hungarian morpheme-based statistical machine translation system with reordering rules. In: Proceedings of the Second Workshop on Hybrid Approaches to Translation, pp. 42–50. Association for Computational Linguistics, Sofia, Bulgaria, August 2013. http://www.aclweb.org/anthology/W13-2808

  6. Miháltz, M., Hatvani, C., Kuti, J., Szarvas, G., Csirik, J., Prószéky, G., Váradi, T.: Methods and results of the Hungarian WordNet project. In: Proceedings of the Fourth Global WordNet Conference, pp. 311–321 (2008). http://www.inf.u-szeged.hu/~szarvas/homepage/pdf/gwc2008.pdf

  7. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR abs/1301.3781 (2013). https://arxiv.org/abs/1301.3781

  8. Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 5–8, 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013). https://proceedings.neurips.cc/paper/2013/file/9aa42b31882ec039965f3c4923ce901b-Paper.pdf

  9. Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Human Language Technologies: conference of the North American Chapter of the Association of Computational Linguistics, Proceedings, 9–14 June 2013, Westin Peachtree Plaza Hotel, Atlanta, Georgia, USA, pp. 746–751 (2013). https://aclanthology.org/N13-1090

  10. Novák, A.: A new form of Humor - mapping constraint-based computational morphologies to a finite-state representation. In: Calzolari, N., et al. (eds.) Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC 2014), pp. 1068–1073. European Language Resources Association (ELRA), Reykjavik, Iceland, May 2014. https://aclanthology.org/L14-1207/

  11. Novák, A.: Milyen a jó Humor? [What is good Humor like?]. In: I. Magyar Számítógépes Nyelvészeti Konferencia [First Hungarian conference on computational linguistics]. pp. 138–144. SZTE, Szeged (2003), http://www.morphologic.hu/downloads/publications/na/2003_mszny_Humor_na.pdf

  12. Orosz, Gy., Novák, A.: PurePos 2.0: a hybrid tool for morphological disambiguation. In: Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013), pp. 539–545. INCOMA Ltd., Shoumen, BULGARIA, Hissar, Bulgaria (2013). https://aclanthology.org/R13-1071/

  13. Popel, M., Mareček, D.: Perplexity of n-gram and dependency language models. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds.) TSD 2010. LNCS (LNAI), vol. 6231, pp. 173–180. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15760-8_23

    Chapter  Google Scholar 

  14. Prószéky, G., Kis, B.: A unification-based approach to morpho-syntactic parsing of agglutinative and other (highly) inflectional languages. In: Proceedings of the 37th annual meeting of the Association for Computational Linguistics on Computational Linguistics. ACL 1999, Stroudsburg, PA, USA, pp. 261–268. Association for Computational Linguistics (1999). https://aclanthology.org/P99-1034/

  15. Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Màrquez, L., Callison-Burch, C., Su, J., Pighin, D., Marton, Y. (eds.) EMNLP, pp. 298–307. The Association for Computational Linguistics (2015). https://aclanthology.org/D15-1036/

  16. Siklósi, B.: Using embedding models for lexical categorization in morphologically rich languages. In: Gelbukh, A. (ed.) CICLing 2016. LNCS, vol. 9623, pp. 115–126. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-75477-2_7

    Chapter  Google Scholar 

  17. Szántó, Z., Vincze, V., Farkas, R.: Magyar nyelvű szó- és karakterszintű szóbeágyazások [Word-level and character-level embeddings for Hungarian]. In: Tanács, A., Varga, V., Vincze, V. (eds.) XIII. Magyar Számítógépes Nyelvészeti Konferencia, pp. 323–328. Szegedi Tudományegyetem, Informatikai Tanszékcsoport, Szeged (2017). http://acta.bibl.u-szeged.hu/59021/1/msznykonf_013_323-328.pdf

Download references

Acknowledgments

This research was implemented with support provided by grants FK125217 and PD125216 of the National Research, Development and Innovation Office of Hungary financed under the FK17 and PD17 funding schemes.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Attila Novák .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Novák, A., Novák, B. (2023). POS, ANA and LEM: Word Embeddings Built from Annotated Corpora Perform Better (Best Paper Award, Second Place). In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2018. Lecture Notes in Computer Science, vol 13396. Springer, Cham. https://doi.org/10.1007/978-3-031-23793-5_29

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23793-5_29

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23792-8

  • Online ISBN: 978-3-031-23793-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics