Advertisement

Xword: A Multi-lingual Framework for Expanding Words

  • Faisal AlshargiEmail author
  • Saeedeh Shekarpour
  • Waseem Alromema
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 1073)

Abstract

The word expansion task has applicability in information retrieval and question answering systems. It relieves the vocabulary mismatch problem leading to a higher recall. The recent word embedding models demonstrated merit for the word expansion task in comparison to the traditional n-gram models. However, to acquire quality embeddings in each language, the processes of corpus compilation, normalization and parameter tuning are time-consuming and challenging especially for poor resources languages such as Arabic. In this paper, we introduce Xword as an online multi-lingual framework for automatic word expansion. Xword relies on both pre-trained ad hoc word embedding models and n-gram models for the expansion task. Xword currently includes the two languages Arabic, and German. Xword represents the results of each model both individually and collectively. Additionally, Xword can filter out the result set based on sentiment and part of speech (POS) tag of every single word. Xword is available as a Web API along with the downloadable models and sufficient documentation on our public GitHub.

Keywords

X-word Word expansion Embedding German language Arabic language Quality 

References

  1. 1.
    Al-Rfou, R., Perozzi, B., Skiena, S.: Polyglot: Distributed word representation for multilingual NLP. In: Proceedings of the Seventeenth Conference on Computational Natural Language Learning, pp. 183–192. Association for Computational Linguistics (2013)Google Scholar
  2. 2.
    Al-Shargi, F., Kaplan, A., Eskander, R., Habash, N., Rambow, O.: Morphologically annotated corpora and morphological analyzers for Moroccan and Sanaani Yemeni Arabic. In: Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016) (2016)Google Scholar
  3. 3.
    Diab, M., El Kholy, A., Eskander, R., Habash, N., Pooleery, M., Rambow, O., Pasha, A., Al-Badrashiny, M., Roth, R.M.: Madamira: a fast, comprehensive tool for morphological analysis and disambiguation of Arabic. In: LREC (2014)Google Scholar
  4. 4.
    Bojanowski, P., Grave, E., Joulin, A., Mikolov, T.: Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist. 5, 135–146 (2017)CrossRefGoogle Scholar
  5. 5.
    Cotterell, R., Schütze, H.: Morphological word-embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1287–1292 (2015)Google Scholar
  6. 6.
    Ding, X., Liu, B., Yu, P.S.: A holistic lexicon-based approach to opinion mining. In: Proceedings of the 2008 International Conference on Web Search and Data Mining, pp. 231–240. ACM (2008)Google Scholar
  7. 7.
    Eckart, T., Alshargi, F., Quasthoff, U., Goldhahn, D.: Large Arabic web corpora of high quality: the dimensions time and origin. In: Workshop on Free/Open-Source Arabic Corpora and Corpora Processing Tools, LREC, Reykjavík (2014)Google Scholar
  8. 8.
    Eskander, R., Rambow, O.: SLSA: a sentiment lexicon for standard arabic. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, 17–21 September 2015, pp. 2545–2550 (2015)Google Scholar
  9. 9.
    Fellbaum, C. (ed.): WordNet: An Electronic Lexical Database. MIT Press, Cambridge (1998)zbMATHGoogle Scholar
  10. 10.
    Ferguson, C.A.: Diglossia. Word: Journal of the International Linguistic Association (1959)Google Scholar
  11. 11.
    Eckart, T., Quasthoff, U., Goldhahn, D.: Large monolingual dictionaries at the Leipzig corpora collection: from 100 to 200 languages. In: Proceedings of LREC 2012, pp. 759–765 (2012)Google Scholar
  12. 12.
    Habash, N., Rambow, O.: Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, ACL 2005, Lisbon, Arbor, MI, USA, pp. 2545–2550 (2005)Google Scholar
  13. 13.
    Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. Computational Linguistics (2015)Google Scholar
  14. 14.
    Kuzi, S., Shtok, A., Kurland, O.: Query expansion using word embeddings. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM 2016, pp. 1929–1932. ACM (2016)Google Scholar
  15. 15.
    Leviant, I., Reichart, R.: Separated by an un-common language: towards judgment language informed vector space modeling (2015)Google Scholar
  16. 16.
    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems, pp. 2177–2185 (2014)Google Scholar
  17. 17.
    Manning, C.D., Surdeanu, M., Bauer, J., Finkel, J., Bethard, S.J., Mc- Closky, D.: The Stanford CoreNLP natural language processing toolkit. In: Association for Computational Linguistics (ACL) System Demonstrations (2014)Google Scholar
  18. 18.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. CoRR, abs/1301.3781 (2013a)Google Scholar
  19. 19.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a Meeting Held, 5–8 December 2013, Lake Tahoe, Nevada, United States, pp. 3111–3119 (2013b)Google Scholar
  20. 20.
    Navigli, R., Velardi, P.: An analysis of ontology-based query expansion strategies. In: Proceedings of the 14th European Conference on Machine Learning, Workshop on Adaptive Text Extraction and Mining, Cavtat-Dubrovnik, Croatia (2003)Google Scholar
  21. 21.
    Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, 25–29 October 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1532–1543 (2014)Google Scholar
  22. 22.
    Heyer, G., Remus, R., Quasthoff, U.: SentiWS - a publicly available German-language resource for sentiment analysis. In: Proceedings of the 7th International Language Ressources and Evaluation (LREC 2010), pp. 1168–1171 (2010)Google Scholar
  23. 23.
    Hallsteinsdóttir, E., Biemann, C., Richter, M., Quasthoff, U.: Exploiting the Leipzig corpora collection. In: Proceedings of the IS-LTC, Ljubljana, Slovenia (2006a)Google Scholar
  24. 24.
    Hallsteinsdóttir, E., Biemann, C., Richter, M., Quasthoff, U.: Exploiting the Leipzig corpora collection. In: Proceedings of the IS-LTC. Ljubljana, Slovenia (2006b)Google Scholar
  25. 25.
    Schnabel, T., Labutov, I., Mimno, D.M., Joachims, T.: Evaluation methods for unsupervised word embeddings. In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, 17–21 September (2015)Google Scholar
  26. 26.
    Shekarpour, S., Höffner, K., Lehmann, J., Auer, S.: Keyword query expansion on linked data using linguistic and semantic features. In: 2013 IEEE Seventh International Conference on Semantic Computing, Irvine, CA, USA, 16–18 September 2013 (2013)Google Scholar
  27. 27.
    Shekarpour, S., Marx, E., Auer, S., Sheth, A.P.: RQUERY: rewriting natural language queries on knowledge graphs to alleviate the vocabulary mismatch problem. In: Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, 4–9 February 2017, San Francisco, California, USA, pp. 3936–3943 (2017)Google Scholar
  28. 28.
    Soricut, R., Och, F.: Unsupervised morphology induction using word embeddings. In: Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Association for Computational Linguistics (2015)Google Scholar
  29. 29.
    Toutanova, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora (EMNLP/VLC-2000), pp. 63–70 (2003)Google Scholar
  30. 30.
    Zamani, H., Croft, W.B.: Embedding-based query language models. In: Proceedings of the 2016 ACM International Conference on the Theory of Information Retrieval, ICTIR 2016. ACM (2016)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2020

Authors and Affiliations

  • Faisal Alshargi
    • 1
    Email author
  • Saeedeh Shekarpour
    • 2
  • Waseem Alromema
    • 3
  1. 1.Universität LeipzigLeipzigGermany
  2. 2.University of DaytonDaytonUSA
  3. 3.Taibah UniversityMadinahKingdom of Saudi Arabia

Personalised recommendations