Intrinsic Evaluation for English–Tamil Bilingual Word Embeddings

  • J. P. SanjanasriEmail author
  • Vijay Krishna Menon
  • S. Rajendran
  • K. P. Soman
  • M. Anand Kumar
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 910)


Despite the growth of bilingual word embeddings, there is no work done so far, for directly evaluating them for English–Tamil language pair. In this paper, we present a data resource and evaluation for the English–Tamil bilingual word vector model. In this paper, we present dataset and the evaluation paradigm for English–Tamil bilingual language pair. This dataset contains words that covers a range of concepts that occur in natural language. The dataset is scored based on the similarity rather than association or relatedness. Hence, the word pairs that are associated but not literally similar have a low rating. The measures are quantified further to ensure consistency in the dataset, mimicking the cognitive phenomena. Henceforth, the dataset can be used by non-native speakers, with minimal effort. We also present some inferences and insights into the semantics captured by word vectors and human cognition.


Intrinsic evaluation Bilingual word embeddings English-Tamil Lexical gaps Semantic similarity Semantic relatedness 


  1. 1.
    Akhtar, S.S., Gupta, A., Vajpayee, A., Srivastava, A., Shrivastava, M.: Word similarity datasets for indian languages: Annotation and baseline systems. In: LAW@ACL (2017)Google Scholar
  2. 2.
    Bruni, E., Tran, N.K., Baroni, M.: Multimodal distributional semantics. J. Artif. Int. Res. 49(1), 1–47 (2014). URL Scholar
  3. 3.
    Budanitsky, A., Hirst, G.: Semantic distance in wordnet: An experimental, application-oriented evaluation of five measures. In: IN WORKSHOP ON WORDNET AND OTHER LEXICAL RESOURCES, SECOND MEETING OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (2001)Google Scholar
  4. 4.
    Chomsky, N.: Aspects of the Theory of Syntax. The MIT Press, Cambridge (1965). URL
  5. 5.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12, 2493–2537 (2011). URL
  6. 6.
    Finkelstein, L., Gabrilovich, E., Matias, Y., Rivlin, E., Solan, Z., Wolfman, G., Ruppin, E.: Placing search in context: The concept revisited. In: Proceedings of the 10th International Conference on World Wide Web, WWW ’01, pp. 406–414. ACM, New York, NY, USA (2001). 10.1145/371920.372094. URL
  7. 7.
    Gouws, S., Bengio, Y., Corrado, G.: Bilbowa: Fast bilingual distributed representations without word alignments. In: F. Bach, D. Blei (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 748–756. PMLR, Lille, France (2015)Google Scholar
  8. 8.
    Hill, F., Reichart, R., Korhonen, A.: Simlex-999: Evaluating semantic models with (genuine) similarity estimation. CoRR abs/1408.3456 (2014). URL
  9. 9.
    Huang, E.H., Socher, R., Manning, C.D., Ng, A.Y.: Improving word representations via global context and multiple word prototypes. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers - Volume 1, ACL ’12, pp. 873–882. Association for Computational Linguistics, Stroudsburg, PA, USA (2012). URL
  10. 10.
    Li, Q., Shah, S., Nourbakhsh, A., Liu, X., Fang, R.: Hashtag recommendation based on topic enhanced embedding, tweet entity data and learning to rank. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, CIKM ’16, pp. 2085–2088. ACM, New York, NY, USA (2016). 10.1145/2983323.2983915. URL
  11. 11.
    Liu, Y., Liu, Z., Chua, T.S., Sun, M.: Topical word embeddings. In: Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence, AAAI’15, pp. 2418–2424. AAAI Press (2015). URL
  12. 12.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th International Conference on Neural Information Processing Systems - Volume 2, NIPS’13, pp. 3111–3119. Curran Associates Inc., USA (2013). URL
  13. 13.
    Rekha, R.U., Anand Kumar, M., Dhanalakshmi, V., Soman, K.P., Rajendran, S.: A novel approach to morphological generator for tamil. In: Kannan, R., Andres, F. (eds.) Data Engineering and Management, pp. 249–251. Springer, Berlin Heidelberg, Berlin, Heidelberg (2012)CrossRefGoogle Scholar
  14. 14.
    Tsvetkov, Y., Faruqui, M., Ling, W., Lample, G., Dyer, C.: Evaluation of word vector representations by subspace alignment. In: EMNLP (2015)Google Scholar
  15. 15.
    Zahran, M.A., Magooda, A., Mahgoub, A.Y., Raafat, H., Rashwan, M., Atyia, A.: Word representations in vector space and their applications for arabic. In: Gelbukh, A. (ed.) Computational Linguistics and Intelligent Text Processing, pp. 430–443. Springer International Publishing, Cham (2015)Google Scholar

Copyright information

© Springer Nature Singapore Pte Ltd. 2020

Authors and Affiliations

  • J. P. Sanjanasri
    • 1
    Email author
  • Vijay Krishna Menon
    • 1
  • S. Rajendran
    • 1
  • K. P. Soman
    • 1
  • M. Anand Kumar
    • 2
  1. 1.Center for Computational Engineering & Networking (CEN), Amrita School of EngineeringAmrita Vishwa VidyapeethamCoimbatoreIndia
  2. 2.Department of Information TechnologyNational Institute of TechnologySurathkalIndia

Personalised recommendations