Natural alpha embeddings

A Publisher Correction to this article was published on 15 May 2021

This article has been updated


Learning an embedding for a large collection of items is a popular approach to overcome the computational limitations associated to one-hot encodings. The aim of item embeddings is to learn a low dimensional space for the representations, able to capture with its geometry relevant features or relationships for the data at hand. This can be achieved for example by exploiting adjacencies among items in large sets of unlabelled data. In this paper we interpret in an Information Geometric framework the item embeddings obtained from conditional models. By exploiting the \(\alpha \)-geometry of the exponential family, first introduced by Amari, we introduce a family of natural \(\alpha \)-embeddings represented by vectors in the tangent space of the probability simplex, which includes as a special case standard approaches available in the literature. A typical example is given by word embeddings, commonly used in natural language processing, such as Word2Vec and GloVe. In our analysis, we show how the \(\alpha \)-deformation parameter can impact on standard evaluation tasks.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Change history


  1. 1.

    In the following for each word w we suppose the \(\text {arg max}\) to be unique. When this is not the case the formula can be easily generalized.


  1. 1.

    Amari, S.I.: Theory of information spaces: a differential geometrical foundation of statistics. Post RAAG Reports (1980)

  2. 2.

    Amari, S.I.: Differential geometry of curved exponential families-curvatures and information loss. Ann. Stat. 10, 357–385 (1982)

    MathSciNet  Article  Google Scholar 

  3. 3.

    Amari, S.I.: Geometrical theory of asymptotic ancillarity and conditional inference. Biometrika 69(1), 1–17 (1982)

    MathSciNet  Article  Google Scholar 

  4. 4.

    Amari, S.I.: Differential-Geometrical Methods in Statistics. Lecture Notes in Statistics, vol. 28. Springer, New York (1985)

    MATH  Google Scholar 

  5. 5.

    Amari, S.I.: Dual connections on the Hilbert bundles of statistical models. In: Geometrization of Statistical Theory, pp. 123–151. ULDM Publ., Lancaster (1987)

  6. 6.

    Amari, S.I.: Information Geometry and Its Applications, Applied Mathematical Sciences, vol. 194. Springer, Tokyo (2016)

    Book  Google Scholar 

  7. 7.

    Amari, S.I., Cichocki, A.: Information geometry of divergence functions. Bull. Polish Acad. Sci. Tech. Sci. 58(1), 183–195 (2010)

    Google Scholar 

  8. 8.

    Amari, S.I., Nagaoka, H.: Methods of Information Geometry. American Mathematical Society, Providence (2000)

    MATH  Google Scholar 

  9. 9.

    Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: Linear Algebraic Structure of Word Senses, with Applications to Polysemy. arXiv:1601.03764 (2016)

  10. 10.

    Arora, S., Li, Y., Liang, Y., Ma, T., Risteski, A.: Rand-walk: a latent variable model approach to word embeddings. arXiv:1502.03520 (2016)

  11. 11.

    Bakarov, A.: A Survey of Word Embeddings Evaluation Methods. arXiv:1801.09536 (2018)

  12. 12.

    Barkan, O., Koenigstein, N.: ITEM2vec: neural item embedding for collaborative filtering. In: IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP), pp. 1–6 (2016)

  13. 13.

    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: 52nd Annual Meeting of the Association for Computational Linguistics, pp. 238–247 (2014)

  14. 14.

    Baroni, M., Lenci, A.: How we blessed distributional semantic evaluation. In: Workshop on Geometrical Models of Natural Language Semantics, pp. 1–10 (2011)

  15. 15.

    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3, 1137–1155 (2003)

    MATH  Google Scholar 

  16. 16.

    Bengio, Y., Simard, P., Frasconi, P., et al.: Learning long-term dependencies with gradient descent is difficult. IEEE Trans. Neural Netw. 5(2), 157–166 (1994)

    Article  Google Scholar 

  17. 17.

    Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: a computational study. Behav. Res. Methods 39(3), 510–526 (2007)

    Article  Google Scholar 

  18. 18.

    Bullinaria, J.A., Levy, J.P.: Extracting semantic representations from word co-occurrence statistics: stop-lists, stemming, and SVD. Behav. Res. Methods 44(3), 890–907 (2012)

    Article  Google Scholar 

  19. 19.

    Casella, G., Berger, R.L.: Statistical Inference, 2nd edn. Duxbury Press, California (2001)

    MATH  Google Scholar 

  20. 20.

    Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Vigas, F., Wattenberg, M.: Visualizing and Measuring the Geometry of BERT. NeurIPS (2019)

  21. 21.

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. Human Language Technologies, North American Chapter of the Association for Computational Linguistics (2019)

  22. 22.

    Firth, J.R.: A Synopsis of Linguistic Theory (1957)

  23. 23.

    Fonarev, A., Grinchuk, O., Gusev, G., Serdyukov, P., Oseledets, I.: Riemannian optimization for skip-gram negative sampling. In: Proceedings of the Association for Computational Linguistics, pp. 2028–2036 (2017)

  24. 24.

    Guy, L.: Riemannian geometry and statistical machine learning. Ph.D. Thesis, Carnegie Mellon University (2005)

  25. 25.

    Hewitt, J., Manning, C.: A structural probe for finding syntax in word representations. In: North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 4129–4138 (2019)

  26. 26.

    Ichimori, T.: On rounding off quotas to the nearest integers in the problem of apportionment. JSIAM Lett. 3, 21–24 (2011)

    MathSciNet  Article  Google Scholar 

  27. 27.

    Jawanpuria, P., Balgovind, A., Kunchukuttan, A., Mishra, B.: Learning multilingual word embeddings in latent metric space: a geometric approach. Trans. Assoc. Comput. Linguist. 7, 107–120 (2019)

    Article  Google Scholar 

  28. 28.

    Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Comput. IEEE 42(8), 30–37 (2009)

    Article  Google Scholar 

  29. 29.

    Krishnamurthy, B., Puri, N., Goel, R.: Learning vector-space representations of items for recommendations using word embedding models. Procedia Comput. Sci. 80, 2205–2210 (2016)

    Article  Google Scholar 

  30. 30.

    Lauritzen, S.L.: Statistical manifolds. Differential geometry in statistical inference, pp. 163–216 (1987)

  31. 31.

    Lebanon, G.: Metric learning for text documents. IEEE Trans. Pattern Anal. Mach. Intell. 28(4), 497–508 (2006)

    Article  Google Scholar 

  32. 32.

    Lee, L.S.Y.: On the linear algebraic structure of distributed word representations. arXiv:1511.06961 (2015)

  33. 33.

    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. NIPS p. 9 (2014)

  34. 34.

    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)

    Article  Google Scholar 

  35. 35.

    Meng, Y., Huang, J., Wang, G., Zhang, C., Zhuang, H., Kaplan, L., Han, J.: Spherical text embedding. Advances in Neural Information Processing Systems (2019)

  36. 36.

    Michel, P., Ravichander, A., Rijhwani, S.: Does the geometry of word embeddings help document classification? A case study on persistent homology-based representations. In: Proceedings of the 2nd Workshop on Representation Learning for NLP (2017)

  37. 37.

    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: International Conference on Learning Representations (2013)

  38. 38.

    Mikolov, T., Karafit, M., Burget, L., Cernock, J., Khudanpur, S.: Recurrent neural network based language model. In: Annual Conference of the International Speech Communication Association (2010)

  39. 39.

    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (2013)

  40. 40.

    Mikolov, T., Yih, W.T., Zweig, G.: Linguistic regularities in continuous space word representations. NAACL-HLT (2013)

  41. 41.

    Mu, J., Bhat, S., Viswanath, P.: All-but-the-top: simple and effective postprocessing for word representations. ICLR (2018)

  42. 42.

    Nagaoka, H., Amari, S.I.: Differential geometry of smooth families of probability distributions. Tech. rep., Technical Report METR 82-7, Univ. Tokyo (1982)

  43. 43.

    Nickel, M., Kiela, D.: Poincaré embeddings for learning hierarchical representations. Advances in Neural Information Processing Systems (2017)

  44. 44.

    Pennington, J., Socher, R., Manning, C.: Glove: Global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (2014)

  45. 45.

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I.: Improving Language Understanding by Generative Pre-Training. Computer Science (2018)

  46. 46.

    Raunak, V.: Simple and Effective Dimensionality Reduction for Word Embeddings. LLD Workshop NIPS (2017)

  47. 47.

    Rudolph, M., Ruiz, F., Mandt, S., Blei, D.: Exponential family embeddings. Advances in Neural Information Processing Systems (2016)

  48. 48.

    Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning representations by back-propagating errors. Nature 323(6088), 533–536 (1986)

    Article  Google Scholar 

  49. 49.

    Sugawara, K., Kobayashi, H., Iwasaki, M.: On approximately searching for similar word embeddings. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (2016)

  50. 50.

    Tifrea, A., Bécigneul, G., Ganea, O.E.: Poincaré glove: hyperbolic word embeddings. In: International Conference on Learning Representations (2019)

  51. 51.

    Volpi, R., Malagò, L.: Evaluating natural alpha embeddings on intrinsic and extrinsic tasks. In: Proceedings of the 5th Workshop on Representation Learning for NLP (2020)

  52. 52.

    Volpi, R., Thakur, U., Malagò, L.: Changing the geometry of representations: \(\alpha \)-embeddings for nlp tasks (submitted) (2020)

  53. 53.

    Wada, J.: A divisor apportionment method based on the Kolm–Atkinson social welfare function and generalized entropy. Math. Soc. Sci. 63(3), 243–247 (2012)

    MathSciNet  Article  Google Scholar 

  54. 54.

    Wikiextractor. Accessed 2017-10

  55. 55.

    Wu, L., Fisch, A., Chopra, S., Adams, K., Bordes, A., Weston, J.: StarSpace: embed all the things! arXiv:1709.03856 (2018)

  56. 56.

    Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Conference on Neural Information Processing Systems (2019)

  57. 57.

    Zhao, X., Louca, R., Hu, D., Hong, L.: Learning item-interaction embeddings for user recommendations. arXiv:1812.04407 (2018)

Download references


The authors are supported by the DeepRiemann project, co-funded by the European Regional Development Fund and the Romanian Government through the Competitiveness Operational Programme 2014–2020, Action 1.1.4, project ID P_37_714, Contract No. 136/27.09.2016.

Author information



Corresponding author

Correspondence to Riccardo Volpi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary material 1 (mp4 69 KB)

Supplementary material 2 (mp4 211 KB)

Appendix A: GloVe training

Appendix A: GloVe training

During the training of GloVe we monitor performances in terms of accuracy on the word analogies task, in comparison with the literature, see Table 5.

Table 5 Accuracy on the word analogy tasks of [37, 39, 44] for different embeddings size and at different iterations during the training, compared with literature [44]

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Volpi, R., Malagò, L. Natural alpha embeddings. Info. Geo. (2021).

Download citation