Revisiting Skip-Gram Negative Sampling Model with Rectification

  • Cun (Matthew) MuEmail author
  • Guang Yang
  • Yan (John) Zheng
Conference paper
Part of the Advances in Intelligent Systems and Computing book series (AISC, volume 997)


We revisit skip-gram negative sampling (SGNS), one of the most popular neural-network based approaches to learning distributed word representation. We first point out the ambiguity issue undermining the SGNS model, in the sense that the word vectors can be entirely distorted without changing the objective value. To resolve the issue, we investigate the intrinsic structures in solution that a good word embedding model should deliver. Motivated by this, we rectify the SGNS model with quadratic regularization, and show that this simple modification suffices to structure the solution in the desired manner. A theoretical justification is presented, which provides novel insights into quadratic regularization. Preliminary experiments are also conducted on Google’s analytical reasoning task to support the modified SGNS model.


Word embedding SGNS model Quadratic regularization 


  1. 1.
    Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1746–1751 (2014)Google Scholar
  2. 2.
    Grbovic, M., Djuric, N., Radosavljevic, V., Silvestri, F., Bhamidipati, N.: Context-and content-aware embeddings for query rewriting in sponsored search. In: International ACM SIGIR Conference on Research and Development in Information Retrieval (2015)Google Scholar
  3. 3.
    Nalisnick, E., Mitra, B., Craswell, N., Caruana, R.: Improving document ranking with dual word embeddings. In: International Conference Companion on World Wide Web (2016)Google Scholar
  4. 4.
    Iyyer, M., Boyd-Graber, J., Claudino, L., Socher, R., Daumé III, H.: A neural network for factoid question answering over paragraphs. In: Conference on Empirical Methods in Natural Language Processing (2014)Google Scholar
  5. 5.
    Shih, K., Singh, S., Hoiem, D.: Where to look: focus regions for visual question answering. In: Conference on Computer Vision and Pattern Recognition (2016)Google Scholar
  6. 6.
    Sienčnik, S.: Adapting word2vec to named entity recognition. In: Nordic Conference of Computational Linguistics (2015)Google Scholar
  7. 7.
    Lample, G., Ballesteros, M., Subramanian, S., Kawakami, K., Dyer, C.: Neural architectures for named entity recognition. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 260–270 (2016)Google Scholar
  8. 8.
    Socher, R., Bauer, J., Manning, C., Ng, A.: Parsing with compositional vector grammars. In: Annual Meeting of the Association for Computational Linguistics (2013)Google Scholar
  9. 9.
    Bengio, Y., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. J. Mach. Learn. Res. 3(Feb), 1137–1155 (2003)zbMATHGoogle Scholar
  10. 10.
    Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. In: International Conference on Artificial Intelligence and Statistics (2005)Google Scholar
  11. 11.
    Bengio, Y., Schwenk, H., Senécal, J.-S., Morin, F., Gauvain, J.-L.: Neural probabilistic language models. In: Innovations in Machine Learning. Springer (2006)Google Scholar
  12. 12.
    Collobert, R., Weston, J.: A unified architecture for natural language processing: deep neural networks with multitask learning. In: International Conference on Machine Learning (2008)Google Scholar
  13. 13.
    Mnih, A., Hinton, G.: A scalable hierarchical distributed language model. In: Advances in Neural Information Processing Systems (2009)Google Scholar
  14. 14.
    Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., Kuksa, P.: Natural language processing (almost) from scratch. J. Mach. Learn. Res. 12(Aug), 2493–2537 (2011)zbMATHGoogle Scholar
  15. 15.
    Le, H., Oparin, I., Allauzen, A., Gauvain, J.-L., Yvon, F.: Structured output layer neural network language model. In: International Conference on Acoustics, Speech and Signal Processing (2011)Google Scholar
  16. 16.
    Baroni, M., Dinu, G., Kruszewski, G.: Don’t count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors. In: Annual Meeting of the Association for Computational Linguistics (2014)Google Scholar
  17. 17.
    Pennington, J., Socher, R., Manning, C.: GloVe: global vectors for word representation. In: Conference on Empirical Methods in Natural Language Processing (2014)Google Scholar
  18. 18.
    Mikolov, T., Karafiát, M., Burget, L., Černockỳ, J., Khudanpur, S.: Recurrent neural network based language model. In: Annual Conference of the International Speech Communication Association (2010)Google Scholar
  19. 19.
    Mikolov, T., Yih, W., Zweig, G.: Linguistic regularities in continuous space word representations. In: Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (2013)Google Scholar
  20. 20.
    Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
  21. 21.
    Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems (2013)Google Scholar
  22. 22.
    Mikolov, T., Grave, E., Bojanowski, P., Puhrsch, C., Joulin, A.: Advances in pre-training distributed word representations. arXiv preprint arXiv:1712.09405 (2017)
  23. 23.
    Sun, F., Guo, J., Lan, Y., Xu, J., Cheng, X.: Sparse word embeddings using \(\ell _1\) regularized online learning. In: Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (2016)Google Scholar
  24. 24.
    Yang, W., Lu, W., Zheng, V.: A simple regularization-based algorithm for learning cross-domain word embeddings. In: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pp. 2898–2904 (2017)Google Scholar
  25. 25.
    Goldberg, Y., Levy, O.: Word2vec explained: deriving Mikolov et al.’s negative-sampling word-embedding method. arXiv preprint arXiv:1402.3722 (2014)
  26. 26.
    Levy, O., Goldberg, Y., Dagan, I.: Improving distributional similarity with lessons learned from word embeddings. Trans. Assoc. Comput. Linguist. 3, 211–225 (2015)CrossRefGoogle Scholar
  27. 27.
    Harris, Z.: Distributional structure. Word 10(2–3), 146–162 (1954)CrossRefGoogle Scholar
  28. 28.
    Levy, O., Goldberg, Y.: Neural word embedding as implicit matrix factorization. In: Advances in Neural Information Processing Systems (2014)Google Scholar
  29. 29.
    Li, Y., Xu, L., Tian, F., Jiang, L., Zhong, X., Chen, E.: Word embedding revisited: a new representation learning and explicit matrix factorization perspective. In: International Joint Conference on Artificial Intelligence (2015)Google Scholar
  30. 30.
    Landgraf, A.J., Bellay, J.: Word2vec skip-gram with negative sampling is a weighted logistic PCA. arXiv preprint arXiv:1705.09755 (2017)
  31. 31.
    Johnson, C.: Logistic matrix factorization for implicit feedback data. In: NIPS Distributed Machine Learning and Matrix Computations Workshop (2014)Google Scholar
  32. 32.
    Udell, M., Horn, C., Zadeh, R., Boyd, S.: Generalized low rank models. Found. Trends® Mach. Learn. 9(1), 1–118 (2016)zbMATHCrossRefGoogle Scholar
  33. 33.
    Trefethen, L.N., Bau III, D.: Numerical Linear Algebra, vol. 50. SIAM (1997)Google Scholar
  34. 34.
    Horn, R., Johnson, C.: Matrix Analysis. Cambridge University Press, Cambridge (1990)zbMATHGoogle Scholar
  35. 35.
    Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22(3), 400–407 (1951)MathSciNetzbMATHCrossRefGoogle Scholar
  36. 36.
    Bertsekas, D.P.: Incremental gradient, subgradient, and proximal methods for convex optimization: a survey. Optim. Mach. Learn. 2010(1–38), 3 (2011)Google Scholar
  37. 37.
    Levy, O., Goldberg, Y.: Linguistic regularities in sparse and explicit word representations. In: Conference on Computational Natural Language Learning (2014)Google Scholar
  38. 38.
    Reddi, S.J., Hefny, A., Sra, S., Poczos, B., Smola, A.: Stochastic variance reduction for nonconvex optimization. In: International Conference on Machine Learning, pp. 314–323 (2016)Google Scholar
  39. 39.
    Wang, X., Ma, S., Goldfarb, D., Liu, W.: Stochastic quasi-newton methods for nonconvex stochastic optimization. SIAM J. Optim. 27(2), 927–956 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  40. 40.
    Goldfarb, D., Mu, C., Wright, J., Zhou, C.: Using negative curvature in solving nonlinear programs. Comput. Optim. Appl. 68(3), 479–502 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  41. 41.
    Fonarev, A., Grinchuk, O., Gusev, G., Serdyukov, P., Oseledets, I.: Riemannian optimization for skip-gram negative sampling. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (2017)Google Scholar
  42. 42.
    Reddi, S.J., Kale, S., Kumar, S.: On the convergence of adam and beyond. In: International Conference on Learning Representations (2018)Google Scholar
  43. 43.
    Chen, R., Menickelly, M., Scheinberg, K.: Stochastic optimization using a trust-region method and random models. Math. Programm. 169(2), 447–487 (2018)MathSciNetzbMATHCrossRefGoogle Scholar
  44. 44.
    Anandkumar, A., Ge, R., Hsu, D., Kakade, S.M., Telgarsky, M.: Tensor decompositions for learning latent variable models. J. Mach. Learn. Res. 15(1), 2773–2832 (2014)MathSciNetzbMATHGoogle Scholar
  45. 45.
    Mu, C., Hsu, D., Goldfarb, D.: Successive rank-one approximations for nearly orthogonally decomposable symmetric tensors. SIAM J. Matrix Anal. Appl. 36(4), 1638–1659 (2015)MathSciNetzbMATHCrossRefGoogle Scholar
  46. 46.
    Mu, C., Hsu, D., Goldfarb, D.: Greedy approaches to symmetric orthogonal tensor decomposition. SIAM J. Matrix Anal. Appl. 38(4), 1210–1226 (2017)MathSciNetzbMATHCrossRefGoogle Scholar
  47. 47.
    Bailey, E., Aeron, S.: Word embeddings via tensor factorization. arXiv preprint arXiv:1704.02686 (2017)
  48. 48.
    Frandsen, A., Ge, R.: Understanding composition of word embeddings via tensor decomposition. In: ICLR (2019)Google Scholar

Copyright information

© Springer Nature Switzerland AG 2019

Authors and Affiliations

  • Cun (Matthew) Mu
    • 1
    Email author
  • Guang Yang
    • 1
  • Yan (John) Zheng
    • 1

Personalised recommendations