Journal of Scientific Computing

, Volume 77, Issue 2, pp 1133–1146 | Cite as

Stochastic Backward Euler: An Implicit Gradient Descent Algorithm for k-Means Clustering

  • Penghang YinEmail author
  • Minh Pham
  • Adam Oberman
  • Stanley Osher


In this paper, we propose an implicit gradient descent algorithm for the classic k-means problem. The implicit gradient step or backward Euler is solved via stochastic fixed-point iteration, in which we randomly sample a mini-batch gradient in every iteration. It is the average of the fixed-point trajectory that is carried over to the next gradient step. We draw connections between the proposed stochastic backward Euler and the recent entropy stochastic gradient descent for improving the training of deep neural networks. Numerical experiments on various synthetic and real datasets show that the proposed algorithm provides better clustering results compared to k-means algorithms in the sense that it decreased the objective function (the cluster) and is much more robust to initialization.


k-Means Backward Euler Implicit gradient descent Fixed-point iteration Mini-batch gradient 



This work was partially supported by AFOSR Grant FA9550-15-1-0073 and ONR Grant N00014-16-1-2157. We would like to thank Dr. Bao Wang for helpful discussions. We also thank the anonymous reviewers for their constructive comments.


  1. 1.
    Artina, M., Fornasier, M., Solombrino, F.: Linearly constrained nonsmooth and nonconvex minimization. SIAM J. Optim. 23(3), 1904–1937 (2013)MathSciNetCrossRefGoogle Scholar
  2. 2.
    Arthur, D., Vassilvitskii, S.: \(k\)-means++: the advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics (2007)Google Scholar
  3. 3.
    Baldassi, C., Ingrosso, A., Lucibello, C., Saglietti, L., Zecchina, R.: Subdominant dense clusters allow for simple learning and high computational performance in neural networks with discrete synapses. Phys. Rev. Lett. 115(12), 101–128 (2015)CrossRefGoogle Scholar
  4. 4.
    Bertsekas, D.P.: Nonlinear Programming, 2nd edn. Athena Scientific, Belmont (2008)Google Scholar
  5. 5.
    Bottou, L., Bengio, Y.: Convergence properties of the \(k\)-means algorithms. Adv. Neural Inf. Process. Syst. 3, 82 (1995)Google Scholar
  6. 6.
    Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., Zecchina, R.: Entropy-SGD: Biasing Gradient Descent into Wide Valleys (2016). arXiv:1611.01838
  7. 7.
    Chaudhari, P., Oberman, A., Osher, S., Soatto, S., Carlier, G.: Deep Relaxation: Partial Differential Equations for Optimizing Deep Neural Networks (2017). arXiv:1704.04932
  8. 8.
    Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., Mytkowicz, T.: Yinyang \(k\)-means: a drop-in replacement of the classic \(k\)-means with consistent speedup. In: Proceedings of the 32nd International Conference on Machine Learning (2015)Google Scholar
  9. 9.
    Elkan, C.: Using the triangle inequality to accelerate \(k\)-means. In: Proceedings of the 20th International Conference on Machine Learning (2003)Google Scholar
  10. 10.
    Kaplan, A., Tichatschke, R.: Proximal point method and nonconvex optimization. J. Global Optim. 13, 389–406 (1998)MathSciNetCrossRefGoogle Scholar
  11. 11.
    LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)CrossRefGoogle Scholar
  12. 12.
    LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)CrossRefGoogle Scholar
  13. 13.
    Lloyd, S.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)MathSciNetCrossRefGoogle Scholar
  14. 14.
    Moreau, J.-J.: Proximité et dualité dans un espace hilbertien. Bull. Soc. Math. France 93, 273–299 (1965)MathSciNetCrossRefGoogle Scholar
  15. 15.
    Newling, J., Fleuret, F.: Nested mini-batch \(k\)-means. In: Advances in Neural Information Processing Systems, pp. 1352–1360 (2016)Google Scholar
  16. 16.
    Rockafellar, R.: Monotone operators and the proximal point algorithm. SIAM J. Control Optim. 14, 877–898 (1976)MathSciNetCrossRefGoogle Scholar
  17. 17.
    Sculley, D.: Web-scale \(k\)-means clustering. In: Proceedings of the 19th International Conference on World wide web. ACM (2010)Google Scholar
  18. 18.
    Tang, C., Monteleoni, C.: Convergence Rate of Stochastic \(k\)-Means (2016). arXiv:1610.04900

Copyright information

© Springer Science+Business Media, LLC, part of Springer Nature 2018

Authors and Affiliations

  • Penghang Yin
    • 1
    Email author
  • Minh Pham
    • 1
  • Adam Oberman
    • 2
  • Stanley Osher
    • 1
  1. 1.Department of MathematicsUniversity of CaliforniaLos AngelesUSA
  2. 2.Department of Mathematics and StatisticsMcGill UniversityMontrealCanada

Personalised recommendations