Skip to main content
Log in

An improved column generation algorithm for minimum sum-of-squares clustering

  • Full Length Paper
  • Series A
  • Published:
Mathematical Programming Submit manuscript

Abstract

Given a set of entities associated with points in Euclidean space, minimum sum-of-squares clustering (MSSC) consists in partitioning this set into clusters such that the sum of squared distances from each point to the centroid of its cluster is minimized. A column generation algorithm for MSSC was given by du Merle et al. in SIAM Journal Scientific Computing 21:1485–1505. The bottleneck of that algorithm is the resolution of the auxiliary problem of finding a column with negative reduced cost. We propose a new way to solve this auxiliary problem based on geometric arguments. This greatly improves the efficiency of the whole algorithm and leads to exact solution of instances with over 2,300 entities, i.e., more than 10 times as much as previously done.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Aloise D., Deshpande A., Hansen P., Popat P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75, 245–249 (2009)

    Article  Google Scholar 

  2. Aloise D., Hansen P.: A branch-and-cut SDP-based algorithm for minimum sum-of-squares clustering. Pesquisa Operacional 29, 503–516 (2009)

    Article  Google Scholar 

  3. Aloise, D., Hansen, P.: Evaluating a branch-and-bound RLT-based algorithm for minimum sum-of-squares clustering. To appear in J. Glob. Optim. (2010)

  4. An L.T., Belghiti M.T., Tao P.D.: A new efficient algorithm based on DC programming and DCA for clustering. J. Glob. Optim. 37, 593–608 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  5. Asuncion, A., Newman, D.J.: UCI machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. (2007)

  6. Bagirov A.M.: Modified global k-means algorithm for minimum sum-of-squares clustering problems. Pattern Recognit. 41, 3192–3199 (2008)

    Article  MATH  Google Scholar 

  7. Bagirov A.M., Yearwoord J.: Hierarchical grouping to optimize an objective function. Eur. J. Oper. Res. 170, 578–596 (2006)

    Article  MATH  Google Scholar 

  8. Bonami, P., Lee, J.: BONMIN user’s manual. Technical report, IBM Corporation, June (2007)

  9. Brusco M.J.: A repetitive branch-and-bound procedure for minimum within-cluster sum of squares partitioning. Psychometrika 71, 347–363 (2006)

    Article  MathSciNet  Google Scholar 

  10. Brusco M.J., Steinley D.: A comparison of heuristics procedures for minimum within-cluster sums of squares partitioning. Psychometrika 72, 583–600 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  11. Christou, I.T.: Exact method-based coordination of cluster ensembles. To appear in IEEE Trans. Pattern Anal. Mach. Intell. (2010)

  12. Diehr G.: Evaluation of a branch and bound algorithm for clustering. SIAM J. Sci. Stat. Comput. 6, 268–284 (1985)

    Article  MATH  Google Scholar 

  13. Dinkelbach W.: On nonlinear fractional programming. Manage Sci 13, 492–498 (1967)

    Article  MathSciNet  Google Scholar 

  14. Drezner Z., Mehrez A., Wesolowsky G.O.: The facility location problem with limited distances. Transp. Sci. 25, 183–187 (1991)

    Article  MATH  Google Scholar 

  15. du Merle O., Hansen P., Jaumard B., Mladenović N.: An interior point algorithm for minimum sum-of-squares clustering. SIAM J. Sci. Comput. 21, 1485–1505 (2000)

    Article  MATH  Google Scholar 

  16. du Merle O., Villeneuve D., Desrosiers J., Hansen P.: Stabilized column generation. Discrete Math. 194, 229–237 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  17. Edwards A.W., Cavalli-Sforza L.L.: A method for cluster analysis. Biometrics 21, 362–375 (1965)

    Article  Google Scholar 

  18. Elhedhli S., Goffin J.-L.: The integration of an interior-point cutting plane method within a branch-and-price algorithm. Math. Program. 100, 267–294 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  19. Fisher R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. VII, 179–188 (1936)

    Article  Google Scholar 

  20. Forgy E.W.: Cluster analysis of multivariate data: efficiency vs. interpretability of classifications. Biometrics 21, 768 (1965)

    Google Scholar 

  21. Goffin J.-L., Haurie A., Vial J.-P.: Decomposition and nondifferentiable optimization with the projective algorithm. Manag. Sci. 38, 284–302 (1992)

    Article  MATH  Google Scholar 

  22. Grötschel, M., Holland, O.: Solution of large-scale symmetric traveling salesman problems. Math. Program. 51, 141–202 (1991). Data sets available at http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/tsp

  23. Hansen P., Jaumard B.: Cluster analysis and mathematical programming. Math. Program. 79, 191–215 (1997)

    MathSciNet  MATH  Google Scholar 

  24. Hansen, P., Jaumard, B., Meyer, C.: A simple enumerative algorithm for unconstrained 0–1 quadratic programming. Cahier du GERAD G-2000-59, GERAD, November (2000)

  25. Hansen P., Mladenović N.: J-means: a new local search heuristic for minimum sum of squares clustering. Pattern Recognit. 34, 405–413 (2001)

    Article  MATH  Google Scholar 

  26. Hansen P., Mladenović N.: Variable neighborhood search: principles and applications. Eur. J. Oper. Res. 130, 449–467 (2001)

    Article  MATH  Google Scholar 

  27. Hansen P., Mladenović N., Pérez J.A.M.: Variable neighborhood search: methods and applications. 4OR 6, 319–360 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  28. Hansen P., Negai E., Cheung B.K., Mladenović N.: Analysis of global k-means, an incremental heuristic for minimum sum-of-squares clustering. J. Classif. 22, 287–310 (2005)

    Article  Google Scholar 

  29. Hartigan J.A.: Clustering Algorithms. Wiley, New York (1975)

    MATH  Google Scholar 

  30. Heinz, G., Peterson, L.J., Johnson, R.W., Kerk, C.J.: Exploring relationships in body dimensions. J. Stat. Education 11, (2003) Data set available at http://www.amstat.org/publications/jse/v11n2/datasets.heinz.html

  31. Inaba, M., Katoh, N., Imai, H.: Applications of weighted Voronoi diagrams and randomization to variance-based k-clustering. In: Proceedings of the 10th ACM Symposium on Computational Geometry, pp. 332–339 (1994)

  32. Jain A.K., Murty M.N., Flynn P.J.: Data clustering: a review. ACM Comput. Surv. 31, 264–323 (1999)

    Article  Google Scholar 

  33. Jensen R.E.: A dynamic programming algorithm for cluster analysis. Oper. Res. 17, 1034–1057 (1969)

    Article  MATH  Google Scholar 

  34. Kelley J.E.: The cutting plane method for solving convex programs. J. SIAM 8, 703–712 (1960)

    MathSciNet  Google Scholar 

  35. Kogan J.: Introduction to Clustering Large and High-Dimensional Data. Cambridge University Press, New York (2006)

    Google Scholar 

  36. Koontz W.L.G., Narendra P.M., Fukunaga K.: A branch and bound clustering algorithm. IEEE Trans. Comput. C-24, 908–915 (1975)

    Article  MathSciNet  Google Scholar 

  37. Laszlo M., Mukherjee S.: A genetic algorithm using hyper-quadtrees for low-dimensional k-means clustering. IEEE Trans. Pattern Anal. Mach. Intell. 28, 533–543 (2006)

    Article  Google Scholar 

  38. Laszlo M., Mukherjee S.: A genetic algorithm that exchanges neighboring centers for k-means clustering. Pattern Recognit. Lett. 36, 451–461 (2007)

    Google Scholar 

  39. Leyffer, S.: User manual for MINLP_BB. Technical report, University of Dundee, UK, March (1999)

  40. Liberti L.: Reformulations in mathematical programming: definitions and systematics. RAIRO-RO 43(1), 55–86 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  41. Likas A., Vlassis N., Verbeek J.J.: The global k-means clustering algorithm. Pattern Recognit. 36, 451–461 (2003)

    Article  Google Scholar 

  42. MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, vol. 2, pp. 281–297. Berkeley, CA (1967)

  43. Mahajan M., Nimbhorkar P., Varadarajan K.: The planar k-means problem is NP-hard. Lect. Notes Comput. Sci. 5431, 274–285 (2009)

    Article  MathSciNet  Google Scholar 

  44. Merz P.: An iterated local search for minimum sum-of-squares clustering. Lect. Notes Comput. Sci. 2810, 286–296 (2003)

    Article  Google Scholar 

  45. Mirkin B.: Mathematical Classification and Clustering. Kluwer, Dordrecht, The Netherlands (1996)

    Book  MATH  Google Scholar 

  46. Mirkin B.: Clustering for Data Mining: A Data Recovery Approach. Chapman and Hall/CRC, Boca Raton (2005)

    Book  MATH  Google Scholar 

  47. Mladenović N., Hansen P.: Variable neighborhood search. Comput. Oper. Res. 24, 1097–1100 (1997)

    Article  MathSciNet  MATH  Google Scholar 

  48. Pacheco J.A.: A scatter search approach for the minimum sum-of-squares clustering problem. Comput. Oper. Res. 32, 1325–1335 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  49. Pacheco J.A., Valencia O.: Design of hybrids for the minimum sum-of-squares clustering problem. Comput. Stat. Data Anal. 43, 235–248 (2003)

    MathSciNet  MATH  Google Scholar 

  50. Padberg, M., Rinaldi, G.: A branch-and-cut algorithm for the resolution of large-scale symmetric traveling salesman problems. SIAM Rev. 33, 60–100 (1991). Data set available at http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95/tsp

    Google Scholar 

  51. Pal, S.K., Majumder, D.D.: Fuzzy sets and decision making approaches in vowel and speaker recognition. IEEE Trans. Syst. Man. Cybern. 7, 625–629 (1977). Data set available at http://www.isical.ac.in/sushmita/patterns/vowel.dat

    Google Scholar 

  52. Peng J., Xia Y.: A new theoretical framework for k-means-type clustering. Stud Fuzziness Soft Comput. 180, 79–96 (2005)

    Article  Google Scholar 

  53. Reinelt, G.: TSPLIB– a traveling salesman library. ORSA J. Comput. 3, 319–350 (1991). http://www.iwr.uni-heidelberg.de/groups/comopt/software/TSPLIB95

    Google Scholar 

  54. Ruspini E.H.: Numerical method for fuzzy clustering. Inf. Sci. 2, 319–350 (1970)

    Article  MATH  Google Scholar 

  55. Ryan D.M., Foster B.A.: An integer programming approach to scheduling. In: Wren, A. (eds) Computer Scheduling of Public Transport Urban Passenger Vehicle and Crew Scheduling, pp. 269–280. North-Holland, Amsterdam (1981)

    Google Scholar 

  56. Sherali H.D., Adams W.P.: Reformulation-linearization techniques for discrete optimization problems. In: Du, D.Z., Pardalos, P.M. (eds) Handbook of Combinatorial Optimization 1, pp. 479–532. Kluwer, Dordrecht (1999)

    Google Scholar 

  57. Sherali H.D., Desai J.: A global optimization RLT-based approach for solving the hard clustering problem. J. Glob. Optim. 32, 281–306 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  58. Späth H.: Cluster Analysis Algorithm for Data Reduction and Classification of Objects. Wiley, New York (1980)

    Google Scholar 

  59. Steinhaus H.: Sur la division des corps matèriels en parties. Bulletin De L’Académie Polonaise Des Sciences Classe III. IV, 801–804 (1956)

    MathSciNet  Google Scholar 

  60. Steinley D.: K-means clustering: a half-century synthesis. Br. J. Math. Stat. Psychol. 59, 1–34 (2006)

    Article  MathSciNet  Google Scholar 

  61. Taillard É.D.: Heuristic methods for large centroid clustering problems. J. Heuristics 9, 51–73 (2003)

    Article  MATH  Google Scholar 

  62. Teboulle M.: A unified continuous optimization framework for center-based clustering methods. J. Mach. Learn. Res. 8, 65–102 (2007)

    MathSciNet  MATH  Google Scholar 

  63. Tuy H.: Concave programming under linear constraints. Soviet Math. 5, 1437–1440 (1964)

    Google Scholar 

  64. van Os B.J., Meulman J.J.: Improving dynamic programming strategies for partitioning. J. Classif. 21, 207–230 (2004)

    Article  MATH  Google Scholar 

  65. Vavasis S.A.: Nonlinear Optimization: Complexity Issues. Oxford University Press, Oxford (1991)

    MATH  Google Scholar 

  66. Xavier, A.E., Negreiros, M.J., Maculan, N., Michelon, P.: The use of the hyperbolic smoothing clustering method for planning the tasks of sanitary agents in combating dengue. In: Proceedings of IFORS 2005 (2005)

  67. Xia, Y., Peng, J.: A cutting algorithm for the minimum sum-of-squared error clustering. In: Proceedings of the SIAM International Data Mining Conference (2005)

  68. Yeh, I.-C.: Modeling of strength of high performance concrete using artificial neural networks. Cement and Concrete Res. 28, 1797–1808 (1998). Data set available at http://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daniel Aloise.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Aloise, D., Hansen, P. & Liberti, L. An improved column generation algorithm for minimum sum-of-squares clustering. Math. Program. 131, 195–220 (2012). https://doi.org/10.1007/s10107-010-0349-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10107-010-0349-7

Keywords

Mathematics Subject Classification (2000)

Navigation