Skip to main content
Log in

A Semi Brute-Force Search Approach for (Balanced) Clustering

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Clustering is one of the most long-standing fundamental problems in the fields of computational geometry and algorithm design. In this paper, we focus on the variance-based clustering problems, included in which is the widely known k-means clustering. As the main contribution, a so-called semi brute-force search approach is proposed and analyzed from both theoretical and experimental aspects. The proposed approach samples a small percentage from the input dataset and search in a brute-force way for a k sized seed whose resulting Voronoi Diagram gives a good clustering of the original dataset. With high probability, the clustering is provably good to estimate the optimum under certain assumptions. Extensive experiments on both synthetic datasets and real-world datasets show that to obtain competitive results compared with k-means method (Llyod in IEEE Trans Inf Theory 28(2):129–137, 1982) and k-means++ method (Arthur and Vassilvitskii, (in: 18th ACM-SIAM symposium on discrete algorithms (SODA), 2007)), we only need a subset of 7% size comparing with the input dataset. If we are allowed to sample 15% from the dataset, our algorithm outperforms both the k-means method and k-means++ method in at least 80% of the clustering tasks. Also, an extended algorithm based on the same idea guarantees a balanced k-clustering result.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Algorithm 2

Similar content being viewed by others

Notes

  1. Note that \(V_1\), \(V_2\) are not necessary to be independent for Fact 1 to hold, but it does not hurt to say so.

References

  1. Lloyd, S.: Least squares quantization in pcm. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  2. Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Philip, S.Y., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  3. Arthur, D., Vassilvitskii, S.: \(k\)-means++: The advantages of careful seeding. In: Proceedings of the 18th ACM-SIAM Symposium on Discrete Algorithms (SODA), pp. 1027–1035 (2007)

  4. Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable \(k\)-means++. In: Proceedings of the 38th Very Large Data Bases (VLDB), pp. 622–633 (2012)

  5. Bachem, O., Lucic, M., Hassani, S.H., Krause, A.: Approximate \(k\)-means++ in sublinear time. In: Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), pp. 1459–1467 (2016)

  6. Lattanzi, S., Sohler, C.: A better \(k\)-means++ algorithm via local search. In: Proceedings of the 36th International Conference on Machine Learning (ICML), pp. 3662–3671 (2019)

  7. Choo, D., Grunau, C., Portmann, J., Rozhoň, V.: \(k\)-means++: few more steps yield constant approximation. In: Proceedings of the 37th International Conference on Machine Learning (ICML), pp. 1909–1917 (2020)

  8. Awasthi, P., Charikar, M., Krishnaswamy, R., Sinop, A.K.: The hardness of approximation of euclidean \(k\)-means. In: Proceedings of the 31st ACM Symposium on International Symposium on Computational Geometry (SoCG), pp. 754–767 (2015)

  9. Liu, H., Han, J., Nie, F., Li, X.: Balanced clustering with least square regression. In: Proceedings of the 31st AAAI Conference on Artificial Intelligence (AAAI), pp. 2231–2237 (2017)

  10. Li, Z., Nie, F., Chang, X., Ma, Z., Yang, Y.: Balanced clustering via exclusive lasso: A pragmatic approach. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 3596–3603 (2018)

  11. Lin, W., He, Z., Xiao, M.: Balanced clustering: A uniform model and fast algorithm. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2987–2993 (2019)

  12. Xu, Y., Möhring, R.H., Xu, D., Zhang, Y., Zou, Y.: A constant fpt approximation algorithm for hard-capacitated \(k\)-means. Optim. Eng. 21(3), 709–722 (2020)

    Article  MathSciNet  Google Scholar 

  13. Cohen-Addad, V., Li, J.: On the fixed-parameter tractability of capacitated clustering. In: Proceedings of the 46th International Colloquium on Automata, Languages, and Programming (ICALP), pp. 1–14 (2019)

  14. Bandyapadhyay, S., Lochet, W., Saurabh, S.: FPT constant-approximations for capacitated clustering to minimize the sum of cluster radii. To appear in the proceedings of the 39th International Symposium on Computational Geometry (SoCG), (2023)

  15. Ostrovsky, R., Rabani, Y., Schulman, L.J., Swamy, C.: The effectiveness of Lloyd-type methods for the \(k\)-means problem. J. ACM 59(6), 1–22 (2013)

    Article  MathSciNet  Google Scholar 

  16. Braverman, V., Cohen-Addad, V., Jiang, H.C.S., Krauthgamer, R., Schwiegelshohn, C., Toftrup, M.B., Wu, X.: The power of uniform sampling for coresets. In: Proceedings of the 63rd Annual Symposium on Foundations of Computer Science (FOCS), pp. 462–473 (2022)

  17. Jain, K., Vazirani, V.V.: Approximation algorithms for metric facility location and \(k\)-median problems using the primal-dual schema and Lagrangian relaxation. J. ACM 48(2), 274–296 (2001)

    Article  MathSciNet  Google Scholar 

  18. Inaba, M., Katoh, N., Imai, H.: Applications of weighted voronoi diagrams and randomization to variance-based \(k\)-clustering. In: Proceedings of the 10th ACM Symposium International Symposium on Computational Geometry (SoCG), pp. 332–339 (1994)

  19. Ahuja, R.K., Magnanti, T.L., Orlin, J.B.: Network Flows - Theory, Algorithms and Applications. Prentice Hall, USA (1993)

    Google Scholar 

Download references

Funding

Yicheng Xu is supported by Fundamental Research Project of Shenzhen City (No. JCYJ20210324102012033) and Shenzhen Science and Technology Program (No. ZDSYS20220422103800001 and CJGJZD20210408092806017). Vincent Chau is supported by NSFC (No. 62202100), and by the Fundamental Research Funds for the Central Universities (No. 2242022R10024). Chenchen Wu is supported by National Natural Science Foundation of China (No.11971349). Yong Zhang is supported by National Key R &D Program of China (No. 2022YFE0196100) and National Natural Science Foundation of China (No. 12071460). Vassilis Zissimopoulos is supported by the CAS President’s International Fellowship Initiative (No. 2019VTA0005). Yifei Zou is supported by Shandong Science Fund for Excellent Young Scholars (No. 2023HWYQ-007).

Author information

Authors and Affiliations

Authors

Contributions

YX and VC wrote the main manuscript text. All other authors participated on the problem discussion and helped to review the manuscript.

Corresponding author

Correspondence to Vincent Chau.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xu, Y., Chau, V., Wu, C. et al. A Semi Brute-Force Search Approach for (Balanced) Clustering. Algorithmica 86, 130–146 (2024). https://doi.org/10.1007/s00453-023-01158-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-023-01158-4

Keywords

Navigation