CPI-model-based analysis of sparse k-means clustering algorithms

Abstract

Standard k-means clustering algorithms have been widely used to solve the partitioning problems of a given data set into k disjoint subsets. When a data set is large-scale and high-dimensional sparse, such as text data with a bag-of-words representation, it is not trivial which representations are adopted for both the data and mean sets. Additionally, algorithms that differ only in their representations need distinct elapsed times until their convergences, despite starting at an identical initial state and executing an identical number of similarity calculations, which is a conventional indicator of speed performance. We design sparse k-means clustering algorithms that utilize distinct representations, each of which is a pair of a data structure and an expression. Our purpose is to clarify the cause of their performance differences and identify the best algorithm when they are executed in a modern computer system. We analyze the algorithms with a simple yet practical clock-cycle per instruction (CPI) model that is expressed as a linear combination of four performance degradation factors in a modern computer system: the completed instructions, the level-1 and last-level cache misses, and the branch mispredictions. We also optimize the model parameters by a newly introduced procedure and demonstrate that CPIs calculated with our model agree well with experimental results when the algorithms are applied to large-scale and high-dimensional real document data sets. Furthermore, our model clarifies that the best algorithm among them suppresses the performance degradation factors of the number of cache misses, the branch mispredictions, and the completed instructions.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18

Notes

  1. 1.

    If mean feature vectors are not normalized by their \(L_2\) norms, i.e., they are not points on the unit hypersphere, a solution by the spherical k-means algorithm does not always coincide with that by the standard k-means algorithm.

  2. 2.

    Even if the algorithms start at an identical initial state, they might have different solutions when the similarities between an object and plural centroids are identical. To avoid this problem, our algorithms adopt the tie-breaking rule where an object belongs to a cluster whose centroid has the smallest ID.

  3. 3.

    In our preliminary experiments, a mean-update step using object feature vectors with inverted-file data structure required much more CPU time than that with the standard data structure.

  4. 4.

    IVFD differs from IVF in the positions in the source codes at which the final assignment of each object is executed to a cluster. IVFD executes the assignment outside the triple loop; IVF does it inside.

  5. 5.

    We assumed that \(w_0\) depended on an algorithm as \(w_{0[algo]}\) so that \(w_0\) contained the number of clock cycles by delay factors, except the foregoing DFs.

  6. 6.

    Regarding the memory consumption, the algorithms (except IVF) required a large memory size proportional to k due to the mean full expression. The required memory size for NYT reached 79.2 GB at \(k=20000\) while IVF used only 3.5 GB.

  7. 7.

    Regarding both algorithms, the instructions executed in the triple loop were identical in the corresponding assembly codes.

  8. 8.

    Actually, since the term order sorted by the number of centroids does not always meet that sorted by the number of objects, both the numbers of centroids and objects do not decrease monotonically (Fig. 10b).

  9. 9.

    Analysis of the IVFD and IVF assembly codes showed that both algorithms used the identical number of instructions for each multiplication and addition operation.

References

  1. 1.

    Aloise, D., Deshpande, A., Hansen, P., Popat, P.: NP-hardness of Euclidean sum-of-squares clustering. Mach. Learn. 75, 245–248 (2009)

    Article  Google Scholar 

  2. 2.

    Aoyama, K., Saito, K., Ikeda, T.: Accelerating a Lloyd-type k-means clustering algorithm with summable lower bounds in a lower-dimensional space. IEICE Trans. Inf. Syst. E101–D(11), 2773–2782 (2018)

    Article  Google Scholar 

  3. 3.

    Bhimani, J., Leeser, M., Mi, N.: Accelerating K-means clustering with parallel implementations and GPU computing. In: Proceedings of IEEE High Performance Extreme Computing Conference (HPEC), pp. 233–242 (2015)

  4. 4.

    Büttcher, S., Clarke, C.L.A., Cormack, G.V. (eds.): Information Retrieval: Implementing and Evaluating Search Engines. The MIT Press, Cambridge (2010)

    MATH  Google Scholar 

  5. 5.

    Broder, A., Garcia-Pueyo, L., Josifovski, V., Vassilvitskii, S., Venkatesan, S.: Scalable k-means by ranked retrieval. In: Proceedings of the ACM International Conference on Web Search and Data Mining (WSDM), pp. 233–242 (2014)

  6. 6.

    Dhillon, I.S., Modha, D.S.: Concept decompositions for large sparse text data using clustering. Mach. Learn. 42(1–2), 143–175 (2001)

    Article  Google Scholar 

  7. 7.

    Ding, Y., Zhao, Y., Shen, X., Musuvathi, M., Mytkowicz, T.: Yinyang k-means: a drop-in replacement of the classic k-means with consistent speedup. In: Proceedings of the 32nd International Conference on Machine Learning (ICML), pp. 579–587 (2015)

  8. 8.

    Drake, J., Hamerly, G.: Accelerated k-means with adaptive distance bounds. In: Proceedings of 5th NIPS Workshop on Optimization for Machine Learning (2012)

  9. 9.

    Dua, D., Taniskidou, E.K.: Bag of words data set (PubMed abstracts) in UCI machine learning repository (2017). http://archive.ics.uci.edu/ml

  10. 10.

    Edelkamp, S., Weiß, A.: BlockQuicksort: avoiding branch mispredictions in quicksort. ACM J. Exp. Algorithmics (JEA) 24(1), 1.4:1–1.4:22 (2019)

    MathSciNet  MATH  Google Scholar 

  11. 11.

    Elkan, C.: Using the triangle inequality to accelerate k-means. In: Proceedings of 20th International Conference on Machine Learning (ICML), pp. 147–153 (2003)

  12. 12.

    Evers, M., Yeh, T.Y.: Understanding branches and designing branch predictors for high-performance microprocessors. Proc. IEEE 89(11), 1610–1620 (2001)

    Article  Google Scholar 

  13. 13.

    Eyerman, S., Smith, J.E., Eeckhout, L.: Characterizing the branch misprediction penalty. In: Proceedings of the International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 48–58 (2006)

  14. 14.

    Frigo, M., Leiserson, C., Prokop, H., Ramachandran, S.: Cache-oblivious algorithms. ACM Trans. Algorithms 8(1, article 4) (2012)

  15. 15.

    Ghoting, A., Buehrer, G., Parthasarathy, S., Kim, D., Nguyen, A., Chen, Y.K., Dubey, P.: Cache-conscious frequent pattern mining on modern and emerging processors. VLDB J. 16(1), 77–96 (2007)

    Article  Google Scholar 

  16. 16.

    Green, O., Dukhan, M., Vuduc, R.: Branch-avoiding graph algorithms. In: Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 212–223 (2015)

  17. 17.

    Hamerly, G.: Making k-means even faster. In: Proceedings SIAM International Conference on Data Mining (SDM), pp. 130–140 (2010)

  18. 18.

    Hammarlund, P., Martinez, A.J., Bajwa, A.A., Hill, D.L., Hallnor, E., Jiang, H., Dixon, M., Derr, M., Hunsaker, M., Kumar, R., Osborne, R.B., Rajwar, R., Singhal, R., D’Sa, R., Chappell, R., Kaushik, S., Chennupaty, S., Jourdan, S., Gunther, S., Piazza, T., Burton, T.: Haswell: the fourth-generation Intel core processor. IEEE Micro 34(2), 6–20 (2014)

    Article  Google Scholar 

  19. 19.

    Harman, D., Fox, E., Baeza-Yates, R., Lee, W.: Inverted files. In: W.B. Frakes, R. Baeza-Yates (eds.) Information Retrieval: Data Structures & Algorithms, chap. 3, pp. 28–43. Prentice Hall, New Jersey (1992)

  20. 20.

    Hattori, T., Aoyama, K., Saito, K., Ikeda, T., Kobayashi, E.: Pivot-based k-means algorithm for numerous-class data sets. In: Proceedings of SIAM International Conference on Data Mining (SDM), pp. 333–341 (2016)

  21. 21.

    Hennessy, J.L., Patterson, D.A. (eds.): Computer Architecture, Sixth Edition: A Quantitative Approach. Morgan Kaufmann, San Mateo (2017)

    Google Scholar 

  22. 22.

    Jian, L., Wang, C., Liu, Y., Liang, S., Yi, W., Shi, Y.: Parallel data mining techniques on graphics processing unit with compute unified device architecture (CUDA). J. Supercomput. 64, 942–967 (2013)

    Article  Google Scholar 

  23. 23.

    Jongerius, R., Anghel, A., Dittmann, G., Mariani, G., Vermij, E., Corporaal, H.: Analytic multi-core processor model for fast design-space exploration. IEEE Trans. Comput. 67(6), 755–770 (2018)

    MathSciNet  Article  Google Scholar 

  24. 24.

    Jordan, M.I., Mitchell, T.M.: Machine learning: trends, perspectives, and prospects. Science 349(6245), 255–260 (2015)

    MathSciNet  Article  Google Scholar 

  25. 25.

    Kaligosi, K., Sanders, P.: How branch mispredictions affect quicksort. In: Azar, Y., Erlebach, T. (eds.) Algorithms-ESA2006. Lecture Notes in Computer Science, pp. 780–791. Springer, Berlin (2006)

    Google Scholar 

  26. 26.

    Knuth, D.E.: Retrieval on secondary keys. In: The Art of Computer Programming: Volume 3: Sorting and Searching, chap. 5.2.4 and 6.5. Addison-Wesley Professional (1998)

  27. 27.

    Kowarschik, M., Weiß, C.: An overview of cache optimization techniques and cache-aware numerical algorithms. In: Meyer, U., Sanders, P., Sibeyn, J. (eds.) Algorithms for Memory Hierarchies. Lecture Notes in Computer Science, chap. 10, pp. 213–232. Springer, Berlin (2003)

    Chapter  Google Scholar 

  28. 28.

    Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    MathSciNet  Article  Google Scholar 

  29. 29.

    MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281–297 (1967)

  30. 30.

    Intel Corp.: Disclosure of hardware prefetcher control on some Intel processors (2014). https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

  31. 31.

    Intel Corp.: Intel memory latency checker v3.9 (2020). https://software.intel.com/content/www/us/en/develop/articles/intelr-memory-latency-checker.html

  32. 32.

    Newling, J., Fleuret, F.: Fast k-means with accurate bounds. In: Proceedings of 33rd International Conference on Machine Learning (ICML) (2016)

  33. 33.

    Perdacher, M., Plant, C., Böhm, C.: Cache-oblivious high-performance similarity join. In: Proceedings of International Conference on Management of Data (SIGMOD), pp. 87–104 (2019)

  34. 34.

    Perf: Linux profiling with performance counters (2019). https://perf.wiki.kernel.org/index.php

  35. 35.

    Samet, H. (ed.): Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc., San Francisco (2006)

    MATH  Google Scholar 

  36. 36.

    Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 1470–1478 (2003)

  37. 37.

    Wu, X., Kumar, V., Quinlan, J.R., Ghosh, J., Yang, Q., Motoda, H., McLachlan, G.J., Ng, A., Liu, B., Yu, P.S., Zhou, Z.H., Steinbach, M., Hand, D.J., Steinberg, D.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2008)

    Article  Google Scholar 

  38. 38.

    Zobel, J., Moffat, A.: Inverted files for text search. ACM Comput. Surv. 38(2, article 6) (2006)

Download references

Acknowledgements

This work was partly supported by JSPS KAKENHI Grant Number JP17K00159.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kazuo Aoyama.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Aoyama, K., Saito, K. & Ikeda, T. CPI-model-based analysis of sparse k-means clustering algorithms. Int J Data Sci Anal (2021). https://doi.org/10.1007/s41060-021-00270-4

Download citation

Keywords

  • Clustering
  • Algorithms
  • Performance analysis
  • Data structure
  • Sparse data
  • k-means