Skip to main content

Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

  • Conference paper
  • First Online:
Artificial Intelligence and Sustainable Computing

Part of the book series: Algorithms for Intelligent Systems ((AIS))

Abstract

Big data mining is an intelligent process of extracting hidden knowledge from high volume, high variety, and high velocity data environments for decision-making systems. Classical data mining algorithms are facing memory utilization, speed-up, scale-up, computing cost, efficiency, and effectiveness related challenges inside the big data. Data volume is a prime attribute of big data mining and is responsible for variety and velocity-related challenges. Intelligent big data mining process incorporates classical data mining and statistics under single and multiple machine execution environments. Sampling is a data reduction technique that handles data volume-related challenges and increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilizes memory resources for any data mining algorithms without the influence of their characteristics. This paper proposed the systematic sampling-based big data mining model through the K-means clustering that is known as SYK-means (systematic sampling-based K-means). The experimental results of the SYK-means algorithm are compared with the RSK-means (random sampling-based K-means) and classical K-means algorithms concerning sample size selection and entire data selection. The experimental evaluation of the SYK-means algorithm achieved better effectiveness and efficiency through R squares, root-mean-square standard deviation, Davies Bouldin, Calinski Harabasz, Silhouette coefficient, CPU time, and convergence validation indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 169.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 219.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 219.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Oussous A, Benjelloun F, Lahcen AA, Belfkih S (2017) Big Data technologies: a survey. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2017.06.001

    Article  Google Scholar 

  2. Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6:1–6. https://doi.org/10.1186/s40537-019-0206-3

    Article  Google Scholar 

  3. Gandomi A, Haider M (2015) Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage 35:137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007

    Article  Google Scholar 

  4. Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of Big Data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001

    Article  Google Scholar 

  5. Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60:293–303. https://doi.org/10.1016/j.bushor.2017.01.004

    Article  Google Scholar 

  6. Siddiqa A, Hashem IAT, Yaqoob I et al (2016) A survey of big data management: Taxonomy and state-of-the-art. J Netw Comput Appl 71:151–166. https://doi.org/10.1016/j.jnca.2016.04.008

    Article  Google Scholar 

  7. Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable Big Data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002

    Article  MathSciNet  Google Scholar 

  8. Khondoker MR (2018) Big data clustering. Wiley StatsRef Stat Ref Online 1–10. https://doi.org/10.1002/9781118445112.stat07978

  9. Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl-Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007

    Article  Google Scholar 

  10. Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-Dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088

    Article  Google Scholar 

  11. Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011

    Article  Google Scholar 

  12. HajKacem MA Ben, N’Cir C-E Ben, Essoussi N (2019) Clustering methods for big data analytics. In: Unsupervised and semi-supervised learning, pp 1–23

    Google Scholar 

  13. Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IeeexploreIeeeOrg, 1–26

    Google Scholar 

  14. Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV (2016) Big Data analytics. In: Big Data technologies and applications, pp 1–400

    Google Scholar 

  15. Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. https://doi.org/10.1145/775107.775114, pp 462–468

  16. Zhao J, Sun J, Zhai Y et al (2018) A novel clustering-based sampling approach for minimum sample set in big data environment. Int J Pattern Recognit Artif Intell 32:1–20. https://doi.org/10.1142/S0218001418500039

    Article  MathSciNet  Google Scholar 

  17. Ly T, Cockburn M, Langholz B (2018) Cost-efficient case-control cluster sampling designs for population-based epidemiological studies. Spat Spatiotemporal Epidemiol 26:95–105. https://doi.org/10.1016/j.sste.2018.05.002

    Article  Google Scholar 

  18. Boicea A, Truică CO, Rădulescu F, Buşe EC (2018) Sampling strategies for extracting information from large data sets. Data Knowl Eng 115:1–15. https://doi.org/10.1016/j.datak.2018.01.002

    Article  Google Scholar 

  19. Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1

    Article  MathSciNet  MATH  Google Scholar 

  20. Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/s2424862218500173

    Article  Google Scholar 

  21. Kollios G, Gunopulos D, Koudas N, Berchtold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng 15:1170–1187. https://doi.org/10.1109/TKDE.2003.1232271

    Article  Google Scholar 

  22. Xu Z, Wu Z, Cao J, Xuan H (2015) Scaling Information-Theoretic Text Clustering: A Sampling-based Approximate Method. In: Proceedings - 2014 2nd International Conference on Adv Cloud Big Data, CBD 2014. https://doi.org/10.1109/CBD.2014.56

  23. Thompson SK (2012) Sampling, Third edn. Wiley Publication

    Google Scholar 

  24. Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290

    Article  MathSciNet  Google Scholar 

  25. Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering map reduce model. Int J Emerg Technol 10

    Google Scholar 

  26. Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  27. Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006

    Article  Google Scholar 

  28. Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7:1–21. https://doi.org/10.1186/s40537-019-0279-z

    Article  Google Scholar 

  29. Pandey KK, Shukla D, Milan R (2020) A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability. In: Shukla RK, Agrawal J, Sharma S, et al (eds) Social networking and computational intelligence, Lecture Notes in Networks and Systems 100. Springer Nature Singapore Pte Ltd., pp 427–440

    Google Scholar 

  30. Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K-median and K-means 04:243–257. https://doi.org/10.1007/s10994-006-0587-3

    Article  Google Scholar 

  31. Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Lecture Notes Computer Science (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 5687 LNCS. https://doi.org/10.1007/978-3-642-03685-9_2, pp 15–28

  32. Luchi D, Loureiros Rodrigues A, Miguel Varejão F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett 117:90–96. https://doi.org/10.1016/j.patrec.2018.12.010

    Article  Google Scholar 

  33. Ben Hajkacem MA, Ben Ncir CE, Essoussi N (2019) STiMR k-means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell 33:1950013. https://doi.org/10.1142/S0218001419500137

    Article  Google Scholar 

  34. Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268

    Article  MATH  Google Scholar 

  35. Bejarano J, Bose K, Brannan T, Thomas A (2011) Sampling Within k-Means Algorithm to Cluster Large Datasets. Tech Rep HPCF-2011-12 1–11

    Google Scholar 

  36. Ji-hong G, Shui-geng Z, Fu-ling B, Yan-xiang H (2001) Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique. Wuhan Univ J Nat Sci 6:467–473

    Article  Google Scholar 

  37. Wang X, Hamilton HJ (2003) DBRS : A Density-Based Spatial Clustering Method with Random Sampling. 563–575

    Google Scholar 

  38. Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8

    Article  Google Scholar 

  39. Da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7

    Article  Google Scholar 

  40. Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8

    Article  MATH  Google Scholar 

  41. Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energy Proc 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342

    Article  Google Scholar 

  42. Zhan Q (2017) Improved spectral clustering based on Nyström method. https://doi.org/10.1007/s11042-017-4566-4, pp 20149–20165

  43. Pandey KK, Shukla D (2020) Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds) Communication and intelligent systems

    Google Scholar 

  44. Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702

    Google Scholar 

  45. Mostafa SA, Ahmad IA, Ahmad IA (2017) Recent developments in systematic sampling: a review. J Stat Theory Pract ISSN. https://doi.org/10.1080/15598608.2017.1353456

    Article  Google Scholar 

  46. Aune-lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001

    Article  Google Scholar 

  47. Kao F, Leu C, Ko C (2011) Remainder Markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011

    Article  MathSciNet  MATH  Google Scholar 

  48. Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007

    Article  Google Scholar 

  49. Ziegel ER, Lohr SL (2000) Sampling: design and analysis. In: Technometrics, p 223

    Google Scholar 

  50. Shalabh (2019) Stratified sampling. In: Sampling theory, pp 1–27

    Google Scholar 

  51. Olufadi Y, Oshungade IO, Adewara AA (2012) On allocation procedures using systematic sampling. J Interdiscip Math 15:23–40. https://doi.org/10.1080/09720502.2012.10700783

    Article  MATH  Google Scholar 

  52. Aggarwal CC, Reddy CK (2013) DATA custering algorithms and applications

    Google Scholar 

  53. Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications

    Google Scholar 

  54. Peña JM, Lozano JA, Larrañaga P (1999) An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0

    Article  Google Scholar 

  55. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021

    Article  Google Scholar 

  56. Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for KMeans-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062

    Article  MathSciNet  Google Scholar 

  57. Luchi D, Santos W, Rodrigues A, Varejao FM (2015) Genetic sampling k-means for clustering large data sets. In: CIARP 2015, LNCS 9423, pp 691–698

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pandey, K.K., Shukla, D. (2022). Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining. In: Dubey, H.M., Pandit, M., Srivastava, L., Panigrahi, B.K. (eds) Artificial Intelligence and Sustainable Computing. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-1220-6_19

Download citation

Publish with us

Policies and ethics