Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining

Pandey, Kamlesh Kumar; Shukla, Diwakar

doi:10.1007/978-981-16-1220-6_19

Kamlesh Kumar Pandey⁸ &
Diwakar Shukla⁸

Part of the book series: Algorithms for Intelligent Systems ((AIS))

396 Accesses
1 Citations

Abstract

Big data mining is an intelligent process of extracting hidden knowledge from high volume, high variety, and high velocity data environments for decision-making systems. Classical data mining algorithms are facing memory utilization, speed-up, scale-up, computing cost, efficiency, and effectiveness related challenges inside the big data. Data volume is a prime attribute of big data mining and is responsible for variety and velocity-related challenges. Intelligent big data mining process incorporates classical data mining and statistics under single and multiple machine execution environments. Sampling is a data reduction technique that handles data volume-related challenges and increases the speed, scalability, flexibility, accuracy, quality, efficiency, and utilizes memory resources for any data mining algorithms without the influence of their characteristics. This paper proposed the systematic sampling-based big data mining model through the K-means clustering that is known as SYK-means (systematic sampling-based K-means). The experimental results of the SYK-means algorithm are compared with the RSK-means (random sampling-based K-means) and classical K-means algorithms concerning sample size selection and entire data selection. The experimental evaluation of the SYK-means algorithm achieved better effectiveness and efficiency through R squares, root-mean-square standard deviation, Davies Bouldin, Calinski Harabasz, Silhouette coefficient, CPU time, and convergence validation indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Oussous A, Benjelloun F, Lahcen AA, Belfkih S (2017) Big Data technologies: a survey. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2017.06.001
Article Google Scholar
Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6:1–6. https://doi.org/10.1186/s40537-019-0206-3
Article Google Scholar
Gandomi A, Haider M (2015) Beyond the hype: Big data concepts, methods, and analytics. Int J Inf Manage 35:137–144. https://doi.org/10.1016/j.ijinfomgt.2014.10.007
Article Google Scholar
Sivarajah U, Kamal MM, Irani Z, Weerakkody V (2017) Critical analysis of Big Data challenges and analytical methods. J Bus Res 70:263–286. https://doi.org/10.1016/j.jbusres.2016.08.001
Article Google Scholar
Lee I (2017) Big data: dimensions, evolution, impacts, and challenges. Bus Horiz 60:293–303. https://doi.org/10.1016/j.bushor.2017.01.004
Article Google Scholar
Siddiqa A, Hashem IAT, Yaqoob I et al (2016) A survey of big data management: Taxonomy and state-of-the-art. J Netw Comput Appl 71:151–166. https://doi.org/10.1016/j.jnca.2016.04.008
Article Google Scholar
Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable Big Data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002
Article MathSciNet Google Scholar
Khondoker MR (2018) Big data clustering. Wiley StatsRef Stat Ref Online 1–10. https://doi.org/10.1002/9781118445112.stat07978
Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowl-Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007
Article Google Scholar
Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-Dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088
Article Google Scholar
Jain AK (2010) Data clustering: 50 years beyond K-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011
Article Google Scholar
HajKacem MA Ben, N’Cir C-E Ben, Essoussi N (2019) Clustering methods for big data analytics. In: Unsupervised and semi-supervised learning, pp 1–23
Google Scholar
Wu X, Zhu X, Wu G-Q, Ding W (2014) Data mining with big data. IeeexploreIeeeOrg, 1–26
Google Scholar
Tsai C-W, Lai C-F, Chao H-C, Vasilakos AV (2016) Big Data analytics. In: Big Data technologies and applications, pp 1–400
Google Scholar
Chen B, Haas P, Scheuermann P (2002) A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining. https://doi.org/10.1145/775107.775114, pp 462–468
Zhao J, Sun J, Zhai Y et al (2018) A novel clustering-based sampling approach for minimum sample set in big data environment. Int J Pattern Recognit Artif Intell 32:1–20. https://doi.org/10.1142/S0218001418500039
Article MathSciNet Google Scholar
Ly T, Cockburn M, Langholz B (2018) Cost-efficient case-control cluster sampling designs for population-based epidemiological studies. Spat Spatiotemporal Epidemiol 26:95–105. https://doi.org/10.1016/j.sste.2018.05.002
Article Google Scholar
Boicea A, Truică CO, Rădulescu F, Buşe EC (2018) Sampling strategies for extracting information from large data sets. Data Knowl Eng 115:1–15. https://doi.org/10.1016/j.datak.2018.01.002
Article Google Scholar
Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1
Article MathSciNet MATH Google Scholar
Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/s2424862218500173
Article Google Scholar
Kollios G, Gunopulos D, Koudas N, Berchtold S (2003) Efficient biased sampling for approximate clustering and outlier detection in large data sets. IEEE Trans Knowl Data Eng 15:1170–1187. https://doi.org/10.1109/TKDE.2003.1232271
Article Google Scholar
Xu Z, Wu Z, Cao J, Xuan H (2015) Scaling Information-Theoretic Text Clustering: A Sampling-based Approximate Method. In: Proceedings - 2014 2nd International Conference on Adv Cloud Big Data, CBD 2014. https://doi.org/10.1109/CBD.2014.56
Thompson SK (2012) Sampling, Third edn. Wiley Publication
Google Scholar
Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290
Article MathSciNet Google Scholar
Pandey KK, Shukla D (2019) A study of clustering taxonomy for big data mining with optimized clustering map reduce model. Int J Emerg Technol 10
Google Scholar
Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006
Article Google Scholar
Bakhthemmat A, Izadi M (2020) Decreasing the execution time of reducers by revising clustering based on the futuristic greedy approach. J Big Data 7:1–21. https://doi.org/10.1186/s40537-019-0279-z
Article Google Scholar
Pandey KK, Shukla D, Milan R (2020) A Comprehensive Study of Clustering Algorithms for Big Data Mining with MapReduce Capability. In: Shukla RK, Agrawal J, Sharma S, et al (eds) Social networking and computational intelligence, Lecture Notes in Networks and Systems 100. Springer Nature Singapore Pte Ltd., pp 427–440
Google Scholar
Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K-median and K-means 04:243–257. https://doi.org/10.1007/s10994-006-0587-3
Article Google Scholar
Aggarwal A, Deshpande A, Kannan R (2009) Adaptive sampling for k-means clustering. Lecture Notes Computer Science (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 5687 LNCS. https://doi.org/10.1007/978-3-642-03685-9_2, pp 15–28
Luchi D, Loureiros Rodrigues A, Miguel Varejão F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett 117:90–96. https://doi.org/10.1016/j.patrec.2018.12.010
Article Google Scholar
Ben Hajkacem MA, Ben Ncir CE, Essoussi N (2019) STiMR k-means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell 33:1950013. https://doi.org/10.1142/S0218001419500137
Article Google Scholar
Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268
Article MATH Google Scholar
Bejarano J, Bose K, Brannan T, Thomas A (2011) Sampling Within k-Means Algorithm to Cluster Large Datasets. Tech Rep HPCF-2011-12 1–11
Google Scholar
Ji-hong G, Shui-geng Z, Fu-ling B, Yan-xiang H (2001) Scaling up the DBSCAN algorithm for clustering large spatial databases based on sampling technique. Wuhan Univ J Nat Sci 6:467–473
Article Google Scholar
Wang X, Hamilton HJ (2003) DBRS : A Density-Based Spatial Clustering Method with Random Sampling. 563–575
Google Scholar
Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8
Article Google Scholar
Da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7
Article Google Scholar
Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8
Article MATH Google Scholar
Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energy Proc 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342
Article Google Scholar
Zhan Q (2017) Improved spectral clustering based on Nyström method. https://doi.org/10.1007/s11042-017-4566-4, pp 20149–20165
Pandey KK, Shukla D (2020) Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds) Communication and intelligent systems
Google Scholar
Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702
Google Scholar
Mostafa SA, Ahmad IA, Ahmad IA (2017) Recent developments in systematic sampling: a review. J Stat Theory Pract ISSN. https://doi.org/10.1080/15598608.2017.1353456
Article Google Scholar
Aune-lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001
Article Google Scholar
Kao F, Leu C, Ko C (2011) Remainder Markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011
Article MathSciNet MATH Google Scholar
Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007
Article Google Scholar
Ziegel ER, Lohr SL (2000) Sampling: design and analysis. In: Technometrics, p 223
Google Scholar
Shalabh (2019) Stratified sampling. In: Sampling theory, pp 1–27
Google Scholar
Olufadi Y, Oshungade IO, Adewara AA (2012) On allocation procedures using systematic sampling. J Interdiscip Math 15:23–40. https://doi.org/10.1080/09720502.2012.10700783
Article MATH Google Scholar
Aggarwal CC, Reddy CK (2013) DATA custering algorithms and applications
Google Scholar
Gan G, Ma C, Wu J (2007) Data clustering theory, algorithms, and applications
Google Scholar
Peña JM, Lozano JA, Larrañaga P (1999) An empirical comparison of four initialization methods for the K-Means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0
Article Google Scholar
Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021
Article Google Scholar
Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for KMeans-clustering based recommender systems. Inf Sci (Ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062
Article MathSciNet Google Scholar
Luchi D, Santos W, Rodrigues A, Varejao FM (2015) Genetic sampling k-means for clustering large data sets. In: CIARP 2015, LNCS 9423, pp 691–698
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Applications, Dr. Harisingh Gour Vishwavidyalaya, Sagar, India
Kamlesh Kumar Pandey & Diwakar Shukla

Authors

Kamlesh Kumar Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Diwakar Shukla
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Electrical Engineering, Madhav Institute of Technology and Science, Gwalior, India
Hari Mohan Dubey
Department of Electrical Engineering, Madhav Institute of Technology and Science, Gwalior, India
Manjaree Pandit
Department of Electrical Engineering, Madhav Institute of Technology and Science, Gwalior, India
Laxmi Srivastava
Department of Electrical Engineering, Indian Institute of Technology Delhi, Delhi, India
Bijaya Ketan Panigrahi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pandey, K.K., Shukla, D. (2022). Approximate Partitional Clustering Through Systematic Sampling in Big Data Mining. In: Dubey, H.M., Pandit, M., Srivastava, L., Panigrahi, B.K. (eds) Artificial Intelligence and Sustainable Computing. Algorithms for Intelligent Systems. Springer, Singapore. https://doi.org/10.1007/978-981-16-1220-6_19

Download citation

DOI: https://doi.org/10.1007/978-981-16-1220-6_19
Published: 20 July 2021
Publisher Name: Springer, Singapore
Print ISBN: 978-981-16-1219-0
Online ISBN: 978-981-16-1220-6
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics