Skip to main content

Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data

Abstract

Risk analysis is one of the most essential business activities because it discovers unknown risks such as financial risk, recovery risk, investment risk, operational risk, credit risk, debit risk, and so on. Clustering is a data mining technique that uses data behavior and nature to discover unexpected risks in business data. In a big data setup, clustering algorithms encounter execution time and cluster quality-related challenges due to the primary attribute of big data. This study suggests a Stratified Systematic Sampling Extension (SSE) approach for risk analysis in big data mining using a single machine execution by clustering methodology. Sampling is a data reduction technique that saves computation time and improves cluster quality, scalability, and speed of the clustering algorithm. The proposed sampling plan first formulates the stratum by selecting the minimum variance dimension and then selects samples from each stratum using random linear systematic sampling. The clustering algorithm produces robust clusters in terms of risk and non-risk group with the help of sample data and extends the sample-based clustering results to final clustering results utilizing Euclidean distance. The performance of the SSE-based clustering algorithm has been compared to existing K-means and K-means ++ algorithms using Davies Bouldin score, Silhouette coefficient, Scattering Density between clusters Validity, Scattering Distance Validity and CPU time validation metrics on financial risk datasets. The experimental results demonstrate that the SSE-based clustering algorithm achieved better clustering objectives in terms of cluster compaction, separation, density, and variance while minimizing iterations, distance computation, data comparison, and computational time. The statistical analysis reveals that the proposed sampling plan attained statistical significance by employing the Friedman test.

This is a preview of subscription content, access via your institution.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

References

  1. Abualigah LMQ (2019) Feature selection and enhanced krill herd algorithm for text document clustering. Springer Nature, Switzerland

    Book  Google Scholar 

  2. Abualigah L, Diabat A (2021) Advances in Sine Cosine algorithm: a comprehensive survey. Artif Intell Rev 54:2567–2608. https://doi.org/10.1007/s10462-020-09909-3

    Article  Google Scholar 

  3. Abualigah L, Diabat A, Mirjalili S et al (2021a) The arithmetic optimization algorithm. Comput Methods Appl Mech Eng 376:113609. https://doi.org/10.1016/j.cma.2020.113609

    MathSciNet  Article  MATH  Google Scholar 

  4. Abualigah L, Yousri D, Abd Elaziz M et al (2021b) Aquila Optimizer: a novel meta-heuristic optimization algorithm. Comput Ind Eng 157:107250. https://doi.org/10.1016/j.cie.2021.107250

    Article  Google Scholar 

  5. Aggarwal CC, Reddy CK (2014) Data custering algorithms and applications. CRC Press, Boca Raton, United States

    Google Scholar 

  6. Aloise D, Contardo C (2018) A sampling-based exact algorithm for the solution of the minimax diameter clustering problem. J Glob Optim 71:613–630. https://doi.org/10.1007/s10898-018-0634-1

    MathSciNet  Article  MATH  Google Scholar 

  7. Arthur D, Vassilvitskii S (2007) K-means++: The advantages of careful seeding. In: SODA ’07: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms. ACM Digital Library, pp 1027–1035

  8. Aune-Lundberg L, Strand G (2014) Comparison of variance estimation methods for use with two-dimensional systematic sampling of land use/land cover data. Environ Model Softw 61:87–97. https://doi.org/10.1016/j.envsoft.2014.07.001

    Article  Google Scholar 

  9. Bejarano J, Bose K, Brannan T, Thomas A (2011). Sampling Within k-Means Algorithm to Cluster Large Datasets. Tech Rep HPCF-2011–12 1–11

  10. Ben HMA, Ben NCE, Essoussi N (2019) STiMR k -means: an efficient clustering method for big data. Int J Pattern Recognit Artif Intell. https://doi.org/10.1142/S0218001419500137

    Article  Google Scholar 

  11. Ben-david S (2007) A framework for statistical clustering with constant time approximation algorithms for K -median and K -means. Mach Learn 66:243–257. https://doi.org/10.1007/s10994-006-0587-3

    Article  MATH  Google Scholar 

  12. Brus DJ (2019) Sampling for digital soil mapping: a tutorial supported by R scripts. Geoderma 338:464–480. https://doi.org/10.1016/j.geoderma.2018.07.036

    Article  Google Scholar 

  13. Caicedo PE, Rengifo CF, Rodriguez LE et al (2020) Dataset for gait analysis and assessment of fall risk for older adults. Data Br 33:106550. https://doi.org/10.1016/j.dib.2020.106550

    Article  Google Scholar 

  14. Celebi ME, Kingravi HA, Vela PA (2013) A comparative study of efficient initialization methods for the k-means clustering algorithm. Expert Syst Appl 40:200–210. https://doi.org/10.1016/j.eswa.2012.07.021

    Article  Google Scholar 

  15. Chen B, Haas P, Scheuermann P (2002). A new two-phase sampling based algorithm for discovering association rules. In: Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM Digital Library, pp 462–468

  16. Chen W, Oliverio J, Kim JH, Shen J (2019) The modeling and simulation of data clustering algorithms in data mining with big data. J Ind Integr Manag 04:1850017. https://doi.org/10.1142/S2424862218500173

    Article  Google Scholar 

  17. Cochran WG (1962). Samling Techniques. Asia Publishing House, Bombay

  18. da Silva A, Chiky R, Hébrail G (2012) A clustering approach for sampling data streams in sensor networks. Knowl Inf Syst 32:1–23. https://doi.org/10.1007/s10115-011-0448-7

    Article  Google Scholar 

  19. Databases P, Harangsri B, Shepherd J, Georgakopoulos D (2004) Query Size Estimation for Joins Using. 237–275

  20. Deeb H, Sarangi A, Mishra D, Sarangi SK (2020) Improved black hole optimization algorithm for data clustering. J King Saud Univ–comput Inf Sci. https://doi.org/10.1016/j.jksuci.2020.12.013

    Article  Google Scholar 

  21. Demšar J (2006) Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 7:1–30

    MathSciNet  MATH  Google Scholar 

  22. Deva Arul S, Iyapparaja M (2020) Social internet of things using big data analytics and security aspects–a review. Electron Gov 16:137–154. https://doi.org/10.1504/EG.2020.105238

    Article  Google Scholar 

  23. Étoré P, Jourdain B (2010) Adaptive optimal allocation in stratified sampling methods. Methodol Comput Appl Probab 12:335–360. https://doi.org/10.1007/s11009-008-9108-0

    MathSciNet  Article  MATH  Google Scholar 

  24. Fahad A, Alshatri N, Tari Z et al (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2:267–279. https://doi.org/10.1109/TETC.2014.2330519

    Article  Google Scholar 

  25. Fränti P, Sieranoja S (2019) How much can k-means be improved by using better initialization and repeats? Pattern Recognit 93:95–112. https://doi.org/10.1016/j.patcog.2019.04.014

    Article  Google Scholar 

  26. Furht B, Villanustre F (2016) Big Data Technologies and Applications. Springer International Publishing, Cham

    Book  Google Scholar 

  27. Gopalakrishnan C, Iyapparaja M (2021) Multilevel thresholding based follicle detection and classification of polycystic ovary syndrome from the ultrasound images using machine learning. Int J Syst Assur Eng Manag. https://doi.org/10.1007/s13198-021-01203-x

    Article  Google Scholar 

  28. Haas PJ (2016) Data-Stream Sampling: Basic Techniques and Results. Springer-Verlag, Berlin Heidelberg

    Google Scholar 

  29. Hariri RH, Fredericks EM, Bowers KM (2019) Uncertainty in big data analytics: survey, opportunities, and challenges. J Big Data 6:44. https://doi.org/10.1186/s40537-019-0206-3

    Article  Google Scholar 

  30. Härtel P, Kristiansen M, Korpås M (2017) Assessing the impact of sampling and clustering techniques on offshore grid expansion planning. Energy Procedia 137:152–161. https://doi.org/10.1016/j.egypro.2017.10.342

    Article  Google Scholar 

  31. Hibberts M, Johnson RB, Hudson K (2012) Common Survey Sampling Techniques. In: Gideon L (ed) Handbook of Survey Methodology for the Social Sciences. Springer Science+Business Media New York

    Google Scholar 

  32. Hu H, Wen Y, Chua T-S, Li X (2014) Toward scalable systems for big data analytics: a technology tutorial. IEEE Access 2:652–687. https://doi.org/10.1109/ACCESS.2014.2332453

    Article  Google Scholar 

  33. Iyapparaja M, Deva Arul S (2020) Effective feature selection using hybrid Ga-EHO for classifying big data siot. Int J Web Portals 12:12–25. https://doi.org/10.4018/IJWP.2020010102

    Article  Google Scholar 

  34. Jabłoński A, Jabłoński M (2020) New Economy Business Models in the Concepts of, the and the Circular Economy. Social Business Models in the Digital Economy. Springer International Publishing, Cham, pp 51–88

    Chapter  Google Scholar 

  35. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recognit Lett 31:651–666. https://doi.org/10.1016/j.patrec.2009.09.011

    Article  Google Scholar 

  36. Jaiswal R, Kumar A, Sen S (2014) A Simple D 2 -Sampling Based PTAS for k -Means. 22–46. https://doi.org/10.1007/s00453-013-9833-9

  37. Jia H, Ding S, Du M (2017) A Nyström spectral clustering algorithm based on probability incremental sampling. Soft Comput 21:5815–5827. https://doi.org/10.1007/s00500-016-2160-8

    Article  MATH  Google Scholar 

  38. G Ji-hong Z Shui-geng B Fu-ling H Yan-xiang 2001 Scaling up the DBSCAN Algorithm for Clustering Large Spatial Databases Based on Sampling Technique 6 467 473

  39. Jothi R, Mohanty SK, Ojha A (2019) DK-means: a deterministic K-means clustering algorithm for gene expression analysis. Pattern Anal Appl 22:649–667. https://doi.org/10.1007/s10044-017-0673-0

    MathSciNet  Article  Google Scholar 

  40. Judez L, Chaya C, Miguel D, Bru R (2006) Stratification and sample size of data sources for agricultural mathematical programming model. Math Comput Mod 43:530–535. https://doi.org/10.1016/j.mcm.2005.07.006

    MathSciNet  Article  MATH  Google Scholar 

  41. Kacfah Emani C, Cullot N, Nicolle C (2015) Understandable big data: a survey. Comput Sci Rev 17:70–81. https://doi.org/10.1016/j.cosrev.2015.05.002

    MathSciNet  Article  Google Scholar 

  42. Kao F, Leu C, Ko C (2011) Remainder markov systematic sampling. J Stat Plan Inference 141:3595–3604. https://doi.org/10.1016/j.jspi.2011.05.011

    MathSciNet  Article  MATH  Google Scholar 

  43. Kara ME (2018) Supplier risk assessment based on best-worst method and k-means clustering: a case study. Sustainability 10:1066. https://doi.org/10.3390/su10041066

    Article  Google Scholar 

  44. Keskintürk T (2007) A genetic algorithm approach to determine stratum boundaries and sample sizes of each stratum in stratified sampling. Comput Static Data Analys 52:53–67. https://doi.org/10.1016/j.csda.2007.03.026

    MathSciNet  Article  MATH  Google Scholar 

  45. Khondoker MR (2018). Big data clustering. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, Chichester, UK.

  46. Kim JK, Wang Z (2019) Sampling techniques for big data analysis. Int Stat Rev 87:S177–S191. https://doi.org/10.1111/insr.12290

    MathSciNet  Article  Google Scholar 

  47. Kou G, Peng Y, Wang G (2014) Evaluation of clustering algorithms for financial risk analysis using MCDM methods. Inf Sci (ny) 275:1–12. https://doi.org/10.1016/j.ins.2014.02.137

    Article  Google Scholar 

  48. Kumar S, Mohbey KK (2019) A review on big data based parallel and distributed approaches of pattern mining. J King Saud Univ–comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.09.006

    Article  Google Scholar 

  49. Larson L, Larson P, Johnson DE (2019) Differences in stubble height estimates resulting from systematic and random sample designs. Rangel Ecol Manag 72:586–589. https://doi.org/10.1016/j.rama.2019.03.007

    Article  Google Scholar 

  50. Li M, Li D, Shen S, et al (2016) DSS: A Scalable and Efficient Stratified Sampling Algorithm for Large-Scale Datasets. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). pp 133–146

  51. Lozada N, Arias-Pérez J, Perdomo-Charry G (2019) Big data analytics capability and co-innovation: an empirical study. Heliyon. https://doi.org/10.1016/j.heliyon.2019.e02541

    Article  Google Scholar 

  52. Luchi D, Loureiros Rodrigues A, Miguel Varejão F (2019) Sampling approaches for applying DBSCAN to large datasets. Pattern Recognit Lett 117:90–96. https://doi.org/10.1016/j.patrec.2018.12.010

    Article  Google Scholar 

  53. Maheshwari S, Gautam P, Jaggi CK (2021) Role of big data analytics in supply chain management: current trends and future perspectives. Int J Prod Res 59:1875–1900. https://doi.org/10.1080/00207543.2020.1793011

    Article  Google Scholar 

  54. Mahmud MS, Huang JZ, Salloum S et al (2020) A survey of data partitioning and sampling methods to support big data analysis. Big Data Min Anal 3:85–101. https://doi.org/10.26599/BDMA.2019.9020015

    Article  Google Scholar 

  55. Mandelli D, Yilmaz A, Aldemir T et al (2013) Scenario clustering and dynamic probabilistic risk assessment. Reliab Eng Syst Saf 115:146–160. https://doi.org/10.1016/j.ress.2013.02.013

    Article  Google Scholar 

  56. Mani SK, Iyapparaja M (2020) Improving quality-of-service in fog computing through efficient resource allocation. Comput Intell 36:1527–1547. https://doi.org/10.1111/coin.12285

    MathSciNet  Article  Google Scholar 

  57. Marle F, Vidal L, Bocquet J (2013) Interactions-based risk clustering methodologies and algorithms for complex project management. Int J Prod Econ 142:225–234. https://doi.org/10.1016/j.ijpe.2010.11.022

    Article  Google Scholar 

  58. Moharm K (2019) State of the art in big data applications in microgrid: A review. Adv Eng Informatics. https://doi.org/10.1016/j.aei.2019.100945

    Article  Google Scholar 

  59. Pandey KK, Shukla D (2020) Stratified Sampling-Based Data Reduction and Categorization Model for Big Data Mining. In: Gupta JC, Kumar BM, Sharma H, Agarwal B (eds), Communication and Intelligent Systems

  60. Pandey KK, Shukla D (2019) Optimized sampling strategy for big data mining through stratified sampling. Int J Sci Technol Res 8:3696–3702

    Google Scholar 

  61. Pandove D, Goel S, Rani R (2018) Systematic review of clustering high-dimensional and large datasets. ACM Trans Knowl Discov Data 12:1–68. https://doi.org/10.1145/3132088

    Article  Google Scholar 

  62. Peña J, Lozano J, Larrañaga P (1999) An empirical comparison of four initialization methods for the k-means algorithm. Pattern Recognit Lett 20:1027–1040. https://doi.org/10.1016/S0167-8655(99)00069-0

    Article  Google Scholar 

  63. Rajasekaran S, Saha S (2013). A novel deterministic sampling technique to speedup clustering algorithms. Lect Notes Comput Sci (including Subser Lect Notes Artif Intell Lect Notes Bioinformatics) 8347 LNAI:34–46. https://doi.org/10.1007/978-3-642-53917-6_4

  64. Ramasubramanian K, Singh A (2016). Sampling and Resampling Techniques. In: Machine Learning Using R. pp 67–127

  65. Rice JA (2007) Mathematical statistics and metastatistical analysis, Third Edit. Thomson Higher Education

  66. Ros F, Guillaume S (2017) DIDES: a fast and effective sampling for clustering algorithm. Knowl Inf Syst 50:543–568. https://doi.org/10.1007/s10115-016-0946-8

    Article  Google Scholar 

  67. Satyanarayana A (2014) Intelligent sampling for big data using bootstrap sampling and chebyshev inequality. Can Conf Electr Comput Eng. https://doi.org/10.1109/CCECE.2014.6901029

    Article  Google Scholar 

  68. shalabh (2019) Systematic Sampling. In: Sampling Theory. pp 1–17

  69. Shields MD, Teferra K, Hapij A, Daddazio RP (2015) Refined stratified sampling for efficient monte carlo based uncertainty quantification. Reliab Eng Syst Saf 142:310–325. https://doi.org/10.1016/j.ress.2015.05.023

    Article  Google Scholar 

  70. Singh S (2003) Advanced sampling theory with applications, vol 1. Springer, Netherlands, Dordrecht

    Book  Google Scholar 

  71. Tchagna Kouanou A, Tchiotsop D, Kengne R et al (2018) An optimal big data workflow for biomedical image analysis. Informatics Med Unlocked 11:68–74. https://doi.org/10.1016/j.imu.2018.05.001

    Article  Google Scholar 

  72. Umarani V, Punithavalli M (2011) Analysis of the progressive sampling-based approach using real life datasets. Open Comput Sci 1:221–242. https://doi.org/10.2478/s13537-011-0016-y

    Article  Google Scholar 

  73. Wang X, Hamilton HJ (2003) DBRS: A Density-Based Spatial Clustering Method with Random Sampling. 563–575

  74. Wang X, He Y (2016) Learning from uncertainty for big data: future analytical challenges and strategies. IEEE Syst Man, Cybern Mag 2:26–31. https://doi.org/10.1109/msmc.2016.2557479

    Article  Google Scholar 

  75. Wang L, Bezdek JC, Leckie C, Kotagiri R (2008) Selective sampling for approximate clustering of very large data sets. Int J Intell Syst 23:313–331. https://doi.org/10.1002/int.20268

    Article  MATH  Google Scholar 

  76. Wang X, Frattini P, Stead D et al (2020) Dynamic rockfall risk analysis. Eng Geol 272:105622. https://doi.org/10.1016/j.enggeo.2020.105622

    Article  Google Scholar 

  77. Xian X, Zhang C, Bonk S, Liu K (2021) Online monitoring of big data streams: a rank-based sampling algorithm by data augmentation. J Qual Technol 53:135–153. https://doi.org/10.1080/00224065.2019.1681924

    Article  Google Scholar 

  78. Xiao Y, Yu J (2012) Partitive clustering ( k -means family). Wiley Interdiscip Rev Data Min Knowl Discov 2:209–225. https://doi.org/10.1002/widm.1049

    Article  Google Scholar 

  79. Xie H, Zhang L, Lim CP et al (2019) Improving K-means clustering with enhanced Firefly Algorithms. Appl Soft Comput J 84:105763. https://doi.org/10.1016/j.asoc.2019.105763

    Article  Google Scholar 

  80. Yeh W-C, Lai C-M (2015) Accelerated simplified swarm optimization with exploitation search scheme for data clustering. PLoS ONE 10:e0137246. https://doi.org/10.1371/journal.pone.0137246

    Article  Google Scholar 

  81. Zahra S, Ghazanfar MA, Khalid A et al (2015) Novel centroid selection approaches for k-means-clustering based recommender systems. Inf Sci (ny) 320:156–189. https://doi.org/10.1016/j.ins.2015.03.062

    Article  Google Scholar 

  82. Zhan Q (2017) Improved spectral clustering based on Nyström method. Multimed Tools Appl 76:20149–20165. https://doi.org/10.1007/s11042-017-4566-4

    Article  Google Scholar 

  83. Zhang H, Wang H (2021) Distributed subdata selection for big data via sampling-based approach. Comput Stat Data Anal 153:107072. https://doi.org/10.1016/j.csda.2020.107072

    MathSciNet  Article  MATH  Google Scholar 

  84. M Zhang C Wang J Bu et al 2015 A sampling method based on url clustering for fast web accessibility evaluation 16 449 456 https://doi.org/10.1631/FITEE.1400377

  85. Zhao X, Liang J, Dang C (2019) A stratified sampling based clustering algorithm for large-scale data. Knowledge-Based Syst 163:416–428. https://doi.org/10.1016/j.knosys.2018.09.007

    Article  Google Scholar 

Download references

Funding

This study received no external funding.

Author information

Affiliations

Authors

Corresponding author

Correspondence to Kamlesh Kumar Pandey.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Pandey, K.K., Shukla, D. Stratified linear systematic sampling based clustering approach for detection of financial risk group by mining of big data. Int J Syst Assur Eng Manag (2021). https://doi.org/10.1007/s13198-021-01424-0

Download citation

Keywords

  • Risk Clustering
  • Sampling
  • Big data clustering
  • Stratified sampling
  • Systematic sampling
  • Sample extension
  • SSE-K means
  • SSE-K means ++ 
  • Robust risk clusters