Abstract
Almost all modern industries leverage data analytics to deal with various dimensions of their business like demand forecasting, targeted marketing, and supply chain planning. In addition to historic data, social media data has also become a prominent source of input for data analytics. The key challenges observed with social media data are its huge volume and high dimensions that need to be dealt with. Clustering is the proven strategy in data analytics to segregate the relevant data for processing and thereby reducing the impact of huge volume. Dimensionality corresponds to the diverse features of the data subject being represented. The application of dimensionality reduction techniques can help in reducing the computational intensiveness caused by the curse of dimensionality. This paper covers an experimental analysis using four popular dimensionality reduction techniques – two linear and two nonlinear approaches – to verify the impact of dimensionality reduction on cluster quality using internal clustering validation indices.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kohavi R, Rothleder N, Simoudis E (2002) Emerging trends in business analytics. Commun ACM 45(8):45–48. https://doi.org/10.1145/545151.545177
Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms. Wiley
Cattell R (1943) The description of personality: basic traits resolved into clusters. J Abnorm Soc Psychology 38:476–506. https://doi.org/10.1037/H0054116
Pudil P, Novovičová J (1998) Novel methods for feature subset selection with respect to problem knowledge. In: Feature extraction, construction and selection, pp 101–116. https://doi.org/10.1007/978-1-4615-5725-8_7
Hartigan J, Wong M (1979) Algorithm AS 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108. https://doi.org/10.2307/2346830
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, no 14. Oakland, CA, USA, pp 281–297
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Forgey E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21(3):768–769
Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis. Wiley. https://doi.org/10.1002/9780470316801. -->
Lukasová A (1979) Hierarchical agglomerative clustering procedure. Pattern Recogn 11(5–6):365–381. https://doi.org/10.1016/0031-3203(79)90049-9
Zepeda-Mendoza M, Resendis-Antonio O (2013) Hierarchical agglomerative clustering. Encycl Syst Biol 886–887. https://doi.org/10.1007/978-1-4419-9863-7_1371
Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. J Classif 35(2):345–366. https://doi.org/10.1007/S00357-018-9259-9
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441. https://doi.org/10.1037/H0071325
Abdi H, Williams L (2010) Principal component analysis. Wiley Interdiscip Rev: Comput Statistics 2(4):433–459. https://doi.org/10.1002/wics.101
Isomura T, Toyoizumi T (2016) A local learning rule for independent component analysis. Sci Rep 6. https://doi.org/10.1038/srep28073
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15:3221–3245
Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. https://doi.org/10.1126/science.290.5500.2323
Ridder D, Kouropteva O, Okun O, Pietikäinen M, Duin R (2003) Supervised locally linear embedding. Artif Neural Netw Neural Inf Process—ICANN/ICONIP 2003:333–341. https://doi.org/10.1007/3-540-44989-2_40
Renjith S, Sreekumar A, Jathavedan M (2018) Evaluation of partitioning clustering algorithms for processing social media data in tourism domain. In: 2018 IEEE recent advances in intelligent computational systems (RAICS). IEEE Press, Thiruvananthapuram, India, pp 127–131. https://doi.org/10.1109/raics.2018.8635080
Renjith S, Sreekumar A, Jathavedan M (2020) Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.01.110
Renjith S, Sreekumar A, Jathavedan M (2019) Pragmatic evaluation of the impact of dimensionality reduction in the performance of clustering algorithms. In: Advances in electrical and computer technologies, ICAECT 2019, Lecture notes in electrical engineering. Springer, Coimbatore, India
Xu R, WunschII D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: The 14th international conference on computational science and its applications—ICCSA 2014. Springer International Publishing, Guimaraes, Portugal, pp 707–720. https://doi.org/10.1007/978-3-319-09156-3_49
Sajana T, Sheela Rani C, Narayana K (2016) A survey on clustering techniques for big data mining. Indian J Sci Technol 9(3):1–12. https://doi.org/10.17485/IJST/2016/V9I3/75971
Ajin V, Kumar L (2016) Big data and clustering algorithms. In: 2016 international conference on research advances in integrated navigation systems (RAINS). IEEE Press, Bangalore, India, pp 101–106. https://doi.org/10.1109/rains.2016.7764405
Dave M, Gianey H (2016) Different clustering algorithms for big data analytics: a review. In: 2016 international conference system modeling and advancement in research trends (SMART). IEEE Press, Moradabad, India, pp 328–333. https://doi.org/10.1109/sysmart.2016.7894544
Lau T, King I (1998) Performance analysis of clustering algorithms for information retrieval in image databases. In: 1998 IEEE international joint conference on neural networks proceedings, IEEE world congress on computational intelligence (Cat. No.98CH36227). IEEE Press, Anchorage, AK, USA, pp 932–937. https://doi.org/10.1109/ijcnn.1998.685895
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. https://doi.org/10.1109/TPAMI.2002.1114856
Wei C, Lee Y, Hsu C (2003) Empirical comparison of fast partitioning-based clustering algorithms for large data sets. Expert Syst Appl 24(4):351–363. https://doi.org/10.1016/S0957-4174(02)00185-9
Zhang B (2003) Comparison of the performance of center-based clustering algorithms. In: Advances in knowledge discovery and data mining, PAKDD 2003, Lecture notes in computer science, vol 2637. Springer, Seoul, Republic of Korea, pp 63–74. https://doi.org/10.1007/3-540-36175-8_7
Wang X, Hamilton H (2005) A comparative study of two density-based spatial clustering algorithms for very large datasets. In: Advances in artificial intelligence, AI 2005, lecture notes in computer science, vol 3501. Springer, Victoria, BC, Canada, pp 120–132. https://doi.org/10.1007/11424918_14
Poonam Dutta M (2012) Performance analysis of clustering methods for outlier detection. In: 2012 second international conference on advanced computing and communication technologies (ACCT 2012). IEEE Press, Rohtak, India, pp 89–95. https://doi.org/10.1109/acct.2012.84
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Jung Y, Kang M, Heo J (2014) Clustering performance comparison using k-means and expectation maximization algorithms. Biotechnol Biotechnol Equip 28(2):S44–S48. https://doi.org/10.1080/13102818.2014.949045
Bhatnagar V, Majhi R, Jena P (2017) Comparative performance evaluation of clustering algorithms for grouping manufacturing firms. Arab J Sci Eng 43(8):4071–4083. https://doi.org/10.1007/S13369-017-2788-4
Kohonen T (1997) Exploration of very large databases by self-organizing maps. In: International conference on neural networks (ICNN’97), vol 1. IEEE Press, Houston, TX, USA, pp PL1-PL6. https://doi.org/10.1109/icnn.1997.611622
Ding C, He X, Zha H, Simon H (2002) Adaptive dimension reduction for clustering high dimensional data. In: 2002 IEEE international conference on data mining. IEEE Computer Society, Maebashi City, Japan, pp 147–154. https://doi.org/10.1109/icdm.2002.1183897
Wang Q, Li J (2009) Combining local and global information for nonlinear dimensionality reduction. Neurocomputing 72(10–12):2235–2241. https://doi.org/10.1016/J.NEUCOM.2009.01.006
Araujo D, Doria Neto A, Martins A, Melo J (2011) Comparative study on dimension reduction techniques for cluster analysis of microarray data. In: The 2011 international joint conference on neural networks. IEEE Press, San Jose, CA, USA, pp 1835–1842. https://doi.org/10.1109/ijcnn.2011.6033447
Chui CK, Wang J (2013) Nonlinear methods for dimensionality reduction. Handb Geomath 1–46. https://doi.org/10.1007/978-3-642-27793-1_34-2
Song M, Yang H, Siadat S, Pechenizkiy M (2013) A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Syst Appl 40(9):3722–3737. https://doi.org/10.1016/J.ESWA.2012.12.078
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36. https://doi.org/10.18637/JSS.V061.I06
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(November):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Dunn J (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27. https://doi.org/10.1080/03610927408827101
Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI 1(2):224–227. https://doi.org/10.1109/tpami.1979.4766909
Team RC (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Tierney L (2012) The R statistical computing environment. Lect Notes Stat. 435–447. https://doi.org/10.1007/978-1-4614-3520-4_41
Racine J (2011) RStudio: a platform-independent IDE for R and Sweave. J Appl Econ 27(1):167–172. https://doi.org/10.1002/JAE.1278
Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: a constant time collaborative filtering algorithm. Inf Retr 4(2):133–151. https://doi.org/10.1023/A:1011419012209
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendices
Appendix 1
Complete observations from internal evaluation conducted on the k-means clustering results using the R package clusterCrit
Index | ClusterCrit variable | Goodness indicator | Dimensionality reduction using | |||
---|---|---|---|---|---|---|
PCA | ICA | LLE | t-SNE | |||
Ball–Hall index | $ball_hall | max diff | 17.20457 | 1.391324 | 1.232667 | 232.7926 |
Banfield–Raftery index | $banfeld_raftery | min | 11,029.27 | 572.3903 | 626.2364 | 27,286.89 |
C index | $c_index | min | 0.09553103 | 0.2327327 | 0.2312412 | 0.08800492 |
Calinski–Harabasz index | $calinski_harabasz | max | 3417.288 | 1782.438 | 1847.939 | 6416.44 |
Davies–Bouldin index | $davies_bouldin | min | 0.9058952 | 1.034618 | 1.039139 | 0.8715404 |
Det ratio index | $det_ratio | min diff | 3.871588 | 3.047042 | 3.026433 | 7.844465 |
Dunn index | $dunn | max | 0.00173042 | 0.00110608 | 0.00136502 | 0.00453987 |
Baker–Hubert Gamma index | $gamma | max | 0.7785525 | 0.5188169 | 0.4677813 | 0.7813951 |
G plus index | $g_plus | min | 0.05533569 | 0.1182925 | 0.1279114 | 0.04907176 |
GDI index | $gdi11 | max | 0.00173042 | 0.00110608 | 0.00136502 | 0.00453987 |
GDI index | $gdi12 | max | 0.01493041 | 0.01049106 | 0.02023069 | 0.02933953 |
GDI index | $gdi13 | max | 0.00511206 | 0.00372532 | 0.00717291 | 0.01019923 |
GDI index | $gdi21 | max | 1.21189 | 1.324565 | 1.406772 | 1.227989 |
GDI index | $gdi22 | max | 10.45642 | 12.56334 | 20.84945 | 7.936054 |
GDI index | $gdi23 | max | 3.580196 | 4.461175 | 7.392295 | 2.75879 |
GDI index | $gdi31 | max | 0.2752019 | 0.2248528 | 0.1981029 | 0.5592138 |
GDI index | $gdi32 | max | 2.374494 | 2.132702 | 2.936039 | 3.614 |
GDI index | $gdi33 | max | 0.8130083 | 0.7573111 | 1.04099 | 1.256326 |
GDI index | $gdi41 | max | 0.2410528 | 0.1885064 | 0.1703023 | 0.4925548 |
GDI index | $gdi42 | max | 2.079849 | 1.78796 | 2.524013 | 3.183207 |
GDI index | $gdi43 | max | 0.7121241 | 0.6348952 | 0.8949034 | 1.10657 |
GDI index | $gdi51 | max | 0.09224953 | 0.09543682 | 0.08576157 | 0.2129281 |
GDI index | $gdi52 | max | 0.7959463 | 0.9052065 | 1.271053 | 1.376078 |
GDI index | $gdi53 | max | 0.2725258 | 0.3214341 | 0.4506593 | 0.4783627 |
Ksq DetW index | $ksq_detw | max diff | 7,186,235,240 | 73,842,109 | 74,344,937 | 3.02E + 12 |
Log Det ratio index | $log_det_ratio | min diff | 6768.324 | 5570.856 | 5536.924 | 10,299.04 |
Log SS ratio index | $log_ss_ratio | min diff | 0.3131567 | −0.3377083 | −0.3016196 | 0.9431728 |
McClain–Rao index | $mcclain_rao | min | 0.3721302 | 0.5643422 | 0.595994 | 0.4384923 |
PBM index | $pbm | max | 42.68164 | 0.9640561 | 0.6983515 | 1497.468 |
Point Biserial index | $point_biserial | max | −2.693745 | −0.4353606 | −0.3859055 | −11.7986 |
Ray–Turi index | $ray_turi | min | 0.2860079 | 0.4883412 | 0.4338252 | 0.2302405 |
Ratkowsky–Lance index | $ratkowsky_lance | max | 0.3568949 | 0.3725436 | 0.3764579 | 0.4178183 |
Scott–Symons index | $scott_symons | min | 14,296.68 | −6276.843 | −5815.897 | 47,340.83 |
SD index | $sd_scat | min | 0.5612758 | 0.7125735 | 0.6275758 | 0.2424001 |
SD index | $sd_dis | min | 0.3689349 | 1.142325 | 1.029613 | 0.07119309 |
S Dbw index | $s_dbw | min | 1.177152 | 2.50059 | 4.480659 | 1.999409 |
Silhouette index | $silhouette | max | 0.3342391 | 0.3077699 | 0.2943706 | 0.4099526 |
Tau index | $tau | max | 0.5503895 | 0.3637915 | 0.3243153 | 0.5235662 |
Trace W index | $trace_w | max diff | 56,644.87 | 5836.337 | 5748.384 | 1,175,143 |
Trace WiB index | $trace_wib | max diff | 2.784339 | 1.556713 | 1.47942 | 5.971501 |
Wemmert–Gancarski index | $wemmert_gancarski | max | 0.5552955 | 0.5115594 | 0.4629613 | 0.5244954 |
Xie–Beni index | $xie_beni | min | 5550.067 | 14,184.04 | 6752.686 | 2710.222 |
Appendix 2
Complete observations from internal evaluation conducted on the AGNES clustering results using the R package clusterCrit
Index | ClusterCrit variable | Goodness indicator | Dimensionality reduction using | |||
---|---|---|---|---|---|---|
PCA | ICA | LLE | t-SNE | |||
Ball–Hall index | $ball_hall | max diff | 19.53912 | 1.331913 | 1.623174 | 363.1799 |
Banfield–Raftery index | $banfeld_raftery | min | 15,293.4 | 1734.322 | 1796.481 | 29,752.68 |
C index | $c_index | min | 0.3628751 | 0.3158354 | 0.3416631 | 0.2416017 |
Calinski–Harabasz index | $calinski_harabasz | max | 615.9316 | 1015.008 | 944.1628 | 2396.612 |
Davies–Bouldin index | $davies_bouldin | min | 1.79305 | 1.329694 | 1.286655 | 1.515427 |
Det ratio index | $det_ratio | min diff | 2.089563 | 2.089563 | 2.005657 | 4.858279 |
Dunn index | $dunn | max | 0.00114581 | 0.00209901 | 0.00118646 | 0.00444984 |
Baker–Hubert Gamma index | $gamma | max | 0.3062376 | 0.3735996 | 0.2849121 | 0.4667844 |
G plus index | $g_plus | min | 0.1733258 | 0.1564965 | 0.1784953 | 0.1258491 |
GDI index | $gdi11 | max | 0.00114581 | 0.00209901 | 0.00118646 | 0.00444984 |
GDI index | $gdi12 | max | 0.01376865 | 0.02173099 | 0.0153745 | 0.02884337 |
GDI index | $gdi13 | max | 0.00473243 | 0.00763023 | 0.00537622 | 0.00997892 |
GDI index | $gdi21 | max | 1.15714 | 1.296565 | 1.365893 | 0.9185128 |
GDI index | $gdi22 | max | 13.90482 | 13.42333 | 17.6997 | 5.953694 |
GDI index | $gdi23 | max | 4.779239 | 4.713223 | 6.189309 | 2.059795 |
GDI index | $gdi31 | max | 0.1850741 | 0.2396983 | 0.1761109 | 0.3479464 |
GDI index | $gdi32 | max | 2.223952 | 2.481595 | 2.282103 | 2.255348 |
GDI index | $gdi33 | max | 0.7643965 | 0.871342 | 0.7980158 | 0.7802811 |
GDI index | $gdi41 | max | 0.09938497 | 0.16374 | 0.1200127 | 0.2061969 |
GDI index | $gdi42 | max | 1.194264 | 1.695199 | 1.555164 | 1.336545 |
GDI index | $gdi43 | max | 0.4104816 | 0.5952212 | 0.5438165 | 0.4624033 |
GDI index | $gdi51 | max | 0.1013306 | 0.1180803 | 0.08362988 | 0.1522455 |
GDI index | $gdi52 | max | 1.217644 | 1.222485 | 1.083704 | 0.9868379 |
GDI index | $gdi53 | max | 0.4185175 | 0.4292411 | 0.3789543 | 0.3414155 |
Ksq DetW index | $ksq_detw | max diff | 1.3315E+10 | 107,678,003 | 112,182,708 | 4.88E+12 |
Log Det ratio index | $log_det_ratio | min diff | 3684.775 | 3684.775 | 3479.858 | 7903.421 |
Log SS ratio index | $log_ss_ratio | min diff | −1.40031 | −0.9007937 | −0.9731472 | −0.0416343 |
McClain–Rao index | $mcclain_rao | min | 0.7376162 | 0.6732493 | 0.7332948 | 0.6335125 |
PBM index | $pbm | max | 7.771969 | 1.091036 | 1.552822 | 518.4593 |
Point Biserial index | $point_biserial | max | −0.8765914 | −0.3172137 | −0.2494482 | −7.426781 |
Ray–Turi index | $ray_turi | min | 1.920044 | 0.928044 | 1.039746 | 1.106356 |
Ratkowsky–Lance index | $ratkowsky_lance | max | 0.3103157 | 0.3103157 | 0.3023539 | 0.4276503 |
Scott–Symons index | $scott_symons | min | 20,174.66 | −3912.774 | −3847.927 | 48,827.03 |
SD index | $sd_scat | min | 0.7649894 | 0.6738317 | 0.8583248 | 0.4549016 |
SD index | $sd_dis | min | 0.5789202 | 1.703494 | 1.819146 | 0.1006037 |
S Dbw index | $s_dbw | min | 3.363968 | 2.699607 | 2.499008 | 2.769472 |
Silhouette index | $silhouette | max | 0.2485357 | 0.3003016 | 0.2853209 | 0.2976705 |
Tau index | $tau | max | 0.2164711 | 0.2640874 | 0.2013074 | 0.3207045 |
Trace W index | $trace_w | max diff | 107,595.6 | 7111.126 | 7257.464 | 2,140,160 |
Trace WiB index | $trace_wib | max diff | 0.9718295 | 0.9718295 | 0.9111961 | 2.86745 |
Wemmert–Gancarski index | $wemmert_gancarski | max | 0.1370079 | 0.3178442 | 0.3221082 | 0.3178927 |
Xie–Beni index | $xie_beni | min | 14,445.39 | 5647.428 | 10,638.43 | 2375.582 |
Rights and permissions
Copyright information
© 2021 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Renjith, S., Sreekumar, A., Jathavedan, M. (2021). A Comparative Analysis of Clustering Quality Based on Internal Validation Indices for Dimensionally Reduced Social Media Data. In: Chiplunkar, N.N., Fukao, T. (eds) Advances in Artificial Intelligence and Data Engineering. AIDE 2019. Advances in Intelligent Systems and Computing, vol 1133. Springer, Singapore. https://doi.org/10.1007/978-981-15-3514-7_78
Download citation
DOI: https://doi.org/10.1007/978-981-15-3514-7_78
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3513-0
Online ISBN: 978-981-15-3514-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)