A Comparative Analysis of Clustering Quality Based on Internal Validation Indices for Dimensionally Reduced Social Media Data

Renjith, Shini; Sreekumar, A.; Jathavedan, M.

doi:10.1007/978-981-15-3514-7_78

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 1133))

Included in the following conference series:

International Conference on Artificial Intelligence and Data Engineering

1750 Accesses
2 Citations

Abstract

Almost all modern industries leverage data analytics to deal with various dimensions of their business like demand forecasting, targeted marketing, and supply chain planning. In addition to historic data, social media data has also become a prominent source of input for data analytics. The key challenges observed with social media data are its huge volume and high dimensions that need to be dealt with. Clustering is the proven strategy in data analytics to segregate the relevant data for processing and thereby reducing the impact of huge volume. Dimensionality corresponds to the diverse features of the data subject being represented. The application of dimensionality reduction techniques can help in reducing the computational intensiveness caused by the curse of dimensionality. This paper covers an experimental analysis using four popular dimensionality reduction techniques – two linear and two nonlinear approaches – to verify the impact of dimensionality reduction on cluster quality using internal clustering validation indices.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 219.00; Price excludes VAT (USA)

Softcover Book: USD 279.99; Price excludes VAT (USA)

Hardcover Book: USD 279.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kohavi R, Rothleder N, Simoudis E (2002) Emerging trends in business analytics. Commun ACM 45(8):45–48. https://doi.org/10.1145/545151.545177
Article Google Scholar
Kantardzic M (2011) Data mining: concepts, models, methods, and algorithms. Wiley
Google Scholar
Cattell R (1943) The description of personality: basic traits resolved into clusters. J Abnorm Soc Psychology 38:476–506. https://doi.org/10.1037/H0054116
Article Google Scholar
Pudil P, Novovičová J (1998) Novel methods for feature subset selection with respect to problem knowledge. In: Feature extraction, construction and selection, pp 101–116. https://doi.org/10.1007/978-1-4615-5725-8_7
Hartigan J, Wong M (1979) Algorithm AS 136: a K-means clustering algorithm. J Roy Stat Soc: Ser C (Appl Stat) 28(1):100–108. https://doi.org/10.2307/2346830
Article MATH Google Scholar
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1, no 14. Oakland, CA, USA, pp 281–297
Google Scholar
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137. https://doi.org/10.1109/TIT.1982.1056489
Article MathSciNet MATH Google Scholar
Forgey E (1965) Cluster analysis of multivariate data: efficiency versus interpretability of classification. Biometrics 21(3):768–769
Google Scholar
Kaufman L, Rousseeuw P (2009) Finding groups in data: an introduction to cluster analysis. Wiley. https://doi.org/10.1002/9780470316801. -->
Lukasová A (1979) Hierarchical agglomerative clustering procedure. Pattern Recogn 11(5–6):365–381. https://doi.org/10.1016/0031-3203(79)90049-9
Article MathSciNet MATH Google Scholar
Zepeda-Mendoza M, Resendis-Antonio O (2013) Hierarchical agglomerative clustering. Encycl Syst Biol 886–887. https://doi.org/10.1007/978-1-4419-9863-7_1371
Roux M (2018) A comparative study of divisive and agglomerative hierarchical clustering algorithms. J Classif 35(2):345–366. https://doi.org/10.1007/S00357-018-9259-9
Article MathSciNet MATH Google Scholar
Hotelling H (1933) Analysis of a complex of statistical variables into principal components. J Educ Psychol 24(6):417–441. https://doi.org/10.1037/H0071325
Article MATH Google Scholar
Abdi H, Williams L (2010) Principal component analysis. Wiley Interdiscip Rev: Comput Statistics 2(4):433–459. https://doi.org/10.1002/wics.101
Article Google Scholar
Isomura T, Toyoizumi T (2016) A local learning rule for independent component analysis. Sci Rep 6. https://doi.org/10.1038/srep28073
Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9:2579–2605
Google Scholar
Maaten L (2014) Accelerating t-SNE using tree-based algorithms. J Mach Learn Res 15:3221–3245
MathSciNet MATH Google Scholar
Roweis S, Saul L (2000) Nonlinear dimensionality reduction by locally linear embedding. Science 290(5500):2323–2326. https://doi.org/10.1126/science.290.5500.2323
Article Google Scholar
Ridder D, Kouropteva O, Okun O, Pietikäinen M, Duin R (2003) Supervised locally linear embedding. Artif Neural Netw Neural Inf Process—ICANN/ICONIP 2003:333–341. https://doi.org/10.1007/3-540-44989-2_40
Article MATH Google Scholar
Renjith S, Sreekumar A, Jathavedan M (2018) Evaluation of partitioning clustering algorithms for processing social media data in tourism domain. In: 2018 IEEE recent advances in intelligent computational systems (RAICS). IEEE Press, Thiruvananthapuram, India, pp 127–131. https://doi.org/10.1109/raics.2018.8635080
Renjith S, Sreekumar A, Jathavedan M (2020) Performance evaluation of clustering algorithms for varying cardinality and dimensionality of data sets. Materials Today: Proceedings. https://doi.org/10.1016/j.matpr.2020.01.110
Renjith S, Sreekumar A, Jathavedan M (2019) Pragmatic evaluation of the impact of dimensionality reduction in the performance of clustering algorithms. In: Advances in electrical and computer technologies, ICAECT 2019, Lecture notes in electrical engineering. Springer, Coimbatore, India
Google Scholar
Xu R, WunschII D (2005) Survey of clustering algorithms. IEEE Trans Neural Netw 16(3):645–678. https://doi.org/10.1109/TNN.2005.845141
Article Google Scholar
Shirkhorshidi A, Aghabozorgi S, Wah T, Herawan T (2014) Big data clustering: a review. In: The 14th international conference on computational science and its applications—ICCSA 2014. Springer International Publishing, Guimaraes, Portugal, pp 707–720. https://doi.org/10.1007/978-3-319-09156-3_49
Sajana T, Sheela Rani C, Narayana K (2016) A survey on clustering techniques for big data mining. Indian J Sci Technol 9(3):1–12. https://doi.org/10.17485/IJST/2016/V9I3/75971
Article Google Scholar
Ajin V, Kumar L (2016) Big data and clustering algorithms. In: 2016 international conference on research advances in integrated navigation systems (RAINS). IEEE Press, Bangalore, India, pp 101–106. https://doi.org/10.1109/rains.2016.7764405
Dave M, Gianey H (2016) Different clustering algorithms for big data analytics: a review. In: 2016 international conference system modeling and advancement in research trends (SMART). IEEE Press, Moradabad, India, pp 328–333. https://doi.org/10.1109/sysmart.2016.7894544
Lau T, King I (1998) Performance analysis of clustering algorithms for information retrieval in image databases. In: 1998 IEEE international joint conference on neural networks proceedings, IEEE world congress on computational intelligence (Cat. No.98CH36227). IEEE Press, Anchorage, AK, USA, pp 932–937. https://doi.org/10.1109/ijcnn.1998.685895
Maulik U, Bandyopadhyay S (2002) Performance evaluation of some clustering algorithms and validity indices. IEEE Trans Pattern Anal Mach Intell 24(12):1650–1654. https://doi.org/10.1109/TPAMI.2002.1114856
Article Google Scholar
Wei C, Lee Y, Hsu C (2003) Empirical comparison of fast partitioning-based clustering algorithms for large data sets. Expert Syst Appl 24(4):351–363. https://doi.org/10.1016/S0957-4174(02)00185-9
Article Google Scholar
Zhang B (2003) Comparison of the performance of center-based clustering algorithms. In: Advances in knowledge discovery and data mining, PAKDD 2003, Lecture notes in computer science, vol 2637. Springer, Seoul, Republic of Korea, pp 63–74. https://doi.org/10.1007/3-540-36175-8_7
Wang X, Hamilton H (2005) A comparative study of two density-based spatial clustering algorithms for very large datasets. In: Advances in artificial intelligence, AI 2005, lecture notes in computer science, vol 3501. Springer, Victoria, BC, Canada, pp 120–132. https://doi.org/10.1007/11424918_14
Poonam Dutta M (2012) Performance analysis of clustering methods for outlier detection. In: 2012 second international conference on advanced computing and communication technologies (ACCT 2012). IEEE Press, Rohtak, India, pp 89–95. https://doi.org/10.1109/acct.2012.84
Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya A, Foufou S, Bouras A (2014) A survey of clustering algorithms for big data: taxonomy and empirical analysis. IEEE Trans Emerg Top Comput 2(3):267–279. https://doi.org/10.1109/TETC.2014.2330519
Article Google Scholar
Jung Y, Kang M, Heo J (2014) Clustering performance comparison using k-means and expectation maximization algorithms. Biotechnol Biotechnol Equip 28(2):S44–S48. https://doi.org/10.1080/13102818.2014.949045
Article Google Scholar
Bhatnagar V, Majhi R, Jena P (2017) Comparative performance evaluation of clustering algorithms for grouping manufacturing firms. Arab J Sci Eng 43(8):4071–4083. https://doi.org/10.1007/S13369-017-2788-4
Article Google Scholar
Kohonen T (1997) Exploration of very large databases by self-organizing maps. In: International conference on neural networks (ICNN’97), vol 1. IEEE Press, Houston, TX, USA, pp PL1-PL6. https://doi.org/10.1109/icnn.1997.611622
Ding C, He X, Zha H, Simon H (2002) Adaptive dimension reduction for clustering high dimensional data. In: 2002 IEEE international conference on data mining. IEEE Computer Society, Maebashi City, Japan, pp 147–154. https://doi.org/10.1109/icdm.2002.1183897
Wang Q, Li J (2009) Combining local and global information for nonlinear dimensionality reduction. Neurocomputing 72(10–12):2235–2241. https://doi.org/10.1016/J.NEUCOM.2009.01.006
Article Google Scholar
Araujo D, Doria Neto A, Martins A, Melo J (2011) Comparative study on dimension reduction techniques for cluster analysis of microarray data. In: The 2011 international joint conference on neural networks. IEEE Press, San Jose, CA, USA, pp 1835–1842. https://doi.org/10.1109/ijcnn.2011.6033447
Chui CK, Wang J (2013) Nonlinear methods for dimensionality reduction. Handb Geomath 1–46. https://doi.org/10.1007/978-3-642-27793-1_34-2
Song M, Yang H, Siadat S, Pechenizkiy M (2013) A comparative study of dimensionality reduction techniques to enhance trace clustering performances. Expert Syst Appl 40(9):3722–3737. https://doi.org/10.1016/J.ESWA.2012.12.078
Article Google Scholar
Charrad M, Ghazzali N, Boiteau V, Niknafs A (2014) NbClust: an R package for determining the relevant number of clusters in a data set. J Stat Softw 61(6):1–36. https://doi.org/10.18637/JSS.V061.I06
Article Google Scholar
Rousseeuw P (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20(November):53–65. https://doi.org/10.1016/0377-0427(87)90125-7
Article MATH Google Scholar
Dunn J (1973) A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J Cybern 3(3):32–57. https://doi.org/10.1080/01969727308546046
Article MathSciNet MATH Google Scholar
Calinski T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat Theory Methods 3(1):1–27. https://doi.org/10.1080/03610927408827101
Article MathSciNet MATH Google Scholar
Davies D, Bouldin D (1979) A cluster separation measure. IEEE Trans Pattern Anal Mach Intell PAMI 1(2):224–227. https://doi.org/10.1109/tpami.1979.4766909
Team RC (2009) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria
Google Scholar
Tierney L (2012) The R statistical computing environment. Lect Notes Stat. 435–447. https://doi.org/10.1007/978-1-4614-3520-4_41
Racine J (2011) RStudio: a platform-independent IDE for R and Sweave. J Appl Econ 27(1):167–172. https://doi.org/10.1002/JAE.1278
Article Google Scholar
Goldberg K, Roeder T, Gupta D, Perkins C (2001) Eigentaste: a constant time collaborative filtering algorithm. Inf Retr 4(2):133–151. https://doi.org/10.1023/A:1011419012209
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Applications, Cochin University of Science and Technology, Kochi, Kerala, 682022, India
Shini Renjith, A. Sreekumar & M. Jathavedan
Department of Computer Science and Engineering, Mar Baselios College of Engineering and Technology, Thiruvananthapuram, Kerala, 695015, India
Shini Renjith

Authors

Shini Renjith
View author publications
You can also search for this author in PubMed Google Scholar
A. Sreekumar
View author publications
You can also search for this author in PubMed Google Scholar
M. Jathavedan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Shini Renjith .

Editor information

Editors and Affiliations

NMAM Institute of Technology, Udupi, India
Niranjan N. Chiplunkar
Ritsumeikan University, Shiga, Japan
Takanori Fukao

Appendices

Appendix 1

Complete observations from internal evaluation conducted on the k-means clustering results using the R package clusterCrit

Index	ClusterCrit variable	Goodness indicator	Dimensionality reduction using
Index	ClusterCrit variable	Goodness indicator	PCA	ICA	LLE	t-SNE
Ball–Hall index	$ball_hall	max diff	17.20457	1.391324	1.232667	232.7926
Banfield–Raftery index	$banfeld_raftery	min	11,029.27	572.3903	626.2364	27,286.89
C index	$c_index	min	0.09553103	0.2327327	0.2312412	0.08800492
Calinski–Harabasz index	$calinski_harabasz	max	3417.288	1782.438	1847.939	6416.44
Davies–Bouldin index	$davies_bouldin	min	0.9058952	1.034618	1.039139	0.8715404
Det ratio index	$det_ratio	min diff	3.871588	3.047042	3.026433	7.844465
Dunn index	$dunn	max	0.00173042	0.00110608	0.00136502	0.00453987
Baker–Hubert Gamma index	$gamma	max	0.7785525	0.5188169	0.4677813	0.7813951
G plus index	$g_plus	min	0.05533569	0.1182925	0.1279114	0.04907176
GDI index	$gdi11	max	0.00173042	0.00110608	0.00136502	0.00453987
GDI index	$gdi12	max	0.01493041	0.01049106	0.02023069	0.02933953
GDI index	$gdi13	max	0.00511206	0.00372532	0.00717291	0.01019923
GDI index	$gdi21	max	1.21189	1.324565	1.406772	1.227989
GDI index	$gdi22	max	10.45642	12.56334	20.84945	7.936054
GDI index	$gdi23	max	3.580196	4.461175	7.392295	2.75879
GDI index	$gdi31	max	0.2752019	0.2248528	0.1981029	0.5592138
GDI index	$gdi32	max	2.374494	2.132702	2.936039	3.614
GDI index	$gdi33	max	0.8130083	0.7573111	1.04099	1.256326
GDI index	$gdi41	max	0.2410528	0.1885064	0.1703023	0.4925548
GDI index	$gdi42	max	2.079849	1.78796	2.524013	3.183207
GDI index	$gdi43	max	0.7121241	0.6348952	0.8949034	1.10657
GDI index	$gdi51	max	0.09224953	0.09543682	0.08576157	0.2129281
GDI index	$gdi52	max	0.7959463	0.9052065	1.271053	1.376078
GDI index	$gdi53	max	0.2725258	0.3214341	0.4506593	0.4783627
Ksq DetW index	$ksq_detw	max diff	7,186,235,240	73,842,109	74,344,937	3.02E + 12
Log Det ratio index	$log_det_ratio	min diff	6768.324	5570.856	5536.924	10,299.04
Log SS ratio index	$log_ss_ratio	min diff	0.3131567	−0.3377083	−0.3016196	0.9431728
McClain–Rao index	$mcclain_rao	min	0.3721302	0.5643422	0.595994	0.4384923
PBM index	$pbm	max	42.68164	0.9640561	0.6983515	1497.468
Point Biserial index	$point_biserial	max	−2.693745	−0.4353606	−0.3859055	−11.7986
Ray–Turi index	$ray_turi	min	0.2860079	0.4883412	0.4338252	0.2302405
Ratkowsky–Lance index	$ratkowsky_lance	max	0.3568949	0.3725436	0.3764579	0.4178183
Scott–Symons index	$scott_symons	min	14,296.68	−6276.843	−5815.897	47,340.83
SD index	$sd_scat	min	0.5612758	0.7125735	0.6275758	0.2424001
SD index	$sd_dis	min	0.3689349	1.142325	1.029613	0.07119309
S Dbw index	$s_dbw	min	1.177152	2.50059	4.480659	1.999409
Silhouette index	$silhouette	max	0.3342391	0.3077699	0.2943706	0.4099526
Tau index	$tau	max	0.5503895	0.3637915	0.3243153	0.5235662
Trace W index	$trace_w	max diff	56,644.87	5836.337	5748.384	1,175,143
Trace WiB index	$trace_wib	max diff	2.784339	1.556713	1.47942	5.971501
Wemmert–Gancarski index	$wemmert_gancarski	max	0.5552955	0.5115594	0.4629613	0.5244954
Xie–Beni index	$xie_beni	min	5550.067	14,184.04	6752.686	2710.222

Appendix 2

Complete observations from internal evaluation conducted on the AGNES clustering results using the R package clusterCrit

Index	ClusterCrit variable	Goodness indicator	Dimensionality reduction using
Index	ClusterCrit variable	Goodness indicator	PCA	ICA	LLE	t-SNE
Ball–Hall index	$ball_hall	max diff	19.53912	1.331913	1.623174	363.1799
Banfield–Raftery index	$banfeld_raftery	min	15,293.4	1734.322	1796.481	29,752.68
C index	$c_index	min	0.3628751	0.3158354	0.3416631	0.2416017
Calinski–Harabasz index	$calinski_harabasz	max	615.9316	1015.008	944.1628	2396.612
Davies–Bouldin index	$davies_bouldin	min	1.79305	1.329694	1.286655	1.515427
Det ratio index	$det_ratio	min diff	2.089563	2.089563	2.005657	4.858279
Dunn index	$dunn	max	0.00114581	0.00209901	0.00118646	0.00444984
Baker–Hubert Gamma index	$gamma	max	0.3062376	0.3735996	0.2849121	0.4667844
G plus index	$g_plus	min	0.1733258	0.1564965	0.1784953	0.1258491
GDI index	$gdi11	max	0.00114581	0.00209901	0.00118646	0.00444984
GDI index	$gdi12	max	0.01376865	0.02173099	0.0153745	0.02884337
GDI index	$gdi13	max	0.00473243	0.00763023	0.00537622	0.00997892
GDI index	$gdi21	max	1.15714	1.296565	1.365893	0.9185128
GDI index	$gdi22	max	13.90482	13.42333	17.6997	5.953694
GDI index	$gdi23	max	4.779239	4.713223	6.189309	2.059795
GDI index	$gdi31	max	0.1850741	0.2396983	0.1761109	0.3479464
GDI index	$gdi32	max	2.223952	2.481595	2.282103	2.255348
GDI index	$gdi33	max	0.7643965	0.871342	0.7980158	0.7802811
GDI index	$gdi41	max	0.09938497	0.16374	0.1200127	0.2061969
GDI index	$gdi42	max	1.194264	1.695199	1.555164	1.336545
GDI index	$gdi43	max	0.4104816	0.5952212	0.5438165	0.4624033
GDI index	$gdi51	max	0.1013306	0.1180803	0.08362988	0.1522455
GDI index	$gdi52	max	1.217644	1.222485	1.083704	0.9868379
GDI index	$gdi53	max	0.4185175	0.4292411	0.3789543	0.3414155
Ksq DetW index	$ksq_detw	max diff	1.3315E+10	107,678,003	112,182,708	4.88E+12
Log Det ratio index	$log_det_ratio	min diff	3684.775	3684.775	3479.858	7903.421
Log SS ratio index	$log_ss_ratio	min diff	−1.40031	−0.9007937	−0.9731472	−0.0416343
McClain–Rao index	$mcclain_rao	min	0.7376162	0.6732493	0.7332948	0.6335125
PBM index	$pbm	max	7.771969	1.091036	1.552822	518.4593
Point Biserial index	$point_biserial	max	−0.8765914	−0.3172137	−0.2494482	−7.426781
Ray–Turi index	$ray_turi	min	1.920044	0.928044	1.039746	1.106356
Ratkowsky–Lance index	$ratkowsky_lance	max	0.3103157	0.3103157	0.3023539	0.4276503
Scott–Symons index	$scott_symons	min	20,174.66	−3912.774	−3847.927	48,827.03
SD index	$sd_scat	min	0.7649894	0.6738317	0.8583248	0.4549016
SD index	$sd_dis	min	0.5789202	1.703494	1.819146	0.1006037
S Dbw index	$s_dbw	min	3.363968	2.699607	2.499008	2.769472
Silhouette index	$silhouette	max	0.2485357	0.3003016	0.2853209	0.2976705
Tau index	$tau	max	0.2164711	0.2640874	0.2013074	0.3207045
Trace W index	$trace_w	max diff	107,595.6	7111.126	7257.464	2,140,160
Trace WiB index	$trace_wib	max diff	0.9718295	0.9718295	0.9111961	2.86745
Wemmert–Gancarski index	$wemmert_gancarski	max	0.1370079	0.3178442	0.3221082	0.3178927
Xie–Beni index	$xie_beni	min	14,445.39	5647.428	10,638.43	2375.582

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Renjith, S., Sreekumar, A., Jathavedan, M. (2021). A Comparative Analysis of Clustering Quality Based on Internal Validation Indices for Dimensionally Reduced Social Media Data. In: Chiplunkar, N.N., Fukao, T. (eds) Advances in Artificial Intelligence and Data Engineering. AIDE 2019. Advances in Intelligent Systems and Computing, vol 1133. Springer, Singapore. https://doi.org/10.1007/978-981-15-3514-7_78

Download citation

DOI: https://doi.org/10.1007/978-981-15-3514-7_78
Published: 14 August 2020
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-3513-0
Online ISBN: 978-981-15-3514-7
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics