An Observation of Different Clustering Algorithms and Clustering Evaluation Criteria for a Feature Selection Based on Linear Discriminant Analysis

Tie, K. H.; Senawi, A.; Chuan, Z. L.

doi:10.1007/978-981-19-2095-0_42

K. H. Tie⁴⁶,
A. Senawi⁴⁶ &
Z. L. Chuan⁴⁶

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 900))

499 Accesses
1 Citations

Abstract

Linear discriminant analysis (LDA) is a very popular method for dimensionality reduction in machine learning. Yet, the LDA cannot be implemented directly on unsupervised data as it requires the presence of class labels to train the algorithm. Thus, a clustering algorithm is needed to predict the class labels before the LDA can be utilized. However, different clustering algorithms have different parameters that need to be specified. The objective of this paper is to investigate how the parameters behave with a measurement criterion for feature selection, that is, the total error reduction ratio (TERR). The k-means and the Gaussian mixture distribution were adopted as the clustering algorithms and each algorithm was tested on four datasets with four distinct clustering evaluation criteria: Calinski-Harabasz, Davies-Bouldin, Gap and Silhouette. Overall, the k-means outperforms the Gaussian mixture distribution in selecting smaller feature subsets. It was found that if a certain threshold value of the TERR is set and the k-means algorithm is applied, the Calinski-Harabasz, Davies-Bouldin, and Silhouette criteria yield the same number of selected features, less than the feature subset size given by the Gap criterion. When the Gaussian mixture distribution algorithm is adopted, none of the criteria can consistently select features with the least number. The higher the TERR threshold value is set, the more the feature subset size will be, regardless of the type of clustering algorithm and the clustering evaluation criterion are used. These results are essential for future work direction in designing a robust unsupervised feature selection based on LDA.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 299.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Adegbola OA, Adeyemo IA, Semire FA, Popoola SI, Atayero AA (2020) A principal component analysis-based feature dimensionality reduction scheme for content-based image retrieval system. Telkomnika 18(4):1892–1896
Article Google Scholar
Alharbi AS, Li Y, Xu Y (2017) Integrating LDA with clustering technique for relevance feature selection. In: Peng W, Alahakoon D, Li X (eds) Advances in Artificial Intelligence: 30th Australasian Joint Conference. Springer, Melbourne, pp 274–286
Google Scholar
Baarsch J, Celebi ME (2012) Investigation of internal validity measures for k-means clustering. In: Proceedings of the International MultiConference of Engineers and Computer Scientists, pp 471–476. Newswood Limited, Hong Kong
Google Scholar
Billings SA, Wei HL (2005) A multiple sequential orthogonal least squares algorithm for feature ranking and subset selection. ACSE Research Report (908). University of Sheffield
Google Scholar
Chormunge S, Jena S (2018) Correlation based feature selection with clustering for high dimensional data. J Electr Syst Inf Technol 5(3): 542–549
Google Scholar
Ding C, Li T (2007) Adaptive dimension reduction using discriminant analysis and K- means clustering. In: Ghahramani Z (ed) ACM International Conference Proceeding Series, vol 227. Association for Computing Machinery, New York, pp 521–528
Google Scholar
El-Mandouh AM, Mahmoud HA, Abd-Elmegid LA, Haggag MH (2019) Optimized K-means clustering model based on gap statistic. Int J Adv Comput Sci Appl (IJACSA) 10(1):183–188
Google Scholar
Gao W, Hu L, Zhang P (2020) Feature redundancy term variation for mutual information-based feature selection. Appl Intell 50(4):1272–1288
Article Google Scholar
Hastie T, Tibshirani R (1996) Discriminant analysis by Gaussian mixtures. J Roy Stat Soc Ser B (Methodol) 58(1):155–176
MathSciNet MATH Google Scholar
He C, Fu H, Guo C, Luk W, Yang G (2017) A fully-pipelined hardware design for Gaussian mixture models. IEEE Trans Comput 66(11):1837–1850
Article MathSciNet Google Scholar
Houari R, Bounceur A, Kechadi MT, Tari AK, Euler R (2016) Dimensionality reduction in data mining. Expert Syst Appl Int J 64(C): 247–260
Google Scholar
Kamper H, Livescu K, Goldwater S (2017) An embedded segmental K-means model for unsupervised segmentation and clustering of speech. In: IEEE Automatic Speech Recognition and Understanding Workshop (ASRU),pp 719–726
Google Scholar
Krzanowski WJ (2018) Attribute selection in correspondence analysis of incidence matrices. J Roy Stat Soc: Ser C (Appl Stat) 42(3):529–541
MATH Google Scholar
Kumar, BS, Ravi V (2017) LDA based feature selection for document clustering. In: Proceedings of the 10th Annual ACM India Compute Conference, pp. 125–130. Association for Computing Machinery, New York
Google Scholar
Lu J, Plataniotis KN, Venetsanopoulos AN (2003) Face recognition using LDA-based algorithms. IEEE Trans Neural Netw 14(1):195–200
Article Google Scholar
Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with Gaussian mixture models. Biometrics 65(3):701–709
Article MathSciNet Google Scholar
Mohd MRS, Herman SH, Sharif Z (2017) Application of K-Means clustering in hot spot detection for thermal infrared images. In: IEEE Symposium on Computer Applications & Industrial Electronics (ISCAIE), pp 107–110
Google Scholar
Morissette L, Chartier S (2013) The k-means clustering technique: general considerations and implementation in Mathematica. Tutor Quant Methods Psychol 9(1)
Google Scholar
Nazari Z, Kang D, Asharif MR, Sung Y, Ogawa S (2016) A new hierarchical clustering algorithm. In: ICIIBMS 2015–International Conference on Intelligent Informatics and Biomedical Sciences, pp 148–152
Google Scholar
Duda O, Peter E, Hart DGS (eds) (2000) Pattern Classification. 2nd edn. Wiley, United States
Google Scholar
Senawi A, Wei HL, Billings SA (2017) A new maximum relevance-minimum multicollinearity (MRmMC) method for feature selection and ranking. Pattern Recognit. 67: 47–61
Google Scholar
Sharmin S, Shoyaib M, Ali AA, Khan MAH, Chae O (2019) Simultaneous featureselection and discretization based on mutual information. Pattern Recogn 91:162–174
Article Google Scholar
Uddin, MP, Mamun, MA, Hossain, MA (2020) PCA-based feature reduction for hyperspectral remote sensing image classification. IETE Techn Rev 1–21
Google Scholar
Ünlü R, Xanthopoulos P (2019) Estimating the number of clusters in a dataset via consensus clustering. Expert Syst Appl 125:33–39
Article Google Scholar
Vashishth V, Chhabra A (2019) GMMR: a Gaussian mixture model based unsupervised machine learning approach for optimal routing in opportunistic IoT networks. Comput Commun 134:138–148
Article Google Scholar
Xiao J, Lu J, Li X (2017) Davies bouldin index based hierarchical initialization K-means. Intell Data Anal 21(6):1327–1338
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank the International Islamic University Malaysia (IIUM), Universiti Malaysia Pahang (UMP) and the Universiti Teknologi MARA (UiTM) for providing financial support under the IIUM-UMP-UiTM Sustainable Research Collaboration Grant 2020 (Vote Number: RDU200722).

Author information

Authors and Affiliations

Centre for Mathematical Sciences, College of Computing and Applied Sciences, Universiti Malaysia Pahang, Lebuhraya Tun Razak, 26300, Gambang, Kuantan, Pahang, Malaysia
K. H. Tie, A. Senawi & Z. L. Chuan

Authors

K. H. Tie
View author publications
You can also search for this author in PubMed Google Scholar
A. Senawi
View author publications
You can also search for this author in PubMed Google Scholar
Z. L. Chuan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to A. Senawi .

Editor information

Editors and Affiliations

Universiti Malaysia Pahang, Pekan, Malaysia
Ismail Mohd. Khairuddin
Universiti Malaysia Pahang, Pekan, Malaysia
Muhammad Amirul Abdullah
Faculty of Computing, Universiti Malaysia Pahang, Pekan, Malaysia
Ahmad Fakhri Ab. Nasir
Faculty of Manufacturing and Mechatronic Engineering Technology, Universiti Malaysia Pahang, Pekan, Malaysia
Jessnor Arif Mat Jizat
Faculty of Manufacturing Engineering, Universiti Malaysia Pahang, Pekan, Malaysia
Mohd. Azraai Mohd. Razman
Faculty of Manufacturing and Mechatronic, Universiti Malaysia Pahang, Pekan, Malaysia
Ahmad Shahrizan Abdul Ghani
Faculty of Manufacturing and Mechatronic Engineering Technology, Universiti Malaysia Pahang, Pekan, Malaysia
Muhammad Aizzat Zakaria
Universiti Malaysia Pahang, Pekan, Malaysia
Wan Hasbullah Mohd. Isa
Faculty of Manufacturing and Mechatronic Engineering Technology, Universiti Malaysia Pahang, Pekan, Malaysia
Anwar P. P. Abdul Majeed

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tie, K.H., Senawi, A., Chuan, Z.L. (2022). An Observation of Different Clustering Algorithms and Clustering Evaluation Criteria for a Feature Selection Based on Linear Discriminant Analysis. In: Khairuddin, I.M., et al. Enabling Industry 4.0 through Advances in Mechatronics. Lecture Notes in Electrical Engineering, vol 900. Springer, Singapore. https://doi.org/10.1007/978-981-19-2095-0_42

Download citation

DOI: https://doi.org/10.1007/978-981-19-2095-0_42
Published: 15 May 2022
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-2094-3
Online ISBN: 978-981-19-2095-0
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics