Skip to main content
Log in

Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion

  • Original Article
  • Published:
International Journal of Machine Learning and Cybernetics Aims and scope Submit manuscript

Abstract

Feature selection is a fundamental preprocess before performing actual learning; especially in unsupervised manner where the data are unlabeled. Essentially, when there are too many features in the problem, dimensionality reduction through discarding weak features is highly desirable. In this paper, we present a framework for unsupervised feature selection based on dependency maximization between the samples similarity matrices before and after deleting a feature. In this regard, a novel estimation of Hilbert–Schmidt independence criterion (HSIC), more appropriate for high-dimensional data with small sample size, is introduced. Its key idea is that by eliminating the redundant features and/or those have high inter-relevancy, the pairwise samples similarity is not affected seriously. Also, to handle the diagonally dominant matrices, a heuristic trick is used in order to reduce the dynamic range of matrix values. In order to speed up the proposed scheme, the gap statistic and k-means clustering methods are also employed. To assess the performance of our method, some experiments on benchmark datasets are conducted. The obtained results confirm the efficiency of our unsupervised feature selection scheme.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Yu L, Liu H (2004) Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 5:1205–1224

    MathSciNet  Google Scholar 

  2. Sharma A, Imoto S, Miyano S (2012) A top-r feature selection algorithm for microarray gene expression data. IEEE/ACM Trans Comput Biol Bioinform 9(3):754–64

    Google Scholar 

  3. Sharma A, Imoto S, Miyano S, Sharma V (2012) Null space based feature selection method for gene expression data. Int J Mach Learn Cybern 3(4):269–276

    Article  Google Scholar 

  4. Sharma A, Imoto S, Miyano S (2012) A between-class overlapping filter-based method for transcriptome data analysis. J Bioinform Comput Biol 10(5):1250010

    Article  Google Scholar 

  5. Dy J, Brodley C (2004) Feature selection for unsupervised learning. J Mach Learn Res 5:845–889

    MathSciNet  MATH  Google Scholar 

  6. Shang R, Chang J, Jiao L, Xue Y (2017) Unsupervised feature selection based on self-representation sparse regression and local similarity preserving. Int J Mach Learn Cybern 1–14

  7. Chandrashekar G, Sahin F (2014) A survey on feature selection methods. Comput Electr Eng 40:16–28

    Article  Google Scholar 

  8. Guyon I, Gunn S, Nikravesh M, Zadeh LA (2006) Feature extraction: foundations and applications, vol 207. Springer, Berlin, pp 89–117

    Book  MATH  Google Scholar 

  9. Peng H, Long F, Ding C (2005) Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundance. IEEE Trans Pattern Anal Mach Intell 27(8):1226–1238

    Article  Google Scholar 

  10. Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182

    MATH  Google Scholar 

  11. Brown G, Pocock A, Zhao M, Lujan M (2012) Conditional likelihood maximization: a unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66

    MATH  Google Scholar 

  12. Xu I, Cao L, Zhong J, Feng Y (2010) Adapt the mRMR criterion for unsupervised feature selection. Advanced data mining and applications. Springer, Berlin, pp 111–121

    Google Scholar 

  13. Gretton A, Bousquet O, Smola AJ, Scholkopf B (2005) Measuring statistical dependence with Hilbert–Schmidt norms. In: Jain S, Simon HU, Tomita E (eds) Proceedings of the international conference on algorithmic learning theory, Springer, pp 63–77

  14. Zarkoob H (2010) Feature selection for gene expression data based on Hilbert–Schmidt independence criterion. University of Waterloo, Electronic theses and dissertations

  15. Bedo I, Chetty M, Ngom A, Ahmad S (2008) Microarray design using the Hilbert–Schmidt independence criterion. Springer, Berlin, pp 288–298

    Google Scholar 

  16. Song I, Smola A, Gretton A, Bedo J, Borgwardt K (2012) Feature selection via dependence maximization. J Mach Learn Res 13:1393–1434

    MathSciNet  MATH  Google Scholar 

  17. Farahat AK, Ghodsi A, Kamel MS (2013) Efficient greedy feature selection for unsupervised learning. Knowl Inf Syst 35(2):285–310

    Article  Google Scholar 

  18. Sharma A, Paliwal KK, Imoto S, Miyano S (2014) A feature selection method using improved regularized linear discriminant analysis. Mach Vis Appl 25(3):775–786

    Article  Google Scholar 

  19. Eskandari S, Akbas E (2017) Supervised infinite feature selection. arXiv Prepr. http://arxiv.org/abs/1704.02665

  20. Luo I, Nie F, Chang X, Yang Y, Hauptmann AG, Zheng Q (2018) Adaptive unsupervised feature selection with structure regularization. IEEE Trans Neural Netw Learn Syst 29(4):944–956

    Article  Google Scholar 

  21. Weston I, Scholkopf B, Eskin E, Leslie C, Noble W (2003) Dealing with large diagonals in kernel matrices. Inst Stat Math 55(2):391–408

    MathSciNet  MATH  Google Scholar 

  22. Fischer A, Roth V, Buhmann JM (2003) Clustering with the connectivity Kernel. Adv Neural Inf Process Syst 16:89–96

    Google Scholar 

  23. Tibshirani R, Walther G, Hastie T (2001) Estimating the number of clusters in a data set via the gap statistic. J R Stat Soc Ser B 63(2):411–423

    Article  MathSciNet  MATH  Google Scholar 

  24. McQueen I (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of fifth Berkeley symposium, math statistics and probability, pp 281–297

  25. Somol P, Pudil P, Novovicova J, Paclik P (1999) Adaptive floating search methods in feature selection. Pattern Recognit Lett 20:1157–1163

    Article  Google Scholar 

  26. UCI Machine Learning Repository. https://archive.ics.uci.edu/ml/datasets. Accessed Feb 2017

  27. Mramor I, Leban G, Demsar J, Zupan B (2007) Visualization-based cancer microarray data classification analysis. Bioinformatics 23(16):2147–2154

    Article  Google Scholar 

  28. Scholkopf A, Smola AJ (2001) Learning with kernels: support vector machines, regularization, optimization, and beyond. MIT Press, Cambridge

    Google Scholar 

  29. Lin S, Liu Z (2007) Parameter selection of support vector machines based on RBF kernel function. Zhejiang Univ Technol 35:163–167

    Google Scholar 

  30. Ester I, Kriegel H, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, vol 96, pp 226–231

  31. Kreyszig A (1970) Introductory mathematical statistics. Wiley, New York

    MATH  Google Scholar 

  32. Hubert L, Arabic P (1985) Comparing partitions. J Classif 2:193–218

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Eghbal G. Mansoori.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix 1

Appendix 1

1.1 Applying SP-BAHSIC, SPC-BAHSIC, SP-FOHSIC and SPC-FOHSIC on synthetic data

Herein, we explain the steps of our proposed methods on two synthetic data. First, we run SP-BAHSIC on dataset \({X_G}\) with \(m=4\) samples and \(n=4\) features, called \(G=\left\{ {A,~B,~C,D} \right\}\).

$${X_G}=\left[ {\begin{array}{*{20}{l}} 1&3&4&2 \\ 3&6&5&4 \\ 5&{10}&4&3 \\ 2&5&4&4 \end{array}} \right]$$

Our aim is to find \({n^\prime }=2\) most informative features. By using RBF kernel function [29] as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,D\} }}^{\phi }\) is computed as:

$$K_{{\{ A,B,C,D\} }}^{\phi }=\left[ {\begin{array}{*{20}{l}} 1& \quad {0.70}& \quad {0.27}& \quad {0.83} \\ {0.70}& \quad 1& \quad {0.64}& \quad {0.94} \\ {0.27}& \quad {0.64}& \quad 1& \quad {0.50} \\ {0.83}& \quad {0.94}& \quad {0.50}& \quad 1 \end{array}} \right]$$

Similarly, using RBF kernel for \(\psi (.)\), the samples similarity matrices \(K_{{\{ B,C,D\} }}^{\psi }\), \(K_{{\{ A,C,D\} }}^{\psi }\), \(K_{{\{ A,B,D\} }}^{\psi }\) and \(K_{{\{ A,B,C\} }}^{\psi }\) are obtained.

$$K_{{\{ B,C,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.93}&{0.78}&{0.96} \\ {0.93}&1&{0.91}&{0.99} \\ {0.78}&{0.91}&1&{0.88} \\ {0.96}&{0.99}&{0.88}&1 \end{array}} \right],$$
$$K_{{\{ A,C,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.96}&{0.92}&{0.97} \\ {0.96}&1&{0.97}&{0.99} \\ {0.92}&{0.97}&1&{0.95} \\ {0.97}&{0.99}&{0.95}&1 \end{array}} \right],$$
$$K_{{\{ A,B,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.92}&{0.72}&{0.96} \\ {0.92}&1&{0.90}&{0.99} \\ {0.72}&{0.90}&1&{0.84} \\ {0.96}&{0.99}&{0.84}&1 \end{array}} \right],$$
$$K_{{\{ A,B,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1& \quad {0.93}& \quad {0.72}& \quad {0.97} \\ {0.93}& \quad 1& \quad {0.90}& \quad {0.98} \\ {0.72}& \quad {0.90}& \quad 1& \quad {0.84} \\ {0.97}& \quad {0.98}& \quad {0.84}& \quad 1 \end{array}} \right]$$

Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of \(K_{{\{ B,C,D\} }}^{\psi }\), \(K_{{\{ A,C,D\} }}^{\psi }\), \(K_{{\{ A,B,D\} }}^{\psi }\) and \(K_{{\{ A,B,C\} }}^{\psi }\). \({H_1}\) is achieved as:

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0073}&{0.0027} \end{array}}&{\begin{array}{*{20}{c}} {0.0094}&{0.0093} \end{array}} \end{array}} \right]$$

According to \({H_1}\), the HSIC2 measure between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and \(K_{{\{ A,B,D\} }}^{\psi }\) is maximum. So, among 4 features, feature \(C\) is more suitable for elimination. In the second phase, the samples similarity matrices \(K_{{\{ B,D\} }}^{\psi }\), \(K_{{\{ A,D\} }}^{\psi }\) and \(K_{{\{ A,B\} }}^{\psi }\) are computed.

$$\begin{aligned} K_{{\{ B,D\} }}^{\psi } & =\left[ {\begin{array}{*{20}{l}} 1& \quad {0.94}& \quad {0.78}& \quad {0.96} \\ {0.94}& \quad 1& \quad {0.92}& \quad {0.99} \\ {0.78}& \quad {0.92}& \quad 1& \quad {0.88} \\ {0.96}& \quad {0.99}& \quad {0.88}& \quad 1 \end{array}} \right], \\ K_{{\{ A,D\} }}^{\psi } & =\left[ {\begin{array}{*{20}{l}} 1& \quad {0.96}& \quad {0.92}& \quad {0.97} \\ {0.96}& \quad 1& \quad {0.97}& \quad {0.99} \\ {0.92}& \quad {0.97}& \quad 1& \quad {0.95} \\ {0.97}& \quad {0.99}& \quad {0.95}& \quad 1 \end{array}} \right], \\ K_{{\{ A,B\} }}^{\psi } & =\left[ {\begin{array}{*{20}{l}} 1& \quad {0.94}& \quad {0.72}& \quad {0.97} \\ {0.94}& \quad 1& \quad {0.90}& \quad {0.99} \\ {0.72}& \quad {0.90}& \quad 1& \quad {0.84} \\ {0.97}& \quad {0.99}& \quad {0.84}& \quad 1 \end{array}} \right] \\ \end{aligned}$$

Also, HSIC2 value between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of them is stored in \({H_2}\).

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0072}&{0.0026}&{0.0093} \end{array}} \right]$$

From HSIC2 values in \({H_2}\), the next candidate for elimination is feature \(D\). Thus, features \(A\) and \(B\) are returned by SP-BAHSIC as the most informative features.

As another example, now SPC-BAHSIC is run on dataset \({X_G}\) with \(m=4\) samples and \(n=6\) features, named \(G=\left\{ {A,~B,~C,D,E,F} \right\}\) in order to find \(n'=2\) most informative features.

$${X_G}=\left[ {\begin{array}{*{20}{l}} 1&3&4&2&4&8 \\ 3&6&5&4&2&1 \\ 5&9&4&3&7&6 \\ 2&5&4&4&3&2 \end{array}} \right]$$

First, we estimate the number of clusters for these \(n=6\) features using gap statistic method which results in \(l=4\) features. Next, the clusters of features are found using k-means method. From each cluster, one feature is selected in \({G_c}=\left\{ {A,~B,~C,F} \right\}\). So, the dataset \({X_G}\) is represented by \({X_{{G_c}}}\) in new feature space.

$${X_{{G_c}}}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1&3 \end{array}}&4&8 \\ {\begin{array}{*{20}{c}} 3&6 \end{array}}&5&1 \\ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 5&9 \end{array}} \\ {\begin{array}{*{20}{c}} 2&5 \end{array}} \end{array}}&{\begin{array}{*{20}{c}} 4 \\ 4 \end{array}}&{\begin{array}{*{20}{c}} 6 \\ 2 \end{array}} \end{array}} \right]$$

Employing \({X_{{G_c}}}\), the next steps of SPC-BAHSIC is the same as SP-BAHSIC. Again, using RBF kernel function as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,F\} }}^{\phi }\) is computed.

$$K_{{\{ A,B,C,F\} }}^{\phi }=\left[ {\begin{array}{*{20}{c}} 1&{0.28}&{\begin{array}{*{20}{c}} {0.33}&{0.44} \end{array}} \\ {0.28}&1&{\begin{array}{*{20}{c}} {0.46}&{0.92} \end{array}} \\ {\begin{array}{*{20}{c}} {0.33} \\ {0.44} \end{array}}&{\begin{array}{*{20}{c}} {0.46} \\ {0.92} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.44} \end{array}}&{\begin{array}{*{20}{c}} {0.44} \\ 1 \end{array}} \end{array}} \end{array}} \right]$$

Also, \(\psi (.)\) is used to compute the following similarity matrices:

$$K_{{\{ B,C,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.74}&{\begin{array}{*{20}{c}} {0.82}&{0.82} \end{array}} \\ {0.74}&1&{\begin{array}{*{20}{c}} {0.84}&{0.98} \end{array}} \\ {\begin{array}{*{20}{c}} {0.82} \\ {0.82} \end{array}}&{\begin{array}{*{20}{c}} {0.84} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.85} \end{array}}&{\begin{array}{*{20}{c}} {0.85} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$
$$K_{{\{ A,C,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.76}&{\begin{array}{*{20}{c}} {0.90}&{0.83} \end{array}} \\ {0.76}&1&{\begin{array}{*{20}{c}} {0.86}&{0.98} \end{array}} \\ {\begin{array}{*{20}{c}} {0.90} \\ {0.83} \end{array}}&{\begin{array}{*{20}{c}} {0.86} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.88} \end{array}}&{\begin{array}{*{20}{c}} {0.88} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$
$$K_{{\{ A,B,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.73}&{\begin{array}{*{20}{c}} {0.75}&{~~~0.81} \end{array}} \\ {0.73}&1&{\begin{array}{*{20}{c}} {0.83}&{~~~0.98} \end{array}} \\ {\begin{array}{*{20}{c}} {0.75} \\ {0.81} \end{array}}&{\begin{array}{*{20}{c}} {0.83} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {1~~~} \\ {0.81~~~} \end{array}}&{\begin{array}{*{20}{c}} {0.81} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$
$$K_{{\{ A,B,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.93}&{\begin{array}{*{20}{c}} {0.77}&{~~0.97} \end{array}} \\ {0.93}&1&{\begin{array}{*{20}{c}} {~~0.93~~}&{0.98~~} \end{array}} \\ {\begin{array}{*{20}{c}} {0.77} \\ {0.97} \end{array}}&{\begin{array}{*{20}{c}} {0.93} \\ {0.98} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {1~~~} \\ {0.88~~~} \end{array}}&{\begin{array}{*{20}{c}} {0.88~} \\ 1 \end{array}} \end{array}} \end{array}} \right]$$

Similar to the first example, HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ B,C,F\} }}^{\psi }\), \(K_{{\{ A,C,F\} }}^{\psi }\), \(K_{{\{ A,B,F\} }}^{\psi }\) and \(K_{{\{ A,B,C\} }}^{\psi }\) and their values are stored in \({H_1}\).

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0133}&{0.0111} \end{array}}&{\begin{array}{*{20}{c}} {0.0151}&{0.0064} \end{array}} \end{array}} \right]$$

From \({H_1}\), it is clear that the similarity measure between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and \(K_{{\{ A,B,F\} }}^{\psi }\) is the highest and so, feature \(C\) should be removed from \({G_c}\). In the second phase, these similarity matrices are achieved.

$$K_{{\{ B,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.75}&{\begin{array}{*{20}{c}} {0.82}&{0.82} \end{array}} \\ {0.75}&1&{\begin{array}{*{20}{c}} {0.84}&{0.99} \end{array}} \\ {\begin{array}{*{20}{c}} {0.82} \\ {0.82} \end{array}}&{\begin{array}{*{20}{c}} {0.84} \\ {0.99} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.85} \end{array}}&{\begin{array}{*{20}{c}} {0.85} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$
$$K_{{\{ A,F\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.77}&{\begin{array}{*{20}{c}} {0.90}&{0.83} \end{array}} \\ {0.77}&1&{\begin{array}{*{20}{c}} {0.86}&{0.99} \end{array}} \\ {\begin{array}{*{20}{c}} {0.90} \\ {0.83} \end{array}}&{\begin{array}{*{20}{c}} {0.86} \\ {0.99} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.88} \end{array}}&{\begin{array}{*{20}{c}} {0.88} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$
$$K_{{\{ A,B\} }}^{\psi }=\left[ {\begin{array}{*{20}{c}} 1&{0.94}&{\begin{array}{*{20}{c}} {0.77}&{0.97} \end{array}} \\ {0.94}&1&{\begin{array}{*{20}{c}} {0.94}&{0.99} \end{array}} \\ {\begin{array}{*{20}{c}} {0.77} \\ {0.97} \end{array}}&{\begin{array}{*{20}{c}} {0.94} \\ {0.99} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.88} \end{array}}&{\begin{array}{*{20}{c}} {0.88} \\ 1 \end{array}} \end{array}} \end{array}} \right],$$

Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ B,F\} }}^{\psi }\), \(K_{{\{ A,F\} }}^{\psi }\) and \(K_{{\{ A,B\} }}^{\psi }\), as \({H_2}\) represents:

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0132}&{0.0110}&{0.0063} \end{array}} \right]$$

According to \({H_2}\), feature \(A\) is the next eliminated feature. So, features \(B\) and \(F\) are the most informative features which are returned by algorithm of SPC-BAHSIC.

In this part, we run SP-FOHSIC on dataset \({X_G}\) with \(m=4\) samples and \(n=4\) features, named \(G=\left\{ {A,~B,~C,D} \right\}\) and \(n'=2\) of informative features are needed.

$${X_G}=\left[ {\begin{array}{*{20}{c}} 1&3&{\begin{array}{*{20}{c}} 4&2 \end{array}} \\ 3&6&{\begin{array}{*{20}{c}} 5&4 \end{array}} \\ {\begin{array}{*{20}{c}} 5 \\ 2 \end{array}}&{\begin{array}{*{20}{c}} {10} \\ 5 \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 4&3 \end{array}} \\ {\begin{array}{*{20}{c}} 4&4 \end{array}} \end{array}} \end{array}} \right]$$

By using RBF kernel function as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,D\} }}^{\phi }\) is computed as:

$$K_{{\{ A,B,C,D\} }}^{\phi }=\left[ {\begin{array}{*{20}{c}} 1&{0.70}&{\begin{array}{*{20}{c}} {0.27}&{0.83} \end{array}} \\ {0.70}&1&{\begin{array}{*{20}{c}} {0.64}&{0.94} \end{array}} \\ {\begin{array}{*{20}{c}} {0.27} \\ {0.83} \end{array}}&{\begin{array}{*{20}{c}} {0.64} \\ {0.94} \end{array}}&{\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} 1 \\ {0.50} \end{array}}&{\begin{array}{*{20}{c}} {0.50} \\ 1 \end{array}} \end{array}} \end{array}} \right]$$

Similarly, using RBF kernel for \(\psi (.)\), the samples similarity matrices \(K_{{\left\{ A \right\}}}^{\psi }\) (samples are only included feature A), \(K_{{\{ B\} }}^{\psi }\) (samples are only included feature B), \(K_{{\{ C\} }}^{\psi }\) (samples are only included feature B) and \(K_{{\{ D\} }}^{\psi }\) (samples are only included feature D) are obtained:

$$K_{{\{ A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.98}&{0.92}&{0.99} \\ {0.98}&1&{0.98}&{0.99} \\ {0.92}&{0.98}&1&{0.96} \\ {0.99}&{0.99}&{0.96}&1 \end{array}} \right],$$
$$K_{{\{ B\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.96}&{0.78}&{0.98} \\ {0.96}&1&{0.92}&{0.99} \\ {0.78}&{0.92}&1&{0.88} \\ {0.98}&{0.99}&{0.88}&1 \end{array}} \right],$$
$$K_{{\{ C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.99}&1&1 \\ {0.99}&1&{0.99}&{0.99} \\ 1&{0.99}&1&1 \\ 1&{0.99}&1&1 \end{array}} \right],$$
$$K_{{\{ D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.98}&{0.99}&{0.98} \\ {0.98}&1&{0.99}&1 \\ {0.99}&{0.99}&1&{0.99} \\ {0.98}&1&{0.99}&1 \end{array}} \right]$$

Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of \(K_{{\{ A\} }}^{\psi }\), \(K_{{\{ B\} }}^{\psi }\), \(K_{{\{ C\} }}^{\psi }\) and \(K_{{\{ D\} }}^{\psi }\). \({H_1}\) is achieved as:

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0025}&{0.0071} \end{array}}&{\begin{array}{*{20}{c}} {2.1554 \times {{10}^{ - 5}}}&{1.7748 \times {{10}^{ - 4}}} \end{array}} \end{array}} \right]$$

According to \({H_1}\), the HSIC2 measure between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and \(K_{{\{ B\} }}^{\psi }\) is maximum. So, among 4 features, feature \(B\) is more suitable for selection. In the second phase, the samples similarity matrices \(K_{{\{ {\text{B}},A\} }}^{\psi }\), \(K_{{\{ B,C\} }}^{\psi }\) and \(K_{{\{ {\text{B}},D\} }}^{\psi }\) are computed.

$$K_{{\{ B,A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.94}&{0.72}&{0.97} \\ {0.94}&1&{0.90}&{0.99} \\ {0.72}&{0.90}&1&{0.84} \\ {0.97}&{0.99}&{0.84}&1 \end{array}} \right],$$
$$K_{{\{ B,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.95}&{0.78}&{0.98} \\ {0.95}&1&{0.92}&{0.99} \\ {0.78}&{0.92}&1&{0.88} \\ {0.98}&{0.99}&{0.88}&1 \end{array}} \right],$$
$$K_{{\{ B,D\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.94}&{0.78}&{0.96} \\ {0.94}&1&{0.92}&{0.99} \\ {0.78}&{0.92}&1&{0.88} \\ {0.96}&{0.99}&{0.88}&1 \end{array}} \right]$$

Also, HSIC2 value between \(K_{{\{ A,B,C,D\} }}^{\phi }\) and each of them is stored in \({H_2}\) as:

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0093}&{0.0071}&{0.0072} \end{array}} \right]$$

From HSIC2 values in \({H_2}\), the next candidate for selection is feature \(A\). Thus, features \(A\) and \(B\) are returned by SP-FOHSIC as the most informative features.

As the last example, SPC-FOHSIC is run on dataset \({X_G}\) with \(m=4\) samples and \(n=6\) features, named \(G=\left\{ {A,~B,~C,D,E,F} \right\}\). As in previous experiments, our aim is to find \(n'=2\) most informative features.

$${X_G}=\left[ {\begin{array}{*{20}{l}} 1&3&4&2&4&8 \\ 3&6&5&4&2&1 \\ 5&9&4&3&7&6 \\ 2&5&4&4&3&2 \end{array}} \right]$$

Again, by using gap statistic method, these 6 features can be grouped into \(l=4\) clusters. These clusters are determined via k-means method. Using their centroids, the features \({G_c}=\left\{ {A,~B,~C,F} \right\}\) establish the new feature space wherein the dataset \({X_G}\) is represented as \({X_{{G_c}}}\):

$${X_{{G_c}}}=\left[ {\begin{array}{*{20}{l}} 1&3&4&8 \\ 3&6&5&1 \\ 5&9&4&6 \\ 2&5&4&2 \end{array}} \right]$$

The next steps of SPC-FOHSIC is the same as SP-FOHSIC. Again, using RBF kernel function as \(\phi (.)\), the samples similarity matrix \(K_{{\{ A,B,C,F\} }}^{\phi }\) is computed as:

$$K_{{\{ A,B,C,F\} }}^{\phi }=\left[ {\begin{array}{*{20}{l}} 1&{0.28}&{0.33}&{0.44} \\ {0.28}&1&{0.46}&{0.92} \\ {0.33}&{0.46}&1&{0.44} \\ {0.44}&{0.92}&{0.44}&1 \end{array}} \right]$$

Also, \(\psi (.)\) is used to compute the following similarity matrices:

$$K_{{\{ A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.98}&{092}&{0.99} \\ {0.98}&1&{0.98}&{0.99} \\ {0.92}&{0.98}&1&{0.95} \\ {0.99}&{0.99}&{0.96}&1 \end{array}} \right],$$
$$K_{{\{ B\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.96}&{0.83}&{0.98} \\ {0.96}&1&{0.96}&{0.99} \\ {0.83}&{0.96}&1&{0.92} \\ {0.98}&{0.99}&{0.92}&1 \end{array}} \right],$$
$$K_{{\{ C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.99}&1&1 \\ {0.99}&1&{0.99}&{0.99} \\ 1&{0.99}&1&1 \\ 1&{0.99}&1&1 \end{array}} \right],$$
$$K_{{\{ F\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.78}&{0.98}&{~~0.83} \\ {0.78}&1&{0.88}&{0.99~~} \\ {0.98}&{0.88}&1&{0.92~} \\ {0.83}&{0.99}&{0.92}&1 \end{array}} \right]$$

Similarly, HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ A\} }}^{\psi }\), \(K_{{\{ B\} }}^{\psi }\), \(K_{{\{ C\} }}^{\psi }\) and \(K_{{\{ F\} }}^{\psi }\) and their values are stored in \({H_1}\) as:

$${H_1}=\left[ {\begin{array}{*{20}{c}} {\begin{array}{*{20}{c}} {0.0020}&{0.0044} \end{array}}&{\begin{array}{*{20}{c}} {9.6213 \times {{10}^{ - 5}}}&{0.0091} \end{array}} \end{array}} \right]$$

From \({H_1}\), it is clear that the similarity measure between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and \(K_{{\{ F\} }}^{\psi }\) is the highest and so, feature \(F\) should be selected. In the second phase, these similarity matrices are achieved as:

$$K_{{\{ F,A\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.77}&{0.90}&{0.83} \\ {0.77}&1&{0.86}&{0.99} \\ {0.90}&{0.86}&1&{0.88} \\ {0.83}&{0.99}&{0.88}&1 \end{array}} \right],$$
$$K_{{\{ F,B\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.75}&{0.82}&{0.82} \\ {0.75}&1&{0.84}&{0.99} \\ {0.82}&{0.84}&1&{0.85} \\ {0.82}&{0.99}&{0.85}&1 \end{array}} \right]$$
$$K_{{\{ F,C\} }}^{\psi }=\left[ {\begin{array}{*{20}{l}} 1&{0.78}&{0.98}&{0.83} \\ {0.78}&1&{0.88}&{0.99} \\ {0.98}&{0.88}&1&{0.92} \\ 1&{0.92}&{0.99}&{0.83} \end{array}} \right]$$

Using (14), HSIC2 is computed between \(K_{{\{ A,B,C,F\} }}^{\phi }\) and each of \(K_{{\{ F,A\} }}^{\psi }\), \(K_{{\{ F,B\} }}^{\psi }\) and \(K_{{\{ F,C\} }}^{\psi }\), represents as \({H_2}\):

$${H_2}=\left[ {\begin{array}{*{20}{c}} {0.0110}&{0.0132}&{0.0092} \end{array}} \right]$$

According to \({H_2}\), feature \(B\) is the next informative feature. So, features \(B\) and \(F\) are the most informative features which are returned by algorithm of SPC-FOHSIC.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Liaghat, S., Mansoori, E.G. Filter-based unsupervised feature selection using Hilbert–Schmidt independence criterion. Int. J. Mach. Learn. & Cyber. 10, 2313–2328 (2019). https://doi.org/10.1007/s13042-018-0869-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13042-018-0869-7

Keywords

Navigation