Introduction

Along with the growth of data such as image data, meteorological data, particularly documents, dimensions of these data also increase [1]. According to the studied extensively, the accuracy of current machine learning methods generally decreases with high dimensional data that event referred to as the curse of dimensionality. An essential issue with machine learning techniques is the high-dimensionality problem of a dataset where the feature subset size much greater than pattern size. For example, in the medical applications that include very high-dimensional datasets, the classification parameters are also increased. Therefore, the performance of the classifier declines significantly [2,3,4].

For preventing the curse of dimensionality, some dimension (feature) reduction techniques are used [5,6,7]. Traditional techniques to reduce the dimensions are divided into two main categories: feature extraction and feature selection [8]. In the first approach, instead of the original features, secondary features with low dimensions are extracted. That means that a high dimensional space is transferred to low dimensional space. However, the second approach includes four sub-categories that include filter method, wrapper method, hybrid methods, and embedded methods [9, 10]. The subset of features in the pre-processing step is selected in filter methods independent of any learner method [11]. In contrast, Wrapper methods apply a learner method to investigate the subsets of features based on their predictive power. Dealing with extensive data and side information, each of these methods has advantages and disadvantages regarding the time being used, consistency with data, efficiency, and accuracy.

The feature selection approaches are divided into three main groups: supervised, unsupervised and semi supervised [7]. In the supervised method, the label of dataset exists, based on which the evaluation and selection of suitable features are made. That is, while in unsupervised type, the classes of the label are not available, and evaluating and selecting are done based on the ability to meet some of the properties of the data set, including the locality preserving ability and/or variance. Since in most datasets, label or side information is available in small quantities, and obtaining these labels is costly, semi-supervised or constrained methods are used. The semi-supervised feature selection method uses data with labels and unlabeled; in contrast, the other choice of semi-supervised method is the pairwise constraint. In this method, not all data sets have labels, but there is side information like a pairwise constraint [12, 13].

A pairwise constraint is a pair of data belonging to the different clusters (cannot-link) or the same cluster (must-link) [14]. In fact, in the real world, in case of lack of label, the best possible information to select the feature is pairwise constraints. Overall, obtaining label is too costly, and in many cases, these constraints inherently exist. In the case of the existence of labels, one can turn this type of data set into pairwise constraint (by transitive closure and vice versa), which is one of the advantages of working on the pairwise constraint [15]. Because of the importance of pairwise constraint and inherent and low-cost nature of this pairwise constraint, many studies have been conducted such as the development of constrained algorithms to consider the pairwise constraint in the process of the machine learning task, active learning algorithms to obtain the best and most valuable pair to increase the accuracy, the transformation of the objective functions in the machine learning task, and the like. One of the studies that have rarely been done in the field of feature selection on the basis of the pairwise constraint. The purpose of this method is to reduce the dimension size by considering the pairwise constraint so that the constraint algorithm has the best results, accuracy, and efficiency. Most of the methods available in this field are improvements to previous similar methods (usually unsupervised feature selection).

In the present paper, a novel pairwise constraints-based method is proposed for feature selection and reduce dimensions. Our method is complementary to previous methods. In this study, in addition to the constraints, the quality of the constraints is also used. The quality of the pair of constraints is the power of the relationship between two pairs of data or vice versa (uncertainty). In the proposed method, in the first, the similarity between the pair constraints is calculated. Then an uncertainty region is created based on it. The uncertainty region and its coefficient are used to indicate the power and quality of the pair of constraints. These coefficients are then ensemble with a previous basic method, then in an iterative process are selected most informative pairs. There was a considerable improvement by comparing the proposed method with the previous methods. It might be argued that the proposed method has reduced the computational complexity of the machine learning algorithm despite increasing the classification accuracy. On the other, the number of final selected features imposes another challenge on feature selection methods. In other words, the number of relevant and non-redundant features is unknown; thus, the optimal number of selected features is not known either. In this proposed method, unlike many previous works, the optimal number of selected features is determined automatically based on the overall structure of the original features and their inner similarities.

The rest of this paper is organized as follows. “Related work” section summarizes related works to feature selection. “Proposed methodology” section introduces some preliminaries of this work, and our proposed method (PCFS) in details. The results of simulation and experimental analysis are illustrated in “Experimental analysis” section. The conclusion is given in “Conclusion” section.

Related work

The dimensionality reduction techniques are mostly divided into two categories: feature extraction and feature selection [16,17,18]. In the feature extraction methods, the data is transformed from the original space into a new space with fewer dimensions. On the contrary, the size of the dataset is directly reduced by the feature selection methods by picking a subset of relevant and non-redundant features and retaining adequate information for the learning task [19]. The objective of the feature selection methods is seeking the related features with the most predictive information from the original feature set [20]. The feature selection was determined to be an essential technique in many practical applications, including text processing [21,22,23], face recognition [24,25,26], image retrieval [27, 28], medical diagnosis [29], case-based reasoning [30] and bioinformatics [31]. One of the basic research subjects in pattern recognition is feature selection, with a long history started in the 1970s. Also, many attempts have been made to review the feature selection approaches [2,3,4].

Following the availability of the class labels of training data, the feature selection methods can be roughly divided into three categories: supervised feature selection, unsupervised feature selection, and semi-supervised feature selection [2, 29, 32]. In the supervised approaches, training samples are characterized by the vector of feature values with class labels, which are applied to direct the search process to associated information; however, in the unsupervised feature selection, the feature vectors value are described without class labels [33]. Since the labeled information is used, the supervised feature selection methods often show better performance compared to unsupervised and semi-supervised techniques [34]. In a large number of real-world applications, collecting the labeled patterns will be hard, and there are abundant unlabeled data and small labeled patterns. In order to handle this ‘incomplete supervision,’ semi-supervised (pairwise constraint) feature selection methods were developed, which use both unlabeled and labeled data for the machine learning task. In the semi-supervised feature selection methods, the local structure of both labeled and unlabeled data or the label information of labeled data and data distribution is used for the purpose of selecting final related and non-redundant features. In semi-supervised learning, part of the data is labeled and part of it is unlabeled. Consequently, the interesting topic of feature selection for semi-supervised feature selection is a more complex problem, and researching this area is recently attracting more interest in many communities. Sheikhpour et al. [35] provides a survey of feature selection methods. In this study, semi-supervised feature selection approaches are surveyed and taxonomies of these methods are introduced based on two different aspects. In [36] a novel Graph-based Semi-Supervised Sparse Feature Selection method is developed based on the mixed convex and non-convex minimization. The reported results of this method showed that the method selects the non-redundant and optimal subset of features and improves the performance of the machine learning task. In [37] a semi-supervised feature selection method is presented that integrates the neighborhood discriminant index and the Laplacian score method to efficiently work with both unlabeled and labeled data. The aim of this method is to find a set of relevant features that has a good ability to hold local geometrical structure and to identify samples belonging to different classes. Moreover, in [38] a semi-supervised feature selection method is developed for bipolar disorder. In this method, a novel semi-supervised technique is utilized to reduce the dimension of high-dimensional data. Also, Liu et al. [39] proposed Rough set based semi-supervised feature selection method. In this method, the unlabeled data can be predicted via various semi-supervised learning methods and the Local Neighborhood Decision Error Rate is developed to create multiple fitness functions to evaluate the relevance of the generated feature sets.

Feature selection methods might be divided into four categories: filter, wrapper, embedded, and hybrid approaches [40, 41]. In the filter-based methods, every single feature is ranked with no consideration of learning algorithms on the basis of its discriminating power among various classes. The statistical analysis of the feature set is required in the filter approach to select the final feature set [42, 43]. On the contrary, a learning algorithm is applied in the wrapper-based feature selection methods to assess the quality of feature subsets in the search space iteratively [44, 45]. The wrapper approach needs a high computational cost for high-dimensional datasets since every single subset is investigated by a specified learning model. In the embedded model, it is considered that the model building process includes the feature selection procedure as a part of it, in which both redundant and irrelevant features can be handled; as a result, training learning algorithms with a considerable number of features will take a great deal of time. On the other hand, the purpose of the hybrid-based approaches is employing the proper performance of the wrapper model and the computational efficiency of the filter model. However, the accuracy issue may be challenging in the hybrid model since the filter and wrapper models are taken into account as two separate steps [46].

Term Variance (TV) [47], Laplacian Score for feature selection (LS) [48], Relevance-Redundancy Feature Selection (RRFS) [49], Unsupervised Feature Selection based on Ant Colony Optimization (UFSACO) [50] are some existing filter-based unsupervised feature selection methods. Furthermore, a clustering algorithm is used in the unsupervised wrapper feature selection methods to investigate the quality of picked features. On the one hand, the higher computational complexity in learning is considered as the major disadvantage of these approaches, which is because of the application of specified learning algorithms. Also, the inefficiency of them on the datasets with many features has been shown. On the contrary, the statistical analysis of the feature set is required by the unsupervised filter method only for solving the feature selection task without employing any learning models. A feature selection method may be investigated in accordance with effectiveness and efficiency. Although the time needed to discover a subset of features is important for the efficiency, the effectiveness is associated with the quality of the subset of features. These issues are in disagreement with each other: in general, one is reduced by improving the other. Alternatively stated, the computational time is advantageous in the filter-based feature selection methods, and they are typically faster, although the quality of selected features is considered in the unsupervised wrapper methods.

Recently, the graph-based methods, including graph theory [51,52,53], spectral embedding [54], spectral clustering [55], and semi-supervised learning [56], have contributed significantly to feature selection because of their capability of encoding similarity relationships among the features. Recently, many graph-based unsupervised and semi-supervised feature selection methods are presented to extract the relationships among the features. For example, a spectral semi-supervised feature selection criterion called the s-Laplacian score was presented by Cheng et al. [57]. According to this criterion, a Graph-based Semi-Supervised Feature Selection method called GSFS was proposed. In this method, in order to select relevant features as well as to remove redundant features, the conditional mutual information and spectral graph theory are employed. Moreover, in [58], the authors designed a graph-theoretic method for non-redundant unsupervised feature selection. In this method, the feature selection tasks as the densest subgraph finding from a weighted graph. In [59], a dense subgraph finding method is selected for the unsupervised feature selection problem. In this paper, a novel normalized mutual information is used to calculate the similarity among two features.

Proposed methodology

The detail of the proposed method will be explained in this section. First, the general concepts related to the proposed method will be expressed, and then the details of the proposed semi-supervised feature selection method are introduced.

Background and notation

Let us review some definitions and concepts, which are the foundations of the proposed algorithm, before getting to the algorithm.

Neighborhoods and pairwise constraint

Laplacian ranking is the basis for the unsupervised method, including the selection of features with pairwise constraints, and in this method, the strongest feature in terms of the ability for preserving local is selected. The main key in assumptions in Laplacian feature selection is on the basis that the data belonging to the same class are closing together and more similar. Laplacian ranking of the rth feature of Lr that should be a minimum is expressed by Eq. (1):

$$ \left[ \begin{aligned} L_{\text{r}} &= \frac{{\mathop \sum \nolimits_{{{\text{i}},{\text{j}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} {\text{S}}_{\text{ij}} }}{{\mathop \sum \nolimits_{\text{i}} \left( {{\text{f}}_{\text{ri}} - \mu_{\text{j}} } \right)^{2} {\text{D}}_{\text{ij}} }} \hfill \\ {\text{D}}_{\text{ij}} &= \sum\nolimits_{j} {S_{ij} } \hfill \\ S_{ij} &= \left\{ \begin{aligned} e^{{ - \frac{{x_{i} - x_{j}^{2} }}{t} if x_{i} {\text{and }}x_{j} {\text{are}} {\text{neighborhood }}}} \hfill \\ 0 ,\quad {\kern 1pt} {\kern 1pt} \quad \;{\text{otherwise}}\; \hfill \\ \end{aligned} \right. \hfill \\ \end{aligned} \right] $$
(1)

which Sij can be expressed based on the relationship between the neighborhood and each data, and t is a fixed value that is initialized and neighborhood means that xi via the K of the nearest neighborhood reaches xj and neighborhoods can have various concepts such as the similarity of data to each other. Rankings expressed are unsupervised and use no other information except for the data set. This article uses concepts such as Laplacian ranking and neighborhood, and on the assump.ion that pairwise constraint exists as ML (Must-link) and CL (Cannot-link), it attempts to select and rank appropriate features. So, all ML and CL set with datasets are prepared. Then, using Eq. (2), it is attempted to rank features. It should be noted that with the use of concepts of the neighborhood.

$$ \left[ \begin{aligned} C_{\text{r}}^{1} &= \frac{{\mathop \sum \nolimits_{{ ( {\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} ) \in {\text{CL}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} }}{{\mathop \sum \nolimits_{{ ( {\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} ) \in {\text{ML}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} }} \hfill \\ C_{\text{r}}^{2} &= \left( {\mathop \sum \nolimits_{{\left( {{\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} } \right) \in {\text{CL}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} } \right) - \lambda \left( {\mathop \sum \nolimits_{{\left( {{\text{x}}_{\text{i }} ,{\text{x}}_{\text{j}} } \right) \in {\text{ML}}}} \left( {{\text{f}}_{\text{ri}} - {\text{f}}_{\text{rj}} } \right)^{2} } \right) \hfill \\ \end{aligned} \right] $$
(2)

In where, \( C_{\text{r}}^{1} \) and \( C_{\text{r}}^{2} \) represent two types of rankings based on the pairwise constraint. In fact, features are selected that have the best ability to protect constraints. If there are two samples are in the ML set so the relevant feature means that the feature values are close together. If the two samples are in the CL set, relevant feature means that features values are far apart. In the follow.ng, for each feature, two types of ranking are calculated and from the maximum value, two rankings, feature selection is done.

In general, if \( \left\{ {x_{i} ,x_{j} ,x_{k} } \right\} \) is the three data of the data set, then each pair’s relationship is expressed as \( \left\{ {{\text{ML}}, {\text{CL}}} \right\} \), and the clustering label is expressed with lab, then relations and Eq. (3) must be established. By closure of pairwise constraints, neighborhoods can be formed.

(3)

$$ \left[ \begin{aligned} \left( {x_{i} ,x_{j} ,Ml} \right) \wedge \left( {x_{i} ,x_{k} ,Ml} \right) \Rightarrow \left( {x_{j} ,x_{k} ,Ml} \right) \hfill \\ \left( {x_{i} ,x_{j} ,Cl} \right) \wedge \left( {x_{i} ,x_{k} ,Cl} \right) \Rightarrow \left( {x_{j} ,x_{k} ,Cl} \right) \hfill \\ \left( {x_{i} ,x_{j} ,Ml} \right) \Leftrightarrow lab_{i} = lab_{j} , \;in \;same \;cluster \hfill \\ \left( {x_{i} ,x_{j} ,Cl} \right) \Leftrightarrow lab_{i} \ne lab_{j} , \;not \;in \;same\; cluster \hfill \\ \end{aligned} \right] $$
(3)

Neighborhoods are a set of a neighborhood whose number is usually smaller or equal to the number of clusters defined in the algorithm. Each neighborhood includes several sample data that must be in the same cluster together. The basic premise in that neighborhood is that different data in different clusters should be placed in different neighborhoods, and no two Neighborhoods should be found where data exists as the same cluster.

Measuring the uncertainty of constraints

In the real world, constraints arise from domain knowledge or expert knowledge. Pairwise constraints have weak relationships, and strangeness (uncertainty) of the relations is variable. Hence, it is needed to create an uncertainty region. By finding the region, it is easy to have an impact on our ranking and see better results in reduced dimensions. In order to do this, the authors use the thresholding histogram method. This method actually used the classifying method with two classes, and its purpose is to reduce ambiguity in the range of values. First, the similarity values of each pair Sen matrix are collected, and then these values are divided into intervals, and the average of each interval is determined as (\( D_{i} \)). In the next step, for each interval, the number of pairs in this range is counted as the g (\( D_{i} \)). So, from these values, a weighted moving average with five windows, f (\( D_{i} \)), is calculated by Eq (4). The authors start from the beginning of the intervals and find the first valley points in the modified histogram \( f\left( {D_{v} } \right) \). Finally, the uncertainty region is calculated.

Step 1:

$$ f\left( {D_{i} } \right) = \frac{{g\left( {D_{i} } \right)}}{{\mathop \sum \nolimits_{e = 1}^{z - 1} g\left( {D_{e} } \right)}} \times \frac{{g\left( {D_{i - 2} } \right) + g\left( {D_{i - 1} } \right) + g\left( {D_{i} } \right) + g\left( {D_{i + 1} } \right) + g\left( {D_{i + 2} } \right)}}{5} , \forall i = 2,3, \ldots .,z - 3 $$
(4)

Step 2: find the first valley points subject to:

$$ f\left( {D_{v - 1} } \right) > f\left( {D_{v} } \right)\; {\text{and}}\; f\left( {D_{v} } \right) < f\left( {D_{v + 1} } \right) $$
(5)

Step 3: find the boundary of the uncertainty region:

$$ m_{d} = D_{v} {\text{and}} m_{c} = \hbox{max} (D_{i} ) - m_{d} $$
(6)

Step 4: find the pairs in similarity matrix that h.ve uncertainty relationship:

$$ Similarity Matrix S_{enij} : \left\{ \begin{aligned} &m_{d} \le if S_{enij} \le m_{c} :uncertainity\;region \hfill \\ & else:\quad \quad \quad \quad \quad \,\,\, strong\;region \hfill \\ \end{aligned} \right.\quad \forall i,j $$
(7)

Weights of the terms obtained

Given that each feature has a certain weight and importance, and not all features may be required for the machine learning task, so in the first step it is necessary to determine the weight of each feature. For this purpose, Laplacian Score (LS) is used. LS is an unsupervised univariate filtering method which is based on the observation that if a data point is close to each other; it may belong to the same class. The basic idea of LS is to evaluate the feature relevance according to its power of locality preserving. The LS for the feature \( A \) is determined using Eq. (8):

$$ LS\left( {S, A} \right) = \frac{{\sum\nolimits_{i,j} {(A(i) - A(j))} S_{ij} }}{{\sum\nolimits_{i} {(A(i) - \bar{A})D_{ii} } }} $$
(8)

where, A(i) represents the value of the feature A in the \( i \)-th a pattern, \( \bar{A} \) denotes the average of the feature A, D is a diagonal matrix that \( D_{ii} = \sum\nolimits_{j} {S_{ij} } \), and \( S_{ij} \) represents the neighborhood relation between patterns, calculated as Eq. (9):

$$ S_{ij} = \left\{ \begin{aligned} &e^{{\frac{{x_{i} - x_{j} }}{t} }} , \quad if x_{i} \text{ and} \, x_{j} \text{ are}\, neighbors \hfill \\ &0,\qquad \,\,\, otherwise \hfill \\ \end{aligned} \right. $$
(9)

where, \( t \) is a suitable constant, \( x_{i} \) represents i-th pattern, and \( x_{i} \) and \( x_{j} \) are neighbors if \( x_{i} \) is among \( k \) nearest neighbors of \( x_{j} \) or \( x_{j} \) is among \( k \) nearest neighbors of \( x_{i} \).

The proposed PCFS algorithm

In this section, a novel Pairwise Constraint Feature Selection method (PCFS) is proposed. This method uses pck-mean which is one of the soft constraints clustering algorithms with small and effective changes. The proposed method has been able to use both standard objective function and a penalty for the violation of constraints, with changing the objective function. These two sections together constitute the objective function and are locally minimized. The proposed method, named-Dim-reduce() function, is affected by the current clustering and vice versa.

figure a

Briefly, the data set are embedded as a data-term matrix, and then other variables values are initialized. The whole of the procedure is repeated in a loop until the clusters not changed (or with the predefined number of the loop). In each iteration, given the current clustering and set of constraints ML and CL, Dim-reduce() performs to produce a reduced feature (line 2). After this, neighborhoods are formed from the closure of pairwise constraints, and then the center of pairwise constraints of each neighborhood is calculated. If a neighborhood does not have any data, randomly a data, it should not be a member of other neighborhoods, is as the center of that cluster. Finally, centers of clusters are initialized by the center of neighborhoods (lines 3–6). For assigning clusters and estimating (updating) center of clusters, section A and B is performed (8–9). These two sections are repeated until convergence, as pck-means. After convergence, the procedure is repeated until meet stop conditions. Dim-reduce() function is the core of PCFS that is summarized in Algorithm 2. In this method, in addition to the usual input in feature selection, pairwise constraints arise as input.

figure b

There are two main functions in this algorithm that respectively, Sen-func() in algorithm 3 and Str-unc() in algorithm 4 are expressed. The first function extracts the matrix of similarities between data pairs, and then in the second function, the uncertainty region and strength of the relationship is calculated for each pair. After calculating the two functions within an iterative process, the authors rank the features by Eq. (10). Finally, Repeat will continue until the selected features are changed.

$$ C_{b} = \frac{{\mathop \sum \nolimits_{{\left( {x_{i} ,x_{j} } \right) \in ML}} \left( {f_{bi} - f_{bj} } \right)^{2} \times S_{trij} + \frac{{\left( {1 - S_{enij} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{k} ,x_{z} } \right) \in ML}} (1 - S_{enkz} )}} \times \left( {1 - S_{trij} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{i} ,x_{j} } \right) \in CL}} \left( {f_{bi} - f_{bj} } \right)^{2} \times S_{trij} + \frac{{\left( {1 - S_{enij} } \right)}}{{\mathop \sum \nolimits_{{\left( {x_{k} ,x_{z} } \right) \in cL}} (1 - S_{enkz} )}} \times (1 - S_{trij} )}} $$
(10)

In which, Strij indicates the quality (power) of the relationship between each data pairs, and each element in the matrix are calculated through the uncertainty region. For the ranking of features, this formula assumes that if the power of pairs (in the set of pairwise constraints) is low, the authors mostly use similarity matrix; otherwise, (in case of reliability and high strength of the relationship of pairwise), Minkowski distance is used. In fact, using this method, strength and quality are added to the formula, and thereby better results can be obtained. The summarization of calculating the similarity matrix is possible in algorithm 3. First, the authors assigned clusters as labels of data set (lines 3–6). Then the classification model is performed on the dataset with produced labels from clustering (line 8). In the iterative process, a similarity matrix based on anticipated labels (from the classification model) is created. During different iterations, this similarity matrix is updated and normalized.

figure c

Finally, the Matrix calculation of strength, Str and the uncertainty region as algorithm 4 is summarized. After finding the uncertainty region (line 3), it is time to calculate Str matrix. For data pairs that are in the uncertainty region, the relative strength of them is equal to β, and outside of this range, it is 1–β. This β parameter was chosen after several preliminary runs, and this the value of β is empirically considered as 0.3.

figure d

Experimental analysis

To investigate the performance of the proposed method (i.e., PCFS), several extensive experiments are performed. The obtained results are compared with six state-of-the-art and well-known methods such as LS [48], GCNC [60], FGUFS [61], FS [62], FAST [63], FJMI [64], LS [48], PCA [65] and the description of this method is described below.

LS (Laplacian Score): this is a graph-based feature selection method that works in unsupervised mode. This method models the data space into a graph, and probably belong to the same class based on the idea of whether two data points are near to each other.

GCNC (Graph Clustering with the Node Centrality): GCNC is a feature selection method, in which the concept of graph clustering is integrated with the node centrality. This approach can handle both redundant and irrelevant features.

FGUFS (Factor Graph Model for Unsupervised Feature Selection): The similarities between features are explicitly measured in this method. These similarities are passed to each other as messages in the graph model. The message-passing algorithm is applied to calculate the importance score of each feature, and then the selection of features is performed on the basis of the final importance scores.

FS (Fisher Score): This method is a univariate filter method that scores features such that based on that feature, the distance between the samples from the same class is short, and the distance between the samples from different classes is long. Therefore, this criterion gives higher ratings to features that have such a separation property.

FAST (Fast clustering-based feature selection method): In this method, the graph-theoretic clustering methods are used to divide the features into clusters. Then the most representative feature that is significantly associated with target classes is picked from each cluster to develop a subset of features.

FJMI (Five-way Joint Mutual Information): In this paper, a feature selection method is proposed, in which a two-through five-way interaction between features and the class label is considered.

PCA (principal component analysis): PCA is a linear transformation-based multivariate analytical dimensionality reduction algorithm. PCA is often utilized to extract significant information from the high dimensional dataset.

The results are reported in terms of two measures, including the classification accuracy (ACC) and the number of selected features. ACC is defined as follow:

$$ ACC = \frac{TP + TN}{TP + TN + FP + FN} $$
(11)

where TP, TN, FP, and FN stand for the number of true positives, true negatives, false positives, and false negatives, respectively.

Datasets

In the present study, a large number of datasets with different properties are applied in the experiments to demonstrate the robustness and effectiveness of the proposed approach. SPECTF, SpamBase, Sonar, Arrhythmia, Madelon, Isolet, Multiple Features, and Colon has taken from the UCI repository are included in these datasets [66] and have been extensively used in the literature. Table 1 presents the basic characteristics of these datasets. The datasets have been chosen in such a way that they consider several characteristics, including the number of different classes, the number of features, and the number of samples. For instance, Colon is a significantly high dimensional dataset with a small sample size; however, SpamBase is the example of a low dimensional with a large sample size dataset. Again, Isolet is a multi-class dataset that has 26 different kinds of classes. In these experiments, the generations of pairwise constraints are simulated as the following: The pairs of samples from the training data and created cannot-link or must-link constraints are randomly selected on the basis of whether the underlying classes of the two samples are similar or dissimilar.

Table 1 Characteristics of the used datasets

Some of these datasets contain features that take a wide range of values. Note that features with small values will be dominated by those features with large values. The normalization of datasets is performed to tackle this issue. The primary reason for selecting this normalization method is that the information related to standard deviation can be partially preserved by the other methods; however, the topological structure of the datasets is retained by the max–min normalization in many cases. For each dataset, the results are achieved over ten independent runs to obtain relatively more stable and accurate approximations. In every single run, each dataset is firs normalized and is randomly split into a test set (1/3 of the dataset) and a training set (2/3 of the dataset). The test set is applied for evaluating the selected features, while the training set is applied to pick the final feature subset. A number of these datasets include features with missing values; thus, every single missing value was replaced with the mean of the available data on the respective feature to handle these kinds of data in the experiments.

Classifiers used in the experiments

In order to demonstrate the generality of the proposed method, several well-known classical classifiers such as Support Vector Machine (SVM), Decision Tree (DT), and Naïve Bayes (NB) were employed to test the classification prediction capability of the selected features. SVM is a learning machine which is generally used for the classification problem. SVM was presented by Vapnik and became very popular over the past 10 years. The maximization of a margin between data samples is the purpose of SVM. NB is a family of simple probabilistic classifiers on the basis of using Bayes theorem with strong (naive) independence assumptions between the features. In simple terms, it is assumed in a Naïve Bayes classifier that in terms of the target class, the features are conditionally independent of each other. Decision Tree (DT) is considered as one of the most successful methods for the classification problem. The tree is created by training samples, and a rule is represented by each path from the root to a leaf, which gives a classification of the pattern. The normalized information gain is examined in this classifier to make decisions.

Moreover, Weka (Waikato Environment for knowledge analysis) is the experimental workbench [67], which is a collection of machine learning algorithms for mostly data mining tasks. In this work, SMO, AdaBoostM1, and Naïve Bayes as the WEKA implementation of SVM, NB, and AB have been applied. WEKA can be considered an advanced tool for machine learning and data mining. This free software can be used under the GNU General Public License. The software includes a set of “visualization” tools, data analysis methods and forecasting models that are put together in a graphical interface so that the user has the best way to execute commands. For this purpose, first the selected feature subset is determined by each feature selection method and then each selected subset is sent to Weka tool for evaluation. Moreover, the used parameters of the mentioned classifiers have been set to the default values of the WEKA software. The proposed method involves several parameters that must be set before starting the method. The appropriate values for some of these parameters are chosen as trial and error after a number of primary runs so they do not mean the best value for these parameters. Moreover, in all of these experiments, the values used in each of the compared methods were used to adjust the parameters.

Experimental result and discussion

In the experiments, the number of selected features and the classification accuracy is used as the performance measures, and first, the performance of the proposed method is investigated over different classifiers. The summary of average classification accuracy (in %) over ten independent runs of the different feature selection methods using SVM, NB, and DT classifier is listed in Table 2. Each entry of these tables denotes the mean value and also standard deviation (indicated in parenthesis) of 10 independent runs. The best mean values of average percentage accuracy are marked in italicface. Table 2 reveals that in most case, the proposed method performs better compared to other feature selection methods.

Table 2 Performance comparison of different feature selection methods on eight datasets

Moreover, Figs. 1, 2, 3 show the average classification accuracy over all datasets on the SVM, Naive Bayes, and Decision Tree classifiers, respectively. As can be seen in these figures, on SVM and Naive Bayes classifiers, the proposed method had the highest average classification accuracy, and on the Decision Tree classifier, FJUFS method won the highest rank. The results of Fig. 1 show that the proposed method obtained 82.87% average classification accuracy and achieved the first rank with a margin of 1.95 percent compared to the FJMI method, which obtained the second-best average classification accuracy. Moreover, from the Fig. 2 results, it can be seen that the differences between the obtained classification accuracy of the proposed method and the second-best ones (FJMI) and third-best ones (FGUFS) on Naive Bayes classifier were reported 1.17 (i.e., 80.38–79.21) and 3.07 (i.e., 80.38–77.31) percent. Furthermore, on the Decision Tree classifier, FGUFS method feature selection method gained the first rank with an average classification accuracy of 79.66%, and the proposed PCFS method was ranked second with an average classification accuracy of 79.02%.

Fig. 1
figure 1

Average classification accuracy over all datasets on the SVM classifier

Fig. 2
figure 2

Average classification accuracy over all datasets on the Naive Bayes classifier

Fig. 3
figure 3

Average classification accuracy over all datasets on the Decision Tree classifier

Also, Tables 3, 4, 5 show the number of times the best results are achieved by different feature selection methods in ten independent run on SVM, NB and DT classifiers, respectively. It can be seen from Table 3, 4, 5 results that in most cases, the proposed methods obtained the highest rate compared to those of other methods in ten independent run with different classifiers.

Table 3 Number of times different methods achieve the best results with SVM classifier
Table 4 Number of times different methods achieve the best results with NB classifier
Table 5 Number of times different methods achieve the best results with DT classifier

Table 6 records the average number of selected features of the seven feature selection methods in the ten independent runs for each dataset. It can be observed that, in general, a significant reduction of dimensionality is achieved by all the different methods by picking only a small portion of the original features. Overall, the proposed method the minimum number of selected features of 40.3 features. While this value for LS, GCNC, FGUFS, FS, FAST, FJMI, and PCA equal to 40.7, 41.2, 46.5, 47.0, 46.2, 46.6, and 44.4 respectively.

Table 6 Average number of selected features in ten independent run

Also, the comparison of the accuracy of the proposed method with the other feature selection methods according to the various numbers of selected features is performed by conducting several experiments. The classification accuracy (average over ten independent runs) curves of SVM and DT classifiers on multiple features and colon datasets are respectively plotted in Figs. 4 and 5. The results of this table indicated that the proposed method, in most cases, is superior to other methods and has highest classification accuracy.

Fig. 4
figure 4

Classification accuracy (average over 10 runs), on multiple features dataset with respect to the number of selected features with a SVM classifier, and b DT classifier

Fig. 5
figure 5

Classification accuracy (average over 10 runs), on Colon dataset with respect to the number of selected features with a SVM classifier, and b DT classifier

Furthermore, a large number of experiments were performed to compare the execution time of the proposed method and other supervised and unsupervised feature selection methods. In these experiments, related execution times (in ms) for different methods are reported in Table 7. It can be concluded from the results reported in this table that, in most cases, the PCFS proposed method has lower running times than the other methods.

Table 7 Average execution time (in ms) of different feature selection methods over ten independent runs

Complexity analysis

this subsection, the computational complexity of the proposed method is calculated. The first phase of the method which utilizes the PCFS clustering to determine of clusters. The time complexity of this phase is \( O\left( {In^{2} s} \right) \) where \( I \). Number of iterations for algorithm convergence indicates, \( n \) denotes the total number of initial features and \( s \) is the number of samples. In the next phase, Dim-reduce function is used to produce a reduced feature The complexity of Dim-reduce function is \( O\left( {n^{2} } \right) \). Consequently, the final computational complexity of the PCFS methods is \( O\left( {In^{2} s + n^{2} } \right) \). When the number of samples (i.e., \( s \)), and number of iterations (i.e., \( I \)), much smaller than the total number of features, the final time complexity of the proposed method can be reduced to \( O\left( {n^{2} } \right) \).

Conclusion

Over the last 10 years, the fast growth of computer and database technologies has led to the rapid growth of large-scale datasets. On the other hand, applications with high dimensional datasets that require high speed and accuracy are rapidly increasing. An important issue with data mining applications, including pattern recognition, classification, and clustering, is the curse of dimensionality, where the number of features is much higher compared to the number of patterns. From a general perspective, feature selection approaches are categorized into three groups, supervised, unsupervised, and semi-supervised. Supervised feature selection methods have a set of training patterns available, each of which is described by taking the values of the features with the labels, while in the unsupervised modes, feature selection methods encounter samples without labels. Semi-supervised feature selection is also a type of feature selection that employs both unlabeled and labeled data simultaneously to improve feature selection accuracy.

In the present paper, a novel pairwise constraints-based method is proposed for feature selection. In the proposed method, in the first, the similarity between the pair constraints is calculated. Then an uncertainty region is created based on it. Then in an iterative process, most informative pairs are selected. The proposed method was compared to different supervised, and unsupervised feature selection approaches, including LS, GCNC, FJUFS, FS, FAST, FJMI and PCA. The reported findings indicate that, in most cases, the proposed approach is more accurate and selects fewer features. For example, numerical results showed that the proposed technique improved the classification accuracy by about 3% and reduced the number of picked features by 1%. Consequently, it can be said that the proposed method reduces the computational complexity of the machine learning algorithm, despite the increase in classification accuracy.