1 Introduction

Although a wide variety of expert systems has been built, knowledge acquisition remains a development bottleneck [2, 21]. Building a large-scale expert system involves creating and extending a large knowledge base over the course of many months or years. Shortening the development time is thus the most important factor for the success of an expert system. In the past, machine-learning techniques were successfully developed to ease the knowledge-acquisition bottleneck. Among the proposed approaches, deriving rules from training examples is the most common [9, 14, 15]. Given a set of examples, a learning program tries to induce rules that describe each class.

In some application domains, the amount of attributes (or features) of given training data is very large (e.g. decades to hundreds). In this case, much computational time is needed to derive classification rules from the data. Besides, derived rules may contain too many features and more rules than actually desired may be obtained due to over-specialization. In fact, not all the attributes are indispensable. Some redundant, similar or dependent attributes may exist in the given training data. This phenomenon mainly results from attribute dependency. Redundant and similar attributes can be thought of as two special cases of dependent attributes. If there exists some dependency relationship between attributes, the dimension of the training data may thus be reduced.

The concept of reduced attributes has been used in many places. For example, in the rough set theory [1619], the reduced set of attributes is also called a “reduct”. Many possible reducts may exist at the same time. Even only a reduced set of attributes is used for classification, the indiscernibility relations still preserve among the attributes [10]. A minimal reduct, just as its literal meaning shows, is a reduct which cannot be reduced any more. It may not be unique as well. The classification work for a high-dimensional dataset can be done faster if a minimal reduct instead of the original entire set of attributes is used. Finding a minimal reduct is an NP-hard problem [20, 22]. Besides, there may be no minimal reduct due to noise in training examples.

Some researches about finding approximate reducts were thus proposed. An approximate reduct is a minimal reduct with acceptable tolerance. It can usually be found in much shorter time relative to an exact minimal reduct. Besides, it usually consists of less attributes than an exact one. It is thus a good trade-off among accuracy, practicability and execution time. Many approaches for finding approximate reducts were proposed [3, 7, 26, 27]. For example, Wróblewski [25] used the genetic algorithm to find approximate minimal reducts. Sun and Xiong [23] proposed an approach compatible with incomplete information systems. Al-Radaideh et al. [1] used the discernibility matrix and a weighting strategy to find the minimal reduct in a greedy strategy. Gao et al. [4] proposed a feature ranking strategy (similar to attribute weighting) with a sampling process included. Recently, approaches based on soft sets for attribute selection has also been proposed to reduce the execution time [13, 20].

All the approaches mentioned above focus on the issue of finding a minimal reduct as soon as possible. However, if there are training examples with missing or unknown values, the approach may not correctly work. Besides, if only the chosen reduct is used in a learning process, the rules cannot contain other attributes and are hard to use if some attribute values in the reduct cannot be obtained in current environments.

In this paper, we solve the above problems from another viewpoint–attribute clustering. Note that we are doing clustering for attributes rather than for objects. Like the conventional clustering approaches for objects, the attributes within the same cluster are expected to possess high similarity, but within different clusters possess low similarity. Here, the dependency degrees between attributes are used to represent the similarities. Since the attributes are grouped into several clusters according to their similarity degrees, an attribute selected from a cluster can thus represent the attributes within the same cluster. An approximate reduct could then be formed from the chosen attributes gathered together. Note that the obtained result in this way is usually an approximate reduct. The proposed approach has the following three advantages.

  1. 1.

    Guessing a missing value of an attribute from the other attributes within the same cluster should be more accurate and faster than that from all attributes.

  2. 2.

    If an object has missing values, its class can also be decided by the other attributes within the same cluster.

  3. 3.

    The proposed approach is flexible for representing rules since each attribute in a rule can be displaced with other attributes in the same cluster.

Besides, the proposed algorithm for clustering attributes is also implemented to verify its effects. Experimental results show that the average similarity of each cluster is related to the cluster number. As the cluster number increases, the average similarity of a cluster will also increase.

The remainder of this paper is organized as follows: Some related concepts including reduct, relative dependency and clustering are reviewed in Sect. 2. The proposed dissimilarity between a pair of attributes is explained in Sect. 3. An attribute clustering algorithm is proposed in Sect. 4. An example is given in Sect. 5 to illustrate the proposed algorithm. The experimental results and some discussions are described in Sect. 6. Conclusions and future work are finally stated in Sect. 7.

2 Related work

In this section, some important concepts related to this paper are briefly reviewed. The concept of reducts is first introduced, followed by the concept of relative dependency. Next, two famous clustering approaches, \(k\)-means and \(k\)-medoids, are described and compared. The reasons for why they are not suitable for clustering attributes are also described. An attribute clustering approach is thus proposed due to these problems and limitations.

2.1 Reducts

Let \(I\) = \((U, A)\) be an information system, where \(U=\{x_{1}, x_{2}, {\dots }, x_{N}\}\) is a finite non-empty set of objects and \(A \) is a finite non-empty set of attributes called condition attributes [10]. A decision system is an information system of the form \(I=({U,A\cup \{d\}})\), where \(d\) is a special attribute called decision attribute and \(d\notin A\) [10]. For any object \(x_{i} \in U\), its value for a condition attribute \(a \in A\), is denoted by \(f_{a}(x_{i})\). The indiscernibility relation for a subset of attributes \(B\) is defined as:

$$\begin{aligned} \mathrm{IND}(B)=\{(x,y)\in U\times U\left| {\;\forall a\in B,\;f_a (x)=f_a (y)} \right. \}, \end{aligned}$$

where \(B\) is any subset of the condition attribute set \(A\) (i.e. \(B\subseteq A\)) [11]. If the indiscernibility relations from both \(A\) and \(B\) are the same [i.e. \(\mathrm{IND}(B)= \mathrm{IND}(A)\)], then \(B\) is called a reduct of \(A\). That means the attributes used in the information system can be reduced to \(B\), with the original indiscernibility information still kept. Furthermore, if an attribute subset \(B\) satisfies the following condition, then \(B\) is called a minimal reduct of \(A\):

$$\begin{aligned} \mathrm{IND}(B)=\mathrm{IND}(A)\quad \hbox {and}\quad \forall B'\subseteq B\,\;\mathrm{IND}(B')\ne \mathrm{IND}(A). \end{aligned}$$

Take the simple information system in Table 1 as an example. In Table 1, the attribute set \(A\) consists of three attributes {Age, Income, Children} and the object set \(U\) consists of five objects \(\{x_{1},x_{2},x_{3}, x_{4}, x_{5}\}\). Since IND({Age, Children}) = IND\((A) = \{(x_{1}, x_{1}), (x_{2}, x_{2}), (x_{3}, x_{3}), (x_{4}, x_{4}), (x_{5}, x_{5})\}\), the attribute subset {Age, Children} is a reduct of the information system. Besides, since neither IND({Age}) nor IND({Children}) equals IND(\(A\)), the attribute subset {Age, Children} is a minimal reduct.

Table 1 A simple information system

When a decision system, instead of an information system, is considered, the definition of a reduct \(B\) (\(B\subseteq A)\) can be modified as follows [25]:

$$\begin{aligned} \forall x_i ,x_j \in U, \ \hbox {if }f_B (x_i )=f_B (x_j ),\hbox { then }d(x_i )=d(x_j ), \end{aligned}$$

where \(d(x_{i})\) denotes the value of the decision attribute of the object \(x_{i}\), \(f_{B}(x_{i})\) denotes the attribute values of \(x_{i}\) for the attribute set \(B\). Similarly, if no subset of \(B\) can satisfy the above condition, \(B\) is called a minimal reduct in the decision system. Take the simple decision system shown in Table 2 as an example. It is modified from Table 1.

Table 2 A simple decision system

In Table 2, a decision attribute, Buying computers, is added to the original information system (Table 1) to form a decision system. In this example, the attribute subset {Age, Income} is not a reduct since the two objects \(x_{1}\) and \(x_{4}\) have the same values for the two attributes but belong to different classes. On the contrary, the attribute subset {Age, Children} is a reduct for the decision system. Furthermore, it is a minimal reduct since neither {Age} nor {Children} is a reduct. Finding minimal reducts has been proven as an NP-Hard problem. Li et al. [11] proposed the concept of “approximate” reducts to speed up the searching process. An approximate reduct allows for some reasonable tolerance degrees, but can greatly reduce the computation complexity. Next, the concept of relative dependency is introduced.

2.2 Relative dependency

Han [5] and Li et al. [11] developed an approach based on the relative dependency to find approximate reducts. The relative dependency is motivated by the operation “projection”, which is very important in the relational algebra. It can also be easily executed by SQL or other query languages. Given an attribute subset \(B\subseteq A\) and a decision attribute \(d\), the projection of the object set \(U\) on \(B\) is denoted by \(\Pi _{B }(U)\) and can be computed by the following two steps: removing attributes in the different set (\(A-B\)) and merging all the remaining rows which are indiscernible [11]. Thus, among the tuples with the same attribute values for \(B\), only one is kept and the others are removed. For example, the projection of the data in Table 2 on the attribute {Age} is shown below:

$$\begin{aligned} \Pi _{\{\mathrm{Age}\}}(\hbox {U})=\left\{ {x_1 ,x_2 ,x_3 } \right\} . \end{aligned}$$

In this example, \(x_{4}\) and \(x_{5}\) are removed since they have the same value of the attribute Age as \(x_{1}\) and \(x_{3}\) have. Similarly, the projection on the attribute Children and on the attribute subset \(\{\mathrm{Age, Children}\}\) is shown below:

$$\begin{aligned}&\!\!\!\Pi _{\{\mathrm{Children}\}} ( U)=\left\{ {x_1 ,x_2 } \right\} \!,\quad \hbox {and} \\&\!\!\!\Pi _{\{\mathrm{Age,Children}\}} ( U)=\left\{ {x_1 ,x_2 ,x_3 ,x_4 ,x_5 } \right\} . \\ \end{aligned}$$

Han et al. thus defined the relative dependency degree (\(\delta _B^D)\) of the attribute subset \(B\) with regard to the set of decision attributes \(D\) as follows:

$$\begin{aligned} \delta _B^D =\frac{\left| {\Pi _B ( U)} \right| }{\left| {\Pi _{B\cup D} ( U)} \right| }, \end{aligned}$$

where \(\vert \Pi _{B}(U)\vert \) and \(\left| {\Pi _{B\cup D} ( U)} \right| \) are the numbers of tuples after the projection operations are performed on \(U\) according to \(B\) and \(B\cup D\), respectively. Take the decision system shown in Table 2 as an example. \(\vert \Pi _{\{\mathrm{Age}\}}(U)\vert =\vert \{x_{1}, x_{2}, x_{3}\}\vert =~3\) and \(\vert \Pi _{\{\mathrm{Age,\, Buying\, computers}\}} (U)\vert =\vert \{x_{1},x_{2},x_{3}, x_{4}, x_{3}\}\vert = 5\). The relative dependency degree of {Age} with regard to {Buying computers} is thus 3/5, which is 0.6.

The goal of the paper is to cluster attributes such that the process of finding approximate reducts can be improved. For achieving this goal, it is thus important to develop an evaluation method which can measure the similarity of attributes. This paper extends the concept of the relative dependency to compute the similarity between any two attributes and proposes an attribute clustering method. The proposed approach will be described in Sect. 3.

2.3 The k-means and the k-medoids clustering approaches

The \(k\)-means and the \(k\)-medoids approaches are two well-known partitioning (or clustering) strategies. They are widely used to cluster data when the number of clusters is given in advance. The \(k\)-means clustering approach [12] consists of two major steps: (1) reassigning objects to clusters and (2) updating the centers of clusters. The first step calculates the distances between each object and the \(k\) centers and reassigns the object to the group with the nearest center. The second step then calculates the new means of the \(k\) groups just updated and uses them as the new centers. These two steps are then iteratively executed until the clusters no longer change.

The \(k\)-medoids approach [8] adopts a quite different way of finding the centers of clusters. Assume \(k\) centers have been found. The \(k\)-medoids approach selects another object at random and replaces one of the original centers with the new object if better clustering results can be obtained. The absolute-error criterion [6] shown below is used to decide whether the replacement is better or not:

$$\begin{aligned} E=\sum \limits _{j=1}^k {\sum \limits _{p\in C_j } {\left| {p-o_j } \right| } }, \end{aligned}$$

where \(E\) is the sum of the absolute errors for all the objects in the data set, \(p\) is an object in cluster \(C_{j}\), \(o_{j}\) is the current center of \(C_{j}\), and the absolute value \(\vert p-o_{j}\vert \) means the distance between the two objects \(p \) and \(o_{j}\). For each randomly selected object \(o_{j'}\), one of the original \(k\) centers, say \(o_{j}\), will be replaced with it and its new sum \(E'\) of absolute errors will be calculated. \(E'\) will then be compared with the previous \(E\). If \(E'\) is less than \(E\), then \(o_{j'}\) is more suitable as a center than \(o_{j}\). \(o_{j'}\) thus actually replaces \(o_{j}\) as a new cluster center; otherwise, the replacement is aborted. The same procedure is repeated until the cluster centers no longer change.

The complexity of the \(k\)-medoids approach is in general higher than the \(k\)-means approach, but the former can guarantee that all the centers of clusters obtained are objects themselves. This feature is important to the proposed attribute clustering here, since not only the attributes are clustered but also the representative attribute of each cluster has to be found. On the contrary, the \(k\)-means approach may use non-object points as cluster centers. Note that both the \(k\)-means and the \(k\)-medoids approaches are mainly designed to cluster objects, but not attributes. As mentioned above, the goal of the paper is to cluster attributes. An attribute clustering method based on \(k\)-medoids is thus proposed to achieve this purpose. It also uses a better search strategy to find centers in a dense region, instead of random selection in \(k\)-medoids. Besides, a method to measure the distances (dissimilarities) among attributes is also needed.

3 Attribute dissimilarity

In this paper, we partition the attributes into \(k\) clusters according to the dependency between each pair of attributes. Each cluster can thus be represented by its representative attribute. The whole feature spaces can thus be greatly reduced.

For most clustering approaches, the distance between two objects is usually adopted as a measure for representing their dissimilarity, which is then used for deciding whether the objects belongs to the same cluster or not. In this paper, the attributes, instead of the objects, are to be clustered. The conventional distance measures such as Euclidean distance or Manhattan distance are thus not suitable since the attributes may have different formats of data, which are hard to compare. For example, assume there are two attributes, one of which is age and the other is gender. It is thus hard to compare the two attributes via the traditional distance measure. Below, a measure based on the concept of relative data dependency is proposed to achieve it. It was proposed by Han et al. [5] and can be thought of as a kind of similarity degrees.

Given two attributes \(A_{i}\) and \(A_{j}\), the relative dependency degree of \(A_{i}\) with regard to \(A_{j}\) is denoted by \(\mathrm{Dep}(A_{i}, A_{j})\) and is defined as:

$$\begin{aligned} \mathrm{Dep}(A_i ,\;A_j )=\frac{\left| {\Pi _{A_i } ( U)} \right| }{\left| {\Pi _{A_i ,\;A_j } ( U)} \right| }, \end{aligned}$$

where \(\left| {\Pi _{A_i } ( U)} \right| \) is the projection of \(U\) on attribute \(A_{i}\). Note that the original relative dependency degree only considers the relative dependency between a condition attribute set and a decision attribute set. Here we extend the above formula to estimate the relative data dependency between any pair of attributes. The dependency degree is not symmetric, such that the condition \(\mathrm{Dep}(A_{i}, A_{j}) = \mathrm{Dep}(A_{j}, A_{i})\) is not always valid. We thus use the average of \(\mathrm{Dep}(A_{i}, A_{j})\) and \(\mathrm{Dep}(A_{j},A_{i})\) to represent the similarity of the two attributes \(A_{i}\) and \(A_{j}\). This extended relative dependency is thought of as the similarity of the two attributes. The distance (dissimilarity) measure for the pair of attributes \(A_{i}\) and \(A_{j}\) is thus proposed as follows:

$$\begin{aligned} d(A_i ,\;A_j )=\frac{1}{\mathrm{Avg}(\mathrm{Dep}(A_i ,\;A_j ),\;\mathrm{Dep}(A_j ,\;A_i ))}. \end{aligned}$$

Take the distance between the two attributes Age and Children in Table 2 as an example. Since \(\vert \Pi _{\{\mathrm{Age}\}} (U)\vert =~3, \vert \Pi _{\{\mathrm{Children}\}} (U)\vert = 2\) and \(\vert \Pi _{\{\mathrm{Age, Children}\}}(U)\vert = 5\), the relative dependency degrees Dep(Age,  Children) and Dep (Children,  Age) are 0.6 and 0.4, respectively. The distance \(d\)(Age,  Children) is thus 1/Avg(0.6, 0.4), which is 2.

4 The proposed algorithm

In this section, an attribute clustering algorithm called Most Neighbors First (MNF) is proposed to cluster the attributes into a fixed number of groups. Assume the number \(k\) of desired clusters is known. Some preprocessing steps such as removal of inconsistent or incomplete tuples and discretization of numerical data are first done. After that, the proposed MNF attribute clustering algorithm is used to partition the feature space into \(k\) clusters and output the \(k\) representative attributes of the clusters.

The proposed clustering algorithm MNF is based on the \(k\)-medoids approach. Unlike the \(k\)-means approach, the proposed algorithm always updates the centers by some existing objects. Besides, it uses a better search strategy to find centers in a dense region, instead of random selection in \(k\)-medoids.

The proposed algorithm MNF consists of two major phases: (1) reassigning the attributes to the clusters and (2) updating the centers of the clusters. In the first phase, the proposed distance measure is used to find the nearest center of each attribute. The attribute is then assigned to the cluster with that center. In the second phase, each cluster \(C_{i}\) uses a searching radius \(r_{i}\) to decide the neighbors of each attribute in \(C_{i}\). The attribute with the most neighbors in a cluster is then chosen as the new center. The proposed algorithm is described in details below.

The MNF attribute clustering algorithm:

Input: An information system \(I=( {U,A\cup \{d\}})\) and the number \(k\) of desired clusters.

Output: \(k\) appropriate attribute clusters with their representative attributes.

Step 1 Randomly select \(k\) attributes \(\{A_{1}^{c}, A_{2}^{c}, {\ldots }, A_{k}^{c}\}\) as the initial representative attributes (centers) in the \(k\) clusters, where \(A_{t}^{c}\) stands for the representative attribute (center) of the t-th cluster \(C_{t}\), \(A_{t}^{c} \in A\). Denote \(A_{c} = \{A_{1}^{c}, A_{2}^{c}, {\ldots }, A_{k}^{c}\}\subseteq A\) as the initial representative attribute set.

Step 2 For each non-representative attribute \(A_{i}\in A-A_{c}\), compute the dissimilarity (distance) \(d(A_{i}\), \(A_{t}^{c})\) between attribute \(A_{i}\) and each representative attribute \(A_{t}^{c}\) as:

$$\begin{aligned} d(A_i ,\;A_t^c )=\frac{1}{\mathrm{Avg}(\mathrm{Dep}(A_i, A_t^c ),\;\mathrm{Dep}(A_t^c, A_i ))}, \end{aligned}$$

where \(\mathrm{Dep}(A_{i}, A_{t}^{c})\) represents the relative dependency degree of \(A_{i}\) with regard to \(A_{t}^{c}\) and \(\mathrm{Dep}(A_{t}^{c}, A_{i})\) represents the relative dependency degree of \(A_{t}^{c}\) with regard to \(A_{i}\), \(t \in \{1, 2, {\ldots }, k\}\).

Step 3 Allocate all non-center attributes to their nearest centers according to the distances found in Step 2. Collect a center attribute with its allocated attributes as a cluster.

Step 4 For each cluster \(C_{t}\), calculate the distances between any two different attributes within \(C_{t}\).

Step 5 Calculate the radius \(r_{t}\) of each cluster \(C_{t}\) as:

$$\begin{aligned} r_t =\frac{\sum \nolimits _{i\ne j} {d(A_{t,i} ,\;A_{t,j} )} }{C_2^{n_t } }, \end{aligned}$$

where \(d(A_{t,i}, A_{t,j})\) is the distance between any two attributes \(A_{t,i}\) and \(A_{t,j}\) within the cluster \(C_{t}\), \(n_{t}\) is the number of attributes within \(C_{t}\), and \(C_2^{n_t } \) is the number of attribute pairs in the cluster, which is \(\frac{n_t ( {n_t -1})}{2}\).

Step 6 For each attribute \(A_{t,i}\) (including the center \(A_{t}^{c})\) within a cluster \(C_{t}\), find the set of attributes [(called Near(\(A_{t,i})]\) with their distances from \(A_{t,i}\) within \(r_{t}\). That is:

$$\begin{aligned} \mathrm{Near}(A_{t,\;i} )=\{A_{t,j} \left| {\;A_{t,j} } \right. \in C_t \;\hbox {and}\;d(A_{t,\;i} ,\;A_{t,j} )\le r_t \}. \end{aligned}$$

Step 7 For each cluster \(C_{t}\), find the attribute \(A_{t,l}\) with the most attributes in its Near set. Set \(A_{t,l}\) as the new center \(A_{t}^{c}\) of \(C_{t}\).

Step 8 Repeat Steps 2–7 until the clusters have converged.

Step 9 Output the final clusters and their centers as the representative attributes.

Table 3 An example for attribute clustering

After Step 9, \(k\) clusters of attributes are formed and \(k\) representative attributes for the feature space are found.

5 An example

In this section, a simple example is given to show how the proposed algorithm can be used to cluster the attributes. Table 3 shows the scores of eight students. There are eight condition attributes \(A\) = {PR, CA, DM, C++, JAVA, DB, DS, AL}, respectively stands for the eight subjects: Probability, Calculus, Discrete Mathematics, C++, JAVA, Database, Data Structure and Algorithms. The values of the condition attributes are \(\{ A, B, C, D\}\), which stand for the grade levels of a subject. There is one decision attribute {ST}, which stands for {Study  for  Master  Degree} and has two possible classes {Yes, No}. In this example, the number of clusters is set at 2 (i.e. \(k\) = 2). For the set of data, the proposed algorithm proceeds as follows.

Step 1\(k\) attributes are randomly selected as the initial centers of the clusters. In this example, \(k\) is set at 2. Assume that the two attributes DM and DS are selected as the initial centers of the two clusters \(C_{1}\) and \(C_{2}\), respectively.

Step 2 The distances (dissimilarities) between each non-center attribute and each center are calculated. Take the distance between PR and DM as an example. Since \(\vert \Pi _\mathrm{PR}\vert \) = 3, \(\vert \Pi _\mathrm{DM}\vert \) = 3 and \(\vert \Pi _\mathrm{PR, DM}\vert \) = 5, the relative dependency degrees \(\mathrm{Dep}(\mathrm{PR, DM})\) is calculated as 0.6 and \(\mathrm{Dep}(\mathrm{DM, PR})\) is 0.6 as well. The distance between the two attributes is thus calculated as:

$$\begin{aligned} d(\mathrm{PR, DM})=\frac{1}{\mathrm{Avg}(0.6, 0.6)}=1.67. \end{aligned}$$

All the distances between non-center attributes and representative centers are shown in Table 4.

Table 4 The distances between non-center attributes and representative centers

Step 3 All non-center attributes are allocated to their nearest centers. Thus, cluster \(C_{1}\) contains {PR, CA, AL, DM} and cluster \(C_{2}\) contains {C++, JAVA, DB, DS}.

Step 4 The distances between any two different attributes in the same clusters are calculated. The results are shown in Table 5.

Table 5 The distances between any two attributes within the same clusters

Step 5 The searching radius of each cluster is calculated. Take the cluster \( C_{1}\) as an example. It includes four attributes {PR, CA, AL, DM}. The distances between each pair of attributes in \(C_{1}\) are \(\{1.67, 1.67, 1.33, 1.67, 1.67, 1.33\}\). The radius \(r_{1}\) is then calculated as:

$$\begin{aligned} r_1 =\frac{1.67+1.67+1.33+1.67+1.67+1.33}{6}=1.56. \end{aligned}$$

Step 6 The Near set of each attribute in a cluster is calculated. Take the attribute PR in cluster \(C_{1}\) as an example. Its distance from the other three attributes CA, AL and DM in the same cluster are calculated as 1.67, 1.33 and 1.67. Near(PR) thus includes only the attribute AL since only AL is within the radius \(r_{1}\) (1.56), which is found from Step 5. Similarly, the Near sets of the other three attributes in the cluster \(C_{1}\) are found as follows:

$$\begin{aligned} \begin{array}{l} \mathrm{Near}( {CA})\,=\phi , \\ \mathrm{Near}( {AL})\,=\,\left\{ {\mathrm{PR,\,DM}} \right\} ,\hbox { and} \\ \mathrm{Near}( {DM})\,=\,\left\{ \mathrm{AL} \right\} . \\ \end{array} \end{aligned}$$

Step 7 Since the attribute AL has the most attributes in its Near set for the cluster \(C_{1}\), AL then replaces the attribute DM as the new center of \(C_{1}\). Similarly, the original center DS for \(C_{2}\) has the most attributes in its Near set. DS is thus still the center of \(C_{2}\).

Step 8 Steps 2–7 are repeated until the two clusters no longer change. The final clusters can thus be found as follows:

\(C_{1} = \{\mathrm{PR, CA, AL, DM}\}\), with the center AL.

\(C_{2} = \){C++, JAVA, DB, DS}, with the center DS.

Step 9 The final clusters and their centers as the representative attributes are then output. The attributes in the same cluster can thus be considered to possess similar characteristics in classification and can be used as alternative attributes of the representative one.

6 Experimental results

In this section, the implementation of the proposed algorithm for clustering attributes is described. The experiments were implemented in C++ on an AMD Athlon 64 X2 Dual Core 3800+ personal computer with 2.01 GHz and 1 GB RAM. The real-world dataset, the Wisconsin Database of Breast Cancers (WDBC) [24], was used to verify the approach. The characteristics of the dataset are shown in Table 6.

Table 6 The characteristics of the dataset of WDBC

Each attribute in the dataset is numerical, so discretization should be first done. In this paper, the discretization is performed by two methods. The first one is equal width which discretizes the range of an attribute in equal intervals; the other one is equal frequency which considers the appearing frequency of each value of an attribute. The average intradistance (dissimilarity) in a cluster is used as a measure to evaluate the goodness of the results. It is defined as the average distance between an attribute and its representative attribute in the same cluster. Formally, it can be represented by the following formula:

$$\begin{aligned} \mathrm{AvgIntra D}=\frac{1}{k}\sum \limits _{i=1}^k {\frac{1}{\vert C_i \vert -1}} \sum \limits _{A_j \in C_i -A_i^c } {d(A_j ,\;A_i^c )} , \end{aligned}$$

where \(C_{i}\) is the i-th attribute cluster. The results of AvgIntra D along with different cluster numbers by the two discretization methods are shown in Fig. 1.

Fig. 1
figure 1

The average intra distances along with different cluster numbers by the two discretization methods

As Fig. 1 showed, the average intradistances decreased along with the increase of the cluster number for both the two discretization methods. Besides, the discretization method by equal width performed better than that by equal frequency.

Another evaluation measure for the clustering results was by the average intrasimilarity in clusters, which was defined as follows:

$$\begin{aligned} \mathrm{AvgIntra S}=\frac{1}{k}\sum \limits _{i=1}^k {\frac{1}{\vert C_i \vert -1}} \sum \limits _{A_j \in C_i -A_i^c } {\mathrm{Sim}(A_j ,\;A_i^c )} , \end{aligned}$$

where \(\mathrm{Sim} (A_{j}, A_{i}^{c})\) denotes the average dependency degree for a non-representative attribute \(A_{j}\) and its representative attribute in the same class. \(\mathrm{Sim}(A_{j}, A_{i}^{c})\) was computed as follows:

$$\begin{aligned} \mathrm{Sim}(A_j ,\,A_i^c )=\frac{\mathrm{Dep}(A_j ,\;A_i^c )+\mathrm{Dep}(A_i^c ,\;A_j )}{2}. \end{aligned}$$

The results of AvgIntra S along with different cluster numbers by the two discretization methods are shown in Fig. 2.

Fig. 2
figure 2

The average intra similarities along with different cluster numbers by the two discretization methods

As Fig. 2 showed, the average intrasimilarity increased along with the increase of the cluster number for both the two discretization methods. The same as before, the discretization method by equal width performed better than that by equal frequency.

The main purpose of attribute clustering was to select representative attributes to replace the whole set. Since the representative attributes selected by the algorithms were not always the same, the algorithm was thus run for 100 times and the frequency for each attribute being selected as a representative attribute was counted. For example, the selected frequencies of all the attributes when the cluster number \(k\) is 3 are shown in Fig. 3.

Fig. 3
figure 3

The frequencies of being selected as centers for \(k =3\)

As Fig. 3 showed, attribute 4 and attribute 7 were the two most frequently chosen attributes in the experiments. Therefore, the two attributes could be chosen and another one might be selected from the set of attributes 5, 6, 9, 17, 19, 24, which had their frequencies more than 10 times and formed the third cluster. The results for \(k=\) 6 and \(k=\) 9 are also shown in Figs. 4, 5 for a comparison.

Fig. 4
figure 4

The frequencies of being selected as centers for \(k=\) 6

Fig. 5
figure 5

The frequencies of being selected as centers for \(k=\) 9

As Figs. 4 and 5 showed, the difference of the frequencies of the attributes being selected as centers was smaller and smaller when the cluster number \(k\) increased. This phenomenon resulted from the fact that the attributes in the same cluster would become more similar to each other when the cluster number increased. Attribute would thus be chosen as centers with a more uniform opportunity. In this case, some other criteria, such as attribute cost and ratio of missing values may be used to aid the selection of representative attributes.

7 Conclusions and future work

In this paper, we have attempted to use attribute clustering for feature selection. A measure of the attribute dissimilarity based on the relative dependency is proposed to calculate the distance between two attributes. An attribute clustering algorithm, called Most Neighbors First, has also been proposed to find centers in a dense region, instead of random selection in \(k\)-medoids. The proposed attribute clustering approach consists of two major phases: reassigning attributes to clusters and updating centers of clusters. After the attributes are organized into several clusters by their similarity degrees, the representative attributes in the clusters can be used for classification such that the whole feature space can be greatly reduced. Besides, if the values of some representative attributes cannot be obtained from current environments for inference, some other possible attributes in the same clusters can be used to achieve approximate inference results.

Experimental results show that the average similarity in the same cluster will increase along with the increase of cluster numbers. Besides, the discretization method is an important factor for the final results. The discretization method by equal width performs better than that by equal frequency.

At last, the proposed attribute clustering approach has to know the number of clusters in advance. This requirement results in the limitation of its applications. In the future, we will try to develop other new approaches for attribute clustering, while the number of clusters is unknown. We will also attempt to apply the proposed approach to some real application domains.