Neural Computing and Applications

, Volume 26, Issue 2, pp 329–344 | Cite as

Ranking of characteristic features in combined wrapper approaches to selection

Open Access
Advances in Intelligent Data Processing and Analysis

Abstract

The performance of a classification system of any type can suffer from irrelevant or redundant data, contained in characteristic features that describe objects of the universe. To estimate relevance of attributes and select their subset for a constructed classifier typically either a filter, wrapper, or an embedded approach, is implemented. The paper presents a combined wrapper framework, where in a pre-processing step, a ranking of variables is established by a simple wrapper model employing sequential backward search procedure. Next, another predictor exploits this resulting ordering of features in their reduction. The proposed methodology is illustrated firstly for a binary classification task of authorship attribution from stylometric domain, and then for additional verification for a waveform dataset from UCI machine learning repository.

Keywords

Feature ranking Feature selection and reduction Wrapper Filter Sequential backward search Stylometry 

1 Introduction

In supervised learning in order to recognise objects from each other, to be able to successfully classify them to decision classes, firstly, we need to characterise these objects by some descriptive features. Their nature and number determine possible types of a classification system to be constructed and its performance. When there are too many features, when there are repetitions, or too much of an overlap in information conveyed by them, the classifier can suffer from it [17]. Knowledge about relevance or redundancy of individual attributes or their groups can be useful not only at a classifier’s design stage, when it is typically exploited for their selection, but also for already working solutions, to optimise them, to reduce some of features, to enhance understanding of performed classification [29].

In selection and reduction of attributes, to establish their relevance or redundancy, there can be employed either a filter, wrapper, or an embedded approach [27]. Filters work separately and independently on classifiers and their parameters or performance. They can use expert domain knowledge, if available, or some other indicators, defined functions, or measures of importance or relevance. Wrappers adapt a set of features to specifics of the exploited classification system, basing on some feedback from its work, typically the predictive accuracy [66]. In embedded approaches, selection is an inherent mechanism of inductive learning algorithm, incorporated in it, such as pruning in artificial neural networks [30], activated relative reducts in rough sets [43, 68], or choosing a variable for a branching node in a decision tree construction.

The paper presents a two-step methodology, within which in the pre-processing stage, a simple wrapper is used to establish a ranking of characteristic features through greedy sequential backward elimination procedures [24]. The resulting ordering of variables is next imposed on another predictor to reduce its features. When both classifiers share the same general characteristics in the proposed framework, there is constructed a combined wrapper; when they differ significantly, the structure can be seen as treating a wrapper as a filter, thus resulting in a combined wrapper-filter solution. The performance of classifiers is observed in the perspective of gradually decreasing numbers of characteristic features involved in pattern recognition.

In the research described, two different types of inducers were employed, rule-based and connectionist, namely decision algorithms inferred with dominance-based rough set approach (DRSA) [21, 22] and artificial neural networks with Multilayer Perceptron (MLP) topology [19]. These classification systems were exploited separately and in combinations, within the same type or hybrid solutions [61, 62].

The procedures are firstly illustrated for a binary authorship attribution, which belongs to computational stylistics, or stylometric, area, a study of writing styles based on quantitative rather than qualitative textual descriptors, aiming at author characterisation, comparison, and recognition [4, 5]. Next, for additional verification and to provide a kind of benchmark study, the methodology is applied to waveform dataset from UCI machine learning repository [8].

The paper is organised as follows. Section 2 addresses the issue of relevance of characteristic features and their ranking. The problem of variable selection and reduction is presented briefly in Sect. 3, and the proposed research methodology in Sect. 4. Section 5 provides short descriptions of the learning systems exploited in research, stylometric domain of application with details of input datasets and used features, and waveform dataset. Obtained research results are illustrated and discussed in Sect. 6, whereas concluding remarks are given in Sect. 7.

2 Relevance of characteristic features and their ranking

Algorithms dedicated to feature selection and reduction often refer to a concept of relevance, which can be defined in a variety of ways as we can have many reasons for formulating such definition [12].

Intuitively speaking, when a feature is irrelevant, it can be disregarded as useless for the induction process, which is a definition by contradiction. On the other hand, not all relevant attributes are in fact needed for classification to work, they can be relevant in varying degrees, and this relevance could depend on the presence or the absence of other features in the considered set, hence it should always be examined in some clearly stated context [40].

Probably the most natural notion of relevance from the perspective of feature selection problems is that of incremental usefulness, when the presence of some feature results in increased performance of a classification system comparing to its absence.

Definition 1

(Incremental usefulness) [41] For a given data sample DS, a learning algorithm LA, and a set of features \(A\), feature \(x_i\) is incrementally useful to LA with respect to \(A\) if the accuracy of the hypothesis produced by LA for the set of features \(A\cup \{x_i\}\) is higher than the one achieved for \(A\).

The definition is formulated for a case when adding a feature to some considered subset increases the performance. It can be extended to include also elimination of variables as follows.

Definition 2

(Usefulness) For a given data sample DS, a learning algorithm LA, and a set of features \(A\), feature \(x_i\) is useful to LA with respect to \(A\) if the accuracy of the hypothesis produced by LA for the set of features \(A\cup \{x_i\}\) is higher than the one achieved for \(A\). Feature \(a_i\in A\) is useful to LA with respect to \(A\) when the accuracy of \(\hbox {LA}(A)\) is lower than that for \(A\setminus\{a_i\}\).

Both definitions require the performance to increase or decrease after adding or, respectively, removing some feature. In reality, it may happen that, instead of seeking this change in performance, it can be easier to detect these variables that are irrelevant or redundant, and enable to keep the predictive accuracy at the same level, which leads to the concept of weak usefulness.

Definition 3

(Weak usefulness) For a given data sample DS, a learning algorithm LA, and a set of features \(A\), feature \(x_i\) is weakly useful to LA with respect to \(A\) if the accuracy of the hypothesis produced by LA for the set of features \(A\cup \{x_i\}\) is not lower than the one achieved for \(A\). Feature \(a_i\in A\) is weakly useful to LA with respect to \(A\) when the accuracy of \(\hbox {LA}(A)\) is not higher than that for \(A\setminus\{a_i\}\).

While establishing the usefulness of individual features or their groups can be the goal in itself (since it increases understanding of features), it can also be employed for a ranking of attributes, essentially in the same manner as retrieved documents are ranked accordingly to their relevance to some search query [6].

Definition 4

(Ranking) [24] Given a data sample DS, and a set of features \(A\), for each attribute \(a_i\in A\) a scoring function \(S\) assigns the score, which reflects how valuable the feature is with respect to the output variable.

By convention, the high score of the ranking function indicates that a feature is valuable, and after application of the scoring procedure, all variables are sorted in decreasing order of \(S(i)\). When attribute ranking is used to construct some classification systems, more and more variables of decreasing relevance are included in nested subsets (with progressively increasing cardinalities) that are taken into consideration [38]. When ranking is exploited in the process of feature reduction, the most deeply nested subsets of attributes include those with the lowest scores as we want to reject these elements which are least relevant.

3 Feature selection approaches

The most natural goal of feature selection algorithms is to find these variables that are relevant and at the same time detect those that are irrelevant or redundant. For plenty of applications, the concepts under study can be described by very high numbers of attributes, while they can also be defined by significantly fewer or simpler characteristic features, which helps in understanding data [26]. Dimensionality reduction enables to lower requirements with respect to storage and computational power, and smaller input variable sets can result in shortened processing time, or improved performance.

Before execution of any feature selection procedure, several decisions must be made that bear heavily on the final outcome. A starting point in the feature space needs to be selected, and this point determines possible directions for search algorithms. Furthermore, organisation of the search, feature subset evaluation strategies, and some stopping criteria must be chosen [15].

The procedure that generates a set of attributes can start with the empty set and then add a single element (or maybe a group of them) at a time in forward selection [50]. Or, it can begin with some initial set from which features are subsequently eliminated in backward reduction. It may also commence execution with a non-empty set that is in turns expanded and reduced.

Forward selection may seem as an obvious choice since it should involve lower computational costs of learning as the majority of candidate subsets of attributes have low cardinalities. We start with many small sets which gradually increase in size, but at the same time, the number of sets falls down. In case of rule classifiers, with just few conditional attributes the process of induction of decision rules does not take a lot of time, and storage requirements are certainly not prohibitive [47]. Yet within such limited context the interaction of some feature with others and its influence on classification could be more difficult to observe and conclusions drawn with respect to its relevance could be misleading. What is more, unless the case is trivial, training of a connectionist classification system with just few inputs is much more trying. Fewer network inputs mean fewer neurons which work as small and simple processing units. With their number being insufficient, the network can run into trouble and have noticeable difficulty with converging and then generalisation for unknown data [19].

In sequential backward reduction, the features and their relevance are observed in the presence of others and this wider context can be more advantageous; however, the initial dimensionality can be so high as to make the whole process unfeasible [1], as in this case the minority of sets are of lower cardinalities. Many attributes cause much higher number of decision rules to be inferred, and we start with correspondingly many such systems to be evaluated before the number of features decreases. On the other hand, it is far easier to have even more than necessary inputs to the artificial neural network as it learns quickly and the training rule is responsible for assigning the best weights to interconnections and by that degrees of relevance of inputs to the produced answer.

Search for some set of relevant attributes can be executed as a separate process, completely regardless of a classification system, in filtering approach, which then can be treated as some kind of pre-processing [25]. Features can be selected for example randomly, or referring to concepts of consistency, entropy, information gain [16]. Being general in nature, filters can be employed within any domain, for any learning system, yet most often at a cost of some lower predictive accuracy than available alternative solutions, which are not universal but adapted to specifics of a task under study.

If a selection strategy is conditioned by a learning process, the wrapper approach is used [33]. Wrappers exploit their own properties, especially their classification ratio, to estimate relevance of features, and by that suitability of the considered set for the particular task. Their close ties with classifiers result typically in improved performance but with the trade-off of some loss in generality, which can cause bias.

Embedded feature selection algorithms are intertwined with the learning processes, are their part, either explicit or implied [36]. When a wrapper has its own mechanism dedicated to variable selection and it is actively used, it becomes in fact an embedded solution. As examples from this category, there can be given decision trees where at each branching node a feature is chosen, artificial neural networks using pruning of input neurons [32], or rough set theory with activated relative reducts [46, 52].

A stopping point for a search procedure is to some extent determined by former choices with respect to the starting point, directions, and organisation of the search. Employing the concept of usefulness of features we can stop the search process when the system shows some significant and irreparable decrease in performance, if this is the primary goal of the selection process.

Alternatively, in forward selection, we can continue adding features, one after one, till the set of all available candidates is completely exhausted and we end with the full set of attributes, while in backward elimination, we can discard variables up to the time when we have only one left. These two extreme and opposite situations are mostly useful in observations of the overall inducer’s performance, when we want to try to find such smallest subset of variables for which the performance is the best (only when all subsets are tested we can confirm that some maximum is global and not local), or when detected characteristics in the feature set result in obtaining a ranking of variables, which can be employed for other inducers.

Feature evaluation, estimation of their individual or group relevance, ranking, selection and reduction procedures significantly gain in importance in cases when expert domain knowledge is missing or insufficient to establish relevance, and this task is transferred to data mining area [27]. Even when this expert knowledge is available, search for important features governed by principles of techniques and algorithms used to detect patterns in data can result in better understanding, knowledge discovery, uncovering new information and relationships [10, 18].

4 Proposed research framework

The paper proposes a methodology that is a combination of feature selection approaches, while exploiting two types of learning systems (rule-based and connectionist), with the objectives of: (1) observing feature relevance and their usefulness through the process of their sequential backward elimination that leads to feature ranking, and next (2) using the obtained ranking in construction of other predictors.

The procedure consists of two subsequent phases:
  1. 1.

    Pre-processing ranking stage—for the initial arbitrarily selected set of characteristic features, there is executed scoring in backward reduction, basing on performance of an inducer. At each step, a single attribute is discarded, elimination of which resulted in the best classification accuracy among all candidate systems at this step. The stage ends when the set of variables is exhausted and the ordering in which they were eliminated gives base to establishing a ranking of all considered features.

     
  2. 2.

    Combined wrapper stage—following the ordering of attributes from the pre-processing stage that defines their ranking, nested subsets of features are taken out from the initial set, and for these remaining new predictors constructed. The processing stops when no variable is left to reduce.

     
Since by definition and execution, a ranking is a separate process from the learning algorithms induced in the second stage, following the general classification of approaches [33], we can treat it as filtering of features, which leads to wrapper-filter solutions. However, when classifiers from both steps share characteristics, it is rather a combination of two wrappers.
Within the pre-processing stage at \(i\)th step, \((N-i)\) new systems are built, \(N\) being the initial number of variables. It means that overall the number of induced classifiers equals:
$$\begin{aligned} \sum \limits _{i=0}^{N-1}(N-i)&= N+(N-1)+(N-2)+\cdots +2+1=\frac{(N+1)N}{2}. \end{aligned}$$
(1)

Depending on \(N\) and the complexity of induction process, this number can become prohibitive and the procedures too time consuming. The execution can be sped up by observing that although the reduction stages need to be performed in sequence as we need results from one to attempt the next; within a stage, all candidate systems are independent on each other, which means that they can be induced and tested in parallel and only their results compared to make a final choice of an attribute to be eliminated.

In the second phase \(N\) inducers are built, the first with the complete set of \(N\) attributes, next with their gradually decreasing numbers till only a single variable remains in the input set.

5 Experimental evaluation

In the research described in this paper, two distinctively different approaches to data mining were used, namely DRSA which infers rules that form decision algorithms, and a connectionist solution of artificial neural networks (ANNs) in MLP topology [70].

The usefulness of the proposed methodology was evaluated by application in the field of stylometry, a branch of science that involves analysis of writing styles and claims that they can be uniquely and unambiguously expressed by quantitative measures [49]. Author attribution is considered as the most important of stylometric tasks [69]. It combines author characterisation with comparison [14] and can be regarded as classification, binary or multi-class, depending on the number of compared authors [2].

For additional verification, the same procedures were next employed to waveform dataset from the popular UCI machine learning repository [8], to provide a benchmark study for comparisons.

5.1 DRSA processing

DRSA was invented to support multi-criteria decision-making [57]. It is a modification of the original classical rough set approach (CRSA) that was defined by Pawlak [45].

DRSA observes monotonicity in values of both conditional and decision attributes, and instead of just discerning (or not) classified objects as CRSA does, it assumes that all values are more or less preferred and applies weak preference and dominance relations. Preferences in data sets are defined either with the help of expert domain knowledge, assigned arbitrarily, or adjusted through some additional algorithm [67]. Dominance allows not only for nominal, but also for ordinal classification.

DRSA procedures induce decision rules through the process of reduction of excessive and redundant information in data sets with the help of rough approximations [37]. The sets to be approximated are dominance cones, corresponding to upward and downward unions of decision classes, and a rule classifies to either at most or at least some decision class.

The inferred rules consist of two parts: the premise, containing single or multiple conditions on individual attributes, which specify values either lower or equal, or higher or equal than the thresholds induced from all learning samples contained in the decision table; and decision parts:
$$\begin{aligned} \hbox{IF} \quad \hbox {cond}_1\; \& \; \hbox {cond}_2 \; \& \ldots \& \; \hbox {cond}_i \quad \hbox{THEN} \quad \hbox {at}\; \hbox {most} \;\hbox {decision}_s \end{aligned}$$
(2)
$$\begin{aligned} \hbox{IF} \quad \hbox {cond}_1\; \& \; \hbox {cond}_2 \; \& \ldots \& \; \hbox {cond}_i \quad \hbox{THEN} \quad \hbox {at}\; \hbox {least} \;\hbox {decision}_s \end{aligned}$$
(3)

Many algorithms for induction of decision rules exist [9, 53], probably the fastest of which (but not the simplest) is generating only so many rules as to provide a minimal cover of the learning samples [42]. The opposite approach is to construct all rules on examples and then choose only some subset of them by imposing some hard constraints [64], for example a minimal support required that indicates for how many learning samples a rule is true, or a maximal rule length giving the number of conditions included in the premise [65]. Or, some group of rules is induced, neither minimal nor complete, then the process of their pruning or adjusting is executed [54, 55]. All these approaches offer higher chances of good recognition ratio, yet computational costs involved could be significant and should be weighted against possible gains [23].

5.2 ANN classifier

Multilayer Perceptron (MLP) is a unidirectional, feed-forward artificial neural network, with neurons grouped into some number of layers. It accumulates knowledge from the training samples using some learning rule, which determines how to adjust weights of interconnections in order to get the value on the network output as close as possible to the one that is expected. Popularly, there is used some version of backpropagation algorithm which minimises the error on the output, calculated as a difference between the desired and actually received value, for all outputs and all training samples [19].

In the research, California Scientific Brainmaker software for simulation of neural networks was used. To lower the influence of initiation of weights on the learning phase, multi-starting approach was employed and each network trained several times with randomising weights before each training, with noting the worse, average, and the best performance. In each case, the structure (established through tests) contained two hidden layers, with the total number of neurons in them equal to the cardinality of the currently considered set of characteristic features. The network outputs corresponded to recognised decision classes.

5.3 Stylometric features

Categorisation of a text with respect to the subject content requires searching for some matching key words or phrases [11]. Authorship attribution means categorisation by authors, which is more challenging because we need to recognise and discern specific styles of writing and a writing style is not conveyed in what we write about, but in how we do it [35].

Features describing styles need to refer to such elements that are not easily imitated or common to many authors, reflect individual linguistic preferences, whether conscious or subconscious, observable in many samples [3, 34]. Popularly, there are exploited either lexical or syntactic descriptors, the first providing some statistical characteristics such as average word length, average sentence length, frequencies of usage for characters, words or phrases, distributions of all these averages and frequencies [44], while syntactic markers refer to punctuation marks and the way in which they organise the structure of the text into units of sentences, paragraphs [7]. These descriptors need to be calculated over many examples, using sufficiently wide corpus, otherwise they would be unreliable [39].

By the very definition, all writing styles are unique for their authors; hence, even though stylometry suggests some types of candidate characteristic feature sets, there is no one and only universal rule how to construct them, which would be applicable in all possible cases, for all writers, and regardless of techniques employed [13, 48]. Instead, many sets of descriptors are studied and adapted to specifics of the particular task under consideration. In the same way, several processing techniques are employed, typically either statistic-oriented computations [31, 51], or methodologies belonging with artificial intelligence domain [28, 61].

In the research presented in this paper as texts to study, there were taken literary works of Jane Austin and Edith Wharton, available in several electronic formats for download and online reading due to Project Gutenberg (http://www.gutenberg.org). The novels were divided into smaller parts of comparable size. For both learning and testing sets, one-half could be attributed to one author and the second to the other, giving perfectly balanced data sets.

For all these text samples, there were calculated frequencies of usage of 25 linguistic elements:
  • 17 function words—but, and, in, with, what, for, from, by, not, that, to, of, this, if, at, on, as,

  • 8 punctuation marks—a full stop, a comma, a colon, a semicolon, a question mark, an exclamation mark, a bracket, a hyphen,

employed in the earlier research on authorship attribution [60, 63]. The attributes obtained that way have real values, which needs to be taken into account while choosing some data mining technique, but of course we can also employ some discretisation strategy [18, 58].

5.4 Waveform dataset

UCI machine learning repository is a popular source of datasets used as kind of benchmark studies for comparison. To make the classification task comparable to the one of previously described binary authorship attribution, from all available datasets, the one named Waveform Database Generator (Version 1) was selected. As it is many times larger than the stylometric dataset, only a part was involved in executed tests.

The number of attributes considered is 21, and there are three decision classes corresponding to three types of recognised waves. Once the complete set of 5,000 samples was put in increasing order with respect to these classes, for type 0 and type 1, first one hundred of samples were taken to be included in the learning dataset and the next fifty for the training set. In this way, also for these datasets, the classification becomes binary and samples for decision classes balanced.

6 Research results

The experiments conducted within the described research were executed in two stages. In the first stage, the sequential backward elimination (SBE) algorithm, applied in the wrapper model, was used to establish ranking of characteristic features, revealing their relevance. The wrapper was constructed for two types of classifiers, minimal cover decision algorithms (MCDA) inferred in DRSA and artificial neural networks.

The two obtained rankings were next employed in the second stage, where reduction of attributes was performed, again for rule and connectionist inducers, while their performance was observed. The elimination of variables for DRSA classifier at this stage was executed in two ways: by discarding attributes and inducing new rules and algorithms, and by rejecting rules from the previously generated full decision algorithm (FDA), with all rules on examples, inferred for all features considered.

The procedures were applied to two pairs of datasets. The primary classification task was binary authorship attribution with stylometric features. For comparison sake, the tests were also executed for waveform dataset with similar characteristics (the same number of classes, comparable numbers of samples and attributes). The results for this second dataset are given at the end of this section.

6.1 Establishing ranking of features by SBE

Since DRSA classifier was to be used as a wrapper with sequential backward reduction of features, it meant starting with the complete set of attributes and elimination of one element at a time. Hence, induction of all rules on examples in each case would be impractical as for 25 features in the FDA algorithm, there were 62,383 constituent decision rules. Instead, minimal cover decision algorithms MCDA were inferred and their performance used to select an attribute, reduction of which gave the best results when compared to others at the same level. The details for all steps are listed in Table 1, where the right-most column (i) shows the established DRSA Ranking of characteristic features.
Table 1

Backward elimination of attributes basing on the performance of DRSA classifiers

 

DRSA Ranking

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(i)

0

25

but and not in with on at of as this that by for to if what from . , ; : ! ? ( -

30

 

6

6

76.67

and

1

24

but not in with on at of as this that by for to if what from . , ; : ! ? ( -

30

 

2

17

77.78

!

2

23

but not in with on at of as this that by for to if what from . , ; : ? ( -

29

 

3

14

81.11

,

3

22

but not in with on at of as this that by for to if what from . ; : ? ( -

31

 

3

17

82.22

?

4

21

but not in with on at of as this that by for to if what from . ; : ( -

30

 

3

15

83.33

what

5

20

but not in with on at of as this that by for to if from . ; : ( -

30

 

3

13

85.56

:

6

19

but not in with on at of as this that by for to if from . ; ( -

30

 

3

13

85.56

.

7

18

but not in with on at of as this that by for to if from ; ( -

29

 

4

11

85.56

of

8

17

but not in with on at as this that by for to if from ; ( -

28

 

3

15

85.56

that

9

16

but not in with on at as this by for to if from ; ( -

26

 

3

16

85.56

(

10

15

but not in with on at as this by for to if from ; -

26

 

3

16

85.56

this

11

14

but not in with on at as by for to if from ; -

27

 

3

16

85.56

but

12

13

not in with on at as by for to if from ; -

24

 

2

17

86.67

if

13

12

not in with on at as by for to from ; -

23

 

2

16

86.67

at

14

11

not in with on as by for to from ; -

23

 

2

17

86.67

to

15

10

not in with on as by for from ; -

23

 

2

17

86.67

-

16

9

not in with on as by for from ;

23

 

2

18

86.67

with

17

8

not in on as by for from ;

22

 

2

20

86.67

on

18

7

not in as by for from ;

25

 

2

22

88.89

from

19

6

not in as by for ;

22

 

4

16

87.78

;

20

5

not in as by for

21

 

2

18

91.11

for

21

4

not in as by

17

15

7

11

91.11

in

22

3

not as by

18

10

10

10

91.11

as

23

2

not by

26

8

10

8

84.44

by

24

1

not

3

2

55

2

61.11

not

Columns present parameters: (a) elimination stage, (b) number of characteristic features left, (c) set of currently considered variables, (d) number of rules in DRSA minimal cover decision algorithm without any constraints, (e) number of exact rules when they are fewer than the total number, (f) minimal support required of DRSA rules resulting in maximal classification accuracy, (g) number of exact DRSA rules meeting constraints on support, (h) maximal predictive accuracy of the classifier (%), and (i) attribute selected to be eliminated

The top row of the table corresponds to the 0th reduction stage, that is the rule classifier induced for all 25 conditional attributes studied, listed in column (c). The minimal cover decision algorithm generated consisted of 30 constituent rules, which was limited to just 6 while demanding their minimal support to be equal at least 6. The maximal classification accuracy gained by the imposed constraint was 76.67 % of correctly recognised testing samples. Classification accuracy specified in the table (and for all other cases of data mining with DRSA presented in this paper) refers only to cases when all matching rules classified correctly. The ambiguous cases of contradicting decisions or no matching rules were always treated as incorrect (which is rather strict but limits additional processing needed otherwise).

Next, 25 new MCDA classifiers were constructed, each with 24 input features, with one attribute eliminated, and their performance tested and compared. Out of these systems, the one with the reduced feature corresponding to the frequency of usage for “and” gave the best result, so this attribute is selected as the least relevant of all candidates and the first to be eliminated, as given in column (i) of the table.

The set of 24 remaining variables gives base for the next reduction stage with index equal 1, shown in Table 1 in the second row. Again the best MCDA decision algorithm consisted of 30 rules, but with support equal or higher than 2, there were 17 rules with maximal classification reaching 77.78 %.

It can be observed in column (h) of the table that classification accuracy gradually increases from 76.67 % up to the maximum of 91.11 % correctly recognised samples when there are only 5, 4, or 3 features left in the input set, then to decrease to 84.44 % for two conditional attributes, and 61.11 % for a single attribute.

The process of attribute elimination can be interpreted in this way that the system discards these elements that are irrelevant or redundant and keeps these that are essential for classification, as a result the classification accuracy either increases or is at least at the same level, but for fewer features. The order in which the attributes are eliminated reflects their importance. When this order is reversed, the performance of DRSA classifiers decreases immediately and irrecoverably, which is illustrated in Fig. 1.
Fig. 1

DRSA classification accuracy in relation to the number of features within sequential backward elimination with MCDA, compared with reduction of attributes using reversed ranking

The same sequential backward reduction procedure was next applied to ANN classifiers (Table 2), starting with constructing a network for all 25 features. For this set, the average classification accuracy was just above 91 %. This value is obviously higher than for the base DRSA classifier, for which it was only 76.67 %. However, it should be noted that the ambiguous classification of the rule-based system, of contradicting decisions or no rules matching, was treated as incorrect in all considered cases and that influenced this lower predictive accuracy. What is more, generation of minimal cover decision algorithms does not guarantee induction of the best rules, with the highest potential for correct classification, and it is quite common that decision algorithms constructed with other approaches test significantly better, yet at the cost of more complex procedures, more computational costs involved, and more processing time needed [56].
Table 2

Backward elimination of attributes basing on the performance of ANN classifiers

 

ANN Ranking

(a)

(b)

(c)

(d)

(e)

0

25

but and not in with on at of as this that by for to if what from . , ; : ! ? ( -

91.11

,

1

24

but and not in with on at of as this that by for to if what from . ; : ! ? ( -

93.89

(

2

23

but and not in with on at of as this that by for to if what from . ; : ! ? -

94.44

-

3

22

but and not in with on at of as this that by for to if what from . ; : ! ?

95.56

at

4

21

but and not in with on of as this that by for to if what from . ; : ! ?

96.67

with

5

20

but and not in on of as this that by for to if what from . ; : ! ?

97.78

what

6

19

but and not in on of as this that by for to if from . ; : ! ?

97.78

from

7

18

but and not in on of as this that by for to if . ; : ! ?

97.78

to

8

17

but and not in on of as this that by for if . ; : ! ?

97.78

for

9

16

but and not in on of as this that by if . ; : ! ?

97.78

of

10

15

but and not in on as this that by if . ; : ! ?

97.78

.

11

14

but and not in on as this that by if ; : ! ?

98.89

in

12

13

but and not on as this that by if ; : ! ?

98.33

!

13

12

but and not on as this that by if ; : ?

98.89

this

14

11

but and not on as that by if ; : ?

98.89

but

15

10

and not on as that by if ; : ?

98.89

that

16

9

and not on as by if ; : ?

98.89

if

17

8

and not on as by ; : ?

97.78

?

18

7

and not on as by ; :

97.78

and

19

6

not on as by ; :

95.56

by

20

5

not on as ; :

94.44

:

21

4

not on as ;

95.56

as

22

3

not on ;

90.00

on

23

2

not ;

82.22

;

24

1

not

62.22

not

Columns present parameters: (a) elimination stage, (b) number of characteristic features left, (c) set of currently considered variables, (d) average predictive accuracy of the classifier (%), (e) attribute selected to be eliminated

The positive change of the classification ratio, or the same performance for fewer inputs is not the only indicator of attribute relevance or redundancy. When some feature is reduced, also the internal structure of the classifier is accordingly modified. For DRSA processing, it means fewer constituent rules in a decision algorithm, while for an artificial neural network, its layers get smaller by removal of neurons.

If such smaller network classifies not worse than before reduction, it means that the relevance of the recently discarded input is negligible and it can be treated as redundant. The performance is illustrated in Fig. 2, while Fig. 3 shows what happens to the classification accuracy of the system when the input features are reduced while following the reversed ANN Ranking. The two graphs from Figs. 2 and 3 show the same trends that are visible in the previously plotted performance of DRSA classifiers in Fig. 1.
Fig. 2

ANN classification accuracy observed in sequential backward elimination process, in relation to the number of considered features, and for each average, there is indicated maximal and minimal performance

Fig. 3

ANN classification accuracy in relation to the number of features, observed in backward reduction of inputs while following the reversed ANN Ranking. For each average, there is indicated maximal and minimal performance

When we compare DRSA and ANN Rankings against each other, and analyse the scores assigned to all attributes, we can see that even though both types of classifiers operate on the same data sets, the resulting orderings of reduced features are different, only the last remaining feature is the same in both rankings: the frequency of usage for “not”. This is a direct result of the inherent characteristics of the inducers that are transferred to the rankings calculated with their help.

As wrappers are often accused of such bias, the obtained rankings need to be observed in the process of reduction of characteristic features for other classification systems, by combining wrappers of the same and different type, to evaluate their usefulness through tests, which is illustrated in the next section.

6.2 Employing ranking of features in their reduction

Following the general categorisation of feature selection approaches [33], ranking belongs with filters. In the research presented, two rankings were obtained using DRSA- and ANN-based wrappers, given in the right-most columns of Tables 1 and 2. These orderings were next used to filter out the conditional attributes from the original set of 25, in backward elimination of input variables for new classifiers.

The details of application of ANN Ranking to backward reduction of attributes in DRSA processing, which results in a hybrid solution, are shown in Table 3. Firstly, subsets of features with increasing cardinalities were rejected, and then for the remaining subsets, new decision algorithms were induced, with providing just a minimal cover MCDA, and also with inferring all rules on examples FDA.
Table 3

Backward elimination of conditional attributes using ANN Ranking with induction of new decision algorithms

  

Induction of DA after attribute elimination

  

Minimal cover DA

All rules on examples DA

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(c)

(d)

(e)

(f)

(g)

1

24

44

 

2

22

71.11

55,418

 

61–62

21

86.67

2

23

44

 

2

24

71.11

44,836

 

61–62

21

86.67

3

22

44

 

2

24

71.11

37,881

 

61–62

21

86.67

4

21

44

 

2

24

71.11

29,401

 

61–62

21

86.67

5

20

40

 

2

29

67.78

23,146

 

61–62

21

86.67

6

19

42

 

2–3

20

67.78

18,325

 

61–62

21

86.67

7

18

40

 

3

20

71.11

13,693

 

61–62

20

86.67

8

17

39

 

2–3

19

71.11

10,495

 

61–62

20

86.67

9

16

32

 

6–8

5

77.78

7,214

 

61–62

16

85.56

10

15

30

 

6–33

4

75.56

5,066

 

61–62

16

85.56

11

14

31

 

2

25

77.78

3,535

 

61–62

16

85.56

12

13

35

 

2, 4–11

6

67.78

2,534

 

61–62

16

85.56

13

12

28

   

75.56

1,822

 

61–62

15

85.56

14

11

31

 

1–7

8

71.11

1,197

 

55–62

11

84.44

15

10

29

 

4–7

11

78.89

636

 

55–62

11

84.44

16

9

21

 

4–11

11

78.89

433

 

55–62

11

84.44

17

8

18

 

1–10

7

78.89

311

 

55–62

11

84.44

18

7

20

 

1–10

7

76.67

199

 

55–62

11

84.44

19

6

20

 

1–3

18

84.44

109

 

55–62

11

84.44

20

5

13

 

25

6

83.33

40

 

55–62

10

84.44

21

4

26

10

6–7

7

81.11

72

22

12–20

14

78.89

22

3

25

4

5–34

3

65.56

18

11

3–31

8

65.56

23

2

22

3

5–54

7

61.11

7

5

1–55

3

61.11

24

1

3

2

1–54

2

61.11

3

2

1–55

2

61.11

  

Minimal cover DA

All rules on examples DA

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(c)

(d)

(e)

(f)

(g)

0

25

30

 

6

6

76.67

62,383

 

65–66

17

86.67

Columns present parameters: (a) elimination stage, (b) number of characteristic features left, (c) number of all rules in a decision algorithm, (d) number of exact rules in a decision algorithm when they are fewer than the total number of rules, (e) value or range of values for minimal support required of rules resulting in maximal classification accuracy, (f) minimal number of rules meeting constraints, and (g) maximal classification accuracy (%)

Since the classification accuracy is usually treated as the most important factor indicating the quality of the obtained solution, we can focus our attention on two (g) columns in Table 3, or a graph in Fig. 4. For both MCDA and FDA classifiers, there are several cases of improved or the same performance when features are reduced, yet the gain, considered in terms of either a number of rejected features, or an increase in predictive accuracy, or a lower number of decision rules remaining in the algorithm, is not so high as it was observed previously for simple ANN or MCDA wrappers.
Fig. 4

Classification accuracy for MCDA and FDA decision algorithms induced after backward attribute elimination based on ANN Ranking, in relation to the number of features

Instead of reducing conditional attributes and then inferring new decision algorithms, which can be very time consuming, we can also eliminate these attributes by discarding rules with conditions on them, limiting all rules on examples decision algorithm induced previously for all features [59, 63]. Such approach can be considered as execution of ranking for decision rules.

Firstly, to each rule in the induced algorithm, a score is assigned, basing on individual scores for all attributes included in the premise part of the rule. From all these elementary scores, corresponding to constituent conditions, the highest one is chosen, indicating the attribute that is perceived as the least important; thus, the first to be eliminated, and this score is given to the decision rule. Then all rules are ordered by their scores, and in each step of reduction, all rules with a certain score are rejected, which results in reduced decision algorithms.

The details of this decision rule ranking procedure are given in Table 4. For comparison, there are also listed results of FDA algorithm reduction while following the reversed ANN Ranking, both plotted also in Fig. 5.
Table 4

Reduction of all rules on examples algorithm (FDA) using ANN feature ranking and its reverse

 

ANN Ranking

Reversed

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(c)

(d)

(e)

(f)

(g)

1

24

,

55,418

61–62

21

86.67

not

61,382

48

49

80.00

2

23

(

44,836

61–62

21

86.67

;

56,666

38

35

78.89

3

22

-

37,881

61–62

21

86.67

on

51,063

38

35

78.89

4

21

at

29,401

61–62

21

86.67

as

40,112

38

28

78.89

5

20

with

23,146

61–62

21

86.67

:

36,947

38

28

78.89

6

19

what

18,325

61–62

21

86.67

by

28,085

38

20

77.78

7

18

from

13,693

61–62

20

86.67

and

20,140

38

20

77.78

8

17

to

10,495

61–62

20

86.67

?

17,000

38

20

77.78

9

16

for

7,214

61–62

16

85.56

if

13,272

38

19

77.78

10

15

of

5,066

61–62

16

85.56

that

10,711

38

18

77.78

11

14

.

3,564

61–62

16

85.56

but

7,666

38

13

77.78

12

13

in

2,580

61–62

16

85.56

this

5,265

21

71

76.67

13

12

!

1,880

61–62

15

85.56

!

3,678

21

58

76.67

14

11

this

1,239

55–62

11

84.44

in

2,572

21

56

76.67

15

10

but

741

55–62

11

84.44

.

1,776

21

51

73.33

16

9

that

533

55–62

11

84.44

of

1,070

21

41

73.33

17

8

if

377

55–62

11

84.44

for

578

12

63

66.67

18

7

?

255

55–62

11

84.44

to

282

5

63

60.00

19

6

and

171

55–62

11

84.44

from

141

5

42

57.78

20

5

by

93

55–62

10

84.44

what

48

 

48

41.11

21

4

:

41

1–20

21

78.89

with

22

19

2

35.56

22

3

as

24

1–31

9

65.56

at

14

19

2

35.56

23

2

on

12

1–55

4

61.11

-

3

12

2

35.56

24

1

;

10

1–55

4

61.11

(

    

25

0

not

    

,

    

Columns present parameters: (a) elimination stage, (b) number of characteristic features left, (c) attribute eliminated at this stage, (d) number of all rules in a decision algorithm, (e) minimal support required of rules resulting in maximal classification accuracy, (f) minimal number of rules meeting constraints, and (g) maximal classification accuracy (%)

Application of ANN Ranking in reduction of FDA results in rather steep decrease in the number of remaining decision rules, while the classifiers predict with the same or only slightly reduced accuracy. Reversed ANN Ranking brings much slower algorithm reduction, but the performance is worsened instantly and irreparably.
Fig. 5

Reduction of FDA algorithm while following ANN Ranking and its reverse. The predictive accuracy is plotted in relation to the number of features

As establishing of DRSA Ranking through sequential backward elimination with generation of minimal cover decision algorithms is treated as a separate process, this ranking can also be used in the procedure of decision rule ranking and reduction, limiting all rules on examples algorithm, the results of which are given in Table 5 and the performance shown in Fig. 6.
Table 5

Backward elimination of decision rules from all rules on examples (FDA) algorithm induced for all features, with following DRSA Ranking of attributes and its reverse

 

DRSA Ranking (SBE for MCDA)

Reversed

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(c)

(d)

(e)

(f)

(g)

1

24

and

47,064

66

17

86.67

not

61,382

48

49

80.00

2

23

!

37,662

66

16

86.67

by

47,968

48

43

80.00

3

22

,

32,655

62

20

86.67

as

37,258

45

43

77.78

4

21

?

27,671

62

20

86.67

in

27,552

44

54

78.89

5

20

what

21,473

62

20

86.67

for

20,377

44

46

78.89

6

19

:

19,736

62

20

86.67

;

18,047

33

27

77.78

7

18

.

14,716

62

20

86.67

from

13,423

33

24

77.78

8

17

of

10,964

62

20

86.67

on

11,763

33

24

77.78

9

16

that

8,575

62

20

86.67

with

8,661

33

24

77.78

10

15

(

6,751

62

20

86.67

-

7,603

33

24

77.78

11

14

this

4,907

59

23

86.67

to

5,324

24

49

75.55

12

13

but

3,440

59

23

86.67

at

3,924

24

49

75.56

13

12

if

2,462

59

23

86.67

if

2,880

25

41

73.33

14

11

at

1,795

59

23

86.67

but

1,950

18

75

74.44

15

10

to

1,208

59

23

86.67

this

1,195

11

113

67.78

16

9

-

854

59

23

86.67

(

858

11

109

67.78

17

8

with

624

59

23

86.67

that

594

11

84

66.67

18

7

on

533

59

23

86.67

of

312

13

44

57.78

19

6

from

335

59

20

86.67

.

205

9

40

57.78

20

5

;

209

13

79

88.89

:

162

9

40

57.78

21

4

for

107

10

63

88.89

what

85

8

25

47.78

22

3

in

65

10

46

90.00

?

58

8

25

47.78

23

2

as

35

10

30

82.22

,

26

3

16

13.33

24

1

by

10

55

4

61.11

!

4

4

1

4.44

25

 

not

    

and

    

Columns list parameters: (a) elimination stage, (b) number of characteristic features left, (c) attribute eliminated at this stage, (d) number of remaining rules without constraints, (e) minimal support required of rules to arrive at the highest classification accuracy, (f) number of rules meeting constraints on support, and (g) maximal classification accuracy (%)

The tendencies visible in predictive accuracy for reduced decision algorithms while following DRSA Ranking and its reverse directly remind these observed previously in the wrapper mode when the ranking was established. The procedures enable to filter out these rules from FDA algorithm which contain conditions on irrelevant attributes and return algorithms with significantly decreased number of decision rules while maintaining or even increasing the classification accuracy.
Fig. 6

Reduction of all rules on examples decision algorithm while following DRSA feature ranking and its reverse

When DRSA Ranking was employed in reduction of input characteristic features to the artificial neural network, it resulted in yet another hybrid solution. At each elimination stage, a single feature was disregarded and the influence of it on the network performance studied, as plotted in Fig. 7. When the reversed ranking is exploited (Fig. 8), comparison of these two graphs reveals very close resemblance to the one displayed in Fig. 1, illustrating the performance of DRSA wrapper employing SBE.
Fig. 7

Reduction of characteristic features for ANN classifier while following DRSA Ranking. The predictive accuracy is plotted in relation to the number of features, and for each average, there is indicated maximal and minimal performance

Fig. 8

Reduction of characteristic features for ANN classifier while following the reversed DRSA Ranking. The predictive accuracy is plotted in relation to the number of features, and for each average, there is indicated maximal and minimal performance

From all tested combinations of wrappers, the best performance was displayed for ANN classifiers employing DRSA Ranking in backward elimination of features (Fig. 7). Good results were also obtained in reduction of all rules on examples algorithm generated for all features, while following DRSA Ranking (Table 5; Fig. 6). In this case, however, this can be explained by the wrapper bias when two systems of the same type, sharing the same characteristics, are combined. The same cannot be stated for the former case, as the differences between DRSA and ANN classifiers are clearly shown in the observed process of sequential backward elimination of features, resulting in two distinctively different rankings.

Using ANN Ranking in backward attribute reduction and then inducing new rules and algorithms for all rules on examples enables to discard eight variables (32 %) before the performance starts decreasing (Table 3; Fig. 4). ANN Ranking in FDA reduction brings also rejection of eight variables and as many as 51,888 decision rules (83 %). Application of reversed rankings, both DRSA- and ANN-based, always resulted in worsened performance.

6.3 Results for waveform dataset

The attributes for the waveform dataset are not described in detail in the UCI ML repository; therefore, for convenience, they were simply labelled form a1 to a21 and the two decision classes corresponded to the selected wave types, type 0 and type 1. The two rankings obtained by sequential backward elimination for DRSA and ANN classifiers are given in Table 6, with details of induced algorithms and performance for both systems, which is also plotted for both types of classification systems in Fig. 9.
Table 6

Backward elimination of attributes basing on the performance of DRSA and ANN classifiers for waveform dataset

 

DRSA Ranking

ANN Ranking

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(h)

(g)

(h)

0

21

55

 

3

20

65

a2

89

a2

1

20

56

   

67

a18

92

a19

2

19

50

   

70

a20

92

a10

3

18

49

 

2

30

72

a6

93

a12

4

17

50

 

2

35

73

a16

94

a17

5

16

52

 

2

31

74

a15

93

a13

6

15

54

 

4

17

73

a19

92

a8

7

14

54

 

2

33

74

a12

91

a5

8

13

52

 

3

32

74

a13

91

a21

9

12

49

 

2

20

77

a8

90

a7

10

11

53

 

2

35

77

a3

87

a14

11

10

48

 

2

35

74

a21

87

a4

12

9

48

 

3

24

74

a17

87

a1

13

8

45

 

3

27

78

a14

88

a3

14

7

31

28

5

17

79

a9

84

a20

15

6

30

21

8

10

81

a5

82

a18

16

5

36

24

6

13

80

a1

82

a16

17

4

33

20

9

11

79

a4

82

a6

18

3

34

13

4

12

80

a7

68

a15

19

2

40

6

  

68

a11

51

a11

21

1

3

2

  

39

a10

50

a9

Columns present parameters: (a) elimination stage, (b) number of characteristic features left, (c) number of rules in DRSA minimal cover decision algorithm without any constraints, (d) number of exact rules when they are fewer than the total number, (e) minimal support required of DRSA rules resulting in maximal classification accuracy, (f) number of exact DRSA rules meeting constraints on support, (g) predictive accuracy of the classifier (%), and (h) attribute selected to be eliminated

Fig. 9

Performance of DRSA and ANN classifiers observed in the sequential backward elimination of characteristic features

The performance of classifiers is compared against each other and to the reference point constituted by the predictive accuracies obtained for the complete set of 21 attributes. Minimal cover decision algorithm induced classified only 65 % with 55 rules limited to 20 by constrains on support to be equal at least 3. All rules on examples algorithm achieves 74 % recognition ratio (31,718 rules constrained to 58 for support equal or higher than 48). ANN with 21 input features recognised correctly 89 % of testing samples.

When DRSA Ranking of features is applied for systematic reduction of inputs to connectionist classifiers, in the initial phase some increase in performance can be observed (see Fig. 10), yet the visible trend is not strictly monotonic. The same ranking is also employed for reduction of selected rules from all rules on examples algorithm in the procedures described before and in this process significant gains can be observed: we can reduce 17 out of 21 attributes (close to 81 %) and still have increased performance. This, however, comes without surprise as both inducers share the same general characteristics, hence the resulting bias.
Fig. 10

Pruning of inputs for ANN classifier compared to pruning of rules from all rules on examples decision algorithm induced for the complete set of attributes, with dimensionality reduction executed while following DRSA Ranking

Imposing ANN Ranking on DRSA processing is performed again in two ways: either for the gradually decreasing subsets of conditional attributes new decision rules are induced (both MCDA and FDA), or the set of rules from the previously inferred complete decision algorithm is analysed and some rules rejected when they refer to discarded features. The details of all resulting solutions are given in Table 7. From the observed performance, we can detect that for all rules on examples, it is possible to reject 13 out of 21 conditional variables (almost 62 %), while not only the recognition is not worse, but increased.
Table 7

Backward elimination of conditional attributes using ANN Ranking with induction of new decision algorithms and with reduction of full decision algorithm previously inferred

  

Induction of DA after attribute elimination

Reduction of rules from FDA

  

Minimal cover DA

All rules on examples DA

(a)

(b)

(c)

(d)

(e)

(f)

(g)

(c)

(d)

(e)

(f)

(g)

(c)

(e)

(f)

(g)

1

20

56

   

67

25,176

 

48

58

75

25,176

48

58

74

2

19

56

 

2

33

64

20,041

 

48

58

74

20,041

48

58

74

3

18

61

 

2

36

64

15,909

 

37

97

71

15,909

37

97

73

4

17

59

 

2

39

58

12,177

 

33

95

76

12,177

33

95

76

5

16

54

 

2

30

61

9,872

 

33

95

76

9,872

33

95

76

6

15

53

   

64

6,835

 

33

90

76

6,835

33

90

76

7

14

55

 

2

32

63

4,925

 

18

211

75

4,925

18

211

75

8

13

53

 

3

18

64

3,408

 

18

185

75

3,489

18

186

75

9

12

58

   

67

2,235

 

10

269

75

2,298

10

272

75

10

11

64

 

2

40

68

1,388

 

11

212

78

1,444

11

215

78

11

10

60

   

68

976

 

11

184

78

1,028

11

186

78

12

9

62

60

3

26

66

796

635

11

137

78

672

11

139

78

13

8

56

46

3

23

64

1,090

340

11

97

75

368

11

101

75

14

7

51

37

  

66

942

187

25

45

71

230

26

46

71

15

6

51

37

  

67

473

135

5

78

73

166

5

92

73

16

5

46

33

  

68

271

101

5

65

73

130

5

81

73

17

4

49

31

  

68

145

68

5

50

69

90

5

65

70

18

3

37

16

  

70

47

24

14

21

67

45

17

29

68

19

2

32

11

8

10

68

33

18

17

18

67

33

17

27

67

20

1

3

2

  

30

3

2

  

30

7

26

4

30

Columns present parameters: (a) elimination stage, (b) number of characteristic features left, (c) number of all rules in a decision algorithm, (d) number of exact rules in a decision algorithm when they are fewer than the total number of rules, (e) value for minimal support required of rules resulting in maximal classification accuracy, (f) minimal number of rules meeting constraints, and (g) maximal classification accuracy (%)

When all rules on examples decision algorithms (a new one and the reduced FDA) are compared in each stage, it becomes apparent that they are in fact very close. Even though the numbers of rules involved are not always exactly the same, the resulting classification accuracy is almost identical, which suggests choosing the second way, that is with reduction of FDA generated for the complete set of features instead of inducing new algorithms. It requires significantly less effort as the hard part of computations is already executed. Once some kind of method for pruning of rules is established, its execution could be less demanding than the induction process.

For comparison, also some tests for reversed rankings were performed, with discarding the least ranking attributes, but results were worse when compared to the corresponding solution for most ranking variables, with differences depending on the number of elements reduced, often increasing along with it.

All experiments conducted, for both stylometric and waveform datasets, confirm the usefulness of the proposed methodology of combining wrappers for estimation of feature relevance used next it their backward reduction.

7 Conclusions

Filter and wrapper are two approaches to selection and reduction of characteristic features, which can be used as a way to observe their relevance or redundancy for the considered classification task. Filters work independently on the particular learning system employed for pattern recognition, while wrappers condition the choice of attributes on performance of the classifier. When a wrapper is used to establish a ranking of characteristic features in a separate process, it can be treated as a filter for another classification system. The paper presents a methodology that involves a combination of wrapper approaches, applied to observe relevance of characteristic features for two binary classification tasks with balanced data.

In the pre-processing stage of the wrapper mode, minimal cover decision algorithms inferred in DRSA and artificial neural networks with MLP topology are used to establish two rankings of the studied features through their sequential backward elimination. The resulting orderings are next employed as filters for inputs to new inducers, of the same and different type. Only application of reversed rankings resulted in worsened performance, while for all other cases, there were several alternative smaller subsets of variables for which the classification accuracy was at the same or increased level.

As the primary classification task authorship attribution was executed, which belongs with computational stylistics—a study of writing styles that requires observations of linguistic habits and preferences and employs stylometric characteristic features. For verification, the same reduction procedures were applied to another dataset, taken from UCI Machine Learning Repository. The results from the conducted experiments for both datasets show similar trends in performance in perspective of dimensionality reduction which validates the proposed research framework.

Notes

Acknowledgments

4eMka Software used for induction of decision rules within DRSA processing [20, 57] was developed at the Laboratory of Intelligent Decision Support Systems, (http://www-idss.cs.put.poznan.pl/), Poznan University of Technology, Poland.

References

  1. 1.
    Abraham A, Falcón R, Bello R (eds) (2009) Rough set theory: a true landmark in data analysis, studies in computational intelligence, vol 174. Springer, BerlinGoogle Scholar
  2. 2.
    Ahonen H, Heinonen O, Klemettinen M, Verkamo A (1997) Applying data mining techniques in text analysis. Technical report C-1997-23, Department of Computer Science, University of Helsinki, FinlandGoogle Scholar
  3. 3.
    Alzahrani S, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C Appl Rev 42(2):133–149CrossRefGoogle Scholar
  4. 4.
    Argamon S, Burns K, Dubnov S (eds) (2010) The structure of style: algorithmic approaches to understanding manner and meaning. Springer, BerlinGoogle Scholar
  5. 5.
    Argamon S, Karlgren J, Shanahan J (2005) Stylistic analysis of text for information access. In: Proceedings of the 28th international ACM conference on research and development in information retrieval, BrazilGoogle Scholar
  6. 6.
    Asadi N, Lin J (2013) Document vector representations for feature extraction in multi-stage document ranking. Inf Retr 16:747–768CrossRefGoogle Scholar
  7. 7.
    Baayen H, van Haltern H, Tweedie F (1996) Outside the cave of shadows: using syntactic annotation to enhance authorship attribution. Lit Linguist Comput 11(3):121–132CrossRefGoogle Scholar
  8. 8.
    Bache K, Lichman M (2014) UCI machine learning repository. http://archive.ics.uci.edu/ml/
  9. 9.
    Bayardo R Jr, Agrawal R (1999) Mining the most interesting rules. In: Proceedings of the 5th ACM SIGKDD international conference on knowledge discovery and data mining, pp 145–154Google Scholar
  10. 10.
    Bellal F, Elghazel H, Aussem A (2012) A semi-supervised feature ranking method with ensemble learning. Pattern Recogn Lett 33:1426–1433CrossRefGoogle Scholar
  11. 11.
    Berber Sardinha T (1999) Using key words in text analysis: practical aspects. ftp://ftp.liv.ac.uk/pub/linguistics
  12. 12.
    Blum A, Langley P (1997) Selection of relevant features and examples in machine learning. Artif Intell 97:245–271CrossRefMATHMathSciNetGoogle Scholar
  13. 13.
    Burrows J (2004) Textual analysis. In: Schreibman S, Siemens R, Unsworth J (eds) A companion to digital humanities. Blackwell, OxfordGoogle Scholar
  14. 14.
    Craig H (2004) Stylistic analysis and authorship studies. In: Schreibman S, Siemens R, Unsworth J (eds) A companion to digital humanities. Blackwell, OxfordGoogle Scholar
  15. 15.
    Dash M, Liu H (1997) Feature selection for classification. Intell Data Anal 1:131–156CrossRefGoogle Scholar
  16. 16.
    Dash M, Liu H (2003) Consistency-based search in feature selection. Artif Intell 151:155–176CrossRefMATHMathSciNetGoogle Scholar
  17. 17.
    Ferreira A, Figueiredo M (2012) Efficient feature selection filters for high-dimensional data. Pattern Recogn Lett 33:1794–1804CrossRefGoogle Scholar
  18. 18.
    Ferreira A, Figueiredo M (2014) Incremental filter and wrapper approaches for feature discretization. Neurocomputing 123:60–74CrossRefGoogle Scholar
  19. 19.
    Fiesler E, Beale R (1997) Handbook of neural computation. Oxford University Press, OxfordCrossRefGoogle Scholar
  20. 20.
    Greco S, Matarazzo B, Słowiński R (1999) The use of rough sets and fuzzy sets in multi criteria decision making. In: Gal T, Hanne T, Stewart T (eds) Advances in multiple criteria decision making, chap 14. Kluwer Academic, Dordrecht, pp 14.1–14.59Google Scholar
  21. 21.
    Greco S, Matarazzo B, Słowiński R (2001) Rough set theory for multicriteria decision analysis. Eur J Oper Res 129(1):1–47CrossRefMATHGoogle Scholar
  22. 22.
    Greco S, Matarazzo B, Słowiński R (2007) Dominance-based rough set approach as a proper way of handling graduality in rough set theory. Trans Rough Sets 7:36–52Google Scholar
  23. 23.
    Greco S, Słowiński R, Stefanowski J (2007) Evaluating importance of conditions in the set of discovered rules. Lect Notes Artif Intell 4482:314–321Google Scholar
  24. 24.
    Guyon I, Elisseeff A (2003) An introduction to variable and feature selection. J Mach Learn Res 3:1157–1182MATHGoogle Scholar
  25. 25.
    Guyon I, Gunn S, Nikravesh M, Zadeh L (eds) (2006) Feature extraction. Foundations and applications. Springer, BerlinMATHGoogle Scholar
  26. 26.
    Jelonek J, Krawiec K, Stefanowski J (1998) Comparative study of feature subset selection techniques for machine learning tasks. In: Proceedings of the 7th workshop on intelligent, information systemsGoogle Scholar
  27. 27.
    Jensen R, Shen Q (2008) Computational intelligence and feature selection. Wiley, HobokenCrossRefGoogle Scholar
  28. 28.
    Jockers M, Witten D (2010) A comparative study of machine learning methods for authorship attribution. Lit Linguist Comput 25(2):215–223CrossRefGoogle Scholar
  29. 29.
    John G, Kohavi R, Pfleger K (1994) Irrelevant features and the subset selection problem. In: Cohen W, Hirsh H (eds) Machine learning: proceedings of the 11th international conference. Morgan Kaufmann, Los Altos, pp 121–129Google Scholar
  30. 30.
    Kavzoglu T, Mather P (2011) Assessing artificial neural network pruning algorithms. In: Proceedings of the 24th annual conference and exhibition of the remote sensing society. Greenwich, UK, pp 603–609Google Scholar
  31. 31.
    Khmelev D, Tweedie F (2001) Using Markov chains for identification of writers. Lit Linguist Comput 16(4):299–307CrossRefGoogle Scholar
  32. 32.
    Kingston G, Maier H, Lambert M (2004) A statistical input pruning method for artificial neural networks used in environmental modelling. In: Transactions of the 2nd Biennial meeting of the international environmental modelling and software society. Osnabrueck, Germany, pp 87–92Google Scholar
  33. 33.
    Kohavi R, John G (1997) Wrappers for feature subset selection. Artif Intell 97:273–324CrossRefMATHGoogle Scholar
  34. 34.
    Koppel M, Argamon S, Shimoni A (2002) Automatically categorizing written texts by author gender. Lit Linguist Comput 17(4):401–412CrossRefGoogle Scholar
  35. 35.
    Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26CrossRefGoogle Scholar
  36. 36.
    Lal T, Chapelle O, Weston J, Elisseeff A (2006) Embedded methods. In: Guyon I, Gunn S, Nikravesh M, Zadeh L (eds) Feature extraction. Foundations and applications, studies in fuzziness and soft computing. Springer, Berlin, pp 137–165Google Scholar
  37. 37.
    Li S, Li T, Liu D (2013) Incremental updating approximations in dominance-based rough sets approach under the variation of the attribute set. Knowl Based Syst 40:17–26CrossRefGoogle Scholar
  38. 38.
    Liu H, Motoda H (2008) Computational methods of feature selection. Chapman and Hall/CRC, Boca RatonGoogle Scholar
  39. 39.
    Lynam T, Clarke C, Cormack G (2001) Information extraction with term frequencies. In: Proceedings of the human language technology conference. San Diego, USA, pp 1–4Google Scholar
  40. 40.
    Mansoori E (2013) Using statistical measures for feature ranking. Int J Pattern Recogn Artif Intell 27(1):1350003–1350014CrossRefMathSciNetGoogle Scholar
  41. 41.
    Molina L, Belanche L, Nebot A (2002) Feature selection algorithms: a survey and experimental evaluation. In: Kumar V, Tsumoto S, Zhong N et al (eds) Proceedings of IEEE international conference on data mining, pp 306–313Google Scholar
  42. 42.
    Moshkov M, Piliszczuk M, Zielosko B (2006) On partial covers, reducts and decision rules with weights. Trans Rough Sets 6:211–246Google Scholar
  43. 43.
    Moshkow M, Skowron A, Suraj Z (2007) On covering attribute sets by reducts. In: Kryszkiewicz M, Peters J, Rybinski H, Skowron A (eds) Rough sets and emerging intelligent systems paradigms, LNCS (LNAI), vol 4585. Springer, Berlin, pp 175–180CrossRefGoogle Scholar
  44. 44.
    Munro R (2003) A queing-theory model of word frequency distributions. In: Proceedings of the 1st Australasian language technology workshop. Melbourne, Australia, pp 1–8Google Scholar
  45. 45.
    Pawlak Z (1982) Rough sets. Int J Comput Inf Sci 11(5):341–356CrossRefMATHMathSciNetGoogle Scholar
  46. 46.
    Pawlak Z (2002) Computing, artificial intelligence and information technology: rough sets, decision algorithms and Bayes’ theorem. Eur J Oper Res 136:181–189CrossRefMATHMathSciNetGoogle Scholar
  47. 47.
    Pawlak Z (2002) Rough sets and intelligent data analysis. Inf Sci 147:1–12CrossRefMATHMathSciNetGoogle Scholar
  48. 48.
    Pearl L, Steyvers M (2012) Detecting authorship deception: a supervised machine learning approach using author writeprints. Lit Linguist Comput 27(2):183–196CrossRefGoogle Scholar
  49. 49.
    Peng R, Hengartner H (2002) Quantitative analysis of literary styles. Am Stat 56(3):15–38CrossRefMathSciNetGoogle Scholar
  50. 50.
    Reif M, Shafait F (2014) Efficient feature size reduction via predictive forward selection. Pattern Recogn 47:1664–1673CrossRefGoogle Scholar
  51. 51.
    Schaalje G, Blades N, Funai T (2013) An open-set size-adjusted Bayesian classifier for authorship attribution. J Am Soc Inf Sci Technol 64(9):1815–1825CrossRefGoogle Scholar
  52. 52.
    Shen Q (2006) Rough feature selection for intelligent classifiers. Trans Rough Sets 7:244–255Google Scholar
  53. 53.
    Sikora M (2006) Rule quality measures in creation and reduction of data rule models. In: Greco S, Hata Y, Hirano S, Inuiguchi M, Miyamoto S, Nguyen H, Słowiński R (eds) Rough sets and current trends in computing, lecture notes in computre science, vol 4259. Springer, Berlin, pp 716–725CrossRefGoogle Scholar
  54. 54.
    Sikora M (2013) Induction and pruning of classification rules for prediction of microseismic hazards in coal mines. Exp Syst Appl 38(2):6748–6758Google Scholar
  55. 55.
    Sikora M (2013) Redefinition of classification rules by evaluation of elementary conditions occurring in the rule premises. Fundam Inf 123(2):171–197MATHMathSciNetGoogle Scholar
  56. 56.
    Sikora M, Wróbel L (2011) Data-driven adaptive selection of rule quality measures for improving the rule induction algorithm. LNCS 6743:279–287Google Scholar
  57. 57.
    Słowiński R, Greco S, Matarazzo B (2007) Dominance-based rough set approach to reasoning about ordinal data. LNCS (LNAI) 4585:5–11Google Scholar
  58. 58.
    Stańczyk U (2009) Relative reduct-based selection of features for ANN classifier. In: Cyran K, Kozielski S, Peters J, Stańczyk U, Wakulicz-Deja A (eds) Man–machine interactions, AISC, vol 59. Springer, Berlin, pp 335–344CrossRefGoogle Scholar
  59. 59.
    Stańczyk U (2010) DRSA decision algorithm analysis in stylometric processing of literary texts. In: Szczuka M, Kryszkiewicz M, Ramanna S, Jensen R, Hu Q (eds) Rough sets and current trends in computing, LNCS (LNAI), vol 6086. Springer, Berlin, pp 600–609CrossRefGoogle Scholar
  60. 60.
    Stańczyk U (2010) Rough set-based analysis of characteristic features for ANN classifier. In: Grana Romay M, Corchado E, Garcia-Sebastian M (eds) Hybrid artificial intelligence systems. Part 1, LNCS (LNAI), vol 6076. Springer, Berlin, pp 565–572CrossRefGoogle Scholar
  61. 61.
    Stańczyk U (2011) Application of DRSA-ANN classifier in computational stylistics. In: Kryszkiewicz M, Rybiński H, Skowron A, Raś Z (eds) Foundations of intelligent systems, ISMIS’11 proceedings, LNCS (LNAI), vol 6804. Springer, Berlin, pp 695–704CrossRefGoogle Scholar
  62. 62.
    Stańczyk U (2011) On performance of DRSA-ANN classifier. In: Corchado E, Kurzyński M, Woźniak M (eds) Hybrid artificial intelligence systems. Part 2, LNCS (LNAI), vol 6679. Springer, Berlin, pp 172–179Google Scholar
  63. 63.
    Stańczyk U (2011) Reduct-based analysis of decision algorithms: application in computational stylistics. In: Corchado M, Kurzyński E, Woźniak M (eds) Hybrid artificial intelligence systems. Part 2, LNCS (LNAI), vol 6679. Springer, Berlin, pp 295–302Google Scholar
  64. 64.
    Stańczyk U (2012) Rule-based approach to computational stylistics. In: Bouvry P, Kłopotek M, Marciniak M, Mykowiecka A, Rybiński H (eds) Security and intelligent information systems, LNCS (LNAI), vol 7053. Springer, Berlin, pp 168–179CrossRefGoogle Scholar
  65. 65.
    Stańczyk U (2013) Decision rule length as a basis for evaluation of attribute relevance. J Intell Fuzzy Syst 24(3):429–445Google Scholar
  66. 66.
    Stańczyk U (2013) Establishing relevance of characteristic features for authorship attribution with ANN. In: Decker H, Lhotska L, Link S, Tjoa BJA (eds) Database and expert systems applications, LNCS, vol 8056. Springer, Berlin, pp 1–8CrossRefGoogle Scholar
  67. 67.
    Stańczyk U (2013) On preference order of DRSA conditional attributes for computational stylistics. In: Decker H, Lhotska L, Link S, Tjoa BJA (eds) Database and expert systems applications, LNCS, vol 8056. Springer, Berlin, pp 26–33Google Scholar
  68. 68.
    Stańczyk U (2013) Relative reduct-based estimation of relevance for stylometric features. In: Catania B, Guerrini G, Pokorny J (eds) Advances in databases and information systems, LNCS, vol 8133. Springer, Berlin, pp 135–147CrossRefGoogle Scholar
  69. 69.
    Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556CrossRefGoogle Scholar
  70. 70.
    Waugh S, Adams A, Tweedie F (2000) Computational stylistics using artificial neural networks. Lit Linguist Comput 15(2):187–198CrossRefGoogle Scholar

Copyright information

© The Author(s) 2014

Open AccessThis article is distributed under the terms of the Creative Commons Attribution License which permits any use, distribution, and reproduction in any medium, provided the original author(s) and the source are credited.

Authors and Affiliations

  1. 1.Institute of InformaticsSilesian University of TechnologyGliwicePoland

Personalised recommendations