Keywords

1 Introduction

Classification is one of the most popular Data Mining topics. Its aim is to learn from labeled patterns a model able to predict the decision class for future, never seen before, data samples [1]. The best way to solve a classification problem is usually to have as much information as possible. In practice, however, this is not always the case. The performance of learning algorithms may decrease due to the abundance of information, because many examples may be very irrelevant to the resolution of the problem or may provide the same information [14, 17].

On the other hand, the abundance of information could increase the computational complexity of the method, particularly in the case of instance-based learners such as the kNN (k Nearest Neighbors) [10] algorithm. However, it is possible to reduce or modify the datasets without affecting the learning process, improving process performance by reducing computational cost.

One approach to doing this is the classification based on the Nearest Prototype (NP) [6, 12]. It is an approach in which the decision class of a new object is calculated by analyzing its proximity to a set of prototypes selected or generated from the initial set of objects. Strategies are needed to reduce the number of examples of input data into a set of representative prototypes. It is possible to say that the data reduction methods with respect to the instances are divided into two categories: selection of prototypes [11] and generation of prototypes [25]. Prototype selection algorithms select a set of representative objects according to a well-defined criterion, while prototype generation algorithms are capable of generating a set of new objects in the application domain from the initial objects.

On the other, Multi-Label Classification (MLC) is a type of classification where each of the objects in the data has associated a vector of outputs, instead of being associated with a single value [26, 31]. ML-kNN is the first learning method that uses the kNN rule in multi-label prediction [30]. This method finds the k nearest neighbours in the datasets using the maximum a posteriori principle in order to determine the label set of the test object. The solution is based on the prior and posterior probabilities of the frequency of each label within the k nearest neighbours. Consequently, this method has the same drawbacks of kNN, because as the datasets increases, so does the computational cost of the algorithm, since for each test object its distance to all existing objects in the training set is calculated.

Despite extensive work on multi-label learning, as far as we know, only in [7] a method of prototypes selection is proposed. In this paper, we develop three methods of prototype generation from MLC datasets. Unlike the method proposed in [7], the three methods proposed are independent of the learning algorithm to be used. By doing so, we rely on Granular Computing [3, 4, 21], and two different ways of the granularity of the information. Two classical granulations are condition granulation and decision granulation, to name the granularity of the universe according to conditional attributes and decision class, respectively.

In the case of the first two methods proposed, the granulation of a universe is performed using a similarity relation that builds similarity classes (or granules) of objects in the universe from conditional attributes. By using similarity relations, methods can be used in the presence of mixed data, i.e., when there are both numerical and nominal attributes. On the other hand, the third method performs a granulation of the universe from an equivalence relation, and taking into account the different labels existing in the universe of discourse. From this, an equivalence class (or granule) is built for each label, and a prototype is generated for each granule.

The paper is organized as follows. Section 2 motivates our research, while Sect. 3 presents the theoretical background on Granular Computing. Section 4 introduces the three prototype generation methods from MLC datasets, and Sect. 5 is dedicated to evaluating the performance of the ML-kNN algorithm on the set of prototypes generated with our methods. Finally, in Sect. 6 we provide some concluding remarks and research directions.

2 Motivation

Some classification algorithms, such as the ones founded on examples-based learning, use a training set to estimate the class label, which causes the scalability problems when the size of the training set increases. In this case, the number of training objects affects the computational cost of a method [13, 15]. The nearest neighbour rule is an example of a high computational cost method when the number of examples is large [2].

The most popular algorithm in this category is kNN. The computational complexity of kNN is O(nm), where n and m are the size of dataset and the dimensionality of embedding space. Thus, these methods are computationally very expensive on large-scale datasets. The purpose of the NP approach is to reduce storage costs and learning technique processes based on examples. In the literature, several papers [2, 15, 25] on this issue have been proposed in the context of single-label learning.

Nevertheless, to the best of our knowledge, the only relevant work relating to NP in the field of multi-label classification is the kNNc method described in [7]. It works in two stages, by combining prototype selection techniques with example-based classification. First, a reduced set of objects is obtained by prototype selection techniques used in classical classification [18]. The goal of this stage is to determine the set of labels which are nearest to the ones in the object to be classified. Then, the full set of samples is used, but limiting the prediction to the labels inferred in the previous step.

Unlike that study, this research proposes three new proposals for the generation of prototypes using different alternatives to granulate the datasets.

3 Granular Computing

Basic issues of Granular Computing may be studied from two related aspects, the construction of granules and computation with granules. The former deals with the formation, representation, and interpretation of granules, while the latter deals with the utilization of granules in problem solving [22, 29].

In the construction of granules, it is necessary to study a criteria for deciding if two elements should be put into the same granule, based on available information. Typically, elements in a granule are drawn together by indistinguishability, similarity, proximity, or functionality [29].

With the granulation of universe, one considers elements within a granule as a whole rather than individually. The loss of information through granulation implies that some subsets of the universe can only be approximately described. The Rough Set Theory (RST) [19, 23] is one of the most representative theories within Granular Computing. It deals mainly with the approximation aspect of information granulation. It uses two main components: an information system and an indiscernibility relation. The former is defined as \(IS=(U, A)\), where U is a non-empty finite set of objects, and A is a non-empty finite set of attributes that describe each object. A particular case are the decision systems where \(DS=(U, A \cup \{d\})\), whereas \(d \notin A\) is the decision class.

In the classical RST, the relation of indiscernibility (R) is defined as an equivalence relation [20]. From this point on, \([x]_R\) defines an equivalence class of an element \(x \in U\) under R, where \([x]_R = \{y \in U: yRx \}\), i.e. the equivalence class of an element includes all objects in the universe indiscernible from x. Each equivalence class may be viewed as a granule consisting of indistinguishable elements. Two objects are equivalent if they have exactly the same value with respect to a set of attributes. It means that two inseparable objects could incorrectly be labeled as separable, making the relationship excessively strict [28].

This problems can be alleviated in some extent by extending the concept of inseparability relation [24] and replacing the equivalence relation with a weaker binary relation. Equation (1) shows an indiscernibility relation,

$$\begin{aligned} R: xRy \Longleftrightarrow \delta (x,y) \ge \xi \end{aligned}$$
(1)

where \(0 \le \delta (x,y) \le 1\) is a similarity function. This weak binary relation states that objects x and y are inseparable as long as their similarity degree \(\delta (x,y)\) exceeds a similarity threshold \(0 \le \xi \le 1\). This relation actually defines a similarity class \(\overline{R}(x) = \{y \in U: yRx \}\) that replaces the equivalence class.

The similarity function could be formulated in a variety of ways, for example, \(\delta (x,y) = 1-\varphi (x,y)\) with \(\varphi (x,y)\) being the distance between objects x and y. In reference [27] the authors studied the properties of several distance functions which allow comparing heterogeneous instances, i.e., objects comprising both numerical and nominal attributes.

4 Methods for the Generation of Prototypes from MLC Datasets Based on Granular Computing

As mentioned, in MLC scenarios an object may be associated with multiple labels. Let \(mlDS = (U,A \cup L)\) be a multi-label decision system, where the set U is a non-empty finite set of objects, A is a non-empty finite set of attributes that describe each observation, and \(L = \{L_1,L_2,\ldots ,L_k\}\) is a non-empty finite set of labels such that the label domain is \(L_i = \{0,1\}\).

Prototype-based classification determines the value of the decision class of a new object by analyzing its similarity to a set of prototypes generated from the initial set of objects. By doing so, we must define what is considered to be a decision class in the MLC context. For example,

  • Each combination \(C_i\) of labels represents a decision value. For example, let \(L = \{L_1,L_2,L_3\}\) denote the set of labels, a combination of labels could be “101”, pointing out that the object belongs to the labels \(L_1\) and \(L_3\), then “101” defines a decision class, so that all objects associated with labels \(L_1\) and \(L_3\) belong to that decision class.

  • Each label (\(L_i\)) is considered a decision value, so that all the objects associated that label belong to this decision class. According with this definition, in the example above there are three decision classes.

The basic idea of the first two proposed algorithms below is similar. Both are iterative algorithms in which a similarity class is built using the similarity relationship defined in Eq. (1). An object may belong to several similarity classes at the same time. However, when an object is included in a similarity class, it is not taken into account to build a new similarity class from it. Each similarity class consists of a granule that is used to build a prototype.

From this, a prototype or centroid is built for a set of similar objects. Each prototype is composed of both conditional and label attributes. To add their information both by condition (attribute values) and by decision (labels values) an aggregation operator is used. In the case of conditional attributes, the average can be used as the aggregation operator if the attribute value is numeric, or the mode if the attribute value is nominal.

The way in which the part of the prototype related to the labels is built differs between the two algorithms, exactly based on what is considered a decision class. In the case of Algorithm 1 each combination of labels represents a decision value, however Algorithm 2 considers each label independently as a decision value. In this way, the first algorithm builds its decision class from the most common combination of labels in the granule, while the second algorithm does it taking into account the labels independently. The resulting prototype will have as decision values the most common labels of the objects in the granule.

figure a
figure b

In contrast to the first two algorithms, the third algorithm performs a different granulation of the universe. In this case, the decision class of the objects is taking into account to build the granulation of the data. The basic idea is to perform a granulation of the universe taking into account the labels instead of the condition attributes. Therefore, a granule is built for each label, so it will include all objects that are labeled with that decision label.

The partition and the covering of an information space are two common types of granulation of the universe [9]. The granulation obtained by this algorithm is a covering, since any objects could belong to two or more information granules. A prototype is then generated for each granule similar to the Algorithm 2. This procedure is formalized in Algorithm 3.

figure c

5 Results and Discussion

In this section, we explore the performance of our prototype generation methods when coupled with the ML-kNN classification algorithm. To accomplish that, we use Hamming Loss (HL) metric which is a well-known performance measure in MLC scenarios [16]. This metric is defined as follows,

$$\begin{aligned} HL = \frac{1}{n}\frac{1}{k} \sum _{i=1}^n{\left| Y_i \varDelta Z_i\right| } \end{aligned}$$
(2)

where \(\varDelta \) operator returns the symmetric difference between \(Y_i\) (the real label set of the ith instance) and \(Z_i\) (the predicted one).

To perform the simulations, we rely on 12 multi-label datasets taken from the well-known RUMDR [8] repository. Table 1 summarizes the number of instances, attributes, and labels for each dataset. In the adopted datasets, the number of instances ranges from 1,675 to 10,491, the number of attributes from 294 to 1,836, and the number of labels from 6 to 400.

Table 1. Characterization of the MCL datasets used in our study.

We also studied the reduction coefficient, Red(.) [5]. This measure, in Eq. (3) indicates by how much the number of objects is reduced, that is, the proportion between the size of the set of prototypes (P) and the universe (U),

$$\begin{aligned} Red (.) = \frac{\left| U\right| -\left| P\right| }{\left| U\right| }*100 \end{aligned}$$
(3)

Figure 1 displays the reduction coefficient achieved once the proposed prototype generation methods are used on each dataset. In this experiment, we have adopted the Heterogeneous Euclidean-Overlap Metric (HEOM), which computes the normalized Euclidean distance between numerical attributes and an overlap metric for nominal attributes [27]. The similarity threshold \(\xi \) used in Eq. (1) ranges from 0.85 to 0.95.

It is worth mentioning that, for each datasets, we have estimated the HL value by using a 10-fold cross validation scheme. For each fold, this procedure splits the whole training set into two data pieces, namely, the training set and the test set. It should be highlighted that, while the training set is used to generate the set of prototypes. The test set is never modified so that it only serves to compute the HL associated with the current fold.

From the results in Fig. 1 we can conclude that our methods achieve a reduction rate higher than \(20\%\) in most problems. The GP1mlTS and GP2mlTS methods have a similar behaviour. However, the GP3mlTS method reports reduction coefficients even higher than \(90\%\). On the other hand, Fig. 2 displays the HL values achieved by the ML-kNN method with the original multi-label datasets, and the results obtained after using the set of prototypes generated by each of the methods proposed in this paper.

Fig. 1.
figure 1

Reduction percent achieved by each edition method.

Fig. 2.
figure 2

HL values achieved by the ML-kNN method.

The results show that the prototypes generated for each dataset leads to HL values similar to those obtained with the original dataset. Only in the case of the scene dataset there is a significant difference in the LH value, especially when we use the set of prototypes generated by the GP3mlTS method with respect to the original dataset. This is due to the fact that this dataset has few labels (exactly 6 labels), thus only a few prototypes are generated. In short, the results showed that our proposal provides a suitable trade-off between algorithm’s performance and the number of training examples in the dataset.

6 Concluding Remarks

The Prototype Generation algorithms have proved their usefulness by improving some kNN issues such as computational time, noise elimination or memory use. Although the extensive work in multi label classification, as far as we know, the topic of prototype generation has not received any attention so far. This paper proposes three methods based on Granular Computing for the generation of prototypes.

After analyzing the reduction coefficient, it could be concluded that the proposed methods achieve a significant reduction of the datasets from the resulting prototypes, while preserving the efficacy of the ML-kNN method in most case studies.

The set of prototypes generated by these methods could be used as a learning set for other learning algorithms, even those not intended for example-based learning.