Keywords

1 Background

Since its conception in 1990, fuzzy rough set theory [2] has been applied as part of a growing number of machine learning algorithms [17]. Simultaneously, the distribution and communication of machine learning algorithms has spread beyond academic literature to a multitude of publicly available software implementations [7, 10, 19]. And also during the same period, Python has grown from its first release in 1991 [13] to become one of the world’s most popular high-level programming languages.

Python has become especially popular in the field of data science, in part due to the self-reinforcing growth of its package ecosystem. This includes scikit-learn [11], which is currently the go-to general purpose Python machine learning library, and which contains a large collection of algorithms.

Only a limited number of fuzzy rough set machine learning algorithms have received publicly available software implementations. Variants of Fuzzy Rough Nearest Neighbours (FRNN) [5], Fuzzy Rough Rule Induction [6], Fuzzy Rough Feature Selection (FRFS) [1] and Fuzzy Rough Prototype Selection (FRPS) [14, 15] are included in the R package RoughSets [12], and have also been released for use with the Java machine learning software suite WEKA [3, 4].

So far, none of these algorithms seem to have been made available for Python in a systematic way. In this paper, we present an initial version of fuzzy-rough-learn, a Python library that fills this gap. At present, it includes FRNN, FRFS, FRPS, as well as FROVOCO [18] and FRONEC [16], two more recent algorithms designed for imbalanced and multilabel classification. These implementations all make use of a significant modification of classical fuzzy rough set theory: the incorporation of Ordered Weighted Averaging (OWA) operators in the calculation of upper and lower approximations for increased robustness [1].

We discuss the use cases and design philosophy of fuzzy-rough-learn in Sect. 2, and provide an overview of the included algorithms in Sect. 3.

2 Use Cases and Design Philosophy

The primary goal of fuzzy-rough-learn is to provide implementations of fuzzy rough set algorithms. The target audience is researchers with some programming skills, in particular those who are familiar with scikit-learn. We envision two principal use cases:

  • The application of fuzzy rough set algorithms to solve concrete machine learning problems.

  • The creation of new or modified fuzzy rough set algorithms to handle new types of data or to achieve better performance.

A third use case falls somewhat in between these two: reproducing or benchmarking against results from existing fuzzy rough set algorithms.

To facilitate the first use case, fuzzy-rough-learn is available from the two main Python package repositories, pipy and conda-forge, making it easy to install with both pip and conda. fuzzy-rough-learn has an integrated test suite to limit the opportunities for bugs to be introduced. API documentation is integrated in the code and automatically updated onlineFootnote 1 whenever a new version is released, and includes references to the literature.

We believe that it is important to make fuzzy rough set algorithms available not just for use, but also for adaptation, since it is impossible to predict or accommodate all requirements of future researchers. Therefore, the source code for fuzzy-rough-learn is hosted on GitHubFootnote 2 and freely available under the MIT license. We have attempted to write accessible code, by striving for consistency and modularity. The coding style of fuzzy-rough-learn is a compromise between object-oriented and functional programming. It makes use of classes to model the different components of the classification algorithms, but as a rule, functions and methods have no side-effects. Finally, subject to these design principles, fuzzy-rough-learn generally follows the conventions of scikit-learn and the terminology of the cited literature.

3 Contents

fuzzy-rough-learn implements three of the fuzzy rough set algorithms mentioned in Sect. 1: FRFS, FRPS and FRNN, making them available in Python for the first time. In addition, we have included two recent, more specialised classifiers: the ensemble classifier FROVOCO, designed to handle imbalanced data, and the multi-label classifier FRONEC.

Together, these five algorithms form a representative cross-section of fuzzy rough set algorithms in the literature. In the future, we intend to build upon this basis by adding more algorithms (Table 1).

3.1 Fuzzy Rough Feature Selection (FRFS)

Fuzzy Rough Feature Selection (FRFS) [1] greedily selects features that induce the greatest increase in the size of the positive region, until it matches the size of the positive region with all features, or until the required number of features is selected.

Table 1. Parameters of FRFS in fuzzy-rough-learn

The positive region is defined as the union of the lower approximations of the decision classes in X. Its size is the sum of its membership values.

The similarity relation \(R_B\) for a given subset of attributes B is obtained by aggregating with a t-norm the per-attribute similarities \(R_a\) associated with the attributes a in B. These are in turn defined, for any \(x, y \in X\), as the complement of the difference between the attribute values \(x_a\) and \(y_a\) after rescaling by the sample standard deviation \(\sigma _a\) (1).

$$\begin{aligned} R_a(x, y) = \max (1 - \frac{\left|x_a - y_a\right|}{\sigma _a}, 0) \end{aligned}$$
(1)
Table 2. Parameters of FRPS in fuzzy-rough-learn

3.2 Fuzzy Rough Prototype Selection (FRPS)

Fuzzy Rough Prototype Selection (FRPS) [14, 15] uses upper and/or lower approximation membership as a quality measure to select instances. It follows the following steps:

  1. 1.

    Calculate the quality of each training instance. The resulting values are the potential thresholds for selecting instances (Table 2).

  2. 2.

    For each potential threshold and corresponding candidate instance set, count the number of instances in the overall dataset that have the same decision class as their nearest neighbour within the candidate instance set (excluding itself).

  3. 3.

    Return the candidate instance set with the highest number of matches. In case of a tie, return the largest such set.

There are a number of differences between the implementations in [15] and [14]. In each case, the present implementation follows [14]:

  • While [15] uses instances of all decision classes to calculate upper and lower approximations, [14] calculates the upper approximation membership of an instance using only instances of the same decision class, and its lower approximation membership using only instances of the other decision classes. This choice affects over what length the weight vector is ‘stretched’.

  • In addition, [14] excludes each instance from the calculation of its own upper approximation membership, while  [15] does not.

  •  [15] uses additive weights, while [14] uses inverse additive weights.

  •  [15] defines the similarity relation R by aggregating the per-attribute similarities \(R_a\) using the Łukasiewicz t-norm, whereas [14] recommends using the mean.

  • In case of a tie between several best-scoring candidate prototype sets, [15] returns the set corresponding to the median of the corresponding thresholds, while [14] returns the largest set (corresponding to the smallest threshold).

In addition, there are two implementation issues not addressed in [15] or [14]:

  • It is unclear what metric the nearest neighbour search should use. It seems reasonable that it should either correspond to the similarity relation R (and therefore incorporate the same aggregation strategy from per-attribute similarities), or that it should match whatever metric is used by nearest neighbour classification subsequent to FRPS. By default, the present implementation uses Manhattan distance on the scaled attribute values.

  • When the largest quality measure value corresponds to a singleton candidate instance set, it cannot be evaluated (because the single instance in that set has no nearest neighbour). Since this is an edge case that would not score highly anyway, it is simply excluded from consideration.

3.3 Fuzzy Rough Nearest Neighbour (FRNN) Multiclass Classification

Fuzzy Rough Nearest Neighbours (FRNN) [5] provides a straightforward way to apply fuzzy rough sets for classification. Given a new instance y, we obtain class scores by calculating the membership degree of y in the upper and lower approximations of each decision class and taking the mean. This implementation uses OWA weights, but limits their application to the k nearest neighbours of each class, as suggested by [8] (Table 3).

Table 3. Parameters of FRNN in fuzzy-rough-learn

3.4 Fuzzy Rough OVO Combination (FROVOCO) Multiclass Classification

Fuzzy Rough OVO COmbination (FROVOCO) [18] is an ensemble classifier specifically designed for, but not restricted to, imbalanced data, which adapts itself to the Imbalance Ratio (IR) between classes. It balances one-versus-one decomposition with two global class afinity measures (Table 4).

Table 4. Parameters of FROVOCO in fuzzy-rough-learn

In a binary classification setting, the lower approximation of one class corresponds to the upper approximation of the other class, so when using OWA weights, the effective number of weight vectors to be chosen is 2. FROVOCO uses the IR-weighting scheme, which depends on the IR between the classes. If the IR is less than 9, both classes are approximated with exponential weights. If the IR is 9 or more, the smaller class is approximated with exponential weights, while the larger class is approximated with a reduced additive weight vector of effective length k equal to 10% of the number of instances.

Provided with a training set X, and a new instance y, FROVOCO calculates the class score of y for a class C from the following components:

  • V(Cy) weighted vote For each other class \(C' \ne C\), calculate the upper approximation memberships of y in C and \(C'\), using the IR-weighting scheme. Rescale each pair of values so they sum to 1, then sum the resulting scores.

  • mem(Cy) positive affinity Calculate the average of the membership degrees of y in the upper and lower approximations of C, using the IR-weighting scheme.

  • \(mse_n(C, y)\) negative affinity For each class \(C'\), calculate the average positive affinity of the members of C in \(C'\). Combine these average values to obtain the signature vector \(S_C\). Calculate the mean squared error of the positive affinities of y for each class and \(S_C\), and divide it by the sum of the mean squared errors for all classes.

The final class score is calculated from these components in (2).

$$\begin{aligned} AV(C, y) = \frac{V(C, y) + mem(C, y)}{2} - \frac{1}{m}mse_n(C, y). \end{aligned}$$
(2)
Table 5. Parameters of FRONEC in fuzzy-rough-learn

3.5 Fuzzy Rough Neighbourhood Consensus (FRONEC) Multilabel Classification

Fuzzy Rough Neighbourhood Consensus (FRONEC) [16] is a multilabel classifier. It combines the instance similarity R, based on the instance attributes, with label similarity \(R_d\), based on the label sets of instances. It offers two possible definitions for \(R_d\). The first, \(R_d^{(1)}\), is simply Hamming similarity scaled to [0, 1]. The second label similarity, \(R_d^{(2)}\), takes into account the prior probability \(p_l\) of a label l in the training set. Let L the set of possible labels, and \(L_1, L_2\) two particular label sets. Then \(R_d^{(2)}\) is defined as follows (Table 5):

$$\begin{aligned} \begin{aligned} a&= \sum _{l \in L_1 \cap L_2} (1 - p_l)\\ b&= \sum _{l \in L \setminus (L_1 \cup L_2)} p_l\\ R_d^{(2)}&= \frac{a + b}{a + b + \frac{1}{2}\left|L_1 \varDelta L_2 \right|} \end{aligned} \end{aligned}$$
(3)

Provided with a training set X, and a new instance y, FRONEC predicts the label set of y by identifying the training instance with the highest ‘quality’ in relation to y. There are three possible quality measures, based on the upper and lower approximations.

$$\begin{aligned} \begin{aligned} Q_1(y, x)&= OWA_{w_l}(\{I(R(z, y), R_d(x, z)) \vert z \in N(y)\})\\ Q_2(y, x)&= OWA_{w_u}(\{T(R(z, y), R_d(x, z)) \vert z \in N(y)\})\\ Q_3(y, x)&= \frac{Q_1(y, x) + Q_2(y, x)}{2} \end{aligned} \end{aligned}$$
(4)

Where \(R_d\) is a choice of label similarity, T the Łukasiewicz t-norm, I the Łukasiewicz implication, and N(y) the k nearest neighbours of y in X, for a choice of k.

For a choice of quality measure Q, FRONEC predicts the labels of the training instance with the highest quality. If there are several such training instances, it predicts all labels that appear with at least half.

3.6 OWA Operators and Nearest Neighbour Searches

Each of the algorithms in fuzzy-rough-learn uses OWA operators [20] to calculate upper and lower approximations. OWA operators take the weighted average of an ordered collection of real values. By choosing suitably skewed weight vectors, OWA operators can thus act as soft maxima and minima. The advantage of defining upper and lower approximations with soft rather than strict maxima and minima is that the result is more robust, since it no longer depends completely on a single value.

To allow experimentation with other weights, we have included a range of pre-defined weight types, as well as a general OWAOperator class that can be extended and instantiated by users and passed as a parameter to the various classes.

Similarly, users may customise the nearest neighbour search algorithm that is used in all classes except FRFS by defining their own subclass of NNSearch. For example, by choosing an approximative nearest neighbour search like Hierarchical Navigable Small World [9], we obtain Approximate FRNN [8].