1 Introduction

Three-way decision (TWD) is a recent paradigm emerged from rough set theory (RST) that is acquiring its own status and visibility [46]. This paradigm is based on the simple idea of thinking in three “dimensions” (rather then in binary terms) when considering how to represent computational objects. This idea leads to the so-called trisecting-acting-outcome (TAO) model [82]: Trisecting addresses the question of how to divide the universe under investigation in three partitions; Acting explains how to deal with the three parts identified; and Outcome gives methodological indications on how to evaluate the adopted strategy.

Based on the TAO model, we propose a framework to handle uncertainty in Machine Learning: this model can be applied both to the input and the output of the Learning algorithm. Obviously, these two latter aspects are strictly related and they mutually affect each other in real applications. Schematically, the framework looks as illustrated in Table 1.

Table 1. TAO model applied to Machine Learning

With reference to the table, we distinguish between applications that handle uncertainty in the input and those that handle uncertainty with respect to the output. By uncertainty in the input we mean different forms of uncertainty that are already explicitly present in the training datasets used by ML algorithms. By uncertainty in the output we mean mechanisms adopted by the ML algorithm in order to create more robust models or making the (inherent and partly insuppressible) predictive uncertainty more explicit.

In the following Sections, we will explain in more detail the different parts of the framework outlined in Table 1, and discuss the recent advances and current research in the framework areas by means of a narrative review of the literature indexed by the Google Scholar database. In particular, in Sect. 2, we describe the different steps of the proposed model with respect to the handling of uncertainty in the input, while in Sect. 3 we do the same for the handling of the uncertainty in the output. In Sect. 4, we will then discuss the advantages of incorporating TWD and the TAO model for uncertainty handling into Machine Learning, and some relevant future directions.

2 Handling Uncertainty in the Input

Real-world datasets are far from being perfect: typically they are affected by different forms of uncertainty (often missingness) that can be mainly related to either the data acquisition process or the complexity (e.g, in terms of volatility) of the phenomena under consideration or for both these factors.

These forms of uncertainty are usually distinguished in three common variants:

  1. 1.

    Missing data: this is usually the most common type of uncertainty in the input [6]. The dataset could contain missing values in its predictive features either because the original value was not recorded (e.g. the data was collected in two separate times, and the instrumentation to measure the feature was available only at one time), was subsequently lost or considered irrelevant (e.g. a doctor decided not to measure the BMI of a seemingly healthy person). This type of uncertainty has been the most studied, typically under the data imputation perspective, that is the task in which missing values are filled in before any subsequent ML process. This can be done in various ways, with techniques based on clustering [34, 65], statistical or regression approaches [7], rough set or fuzzy rough set methods [4, 51, 67];

  2. 2.

    Weak supervision: in the case of supervised problems, the supervision (i.e. the target or decision variable) is only given in an imprecise form or only partially specified. This type of uncertainty has seen some increase in interest in the recent years [105], with a growing literature focusing specifically on superset learning [17, 29]; this is a specific type of weak supervision in which instances are associated with sets of possible but mutually exclusive labels that are guaranteed to contain the true value of the decision label;

  3. 3.

    Multi-rater annotation: this form of uncertainty is getting more and more impact due to the increasing use of crowdsourcing [5, 23, 69] for data annotation purposes, but it is also inherent in many domains where it is common (and in fact recommended) practice to involve multiple experts to increase the reliability of the Ground Truth, which is a crucial requirement in many situations where ML models are applied for sensitive or critical tasks (like in medicine for diagnostic tasks). Involving multiple raters who annotate the dataset independently of each others often results in multiple and conflicting decision labels for a given instance [9], for a common phenomenon that has been denoted with many expressions, like observer variability or inter-rater reliability.

While superficially similar (e.g. weak supervision could be seen as a form of missing data), the problems inherent to and the methods to handle these types of uncertainty are such that they should be distinguished. In the case of missing data, the main problem is to build reliable models of knowledge despite the incomplete information, and the completion of the dataset is but a means to an end, often under assumptions that are difficult to attain (or verify). In the case of weak supervision, on the other hand, the task of completion (which is usually called disambiguation) is of fundamental importance and the goal is, usually, to simultaneously build ML models and disambiguate the uncertain instances. Finally, in the case of multi-rater annotations, while the task of disambiguation is obviously present, there is also the problem of inferring the extent each single rater can be trusted (i.e., how accurate they are) and how to meaningfully aggregate the information they provide in order to build a consensus which is to be used to build the ground truth by which to train the ML model.

2.1 Trisecting and Acting Steps

In all three uncertainty forms, the trisecting act is at the basis of the process of uncertainty handling, as the uncertain instances (e.g., the instances missing some feature values, or those for which the provided annotations are only weak) must be necessarily recognised for any action to be considered: this also means that the trisecting act usually amounts to simply dividing the certain instances from the uncertain ones, and the bulk of the work is usually performed in the acting step in order to decide how differently handle the two kinds of instances. According to the three kinds of problems described at the beginning of the section, we present the following solutions.

Missing Data. Missing data is the type of uncertainty for which a TWD methodology to handle this kind of uncertainty is more mature, possibly because the problem has been well studied in RST and other theories for the management of uncertainty that are associated with TWD [21, 22]. Most approaches in this direction have been based on the notion of incomplete information table, which is typically found in RST: Liu et al. [42] introduced a TWD model based on an incomplete information table augmented with interval-valued loss functions; Luo et al. [45] proposed a multi-step approach by which to distinguish different types of missing data (e.g. “don’t know”, “don’t care”) and similarity relations; Luo et al. [44] focused on how to update TWD in incomplete and multi-scale information systems using decision-theoretic rough sets; Sakai et al. [57,58,59] described an approach based on TWD to construct certain and possible rules using an algorithm which combines the classical A-priori algorithm [3] and possible world semantics [30]. Other approaches (not directly based on the incomplete information table notion) have also been considered: Nowicki et al. [52] proposed a TWD algorithm for classification with missing or interval-valued data based on rough sets and SVM; Yang et al. [75] proposed a method for TWD based on intuitionistic fuzzy sets that are construed based on a similarity relation of instances with missing values.

While all the above approaches propose techniques based on TWD with missing data for classification problems, there have also been proposals to deal with this type of uncertainty in clustering, starting from the original approach proposed by Yu [85, 87], to deal with missing data in clustering using TWD: Afridi et al. [2] described an approach which is based, as for the classification case, on a simple trisecting step in which complete instances are used to produce an initial clustering and then use an approach based on game-theoretic rough sets to cluster the instances with missing values; Yang et al. [74] proposed a method for three-way clustering with missing data based on clustering density.

Weak Supervision. With respect to the case of weak supervision, the application of three-way based strategies is more recent and different techniques have been proposed in the recent years. Most of the work in this sense has focused on the specific case of semi-supervised learning, in which the uncertain instances have no supervision, and active learning, in which the missing labels can be requested to an external oracle (usually a human user) at some cost: Miao et al. [48] proposed a method for semi-supervised learning based on TWD; Yu et al. [88] proposed a three-way clustering approach for semi-supervised learning that uses an active learning approach to obtain labels for instances that are considered as uncertain after the initial clustering; Triff et al. [66] proposed an evolutionary semi-supervised algorithm based on rough sets and TWD and compare it with other algorithms obtaining interesting results when only the certainly classified objects are considered; Dai et al. [18] introduced a co-training technique for cost-sensitive semi-supervised learning based on sequential TWD and apply it to different standard ML algorithms (k-NN, PCA, LDA) in order to obtain a multi-view dataset; Campagner et al. [10, 13] introduced a three-way Decision Tree model for semi-supervised learning and show that this model achieves good performance with respect to standard ML algorithms for semi-supervised learning; Wang et al. [70, 71] proposed a cost-sensitive three-way active learning algorithm based on the computation of label error statistics; Min et al. [49] proposed a cost-sensitive active learning strategy based on k-nearest neighbours and a tripartition of the instances in certain and uncertain ones.

In the case of more general weakly supervised learning, Campagner et al. [12] proposed a collection of approaches based on TWD and standard ML algorithms in order to take into account this type of uncertainty in the setting of classification. In particular, the authors considered an algorithm for Decision Tree (and ensemble-based extensions, such as Random Forest) learning, in which the trisecting and acting steps are dynamically and iteratively performed during the Decision Tree induction process on the basis of TWD and generalized information theory [33], and a generalized stochastic gradient descent algorithm based on interval analysis and TWD, in order to take into account the fact that the uncertain instances naturally determine interval-valued information with respect to the loss function to be optimized. In both cases, promising results were reported, showing that they outperform standard superset learning and semi-supervised techniques. A different approach, which is based on treating weakly supervision as a type of missing data, proposed by Sakai et al. [58], employs a three-way rule extraction algorithm that could also be applied in the case of weakly supervised data: this approach is of particular interest in that it suggests an integrated end-to-end approach to simultaneously handle missing data and weakly supervised data.

Multi-rater Annotation. With respect to the third type of uncertainty, that is multi-rater annotation, in [12] we noted that the issue has largely been ignored in the ML community. With respect to the application of TWD methodologies to handle this type of uncertainty, there has been some recent works with respect to aggregation methods and information fusion using TWD, mainly under the perspective of group decision making [25, 39, 53, 96] and the modelling of multi-agent systems [76]. However, there has been so far a lack of studies concerning the application of these TWD based techniques to ML problems. Some related approaches have been explored under the perspective of multi-source information tables in RST, in which the multi-rater, and possibly conflicting, information is available not only for the decision variable but also for the predictor ones: Huang et al. [28] proposed a three-way concept learning method for multi-source data; Sang et al. [60] studied the application of decision-theoretic rough sets for TWD in multi-source information systems; Sang et al. [61] proposed an alternative approach which is not directly based on merging different information systems but instead it employs multi-granulation double-quantitative decision-theoretic rough set, which the authors show to be more fault tolerant with respect to traditional approaches. Campagner et al. [8, 15] proposed a novel aggregation strategy, based on TWD, which can be applied to implement the trisecting step to handle the multi-rater annotation uncertainty type. In this case, the instances are categorized as certain or uncertain depending on the distribution of labels given by the raters and a set of parameters that have a cost-theoretic interpretation. After the aggregation step, the problem is converted into a weakly supervised one and a learning algorithm is proposed that is shown to be significantly more effective than the traditional approach of simply assigning the most frequent labels (among the multi-rater annotations) to the uncertain instances.

2.2 Outcome Step: Evaluating the Results

All of the articles considered for this review mainly deal with the trisecting and acting step in the TAO model that we propose. The outcome step has rarely been considered and is usually addressed as it would be for traditional ML models: that is by simply considering the accuracy of the trained models, sometimes even in naive ways [14]. According to the framework that we propose, the main goal of employing TWD for ML is the handling of uncertainty. In this light, attention should also be placed on how much the TWD approach allows to reduce the initial uncertainty in the input data or at least to which degree the TWD-based algorithm is able to obtain good performances despite of the uncertainty. For example, with respect to the missing data problem, the outcome step should also consider the amount of missing values that have been correctly imputed (for imputation-based approaches), or the robustness of the induced ML algorithm with respect to different values that could be present in the missing features, for instance using interval-valued accuracy or information-theoretic metrics [11, 14], or by distinguishing which predictions made by the algorithm are certain (i.e., robust with respect to the missing values or the weakly supervised instance) or only possible. Similarly, with respect to the multi-rater annotation uncertainty type, besides the accuracy of the proposed approaches with respect to a known ground truth (when available), the outcome step should also consider the robustness of the proposed approach when varying the degree of conflicting information, and the level of noise of the raters who annotate the datasets, as we considered in [15]. In this sense, we believe that more attention should be put on the outcome step of the proposed framework, and further research in this sense should be performed.

3 Handling Uncertainty in the Output

The application of TWD to handle uncertainty in the output of the ML is a mature research area, and has possibly been considered since the original proposal of TWD, both for classification [80, 103] and for clustering [40]. In both cases, the uncertainty in the output of the ML model refers to the inability of the ML model to properly discriminate the instances and assign them a certain, precisely known, label. This could be due to a variety of issues: the chosen data representation (i.e., the selected features and/or their level of granularity) is not informative enough; the inability to distinguish different instances that are either identical or “too near” in the sample space, but are associated with different decision labels; the selected model class is not powerful enough to properly represent the concept to be learned. All these issues have been widely studied, both under the perspective of RST with the notion of indiscernibility [54, 55], and of more traditional ML approaches, with the notion of decision boundary. The approach suggested by TWD in this setting consists in allowing the classifier to abstain [81], even partially, that is excluding some of the possible alternative classifications. In so doing, the focus is on the trisecting step, which involves deciding on which instances the ML model (both for classification or clustering) should be considered uncertain, and hence the model should abstain on.

3.1 Trisecting and Acting Steps for Classification

With respect to classification, the traditional model of TWD applies only to binary classification cases, for which a third “uncertain” category is added, for which extensions of the most traditional ML methods are available. In all of the cases, the trisecting step is performed in a similar manner, on the basis of the original decision-theoretic rules proposed by Yao [81]; these rules are often embedded in different models, and the main variation relates to how the acting step is implemented. This step has usually been based on Bayesian decision analysis under the decision-theoretic rough set paradigm [31, 36, 79, 103, 104]. However, also other approaches to implement the acting step have been proposed, such as structured approximations in RST [27], or the combination of TWD with more traditional ML techniques, for instance, Deep Learning [37, 100, 101], optimization-based learning [41, 43, 95] or frequent pattern mining [38, 50]: all of these implementations of the TWD model for the handling of uncertainty have been successfully applied to different fields, such as face recognition, spam filtering or recommender systems.

A particularly interesting use, with respect to the acting outcome, consists of integrating TWD in active learning methodologies: Chen et al. [16] proposed a three-way rule-based decision algorithm that employs active learning to re-classify the uncertain instances; Zhang et al. [94] proposed a random forest-based recommender systems with the capability to ask for user supervision on uncertain objects; Yao et al. [78] proposed a TWD model based on a game-theoretic rough set for medical decision systems that distinguish certain rules (for acceptance and rejection) from deferment rules which require intervention from the user.

In recent years, different proposals have also been considered for the extension to the multi-class case, mainly under two major approaches. The first one is based on sequential TWD [83], which essentially implements a hierarchical one-vs-all learning scheme; Yang et al. [77] considered a Bayesian extension of multi-class decision theoretic rough sets [102]; Savchenko [62, 63] proposed sequential TWD and granular computing for speed-up of image classification when the number of classes is large; Zhang et al. [98] proposed a sequential TWD model based on the use of autoencoders for granular feature extraction. The second approach, which can be defined as natively multi-class, has been proposed by some authors (e.g., in [11, 12]): it employs a decision-theoretic procedure to convert every standard probabilistic classifier into a multi-class TWD classifier. A similar approach, but based on decision-theoretic rough sets, have also been developed by Jia et al. [32].

While all the approaches mentioned above consider the combination of TWD and ML models in a a posteriori strategy in which the trisecting step is performed after, or as a consequence of, the standard ML training procedure, in [11, 12] we also considered how to directly embed TWD in the training algorithm of a wide class of standard ML models, either by a direct modification of the learning algorithm (for decision trees and related methods), or by adopting ad-hoc regularized loss functions (for optimization-based procedures such as SVM or logistic regression).

3.2 Trisecting and Acting Steps for Clustering

In regards clustering, various approaches have been proposed to implement the TWD-based handling of the uncertainty in the output, hence to construct clusterings in which the assignment of some instances to clusters is uncertain, mainly under the frameworks of rough clustering [40], interval-set clustering [84] and three-way clustering [90]. In all of the above approaches, the trisecting step is implemented as a modification of standard clustering assignment criteria, and it allows instances to be considered as uncertain with respect to their assignment to one or more clusters: Yu [90] proposed a three-way clustering algorithm that also works with incomplete data; Wang et al. [73] proposed a three-way clustering method based on mathematical morphology; Yu et al. [91] considered a flexible tree-based incremental three-way clustering algorithm; Yu et al. [86] proposed an optimized ensemble-based three-way clustering algorithm for large-scale datasets; Afridi et al. [1] proposed a variance-based three-way clustering algorithm; Zhang et al. [99] proposed a novel improvement on the original rough k-means based on a weighted Guassian distance function; Li et al. [35] extended standard rough k-means with an approach based on decision-theoretic rough sets, Yu et al. [89] proposed an hybrid clustering/active learning based on TWD for multi-view data; Zhang [97] proposed a three-way c-means algorithm; Wang et al. [72] proposed a refinement three-way clustering algorithm based on the re-clustering of ensemble of traditional hard clustering algorithms; Yu et al. [93] proposed a density three-way clustering algorithm based on DBscan; Yu et al. [92] proposed a three-way clustering algorithm optimized for high-dimensionality datasets based on a modification of the k-medoids algorithm and the random projection method; Hu et al. [26] proposed a sequential TWD model for consensus clustering based on the notion of co-association matrix.

3.3 Outcome Step: Evaluating the Results

With respect to the outcome step, both clustering and classification techniques based on TWD have been shown to significantly improve the performance in comparison to traditional ML algorithms (see the referenced literature). Despite this promising assessment, one should also consider that the evaluation of ML algorithms using TWD to handle the uncertainty in output, at least in principle, cannot be made on the same grounds of traditional ML models (i.e., only on the basis of accuracy metrics). Indeed, since these models are allowed to abstain on uncertain instances, metrics for their evaluation should take into account the trade-off between the accuracy on the classified/clustered instances but also the coverage of the algorithm, that is on how many instances the model defers its decision. As an example of this issue, suffice it to consider that a three-way classifier that abstains on all the instances but one, which is correctly classified/clustered, has perfect accuracy but it is hardly a useful predictive model. However, attention towards this trade-off has emerged only recently, where the majority of the surveyed papers only focus on the accuracy of the models on the classified/clustered instances: Peters [56] proposed a modified Davis-Bouldin index for evaluation of three-way clustering; Depaolini et al. [19] proposed generalizations of Rand, Jaccard and Fowlkes-Mallows indices; similarly, we proposed a generalization of information-theoretic measures of clustering quality [14] and generalization of accuracy metrics for classification [11]. Promisingly, the superior performance of TWD techniques for the handling of output uncertainty can be observed also under these more robust, and conservative, metrics.

4 Discussion

In this article, we proposed a TAO model for the management of uncertainty in Machine Learning that is based on TWD. After describing the proposed framework, we have reviewed the current state of the art for the different areas of concern identified by our framework, and discussed about the strengths, limitations and areas requiring further investigation of the main works considered.

In what follows, we emphasise both what we believe are the main advantages of adopting this methodology in ML and also delineate some topics that in our opinion are particularly in need of further study.

4.1 Advantages of Three-Way ML

It is undeniable that in the recent years, the application of TWD and the TAO model to ML applications has been growing and showing promising results. In this Section, we will emphasise the advantages of TWD under the perspective of uncertainty handling for ML. In this perspective, TWD and the TAO model look promising as a means to provide a principled way to handle uncertainty in the ML process in an end-to-end fashion, by directly using the information obtained in the trisecting act (i.e., the splitting of instances into certain/uncertain ones), in the subsequent acting and outcome steps, without the need to address and “correct” the uncertainty in a separate pre-processing step. This is particularly clear in our discussion about the handling of the uncertainty in the input: in this case, the TAO model enables one to directly deal with different forms of data uncertainty in a theoretically-sound, robust and non-invasive manner [20], while also obtaining higher predictive accuracy than with traditional ML methodologies. The same holds true also with respect to the handling of uncertainty in the output. In this case, the TAO model allows to obtain classifiers that are both more accurate and robust, thanks to the possibility of abstention that allows more conservative decision boundaries. Abstention is a more informative strategy also from a decision-support perspective, in that the model that is enhanced with TWD can expose its predictive uncertainty by abstaining as a sign that the situation needs more information, or the careful consideration of the human decision maker.

4.2 Future Directions

Despite the increasing popularity of TWD to handle the uncertainty in ML pipelines, and the relative maturity of the application of this methodology with respect to the trisecting and acting steps of our framework (see Table 1), we believe that some specific aspects merit further investigations. Then, as already discussed in Sects. 2 and 3, the outcome step has not been sufficiently explored, especially with respect to the handling of uncertainty in the input. As discussed in Sect. 2.2, we believe that conceiving appropriate metrics to assess the robustness of TWD methods represents a particularly promising strand of research, which would also enable counterfactual-like reasoning [47] for ML models, a topic that has recently been considered important in the light of eXplainable AI [68]. For instance, this can be done by analyzing the robustness and performance of the ML models with respect to specific counterfactual instantiations of the instances affected by uncertainty that would most likely alter the learnt decision boundary. Similarly, while there have been more proposals for the outcome step for the output part of our framework, we believe that further work should be done towards the general adoption of these measures in the application of TWD-based ML. Similarly, a second promising direction of research regards the acting step for the management of the uncertainty in the output: as we previously discussed, active learning and human-in-the-loop [24] techniques to handle the instances recognized as uncertain by the ML algorithms are of particular interest. Similarly, it would be interesting to study the connection between the TWD model to handle uncertainty in the output and the conformal prediction paradigm [64], as both are based on the idea of providing set-valued predictions on uncertain instances. A third research direction regards the fact that the different steps have currently been studied mostly in isolation: so far, most studies applying TWD in ML focused either on the input or the output part of our framework. While some initial works with respect to a unified treatment of both types of uncertainty have recently been considered [12], we believe that further work toward such a uniform methodology would be particularly promising. Finally, missing data is usually understood as a problem of completeness: this is missing data at feature level, for instances at least partly observed. But there is also a “missingness” at row level, that is a source of uncertainty (which makes the data we have uncertain and less reliable) that regards instances that we have not observed or whose characteristics are not well represented in the data collected: more research is due to how TWD can tackle this important source of bias, which is usually called sampling bias.