Discovering a taste for the unusual: exceptional models for preference mining
Abstract
Exceptional preferences mining (EPM) is a crossover between two subfields of data mining: local pattern mining and preference learning. EPM can be seen as a local pattern mining task that finds subsets of observations where some preference relations between labels significantly deviate from the norm. It is a variant of subgroup discovery, with rankings of labels as the target concept. We employ several quality measures that highlight subgroups featuring exceptional preferences, where the focus of what constitutes ‘exceptional’ varies with the quality measure: two measures look for exceptional overall ranking behavior, one measure indicates whether a particular label stands out from the rest, and a fourth measure highlights subgroups with unusual pairwise label ranking behavior. We explore a few datasets and compare with existing techniques. The results confirm that the new task EPM can deliver interesting knowledge.
Keywords
Subgroup discovery Exceptional model mining Label ranking Preference learning Distribution rules1 Introduction
Consider a survey where detailed preferences of sushi types have been collected, along with information about the respondents. For each example in the dataset, we have personal details (age, gender, income, etc.) as well as a set of sushi types, ordered by preference (Kamishima 2003). By mapping the demographic attributes and unusual preferences, marketeers would be able to target key demographics where specific sushi types have greater potential.
The study of preference data has been approached from a number of perspectives, grouped under the name Preference Learning (PL) (e.g., as Label Ranking; de Sá et al. 2016; Cheng et al. 2013; Vembu and Gärtner 2010) Typically, the aim is to build a global predictive model, supported by preference mining methods (Fürnkranz and Hüllermeier 2010), such that the preferences can be predicted for new cases. However, in several areas, such as marketing, there is also great value in identifying subpopulations whose preferences deviate from the norm. If the preference of some sushi type by a certain age group or in a certain region is markedly different from the average population, then the vendor can develop specific strategies for those groups. Finding coherent groups of customers to focus on is an invaluable part of promotion strategies.
In this work, the term preference is not strictly interpreted as a literal preference, but instead as an order relation \(object_1 \succ object_2\). An order relation can represent several phenomena: a person likes \(sushi_1\) more than \(sushi_2\) (Kamishima 2003); \(\lambda _1\) is more likely to occur than \(\lambda _2\) (Hüllermeier et al. 2008); \(algorithm_1\) is better than algorithm \(algorithm_2\) (Brazdil et al. 2003). In this context, unusualness is the extent to which some groups show different preferences from average behavior.
Arguably the most generic setting for discovering local, supervised deviations is that of subgroup discovery (SD) (Lavrac et al. 2004). The aim of SD is to discover subgroups in the data for which the target shows an unusual distribution, as compared to the overall population (Klösgen and Zytkow 2002). SD is a generic task in the sense that the actual nature of the target variable can be quite diverse. For example, SD approaches have been developed for binary, nominal (Abudawood et al. 2009) and numeric target variables (Jorge et al. 2006; Jin et al. 2014), as well as multiple targets (Duivesteijn et al. 2012; Umek and Zupan 2011).
We extend the work on exceptional preferences mining (EPM) (de Sá et al. 2016), which focuses on the discovery of meaningful subgroups with exceptional preference patterns. When applying SD to a new context, the main task is to determine what constitutes an interesting subgroup. In EPM, different quality measures determine the interestingness based on how the preferences in the subgroup, differ from the preferences in the whole data. A set of EPM quality measures reflect different facets of interestingness one might have about the unusualness of a set of preferences.
In this work, we include a more comprehensive experimental setup and propose a new quality measure. We employ EPM on several realworld datasets, using four distinct quality measures. These measures define the type of exception that is identified to either encompass the entire label space or focus on more local peculiarities. In particular, two of them look for overall exceptional preferences; a third measure assesses if one particular label behaves exceptionally; the remaining measure quantifies the exceptional behavior of a single pair of labels.
Finally, to consolidate the previous work on EPM, we compare EPM with a subgroup discovery approach known as Distribution Rules (DR) (Jorge et al. 2006).
We start by introducing Label Ranking in Sect. 2 and subgroup discovery in Sect. 3. Then, in Sect. 4 we introduce exceptional preferences mining and analyze the results obtained in Sect. 5. Finally, we conclude this paper in Sect. 6.
2 Label ranking
In Label Ranking, given an instance x from the instance space \(\mathbb {X}\), the goal is to predict the ranking of the labels \(\mathcal {L} = \{\lambda _1,\ldots ,\lambda _k\}\) associated with x (Hüllermeier et al. 2008). A ranking can be represented as a strict total order over \(\mathcal {L}\), defined on the permutation space \(\varOmega \).
The Label Ranking task is similar to the classification task, where instead of a class we want to predict a ranking of the labels. As in classification, we do not assume the existence of a deterministic \(\mathbb {X} \rightarrow \varOmega \) mapping. Instead, every instance is associated with a probability distribution over \(\varOmega \) (Cheng et al. 2009). This means that, for each \(x \in \mathbb {X}\), there exists a probability distribution \(\mathcal {P}( \cdot  x)\) such that, for every \(\pi \in \varOmega \), \(\mathcal {P}(\pi  x)\) is the probability that \(\pi \) is the ranking associated with x. The goal in Label Ranking is to learn the mapping \(\mathbb {X} \rightarrow \varOmega \). The training data is defined as D, which is a bag of \(n\) records of the form \(x=(a_1,\ldots ,a_m,\pi )\), where {\(a_1,\ldots ,a_m\}\) is set of values from \(m\) independent variables \(\mathcal {A}_1,\ldots ,\mathcal {A}_m\) describing instance x and \(\pi \) is the corresponding target ranking.
Rankings can be represented with total or partial orders and viceversa.
 1.
Irreflexive: \(\lambda _a\nsucc \lambda _a\)
 2.
Transitive: \(\lambda _a\succ \lambda _b\) and \(\lambda _b\succ \lambda _c\) implies \(\lambda _a\succ \lambda _c\)
 3.
Asymmetric: if \(\lambda _a \succ \lambda _b\) then \(\lambda _b\nsucc \lambda _a\)^{1}
 4.
Connected: For any \(\lambda _a,\lambda _b\) in \(\mathcal {L}\), either \(\lambda _a\succ \lambda _b\) or \(\lambda _b\succ \lambda _a\)
However, in realworld ranking data, we do not always have clear and unambiguous preferences, i.e. strict total orders (Brandenburg et al. 2013). Hence, sometimes we have to deal with indifference (Brinker and Hüllermeier 2007) and incomparability (Cheng et al. 2010). For illustration purposes, let us consider a survey where a set of \(n\) consumers rate \(k\) sushi types. If a consumer feels that two sushi types have identical taste, then these can be expressed as indifferent so they are assigned the same rank (i.e. a tie).
 1.
Reflexive: \(\lambda _a\succeq \lambda _a\)
 2.
Transitive: \(\lambda _a\succeq \lambda _b\) and \(\lambda _b\succeq \lambda _c\) implies \(\lambda _a\succeq \lambda _c\)
 3.
Antisymmetric: \(\lambda _a \succeq \lambda _a\) and \(\lambda _b\succeq \lambda _a\) implies \(\lambda _a=\lambda _b\)
 4.
Connected: For any \(\lambda _a,\lambda _b\) in \(\mathcal {L}\), either \(\lambda _a\succeq \lambda _b\), \(\lambda _b\succeq \lambda _a\) or \(\lambda _b=\lambda _a\)
Additionally, realworld data may lack preference data regarding two or more labels, which can be defined as incomparability (Chiclana et al. 2009). Continuing with the sushi survey, if a consumer never tried one or two sushi types, \(\lambda _a\) and \(\lambda _b\), it leads to incomparability, \(\lambda _a \perp \lambda _b\). In other words, the consumer cannot decide whether the sushi types are equivalent or select one as the preferred, because he never tasted at least one of them. In this cases, we can use partial orders.
 1.
Reflexive: \(\lambda _a\succeq \lambda _a\)
 2.
Transitive: \(\lambda _a\succeq \lambda _b\) and \(\lambda _b\succeq \lambda _c\) implies \(\lambda _a\succeq \lambda _c\)
 3.
Antisymmetric: \(\lambda _a \succeq \lambda _a\) and \(\lambda _b\succeq \lambda _a\) implies \(\lambda _a=\lambda _b\)
Several learning algorithms proposed for modeling Label Ranking data can be grouped as decompositionbased or direct (de Sá et al. 2018). Decomposition methods divide the problem into several simpler problems (e.g., multiple binary problems). An example is ranking Ranking by Pairwise Comparisons (RPC) (Fürnkranz and Hüllermeier 2003), which decomposes the LR problem into a set of binary classification problems. A learning method is trained with all examples for which either a pairwise comparison (or pairwise preference) \(\lambda _i \succ \lambda _j\) or \(\lambda _j \succ \lambda _i\) is known (Fürnkranz and Hüllermeier 2003). The resulting predictions are then combined to predict a total or partial ranking (Cheng et al. 2013). Direct methods, on the other hand, treat the rankings as target objects without any decomposition. Examples of that include decision trees (Todorovski et al. 2002; Cheng et al. 2009), kNearest Neighbors (Brazdil et al. 2003; Cheng et al. 2009) and the linear utility transformation (HarPeled et al. 2002; Dekel et al. 2003).
Consensus ranking When dealing with sets of rankings, as permutations or total/partial orders, it is often useful to define a consensus ranking. A consensus ranking can be seen as an overall ranking that has the highest agreement with a given set of rankings (Cook et al. 2007). Different methods to derive the consensus ranking can be found in the literature (Sculley 2007; Svendová and Schimek 2017). For example, in Cook et al. (1996) a consensus ranking for players is proposed as the ranking which deviates the least from the outcomes in the tournament.
In the context of Label Ranking it is common to use the average ranking as the consensus ranking (Brazdil et al. 2000). The average ranking is obtained by computing the average of the ranks, where the label with the lowest values is ranked in first place, and so on.
3 Subgroup discovery and exceptional model mining
Subgroup discovery (SD) (Klösgen and Zytkow 2002) is a data mining framework that seeks subsets of the dataset (satisfying certain userspecified constraints) where something exceptional is going on. In SD, we assume a flattable dataset D, which is a bag of \(n\) records of the form \(x=(a_1,\ldots ,a_m,t_1,\ldots ,t_\ell )\). We call \(\{a_1,\ldots ,a_m\}\) the descriptors and \(\{t_1,\ldots ,t_\ell \}\) the targets, and we denote the collective domain of the descriptors by \(\mathcal {A}\). We are interested in finding interesting subsets, called subgroups, that can be formulated in a description language \(\mathcal {D}\). In order to formally define subgroups, we first need to define the following auxiliary concepts.
Definition 1
(Pattern and coverage) Given a description language \(\mathcal {D}\), a pattern \(p\in \mathcal {D}\) is a function \(p:\mathcal {A}\rightarrow \{0,1\}\). A pattern p covers a record x iff \(p(a_1,\ldots ,a_m)=1\).
Patterns induce subgroups, and subgroups are associated with patterns, in the following manner.
Definition 2
Formally, the interestingness of a subgroup can be measured using any characteristics available from its associated pattern. In practice, it depends on the task we are trying to solve. Therefore, we should define one or more quality measures to assess the interestingness we want to explore.
Definition 3
(Quality Measure) A quality measure is a function \(\varphi :\mathcal {D}\rightarrow \mathbb {R}\).
For very small subgroups, one easily finds an unusual distribution of the target. Hence, to favor larger subgroups, one defines the quality measure such that it balances the exceptionality of the target distribution with the size of the subgroup.
3.1 Search strategy
In the EMM process, we explore a large search space, guided by a userdefined quality measure that expresses the type of exceptionality we seek. Typically, subgroups are found by a levelwise search through attribute space (Duivesteijn 2013). However, we consider the exact search strategy to be a parameter of the algorithm.
EMM strives to find descriptions that satisfy certain userspecified constraints. Usually these constraints include lower bounds on the quality of the description and size of the induced subgroup. More constraints may be imposed as the question at hand requires; domain experts may for instance request an upper bound on the complexity of the description.
Most SD algorithms traverse the search space of candidate descriptions in a generaltospecific way: they treat the space as a lattice whose structure is defined by a refinement operator \(\eta :\mathcal {D}\rightarrow 2^\mathcal {D}\). This operator determines how descriptions can be extended into more complex descriptions by atomic additions. Most applications (including ours) assume \(\eta \) to be a specialization operator: every description \(q\in \mathcal {D}\) that is an element of the set \(\eta (p)\), is more specialized than the description p itself. The algorithm results in a ranked list of descriptions (or the corresponding subgroups) that satisfy the userdefined constraints.
In this EMM setting, a greedy bestfirst search strategy is chosen. At each level, the descriptions according to our quality measure \(\varphi \) are sorted, and refined to create the candidate descriptions for the next level. We define constraints on single attributes and define the corresponding subgroups as those records satisfying each one of those constraints. The search is constrained by an upper bound on the complexity of the description (also known as the search depth, \(d\)) and a lower bound on the support of the corresponding subgroup. Due to its greediness, this search strategy provides no guarantee of optimality (Heusner et al. 2017).
3.1.1 Bestfirst search algorithm in EMM
In Algorithm 1, we outline the pseudocode of the Bestfirst search algorithm for EMM. In this code, we assume that there is a subroutine called satisfiesAll that tests whether a candidate description satisfies all conditions in a given set (to allow, for instance, the domain expert to express constraints on the resulting descriptions, such as a bounded complexity). The PriorityQueue() is a queue, with unbounded length, where the elements are stored and sorted with the corresponding quality; One elementary operation, insert_with_priority, is for adding an element to the PriorityQueue.
3.2 Distribution rules
4 Exceptional preferences mining
Exactly what constitutes an interesting deviation in preferences is governed by the employed quality measure, and the target concept (binary, numeric, preferences, ...). Thus, different measures are required to evaluate different types of targets. SD approaches have been developed for binary, nominal (Abudawood et al. 2009) and numeric target variables (Jin et al. 2014; Jorge et al. 2006), for targets encompassing multiple attributes (Umek and Zupan 2011) and also distributions (Jorge et al. 2006) (Sect. 3.2). However, none of these approaches is able to capture all the sets of preferences that can be derived from rankings within a SD framework. For that we use, exceptional preferences mining (EPM) (de Sá et al. 2016), which is the search for subgroups with deviating preferences.
In EPM, the target concept at hand consists of a single target t, which would make sense in SD. However, that target object is a ranking of labels, \(\pi \in \varOmega \) (as defined in Sect. 2) which can be represented as a set of pairwise comparisons. Hence it represents interactions between multiple individual labels, which is more consistent with the EMM scenario.
Some other approaches to mine preferences and ranks can be found in the literature (Henzgen and Hüllermeier 2014; Van et al. 2014). However, these approaches tackle different problems from the one we address in this paper. In Henzgen and Hüllermeier (2014), the authors suggest an approach to mine the rankings with association rules that search for subranking patterns Our approach goes beyond this as it relates the ranking patterns with descriptors (otherwise referred to as independent variables). From a different perspective, Van et al. (2014) suggests a ranked tiling approach to search for rank patterns, whereas we are interested in the preference relations derived from the ranks.
In the Label Ranking context (Sect. 2), when the number of labels is large, the search for preference patterns can be hard to analyze and visualize. A realworld example is the Sushi dataset (Kamishima 2003), which represents the preferences of 5000 persons over 10 types of sushi. Even this relatively modest number of sushi types can be ranked in a large number of combinations. This may have a significant effect on the data, as it is shown in this dataset, where more than 98% of the 5000 rankings present in this dataset are unique. This illustrates why it can be more difficult to directly learn a ranker that associates a reliable complete ranking for any subset in the instance space, \(\mathbb {X}\), when the number of labels is nontrivial.
4.1 Preference matrix
4.1.1 Preference matrix of one ranking
If needed, one can also derive a ranking from a PM. How to do so is a nontrivial question, which has received some attention in research fields with similar types of matrices (Hüllermeier et al. 2008). The straightforward way is to sum the rows of the PM and then assign a score to each corresponding label. Higher values correspond to a relatively more preferred label.
In terms of the complexity of the generation of PMs, it is basically a pairwise decomposition problem. Therefore, the complexity is \(\mathcal {O}\left( k^2 \right) \) per matrix, where \(k\) is the number of labels in the ranking. Even though any number of labels is theoretically permitted in label ranking, in practice the number of labels is usually smaller than 20. Hence, the computational cost of generating PMs should not be a problem.
4.1.2 Preference matrix of a set of rankings
Alternatively, one can also aggregate \(M_{D}\) or \(M_{S}\) using the mode.^{2} That is, several modes are used to represent the preferences of a population D or a subgroup S. In this case, \(M_{S}\) represents the most frequent occurring values contained in the entries of the set of \(M_{\pi }, \pi \in S\). In cases where two or more modes per entry are obtained, the median is used.
Example dataset \(\hat{D}\)
\(\mathcal {A}_1\)  \(\pi \)  Alternative \(\pi \)  

\(\lambda _1\)  \(\lambda _2\)  \(\lambda _3\)  \(\lambda _4\)  
0.1  4  3  1  2  \(\lambda _3\succ \lambda _4\succ \lambda _2\succ \lambda _1\) 
0.2  3  2  1  4  \(\lambda _3\succ \lambda _2\succ \lambda _1\succ \lambda _4\) 
0.3  1  4  2  3  \(\lambda _1\succ \lambda _3\succ \lambda _4\succ \lambda _2\) 
0.4  1  3  2  4  \(\lambda _1\succ \lambda _3\succ \lambda _2\succ \lambda _4\) 
The records in the illustrative dataset \(\hat{D}\) contain distinct total orders (Table 1). But its PM clearly shows that \(\lambda _3\) is always preferred to \(\lambda _2\) (\(M_{\hat{D}}\left( 3,2\right) =1\)). This information can be easily obtained from the PM, but is hard to read directly from Table 1. Even though, if we analyze carefully, \(\lambda _3\) is always preferred to \(\lambda _2\), this pattern is based on different ranks, namely, \(3 > 1\), \(2 > 1\), \(4 > 2\) and \(3>2\). Thus, unless one is looking specifically for this pattern, it would be quite hard to find. In real datasets, with more examples and labels, the task would be even harder. Conversely, \(\lambda _4\) is never preferred to \(\lambda _3\), which is represented by \(M_{\hat{D}}\left( 4,3\right) =1\). In some cases, the overall trend is not as clear (e.g., \(\lambda _1\) is preferred to \(\lambda _4\) but not always) and in other cases, there is no trend at all (e.g., \(\lambda _1\) and \(\lambda _2\)).
Representing a set of rankings as a PM has another advantage over the traditional permutation representation. On a PM, we can naturally derive a varied set of metrics to search for preference patterns in a set of rankings by characterizing parts of the matrix. For example, it enables simple labelwise (by rows/columns of the PM) and pairwise (by single entries of the PM) analysis of preferences (see Sect. 4.3).
On the other hand, PMs can also have limitations in comparison to the traditional representations, like permutations. In particular, the choice of the aggregation metrics can hide relevant information in the PMs. For example, when using the mean, if half of the rankings have the opposite order of the other half (e.g., \(\lambda _1\succ \lambda _2\succ \lambda _3\succ \lambda _4\) and \(\lambda _4\succ \lambda _3\succ \lambda _2\succ \lambda _1\)) this results in a PM with all entries equal to zero. Because the same happens when all rankings are complete ties, there is no way for the method to detect this difference in the preferences. Therefore, in an attempt to mitigate this, subgroups with a PM containing only zeros are ignored. That is, only subgroups for which we can infer at least one pairwise preference can be considered interesting in this exceptional preferences mining approach.
4.2 Characterizing ranking exceptionality
In EPM, we want to search for exceptional preference (or ranking) behavior. Because preferences are represented with rankings, we can distinguish three categories of exceptionality concerning rankings: rankingwise, labelwise and pairwise.
Measures that fall into the first category, rankingwise, will use all the entries of the PM, and therefore, benefit subgroups with exceptional complete rankings. This is, if the average ranking of the population is \(\lambda _1\succ \lambda _2\succ \lambda _3\succ \lambda _4\), subgroups with an average ranking of \(\lambda _4\succ \lambda _3\succ \lambda _2\succ \lambda _1\) will be deemed the most interesting. However, finding a reasonable set of rankingwise exceptional preferences can be challenging in some cases. Considering the example of the Sushi dataset mentioned before, with more than 98% of unique rankings, it will be difficult to observe unusual complete rankings that occur very frequently, due to the low number of ranking repetitions.
Labelwise measures, are less restrictive and focus on rows/columns of the PMs. Therefore, they look for subgroups where at least one label is unusually ranked higher (or lower) in comparison to the whole population. The preferences of these subgroups can be represented as incomplete rankings. Considering a population where we observe that \(\lambda _1,\lambda _2,\lambda _3\succ \lambda _4\), therefore, subgroups where \(\lambda _4\succ \lambda _1,\lambda _2,\lambda _3\) will be interesting. Note that, the following list of complete rankings agree with \(\lambda _4\succ \lambda _1,\lambda _2,\lambda _3\) : \(\lambda _4\succ \lambda _3\succ \lambda _2\succ \lambda _1\), \(\lambda _4\succ \lambda _2\succ \lambda _3\succ \lambda _1\), \(\lambda _4\succ \lambda _3\succ \lambda _1\succ \lambda _2\), \(\lambda _4\succ \lambda _1\succ \lambda _2\succ \lambda _3\) and \(\lambda _4\succ \lambda _1\succ \lambda _3\succ \lambda _2\). As an example, if a subgroup ranks \(tekkamaki\) consistently in the top 3 while the majority in the dataset ranks it in the last 3, this type of measures will find it to be very interesting.
Finally, pairwise measures pick single entries of the PM, which makes them look for unusual pairwise preferences. Considering a population where the majority agrees that \(\lambda _1\succ \lambda _4\), any subgroup where most of the subjects agree that \(\lambda _4\succ \lambda _1\) will be considered very interesting. This means that, if a population displays this preference \(tamago \succ kappamaki\), a subgroup where most people prefer \(kappamaki \succ tamago\) will be deemed interesting by these type of measures. Our assumption is that, even though over 98% of the total rankings in the Sushi dataset are unique, there is plenty of information present in these rankings: the partial orders and pairwise comparisons can reveal interesting subgroups.
4.3 Characterizing exceptional subgroups
4.4 Quality measures
In this section we introduce the quality measures used in this work. We propose 4 quality measures: 2 rankingwise, 1 labelwise and 1 pairwise (Sect. 4.2). We describe 3 previously proposed measures (de Sá et al. 2016) and introduce a new one.
As we are interested in subgroups with exceptional preferences, we should be able to measure a preference distance. For that we can use the distance matrix \(L_S\). The distance measures we employ, typically consider a particular subset of the entries of the distance matrix \(L_S\). Because rankings have interlabel relations that can be explored (Henzgen and Hüllermeier 2014), there are many ways to tackle this, for example, to use less restrictive measures to look for unusual behaviors of partial rankings.
To the best of our knowledge, as in most EMM approaches (Leeuwen and Knobbe 2012), none of the following quality measures are guaranteed of having antimonotonicity properties.
4.4.1 Rankingwise measures
Rankingwise quality measures should prefer subgroups whose average rankings are very different to the average ranking of the complete dataset, i.e. maximizing the distance between complete rankings.
Rankingwise norm If one is searching for subgroups whose average ranking is as close as possible to the inverse ranking of the population, one should use the Rankingwise Norm quality measure, RWNorm. Given a set of subgroups with same size, this measure gives the highest score to subgroups whose rankings are the inverse of the population.
Rankingwise covariance Covariance is used in statistics to measure the extent to which two variables change in comparison with each other. In simple terms, a positive value indicates that when one increases, the other also increases. If they behave in opposite directions, the covariance is negative.
As in RWNorm, we are interested in subgroups with complete rankings that contradict the preferences in the general population. Hence, we can use covariance to measure the deviations of preferences. The entries of a row in the PM \(M_S\) represent how a label relates to the remaining labels in the subgroup S. By abuse of notation, the rows of \(M_S\) and \(M_D\) can be seen as independent variables, which allows us to measure the covariance between labels. That is, we can compare the PM values of a label in a subgroup S with the corresponding values of the same label in D using their covariance.
In comparison to RWNorm, we expect this measure to be more conservative because it requires that most of the entries behave in opposite directions. On the other hand, this measure is better at distinguishing one subgroup whose overall deviation is due to one label deviating strongly and the others not so much, from one where all labels have small deviations.
4.4.2 Labelwise measures
The fact that only one label behaves differently, disregarding the interaction between the other labels, can also be interesting (Cheng et al. 2013). Therefore, it is useful to define labelwise measures that look for subgroups where a label shows unusual behavior. Depending on the application at hand, a subgroup can be considered interesting when at least one label is under or overappreciated in comparison to the population. For example, a data analyst might be interested in finding subgroups where the preference for a particular type of sushi is substantially different, when compared to the population.
4.4.3 Pairwise measures
In PL, Pairwise Preferences (Hüllermeier et al. 2008) are often the focus of the analysis, decomposing the preferences into pairs labelvslabel. In EPM, if we are interested in subgroups with at least one pair of labels with distinctive preference behavior we can use pairwise measures.
One alternative pairwise measure could be the pairwise minimum, which would provide the lower bound of PWMax for each subgroup.
4.5 Tackling false discoveries
In SD, one aims to find subsets of the dataset that are interesting in some sense. As such, the space of candidates to be considered for what essentially amounts to a statistical test is vast. Hence, SD suffers from the multiple comparisons problem (Hochberg and Tamhane 1987): when testing a large number of a null hypotheses, by definition, some will incorrectly be rejected. Namely, with a significance level of \(\alpha \), \(\alpha \) out of each 100 null hypotheses tested are expected to be incorrectly rejected.
For supervised local pattern mining, to which SD belongs, a swaprandomizationbased statistical test procedure has been developed (Duivesteijn and Knobbe 2011). First, a number of copies of the original dataset is generated, and in each of the copies the target attributes are swap randomized. All other attributes are kept intact. This means that the search space of the mining algorithm and the distribution of the targets remains intact, but the connections between the search space and the target space are broken. The procedure then involves running the algorithm to be tested on each copy of the dataset, and reporting the best subgroup found, according to the selected quality measure. Any subgroup that is found on such a copy of the dataset is interesting only because of random effects. Hence, these are artificially generated false discoveries. The procedure then builds a global model over the artificial false discoveries, the socalled Distribution of False Discoveries (DFD). Then, the subgroups found on the original dataset can be assigned a p value, corresponding to the null hypothesis that a subgroup with this quality is generated by the same process that generated the DFD. Refuting the null hypothesis essentially refutes the hypothesis that the subgroup found is a false discovery.
The DFD validation procedure has only one parameter: the number of dataset copies. This number must be large enough to satisfy certain conditions arising in the global modeling involved in creating the DFD. As noted in Duivesteijn and Knobbe (2011), typically, 100 copies are enough.
5 Experiments
In this section we start with a description of the experimental setup (Sect. 5.1), then we present some statistics of the datasets used (Sect. 5.2). Then we present the results obtained (Sect. 5.3) and finally we compare our findings with the results of an alternative approach (Sect. 5.4).
5.1 Implementation and experimental setup
We incorporate exceptional preferences mining in the Cortana^{3} software package (Meeng and Knobbe 2011). This package delivers a generic framework for SD, implements several SD instances, and offers many generic features allowing for different SD approaches. The description language consists of logical conjunctions of conditions on single attributes.
Our experiments use a greedy bestfirst search approach (Algorithm 1). The numeric strategy used for this experiments is an on the fly discretization approach of 8 equalwidth bins. For every extreme of the bin we use a set of numeric operators such as \(\ge \) and \(\le \).
All the findings we present in this paper have gone through the DFD validation procedure (Sect. 4.5) with 100 copies, and all have been found significant at a significance level of \(\alpha =1\)%.
All the subgroups presented in this manuscript were found in less than 3 minutes of execution time, on an Intel Core i7 5500U CPU @ 2.40GHz with 16GB of RAM. The DFD validation procedure, for depths bigger than 4 can take more than 30 minutes, depending on the dataset.
5.2 Datasets
To illustrate domainspecific interpretation of the results, we experiment with some realworld datasets (Table 2). The Algae dataset^{4} is based on the COIL 1999 Competition Data from UCI (Lichman 2013). This dataset concerns the frequencies of algae populations in different environments. This dataset consists of 340 examples, each representing measurements of a sample of water from different European rivers in different periods. The measurements include concentrations of chemical substances such as nitrogen (in the form of nitrates, nitrites and ammonia), oxygen and chlorine. Also the pH, season, river size and flow velocity are registered. For each sample, we have the preference relations of 7 types of algae which represent the concentrations ordered from larger to smaller concentrations. Those with 0 frequency are placed in last position and equal frequencies are represented with ties. Missing values are set to 0.
The Sushi preference dataset (Kamishima 2003), is composed of demographic data about 5000 people and their sushi preferences. Each person sorted a set of 10 different sushi types by preference. The 10 types of sushi, are (a) shrimp, (b) sea eel, (c) tuna, (d) squid, (e) sea urchin, (f) salmon roe, (g) egg (h) fatty tuna, (i) tuna roll and (j) cucumber roll.

a) American Beauty (1999)

b) Star Wars: Episode IV—A New Hope (1977)

c) Star Wars: Episode V—The Empire Strikes Back (1980)

d) Star Wars: Episode VI—Return of the Jedi (1983)

e) Jurassic Park (1993)

f) Saving Private Ryan (1998)

g) Terminator 2: Judgment Day (1991)
We also study data with socioeconomic information from regions of Germany and its electoral results, the datasets GermanElections2005 and GermanElections2009. The 413 records correspond to the administrative districts of Germany, which are described by 39 attributes. Both datasets are parts of data which was extracted from a publicly available database of the German Federal Office of Statistic (Boley et al. 2013). A similar study has been presented in Grosskreutz et al. (2010), but restricted to the city of Cologne.

a) CDU (conservative)

b) SPD (centerleft)

c) FDP (liberal)

d) Green (centerleft)

e) Left (leftwing)
Dataset details
Datasets  #examples  #labels  #attributes  \(U_{\pi }\) (%)  \(E\left( U_\pi \right) \) (%) 

GermanElections2005  412  5  31  5  28 
GermanElections2009  412  5  33  7  28 
Top7movies  602  7  7  52  94 
Algae  316  7  11  72  96 
Sushi  5000  10  10  98  99 
Cpusmall  8192  5  6  1  1 
Considering the case of the Sushi dataset (Table 2), with an \(U_\pi = 98\)%, if we randomly pick 100 instances (i.e. 100 users and its rankings), we will probably have 98 distinct rankings. This means that, it will be extremely unlikely to find more than 3 users with the very same preferences. On the other hand, because the \(U_\pi =98\)% is close to the \(E\left( U_\pi \right) =99\)%, we should also not expect very strong biases in the ranking behaviors. For these reasons, we expect that it will be harder to find complete ranking patterns in this dataset.
Looking into the \(E\left( U_\pi \right) \) of the two german elections datasets, their \(U_\pi \) is considerably less than its expected value. This seems to indicate that, not all rankings have equal probability in this election scenario. However, because we know that in elections it is very unusual that all parties have equal chances of being in all positions, across different regions, it makes sense.
5.3 Results
In this section we show some of the most interesting results obtained with the different quality measures.
5.3.1 Study on the behavior and biases of the quality measures
With each of the introduced quality measures, one can find subgroups featuring exceptional ranking behavior. The exceptionality is measured in (sometimes subtly) different ways for the different quality measures; which quality measure one uses depends on what type of exceptional ranking one is looking for. The quality measures we have outlined in Sect. 4.4 all live at a different level of granularity: a subgroup is flagged up as interesting by the one measure if only a single pair of labels has an exceptional relative ranking, by the other measure if a single label has an exceptional ranking relative to all others, and by the last measure if overall label behavior is exceptional. This difference in scope implies that the measures are correlated, but not perfectly so. In this section, we explore the resulting differences in focus between the quality measures, to allow the user to make an informed choice.
The first row shows the subgroups of RWNorm and the vertical axis represents its score. The horizontal axis represents the scores of each quality measure, in the following order: RWNorm, RWNormMode, RWCov, LWNorm and PWMax. The second row shows the subgroups of RWNormMode, and so on.
As expected, some quality measures have a different but congruent bias. We can observe that 3 measures have a very similar bias, RWNorm, LWNorm and PWMax. This is somewhat expected, since they basically have the same measure, but applied in different parts of the distance matrix \(L_S\).
Total number of significant subgroups found per dataset, with depth 1, using the different quality measures
Datasets  RWNorm  RWNormMode  RWCov  LWNorm  PWMax 

GermanElections2005  59  19  0  59  62 
GermanElections2009  55  18  1  53  59 
Top7movies  2  0  0  2  2 
Algae  22  5  1  22  21 
Sushi  25  5  0  18  20 
Cpusmall  12  10  6  12  12 
Finally, RWCov, seems to have the most different bias. That is because it is not based on the distance matrix \(L_S\); instead, it directly measures the negative correlation between the population \(M_D\) and the subgroups \(M_S\). Therefore, with this quality measure, we will find subgroups that do not necessarily maximize preference distance, but instead feature unusual preference behavior in a abstract sense.
Now, let us focus on the number of subgroups obtained per measure, in terms of the given datasets in Table 3. Using a bestfirst search to find subgroups, we compare the number of subgroups obtained, per quality measure per dataset. For simplicity, we use a search depth of 1. RWCov is, by far, the measure that identifies the least number of subgroups throughout measures and datasets. This seems to indicate that this measure is very restrictive, as expected (Sect. 4.4).
5.3.2 German elections
With the GermanElections2005 dataset, using the PWMax with a search depth of 1, we found 62 significant subgroups. The best subgroup, \(\text {Region} = \text {East}\), indicates that the party with label e in comparison to the party with label c has a very different behavior from the majority. In fact, while on 75% of the districts in Germany the FDP party (label c) was more voted than the Left party (label e), on the 2005 elections, all the 87 districts from East Germany voted more on the Left party than on the FDP party. This shows a great example of an extreme inversion of preferences.
The second best subgroup obtained, compares the centerleft Green party (label d) with the leftwing Left party (label e). The Green party had more votes than the Left party on 72% of the districts in Germany. On the other hand, on 88% of the districts where the average income is less or equal than 16,979, the Left party was more voted than the Green party.
To compare with the German elections of 2009, we used the GermanElections2009 dataset with the same settings and found 57 significant subgroups. As in the 2005 elections, the best subgroup shows that 100% of the districts in east Germany gave more votes to the Left party than on the Green party, in comparison to only 27% in the whole Germany. The second best subgroup, as in the 2005 case, compares the centerleft Green party (label d) with the leftwing Left party (label e). However, in this case, 94% of the districts, where the average income is less or equal than 16,979, the Left party was in advantage in comparison to the Green party. Comparing to the 88% of 2005, we realize that, in 2009, 6 p.p. more districts, where the average income was \(\le \)16,979, increased the votes in the Left party, in comparison to the Green party.

\(\text {Children Population} \le 14.8\% \wedge \text {Income} \le 16,634\)

\(\text {Children Population} \le 14.8\% \wedge \text {Unemployment} \ge 8.4\)%

\(\text {Income} \ge 18{,}442\)

\(\text {Income} \ge 17{,}791 \wedge \text {Youth unemployment} \le 8.5\)%

\(CDU \succ \mathbf {Left} \succ SPD \succ FDP \succ Green\) (Thuringia)

\(\mathbf {Left} \succ \mathbf {SPD} \succ CDU \succ FDP \succ Green\) (Brandenburg)

\(\mathbf {Left} \succ CDU \succ SPD \succ FDP \succ Green\) (SaxonyAnhalt)

\(CDU \succ \mathbf {Left} \succ SPD \succ FDP \succ Green\) (Saxony)

\(CDU \succ SPD \succ FDP \succ \mathbf {Green} \succ Left\) (Bavaria)

\(CDU \succ SPD \succ FDP \succ Left \succ Green\) (All states)
This analysis, also shows the potential of EPM as a tool to study election data. By looking at different levels of granularity of the preferences, EPM does not necessarily focus on the winners, but rather on major preference shifts. Also, considering the elections application, different ranking aggregation metrics can be used to comply with the Condorcet method (de Condorcet 1785).
5.3.3 Top7Movies
On the other hand they seem to dislike American Beauty and Jurassic Park. In fact, the average ranking of this subgroup is \(b\succ \mathbf {f} \succ c \succ d \succ g \succ \mathbf {a} \succ e\) and the average ranking of the whole population is \(b\succ c \succ \mathbf {a} \succ \mathbf {f} \succ d \succ g \succ e\).
5.3.4 Algae
The visual representations of the PM clearly reveal the effect of the LWNorm quality measure in this dataset. We can also observe from the description of the subgroups obtained, that the variables V10 and V6 are highly correlated with the presence of the algae a.
5.3.5 Sushi
Considering the high percentage of unique rankings in the sushi dataset (Table 2) we do not expect to find strong patterns in the whole PM, therefore, we focus on labelwise ranking patterns.
5.3.6 Cpusmall
In terms of the rankings, the average ranking of the whole dataset is \(\left( 2,4,3,1,5\right) \), and the average ranking in this subgroup is \(\left( 3,1,5,4,2\right) \). The Kendall \(\tau \) correlation of these two rankings is \(0.4\), which confirms the unusualness of the subgroup.
We could also observe that, despite having obtained 275 significant subgroups, there were many subgroups whose PM was very similar and showing the same unusual behavior. This could also be observed in terms of the ranking derived from their PM.
5.3.7 Comparison of different aggregation metrics
As mentioned in Sect. 4.1, different metrics can be used in the aggregation of PM. To test how this choice can affect the model, we analyzed some results were PMs are aggregated with the mode (instead of the the mean), however, for the sake of space, we only present one dataset and one quality measure, RWNormMode.
Using the mode as the aggregation, RWNormMode quality measure, we found 131 significant subgroups of depth 2 on the Cpusmall dataset. As a point of comparison, we obtained 155 significant subgroups, with the same settings, using the RWNorm quality measure (aggregation with the mean). Despite the similar number of subgroups found, the two groups of subgroups are quite distinct. This is somehow expected from the previous analysis of the quality measures in Sect. 5.3.1.
A striking difference is that the rankings of the subgroups from RWNormMode are consistently different from the ones obtained with RWNorm. However, despite being different, the average rankings of the subgroups have a similar correlation (in terms of the Kendall \(\tau \)) to the average ranking of the population.^{7} In other words, the subgroups are at a similar “preference distance” from the population. This seems to indicate that RWNormMode can be a complementary measure with RWNorm.
From a different perspective, in Fig. 13 we compare the distributions of the correlation between the average ranking of the dataset and each one of the rankings that are part of the best subgroup. We measure this correlation in terms of the Kendall \(\tau \) correlation coefficient. As seen in Fig. 13, the distributions are similar. This behavior was also observed in other subgroups and other datasets. Therefore, this confirms what we observed above, that RWNormMode and RWNorm find different subgroups but with similar ’preference distances’.
Aggregating a PM with the mode can yield either 1, 0 or \(1\) in contrast to the mean where any value in the interval \([1,1]\) is possible. Therefore, the mean can measure exceptionality on subgroups with the same mode as the dataset (e.g., label a in Fig. 8). On the other hand, the mode can detect subgroups where the majority of the pairs behave differently. Therefore, depending on the task, the best choice of the aggregation metric for the quality measures can change. However, we believe that the best way is to complement the use of RWNormMode with RWNorm and vice versa.
5.4 Comparison with distribution rules
In this section, we compare subgroups found with our algorithm (using Cortana) with subgroups from a different approach, Distribution Rules (DR) (using CAREN Azevedo and Jorge 2010 software^{8}). As mentioned before (Sect. 3.2), Distribution Rules are a SD method that looks for unusual target distributions (Jorge et al. 2006; Lucas et al. 2007). Cortana and CAREN can be used for mining other structures of data. For simplicity, in this work we refer to Cortana and CAREN as the tools with our preference learning approaches.
DR use a numeric target to construct the distributions. Since we have rankings as targets, we propose a simple way to represent individual rankings as numeric values. For each example we compute the similarity score between its ranking and the average ranking (consensus ranking Brazdil et al. 2003) of the dataset. Given that, the similarity measure that we use is the Kendall \(\tau \), the new target can have values in the range \(\left[ 1,1\right] \).
Example dataset \(\hat{D}\) with the proposed alternative representation in the rightmost column of the table
\(\mathcal {A}_1\)  \(\pi \)  Similarity to average ranking  

\(\lambda _1\)  \(\lambda _2\)  \(\lambda _3\)  \(\lambda _4\)  
0.1  4  3  1  2  0 
0.2  3  2  1  4  0.66 
0.3  1  4  2  3  0.33 
0.4  1  3  2  4  0.66 
For a fair comparison between the two methods, we discretized the numeric attributes beforehand with an equal width discretization of 8 bins. We handle the discretized numerical attributes as a nominal, not ordinal, scale. In terms of the property of interest (target), this numerical variable does not have to be previously discretized, because the method works with raw distributions (Lucas et al. 2007).
In terms of the experimental setup, we will use the same maximum search depth for both methods. In Cortana, we take the RWNorm quality measure. For each subgroup, we perform a Kolmogorov–Smirnov statistical test to compare the target distribution of the subgroup with the target distribution of the whole population. Subgroups which are deemed interesting, are the ones whose distributions differ significantly from the distribution of the whole population.
We will use the term subgroup and distribution rules interchangeably to refer to distribution rules. However, when there is the need to differentiate from subgroups found with Cortana and CAREN, we will use the terms subgroups and distribution rules, respectively.
5.4.1 German elections
Comparison of subgroups found by CAREN and Cortana
CAREN  Cortana 

\(\text {Region} = \text {East}\)  \(\text {Region} = \text {East}\) 
\(\text {Region} = \text {East} \wedge \text {Type} = \text {Rural}\)  \(\text {Region} = \text {East} \wedge \text {Reg.Web.Dom.} = \text {a}\) 
\(\text {Region} = \text {East} \wedge \text {Reg.Web.Dom.} = \text {a}\)  \(\text {Income} = \text {a} \wedge \text {Region} = \text {East}\) 
\(\text {Income} = \text {a} \wedge \text {Region} = \text {East}\)  \(\text {Region} = \text {East} \wedge \text {Type} = \text {Rural}\) 
\(\text {Income} = \text {a}\)  \(\text {Income} = \text {a}\) 
5.4.2 Top7Movies
Comparison of subgroups found by CAREN and Cortana
CAREN  Cortana 

\(\text {Age} =\) 35–44 \(\wedge \,\text {Latitude} = \text {h}\)  \(\text {Age} =\) 35–44 \(\wedge \,\text {Gender} = \text {Male}\) 
\(\text {Age} = \) 35–44 \(\wedge \,\text {longitude} = \text {g}\)  \(\text {Age} =\) 35–44 
\(\text {Age} = \) 35–44  \(\text {Age} =\) 35–44 \(\wedge \, \text {Latitude} = \text {h}\) 
\(\text {Occupations} = \text {Other} \wedge \text {Age} =\) 25–34  \(\text {Age} =\) 35–44 \(\wedge \,\text {Longitude} = \text {g}\) 
\(\text {Occupations} = \text {Other} \wedge \text {Gender} = \text {Male}\)  \(\text {Age} = \) 18–24 
\(\text {Age} =\) 50+ \(\wedge \,\text {Gender} = \text {Male}\)  \(\text {Age} =\) 18–24 \(\wedge \,\text {Latitude} = \text {h}\) 
\(\text {Occupations} = \text {Other}\)  \(\text {Age} =\) 18–24 \(\, \wedge \,\text {Longitude} = \text {g}\) 
We note that, in the Label Ranking context, despite the similarities between the subgroups found both by CAREN and Cortana, the interpretation of the rankings is richer with a PM than with a distribution. PM are better for spotting slight nuances in the preference patterns, for example, when a particular label is under or overappreciated. Moreover, if we want to search for partial ranking patterns such as labels or simply labelvslabel, it is simpler to visualize and handle it with a PM. This mean that, EPM, due to its representation of rankings, has a bigger margin for the creation of new quality measures.
6 Conclusions
In this work, we empirically show how exceptional preferences mining (EPM) can be used in problems where the target concept can be represented as a preference of a set of labels, such as rankings or pairwise comparisons. The results are a set of subgroups, that can be described in terms of a conjunction of few conditions on some attributes, where the label preferences are exceptional in some sense. The presented subgroups form clear coherent parts of the search space, which means that EPM finds deviating preferences that are actionable for domain experts, since their description in terms of attributes should be familiar to them.
All subgroups whose PM deviates significantly from the Preference Matrix (PM) for the whole dataset are considered to be interesting. We used four quality measures for EPM that instantiate this concept of ‘interesting’ to different levels, Rankingwise, Labelwise and Pairwise. The RWNorm, RWNormMode and RWCov quality measures consider a subgroup interesting if the full set of preference relations is substantially displaced. The LWNorm quality measure highlights subgroups where any one label interacts exceptionally with the other labels, agnostic of how those other labels interact with each other. The PWMax quality measure finds a subgroup interesting if any one pair of labels display exceptional preference relations. Hence, by choosing the appropriate quality measure, EPM delivers subgroups featuring preference relations that are exceptional at your preferred scope.
To show the potential of the approach, we provided experiments on several datasets. The experiments with the RWNorm quality measure on the Algae dataset revealed several interesting conditions that can affect the populations of the different species of algae from rivers. The experiments with the LWNorm quality measure on the Sushi dataset illustrate the relative merit of this quality measure: it focuses on subgroups where one particular label is exceptionally under or overappreciated. The subgroup presented has a penchant for Sea Urchin (cf. Fig. 10). The PWMax measure shows its potential on the German2005elections dataset by identifying several subgroup with strong exceptional preferences with respect to the different parties. The experiments with the RWCov quality measure on the Cpusmall dataset (e.g., Fig. 11) reveal a subgroup with quite unusual preference behavior. Finally, the RWNormMode was compared to the RWNorm measure, in different experiments, and we could observe that it revealed some interesting subgroups too. Moreover, we concluded that RWNormMode and RWNorm can be complementary measures to study exceptional preference patterns.
As we argued in Sect. 3, one of the main benefits of a local pattern mining method such as EPM is that it delivers interpretable results. That means that the resulting subgroups are ideally suited to instigate realworld policies and actions. For this reason, we studied several realworld datasets.
We also compared the results found with EPM with an alternative approach, the Distribution Rules (DR). Despite their very different setting, the subgroups found by this method were very similar to the ones found with Cortana. In our opinion, this simple comparison empirically shows that our suggested quality measures for EPM are finding relevant patterns. In terms of interpretation, PM are better than distribution rules to detect slight nuances in the preference patterns, for example, when a particular label is under or overappreciated. In some cases, information which is not easy to obtain with the usual representations of rankings, is clearly revealed through the PM visualization (see Sect. 5.3.2).
From this study, we also understand some limitations of our approach. We observed that, in some cases, despite having obtained many significant subgroups, most of them are specializations of simpler subgroups with very similar average rankings, if not equal. This means that, many different subgroups are finding the same ranking behaviors.
EPM also has the disadvantage to be time consuming. A large number of labels combined with a still reasonably high search depth makes the statistical tests very time consuming.
As future work we would like to study alternative ways to represent and look for patterns in rankings, for example for rankings with a large number of labels as well as for partial orders. Finally, we would also like to study how pruning techniques such as minimum improvement can be used to filter out subgroups, that are specializations of simpler subgroups, but have very similar PMs.
Footnotes
Notes
Acknowledgements
This research has received funding from the ECSEL Joint Undertaking, the framework programme for research and innovation Horizon 2020 (2014–2020) under Grant Agreement Number 662189MANTIS20141.
References
 Abudawood, T., & Flach, P. A. (2009). Evaluation measures for multiclass subgroup discovery. In Machine learning and knowledge discovery in databases, European conference, ECML PKDD 2009, Bled, Slovenia, September 7–11, 2009, proceedings, Part I, pp. 35–50.Google Scholar
 Agrawal, R., Mannila, H., Srikant, R., Toivonen, H., & Verkamo, A. I. (1996). Fast discovery of association rules. In Advances in knowledge discovery and data mining, pp. 307–328. AAAI/MIT Press.Google Scholar
 Azevedo, P. J., & Jorge, A. M. (2010). Ensembles of jittered association rule classifiers. Data Min. Knowl. Discov., 21(1), 91–129.MathSciNetCrossRefGoogle Scholar
 Boley, M., Mampaey, M., Kang, B., Tokmakov, P., & Wrobel, S. (2013). One click mining: Interactive local pattern discovery through implicit preference and performance learning. In Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, IDEA@KDD 2013, Chicago, Illinois, USA, August 11, 2013, pp. 27–35.Google Scholar
 Brandenburg, F., Gleißner, A., & Hofmeier, A. (2013). Comparing and aggregating partial orders with kendall tau distances. Discrete Mathematics, Algorithms and Applications, 5(2).Google Scholar
 Brazdil, P., & Soares, C. (2000). A comparison of ranking methods for classification algorithm selection. In Machine learning: ECML 2000, 11th European conference on machine learning, Barcelona, Catalonia, Spain, May 31June 2, 2000, Proceedings, pp. 63–74.Google Scholar
 Brazdil, P., Soares, C., & da Costa, J. P. (2003). Ranking learning algorithms: Using IBL and metalearning on accuracy and time results. Machine Learning, 50(3), 251–277.CrossRefMATHGoogle Scholar
 Breen, J. (2012). Zipcode: US ZIP code database for geocoding, 2012. R package version 1.0.Google Scholar
 Brinker, K., & Hüllermeier, E. (2007). Label ranking in casebased reasoning. In Casebased reasoning research and development, 7th international conference on casebased reasoning, ICCBR 2007, Belfast, Northern Ireland, UK, August 13–16, 2007, proceedings, pp. 77–91.Google Scholar
 Chankong, V., & Haimes, Y. (2008). Multiobjective decision making: Theory and methodology. Dover Books on Engineering. Dover Publications.Google Scholar
 Cheng, W., Dembczynski, K., & Hüllermeier, E. (2010). Label ranking methods based on the plackettluce model. In Proceedings of the 27th international conference on machine learning (ICML10), June 21–24, 2010, Haifa, Israel, pp. 215–222.Google Scholar
 Cheng, W., Henzgen, S., & Hüllermeier, E. (2013). Labelwise versus pairwise decomposition in label ranking. In LWA 2013. Lernen, Wissen and Adaptivität, workshop proceedings Bamberg, 7–9 Oct 2013, pp. 129–136.Google Scholar
 Cheng, W., Huhn, J. C., & Hüllermeier, E. (2009). Decision tree and instancebased learning for label ranking. In Proceedings of the 26th annual international conference on machine learning, ICML 2009, Montreal, Quebec, Canada, June 14–18, 2009, pp. 161–168.Google Scholar
 Cheng, W., Rademaker, M., Baets, B. D., & Hüllermeier, E. (2010). Predicting partial orders: Ranking with abstention. In Machine learning and knowledge discovery in databases, European conference, ECML PKDD 2010, Barcelona, Spain, Sept. 20–24, 2010, proceedings, Part I, pp. 215–230.Google Scholar
 Chiclana, F., HerreraViedma, E., & Alonso, S. (2009). A note on two methods for estimating missing pairwise preference values. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 39(6), 1628–1633.CrossRefGoogle Scholar
 Chomicki, J. (2003). Preference formulas in relational queries. ACM Transactions on Database Systems, 28(4), 427–466.MathSciNetCrossRefGoogle Scholar
 Cook, W. D., Doyle, J., Green, R. H., & Kress, M. (1996). Ranking players in multiple tournaments. Computers & OR, 23(9), 869–880.MathSciNetCrossRefMATHGoogle Scholar
 Cook, W. D., Golany, B., Penn, M., & Raviv, T. (2007). Creating a consensus ranking of proposals from reviewers’ partial ordinal rankings. Computers & OR, 34(4), 954–965.CrossRefMATHGoogle Scholar
 de Condorcet, M. (1785). Éssai sur l’application l’analyse à la probabilité des dés décisions rendues à la pluralité des voix (trans. essay on the application of mathematics to the theory of decisionmaking).Google Scholar
 de Sá, C. R., Azevedo, P. J., Soares, C., Jorge, A. M., & Knobbe, A. J. (2018). Preference rules for label ranking: Mining patterns in multitarget relations. Information Fusion, 40, 112–125.CrossRefGoogle Scholar
 de Sá, C. R., Duivesteijn, W., Soares, C., & Knobbe, A. (2016). Exceptional preferences mining. In Discovery science, pp. 1–16.Google Scholar
 de Sá, C. R., Soares, C., & Knobbe, A. J. (2016). Entropybased discretization methods for ranking data. Inf. Sci., 329, 921–936.CrossRefGoogle Scholar
 Dekel, O., Manning, C. D., & Singer, Y. (2003). Loglinear models for label ranking. In Advances in neural information processing systems 16 [Neural information processing systems, NIPS 2003, Dec. 8–13, 2003, Vancouver and Whistler, British Columbia, Canada], pp. 497–504.Google Scholar
 Dembczynski, K., Kotlowski, W., Slowinski, R., & Szelag, M. (2010). Learning of rule ensembles for multiple attribute ranking problems. In Preference learning, pp. 217–247. Berlin: Springer.Google Scholar
 Duivesteijn, W. (2013). Exceptional model mining. Ph.D. thesis, Leiden University.Google Scholar
 Duivesteijn, W., Feelders, A., & Knobbe, A. J. (2012). Different slopes for different folks: Mining for exceptional regression models with cook’s distance. In The 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12, Beijing, China, Aug. 12–16, 2012, pp. 868–876.Google Scholar
 Duivesteijn, W., Feelders, A., & Knobbe, A. J. (2016). Exceptional model mining: Supervised descriptive local pattern mining with complex target concepts. Data Min. Knowl. Discov., 30(1), 47–98.MathSciNetCrossRefGoogle Scholar
 Duivesteijn, W., & Knobbe, A. J. (2011). Exploiting false discoveries: Statistical validation of patterns and quality measures in subgroup discovery. In 11th IEEE International conference on data mining, ICDM 2011, Vancouver, BC, Canada, Dec. 11–14, 2011, pp. 151–160.Google Scholar
 Dzyuba, V., & van Leeuwen, M. (2013). Interactive discovery of interesting subgroup sets. In Advances in intelligent data analysis XII—12th international symposium, IDA 2013, London, UK, Oct. 17–19, 2013. Proceedings, pp. 150–161.Google Scholar
 Fürnkranz, J., & Hüllermeier, E. (2003). Pairwise preference learning and ranking. In Machine learning: ECML 2003, 14th European conference on machine learning, CavtatDubrovnik, Croatia, Sept. 22–26, 2003, proceedings, pp. 145–156.Google Scholar
 Fürnkranz, J., & Hüllermeier, E. (Eds.). (2010). Preference learning. Berlin: Springer.MATHGoogle Scholar
 Grosskreutz, H., Boley, M., & KrauseTraudes, M. (2010). Subgroup discovery for election analysis: A case study in descriptive data mining. In Discovery science—13th international conference, DS 2010, Canberra, Australia, Oct. 6–8, 2010. Proceedings, pp. 57–71.Google Scholar
 HarPeled, S., Roth, D., & Zimak, D. (2002). Constraint classification: A new approach to multiclass classification. In Algorithmic learning theory, 13th international conference, ALT 2002, Lübeck, Germany, Nov. 24–26, 2002, proceedings, pp. 365–379.Google Scholar
 Harper, F . M., & Konstan, J . A. (2016). The movielens datasets: History and context. TiiS, 5(4), 19:1–19:19.Google Scholar
 Henzgen, S., & Hüllermeier, E. (2014). Mining rank data. In Discovery science—17th international conference, DS 2014, Bled, Slovenia, Oct. 8–10, 2014. Proceedings, pp. 123–134.Google Scholar
 Heusner, M., Keller, T., & Helmert, M. (2017). Understanding the search behaviour of greedy bestfirst search. In Proceedings of the tenth international symposium on combinatorial search, Edited by Alex Fukunaga and Akihiro Kishimoto, 16–17 June 2017, Pittsburgh, Pennsylvania, USA, pp. 47–55.Google Scholar
 Hochberg, Y., & Tamhane, A. (1987). Multiple comparison procedures. Wiley series in probability and mathematical statistics: Appliedprobability and statistics. WileyGoogle Scholar
 Hüllermeier, E., Fürnkranz, J., Cheng, W., & Brinker, K. (2008). Label ranking by learning pairwise preferences. Artificial Intelligence, 172(16–17), 1897–1916.MathSciNetCrossRefMATHGoogle Scholar
 Jin, N., Flach, P. A., Wilcox, T., Sellman, R., Thumim, J., & Knobbe, A. J. (2014). Subgroup discovery in smart electricity meter data. IEEE Transactions on Industrial Informatics, 10(2), 1327–1336.CrossRefGoogle Scholar
 Jorge, A. M., Azevedo, P. J., & Pereira, F. (2006). Distribution rules with numeric attributes of interest. In Knowledge discovery in databases: PKDD 2006, 10th European conference on principles and practice of knowledge discovery in databases, Berlin, Germany, Sept. 18–22, 2006, Proceedings, pp. 247–258.Google Scholar
 Jorge, A. M., Pereira, F., & Azevedo, P. J. (2006). Visual interactive subgroup discovery with numerical properties of interest. In Discovery science, 9th international conference, DS 2006, Barcelona, Spain, Oct. 7–10, 2006, proceedings, pp. 301–305.Google Scholar
 Kamishima, T. (2003). Nantonac collaborative filtering: recommendation based on order responses. In Proceedings of the ninth ACM SIGKDD international conference on knowledge discovery and data mining, Washington, DC, USA, Aug. 24–27, 2003, pp. 583–588.Google Scholar
 Klösgen, W. (1996). Explora: A multipattern and multistrategy discovery assistant. In Advances in knowledge discovery and data mining, pp. 249–271. American Association for Artificial Intelligence.Google Scholar
 Klösgen, W., & Zytkow, J. M. (Eds.) (2002). Handbook of data mining and knowledge discovery. New York, NY: Oxford University Press .Google Scholar
 Lavrac, N., Kavsek, B., Flach, P. A., & Todorovski, L. (2004). Subgroup discovery with CN2SD. Journal of Machine Learning Research, 5, 153–188.MathSciNetGoogle Scholar
 Leman, D., Feelders, A., & Knobbe, A. J. (2008). Exceptional model mining. In Machine learning and knowledge discovery in databases, European conference, ECML/PKDD 2008, Antwerp, Belgium, Sept. 15–19, 2008, proceedings, Part II, pp. 1–16.Google Scholar
 Lichman, M. (2013). UCI machine learning repository.Google Scholar
 Lucas, J. P., Jorge, A. M., Pereira, F., Pernas, A. M., & Machado, A. A. (2007). A tool for interactive subgroup discovery using distribution rules. In Progress in artificial intelligence, 13th Portuguese conference on aritficial intelligence, EPIA 2007, workshops: GAIW, AIASTS, ALEA, AMITA, BAOSW, BI, CMBSB, IROBOT, MASTA, STCS, and TEMA, Guimarães, Portugal, Dec. 3–7, 2007, proceedings, pp. 426–436.Google Scholar
 Meeng, M., & Knobbe, A. (2011). Flexible enrichment with cortana—software demo. In Proceedings of BeneLearn, pp. 117–119.Google Scholar
 Sculley, D. (2007). Rank aggregation for similar items. In Proceedings of the seventh SIAM international conference on data mining, April 26–28, 2007, Minneapolis, Minnesota, USA, pp. 587–592.Google Scholar
 Svendová, V., & Schimek, M. G. (2017). A novel method for estimating the common signals for consensus across multiple ranked lists. Computational Statistics & Data Analysis, 115, 122–135.MathSciNetCrossRefGoogle Scholar
 Todorovski, L., Blockeel, H., & Dzeroski, S. (2002). Ranking with predictive clustering trees. In Machine learning: ECML 2002, 13th European conference on machine learning, Helsinki, Finland, Aug. 19–23, 2002, proceedings, pp. 444–455.Google Scholar
 Umek, L., & Zupan, B. (2011). Subgroup discovery in data sets with multidimensional responses. Intelligent Data Analysis, 15(4), 533–549.Google Scholar
 Van, T. L., van Leeuwen, M., Nijssen, S., Fierro, A. C., Marchal, K., & Raedt, L. D. (2014). Ranked tiling. In Machine learning and knowledge discovery in databases—European conference, ECML PKDD 2014, Nancy, France, Sept. 15–19, 2014. Proceedings, Part II, pp. 98–113.Google Scholar
 van Leeuwen, M., & Knobbe, A. J. (2012). Diverse subgroup set discovery. Data Mining and Knowledge Discovery, 25(2), 208–242.MathSciNetCrossRefGoogle Scholar
 Vembu, S., & Gärtner, T. (2010). Label ranking algorithms: A survey. In Preference learning., pp. 45–64. Berlin: Springer.Google Scholar
Copyright information
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.