Abstract
This paper is a plea for revisiting various existing approaches to the handling of data, for classification purposes, based on a settheoretic view, such as version space learning, formal concept analysis, or analogical proportionbased inference, which rely on different paradigms and motivations and have been developed separately. The paper also exploits the notion of conditional object as a proper tool for modeling ifthen rules. It also advocates possibility theory for handling uncertainty in such settings. It is a first, and preliminary, step towards a unified view of what these approaches contribute to machine learning.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
 Data
 Classification
 Version space
 Conditional object
 Ifthen rule
 Analogical proportion
 Formal concept analysis
 Possibility theory
 Possibilistic logic
 Bipolarity
 Uncertainty
1 Introduction
It is an understatement to say that the current dominant paradigms in machine learning rely on neural nets and statistics; see, e.g., [1, 8]. Yet, there have been quite a number of set theoretic or logicbased views that have considered data sets from different perspectives: we can thus (at least) mention concept learning [24, 25], formal concept analysis [19], rough sets [28], logical analysis of data [4], test theory [7], and GUHA method [22]. Still some other works, mentioned later, may be also relevant. These various paradigms can be related to logic, but have been developed independently. Strangely enough, little has been done to move towards a unified view of them.
This research note aims to be a first step in this direction. However, the result will remain modest, since we shall only outline connections between some settings, while other ones will be left aside for the moment. Moreover we shall mainly focus on Boolean data, even if some of what is said could be extended to nominal, or even numerical data. Still, we believe that it is of scientific interest to better understand the relationships between these different theoretical settings developed with various motivations and distinct paradigms, while all are starting from the same object: a set of data. In the long range, such a better understanding may contribute to some cooperation between these set theorybased views and currently popular ones, such as neural nets or statistical approaches, perhaps providing tools for explanation capabilities; see, e.g., [6] for references and a tentative survey.
The paper is organized as follows. Section 2 states and discusses the problem of assigning an item to a class, given examples and counterexamples. Section 3 presents a simple propositional logic reading of the problem. Section 4 puts the discussion in a more appropriate setting using the notion of conditional object [12], which captures the idea of a rule, better than material implication. Moreover, a rulebased reading of analogical proportionbased classification [26] is also discussed in Sect. 5. Section 6 briefly recalls the version space characterization of the set of possible descriptions of a class on an example, emphasizing its bipolar nature. Section 7 advocates the interest of possibilistic logic [16] for handling uncertainty and coping with noisy data. Indeed, sensitivity to noise is a known drawback of pure settheoretic approaches to data handling. Section 8 briefly surveys formal concept analysis and suggests its connection and potential relevance to classification. Section 9 mentions some other related matters and issues, pointing out lines for further research.
2 Classification Problem  A General View
Let us consider m pieces of data that describe items in terms of n attributes \(A_j\). Namely an item i is represented by a vector \(\varvec{a^i} = (a^{i}_1, a^{i}_2, \cdots , a^{i}_n)\), with \(i=1,\dots ,m\), together with its class \(cl(\varvec{a^i})\), where \(a^{i}_j\) denotes the value of the jth attribute \(A_j\) for item \(\varvec{a^i}\), namely \(A_j(\varvec{a^i}) = a_j^i \in dom(A_j)\) (\(dom(A_j)\) denotes the domain of attribute \(A_j\)). Each domain \(dom(A_j)\) can be described using a set of propositional variables \(\mathcal {V}_j\) specific to \(A_j\), by means of logical formulas. If \(dom(A_j)= 2\), one propositional variable \(\mathcal {V}_j\) is enough and \(dom(A_j) = \{v_j, \lnot v_j\}\).
Let \(\mathcal {C} = \{cl(\varvec{a^i})  i=1,..., m\}\) be a set of classes, where each object is supposed to belong to one and only one class. The classification problem amounts to predicting the class \(cl(\varvec{a^*})\in \mathcal {C} \) of a new item \(\varvec{a^*}\) described in terms of the same attributes, on the basis of the m examples \((\varvec{a^i}, cl(\varvec{a^i}))\) consisting of classified objects.
There are other problems that are akin to classification, with different terminologies. Let us at least mention casebased decision, and diagnosis. In the first situation, we face a multiple criteria decision problem where one wants to predict the value of a new item on the basis of a collection of valued items (assuming that possible values belong to a finite scale), while in the second situation attribute values play the role of symptoms (present or not) and classes are replaced by diseases [13]. In both situations, the m examples constitute a repertory of reference cases already experienced. This is also true in casebased reasoning, where a solution is to be found for a new encountered problem on the basis of a collection of previously solved ones, for which the solution is known; however, casebased reasoning usually includes an adaptation step of the past solution selected, for a better adequacy with the new problem. Thus, ideas and methods developed in these different fields may be also of interest in a classification perspective.
Two further comments are in order here. First, for each class C, one may partition the whole set of m data in two parts: the set \(\mathcal {E}\) of examples associated with this class, and the set \(\mathcal {E'}\) of examples of other classes, which can be viewed as counterexamples for this class. The situation is pictured in Table 1 below. It highlights the fact that the whole set of items in class C is bracketed between \(\mathcal {E}\) and \(\overline{\mathcal {E'}}\) (where the overbar means complementation). If the table is contradictionfree, there is no item that is both in \(\mathcal {E}\) and in \(\mathcal {E'}\).
Second, the classification problem can be envisaged in two different manners:

1.
as an induction problem, where one wants to build a plausible description of each class; it can be done in terms of ifthen rules associating sets of attribute values with a class, these rules being used for prediction purposes;

2.
as a transduction problem, where the prediction is made without the help of such descriptions, but by means of direct comparisons of the new item with the set of the m examples.
3 A Simple Logical Reading
An elementary idea for characterizing a class C is to look for an attribute such that the subset of values taken for this attribute by the available examples of class C is disjoint from the subset of values taken by the examples of the other classes. If there exists at least one such attribute \(A_{j^*}\), then one may inductively assume that belonging or not to class C, for any new item, can be predicted on the basis of its value for \(A_{j^*}\). More generally, if a particular combination of attribute values can be encountered only for items of a class C, then a new item with this particular combination should also be put plausibly in class C. Let us now have a more systematic logical analysis of the data.
Let us consider a particular class \(C \in \mathcal {C}\). Then the m items \(\varvec{a^i}\) can be partitioned into two subsets, the items \(\varvec{a^i}\) such that \(cl(\varvec{a^i})= C\), and those such that \(cl(\varvec{a^i})\ne C\) (we assume that \(\mathcal { C} \ge 2\)). Thus we have a set \(\mathcal {E}\) of examples for C, namely \(\varvec{e^i} = (a^{i}_1, a^{i}_2, \cdots , a^{i}_n, 1)=(\varvec{a^i}, 1)\), where ‘1’ means that \(cl(\varvec{a^i})= C\), and a set \(\mathcal {E'}\) of counterexamples \(\varvec{e'^j} = (a'^{j}_1, a'^{j}_2, \cdots , a'^{j}_n, 0)\) where ‘0’ means that \(cl(\varvec{a'^{j}})\ne C\).
Let us assume that the domains \(dom(A_j)\) for \(j=1,n\) are finite and denote by \(v_C\) the propositional variable associated to class C (\(v_C\) has truthvalue 1 for elements of C and 0 otherwise). Using the attribute values as propositional logic symbols, an example \(\varvec{e^i}\) expresses the truth of the logical statement
meaning that if it is an example, then it belongs to the class, while counterexamples \(\varvec{e'^j}\) are encoded by stating that the formula \( a'^{j}_1 \wedge a'^{j}_2 \wedge \cdots \wedge a'^{j}_n \rightarrow \lnot v_C\) is true, or equivalently
Then any class (or concept) C that agrees with the m pieces of data is such that
Letting \(\mathcal {E}\) be the set of models of \(\bigvee _{i}\varvec{a^i}\) (the examples) and \(\mathcal {E'}\) be the set of models of \(\bigvee _{j}\varvec{a'_j}\) (the counterexamples), (1) simply reads \(\mathcal {E} \subseteq C \subseteq \overline{\mathcal {E'}}\) where the overbar denotes complementation. Note that the larger the number of counterexamples, the more specific the upper bound of C; the larger the number of examples, the more general the lower bound of C.
This logical expression states that if an item is identical to an example on all attributes then it is in the class, and that if an item is in the class then it should be different from all counterexamples on at least one attribute.
Let us assume Boolean attributes for simplicity, and let us suppose that \(a_1^i = v_{1}\) is true for all the examples of class C and false for all the examples of other classes. Then it can be seen that (1) can be put under the form \(v_1 \wedge L \models v_C \models v_1 \vee L'\) where L and \(L'\) are logical expressions that do not involve any propositional variable pertaining to attribute \(A_{1}\). This provides a reasonable support for inducing that an item belongs to C as soon as \(v_1\) is true for it. Such a remark can be generalized to a combination of attribute values and to nominal attributes.
Let us consider a toy example, small yet sufficient for an illustration of (1) and starting the discussion.
Example 1
It is an example with two Boolean attributes, two classes (C and \( \overline{C}\)), two examples and a counterexample. Namely, we have \(\varvec{e^1} = (a^{1}_1, a^{1}_2, 1)= (1, 0, 1) = (v_1, \lnot v_2, v_C)\); \(\varvec{e^2} = (a^{2}_1, a^{2}_2, 1)= (0, 1, 1)= (\lnot v_1, v_2, v_C)\); \(\varvec{e'^1} = (a'^{1}_1, a'^{1}_2, 0)= (0, 0, 0)= (\lnot v_1,\lnot v_2, \lnot v_C)\).
We can easily see that \((v_1\wedge \lnot v_2) \vee ( \lnot v_1\wedge v_2) \,\models \, v_C\,\models \,v_1 \vee v_2\), i.e., we have \(v_1 \dot{\vee }v_2\,\models \,v_C \,\models \,v_1 \vee v_2\), where \(\dot{\vee }\) stands for exclusive or. Indeed depending on whether (1, 1) is an example or a counterexample, the class C will be described by \(v_1 \vee v_2\), or by \(v_1\,\dot{\vee }\,v_2\) respectively.
Note that in the absence of any further information or principle, the two options for assigning (1, 1) to a class on the basis of \(\varvec{e^1}, \varvec{e^2}\) and \(\varvec{e'^1}\), are equally possible here. \(\Box \)
Observe that if the bracketing of C in (1) is consistent, the conjunction of the lower bound expression and the upper bound expression yields the lower bound. But in case of an item which would appear both as an example and as a counterexample for C (noisy data), this conjunction would not be a contradiction, as we might expect in general, as shown by the example below.
Example 2
Assume we have \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (1, 1, 1)\); \(\varvec{e'^1} = (1, 1, 0)\). The classes \(\mathcal {E}\) and \(\mathcal {E'}\) overlap since \(\varvec{e^2}\) and \(\varvec{e'^1}\) are the same item, classified differently. As a consequence we do not have that \(\mathcal {E} \subseteq \overline{\mathcal {E'}}\). So equation (1) is not valid: we do not have that \(v_1= (v_1\wedge \lnot v_2) \vee (v_1\wedge v_2) \,\models \,v_C \,\models \,\lnot v_1 \vee \lnot v_2\), i.e., \(v_1\,\models \,v_C \,\models \,\lnot v_1 \vee \lnot v_2\) is wrong even if \(v_1 \wedge (\lnot v_1 \vee \lnot v_2) = v_1 \wedge \lnot v_2 \ne \bot \). \(\Box \)
A more appropriate treatment of inconsistency will be proposed in the next section.
The two expressions bracketing C in (1) have a graded counterpart, proposed in [17], for assessing how satisfactory an item is, given a set of examples and a set of counterexamples supposed to describe what we are looking for. Then an item is all the better ranked as it is similar to at least one example on all important attributes, and that it is dissimilar to all counterexamples on at least one important attribute (where similarity, dissimilarity, and importance are matters of degrees). However, this ranking problem is somewhat different from the classification problem where each item should be assigned to a class. Here if an item is both close to an example and to a counterexample, it has a poor evaluation, just as it would be if it is close to a counterexample only.
Note that if one considers examples only, the graded counterpart amounts to searching for items that are similar to examples. In terms of classification, it means looking for the pieces of data that are sufficiently similar (on all attributes) to the item, the class of which one wants to predict, and to assign this item to the class shared by the majority of these such similar data. This is the knearest neighbor method. This is also very close to fuzzy casebased reasoning and instancebased learning [11, 23].
4 Conditional Objects and Rules
A conditional object ba, where a, b are propositions, is a threevalued entity, which is true if \(a\wedge b\) is true; false if \(a\wedge \lnot b\) is true; inapplicable if a is false; see, e.g., [12]. It can be thought as the rule ‘if a then b’. Indeed, the rule can be fired only if a is true; the examples of this rule are such that \(a\wedge b\) is true, while its counterexamples are such that \(a \wedge \lnot b\) is true. This view of conditionals dates back to Bruno De Finetti ’s works in the 1930’s.
An (associative) quasiconjunction & can be defined for conditional objects:
where \(\rightarrow \) denotes the material implication. It fits with the intuition that a set of rules can be fired as soon as at least one rule can be fired, and when a rule is fired, the rule behaves like material implication. Moreover, entailment between conditional objects is defined by \(ba \vDash dc\) iff \(a\wedge b \vDash c\wedge d\) and \(c\wedge \lnot d \vDash a\wedge \lnot b\), which expresses that the examples of rule ‘if a then b’ are examples of rule ‘if c then d’, and the counterexamples of rule ‘if c then d’ are counterexamples of rule ‘if a then b’. It can be checked that \(ba= (a\wedge b)a = (a\rightarrow b)a\) since these three conditional objects have the same examples and the same counterexamples. It can be also shown that \(a\wedge b\top \vDash ba \vDash a\rightarrow b\top \) (where \(\top \) denotes tautology), thus highlighting the fact that ba is bracketed by the conjunction \(a\wedge b\) and the material implication \(a\rightarrow b\).
Let us revisit expression (1) in this setting. For an example \(\varvec{e} =(\varvec{a}, 1)\), and a counterexample \(\varvec{e'}=(\varvec{a'}, 0)\) with respect to a class C, it leads to consider the conditional objects \(v_C\varvec{a} \) and \(\lnot v_C\varvec{a'}\) respectively (if it is an example we are in the class, otherwise not).
For a collection of examples we have
Similarly, we have
Letting \(\phi _E= \bigvee _{i=1}^r\varvec{a^i}\) and \(\phi _{E'}= \bigvee _{j=1}^s\varvec{a'^j}\), we can join the two conditional expressions:
where
A set of conditional objects K is said to be consistent if and only if for no subset \(S \subseteq K\) does the quasiconjunction Q(S) of the conditional objects in S entail a conditional contradiction of the form \(\bot \phi \). [12]. Contrary to material implication, the use of threevalued conditionals reveals the presence of contradictions in the data.
Example 3
(Example 2 continued) The data are \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (1, 1, 1)\); \(\varvec{e'^1} = (1, 1, 0)\). In terms of conditional objects, considering the subset \(\{\varvec{e^2},\varvec{e'^1} \}\), we have
which is a conditional contradiction. \(\Box \)
5 Analogical ProportionBased Transduction
Apart from the knearest neighbor method, there is another transduction approach to the classification problem which applies to Boolean, nominal and numerical attribute values [5]. For simplicity here, we only consider Boolean attributes. It relies on the notion of analogical proportion [26]. Analogical proportions are statements of the form “a is to b as c is to d”, often denoted by \(a:b\,{::}\,c:d\), which express that “a differs from b as c differs from d and b differs from a as d differs from c”. This statement can be encoded into a Boolean logical expression which is true only for the 6 following assignments (0, 0, 0, 0), (1, 1, 1, 1), (1, 0, 1, 0), (0, 1, 0, 1), (1, 1, 0, 0), and (0, 0, 1, 1) for (a, b, c, d). Boolean Analogical proportions straightforwardly extend to vectors of attributes values such as \(\varvec{a}=(a_1, ..., a_n)\), by stating \(\varvec{a}:\varvec{b}\,{::}\,\varvec{c}:\varvec{d} \text{ iff } \forall i \!\in \! [1,n], ~ a_i:b_i\,{::}\,c_i:d_i\). The basic analogical inference pattern, is then
Thus analogical reasoning amounts to finding completely informed triples \((\varvec{a}, \varvec{b}, \varvec{c})\) appropriate for inferring the missing value(s) in \(\varvec{d}\). When there exist several suitable triples, possibly leading to distinct conclusions, one may use a majority vote for concluding. This inference method is an extrapolation, which applies to classification (then the class \(cl(\varvec{x})\) is the unique solution, when it exists, such as \(cl(\varvec{a}):cl(\varvec{b})\,{::}\,cl(\varvec{c}):cl(\varvec{x})\) holds).
Let us examine more carefully how it works. The inference in fact takes items pair by pair, and then puts two pairs in parallel. Let us first consider the case where three items belong to the same class ; the fourth item is the one, the class of which one wants to predict (denoted by 1 in the following). Considering a pair of items \(\mathbf {a^i}\) and \(\mathbf {a^j}\). There are attributes for which the two items are equal and attributes for which they differ. For simplicity, we assume that they differ only on the first attribute (the method easily extend to more attributes). So we have \(\mathbf {a^i}=(a^i_1, a^i_2, \cdots ,a^i_n, 1)\) and \(\mathbf {a^j}=(a^j_1,a^j_2, \cdots , a^j_n, 1)\) with \(a^j_1 = \lnot a^i_1\) and \(a^j_t = a^i_t=v_t\) for \(t=2,\dots , n\). This means that the change from \(a^i_1\) to \(a^j_1\) in context \((v_2,\cdots ,v_n)\) does not change the class. Assume we have now another pair \(\mathbf {a^k}=(v_1, a^k_2, \cdots ,a^k_n, 1)\) and \(\mathbf {a^{\star }}=(\lnot v_1,a^\star _2,\cdots ,a^\star _n, ?)\) involving the item, the class of which we have to predict, and exhibiting the same change on attribute \(A_1\) and being equal elsewhere, i.e., we have \( a^k_t=a^\star _t =v^\sharp _t \) for \(t=2,\dots , n\). Putting the two pairs in parallel, we obtain the following pattern
\((v_1, v_2, \cdots ,v_n, 1)\)
\((\lnot v_1, v_2, \cdots ,v_n, 1)\)
\((v_1, v^\sharp _2, \cdots ,v^\sharp _n, 1)\)
\((\lnot v_1,v^\sharp _2,\cdots ,v^\sharp _n, ?)\).
It is not difficult to check that \(\mathbf {a^i}\), \(\mathbf {a^j}\), \(\mathbf {a^k}\) and \(\mathbf {a^\star }\) are in analogical proportion for each attribute. So \(\mathbf {a^i}:\mathbf {a^j}\,{::}\,\mathbf {a^k}:\mathbf {a^\star }\) holds. The solution of \(1: 1\,{::}\,1 : ?\) is obviously \(?=1\), so the prediction is \(cl(\mathbf {a^{\star }})=1\). This conclusion is thus based on the idea that since the change from \(a^i_1\) to \(a^j_1\) in context \((v_2,\cdots ,v_n)\) does not change the class, it is the same in the other context \((v^\sharp _2,\cdots ,v^\sharp _n)\).
The case where \(\mathbf {a^i}\) and \(\mathbf {a^k}\) belong to class C while \(\mathbf {a^j}\) is in \(\lnot C\) leads to another analogical pattern, where the change from \(a^i_1\) to \(a^j_1\) now changes the class in context \((v_2,\cdots ,v_n)\). The pattern is
\((v_1, v_2, \cdots ,v_n, 1)\)
\((\lnot v_1, v_2, \cdots ,v_n, 0)\)
\((v_1, v^\sharp _2, \cdots ,v^\sharp _n, 1)\)
\((\lnot v_1,v^\sharp _2,\cdots ,v^\sharp _n, ?)\)
The conclusion is now \(?=0\), i.e., \(\mathbf {a^{\star }}\) is not in C. This approach thus implements the idea that the switch from \(a^i_1\) to \(a^j_1\) that changes the class in context \((v_2,\cdots ,v_n)\), also leads to the same change in context \((v^\sharp _2,\cdots ,v^\sharp _n)\).
It has been theoretically established that analogical classifiers always yield exact prediction for Boolean affine functions describing the class (which includes xor functions), and only for them [9]. Still, a majority vote among the predicting triples often yields the right prediction in other situations [5].
Let us see how it works on Example 1 and variants.
Example 4
In Example 1 we have: \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (0, 1, 1)\); \(\varvec{e'^1} = (0, 0, 0)\). We can check that there is no analogical prediction in this case for (1, 1, ?). Indeed, whatever the way we order the three vectors, either we get the 4tuple (1, 0, 0, 1) on one component, which is not a pattern in agreement with an analogical proportion, or the equation \(0 : 1\,{::}\, 1 : ?\) which has no solution. So analogy remains neutral in this case.
However, in the situation where would have \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (1, 1, 1)\); \(\varvec{e'^1} = (0, 1, 0)\). Taking the triple \((\varvec{e^2}, \varvec{e^1}, \varvec{e'^1})\), we can check that \((1, 1) : (1, 0) \,{::}\, (0, 1) : (0, 0)\) holds on each of the two vector components. The solution of the equation \(1 : 1 \,{::}\, 0 : ?\) is \(?=0\), which is the analogical prediction for (0, 0, ?).
Similarly, in the case \(\varvec{e^1} = (1, 0, 1)\), \(\varvec{e^2} = (1, 1, 1)\) and \(\varvec{e^3} = (0, 1, 1)\), we would obtain \(?=1\) for (0, 0, ?) as expected, using triple \((\varvec{e^2}, \varvec{e^1}, \varvec{e^3})\). \(\Box \)
It is clear that the role of analogical reasoning here is to complete the data set with new examples or counterexamples obtained by transduction, assuming analogical inference patterns are valid in the case under study. It may be a first step prior to the induction of a classification model.
6 Concept Learning, Version Space and Logic
The version space setting, as proposed by Mitchell [24, 25], offers an elegant elimination procedure, exploiting examples and counterexamples of a class, then called “concept”, for restricting the hypotheses space and providing an approach to rule learning.
Let us recall the approach using a simple example, with 3 attributes:

\(A_1= \ \)Sky (with possible values Sunny, Cloudy, and Rainy),

\(A_2=\ \)Air Temp (with values Warm and Cold),

\(A_3=\ \)Humidity (with values Normal and High).
The problem is to learn the concept of \(C= \ \)Nice Day on the basis of examples and counterexamples. This means finding all hypotheses h, such that the implication \(h \rightarrow v_C\) is compatible with the examples and the counterexamples.
Each hypothesis is described by a conjunction of constraints on the attributes, here Sky, Air Temp, and Humidity. Constraints may be ? (any value is acceptable), \(\emptyset \) (no value is acceptable), a specific value, or a disjunction thereof. The target concept C, here Nice Day, is supposed to be represented by a disjunction of hypotheses (there may exist different h and \(h'\) such that \(h \rightarrow v_C\) and \(h' \rightarrow v_C\)).
Descriptions of examples or counterexamples can be ordered according to their generality/specificity. Thus for instance, the following descriptions are ordered according to decreasing generality: \({<}?, ?, ?{{>}}\), \({<}\)Sunny \(\vee \) Cloudy\(,?, ?{{>}}\), \({<}\)Sunny\(, ?, ?{{>}}\), \({<}\)Sunny, ?, Normal\({{>}}\), \({<}\emptyset , \emptyset , \emptyset {{>}}\).
The version space is represented by its most general and least general members. The socalled general boundary G is the set of maximally general members of the hypothesis space that are consistent with the data. The specific boundary S is the set of maximally specific members of the hypothesis space that are consistent with the data. G and S are initialized as \(G\,=\,{<}?, ?, ?{{>}}\) and \(S\,=\,{<}\emptyset , \emptyset , \emptyset {{>}}\) (for 3 attributes as in the example).
The procedure amounts to finding a maximally specific hypothesis which covers the positive examples. Suppose we have two examples of Nice Day:
Then, taking into account Ex1, S is updated to \(S_1\,=\,\) \({<}{} \texttt {Sunny}\),\( \texttt {Warm}\),\(\texttt {Normal}{{>}}\).
Adding Ex2, S is improved into \(S_2\,=\,{<}\)Sunny, Warm\(, ?{{>}}\), which corresponds to the disjunction of Ex1 and Ex2. The positive training examples force the S boundary of the version space to become increasingly general (\(S_2\) is more general than \(S_1\)).
Although the version space approach was not cast in a logical setting, it is perfectly compatible with the logical encoding (1). Indeed here we have two examples of the form \((v_1, v_2, v_3)\) and \((v_1, v_2, \lnot v_3)\) (with \(v_1=\texttt {Sunny}; v_2=\texttt {Warm}; v_3=\texttt {Normal}, \lnot v_3= \texttt {High}\)). A tuple of values such that \({<}\!v,v',v''{{>}}\) is to be understood as the conjunction \(v\wedge v' \wedge v"\). So we obtain \((v_1 \wedge v_2 \wedge v_3) \vee (v_1\wedge v_2 \wedge \lnot v_3) \rightarrow v_C\). It corresponds to the left part of Eq. (1) for \(n=3\) and \(\mathcal {E}=2\), which yields \((v_1 \wedge v_2) \wedge (v_3\vee \lnot v_3) \rightarrow v_C\), i.e., \((v_1 \wedge v_2) \rightarrow v_C\). So the more positive examples we have, the more general the lower bound of C in (1) (the set of models of a disjunction is larger than the set of models of each of its components). This lower bound, here \(v_1 \wedge v_2\), is a maximally specific hypothesis h.
Negative examples play a complementary role. They force the G boundary to become increasingly specific. Consider we have the following counterexample for Nice Day: cEx3. <Rainy, Cold, High> .
The hypothesis in the G boundary must be specialized until it correctly classifies the new negative example. There are several alternative minimally more specific hypotheses. Indeed, the 3 attributes can be specialized for avoiding to cover cEx3 by being \(\lnot \)Rainy, or being \(\lnot \)Cold, or being \(\lnot \)High. This exactly corresponds to Equ(1), which here gives \(v_C \rightarrow \lnot \texttt {Rainy}\vee \lnot \texttt {Cold}\vee \lnot \texttt {High}\), i.e., \(v_C \rightarrow \texttt {Sunny} \vee \texttt {Cloudy}\vee \texttt {Warm}\vee \texttt {Normal}\).
The elements of this disjunction correspond to maximally general potential hypotheses. But in fact we have only two new hypotheses in G: \({<}\)Sunny\(, ?, ?{{>}}\) and \({<}?,\) Warm\(, ?{{>}}\), as explained now. Indeed, the hypothesis \(h = (?,?,\) Normal) is not included in G, although it is a minimal specialization of G that correctly labels cEx3 as a negative example. This is because example Ex2 whose attribute value for \(A_3\) is \(\texttt {High}\), disagrees with the implication \(\texttt {Normal} \rightarrow v_C\). So, hypothesis \({<}?,?,\) Normal\({{>}}\) is excluded. Similarly, examples Ex1 and Ex2 (for which the attribute value for \(A_1\) is \(\texttt {Sunny}\)) disagree with implication \(\texttt {Cloudy}\rightarrow v_C\). This kind of elimination applies in Equation (1) as well. Indeed the expression \(v \wedge L \vDash \lnot v \vee L'\) can be simplified into \(v \wedge L \vDash L'\).
We thus obtain upper and lower bounds from Ex1, Ex2, and cEx3
where \(\{{<}v_1, v'_1, v''_1{{>}}, {<}v_2, v'_2, v''_2{{>}}\}\) logically reads \((v_1\wedge v'_1 \wedge v''_1) \vee (v_2\wedge v'_2 \wedge v''_2)\) (? stands for \(\top \)).
The S boundary of the version space thus summarizes the previously encountered positive examples. Any hypothesis more general than S will, by definition, cover any example that S covers and thus will cover any past positive example. In a dual fashion, the G boundary summarizes the information from previously encountered negative examples. Any hypothesis more specific than G is assured to be consistent with past negative examples. The set of all the hypotheses between S and G has a lattice structure. This in full agreement with Equation (1). The approach provides an iterative procedure that takes advantage of the examples and counterexamples progressively.
Thus, the general procedure for obtaining the bounds of the version space are as follows.

If \(\mathbf {e}\) is a positive example,

1.
remove from G any hypothesis inconsistent with \(\mathbf {e}\);

2.
substitute in S any minimal generalization h consistent with \(\mathbf {e}\).

1.

If \(\mathbf {e}\) is a negative example,

1.
remove from S any hypothesis inconsistent with \(\mathbf {e}\);

2.
substitute in G any minimal specialization h consistent with \(\mathbf {e}\).

1.
7 Towards a Possibilistic Variant of the Version Space
The main drawback of the version space approach is its sensitivity to noise. Indeed each example and each counterexample influence the result. In [18], the authors use rough set approximations to cope with this problem.
Here we make another suggestion using possibility theory. The idea is to associate each example and each counterexample with a certainty level, as in possibilistic logic (see, e.g., [16]) in order to express to what extent we consider it is certain that the corresponding piece of information is true (rather than false). This certainty level expresses our confidence in the piece of data as being exact. It can reflect the confidence we have in the source that provided it, or be the result of an analysis or filtering of the data that disqualifies outliers. In that respect we should remember that one semantics of possibility theory is in terms of (dis)similarity [29].
In other words, we have a multitiered set of examples and a multitiered set of counterexamples. So, considering all examples and all counterexamples whose certainty is above or equal to some given certainty level \(\alpha \) yields a regular version space with classical bounds. Thus, for each \(\alpha \), it gives birth to a finite set of hypotheses to which \(\alpha \) can be associated. We have thus a natural basis for rankordering hypotheses. The smaller \(\alpha \), the larger the numbers of examples and counterexamples taken into account, and the tighter the bounds.
This can be illustrated on the example of the previous section.
Example 5
Examples and counterexamples now come with certainty weights. Assume we have Ex1: (<Sunny, Warm, Normal\(>, 1)\); cEx3: (<Rainy, Cold, High\(>, \alpha \)); Ex2: (<Sunny, Warm, High\(>, \beta )\), with \(1> \alpha >\beta \).
So, we obtain a layered version of the upper and lower bounds of the version space:

at level 1, we have \(G_1\,=\,{<}?, ?, ?{{>}}\) and \(S_1\,=\,{<}\)Sunny, Warm, Normal\({{>}}\).

at level \(\alpha \), we have \(G_\alpha \,=\,\{{<}\)Sunny\(, ?, ?\!{>}, {<}\)Cloudy\(, ?, ?{{>}}, {<}?,\) Warm\(, ?{{>}}\}\) and \(S_\alpha = {<}\)Sunny, Warm, Normal\({{>}}\).

at level \(\beta \), we have \(G_\beta \! =\!\{{<}\)Sunny\(, ?, \!?{{>}}\), \({<}?,\) Warm\(, \!?{{>}}\}\) and \(S_\beta \,=\,{<}\)Sunny\(, \ \)Warm\(, ?{{>}}\).
\(\Box \)
The above syntactic view is simpler than the semantic one presented in [27] where the paper starts with a pair of possibility distributions over hypotheses, respectively induced by the examples and by the counterexamples.
8 Formal Concept Analysis
Formal concept analysis [19] is another setting where association rules between attributes can be extracted from a formal context \(R\subseteq X \times Y\), which is nothing but a relation linking items in X with properties in Y. It provides a theoretical basis for data mining. Table 1 can be viewed as a context, restricting to rows \(\mathcal {E}\cup \mathcal {E}'\) and considering the class of examples as just another attribute.
Let Rx and \(R^{1}y\) respectively denote the set of properties possessed by item x and the set of items having property y. Let \(E \subseteq X\) and \(A \subseteq Y\). The set of items having all properties in A is given by \(A^\downarrow = \{x \  \ A \subseteq Rx\}\) and the set of properties possessed by all items in E is given by \(E^\uparrow = \{y \  \ E \subseteq R^{1}y\}\). A formal concept is then defined as a pair (E, A) such that \(A^\downarrow =E\) and \(E^\uparrow = A\) where E and A provides the extent and the intent of the formal concept respectively. Then, it can be shown that \(E\times A\subseteq R\), and is maximal with respect to set inclusion, i.e., (E, A) defines a maximal rectangle in the formal context.
Let A and B be two subsets of Y. Then R satisfies the attribute implication \(A \Rightarrow B\) if for every \(x \in X\), such that \(x \in A^\downarrow \), then \(x \in B^\downarrow \). Formal concept analysis is not primarily oriented towards concept learning, but towards mining attribute implications (i.e., association rules). However, it might be interesting to consider formal contexts where Y also contains the names of classes, i.e., \(\mathcal {C}\subseteq Y\). Then being able to find attribute implications of the form \(A \Rightarrow C\) where \(A \cap \mathcal {C}= \emptyset \) and \(C \subseteq \mathcal {C}\), would be of a particular interest, especially if C is a singleton.
A construction dual to the theory of attribute implications has been proposed in [2], to extract disjunctive attribute implications \(A \longrightarrow B\) which are satisfied by R if for every object x, if x possesses at least one property in A then it possesses at least one property in B. This approach interprets a zero in matrix R for object x and property a as the statement that x does not possess property a.
Disjunctive attribute implications can be extracted either by considering the complementary context \(\overline{R}\), viewed as a standard context with negated attributes, and extracting attribute implications from it. Disjunctive attribute implications are then obtained by contraposition. Or we can derive them directly, replacing operator \(A^\downarrow \) by a possibilistic operator \(A^{\downarrow \Pi } = \{x \  \ A \cup Rx\ne \emptyset \}\) introduced independently by several authors relating FCA and modal logic [21], rough sets [30] and possibility theory [10]. Then R satisfies the attribute implication \(A \longrightarrow B\) if \(A^{\downarrow \Pi }\subseteq B^{\downarrow \Pi } \).
It is interesting to notice that if Y also contains the names of classes, i.e., \(\mathcal {C}\subseteq Y\), disjunctive attribute implications of the form \(C \longrightarrow B\) where \(B \cap \mathcal {C}= \emptyset \) and \(C \subseteq \mathcal {C}\) corresponds to the logic rule \(v_C\rightarrow \vee _{b\in B} b\), which is in agreement with the handling of exceptions in the logical reading of a classification task presented in Sect. 3. Indeed, the rule \(v_C\rightarrow \vee _{b\in B} b\) can be read by contraposition: if an object violates all properties in B, then it is a counterexample. So there is a natural way of relating logical approaches to the classification problem and formal concept analysis, provided that a formal context is viewed as a set of examples and count examples (e.g., objects that satisfy a set of properties, vs. objects that do not).
Note finally that the rectangular nature of formal concepts expresses a form of convexity, which fits well with the ideas of Gärdenfors about conceptual spaces [20]. Moreover, using also operators other than \(^\downarrow \) and \(^\uparrow \) (see [14]) help characterizing independent subcontexts and other noticeable structures. Formal concept analysis can be also related to the idea of clustering [15], where clusters are unions of overlapping concepts in independent subcontexts. The idea of approximate concepts, i.e., rectangles with “holes”, suggests a convexitybased completion principle, which might be useful in a classification perspective.
9 Concluding Remarks
This paper is clearly a preliminary step toward a unified, logical, study of set theorybased approaches in data management. It is preliminary in at least two respects: several of these approaches have been only cited in the introduction, while the others have been only briefly discussed. All these theoretical settings start with a Boolean table in the simplest case, and many of them extend to nominal, and possibly to numerical data. Still they have been motivated by different concerns such as describing a concept, predicting a class, or mining rules. Due to their set theorybased nature, they can be considered from a logical point of view, and a number of issues are common, such that handling incomplete information, missing values, inconsistent information, or non applicable attributes.
In a logical setting, the handling of uncertainty can be conveniently achieved using possibility theory and possibilistic logic [16]. We have suggested above how it can be applied to concept learning and how it may take into account uncertain pieces of data. Possibilistic logic can also handle default rules that can be obtained from Boolean data by looking for suitable probability distributions [3]; such rules provide useful summaries of data. The possible uses of possibilistic logic in data management is a general topic for further investigation.
References
AbuMostafa, Y.S., MagdonIsmail, M., Lin, H.T.: Learning from Data. A Short Course. AMLbook.com (2012)
AitYakoub, Z., Djouadi, Y., Dubois, D., Prade, H.: Asymmetric composition of possibilistic operators in formal concept analysis: application to the extraction of attribute implications from incomplete contexts. Int. J. Intell. Syst. 32(12), 1285–1311 (2017)
Benferhat, S., Dubois, D., Lagrue, S., Prade, H.: A bigstepped probability approach for discovering default rules. Int. J. Uncertainty Fuzziness Knowl.Based Syst. 11(Supplement1), 1–14 (2003)
Boros, E., Crama, Y., Hammer, P.L., Ibaraki, T., Kogan, A., Makino, K.: Logical analysis of data: classification with justification. Ann. OR 188(1), 33–61 (2011)
Bounhas, M., Prade, H., Richard, G.: Analogybased classifiers for nominal or numerical data. Int. J. Approx. Reason. 91, 36–55 (2017)
Bouraoui, Z., et al.: From shallow to deep interactions between knowledge representation, reasoning and machine learning (Kay R. Amel group). CoRR abs/1912.06612 (2019)
Chikalov, I., et al.: Three Approaches to Data Analysis  Test Theory, Rough Sets and Logical Analysis of Data. Intelligent Systems Reference Library, vol. 41. Springer, Heidelberg (2013) https://doi.org/10.1007/9783642286674
Cornuejols, A., Koriche, F., Nock, R.: Statistical computational learning. In: Marquis, P., Papini, O., Prade, H. (eds.) A Guided Tour of Artificial Intelligence Research, pp. 341–388. Springer, Cham (2020). https://doi.org/10.1007/9783030061647_11
Couceiro, M., Hug, N., Prade, H., Richard, G.: Analogypreserving functions: a way to extend Boolean samples. In: Proceedings of the IJCAI 2017, Stockholm, pp. 1575–1581 (2017)
Dubois, D., Dupin de SaintCyr, F., Prade, H.: A possibilitytheoretic view of formal concept analysis. Fundamenta Informaticae 75(1–4), 195–213 (2007)
Dubois, D., Hüllermeier, E., Prade, H.: Fuzzy methods for casebased recommendation and decision support. J. Intell. Inf. Syst. 27(2), 95–115 (2006). https://doi.org/10.1007/s108440060976x
Dubois, D., Prade, H.: Conditional objects as nonmonotonic consequence relationships. IEEE Trans. Syst. Cybern. 24(12), 1724–1740 (1994)
Dubois, D., Prade, H.: Fuzzy relation equations and causal reasoning. Fuzzy Sets Syst. 75(2), 119–134 (1995)
Dubois, D., Prade, H.: Possibility theory and formal concept analysis: characterizing independent subcontexts. Fuzzy Sets Syst. 196, 4–16 (2012)
Dubois, D., Prade, H.: Bridging gaps between several forms of granular computing. Granul. Comput. 1(2), 115–126 (2015). https://doi.org/10.1007/s4106601500088
Dubois, D., Prade, H.: Possibilistic logic: from certaintyqualified statements to twotiered logics – a prospective survey. In: Calimeri, F., Leone, N., Manna, M. (eds.) JELIA 2019. LNCS (LNAI), vol. 11468, pp. 3–20. Springer, Cham (2019). https://doi.org/10.1007/9783030195700_1
Dubois, D., Prade, H., Sédes, F.: Fuzzy logic techniques in multimedia database querying: a preliminary investigation of the potentials. IEEE Trans. Knowl. Data Eng. 13(3), 383–392 (2001)
Dubois, V., Quafafou, M.: Concept learning with approximation: rough version spaces. In: Alpigini, J.J., Peters, J.F., Skowron, A., Zhong, N. (eds.) RSCTC 2002. LNCS (LNAI), vol. 2475, pp. 239–246. Springer, Heidelberg (2002). https://doi.org/10.1007/3540458131_31
Ganter, B., Wille, R.: Formal Concept Analysis. Mathematical Foundations. Springer, Heidelberg (1998). https://doi.org/10.1007/9783642598302
Gärdenfors, P.: Conceptual Spaces. The Geometry of Thought. MIT Press, Cambridge (2000)
Gediga, G., Düntsch, I.: Modalstyle operators in qualitative data analysis. In: Proceedings of the IEEE International Conference on Data Mining, pp. 155–162 (2002)
Hájek, P., Havránek, P.: Mechanising Hypothesis Formation  Mathematical Foundations for a General Theory. Springer, Heidelberg (1978). https://doi.org/10.1007/9783642669439
Hüllermeier, E., Dubois, D., Prade, H.: Model adaptation in possibilistic instancebased reasoning. IEEE Trans. Fuzzy Syst. 10(3), 333–339 (2002)
Mitchell, T.M.: Version spaces: a candidate elimination approach to rule learning. In: IJCAI, pp. 305–310 (1977)
Mitchell, T.M.: Version spaces: an approach to concept learning. Ph.D. thesis, Stanford University (1979)
Prade, H., Richard, G.: Analogical proportions and analogical reasoning  an introduction. In: Aha, D.W., Lieber, J. (eds.) ICCBR 2017. LNCS (LNAI), vol. 10339, pp. 16–32. Springer, Cham (2017). https://doi.org/10.1007/9783319610306_2
Prade, H., Serrurier, M.: Bipolar version space learning. Int. J. Intell. Syst. 23(10), 1135–1152 (2008)
Pawlak, Z.: Rough Sets. Theoretical Aspects of Reasoning about Data. Springer, Dordrecht (1991). https://doi.org/10.1007/9789401135344
Sudkamp, T.: Similarity and the measurement of possibility. In: Actes Rencontres Francophones sur la Logique Floue et ses Applications (Montpellier, France), pp. 13–26. Cépadués Editions, Toulouse (2002)
Yao, Y.Y.: Concept lattices in rough set theory. In: Proceedings of Annual Meeting of the North American Fuzzy Information Processing Society, NAFIPS 2004, pp. 796–801 (2004)
Acknowledgements
The authors acknowledge a partial support of ANR11LABX0040CIMI (Centre International de Mathématiques et d’Informatique) within the program ANR11IDEX000202, project ISIPA (“Intégrales de Sugeno, Interpolation, Proportions Analogiques”).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Dubois, D., Prade, H. (2020). Towards a LogicBased View of Some Approaches to Classification Tasks. In: Lesot, MJ., et al. Information Processing and Management of Uncertainty in KnowledgeBased Systems. IPMU 2020. Communications in Computer and Information Science, vol 1239. Springer, Cham. https://doi.org/10.1007/9783030501532_51
Download citation
DOI: https://doi.org/10.1007/9783030501532_51
Published:
Publisher Name: Springer, Cham
Print ISBN: 9783030501525
Online ISBN: 9783030501532
eBook Packages: Computer ScienceComputer Science (R0)