Keywords

1 Introduction

It is an understatement to say that the current dominant paradigms in machine learning rely on neural nets and statistics; see, e.g., [1, 8]. Yet, there have been quite a number of set theoretic- or logic-based views that have considered data sets from different perspectives: we can thus (at least) mention concept learning [24, 25], formal concept analysis [19], rough sets [28], logical analysis of data [4], test theory [7], and GUHA method [22]. Still some other works, mentioned later, may be also relevant. These various paradigms can be related to logic, but have been developed independently. Strangely enough, little has been done to move towards a unified view of them.

This research note aims to be a first step in this direction. However, the result will remain modest, since we shall only outline connections between some settings, while other ones will be left aside for the moment. Moreover we shall mainly focus on Boolean data, even if some of what is said could be extended to nominal, or even numerical data. Still, we believe that it is of scientific interest to better understand the relationships between these different theoretical settings developed with various motivations and distinct paradigms, while all are starting from the same object: a set of data. In the long range, such a better understanding may contribute to some cooperation between these set theory-based views and currently popular ones, such as neural nets or statistical approaches, perhaps providing tools for explanation capabilities; see, e.g., [6] for references and a tentative survey.

The paper is organized as follows. Section 2 states and discusses the problem of assigning an item to a class, given examples and counter-examples. Section 3 presents a simple propositional logic reading of the problem. Section 4 puts the discussion in a more appropriate setting using the notion of conditional object [12], which captures the idea of a rule, better than material implication. Moreover, a rule-based reading of analogical proportion-based classification [26] is also discussed in Sect. 5. Section 6 briefly recalls the version space characterization of the set of possible descriptions of a class on an example, emphasizing its bipolar nature. Section 7 advocates the interest of possibilistic logic [16] for handling uncertainty and coping with noisy data. Indeed, sensitivity to noise is a known drawback of pure set-theoretic approaches to data handling. Section 8 briefly surveys formal concept analysis and suggests its connection and potential relevance to classification. Section 9 mentions some other related matters and issues, pointing out lines for further research.

2 Classification Problem - A General View

Let us consider m pieces of data that describe items in terms of n attributes \(A_j\). Namely an item i is represented by a vector \(\varvec{a^i} = (a^{i}_1, a^{i}_2, \cdots , a^{i}_n)\), with \(i=1,\dots ,m\), together with its class \(cl(\varvec{a^i})\), where \(a^{i}_j\) denotes the value of the j-th attribute \(A_j\) for item \(\varvec{a^i}\), namely \(A_j(\varvec{a^i}) = a_j^i \in dom(A_j)\) (\(dom(A_j)\) denotes the domain of attribute \(A_j\)). Each domain \(dom(A_j)\) can be described using a set of propositional variables \(\mathcal {V}_j\) specific to \(A_j\), by means of logical formulas. If \(|dom(A_j)|= 2\), one propositional variable \(\mathcal {V}_j\) is enough and \(dom(A_j) = \{v_j, \lnot v_j\}\).

Let \(\mathcal {C} = \{cl(\varvec{a^i}) | i=1,..., m\}\) be a set of classes, where each object is supposed to belong to one and only one class. The classification problem amounts to predicting the class \(cl(\varvec{a^*})\in \mathcal {C} \) of a new item \(\varvec{a^*}\) described in terms of the same attributes, on the basis of the m examples \((\varvec{a^i}, cl(\varvec{a^i}))\) consisting of classified objects.

There are other problems that are akin to classification, with different terminologies. Let us at least mention case-based decision, and diagnosis. In the first situation, we face a multiple criteria decision problem where one wants to predict the value of a new item on the basis of a collection of valued items (assuming that possible values belong to a finite scale), while in the second situation attribute values play the role of symptoms (present or not) and classes are replaced by diseases [13]. In both situations, the m examples constitute a repertory of reference cases already experienced. This is also true in case-based reasoning, where a solution is to be found for a new encountered problem on the basis of a collection of previously solved ones, for which the solution is known; however, case-based reasoning usually includes an adaptation step of the past solution selected, for a better adequacy with the new problem. Thus, ideas and methods developed in these different fields may be also of interest in a classification perspective.

Two further comments are in order here. First, for each class C, one may partition the whole set of m data in two parts: the set \(\mathcal {E}\) of examples associated with this class, and the set \(\mathcal {E'}\) of examples of other classes, which can be viewed as counter-examples for this class. The situation is pictured in Table 1 below. It highlights the fact that the whole set of items in class C is bracketed between \(\mathcal {E}\) and \(\overline{\mathcal {E'}}\) (where the overbar means complementation). If the table is contradiction-free, there is no item that is both in \(\mathcal {E}\) and in \(\mathcal {E'}\).

Second, the classification problem can be envisaged in two different manners:

  1. 1.

    as an induction problem, where one wants to build a plausible description of each class; it can be done in terms of if-then rules associating sets of attribute values with a class, these rules being used for prediction purposes;

  2. 2.

    as a transduction problem, where the prediction is made without the help of such descriptions, but by means of direct comparisons of the new item with the set of the m examples.

Table 1. Contradiction-free data table

3 A Simple Logical Reading

An elementary idea for characterizing a class C is to look for an attribute such that the subset of values taken for this attribute by the available examples of class C is disjoint from the subset of values taken by the examples of the other classes. If there exists at least one such attribute \(A_{j^*}\), then one may inductively assume that belonging or not to class C, for any new item, can be predicted on the basis of its value for \(A_{j^*}\). More generally, if a particular combination of attribute values can be encountered only for items of a class C, then a new item with this particular combination should also be put plausibly in class C. Let us now have a more systematic logical analysis of the data.

Let us consider a particular class \(C \in \mathcal {C}\). Then the m items \(\varvec{a^i}\) can be partitioned into two subsets, the items \(\varvec{a^i}\) such that \(cl(\varvec{a^i})= C\), and those such that \(cl(\varvec{a^i})\ne C\) (we assume that \(|\mathcal { C}| \ge 2\)). Thus we have a set \(\mathcal {E}\) of examples for C, namely \(\varvec{e^i} = (a^{i}_1, a^{i}_2, \cdots , a^{i}_n, 1)=(\varvec{a^i}, 1)\), where ‘1’ means that \(cl(\varvec{a^i})= C\), and a set \(\mathcal {E'}\) of counter-examples \(\varvec{e'^j} = (a'^{j}_1, a'^{j}_2, \cdots , a'^{j}_n, 0)\) where ‘0’ means that \(cl(\varvec{a'^{j}})\ne C\).

Let us assume that the domains \(dom(A_j)\) for \(j=1,n\) are finite and denote by \(v_C\) the propositional variable associated to class C (\(v_C\) has truth-value 1 for elements of C and 0 otherwise). Using the attribute values as propositional logic symbols, an example \(\varvec{e^i}\) expresses the truth of the logical statement

$$a^{i}_1 \wedge a^{i}_2 \wedge \cdots \wedge a^{i}_n \rightarrow v_C$$

meaning that if it is an example, then it belongs to the class, while counter-examples \(\varvec{e'^j}\) are encoded by stating that the formula \( a'^{j}_1 \wedge a'^{j}_2 \wedge \cdots \wedge a'^{j}_n \rightarrow \lnot v_C\) is true, or equivalently

$$\models v_C \rightarrow \lnot a'^{j}_1 \vee \lnot a'^{j}_2 \vee \cdots \vee \lnot a'^{j}_n.$$

Then any class (or concept) C that agrees with the m pieces of data is such that

$$\begin{aligned} \bigvee _{i : \varvec{e^{i}} \in \mathcal {E}} (a^{i}_1 \wedge a^{i}_2 \wedge \cdots \wedge a^{i}_n) \models v_C\models \bigwedge _{j : \varvec{e'^{j}} \in \mathcal {E'}} (\lnot a'^{j}_1 \vee \lnot a'^{j}_2 \vee \cdots \vee \lnot a'^{j}_n). \end{aligned}$$
(1)

Letting \(\mathcal {E}\) be the set of models of \(\bigvee _{i}\varvec{a^i}\) (the examples) and \(\mathcal {E'}\) be the set of models of \(\bigvee _{j}\varvec{a'_j}\) (the counter-examples), (1) simply reads \(\mathcal {E} \subseteq C \subseteq \overline{\mathcal {E'}}\) where the overbar denotes complementation. Note that the larger the number of counter-examples, the more specific the upper bound of C; the larger the number of examples, the more general the lower bound of C.

This logical expression states that if an item is identical to an example on all attributes then it is in the class, and that if an item is in the class then it should be different from all counter-examples on at least one attribute.

Let us assume Boolean attributes for simplicity, and let us suppose that \(a_1^i = v_{1}\) is true for all the examples of class C and false for all the examples of other classes. Then it can be seen that (1) can be put under the form \(v_1 \wedge L \models v_C \models v_1 \vee L'\) where L and \(L'\) are logical expressions that do not involve any propositional variable pertaining to attribute \(A_{1}\). This provides a reasonable support for inducing that an item belongs to C as soon as \(v_1\) is true for it. Such a remark can be generalized to a combination of attribute values and to nominal attributes.

Let us consider a toy example, small yet sufficient for an illustration of (1) and starting the discussion.

Example 1

It is an example with two Boolean attributes, two classes (C and \( \overline{C}\)), two examples and a counter-example. Namely, we have \(\varvec{e^1} = (a^{1}_1, a^{1}_2, 1)= (1, 0, 1) = (v_1, \lnot v_2, v_C)\); \(\varvec{e^2} = (a^{2}_1, a^{2}_2, 1)= (0, 1, 1)= (\lnot v_1, v_2, v_C)\); \(\varvec{e'^1} = (a'^{1}_1, a'^{1}_2, 0)= (0, 0, 0)= (\lnot v_1,\lnot v_2, \lnot v_C)\).

We can easily see that \((v_1\wedge \lnot v_2) \vee ( \lnot v_1\wedge v_2) \,\models \, v_C\,\models \,v_1 \vee v_2\), i.e., we have \(v_1 \dot{\vee }v_2\,\models \,v_C \,\models \,v_1 \vee v_2\), where \(\dot{\vee }\) stands for exclusive or. Indeed depending on whether (1, 1) is an example or a counter-example, the class C will be described by \(v_1 \vee v_2\), or by \(v_1\,\dot{\vee }\,v_2\) respectively.

Note that in the absence of any further information or principle, the two options for assigning (1, 1) to a class on the basis of \(\varvec{e^1}, \varvec{e^2}\) and \(\varvec{e'^1}\), are equally possible here. \(\Box \)

Observe that if the bracketing of C in (1) is consistent, the conjunction of the lower bound expression and the upper bound expression yields the lower bound. But in case of an item which would appear both as an example and as a counter-example for C (noisy data), this conjunction would not be a contradiction, as we might expect in general, as shown by the example below.

Example 2

Assume we have \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (1, 1, 1)\); \(\varvec{e'^1} = (1, 1, 0)\). The classes \(\mathcal {E}\) and \(\mathcal {E'}\) overlap since \(\varvec{e^2}\) and \(\varvec{e'^1}\) are the same item, classified differently. As a consequence we do not have that \(\mathcal {E} \subseteq \overline{\mathcal {E'}}\). So equation (1) is not valid: we do not have that \(v_1= (v_1\wedge \lnot v_2) \vee (v_1\wedge v_2) \,\models \,v_C \,\models \,\lnot v_1 \vee \lnot v_2\), i.e., \(v_1\,\models \,v_C \,\models \,\lnot v_1 \vee \lnot v_2\) is wrong even if \(v_1 \wedge (\lnot v_1 \vee \lnot v_2) = v_1 \wedge \lnot v_2 \ne \bot \). \(\Box \)

A more appropriate treatment of inconsistency will be proposed in the next section.

The two expressions bracketing C in (1) have a graded counterpart, proposed in [17], for assessing how satisfactory an item is, given a set of examples and a set of counter-examples supposed to describe what we are looking for. Then an item is all the better ranked as it is similar to at least one example on all important attributes, and that it is dissimilar to all counter-examples on at least one important attribute (where similarity, dissimilarity, and importance are matters of degrees). However, this ranking problem is somewhat different from the classification problem where each item should be assigned to a class. Here if an item is both close to an example and to a counter-example, it has a poor evaluation, just as it would be if it is close to a counter-example only.

Note that if one considers examples only, the graded counterpart amounts to searching for items that are similar to examples. In terms of classification, it means looking for the pieces of data that are sufficiently similar (on all attributes) to the item, the class of which one wants to predict, and to assign this item to the class shared by the majority of these such similar data. This is the k-nearest neighbor method. This is also very close to fuzzy case-based reasoning and instance-based learning [11, 23].

4 Conditional Objects and Rules

A conditional object b|a, where ab are propositions, is a three-valued entity, which is true if \(a\wedge b\) is true; false if \(a\wedge \lnot b\) is true; inapplicable if a is false; see, e.g., [12]. It can be thought as the rule ‘if a then b’. Indeed, the rule can be fired only if a is true; the examples of this rule are such that \(a\wedge b\) is true, while its counter-examples are such that \(a \wedge \lnot b\) is true. This view of conditionals dates back to Bruno De Finetti ’s works in the 1930’s.

An (associative) quasi-conjunction & can be defined for conditional objects:

$$ b|a\ \& \ d|c = (a \rightarrow b)\wedge (c \rightarrow d)|(a\vee c)$$

where \(\rightarrow \) denotes the material implication. It fits with the intuition that a set of rules can be fired as soon as at least one rule can be fired, and when a rule is fired, the rule behaves like material implication. Moreover, entailment between conditional objects is defined by \(b|a \vDash d|c\) iff \(a\wedge b \vDash c\wedge d\) and \(c\wedge \lnot d \vDash a\wedge \lnot b\), which expresses that the examples of rule ‘if a then b’ are examples of rule ‘if c then d’, and the counter-examples of rule ‘if c then d’ are counter-examples of rule ‘if a then b’. It can be checked that \(b|a= (a\wedge b)|a = (a\rightarrow b)|a\) since these three conditional objects have the same examples and the same counter-examples. It can be also shown that \(a\wedge b|\top \vDash b|a \vDash a\rightarrow b|\top \) (where \(\top \) denotes tautology), thus highlighting the fact that b|a is bracketed by the conjunction \(a\wedge b\) and the material implication \(a\rightarrow b\).

Let us revisit expression (1) in this setting. For an example \(\varvec{e} =(\varvec{a}, 1)\), and a counter-example \(\varvec{e'}=(\varvec{a'}, 0)\) with respect to a class C, it leads to consider the conditional objects \(v_C|\varvec{a} \) and \(\lnot v_C|\varvec{a'}\) respectively (if it is an example we are in the class, otherwise not).

For a collection of examples we have

$$ \begin{aligned} (v_C|\varvec{a^1}) \ \& \ \cdots \ \& \ (v_C|\varvec{a^r})&= ((\varvec{a^1} \vee \cdots \vee \varvec{a^r})\rightarrow v_C)|(\varvec{a^1} \vee \cdots \vee \varvec{a^r}) \\&= v_C|(\varvec{a^1} \vee \cdots \vee \varvec{a^r}) \end{aligned}$$

Similarly, we have

$$ \begin{aligned} (\lnot v_C|\varvec{a'^1} )\ \& \ \cdots \ \& \ (\lnot v_C|\varvec{a'^s} )&= ((\varvec{a'^1} \vee \cdots \vee \varvec{a'^s})\rightarrow \lnot v_C)|(\varvec{a'^1} \vee \cdots \vee \varvec{a'^s}) \\&=\lnot v_C|(\varvec{a'^1} \vee \cdots \vee \varvec{a'^s}) \end{aligned}$$

Letting \(\phi _E= \bigvee _{i=1}^r\varvec{a^i}\) and \(\phi _{E'}= \bigvee _{j=1}^s\varvec{a'^j}\), we can join the two conditional expressions:

$$ (v_C|\phi _E) \ \& \ (\lnot v_C|\phi _{E'}) = (\phi _E\rightarrow v_C) \wedge (\phi _{E'}\rightarrow \lnot v_C) |(\phi _E\vee \phi _{E'}) $$

where

$$ (\phi _E\wedge v_C) \vee (\phi _{E'}\wedge \lnot v_C) |\top \vDash (v_C|\phi _E) \ \& \ ( \lnot v_C|\phi _{E'}) \vDash (\phi _E\rightarrow v_C) \wedge (\phi _{E'}\rightarrow \lnot v_C)|\top $$

A set of conditional objects K is said to be consistent if and only if for no subset \(S \subseteq K\) does the quasi-conjunction Q(S) of the conditional objects in S entail a conditional contradiction of the form \(\bot |\phi \). [12]. Contrary to material implication, the use of three-valued conditionals reveals the presence of contradictions in the data.

Example 3

(Example 2 continued) The data are \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (1, 1, 1)\); \(\varvec{e'^1} = (1, 1, 0)\). In terms of conditional objects, considering the subset \(\{\varvec{e^2},\varvec{e'^1} \}\), we have

$$ \begin{aligned} v_C|(v_1\wedge v_2)\ \& \ \lnot v_C|(v_1\wedge v_2)&= (v_1\wedge v_2)\rightarrow (v_C\wedge \lnot v_C)|(v_1\wedge v_2) \\&= (v_C\wedge \lnot v_C)|(v_1\wedge v_2)=\bot |v_1\wedge v_2, \end{aligned}$$

which is a conditional contradiction. \(\Box \)

5 Analogical Proportion-Based Transduction

Apart from the k-nearest neighbor method, there is another transduction approach to the classification problem which applies to Boolean, nominal and numerical attribute values [5]. For simplicity here, we only consider Boolean attributes. It relies on the notion of analogical proportion [26]. Analogical proportions are statements of the form “a is to b as c is to d”, often denoted by \(a:b\,{::}\,c:d\), which express that “a differs from b as c differs from d and b differs from a as d differs from c”. This statement can be encoded into a Boolean logical expression which is true only for the 6 following assignments (0, 0, 0, 0), (1, 1, 1, 1), (1, 0, 1, 0), (0, 1, 0, 1), (1, 1, 0, 0), and (0, 0, 1, 1) for (abcd). Boolean Analogical proportions straightforwardly extend to vectors of attributes values such as \(\varvec{a}=(a_1, ..., a_n)\), by stating \(\varvec{a}:\varvec{b}\,{::}\,\varvec{c}:\varvec{d} \text{ iff } \forall i \!\in \! [1,n], ~ a_i:b_i\,{::}\,c_i:d_i\). The basic analogical inference pattern, is then

$$\frac{\forall i \in \{1,..., p\}, ~~a_i : b_i \,{::}\, c_i : d_i \text{ holds }}{\forall j \in \{p+1,..., n\}, ~~a_j : b_j \,{::}\, c_j : d_j \text{ holds }}$$

Thus analogical reasoning amounts to finding completely informed triples \((\varvec{a}, \varvec{b}, \varvec{c})\) appropriate for inferring the missing value(s) in \(\varvec{d}\). When there exist several suitable triples, possibly leading to distinct conclusions, one may use a majority vote for concluding. This inference method is an extrapolation, which applies to classification (then the class \(cl(\varvec{x})\) is the unique solution, when it exists, such as \(cl(\varvec{a}):cl(\varvec{b})\,{::}\,cl(\varvec{c}):cl(\varvec{x})\) holds).

Let us examine more carefully how it works. The inference in fact takes items pair by pair, and then puts two pairs in parallel. Let us first consider the case where three items belong to the same class ; the fourth item is the one, the class of which one wants to predict (denoted by 1 in the following). Considering a pair of items \(\mathbf {a^i}\) and \(\mathbf {a^j}\). There are attributes for which the two items are equal and attributes for which they differ. For simplicity, we assume that they differ only on the first attribute (the method easily extend to more attributes). So we have \(\mathbf {a^i}=(a^i_1, a^i_2, \cdots ,a^i_n, 1)\) and \(\mathbf {a^j}=(a^j_1,a^j_2, \cdots , a^j_n, 1)\) with \(a^j_1 = \lnot a^i_1\) and \(a^j_t = a^i_t=v_t\) for \(t=2,\dots , n\). This means that the change from \(a^i_1\) to \(a^j_1\) in context \((v_2,\cdots ,v_n)\) does not change the class. Assume we have now another pair \(\mathbf {a^k}=(v_1, a^k_2, \cdots ,a^k_n, 1)\) and \(\mathbf {a^{\star }}=(\lnot v_1,a^\star _2,\cdots ,a^\star _n, ?)\) involving the item, the class of which we have to predict, and exhibiting the same change on attribute \(A_1\) and being equal elsewhere, i.e., we have \( a^k_t=a^\star _t =v^\sharp _t \) for \(t=2,\dots , n\). Putting the two pairs in parallel, we obtain the following pattern

\((v_1, v_2, \cdots ,v_n, 1)\)

\((\lnot v_1, v_2, \cdots ,v_n, 1)\)

\((v_1, v^\sharp _2, \cdots ,v^\sharp _n, 1)\)

\((\lnot v_1,v^\sharp _2,\cdots ,v^\sharp _n, ?)\).

It is not difficult to check that \(\mathbf {a^i}\), \(\mathbf {a^j}\), \(\mathbf {a^k}\) and \(\mathbf {a^\star }\) are in analogical proportion for each attribute. So \(\mathbf {a^i}:\mathbf {a^j}\,{::}\,\mathbf {a^k}:\mathbf {a^\star }\) holds. The solution of \(1: 1\,{::}\,1 : ?\) is obviously \(?=1\), so the prediction is \(cl(\mathbf {a^{\star }})=1\). This conclusion is thus based on the idea that since the change from \(a^i_1\) to \(a^j_1\) in context \((v_2,\cdots ,v_n)\) does not change the class, it is the same in the other context \((v^\sharp _2,\cdots ,v^\sharp _n)\).

The case where \(\mathbf {a^i}\) and \(\mathbf {a^k}\) belong to class C while \(\mathbf {a^j}\) is in \(\lnot C\) leads to another analogical pattern, where the change from \(a^i_1\) to \(a^j_1\) now changes the class in context \((v_2,\cdots ,v_n)\). The pattern is

\((v_1, v_2, \cdots ,v_n, 1)\)

\((\lnot v_1, v_2, \cdots ,v_n, 0)\)

\((v_1, v^\sharp _2, \cdots ,v^\sharp _n, 1)\)

\((\lnot v_1,v^\sharp _2,\cdots ,v^\sharp _n, ?)\)

The conclusion is now \(?=0\), i.e., \(\mathbf {a^{\star }}\) is not in C. This approach thus implements the idea that the switch from \(a^i_1\) to \(a^j_1\) that changes the class in context \((v_2,\cdots ,v_n)\), also leads to the same change in context \((v^\sharp _2,\cdots ,v^\sharp _n)\).

It has been theoretically established that analogical classifiers always yield exact prediction for Boolean affine functions describing the class (which includes x-or functions), and only for them [9]. Still, a majority vote among the predicting triples often yields the right prediction in other situations [5].

Let us see how it works on Example 1 and variants.

Example 4

In Example 1 we have: \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (0, 1, 1)\); \(\varvec{e'^1} = (0, 0, 0)\). We can check that there is no analogical prediction in this case for (1, 1, ?). Indeed, whatever the way we order the three vectors, either we get the 4-tuple (1, 0, 0, 1) on one component, which is not a pattern in agreement with an analogical proportion, or the equation \(0 : 1\,{::}\, 1 : ?\) which has no solution. So analogy remains neutral in this case.

However, in the situation where would have \(\varvec{e^1} = (1, 0, 1)\); \(\varvec{e^2} = (1, 1, 1)\); \(\varvec{e'^1} = (0, 1, 0)\). Taking the triple \((\varvec{e^2}, \varvec{e^1}, \varvec{e'^1})\), we can check that \((1, 1) : (1, 0) \,{::}\, (0, 1) : (0, 0)\) holds on each of the two vector components. The solution of the equation \(1 : 1 \,{::}\, 0 : ?\) is \(?=0\), which is the analogical prediction for (0, 0, ?).

Similarly, in the case \(\varvec{e^1} = (1, 0, 1)\), \(\varvec{e^2} = (1, 1, 1)\) and \(\varvec{e^3} = (0, 1, 1)\), we would obtain \(?=1\) for (0, 0, ?) as expected, using triple \((\varvec{e^2}, \varvec{e^1}, \varvec{e^3})\). \(\Box \)

It is clear that the role of analogical reasoning here is to complete the data set with new examples or counter-examples obtained by transduction, assuming analogical inference patterns are valid in the case under study. It may be a first step prior to the induction of a classification model.

6 Concept Learning, Version Space and Logic

The version space setting, as proposed by Mitchell [24, 25], offers an elegant elimination procedure, exploiting examples and counter-examples of a class, then called “concept”, for restricting the hypotheses space and providing an approach to rule learning.

Let us recall the approach using a simple example, with 3 attributes:

  • \(A_1= \ \)Sky (with possible values Sunny, Cloudy, and Rainy),

  • \(A_2=\ \)Air Temp (with values Warm and Cold),

  • \(A_3=\ \)Humidity (with values Normal and High).

The problem is to learn the concept of \(C= \ \)Nice Day on the basis of examples and counter-examples. This means finding all hypotheses h, such that the implication \(h \rightarrow v_C\) is compatible with the examples and the counter-examples.

Each hypothesis is described by a conjunction of constraints on the attributes, here Sky, Air Temp, and Humidity. Constraints may be ? (any value is acceptable), \(\emptyset \) (no value is acceptable), a specific value, or a disjunction thereof. The target concept C, here Nice Day, is supposed to be represented by a disjunction of hypotheses (there may exist different h and \(h'\) such that \(h \rightarrow v_C\) and \(h' \rightarrow v_C\)).

Descriptions of examples or counter-examples can be ordered according to their generality/specificity. Thus for instance, the following descriptions are ordered according to decreasing generality: \({<}?, ?, ?{{>}}\), \({<}\)Sunny \(\vee \) Cloudy\(,?, ?{{>}}\), \({<}\)Sunny\(, ?, ?{{>}}\), \({<}\)Sunny, ?,  Normal\({{>}}\), \({<}\emptyset , \emptyset , \emptyset {{>}}\).

The version space is represented by its most general and least general members. The so-called general boundary G is the set of maximally general members of the hypothesis space that are consistent with the data. The specific boundary S is the set of maximally specific members of the hypothesis space that are consistent with the data. G and S are initialized as \(G\,=\,{<}?, ?, ?{{>}}\) and \(S\,=\,{<}\emptyset , \emptyset , \emptyset {{>}}\) (for 3 attributes as in the example).

The procedure amounts to finding a maximally specific hypothesis which covers the positive examples. Suppose we have two examples of Nice Day:

figure a

Then, taking into account Ex1, S is updated to \(S_1\,=\,\) \({<}{} \texttt {Sunny}\),\( \texttt {Warm}\),\(\texttt {Normal}{{>}}\).

Adding Ex2, S is improved into \(S_2\,=\,{<}\)SunnyWarm\(, ?{{>}}\), which corresponds to the disjunction of Ex1 and Ex2. The positive training examples force the S boundary of the version space to become increasingly general (\(S_2\) is more general than \(S_1\)).

Although the version space approach was not cast in a logical setting, it is perfectly compatible with the logical encoding (1). Indeed here we have two examples of the form \((v_1, v_2, v_3)\) and \((v_1, v_2, \lnot v_3)\) (with \(v_1=\texttt {Sunny}; v_2=\texttt {Warm}; v_3=\texttt {Normal}, \lnot v_3= \texttt {High}\)). A tuple of values such that \({<}\!v,v',v''{{>}}\) is to be understood as the conjunction \(v\wedge v' \wedge v"\). So we obtain \((v_1 \wedge v_2 \wedge v_3) \vee (v_1\wedge v_2 \wedge \lnot v_3) \rightarrow v_C\). It corresponds to the left part of Eq. (1) for \(n=3\) and \(|\mathcal {E}|=2\), which yields \((v_1 \wedge v_2) \wedge (v_3\vee \lnot v_3) \rightarrow v_C\), i.e., \((v_1 \wedge v_2) \rightarrow v_C\). So the more positive examples we have, the more general the lower bound of C in (1) (the set of models of a disjunction is larger than the set of models of each of its components). This lower bound, here \(v_1 \wedge v_2\), is a maximally specific hypothesis h.

Negative examples play a complementary role. They force the G boundary to become increasingly specific. Consider we have the following counter-example for Nice Day: cEx3. <RainyColdHigh> .

The hypothesis in the G boundary must be specialized until it correctly classifies the new negative example. There are several alternative minimally more specific hypotheses. Indeed, the 3 attributes can be specialized for avoiding to cover cEx3 by being \(\lnot \)Rainy, or being \(\lnot \)Cold, or being \(\lnot \)High. This exactly corresponds to Equ(1), which here gives \(v_C \rightarrow \lnot \texttt {Rainy}\vee \lnot \texttt {Cold}\vee \lnot \texttt {High}\), i.e., \(v_C \rightarrow \texttt {Sunny} \vee \texttt {Cloudy}\vee \texttt {Warm}\vee \texttt {Normal}\).

The elements of this disjunction correspond to maximally general potential hypotheses. But in fact we have only two new hypotheses in G: \({<}\)Sunny\(, ?, ?{{>}}\) and \({<}?,\) Warm\(, ?{{>}}\), as explained now. Indeed, the hypothesis \(h = (?,?,\) Normal) is not included in G, although it is a minimal specialization of G that correctly labels cEx3 as a negative example. This is because example Ex2 whose attribute value for \(A_3\) is \(\texttt {High}\), disagrees with the implication \(\texttt {Normal} \rightarrow v_C\). So, hypothesis \({<}?,?,\) Normal\({{>}}\) is excluded. Similarly, examples Ex1 and Ex2 (for which the attribute value for \(A_1\) is \(\texttt {Sunny}\)) disagree with implication \(\texttt {Cloudy}\rightarrow v_C\). This kind of elimination applies in Equation (1) as well. Indeed the expression \(v \wedge L \vDash \lnot v \vee L'\) can be simplified into \(v \wedge L \vDash L'\).

We thus obtain upper and lower bounds from Ex1, Ex2, and cEx3

figure b

where \(\{{<}v_1, v'_1, v''_1{{>}}, {<}v_2, v'_2, v''_2{{>}}\}\) logically reads \((v_1\wedge v'_1 \wedge v''_1) \vee (v_2\wedge v'_2 \wedge v''_2)\) (? stands for \(\top \)).

The S boundary of the version space thus summarizes the previously encountered positive examples. Any hypothesis more general than S will, by definition, cover any example that S covers and thus will cover any past positive example. In a dual fashion, the G boundary summarizes the information from previously encountered negative examples. Any hypothesis more specific than G is assured to be consistent with past negative examples. The set of all the hypotheses between S and G has a lattice structure. This in full agreement with Equation (1). The approach provides an iterative procedure that takes advantage of the examples and counter-examples progressively.

Thus, the general procedure for obtaining the bounds of the version space are as follows.

  • If \(\mathbf {e}\) is a positive example,

    1. 1.

      remove from G any hypothesis inconsistent with \(\mathbf {e}\);

    2. 2.

      substitute in S any minimal generalization h consistent with \(\mathbf {e}\).

  • If \(\mathbf {e}\) is a negative example,

    1. 1.

      remove from S any hypothesis inconsistent with \(\mathbf {e}\);

    2. 2.

      substitute in G any minimal specialization h consistent with \(\mathbf {e}\).

7 Towards a Possibilistic Variant of the Version Space

The main drawback of the version space approach is its sensitivity to noise. Indeed each example and each counter-example influence the result. In [18], the authors use rough set approximations to cope with this problem.

Here we make another suggestion using possibility theory. The idea is to associate each example and each counter-example with a certainty level, as in possibilistic logic (see, e.g., [16]) in order to express to what extent we consider it is certain that the corresponding piece of information is true (rather than false). This certainty level expresses our confidence in the piece of data as being exact. It can reflect the confidence we have in the source that provided it, or be the result of an analysis or filtering of the data that disqualifies outliers. In that respect we should remember that one semantics of possibility theory is in terms of (dis)similarity [29].

In other words, we have a multi-tiered set of examples and a multi-tiered set of counter-examples. So, considering all examples and all counter-examples whose certainty is above or equal to some given certainty level \(\alpha \) yields a regular version space with classical bounds. Thus, for each \(\alpha \), it gives birth to a finite set of hypotheses to which \(\alpha \) can be associated. We have thus a natural basis for rank-ordering hypotheses. The smaller \(\alpha \), the larger the numbers of examples and counter-examples taken into account, and the tighter the bounds.

This can be illustrated on the example of the previous section.

Example 5

Examples and counter-examples now come with certainty weights. Assume we have Ex1: (<SunnyWarmNormal\(>, 1)\); cEx3: (<RainyColdHigh\(>, \alpha \)); Ex2: (<SunnyWarmHigh\(>, \beta )\), with \(1> \alpha >\beta \).

So, we obtain a layered version of the upper and lower bounds of the version space:

  • at level 1, we have \(G_1\,=\,{<}?, ?, ?{{>}}\) and \(S_1\,=\,{<}\)SunnyWarmNormal\({{>}}\).

  • at level \(\alpha \), we have \(G_\alpha \,=\,\{{<}\)Sunny\(, ?, ?\!{>}, {<}\)Cloudy\(, ?, ?{{>}}, {<}?,\) Warm\(, ?{{>}}\}\) and    \(S_\alpha = {<}\)SunnyWarmNormal\({{>}}\).

  • at level \(\beta \), we have \(G_\beta \! =\!\{{<}\)Sunny\(, ?, \!?{{>}}\), \({<}?,\) Warm\(, \!?{{>}}\}\) and \(S_\beta \,=\,{<}\)Sunny\(, \ \)Warm\(, ?{{>}}\).

\(\Box \)

The above syntactic view is simpler than the semantic one presented in [27] where the paper starts with a pair of possibility distributions over hypotheses, respectively induced by the examples and by the counter-examples.

8 Formal Concept Analysis

Formal concept analysis [19] is another setting where association rules between attributes can be extracted from a formal context \(R\subseteq X \times Y\), which is nothing but a relation linking items in X with properties in Y. It provides a theoretical basis for data mining. Table 1 can be viewed as a context, restricting to rows \(\mathcal {E}\cup \mathcal {E}'\) and considering the class of examples as just another attribute.

Let Rx and \(R^{-1}y\) respectively denote the set of properties possessed by item x and the set of items having property y. Let \(E \subseteq X\) and \(A \subseteq Y\). The set of items having all properties in A is given by \(A^\downarrow = \{x \ | \ A \subseteq Rx\}\) and the set of properties possessed by all items in E is given by \(E^\uparrow = \{y \ | \ E \subseteq R^{-1}y\}\). A formal concept is then defined as a pair (EA) such that \(A^\downarrow =E\) and \(E^\uparrow = A\) where E and A provides the extent and the intent of the formal concept respectively. Then, it can be shown that \(E\times A\subseteq R\), and is maximal with respect to set inclusion, i.e., (EA) defines a maximal rectangle in the formal context.

Let A and B be two subsets of Y. Then R satisfies the attribute implication \(A \Rightarrow B\) if for every \(x \in X\), such that \(x \in A^\downarrow \), then \(x \in B^\downarrow \). Formal concept analysis is not primarily oriented towards concept learning, but towards mining attribute implications (i.e., association rules). However, it might be interesting to consider formal contexts where Y also contains the names of classes, i.e., \(\mathcal {C}\subseteq Y\). Then being able to find attribute implications of the form \(A \Rightarrow C\) where \(A \cap \mathcal {C}= \emptyset \) and \(C \subseteq \mathcal {C}\), would be of a particular interest, especially if C is a singleton.

A construction dual to the theory of attribute implications has been proposed in [2], to extract disjunctive attribute implications \(A \longrightarrow B\) which are satisfied by R if for every object x, if x possesses at least one property in A then it possesses at least one property in B. This approach interprets a zero in matrix R for object x and property a as the statement that x does not possess property a.

Disjunctive attribute implications can be extracted either by considering the complementary context \(\overline{R}\), viewed as a standard context with negated attributes, and extracting attribute implications from it. Disjunctive attribute implications are then obtained by contraposition. Or we can derive them directly, replacing operator \(A^\downarrow \) by a possibilistic operator \(A^{\downarrow \Pi } = \{x \ | \ A \cup Rx\ne \emptyset \}\) introduced independently by several authors relating FCA and modal logic [21], rough sets [30] and possibility theory [10]. Then R satisfies the attribute implication \(A \longrightarrow B\) if \(A^{\downarrow \Pi }\subseteq B^{\downarrow \Pi } \).

It is interesting to notice that if Y also contains the names of classes, i.e., \(\mathcal {C}\subseteq Y\), disjunctive attribute implications of the form \(C \longrightarrow B\) where \(B \cap \mathcal {C}= \emptyset \) and \(C \subseteq \mathcal {C}\) corresponds to the logic rule \(v_C\rightarrow \vee _{b\in B} b\), which is in agreement with the handling of exceptions in the logical reading of a classification task presented in Sect. 3. Indeed, the rule \(v_C\rightarrow \vee _{b\in B} b\) can be read by contraposition: if an object violates all properties in B, then it is a counterexample. So there is a natural way of relating logical approaches to the classification problem and formal concept analysis, provided that a formal context is viewed as a set of examples and count examples (e.g., objects that satisfy a set of properties, vs. objects that do not).

Note finally that the rectangular nature of formal concepts expresses a form of convexity, which fits well with the ideas of Gärdenfors about conceptual spaces [20]. Moreover, using also operators other than \(^\downarrow \) and \(^\uparrow \) (see [14]) help characterizing independent sub-contexts and other noticeable structures. Formal concept analysis can be also related to the idea of clustering [15], where clusters are unions of overlapping concepts in independent sub-contexts. The idea of approximate concepts, i.e., rectangles with “holes”, suggests a convexity-based completion principle, which might be useful in a classification perspective.

9 Concluding Remarks

This paper is clearly a preliminary step toward a unified, logical, study of set theory-based approaches in data management. It is preliminary in at least two respects: several of these approaches have been only cited in the introduction, while the others have been only briefly discussed. All these theoretical settings start with a Boolean table in the simplest case, and many of them extend to nominal, and possibly to numerical data. Still they have been motivated by different concerns such as describing a concept, predicting a class, or mining rules. Due to their set theory-based nature, they can be considered from a logical point of view, and a number of issues are common, such that handling incomplete information, missing values, inconsistent information, or non applicable attributes.

In a logical setting, the handling of uncertainty can be conveniently achieved using possibility theory and possibilistic logic [16]. We have suggested above how it can be applied to concept learning and how it may take into account uncertain pieces of data. Possibilistic logic can also handle default rules that can be obtained from Boolean data by looking for suitable probability distributions [3]; such rules provide useful summaries of data. The possible uses of possibilistic logic in data management is a general topic for further investigation.