Concept learning consistency under three-way decision paradigm

Concept Mining is one of the main challenges both in Cognitive Computing and in Machine Learning. The ongoing improvement of solutions to address this issue raises the need to analyze whether the consistency of the learning process is preserved. This paper addresses a particular problem, namely, how the concept mining capability changes under the reconsideration of the hypothesis class. The issue will be raised from the point of view of the so-called Three-Way Decision (3WD) paradigm. The paradigm provides a sound framework to reconsider decision-making processes, including those assisted by Machine Learning. Thus, the paper aims to analyze the influence of 3WD techniques in the Concept Learning Process itself. For this purpose, we introduce new versions of the Vapnik-Chervonenkis dimension. Likewise, to illustrate how the formal approach can be instantiated in a particular model, the case of concept learning in (Fuzzy) Formal Concept Analysis is considered.


Introduction
Concept mining deals with the extraction of concepts from artifacts such as data, traces of behaviors, or collections of unstructured data, among others. In a broad sense, the task should be understood as a problem on recognizing patterns from source data. Its resolution needs to use Artificial Intelligence and Statistics techniques (such as data mining, text mining, and variants of statistical learning). However, other scientific fields as Cognitive Computing or Psychology are also needed. When faced with this task, the researcher must bear in mind that the concept notion itself is subject to various interpretations, some of them accommodated to the nature of a particular problem. Such notion is not limited to concept extraction based on linguistic patterns (e.g., using WordNet [44]). One can also address the problem of extracting concepts with slightly more formalized semantics as for instance with Formal Concept Analysis (see a survey of related issues in [39]) or even at the level of Semantic Web technologies [17].
There are several issues related to Concept Mining as, for instance, the granularity of the conceptual structure achieved, the richness of the concept repertoire, and the treatment of uncertainty. The latter affects the extensionality of the extracted concepts; in equivalent terms, the problem of deciding the concept membership. Decision-making in Concept Mining could be essentially different from other decision-making tasks since solving the uncertainty would solve, in practice, the problem itself, the concept specification. For example, the treatment of concept mining by means of fuzzy methods (e.g., [42]) would allow posing the problem in such a way that general-purpose solutions for uncertainty processing could be applied. It is in this aspect that Concept Mining relates to approaches for shaping decision regions within the data space, as the so-called Three-Way Decisions (3WD) research paradigm.
The 3WD paradigm has emerged as a framework to address the challenges related to decision-making processes [22,32,61,63,66]. The ground idea that underlies in 3WD is that the domain of a decision-making problem is intrinsically partitioned into two regions. The first one -the decided data-comprises those inputs for which the decision problem has been solved, split in turn in the set for which the answer is positive, and the set containing the inputs with negative output. A second region comprises all data for which we do not know, for the moment, the decision to be made (the boundary region). The analysis of the partition is prior to the second problem to address: the design of strategies for the three regions (see Fig. 3).
The 3WD paradigm aims to bridge ideas between different approaches, ranging from well-established fields such as Granular Computing [65], to human problem-solving skills [62,64]. Solutions based on 3WD techniques capture different ways of understanding both the decision process and the (ontological) nature of positive, negative, and boundary (representing undecided, uncertainty) sets. Among other approaches, the following are considered: techniques from Rough Set Theory [66], assessment of thresholds and determination of decisions [49], working with interval-based evaluations [36], and extensions of Formal Concept Analysis [47,66,68,71], etc. The 3WD is inspiring new ways of managing decision-making that broaden the horizon of its applicability [5].
Focusing on a topic closer to that of the paper -consistency of Concept Mining from datasets-there are several 3WD applications in Data Science and related fields (e.g. [40,58]). These cover topics as foundations [5,30,62], the enrichment of Machine Learning (ML) processes [54], applications in the presence of uncertainty or absence of data/information [1], semantic tagging and sentiment analysis [24,69,70], and incremental concept learning [68].

Related work on 3WD foundations of Concept Learning in FCA
Roughly speaking, efforts focused on the formalization of processes for concept learning within the 3WD paradigm can be classified in two categories. Those that design the formalism from 3WD principles as [23,29], and those that exploit the similarities with 3WD within other frameworks that allow formalizing them, such as Formal Concept Analysis or Concept Graphs [19,27]. The later type of approach is rooted in 3WD principles. For example, the notion of Three-Way Cognitive Concept Learning introduced through the so-called Three-way cognitive operator [29]. This approach is framed within a multi-granularity context. It starts with an attribute partition according to the different ways of deciding data. In the cited paper the basic requirements for such an operator are detailed (an axiomatic approach). This is specified employing a pair of applications relating, in both directions, the three-way decisions with the sets of the attribute partition. From the axioms, the notion of concept under such applications is defined as: a pair (⟨X, Y⟩, B) formed by a 3WD decision (X, Y are the positive and the negative region resp.) and a subset of attribute partition satisfying two closure conditions. The first is (X, Y), which is the most effective decision for the multi-decision set represented by B. The second condition, states that B contains all decision problems for which (⟨X, Y) or another less effective 3WD decision is a solution. Two particularities of the theory developed in that work are that the decision thresholds for each function are set as initial parameters (thus prefixing the positive and negative sets for 3WD reasoning) and that only uncontradictory three-way decisions are considered (something natural for learning in a multi-decision context). The framework is also useful to formalize the dynamics and evolution of the 3WD decision (depending on the variations of the information), as well as to establish how the 3WD cognitive concept learning would be [23], from a fusion viewpoint. The idea guiding the learning in the latter is to find the best approach to the decision problem from the multi-granularity [23,29].
Concerning the second approach mentioned above -FCA and related approaches as the source of Concept Learning formalization-the two cited works [19,27] develop concept learning approaches in FCA. These can be seen as extensions of Kuznetsov foundational work [26]. The idea can be read in 3WD since the attribute to learn classifies the data from G (object set in FCA) in positive G + , negative G − , and undetermined G (the so-called boundary region in 3WD). The partition of the object set induces three formal contexts that, combined, allow to define the version space to select the hypothesis (the consistent classifier), as well as to characterize different types of sound classifiers. The approach [27] further generalizes the idea to work with conceptual graphs, that are not necessarily endowed with a lattice structure, although they are partially ordered by means of a specialization relationship, which gives rise to a concept lattice on sets of graphs.
The former two approaches share the aim of establishing the appropriate framework in which to specify and solve the concept learning problem. Such a framework can and should be complemented with other two types of studies. On the one hand, a study on the computational complexity of the problem. And on the other hand, another is on how to decide the consistency of the learning processes, designed on the constructed hypothesis spaces (a Statistical Learning issue). In the present case, how to study the consistency of methods based on Empirical Risk Minimization (ERM) using the new hypothesis classes.
Concerning the first issue, the complexity was characterized by Kuznetsov in [28]. The author presents the main complexity results in (FCA-based) Concept Learning. As regards counting the (minimal) hypotheses in FCA-based learning, the problem is proved to be #P-complete. And concerning the decision problem itself (the existence of a positive hypothesis of bounded size), its NP-completeness is proved.
The second study would focus on the convergence of the learning procedures. For example, whether the empirical risk of an ERM method converges to the true risk when the sample set increases. In particular, we are concerned with the ERM consistency preservation when the hypothesis class itself is enhanced (e.g. providing more effective 3WD decisions or refining classifiers). To better frame this problem, let us focus for a moment on such enrichment.

Enhancing the hypothesis class by means of 3WD
This paper concerns some of the foundational issues arising when reconsidering Machine Learning (ML) models under 3WD premises. Specifically, how the Learning Process is influenced by the transformation of F , the class of functions on the data space U that the ML model can use to learn (the hypothesis class). To properly raise the problem, the elements involved in the issue are sketched here. Let F 3WD be some extension of F using a particular 3WD-based enhancing method (previously designed). Such an extension will be called 3WD closure. The selected 3WD closure should present some features, for instance, to be easy to implement (from F ) and that (partially) solves the noncommitement problem. The change of F by F 3WD will have an impact on the Learning Process. Consequently, the decision-making procedure based on such a process may change. This fact suggests addressing two issues.
The first one is foundational in nature, namely what consequences the extension of F to F 3WD would have. It should describe how such an extension would affect the performance of the model. To study the issue, we will focus on Vapnik-Chervonenkis (VC) dimension, -denoted in this paper by dim VC (.) . VC dimension is useful for studying PAClearning and Empirical Risk Minimization-based (ERM) learning processes as well.
The second would deal with the usefulness of the study for the first one. That is how the theoretical framework developed in the first part of the paper applies to a particular ML model. We have selected as a case study the (Fuzzy) Formal Concept Analysis (F)FCA, considered as a Knowledge Discovery tool for concept mining. The application of (F)FCA for Concept Learning is an active research topic in both Cognitive Computing and Machine Learning [15,30,37,39,68]. Its soundness is based on its natural relationship with the traditional notion of concept. The classical view on concepts, the classical theory, holds that concepts possess a definitional structure. In other words, a concept can be defined by specifying (a set of) its properties. In fact, according to the International Standard ISO 704, a concept is a unit of thought constituted of two parts: its extent and its intent. That is, besides the definitional structure formed by properties, it is the set of elements satisfying it. This definition matches with the notion of the formal concept itself in FCA.
In order to familiarize the reader with the basics of FCA, let us see a simple but illustrative example using FFCA for Concept Learning. (see Fig. 1, right). In FFCA, it is usual to take a common threshold for attributes to induce crisp concepts. For example, one can select as a threshold 1 to obtain a (crisp) formal context. However, for characterizing some sets, employing concepts is not convenient. If different thresholds for the same attribute are selected, one can obtain the formal context 2 shown in Fig 1, down. The associated concept lattices are shown in Fig. 2.  The example would have the maximum size of a subset with maximum semantic differentiation (technically, a shattered set using concepts from 2 ). To obtain a similar differentiation for all the individuals, 2 4 concepts would be needed, thus this is not possible with 2 . However, the number of fuzzy attributes can be increased by taking other modifications from the original attributes. A question that arises is whether the full object could be shattered by enlarging the attribute set with some newer (crisp) modifiers of the fuzzy attributes. This issue would be interesting when working with potentially infinite datasets and hypothesis classes.
As it can be seen in the example, the step from the dataset to FFCA can be considered natural, and the refinement of the class of membership functions would be necessary for proper concept mining. It is necessary to keep in mind that increasing the number of attributes is like increasing the size (the variable dimension) of the dataset, so this could bring more complexity. Therefore, new issues can arise (something similar to the curse of dimensionality [7]). To address the issue, the VC dimension for a formal context will be introduced in a natural way (Sect. 6). Namely it is the VC dimension of the class formed by (the extensions of) the concepts of .
Generalizing the framework sketched in the previous example, we can see that any hypothesis class F on a data space G naturally induces a formal context In general, dim VC (F) < ∞ does not imply that dim VC ( [F]) < ∞ (although the reciprocal is true). The aim in the case at hand is to extend the hypothesis class to F 3WD (built by some method), before constructing the formal context. Thus, the same issue arises: whether dim VC ( [F 3WD ]) < ∞ . Furthermore, in case it is true, a new question arises: whether it is possible to restrict ourselves to finitely generated (f.g) concepts, that is, concepts that are characterizable by a finite set of attributes. The requisite of only using f.g. concepts is a natural condition since it facilitates user acceptability of the discovered concepts by presenting simpler characterizations (see e.g. [4]).
Summing up, the Table 1 shows, with a brief description, both the hypothesis classes and the different versions of the VC dimension that will be used along the paper, both in the abstract definition and in the case of FFCA. The first block contains the general notation. The term F 3WD denotes a generic 3WD-closure obtained using some approach. The second block lists the different types of hypothesis classes considered in the paper. It includes both the ones induced by the attributes of the formal context (the two first ones) and the ones built by concept subsets of the concept lattice of the formal context. The third block enumerates the four VC dimensions studied in the paper.
[F] = (G, F, I) where I is defined as: gIf ⟺ f (g) = 1 Throughout the paper, several results related to different VC dimensions are presented. Additionally, many examples are introduced to show how the extension of the hypothesis class modifies the VC dimension, that could even lead to an infinite dimension in some examples (and hence loosing ERM consistency). Among other results, it is verified that VC and DVC dimensions agree for any hypothesis class (Prop. 3). Within FFCA, the DVC and SVC dimensions agree in formal contexts, even in some cases in which a finite hypothesis class is extended to an infinite one. Concerning the extension of the attribute set, it is shown that the extension obtained by the so-called contractions (roughly speaking, those obtained by varying the thresholds for decision-making) preserves the SVC dimension finiteness (Corollary 8). Moreover, the study of different dimensions on finite generated hypothesis class is carried out, showing that SVC dimension is preserved in interesting cases (Th. 2).

Structure of the paper
The paper aims to address the above issues from a 3WD inspired theoretical framework. The next section recalls the main elements of the 3WD paradigm, Machine Learning, and (fuzzy) Formal Concept Analysis needed along the paper. Section 3 is devoted to framing learning within the 3WD trisecting-acting framework. In Sect. 4, the extension of the hypothesis class, by means of some 3WD technique, is formalized in functional terms. In Sect. 5, some variants of the VC dimension are introduced. These facilitate the analysis of the new ML models. The second part of the paper starts with Sect. 6, which is devoted to instantiate the above ideas for (F)FCA, considered as a model for Concept Mining. The main results on the VC dimension are shown for this case. The analysis follows in Sect. 7, where the socalled DVC dimension for FFCA is studied. Several variants, related to finitely generated concepts, are also analyzed. In Sect. 8 it is proven that a particular type of 3WD closures -focused to refining indecision regions for functions of the hypothesis class-preserves PAC learnability. The paper ends with some final considerations, as well as future work (Sect. 9).

Background
The cardinal of the set A will be denoted by |A|, by P(A) its power set, and by C A its characteristic function. Throughout the paper U will denote a data space. The class of evaluations on U is defined as the function class Likewise, the class of binary functions U {0,1} is analogously defined.

Learning and VC dimension
Any ML model considered here has a hypothesis class F associated, where the ML procedure searches the solution to the learning problem of a set A (i.e., the decision problem of belonging to that set), under an unknown measure of probability, P(.).
There is a risk function Q ∶ ℝ 2 × F → ℝ to estimate the discrepancy between the decision value, y, for the input x and the result given by the function chosen as solution, f(x).
Definition 1 Under the above conditions, the (functional) risk of f is  When A is fixed, any reference to the set will be omitted in the notation, e.g. by writing The purpose of the ML-based method will be then to minimize the risk, by finding f 0 ∈ F such that The learning process will use sample data; independent random, identically distributed sets S. The solution proposed from S will also be a function of F . Learning will be driven by the goal of minimizing the empirical risk associated with the sample data.

Definition 2
The empirical risk of f for the finite sample S is defined as: A process is said to be a Empirical Risk Minimizer (ERM) (relative to Q) if it uses the empirical risk as an estimate of the soundness of the solution, in the following sense. Suppose that, for a sample S n of size n, the ERM-based process returns a function f n that minimizes the empirical risk for S.

Definition 3
The ERM is consistent if the two following limits converge in probability to the value sought;

Vapnik-Chervonenkis dimension
A key measure for studying the consistency of ERM is the so-called Vapnik-Chervonenkis dimension (VC dimension) [52] (see also [10,51]). Under certain conditions, a hypothesis class with a finite VC dimension guarantees the consistency of the (ERM-based) learning process. The VC dimension is also useful for other Data Science challenges such as Differential Privacy ( [72], p. 64).

Definition 4 Let
A be a set and B ⊆ P(U) a set class.
, is the largest cardinal of a set shattered by B (it can be infinite).
The notation dim VC (U, B) will be used when the aim is to make U explicit. The reader can find several illustrative examples of computing VC dimension in classical literature, as in [10,35].
In this paper we work with pairs of the form (U, F) where F is a hypothesis class on U. The VC dimension can also be defined for a real-valued hypothesis class F . Given A ⊆ U , it is said that A is learned by means of F if there exists f ∈ F and ∈ ℝ such that That is, it is working with the set class Without loss of generality, one can only work with set classes B F defined as A key result in Statistical Learning states that, for any hypothesis class with bounded VC dimension, a consistent learner induces a PAC learning algorithm by providing a large enough training set [10,31,35] (see Thm. 1 below). In the case of convergence, the VC dimension is useful to bound the error of the ML-based algorithm [10,51]. The VC dimension is used to find a bound -independent of the underlying distribution P-for the sample size that is needed to select a hypothesis with arbitrarily small error, and with arbitrarily high probability, no matter which set we are trying to learn. We refer the reader to the references [9,10] or [51] for technical details. Therefore, the VC dimension turns out to be a useful tool to analyze ERM-based ML models. To refer to this result throughout the paper, it is stated here in the following general terms: The following conditions are equivalent: The researcher can compare different ML models based on their VC dimensions, taking into account that the larger the VC dimension, the higher the size of the data sample to be used. Additionally, hypothesis classes with excessive VC dimension should be avoided since these might overfit [51]. That is, these could be focused on irrelevant features of the input dataset. Therefore, the aim is to work with a hypothesis class with a low VC dimension.
The following function is useful to estimate the growth of the VC dimension with respect to the size of the sample set.
Therefore 0 ≤ |X| F ≤ 2 |X| , reaching the upper bound when X is shattered. It is interesting to note that, although s (S,F) (n) ≤ 2 n , its growth is polynomially bounded under finite VC dimension.
The question arising here is how to change the VC dimension if some processing is applied to data. Namely, either by (1) some data processing before applying the ML process, or either by (2) transforming the values returned by the classifier function. The following result is about the first one (that, roughly speaking, is a data pre-processing).

Three-way decision modeling through evaluation
Among the various 3WD frameworks put forward by the research community [63], only those based on evaluations will be considered here. Given an evaluation f ∈ U [0,1] , the elements belonging to U are classified as accepted, rejected, or unknown (identifying 0 as false and 1 as true) by means of the decision regions associated to f: respectively (see [63] for more details). Since the three sets form a partition of U, the 3WD decision can be specified by The class of all 3WD decisions on U is partially ordered in the following way. Consider that ( The 3WD general framework considers two main tasks (see Fig. 3): trisecting and acting [64]. The first one splits the data space into the three regions, whilst the second one is devoted to applying specific strategies to each. Thinking in the Learning Problem, the second task has to solve the decision on the uncertainty region.
Other different decision regions can be obtained by taking thresholds on the evaluations. The idea would be to consider as decided data having an uncertain value close to a decision value. There are several ways to formalize the idea (including the use of fuzzy logic). For example, to show several examples in the paper, the following regions will be considered, given and with 0 ≤ + ≤ 1: Please note that this notation extends the previous one (for = = 0 ), and produce more effective 3WD decisions, that is

(Fuzzy) formal concept analysis
The information format used in Formal Concept Analysis (FCA) is organized in the so-called Formal Context, a three elements set = (G, M, I) , where G is a set of objects, M is a non-empty set of attributes, and I ⊆ G × M . For example, Fig. 4 (left) shows a formal context describing fishes The set of concepts of is endowed with the structure of lattice by means of the subconcept relationship, ≤ . The lattice, denoted by ( ) , is complete [20]. The Hasse diagram of the concept lattice associated with the formal context of Fig. 4 left is shown in Fig. 4, right. In this representation, each node is a concept and its intent (extent, resp.) is formed by the set of attributes (objects, resp.) included along the path to the top (bottom resp.) concept. For example, the bottom concept is a concept that could be interpreted as euryhaline-fish. Note that for this concept there is not a proper term of the language within the attribute set to denote it, thus it is something new. This is an example of how FCA can be used as a concept mining tool.

Fuzzy formal concept analysis
There is an extensive bibliography on extending FCA to work with vagueness employing Fuzzy Logics [38,48], which has become a subfield of its own, the so-called Fuzzy FCA (FFCA). There are general proposals of what a formal fuzzy context/concept would be [11,34], as well as others for specific applications (e.g., [41,50,57,67]). Although there exist more general approaches, the selected here relies on inducing crisp sets by selecting thresholds for fuzzy relations. More specifically, for the attributes if they are considered as fuzzy predicates.
Example 1 already shows a f.f.c. The fuzzy relation I induces a fuzzy membership function for each atttibute m ∈ M , defined by which would turn it as a fuzzy predicate.
There exist several ways for defining concepts in FFCA [50,67]. The formalization selected here is similar to, for example, that of [8], but making the attributes fuzzy instead of the object sets. • The fuzzy derivation operator ′ on objects is a n d , o n a t t r i b u t e s , i s Formal context on fishes, and its associated concept It is simply written B instead of (B) . This way, an structure analogous to the concept lattice for classic formal contexts, can be obtained.

On 3WD and learning
As it was already discussed in the introduction, extensions of F , by using some 3WD-method, are considered. Then F 3WD ⊆ U [0,1] . The trisecting-and-acting model can be, thus, reformulated as follows (Fig. 5): the new class F 3WD refines the boundary region in the trisecting task. This way, the decision-making task has to be refined as well.
In functional terms, the overall result of the refinement of the Trisecting-and-Acting model sketched in Fig. 3 should be an answer function; a decision function (values in {0, 1} ) that extends the decision regions of the initial function.

Definition 9
• The class of answer functions, ANS , is the class of , an answer for f is an answer g ∈ ANS such that POS f ⊆ POS g and NEG f ⊆ NEG g .
In general, the step from F to F 3WD may not preserve decision regions.

Definition 10
Let F be a hypothesis class Please note that 3WD-based decision techniques would produce an answer function from f, which could be considered as a post-processing of the output of f. An example of an answer that comes from a 3WD technique could be the functional version of the well-known Closed World Assumption (CWA) from Nonmonotonous Reasoning in AI (cf. [21]).
. The closed answer induced by f is the function where cwa is defined by Since any f ∈ F induces a default answer f cwa , VC dimension can be assigned by default to any evaluation class.
In functional terms, the description of a 3WD-based decision-making improvement process (sketched in Fig. 5) consists in the design of two operators, The first operator submerges F into F 3WD , in order to provide more learning power (it is possibly the simplest inclusion). The second one will be an operator on F 3WD to obtain answers. The operator 2 aims to solve the indecision problem for any f ∈ F 3WD and would reflect the modification of strategy II (shown in Fig. 5). The composition 2 • 1 could be answer preserving.

Extending the hypothesis class
At this point, it is necessary to consider the issue of how the extension of the hypothesis class would affect the learning consistency. To address this issue, by Thm. 1 it would suffice to study how the VC dimension changes. In particular, we Fig. 5 Enhancing Trisecting-and-acting model for learning by refining the boundary region are interested in the following question, expressed using the following notation: (equivalent to whether the 3WD technique preserves PAClearning). Of course, the answer to such a question depends on the particular selection of the 3WD closure.
Example 2 (taken from [55]) Suppose F is a finite set of linearly independent real-valued functions on U. Consider F 3WD the ℝ-vector space generated by F . Then Thus, in this case, F 3WD preserves PAC learnability.
To illustrate the results, throughout the paper, we will use a particular example of 3WD-closure (that we will call contraction). Nevertheless, the study can be carried out for any F 3WD .

Extending the hypothesis class by contraction
The 3WD closure introduced in this section focuses on refining BND f .

Definition 13
Given 0 ≤ + < 1 , the -contraction function is The class of contraction functions is denoted by CON . This class is amenable to performing a modification of the trisecting task.

Proposition 1 CON uniformly preserves the decisions for any hypothesis class.
A particular case of F 3WD using CON is defined as follows: The transformation of a function f ∈ F employing the contraction is a relatively simple (computable) operation that extends the decision regions (see Fig. 6). In the Fuzzy logic realm, the definition of the contraction itself, f , = c , •f , shows how the contraction functions can be interpreted as (restrictive) external modifiers. Its implementation in the ML model is expected to be possible. Moreover, one can easily reestimate the empirical risk for f , ).
By using 3WD contractions, the acceptance/rejection regions are expanded whilst the boundary region is retracted. Thus produces a more effective 3WD decision, The problem of learning is transformed through 3WD to the problem of finding f , 0 , 0 such that: being R(A, f) the risk associated to the function f. It is equivalent to address the classical problem of learning under ERM using functions from F c instead of F , with the possibility of obtaining a minor risk.
In some cases, the learning problems for both classes, F and F 3WD are equivalent since both classes allow to learn the same sets. However, this might not be true in general (as it will be shown when analyzing the case of FCA).
The issue of extending F is whether the consistency of the ERM process is preserved. Or equivalently, whether

Introducing 3WD VC dimension
To address the problem of PAC learnability preservation, in this section, a rewriting of the VC dimension is introduced. Please recall that, although the study will be made for F c , any 3WD closure F 3WD would support a similar one. To simplify the notation, = 0 will be considered (definitions and results are analogous for the general case). Thus, it is necessary to work with the extension of the hypothesis class F to (Note that F c 0 = F ).
In terms of classes of sets associated to functions, the definition would be as follows.

Definition 16
Let (U, F) , with F being a hypothesis class on U, and A ⊆ U and ∈ [0, 1] .

The -cut of
.
, is the maximum size of a set 3WD-shattered by F c (it may be infinite).
As it was already mentioned, it may occur that dim 3VC (F, ) = dim VC (F) , or even that B F c = B F for some 3WD closures. Let us show an example.

Example 3 (extension by contraction preserving VC dimension).
Consider the class FuzzyCirc of membership degree functions associated to circles in the plane with center (0, 0).
where It is not difficult to check that the VC dimension of this class is 1. Given 0 ≤ ≤ , the contraction f is: T h e r e f o r e B FuzzyCirc = B FuzzyCirc c , hence Bearing in mind that dim 3VC (F, ) ≥ dim VC (F) , and that both are natural numbers, the following cases are possible: 1. dim 3VC (F, ) = dim VC (F) for all . In this case, if dim VC (F) < ∞ then F c preserves PAC learnability. 2. dim 3VC (F, ) = ∞ and dim VC (F) < ∞ . The new class F c has more shattering capacity than the original one. By Thm. 1 a convergent learning method based on minimizing ERM, is not available for such . 3. In the case of finite and distinct dimensions, PAC learnability is preserved.
Since the dimension is a natural number, from a certain , both dimensions are equal (we might reduce ourselves to the case (1), being small enough). • lim →0 dim 3VC (F, ) ≠ dim VC (F) . In this case, for some small enough , the dimension remains constant and greater than dim VC (F) . Therefore it is possible to use contractions of F to work with the regions BND f , in order to be able to shatter sets of greater size. However, the bounds for empirical error could be greater than that of the original hypothesis class.
Despite increasing the VC dimension (that may suppose a problem), case (3) could be interesting when working with discrete data. Even it could be interesting to study what would be the maximum value for dim 3VC (F, ) (e.g. for tasks as Categorization). This would imply a better ability for characterizing datasets (learning) from their available attributes/features. However, on the downside, the new concepts may not be easily interpretable. This issue will be revisited in the second part of the paper in a particular case.

Differential VC dimension
The change of F by F 3WD can be traumatic for the efficiency of the ML model because it can cause an increase of both the VC dimension (even becoming infinite) and the dimension of the dataset itself. However, the ML procedure usually works on finite subclasses of F 3WD . This option suggests studying a version of the VC dimension for finite extensions.
The following result guarantees that it is possible to work with finite extensions for any hypothesis class.
The relationship between dim VC (F) and dim VC (F 3WD ) for a selected 3WD closure, remains to be studied.
A general analysis of the enhancement of F to F 3WD has been developed so far. The following sections are devoted to instantiating the general framework outlined above, for the case of FFCA.

The semantic VC dimension
This section aims to study how in FCA the extension/transformation of the attribute set influences the ability of a formal context to approximate a set by using concepts. FCA provides a learning model; a formal context = (G, M, I) induces a hypothesis class ext( ) composed of (the extents of) its concepts, i.e., from ( ) [26,28].
The concepts (actually, their characteristic functions) can be extended (or transformed) to another class of functions through some 3WD method, obtaining new contexts of the type (G, F, I F ) , as it was defined in the introduction. First, let us examine the case of FCA (crisp). The so-called semantic VC dimension is the straight translation of the VC dimension to FCA.

Definition 18 Let = (G, M, I) be a formal context and
shatters O if 3. The semantic VC dimension of (also called SVC dimension, denoted by dim s VC ( ) ) is the maximum size of an object set shattered by .
Since | ( )| ≤ 2 |M| , and 2 |O| concepts are required to shatter an object set O, then dim s VC ( ) ≤ |M| . In functional terms, the hypothesis class would be Example 4 Considering from Fig. 4, it holds dim s VC ( ) ≥ 2 , because and it can not be 3 since in that case it would be necessary that | ( )| ≥ 8.

Related notions
The dimension defined above is global and exogenous in nature. That is, all the elements of the lattice ( ) (global) can be used to shatter any set of the data space; it is not restricted to shatter concept extents only (exogenous). This feature contrasts with another VC dimension for lattices introduced in [12].
In [12] Cambie et al. introduce a dimension of type VC partial in the sense that it is computed for subsets F of a ranked lattice ⟨L, ≤⟩ , instead of the full lattice. Additionally, it is endogenous in nature, that is, it concerns those elements of the lattice itself that F shatters using the meet operation. Formally, it is said that F ⊂ L shatters an element c ∈ L if and only if From this notion of shattering, the definition of the corresponding VC dimension follows naturally: dim * VC (F) is the maximum rank of the elements shattered by F. A lattice is called a SSP lattice [12] if the following Sauer-Shelah-Perles Lemma version is satisfied for any F Whether SSP lattices are the relatively complemented ones (which are those that does not have any 3-element interval), is an open problem.
In the case of finite concept lattices, one can endow a concept lattice with the rank height. This is defined as the supremum of the lengths of all chains that join the smallest element of the lattice with the considered element. In this case, any shattered concept in the sense of our definition induces the lattice of all its subsets. Therefore the height (that we used as rank) of a concept C is |ext(C)|, and consequently, both shattering notions coincide on concept extents.
The so-called contranominal scale represents a bridge between the semantic dimension and other studies on concept lattices. Given a set A, the context N c (A) = (A, A, ≠) is its contranominal scale. If A is shattered, then the formal context A = (A, M, I ↾A×M ) is isomorphic to N c (A) ( [3], lemma 29)). In the particular case of A being the extent of a concept, then ( A ) is also a sublattice of ( ) . The semantic dimension would be This way, the following result, stated by Albano (Th. 3 from [2]), follows from the bound on the size of the concept lattice: In a later work [3], Albano and Chornomaz complement the results of [2] by studying B(k)-free contexts. These are contexts in which B(k), the boolean lattice of k atoms, can not be (order-)embedded in them. In our terms, those that dim s VC ( ) < k . In this case: Moreover, this bound is sharp (Sect. 4 in [3]).

Semantic dimension and Learning consistency in FCA
Via semantic VC dimension, it is possible to analyze features of FCA as a model for ML. Please note that the use of FCA could involve working with complex ML models. According to what has been shown, for the semantic VC dimension, finite class approximations can be used. Roughly speaking, the following result states that -for formal contexts with finite SVC dimension-it should not be expected that many subsets of big concepts can be semantically characterized.
Since C is a concept, any subset of ext(C) ∩ ( ) is already the extent of a concept. The reason is that if A = ext(C) ∩ ext(D) for some D ∈ ( ) , t hen A = ext(C ∧ D) . Therefore Please note that by Thm. 1 the following consequence holds, which is interesting for infinite formal contexts.

Corollary 3 Any formal context with finite SVC dimension is PAC learnable.
Moreover, due to Lemma 2, the SVC dimension would not be increased after applying some type of pre-processing: The new notion of dimension that will be introduced comes from the application of the 3WD paradigm, introduced in Sect. 5, to fuzzy formal contexts.
A fuzzy f.c. = (G, M, I) has a default semantic dimension, associated with the operator cwa, where gI cwa m ⟺ I (g, m) = 1 (i.e. cwa( m (g)) = 1 ). The following example will be taken up later.

Semantic differential dimension
The Differential VC dimension (DVC dimension, Subsect. 5.1) can be instantiated for fuzzy formal contexts once the extension F c has been built. The hypothesis class Two issues should now be addressed. On the one hand, it has already been commented that if the attributes are considered as data dimensions, the step from F to F c involves a dimensionality increase. On the other hand, please recall that the contractions c , apply to the outputs of attributes. That is, it is necessary to work with (in terms of fuzzy logic, these could be, for example, external modifiers of the fuzzy predicates as it was illustrated in Ex. 1). Thus, the bound on VC dimension presented in the Lemma 2 does not apply. As consequence, the consideration of the new class F 3WD may cause an increase in the VC dimension.
To transform into a classic formal context it is necessary to solve the no decision issue of the attribute m, for each object g when 0 < m(g) < 1 . That is, for those objects belonging to the region BND m , by selecting crisp attributes m . It is possible to choose more than one crisp predicate for the same fuzzy attribute (e.g., as a consequence of using several external modifiers). Thus, the SVC dimension can be increased over |M|. For example, if only one defuzzification for each attribute is taken (as occurs with 2 ), it would not be possible to shatter the three object set, since dim s VC ( 1 ) ≤ 2 . However, by making multiple defuzzifications (that is, multiple decisions about the region BND m ), such bound could be surpassed (see formal context 2 ).
In general, it may occur that dim S VC ( [F]) = ∞ . This phenomenon suggests to consider the FFCA version of the DVC dimension (based on finite classes) in order to avoid this.  Thus the relationship I for the formal context [F] can be expressed as (see Fig. 8): In view of the above, (X, Y) ∈ ( [F]) if and only if Reasoning as with the intervals in ℝ , it is easy to see that dim s VC ( [F]) = 2.
The above example suggests the following issue. For a formal context with an infinite object set, it could be possible that the VC-dimension of some defuzzification was infinite, even if M is finite. In Sect. 8, it is shown that this is not possible using M c . Also, finite approximations can be used:

Proof Apply Thm. 3. ◻
A straightforward consequence of the proposition is that it achieves PAC learnable classes with concepts from finite subcontexts that are induced by finite subclasses of M c .  In this case, the new contexts are not useful to PAClearning. It would be necessary to refine the set of new predicates to be used. We will see that this is not possible if M is finite.
The result corresponding to Lemma 1, which shows that any finite subclass of F can be used, would be the following:

Learning with finitely generated concepts
The use of FFCA as a conceptual learning model can also have its drawbacks, as it occurs with FCA. For example, it involves the use of concepts that can not be specified by a finite set of attributes. In this section this issue is examined.
In the case of a formal context with finite VC dimension, Thm. 1 ensures the convergence. That is, if for each sample S n , a function f D n minimizing empirical risk is selected, then which in the case of FCA and Q(x, y, f ) = |y − f (x)| , will be rewritten as: Hence, in the sample, the difference between A and the chosen concept is close in probability to 0.
For FFCA, the defuzzification of the class F c brings the learning problem back to FCA, although with the peculiarity that it is necessary to work with a potentially infinite attribute set. This could represent a difficulty (thinking that the natural processes of conceptualization often involve concept characterization by finite attribute sets). Therefore, it is necessary to study ERM consistency using only finitely generated concepts, in the following sense: Definition 23 Let = (G, M, I) be a formal context.
• is finitely generated if any concept of is finitely generated.
• f ( ) is the sub-lattice of ( ) whose elements are finitely generated concepts.

Some examples
Example 12 (fuzzy formal context with infinite VC dimension, generated by a finite VC dimension class but with not f.g. concepts) Consider the class the functions are fuzzy membership functions for circles in It is defined the fuzzy formal context To see the latter, note, for example, that a segment AB is the extent of a concept, since It is also true for any convex polygon. Thus they are not f.g. in O , although they are f.g. in other formal contexts as C (Example 5).

Example 13
The extents of f ( C ) , the formal context from Example 5, are the convex polygons. Therefore, it is also dim VC ( f ( C )) = ∞

Example 14 (Formal context with finite semantic VC dimension but no finitely generated) Let
It is verified that dim s VC ( r ) = 2 , as any concept of is an angular region, and the VC dimension of this class is 2. Consider the concept of ( r ) It is easy to check that C ∈ ( ) ⧵ f ( ).
Example 15 (Formal context which is not f.g. but dim VC ( f ( )) = dim s VC ( ) ). Let ≥ = (ℝ, ℝ, ≥) . Please note that any m ∈ ℝ , considered as attribute, satisfies With this in mind, it is straightforward to see that Therefore However, the concept C = (�, ℝ) is not f.g.

Preserving PAC learnability working with f.g. concepts
Since in conceptualization processes one usually works with concepts characterized by a finite number of attributes (that is, f.g. concepts), it has to be studied whether it is possible to achieve ERM consistency by only using f.g. concepts as hypothesis class. Note that any concept is approximable by f.g. concepts in the following sense: for any D ∈ ( ) Recall that this fact does not imply a finite VC dimension. For the formal context C from Ex. 5dim s VC ( C ) = dim VC ( f ( )) = ∞. However, it is not necessarily true that any concept is finitely approximable by an enumerable sequence of f.g. concepts. That is, it is not true in general that any D ∈ (K) can be characterized as for some sequence {C n } n of f.g. concepts. This property would be useful to replace the concepts involved in ERM by f.g. concepts, preserving the convergence required in Def. 3.
Example 16 (Formal context with a concept not approximable by an enumerable sequence of f.g. concepts). Consider the contranominal scale on ℝ , ≠ = (ℝ, ℝ, ≠) . Then dim s VC ( ≠ ) = ∞ , since the concept set is then it is easy to see that Therefore, any sequence {C n } n ⊆ f ( ≠ ) satisfies Thus, the concept (�, ℝ) can not be approximated by an enumerable sequence of f.g. concepts.
The former example seems to suggest that there is no convergence to the infimum of the empirical risk using f.g. concepts. For example, when the infimum is reached with concepts that are not approximable by such an enumerable sequence. To prove that this circumstance does not occur, the strategy has to be reformulated. The idea is to use the lattices instead of working with the limit of the empirical risk. The following theorem guarantees ERM consistency using f.g. concepts. The proof will be carried out by checking that the (finite) VC dimension is preserved.

Theorem 2 Let be a formal context. Then
Proof I t i s o n ly n e c e s s a r y t o p r ove t h a t dim s VC ( ) ≤ dim VC ( f ( )).
Let A be a finite set shattered by ( ) . To prove that it is also shattered by f ( ) , it suffices to demonstrate that for all D ∈ ( ) exists C ∈ f ( ) such that Since D is finitely approximable, let C 0 ∈ f ( ) such that D ≤ C 0 .
For the same reason, for each z ∈ A ∩ ext(C 0 ) ⧵ A ∩ ext(D) there exists C z ∈ f ( ) such that D ≤ C z and z ∉ ext(C z ) . Then We need to check that the latter concept is f.g.
Since |A ∩ ext(C) ⧵ (A ∩ D)| < ∞ , and then such concept is f.g. Therefore, any set shattered by is also shattered by Finally, combining the above results, we can restrict ourselves to f.g. concepts of the 3WD extension [M c ] . In formal terms:

Definition 24
Let k ∈ ℕ . The hypothesis class ≤k ( ) is the class of all concepts of that are finitely generated by an attribute set of size k at most.
Let us see an example in which the finiteness of VC dimension could be preserved with these classes.
Example 18 (formal context with finite VC dimension for bounded f.g. concepts) Ex. 5 started with the hypothesis class H , the set of half-planes in ℝ 2 . Then dim VC (H) = 3 whilst dim s VC ( C ) = +∞ . It is straightforward to see that since the extent of the concepts generated by the three hyperplanes are: hyperplanes, angular regions, bands and triangles. In general, as the concepts are the convex polygons with at most k sides.
The following result would be the translation, to formal contexts, of the fact that the finite Boolean combination of functions from a hypothesis class with finite VC dimension also has a finite VC dimension [14]. This shows that, when the formal context is PAC learnable, then there exist a k 0 such that ≤k 0 ( ) is not only PAC learnable, but it can also be used granting the same error bounds as the original class.

On 3WD closures preserving PAC learnability
Different examples showing how the VC dimension changes when F is extended to F 3WD have been previously presented. This section presents a sufficient condition for the preservation of VC dimension finiteness, that can be applied to the particular case of F c . The following result will be used.
, where each C i is a collection of subsets of U that is linearly ordered by inclusion. Then dim VC (C) ≤ n + 1.
The following theorem shows that, for formal contexts with a finite attribute set, any extension by contraction has a finite VC dimension. On the other hand, it is verified that

linearly ordered. Then
Proof To adapt the above proof, let

Conclusions, related and future Work
This paper formalizes and analyzes the impact of the 3WD paradigm on ML models throughout the study of (variants of) the VC dimension. The study has been carried out at two levels. The first and more general one concerns the enhancement of the hypothesis class by means of some 3WD method. The idea lies in the fact that any 3WD technique that reduces the boundary regions of the hypothesis impacts on the VC dimension. Its finiteness is essential to preserve ERM consistency. The second level concerns a case study instantiating the general analysis to (F)FCA, understood as a model for concept mining (or categorization) from data. The starting idea has already been considered in other works (e.g., [19,26,30]). In the present approach, the hypothesis class is the class of definable sets (the extents of concepts). The option of using only definable sets is not a new idea (e.g. o-minimality and VC dimension in [13]). In this work, we show how to extend the description language (i.e. attribute set) using a particular 3WD closure, M c . However, the analysis can be performed for other options. There exists other multi-source search that aims to improve the quality of information (which enriches learning in turn) to enhance decision procedures. For example, for agents [59] and multi-source information systems [45], and even for combining both in multi-agency [36,49]. We believe that the approach developed here can enhance such works and others as [30]. In the latter, Li et al. analyze how to learn one exact or two approximate cognitive concepts from a given object set or attribute set. Our proposal can be seen as complementary to that of [67] where authors seek to minimize the use of attributes by associating to them an estimation of their significance.
Our formalization can also help to enrich other approaches. For example, those addressing the granularity of selection/updating [30,33,53], as well as those addressing the management of data dynamics [33]. The analysis of both SVC and DVC dimensions for different hypothesis classes might be useful to analyze other (incremental) concept learning approaches [23,29,60,68]. The consideration of more effective 3WD-decisions for shattering, complements -to some extent-the cited works. For example, allowing the use of 3WD-closures that imply managing contradictory 3WD decisions (that is to say, it can exist (X 1 , Y 1 ) , (X 2 , Y 2 ) with X 1 ∩ Y 2 ≠ � or X 2 ∩ Y 1 ≠ � ). This extension could be useful to decision reconsideration. Moreover, due to the goal of estimating VC dimension, we use different (variable) thresholds for building 3WD-decisions. This fact differentiates it from the above (multi-decision) 3WD approaches in that they are prefixed to build the 3WD cognitive operators [23,29].
Other approaches that consider the idea of classifier specialization, in the context of FCA, were already mentioned in the introduction [19,27]. In the first one, a specialization relation on the classifiers (thought as conceptual graphs) endows the hypothesis class with a semilattice structure. The second is focused on the version space, starting from three formal contexts induced by positive, negative, and undecided examples. Its characterization using the corresponding Galois connections allows, for example, to isolate the elements that can be classified positively for at least one classifier of the version space. Furthermore, working with the semilattice structure of the classifiers, the existence of a minimal positive hypothesis is granted. Both papers work on the specification of the version space and the existence of classifiers satisfying different requisites. We think that our approach could be adapted to estimate ERM consistency of learning procedures working on different hypothesis classes (subclasses of classifiers of the version space), in the case of infinite data spaces.
The use of the DVC dimension suggests a balance problem. On the one side, the need to manage a relatively small number of attributes (data dimensions), and on the other side, the use of a greater number of them implies achieving richer categorizations. The reduction of the number of attributes allows a better specification of the knowledge, for which factors such as the sensitivity to cost [16,25] could be considered. Furthermore, the refinement of the specification facilitates the interpretation of concepts obtained by data processing [56]. The design of criteria to select (a minimal set of) adequate properties (features) represents an issue to solve in several 3WD models [18,43]. Any attribute selection could modify the VC dimension. Thus, it will have an impact on error estimation. In the case of the DVC dimension, a selection exists. However, we do not deal with the minimal attribute sets. Thus, we are not concerned with selecting a minimal hypothesis class F ⊆ M c such that dim s VC ( [F], ) = dim DVC ( , ) . Moreover, no attempt has been made to satisfy some additional constraints, for example, that a class contains only attributes that actually say something relevant about the objects [56]. Its refinement would represent a type of attribute reduction.
The attribute reduction problem was first considered by Ganter and Wille in seminal works [20]. The present case is slightly different and raises some algorithmic questions (in line with [67]), which will be the aim of future work. Another future research line is the analysis of the impact of techniques to optimize the SVC dimension, by studying the Attribute Topology [68] associated with defuzzifications. Algorithms working on this structure may offer more efficient solutions to the computation of VC dimension. In addition, the impact on implications of 3WD-closures will be considered in the future, following ideas of the paper [3].
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/.