Introduction

The development and use of explainable models in the field of artificial intelligence (AI) are intended to promote confidence and transparency regarding the reliability of AI procedures among end users and thus increase their acceptance. Regarding the creation of explainable models, learning classifier systems (LCS) feature a fundamental advantage as they rely on local rules to describe the state space of an environment. These rules allow a simple mapping between states and the system’s decisions and improve its interpretability. The most widely studied LCS in terms of formal theoretical analysis [22] and empirical evaluation is currently the XCS classifier system (XCS) [39]. It represents an evolutionary rule-based online machine learning technique characterized by an inherent generalization pressure, which is hypothesized to result in accurate and maximally-general rules. These rules are also called classifiers in context of XCS. A classifier advocates an action for a specific region of the state space of an environment under consideration, specified by the condition of the classifier.

Yet, an exclusive pursuit of generalization should be treated with caution, as an excessive pressure to generalize or the lack of a pressure to specialize is likely to impede or prevent the creation of accurate and maximally-general rules. As shown in several papers [2, 15, 18], under certain circumstances an excessive generalization pressure occurs in XCS, leading to the formation of so-called over-general classifiers that reduce the performance and accuracy of XCS. Over-general classifiers refer to classifiers that are only capable of advocating the correct action for a subset of its condition as they match incompatible regions or niches of the considered state space [15]. A more detailed description of over-generalization in XCS is provided in section “Generalization and Over-Generalization in XCS”.

By using intervals in the continuous real-valued input space to define a classifier condition, the XCS variants XCS for real-valued input spaces (XCSR) [36] as well as the XCS for function approximation (XCSF) [37] theoretically permit infinite variations in one dimension of the condition. In contrast the traditional XCS with binary inputs only allows three different states. Thus, the concepts of generalization and over-generalization differ significantly in traditional XCS and the variants XCSR and XCSF, whereas XCSR and XCSF are characterized by a considerably increased complexity due to the large number of variations. A more detailed description of XCS, XCSR and XCSF is provided in section “Real-Valued XCS in a Nutshell”.

Until now, no solution has been proposed for the problem of identifying and handling over-general classifiers in XCSR and XCSF guaranteeing an accurate population during the learning phase. In contrast, for traditional XCS, the Absumption mechanism [21] and the Specify operator [17] have been proposed offering a basis for XCSR and XCSF. Since both methods aim at solving the same problem, we refer to them as Over-Generality Handling (OGH) in this paper.

This work is part of our overarching aspiration to deal with the still unsolved issue of over-generalization in XCS when applied to real-valued problem spaces, cf. e.g., [28]. We focus on XCS(R) as it forms the basis for many descended systems such as UCS [3] and ExSTraCS [33], while additionally investigating the effects of OGH on the variant XCSF. We also deem the capability of XCS to learn complete state-action mappings, more formally \(X \times A \rightarrow P\), as a crucial feature in our pursuit of readying robust and reliable online learning systems allowing for interpretability but also self-reflection of the evolved knowledge [27, 29].

The contributions of this work are: (1) The evaluation of the application of Absumption and of Specify to gain insights into their impact on the learning performance of XCSR and XCSF. (2) The introduction of identification and specialization strategies for OGH in real-valued problem spaces. (3) A comparative study of the use of Absumption and of Specify in XCSR, applied to several well-known toy problems (single- and multi-step), and agricultural real-world classification tasks. (4) A comparative study of the use of Absumption and of Specify in XCSF, applied to three well-known regression tasks from the field of global optimization, while assessing different over-generalization tendency levels.

We proceed with a brief overview of approaches to remove inappropriate classifiers in LCS in the following section followed by necessary background information on XCS and OGH in the next section. The subsequent section introduces the adapted versions of the OGH approaches for XCS in real-valued problem spaces, i.e., XCSR and XCSF, including new specialization and identification strategies. In “Evaluation”, the results of empirical studies on various classification tasks, on a multi-step problem and three regression tasks are reported and thoroughly discussed in the next section. We conclude with a short summary and a description of future work in “Conclusion”.

This article is an extension of an original conference paper published at the Evostar 2021 conference [34]. It additionally includes: new experiments on the application of OGH to XCSF for regression tasks; a more thorough discussion of both the new results and the results of the original conference paper; more elaborate descriptions of the algorithms of the proposed techniques using pseudocode.

Related Work

Early on, over-generalization was part of the research in the field of XCS [14], but so far this has only resulted in procedures for XCS with binary inputs. The Absumption method [21] is the most recent procedure for XCS with binary inputs, especially designed for single-step problems. Absumption is based on the inconsistency of an over-general classifier’s condition, since it contains incompatible niches or niches, which cannot be predicted by the same action. After identification, a more specific version of an over-general is generated by changing a Don’t-Care symbol # to a specific value. The Specify operator [17], a previously introduced procedure for XCS with binary inputs, was developed to handle the problem of over-generals in multi-step environments. This approach employs the prediction error to identify over-generals, since the inevitably arising oscillation in the prediction of an over-general is recognizable this way. The specialization of an over-general classifier is performed similarly to Absumption. In addition to the previously mentioned procedures developed especially for XCS with binary inputs, there are further specialization mechanisms in other LCSs. In the field of evolutionary learning systems, rule specialization methods have also been introduced, such as the so-called memetic operators in BioHEL [8]. The Mutespec operator [10], introduced for the ALECSYS classifier system, bases the identification of over-generals on the variance of the rewards received by a classifier. It has to be noted that over-general classifiers are termed oscillating classifiers in the publications of Specify and Mutespec. Furthermore, ACS2 [6] applies a specialization mechanism, which, however, was not specifically designed to identify over-generals. Finally, an approach for the specialization of rules beyond LCSs, introduced in [11], is the Specialize operator for a genetic learning system operating on high-level symbolic rules.

In addition to approaches that focus on the specialization of rules, further approaches are available to remove unnecessary or inappropriate rules from the population in order to achieve an optimal solution. In [39], the idea of condensation was discussed and further enhanced in [16], which employs the GA with the genetic operators turned off in order to remove classifiers with a low level of fitness from the population. The compaction algorithm [9, 40] forms a compact rule set after the learning phase of XCS is completed by finding the smallest subset of the population that achieves maximum performance. In [5], the compaction algorithm has been extended by closest classifier matching, which is applied at the beginning of the compaction process to avoid that the resulting population no longer covers the entire problem space.

Background

In this section, we start by briefly providing the necessary general information about XCS and its general learning scheme for the sake of facilitated comprehension. We then give a brief introduction to over-generalization in XCS, which includes a definition of over-general classifiers along with a description of their negative impact on XCS. However, a certain familiarity with temporal difference learning in general is assumed due to space constraints. For a more detailed introduction to temporal difference learning and XCS, the reader is referred to [31] and [7, 36], respectively.

Real-Valued XCS in a Nutshell

The evolutionary rule-based machine learning system XCS was introduced by Wilson in [35, 39] and belongs to the family of Michigan-style LCSs, which are attributed to Holland [12]. As XCS heavily relies on a niche-relative variant of Holland’s genetic algorithm (GA), which employs a measure of relative accuracy of the rule’s prediction as fitness, XCS represents the origin of the family of accuracy-based LCSs [32]. In contrast, the GA of its so-called strength-based relatives, such as ZCS [38], directly employs the rule’s prediction value, e.g., an estimate of expected cumulated reward in reinforcement learning settings, as fitness of the GA. In the context of LCS, the prediction value is also called strength. As the GA enables continuous optimization of the structure of the rules, XCS is able to adapt to changes in its environment while in use, which is an important factor for reliable deployment in real-world applications, such as field robotics. Thus the evolutionary learning approach of XCS constitutes a particular strength in adapting to an environment.

figure a

As proposed by Wilson’s generalization hypothesis [35], XCS aims to construct a complete, accurate and maximally-general state-action map, more formally defined as \(X \times A \rightarrow P\). The set of all possible environmental states \(\varvec{\sigma }_t\) is called input space X. The input space is also known as state space. The action space A represents all possible actions and P denotes the payoff space. In XCS the state-action map is represented by a population \([P] := \{cl_i\}_{i \in \mathbb {N}}\) of rules or classifiers, usually just called cl and defined as \(cl := (C, a, p, \epsilon , F)\). The classifier condition \(cl.C \subseteq X\) defines a subset of X, which is matched by the classifier. \(cl.a \in A\) represents the action advocated by cl to be performed by the system when the current state falls into the classifier’s condition. cl.p denotes the prediction of cl, which represents an estimate of the payoff received from the environment in the case that cl.a is performed within the classifier condition. The use of local learning in XCS by means of classifiers facilitates both the interpretability and the reflection of the acquired knowledge, e.g., whether knowledge gaps or contradictions exist. In cl.\(\epsilon\), an estimate of the absolute prediction error is stored. cl.F represents the fitness of cl, estimating the niche-relative accuracy of the payoff prediction of the classifier. Furthermore, additional book-keeping parameters are maintained in XCS: the number of performed reinforcements of a cl is stored in cl.exp and the number of successful subsumption operations in cl.num. To control the GA invocation, a timestamp parameter cl.ts is updated each time a cl was a candidate for the GA. For more details about the classifiers and their attributes and parameters, the interested reader is referred to [7].

Algorithm 1 presents a detailed description of a single pass of the main learning loop of XCS. As long as the defined termination criteria are not satisfied, e.g., a maximum number of learning steps, the loop is repeated. When applied to single-step problems, i.e., classification, the actions are usually selected by alternating explore and exploit steps instead of an \(\epsilon\)-greedy approach. \(U_{[0,1]}\) draws a value uniformly at random from the given interval. For a more in-depth explanation of the learning loop of XCS, the reader is referred to [7].

The basic structure of XCS features a great flexibility regarding possible tasks, also reflected in the good extensibility of XCS. Thus, many extensions to XCS have already been introduced to expand its application range, cf. e.g., [33, 36, 37]. Two extensions of particular interest in the context of this work are: XCS for real-valued inputs (XCSR) [36] and XCS for function approximation (XCSF) [37]. XCSR introduces so-called interval predicates \((l_i,u_i)\) to define a classifier condition combined with minor modifications to deal with real-valued inputs. \(l_i\) and \(u_i\) denote a lower and upper bound, respectively, for the dimension i of the input space X. As a result of using interval predicates to define a classifier condition, the conditions have the geometric form of axis-parallel hyperrectangles.

XCSF builds on the extension of XCSR and additionally employs a single dummy action and classifier predictions based on a predictive model instead of a simple scalar. The predictive model usually employs a linear function \(h(\varvec{\sigma }_t)\) that computes the classifier prediction as a function of the environmental state \(\varvec{\sigma }_t\), which is an n-dimensional vector of the input space. The function is defined as \(h(\varvec{\sigma }_t) = (\varvec{\sigma }_t^* - l^*) \times w\). \(\varvec{\sigma }_t^*\) refers to \(\varvec{\sigma }_t\) extended by a leading 1. \(l^*\) refers to vector \(l = (l_1,...,l_n)\), which comprises the lower bounds of the interval predicates of the classifier, extended by a leading 0. \(w = (w_0,\dots , w_n)\) denotes an n+1-dimensional weight vector. As proposed in [20], updating the weights is now commonly done using Recursive Least Squares (RLS), rather than updating the weights using a modified delta rule as originally proposed. For a more detailed description of XCSR and XCSF, the reader is referred to [36] and [37], respectively.

Generalization and Over-Generalization in XCS

Based on Wilson’s description in [35], a system uses generalization whenever it treats seemingly different situations in the same way if these situations lead to equivalent after-effects for the system when performing the same action. A major problem for LCS is the unintended formation of over-general classifiers, i.e., a classifier that is just capable of advocating the correct action for a subset of its condition [15], as already noted in section “Introduction”. These classifiers are capable of degrading the performance of a system since they may result in the selection of an incorrect action and propagate incorrect information. Since over-general classifiers match incompatible niches, they possess an oscillating prediction as they are only locally accurate, which usually causes them to be inaccurate.

As conjectured by Wilson’s Generalization Hypothesis [35], explaining the process of generating accurate and maximally-general classifiers in XCS, over-general classifiers should be removed from the population due to their inaccuracy. However, under certain circumstances the counteracting fitness signal is not sufficient, which leads to over-general classifiers caused by the generalization pressure present in XCS. Since XCS is able to tolerate small oscillations of the classifier prediction due to the error threshold hyperparameter \(\epsilon _0\), it is possible that an inaccurate classifier is considered accurate if the oscillation is small enough [15]. Thus, as described by Lanzi in [18], an over-general classifier must be observed often enough or at all by XCS to cause sufficient oscillation of the prediction for the classifier to be recognized as inaccurate.

We refer to this problem as unequally observed environmental niches from the point of view of the classifiers, which can be caused by non-uniform sampling or unequally sized niches. The problem of unequally sized niches can also occur in environments with equally sized niches if different types of niches are not or not almost equally contained in a classifier. Another problem that causes XCS to consider over-general classifiers as accurate is the issue of indistinguishable payoff levels of different environmental niches, which is associated with multi-step problems [2]. This issue can arise due to the application of a discounted reward or the presence of dominant classifiers, which are over-general classifiers that result in a misleading fitness signal. In comparison to binary inputs, these problems are generally more serious for real-valued inputs, since, on the one hand, the number of possible observations in the possibly continuous input range is substantially increased. On the other hand, an infinite number of variations of the intervals are theoretically possible in the individual dimensions of a condition. For a more detailed description of generalization and over-generalization in the context of XCS, we refer the reader to [2, 15, 18] and [4].

Over-Generality Handling for Real-Valued XCS

To address the challenge of over-generality in real-valued problem spaces by identifying and suitably handling over-general classifiers, we design two Over-Generality Handling (OGH) approaches, one based on the Absumption mechanism in [21] and the other based on the Specify operator in [17]. If OGH is enabled, it is invoked subsequently to the reinforcement in each iteration of XCS’ main loop. First, we will introduce the basic structure and the identification strategies of both approaches, as the specialization strategies, creating more specific versions of an identified over-general classifier, are shared between both variants of OGH.

figure b

As shown in Algorithm 2, the adapted version for real-valued problem spaces uses a more generic basic structure compared to the original Absumption [21], i.e., both the identification and the specialization strategy employed are exchangeable. Based on the REA ratio of an over-general classifier \(cl_{og}\), Absumption decides, whether a \(cl_{og}\) in [overG] is removed from the population or decomposed into more specific versions. REA is the ratio of \(cl_{og}\)’s experience to the dimensionality of its condition, i.e., an indicator of useful information contained in \(cl_{og}\). The threshold value is set to 1 as described in [21]. Thus, if the REA of \(cl_{og}\) is at least 1, the specialization strategy applied decomposes \(cl_{og}\) into an amount of more specific versions defined by \(cl_{og}.num\).

The first strategy we propose for detecting over-general classifiers, called Inconsistency, is based on the mechanism proposed in [21]. The algorithmic description of Inconsistency is shown in Algorithm 3. It utilizes the idea of over-general classifiers matching incompatible niches by tracking the number of positive rewards (NPR) and the number of negative rewards (NNR). If \(\text {NPR} * \text {NNR} > 0\) is satisfied for a classifier, it is considered to be over-general and is added to [overG].

figure c
figure d

As Inconsistency can be applied only to single-step classification problems, we introduce the second identification strategy, Prediction Deviation Variance, enabling its application to multi-step problems as well as to regression tasks. This strategy performs the identification based on the increased prediction oscillation of an over-general classifier. In this strategy, for each classifier cl, the absolute deviation \(|cl.p - r_t|\), where \(r_t\) denotes the current reward at time t, is tracked in a FIFO-buffer within cl, denoted as cl.dev. As described in Algorithm 4, a classifier cl is only checked for over-generality by this strategy if \(cl.exp > \theta _{OGCIdent}\) is satisfied. If the variance of the absolute deviations in cl.dev is greater than twice the average deviation variances of the entire population, denoted as \(dev^*\), cl is regarded as over-general and is added to [overG]. This condition was determined in preliminary experiments and does not claim to be optimal, but has been found to yield improved results.

Specify for real-valued XCS is based on the version for binary inputs in [17], already employing an identification strategy applicable in multi-step problems as well as in single-step classification and regression tasks. The process of Specify for real-valued XCS is shown in Algorithm 5. This OGH approach uses an indirect mechanism to detect over-general classifiers based on the condition \(\epsilon _{[A]}\) \(\ge\) 2 \(* \ \epsilon _{[P]}\) proposed in [17].

figure e

We propose two specialization strategies returning the required amount of more specific versions of an identified classifier \(cl_{id}\) utilizing the center point \(\varvec{c}\) of \(cl_{id}\)’s hyperrectangular condition \(cl_{id}.C = (\varvec{l}^{cl_{id}}, \varvec{u}^{cl_{id}})\), i.e., \({c_i = l_i^{cl_{id}} + \frac{1}{2}(u_i^{cl_{id}} -l_i^{cl_{id}})}\) for \(i= 1...n\). The first method is the New Condition Specialization (NCS) strategy. Its mode of operation is shown in Algorithm 6. A new classifier \(cl_{new}\) is generated by computing the so-called interval predicate \((l_i,u_i)\) of condition \(cl_{new}.C\) for each dimension \(i= 1...n\) as follows: \(l_i = \max \{l_i^*, c_i - U_{[0, r_0)}\}\) and \(u_i = \min \{u_i^*, c_i +U_{[0, r_0)}\}\). \(l_i^*\) and \(u_i^*\) denote the minimum and maximum bounds of the problem space for dimension i, respectively. In \(U_{[0, r_0)}\), the given standard spread parameter \(r_0\) is excluded. The other attributes of \(cl_{new}\) are initialized with the attribute values of \(cl_{id}\), except \(cl_{new}.exp = 0\) and \(cl_{new}.num = 1\). However, as a simple and fast strategy, it does not guarantee the creation of \(cl_{new}.C\) inside \(cl_{id}.C\) and thus may result in the generation of \(cl_{new}.C\) that is not fully contained within \(cl_{id}.C\).

figure f

This drawback is solved by the Inside Condition Specialization (ICS) strategy, which is shown in Algorithm 7. In case of ICS, \((\varvec{l}, \varvec{u})\) of \(cl_{new}.C\) are determined differently: \(l_i = \max \{l_i^{cl_{id}}, l_i^*, c_i -U_{[0, r_0)}\}\) and \(u_i = \min \{u_i^{cl_{id}}, u_i^*, c_i +U_{[0, r_0)}\}\). Thus, \(cl_{new}.C\) is fully contained inside \(cl_{id}.C\), as the constraints \(l_i^{cl_{id}} \le l_i\) and \(u_i \le u_i^{cl_{id}}\) are satisfied. The other attributes of \(cl_{new}\) are initialized analogously to NCS.

figure g

Evaluation

In this section the results of a variety of experiments are summarized we carried out to evaluate the impact of Absumption- and Specify-based OGH on the performance on XCS variants for real-valued problem spaces. We performed experiments with XCSR on two multi-step problems and on six single-step problems, i.e., classification tasks. These tasks comprise three well-known toy problems and three real-world data sets from the agricultural domain, as the authors’ research is concerned with the usage of AI in agricultural applications. In addition, we conducted experiments with XCSF in three well-known regression tasks from the field of global optimization. Typically, other types of modern LCSs, e.g., ExSTraCS [33] or BioHEL [1], are preferred for classification tasks and for mining of data sets. However, we have chosen XCSR as it enables reliable learning at runtime, i.e., online, in systems facing real-world conditions, due to creating a complete \(X \times A \rightarrow P\) mapping. We intend to show OGH improving the ability of XCSR to generate accurate complete mappings and, thus, to reliably perform classification tasks and stream mining of online real-world data. Furthermore, OGH is designed to be easily applicable to any system descended from XCSR, a fact highlighted by our experiments in XCSF.

All experiments have been repeated for 30 i.i.d. runs with individual random seeds. The repetition means and the observed standard deviations of the conducted experiments are given in Tables 1, 2, 3, 4, 5, 6 below. In all conducted experiments, the performance of XCSR and XCSF was evaluated using 3 metrics: (1) The system error, i.e., the average error of system prediction in relation to either the actual reward received in classification and multi-step tasks or the actual function value in regression tasks. (2) The average number of macro classifiers or the average size of the population, indicated by |[P]|. (3) The average volume of classifier conditions in [P], i.e., the average classifier generality. In addition, the performance of XCSR was also evaluated using a fourth metric, the average reward achieved at the end of the learning period.

In the evaluation experiments, both a configuration with the Specify-based and with the Absumption-based OGH approach have been evaluated. They are compared with XCSR and XCSF instances without our proposed OGH techniques, which are configured according to the settings yielding the latest improvements on the considered problem types as reported in the literature, serving as baseline in our experiments. Since the configurations are dependent on the problem types as well as the individual problem instances, the exact configuration details are provided with the descriptions of the employed problem instances in the following sections: the benchmark problems in section “Results in Benchmark Problems”, the real-world data sets in section “Results in Real-World Problems”, the multi-step problems in section “Results in Multi-step Problems” and the regression problems in section “Results in Regression Tasks”.

For statistical evaluation, we analyzed the experimental results by means of different statistical tests for significance: A standard ANOVA paired with a Tukey-HSD post-hoc test was conducted to test for statistically significant differences if a test for homoscedasticity was positive. Otherwise, we employed the robust Welch-ANOVA in combination with a Games-Howell post-hoc test. Figure 1 and 2 show plots of the learning curves of the compared configurations over the entire learning period along with the standard deviation in form of error bars. As commonly done in the LCS literature, the learning progress of the single-step problems is shown in form of reward achieved and system prediction error or, in case of the regression tasks for XCSF, only in form of the system prediction error. For the multi-step problems, the learning progress is depicted by means of required steps to the goal.

Results in Benchmark Problems

First, we performed an evaluation of three well-known benchmark problems, as their known problem structure facilitates an analysis with respect to specific characteristics and provides the basis for comparison with results from the literature: (1) Real k-multiplexer problem (RMP) [36], (2) Checkerboard problem (CBP) [30], and (3) Mario classification problem [27]. RMP poses a challenging problem of real-valued binary classification with properties like feature interaction (epistasis) and multiple niches with same actions (heterogeneity) [33]. For each dimension \(x_i\), with \(i = 1, ... , k\), the input space of RMP is defined in the interval \(0.0 \le x_i < 1.0\). A threshold \(\theta\) specifies the two partitions of the bit values in each dimension. In this paper we applied RMP with \(\theta = 0.5\) and \(k = 6\), denoted as 6-RMP. CBP is a well-known benchmark for LCS, designed to provide an increased complexity compared to RMP [30]. The underlying problem is to predict a multidimensional checkerboard pattern consisting of black and white areas. In this paper, we used a variant of CBP with 3 dimensions and 3 divisions per dimension denoted as CBP(3,3). Mario is a two-dimensional real-valued multi-class classification problem in the form of a 16x16 pixel art of Super Mario. Unlike CBP it features seven different actions, comprising the various colors, and allows for different levels of generalization in the niches, e.g., the blue trousers compared to the yellow knobs. Analogously to RMP, the input space of CBP and Mario is also defined in the interval \(0.0 \le x_i < 1.0\) for each dimension \(x_i\).

A single repetition of an experiment was run over 100,000 alternating explore/exploit steps. In each step, the input space of the benchmark problems was sampled using a uniform distribution. We applied a binary reward scheme for each of the benchmark problems. Each correctly predicted action of a sampled situation led to a reward of 1000, otherwise 0. The intervals of a classifier condition were encoded by the unordered bound hyperrectangular representation [30]. The configuration and hyperparameters of OGH were determined in preliminary experiments. In all benchmark problems, we applied the specialization strategy NCS for both OGH approaches and the Inconsistency identification strategy of Absumption. For Specify, \(\theta _{OGCIdent}\) was set to 10 for 6-RMP and to 50 for CBP(3,3) and Mario. We adopted further hyperparameters from the literature, which can be found for 6-RMPFootnote 1, CBP(3,3)Footnote 2 and MarioFootnote 3 in [27, 30, 36], respectively. For CBP(3,3), a value of 0.5 was selected for \(r_0\) in deviation from [30] to artificially create a tendency of XCSR to over-generalize in this problem.

Table 1 Overall results for classification tasks, i.e., reward, system error, population size and generality
Fig. 1
figure 1

Learning curve plots of the conducted experiments using XCSR

According to Table 1, the application of Absumption or Specify in all considered problems is accompanied by a significant increase of the population size and by a significant reduction of the generality, except for CBP(3,3). This can be easily explained by the underlying concept of OGH, as the decomposition of over-general classifiers can lead to many redundant over-specific classifiers. The significantly improved generality in CBP(3,3) can be possibly attributed to OGH enabling XCSR to solve the problem from the over-specific side, facilitating the formation of accurate classifiers with increased generality. Regarding the reward and system error metrics, in case of 6-RMP and CBP(3,3), XCSR benefits neither from Specify nor from Absumption. Compared to standard XCS, the application of Specify results in a non-significantly different reward value and an increased system error. In contrast, the application of Absumption reduces the reward value and leads to the highest system error of all configurations. One possible explanation of the results is the basic structure of these problems posing a challenge to OGH. The equally sized niches with strictly defined boundaries require the intervals of a classifier condition to be accurately inside a niche in order for a classifier not to be regarded as over-general. Due to the random decomposition of identified over-general classifiers, it cannot be guaranteed that newly created conditions will be contained inside a niche. Especially for Absumption, this effect seems to have a major impact, as the strict evaluation of inaccuracies and their subsequent treatment seem to reduce the learning speed, as shown in the learning curve diagrams in Fig. 1a, b. For Specify, this problem seems less affecting, which may be due to the indirect identification of at most one over-general classifier per iteration. It would therefore be of interest to realize a more focused specialization pressure by OGH, which will be subject of future research. In Mario, the results are quite different, as this problem enables different generalizations and is known to lead to over-general classifiers due to its more inconsistent underlying structure consisting of unequal-sized niches. Both OGH approaches increase the performance of XCSR and lead to a significant improvement of the reward and the system error metrics compared to the standard configuration. Special attention is drawn to Absumption, which significantly outperforms all other configurations in the aforementioned metrics. This is also reflected in the learning curves of Mario (cf. Fig. 1c) showing a significantly increased learning speed by OGH, especially by Absumption. Thus, OGH seems to enable a significant refinement of the learned model of a problem affected by over-general classifiers, resulting in increased accuracy and performance.

To summarize, in our experiments, the application of OGH in XCSR mainly provides significant improvements regarding the system error and reward metrics for problems characterized by the formation of over-general classifiers and no significant benefits for other problems. It is also associated with an increase in population size and, in most cases, a decrease in generality in our experiments.

Results in Real-World Problems

In addition, an evaluation with real-world data was performed to assess the effects of OGH when applied to XCSR in real-world settings. Since the authors are conducting research on the use of AI in agricultural applications, three available data sets covering different agricultural domains were selected for evaluation: (1) Paddy Leaf, (2) Horse Colic and (3) Soybean Disease. The Paddy LeafFootnote 4 data set from Kaggle represents a multi-class classification task based on the average RGB values of 6000 paddy leaf pictures used to perform nitrogen fertilizer recommendation. The data set is balanced and consists of 3 real-valued attributes, indicating the average color channel values of red, green and blue present in a picture, and 4 different labels. The task of the Horse ColicFootnote 5 data set from the UCI repository is a binary classification, whether a lesion of a horse suffering from a colic was surgical or not based on different health attributes. The data set is unbalanced and consists of 368 instances with 21 different health attributes, either real-valued, integer-valued or boolean, and a total of 30% missing attribute entries. In case of missing entries, each attribute has been assigned a default value not being within the value range of the respective attribute. The last data set is Soybean DiseaseFootnote 6 from the UCI repository. The task is to classify 19 different soybean diseases from 35 integer-valued marker attributes also containing missing entries, handled analogously to the Horse Colic data set. The data set consists of 683 instances and is highly unbalanced. W.l.o.g., we normalized the attributes to the range [0, 1] for all data sets.

Analogously to the benchmark problems before, a single repetition of an experiment was run over 50,000 alternating explore/exploit steps. In each step, an instance is drawn uniformly at random from the used data set with replacement. We also applied a binary reward scheme, i.e., a reward of 1000 for each correctly predicted action, i.e., class, otherwise 0. For the evaluation experiments with real-world data sets, a standard parameterizationFootnote 7 of XCSR was used. The conditions were encoded by unordered bound hyperrectangular representation. \(\theta _{mna}\) was set depending on the number of available classes: In Paddy Leaf it was set to 4, in Horse Colic to 2 and in Soybean Disease to 19. Due to the high number of attributes and available classes in Soybean Disease, N = 25,000 to increase the population size. The experiments for the three real-world data sets shared the same configuration for OGH: The specialization strategy ICS was employed by both approaches, the Inconsistency identification strategy was applied in Absumption and, for Specify, \(\theta _{OGCIdent}\) was set to 50.

Considering the results in Table 1, the application of Absumption results in significant performance improvements regarding the reward and system error metrics in all conducted experiments. It leads to a significant increase in learning performance and superior results compared to the other configurations. As shown in the learning curve plots in Fig. 1d–f, Absumption also causes a faster and more accurate learning of the problems, which is most evident in Paddy Leaf. Specify also leads to advantages in Paddy Leaf in terms of faster learning and a significant improvement of the reward and system error metrics, however, less pronounced compared to Absumption. In the other data sets, Specify does not yield any advantage, possibly due to the just indirect identification of over-general classifiers. Compared to the benchmark problems, only in one real-world data set, i.e., Paddy Leaf, OGH shows a significant increase in population size and a significant reduction in generality. Similarly to the benchmark problems, this can be attributed to the underlying concept of OGH and is more pronounced in case of Absumption. For the other two data sets, Specify causes a significantly increased population size, unlike Absumption. This can be attributed to the direct identification of over-general classifiers by Absumption, only performing necessary decompositions, which is not guaranteed in case of Specify. Regarding generality, there are no significant differences between the configurations in Horse Colic and Soybean Disease.

In summary, for real-world data sets, the application of OGH, especially Absumption, in XCSR causes significant improvements of the reward and system error metrics in our experiments. The increase in population size and decrease in generality due to OGH was less pronounced in our experiments. Based on the results, we expect OGH also provides advantages beyond the agricultural domain for problems featuring characteristics comparable to the applied data sets, such as unbalanced data or missing entries. However, we left this for future work.

Results in Multi-step Problems

To preliminarily evaluate the potential of OGH for multi-step problems, we conducted experiments in two differently configured variants of the GridWorld environment, which is based on the Puddles environment introduced in [19]. The problem space of GridWorld is two dimensional and each dimension \(x_i\) is defined in the interval \(0.0 \le x_i < 1.0\). In GridWorld, the task to be learned is to reach the goal in position (1, 1) in as few steps as possible. In each episode the agent starts in a random position within the environment except the goal. The agent is allowed to move within the environment with a given step size in four directions, i.e., left, right, up and down. For each step taken, the agent receives a negative reward or punishment of -0.5, except the step leading to the goal, resulting in a reward of 0. The environment also contains so-called puddles leading to an additional punishment of -2 for each puddle the agent is in. An episode ends either if the goal has been reached or if 200 steps have been taken.

In our experiments, we applied a step size of 0.07 and 0.05 denoted as GridWorld(0.07) and GridWorld(0.05), respectively. Each repetition of the experiments was run over 10,000 episodes. As a fixed number of episodes is defined for the repetitions, the configurations and repetitions differ in the number of steps. Thus, the results of the metrics already used in the single-step experiments are determined over the first 205,000 steps and 285,000 steps for GridWorld(0.07) and GridWorld(0.05), respectively. In addition to these metrics, the performance of XCSR is evaluated over the entire repetition run using the steps to goal metric, which calculates as the 100-episode mean steps required to reach the goal. XCSR only used explore steps in combination with an \(\epsilon\)-decay action selection regime, parameterized as follows: \(\epsilon\) = 1.0, \(\epsilon _{fin}\) = 0.02 and a decay fraction of 10% of the episodes, i.e., in the reported experiments the \(\epsilon\) was decayed from 1.0 to 0.02 over the first 1000 episodes. GridWorld(0.07) and GridWorld(0.05) shared the same parameterizationFootnote 8, based on the settings for Puddles (0.1) in [19]. The conditions of the classifiers were encoded by unordered bound hyperrectangular representation. The classifiers employed a computed prediction based on a linear prediction model using recursive least squares [20] (RLS) with a parameterization of \(\lambda _{RLS}\) = 1.0 and \(\delta _{RLS}\) = 1.0. The settings of OGH were determined in preliminary experiments: For both experiments, Absumption employed the Prediction Deviation Variance identification strategy with a FIFO-buffer size of 200 and both OGH approaches used the specialization strategy ICS as well as a \(\theta _{OGCIdent}\) of 100.

Table 2 Overall results of conducted multi-step experiments, i.e., reward, system error, population size, generality and steps to goal

As can be seen from Table 2, XCSR using OGH significantly outperforms the standard configuration in terms of the steps to goal, reward and system error metrics in both configurations of GridWorld. Analogously to Mario and the real-world problems, Absumption significantly surpasses Specify in terms of the aforementioned metrics, except for the reward and system error metrics in GridWorld(0.07). The learning curve of GridWorld(0.05) in Fig. 1h shows a faster reduction of the required steps by the OGH approaches compared to the standard configuration, with Absumption being clearly superior to Specify. In GridWorld(0.07) (cf. Fig. 1g) the difference between OGH and the standard configuration is less pronounced, since GridWorld(0.07) already seems to be well solvable by the standard configuration. The significant improvement of the steps to goal, the reward and the system error metrics indicates the application of both Specify and Absumption in GridWorld enables XCSR to evolve a more accurate model of the underlying problem faster. Absumption provides additional benefits since it is superior to Specify. In terms of population size and generality, both OGH approaches cause a significant increase in population size and a significant decrease in generality. Once again, this can be attributed to the underlying concept of OGH, the decomposition of over-general classifiers. The intensified occurrence of this effect in Absumption can be explained by the direct identification of over-general classifiers, resulting in more classifiers being decomposed per iteration of XCSR compared to Specify.

In conclusion, the use of OGH and especially Absumption results in significantly improved steps to goal, reward and system error metrics in our experiments in multi-step environments, indicating faster learning of an accurate model. However, due to the mode of operation of OGH, these advantages are accompanied by an increase in population size and a decrease in generality in our experiments.

Results in Regression Tasks

This part of the evaluation focuses on applying OGH in XCSF when solving regression tasks from the field of global optimization. The reported results regarding regression problems were obtained by evaluations on three test functionsFootnote 9 already applied in the literature for XCSF assessment:

$$\begin{aligned} f_1(x_1,x_2)&= -(x_2 + 47) \sin \!{\left( \sqrt{\left| \frac{x_1}{2}+(x_2 +47)\right| } \right) } - \\&\quad x_1 \sin \!{\left( \sqrt{|x_1 - (x_2 + 47)|}\right) }, -512 \le x_1, x_2 \le 512\\ f_2(x_1,x_2)&= \sin \!{(4\pi (x_1 + \sin \!{(\pi x_2)}))}, 0 \le x_1,x_2 \le 1\\ f_3(\varvec{x})&= \max \!\left\{ \exp \!{\left( \!-10a^2\right) },\exp \!{\left( \!-50b^2\right) },1.25\exp \!{\left( \!-5\!\left( a^2 \! + \! b^2\right) \right) }\right\} ,\\ a&= \frac{1}{\lfloor \frac {n}{2} \rfloor } \sum ^{\lfloor \frac {n}{2} \rfloor }_{i=1}x_i, b = \frac{1}{\lceil \frac {n}{2} \rceil } \sum ^{n}_{i=\lfloor \frac {n}{2}\rfloor + 1}x_i, -1 \le x_i \le 1 \end{aligned}$$

The first function \(f_1\), a common 2-dimensional benchmark in global optimization called Eggholder function [13], proved to be quite complex to approximate with XCSF in [26] due to its strong and repetitive curvature and multi-modality in both dimensions. These properties are also featured by the 2-dimensional Sine-in-Sine function \(f_2\) [24], which thereby exhibits a higher degree of variation of the curvature and was applied for XCSF assessment in [24, 25]. The next function \(f_3\), referred to as Cross function [25], was chosen for the analysis concerning the approximation capacity of XCSF in [25] and is characterized by its mixture of linear and nonlinear subspaces. In this work, we apply \(f_3\) in its 3-dimensional shape based on the generalized version proposed in [23].

Each experiment repetition features 200,000 explore steps, or learning steps in context of XCSF. In each step, the test functions’ input space was randomly sampled following a uniform distribution. The settings of OGH were determined in preliminary experiments: For both experiments, a \(\theta _{OGCIdent}\) of 100 as well as the specialization strategy ICS were applied for both OGH approaches and Absumption employed the Prediction Deviation Variance identification strategy with a FIFO-buffer size of 200. The intervals of a classifier condition were encoded by the unordered bound hyperrectangular representation. In XCSF, classifiers learn a model for predicting the payoff as a function of the input instead of simply holding a scalar estimate. In this paper, we applied both a linear prediction model using RLS to update the prediction values and a radial basis function (RBF) based interpolation approach as introduced in [29]. RLS was applied using a parameterization of \(\lambda _{RLS}\) = 1.0 and \(\delta _{RLS}\) = 1.0 and RBF using a Thin-Plate-Spline basic function, defined as \(\theta (r) = r^2 \log (r)\). To interpolate cl.p, in each experiment, the classifiers contain a set cl.Sp of up to 50 samples \(s_i\) representing situation-reward pairs, more formally defined as \(s_i = (\varvec{x}_t, r_t)\). \(\varvec{x}_t\) denotes an input or feature vector at time t and \(r_t\) denotes the associated reward or function value. To ensure robust interpolation, during interpolation of cl.p for a situation \(\varvec{x}\), it is checked whether the interpolated function value \(f(\varvec{x})\) satisfies \(r_{min}< f(\varvec{x}) < r_{max}\), where \(r_{max}\) and \(r_{min}\) denote the largest and smallest reward so far observed among all \(s_i \in cl.Sp\). If the conditions are met, \(f(\varvec{x})\) is chosen as cl.p, otherwise Nearest Neighbor Interpolation is applied as fallback. The mixing strategy used in our experiments to compute the system prediction based on the considered classifier predictions is analogous to the strategy used in [29]. Further hyperparametersFootnote 10 of XCSF were taken from the literature, in particular [29]. To assess the presence of a specialization pressure by OGH in XCSF, a tendency to over-generalize was artificially created in XCSF. For this purpose, different levels for the initial spread value \((r_0)\) for newly generated conditions were selected, namely 0.1, 0.5, and 1.0, which reflect a stronger tendency to over-generalization with increasing values. Usually, smaller \(r_0\) values are chosen only in case of existing prior knowledge about the environment and corresponding necessity, since a too small choice unnecessarily generates many transient, redundant classifiers. Thus, especially for environments with little prior knowledge, there is a more severe over-generalization problem due to the default choice of larger \(r_0\) values. In Tables 3, 4, 5 and 6, the \(r_0\) values used in the different experiments are provided with the respective results. Also a lower bound for \(r_0\) of 0.005 was applied, i.e., a minimal spread was applied to each newly created classifier.

Table 3 Overall results of conducted experiments on Eggholder function and Sine-in-Sine function using RLS-based predictions, i.e., system error, population size and generality
Table 4 Overall results of conducted experiments on Cross function using RLS-based predictions, i.e., system error, population size and generality

XCSF Using RLS Prediction

As shown in Tables 3 and 4, representing the results for the experiments using RLS predictions, in case of Eggholder, Absumption is significantly superior to the other configurations in the first 50,000 steps in each tested experiment setting. Specify performs worse compared to Absumption, but also causes significantly improved levels at \(r_0\) values of 0.1 and 0.5. Over the entire run, this situation changes slightly: ultimately, Absumption leads to a significant reduction in system error in case of the higher \(r_0\) values, 0.5 and 1.0. For \(r_0 = 0.1\), Absumption causes a slight increase of the system error. Therefore, it can be seen that Absumption can realize a specialization pressure in Eggholder in case of higher \(r_0\) values and, thus, can capture the existing multi-modality and the strong curvature of this test function. Nevertheless, Absumption seems to act too strictly in case of low \(r_0\) values, possibly generating more inaccurate classifiers due to the random decomposition of over-general classifiers in OGH. Specify, in contrast, leads to a significant reduction in system error at all \(r_0\) values tested, though it does not quite achieve the level of Absumption in case of \(r_0 = 0.5\) and 1.0. This is probably due to the random decomposition of only one classifier in the action set that is considered to be over-general and the following creation of only one new specialized classifier. Thus, Specify is less strict than Absumption. Looking at the learning curve plots in Fig. 2a, b, the experiments with \(r_0\) values of 0.5 and 1.0 show that Absumption leads to a strong reduction of the system error, especially at the beginning. The configuration with Specify shows an increased reduction of the error only at a later time step of the experiment, but can significantly undercut the default configuration. In terms of population size, both OGH approaches lead to a significant increase in this metric in all cases over the entire course. This can be attributed to the decomposition of identified over-general classifiers, resulting in a larger number of transient classifiers. This decomposition effect can also be seen as the major cause for the reduced generality, as both OGH approaches cause a decrease in this metric value.

Fig. 2
figure 2

Learning curve plots of the conducted experiments using XCSF

For the Sine-in-Sine function, the experiments show that the OGH approaches exert a positive effect on the system error only in case of the higher \(r_0\) values tested, i.e., 0.5 and 1.0, and, thus, in case of a stronger over-generalization tendency. Already at the beginning of the experiment run, but also over the entire course of the experiment, Absumption leads to the largest reductions in the system error in these experiment configurations, followed by Specify showing also significant reductions in the metric. The learning curve plot of the experiment with an \(r_0\) value of 1.0 in Fig. 2c also illustrates the significant reduction in system error at the beginning of the experiment due to Absumption. Specify also shows a smaller but still significant reduction of the system error as the experiment progresses. In addition, Specify can even slightly undercut Absumption at the end of the experiment. Thus, also in case of Sine-in-Sine it becomes apparent that the OGH approaches and especially Absumption in case of the higher tested \(r_0\) values enable XCSF to learn the underlying problem environment or test function more accurately. This equally suggests the presence of specialization pressure due to OGH. In case of the experiment configuration with \(r_0 = 0.1\), Absumption leads to a slight deterioration of the system error and Specify leads to no significant improvement. Thus, there seems to be only a slight over-generalization tendency in this experiment configuration and Absumption performs worse due to its probably too strict mode of operation in combination with the random decomposition. As with Eggholder, both OGH approaches cause a significant increase in population size for Sine-in-Sine, which, like previously, is due to the increased generation of transient classifiers by the decomposition of over-general classifiers. Likewise, both OGH approaches lead to a reduction in generality due to the inherent formation of specialized classifiers in OGH.

Similar to Sine-in-Sine, OGH does not lead to any improvement in the Cross function configuration with \(r_0 = 0.1\), with Absumption again leading to a slight increase in system error. Even with \(r_0 = 0.5\), Absumption causes a small but significant increase in system error over the entire run, although XCSF with Absumption has the lowest error value in the first 50,000 steps. Presumably, a too strict mode of operation of Absumption as well as the random-based decomposition in OGH is responsible for this result. Compared to Absumption, Specify can lead to significant reductions of the system error at \(r_0 = 0.5\) as well as at \(r_0 = 1.0\). Thus, also in this context advantages are shown by the less strict mode of operation of Specify. Only at \(r_0 = 1.0\), XCSF with Absumption achieves a significant improvement in system error. These results are also reflected in the learning curve plot in Fig. 2d. Absumption causes the strongest reduction in system error at the beginning, but is undercut by Specify at about 50,000 steps. As before, Absumption causes a significant increase in population size as well as a significant reduction in generality for all \(r_0\) values examined. In case of Specify, a significant increase in population size or reduction in generality occurs only at \(r_0 = 1.0\), in the other cases Specify does not cause significant changes. The increase or reduction can probably again be attributed to an increased formation of redundant, specialized classifiers.

To conclude, the results in the investigated test functions show that OGH can realize a specialization pressure in case of RLS-based predictions and leads to improvements especially in configurations with increased over-generalization tendency. In this context, it becomes apparent that Specify leads to improvements in most cases or does not negatively affect the performance in the other cases. Absumption, on the other hand, has a greater potential for improvement in most cases due to its stricter mode of operation, but can also lead to a deterioration in performance.

Table 5 Results of conducted experiments on Eggholder function and Sine-in-Sine function using interpolation-based predictions, i.e., system error, population size and generality
Table 6 Results of conducted experiments on Cross function using interpolation-based predictions, i.e., system error, population size and generality

XCSF Using Interpolated Prediction

In Tables 5 and  6, the results for the experiments with interpolated predictions show that OGH leads to significant improvements in the system error over the entire run for the experiment configurations with \(r_0 = 0.5\) and 1.0 in case of Eggholder. This is also evident in the learning curve plots for these configurations in Fig. 2e, f. Absumption features a slightly faster reduction in system error at the beginning, but is quickly matched by Specify. At \(r_0 = 0.1\), OGH leads to no significant change in system error. Thus, in case of higher over-generalization tendencies, OGH enables more accurate learning of the test function. In terms of population size for the experiments with interpolated prediction, OGH is also shown to lead to a significant increase in population size, with Absumption showing the highest values. This is again probably due to the formation of transient, redundant classifiers. For generality, it is found that OGH leads to a significant reduction in this metric and Absumption has the lowest values. Thus, a specialization pressure by OGH is realized here as well, leading to the formation of more specialized classifiers. It is also shown that the performance of the applied interpolated predictions can be increased by the formation of specialized classifiers. The smaller conditions increase the density of the stored sampling points and, thus, an improvement of the interpolation performance can be achieved.

For the sine-in-sine function, an almost identical pattern as for the Eggholder function is obtained. OGH causes a significant reduction of system error in case of \(r_0 = 0.5\) and 1.0 and no significant improvement in case of \(r_0 = 0.1\). In Fig. 2g, the learning curve plot of the Sine-in-Sine experiment configuration with \(r_0 = 1.0\) also shows a significant reduction of the system error due to both OGH approaches. The use of OGH also results in a significant increase in population size and a significant reduction in generality in Sine-in-Sine. Thus, this test function also confirms that both OGH approaches can realize a specialization pressure, which leads to increases in performance especially in the presence of a stronger over-generalization tendency.

In case of the Cross function, a significant reduction of the system error by both OGH approaches occurs at an \(r_0\) value of 1.0, also underlined by the learning curve plot in Fig. 2h. Both OGH variants lead to a faster reduction of the system error at the beginning of the experiment. Considering the results for \(r_0 = 0.1\) and 0.5, it can be seen that in this case Specify leads to no significant change in the system error, but Absumption causes a small but still significant increase in this metric. This can again probably be attributed to the stricter mode of operation of Absumption. Therefore, for the Cross function, once again the less strict mode of operation of Specify is shown to be beneficial. For population size and generality, Absumption leads to a significant increase and decrease, respectively, in all cases. In contrast, Specify causes a significant increase in population size or decrease in generality in case of \(r_0 = 1.0\) and otherwise it does not lead to any significant changes. Thus, it appears that OGH is able to develop benefits by realizing the specialization pressure in the cross function using interpolated predictions only in case of strong over-generalization tendencies.

In summary, the experiments with the interpolation-based prediction show that also in this case a specialization pressure can be realized by OGH. Similar to the RLS-based predictions, the less strict operation of Specify shows advantages. In case of improvements by OGH, Specify can achieve similarly good values as Absumption, but in the other cases, unlike Absumption, it does not lead to any deterioration of the system error.

Key Findings and Discussion

The results of the evaluation experiment support the hypothesis that OGH is a viable approach to the problem of over-generalization in XCS-based systems for continuous-valued problem domains. As evident in the generality metric by the significantly lowered values, OGH is able to cause a specialization pressure in XCSR and XCSF.

In case of the classification tasks, the advantage of OGH becomes apparent when XCSR has to cope with the studied real-world data sets or the examined toy problem environments featuring unbalanced classes due to unequally sized environmental niches, i.e. the Mario environment. The results in the Mario environment indicate that OGH is able to prevent the formation of over-general classifiers caused by the unequally sized environmental niches. It also seems to enable XCS to more accurately represent small niches of an environment, such as the yellow buttons in the Mario environment, indicating a more accurate state-action map.

This indicates that OGH can improve the online learning process of XCSR. In XCSR, a local learning is performed as local submodels in the form of the classifiers are trained instance by instance for the underlying classification task in this learning process. Using OGH, these local models become more fine-grained, allowing for a more accurate adaptation to the classification task. Also, updates to the corresponding submodels affect smaller areas of the entire learned model, which reduces unnecessary updates and increases the accuracy of the overall model. Thus, OGH enables an improvement of the adaptive modeling capability present in XCSR, which results from the online learning process as well as local learning. Furthermore, the accelerated learning of the classification tasks in the real world data sets, especially in Paddy Leaf, due to OGH reveals that OGH can lead to an increase in data efficiency. Early in the learning process, OGH can prevent the formation of over-general classifiers that lead to an incorrect representation of the underlying problem, adversely affect classifiers that are actually correct, and ultimately slow down and hinder the learning process. Thus, using OGH, XCS requires fewer data samples in order to achieve high system performance.

In the toy problems featuring balanced classes, i.e. 6-RMP or CBP(3,3), the OGH approaches lead to no improvement of the overall learning performance. Unexpectedly, OGH causes even the highest generality in the Checkerboard environment. Since the Checkerboard environment lacks sufficient specialization pressure, this seems to be due to the specialization pressure by OGH. This pressure leads to the problem being solved rather from the over-specific side, which facilitates XCS to generate accurate and maximally-general classifiers. However, OGH seems to degrade the accuracy of the learned model in these environments, as it causes an increase of the system error. This may be due to the random creation of specific versions of the conditions by NCS which may result in an unfocused specialization pressure. The cause of the deterioration in accuracy should be investigated more closely in future work, as should other specialization strategies that enable more focused specialization pressure. Nonetheless, OGH seems to be a distinct advantage in classification problems with more complex structures, such as the Mario environment and probably the Checkerboard environment with more divisions per dimension.

Similarly, for XCSF, the pressure of specialization by OGH leads to performance improvements in solving the studied regression tasks. Especially for configurations that have a stronger tendency to form over-general classifiers due to having a large value for \(r_0\), the specialized classifiers, generated due to the specialization pressure of OGH, led to a more accurate modeling of the respective test function and, thus, a significant reduction of the system error. Especially when using linear approximators as classifier prediction, the more fine-grained local models allow a significantly better representation of the high multi-modality as well as the strong and repetitive curvature in the Eggholder function and the Sine-in-Sine function. Also, the obliqueness of the Cross function was better reflected when using OGH. For the interpolated classifier predictions, the increase in performance was mainly accompanied by significant reductions in generality. Thus, the specialization pressure of OGH causes an increase in the density of sampling points within the interpolated predictions. This enables the interpolation function to better approximate the studied function. Nevertheless, it also turned out that in case of a presumably already sufficient density, e.g. in the case of the configurations with \(r_0=0.1\), no further increases of the learning performance are obtained, since a sufficiently accurate approximation can be achieved apparently by means of the interpolated classifier predictions in the standard configuration. Still, especially in the absence of knowledge about the function under investigation and a consequently larger choice of the \(r_0\) value, OGH provides a good approach to obtain a more accurate model of the function under investigation.

The statements for the regression tasks also apply to the results of XCSR in the GridWorld environment, since in this case, analogous to XCSF, a function is approximated, i.e., the Q-function underlying this multi-step environment or rather reinforcement learning task. Due to the more fine-grained local models, the Q-function, which can be highly non-linear and complex in structure, is approximated more accurately, resulting in a more appropriate choice of actions in the environment. Thus, the increased specialization pressure due to OGH significantly improves the performance of XCSR, as the further reduction of the number of required steps can be attributed to the further refinement of the created state-action map. The refined map leads to a more accurate selection of suitable actions. This is also reflected by the rapidly increasing reward and the rapidly decreasing error, which indicate that OGH enables XCS to quickly predict short and puddle-avoiding paths.

In summary, OGH is able to contribute a significant advantage for XCSR and XCSF in the studied classification tasks in the form of the toy problems and the real-world data sets, as well as the examined multi-step tasks and the studied regression tasks.

Conclusion

We presented two Over-Generality Handling (OGH) approaches adapted for XCS with real-valued inputs (XCSR) and XCS for function approximation (XCSF), one based on the Absumption mechanism and one based on the Specify operator. The presented approaches provide promising means to deal with the challenge of over-general classifiers arising in XCS-based systems designed for real-valued inputs. This is particularly evident in reinforcement learning settings that demand for long action-chains and only provide sparse reward signals.

To enable the use of OGH in XCSR and XCSF, two specialization strategies were introduced enabling a decomposition of over-general classifiers into more specific ones in real-valued problem space. A new identification strategy for over-general classifiers was proposed to enable the application of Absumption in multi-step problems. To fathom the potential of OGH, the Absumption- and Specify-based OGH was evaluated in multi-step problems and in single-step problems, i.e., classification tasks based on common benchmark problems and on real-world data sets.

The presented results of the conducted empirical studies showed the application of OGH results in considerable improvements in benchmark problems, tending towards the formation of over-general classifiers, in our experiments. Regarding the considered real-world data sets from the agricultural domain, especially the Absumption-based OGH caused significant improvements. In the evaluated multi-step problems, the application of OGH in general and Absumption in particular led to a significant performance increase of XCSR. In case of XCSF, OGH led to a reduction of the system error in most cases for both RLS-based predictions and interpolation-based predictions. In the experiments conducted, it became apparent that Absumption resulted in stronger reductions, but in some cases also caused deterioration of the system error. Specify, on the other hand, led to smaller reductions compared to Absumption, but in no case caused a deterioration of the system error. Moreover, it turned out the underlying concept of OGH can result in an increased number of specific classifiers during the learning phase leading to an increase of the population size and a reduction of the generality metric.

The overall goal we strive for is to enhance XCS’ ability of learning in environments that mislead the system to suffer from over-generalization, i.e., long chain sequential decision problems. Therefore, in our future work, we aim to realize a more precise localization of incompatible niches within over-general classifiers. In addition, based on these new identification methods, a more systematic decomposition of over-general classifiers will be implemented, using a more sophisticated heuristic instead of the currently used random-based decomposition. The results in 6-RMP and CBP motivate an investigation of whether OGH has difficulties with such tasks in general, or whether the results are due to a possibly too unfocused specialization pressure. Thus, in our future work, we also plan to study OGH in more complex variants of RMP and CBP, such as: RMP with eleven or more dimensions; CBP with either systematically increased division per dimensions or additional irrelevant attributes initialized with a Gaussian distribution. Furthermore, we will evaluate OGH on additional and also larger real-world data sets.