1 Introduction

Abduction is reasoning toward the “best” explanations for a set of encountered observations. While the most prevalent application area is diagnosis, where abduction is an intuitive methodology for deriving root causes, its usage ranges from test case generation [41] over text interpretation [49] to human behavior interpretation [14]. In regard to fault localization, model-based approaches have been developed for decades. These techniques aim at identifying failures from a description of the system under diagnosis and the detected abnormal behavior [8, 53]. The traditional consistency-based method exploits a formalization of the correct behavior that in conjunction with a symptom leads to inconsistencies. These inconsistencies are then used to derive explanations. In contrast, the abductive model-based diagnosis version depends on logical entailment, i.e., given the system description a diagnosis has to entail the observations. Additional restrictions, such as minimality, characterize the admissible solutions. Originally, abductive model-based approaches consider solely representations of the faulty system behavior. Such failure knowledge can be expressed as Horn theories [5].

All model-based techniques depend on the quality of the underlying representation to adequately capture the system’s behavior and structure. Besides being expressive enough to formalize the artifact’s characteristics, the model should allow for efficient diagnostic reasoning to be suitable in a practical application. Ideally, the system description allows additional support of the decision making process by (1) including probabilities to allow a prioritization, (2) by enabling probe selection recommendations to refine the diagnoses, and/or (3) by taking into account costs and other factors for suggesting repair or replacement activities [4]. Other approaches extend the model-based diagnosis framework with capabilities to handle uncertainties often encountered in practice [27, 38]. From a practical point of view, it is essential to construct an appropriate system description as the quality of the entire diagnostic process relies on the model’s suitability in regard to the domain; yet, generation of appropriate diagnostic models is often associated with a large initial effort. To facilitate the integration of diagnosis tools in existing work processes, research has focused on automatically extracting Horn diagnosis models from failure assessments used in practice, such as Failure Mode and Effect Analysis (FMEA) [31, 62]. While this method eliminates or reduces the manual model creation, the generated system description is only as good as its source.

Let us consider a simple example from the wind turbine domain, where maintenance costs account for a large portion of the life time expenditure of the installations. The gearbox is a vital component converting the rotor’s low-speed power to high speed power for the generator. Hence, failures in the gearbox contribute to a great extend to system downtimes. Gearbox lubrication plays an essential role in retaining a healthy system state by protecting the gears and bearings from excessive wear and overheating. A failing oil filter promotes contamination, which in conjunction with excessive oil temperature affects the lubricant film of the subcomponents. Overheating of the oil is caused by damages to the oil cooler. An insufficient lubrication of the gearbox may also be rooted in a broken oil pump, which leads to a decrease in oil pressure and thus a reduced oil flow through the gearbox. Obviously, this type of information can easily be formalized as a set of propositional Horn clauses. Suppose an inadequate lubrication of the gearbox has been detected. The goal of diagnosis is to determine the root causes of the observed symptom. For this diagnosis problem there are two possible explanations: either the oil filter is broken and there are cracks and leaks in the cooling component or the oil pump has been damaged.

Abduction for propositional Horn theories is an NP-complete problem that grows exponentially in the size of the model [10]. Even though there are practical examples of NP-complete problems that can be solved quite efficiently [2], usually there is no universally “best” algorithm operating well on all problem instances. Choosing the computation method in regard to the particular example at hand can provide better performance results [34]. Algorithm selection aims at solving exactly this issue by identifying the “best performing” procedure for a specific instance [54]. The approach requires (1) a portfolio of algorithms to choose from, (2) empirical performance data of the algorithms on representative problems, and (3) a set of problem features that are used to get a notion of the difficulty of a problem instance [20]. To determine the most suitable routine for a distinct example, a predictor is constructed taking into account the empirical data and the feature vector of the representative problem cases [34, 57].

Given the complexity results of abduction and the successful application of the portfolio approach in the domain of SAT [64], graph coloring [46], and tree-decomposition [44], algorithm selection represents an interesting strategy to compute abductive diagnoses. Hence, we investigate algorithm selection as a means to efficiently compute explanations in the context of propositional Horn abduction. Horn models possess certain structural properties, which we utilize as features in order to classify diagnosis problems in regard to the most runtime efficient algorithm. We embedded this selection process within a meta-algorithm that generates the structural metrics for a given diagnosis problem, categorizes it on a previously trained predictor, and computes the diagnoses using the selected abduction approach. The foundations for this work have been laid previously; first, we have applied a portfolio approach to abduction based on bijunctive Horn models, i.e., theories where the clauses solely contain a single negative literal representing a cause and single positive literal representing a symptom [30, 32]. Hence, in this work we extend the underlying representation to Horn clauses. Second, in Koitz and Wotawa [33], we have compared the performance of consequence finding and proof-tree completion methods in the context of propositional Horn clause abduction. We rely on the empirical performance data we have obtained from this previous research as an input to construct the predictor for our algorithm selection approach, test the resulting selection accuracy, and determine the overall performance of the meta-approach in comparison to the other abduction procedures.

This paper is structured as follows. In the upcoming sections, we give an overview of related literature and describe the basic definitions of abductive model-based diagnosis for propositional Horn theories. Section 4 focuses on the attributes we can extract from the Horn models and details our meta-approach. Before concluding the paper, we provide an empirical analysis comparing the meta-approach to consistently choosing a single abduction method.

2 Related research

There are various methods for extracting abductive diagnoses, such as the parsimonious set-covering theory [52] or probabilistic approaches [50]. In logic-based abduction there are two formulations of the problem directly connected to the relation between diagnoses, observations, and the background knowledge: consequence finding [40] and proof-tree completion [42]. (1) In consequence finding explanations are derived deductively by re-formulating abduction as the search for logical consequences. Two well known consequence finding techniques are the Assumption-based Truth Maintenance System (ATMS) [6], deriving consequences from propositional Horn theories, and Skipping Ordered Linear (SOL) resolution allowing to extract logical consequences of interest for full clausal first-order theories [23]. (2) Proof-tree completion rewrites the diagnosis problem in such a way that refutations correspond to solutions [42]. Hence, any conflict extracting method can be utilized, e.g., Minimal Unsatisfiable Subsets (MUSes) computation of logical formulae [37]. Many MUS enumeration algorithms refrain from computing the unsatisfiable cores directly, but exploit its hitting set dual Minimal Correction Subsets (MCSes). For instance, Liffiton and Sakallah [37] present the CAMUS algorithm utilizing this hitting set duality to produce MUSes by first computing all MCSes.

Abduction has also been introduced to logic programming, where additional integrity constraints and distinguished predicates restrict the admissible solution space [25]. Besides specific tools such as the \(\mathcal {A}\)-System [26], abductive logic programming (ALP) can be realized using Answer Set Programming (ASP) [56]. Recently, Saikko et al. [55] have proposed the derivation of one minimal-cost abductive explanations through implicit hitting set computation based on a combination of an Integer Programming and a SAT solver. An implementation of their iterative approach outperforms a state-of-the-art ASP solver with an encoding of propositional abduction. Ignatiev, Morgado, and Marques-Silva [22] report on an improvement of this method by deriving the hitting sets via a MaxSAT solver.

Even though there exist certain subsets of logics where abductive inference is tractable, abduction for the general case is hard for the second level of the polynomial hierarchy [10]. Algorithm selection has resulted in significant performance improvements on various AI problems [34]. First formalized by Rice [54], algorithm selection aims at identifying the “best performing” approach for a specific problem instance. To determine the most suitable routine for a distinct example, a predictor is constructed taking into account the empirical data and the feature vector of representative problem cases [57]. Instead of creating a single classifier to chose a method, Leyton-Brown et al. [36] train an empirical hardness model for each algorithm within the portfolio to forecast each technique’s performance on the instance and execute the one predicted superior. While there are approaches that manually create models for the predictor [59], most of the time a machine learning classifier is exploited that categorizes a new problem instance at hand as one of the methods in the portfolio. Static portfolio approaches have a finite and fixed set of algorithms to chose from (though there exist approaches where the number of algorithms in the portfolio is based on the training data [63]), while a dynamic portfolio is built online and may be constructed of algorithmic blocks.

Generally, there are various strategies in the field of algorithm selection, for instance, in regard to the selection (a single algorithm used to solve the entire problem or interleaved/parallel execution of various approaches), the number of models built (a single one to predict the “best” method or one per approach in the portfolio), and the time of choosing the algorithm (at the beginning or several times during search). The interested reader is referred to Kotthoff [34], who provides a comprehensive overview of different algorithm selection techniques.

At the intersection of abduction and algorithm selection, there is the work by Guo and Hsu [16] . The authors propose algorithm selection in the context of deriving the most probable explanation (MPE) in probabilistic inference. Differencing from other work on algorithm selection, the authors first use classification to determine whether the problem is solvable and then either use clique-tree propagation to derive the exact solution or apply a second classifier to identify the “best” approximation procedure. While not directly usable for abductive reasoning, Malitsky et al. [39] apply the portfolio approach to enumerate MCSes, which can be exploited to derive explanations based on their hitting set duals [28]. Instead of choosing a single solver for the current problem instance, the authors use a technique that switches between various enumeration procedures multiple times. After a fixed timeout the next solver is used for computing the remaining solutions. The decision for the proceeding solver is computed on demand and to ensure already computed solutions are not derived again, blocking clauses are added whenever a new enumeration procedure is chosen.

3 Preliminaries

In essence, abductive inference allows the derivation of plausible explanations for a given set of observations. Logic-based abduction provides an intuitive notion of this type of reasoning: a set of abducible propositions is an explanation or diagnosis for an observed symptom in case the observation is a consistent logical consequence of the diagnosis and background theory. A conclusion \(\phi \) is said to be a logical consequence of a set of premises ψ, if and only if for any interpretation in which \(\psi \) holds \(\phi \) is also true. We write this relation as \(\psi \models \phi \) and say \(\psi \) entails \(\phi \). An abductive diagnosis or explanation is composed of a set of abducible propositions taken from a sub-vocabulary of the representation language. In the context of fault identification, these propositions are usually abnormality assumptions about components, while in medical diagnosis correspond to diseases.

Similar to Friedrich, Gottlob, and Nejdl [11] we focus on Horn abduction and define a propositional knowledge base (KB) that formalizes a propositional Horn theory \(\textit {Th}\) over a finite set of propositional variables \(\textit {A}\). A Horn clause is defined as a disjunction of literals featuring at most one positive literal and can be described by a rule, i.e., \(\{\neg a_{1}, \ldots , \neg a_{n},a_{n + 1}\}\) can be written as \( a_{1} \wedge {\ldots } \wedge a_{n} \rightarrow a_{n + 1}\). A clause without any negated propositions, i.e., \(\{p\}\), characterizes a fact. Many diagnostic reasoning engines restrict the model to Horn clause sentences, which are usually expressive enough in the context of fault identification to convey the necessary relations [5]. As the complexity of logic-based abduction is connected to the characteristics of the underlying model [3, 48] restricting the input language to Horn clauses is beneficial from a computation point of view. In particular, while abduction for general propositional theories is located in the second level of the polynomial hierarchy, for Horn clauses the complexity is mitigated to the first level [10]. A model or knowledge base is formally defined as a tuple \(\textit {(A,Hyp,Th)}\), where the set Hyp contains all hypotheses, i.e., abducible variables.

Definition 1 (Knowledge Base (KB))

A knowledge base (KB) is a tuple (A,Hyp,Th) where A denotes the set of propositional variables, Hyp \(\subseteq \) A the set of hypotheses, and Th the set of Horn clause sentences over A.

A Propositional Horn Clause Abduction Problem (PHCAP) is characterized by a KB and a set of observations for which root causes are to be derived. In our definition, observations may only be a conjunction of propositions and not an arbitrary logical sentence.

Abduction within the logic-based framework is defined via derivability and consistency, i.e., the observed manifestation must be derivable from Th augmented with the diagnosis \({\Delta }\), while \({\Delta }\cup \textit {Th}\) is consistent. In addition, a solution to a PHCAP must consist of propositions from the set of hypotheses.

Definition 2 (Propositional Horn Clause Abduction Problem (PHCAP))

Given a KB(A,Hyp,Th) and a set of observations Obs \(\subseteq \) A, the tuple (A,Hyp,Th,Obs) forms a Propositional Horn Clause Abduction Problem (PHCAP).

Definition 3 (Diagnosis; Solution of a PHCAP)

Given a PHCAP (A,Hyp,Th,Obs). A set \({\Delta }\) \(\subseteq \) Hyp is a solution iff \({\Delta }\) ∪ Th \(\models \) Obs and \({\Delta }\) ∪ Th \(\not \models \) \(\bot \).

Since generally the goal of abduction is to derive the “best” explanations, some preference criteria must be present to characterize the notion of optimality. A common principle is to only consider subset-minimal diagnoses. Especially in practical applications explanation supersets provide no additional information useful in a real-world context. Thus, we define \({\Delta }\text {-Set}\) as the set containing all parsimonious explanations.

Definition 4 (Parsimonious Diagnosis)

A solution \({\Delta }\) is parsimonious or minimal iff no set \({\Delta }^{\prime }\) \(\subset \) \({\Delta }\) is a solution.

Example 1

In the introduction we have discussed the example of gearbox lubrication in the industrial wind turbine domain, where damages to the cooling component lead to oil overheating. This overheating in conjunction with a broken oil filter alters the lubricant’s condition and hence negatively affects gearbox lubrication. Moreover, a malfunctioning pump triggers a reduced pressure and subsequently influences the dispensation of fluid throughout the system. These circumstances can be easily described by the following KB:

$$\textit{A} = \left\{ \begin{array}{c} \textit{Broken\_filter},\textit{Cooler\_leaks},\\ \textit{Cooler\_cracks},\textit{Damaged\_pump},\textit{reduced\_pressure},\textit{poor\_lubrication},\textit{overheating} \end{array} \right\} $$
$$\textit{Hyp} = \left\{ \begin{array}{c} \textit{Broken\_filter},\textit{Cooler\_leaks},\textit{Cooler\_cracks},\textit{Damaged\_pump}, \end{array} \right\} $$
$$\textit{Th} = \left\{ \begin{array}{c} \textit{Broken\_filter}\wedge\textit{overheating}\rightarrow \textit{poor\_lubrication},\\\textit{Cooler\_leaks}\wedge\textit{Cooler\_cracks}\rightarrow \textit{overheating},\\ \textit{Damaged\_pump}\rightarrow\textit{reduced\_pressure},\textit{reduced\_pressure}\rightarrow\textit{poor\_lubrication} \end{array} \right\} $$

Presume we determine there is an insufficient greasing of the gearbox, i.e., \(\textit {Obs}=\{\textit {poor\_lubrication}\}\). Given the theory Th and Obs, we can determine two parsimonious explanations; (1) either leaks and cracks in the cooler cause the oil to overheat in addition to a filter failure or the oil pump is damaged, i.e., \({\Delta }\text {-Set}=\{\allowbreak \{\textit {Broken\_filter},\allowbreak \textit {Cooler\_leaks},\allowbreak \textit {Cooler\_cracks}\}, \allowbreak \{\textit {Damaged\_pump}\}\}\).

By rewriting the relations between the theory, a diagnosis, and the observations, we can reformulate the abduction problem in two different fashions [33]: proof-tree completion [42] and consequence finding [40]. In the former, abduction is described as the search for a refutation proof consisting of abducible propositions, i.e., Δ ∪Th ∪{¬Obs}⊧⊥, where \(\{\neg \textit {Obs}\}\) is the disjunction containing a negation of each observation, i.e., \(\bigvee _{o_{i}\in Obs}\neg o_{i}\). To extract explanations, the derived conflicts have to be propositions from \(\textit {Hyp}\) and again no inconsistency may arise from the solutions. In the latter, abductive explanations are obtained in a deductive manner. By further recasting the conditions stated in Definition 3, we can construct \(\textit {Th}\cup \{\neg \textit {Obs}\} \models \{\neg {\Delta }\}\), where \(\{\neg {\Delta }\}=\bigvee _{\delta _{j}\in {\Delta }}\neg \delta _{j}\). In this scenario, we seek the consistent logical consequences of the theory and the negated observations comprising negations of hypotheses, which constitute the diagnoses.

4 Algorithm selection for abductive model-based diagnosis

In practice, we observe that different approaches perform noticeably better on certain problem instances than others. Yet, often no single solver computes solutions optimally on every example [36, 39, 64]. The objective of algorithm selection is to identify the most appropriate method out of a portfolio of techniques for a given problem instance in regard to a certain performance metric [54]. A common target is to select the method minimizing the overall computation time. However, other measures such as accuracy or simplicity of solutions may be applied. To chose the appropriate algorithm a mapping between features characterizing a problem instance and the preferable method to solve it is necessary. This mapping between the algorithms and the problem attributes requires empirical performance data on representative samples of all approaches within the portfolio. On basis of the features extracted and the execution records, a predictor is built which should forecast the “best” algorithm on the instance at hand given the problem’s attributes. To achieve this, the predictor has to be able to uncover aspects of the problem which influence the performance of the methods. Rice [54] advocates for the use of features inherent to the problems within the problem space to ensure an accurate selection. In order to be beneficial running the predictor may not be more expensive than solving the problem itself [34].

4.1 Structural features

Some intuitive measures of the complexity of a diagnosis problem are the number of abducible variables, i.e., hypotheses, the number of manifestations, i.e., propositions from \(\textit {A}\setminus \textit {Hyp}\), and connections within the theory. As the models we are studying consist of Horn clauses, we can easily define their structural properties based on various graph representations. In the simplest case, the theory is characterized as a directed graph \(G=(V,E)\), where V is the set of vertices and E constitutes the set of directed edges. The nodes of the graph represent the propositional variables, while the edges are determined by the theory. For each clause \( a_{1} \wedge {\ldots } \wedge a_{n} \rightarrow a_{n + 1}\) there exists a set of edges, such that \(\forall a_{i} \in \{a_{1},\ldots ,a_{n}\}: (a_{i},a_{n + 1}) \in A\), i.e., for each proposition in a clause’s body, there is a directed edge from the proposition’s node to the node representing the clause’s head. Considering the graph representations of the model, we can extract certain characteristics of their structure, which we subsequently use within the algorithm selection process. Note that we do not account for the logical connectives of the Horn clauses’ bodies in G.

Similarly to Peng and Reggia’s [52] set-covering theory, we define two sets in regard to knowledge about the relations in the Horn theory: effects(pi) and causes(pi). The first contains variables inferred by \(p_{i}\). That is, assuming a cause-effect relation between propositions, the set comprises all effects or manifestations of \(p_{i}\). The latter holds the variables which directly trigger the proposition \(p_{i}\), i.e., the causes of \(p_{i}\).

4.1.1 Outdegree and indegree

Based on the directed graph, we can compute for each vertex representing a hypothesis its outdegree. This specifies the number of variables affected by said proposition. Similarly, we record the indegree of each node, i.e. the number of propositional variables inferring the variable. Collected over the entire model these measures provide an intuitive metric of the basic magnitude of the theory and the connectedness of G.

4.1.2 Covering and overlap

Several propositions may be covered by the same cause. On basis of this we can define a direct covering metric for each pair of propositions as the ratio between the number of common effects and the total number of symptoms induced by the propositions:

$$\text{\textit{covering}}(p_{i},p_{j})=\frac{|\text{\textit{effects}}(p_{i})\cap \text{\textit{effects}}(p_{j})|}{|\text{\textit{effects}}(p_{i})\cup \text{\textit{effects}}(p_{j})|} $$

In a similar manner, we define the overlap of two effects as their common sources in relation to all their causes:

$$\text{\textit{overlap}}(p_{i},p_{j})=\frac{|\text{\textit{causes}}(p_{i})\cap \text{\textit{causes}}(p_{j})|} {|\text{\textit{causes}}(p_{i})\cup \text{\textit{causes}}(p_{j})|} $$

Additionally, based on the directed graph we compute a label for each vertex v, which contains all hypotheses said node is (directly and indirectly) caused by:

$$\textit{label}(v)= \left\{\begin{array}{ll} \hspace{0.24cm}\{v\} & \text{if } v\in \textit{Hyp}\\ \bigcup\limits_{(x,v) \in E} \textit{label}(x) & \text{otherwise} \end{array}\right. $$

Note that we do not contemplate the connectives of the variables, hence, a label in our case is a simple set. Given the definition of a label, we define an overlap based on labels:

$$\textit{overlap}_{\textit{label}}(p_{i},p_{j})=\frac{|\text{\textit{label}}(p_{i})\cap \text{\textit{label}}(p_{j})|} {|\text{\textit{label}}(p_{i})\cup \text{\textit{label}}(p_{j})|} $$

While \(\textit {overlap}\) only takes into account the direct causes of a proposition, \(\textit {overlap}_{\textit {label}}\) uses the information of all predecessors of a node. By collecting these measures for any pair of hypotheses or effects, we can compute a value over the entire model.

4.1.3 Independent diagnosis subproblem

Whenever there exist several subproblems in our theory we refer to them as independent diagnosis subproblems. If several subproblems exist, G is disconnected and each independent diagnosis subproblem itself is a connected subgraph.

4.1.4 Path length

Another measure of connectedness within the model is the minimal path length between any two nodes in G. Yet, we only consider paths between nodes which are causing some other proposition. Therefore, we construct an undirected graph, where we only consider propositions as nodes, which are causing other propositions. An edge is created between two propositions, in case they are directly causing the same effect. In particular, we measure the length of the minimal path between these nodes.

4.1.5 Kolmogorov complexity

A simple encoding-based measure on a graph is its Kolmogorov complexity, which defines a value equal to the length of the word necessary to encode the graph. A straightforward approach in this context is to compute the complexity based on the adjacency matrix of the undirected graph [45]. Therefore, we simply replace the directed edges of G with undirected ones, create a String representation of the adjacency matrix and based on this derive the Kolmogorov complexity.

4.1.6 Observation dependent metrics

Since not only the topology of the model is of interest, but also the structure of the current diagnosis problem, we measure indegree, overlap and \(\textit {overlap}_{\textit {label}}\) among the elements of \(Obs\) as well as the number of diagnosis subproblems involving variables of \(Obs\), in case several exist.

Example 2

For describing the features, we will use the Horn clause model with \(\textit {Th}=\{ e_{1} \wedge H_{1} \rightarrow e_{2},e_{1} \wedge H_{1} \rightarrow e_{3}, H_{2} \rightarrow e_{3}, H_{2} \rightarrow e_{4}, e_{3}\rightarrow e_{4}, H_{3} \rightarrow e_{5}\} \) as an example. Figure 1a shows G created on basis of the model. The labels are as follows:

Table 1
Fig. 1
figure 1

Feature graphs Example 2

For instance, we can extract \( \textit {outdegree}(H_{1})= 2, \) \(\textit {indegree}(e_{2})= 2, \textit {covering}(H_{1},H_{2})=\frac {1}{3}, \textit {overlap}(e_{3},e_{4})\) \(=\frac {1}{4}\) and \(\textit {overlap}_{\textit {label}}(e_{3},e_{4})= 1\). We also have two independent diagnosis subproblems, namely one including \(H_{1}, H_{2},e_{1},e_{2},e_{3}\) and \(e_{4}\) and the other one consisting of \(H_{3}\) and \(e_{5}\). To compute the path length between nodes functioning as causes, we construct the graph in Fig. 1b. Based on the undirected graph, we can see for instance that path(H1,e3) = 2.

4.2 Meta-approach

There are two possible variations of algorithm selection; either one algorithm of the portfolio is to be selected based on a single predictor or there is a predictor for each approach within the portfolio estimating the performance of each algorithm and choosing the method with the best result [20, 34]. For our meta-technique, we consider the first variant, where we train a single classifier for all abductive reasoning methods to select a single approach for execution. We use a 1-of n portfolio [64], i.e., we chose one of the n algorithms in our portfolio and invoke it in order to solve our current diagnosis problem. The portfolio consists of the seven abductive reasoning methods, which we describe in Section 5.1. Our performance objective is to choose the approach with the least runtime.

Within the context of abductive diagnosis our meta-approach can be split into two stages: the online and the offline phase. In the offline step, the empirical data on computation times of the various abductive reasoning approaches is collected. On basis of the metrics and the runtime information a machine learning classifier is trained. Furthermore, in the model-based diagnosis scenario the system description has to be available before the actual diagnosis computation; thus, the majority of features can be computed offline on the diagnosis model. Only a portion of attributes, namely those associated with the set of observations, has to be derived online.Footnote 1 Figure 2 lists all features extracted from the models, with the Instance specific/Observation dependent constituting the attributes computed online.

Fig. 2
figure 2

Features

Algorithm Met AB describes the online portion of the meta-approach, which is executed whenever new diagnosis problem is applied. The procedure takes the PHCAP, the offline-trained machine learning classifier, and the already computed metrics of the diagnosis model as input. Online we have to collect the current PHCAP’s instance-based features such as \(|Obs|\) or the number of independent diagnosis subproblems comprising the current observations. Based on the online and offline generated attributes we supply the feature vector \(\phi \) with the measurements of the current diagnosis problem. While Hutter et al. [21] state that the feature extraction method should be highly efficient, in our framework only the computation of a subset of these attributes has to be performed online, namely the computation of the instance specific metrics. By providing all features to the machine learning algorithm, we in turn retrieve a predicted best abduction method \(\alpha \) out of our portfolio for this specific scenario based on the trained classifier and the instance’s features. Subsequently, we instantiate the diagnosis engine with the corresponding abduction method as well as abduction problem and compute the set of abductive diagnoses, i.e., \({\Delta }-Set\).

figure c

5 Evaluation

In this section, we evaluate the feasibility of our meta-approach based on the an empirical study we have conducted previously [33]. This foregoing evaluation has comparatively analyzed conflict-driven and consequence finding techniques for abductive reasoning based on two benchmarks; one set of experiments utilizes examples stemming from practice, while the other encompasses artificially generated diagnosis problems. The objectives of the evaluation in this paper are two-fold; on the one hand, we aim at assessing the quality of the structural attributes to train a machine learning classifier to forecast the most efficient algorithm in regard to its runtime for a specific PHCAP instance. On the other hand, we want to determine the overall efficiency of the meta-approach in comparison to consistently using a single abductive diagnosis techniques in the portfolio.

In the upcoming subsections, we first report on the abductive reasoning methods utilized in the previous study which function as the algorithms for our portfolio in this paper. Second, in Section 5.2 we describe the two different types of sample sets we have utilized and then report on the empirical performance data we extracted from this previous study which we subsequently utilize to determine the quality of our meta-approach. A more detailed description of the algorithms, their implementation, the benchmarks, and the runtime data obtained can be found in Koitz and Wotawa [33]. In Section 5.4, we give an overview of the evaluation set-up for the meta-approach and subsequently present the results of this evaluation.

5.1 Portfolio

In this subsection, we give a short description of each abductive reasoning method as well as details on the implementations used in the empirical evaluation. For a more in depth discussion we refer to our previous work [33].

5.1.1 Abduction with the ATMS

As discussed in the preliminaries, explanations can be derived via consequence finding; this type of reasoning infers logical consequences from a background theory while restricting solutions to a target language referred to as a production field [40]. The production field specifies the conditions applying to the clauses generated, e.g., their cardinality or literals admissible. The ATMS is a reasoner for propositional Horn clauses, allowing to derive logical consequences given a query w.r.t. the underlying background theory [35, 42]. The constructed consequences only contain assumptions, i.e., hypotheses. Coupled with a problem solver, the ATMS determines the belief of data, while the problem solver performs the inference steps. Internally, the ATMS constructs a directed graph in which propositions are represented as nodes and the edges are based on the relations between the variables. Each node stores a label that contains all hypotheses allowing to infer the node. de Kleer [7] provides an algorithm that ensures the labels are sound, complete, minimal, and consistent. Given a PHCAP and an ATMS that has already processed the background theory, one simply has to add a clause \(o_{1}\wedge o_{2}\wedge \ldots \wedge {\ldots } o_{n}\rightarrow \textit {explain}\), where \(o_{1},o_{2},\ldots ,o_{n}\in \textit {Obs}\) and \(\textit {explain}\notin \textit {A}\). The label of the node explain then contains all subset minimal sets of hypotheses which are consistent with the theory and allow to infer the observations; hence, the label of explain holds all parsimonious diagnoses. We realized the ATMSFootnote 2 within Java and refer to the approach as ATMS within our experiments.

5.1.2 Abduction as consequence finding via SOL-resolution

Inoue [23] proposes SOL-resolution as a sound and complete strategy to derive subset minimal logical consequences belonging to the production field. By defining an additional skip rule, SOL-resolution bypasses literals belonging to the production field instead of resolving them. To obtain abductive explanations as logical consequences using SOL-resolution, the background theory has to be extended by the negation of observations, i.e., a clause \(\bigvee _{o_{i}\in Obs}\neg o_{i}\) is added to the knowledge base. The production field contains the negation of the hypotheses, i.e., ∀hHyp : ¬h, to ensure only abducibles are considered in the derivation. After the computation, hypotheses contained within the consequences, have to be converted back their positive counterpart to obtain abductive diagnoses. In our evaluation, we exploit the consequence finding tool SOLARFootnote 3 [47], which is a Java implementation of SOL-resolution within the framework of tableau calculus for first order full clausal theories. Consequence finding with SOLAR is denoted CF within the evaluation.

5.1.3 Conflict-driven search via HS-DAG

In proof-tree completion-style abduction, contradictions appear given the background theory and the negated observations. These conflicts constitute the explanations. Reiter [53] has developed an approach to derive diagnosis within the consistency-based framework via conflicts arising from the manifestation contradicting the normal system behavior. The method exploits the hitting set relation between conflicts and consistency-based diagnoses.Footnote 4 Reiter’s technique operates on a tree who’s nodes are either labeled by a conflict, which is returned by an external theorem prover, or constitute a minimal hitting set, i.e., a consistency-based diagnosis. The tree structure implicitly removes already computed refutations and their supersets from further consideration. Greiner et al. [15] later corrected shortcomings of the algorithm description in regard to non-minimal contradictions and developed Hitting Set Directed Acyclic Graph (HS-DAG) utilizing a directed acyclic graph instead of a tree.

In the case of abductive diagnoses, we are interested in the conflicts generated by the theorem prover which should only encompass elements of Hyp. Hence, in our evaluation we use the publicly available Java diagnosis engine jdiagengineFootnote 5 [51] that implements a conflict-driven search via HS-DAG coupled with an incremental assumption-based linear time Horn clause theorem prover (LTUR) [43]. Since the refutations returned by LTUR are not guaranteed to be minimal, the returned conflicts may not correspond to the parsimonious explanations we seek to obtain. We propose two solutions to this issue: (1) Simply recording all contradictions while constructing the entire HS-DAG and afterwards removing all conflict supersets ensures all remaining refutations correspond to the parsimonious diagnoses. We refer to this implementation as HS-DAG within our empirical study. (2) Another option is to minimize conflicts right after they have been returned by the theorem prover by employing a minimization procedure such as QuickXplain [24]. In the set-up \(\textit {HS-DAG}_{\textit {QX}}\) we minimize the refutations computed by LTUR right away before continuing to construct the tree. For this purpose, we implemented an assumption-based QuickXplain in Java. QuickXplain itself depends on a theorem prover, which in our case is another instance of jdiagengine’s assumption-based LTUR.

5.1.4 Conflict-driven search via power set exploration

Conflicts are known as MUSes or unsatisfiable cores within the area of infeasibility analysis. A Minimal Unsatisfiable Subset (MUS) is a set of clauses that cannot be satisfied simultaneously, while every proper subset of an MUS is satisfiable [37]. Liffiton et al. [37] suggests to compute MUSes by traversing the power set of clauses encoded as a Boolean formula. This Boolean formula characterizes the already explored parts of the lattice.

The computation starts with a seed, i.e., a set of clauses from the unexplored region of the power set; in case the seed is unsatisfiable it is reduced to an MUS, i.e., the lattice is descended until a minimal set of unsatisfiable clauses is reached. Afterwards the encoding of the lattice is augmented in such a way that the MUS and all of its supersets are marked as already processed. In case the seed is satisfiable it is expanded until a maximal set of satisfiable clauses, a Maximal Satisfiable Subset (MSS), is found. Again the power set encoding is adapted to update the already investigated portions of the lattice. Once there are no more unexplored regions, all MUSes, i.e., conflicts, have been uncovered. Arif et al. [1] present a version favoring unsatisfiable seeds by computing maximal models of the map, i.e., the maximum number of literals is true without violating a clause. This extension renders the ascending of the lattice in case the seed is satisfiable unnecessary.

We have implemented the general approach proposed by Liffiton et al. [37] in Java. To favor unsatisfiable cores early on in the computation, we implemented Arif et al. ’s [1] method to find maximal model seeds. In addition, we use the SAT solver SAT4JFootnote 6 [9] in order to determine the satisfiability of the Boolean formula representing the map. For our first set-up XPLorer we implemented the insertion-based MUS extraction algorithm as suggested by Arif et al. [1] using jdiagengine’s incremental assumption-based LTUR. In the second set-up XPLorerQX, we take advantage of the fact that each unsatisfiable seed already constitutes a conflict and the MUS extraction merely reduces it toward a minimal contradiction; hence, we can apply the assumption-based QuickXplain Java implementation to minimize the unsatisfiable seed to an MUS, i.e., an abductive explanation.

5.1.5 Abduction under stable model semantics

In ASP, abduction is framed as the search for stable models of a logic program representation of the problem [13]. The resulting stable models, i.e., answer sets, then constitute the abductive explanations. We use an encoding of propositional abduction by Saikko et al. [55]Footnote 7 in conjunction with the C++ ASP solver clingo 4.5.4Footnote 8 [12]. To enumerate all subset minimal answer sets, we adapt the encoding by removing the optimization criteria and call the solver with --heuristic=Domain, --enum-mod=domRec, and--dom-mod= 5,16. These parameters ensure the computation of subset minimal answer sets. We denote this approach ASP in the evaluation.

5.2 Benchmarks

We exploit two sample sets for our evaluation: (1) an artificially created set of PHCAPs (Artificial Samples) and (2) diagnosis problems constructed based on real system failure assessments characterizing component faults and their manifestations (FMEA Samples). In Table 1 we report on the characteristics of the generated abduction problems.

Figure 3 shows the principle structure of the artificial examples as an And-or-graph, where \(H_{i}\in \textit {Hyp}\) and ejAHyp. Each rule’s body encompasses elements of A, while the head may only contain literals from \(\textit {A}\setminus \textit {Hyp}\). The samples were built based on several parameters: n sets the minimal number of negative literals in the clause, r determines how many clauses are generated with the same set of negative literals, o fixates the maximum overlap between positive literals of different clauses, and k determines the maximum number of observations, which are randomly selected from \(\textit {A}\setminus \textit {Hyp}\) and for simplicity only comprise positive observations. We obtained 166 PHCAPs invoked with n = 1, r=o = 15, and k = 5.

Table 1 Sample statistics [33]

FMEA is a reliability analysis tool that records possible component faults and their effects on the system behavior and function [18]. Thus, these assessments provide information suitable for constructing an abductive diagnosis model [62]. For the evaluation, we use twelve FMEAs encompassing failure knowledge of different technical systems (e.g., electrical circuits, a connector system by Ford, Focal Plane Unit of the Heterodyne Instrument for the Far Infrared built for the Herschel Space Observatory, the Anticoincidence Detector mounted on the Large Area Telescope of the Fermi Gamma-ray Space Telescope, printed circuit boards, the Maritim IT Standard, as well as components, such as rectifier, inverter, transformer, main bearing, and backup components of an industrial wind turbine) and compile them into abductive knowledge bases. The compilation procedure, as described by Wotawa [62], matches each row within the FMEA table to a set of bijunctive Horn clauses. Each clause represents the cause-effect relation between a fault mode and one of its manifestations. These FMEA-based models, thus, only contain implications from a single fault to a single effect. Thus contrast the artificial examples, which can have several levels of depth.

Fig. 3
figure 3

Artificial example structure [33]

To construct the 213 diagnosis problems, we randomly chose observations from the set of failure effects described in the assessment. In contrast to the artificial examples, the FMEA models only comprise bijunctive definite Horn clauses, i.e., each clause features a single hypothesis in the body and a proposition from \(\textit {A}\setminus \textit {Hyp}\) as head.

5.3 Empirical performance data

To generate the performance data necessary for constructing a classifier for our meta-approach, we obtained execution time data for all algorithms contained in our portfolio on both benchmarks. Each algorithm was invoked ten times on each PHCAP with the aim to compute all parsimonious diagnoses. In case the execution on a PHCAP was not finished after twenty minutes, the computation was interrupted. Timed out executions are penalized with 𝜃 = 40minutes. Figure 4a and b depict the number of samples solved for growing cumulative log runtime.

Fig. 4
figure 4

Numbers of diagnosis samples solved over time [33]

Given the data we can determine that the two most promising techniques based on the benchmarks are ATMS and HS-DAGQX. Both methods solve a reasonable amount of samples within the given time frame and our data shows that ATMS on average is the most efficient approach, while HS-DAGQX computes diagnoses faster for more instances. Especially on FMEA Samples, we can observe that the ATMS is the superior approach. Even though the methods using general solvers, i.e., CF and ASP, are around two orders of magnitude slower than the best Horn reasoner, they show a consistent performance by being able to compute explanations for most samples within the given runtime allowance.

5.4 Meta experiment Set-up

MetAB itself is implemented in Java. To create the predictor based on the features, we exploit the Waikato Environment for Knowledge Analysis (WEKA) library [17] which provides a vast variety of machine learning algorithms. For each PHCAP execution we collect the metrics listed in Fig. 2 to gather our attributes for prediction. The feature vector itself holds one more nominal value: the class attribute. It corresponds to the algorithm’s name that has solved the diagnosis problem the fastest in the experiments conducted previously [33]. Hence, this last attribute is to be predicted. To evaluate the classification accuracy based on the features, we randomly split each benchmark into a training set comprising 80 % of the data and a test set holding the remaining 20 % of examples. Due to dividing samples arbitrarily, it is not warranted that each PHCAP is represented in equal measures within the test set. Hence, it may also occur that the test set comprises samples particularly suitable for a certain approach.

To determine an appropriate machine learning method, we performed model selection via 10-fold cross-validation on the training data for several classification algorithms available in WEKA. Based on the accuracy results obtained we chose to use the random subspace method [19] in combination with WEKA’s decision tree with reduced-error pruning (REP Tree) for Artificial Samples. In the random subspace method, subsets of components of the feature vector are selected pseudorandomly and a decision tree is generated based solely on the chosen attributes. The resulting classifier then depends on several decision trees constructed in this manner, i.e., a decision forest. For FMEA Samples we decided on WEKA’s decision table majority classifier, which also relies on a suitable subset of features to build the decision table.

5.5 Results

Table 2 depicts all classification statistics based on (1) the 10-fold cross-validation on the training data and (2) the evaluation on the test set. Utilizing the random subspace method we reach between 66.39 % and 67.17 % correctly labeled samples. A PHCAP is labeled correctly in case the predicted algorithm was the most efficient on the problem instance in our empirical performance data. For the FMEA Samples we receive slightly better results with more diagnosis problems (between 76.76 % and 77.79 %) classified correctly.

Table 2 Classification Statistics
Table 3 Attribute selected classification statistics

Ideally, we could improve upon the accuracy results. A common strategy to enhance classification precision is to adapt the feature vector. In this respect, we explore WEKA’s attribute selection in order to determine whether we could remove certain features while achieving the same or better prediction quality. Since the number and composition of the reduced feature set depends highly on the performed selection process, we conducted some informal evaluations to decide on the leading method. In the end, we chose to rank attributes in the artificial case due to their information gain in consideration of the class feature and in case of the FMEA Samples based on a subset of features that are statistically relevant to the class attribute (Relief Feature Selection). Using these methods we could diminish the set of features to fifteen and twenty-one in case of the artificial and FMEA examples, respectively.Footnote 9 As reported in Table 3 the results improved from 67.17 % to 69.58 % and 76.76 % to 78.41 % on the test sets. Table 4 lists the selected attributes according to their rank. While the suggested feature sets diverge, there are certain attributes which seemingly are important in both benchmarks, namely the metrics which characterize the basic composition of the theory, e.g., overlap and indegree.

Table 4 Selected attributes

The confusion matrices in Table 5a and b give a deeper insight into the number of correctly and wrongly labeled instances of the test set based on the attribute selected classifiers. Each column of the matrix reports on the number of instances labeled a certain algorithm, while the rows represent the number of PHCAPs the algorithm was superior to the other approaches. For example, Table 5a shows that HS-DAG was predicted as the most efficient method 191 of 332 times, while it actually performed the best on 169 samples. Hence, on 22 samples the classifier incorrectly selected HS-DAG. In 33 cases, the instance was mislabeled as HS-DAGQX though HS-DAG would have been the best suited candidate. For 35 examples the situation was the other way around. This is due to both approaches experiencing a similar runtime behavior for most samples (as is also apparent from Fig. 4).

Table 5 Confusion matrices. The rows represent the actual number of instances labeled as the class, while the columns show the number of predicted instances
Table 6 Classifier performance measures

Table 6 reports on the binary classification measures for each approach on the test set. The best results in each column are emphasized. Precision denotes the proportion of cases correctly labeled an approach to all the samples the algorithm was selected as the abduction method, whereas Recall is the ratio of samples correctly predicted to instances where the approach is actually the fastest. The F1-Score is a combined metric based on the weighted average of Precision and Recall [58]. Considering for example the ATMS in Artificial Samples, we can see that the Precision value is close to the best value of 1, since it was only once selected incorrectly. However, the Recall value is rather disappointing due to various samples where ATMS were in fact the fastest method, but was not identified as such by the classifier. This trade-off is also apparent in the F1-Score, which hence is only 0.49. In contrast in the FMEA Samples, Precision and Recall values are good and hence also the F1-Score for ATMS is promising. Other common measures in multi-class classification are Accuracy, i.e., the overall effectiveness of a classifier, Specificity, i.e., how well negative labels are classified, and the area under the receiver operating characteristic curve (AUROC), i.e., the ability of the predictor to avoid false labeling [58, 61].

Table 7 Runtime results of the meta-approach in comparison to ATMS and HS-DAGQX

From the contingency tables is apparent that HS-DAG, HS-DAGQX and ATMS are the best performing approaches with some variations between the sample sets. A premature analysis of these results would suggest that applying for instance HS-DAG to every artificial instance would yield the optimal runtime for most problems. However, based on the entire set of problems, i.e. test and training, HS-DAG is not the most efficient approach as its computation time is notably larger on several instances than other algorithms. Thus, we propose to use our meta-approach MetAB which we subsequently compare to always using a single diagnosis method.

MetAB’s overall runtime is determined by (1) the computation of the online metrics, (2) the time it takes to create the feature vector, supply it to the classifier, and predict the “best” algorithm, and (3) the diagnosis time of the suggested abduction procedure. In regard to the feasibility of the meta-approach, we like to refer back to a particular characteristic of model-based diagnosis, namely the availability of the system description offline. As the model has to be present before the computation of the diagnoses, it allows us to extract most of the metrics utilized in the algorithm selection offline. On average we have observed an offline feature computation time of around 52,868.39 ms for the artificial instances and 165.85 ms for the FMEA benchmark.Footnote 10 The online computation of the features, which are inherent to the specific instance of the PHCAP, is negligible (on average < 0.1 ms). Based on the total classification time on the test set, we approximate that the prediction for a single instance on average below 0.5 ms. The third factor, the diagnosis time, is much dependent on the predictive capabilities of the classifier. A known drawback of this type of algorithm selection approach, where a single method is chosen and executed on the instance, is that in case the prediction is incorrect the meta-approach might be rather inefficient [34].

Figure 5a and b depict the distributions of the log runtimes of the abductive reasoning algorithms on the test sets. In addition to the seven abduction methods, we have included on the one hand our meta approach MetAB and on the other hand an optimal algorithm selection approach MetABOpt. For MetABOpt we assume a classifier that labels each diagnosis problem correctly and thus computes the explanations with the approach requiring the minimal runtime. MetABOpt’s computation time thus consists of the collection of the online metrics, the classification, and the diagnosis using the fastest approach. For the artificial examples, MetAB and MetABOpt show the best average and median runtime results, with an average percentage difference between them of 15.30 %. However, In case of the FMEA examples, ATMS provides the best median results on the test set followed by MetABOpt and MetAB. This is unsurprising given the dominance of ATMS on FMEA Samples. On average, yet, ATMS is slower than MetABOpt and MetAB. The average percentage difference between MetABOpt and MetAB is 10.93 %.

Fig. 5
figure 5

Statistical distribution for the runtimes [10y ms] on the test set

Table 7 lists the runtime results of MetAB in comparison to ATMS and HS-DAGQX. From the table we can observe that for the FMEA examples the classifier at least once chose an algorithm which cannot compute the diagnoses within the given time frame, hence, the maximum runtime corresponds to the penality of \(\theta = 40~ \textit {minutes}\) plus the classification time.

For a more in-depth comparison of MetAB to the other diagnosis methods, we provide various runtime scatterplots in Figs. 6 and 7. The x and y values characterize the penalized log runtimes of the corresponding algorithm pair, while each data point represents one sample run. For instance, in Fig. 6b, we compare MetAB to ATMS on the artificial samples. Points above the diagonal represent executions where ATMS was more efficient than MetAB, while points below the line indicate samples where MetAB was superior. The dashed lines mark the penalized runtime, i.e., every execution exceeding the runtime limit is located on the dashed lines. As apparent from Figs. 6a and 7a, MetAB performs not as good as the ideal approach MetABOpt. Nevertheless, given the classification accuracy most data points are on the diagonal or close to it confirming that the selected algorithm is either the fastest or among the best solvers for the PHCAP. For the FMEA examples, we can see that on one instance, MetAB choses an algorithm exceeding the time limit hence featuring a penalized runtime. From the remaining plots, we can conclude that while on some PHCAPs the meta-approach cannot compete with always choosing a single algorithm, the bulk of data points is usually located below the diagonal suggesting the benefit of MetAB even when an ideal prediction is not possible. In these cases, where the Figs. 6a and 7a show that while on the artificial samples MetAB shows convincing runtime results, the same cannot be stated for the FMEA examples. As the models only feature bijunctive Horn clauses the computation of the online features requires around the same time as the diagnosis itself. Hence, MetAB’s prediction only causes computational overhead. Further the scatter plots indicate that while consequence finding using SOLAR does not provide efficient runtimes, it is a dependable approach computing diagnoses for all PHCAPs constructed on top of the FMEAs. Both general reasoners, SOLAR and clingo, are not specialized Horn abduction solvers, but designed for first-order clausal theories and normal as well as disjunctive logic programs, respectively. Therefore, CF and ASP are around two orders of magnitude slower than MetAB. However, as both approaches perform consistently and compute explanations for most samples within the given runtime allowance, we conclude that off-the-shelf tools can very well function as suitable abduction engines despite not being competitive in regard to runtime.

Fig. 6
figure 6

Scatter plots of runtime [10y ms] comparisons on the test set of Artificial Samples

Fig. 7
figure 7

Scatter plots of runtime [10y ms] comparisons on the test set of FMEA Samples

To evaluate the efficiency of our meta-approach in comparison to always choosing a single approach, we determine whether the median difference between MetAB and the best performing approaches on the corresponding sample set is significant. Since dealing with non-normal runtime distributions and under the assumption that the observations are independent and identically distributed, we apply the one-tailed Wilcoxon Signed-Rank Test [60]. Suppose paired runtime data \((x_{1},y_{1}),(x_{2},y_{2}),\ldots , (x_{n},y_{n})\) from MetAB and the compared abduction approach, respectively. We propose hypothesis \(H_{0}: m_{X} \leq m_{Y}\), stating a median difference of zero between pairs of observations. For FMEA Samples, we compared MetAB and ATMS and formulate our alternative hypothesis as \(H_{1}: m_{X} > m_{Y}\). Given the one-tailed test for \(\alpha = 0.05\), we reject \(H_{0}\), i.e., stating a significant difference between MetAB and ATMS. This indicates that on simple examples such as the one’s generated based on FMEAs, choosing the ATMS as abduction procedure is advantageous on the test data. In regard to the artificial examples, we compared the meta-approach to HS-DAG and HS-DAGQX with \(H_{1}: m_{X} < m_{Y}\). In both cases, we accept the null hypothesis since the Wilcoxon signed-rank test determined an insignificant improvement of MetAB in comparison to the hitting set approaches.

6 Conclusion

Algorithm selection has been proposed as a way to deal with the issue that given different problem instances there is no universally superior approach computing solutions efficient for all samples, but the performance of a certain method highly depends on the underlying characteristics of the problem instance. Abductive diagnosis represents a novel application area of the portfolio approach. Utilizing a machine learning classifier and structural attributes of the Horn theories, we have developed a meta-approach to abductive model-based diagnosis that aims at predicting the most suitable method for a diagnosis problem on a case-by-case basis. Algorithm selection can be particularly useful in the context of model-based diagnosis, since the structural features of the underlying system description can be computed mostly offline and only the dynamic portion depending on the observations have to be derived during computation.

We evaluated the algorithm selection method within an empirical set-up based on seven abductive reasoning methods and two benchmarks. Subsequently, we compared abduction via the meta-approach to always choosing a single abductive reasoning mechanism. The accuracy of our algorithm selection technique is satisfactory and MetAB is in fact on average more efficient than choosing a single abductive reasoning approach on both sample sets. Hence, based on our benchmarks we can advocate for the benefit of using a portfolio method for abduction. While on average the fastest, our meta-approach cannot outperform all other abduction techniques on median on the simple diagnosis problems stemming from FMEAs. These models are characterized by bijunctive Horn clauses, where the meta-approach’s set-up effort might not justify the runtime improvements. Possibly on larger models, we could observe a better performance of MetAB in comparison to ATMS, which is known to have difficultly propagating label values given larger more interconnected theories.

Thus, adapting and extending the benchmarks used for evaluation might provide improved runtime results for our portfolio approach. Even though we have applied a simple attribute selection, a deeper analysis of negligible or combinable features would definitely improve the technique since choosing the incorrect method can be an expensive mistake within our set-up. In addition, removing some of the attributes would decrease the computational effort especially in regard to the offline features, which may take a long time to derive depending on the problem structure.

There may be situations where certain instances are inherently difficult and thus require an extensive amount of time on any technique (including the meta-approach). In these situations, switching between computation methods may provide better results than a simple algorithm selection. This would be particularly interesting if approaches favor certain kinds of explanations and would produce those efficiently early on in the calculation. Then, by switching between the abduction methods and adding constraints that ensure previously derived diagnoses are not considered again, we may be able to improve the computation even further. However, it would remain to determine to what extend the selection of the next approach and the switching would increase the computation time to decide if this approach is beneficial.