# Provenance for Explaining Taxonomy Alignments

## Abstract

Derivations and proofs are a form of provenance in automated deduction that can assist users in understanding how reasoners derive logical consequences from premises. However, system-generated proofs are often overly complex or detailed, and making sense of them is non-trivial. Conversely, without any form of provenance, it is just as hard to know why a certain fact was derived.

Derivations and proofs are a form of provenance in automated deduction that can assist users in understanding how reasoners derive logical consequences from premises. However, system-generated proofs are often overly complex or detailed, and making sense of them is non-trivial. Conversely, without any form of provenance, it is just as hard to know why a certain fact was derived.

We study provenance in the application of Euler/X [1], a logic-based toolkit for aligning multiple biological taxonomies. We propose a combination of approaches to explain both, logical inconsistencies in the input alignment, and the derivation of new facts in the output taxonomies.

**Taxonomy Alignment.** Given taxonomies \(T_1,T_2\) and a set of *articulations* \(A\), all modeled as monadic, first-order constraints, the *taxonomy alignment problem* is to find “merged” taxonomies that satisfy \(\varPhi = T_1\cup T_2\cup A\). An alignment can be *inconsistent* (\(\varPhi \) is unsatisfiable), *unique* (\(\varPhi \) has exactly one minimal model), or *ambiguous* (\(\varPhi \) has more than one minimal model). For example, let \(T_1\) be given by *isa* (subset) constraints \(\mathsf {b \subseteq a}\), \(\mathsf {c \subseteq a}\), *coverage* constraint \(\mathsf {a = b\cup c}\), and *sibling disjointness* \(\mathsf {b\cap c=\emptyset }\). Similarly, \(T_2\) is given by \(\mathbin {\mathrm {isa}}\) constraints \(\mathsf {e \subseteq d}\), \(\mathsf {f \subseteq d}\), coverage \(\mathsf {d=e\cup f}\), and sibling disjointness \(\mathsf {e}\cap \mathsf {f}=\emptyset \).

An expert aligns \(T_1\) and \(T_2\) using *articulations* \(\mathsf {a=d}\), \(\mathsf {b\subsetneq e}\), \(\mathsf {c\subsetneq f}\), and \(\mathsf {b\subsetneq d}\); see Fig. 1. We would like to “apply” all of these relations between the two taxonomies, and output a merged taxonomy.

**Inconsistency Explanation.**Usually \(T_1\) and \(T_2\) are considered immutable or correct by definition, whereas \(A\) might contain modeling errors. Euler/X applied to Fig. 1 finds that the constraints are unsatisfiable, and performs a model-based diagnosis. The result lattice (Fig. 2) highlights

*minimal inconsistent subsets*(MIS) and

*maximal consistent subsets*(MCS). The MIS \(\{\mathsf {A}_1, \mathsf {A}_2, \mathsf {A}_3\}\) indicates which articulations are inconsistent with \(T_1,T_2\). To further explore the inconsistency, the system-derived MCS can be employed: Fig. 3 shows the merged taxonomies (a.k.a. “possible worlds”) obtained from the MCS. Here, each MCS corresponds to one possible world.

^{1}

Using expert knowledge or further constraints^{2} a preferred merge result can be selected to further analyze and then repair the inconsistency. Here, suppose the user chose the first maximal consistent subset \(\{\mathsf {A}_1, \mathsf {A}_2, \mathsf {A}_4\}\). It follows from \(\mathsf {A}_1, \mathsf {A}_2\) and the input taxonomies \(T_1,T_2\) that \(\mathsf {f}\subsetneq \mathsf {c}\). However, \(\mathsf {A}_3\) is \(\mathsf {c}\subsetneq \mathsf {f}\) yielding a contradiction. Now the problem is to explain why \(\mathsf {f} \subsetneq \mathsf {c}\) is inferred.

**Derivation Explanation.**To understand how \(\mathsf {f}\subsetneq \mathsf {c}\) is inferred, we may need to inspect its logical derivation or an abstraction of it. We obtain this provenance in Euler/X by keeping track of the rules \(r_1,\dots , r_8\) and input alignments \(\mathsf {A}_1,\dots , \mathsf {A}_4\) used by the reasoner. Figure 4 depicts the resulting provenance overview.

**Related Work.** Data provenance is an actively researched area and is closely related to proofs and derivations in logical reasoning. Our inconsistency explanation is based on Reiter’s model-based diagnosis [6], which has been studied extensively and applied to many areas, e.g., type error debugging, circuit diagnosis, OWL debugging, etc. We have adapted the HST algorithm in [4] to compute all MIS and MCS for inconsistency explanation. The problem was shown to be Trans-Enum-complete by Eiter and Gottlob [2]. Inspired by the ideas of a provenance semirings [3] and Datalog debugging [5], our approach explains the derivation of the inferred relations.

## Footnotes

## Notes

### Acknowledgments

Supported in part by NSF IIS-1118088 and DBI-1147273.

## References

- 1.Chen, M., Yu, S., Franz, N., Bowers, S., Ludäscher, B.: Euler/X: A toolkit for logic-based taxonomy integration. In: 22nd International Workshop on Functional and (Constraint) LogicProgramming (WFLP), Kiel, Germany (2013)Google Scholar
- 2.Eiter, T., Gottlob, G.: Hypergraph transversal computation and related problems in logic and AI. In: Flesca, S., Greco, S., Leone, N., Ianni, G. (eds.) JELIA 2002. LNCS (LNAI), vol. 2424, pp. 549–564. Springer, Heidelberg (2002) CrossRefGoogle Scholar
- 3.Green, T., Karvounarakis, G., Tannen, V.: Provenance semirings. In: ACM Symposium on Principles of Database Systems (PODS), pp. 31–40 (2007)Google Scholar
- 4.Horridge, M., Parsia, B., Sattler, U.: Explaining inconsistencies in OWL ontologies. In: Godo, L., Pugliese, A. (eds.) SUM 2009. LNCS, vol. 5785, pp. 124–137. Springer, Heidelberg (2009) CrossRefGoogle Scholar
- 5.Köhler, S., Ludäscher, B., Smaragdakis, Y.: Declarative datalog debugging for mere mortals. In: Barceló, P., Pichler, R. (eds.) Datalog 2.0 2012. LNCS, vol. 7494, pp. 111–122. Springer, Heidelberg (2012) CrossRefGoogle Scholar
- 6.Reiter, R.: A theory of diagnosis from first principles. Artif. Intell.
**32**(1), 57–95 (1987)MATHMathSciNetCrossRefGoogle Scholar