1 Introduction and Motivation

A challenging issue in data quality is to automatically check the quality of a source dataset and then to identify cleaning activities, namely a sequence of actions able to cleanse a dirty dataset. Data quality is a domain-dependent concept, usually defined as “fitness for use”, thus reaching a satisfying level of data quality strongly depends on the analysis purposes. Focusing on consistency, which can be seen as “the violation of semantic rules defined over a set of data items” [1], the state-of-the-art solutions mainly rely on functional dependencies (FDs) and their variants, that are powerful in specifying integrity constraints. Consistency requirements are usually defined on either a single tuple, two tuples, or a set of tuples [4]. Though the first two kind of constraints can be modelled through FDs, the latter one requires reasoning with a (finite but variable in length) set of data items (e.g., time-related data), and this makes the use of FD-based approaches uneffective (see, e.g., [4, 10]). This is the case of logitudinal data (aka historical or time-series data), which provide knowledge about a given subject, object or phenomena observed at multiple sampled time points. In addition, it is well known that FDs are expressive enough to model static constraints, which evaluate the current state of the database, but they do not take into account how the database state has evolved over time [3]. Furthermore, though FDs enable the detection of errors, they cannot be used as guidance to fix them [9].

In such a context graphs or tree formalisms are deemed also appropriate to model the expected data behaviour, that formalises how the data should evolve over time for being considered as consistent, and this makes the exploration-based technique (as AI Planning) a good candidate for the data quality task. The idea that underlies our work is to cast the problem of checking the consistency of a set of data items as a planning problem. This, in turn, allows using off-the-shelf AI planning tools to perform two separated tasks: (i) to catch inconsistencies and (ii) to synthesise a sequence of actions able to cleanse any (modelled) inconsistencies found in the data. In this paper we summarise results from our recent works on data consistency checking [15] and cleaning [2, 14].

AI Planning at a Glance. Planning in Artificial Intelligence is about the decision making performed by computer programs when trying to achieve some goal. It requires to synthesise a sequence of actions that will transform a system configuration, step by step, into the desired one (i.e., the goal state). Roughly, planning requires two main elements: (i) the domain, i.e., a set of states of the environment S together with the set of actions A specifying the transitions between these states; (ii) the problem, that consists of the set of facts whose composition determine an initial state \(s_0 \in S\) of the environment, and a set of facts \(G \subseteq S\) that models the goals of the planning taks. A solution (aka plan) is a bounded sequence of actions \(a_1,\ldots ,a_n\) that can be applied to reach a goal configuration. Planning formalisms are expressive enough to model complex temporal constraints, then a cleaning approach based on AI planning might allow domain experts to concentrate on what quality constraints have to be modelled rather than on how to check them. Recently, AI Planning contributed to the trace alignment problem in the context of Business Process Modelling [5].

2 A Data Cleaning Approach Framed Within KDD

Our approach requires to map a sequence of events as actions of a planning domain, so that AI planning algorithms can be exploited to find inconsistencies and to fix them. Intuitively, let us consider an events sequence \(\epsilon = e_0,e_2,\ldots ,\) \(e_{n-1}\). Each event \(e_i\) will contain a number of observation variables whose evaluation determines a snapshot of the subject’s state Footnote 1 at time point i, namely \(s_i\). Then, the evaluation of any further event \(e_{i+1}\) might change the value of one or more state variables of \(s_i\), generating a new state \(s_{i+1}\).

We encode the expected subjects’ behaviour (the so-called consistency model) as a transition system. A consistent trajectory represents a sequence of events that does not violate any consistency constraints. Given a \(\epsilon \) event sequence as input, the planner deterministically determines a trajectory \(\pi = s_0 e_0s_1\) \(\ldots \) \(s_{n-1} e_{n-1} s_{n}\) on the finite state system explored (i.e., a plan) where each state \(s_{i+1}\) results by applying event \(e_i\) on \(s_i\). Once a model describing the evolution of an event sequence has been defined, we detect quality issues by solving a planning problem where a consistency violation is the goal condition. If a plan is found by a planning system, the event sequence is marked as inconsistent in the original data quality problem. Our system works in three steps (Fig. 1).

Step 1 [Universal Checker]. We simulate the execution of all the event sequences - within a finite-horizon - summarising all the inconsistencies found during the explorationFootnote 2 into an object, we call Universal Checker (UCK), that represents a taxonomy of the inconsistencies that may affect a data source. The UCK computed can be seen as a list of tuples \((id,s_i,a_i) \), that specifies the inconsistency with id might arise in a state \(s_i\) as consequence of applying \(a_i\).

Step 2 [Universal Cleanser]. For any given tuple \((id,s_i,a_i) \) of the Universal Checker, we construct a new planning problem which differs from the previous one in terms of both initial and goal states: (i) the new initial state is \(s_i\), that is a consistent state where the event \(e_i\) can be applied leading to an inconsistent state \(s_{i+1}\); (ii) the new goal is to be able to “execute action \(a_i\)”. Intuitively, a cleaning action sequence applied to state \(s_i\) transforms it into a state \(s_{j}\) where action \(a_i\) can be applied without violating any consistency rules. To this end, the planner explores the state space and collects all the optimal corrections according to a given criterion. The output of this phase is a Universal Cleanser. Informally, it can be seen as a set of policies, computed off-line, able to bring the system to the goal from any state reachable from the initial ones (see, e.g., [8, 12]). In our context, the universal cleanser is a lookup table that returns a sequence of actions able to fix an event \(e_i\) occurring in a state \(s_j\).

Fig. 1.
figure 1

A graphical representation of the Consistency Verification and Cleaning Process.

Step 3 [Cleanse the Data]. Given a set of event sequences \( D = \{\epsilon _1,\ldots ,\) \(\epsilon _n\}\) the system uses the planner to verify the consistency of each \(\epsilon _i\). If an inconsistency is found, the system retrieves its identifier from the Universal Checker, and then selects the cleaning actions sequence through a look-up on the Universal Cleanser.

The Universal Cleanser presents two important features that makes it effective in dealing with real data: first, it is synthesised off-line and only summarises cost-optimal action sequences. Clearly, the cost function is domain-dependent and usually driven by the purposes of the analysis (we discussed how to select different cleaning alternatives in [13, 14]). Second, the UC is data-independent as it has been synthesised by considering all the (bounded) event sequences, thus any data sources conform to the model can be handled. Our approach has been implemented on top of the UPMurphi planner [6, 7].

Real-life Application Footnote 3. Our approach has been applied to the mandatory communication Footnote 4 domain, that models labour market data of Italian citizens at regional level. Here, inconsistencies represent career transitions not permitted by the Italian Labour Law. Thanks to our approach, we synthesised both the Universal Checker and Cleanser for the domain (i.e., 342 distinct inconsistencies found and up to 3 cleaning action sequence synthesised for each). The system has been employed within the KDD process that analysed the real career sequences of 214,432 citizens composed of 1,248,751 mandatory notifications. For details about the quality assessment see [15] whilst for cleaning details see [14].

3 Concluding Remarks

We presented a general approach that expresses Data quality and cleaning tasks in terms of AI Planning problem, connecting two distinct research areas. Our approach has been formalised and fully-implemented on top of the UPMurphi planner, and applied to a real-life example analysing and cleaning million records concerning labour market movements of Italian citizens.

We are working on (i) including machine-learning algorithms to identify the most suited cleaning action, and (ii) applying our approach to build training sets for data cleaning tools based on machine-learning (e.g., [11]).