Discovering Evolving Temporal Information: Theory and Application to Clinical Databases

Functional dependencies (FDs) allow us to represent database constraints, corresponding to requirements as “patients having the same symptoms undergo the same medical tests.” Some research efforts have focused on extending such dependencies to consider also temporal constraints such as “patients having the same symptoms undergo in the next period the same medical tests.” Temporal functional dependencies are able to represent such kind of temporal constraints in relational databases. Another extension for FDs allows one to represent approximate functional dependencies (AFDs), as “patients with the same symptomsgenerallyundergo the same medical tests.” It enables data to deviate from the defined constraints according to a user-defined percentage. Approximate temporal functional dependencies (ATFDs) merge the concepts of temporal functional dependency and of approximate functional dependency. Among the different kinds of ATFD, the Approximate Pure Temporally Evolving Functional Dependencies (APE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textit{APE}$$\end{document}-FDs for short) allow one to detect patterns on the evolution of data in the database and to discover dependencies as “For most patients with the same initial diagnosis, the same medical test is prescribed after the occurrence of same symptom.” Mining ATFDs from large databases may be computationally expensive. In this paper, we focus on APE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textit{APE}$$\end{document}-FDs and prove that, unfortunately, verifying a single APE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textit{APE}$$\end{document}-FD over a given database instance is in general NP-complete. In order to cope with this problem, we propose a framework for mining complex APE\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\textit{APE}$$\end{document}-FDs in real-world data collections. In the framework, we designed and applied sound and advanced model-checking techniques. To prove the feasibility of our proposal, we used real-world databases from two medical domains (namely, psychiatry and pharmacovigilance) and tested the running prototype we developed on such databases.


Introduction
Since some decades, in most of the real-world domains, there is the need of storing and analyzing huge and often overwhelming quantities of data, which are required both for decision making and in the wider area of the management of complex organizations [23,29]. According to this scenario, and without loss of generality, in this paper, we will focus on the healthcare/medical domain, where such need arises, to support clinical decision-making and healthcare policies [3]. Advanced techniques such as data mining and analysis allow medical stakeholders to extract useful knowledge from these data. In particular, it is often the case that such knowledge is inherently temporal, as it is discovered when analyzing data evolution, time series, and changes of information over time. Temporal data mining is the research area focusing on the analysis and discovery from data having some specific temporal characterization [8,9,25].
Considering data stored according to the well-known relational model, functional dependencies (FDs) are usually specified for expressing constraints on data and for improving the quality of database schemata, by deriving normal forms [2]. However, functional dependencies (FDs) could be used to derive some knowledge about the given database. As an example, let us consider a simple relation describing the adverse drug reactions patients may have during a hospitalization. Such a relation stores patient demographic data, together with drugs taken and the adverse reactions possibly occurring. Moreover, a temporal attribute time-stamps the adverse reactions. Typically, patients taking the same drug may have the same adverse reaction. Thus, we can derive a functional dependency between the patient drug and the adverse reaction. It may also be that such functional dependencies hold on "most tuples," but not on all of them. Such dependencies have been named approximate functional dependency (AFD) [13,16]. As an example, if we consider patients affected by some allergies that make unpredictable the reaction to a drug, the corresponding tuples will likely differ from all the other ones as for the dependency between drugs and adverse reactions.
Finer constraints may also be discovered. As for adverse drug reactions, for example, further drug prescriptions may follow, to mitigate these known effects. For example, suppose that some drugs are prescribed just to mitigate some wellknown adverse reaction. Thus, the drug prescribed just after a given adverse reaction will be related both to the adverse reaction and to the drug previously taken by patients. In such a case, prescribed drugs and related adverse reactions determine drugs administered next. We call this dependency a temporal functional dependency (TFD).
Approximate functional dependencies have been extensively considered, and some tools have been proposed for deriving such dependencies [13,14,16,19]. On the other side, temporal functional dependencies have been proposed according to different perspectives and considering different kinds of temporal features [7,15,[30][31][32]. To the best of our knowledge, only some recent studies focused on approximate temporal functional dependencies [4,5,11,26].
In this paper, we continue such studies by considering a different kind of approximate TFD and its application to data from clinical domains. More specifically, we will adopt the framework for temporal functional dependencies proposed by Combi et al. in [7], which allows the specification of multiple kinds of temporal functional dependencies. According to this framework, we consider here the issue of mining (approximate) temporal functional dependencies based on tuple temporal evolution. Temporal evolution of tuples has been originally proposed by Vianu [30] for the characterization of dynamic functional dependencies (DFDs), which allow one to specify constraints on the evolution of tuples in consecutive snapshots of a temporal database. Here, we consider the characterization of DFDs introduced in [7], called Pure Temporally Evolving TFDs ( PE-FDs). In particular, we consider the problem of extracting all Approximate PE-FDs, called APE-FDs, from a given temporal medical database.
Before moving to the more experimental side of our work, we provide a "negative," yet interesting, result about the complexity of checking APE-FDs. First, we prove that checking a single APE-FD against a database instance is NP-Complete in the size of the instance (i.e., data complexity). Moreover, we noticed that the NP-completeness of this problem heavily relies on instances that are fictitious and imply properties of data that are unreasonable in many contexts such as the clinical one. We thus came out with a series of optimizations and heuristics that improve the performances with respect to the more general problem of checking an APE-FD against a database instance.
As we pointed out, mining APE-FDs introduces many computational challenges that require techniques inherited from different fields of Computer Science (e.g., model checking and combinatorial optimization). We embedded such techniques in a framework that has been implemented as a running prototype and applied to data from pharmacovigilance and psychiatry domains. With respect to the preliminary results presented in [10,27], we focus here only on APE -FDs and do not consider the related temporal association rules [10,27]. We propose here a new, stronger and more focused definition of PE-FD and of the related APE-FD, by introducing also a bounded version of temporal evolution of data. Moreover, we provide a detailed discussion and proof of our theoretical results, by introducing a significantly improved and extended presentation with new and more complete examples. Furthermore, we introduce a completely new section, in which we propose a couple of novel optimization techniques for solving the problem of checking APE-FDs.
In the following, "Background and Related Work" section describes the background and the related work. "Discovering Pure Temporally Evolving Functional Dependencies" section formally introduces the concepts of PE-FD and APE-FD. "Some Motivating Clinical Scenarios" section introduces some motivating clinical scenarios using PE-FDs and APE -FDs. "The Computational Complexity of Checking APE-FD" section proves the NP-Hardness of checking an APE-FD against a given temporal database. "Algorithms for Checking APE-FDs" section provides a description of the algorithm that checks a single APE-FD against a given database plus a series of optimizations and heuristics that may be generally implemented in order to speed up such verification process. "Mining APE-FDs" section provides a high-level description of the main features of our prototype for mining such dependencies and the main ideas underlying its implementation; then, it provides interesting mined APE-FDs from the psychiatry and pharmacovigilance domains; in the last part of this section, we analyze the performances of the implemented prototype. "Conclusions" section draws some conclusions and sketches possible directions for future research.

Background and Related Work
In this section, we introduce and discuss the main definitions and concepts we will use through this paper. We first recall the definition of functional dependency (FD). Then, we introduce some extensions of FDs, i.e., temporal functional dependencies (TFDs) and approximate functional dependencies (AFDs). The definition of approximate temporal functional dependency (ATFD) is grounded on these concepts. Figure 1 depicts the relationships among such kinds of functional dependencies.

Functional Dependencies and Their Temporal Extensions
The concept of functional dependency (FD) comes from the database theory [2]. Definition 1 Let be a relation with schema R. Let X, Y be sets of attributes of R. A functional dependency between X and Y represents the constraint that for all couples of tuples t and t ′ in having the same value(s) on attribute(s) X, the corresponding value(s) on Y for those tuples are identical.
More formally, relation satisfies functional dependency X → Y if the following condition holds: Temporal functional dependencies (TFDs) have been proposed as extensions of (atemporal) functional dependencies [33]. As an example, we may represent the constraint that a pathology functionally determines a corresponding drug, but only considers tuples month by month. In other words, the patients with the same pathology are treated with a common drug during some month, while in another month the same patients affected by the same pathology take another (common) drug. Combi et al. proposed a framework for TFDs that subsumes and extends previous proposals [7]. They use a temporal relational data model, allowing one to represent the notion of temporal relation. Each relation is equipped with a time-stamping temporal attribute VT, which represents the valid time, i.e., the time when the fact is true in the represented real world [6]. VT has values in domain T isomorphic to ℕ.
Two temporal views allow joining tuples that satisfy specific temporal conditions, which represent relevant cases of (temporal) evolution. On the basis of the introduced data model, and leveraging such temporal views, we may provide a definition for TFDs.
In other words, FD X → Y must be satisfied by each relation obtained from the evolution expression by selecting those tuples whose valid times belong to the same temporal group. Temporal grouping enables us to group tuples together over a set of temporal granules, based on VT. Four different classes of TFD have been proposed in [7]: • Pure temporally grouping TFD: E-Exp(R) returns the original temporal relation . These TFDs force FD X → Y , where X, Y ⊆ U , to hold over each set of tuples temporally grouped according to their VT; • Pure temporally evolving TFD: E-Exp(R) specifies how to derive the tuples modeling the evolution of objects. No temporal grouping exists, i.e., all the tuples of are considered together; Fig. 1 A graphical account for the IS_A relationships between functional dependency (FD), approximate functional dependency (AFD), temporal functional dependency (TFD) and approximate temporal functional dependency (ATFD) • Temporally mixed TFD: in this case, after evaluating the expression E-Exp(R) , a temporal grouping t-Group is performed; • Temporally hybrid TFD s: first, the evolution expression E-Exp(R) allows the selection of those tuples that are needed to represent the evolution of real-world objects/ concepts; then, temporal grouping is applied to the selected tuples.
In the remainder of the paper, we shall focus on Pure Temporally Evolving TFDs only.

Approximate Functional Dependencies and Their Temporal Extensions
The concept of approximate functional dependency (AFD) is defined moving from the concept of plain FD. In fact, given a relation where an FD holds for most tuples in , we may identify some tuples, for which that FD does not hold. Consequently, we can define some measurements of the error we make in considering the FD to hold on . One measurement [16] we can apply is known as G 1 , which considers the number of violating pairs of tuples. Formally: The related scaled measurement g 1 is defined as follows: where |r| is the cardinality of relation , i.e., the number of tuples belonging to relation .
Another measurement [16] we can apply is known as G 2 , which considers the number of tuples that violate the functional dependency. Formally: The related scaled measurement g 2 is defined as follows: Topics related to approximate functional dependencies have been considered since some years [13,14,16,19]. Instead, to the best of our knowledge, very few studies focused on approximate temporal functional dependencies [4,5,11,26]. In [4], Combi et al. consider the problem of mining approximate TFDs with different kinds of temporal grouping on clinical data. In [11,26], Sala and Combi extend the concept of approximate TFD to deal with interval-based TFDs. In this paper, we continue such studies by considering a different kind of approximate TFD and its application to data from clinical domains.

Discovering Pure Temporally Evolving Functional Dependencies
In the following, we focus on Pure Temporally Evolving Functional Dependencies ( PE-FDs for short), as specified in the framework proposed in [7]. Our temporal functional dependencies will be given on a temporal schema R = U ∪ {VT} where U is a set of atemporal attributes and VT is a special attribute denoting the valid time of each tuple. Hereinafter, we assume tuples time-stamped with natural numbers (i.e., Dom(VT) = ℕ ). Let J ⊆ U be a nonempty subset of U. We define the set W as W = U⧵J and set W , which is basically a renaming of attributes in W. Formally, for each attribute A ∈ W , we have A ∈ W (i.e., W = {A ∶ A ∈ W}).

Definition 3 (Views Evolution and Bounded Evolution)
Given an instance of R, an instance J of schema R ev = JWW{VT, VT} is defined as follows: Schema R ev is called the evolution schema of R. We will denote as R J the view Evolution on R that is built by expression J for every instance of R. View R J joins two tuples t 1 and t 2 that agree on the values of the attributes in J (i.e. For application purposes, it would be important to consider in an evolution schema only those pairs of consecutive tuples whose difference between VT and VT is within some given bound. Given a parameter k ∈ ℕ ∪ {+∞} , tuples of J are filtered by means of the selection k ( J ) = VT−VT≤k ( J ) (notice that +∞ ( J ) = J ).
We will denote as k ( J ) the view Bounded Evolution. It forces to consider only those tuples belonging to J having a temporal distance within the given threshold k. In the following, given a tuple t ∈ J , we denote its temporal distance Let us now define, by using the introduced temporal view Evolution, a slightly restricted version of Pure Temporally Evolving Functional Dependency with respect to that defined in [7]. Without loss of generality, such definition will allow us to simplify the notation and to focus on a general kind of temporal evolution of considered data.

Some Motivating Clinical Scenarios
In this section, we describe and discuss two scenarios, borrowed from the clinical domain, in order to provide examples of how PE-FDs and APE-FDs work. The first scenario is taken from psychiatric case register. Let us consider the temporal schema Contact = {Name, Phys, CT, Dur} ∪ {VT} . Such a schema stores values about a phone-call service provided to psychiatric patients. This service is intended for monitoring and helping psychiatric patients, who are not hospitalized. Whenever a patient feels the need to talk to a physician, she can call the service. Data about calls are collected according to schema Contact. For the sake of simplicity, temporal attribute VT identifies the day when the call has been received. In addition, the service may be used by people somehow related to patients, as, for instance, relatives worried about the current condition of a patient.
More precisely, attribute Name identifies patients, Phys identifies physicians, CT (Contact Type) specifies the person who is doing the call (e.g., value "self" stands for the patient himself, "family" for a relative) and Dur stores information about total duration of calls (value ∼ n means approximately n minutes). An instance of R is provided in Fig. 3. Instance Name , and J in general, may be seen as the output of a two-phase procedure. First, table Contact is partitioned into subsets of tuples, one for each value of Name. Then, each tuple is joined with its immediate successor in its partition, w.r.t. VT values. The whole relation Name is provided in Fig. 4. In the following, we will use t for referencing tuples of and u for referencing tuples of J . Moreover, in the following, each tuple u in J will be identified by the pair of indexes of the tuples in that generate u. For instance, the first tuple of J in Fig. 4 will be denoted by u 1,2 since it is generated by the join of tuples t 1 and t 2 in . Going back to our example, it is worth noting that tuples t 2 and t 7 are not joined in Name , even if t 7 [VT] = t 2 [VT] + 2 and there is no tuple t with t[VT] = t 7 [VT] + 1 . This is due to the fact that t 7 [Name] ≠ t 2 [Name] forbids the join in Name . Moreover, t 1 and t 3 are not joined in Name . Indeed, the presence of tuple forbids the join in Name . Figure 5 graphically depicts how pairs of tuples (t 1 , t 2 ), (t 2 , t 3 ), (t 3 , t 4 ), (t 4 , t 5 ), (t 5 , t 6 ) and (t 7 , t 8 ), (t 8 , t 9 ), (t 9 , t 10 ), (t 10 , t 11 ), (t 11 , t 12 ) are joined in Name for the two patients, respectively. Basically, each tuple u ∈ Name corresponds to an edge in Fig. 5, while we have a node for each tuple in .
Let us now discuss some temporal dependencies we can derive from such data. We could be interested in verifying whether there is some relationship between some previous features of patient's call and the fact that the considered call was either with him or with a relative. In our example, we have that ⊧ [ 5 ( Contact Name )]Phys, Phys → CT. In other words, given consecutive calls related to the same patient within 5 days, the couple composed by the physician of the first call and by the physician of the next one determines the type of contact of the next call. And it holds for all patients. However, if we consider a wider time window of 6 days, we have that ⊭ [ 6 ( Contact Name )]Phys, Phys → CT , because of pairs (t 2 , t 3 ) and (t 10 , t 11 ) . More precisely, we have that  An instance of schema Contact that stores the phone contacts about two psychiatric cases. Attribute # represents the tuple number, and it is used only for referencing tuples in the text (i.e., # does not belong to the schema Contact)

SN Computer Science
The two proposed PE-FDs differ only for the maximum temporal distance allowed. In particular, tuple u 10,11 is one of the responsible ones for ⊧ [ 6 ( Contact Name )] Phys, Phys → CT , but it does not belong to 5 ( Contact Name ) because (u 10,11 ) > 5 . This allows us to point out a general property of PE-FDs and of APE-FDs too. Given a PE-FD Moving to the problem of mining approximate dependencies, if we consider APE-FD [ 6 ( R Name )]Phys, Phys − →CT with = 1 12 , we have that ⊧ [ 6 ( R Name )] Phys, Phys − →CT . Indeed, by considering relation � = ⧵{t 3 } , this dependency would hold without the need of approximation (i.e., if tuple t 3 is deleted from relation ). More precisely, we have � Name = Name ⧵{u 2,3 , u 3,4 } ∪ {u 2,4 } . Tuple u 2,4 was not originally in Name because of tuple t 3 . . The second example we propose is borrowed from the internal medicine domain. As another simple example of how view R J works, let us consider the temporal schema ThCy = {PatId, Phys, Dos} ∪ {VT} . Such a schema allows one to store the values about cycles of therapies in which a specific, fixed, drug is administered to a patient by a given physician. Figure 6a depicts an instance of R. Figure 6b shows the result of view  Fig. 7a, we have a graphical account of how the pairs of tuples (t 1 , t 2 ), (t 2 , t 3 ), (t 3 , t 4 ), (t 5 , t 6 ), (t 6 , t 7 ) and (t 7 , t 8 ) are joined in PatId . In both the graphs depicted in Fig. 7a, nodes are labeled with the tuple number and the value for Dos attribute is reported above each node. Moreover, we recall from the previous example that each edge (t i , t j ) is labeled with the value t j [VT] − t i [VT] (i.e., the temporal distance between two tuples). In the scenario for PatId the value for the attribute Phys is reported below each node, while for Phys the value of PatId is reported below each node.
We would like to point out that it may be the case that a tuple t ∈ has a more than one immediate successor in J (i.e. is joined with more than one tuple). It is the case of view Phys shown in Fig. 7b, where tuples are joined with respect to the values of attribute Phys. We have that, since Dr. Shepherd makes two drug administrations at VT = 20 , tuple t 1 has both tuples t 3 and t 7 as its immediate successors. We will see that the number of immediate successors of a tuple in J will play a major role in some of the following complexity results.
In this domain, we could be interested in understanding whether there are dependencies among previous and current drug dosages for a given patient, possibly considering the physicians administering the drug.
In the example depicted in Figs  . However, in this case, we have that the conflicting pairs are more than one. As a matter of fact, all pairs (u 1,2 , u 5,6 ) , (u 2,3 , u 7,8 ) and (u 3,4 , u 4,6 ) are conflictgenerating. If we want to rule all the conflicts out by playing on the maximum allowed distance, we have to set it to 6 and then we have ⊧ [ 6 ( R PatId )]Dos → Dos.

APE-FD
In this section, we address the complexity of checking an APE-FD against an instance . We call this problem Check-APE-FD: Let us consider, for example, the PE -FD [ +∞ ( ThCy PatId )]Phys, Phys → Dos . We have proved above that ∖ ⊧ fd . Figure 8 graphically reports all the possible ′ PatId where ′ is obtained from by deleting exactly one tuple. For example, if � = ⧵{t 1 } , it means that the dotted edge (t 1 , t 2 ) has been removed. This means that t 1 and t 2 are not joined in ′ PatId . Moreover, if we take � = ⧵{t 2 } we have that both the edges (t 1 , t 2 ) and (t 2 , t 3 ) are removed and the dashed edge (t 1 , t 3 ) turns out to be "active." This means that t 1 and t 2 are not joined in ′ PatId as well as t 2 and t 3 , but t 1 and t 3 turn out to be joined in ′ PatId due to the absence of t 2 . Let us observe that in this case, the join operation involving t 1 and t 3 belongs to ′ PatId and not to PatId . This specific behavior, in which the deletion of a tuple introduces additional, possibly different, constraints as a side effect, instead of just removing existing ones, gives us a hint on the problem Check-APE-FD. Such problem is not so easy to solve. Notice that ⧵{t 1 } � ⊧ [ +∞ ( ThCy PatId )]Phys, Phys → Dos as well, because of the pairs (t 2 , t 3 ) and (t 6 , t 7 ) . However, . Therefore, problem Check-APE -FD belongs to the complexity class NP. In order to prove that, it suffices to apply a guess-and-check algorithm. First, this algorithm guesses XY → Z , the algorithm returns YES, otherwise NO. In the procedure above, we implicitly make use of a function that verifies, given an instance of R and a PE- ]XY → Z holds or not. We can call this problem Check-PE-FD. Since there is no approximation, XY → Z may be performed in polynomial time [7]. For this reason, we can conclude that Check-APE-FD belongs to the complexity class NP. In the following, we will prove that Check-APE-FD is NP-hard even in the case of the most constrained kind of APE-FDs, which is represented by the class of simple update APE-FDs. From now on, we will consider Problem 1 only for simple update APE-FDs. Considering the inclusions shown in Fig. 2, we can immediately conclude that our hardness result directly propagates to the other classes of APE-FDs.
In this section, we will make use of finite words w on a finite nonempty alphabet (i.e., w ∈ * ). We will use the standard notation w[i] for denoting the ith symbol of word w. Given a word w, we denote with first(w) and last(w) its first and its last element, respectively (i.e., first(w) = w [1] and 1 Given a ℕ > -sequence s we denote with first(s) and last(s) its first and its last element, respectively (i.e., first(s) = s [1] and last(s) = s[|s|] ). A ℕ > -sequence s, for which for every (for a graphical account of how a word is filtered by a sequence please refer to Fig. 9). Given a word w and a pair (b, e), with 1 ≤ b ≤ e ≤ |w| , we call the word w‖ [b,e] a slice of w. Given two words w 1 , w 2 ∈ * we say that The proof that Check-APE-FD is NP-hard is done in two steps. First, we describe a known NP-Complete problem called Common Permutation Problem ( CP -P for short). Then, we introduce a problem called Periodic Repair Problem ( PR-P) and we prove that CP -P may be reduced to it using logarithmic space. Finally, we reduce the PR -P to Check-APE -FD using logarithmic space.
Let us begin with the Common Permutation Problem which has been proved to be NP-Complete in [12]. Problem 2 (CP-P). Given a finite alphabet and two words w 1 , w 2 over it, is there a permutation w p of for which w p ⊑ w 1 and w p ⊑ w 2 ? Consider = {a, b, c} we have that the pair w 1 = bcbab and w 2 = accaacb is a positive instance of Problem 2 because cab ⊑ w 1 and cab ⊑ w 2 . On the other hand, the pair w 1 = bcbac and w 2 = acab is a negative instance of Problem 2 since both words do not share any permutation of as their subsequence. More precisely, w 1 contains the permutations bca, bac and cba while w 2 contains the permutations acb and cab.
A word w is periodic if and only if for every pair of . Let us observe that if w is repetition-free, then it is periodic. Moreover, if w is periodic for Fig. 9 An example of a word w‖ s obtained by applying a sequence s to a word w Fig. 10 A graphical account of how s 1 n , s w 1 , s 2 n , s w 2 , and s 3 n filter blocks of w SN Computer Science every pair (b, e), with 1 ≤ b ≤ e ≤ |w| , we have that w‖ [b,e] is periodic (i.e., every slice of a periodic word is itself periodic).
The following lemma turns out to be useful for our reduction.

Lemma 1 Given a periodic word w, if w is not repetition-free, then there exists an index
Proof Since w is not repetition free, there exists two indexes . We prove the claim by induction on = |w| − i � . For the base of the induction, we have = 0 and thus the claim trivially holds since i is the index we were looking for. Let us consider = n + 1 . Since w is periodic and . Thus, positions i + 1 and i � + 1 witness a repetition and since |w| − (i � + 1) < |w| − i � = we can apply the inductive hypothesis and prove our claim. ◻ Problem 3 (PR-P) Given a word w = a 1 … a n , a finite alphabet and a natural number k, determine whether a periodic word w ′ ⊑ w exists such that |w ′ | ≥ k.
Problem 3 belongs to the complexity class NP. A simple nondeterministic algorithm for PR -P guesses an ℕ > -sequence s such that |s| ≥ k and last(s) ≤ |w| (i.e., s "chooses" only positions in 1 … |w| ). Then, it suffices to check whether or not w‖ s is periodic (periodicity checking may be performed in logarithmic space).
In the following, we describe how to reduce CP -P to PR -P. Let us consider two words w 1 and w 2 on an alphabet with length n 1 and n 2 , respectively. We assume without loss of generality that is a finite subset of the negative integers (i.e., ⊆ ℤ − ). Let n = max(n 1 , n 2 ) and = | | . Let us consider the following word w over the alphabet ∪ {1, … , n} ( ⋅ is the classical word concatenation operator): Finally, we put k = 3n + 2 . Such reduction operates in logarithmic space. The following two lemmas prove the soundness and completeness of the above reduction.

Lemma 2
If there exists a permutation w p of which is a common subsequence of w 1 , w 2 , then there exists a ℕ > -sequence s, with |s| ≥ 3n + 2 and last(s) ≤ 3n + |w 1 | + |w 2 | , such that w‖ s is periodic.
Proof First, let us recall that w p is a repetition-free sequence of symbols in ℤ − . By hypothesis, we have w p ⊑ w 1 and w p ⊑ w 2 and thus there exists a pair of ℕ > -sequence s 1 and s 2 with |s 1 | = , |s 2 | = and w 1 Let s j with j ∈ {1, 2} be the ℕ > -sequence such that |s j | = and for every 1 ≤ i ≤ , we have s j [i] = s j [i] + nj + (j − 1) . Let us observe that s j is a simple shift of the indexes in the sequence s j with j ∈ {1, 2} . Then, we may define s as follows: Since ∩ {1, … n} = � there are not "conflicts" between the blocks 1 … n and w p , we can conclude that w‖ s is periodic. ◻

Lemma 3 If there exists a sequence s with
. Informally s 1 n , s w 1 , s 2 n , s w 2 , s 3 n are the indexes in s that concern the subwords 1 … n (first block), w 1 , 1 … n (second block), w 2 and 1 … n (third block) respectively. A graphical account of this decomposition of w‖ s is given in Fig. 10. This means that we may retrieve the subsequence selected by s on w restricted to the first block by means of the operation w‖ s 1 n . If we want to retrieve the subsequence selected by s on w restricted to the w 1 block, we write w‖ s w 1 , and so on, for the second block 1 … n (i.e., w‖ s 2 n ), the block w 2 (i.e., w‖ s w 2 ) and the third block 1 … n (i.e., w‖ s 3 n ). Let us notice that w‖ s is equal to w‖ s 1 is a slice of w‖ s and thus w ′ is periodic.
is a slice of w‖ s and thus w ′′ is periodic. Two cases may arise, i.e., either |s 3 n | = 0 or not.
If |s 3 n | = 0 we have that w‖ s ⊑ 1 … n ⋅ w � and by a counting argument we have that �w‖ s � ≤ 3n which is a contradiction since 3n + 2 ≤ �w‖ s � and > 0 by definition.
We have now that |s 2 n | > 0 . Consider the slice w � = w‖ s w 1 ⋅ w‖ s 2 n : we have just proved that �w‖ s 2 n � > 0 and we have that w ′ is periodic being a slice of w‖ s . By applying Lemma 1 as we did above, we can claim that w‖ s w 1 is repetition free and thus �w‖ s w 1 � ≤ . Suppose now by contradiction that w‖ s w 2 is not repetition free. If |s 3 n | > 0 we reach immediately a contradiction by applying Lemma 1 on the word w‖ s w 2 ⋅ w‖ s 3 n . Then, we have |s 3 n | = 0 and by definition �w‖ At this point, we have that both w‖ s w 1 and w‖ s w 2 are repetition free and thus �w‖ s w 1 � ≤ and �w‖ s w 2 � ≤ . Since 3n + 2 ≤ �w‖ s � , we have that �w‖ s w 1 � = and �w‖ s w 2 � = and thus both w‖ s w 1 and w‖ s w 2 are permutations of . It remains to prove that they are the same permutation. Let us observe that, since �w‖ s w 1 � = �w‖ s w 2 � = and �w‖ s � ≥ 3n + 2 , by a counting argument we have that w‖ s 1 n = w‖ s 2 n = w‖ s 3 n = 1 … n. and Now we reduce Problem PR -P to Check-APE -FD in logarithmic space. Suppose that we have an instance of PR -P consisting of a word w ∈ * and a natural number k. We define the instance w on the temporal schema R = {J, X} ∪ VT as X → X , w and w,k may be built using logarithmic space on the input w, k. Finally, we can conclude this section by explicitly providing the desired result.

Algorithms for Checking APE-FDs
As we proved in "The Computational Complexity of Checking APE-FD" section, given an APE-FD [ k ( R J )]XY − →Z and an instance of R, the problem Check-APE-FD is NP-Complete in | | . Then, in principle, there is no asymptotically better algorithm than exploring the whole set of possible In the following, we provide two algorithms that make use of heuristics, for pruning the search space in order to achieve the tractability for many cases.
The first algorithm is the more general one, and it may be applied without assumptions on the input instance . Such algorithm makes use of two optimization techniques. The first one consists of trying, whenever it is possible, to split the current subset of into two subsets, on which the problem may be solved independently (i.e., choices in one subset do not affect those in the other one and vice versa). The latter optimization technique consists of checking whether the current partial solution may not lead to an optimal solution (i.e., a solution ′ where | ′ | is the maximum possible number of tuples that may be kept). If this happens, the subtree is pruned immediately (i.e., we are looking only for optimal solutions).
The second algorithm is applicable under the assumption that we have a bounded and relatively small number of tuples that share the same values for both VT and J which is often the case in clinical domains, as we will discuss later on. In this setting, we show how to provide an upper bound value for all the candidate solutions that contain the current partial solution and thus we can apply a pure branch-and-bound approach in order to speed up the algorithm even more.
Before discussing in detail the algorithms and their properties, we need to introduce some basic concepts and features for the representation of tuples through graph-based structures.

Graph-Based Structures for Tuple Representation
To this regard, we use a suitable graph representation of tuples. A directed graph is a pair G = (V, E) , where V is a finite set of nodes and E ⊆ V 2 is its edge set. Our graphs are simple, i.e., there are no loops and no parallel edges. Let U ⊆ V : we denote by G| U = (U, E| U ) the subgraph of G induced by U, that is, the graph on node set U such that, . We define t h e j u m p v a l u e o f a n e d g e Obviously, the notion of induced subgraph naturally extends to L-DAG.
Let us notice that in our notion of weighted L-DAG, weights are associated with nodes. Let us now introduce a general problem on L-DAG, called k-Thick Path (k-TP for short).
Problem 4 (k-TP). Given an L-DAG L G = (V, E, l) and a natural number k, determine whether or not there exists a node subset For instance, if we consider the L-DAG in Fig. 11, we have that the set In a solution, we may choose to take more than one node per layer as well as completely ignore all the nodes in a layer. Then, we may see a candidate solution V ′ as the result of a two-step nondeterministic guess: , p} (let us assume l i < l j for every 1 ≤ i < j ≤ p ′ ), which will be all and only the layers which contain at least one node in our solution; Going back to the example in Fig. 11, in V ′′ condition 2 is violated because by choosing v 4 we choose layer 3 as the nonempty layer following layer 2 but (v 3 , v 4 ) ∉ E . As a matter of fact, V �� ⧵{v 3 } (i.e., we choose only v 2 in the layer 2) turns out to be candidate solution. In V ′ , we ignore layers 3 and 4 by not choosing any node in them. Instead, we choose layer 5 as the nonempty layer following layer 2 and everything works just fine.
The k-TP problem may naturally be extended to wL-DAG by imposing the set V ′ to satisfy v∈V � W(v) ≥ k . In [10], we prove that the k-TP problem on wL-DAG is NPhard. Our proof can be naturally extended to prove that the nonweighted version of the problem is NP-hard too.

The First Algorithm
Both algorithms rely on the concept of color that we will explain through an example in the following. Given an APE-FD [ k ( R J )]XY − →Z and an instance of R, let us suppose that we are solving the problem Check-APE-FD on such instance with a simple guess-and-check procedure, which makes use of two, initially empty, subsets + (the tuples to be kept in the solution) and − (the tuples to be deleted in the solution) of . At each step, the procedure guesses a tuple t in ⧵( + ∪ − ) and decides nondeterministically (guessing phase) either to update + to + ∪ {t} (i.e., t is kept in the current partial solution) or to update − to − ∪ {t} (i.e., t is deleted in the current partial solution). When When two tuples t, t ′ share the same value for attribute J (i.e., t[J] = t � [J] ), we say that they are in the same J-group and t[J] is the value of the J-group containing t and t ′ . For the sake of brevity, for a given j ∈ Dom(J) we will use , the solution is inconsistent).
The above theorem guarantees that from a partial solution ( , + , − ) that features at least two conflicting edges, we cannot reach a consistent solution ( , + f , − f ) . In such a case, we may return immediately NO without considering any further ( , + , − ) . The colors of a partial solution ( , + , − ) are represented by the set colors( , is an active edge in ( , + , − )} . Clearly, the hypothesis of Theorem 2 applies if and only if set colors( , + , − ) contains at least two conflicting colors.
Then, by means of colors, our above guess-and-check procedure may be improved by adding the control on the size of − and by keeping updated the current set of colors colors( , + , − ) . Once an insertion of a tuple in either + or − introduces a color c that is conflicting with at least one color in ( , + , − ) , the procedure answers NO immediately. An example of how the procedure works is given in Fig. 12, where we have an instance of five tuples with = 0.2 (i.e., we may delete at most one tuple) and k = 6 (all the tuples are in the same window). The execution depicted in Fig. 12 guesses the values of tuples from the oldest ( t 1 ) to the newest one ( t 2 ) according to the value of VT. First, it tries to put the current tuple t in + ; if no violation arises, it continues; if some violation arises, it tries to insert tuple t in − ; if no violation arises, it continues; otherwise, it goes back to the previous choice (i.e., backtracking). Every internal node is labeled with the current tuple, which will be guessed next; every leaf is labeled either with YES (i.e., the current branch is a solution) or NO (i.e., a violation has arisen); the current set of colors is reported within the node. Nodes are numbered according to their order of appearance. We have that the root is n 1 followed by the introduction of nodes n 2 … n 4 in this precise order. If we introduce t 4 in the partial solution associated to n 4 , we violate the first constraint. Since in n 4 adding t 4 in − does not generate any violation, node n 5 is created as child of n 4 . However, node n 5 cannot be extended without introducing a violation in the above constraints. Indeed, if we put t 5 in + , we introduce a conflicting color; if we put t 5 in − , we exceed the maximum number of allowed deletions. We backtrack to n 4 . As all the possible choices have been explored, we backtrack to n 3 , where the choice of adding t 3 to − is attempted, generating node n 6 . From n 6 , we put t 5 in + without violating any constraint and thus we have that Let us now consider in some more detail the first algorithm. Basically, the algorithm works similarly to the previous procedure, except for some trivial technicalities. Two more heuristics have been introduced, to possibly stop earlier, during the exploration of a branch in the tree of computation. The main procedure of the algorithm is reported in Fig. 15, while auxiliary procedures are reported in Figs. 13 and 14. The algorithm is implemented by function TupleWiseMin that takes four arguments. The first argument is G , which is derived from considering the APE-FD [ k ( R J )]XY − →Z that has to be checked. More precisely, G is an instance of schema J, X, Y, Z, VT, count, with Dom(count) = ℕ . We have that t ∈ G if and only if there exists t � ∈ for }| , that is, we count how many tuples in share the same values for attributes J, X, Y and Z, respectively. The input parameter k is the length for the grouping sliding window. Sets G + and G − , originally initialized to ∅ , represent the tuples of G that are either kept or deleted in the current solution, respectively. On instances s of schema J, X, Y, Z, VT, count, we denote with ||r|| the sum on the count attribute for the tuples in s (i.e., ��s�� = ∑ t∈s t[count] ). Finally, C is a set of colors which is initially set to ∅ . A color c is a tuple on the schema X, Y, Z. As we will see, C keeps track via colors of the constraints introduced so far in the construction of the solution.
Procedure TupleWiseMin returns the minimum number of tuples that has to be deleted from in order to obtain an instance ′ such that � ⊧ [ k ( R J )]XY → Z . Then, if such minimum is less or equal than ⋅ | | we can conclude

SN Computer Science
Given G , G + , G − and a set of colors C we say that an edge (t, t � ) ∈ G × G is pending if and only if the following conditions hold:

for every t ′′ with t �� [J] = t[J] and t[VT] < t �� [VT] < t � [VT]
we have t �� ∉ G + ; Informally speaking, a pending edge is an edge that is not active in the current partial solution but it may become active during the computation and, if it happens, it introduces a new color in C . In our algorithm, pending edges for the current partial solution are retrieved by procedure E?, while active edges are retrieved by procedure E!. Procedure TupleWiseMin (Fig. 15) works as follows. If G + ∪ G − = G , it means that we have obtained a solution without violating any constraint and thus we can return ||G − || (i.e., the number of deleted tuples). If G + ∪ G − ≠ G , the algorithm guesses a tuple t ∈ G ⧵(G + ∪ G − ) and proceeds as follows. First, it checks whether inserting t into G + does not cause any violation of constraints. If so, it stores in m t the value of the recursive call to TupleWiseMin where t belongs to G + and C has been updated accordingly. By inserting a tuple t in G + , the algorithm is asserting that t belongs to the current partial solution, while by inserting t in G − the algorithm is asserting that t does not belong to the current partial solution. If a constraint is violated, the algorithm stores in m t the value +∞ , which means that t may not be kept in the current solution.
Then, it checks whether inserting t into G − does not cause any violation in the constraints. If it is the case, it stores in m ⧵t the value of the recursive call to TupleWiseMin, where G − and C are updated accordingly. If a constraint is violated, the algorithm stores in m ⧵t the value +∞ , which means that t must be kept in the current partial solution. In procedure Tuple-WiseMin, the only way in which a constraint may be violated is that, after the insertion a tuple t in G + (resp. G − ), an edge (t � , t �� ) turns out to be active and its color (t � [X], t �� [Y], t �� [Z]) turns out to be conflicting with at least one color in C.
As pointed out by the example in Fig. 12, checking each step for consistency is itself an optimization, even if it is trivial, since it allows us to prune entire subtrees in the tree of computations without exploring them. We propose The second optimization allows us to prune a subtree of computation even before a contradiction arises. It verifies, in many cases, whether every possible solution that may be built starting from the current partial one turns to be not minimal. Suppose that there exists an active o-pair (t, t � ) in a partial solution (G , G + , G − ) , such that there exists t ∈ G in the same J-group of t with with c(t, t � ) = (x, y, z)} . Let us define the set of colors pending(G , G + , G − ) = {(x, y, z) ∶ there exists a pending edge (t, t � ) with c(t, t � ) = (x, y, z)} , which collects all and only the colors that may be introduced later on in the current computation. A color (x, y, z) is safe in (vt, vt � , j) if and only if one of the following three conditions hold: , (x, y, z) is a pending color and there is no pending color that is conflicting with (x, y, z)); 3. The color is not conflicting with any color in colors(G , G + , G − ) ∪ pending(G , G + , G − ) and do not exist two tuples t, t is an edge and the color is conflicting with (x, y, z).
The three conditions above imply that if a color is safe in (vt, vt � , j) , then it is neither in conflict with a color in . However, this is just a necessary but not sufficient condition. Let us consider the example shown in Fig. 16 and assume that k ≥ 7 (i.e., every o-pair in the example is also an edge). We have that the active edges are (t 1 , t 2 ), (t 2 , t 3 ), (t 3 , t 4 ), and (t 4 , t 5 ) for the j 1 -group and (t 7 , t 12 ), (t 8 , t 12 ) for the j 2 -group, since we have t 9 , t 10 , t 11 ∈ G − . Thus, we h ave colors(G , G + , G − ) = {(x 1 , y 4 , z 4 ), (x 2 , y 5 , z 6 ), (x 5 , y 6 , z 6 ), (x 4 , y 6 , z 5 ), (x 1 , y 6 , z 6 ), (x 2 , y 6 , z 6 )} and, since we have to decide the status of tuple t 6 , we have pending(G , G + , as first two components of their colors (thus condition 3 applies to these colors). Colors c(t 7 , t 10 ) = (x 1 , y 4 , z 4 ) and c(t 11 , t 12 ) = (x 5 , y 6 , z 6 ) (i.e., the continuous edges in j 2group in Fig. 16) are safe in (1, 5, j 2 ) , because they belong to colors(G , G + , G − ) and thus both satisfy condition 1. Finally, colors c(t 8 , t 11 ) = (x 2 , y 5 , z 5 ) , and c(t 8 , t 10 ) = (x 4 , y 6 , z 6 ) are not safe in (1, 5, j 2 ) (i.e., the X-labeled edges in Fig. 16), because they are in conflict with two colors in colors(G , G + , G − ) ; more precisely, (x 2 , y 5 , z 5 ) is in conflict with (x 2 , y 5 , z 6 ) and (x 4 , y 6 , z 6 ) is in conflict with (x 4 , y 6 , z 5 ). Given a partial solution (G , G + , G − ) and the triple (vt, vt � , j) , a (vt, vt � , j)-replace DAG is a DAG (V, E) where Fig. 15 The main procedure for a tuple-wise check of APE-FD s. Notice that we use a compact notation for the recursive procedure which is initially called TupleWiseMin(G , k) . Here, when G + , G − and C are omitted in the procedure call, they get their respective default values specified in the procedure declaration (i.e., ∅ for each of them in this case)

SN Computer Science
A node t ∈ V is a starting node (resp. ending node) if and only if vt < t[VT] < vt � and, for every E) for which t 1 is a starting node and t m is an ending node. We say that vt and vt ′ in j can be safely replaced if and only if there exists a replace path in the (vt, vt � , j) -replace DAG (V, E). Figure 16 depicts the (1, 5, j 2 )-replace DAG, where t 10 is the only initial node that is not an ending one, and t 11 is the only ending node that is not a starting one. Since t 10 is connected to t 11 , we have that t 10 t 11 is a replace path in the (1, 5, j 2 )-replace DAG and thus 1 and 5 can be safely replaced in j 2 . Using the above definitions of replace DAGs/paths, we can provide the following result. (G , G + , G − ) , if there exists a group j with two consecutive valid times vt and vt ′ such that vt and vt ′ can be safely replaced in j, then every consistent solution that follows (G , G + , G − ) is not optimal.

Theorem 3 Given a partial solution
The proof of the theorem is straightforward. Let us suppose that t 1 … t m is a replace path in the (vt, vt � , j)-replace DAG. By definition, we have t 1 , … , t m ∈ G − . It suffices to take any consistent solution (G , G is still a consistent solution. Nonoptimality immediately follows. We take advantage of Theorem 3 by pruning every computation rooted in a partial solution (G , G + , G − ) that features a J-group j-group and two consecutive valid times vt and vt ′ in the j-group, such that vt and vt ′ can be safely

The Second Algorithm
Let us now propose another algorithm with some auxiliary procedures, reported in Figs. 17 and 18, for solving problem Check-APE-FD. Such an algorithm, whose main procedure is called EdgeWiseMin, strongly differs from TupleWiseMin in approaching the problem. In principle, it works better, but it may work only under a quite reasonable assumption on the input, which we will discuss in detail later on.
At every step, procedure EdgeWiseMin, instead of guessing if a tuple belongs to the current partial solution, guesses if a color is forbidden or allowed in the current partial solution. Informally, forbidding a color (x, y, z) means avoiding all the active edges (t, t � ) ∈ G × G for which c(t, t � ) = (x, y, z) . On the other hand, allowing a color (x, y, z) means forbidding all the active edges (t, t � ) ∈ G × G whose colors are conflicting with (x, y, z). In order to do that, we introduce the concept of color-partial solution. A color-partial solution is a triple are disjoint subsets of colors (i.e., C + ∩ C − = � ) and for every pair of colors (x, y, z), (x � , y � , z � ) ∈ C + (x, y, z) is not conflicting with (x � , y � , z � ) (i.e., if x � = x and y � = y then z � = z). 1. for every color (x, y, z) in C + and for each edge is active then z � = z. 2. for every color (x, y, z) in C − and for each edge is not active in (G , G + , G − ) . It means that one of the following two conditions holds:   Fig. 18. Procedure BuildDag builds a single-source, single-sink DAG whose nodes are nonempty subsets of G . Each subset is formed by tuples sharing the same value for VT, and thus function Time is well defined. Procedure SourceSinkShort-estPath returns the shortest path from source to sink on the DAG provided by BuildDag. The solution is given as a set of nodes (i.e., subsets of G ), and it omits source and sink nodes

SN Computer Science
In an opposite way from TupleWiseMin, in this algorithm color-partial solutions induce complete minimal solutions. However, such solutions may be inconsistent. The algorithm tries to obtain consistency by either forcing or forbidding one color at a time. This is done by means of sets C + and C − , which are both initialized to ∅ at the beginning of the procedure. As we informally said above, if a color (x, y, z) belongs to C + , it means that the current partial solution must avoid all the active edges (t, t � ) such that t[X] = x , t � [Y] = y , and t � [Z] ≠ z ; if a color (x, y, z) belongs to C − , it means that the current partial solution must avoid all the active edges (t, t � ) As a general overview of the algorithm, let us consider the following simplified procedure (let CPS = (G , C + , C − ) to be the current color-partial solution): Two observations are omitted in the above procedure w.r.t. function EdgeWiseMin. The first one is that the procedure does not take into account the fact that the value ||G − || of the color-partial solution computed at point 1 is a lower bound for the optimal solution that may be achieved in the current branch of the computation. Procedure EdgeWiseMin uses it in a classical branch-and-bound fashion by propagating the value of the current optimal solution (if any) in the tree of recursive calls (in Fig. 18 this is done by means of parameter optimal). In step 1., if the computed value ||G − || is greater than the optimal one, we immediately return from the recursive call, because no better solution may be found. The last omitted observation regards how the value of the color-partial solution val(CPS) is computed, where CPS = (G , C + , C − ) . For every J-group in G , i.e. any set of tuples having value j for attribute J, we build The procedure returns the minimum number of tuples to delete in in order to obtain an instance ′ ⊆ such that � ⊧ [ k ( R )]XY → Z . Like procedure TupleWiseMin of Figure 15, the initial call to the recursive procedure is EdgeWiseMin(G , k) with C + , C − , and optimal initialized to their respective default values the following wL -DAG L -thick path. Given a wL-DAG L G , we will call MAX-ThickPath (Max-TP for short) the problem of finding the maximum M for which L G admits a M-Thick Path. Max-TP may be solved by a simple dichotomic search having a decision procedure that solves the problem M-TP (Problem 4), which is NP-Complete [10]. Here, our assumption comes into play and allows us to find M j CPS for every j ∈ J(G ) in a "reasonable" time. Indeed, in the instances that are used to prove the NP-completeness, the number of nodes in any layer, roughly corresponding to the number of tuples of the given relation at a corresponding time point, is supposed to increase as the number of time points/layers increases. This is not the case in many daily applications, especially in the clinical domain, where we may have a great number of tuples but scattered along the timeline. In thefollowing, we will provide a formal definition of our assumption.

t) = t[count] and L j (t) = t[VT] , and
and the value space ( j, G ) as the value 2 MaxLevel(G ,j) ⋅|VT(G , j)| ⋅ log 2 (MaxCount(G , j)) . Let MaxSpace(G ) = max j∈J(G ) space(j, G ) . We will see that EdgeWiseMin is applicable to our instance, if we have O(MaxSpace(G ) 2 ) bits for computing it. The problem with MaxSpace(G ) is that it is exponential in MaxLevel(G , j) , but this value depends on the maximum number of tuples that shares the same values for attributes VT and J and differs on at least one among the attributes X, Y and Z in the original instance . As we say above, we assume this value to be manageable as it happens in many real-world applications. Hereinafter, we will suppose to have O(MaxSpace(G ) 2 ) bits for performing our computation.
Let us suppose to have a wL-DAG L G = (V, E, L, W) , and we want to solve MAX-TP on it. Given a wL-DAG By definition of level subset, the function L � ∶ V → ℕ , such that for every V � ∈ V we have L � (V � ) = L(v) for some v ∈ V � , turns out to be well defined. We define the unfolding of wL-DAG L G = (V, E, L, W) as the weighted DAG For instance, the DAG in Fig. 19 is the result of unfolding the wL-DAG in Fig. 11. The unfolding of a wL-DAG, in the worstcase scenario, is exponential in the size of L G . Given a wL-DAG L G = (V, E, L, W) , we define W all (L G ) = ∑ v∈V W(v) as the sum of all the weights associated with nodes in V. It is straightforward to prove that the union of all the internal nodes in a source-to-sink path in the unfolding of a wL-DAG L G is a thick path in L G and, on the other hand, every thick path in L G may be associated with a source-to-sink path in its unfolding. Moreover, for every source-to-sink path p in the unfolding of L G , let w p be its weight. The weight of the thick path associated with p is exactly W all (L G ) − w p (i.e., L G admits a thick path with value W all (L G ) − w p ). With these premises, we can prove the following result. Given a color-partial solution CPS = (G , C + , C − ) , procedure EdgeWise computes the value val(CPS) (performed

SN Computer Science
by procedure PartialSolution in Fig. 18) summing up all the values M j CPS for every j ∈ J(G ) . Each M j CPS is computed as value of a source-to-sink shortest path (performed by procedure SourceSinkShortestPath in Figure 17) on the unfolding of L j CPS (built by procedure BuildDag in Fig. 17). For building U(L j CPS ) , on which we will compute value of a source-to-sink shortest path, we may need O(MaxSpace(G ) 2 ) bits.
Finally, let us observe that procedure PartialSolution does not return only the value val(CPS) of the current color-partial solution CPS = (G , C + , C − ) . Since it effectively computes a minimal solution (G , G + , G − ) induced by CPS in order to provide val(CPS), it returns the set colors(G , G ⧵G − , G − ) , that is, the set of all and only the colors associated with active edges in (G , G ⧵G − , G − ) . If such a set does not contain two colors (x, y, z) and (x, y, z � ) such that z ≠ z ′ , then we have that (G , C + , C − ) is a consistent solution and we may return val(CPS). Otherwise, procedure EdgeWiseMin takes a color (x, y, z) in colors(G , G ⧵G − , G − ) such that there exists (x, y, z � ) in colors(G , G ⧵G − , G − ) with z ≠ z ′ and performs two recursive calls, one in which C + is updated to C + ∪ {(x, y, z)} and the other in which C − is updated to C − ∪ {(x, y, z)}.

Mining APE-FDs
In this section, we consider the problem of mining APE -FDs on a given instance of a temporal schema R from a practical point of view. We will describe a prototype that performs such task. In particular, we point out two big computational challenges we addressed in the implementation of our prototype.
Let us start with the formal definition of our problem. Given a temporal schema R = U ∪ {VT} , an instance of R, a nonempty set J ⊆ U , a threshold 0 ≤ ≤ 1 , and a value k ∈ ℕ , we denote as PE( , k, ) the set of all the APE-FDs Given a temporal schema R = U ∪ {VT} , an instance of R, a nonempty set J ⊆ U , a threshold 0 ≤ ≤ 1 and a value k ∈ ℕ , we are interested in finding a minimal complete set Fig. 19 The unfolding of the wL-DAG of Figure 11 into a weighted DAG for solving the MAX-TP problem. The table below the graph provides the weights for source-to-node edges and node-to-sink edges, which are both represented by dashed lines. Continuous edges without labels have weight 0. P = source{v 1 }{v 2 , v 3 }{v 7 , v 9 }{v 8 }{v 11 }sink is a source-sink shortest path with value 4 PE( , k, ) . However, in order to do that, we have to deal with two computational problems: • the smallest minimal complete set S may be exponential in the size of U with respect to some given temporal schemata R = U ∪ {VT} , instances of R, nonempty sets J ⊆ U , thresholds 0 ≤ ≤ 1 , and values k ∈ ℕ; • given a single APE-FD [ k ( R J )]XY − →Z on a schema R = U ∪ {VT} , and an instance of R, deciding whether or not ⊧ [ k ( R J )]XY− →Z is a NP-complete problem.
The first result may be derived by leveraging a result of Kivinen et al. [16] on approximate functional dependencies. The second result is proved in "The Computational Complexity of Checking APE-FD" section. However, such theoretical bounds are both difficult to achieve in real-world domains. For instance, the size of PE( , k, ) could be exponential in |U|, but in real-case scenarios, we have that |U| is often less than 50/60 elements. Moreover, the instance built in [16] for achieving the exponential lower bound does not occur in real-world instances. The complexity of checking a single APE-FD is even worse. Indeed, checking a single APE-FD is NP-Complete in the number of tuples, which may be very high and increasing time after time. This problem is known as the curse of cardinality, and its relevance has been recently considered for temporal inference of sequential patterns in [18]. Even in this case, the instance specified in "The Computational Complexity of Checking APE-FD" section for proving the NP-hardness result has been built in a very complex and constrained way. Such an instance does not even remotely resemble some real-world scenario. Thus, we are allowed to design and implement a prototype for the practical mining of such dependencies on real-world datasets and evaluate its performances.

Prototype Overview
Even though the results reported in "The Computational Complexity of Checking APE-FD" section are not completely encouraging and according to the last comments in the previous section, we developed a prototype. Given a temporal schema R = U ∪ {VT} , an instance of R, a nonempty set J ⊆ U , a threshold 0 ≤ ≤ 1 and a value k ∈ ℕ , it returns a minimal complete set PE( , k, ) . The prototype is named Attila(Approximate Temporal Tailored Inference Lean Application). In the following, after providing a high-level description of Attilamodules and their interaction, we focus with a detailed description on novel ideas underlying the design of the prototype. Attilawas implemented according to the principles of distributed programming. Different tasks are executed by different processes (possibly executed on different machines). Attila is composed by three main processes: • Worker is responsible for maintaining a representation of the minimal partial PE( , k, ) of PE( , k, Processes composing Attilaare hierarchically organized. A Worker could manage minimal partial PE( , k, ) , but more Contributors are needed for checking multiple dependencies at the time. Each Contributor may have several Sub-Contributors in order to speed up the checking procedure. Now, let us consider in some detail each process type, in order to give a general idea of how computations are handled. Worker has to manage the minimal partial PE( , k, ) . At the end of computation, PE( , k, ) turns out to be a minimal complete set, according to the goal of our distributed procedure. Worker interacts only with its pool of Contributors. Figure 20 depicts how such interaction happens, by a BPMN choreography [22]. A Contributor can register to the Worker at any time incrementing by one the number of APE-FDs that may be checked simultaneously.  We use Ordered Binary Decision Diagrams (OBDDs) [1] as data structures allowing an efficient execution of such operations. An OBDD is a single-rooted directed acyclic graph that represents a propositional formula . A propositional variable is associated with every node as a label, except for the only terminal node 1. 2 Any nonterminal node v may have at most two outgoing edges low(v) (dotted line, depicted in Fig. 21) and high(v) (solid lines, depicted in Fig. 21). low(v) (high(v)) means that variable v is taken with value 0 (1). A variable truth assignment is represented by a path from the root to terminal node 1. Thus, an OBDD represents the set of all truth assignments for a given formula . Worker uses three different OBDDs corresponding to formulas PE , P and A , to keep track of sets PE( , k, ) , Pending and Assigned, respectively. APE-FDs in PE( , k, ) correspond to all and only the solutions of formula PE (and the same for APE-FDs in Pending with respect to formula P and for APE -FDs in Assigned with respect to A ). Hereinafter, we will use for denoting both the formula and the OBDD corresponding to all its possible solutions. Informally, updates of these three sets are implemented by adding conjuncts/disjuncts to their respective formulas.
For representing set PE( , k, ) as all and only the solutions of a formula, it suffices to assign to each attrib- Clearly, if a formula PE represents all the possible APE-FDs, an OBDD for PE represents them as well. The same approach may be used for sets Pending and Assigned. Hereinafter, we refer to as the OBDD representing all and only the assignments such that ⊧ .
A Worker begins the APE-FD mining task by initializing two OBBDs representing PE = A = ⊥ . Initially, PE = ⊥ (i.e., PE( , k, ) = � ) since the distributed procedure has not discovered any valid APE-FD yet. P is true only for those assignments that represent well-formed APE-FDs. In order to extract from P an APE-FD to be tested, we take the solution associated with any root-to-terminal path in the OBDD  Fig. 21 The update of the set 2 We use OBDD without the 0 terminal node.
in PE( , k, ) , it suffices to update PE to PE ∨ where Moreover, for deleting a solution from Pending it suffices to update P : , if ⃗ z holds, then at least one of variables not contained in the formula for the given APE-FD must hold).
It is worth noting that we have a search operation for APE -FDs that is linear in |U|. Moreover, Boolean operations on OBDDs are implemented in very efficient way by many packages on the market (in our prototype, we used BuDDy [17]). This solution allows us to have a compact representation of sets of APE-FDs that can be manipulated efficiently. Figure 21 shows an example of how PE( , k, ) is updated if we represent it through formula PE . In this example, we borrow two dependencies from the psychiatric case register introduced in a simplified way in "Discovering Pure Temporally Evolving Functional Dependencies" section and discussed in detail in "Mining APE-FDs on Clinical Domains" section. The real-world schema differs from the example since patients are identified by PatId attribute in place of their names. Furthermore, several attributes are used for storing information regarding registered calls. The most significant attribute is Global Assessment of Functioning (GAF): it is a numeric value provided by the physician at the end of the call, and it scores the patient's mental health status. Figure 21  As one may notice, each node v has at most two outgoing edges, one solid and one dashed representing high(v) and low(v) respectively. Taking a solid edge high(v) in a root-toterminal path denotes that the attribute corresponding to v belongs to the dependency. Taking a dashed edge low(v) in a root-to-terminal path denotes that the attribute corresponding to v does not belong to the dependency. As an example, the OBDD shown in Fig. 21 Fig. 20. Thus, Contributor is responsible for checking a single APE-FD at a time. Since the complexity is intractable (recall that the problem in NP-Complete), Contributor does not deal directly with the computation. Indeed, as mentioned before, it splits a problem among several computational units called Sub-Contributor s. The way in which a Contributor deals with its pool of Sub-Contributors is very close to the interaction between Worker and its Contributors, and it is described by the BPMN choreography diagram provided in Fig. 22. The status of the problem is managed by Contributor, and it is represented by a binary tree, where each node is labeled with a tuple and its two children represent either the case in which the tuple is inserted in the current solution or it is removed from it. Subproblems are generated by asserting that a tuple belongs or not to the final solution. This procedure generates a tree. Initially, the whole tree is given to a single Sub-Contributor to visit. Suppose that a new Sub-Contributor registers himself to the same Contributor during a computation. Such Contributor selects the subproblem of a Sub-Contributor and a tuple t in . Contributor splits the subproblem into two parts, one in which t must belong to the solution and the other in which t does not belong to the solution. One portion is given to the new Sub-Contributor and the "old" Sub-Contributor is notified to reduce its problem to the other portion. Usually, we have multiple Sub-Contributors that work in a subtree rooted at the node where the reduce operation happens, and thus, as reported in Fig. 22, we have to notify all of them about the reduction. Figure 23 shows an example of how Contributor works. Suppose that there is exactly one Sub-Contributor sc 1 that is exploring the tree (a) on the top of Fig. 23. At a certain point, a new Sub-Contributor sc 2 registers himself to the Contributor and requests a subproblem. Now Contributor looks at the active subproblem and chooses the one of sc 1 . At the root of such problem, there is tuple t 1 . Therefore, Contributor splits the subproblem into two more subproblems: one where t 1 is forced to be deleted (tree (b) in Fig. 23), and the other where t 1 is kept (tree (c) in Fig. 23). Finally, the exploration of subtree (c) is given to sc 2 and sc 1 is notified that its exploration of the tree (a) is reduced to the exploration of the subtree (b).
Sub-Contributor is the minimal computation unit: it simply performs tasks assigned by its master Contributor. Sub-Contributor listens constantly to its Contributor in order to receive reduction of its current subproblem for speeding up the process; meanwhile, it explores its current subproblem searching for a solution. Sub-Contributor operates in two symmetric ways that may be seen as two concurrent threads. The first thread assumes that its subproblem contains the solution and performs a depth-first search of the tree in order to find it. The other thread assumes that the subproblem does not contain the solution and tries to find a counterexample.

SN Computer Science
In order to deal with the latter task, the Sub-Contributor translates the subproblem into a linear programming problem and verifies its feasibility. This symmetric approach turns out to be very efficient. Attilamakes use of the open-source linear programming library GNU Linear Programming Kit (GLPK) [24] to perform such linear programming tasks.

Mining APE-FDs on Clinical Domains
In this section, we discuss APE-FDs obtained through Attila . These results may be also considered as an early validation of our prototype. As we already mentioned, we focused on two different clinical domains. The first one is that of psychiatry. In this domain, one of the main sources of information consists of data acquired during the (mainly, telephonic) contacts between patients and psychiatrists. "Discovering Pure Temporally Evolving Functional Dependencies" section provides a detailed description of this domain. Attila allowed us to extract the following APE-FDs from relation Contact: 1 . This dependency represents the fact that, considering two consecutive calls of the same patient, the number of psychologists involved in the second call uniquely depends on the GAF score of the patient during the first call. It may highlight that some (maybe implicit) policy determines the number of psychologists required for a contact, according to the conditions the given patient showed in the previous call. • [ +∞ ( Contact PatId )]Service, GAF − →CT with = 0.1 . Informally, it means that for each pair of consecutive calls for the same patient, the previous patient's GAF score and Service (clinical psychiatry, medical psychology, psychotherapy, ...) uniquely determines the next contact type (family member, a neighbor, the police, ...). • [ +∞ ( Contact PatId )]GAF, Physician − →Request with = 0.1 . This dependency says that the next actions on patients are mainly based on the previous GAF score and physician. For each pair of consecutive calls, the request (it could be group psychotherapy, family psychotherapy, legal medical evaluation, ...) decided during the second call depends on the physician and on the GAF score of the given patient during the first call.
The second clinical domain is that of pharmacovigilance, which is the science related to the management and prevention of suspected adverse reactions induced by drugs [34]. Premarketing trials are not able to discover all adverse reactions induced by the investigated drug. This is due to trial limitations, e.g., short timespan of the study, highly selected test groups, and so on. Adverse drug reactions (ADRs) may, indeed, go undetected and become evident when the drug is already on the market [28]. Therefore, marketed drugs require a continuous monitoring of their possible effects. The spontaneous reporting of ADRs allows healthcare stakeholders to identify unexpected reactions and to inform regulating authorities about them. This practice is extremely important, provides early warnings and requires limited economic and organizational efforts and resources [21]. Among its multiple advantages, spontaneous reporting allows one to consider every drug on the market and any category of patients. It investigates possible relationships between one or more adverse reactions and one or more drugs. physicians, chemists or citizens are allowed to submit reports. The analysis focuses on unknown or completely undocumented relationships and may suggest a potential cause-effect link between ADRs and drugs, classified as "suspected" or "concomitant." Any report contains both demographic and specific pharmacovigilance data, as patient information (age, nationality, gender, weight, outcome of reactions, and so on), drug(s) involved in the suspected reaction(s) (identified by their Anatomical Therapeutic Chemical -ATC -classification, brand name, dosage), and the description of the occurred adverse reaction(s) encoded by means of the MedDRA classification [20], the entry date, the period of the adverse reaction and the periods of drug administrations. These temporal data are then processed and analyzed to possibly discover any cause-effect relationship among drugs and reaction(s) in different time periods, or according to the exposure timespan. In this case, we consider the evolution of reports for the same drug (by using PhProd (Pharmaceutical Product, i.e., active principle) for performing the join in the evolution expression). This way, we may observe whether the therapy decided by physicians is influenced by previous adverse reactions. As an example, the fact that physicians are aware of past cases of adverse events could determine changes in drug dosages. Changing the prescribed drug quantities for patients because of previously suspected drugs could be considered as attempt of avoiding such adverse reactions. Among APE-FDs extracted from a recent instance of Reports schema of the Italian Network of Pharmacovigilance, we introduce here the following APE-FD.
• [ +∞ ( Reports PhProd )]PhProd, Dos − →Dos with = 0.2 . Such a dependency may highlight that, when an ADR is reported, the dose is usually adjusted in the same way for most patients, depending on the previous administrations. Such APE-FD may suggest that Italian physicians methodically consider the Italian Network of Pharmacovigilance in managing drug therapies.

Performance Analysis
In this section, we present a short and preliminary performance analysis of Attila and of its components. We executed two kinds of test. The first test was done using a single machine. This way, we obtained a first evaluation of the time required for mining multiple APE-FDs on a large real-world database. The second test focused on a single APE-FD, but considering some distributed architecture, using a server and at most two distinct remote machines. This test allows us to observe whether the time required for checking a single APE -FD decreases, when the problem is distributed among different computational units.
We started by analyzing the performances of the whole system, when the computation is entirely done by a single machine. We tested Attila on an instance of schema Contact, consisting of approximately 1.5 ⋅ 10 6 rows. APE-FDs as [ +∞ ( Contact PatId )]XY − →Z were extracted, with a threshold = 0.1 . Figure 24 depicts the result of this first experiment. Attilaverified almost 4500 APE-FDs in about 10 days. Figure 24 shows through a pie chart that checked APE-FDs (i.e., holding and not holding) are less than half of possible APE-FDs. Indeed, many APE-FDs denoted as superset are subsumed by the checked and holding APE-FDs. Thus, they have not been tested (i.e., only minimal APE-FDs have been tested). Moreover, Worker is often idling, as Contributors perform most of the computation. As described in "Mining APE-FDs" section, a Contributor has to visit a tree, which is exponentially large w.r.t. the instance size. Even though this operation is theoretically unfeasible for large instances, by employing simple pruning conditions (as, for example, too many tuples deleted, existence of violated constraints, and so on), the tree size may be reduced.
Finally, we analyzed the interactions between Contributor and Sub-Contributors by checking APE-FD [ +∞ ( Reports PhProd )]PhProd, Dos − →Dos with = 0.2 , discussed in "Mining APE-FDs on Clinical Domains" section and related to the pharmacovigilance domain. Figure 25a depicts some comparisons between various configurations of Sub-Contributors when checking the given dependency. We considered five possible situations: (1) a single local Sub-Contributor(Server), running on the same machine where Contributor is running; (2) a single local Sub-Contributor, and a remote Sub-Contributor(Remote); (3) two local Sub-Contributors and a single remote Sub-Contributor, (4) two remote Sub-Contributors, i.e., two separate physical machines with identical hardware/specs running a Sub-Contributor each; (5) a single local Sub-Contributor, and two remote Sub-Contributors.

SN Computer Science
As expected, we observed that performances improved when distributing the task among different machines. Figure 25b depicts the number of closed branches, which increases according to the size of the instance. These results confirm that our database instance is easier to evaluate than the artificial instances we built to prove NP-hardness results.

Conclusions
In this paper, we proposed a framework for discovering Approximate Pure Temporally Evolving Functional Dependencies ( APE-FDs for short) from a temporal database. We have addressed in depth the data complexity of such problem. Unfortunately, this complexity turns out to be NP-Complete even for a single dependency. Moreover, moving to mining the set of APE-FDs holding on an instance , the size of the result set depends also on the number of attributes of the schema of . For some instances, the lower bound of such size is exponential. We faced these problems in a real-world context, by proposing the use of model checking techniques, distributed computations and linear programming techniques. The implemented prototype Attila was tested on two real-world clinical scenarios and proved to be efficient. Moreover, we discussed the meaning of some interesting APE-FDs mined from the databases in the psychiatry and pharmacovigilance domains, previously introduced. These results may provide (clinical) stakeholders with some new, previously unknown, understanding of the underlying data. We plan to further improve and extend our prototype, by integrating it in a platform allowing the discovery of different types of (temporal) approximate functional dependencies. Finally, we plan to perform an extended validation of mined APE-FDs with clinical experts.

Compliance with Ethical Standards
Conflict of interest On behalf of all authors, the corresponding author states that there are no conflict of interest.
Ethical Approval This article does not contain any studies with human participants or animals performed by any of the authors.
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing,   adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creat iveco mmons .org/licen ses/by/4.0/.