1 Introduction

When data analysis is utilized for decision-making, it is critical to guarantee the traceability of the results. Data provenance in data analysis refers to all metadata describing its processing and is essential for traceability. It has been discussed in many research domains such as debugging [30, 37], security [12], scientific workflow [24], and visualization [10]. In particular, data lineage refers to tracing the source data from which the analysis results are derived. This topic has been widely researched in the database domain [16, 17, 53].

Even with lineage, traceability remains a challenge. The assumption for the conventional lineage is that users can understand why the output was derived only by observing the source data. This assumption holds if the analysis consists of a sequence of simple operators (e.g., filter, join, and aggregation) and their operational semantics are simple. However, modern data analysis has become more complex. Not only are the targets composed of diverse content data such as images, videos, and texts but processing often involves sophisticated technologies such as AI and machine learning (ML). In such complex analysis, users cannot understand the reason for the analysis result only from the lineage (source data). In other words, we need more information about the reasoning basis for AI/ML processing. Consider the following example.

Example 1

Figure 1 shows an analysis of a financial institution summarizing the results of examinations for loan application applied on December based on customer information using an ML model. The ML model takes three attribute values (Income, Debt, LoanAmount) as inputs and returns the examination result (Accept, Marginal, Reject). The reason that an application was rejected is not obvious only from the input. In this case, the lineage of \(\langle \hbox {Reject},1 \rangle \) is a set of tuples {\(\langle 0002, \hbox {Joa}\), \(\hbox {US}, 2200 \rangle \), \(\langle 0002, 1350, 12/17 \rangle \), \(\langle 0002,20000, 12/19 \rangle \)}. This is insufficient to understand why \(\langle \textrm{Reject},1 \rangle \) was derived from these source tuples. In other words, the source tuples do not fully explain “why this application was rejected.” However, the reason for the decision in the ML model along with the source tuples may explain why an application was rejected. For example, “the model rejected this application because the debt and loan amount of the applicant are both large.” This reason describes the basis of the result, and it can help people understand the model behavior leading to traceability for complex data analysis.

Fig. 1
figure 1

Analysis summarizing the results of examinations for loan application applied on December using an ML model. Tuples surrounded by dashed lines represent the source data of the result tuple \(\langle \hbox {Reject}, 1 \rangle \)

In this paper, we propose the concept of augmented lineage, which is an extension of lineage that incorporates the basis for computation (reason) in complex data analysis. Reason provides information which helps users understand the computation, for example, (1) which region of the content data (e.g., images or videos) the analytical model emphasizes and (2) which attributes much affect the AI/ML model decision. Since reason is essential in our augmentation of lineage, we use the term “augmented lineage” in the sense of “reason-augmented lineage” in this paper. The contributions of this paper are as follows: (i) proposal and formulation of augmented lineage, (ii) proposal of a basic algorithm to derive the augmented lineage, (iii) proposal of an enhanced method for its efficient derivation, and (iv) experimental evaluation of the proposed derivation method.

The preliminary version of this paper was published in [55]. We have intensively extended our previous work in the following three points: (i) This paper includes detailed descriptions of our model, augmented lineage, and algorithms to derive augmented lineage. (ii) We fully revised our segmentation scheme for augmented lineage derivation. The revised one preserves the order of operators in a given operator tree as it is. This feature is beneficial if the given operator tree is optimized by the query optimizer. (iii) This paper includes extensive experiments based on the derived TPC-H benchmark databases and queries of three scale factors [50] as well as the original experiments based on LFW image dataset [28].

The rest of this paper is organized as follows: Sect. 2 overviews related work. Section 3 presents the data model to represent data analysis in this paper. Section 4 proposes the augmented lineage, while Sect. 5 formulizes a basic algorithm to derive the augmented lineage. Section 6 details its enhanced method and the system architecture. Section 7 experimentally investigates the derivation method. Finally, Sect. 8 concludes this paper.

2 Related work

Provenance and lineage have been researched widely in many domains [8, 13, 27].

[4,5,6, 14, 15, 17, 23, 43, 52] proposed methods that derive source data for database queries. [52] derived lineage of output values using user-defined weak inversion functions. In the framework, the derived lineage depends on the inversion functions the user provides and the framework provides, and it may derive incomplete and/or redundant lineage. [9] discussed two granularities for provenance: tuple-level (Why-Provenance) and value-level (Where-Provenance). Value-level lineage was derived in [52], and tuple-level lineage was derived in [15, 17]. [15, 17] proposed a method to derive accurate tuple-level lineage for relational queries by performing tracing queries. These frameworks focused on deriving the lineage of specified output tuples after query execution. Such a framework is called the lazy approach. The lazy approach has the feature that it takes no/small overhead for the ordinary query execution, but it takes some extra work to derive lineage after the query.

There is an alternative approach called the eager approach. Frameworks based on the eager approach generally derive provenance/lineage for all output tuples while executing the analysis. Therefore, it incurs some overhead for the ordinary analysis execution, but the provenance/lineage is ready as well as query results. [4,5,6, 14, 23, 43] are frameworks based on the eager approach. Trio [4, 5] attached identifiers to all source tuples and inherited them as annotations to query results during the analysis to derive tuple-level lineage. They also considered the application of the lineage for the uncertain data model and probabilistic data model. DBNotes [6, 14] attached value IDs to all attribute values in source tables, inherited the IDs during the query execution, and finally captured value-level provenance. Perm [23] derived tuple-level lineage by rewriting original relational queries so that they keep information about attributes in the source tuples as well. Smoke [43] instrumented relational operators to construct lineage indexes, which maintain the relationship between input and output tuples while executing the query. Using the indexes, it could identify the input tuples from which output tuples were derived (backward lineage) and the output tuples to which input tuples contributed (forward lineage).

[3, 25, 47] discussed more sophisticated annotation-based provenance models known as the semiring provenance model. [25] firstly proposed the model. Annotations were attached to the source tuple and inherited during analysis, and polynomials on semiring were derived as provenance. Following this, [3] extended the framework to handle aggregate operation. When the database size is huge or the complex query is executed, the polynomial for semiring provenance becomes quite huge, leading to difficulty in understanding provenance information. Some frameworks addressed the problem. [20] represented semiring-based provenance using the circuit, which reduced the provenance size, and [1] focused on approximating the provenance with practical information loss. In addition to the above frameworks, some studied provenance in other types of database queries (e.g., Datalog [19, 25], XQuery [22], and SPARQL [49]). Traceability has been researched in domains other than the database.

In the scientific workflow, [7, 24] captured the provenance for reproducibility of the computation. Recently, some researches targeted the distributed big data processing systems and proposed how to capture provenance (e.g., MapReduce [2, 36], Spark [30] and Flink [41, 42]). In addition, provenance has been also researched in the ML domain, and [39, 48, 54] focused on the provenance for the training phase of ML models.

Modern data analysis includes content data analysis and AI/ML analysis (e.g., recommendation [56], anomaly detection [11, 33], and medical diagnosis [32, 35]). There are some researches that can handle traceability in such complex data analysis. [16, 53] derived the lineage for the analysis which included user-defined functions (UDFs) and relational operators. [16] proposed a method to derive the lineage for analysis which included more generalized operators. Their method can deal with non-relational operators by focusing on the relationship between the input and output data (e.g., one-to-one, N-to-one, N-to-M relationships) of each operator and schema information. [53] presented a method to find lineage in SciDB, which handles multi-dimensional array data for analysis involving user-defined operators. Their study could identify backward lineage and forward lineage at the cell level. The lineage was obtained by recording the input/output relationships of each operator in the workflow. They proposed a method to efficiently manage and later derive lineages by rerunning. In these studies, lineage means the correspondence between input/output tuples (cells). However, they did not show the important parts of the content data (images, texts, etc.). Moreover, they did not cover the reasoning basis for AI/ML decisions. [57] proposed a method to track information about which parts of the content data were extracted by data extraction processing represented by UDFs as well as the traditional tuple-level provenance based on the semiring model. Even this framework cannot provide the basis for decisions made by AI/ML processing.

Recently, many studies have investigated interpretability and explainability of AI/ML processing [21, 26]. Using frameworks such as [38, 45, 46], the basis for the decisions in AI/ML processing can be shown. The augmented lineage presented in this paper is a general framework that integrates such reasoning functions into lineage and contributes to the traceability suitable for complex data analysis, including AI/ML processing.

Our framework lazily derives augmented lineage, and it uses [17] as a base reference model. [17] is a state-of-the-art approach to lazily derive tuple-level lineage for relational queries. However, our framework extends the base model in the following points: (1) Inclusion of an additional operator (Function operator) to model complex data analysis. (2) Augmentation of lineage by reason to show the basis of complex data analysis. (3) Proposal of augmented lineage derivation methods. (4) Intensive experimental evaluation of augmented lineage derivation performance.

3 Data model

This section describes the data model to represent data analysis. Complex data analysis is modeled as a task consisting of relation-like operators.

3.1 Data

We model data as a collection of tables. Table \( T(A_1, \ldots , \) \(A_n) \) has a set of tuples \( \{t_1,\ldots ,t_p\} \). \({\varvec{A}}_T\) denotes all attributes of a table T. \(t.{\varvec{A}}\) projects tuple t onto the tuple with an attribute set \({\varvec{A}} ~(\subseteq {\varvec{A}}_T)\). For a set of tuples \( \{t_1,\ldots ,t_q\} \) (including, in special cases, single value instead of \(t_i\)), \(\langle t_1,\ldots ,t_q \rangle \) denotes the tuple that concatenates them. For example, for \(t_1 = \langle 1, 2 \rangle , t_2 = \langle 3 \rangle , \langle t_1, t_2 \rangle = \langle 1, 2, 3 \rangle \). Source data set D consists of tables \((T_1,\ldots ,T_m)\).Footnote 1\({\mathscr {T}}\) denotes a task which consists of operators defined in Sect. 3.2 on the source data set D, and O denotes its output. That is, \( O = {\mathscr {T}}(D) = {\mathscr {T}}(T_1,\ldots ,T_m)\). A task can consist of no operator as a special case. In this case, the output is a source table itself, \(O = T_i\).

3.2 Operators

We describe the operators that compose a task. To model data analysis, we assume seven set-relational operators: basic operators (Selection \(\sigma \), Projection \(\pi \), Join \(\bowtie \), Aggregation \(\alpha \), Union \(\cup \), Difference −) and Function operator \(\phi \), which models complex data analysis. A task is represented as a tree consisting of these operators, while leaf nodes are source tables. First, we define each basic operators as follows:

Definition 1

(Basic operators)

  • Selection: \(\sigma _C(T) = \{t ~ | ~ t \in T \wedge t\) satisfies condition \(C\}\).

  • Projection: \( \pi _{{\varvec{A}}}(T) = \{t.\vec {A}~ | ~ t \in T\}\). Note that its results are duplicate eliminated.

  • Join: \( T_1 \bowtie _{\theta _1} \cdots \bowtie _{\theta _{m-1}} T_m = \{\langle t_1, \ldots , t_m \rangle ~ | ~ t_i \in T_i ~(i=1,\ldots ,m) \wedge \langle t_1, \ldots , t_m \rangle \text {satisfies condition}~\theta _i (i=1, \ldots , m-1) \}. \) Join operator without \(\theta _i\) represents natural join.

  • Aggregation: \( \alpha _{{\varvec{G}},g({\varvec{B}})}(T) = \{\langle T'.{\varvec{G}}, g(T'.{\varvec{B}}) \rangle ~ | ~ T' \subseteq T, \forall t, t' \in T', \forall t'' \notin T':t'.{\varvec{G}} = t.{\varvec{G}} \wedge \) \(t''.{\varvec{G}} \ne t.{\varvec{G}} \}\). Note that \({\varvec{G}}\). denotes the grouping key of T and \(g({\varvec{B}})\) denotes an aggregate function over attribute \({\varvec{B}}\) in the grouped table.

  • Union: \( T_1 \cup \cdots \cup T_m = \{t~ | ~t \in T_i~(i=1,\ldots ,m)\}\).

  • Difference: \( T_1 - T_2 = \{t ~ | ~ t \in T_1 \wedge t \notin T_2\}\).

To model complex data analysis, we will employ UDF and the function operator to invoke UDF. Before explaining them, we define reason.

Definition 2

(Reason) Reason is a data object returned by each UDF invocation to help users understand the UDF computation. UDF developers are responsible for making each invocation of UDF output both a computation result and a reason.

Our definition of reason here is very general, and any data object which may help users understand the behavior of the computation and model can be defined as a reason. In the domain of explainable AI models, various types of data are used as “explanations” or “reasons” depending on models and problems [21, 31]. To make our model applicable to different domains, we employ this rather general definition here. Recently, model-agonistic explanation models such as LIME [45] and SHAP [38] are attracting attention. In those approaches, interpretable approximations of the original model are used as explanation models, and weights associated with simplified features of the explanation model are often deployed as “explanations”. Our definition can accommodate such “explanations” when the UDF developer thinks they are useful to help understand the UDF computation.

Example 2

(Reason) Let us consider a classifier that takes customer and loan application information as input to examine the application, like in Example 1. Suppose a developer wants the classifier to show which input attributes contribute a lot in decision-making as well as the decision results to help understand the basis for the decision. In that case, the developer can make a data object specifying significant input attributes as a reason.

Now, we define UDF that models complex data analysis and is invoked in a function operator in our model. UDF has two execution modes (normal mode and reasoning mode). Intuitively, given input attribute values, the normal mode only returns the result of complex data analysis. On the other hand, the reasoning mode returns reasons as well as the results.

Definition 3

(UDF) UDF is the function, which has the following input and output depending on its execution mode.

  1. 1.

    Normal mode

    $$\begin{aligned} f_{n}: \hbox {Domain}({{\varvec{E}}}) \rightarrow { Value} \end{aligned}$$
  2. 2.

    Reasoning mode

    $$\begin{aligned} f_{r}: \hbox {Domain}({{\varvec{E}}}) \rightarrow { Value} \times { Reason} \end{aligned}$$

Here, \(\hbox {Domain}({{\varvec{E}}})\) is the domain of attributes \({{\varvec{E}}}\), \({ Value}\) is the domain of the output values of the complex data analysis, \({ Reason}\) represents the domain of reasons, and \(\times \) denotes the cross product of each domain. In the normal mode, UDF invocation outputs just the calculation result of the UDF, while, in the reasoning mode, it produces the reason for the calculation result as well. The UDF developers have to design and determine what information is produced as reasons. In more detail, our framework assumes that the developer should implement two functions for a UDF: \({ main}()\), which is called in the normal mode invocation and returns only the UDF computation result, and \({ main}\_{ with}\_{ reason}()\), which is called in the reasoning mode invocation and returns a reason as well. We assume that UDFs invoke pre-trained AI/ML models and their training is out of the scope of this paper.

Depending on the property and implementation of UDF, providing reason information may need much extra computation and be time-consuming processing. In addition, as we discuss later, it is not always needed for UDF to generate reasons even in deriving augmented lineage. The two execution modes of UDF provide such choices.

Example 3

(UDF) In Fig. 1, we employ credit_exam function as UDF, which takes attribute values of Income, Debt, and LoanAmount of a client, and then returns the examination result (Accept, Marginal, Reject). Besides, it outputs the set of attributes contributing a lot to the result as a reason when it is invoked in reasoning mode. For example, credit_exam function returns the following output(s) for the client with ClientID “0001”:

  1. 1.

    Normal mode

    $$\begin{aligned} { credit}\_{ exam}_{n}(\{3400, 810, 4000\}) = \hbox {``Accept''} \end{aligned}$$
  2. 2.

    Reasoning mode

    $$\begin{aligned} \begin{aligned}&{ credit}\_{ exam}_{r}(\{3400, 810,4000\})\\&\quad = [\hbox {``Accept''},\{\hbox {Income}, \hbox {LoanAmount}\}] \end{aligned} \end{aligned}$$

Then, we define the function operator. Intuitively, function operator adds the result (Value) of the UDF as a new attribute value to the input tuple when invoked in the normal mode, and adds Value and Reason of the UDF as new attribute values when invoked in the reasoning mode.

Definition 4

(Function operator) Function operator \(\phi _{f({{\varvec{E}}})}\) applies UDF f to all tuples (more precisely, to their values for attributes \({\varvec{E}}\)) in the input table. The output of function operator \(\phi _{f({\varvec{E}})}\) is as follows depending on UDF’s execution mode:

  1. 1.

    Normal mode

    $$\begin{aligned} \begin{aligned} \phi ^n_{f({\varvec{E}})}(T)&= \phi _{f_n({\varvec{E}})}(T) \\&= \{ \langle t, f_n(t.{\varvec{E}}).\hbox {Value} \rangle | t \in T \} \end{aligned} \end{aligned}$$
  2. 2.

    Reasoning mode

    $$\begin{aligned} \begin{aligned}&\phi ^r_{f({\varvec{E}})}(T) = \phi _{f_r({\varvec{E}})}(T) \\&\quad =\{ \langle t, f_r(t.{\varvec{E}}).\hbox {Value}, f_r(t.{\varvec{E}}).\hbox {Reason} \rangle | t \in T \} \end{aligned} \end{aligned}$$

Example 4

(Analysis modeling and function operator) The analysis in Fig. 1 can be modeled as follows:

$$\begin{aligned} \begin{aligned} O&= {\mathscr {T}}(D) = \alpha _{\hbox {Judge}, \hbox {COUNT}(*)} \\&(\phi _{f({\varvec{E}})}(\pi _{{\varvec{A}}}(\hbox {Client} \bowtie \hbox {Debt} \bowtie \sigma _{C}(\hbox {Loan}))))\\&\hbox {s.t.}~{\varvec{A}} = {\varvec{E}} = \{\hbox {Income}, \hbox {Debt}, \hbox {LoanAmount}\}\\&C: \hbox {Application date is on December} \end{aligned} \end{aligned}$$

Figure 2a shows its operator tree, and Fig. 3 shows its intermediate results.

Fig. 2
figure 2

Segmentation of the operator tree

The function operator invokes credit_exam UDF in Example 3. Here, “Judge” corresponds to the UDF’s Value, and * denotes all the attributes of the input table. In the reasoning mode, the function operator returns a table in Fig. 3e.“Reason” corresponds to the UDF’s Reason.

For ease of exposition, this paper restricts the operators to relational operators. However, our framework can deal with more generalized processing and external programs as long as they have the same input/output relationships as relational operators. For example, when an external program inputs each tuple and (1) outputs it or (2) discards it, then we can model this processing as selection \(\sigma \) because it has the same input/output relationship. Similarly, other data processing can be modeled as one of the above operators as long as their input/output relationship is the same.

Fig. 3
figure 3

Base tables, intermediate results, and final result of the task

4 Augmented lineage

This section defines augmented lineage. Augmented lineage is an extension of lineage in [15]. The set of the source tuples from which tuple \(o (\in {\mathscr {T}}(D))\) of task \({\mathscr {T}}\) is derived is called the source lineage of tuple o in task \({\mathscr {T}}\).

We first define the source lineage of tuple o in an operator Op, and then define the source lineage of tuple o in task \({\mathscr {T}}\). The following definition specifies three features: (1) The tuples in the source lineage produce output tuple o. (2) Every tuple in the source lineage contributes to deriving tuple o. (3) The source lineage is the maximum subsets of source tables that meet conditions (1) and (2).

Definition 5

(Source Lineage in an Operator) Let \( O (= { Op}(T_1,\ldots ,T_m))\) be the result of performing an operator \({ Op}\) on the set of tables \( \{T_1,\ldots ,T_m\} \). Source lineage of tuple \(o (\in O)\) in operator \({ Op} ({ SL}_{{ Op}}(o))\) is the set of \({\hat{T}}_i \subseteq T_i ~(i = 1,\ldots ,m)\) which meets the following conditions.Footnote 2

  1. 1.

    \({ Op}({\hat{T}}_1, \ldots ,{\hat{T}}_m) = \{o\} \)

  2. 2.

    \(\forall {\hat{T}}_i: \forall {\hat{t}} \in {\hat{T}}_i: Op({\hat{T}}_1, \ldots , \{{\hat{t}}\}, \ldots , {\hat{T}}_m) \not = \emptyset \)

  3. 3.

    \({\hat{T}}_i\) is the maximum subset of \(T_i\) that meets 1 and 2.

The source lineage of tuple set \({\bar{O}} (\subseteq O)\) in operator \({ Op}\) is \({ SL}_{{ Op}}({\bar{O}}) = {\hat{\bigcup }}_{o \in {\bar{O}}} { SL}_{{ Op}}(o)\). Note that \({\hat{\bigcup }}\) produces the union for each table: \(\{{T}^{1}_1, \ldots , {T}^{1}_m \} ~\hat{\bigcup }~ \{{T}^{2}_1, \ldots , {T}^{2}_m\} = \{{T}^{1}_1 \cup {T}^{2}_1, \ldots , {T}^{1}_m \cup {T}^{2}_m\}\) s.t. \({T}^{j}_i \subseteq T_i\). \(\hat{T_i} \in { SL}_{{ Op}}({\bar{O}})\) is denoted as \(P_{T_i}({ SL}_{{ Op}}({\bar{O}}))\).

Example 5

(Source lineage in an operator) Let us consider the join operator \(\bowtie \) in Fig. 2a, which produces the table shown in Fig. 3c. The source lineage of tuple \((\langle 0001, \ldots , 12/16 \rangle )\) in the join is \(SL_{\bowtie } (\langle 0001,\) \(\ldots , 12/16 \rangle ) = \{\hbox {Client}: \{\langle 0001, \ldots , 3400 \rangle \}, \hbox {Debt}: \{\) \(\langle 0001, 810, 12/7 \rangle \}, \hbox {Loan}: \{\langle 0001, 4000, 12/16 \rangle \}\}\). Similarly, \(SL_{\bowtie }( \langle 0002, \ldots , 12/19 \rangle )= \{\hbox {Client}: \{\langle 0002, \ldots ,\) \(2200 \rangle \}, \hbox {Debt}: \{\langle 0002, 1350,12/17 \rangle \}, \hbox {Loan}: \{\langle 0002, 20000,\) \(12/19 \rangle \}\}\). Then, the source lineage of the both tuples in the join is \(SL_{\bowtie }( \{\langle 0001, \ldots ,12/16 \rangle , \langle 0002, \ldots , 12/19 \rangle \})\) \(= \{\hbox {Client}: \{\langle 0001, \ldots , 3400 \rangle ,\langle 0002, \ldots , 2200\rangle \}, \hbox {Debt}:\) \(\{\langle 0001, 810, 12/7 \rangle , \langle 0002, 1350,12/17 \rangle \}, \hbox {Loan}: \{\langle 0001,\) \(4000, 12/16 \rangle , \langle 0002, 20000,12/19 \rangle \}\}\).

Fig. 4
figure 4

Augmented lineage of tuple set \(\{\langle \hbox {Reject}, 1 \rangle \}\) in Example 4

Next, we define source lineage in a task. Since a task is an operator tree, the source lineage in the task is obtained by recursively deriving the source lineage in each operator in the operator tree.

Definition 6

(Source Lineage in Task) Let \( O (= {\mathscr {T}}(T_1,\ldots ,T_m)) \) be the result of performing a task \({\mathscr {T}}\) on the set of tables \( \{T_1,\ldots ,T_m\} \). Source lineage of tuple \(o (\in O)\) in task \({\mathscr {T}}\) (denoted by \({ SL}_{{\mathscr {T}}}(o)\)) is the set of \({\hat{T}}_i \subseteq T_i\) which meets the following conditions.

  1. 1.

    If \({\mathscr {T}}\) contains no operator, namely \(O = T_i, { SL}_{{\mathscr {T}}}(o) = o \in T_i\).

  2. 2.

    Otherwise, \({\mathscr {T}}(D) = { Op}(O_1, \ldots , O_n)\) s.t. \(O_i = {\mathscr {T}}_i(D_i),\) \(D_i \subseteq D\). Let \(\hat{O_i} = P_{O_i}({ SL}_{{ Op}}(o))\). Then, \({ SL}_{\mathscr {T}_{i}}(o) = \bigcup _{1 \le i \le n}(\) \({\hat{\bigcup }}_{\hat{o_i} \in \hat{O_i}} { SL}_{{\mathscr {T}}_i}(\hat{o_i}))\).

The source lineage of tuple set \({\bar{O}} (\subseteq O)\) in task \({\mathscr {T}}\) is \({ SL}_{{\mathscr {T}}}({\bar{O}}) = {\hat{\bigcup }}_{o \in {\bar{O}}} { SL}_{{\mathscr {T}}}(o)\). Note that the source lineage in a task depends on the operator tree and may change even if a different operator tree logically equivalent to the original tree is given, as suggested in [17].

Example 6

(Source lineage in task) Let us consider the partial task \({\mathscr {T}}\) (\(\pi \)-\(\bowtie \)-\(\sigma \)) in Fig. 2a, which takes base tables shown in Fig. 3a and produces the table shown in Fig. 3d. Then, the source lineage of tuple (\(\langle 3400, 810, 4000 \rangle \)) in the task is \({ SL}_{{\mathscr {T}}}(\langle 3400,\) 810,  \(4000 \rangle ) = \{\hbox {Client}: \{\langle 0001, \ldots , 3400 \rangle \},\) \(\hbox {Debt}: \{\langle 0001,\) 810,  \(12/7 \rangle \}, \hbox {Loan}: \{\langle 0001, 4000, 12/16 \rangle \}\}\).

Before defining the augmented lineage, we define intermediate lineage of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in an intermediate result. It represents the set of tuples in the intermediate result which contribute to \({\bar{O}}\).

Definition 7

(Intermediate Lineage) Let O be the output tuples of task \({\mathscr {T}}\). That is, \( O = {\mathscr {T}}(D) = {\mathscr {T}}(T_1,\) \(\ldots ,T_m) \). Furthermore, let task \({\mathscr {T}}\) be divided into two tasks, \({\mathscr {T}}'\) and \({\mathscr {T}}''\) s.t. \( {\mathscr {T}}(D) = {\mathscr {T}}'(O', T_{l_{i+1}}, \ldots , T_{l_m})\), \( O' = {\mathscr {T}}''(T_{l_1},\ldots ,T_{l_i})\). Namely, \(O'\) is the intermediate result of the task \({\mathscr {T}}''\). The intermediate lineage \(IL({\bar{O}}, O')\) of tuple set \({\bar{O}} \) (\(\subseteq {\mathscr {T}}(D)\)) in intermediate result \(O'\) is the source lineage \(\hat{O'} \subseteq O'\) of tuple set \({\bar{O}}\) in task \({\mathscr {T}}'\), namely \({ IL}({\bar{O}}, O') = P_{O'}({ SL}_{{\mathscr {T}}'}({\bar{O}}))\).

Example 7

(Intermediate lineage) Let \({\mathscr {T}}\) denote the whole task shown in Fig. 2a, and \(\hbox {INT}_{\sigma }\) denote the intermediate result of \(\sigma \) in the task shown in Fig. 3b. Then, the intermediate lineage of tuple set \(\{\langle \hbox {reject}, 1 \rangle \}\) in \(\hbox {INT}_{\sigma }\) is \({ IL}(\{\langle \hbox {reject} \rangle \}, \hbox {INT}_{\sigma }) = \{\langle 0002,\) \(20000, 12/19 \rangle \}\).

Finally, we define augmented lineage. Augmented lineage consists of source lineage (source tuples contributed to the output tuples) and reasoning lineage (reasons for UDF invocations).

Definition 8

(Augmented Lineage) Augmented lineage\({ AL}({\bar{O}})\) of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in task \({\mathscr {T}}\) consists of the following pair, where \(O'_i\) denotes the intermediate result generated by each function operator \(\phi ^r_{f_i({\varvec{E_i}})}\) invoked in reasoning mode.

  • Source Lineage (SL): \({ SL}_{{\mathscr {T}}}({\bar{O}}) \)

  • Reasoning Lineage (RL): \({ RL}_{{\mathscr {T}}}({\bar{O}})= \{ \langle o.{{\varvec{E}}_i}, o.\hbox {Value},o.\hbox {Reason} \rangle | \forall i: o \in IL({\bar{O}}, O'_i) \} \)

Example 8

(Augmented lineage) Referring to the outputs of the function operator shown in Fig. 3e, the augmented lineage of tuple set \(\{\langle \hbox {Reject}, 1 \rangle \}\) in the task of Example 4 is shown in Fig. 4.

5 Augmented lineage derivation

As mentioned in Section 2, there are two approaches (i.e., the eager approach and lazy approach) to obtaining the lineage [13, 27]. The former produces lineage for all analysis results during the ordinary analysis execution. The latter derives the lineage for the chosen analysis results after the analysis execution. This paper focuses on the lazy approach. In recent data analysis, analysts often develop analysis flows using trial and error to configure parameters, change source data, etc. In this context, the lazy approach is more desirable than the eager approach. The reason is that the lazy one derives the lineage only when needed, while the eager one has larger overhead for every analytical execution even when the lineage is not required. Our proposed method to derive the augmented lineage is based on [17], which proposed a method to obtain lineage of database queries via the lazy approach. Our extension enables use of UDFs for complex data analysis, a more practical treatment of operators and operator trees, and inclusion of reasons for the augmented lineage.

When a tuple in the analysis output is specified as a target for augmented lineage derivation, our method first tries to look for tuples in source tables whose corresponding attribute values match the attribute values in the specified target tuple as in the lineage derivation in [17]. In cases of simple tasks, this approach directly leads to finding source tuples in the source tables. However, when the task is complex, some attribute values in the target tuple may have been newly generated by operators such as aggregations and functions and do not have directly corresponding attribute values in source tables. In such cases, we need to decompose a task into smaller sub-tasks, and derive augmented lineage step by step using intermediate results of the sub-tasks, rather than deriving augmented lineage from the source tables directly. For this purpose, given a task, we divide its operator tree into one or more operator sub-trees (segments). Then, we derive augmented lineage recursively in a top-down manner, as illustrated in the following example.

Example 9

Consider the augmented lineage of a tuple\(\langle \hbox {Reject}, 1 \rangle \) in Fig. 1, which is shown in Example 4. To derive its augmented lineage, we divide the analysis into segments because attribute values in the tuple \(\langle \hbox {Reject}, 1 \rangle \) are generated by the aggregation and function operators and do not have corresponding attribute values in the source tables. In this example, we have two segments: (1) \(\alpha \) and (2) \(\phi \)-\(\pi \)-\(\bowtie \)-\(\sigma \). \(\phi \)’s intermediate result (namely, the input to \(\alpha \)) has five attributes (Income, Debt, LoanAmount, Judge, Reason). Then, we first find the tuple \(\langle 2200, 1350, 20000, \hbox {Reject},\{\hbox {Debt}, \hbox {LoanAmount}\} \rangle \) as the intermediate lineage of the tuple \(\langle \hbox {Reject}, 1 \rangle \) in the intermediate result showing in Fig. 3e, which is the output of \(\phi \). We also record \(\{ \langle \langle 2200, 1350, 20000 \rangle ,\hbox {Reject}, \{\hbox {Debt}, \hbox {LoanAmount}\} \rangle \}\) as reasoning lineage. Next, we derive the source lineage of the tuple \(\langle 2200, 1350, 20000,\hbox {Reject}, \{\hbox {Debt}, \hbox {LoanAmount}\} \rangle \) in the source tables. Eventually, we derive the augmented lineage as the pair of source lineage \(\{\{ \langle 0002\), \(\hbox {Joa}, \hbox {US}, 2200 \rangle \}\), \(\{ \langle 0002, 1350, 12/17 \rangle \}\), \(\{ \langle 0002, 20000,12/19 \rangle \}\}\) and reasoning lineage \(\{ \langle \langle 2200,1350, 20000 \rangle \), \(\hbox {Reject}, \{\hbox {Debt}, \hbox {LoanAmount}\} \rangle \}\).

In Sects. 5.1 and 5.2, we explain the segment in more detail and how to divide a task into segments, respectively. The segmentation scheme shown here is quite different from those in [17, 55] in that it preserves the order of operators in a given operator tree as it is. This feature is beneficial if the given operator tree is optimized by the query optimizer based on the cost model and/or heuristics such as selection push-down. In Sect. 5.3, we introduce a tracing query, which finds the source lineage for a single segment. Finally, we propose a basic algorithm to derive the augmented lineage of a task in Sect. 5.4.

5.1 Segment

There are two types of segments (Non-D-segment and D-segment). The original idea of Non-D-segment and D-segment comes from AUSPJ- and D-segments in our reference work [17]. The main reason for this separation will be that due to the non-monotonic property of difference, formulating the tracing query becomes complicated if we mix difference and the other operators. We follow their approach here.

  • Non-D-segment: A segment of operators except the difference in the pattern: “\(\phi ^*\)-\(\alpha ^*\)-\((\cup | \pi | \sigma )^*\)-\((\pi | \sigma |\bowtie )^*\)”. \(\phi ^*\) and \(\alpha ^*\) denote sequences of function and aggregation operators, respectively. \((\cup | \pi | \sigma )^*\) and \((\pi | \sigma | \bowtie )^*\) denote the operator sub-trees that consist of any combinations of the specified three operators. The leftmost operator is located at the top of the operator tree. Namely, the operators are executed from right to left in a bottom-up manner. Note that all the operators do not need to actually appear in a Non-D-segment. If \(\alpha ^*\) consists of multiple aggregations \(\alpha _1- \cdots -\alpha _m\), the grouping keys \({\varvec{G_i}}\) of \(\alpha _i\) must meet the following conditions: \({\varvec{G_1}} \subseteq \cdots \subseteq {\varvec{G_m}}\). This segment is based on AUSPJ-segment \(\alpha \)-\(\cup \)-\(\pi \)-\(\sigma \)-\(\bowtie \) proposed in [17]. Placing the function operator at the top of the segment naturally extends AUSPJ-segment to accommodate the function operator, since attributes corresponding to Value and Reason of UDFs are guaranteed to appear in its intermediate result, which is convenient for deriving reasoning lineage. Another change is generalization of operator sequences to reduce the number of segments.

  • D-segment: A segment consisting of a single difference operator. This segment is represented as the following pattern: “−”.

The leftmost operator in a segment is called the top of the segment, and the top segment of the whole operator tree is called the root segment.

5.2 Segmentation

Given task \({\mathscr {T}}\), which consists of one or more operators, the segmentation splits the task into one or more segments by applying the longest pattern matching to its operator tree in a top-down manner from the root operator node. The patterns here are those explained in the above segment description.

Proposition 1

The segmentation uniquely decomposes a task into one or more (Non-D/D) segments.

Proof

According to the definitions of segments, any operator can be the top of a segment. Therefore, the segmentation decomposes the operator tree into one or more disjoint segments which covers the whole operator tree. In addition, since we do the longest pattern matching, the decomposition is unique. \(\square \)

Example 10

The task in Example 4 is divided into two Non-D-segments by the segmentation: (1) Non-D-segment consisting of \(\alpha \) alone and (2) Non-D-segment \(\phi \)-\(\pi \)-\(\bowtie \)-\(\sigma \) (Fig. 2b).

5.3 Tracing query

We define a tracing query for each segment. Given a task \({\mathscr {T}}\) consisting of a single segment, the tracing query finds the source lineage \({ SL}_{{\mathscr {T}}}({\bar{O}})\) of a tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(T_1, \ldots , T_m))\) in task \({\mathscr {T}}\).

First, we introduce the split operator. Given a table and the list of attributes’ sets, the split operator projects the table onto each set of attributes.

Definition 9

(Split operator) Given a table T with its attribute set \({\varvec{A}}_T\), the split operator produces a set of tables, each with attribute set \({\varvec{A}}_{T_i} \subseteq {\varvec{A}}_T\), using the projection as follows:

$$\begin{aligned} { Split}_{{\varvec{A}}_{T_1},\ldots ,{\varvec{A}}_{T_n}}(T) = \{\pi _{{\varvec{A}}_{T_1}}(T),\ldots ,\pi _{{\varvec{A}}_{T_n}}(T)\} \end{aligned}$$

Example 11

(Split operator) Let \(\hbox {INT}_{\bowtie }\) denote the output table of the join in Fig. 2a shown in Fig. 3c. The split operator \({ Split}_{{\varvec{A}}_{\textrm{Client}}, {\varvec{A}}_{\textrm{Debt}}, {\varvec{A}}_{\textrm{Loan}}}(\hbox {INT}_{\bowtie })\) produces \(\{\hbox {Client}: \{\langle 0001, \hbox {Havey}, \hbox {UK}, 3400 \rangle , \langle 0002, \hbox {Joa},\) \(\hbox {US}, 2200 \rangle ,\) \(\ldots \},\hbox {Debt}: \{\langle 0001, 810, 12/7 \rangle , \langle 0002, 1350,\) \(12/17 \rangle , \ldots \},\) \(\hbox {Loan}: \{\langle 0001, 4000, 12/16 \rangle , \langle 0002, 20000,\) \(12/19 \rangle , \ldots \}\}\).

Fig. 5
figure 5

Transformed non-D-segment

Proposition 2

A operator tree for a Non-D-segment can be rewritten into the following form (Fig. 5), and the lineage derived for this is same as that derived for the original operator tree.

$$\begin{aligned} \begin{aligned} {\mathscr {T}}(D)&= \phi _{f_1({\varvec{E_1}})}(\ldots ( \phi _{f_n({\varvec{E_n}})}(\alpha _{{\varvec{G_1}},g_1({\varvec{B_1}})}\\&\quad (\ldots (\alpha _{{\varvec{G_k}},g_k({\varvec{B_k}})}(\cup _{i}(\pi _{{\varvec{A_i}}}(\sigma _{C_i}(T^i_1 \bowtie \cdots \bowtie T^i_{m_i}))))))))) \end{aligned} \end{aligned}$$

Proof

We can prove that this rewriting preserves logical equivalence by the commutativity of operators. As discussed in Sect. 4, rewriting an operator tree may affect the derived lineage. However, it is proved in [17] that this rewriting has no effect on the derived lineage. \(\square \)

The tracing queries for Non-D-segments and D-segments are shown as follows:

  • Tracing Query for a Non-D-Segment: Based on the rewriting in Proposition 2, the source lineage of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in task \({\mathscr {T}}\) can be obtained by executing the following tracing query:

    $$\begin{aligned} \begin{aligned} { TQ}_{{\bar{O}},{\mathscr {T}}}(D)&= \bigcup _{i} { Split}_{{\varvec{A}}_{T^i_1},\ldots ,{\varvec{A}}_{T^i_{m_i}}}\\&\quad (\sigma _{C_i}(T^i_1 \bowtie \cdots \bowtie T^i_{m_i}) < imes {\bar{O}}) \end{aligned} \end{aligned}$$

    Note that \({\varvec{A}}_{T^i_j}\) denotes the set of attributes of \(T^i_j\) and \( < imes \) denotes semi-join. If some operators are missing in the Non-D-segment, their counterparts are omitted in the tracing query. If tuple set \({\bar{O}}\) is the whole output \({\mathscr {T}}(D)\), the tracing query is denoted as follows using the notation ALL:

    $$\begin{aligned} \begin{aligned} { TQ}_{\textbf{ALL},{\mathscr {T}}}(D)&= \bigcup _{i} { Split}_{{\varvec{A}}_{T^i_1},\ldots ,{\varvec{A}}_{T^i_{m_i}}} \\&(\sigma _{C_i}(T^i_1 \bowtie \cdots \bowtie T^i_{m_i})) \end{aligned} \end{aligned}$$
  • Tracing Query for a D-Segment: Given a D-segment \({\mathscr {T}}(D) = T_1 - T_2\), the source lineage of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in task \({\mathscr {T}}\) is represented as follows:

    $$\begin{aligned} { TQ}_{{\bar{O}},{\mathscr {T}}}(T_1, T_2) = \{ {\bar{O}}, T_2 \} \end{aligned}$$

    Please note that the tracing query for a D-Segment identifies \(T_2\) as well as \({\bar{O}}\) as the source lineage.

5.4 Augmented lineage derivation procedure

We explain an algorithm to derive the augmented lineage of tuple set \({\bar{O}} (\subseteq {\mathscr {T}}(D))\) in task \({\mathscr {T}}\). To obtain the augmented lineage, we (1) split the task into segments as described in Sect. 5.2 and (2) apply Algorithm 1 to the task. Algorithm 1 iteratively applies the tracing query to each segment from the root of the operator tree, derives the source lineage of the segment, and records the reasoning lineage if the tuples in the intermediate result contain reasons.

Example 12

(Augmented lineage derivation procedure) Let us consider the augmented lineage of the tuple \(\langle \hbox {Reject}, 1 \rangle \) in the task shown in Example 4 using Algorithm 1. Source data set D consists of three tables (Client, Debt, Loan), the task \({\mathscr {T}}\) is composed of two segments (\({\mathscr {T}}_1 = \alpha \) and \({\mathscr {T}}_2 =\phi \)-\(\pi \)-\(\bowtie \)-\(\sigma \)) as shown in Fig. 2b, and tuple set \({\bar{O}}\) is \(\{\langle \hbox {Reject}, 1 \rangle \}\). Therefore, TQqueue is \([\{\langle \hbox {Reject}, 1 \rangle \}, {\mathscr {T}}]\) at line 5.

First, \([\{\langle \hbox {Reject}, 1 \rangle \}, {\mathscr {T}}]\) is dequeued from TQqueue at line 7. Then, since the top segment of the task is Non-D-segment \(\alpha \), we execute the following tracing query at line 28:

$$\begin{aligned} \begin{aligned} {\hat{\hbox {INT}}}&= { TQ}_{\{\langle \hbox {Reject}, 1 \rangle \},{\mathscr {T}}_1}(D)\\&=\hbox {INT} < imes _{\textrm{Judge}} \{\langle \hbox {Reject}, 1 \rangle \} \end{aligned} \end{aligned}$$

Note that \(\hbox {INT}\) denotes the intermediate result of the function operator shown in Fig. 3e, and \({\hat{\hbox {INT}}}\) denotes the intermediate lineage of tuple set \(\{\langle \hbox {Reject}, 1 \rangle \}\) in \(\hbox {INT}\). Then, since \({\hat{\hbox {INT}}}\) does not correspond to any source table, we enqueue \([{\hat{\hbox {INT}}}, {\mathscr {T}}_2]\) into TQqueue at line 38. Next, \([{\hat{\hbox {INT}}}, {\mathscr {T}}_2]\) is dequeued at line 7. Then, we store the reasoning lineage from \({\hat{\hbox {INT}}}\) in \({ RL}\) at line 24 since \({\mathscr {T}}_2\) is a Non-D-segment, and its top operator is a function operator. After that, we execute the following tracing query for \({\mathscr {T}}_2\) at line 28:

$$\begin{aligned} \begin{aligned}&{ TQ}_{{\hat{\hbox {INT}}},{\mathscr {T}}_2}(D) = { Split}_{{\varvec{A}}_{\textrm{Client}},{\varvec{A}}_{\textrm{Debt}},{\varvec{A}}_{\textrm{Loan}}}\\&\quad (\sigma _C(\hbox {Client} \bowtie \hbox {Debt} \bowtie \hbox {Loan}) < imes _{{\varvec{E}}} {\hat{\hbox {INT}}})\\&\hbox {s.t.} ~{\varvec{E}} = \{\hbox {Income}, \hbox {Debt}, \hbox {LoanAmount}\}\\&\quad C: \hbox {Application date is on December} \end{aligned} \end{aligned}$$

Then, since \({\hat{\hbox {Income}}}\), \({\hat{\hbox {Debt}}}\), and \({\hat{\hbox {Loan}}}\) correspond to source tables, they are stored in \({ SL}\). Finally, the algorithm returns the pair of \({ SL}\) and \({ RL}\) at line 42. Note that \({\hat{\hbox {INT}}}\), \({ SL}\), and \({ RL}\) are as explained in Example 9.

This algorithm basically assumes that when a task consists of multiple segments, the intermediate result of the lower segment is available for the tracing query to derive the source lineage of the upper segment, and reason information is included in the intermediate result of function operators.Footnote 3 We discuss a number of alternative approaches on how to prepare the intermediate results of non-root segments in Sect. 6.

We discuss the computational complexity of the algorithms in terms of two points: (1) the number of executed tracing queries for a task and (2) the complexity of a tracing query itself.

Proposition 3

(The number of executed tracing queries) The number of executed tracing queries for a task is the same as the number of segments composing the task.

Proof

It is obvious from Algorithm 1. \(\square \)

Proposition 4

(Complexity of a tracing query for a Non-D-segment) The tracing query for a Non-D-segment can be represented as follows as discussed in Sect. 5.3.

$$\begin{aligned} \begin{aligned} TQ_{{\bar{O}},{\mathscr {T}}}(D)&= \bigcup _{i} Split_{{\varvec{A}}_{T^i_1},\ldots ,{\varvec{A}}_{T^i_{m_i}}}\\&\quad (\sigma _{C_i} (T^i_1 \bowtie \cdots \bowtie T^i_{m_i}) < imes {\bar{O}}) \end{aligned} \end{aligned}$$

Note that when the tracing query is executed on the intermediate result, \(T^i_1 \cdots T^i_{m_i}\) represent the intermediate result tables. Let \({ card}(T^i_j)\) denote the cardinality of \(T^i_j\). Then, the complexity of the tracing query is \(O(\Sigma _{i}(\Pi _{j} ~{ card}(T^i_j) * { card}({\bar{O}})))\).

Proof

In the tracing query, joins are dominant operators. For each i, the join/semi-join operation over \(m_i + 1\) tables is performed. Thus, we get the above equation. \(\square \)

Proposition 5

(Complexity of a tracing query for a D-segment) The complexity of the tracing query for D-segment is O(1).

Proof

It is obvious from the definition of the tracing query for D-segment. \(\square \)

figure a

6 Implementation of augmented lineage derivation

6.1 Deployment of intermediate results

The procedure to derive the augmented lineage shown in Sect. 5.4 assumes that intermediate results of non-root segments are available before execution of tracing queries. In this section, we introduce how to deploy intermediate results.

There are following two naive approaches to do so.

  • Rerun: This approach runs the original analysis task as it is, usually in the normal mode. When the augmented lineage is requested, it restores all the needed intermediate results of the non-root segments by rerunning the original task in the reasoning mode. Although it causes no runtime overhead and storage cost to the original analysis task, rerunning the task to restore the intermediate results takes much time.

  • Full Materialization (Full): This approach executes the original analysis task in the reasoning mode and materializes (generates and stores) the needed intermediate results of non-root segments during its execution. This approach affects the performance of the original task and needs the storage cost for managing the intermediate results.

In the preliminary experiments, we evaluated the performance of the above two approaches on complex data analyses. We found that the execution cost of function operator with an expensive UDF tends to be dominant in the overall processing time compared with other relational operators. Hence, based on this observation, we propose the following alternative approach to providing the intermediate results of non-root segments.

  • Function Materialization (FM): This approach executes the original analysis task in the reasoning mode as in Full Materialization but materializes only intermediate results of non-root segments which contain function operators during its execution. Intermediate results of the other segments are restored by rerunning the segments.

To reduce the time of rerunning tasks and segments in Rerun and FM, we utilize the semi-join pushdown optimization used in [57]. When we get output tuples O of a task and \({\bar{O}} \subseteq O\) are specified as targets for the augmented lineage derivation, we apply the semi-join \(O < imes {\bar{O}}\) and then push it down along the operator tree. Although [57] does not consider the difference operator, the pushdown transformation (\((T_1 - T_2) < imes {\bar{O}} \rightarrow (T_1 < imes {\bar{O}}) - T_2\)) is applicable. This optimization will contribute to reducing the size of tables accessed in rerunning the analysis. We experimentally evaluate the cost of the above approaches.

6.2 System architecture

Fig. 6
figure 6

System architecture

This section describes the system architecture which enables complex data analysis with augmented lineage. The overall organization is shown in Fig. 6. The system consists of five components: AL Manager, Parser, Segmenter, TQ Generator, and Executor.

AL Manager: It receives tasks from the user, manages the translation to operator trees, their segmentation and execution, and returns the task results to the user. When the user requests the augmented lineage for some result tuples, it manages the tracing query generation and execution, and returns the augmented lineage to the user.

Parser: It takes a task from AL Manager and translates it into an optimized operator tree.

Segmenter: It takes an operator tree from AL Manager and decomposes it into segments.

TQ Generator: It takes a segment of a task and generates its tracing query in the form of an operator tree.

Executor: Given operator tree(s), it executes the tree(s) over Database and Results Store, and returns the results. Results Store preserves the task output and intermediate results needed for augmented lineage derivation.

We explain the procedure of the system to derive augmented lineage assuming Function Materialization. Rerun and Full Materialization can be accommodated with minor adjustment. In the following procedure, Steps \(\textcircled {1}\)\(\textcircled {5}\) correspond to the task execution, and Steps \(\textcircled {6}\)\(\textcircled {10}\) correspond to the augmented lineage derivation.

\(\textcircled {1}\):

The user submits the task to AL Manager.

\(\textcircled {2}\):

AL Manager passes the given task to Parser and obtains its operator tree.

\(\textcircled {3}\):

AL Manager sends the tree to Segmenter and receives its segments.

\(\textcircled {4}\):

The segments are passed to Executor. Executor executes the trees while storing necessary intermediate results in Results Store. AL Manager obtains the task result.

\(\textcircled {5}\):

AL Manager returns the task result to the user.

\(\textcircled {6}\):

The user specifies the target tuples for augmented lineage derivation (AL target tuples) to AL Manager.

\(\textcircled {7}\):

AL Manager identifies segments needed to rerun for the intermediate result deployment and passes them to Executor. Executor executes them and store their results in Results Store.

\(\textcircled {8}\):

AL Manager sends all the segments of the task to TQ Generator, and TQ Generator returns an operator tree of the tracing query for each segment.

\(\textcircled {9}\):

AL Manager performs Algorithm 1 with the help of Executor and obtains augmented lineage.

\(\textcircled {10}\):

AL Manager returns the augmented lineage to the user.

7 Experiment

This section experimentally evaluates the performance of three methods to derive augmented lineage. We use two datasets (LFW datasets [28] and Complex Data Analysis workload (CDA workload)). In the CDA workload, we used the extended TPC-H benchmark queries [50] including complex content processing involving ML-based image and text analysis. For each dataset, we first evaluate the processing time of executing analysis tasks. In the case of Full Materialization and Function Materialization, the system needs to store intermediate results while executing analysis tasks. This intermediate result store takes an extra cost over the ordinary analysis execution. We evaluate this additional cost along with the cost for executing analysis tasks. Then, we evaluate the three methods regarding processing time and storage cost for deriving augmented lineage. As described in Sect. 6.1, we utilize the semi-join pushdown optimization proposed in [57] to rerun the task efficiently. Therefore, the Rerun approach here is compatible with their framework.

For this experiment, we developed a prototype which has a simplified version of the architecture discussed in Sect. 6.2. The prototype implements AL Manager as python scripts which also integrate functionality of Segmenter and TQ Generator. PostgreSQL is deployed to accommodate Database and Results Store and to provide functionality of Parser and Executor as well. Thus, we employ SQL to specify segments and tracing queries to be evaluated in Executor. We implemented this prototype using PostgreSQL 9.6 and Python 3.7.8, and we ran it on a machine with an Intel(R) Core(TM) i7-8700 CPU @ 3.20 GHz, a GeForce GTX 1060 3 GB, and two 16 GB DDR4 DIMMs.

We rebooted the machine before each measurement, and all measurements were independently performed 5–20 times. We show the number of measurements in each experiment section and the average value as the result.

The bar graphs below suggest statistical differences with ** (significance level 1%) and * (significance level 5%) under Welch’s t-test [51]. We omitted error bars representing confidence intervals, since their intervals are too narrow to see.

7.1 LFW dataset evaluation

In this experiment, we evaluate the performance of the three methods using LFW dataset and investigate the impact of the processing cost of the function operator in the whole task. We prepared two UDFs which have different processing costs. All the results here are based on ten times measurement.

7.1.1 Task

We use the LFW dataset [28] for this experiment. The database consists of two tables: \(\hbox {Image}\) and \(\hbox {Event}\), whose schemas are shown in Appendix. The attribute i_img stores one photograph image from the LFW dataset, and it actually contains the path (URI) of the external photograph image file as a string. The path includes the name of the celebrity in the photograph. The attribute e_visitors stores the number of participants in the event. The other attributes store identifiers for images, events, and places as integer values. Values for attributes except for i_img were synthesized so that they maintain consistency.

The SQL for the analysis task in this experiment is also shown in Appendix. It uses UDF for person recognition for images and determines the number of times that celebrities appeared on stage at large event places. We prepared two implementations of the recognition UDF function with difference processing costs. Each implementation is explained in detail in Sect. 7.1.2.

We prepared three different source table sizes (Small, Medium, Large). Table 1 shows the number of tuples in each case.

7.1.2 UDFs

The recognition UDF takes the path to the photograph image file as an argument and returns the person’s name in the photograph. We prepared two different implementations as follows:

  • Face Recognition A person is identified by an ML-based face recognition model in the given image. It outputs the person’s name as Value and the position of the bounding box around the face as Reason. Its processing cost is more expensive than the following alternative.

  • String Processing A person is recognized by extracting the substring corresponding to the celebrity’s name from the URI string. It outputs the person’s name as Value and the position of the name substring in the URI as Reason. Its processing cost is cheaper than the face recognition implementation.

Table 1 Number of tuples in each source table

Note that the output value of the two recognition UDF implementations is the same.

7.1.3 Task processing

Fig. 7
figure 7

Task processing time

This subsection compares the processing time of the task (analysis query) corresponding to the three methods (Rerun, FM, and Full). In the case of Rerun, we execute the analysis query as it is, since Rerun needs no stored intermediate results. In the cases of FM and Full, we need to store some intermediate results while executing the analysis query. Full needs more storage than FM.

We used the medium-sized tables for this experiment. Figure 7 shows the result. The left-hand side shows the processing time of the analysis query involving the expensive UDF, and the right-hand side shows the processing time for the cheap UDF. In the figure, Baseline corresponds to executing the analysis query without storing any intermediate results for Rerun, and SptFM and SptFull correspond to the processing time including the storage of intermediate results for FM and Full, respectively.

As shown in the figure, we validated the trend that SptFM and SptFull incur some processing overhead compared with Baseline, and the latter needs more. The reason is that SptFull stores more intermediate results. In addition, the increase in processing overhead is larger in the case of the cheap UDF, since the storage cost of intermediate results is almost same no matter whether the UDF is expensive or cheap.

7.1.4 Derivation of augmented lineage

In this subsection, we measure the processing time to derive augmented lineage for a single tuple in the analysis query results.

Fig. 8
figure 8

Processing time for deriving the augmented lineage in the analysis with the expensive UDF

Table 2 Number of tuples in the stored intermediate results
Fig. 9
figure 9

Processing time for deriving the augmented lineage in the analysis with the cheap UDF

Face recognition UDF (expensive UDF) case: Figure 8 summarizes the processing time. Table 2 shows the number of tuples in the stored intermediate results. Both Function Materialization (FM) and Full Materialization (Full) outperform Rerun with respect to the processing time. FM is up to 371.9 times faster than Rerun for the large-sized tables, and Full is up to 497.1 times faster than Rerun for the medium-sized tables. FM drastically reduces the number of stored intermediate tuples (by 90.9%) compared to Full (Table 2).

String processing UDF (cheap UDF) case: Figure 9 summarizes the processing time. In this case, FM and Full also exceed Rerun with respect to the processing time. FM is 1.6 times faster than Rerun, and Full is 2.6 times faster than Rerun for the medium-sized tables. Since the number of tuples in the stored intermediate results is the same as in the expensive UDF case, FM incurs a smaller storage cost than Full. However, the advantage of storing the intermediate results on the processing time is smaller because the processing cost of UDF is not so expensive compared with other relational operators. Therefore, the gain of FM and Full over Rerun has become smaller.

The above experiments demonstrate that FM can control the tradeoff between the processing time for augmented lineage derivation and the storage cost of the intermediate results.

7.2 CDA workload evaluation

Table 3 CDA workload queries

This section measures the performance of the three methods using the CDA workload.Footnote 4 Since original TPC-H [18, 50] is composed of 22 SQL queries focusing on evaluating the performance of a relational DBMS, it originally does not involve any complex data analysis. We extended source tables by adding and replacing some attributes, developed new scenarios for complex data analysis, and extended the original queries to include UDFs whose processing costs are comparable to that of the expensive UDF in the LFW Dataset experiment.

We explain database extension and analysis queries in Sect. 7.2.1, and UDFs in Sect. 7.2.2. Then, we show the processing time for executing each analysis query in Sect. 7.2.3. Finally, we report the processing time for deriving augmented lineage in Sect. 7.2.4.

7.2.1 Tasks

We extended the original TPC-H database. The main extension is as follows:

  • Part table We added a new attribute to store the photograph image of the part (p_image). We used images in the COCO dataset [34] for this extension, and we stored path strings of the image files as its values.

  • Lineitem table We replaced the comment attribute (l_comment) values with Amazon review text data [40].

Since we extended the database as described above, we can develop new complex data analysis scenarios on the original queries. For the Part table, we added filtering that selects parts whose photograph images include any human image based on the human detection UDF. For the Lineitem table, we added another filtering that selects line-item tuples whose review comment is positive based on the text sentiment classifier UDF.

We picked up 10 queries (Q2, Q3, Q9, Q15, Q16, Q17, Q18, Q19, Q20, and Q21) from the 22 TPC-H benchmark queries for the CDA workload. These queries were chosen by the following procedure: We first chose 19 TPC-H queries which refer to the Part and/or Lineitem tables. Next, we filtered out 3 queries which contain the CASE clause because our model does not cover it. Then, we applied the following two changes to the remaining 16 queries:

  1. 1.

    Rewrite queries containing nested subqueries (e.g., IN clause and EXISTS clause) by join and difference operators.

  2. 2.

    Add the UDFs to do the above content processing.

Among the 16 queries, Full Materialization and Function Materialization happen to do the same processing for 10 queries to derive augmented lineage. We chose 4 representative queries among them. Eventually, we selected 10 queries, the 4 queries plus 6 queries in which Full Materialization and Function Materialization do different processing.

We summarize the derivations of the 10 queries from original queries and the types of involved segments in Table 3. Note that person_detection is the UDF applied to the attribute p_image, and text_sentiment is applied to the attribute l_comment. Finally, we input those SQL queries to PostgreSQL and obtained processing trees output by the EXPLAIN command. Then, by applying our segmentation algorithm, we decided segments to be used for tracing queries. Hereafter, Qxx denotes the query xx of CDA workload.

Fig. 10
figure 10

Task processing time

In this experiment, we generated databases of three sizes using DBGEN [18, 50], whose scale factor (SF) is 1, 10 and 100.

7.2.2 UDFs

We implemented the above two UDFs. Both of the UDFs utilize a neural network for their prediction.

  • Human detection This UDF is applied to the attribute p_image in the extended Part table. This UDF takes image data as input and decides whether any human image appears in the whole image or not. It returns a Boolean value as Value and the position of the bounding box of the human image as Reason when it returns True. We implemented this UDF using YOLO [44], which can detect objects quickly.

  • Text sentiment classifier This UDF is applied to the attribute l_comment in the extended Lineitem table. This UDF takes review comment text data as input and returns Positive or Negative, representing the sentiment of the review comment, as Value and likelihood of the sentiment as Reason. We implemented this UDF using BERT-Base text sentiment classifier provided by IBM [29].

7.2.3 Task processing

Figure 10 shows the processing time for analysis queries. This result is based on ten times measurements. Note that SptFM and SptFull do the same processing for Q2, Q3, Q5, and Q8. This shows that the overhead of storing intermediate results while executing the analysis is negligible. The average overhead of SptFM and SptFull for all queries is 0.9 % and 0.5 %, respectively. This experiment suggests that the overhead of storing intermediate results while analysis query execution is rather small when the UDF processing cost is not negligible.

7.2.4 Derivation of augmented lineage

Fig. 11
figure 11

Processing time for deriving augmented lineage in the case of \(\hbox {AL}\_\hbox {size} = 1\)

Fig. 12
figure 12

Processing time for deriving augmented lineage in the case of \(\hbox {AL}\_\hbox {size} = 100\)

In this experiment, we evaluate the performance when we derive augmented lineage for target output tuples (AL_size). First, for all queries, we measure the processing time for deriving augmented lineage for a randomly chosen single tuple in the analysis query results. Q1, Q3, Q5, Q9, and Q10 have more than 100 output tuples. Then, we also measure the processing time for deriving augmented lineage for randomly chosen 100 tuples for these queries. The results for \(\hbox {SF} = 1, 10, 100\) are based on 20, 10, 5 times measurements, respectively.

The processing time to derive the augmented lineage for a single output tuple (\(\hbox {AL}\_\hbox {size} = 1\)) for tables of the three different scale factors is shown in Fig. 11a–c, and that for 100 output tuples is shown in Fig. 12a–c. We set the timeout to 24 h, which means that we stopped deriving augmented lineage when it would take more than 24 h. The processing time of Rerun for Q3 and Q4 is not shown due to the timeout in Figs. 11c and 12c. Table 4 shows the number of tuples in the stored intermediate results. Note that FM and Full do the same processing for Q2, Q3, Q5, and Q8.

In the case of AL_size = 1: We can see the trend that (1) Rerun is the most time-consuming, (2) Function Materialization (FM) is faster than Rerun, and (3) Full Materialization (Full) is the fastest for deriving augmented lineage. Namely, Rerun > FM > Full. However, FM often achieves the processing time comparable to Full. Regarding the storage cost, FM needs less stored tuples than Full as shown Table 4.

However, we can see some different cases, too. The first one is the case of Full > Rerun > FM. This case applies to Q7 in Fig. 11b and c. In the query, Rerun and FM could rerun the analysis query quickly due to the semi-join and predefined indexes for source tables. Moreover, since the intermediate results obtained by the rerun were far smaller than the stored intermediate results in Full and cached in the DBMS, they could execute the tracing queries quickly. On the other hand, Full needed to scan the huge intermediate results without using index. This will be the reasons why Rerun and FM outperformed Full in this query. In Q5 in Fig. 11c, FM and Full do the same processing, and Rerun outperformed FM and Full. In addition to the above mentioned reason, we found that the query plan of a tracing query was poor in FM and Full.

In some queries, Rerun > FM/Full. Q1 in Fig. 11a and b, Q4 in Fig. 11a–c, and Q9 in Fig. 11a are such cases. In Q7 in Fig. 11a, Rerun > Full > FM, and in Q1 in Fig. 11c, Full > Rerun/FM. In these cases, FM was accelerated due to the similar reason as mentioned above.

In Q10 in Fig. 11a–c, Rerun/FM > Full because the cost of rerunning the UDFs was negligible.

We can see that the processing time generally increases as the scale factor increases. In Q1 and Q7, Rerun and FM took almost the same time even when the scale factor increased. In these queries, the processing time of the analysis queries was not affected much because of the efficient semi-join. Besides, since the size of the intermediate result restored by rerunning was almost same for the three scale factors, the processing time of the tracing query in Rerun and FM was not affected.

Table 4 Number of tuples in the stored intermediate results

In Q4, the processing time of FM and Full decreased even though the scale factor was increased from 1 to 10. The reason was that the query plans of the tracing query were different for SF = 1 and \(\hbox {SF}=10\), and that for \(\hbox {SF}=1\) was inefficient. In Q5, the processing time of FM and Full suddenly increased over Rerun due to the similar reason. Namely, the query plan of the tracing query for \(\hbox {SF}=100\) was poor in FM and Full.

In the case of AL_size = 100: When AL_size increases, the overall processing time generally increases. The increase in the processing time is x1–x13 in most queries, and the maximum is x26 in Rerun for Q5 in \(\hbox {SF}=100\). Rerunning the analysis query takes more time since the semi-join produces more source tuples. In addition, the tracing query needs to process larger input tables. The comparison of the three approaches is almost the same as in \(\hbox {AL}\_\hbox {size} = 1\).

In Q10, Rerun and FM outperformed Full in \(\hbox {SF} = 1\). Q10 needs four tracing queries (TQ1, TQ2, TQ3, and TQ4) to get the augmented lineage. In \(\hbox {SF} = 10\) and \(\hbox {SF} = 100\), the execution of TQ4 was dominant in Full, while in \(\hbox {SF} = 1\), the execution of TQ3 accounted for almost half of the whole processing time. Rerun and FM could run TQ3 very quickly using the cache. This will be the reason for Q10 in \(\hbox {SF}=1\).

These experiments demonstrate that FM is an appropriate approach to realize a tradeoff between processing time and storage cost. For example, in Q9, FM can derive augmented lineage 12.8 times faster than Rerun and reduce the number of stored intermediate tuples by 99.6% in the case of \(\hbox {SF}=10\) and AL_size = 1.

8 Conclusions and future work

In this paper, we have proposed the augmented lineage, which is an extended lineage combining reasons for complex data analysis. Augmented lineage ensures traceability of complex data analysis including UDFs for AI/ML processing. Additionally, we formulated an algorithm to derive the augmented lineage using the lazy approach. We also proposed a new execution scheme named Function Materialization (FM), which allows for a tradeoff between runtime cost and storage cost in deriving the augmented lineage. Experiments showed that FM is effective, especially when the execution time of UDFs involving sophisticated AI/ML processing is high in the whole analysis tasks.

Interesting future research topics include generalization of our reasoning framework. This paper assumes that UDFs performing complex data analysis like AI/ML processing provide reasons. Since the reason is a kind of annotation, it could be used for other purposes, too. For example, we may use it to show an operator’s processing time or resource utilization to monitor the system behavior. In addition, extending the proposed framework to more general data models and more generalized analysis contexts such as big data processing systems and stream processing environments is also an interesting issue. Finally, integrating model-agonistic explanation systems (e.g., LIME [45] and SHAP [38]) into our framework is also a challenging topic.