Evaluation of Heuristics for Product Data Models
 167 Downloads
Abstract
Product Data Model (PDM) is an example of a datacentric approach to modelling informationintensive business processes, which offers flexibility and facilitates process optimization. It is declarative, and as such, there may be multiple workflow designs that can produce the end product. To this end, several heuristics have been proposed. The contributions of this work are twofold: (i) we propose new heuristics that capitalize on established techniques for optimizing dataintensive workflows; and (ii) we extensively evaluate the existing solutions. Our results shed light on the merits of each heuristic and show that our proposal can yield significant benefits in certain cases. We provide our implementation as an opensource product.
Keywords
Datacentric processes Process optimization PDM1 Introduction
Datacentric approaches have been emerging in the last two decades as an alternative to the more mainstream activityoriented modelling approaches for business processes [7, 10, 14]. We quote from [14] that “the central idea behind datacentric approaches is that data objects/elements/artifacts can be used to enhance a processoriented design or even to serve as the fundament for such a design. This has certain advantages, varying from increasing flexibility in process execution and improving reusability to actually being able to capture processes where data play a relevant role.”
In this work, we focus on a particular datacentric modelling approach, namely a Product Data Model (PDM)oriented one, for which the main driver, as also reported in [14], is process optimization apart from flexibility; this approach is tailored to informationintensive processes and it is declarative. As such, it focuses on describing what is needed in order to deliver an information product rather than the exact way to achieve this goal. To fulfill the latter aspect, the declarative model is accompanied by a method to generate workflow designs, which is referred to as Product Based Workflow Support (PBWS) [18]. PBWS presents a set of heuristics for PDMs with a view to enhancing the performance on a casebycase manner. PBWS improves upon a previous method, called Product Based Workflow Design (PBWD) [13], where the burden of defining the sequence of actions rests with the workflow designer, while PBWD merely assists this task through presenting the alternatives.
 1.
We propose a new heuristic for choosing the next operation to be performed in a PDM for a specific case to optimize time duration and/or cost. Our proposal comes in three flavors and is based on established query processing and datacentric workflow technology, and is of low computational complexity.
 2.
We perform an extensive experimental evaluation of the available heuristics and we show that our proposal yields benefits in terms of time, cost or a combination of both compared to previous heuristics, on the average case. However, there is no globally dominant solution, in the sense that in specific cases, existing heuristics may behave better. We provide the open source of all the heuristics and the experiments, so that interested third parties can repeat our work and extend the set of heuristics and/or test cases^{1}.
2 Background: The Product Data Model
A PDM is used to represent the structure of a workflow product in a rooted graphlike manner, similar to a Bill of Material [11, 18]. PDMs describe the required elements for yielding the desired product in the root, where example (informational) products include decision on whether to grant an approval to a specific admission request, approval of a mortgage application, and so on. More specifically, the vertices (or equivalently nodes) in this structure correspond to data elements, that is the information that is processed in the workflow. Each node has a value assigned to it, which typically differs between process instances (cases). In Fig. 1, we present the PDM for a classical mortgage example, which will also be used in the comparison section. The final product of the process is to determine the value of the root (or top or end product) node. Values are determined from the bottom to the root as specified by the arcs (graph edges), which are called operations. These arcs represent actions that are applied on the valued data elements to produce values for nodes downstream. Each operation can have zero or more input data elements while producing the value for exactly one output data element. An operation is represented by a tuple, which consists of the output element and a set of input elements, e.g. \((A, \{B, C, D\})\) for Op01 in the figure, which means that Op01 can be applied only if B, C and D have been produced and may lead to the generation of the value of A. A data element may be determined through multiple operations, e.g., in the figure, A can be determined in three manners and, for a specific case, the process terminates as soon as one of them manages to complete. In the case where an element has zero input elements, it is called a leaf element, and commonly, it is provided as input to the process; in the figure, elements such as B, F and E are leaf elements.

Cost, which represents the cost associated with executing the operation.

Time, which represents the time required for the complete execution of the operation.

Probability, represents the probability that an operation is executed unsuccessfully, therefore not producing its output element.

Conditions, which represent requirements regarding the value of the input elements. These requirements must be met, if the process is to be executed, meaning that the existence of all input elements of an operation is not sufficient.
3 Deriving Workflow Designs
PDM does not specify per se how the end information product is created but allows multiple workflow designs to produce the desired information product. As mentioned above, there are cases where multiple workflow designs lead to the production of an output element. Usually, in these cases, the alternative paths have different execution costs and time durations. This gives rise to the following optimization problem: which paths of operations to choose for a specific case in order to optimize given quantitative objectives of cost and time?
There are two highlevel strategies for the calculation of an optimal execution path of a workflow, namely a global and a local one. A global strategy considers the effect of each decision on future steps. It takes into account the complete set of alternative paths that produce the end product to optimize the execution performance of each case. Instead, a local strategy adopts a stepbystep approach, meaning that, at each step, it examines the set of operations available for execution and chooses the best one, according to a particular metric, e.g. cost of execution. As explained in [18], a global strategy does not scale. For this reason, in this work, we exclusively deal with lowpolynomial local strategy heuristics.
 1.
Random: the operation is randomly selected from the set of executable operations.
 2.
Lowest Cost: the operation with the lowest cost is selected.
 3.
Shortest Time: the operation with the shortest time is selected.
 4.
Lowest Failure Probability: the operation with the lowest probability of not being executed successfully is selected.
 5.
Shortest Distance to Root Element: the operation with the shortest distance to the root element (measured in the total number of operations) is selected.
 6.
Shortest Remaining Process Time: the operation with the shortest remaining processing time (measured as the sum of the processing times of the operations on the path to the root element) is selected.
 7.
Shortest Remaining Cost: the operation with the shortest remaining cost (measured as the sum of the costs of the operations on the path to the root element) is selected.
We discuss implementation details at the end of this section.
3.1 RankBased Heuristics
Our approach relies on treating productions rules in a manner that resembles knockout activities, and their optimal ordering, which also bears similarities to the way data analytics operators and database joins are ordered [1, 8, 9]. A knockout activity is an activity whose execution leads directly to the completion of the process. For example, the execution of Op03 in Fig. 1, which produces the root element A is a knockout activity/production rule. Then, the optimal ordering needs to take into account the probability of an operation to produce the end element, either directly or indirectly as a sequence of operations starting with that operation, and the corresponding cost or execution duration.
Example: in the example in Fig. 1, assume that in the current state all leaf elements have been produced already except element E, for which Op07 was not executed successfully. Thus, in the next step, there are two available production rules for execution, namely Op02 and Op03. Based on their attributes they have the following ranking value: \(rank(Op02) = 0.9025/10 = 0.09025\). While Op02 may not lead to a process termination directly, we consider Op02 as part of a path that leads indirectly to the root, that is the path: \(Op02 \rightarrow Op01 \rightarrow A(end~state)\). Therefore, we use as probability of this knockout path, the probability of success of the operations in the whole path, which is \((1Probability(Op02))*(1 Probability(Op01)) = 0.95*0.95 = 0.9025\) and as cost, the aggregate cost of the whole path, which is \(Cost(Op01) + Cost(Op02) = 5+5 = 10\). On the other hand, \(rank(Op03) = 0.95/9 = 0.105556\). The probability 0.95, in the nominator, is the probability of the successful execution of Op03 because it is the successful execution of Op03 that produces the root element A and therefore, completes the workflow execution. Based on these values, Op03 is selected for execution. \(\square \)
3.2 Implementation Issues
All the heuristics conform to a generic template, shown in Algorithm 1, which produces, for each case, the sequence of steps (operations) chosen; this sequence is captured in the variable WF. The operations metadata are mapped to a HashMap variable, where the key is the operation id and the value is nested and consists of all attributes in the table of the example in Fig. 1. Based on such a structure, the executableList can be found through a simple traversal of the hashmap, taking into account the contents of the availableList. This occurs once at the beginning and then, executableList keeps being updated. Then the availableList is scanned to choose the nextOp operation according to the chosen local strategy (line 4). The time complexity of this algorithm, for the first 4 heuristics, is \(O(n(n+V))\), where n is the number of operations and V the number of nodes in the graph. This is because, in each step at most n operations are examined, and there are n steps at most. Also, for an operation to be inserted in the executableList, up to V elements checks need to be performed. A fast implementation employs a priority queue for supporting the choice of nextOp in each iteration, but it is beyond the scope of this work to discuss such details.
However, a more important point is that the three last existing heuristics along with our proposal need to process the path from a given operation to the root. For this, we need to employ another auxiliary structure, where the PDM model is seen as a typical graph with as many vertices as the data elements and directed edges for each data element in the input of another data element pointing to that element. Since finding the shortest path from a root element is at most O(nlogV) using an algorithm such as Dijkstra, the complexity of the relevant techniques is the previous complexity multiplied by this factor, i.e., \(O(n^2(n+V)logV)\). In addition to the PDM’s data elements, this graph also contains an artificial starting vertex. This vertex represents the initial state of execution where no elements have been produced. It covers operations, such as Op08, Op09 and Op10 in the example, which, otherwise, cannot be represented as edges connecting vertices.
4 Evaluation
 1.
The rankbased heuristics proposed in this work are the best performing ones both when cost and when time is the optimization objective. Their relative difference is small and does not exceed 1.1%, which means that the rank function effectively covers both objectives in all three flavors.
 2.
Choosing randomly the next operation incurs approximately 25% higher cost and 25% higher time compared to our solutions. The best performing heuristics from the existing ones are Shortest Remaining Cost for the cost objective and Shortest Remaining Time for the time objective. These heuristics are on average only 4.2% and 4.6% worse than the rankbased ones, respectively.
 3.
There is no globally dominant solution. As the two right barcharts show, even the random heuristic yields the best performance in some cases. On average, for each case there are 1.86 heuristics that yield the best performance regardless of the exact objective (it can be observed in the figures that the barcharts do not sum to 100K). The most common winners are the rankbased heuristics, but each individual flavor is the best in no more than 28.5% of the cases.
Next, we move our attention to Figs. 5 and 6, which refer to the largest PDM in Fig. 2. In this PDM, in summary, the rankbased heuristics are still the best ones in the average case, with even smaller relative differences between the three flavors (less than 0.4%). The random heuristic is only 14% worse than the best heuristic on the average case. But the best performing heuristic from the ones in [18] has now become the one that chooses the operation with the lowest failure probability; this heuristic is 6.3% worse that the rankbased solutions. Finally, in each case, 1.52 heuristics achieve the top performance on average.
Similar and even better results are observed when the optimization objective is the product of cost and time (no details are provided due to space limitations). Two additional significant points are: (i) Our proposals are superior to the best performing existing heuristic by up to more than 20 times. (ii) The overhead time to run the heuristics is extremely low: on a Ryzen 5 3600x CPU with 16 GB RAM, our rankbased heuristics take less than 1.34 milliseconds for each case of the social insurance PDM; the other heuristics are even faster and the times are negligible for the small PDM.
5 Related Work
As stated in the introduction, an increasing amount of datacentric approaches have been developed as part of a general trend in the area of Business Process Management (BPM). Despite this recent interest, Business Process Improvement or Redesign, one of the key areas of BPM remains relatively undeveloped in terms of automated algorithmic solutions. In a recent survey that aims to evaluate several datacentric process approaches, this lack of focus on process optimization or redesign is highlighted [14]. Out of the 14 methods examined, only 2 of them identify the objective of business process optimization as a motive for their development. These two methods are Product Based Workflow Design (PBWD) [13] and its extension, Product Based Workflow Support (PBWS) [18], upon which we build our work.
A significant part of recent research in BPs targets variability between process models aiming at the same highlevel objectives [15]. For example, the work in [15] is motivated by the fact that the same goal in different municipalities is performed using different equivalent processes and, to manage such variability, it introduces the configurable process trees. This methodology allows a specific set of process models to be selected according to several criteria. This bears some similarity to the way PBWS exploits the existence of alternative paths in order to optimize each case’s performance. The main difference lies in the fact that these alternatives are different paths of the same, already existing PDM model, while in [15], there is an attempt to create a model that contains alternative paths to cover all rationales. In such a context, proposals like [4] deal with the problem of extracting alternative models, whereas the issue of assessing the quality of different process model configurations [3] has also been explored.
Additionally, there are proposals considering variant optimization objectives, such as the techniques in [1], where a set of heuristics is introduced for optimizing the metrics of resource utilization, maximal throughput and execution (cycle) time. These heuristics consider changing the relative ordering of activities, enforcing parallel execution and activity merging, but they cannot be applied to PDMs (at least in a straightforward manner). Finally, our work relates to declarative process models [5, 12]; e.g., our workflow design solution can be seen as a promising means to derive executable model structures out of such declarative models although providing a complete methodology to achieve this remains an open issue.
Regarding datacentric workflows, a lot of effort has been put towards finding the best sequential order of flow tasks for objectives such as minimization of the sum of the costs of these tasks or the bottleneck cost, or the maximization of the utilization of each execution processor, and so on [2, 6, 8]. All these proposals aim to optimize a single criterion, but there are also proposals that target multiobjective data flow optimization, such as the algorithms in [16, 17]. Despite some initial efforts in [9], transferring the results of data analytics workflow optimization to business process workflows is still a topic in its infancy.
6 Conclusions and Future Work
This work focuses on processes modelled according to the declarative PDM paradigm and aims to evaluate both existing and novel heuristics for yielding workflow designs on a casebycase basis. Inspired by data analytics, we use the notion of rank, which combines the probability to produce the root element and the cost to achieve this in a single metric. In our experiments, we show that rankbased heuristics exhibit the best performance on average, but in specific cases, each of the 10 heuristics examined in this work may be the dominant one.
Our work suffers from the same limitations as the heuristics in [18]: we optimize on a casebycase basis without seeing the process as a whole, e.g., in terms of resource utilization and without considering parallel task execution. Apart from addressing these limitations, we aim to follow three directions as future work: (i) to better handle the information that a data element may need input by multiple elements, when computing path costs (which is now implicitly ignored); (ii) to devise hybrid methodologies that switch between heuristics in a specific case, motivated by our key observation that there is no globally dominant solution; and (iii) to transfer similar techniques to other declarative modelling approaches, such as [12].
Footnotes
Notes
Acknowledgment
The research work was supported by the Hellenic Foundation for Research and Innovation (H.F.R.I.) under the “First Call for H.F.R.I. Research Projects to support Faculty members and Researchers and the procurement of highcost research equipment grant” (Project Number:1052, Project Name: DataflowOpt). We would like also to thank Dr. Georgia Kougka for her comments and help.
References
 1.van der Aalst, W.M.P.: Reengineering knockout processes. Decis. Support Syst. 30(4), 451–468 (2001). https://doi.org/10.1016/S01679236(00)001366 CrossRefGoogle Scholar
 2.Agrawal, K., Benoit, A., Dufossé, F., Robert, Y.: Mapping filtering streaming applications. Algorithmica 62(1–2), 258–308 (2012). https://doi.org/10.1007/s0045301094536MathSciNetCrossRefzbMATHGoogle Scholar
 3.Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Discovering and navigating a collection of process models using multiple quality dimensions. In: Lohmann, N., Song, M., Wohed, P. (eds.) BPM 2013. LNBIP, vol. 171, pp. 3–14. Springer, Cham (2014). https://doi.org/10.1007/9783319062570_1CrossRefGoogle Scholar
 4.Buijs, J.C.A.M., van Dongen, B.F., van der Aalst, W.M.P.: Mining configurable process models from collections of event logs. In: Daniel, F., Wang, J., Weber, B. (eds.) BPM 2013. LNCS, vol. 8094, pp. 33–48. Springer, Heidelberg (2013). https://doi.org/10.1007/9783642401763_5CrossRefGoogle Scholar
 5.Chawla, N., King, I., Sperduti, A.: Userguided discovery of declarative process models (2011)Google Scholar
 6.Deshpande, A., Hellerstein, L.: Parallel pipelined filter ordering with precedence constraints. ACM Trans. Algorithms 8(4), 1–38 (2012)MathSciNetCrossRefGoogle Scholar
 7.Henriques, R., Rito Silva, A.: Objectcentered process modeling: principles to model dataintensive systems. In: zur Muehlen, M., Su, J. (eds.) BPM 2010. LNBIP, vol. 66, pp. 683–694. Springer, Heidelberg (2011). https://doi.org/10.1007/9783642205118_62CrossRefGoogle Scholar
 8.Kougka, G., Gounaris, A., Simitsis, A.: The many faces of datacentric workflow optimization: a survey. Int. J. Data Sci. Anal. 6(2), 81–107 (2018). https://doi.org/10.1007/s4106001801070CrossRefGoogle Scholar
 9.Kougka, G., Varvoutas, K., Gounaris, A., Tsakalidis, G., Vergidis, K.: On knowledge transfer from costbased optimization of datacentric workflows to business process redesign. In: Hameurlain, A., Tjoa, A.M. (eds.) Transactions on LargeScale Data and KnowledgeCentered Systems XLIII. LNCS, vol. 12130, pp. 62–85. Springer, Heidelberg (2020). https://doi.org/10.1007/9783662621998_3CrossRefGoogle Scholar
 10.Künzle, V., Reichert, M.: Philharmonicflows: towards a framework for objectaware process management. J. Softw. Maintain. 23(4), 205–244 (2011)CrossRefGoogle Scholar
 11.Orlicky, J.A., Plossl, G.W., Wight, O.W.: Structuring the bill of material for MRP. In: Lewis, M., Slack, N. (eds.) Operations Management: Critical Perspectives on Business and Management, vol. 58. Taylor & Francis, New York (2003)Google Scholar
 12.Pesic, M., Schonenberg, H., van der Aalst, W.M.P.: Declare: full support for looselystructured processes. In: 11th IEEE International Enterprise Distributed Object Computing Conference (EDOC 2007), pp. 287–300 (2007)Google Scholar
 13.Reijers, H.A., Limam, S., van der Aalst, W.M.P.: Productbased workflow design. J. Manag. Inf. Syst. 20(1), 229–262 (2003)CrossRefGoogle Scholar
 14.Reijers, H.A., et al.: Evaluating datacentric process approaches: does the human factor factor in? Softw. Syst. Model. 16(3), 649–662 (2016). https://doi.org/10.1007/s102700150491zCrossRefGoogle Scholar
 15.Schunselaar, D.: Configurable process trees : elicitation, analysis, and enactment. Ph.D. thesis, Department of Mathematics and Computer Science, October 2016. ProefschriftGoogle Scholar
 16.Simitsis, A., Wilkinson, K., Dayal, U., Castellanos, M.: Optimizing ETL workflows for faulttolerance. In: 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010), pp. 385–396 (2010)Google Scholar
 17.Simitsis, A., Wilkinson, K., Castellanos, M., Dayal, U.: Optimizing analytic data flows for multiple execution engines. In: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, pp. 829–840 (2012)Google Scholar
 18.Vanderfeesten, I.T.P., Reijers, H.A., van der Aalst, W.M.P.: Productbased workflow support. Inf. Syst. 36(2), 517–535 (2011)CrossRefGoogle Scholar