For each workflow pattern and each WfMS, we show the duration (milliseconds, Fig. 2(a)), the CPU utilization (%, Fig. 2(b)), and the mean amount of RAM that was allocated by the engine (MB, Fig. 3(a)). Table 1 shows the statistics  of the duration computed for each workflow pattern and for every WfMS. The data provided in Table 1 correspond to the means of measurements obtained under the maximum load each WfMS could sustain, shown in terms of the number of concurrent instance producers. In some cases the WfMS could not handle the maximum load (1,500 concurrent instance producers), and we had to reduce the number of concurrent instance producers. These cases and the resulting data are discussed in detail in the following subsections. For every experiment, the total number of completed workflow instance requests from all WfMSs is listed in Table 1 in column \(\#Workflow (wi)\). We also include the total duration of each experiment (in seconds) and the average throughput in terms of workflow instances per second.
Similar statistics have been respectively calculated for CPU and RAM usage but they are omitted for space reasons, and they are provided with the supplementary material. The behaviour of all the WfMSs is discussed thoroughly in Sect. 4.3.
4.2 Results Analysis
Sequence Flow Pattern [SEQ]. The [SEQ] workflow pattern lasted on average 0.39 ms for WfMS A, 6.39 ms for WfMS B and 0.74 ms for Camunda. The short duration of this workflow pattern justifies the low mean CPU usage which is \(43.21\,\%\) for WfMS A, \(5.83\,\%\) for WfMS B and \(36.75\,\%\) for Camunda. WfMS B also has a very low average throughput of 63.31 wi/s while for the other two WfMS the average throughput is similar. Concerning the memory utilization under the maximum load WfMS A needed in average 12, 074, WfMS B 2, 936 and Camunda 807.81 MB of RAM respectively. As observed from the Table 1 [SEQ] is the workflow pattern with the highest throughput for all the WfMS under test.
Exclusive Choice & Simple Merge Patterns [EXC]. Before proceeding to the results analysis of the [EXC], we should consider that the first script task of the workflow pattern generates a random integer, which is given as an input to the very simple evaluation condition of the exclusive choice gateway. This was expected to have some impact on the performance. However, Fig. 2(a) shows that the duration times are not notably affected as the values are close to those of [SEQ]. More particularly, we have a mean of 0.48 ms for WfMS A, 9.30 ms for WfMS B and 0.85 ms for Camunda. Concerning the CPU and RAM utilization, we see a slight increase with respect to the [SEQ]. WfMS A uses an average of \(57.42\,\%\) CPU and 12, 215 MB RAM for executing 775, 455 workflow instances in 540 s, WfMS B takes approximately the same amount of time (562 s) to execute 27, 805 workflow instances. For this, it utilizes a mean of \(5.73\,\%\) CPU and 2, 976.37 MB of RAM, and Camunda \(43.21\,\%\) of CPU and 824.96 MB of RAM for executing 765, 274 workflow instances in 540 s.
Explicit Termination Pattern [EXT]. As discussed in Sect. 2 the [EXT] executes concurrently an empty script and a script that implements a five seconds wait. According to the BPMN 2.0 execution semantics, the branch of the [EXT] that finishes first terminates the rest of the workflow’s running branches. We have therefore designed the model considering that the fastest branch (empty script) will complete first, and stop the slow script on the other branch when the terminate end event following the empty script is activated. This was the case for WfMS A and Camunda, which executed the workflow patterns in an average of 14.11 ms and 0.4 ms respectively. The resource utilization of these two WfMSs also increases in this workflow pattern, i.e., we have \(60.20\,\%\) mean CPU usage and 12, 025 MB mean RAM usage for WfMS A and \(33.34\,\%\) mean CPU usage and 794.92 MB mean RAM usage for Camunda. We can already see an interesting difference on the performance of the two WfMS as [EXT] constitutes the slowest workflow pattern for WfMS A and the fastest for Camunda.
As seen in Fig. 2(a), WfMS B has very high duration results for this workflow pattern. We have investigated this matter in more detail and we have observed that over the executions WfMS B chooses the sequential execution of each path with an average percentage of 52.23 % for following the waiting script first and 47.77 % for following first the empty script. Since the waiting script takes five seconds to complete, every time it is chosen for execution it adds a five seconds overhead, and thus the average duration time is so high. This alternate execution of the two branches also explains the rest of the statistics. For example, we observe a very high standard deviation of 2500.44 that indicates that there is a very large spread of the values around the mean duration. Concerning the resource utilization we can observe a very low average usage of CPU at \(0.24\ \%\) and a mean RAM usage similar to the rest of the workflow patterns at 2, 747.34 MB. In Sect. 4.3 we attempt to give an explanation of this behavior for WfMS B.
Parallel Split and Synchronization Patterns [PAR]. The [PAR] executes two empty scripts concurrently. For WfMS A and WfMS B we observe an increase in the duration times to 13.30 ms for WfMS A and 10.07 ms for WfMS B. Camunda handles parallelism very fast, with a mean duration of 0.71 ms. Although WfMS B seems faster by looking the duration results, we should take into consideration that it has a total execution of 27, 718 workflow instances in 567 s while WfMS A executed 772, 013 workflow instances in 540 s. Moreover, it is noteworthy that WfMS A has a standard deviation of 11.99 which indicates that there were executions for which the parallelism introduced more overhead in duration than the average value. WfMS B has a \(5.64\,\%\) mean CPU and 2, 935.81 MB mean RAM usage and Camunda has a \(41.67\,\%\) mean CPU and 828.37 MB mean RAM usage. For both WfMSs these values are in the same range as the values resulted for the execution of the other workflow patterns. WfMS A utilizes in average \(66.10\,\%\) of CPU and 12, 201.16 MB of RAM. For WfMS A the values of utilized resources are relatively higher than these obtained from the other workflow patterns.
Arbitrary Cycle Pattern [CYC]. The performance of the [CYC] workflow pattern cannot be directly compared to the other workflow patterns, because it contains a higher number of language elements and demonstrates a more complex structure. The [CYC] is also expected to have some extra overhead because of the number generation and the script that increases the value of the variable. Finally, the duration of this workflow pattern is dependent on the generated number, as in the one case it executes 10 cycles while in the other it will execute 5 cycles. During the execution of [CYC], Camunda showed connection timeout errors for a load greater than 600 instance producers. For this reason, we had to reduce the load to 600 instance producers for testing the other two WfMS. The load for the results shown in Figs. 2(a), (b) and 3(a) for this workflow pattern is thus 600 instance producers. Table 1 shows the results for the maximum load each WfMS could sustain: 800 instance producers for WfMS A, 1500 instance producers for WfMS B and 600 for Camunda,. As expected, the mean [CYC] execution duration is higher than the other workflow patterns. WfMS A has a mean duration of 6.23 ms and Camunda a marginally bigger mean duration of 3.06 ms for this number of instance producers. WfMS B has a mean duration of 39.36 ms for approximately 600 instance producers.
Concerning the resource utilization, WfMS B and Camunda remain stable to the same range of mean CPU usage (\(4.67\,\%\) for WfMS B and \(41.67\,\%\) for Camunda) as with the other workflow patterns. WfMS B remains on the same range of mean RAM usage (2, 851.9 MB), while we observe an increase for Camunda to an average of 933.31 MB. Concerning WfMS A’s resource utilization, we observe a tendency to increase in comparison with the rest of the workflow patterns. For approximately 600 instance producers, WfMS A uses in average \(70.09\,\%\) of CPU and 12, 201.16 MB RAM. We consider it also interesting to report how the results evolved for WfMS A and WfMS B when we increased the load to the maximum (1500 and 800 instance producers respectively). Then, we observe WfMS A doubling the mean duration time from 2.92 ms to 6.23 ms. The CPU is also more stressed reaching \(83.93\,\%\) while the mean memory usage is only slightly increased to 12, 429.67 MB. WfMS B remains in the same range of the previous values with scarcely any increase to its performance. It uses in average \(4.59\,\%\) of CPU and 2, 897.72 MB of RAM. This is because its response time increases while adding instance producers.
Mix [MIX]. By a quick overview of the [MIX] statistics, one could conclude that they express the mean duration times of the individual workflow patterns shown in Fig. 3(b). The throughput of the mix, is also a bit smaller for all the WfMSs, although WfMS A keeps it on the same range as the previous values at 1, 402.33 wi/s. In Fig. 3(b) we can observe the separate duration times of the workflow patterns for the case that they are executed in the uniformly distributed mix. As seen in Fig. 3(b), all workflow patterns have a slight increase in their duration times with respect to the execution as a single workflow pattern.
As reported in Sect. 4.2, the BPMN version of WfMS B presents some peculiarity in its behaviour. This was also noticed by Bianculli et al.  on their performance measurements on the WS-BPEL version of the WfMS B. According to the WfMS B documentation the REST API calls to the execution server will block until the process instance has completed its execution. We observed the effects of this synchronous API in our experiments. All instance producers send requests to the WfMS using the REST API with a think time of 1 s. The instance producers need to wait for the completion of their previous request before sending a new one, but in the case of WfMS B the clients that are waiting for the entire execution of the workflow instance to finish introduce a high overhead. This overhead causes a delay that burdens the WfMS’s performance. In order to investigate this further we have executed a scalability test to analyze the WfMS behavior under different load intensity levels. The goal of this experiment was to examine, whether by increasing significantly the number of instance producers we could achieve a number of executed workflow instances that can be more comparable to those of WfMS A and Camunda. We executed the experiment for 500, 1000, 1500, and 2000 instance producers and observed a mean response time of 7.15, 15.19, 22.58 and 30.89 s respectively, while the throughput remained stable to an average of 62.23 workflow instances per second. These data basically show that (i) it is pointless to increase the number of instance producers and target to the execution of more workflow instances; and that (ii) the fact that WfMS B is the only WfMS of the three under test using a synchronous REST API does not impact the comparability of the measurement.
Another issue discussed concerning WfMS B was the inconsistent execution behaviour of the [EXT]. Although the expected execution of [EXT] is that when the path with the empty script ends the execution of the path with the 5 s script will also be terminated, we have observed many executions with the opposite behavior. The path with the wait script was executing “first” and then, after 5 s, followed the execution of the empty script. In this case, the end event that corresponded to the empty task was never executed. This behavior of WfMS B was also explained in their documentation. WfMS B basically chooses to dedicate a single thread to the parallel execution of scripts, leading to a non-deterministic serialization of the parallel paths. Indeed data showed that in about \(50\,\%\) of the cases, the fast path is chosen to be executed first (cf., Sect. 2). When the branch with the 5 s waiting script is chosen then as expected the execution of the WfMS needs to wait 5 s until this branch is completed. This explains the very high duration of the [EXT], as half of the executions have the 5 s duration.
At this point we can draw some conclusions. Regarding the behavior of WfMS B on the duration of the workflow execution we observe much higher values for all the workflow patterns. The CPU and memory utilization of WfMS B is always on much lower limits when compared to the other two WfMSs because of the lower throughput. However, this is reasonable since every workflow instance is executed sequentially and the actual executed load is lower compared to the other two WfMSs, because of the higher response time. WfMS A and Camunda share many architectural similarities because Camunda was originally a fork of WfMS A. Still their behaviour is not identical and leads to some interesting points. Camunda kept the duration values low for all the workflow patterns, but for [SEQ] and [EXC] WfMS A executed slightly better. However, we note large differences in the duration values for [EXT], [PAR], and [MIX], that indicate an impact of parallelism on the performance of the WfMS A, and increased resource utilization. The parallelism does not seem to have much impact on Camunda, as it remained relatively stable in all tests. Concerning the resource utilization in general we observe WfMS B and Camunda having a more stable behaviour, while WfMS A shows a direct increase when it is more stressed. In general, we may conclude that Camunda performed better and more stable for all metrics when compared with WfMS A and WfMS B.
Finally, concerning our hypotheses (cf., Sect. 2), [SEQ] resulted in the workflow pattern with the lowest and most stable performance for all WfMSs [HYP1]. Also it was the workflow pattern with highest throughput for all tested WfMSs. Concerning the [EXC], our hypothesis was affirmed (cf., [HYP2]) as there is a slight impact on the performance, which we connect with the evaluation of the condition. The [HYP3] and [HYP4] that the [PAR] and [EXT] will have similar impacts on performance holds basically for WfMS A and Camunda. Our [HYP4] and [HYP5] for parallelism and complex structures having an impact on the performance seems to hold for WfMS A, while for Camunda no conclusions can be drawn with respect to this point. These results indicate that sequential workflows (i.e., [SEQ]) may help towards discovering the maximum throughput of the WfMSs. Parallelism (i.e., [PAR][EXT]) may affect the WfMSs in terms of throughput and resource utilization, while more complex structures (i.e., [CYC]) are better candidates for stressing the WfMSs in terms of resource utilization. These conclusions should be considered when designing the workload for more complex, realistic cases and macro-benchmarks.
4.4 Threats to Validity
A threat to external validity is that we evaluate three WfMSs, which is the minimum number of WfMS for drawing initial conclusions. For generalizing our conclusions more WfMS are needed, and for this we are designing the BenchFlow environment to allow the easy addition of more WfMSs. Moreover, our simple workload models threaten the construct validity. Although the micro-benchmark does not correspond to real-life situations, we consider it fundamental for the purposes of our work. Using this knowledge as a basis, we plan to test more complex, realistic structures such as nested parallelism and conditions, as well as combinations of them in diverse probabilistic workload mixes .