Keywords

1 Introduction

Increasing speed of the changing priorities of customers causes many companies to adopt continuous deployment [1, 8, 15, 23]. A continuous deployment model is crucial for service delivery business as it ensures that software services are always in a releasable state, and changes are incremental. To ensure high quality release in continuous deployment, the upcoming release is staged in production environment using such strategies as blue-green deployment [13], dark launches [9], canary release [13, 26] and shadow testing [23], and its performance is monitored [3, 13] to quickly identify whether it is misbehaving [23, 26].

Vast amount of data is logged during the execution of the new and previously deployed software versions. Existing monitoring systems keep track of suspicious events in logs (e.g., errors, warning messages, stack traces) and raise alerts. However, such systems do not leverage the unstructured data captured in the execution logs to efficiently derive and compare the dynamic behavior of the new and the previously deployed versions in a holistic manner.

In this work, we present a novel approach to automatically detect discrepancies in the fast evolving applications adopting continuous deployment. This is achieved by identifying the differences in the behavior model of the previously deployed and new version, derived by mining the execution logs.

Fig. 1.
figure 1

Differences between the execution behavior models of two versions of Nutch. Vertices added in the new version are encircled twice, added edges–bold, deleted vertices and edges–dashed. The analysis of differences allowed us to discover a bug that we reported as NUTCH-2345. The bug was fixed by the Nutch developers.

2 Motivating Example: A Bug in Nutch

As part of an issue [NUTCH-1934]Footnote 1 the class Fetcher counting ca. 1600 lines of code is refactored to improve modularity. We took the version before and after refactoring to identify differences between the two versions. We used Nutch to crawl a set of URLs thus generating the execution logs for both the versions. We map the generated execution logs to templates derived from the Nutch source code using string matching. A subset of log lines was not mapped to any source code template (that is, from third party library) and clustered using a combination of approximate and weighted edit distance clustering. Execution behavior model is discovered automatically for each of the versions using the respective templatized execution logs. Each vertex in the model corresponds to a unique template. Using our automated approach, many diff regions are detected between the two discovered models.

Figure 1 presents one of the diff regions, i.e., deletion of a set of vertices T1103–T1109 (represented as dashed) from class Fetcher.java and addition of new vertices EXT0–EXT5 (double circled) from apparently third party library (prefixed with EXT). We manually investigated this diff region and found that the code fragment corresponding to templates T1103–T1109 has been moved from Fetcher.java to FetchItemQueue.javaFootnote 2. Inspecting FetchItemQueue.java we found that FetchItemQueue s is used as logger instead of FetchItemQueue. Consequently, the log messages from FetchItemQueue had a wrong class name, and thus were not mapped to the corresponding source code logging statement and treated as log statements from third party library (EXT0–EXT5).

This issue was introduced in Nutch 1.11 and fixed after we reported itFootnote 3 in Nutch 1.13. Using our approach, the issue would have been detected in the version 1.11 itself. This highlights the potential of our approach for discovering anomalies by analyzing automatically identified diff regions.

3 Proposed Approach

The proposed approach takes executions logs and source code for the deployed and new version as starting points. Since our approach is targeted towards continuous deployment, access to both these artifacts can be assumed. The approach leverages execution logs without instrumenting the code because instrumentation overhead is not possible in a fast evolving production software [31]. Nevertheless, execution paths are successfully captured from the existing logs because in practice, sufficient logging is done to facilitate runtime monitoring [6, 16].

Our approach consists of three broad phases: template mining that maps each line in execution logs to a unique template (Sect. 3.1), execution behavior model mining that derives execution behavior models from the templatized logs and refines the model using multimodal approach (Sect. 3.2), and analysis of the model differences to identify the differences between the execution behavior models and classify them into cohesive diff regions (Sect. 3.3).

3.1 Template Mining

A template is an abstraction of a logging statement in the source code consisting of a fixed part and variable part (denoting parameters) [2, 18]. Due to presence of parameters templates often manifest themselves as different log messages. Thus, identifying the templates from the execution log messages has inherent challenges [20]. If no source code is available, templates can be inferred by clustering log messages [20, 27]. However, often log messages from different logging statements are clustered together, resulting in inaccurate templates. Since we have access to source code, we extract templates using regular expressions (cf. Fig. 2).

Fig. 2.
figure 2

Given the source code (1) templates are extracted (2), and log lines are mapped to them (3). Log lines from external libraries are clustered to create new templates (4).

Derive Templates from the Source Code: In this step the print statements are identified from the source code along with the class name and severity level (e.g., INFO, WARN and DEBUG) [6]. We search for the logging statements in the source code using regular expressions with some enhancements to identify ternary print statements and ignore commented logging statements in the source code. As shown in Fig. 2, logging statement is parsed and represented as a regular expression which is then assigned a unique template id. Class name and severity level are also stored as additional information to disambiguate templates which have identical invariant pattern but appear in different classes of the code.

While the complete source code is used to extract templates for the deployed source code version, to extract the templates for the new version we only analyze the diff between the two source code revisions as indeed, continuous deployment encourages incremental changes. Not only is the extraction more efficient, this also ensures that the unchanged templates between the two versions are represented by the same template ID. The main shortcoming of diff is that if a logging statement is modified, it is represented in the diff as a combination of addition and deletion, that will be interpreted as addition of a new template and deletion of the old template. Thus, two execution behavior models will appear different for the templates which are actually the same. Since modification of logging statement is frequent [6, 16], we address this shortcoming using a novel multimodal approach for template merging and model refinement (Sect. 3.2).

Templatize Log Messages: In this step, template id is assigned to each log line appearing in the execution logs, by matching with templates obtained from the previous step. To reduce the search space for the match, class name and severity level (if included as part of the log messages) are used as additional matching parameters (cf. Fig. 2). While regular expression matching can find the matching template, log lines matching multiple templates, templates with no fixed part and log lines generated by the third party libraries require special treatment. If a log line matches more than one template it is mapped to the most specific template, i.e., the template with the largest fixed part. If there is one logging statement in a class without constant part then all the unmapped log lines from that class with same logging level are mapped to itFootnote 4. Finally, log lines from external sources such as third party libraries for which we do not have access to source code cannot be templatized as explained above. These log lines are clustered using a combination of approximate clustering [20] and weighted edit distance similarity [10]. Each cluster generated after the refinement is represented as a template and is assigned a unique template ID. Thereafter, non-templatized log lines are matched with the templates derived from clustering step such that all the log lines are assigned a unique template id.

3.2 Mining Execution Behavior Model Using Multimodal Approach

Execution Behavior Model (\(\mathsf{EBM}\)) is a graphical representation of the templatized execution logs capturing the relationship between the templates. Each vertex in the model corresponds to a unique template and the edges represent the flow relationship between the templates. Since template represents a logging statement from the code, \(\mathsf{EBM}\) captures a subset of possible code flows.

Accuracy of identified diff regions directly depends on the accuracy of the \(\mathsf{EBM}\) mining which in turn depends on the accuracy of the template mining. As discussed in Sect. 3.1, the execution logs are templatized with high precision using source code. However, for log lines being generated from third party libraries we had to resort to the clustering based technique which has inherent limitations. This limits the template mining accuracy and consequently the accuracy of \(\mathsf{EBM}\) mining. This is even more apparent in the new version because only a limited amount of logs is available, which is a hindrance to accurate mining [20]. Further, inconsistency in the templates because of the modified log statements in source code being recorded as new templates leads to many spurious differences between the compared models thus, making the diff analysis practically less effective. To overcome this problem, we propose an iterative \(\mathsf{EBM}\) refinement strategy using multimodal signals that is, text and vicinity (i.e. predecessors and successors in \(\mathsf{EBM}\)) of the template.

Iterative Execution Behavior Model Refinement: We derive execution behavior model for the deployed, \(\mathsf{EBM}_{d}\) and the new version, \(\mathsf{EBM}_{n}\) using corresponding templatized execution logs. We compare \(\mathsf{EBM}_{d}\) and \(\mathsf{EBM}_{n}\) to identify the vertices which are present in \(\mathsf{EBM}_{n}\) but not in \(\mathsf{EBM}_{d}\) (that is, \(\bigtriangleup T_{add}\)) and vice-versa, i.e. \(\bigtriangleup T_{del}\). It is possible that the vertex from \(\bigtriangleup T_{add}\) set is actually same as the vertex from set \(\bigtriangleup T_{del}\) but captured as different template as discussed above. We identify and resolve such cases using proposed multimodal approach thus, reducing the spurious diff and making the comparison more effective.

One of the multimodal signals that we use is textual similarity between the templates from \(\bigtriangleup T_{del}\) and \(\bigtriangleup T_{add}\). If there are m templates in \(\bigtriangleup T_{del}\) and n templates in \(\bigtriangleup T_{add}\) then similarity is calculated between \(m\,\times \,n\) pairs. The pairs with textual similarity above a threshold are captured as potential merge candidates. We do not merge the templates simply based on text similarity because there can be two textually similar templates corresponding to different logging statements in the code. Hence, to improve the precision, we evaluate the similarity for one more modality, i.e., vicinity similarity, where vicinity is the set of predecessors and successors. If the vicinity similarity is above a threshold, the templates are marked as identical. Thresholds for textual similarity and vicinity similarity can be selected based on grid search and fine tuned to project requirements [20].

We continue the process iteratively with each step leading to a more refined \(\mathsf{EBM}_{n}\). With every iteration some of the vertices are marked as identical which in turn can change the value of vicinity similarity for other candidate pairs. We stop the iterations when no more candidate pairs can be merged and the \(\mathsf{EBM}_{n}\) output of subsequent steps no longer changes.

Fig. 3.
figure 3

Iterative multimodal execution behavior model refinement

Example 1

Consider the EBMs shown in Fig. 3. By comparing \(\mathsf{EBM}_{d}\) and \(\mathsf{EBM}_{n}\), we observe that \(\bigtriangleup T_{del} = \{T_{2}, T_{3}, T_{4}\}\) and \(\bigtriangleup T_{add} = \{T'_{2}, T'_{3}, T'_{4}\}\). We calculate text similarity for all the nine pairs and find the potentially similar candidate set, \(C = \{(T_{2},T'_{2}), (T_{3},T'_{3}), (T_{4},T'_{4})\}\). Vicinity similarity is checked for all the candidates and in first iteration vicinity similarity is above the threshold only for one pair, \((T_{2},T'_{2})\) which is marked as identical and removed from C. In the next iteration, the remaining pairs from C are analyzed for the vicinity similarity which is found to be greater than the threshold for \((T_{4},T'_{4})\), which is again marked as identical and removed from C. Only one pair, \((T_{3},T'_{3})\) is not marked as same because its vicinity similarity is below threshold even though the textual similarity is high. Consequently, diff set after \(\mathsf{EBM}_{n}\) refinement is reduced to \(\bigtriangleup T_{add} = \{T'_{3}\}\) and \(\bigtriangleup T_{del} = \{T_{3}\}\).

3.3 Analyzing Differences Between Execution Behavior Models

Since EBMs are graphs identifying the differences between them can be seen as the graph isomorphism problem [14], known to be in NP. However, since we ensure the consistency in the template ID across the two models, the comparison of two models is simplified. The refined models are compared to identify the following differences: sets of vertices, \(\bigtriangleup \textit{diff}_{v}\) and edges, \(\bigtriangleup \textit{diff}_{e}\) which are added/deleted, as well as the set of vertices for which the relative frequency of outgoing transitions has changed \(\bigtriangleup \textit{diff}_{dist}\) in \(\mathsf{EBM}_{n}\) when compared to \(\mathsf{EBM}_{d}\). For efficient follow-up analysis, we group the identified differences into cohesive regions such that the related differences are investigated as single unit.

Example 2

Deletion of T1103–T1109 and the corresponding edges, and addition of EXT0–EXT5 and the corresponding edges in Fig. 1 are grouped together.

Vertex Anchored Region: Intuitively, we would like to find the maximum point from which the difference in execution behaviors is observed and the minimum point up to which there are differences in the execution behavior. For the differences with same maximum point, it is highly likely that they are caused due to modification in same code, and, thus, should be investigated as a single unit.

A vertex, \(v_{i}\) is randomly selected from \(\bigtriangleup \textit{diff}_{v}\) as a seed to detect the region. We back traverse the graph till an unchanged ancestor (that is, vertex common between the two models) is detected along all the paths to \(v_{i}\). All the vertices and edges along the path (including unchanged ancestor) are marked as part of the region. For all marked vertices, all the outgoing branches are traversed and marked till an unchanged child vertex (that is, vertex common between the two models) is detected. Unchanged child vertex is not included in the region because the boundary of region is defined till the last difference in the included path. Effectively, a region covering a set of vertices and edges is identified. The process repeats as long as there remain unmarked vertices in \(\bigtriangleup \textit{diff}_{v}\). At the end of this step, all vertices from \(\bigtriangleup \textit{diff}_{v}\) and some edges from \(\bigtriangleup \textit{diff}_{e}\) are marked as part of some region. We call these regions as vertex anchored regions.

Fig. 4.
figure 4

There are two vertex anchored region (different shades of yellow) and one edge between the unchanged vertices (blue). Blue pointers correspond to backtracking, and green pointers depict forward tracking. (Color figure online)

Example 3

Consider Fig. 4 where \(\bigtriangleup \textit{diff}_{v}=\{T_{0}, T_{2}, T_{3}, T_{4}, T_{6}, T_{7}, T_{8}, T_{10}\}\) and \(\bigtriangleup \textit{diff}_{e}=\{(T_{0}, T_{11}), (T_{1}, T_{0}), (T_{1}, T_{4}), (T_{1}, T_{3}), (T_{1}, T_{2}), (T_{1}, T_{6}), (T_{4}, T_{5}),(T_{3}, T_{5}),(T_{2}, T_{5}), (T_{11}, T_{6}), (T_{6}, T_{7}), (T_{7}, T_{11}), (T_{10}, T_{11}), (T_{10}, T_{9}), (T_{9}, T_{10}), (T_{9}, T_{8}), (T_{8}, T_{9}), (T_{8}, T_{11}), (T_{5}, T_{1})\}\). We choose \(T_{7}\) as the first seed and back traverse its incoming path (blue pointers) up to maximum unchanged vertices, i.e., {\(T_{1}\), \(T_{11}\)}, marking vertices \(\{T_{7}, T_{6}, T_{1}, T_{11}\}\) and edges \(\{(T_{7}, T_{6}), (T_{6}, T_{1}), (T_{6}, T_{11})\}\). Next, the outgoing branches are traversed (green pointers) till unchanged vertex is detected and the corresponding vertices are marked. As a result, the light yellow region is created consisting of \(V_{r1} = \{T_{0}, T_{2}, T_{3}, T_{4}, T_{6}, T_{7}\}\) and \(E_{r1} = \{(T_{0}, T_{11}), (T_{1}, T_{0}), (T_{1}, T_{4}), (T_{1}, T_{3}), (T_{1}, T_{2}), (T_{1}, T_{6}), (T_{4}, T_{5}),(T_{3}, T_{5}),(T_{2}, T_{5}), (T_{11}, T_{6}), (T_{6}, T_{7}), (T_{7}, T_{11}), (T_{5}, T_{1})\}\) from \(\textit{diff}_{v}\) and \(\textit{diff}_{e}\) respectively. For the next iteration we choose \(T_{8}\) as a seed from the set of uncovered vertices in \(\textit{diff}_{v}\) and repeat the process to identify another region. The second region becomes \(V_{r2} = \{T_{8}, T_{9}, T_{10}\}\) and \(E_{r2} = \{(T_{10}, T_{11}), (T_{10}, T_{9}), (T_{9}, T_{10}), (T_{9}, T_{8}), (T_{8}, T_{9}), (T_{8}, T_{11})\}\). Hence, all the vertices from \(\textit{diff}_{v}\) and a subset of \(\textit{diff}_{e}\) are grouped in one of the cohesive regions.

Edge Anchored Region: Not all edges from \(\bigtriangleup \textit{diff}_{e}\) belong to one of the vertex anchored regions. These are mainly the edges added/deleted between unchanged vertices and should be analyzed separately. We refer to each of these edges along with its vertices as an edge anchored region.

Example 4

After detecting the vertex anchored regions in Fig. 4 only one edge in \(\bigtriangleup \textit{diff}_{e}\) is unmarked. The only edge anchored region is hence \(V_{r3} = \{T_{1}, T_{5}\}\) and \(E_{r3} = \{(T_{5}, T_{1})\}\).

Distribution Anchored Region: Apart from the above two cases of structural changes (addition or deletion of vertex or edge) in execution behavior model, we investigate the vertices common in both the versions of the model to detect the deviations in changes in the relative frequency of outgoing transitions. To capture the distribution change, for a given vertex v and its outgoing transitions common between the two models we compute \(\frac{|f_d(i) - f_n(i)|}{f_d(i)}\), where \(f_d(i)\) (\(f_n(i)\)) is a relative frequency of transition i in \(\mathsf{EBM}_{d}\) (\(\mathsf{EBM}_{n}\)) among the outgoing transitions of v common between the two models. If the metric value is above threshold for at least one transition from the vertex v, it is marked as distribution anchored region. Threshold needs to be decided manually based on the project requirements such that minor changes are discounted (that is, not considered as part of the differences) and major changes are marked in the differences.

4 Evaluation: Open Source and Proprietary Applications

We evaluated our approach on two different applications: (i) NutchFootnote 5, an open source web crawler, and (ii) an industrial log analytics application. We have already shown some initial results on the Nutch project in Sect. 2 and discuss the other findings here. Also all the artifacts such as execution logs, templatized logs, execution behavior model and diff files are made publicly available for reproducibility of the resultsFootnote 6. Details of the industrial application cannot be divulged for confidentiality reasons. We selected these applications primarily because of the availability of the source code and historical data on bugs and the corresponding fixes, as well as frequently occurring incremental changes in these applications. Execution logs for these projects were not available so we use a custom load-generator to generate logs for different source code versions.

Table 1. Properties for the two versions of Nutch application

4.1 Experimental Results for Nutch

Two Nutch versions were used: (i) before the commit for [NUTCH-1934], henceforth called version 1 (deployed/prod version) and (ii) after the commit for [NUTCH-1934], henceforth called version 2 (new version). This commit is considered a major change as a big class, Fetcher.java (ca. 1600 lines of code) is refactored into six classes. Table 1 presents details for two Nutch versions. We derive the templates from the source code for version 1, henceforth called \( {\text{ templates }}_{v1}\). To derive the templates for version 2, \({\text{ templates }}_{v2}\), 46 templates are deleted, and 47 templates are added to \({\text{ templates }}_{v1}\) in accordance with the code diff (here, git-diff) between the two code versions. We generate the execution logs for both the versions by crawling same URLs (that is, mimic prod) and observe that number of loglines generated for version 1 are less than that for version 2 (cf. Table 1). The execution logs for version 1 and version 2 are templatized using \({\text{ templates }}_{v1}\) and \({\text{ templates }}_{v2}\), respectively. Around \(12\%\) log lines are not templatized, and hence are clustered. 80 clusters are obtained. The clusters are further refined and grouped using weighted edit distance reducing their number to 26. Non-templatized log lines are matched with the templates generated after clustering and every line in the execution log is templatized for both the versions. We discover the execution behavior model (EBM) for both the versions, \(\mathsf{EBM}_{1}\) and \(\mathsf{EBM}_{2}\) and refine them using multimodal approach.

The behavior model shows that there are 53 added and 47 deleted vertices in \({\mathsf{EBM}}_{2}\) as compared to \({\mathsf{EBM}}_{1}\). For every pair of the added and deleted vertices, text similarity is calculated from the source code. The text similarity is found to be above a threshold (here, 0.8) for 36 out of 2491 pairs and the corresponding vicinity is compared in \({\mathsf{EBM}}_{1}\) and \({\mathsf{EBM}}_{2}\). Vicinity similarity is also found to be above a threshold (that is, 0.5) for all the 36 candidate pairs. Thus, these vertices are marked to be the same templates across the two EBMs. For better understanding, diff refinements file is made publicly available at the link (See footnote 6). As a result, all the templates which are captured as new template because of refactoring got mapped to the corresponding old templates reducing the number of differences significantly. Refined \({\mathsf{EBM}}_{2}\) is compared with \({\mathsf{EBM}}_{1}\) to identify and analyze the differences. The final refined model with diffs is made publicly available (See footnote 5).

We observe several differences which are grouped as cohesive regions using approach discussed in Sect. 3.3.

We identified one region which is explained in Sect. 2. In the additional region we observe (i) deletion of vertex corresponding to template “Using queue mode : byHost” (though present in source code of both the versions), and (ii) significant change in distribution of a vertex T1135 such that the edge \(T1135 \rightarrow T1131\) traversed only twice in \({\mathsf{EBM}}_{1}\) has been traversed 601 times in \({\mathsf{EBM}}_{2}\). Both observations are related to FetcherThread.java which is investigated manually and a bug is identified in the way URLs are redirected. Instead of following the correct redirect link, the code was following the same link over and over again. After the maximum number of retries is exceeded further processing of the URL stopped with the message T1131 (“- redirect count exceeded *”), thus, increasing frequency of this edge traversal. This bug has already been reported as NUTCH-2124Footnote 7 and attributed to patch commit we are analyzing. This validates the findings of our approach and highlights its usefulness. Therefore, using our approach we not only detect differences but also provide the context to derive actionable insights.

Fig. 5.
figure 5

Annotated execution behavior model highlighting the regions of diff for internal analytics application. Grey part is common in \({\mathsf{EBM}}_{d}\) and \({\mathsf{EBM}}_{n}\), dashed is the part which is present only in \({\mathsf{EBM}}_{d}\) but not in \({\mathsf{EBM}}_{n}\), and bold edges/double encircled are present in \({\mathsf{EBM}}_{n}\) but not in \({\mathsf{EBM}}_{d}\)

4.2 Experimental Results for the Industrial Application

The \(\mathsf{EBM}\) generated automatically by our code is shown in Fig. 5 with annotations. Grey denotes the part which is common in \({\mathsf{EBM}}_{d}\) and \({\mathsf{EBM}}_{n}\), dashed is the part which is present only in \({\mathsf{EBM}}_{d}\) but not in \({\mathsf{EBM}}_{n}\), and bold edges/double encircled correspond to the part which is present in \({\mathsf{EBM}}_{n}\) but not in \({\mathsf{EBM}}_{d}\). We selected two code revisions (referred to as v1 and v2) of the project such that it captures different kinds of code changes possible in software development cycle. As shown in the Fig. 5, our approach detected six different regions of change between the two revisions, which we explain below.

Region 1: T58 template is present in both the source code versions but is not observed in \({\mathsf{EBM}}_{n}\). Manual inspection of the code and commit history reveals that this is actually a bug caused due to faulty regular expression match and hence one conditional statement is skipped completely.

Region 2: A shift in distribution between edges (EXT0, T1) and (EXT0, T2) that is, increase in transition to T1 by factor of 8. Manual inspection reveals that the cause of this anomaly is a wrong Boolean condition check, which caused the distribution to be flipped between two conditional statements.

Region 3: Many new nodes appeared in \({\mathsf{EBM}}_{n}\) because a new Java class is added (identified in manual investigation) which gets invoked in the new version, i.e., this is an evolutionary design change. Addition of T0 however is not exactly related to this change. It is from the class that invokes the new feature but was added in Region 3 alongside the new class because of its close proximity.

Region 4: It has only one change viz. the addition of edge \(T50 \rightarrow T78\) and an accompanying decrease in the frequency of \(T76 \rightarrow T78\). Manual investigation highlights that T78 corresponds to a new exception check added in the class containing T50 thus whatever is not caught at T50 level is caught at T76.

Region 5: Main change is addition of node T40 and disappearance of nodes T44 and T45. Both T44 and T45 are exception nodes which exist in both code revisions while T40 is a new node. On manual inspection, it is revealed that this change is actually a result of bug fix that is, for ArrayOutOfBounds exception. This validates that the bug fix is working as intended.

Region 6: Two new nodes appeared in \({\mathsf{EBM}}_{n}\) and investigation of revision history reveals that a new function was added with two prints which is invoked just after T51 in the code thus, an evolutionary change.

To summarize, our approach has successfully detected all seven regions of code change between the two code revisions. It coalesced two of the regions (in Region 3) but this does not affect the usability of our approach as these regions are in close proximity. Manual investigation of diff regions in \(\mathsf{EBM}\) highlights regression bugs as well as validates the evolutionary changes.

5 Discussion

Based on the project requirements, the information from the proposed approach can be leveraged in different ways to help improve the continuous deployment process. It provides additional insights (not to replace the existing practices) for some of the use cases as discussed below:

  • Go/No-go during Release Decision Making: Most software companies have the concept of a go/no-go meeting before a production release where product, development and operations managers get together to decide whether to go ahead with the release of newer version or not. Our approach provides a way to visualize differences between the code versions in a graphical way thus being easily consumable for decision-making.

  • Update Test Suite to Cover Modified Execution Flows: The proposed approach identifies region of differences between two execution behavior models thus, merit for comprehensive review. Regression testing can be performed for the diff regions instead of testing the whole application thus making regression testing leaner and at the same time more reliable and effective.

  • Optimal Test Suite Coverage: By looking at discovered execution flow graphs, it is possible to identify the code paths which are frequently taken during runtime, and tests can be designed intelligently. Hence ensure that more frequent paths are tested rigorously using sophisticated techniques.

The approach assumes the presence of an identifier (i.e. thread ID) to capture the trace for an execution. Since thread ID is often present in the execution logs [32], it is fair to make this assumption. When the identifier for an execution is not present, execution behavior model can be mined using other techniques [20].

To keep our approach language independent and light weight, we do not use static analysis techniques [19]. Also static analysis will not capture the complete reality of execution behavior which gets influenced by prod configuration.

We mine the differences between the execution behavior models of two versions. However, we do not associate the differences with the change type such as bug fixing or feature addition. This kind of classification will not only help in quick resolution of bugs but will also act as an additional check to see if all the release items have been properly taken care before signing off on deployment.

6 Threats to Validity

Threats to Construct Validity: It focuses on the relation between the theory behind the experiment and the observation [29]. Performance of the approach depends on the pervasiveness of logging hence, if less logging statements then it may not be possible to derive useful inferences. However, given that logs are primary source for problem diagnosis, sufficient logging statements are written in the software [6].

Threats to External Validity: External validity is concerned with the generalizability of the results to other settings [29]. We conducted experiments on one open source and one proprietary project to illustrate the effectiveness of the approach. However, both are Java based projects using Log4j library for logging thus, very similar in terms of logging practice. While the approach does not make any project specific assumptions, it is possible that the performance can vary for different project characteristics. Accuracy of multimodal approach depends on the thresholds and thus can vary across projects.

7 Related Work

Execution logs have been extensively studied in such contexts as anomaly detection [4, 20], identification of software components [24], component behavior discovery [17], process mining [28], behavioral differencing [12], failure diagnosis [25], fault localization [30], invariant inference [5], and performance diagnosis [10, 26]. In this section, we focus on automatic analysis of execution logs.

Goldstein et al. [12] analyze system logs, automatically infer Finite State Automata, and compare the inferred behavior to the expected behavior. However, they work on system logs with predefined states while we identify these states (templates) first. Moreover, they present the differences as independent units whereas we group them together rendering the representation more usable.

Cheng et al. [7] propose to extract the most discriminative subgraphs which contrast the program flow of correct and faulty execution. Fu et al. [10] derive a Finite State Automata to model the execution path of the system and learn it to detect anomalies in new log sequences. However, these are supervised approaches assuming the presence of ground truth for correct and faulty executions to learn a model. Nandi et al. [20] detect anomalies by mining the execution logs in distributed environment however, anomalies are detected within the same version, no differencing between the flow graphs of two versions.

Tarvo et al. [26] automatically compare the quality of the canary version with the deployed version using a set of performance metrics such as CPU utilization and logged errors however, do not detect the differences in execution flow which is crucial for finding discrepancies. A set of techniques compare multiple versions of an application. Ramanathan et al. [22] consider program execution in terms of memory reads and writes and detect the tests whose execution behavior is influenced by these changes. Ghanavati et al. [11] compare the behavior of two software versions under the same unit and integration tests. If a test fails in the new version, a set of suspicious code change is reported. This approach works best when comprehensive test suites are available which is not the case in considered Agile environment.

Beyond the specifics of the execution log analysis, our work can be positioned in the context of continuous deployment. Continuous deployment can be seen as an extension of continuous integration [33] and the following step on “the stairway to heaven”, the typical evolution path for companies [21].

8 Conclusion and Future Work

We have presented an approach to efficiently highlight the differences in the execution behavior caused due to incremental changes in fast evolving applications. We automatically discover execution behavior model using multimodal approach for the deployed and the new version by mining the execution logs. The models are compared to automatically identify the differences which are presented as cohesive diff regions. Since we have used a graphical representation, we not only identify diff regions but also the context to facilitate in-depth analysis.

Our preliminary evaluation on the open source project Nutch and internal log analytics application illustrates the effectiveness of the approach. Using our approach, we were able to detect multiple bugs introduced in new version for both the applications. Following the analysis, we found that some of the detected bugs were already reported in their issue tracking system therefore, we reported the remaining ones which were later fixed by the developers.

As part of future work, we plan to evaluate the approach on several other applications. Also we plan to automatically classify identified diff regions as anomaly and drill down to the root cause commit(s) using revision history.