Despite interest from industry , at the time of writing there is no research which suggests/summarizes how to automatically measure the FKM. To be able to define a broadly accepted definition for the automatic measurement we investigated what other researchers and practitioners did in this area. Usually systematic literature review (SLR) studies are conducted to capture the state of a research topic. However, SLRs focus mainly on research contribution and do not include grey literature from practice (GL) . As a large majority of software practitioners do not publish in academic forums, we also included GL to make sure we get the result of the current state-of-practice in this field. Furthermore, the very current perspective, the relevance for practitioners and the low volume of published research indicates that not only formal literature should be used to cover a topic .
The multivocal literature review was conducted according to the guideline of Garousi et al.  which considers also the popular SLR guidelines by Kitchenham and Charters . Literature was included if any of the inclusion and none of the exclusion criteria are met (see Table 1).
3.1 Systematic Literature Review
The publications of Forsgren et al. (i.e., the book “Accelerate” and the “State of DevOps Reports”) are listed in Google Scholar and Research Gate. For the SLR, the relevant/related research was identified using snowballing starting from their publications. All 93 unique articles which cited the literature about the FKM published by Forsgren et al. were retrieved by following the cited links. Citations from books were not included. 21 articles are not written in English or German and hence were excluded. Only 7 of the 72 remaining articles treated the topic “metrics” and none of them contained more information about the FKM than already presented by Forsgren et al. As no articles from the SLR were included, no data could be extracted and used in the synthesis.
3.2 Gray Literature Review
For the gray literature review (GLR) Google was used as search engine because pilot searches have shown that there is no more narrow scope for the source of information (e.g. only StackOverflow or Medium) which returns results. A pilot search was conducted to find which keywords are used when people are talking about the FKM. This was done by retrieving articles which talk about one of the four metrics (search for “deployment frequency”, “lead time for change”, “time to restore service” and “change failure rate”) and screening the articles to see how the authors bring them into the context of the FKM. As a result, the following search terms were defined to be used for the GLR.
In contrast to the searches within the formal literature, gray literature search returns an exhaustive number of results. Thus, stopping criteria need to be defined . Google has a ranking algorithm which aims to return relevant articles ordered by priority. That means, the most relevant articles are at the top and the following stopping criteria were applied.
Theoretical saturation: As soon five articles in a row did not match the “Is about this topic & contains information” inclusion criteria, the next five articles were screened by only looking at their title. If they were not relevant, the search was ended.
Effort bounded: After reviewing 100 results for a search term, the search was ended.
Initially, 115 articles/search results were retrieved and screened. 43 out of those 115 were not about the topic and 5 were not in text form. 16 unique articles remain which either include a definition or an experience report.
This section presents the results of the multivocal literature review. The full list of retrieved literature is provided online.Footnote 1
Deployment Frequency: 7/16 articles contain a definition for deployment frequency. As this metric is already well defined by Forsgren et al. as deployment of software to production, the definitions do not widely diverge. They have in common that “number of deployments/releases in a certain period” are counted. Some state that they only count successful deployment (but successful is not defined) and some explicit mention that they count deployments to production. For the purposes of automated measurement, a deployment is defined as a new releaseFootnote 2 As this is a speed metric, every deployment attempt is counted as deployment even if it was not successful.
Lead Time for Change: 9/16 articles contain a definition for lead time for change. Like the deployment frequency, the definition of Forsgren et al. does not leave much room for interpretation although some deliberately took approaches diverging from that of Forsgren et al. All suggestions based on the original FKM definition measure the time a commit takes until it reaches production, the only difference is how they aggregate (i.e., mean, median, p90 etc.). Today it is default practice to use a version control system for source code. To make an adjustment to the software system a developer has to alter source code and to put it under version control. Hence, the commitFootnote 3 is defined as the “change”. Thus, the lead time is given by the time span between the timestamp of the commit and the timestamp of the deployment, as defined in Sect. 3.3.
Time to Restore Service: 8/16 articles contain a definition for time to restore service. Five of them define the time to restore service as mean time for closing an incident in a certain period. One suggests using chaos engineering (i.e., introduce a failure and measure how long it takes until it gets discovered and resolved), there is a suggestion to periodically poll the “status” and record how long it takes when the status indicates degradation until the degradation gets restored (but do not mention from where the status is taken). The last suggestion made by two articles assumes that the time to restore service should be calculated for failed releases and thus suggests identifying “fix releases” and measuring how long it takes from one release to the following “fix release”. The reasons for a failure are manifold, and frequently rely on human interpretation of what constitutes “failure” and “fix”, which makes it difficult to fully automate this metric. Provided that a team has an incident management, the calculation via incidents is an interesting approach. Since the incident creation could also be automated, this approach allows a mixture of manual and automated failure recognition. For this work, we define the time to restore as the time between incident creation to closing the incident, like this is stated by the majority of articles found. This choice was made because there is already an incident management in place, which can be used to gather the data and this seems so far to be the most reliable source of data.
Change Failure Rate: 9/16 articles contain a definition for change failure rate. The different suggestions are listed below.
Percentage of releases that were followed by a “fix release”.
Count of hot fixes in commit messages.
Detect failures by using monitoring metrics and divided by deployments.
Manually mark a deployment as successful or not.
Count rollbacks divided by deployments.
To measure change failure rate, first, it has to be defined what a change is. In all identified articles a change is indicated by a deployment. Accordingly, the change failure rate is the ratio of change failures to deployments (see Sect. 3.3). The next challenge is to identify a failure and attribute it to a change. Unlike for the time to restore service, the incident management cannot be used for failure detection as, according to Forsgren et al., a change failure includes all cases where subsequent remediation was required.Footnote 4 Especially for development teams with a good software delivery performance, the team itself will be responsible for the deployment and any resulting failures will be fixed immediately without an incident ever existing. As we assume a low change failure rate in the context of our case study of Swiss Post, we decided to use for our measurements the manual classification of a deployment as a failure by the development team.
Summary: The velocity metrics are more precisely defined and thus the automatic measurement is easier and more straightforward to derive. This is also reflected in the articles found. With the toolchain used by the development team, the measurement of the speed metrics can be completely automated. The stability metrics are less well defined, and unlike the velocity metrics, the boundaries can be drawn less precisely. The literature provided various approaches, but the approaches that would have allowed a fully automated measurement do not capture all relevant aspects of the metrics. For this reason, we have chosen to use only partial automation for measuring the stability metrics. We assume that the change failures are less manifold than failures in general and thus suggest the creation of a taxonomy of change failures, which will be the enabler for tools and concepts to automatically detect them.