Any effort to see if the selected agility measurement tools are valid in what they do, would require to apply them to real software developments teams. According to Runeson and Host , a case study is “a suitable research methodology for software engineering research since it studies contemporary phenomena in their natural context”. As a result, a case study was selected as the most suitable means.
2.1 Subject Selection
Company A is a United States company which operates in the Point Of Sales (POS) area. It has four teams with mixed members of developers and testers. The teams do not follow a specific agile methodology, but rather a tailored mix of the most famous ones which suits the needs of each team. Methodology A, as we can name it, embraces the practices from the various agile methodologies, some of them to a larger and some of them to a smaller extent. The analysis process created by Koch  was used for identifying these methodologies. The identification of the practices was done by observing and understanding how the teams work.
2.2 Data Collection
In order to collect the data, an online survey was considered to be the best option, since it could be easily answered by each subject.
For each of the tools, four surveys were created (one for each team). The data collection lasted about one month, while the surveys for each tool were conducted every ten days. None of the subjects was familiar with any of the tools.
Two subjects were requested to answer to the surveys first, in order to detect if there were any questions which could cause confusion, but also to see how much time is needed to complete a survey. Once the issues pointed out by the two subjects were fixed, the surveys were sent to the rest of the company’s employees.
The links for the surveys were sent to the subjects via email, and they were asked to spend 15–20 min to reply to the survey. The employees who belonged to more than one team were asked a couple of days later to take the other survey in order to verify that their answers matched in both surveys.
OPS agility measurements are based on three aspects: Adequacy, Capability and Effectiveness. Effectiveness measurement focuses on how well a team implements agile methodologies. Since the rest of the tools focus on the same thing, it was decided only to use the survey from Effectiveness and not to take into account the Adequacy and Capability aspects.
The surveys for PAM, TAA and OPS were answered on a Likert scale 1–7 (never having done what is asked in the question to always doing what is asked in the question).
The employees who were asked to answer to the surveys were all members of the software development teams, which consisted of software and QA engineers. All of the participating employees have been in the company for over a year and most of them have more than five years of work experience in an agile environment. Employees who had been working for less than six months in the company were not asked to participate, since it was considered that they were not fully aware of the company’s procedures or that they were not familiar enough with them. Each participant replied to 176 questions in total. Initially, 34 surveys were expected to be filled in, but in the end, 30 of them were filled in, since some employees chose not to participate.
2.3 Data Preparation
All three tools have different amount of questions and cover different practices. For this reason, we preferred to do a grouping of the questions based on the practices/areas to which they belong.
Team Agility Assessment – Areas. Team Agility Assessment (TAA) does not claim that it covers specific agile practices, but rather areas important for a team. It focuses on product ownership for Scrum teams but also on the release, iteration planning and tracking. The team factor plays a great role, as well as the development practices and the work environment. Automated testing and release planning are important here as well.
Perceptive Agile Measurement – Practices. The Perceptive Agile Measurement (PAM) tool focuses on the iterations during software development, but also on the stand-up meetings for the team members, their collocation and the retrospectives they have. The access to customers and their acceptance criteria have a high importance as well. Finally, the continuous integration and the automated unit testing are considered crucial in order to be agile.
Objectives, Principles, Strategies (OPS) – Practices. Objectives, Principles, Strategies (OPS) Framework is the successor of the Objectives, Principles, Practices (OPP) Framework . OPP identified 27 practices as implementations of the principles which later on were transformed into 17 strategies.
Practices Covered Among The Tools. We have abstracted some of the OPP practices to OPS strategies in order to avoid repeating the mapping of the questions. The connection between the practices and the strategies is done based on the questions of each tool.
Mapping of questions among tools. PAM has its questions divided on the basis of agile practices, while on the other hand, TAA has divided them based on areas considered important. Although all practices/areas from PAM and TAA are mapped onto OPP and OPS, not all of their questions are under OPP practices or OPS strategies. This can be explained due to the different perception/angle that the creators of the tools have and what is considered important for an organization/team to be agile.
2.4 Data Analysis
The data gathered from the surveys were grouped on the basis of the practices covered by the OPP, and as a consequence, the OPS.
Convergent Validity Analysis. Since all the tools claim to be measuring agility and under the condition that convergent validity exists among them, then, by definition, they should yield similar results.
In similar studies [22, 23], the correlation analysis was selected as the best way to check similar tools and this was followed here as well. We decided to use the practices covered by each tool and see if they correlate with the same practices from the other two tools. The idea is based on the multitrait-multimethod matrix, presented by Campbell and Fiske . The matrix is the most commonly used way for providing construct validity.
In order to select which correlation analysis method to choose from, the data were checked if they had normal distribution by using the Shapiro-Wilk test which is the most powerful normality test, according to a recent paper published by Razali and Wah . The chosen alpha level was 0.05, as it is the most common one.
Out of the 42 normality checks (three for each of the 14 practices), only 17 concluded that the data are normally distributed. The low level of normally distributed data gave a strong indication that Spearman’s rank correlation coefficient, which is more adequate for non-parametric data, was more appropriate to use, rather than the Pearson’s product-moment correlation.
In order to use the Spearman’s rank correlation coefficient, a monotonic relationship between two variables is required. In order to check for the monotonicity, plots were drawn between the results of each tool for all 14 practices. The plots surprisingly showed that only eight out of 42 were monotonic, which indicates no correlation what-so-ever.
Direct Match Questions Analysis. We want to find which questions are the same among the tools. In order to achieve this, the mapping described in Subsect. 2.3 was used. Afterward, the questions were checked one by one to identify the ones which had the same meaning. When we finalized the groups of questions which were the same, we requested from the same employees who were taking the pilot surveys to verify if they believed the groups were correctly formed. Their answer was affirmative, so we continued by checking if the answers of the subjects were the same. Surprisingly, OPS–TAA have 20 questions with the same meaning, while OPS–PAM and TAA–PAM only four and three respectively.
Out of the 35 normality checks (two for each group and three for one group), only 2 concluded that the data are normally distributed. Since the samples are also independent (they do not affect one another), there is a strong indication that the MannWhitney U test is appropriate. For the group Smaller And Frequent Product Releases, we used the Kruskal–Wallis one-way analysis of variance method, which is the respective statistical method for more than two groups.
The hypothesis in both cases was: