This section presents our results together with the discussion. The section is structured according to the four main research questions (see section 3.1). The results are presented for each research question, followed by a discussion of the results for each main research question under the reflective subsections.
The analysis is based on the raw statements that were directly recorded in the retrospective meetings by the participants using a software tool. The statements, in the raw data, were classified into negative experiences, positive experiences and corrective actions. In addition, the data includes interrelationships between statements and the number of votes each statement received in the retrospective meeting. The quantitative analysis is based on our categorisation of the retrospective statements into process areas and topic types. The combination of process area and topic type is called the discussion topic in this article. See section 3.5.1 for a detailed description of the concepts and classifications used in the analysis.
What Discussion Topics are Covered in the Team-Level Retrospectives? (RQ 1)
This section presents our analysis of the topics that the team-level retrospectives in the case organisation discussed during the study period. We divided the analysis into three sub-questions regarding the process areas that the retrospective statements concern, the topic types and the degree of positive and negative discussions. Table 4 presents a summary of the most common positive discussion topics, negative discussion topics, discussion topics that the participants voted for and corrective actions. In the table, common discussion topics (the combination of process area and topic type, see Fig. 3) are described and illustrated together with concrete examples from the retrospective statements. The progress of implementation work was one of the most common discussion topics. The tools and resources & schedules for the implementation work were also commonly discussed both positively and negatively, but they did not result in many related corrective actions. Learning about the implementation work was mostly discussed based on positive experiences and rarely mentioned in the negative, and it resulted in few corrective actions. Software testing tools was commonly discussed based on negative experiences and rarely mentioned in the positive, and it did not result in many related corrective actions. Furthermore, the (lack of) experience with implementation work was a common discussion topic, one that the participants voted as important, but still it was rarely discussed and rarely formed the basis for the presented corrective actions. Most corrective actions were related to work practices.
What Development Process Areas do the Discussions Concern? (RQ 1.1)
The total numbers of retrospective statements recorded during the whole research period are summarised in Table 5. The table also shows the distribution of the statements across process areas and with respect to the positive, negative and corrective action classes. The data reveal that the discussions included statements regarding almost all process areas. The bulk of the statements concern the areas of sprint planning, implementation work and software testing, together comprising 75 % of all statements.
Figure 4 shows how often statements regarding each process area were mentioned in the retrospective meetings. The diagram indicates the percentage of all retrospectives where at least one statement was made regarding a certain process area. All retrospectives included statements related to implementation work. Statements concerning sprint planning and software testing were detected in more than 60 % of the retrospective meetings, meaning that the frequency distribution follows a similar profile over the process areas as the total numbers of statements.
Figure 5 shows how much each process area was discussed in the retrospective meetings. The diagram indicates the average number of statements in the retrospective meetings in which at least one statement was made regarding a specific process area. Participants allotted a similar amount of attention to all process areas. Implementation work stands out also in this data as a topic that received more emphasis. In addition to implementation work, participants discussed and voiced more positive experiences with the product owner and the sales and requirements areas than they did with other areas.
What Topic Types do the Discussions Concern? (RQ 1.2)
The distribution of the retrospective statements across each of the topic types is presented in Table 6. The data show that the topic types most often discussed were work practices, task outcome and progress, and resources & scheduling, as well as cooperation, instructions and learning. The least discussed topics were task risks, task monitoring, policies and customers & users. The task types (denoted with the letter T in Table 6) were most common in both positive (45 %) and negative (39 %) statements, whereas the types of methods (denoted with the letter M) represented almost half (49 %) of the corrective actions.
Figure 6 presents a synthesis of the retrospective outcomes. The three most common process areas in the discussions were closely interconnected to one another with respect to perceived cause-effect relationships, which indicates that the retrospectives also included discussions on the workflow related to planning, implementation and testing. Interestingly, discussions about the general management, the product owner and deployment were disconnected from the other areas.
To What Degree are the Discussion Topics Related to Negative and Positive Experiences? (RQ 1.3)
Overall, negative statements were more common than positive statements, as the total number of positive statements in retrospective discussions comprised only 57 % the total number of negative statements (see Table 5). All retrospectives included negative experiences related to the implementation work. The detection of negative experiences was frequent in other process areas, too. In contrast, the detection of positive experiences was more frequent for implementation work than for the other process areas (see Fig. 4). The distribution of the positive and negative statements is quite similar throughout the process areas, with the exception of the sales & requirements, general management and product owner areas, each of which received only a few positive statements in the retrospective meetings (see Table 5 and Fig. 4). Fig. 5 reveals, however, that those rare positive statements led to similar amounts of discussion as did other areas brought up in the meeting.
Regarding the topic types, the most positive topics were task outcome, task progress, cooperation and resources & scheduling. These topics were also commonly focused upon in the negative statements, possibly reflecting a certain level of satisfaction when a team succeeds in activities that the team members feel they frequently struggle with. The topics of task progress, learning, cooperation and motivation all received more positive than negative statements; learning especially received only a few negative statements, but it received the second highest number of positive statements. Many purely negative topic types received only a few or no positive statements at all. It seems that the positive statements cover topics that are closely related to a team’s progress and achievements in its implementation work, such as task progress and outcomes. Topics that participants discussed frequently, but mostly in the negative, were instructions, task difficulty, process and experience.
Our results show that team-level retrospectives greatly emphasise topics in the process areas close to the interests of the team members. For each outcome class (positive experiences, negative experiences and corrective actions), the discussions were commonly related to sprint planning, implementation work and software testing. This emphasis seems natural for team-level retrospectives since these process areas are closely related to the everyday work of the development teams.
The existing literature has presented a wide array of factors affecting the software development outcomes (McLeod and MacDonell 2011). Prior studies show that the factors are related to the entire software development lifecycle; for example, the factors of better understanding the sales, user and customers (Keil et al. 1998; Moløkken-Østvold and Jørgensen 2005; Drew Procaccino et al. 2002; Cerpa et al. 2010), of determining the requirements (McLeod and MacDonell 2011), and of focusing on quality control and software testing (Jones 2008; Kaur and Sengupta 2011; Egorova et al. 2010). Based on our results, however, it seems that team-level retrospectives are weak at recognising issues and successes related to the process areas that are external to the concerns of the teams (eg sales & requirements, general management and the product owner). In our results, the three process areas most closely related to the interests of the teams accounted for 75 % of the retrospective statements. Many of the earlier works are review studies or surveys covering multiple cases and organisations, with the aim of collecting the full variety of factors, which explains their stronger emphasis on factors beyond the scope of the development teams. Such studies have collected data from different organisational levels, thus providing a wider, more comprehensive picture of all the factors that affect project outcomes or failures. Our results, in contrast, provide a descriptive account of the factors that are emphasised when the analysis is conducted at a team level by the development team.
Prior studies have divided software engineering issues into internal and external factors based on their controllability for the team members (Xiangnan et al. 2010). We have chosen to use the terms local causes and bridge causes to express the extent to which the factors are dependent upon organisational issues across process areas (Lehtinen et al. 2014a). The team members recognised internal factors based on the planning work, implementation work and software testing. Their findings represent the local causes and bridge causes affecting across these process areas. Such a rich discussion was not conducted for the process areas external to the interests of the team members. In terms of the external process areas, the team members mainly recognised the existing problems and could not identify any positive experiences. Such problems could be better considered in an organisation-level retrospective because a team-level retrospective might not be an effective forum for discussing problems that are not under the team’s control. In our earlier study, we presented an analysis of the organisation-level retrospectives and identified the areas of product owner, sales and requirements as well as a lack of communication and bridging relationships to development practices as being important (Lehtinen et al. 2015b).
In terms of the outcome classes, the number of positive experiences was less than the number of negative experiences, suggesting that most of the discussion was related to the negative experiences. Participants emphasised people factors most often when recounting positive experiences, whereas they most commonly mentioned methodology factors with respect to negative experiences and the need for corrective actions. The results presented in section 4.1.3 highlight the fact that the positive retrospective statements emphasise learning, cooperation, resources & schedules, task outcome and task progress, ie topic types that characterise people factors and the accomplishments of the team. The negative statements, in contrast, emphasise work practices, resources & schedules, task outcome, tools and task progress, ie topic types that characterise methodology factors and task failures. Our prior study on software project failures recognised that software engineering problems are equally related to people and methodology factors (Lehtinen et al. 2014a). It also found that when selecting improvement targets, management prioritises people factors, whereas the employees emphasise the methodology factors. The results of this study corroborate these earlier findings, showing that the development teams tend to find people factors as more positive and avoid stating themselves as a target for improvement.
For Which Discussion Topics are Corrective Actions Developed? (RQ 2)
In this section, we present the results regarding the developed corrective actions. First, we present our analysis of the discussion topics that were most often the subject of corrective actions in the retrospective meetings. Second, we provide a description of the process areas and topic types with respect to the corrective actions themselves.
What Discussion Topics Most Often Result in the Development of Corrective Actions? (RQ 2.1)
Tables 7 and 8 show the number of corrective actions that were developed based on the negative statements for each topic type and process area, respectively.
The corrective actions were mostly developed for topics that received numerous negative statements. Negative statements regarding such topic types as task outcome, work practices, task difficulty and instructions received high number of corrective actions. All topic types that received numerous corrective actions also had a large number of negative statements. However, some topic types received a relatively high amount of negative discussion, but only a few, or no, suggestions for corrective actions. These topic types included cooperation, task progress, task priority, experience, existing product and tools.
Overall the voting practice that was applied in the retrospective meetings guided the development of corrective actions. The majority (66 %) of corrective actions were developed based on the negative statements that participants voted for.
We also analysed the distribution of the topic types with respect to the negative target statements, ie the negative statements that the corrective actions were developed for. This analysis is relevant since it shows how the corrective actions often represent other types than the actual negative statement the corrective action is developed for. Table 9 illustrates the differences between the targets of the corrective actions and the actual developed corrective actions. Two groups of topic types are highlighted in the table. First, common types of corrective actions that are not commonly listed as the target statements are shown. Second, common target statements that are not commonly listed as corrective actions are shown.
The developed corrective actions were most commonly in types work practices, cooperation, task outcome and process. Of these topic types, task outcome and work practices were also commonly mentioned in the target statements. The topic types of cooperation and process were virtually non-existent as target statements, even though both represented approximately 10 % of the corrective actions. Similarly, tools and values & responsibility were clearly less frequently discussed as targets than their share of the corrective actions would suggest.
On the other hand, task outcome, task estimation, policies, and task difficulty appeared more frequently as the target statements for corrective actions than as corrective actions themselves.
What Types of Corrective Actions are Developed? (RQ 2.2)
In terms of the process areas, the number of corrective actions followed the overall profile for the number of negative statements. The product owner and sprint planning areas represent somewhat higher shares, and implementation and deployment represent lower shares of corrective actions in comparison to the number of negative statements (see Figs. 4 and 5 and Table 5). Table 6 lists the number of corrective actions in each topic type, while Table 7 lists the number of negative statements and corrective actions developed for them in each topic type. The most common corrective actions were work practices. Other common corrective actions were related to cooperation, task outcome, process, resources & schedules, instructions and tools. Some topic types received many negative statements, but did not receive many corrective actions. These topic types include experience, existing product and the topics related to development tasks, ie task progress, task estimation, task difficulty and task priority.
Our results show that in terms of the process areas, the distribution of corrective actions follows the distribution of negative experiences (see Fig. 4 and Table 5). The most common process areas for corrective actions were planning work, implementation work and software testing.
The corrective actions were commonly developed for certain topic types (see Table 7). These included task outcome, work practices, task difficulty, instructions, resources & schedules, and task estimation. One explanation for why these topics attracted many corrective actions is that the team members perceived these topic types as being controllable. The team was able to change its work practices, improve its information exchange, change its estimation methods and consider the available resources and schedules. On the other hand, it is remarkable that there were many discussion topics that only received a few, or no, suggestions for corrective actions at all, even though there were many related negative statements. These included cooperation, task priorities, task progress, existing product, experience and process. It seems that solving these problems would have required external support (eg collaboration with external stakeholders) and business critical decision making (eg refactoring the existing product and reprioritising the development of new features). This supports the hypothesis that in team-level retrospectives, the participants tend to focus on solving problems that they feel are under the team’s control (see section 4.1.4). Some of these problems were also problems that are inherently difficult to solve, such as estimations, or those related to task progress, difficulty and priority, which are more symptoms of underlying causes rather than controllable root causes.
When we compared the target statements that the corrective actions were developed for and the actual corrective actions (see Table 9), we recognised that the distribution of corrective actions did not follow the distribution of target statements. For example, participants often selected negative task outcome statements as a target for corrective actions, but the corrective actions did not focus on the task outcome quite as often. Instead, the related corrective actions suggested the need to make changes in the work practices, cooperation and process, topics that ultimately caused problems with respect to the task outcome, but that are separate from the task outcome itself. Another example is cooperation, which was never a set target. However, 10 % of the corrective actions were cooperation actions. This indicates that in team-level retrospectives, the root cause analysis does not uncover the actual causes of the negative experiences, but deals more with the visible symptoms. However, as the participants possessed knowledge about the causal mechanisms behind the symptoms, they still felt that they were able to develop the appropriate corrective actions without explicating the exact ‘root causes’ (Lehtinen et al. 2011). As an example, when they were talking about a negative experience regarding poor task estimates, they directly proposed corrective actions for improving the cooperation between product owners and developers, without considering this lack of cooperation as a cause of the poor estimates. This behaviour might be a problem, since we do not know if the implicit inferences regarding the causes are comprehensive or correct.
How do the Discussion Topics Evolve Over Time? (RQ 3)
We studied how the retrospective discussions evolved over time using two approaches. First, we compared the discussion topics throughout the three stages covered in the study timeline, each of which represents a distinct phase in the case organisation’s history (see section 3.2). Second, we identified the recurring topics of discussion persistent throughout the study timeline. The first approach demonstrated at a higher level how the discussion topics changed over time in terms of the process areas and topic types. The second approach yielded a more detailed analysis of the types of discussions that kept recurring over time, how frequently they occurred and the detailed contents of the discussions. Finally, we also identified the corrective actions developed by the participants with respect to those recurring discussions.
How do the Discussion Topics Change Over Time? (RQ 3.1)
Figure 7 shows the most salient changes in the distribution of the retrospective statements over time (see the description of the stages in section 3.2). The figure includes changes where the percentage of statements in a certain topic type or process area changed nine or more percentage points from one stage to the next during the evolution of the case organisation. Regarding the process areas, see Fig. 7a and b. The figure shows that the share of negative statements and corrective actions in the software testing process area increased greatly in stage 2, while the share of corrective actions remained high in stage 3. During stage 3, the share of negative statements and corrective actions in the implementation area dropped, while the share of statements in the product owner area increased. There were no changes of greater than nine percentage points in the shares of positive statements regarding the process areas.
Regarding the topic types, see Fig. 7c and d. The figure shows that the share of negative statements in the task progress and task outcome types increased during stage 2 and decreased again during stage 3. The share of positive statements regarding task progress followed a similar pattern, whereas the statements on resources and scheduling followed the opposite pattern and were more common as negative topic types in stages 1 and 3 and decreased during stage 2 at the same time that positive discussions on task priority and task progress increased. Other changes in the positive topic types included a decreasing trend with respect to learning and a high share of cooperation discussions in stages 1 and 3, with almost no positive cooperation discussions occurring during stage 2. There were no changes of greater than nine percentage points in the share of corrective actions regarding the topic types.
As a summary, stage 1 and 2 were characterised by a clear focus on the implementation work in retrospective discussions. Stage 2 further emphasised software testing as an improvement area and the discussions focused more on task progress and outcomes and priorities. The task-related topics were common in both negative and positive discussions. Finally, during stage 3 the emphasis on implementation work decreased and testing improvements continued. In addition, task-related discussions decreased, while negative discussions on resources and scheduling and positive discussions on cooperation took place.
What Discussions Keep Recurring Over Time? (RQ 3.2)
We studied how often discussions kept recurring by identifying similar recurring retrospective statements in the team-level retrospectives. A total of 43 individual statements occurred more than once during the study period. The recurring statements were more common when participants discussed positive experiences (23 % were recurring).
The share of recurring statements when participants discussed negative experiences was 9 %. The recurrence of corrective actions was rare — a total of seven actions reoccurred twice, while two actions reoccurred up to three times.
We analysed in detail the ten statements that recurred most often to identify how frequently specific discussions kept recurring in the retrospectives. Each of the ten statements occurred in more than 10 % of the retrospectives. Table 10 summarises the recurring statements that form three distinct recurring discussions. These discussions consisted of the explicit retrospective statements and represent more specific instances than the discussion topics. Figure 8 presents the occurrence distribution for the ten statements that recurred most often throughout the retrospective timeline.
All three discussions included both positive and negative statements. The discussions on the state of bug fixing were related to the number of open defects and how successfully they were fixed. The recurring retrospective statements were as follows: fixed a lot of bugs (repeated in 38 % of the retrospectives), a lot of bugs (18 %) and low bug count (15 %). The discussions on the accuracy of estimations were related to the challenges and achievements in completing the tasks within the sprint schedules. The retrospective statements included completed all tasks (repeated in 45 % of the retrospectives), too much to do (15 %), efforts were not accurate (35 %) and good estimates (30 %).
Regarding the need for clarifying instructions, the retrospective statements were as follows: I got help when I needed it (repeated in 15 % of the retrospectives), incomplete specifications (13 %) and lack of information on how the system should work (13 %). These discussions obviously concerned missing instructions for the development tasks and a lack of information on the requirements.
What Corrective Actions are Developed for the Recurring Discussions? (RQ 3.3)
A total of 19 % of all corrective actions were developed for the recurring discussions. Table 11 presents the developed corrective actions and the related recurring discussions (corrective actions were only developed for the negative experiences). The greatest number of corrective actions were developed for the discussions on estimation accuracy (8 % of all corrective actions) and the state of bug fixing (6 %). Despite the developed corrective actions, the statements on the recurring topics kept repeating themselves.
Based on the results presented in this section, we conclude that the discussion topics varied over time and they also reflected the evolution of the development organisation. In a longitudinal analysis, stage 1 represents a small software organisation with only one development team. The organisation adopted Scrum at the beginning of the observation period. The retrospective discussions focused on instructions, learning, cooperation and tools — logical topics for an incipient software organisation. During stage 2, the organisation consisted of two software development teams and faced the challenge of rapid growth during the second year. In comparison to stage 1, the retrospective discussions emphasised task outcome and task progress. In terms of the task outcome, the team members started to discuss the increasing number of software defects; they often mentioned the problem of ‘a lot of bugs’. They also repeated the negative statement on task progress: ‘too much to do’. Furthermore, stage 2 discussions rarely referred to positive experiences regarding task completion or progress. These problems eventually drove the organisation to lengthen the sprints to 3 weeks. During stage 3, the organisation had six teams and the duration of the sprints was further lengthened to 4 weeks by adding one additional week for testing. During that time, the recurring discussions on the high amount of bug fixing and open defects ended. During stage 3, the retrospective topics included somewhat similar topic types as in stage 2, but cooperation became a common topic type once again. Positive experiences with software testing also became more common. Additionally, the team members kept repeating the following points: ‘I got help when I needed it’ and I ‘completed all tasks’. The statements reflect positive software engineering experiences important for the successful development work.
Our analysis revealed that certain discussions recurred over a long period of time. In the context of lean development, the problem of repeatedly regenerating the same list of retrospective findings is also recognised (Poppendieck and Poppendieck 2007). The recurring discussions in this case had to do with estimation accuracy, the state of bug fixing and the need for clarifying instructions. We hypothesise that the problems behind these three repeating discussions are different in nature.
We call the first recurring problem type ‘trivial scapegoat’ problem. For example, strive for better specifications and information about the system was such a problem. Here, a rather simple solution is desired as a solution to a complex problem. This discussion kept recurring because a simple problem was used to conceal the more complex phenomena behind it, and the root causes could not be solved without tackling the complex phenomena. In this case, the actual problem was related to poor communication and understanding of the system features. This problem arose from multiple levels, starting with the need to gather the required information from customers and understanding it correctly at the product-owner level. Furthermore, the communication problems between product owners and development teams increased the overall lack of required knowledge among the development teams. For this problem, a trivial scapegoat was presented in the retrospectives by stating that the specifications are poor or lacking. This problem kept recurring because the underlying challenges were much more complex than simply improving the specifications and documentation involving, for example, working practices and resourcing.
The second type of recurring problem is the ‘unsolvable problem’. The estimation accuracy discussion was a combination of the previously described trivial scapegoat problem and unsolvable problem. Poor estimation accuracy is a complex problem that includes many uncontrollable factors on the team level. The whole concept of estimations was somewhat unclear for the team members. They thought that development would be much easier if the estimates were more accurate. However, this seems to be just an easy way of concealing a complex problem, such as a lack of communication, resourcing challenges and scheduling problems; thus, participants used estimation accuracy as a trivial scapegoat for a larger problem. In addition, the estimation accuracy seemed also to be an unsolvable problem. The inherent uncertainty of task estimations and constant changes in the estimated tasks cause inaccuracy. Considering all the uncertainties of software development and the organisational communication challenges, it is unrealistic to assume that it would be possible for the developers or product owners to be able to achieve highly accurate task estimates. The high number of developed corrective actions without any significant effect during the course of 3 years corroborates the fact that the estimation problems were extremely difficult for the team members to solve.
The third type of problem leading to recurring retrospective discussions was a ‘naturally recurring problem’. An example of this type of problem is the high number of bugs. In software development, bugs occur naturally and constantly, and the development team is well in control of temporarily improving this problem and decreasing the number of open bugs by investing in bug fixing and early testing activities. However, the problem kept recurring since the teams were incapable of solving the real causes due to external factors, including overly tight schedule pressures. The related discussions lasted for approximately 2 years. After the organisation invested enough effort in software testing and bug fixing, the problem stopped recurring in the discussions.
A fifth of all corrective actions were developed as a result of the recurring discussions (see Table 11). The team members developed the highest number of corrective actions for the estimation problems, and despite the repeating discussions and numerous corrective actions, the estimation accuracy did not improve. The team members also developed a high number of corrective actions for the software quality problems and, in contrast to the estimation problems, these actions had an effect and the problems decreased during the observation period. As a conclusion for our analysis of how the retrospective discussions evolved over time, we state that it is useful to identify the types of recurring discussions in the retrospective meetings. The recurrence could be a sign of unproductive discussions and ineffective corrective action innovations in the cases of trivial scapegoat problems or unsolvable problems. In the case of naturally recurring problems, identifying such a phenomenon could help target the analysis towards issues that cause the recurrences, not just the causes and corrective actions for the problem itself.
How Well do the Retrospective Discussions Correspond to the Development Repository Data? (RQ 4)
We studied how well the retrospective discussions related to the actual development status in the organisation. We analysed the recurring retrospective discussions and compared the changes in discussion topics to the task backlog system and the bug repository (see section 3.4). We focused our analysis on two recurring discussions, estimation accuracy and the state of bug fixing, each of which was described in section 4.3.2. These discussions were selected based on the availability of repository data that could be compared with the discussions.
Figure 9 presents the level of task estimation accuracy in the development sprints and maps it together with the outcomes from the team-level retrospectives. The data cover the timeline from the 20th retrospective (stage 2) to the 37th retrospective (stage 3). The task estimate data were only available for stages 2 and 3, since the company did not record the estimates and actual efforts during stage 1. The results show that the retrospective statements are somewhat contradictory. During stage 2, statements made in the 20th and 21th sprint retrospectives focused on ‘good estimates’, whereas in the 22th and 23th sprint retrospectives the statements more reflected the opinion that ‘efforts were not accurate’. The estimations were relatively accurate in 21th sprint and inaccurate in the 23th sprint, which are in line with the statements. However, the estimation accuracy is similar for the 20th and 22th sprints, whereas the retrospectives found the opposite perceptions. The retrospectives did not reveal a high degree of measured estimation accuracy in the 24th sprints retrospective. Furthermore, in the 27th sprint retrospective, participants perceived ‘good estimates’ and in the 28th and 31th sprint retrospective that the ‘efforts were not accurate’. These results are also contradictory due to the fact that the estimation accuracy in the 27th sprint was not clearly different than it was in the 28th and 31th sprints. Additionally, the retrospective for 26th sprint found that the estimations were both inaccurate and accurate. Obviously, the retrospective statements do not reflect the estimations for all tasks in a sprint, since the deviation in the estimation accuracy is high, as is visible in the numerous outliers in Fig. 9.
State of bug Fixing
Figure 10 maps the retrospective statements on the state of bug fixing into the timeline of open bugs (data from the bug repository). It seems that these statements are in line with the real-world status of open bugs. When the number of open bugs increased (in comparison to the past), the related retrospective found that ‘a lot of bugs’ were present. In contrast, when the number of open bugs decreased, the related retrospective found that the company had ‘fixed a lot of bugs’. Furthermore, the statement ‘low bug count’ occurred when the number of open bugs was approximately 90 or less. The teams did not demonstrate the actual bug counts before the retrospectives. The figure also divides the timeline into stages 1–3. It seems that in stage 3, the number of open bugs stabilised at less than 90 open bugs, which could explain why the number of statements on the high number of bugs also decreased.
We analysed the correspondence between the retrospective discussions and the repository data from the task repositories. This analysis revealed that certain discussions, such as remarks on the high or low number of bugs, rather accurately reflected the state of development depicted by the repository data. Other discussions, such as the comments on poor estimation accuracy, did not match the situation reflected by the task repository data on task estimates and actual efforts. In our analysis, we identified several phenomena related to these two recurring discussions.
Our first finding was that these recurring discussions were related to different types of problems and that the reasons for the recurrences were different. In the case of the discussions on the high number of open bugs, the problem was easy for the team members to recognise and they were able, at a team level, to react to the problem. The problem kept recurring and the related discussions lasted for approximately 2 years matching the measured open bug trends. The team members also developed a high number of corrective actions for the software quality problems. In comparison to the estimation problems, these problems did reflect reality and the problem did decrease during the observation period, when the number of open bugs levelled off at below 90 (see Fig. 10). However, we cannot explain why participants felt that less than 90 open bugs was a low bug count.
The other recurring discussion regarding estimation accuracy concerned a problem that was different in nature. The problem was extremely difficult, if not impossible, for the team to solve. These difficulties with the estimation concept caused team members to use the estimation accuracy as an excuse, a trivial scapegoat. This was visible in the way that participants hoped the better estimates would solve such problems as a lack of communication between the team and product owners. The retrospective discussions on estimation accuracy were contradictory in comparison with the estimation accuracy data from the task repository. In addition, despite the high number of repeating discussions and corrective actions for the estimation problems, the estimation accuracy did not improve (see Fig. 9).
Furthermore, the repeating discussions on the need for clarifying instructions were also a problem, with complex uncontrollable factors being easy to blame in the case of failure. Even though we did not have triangulating data to further explain this finding, our earlier study revealed that the case organisation struggled with the collaboration problem with respect to stakeholders (Lehtinen et al. 2015b). It resulted in a lack of information exchange between sales & requirements, the product owners and developers. The underlying problem was related to the complexity of the business domain, including the high number of customers and the fact that third parties were difficult to control and collaborate with.
The outcome of an individual retrospective provided only a narrow sample of the problems and corrective actions regarding the software development practice. When the retrospective outcomes were combined over a longer period of time, they collectively provided both a more detailed and broader view of the perceived issues and relationships interconnecting the planning work, implementation work and software testing process areas (see Fig. 6). Such a collective software engineering knowledge base has been presented as a valuable asset (Anbari et al. 2008). Due to the specific temporal context and human factors affecting the outcome of each individual retrospective, the combined overview does not accurately illustrate the current situation. The past was not equal to the present. For example, the combined overview did show that there had been a problem with the high number of bugs. However, it did not show that the problem decreased over time. Additionally, the combined overview showed that the teams have problems with estimations. However, the deviations in the overall estimation accuracy (see Fig. 9) reveal that significant differences between the stated ‘accurate’ and ‘inaccurate’ development sprints in fact barely existed.