1 Introduction

Schein and Bennis, in their foundational work from the 1960s, introduced the concept of psychological safety (PS) (Schein and Bennis 1965). They posited that PS is crucial for fostering an environment where individuals feel secure and adaptable to evolving organizational demands. Rekindling interest a quarter of a century later, Kahn defined PS as the ability to display and engage oneself without fear of negative consequences to one’s self-image, position, or career (Kahn 1990). Edmondson subsequently extended this definition to a team level, conceptualizing PS as the shared belief among team members that it is safe to take interpersonal risks in the workplace (Edmondson 1999b, a). She argued that members of a team with high psychological safety feel confident that they will not be rejected or blamed by other team members for speaking up, making their views and opinions known, feeling that it is safe to experiment and take risks, and trusting that their peers will engage in constructive dialogue when conflicts or confrontations arise (Edmondson 1999b). Drawing on these arguments, a significant stream of organizational research has shown that PS positively affects various outcomes, including learning and performance, at individual and team levels (Edmondson and Lei 2014; Newman et al. 2017; Frazier et al. 2017).

Recently, PS has caught increasing interest among software practitioners and researchers, most notably in the information systems community. Researchers have examined the impact of PS on knowledge sharing (Zhang et al. 2010; Kakar 2018; Safdar et al. 2017), agile practices (Hennel and Rosenkranz 2021; Buvik and Tkalich 2021; Diegmann and Rosenkranz 2017; Thorgren and Caiman 2019), efficiency (Buvik and Tkalich 2021) and overall team performance (Faraj and Yan 2009; Lenberg and Feldt 2018). However, relatively little work has examined how psychological safety can contribute to enhancing software quality. Software quality is of key relevance to software engineering practice and research. In 2018, the American Consortium for Information and Software Quality (CISQ) reported that the accumulated cost of poor-quality software in the US was approximately $2.84 trillion (Krasner 2018). The report states that about 60% of the software engineering effort is dedicated to finding and fixing defects (Krasner 2018). It also claims that business disruptions caused by software defects can cost between $32.22 and $53.21 a minute (Krasner 2018). A similar study (Stripe 2018) suggests “bad code” costs $85 billion a year globally. The same study found that software developers spend more than 17 hours a week on maintenance issues related to code quality and approximately four hours a week on fixing “bad code” (Stripe 2018).

Although software defects will continue to be part of the production code (Krasner 2018), efforts invested in the prevention and resolution of defects should continue. This includes research efforts steered towards understanding how software teams can achieve higher software quality, examining not only technical means (e.g., tools, best practices) but also social enablers, such as collaboration and knowledge sharing, that may play a role in the pursuit of high software quality. Given the well-established evidence on the effect of PS on team efficiency (Edmondson 1999b; Edmondson and Lei 2014; Buvik and Tkalich 2021) and, more broadly, performance (Kim et al. 2020; Baer and Frese 2003; Lenberg and Feldt 2018), it is sound and relevant to explore the interplay between PS and teams’ behaviors aimed at enhancing software quality. It is sound because affinity for high standards is characteristic of high-performing teams (Bush and Glover 2012; Margerison and McCann 1984). Hackman and Hackman (2002) emphasize that one of the key attributes of effective teams is their ability to meet or surpass quality standards. They argue that the core features of an effective team include the ability to “serve their customers well” and “meet[...] or exceed[...] the standards of quality” (Hackman and Hackman 2002). These claims undoubtedly hint at potential parallels. If PS influences efficiency and performance, then PS may also play an important role in teams’ efforts to produce software of high quality (Alami and Krancher 2022).

In this paper, we focus on how PS supports teams using agile methods in their efforts to pursue software quality. Agile methods seek to enhance a software development team’s agility, which is defined as its capacity to both initiate and adapt to change through iterative development cycles, self-organizing teams, craftsmanship, and streamlined yet sufficiently coordinated processes (Beck 2000; Cobb 2015; Boehm 2002). Agile software development has not only become mainstream (Dybå and Dingsøyr 2008; Lee and Xia 2010), but it is also in significant demand for improving software quality. This tendency has been consistently validated by surveys and empirical data since 2008 (digital.ai 2021; Ambler 2008; Vijayasarathy and Turk 2008; Rodríguez et al. 2012). While the published data suggest that some teams use agile approaches to increase software quality (e.g., 77% of respondents (Ambler 2008) and 61% (Rodríguez et al. 2012)), the numbers also reveal that utilizing agile methods such as Scrum does not always provide identical outcomes.

PS is especially relevant to agile teams. According to some practitioners, “agile doesn’t work without psychological safety” (Clark 2022). Collaboration and creativity are essential for agile teams (West 2002). Agile team members should express their ideas, thoughts, and concerns freely (West 2002). PS provides an atmosphere where team members feel free to share their views and opinions without fear of being judged or retaliated against (Edmondson and Lei 2014). It fosters free and honest communication among team members, which increases trust (Edmondson and Lei 2014). Team members are comfortable sharing their opinions and feelings. Agile teams must also be versatile and flexible in a fast-paced, constantly changing environment (Mundra 2018). PS allows team members to communicate their worries or doubts, seek support when required, and acknowledge errors without fear of being blamed or punished (Edmondson and Lei 2014). This may foster a culture of continual learning and progress by allowing team members to voice their opinions and learn from mistakes (Alami et al. 2022). Some agile teams are given the authority to make choices and accept responsibility for their work (Stray et al. 2018). PS allows team members to openly express their thoughts and ideas, participate in decision-making processes, and accept responsibility for their actions (Edmondson and Lei 2014). This feeling of empowerment and ownership may lead to improved responsibility, motivation, and engagement among team members, resulting in better outcomes and results (West 2002).

In sum, PS is relevant to agile software development, potentially playing a key role in enabling behaviors aimed at enhancing software quality. However, whether and how PS enables these behaviors remains little understood. Therefore, we ask:

RQ:

How does psychological safety influence agile teams’ behaviors aimed at enhancing software quality?

Utilizing a mixed-methods (\(Qualitative \rightarrow Quantitative\)) research approach (Creswell and Clark 2017), we began our research project with an inductive qualitative phase, conducting 20 interviews to explore behaviors that are promoted by PS and that aim at enhancing software quality. Subsequently, we transformed these findings into hypotheses for a confirmation phase using survey instruments, aiming to test the findings of the qualitative phase in a sample of 423 survey respondents. This mixed-methods approach enabled both the flexibility of uncovering potential themes that had escaped the attention of prior research and the rigor of testing the relevance of these themes in a larger sample.

We contribute to the long-standing effort to understand and identify how software engineering teams can deliver better software by showing a new facet to this endeavor. PS is not only an impetus for higher performance and learning, but it has far-reaching implications for software teams. It advances behaviors (e.g., admitting mistakes, taking initiatives, etc.) conducive to achieving software quality. This contribution is part of a broad study on psychological safety in software engineering teams. The first phase of the study focused on the antecedents of psychological safety in agile teams (Alami et al. 2023), and this manuscript reports how it enhances SE teams’ ability to pursue software quality. We opted for this division to strategically ensure conciseness and a focused examination of each topic in its own right. This separation also allowed for a deeper exploration of the topics, contributing to a clearer understanding and targeted discussions.

In the remainder of this paper, we review related work (Section 2), describe our approach for measuring PS (Section 3) and the methods of our study Section 4, report on our findings (Section 5), discuss implications (Section 6) and threats to validity (Section 7), and conclude the paper (Section 8).

2 Related Work

In our search for related work, we observed a timid increase in studies with an interest in psychological safety (PS) in software development teams. Available literature has shown interest in the synergy between agile values and PS (Buvik and Tkalich 2022; Thorgren and Caiman 2019; Hennel and Rosenkranz 2021).

It has been suggested that increasing members’ sense of PS in agile development teams may improve the effectiveness of these methods (Buvik and Tkalich 2022; Thorgren and Caiman 2019; Hennel and Rosenkranz 2021). Encouragement of actions consistent with agile concepts and principles is seen as essential by some (Buvik and Tkalich 2022; Thorgren and Caiman 2019). Buvik and Tkalich looked at how factors like team autonomy, task dependency, and job clarity affect outcomes like team reflexivity and performance in agile software development teams. They found that autonomy substantially influenced PS, but task interdependence and role clarity did not (Buvik and Tkalich 2022). The hypothesis that team reflexivity moderates the connection between PS and performance was not supported by the data. Instead, they discovered a robust and direct link between PS and team performance (Buvik and Tkalich 2022). Another study by Lenberg and Feldt (2018) suggests that the PS of team members and the clarity of team standards impact both efficiency and employee happiness in agile software development teams (Lenberg and Feldt 2018).

Findings from research by Thorgren and Caiman imply that PS might help reduce the cultural mismatch between agile practices and traditional work settings. They used one case study to look at how different cultures see openness, responsibility, and inclusion and how it can affect a Scrum implementation. They discovered that fostering a sense of PS within a team may help reduce conflicts and tensions between agile techniques, principles, and the culture of the workplace. They determine, for instance, that openness promotes feedback within the team, transparency, and bravery. It was also noted that the team became less dependent on management and the Scrum Master in making their own choices, suggesting that inclusivity breaks the typical hierarchical divides and consequently promotes collective accountability (Thorgren and Caiman 2019).

Hennel and Rosenkranz, using three case studies, investigated the effects of social agile techniques (such as daily stand-ups, retrospectives, and sprint planning) and PS on the actions of software development teams. They argue that team members’ perceptions of PS are crucial in encouraging the adoption of agile approaches. They discovered that when people feel comfortable in the workplace, they are more likely to take part in ceremonies, share their opinions, and pitch in on reform projects. If individuals feel safer in their environment, they are more likely to assist one another and share their knowledge. They also observed that PS and the use of agile principles complement one another well (Hennel and Rosenkranz 2021).

Alami and Krancher sought to investigate how Scrum values, principles, and prescribed events promote software quality. Using a two-phase study design, they investigated positive cases (phase 1) and negative cases (phase 2). They suggest that psychologically safe Scrum teams are more inclined to speak out about errors related to software quality. Subsequently, speaking out results in more defects and errors being found and resolved. They further explain that when software developers feel safe, they show more care about the quality of their deliverables. This care translates to adequate effort being invested in assuring the quality of deliverables. The analysis of the negative cases (phase 2) echoes the relevance of PS in order for these traits (i.e., speaking out and caring about quality) to materialize. They found that some cultural constraints and internal team tensions, such as “fear to speak up during the ceremonies” and “uncompromising” leadership, hinder the effects of PS (Alami and Krancher 2022).

Recently, Santana et al. investigated the impact of interpersonal conflicts on the establishment of PS using data from the Q&A community on Stack Exchange (Santana et al. 2023). They found 11 distinct situations involving interpersonal disputes, which represent challenges fostering PS: divergence of views, communication challenges, time-estimating concerns, lack of good programming methods, difficulties asking for assistance, differing levels of expertise, code review inconsistencies, software development issues, out-of-scope items, the desire for assistance, time-estimating concerns, and ineffective meetings. They also found indicators for lack of PS: “communicating opinions on work-related issues,” “recommending ideas for new projects or changes in procedures,” “valuing expressed opinions,” “discussing other people’s mistakes,” “asking questions or uncertainties related to work,” and encouragement and support to take on new tasks or learn something new” (Santana et al. 2023). Their indicators of PS have parallels with our previous work on the antecedents of PS (Alami et al. 2023). In this work, we establish the relationship between some of these PS traits and the teams’ ability to achieve software quality.

In sum, despite the enthusiastic interest in PS in teams and organizations, as shown in the social sciences, relatively little empirical work has translated this concept in the context of software development (Buvik and Tkalich 2022; Frazier et al. 2017). The outcomes evidenced thus far are appealing to explore in agile software teams. Particularly the interplay with advancing software quality, given the latter is the most dominant driver for agile method adoption (digital.ai 2021; Ambler 2008; Vijayasarathy and Turk 2008; Rodríguez et al. 2012). In addition, reports from Google (e.g., “Project Aristotle”) acclaimed that PS is strongly attributed to the success of some software development teams in Google (Duhigg 2016).

3 Measuring Psychological Safety

Edmondson conceptualized psychological safety as a property shared by the members of a team (Edmondson 1999b). She proposed and validated a scale comprising seven items, as shown in Table 1. We used Edmondson’s criteria (Edmondson 1999b) to assess the level of psychological safety of our participants’ teams. These measurements have been used and modified by social science researchers and shown to be a valid method of gauging the psychological safety of a whole team (Newman et al. 2017).

Although, in the social sciences, other scales to measure PS were proposed, Edmondson’s (1999b) remains widely acknowledged and well-validated (Newman et al. 2017). Psychological safety has been measured using four different levels: individual-level, team-level, and organizational-level (Newman et al. 2017). While individual-level measurements of PS seek to capture individually held perceptions of psychological safety within organizations, most studies measured psychological safety at the team level (Newman et al. 2017). Both individual-level and team-level studies used Edmondson’s 7-item scale (Edmondson 1999b). Individual-level measurement replaced the reference “team” with “organization” in the items, e.g., (Carmeli et al. 2010; De Clercq and Rius 2007; Madjar and Ortiz-Walters 2009). Brown and Leigh developed a 22-item scale to measure “psychological climate” related to job involvement, effort, and performance (Brown and Leigh 1996); however, this scale remains extensive and covers a broad scope. Hence, its adoption is scarce (Newman et al. 2017). Liang and colleagues also proposed their own 5-item scale (Liang et al. 2012). This scale is biased towards promotive and prohibitive “voice”, the expression of constructive opinions, concerns, or ideas about work-related issues (Dyne et al. 2003), and diminishes the role of helping each other (e.g., PS5) and punishment avoidance (e.g., PS1). Similarly to Brown and Leigh (1996), Liang and colleagues’ (Liang et al. 2012) scale adoption is also low (Newman et al. 2017).

Newman et al. reported only two studies using organizational-level measurement (Newman et al. 2017), e.g., (Baer and Frese 2003; Carmeli 2007). Both studies used Edmondson’s (1999b) 7-item team-level scale, replacing the referent “team” with “organization.”

In the context of SE, team-level measures are more pertinent, given that the efforts are collective. In addition, Edmondson’s (1999b) scale has gone through extensive validation in the last two decades and has been acknowledged across many research communities (Newman et al. 2017), offering a reliable tool for gauging the team’s PS. The scale also captures nuances relevant in an engineering context, such as voicing errors and risk-taking. Furthermore, the use of proxy measures might be problematic since they may not align with Edmondson’s exact constitutive definition of psychological safety (Edmondson 1999a). Inconsistency between the theoretical definition and the method used to measure the concept may cause reduced construct validity (Bagozzi et al. 1991).

Table 1 Edmondson’s scale to measure psychological safety, adopted from (Edmondson 1999b, 2018)

3.1 Adjustments to Edmondson’s Scale

Although Edmondson’s scale is a reliable tool for measuring PS, in Phase I of the study and upon completion of the Phase II pilot, we deemed some adjustments necessary to align the scale with modern SE contexts.Items utilized in both phases of our investigation are detailed in Table 1. In Phase I, we utilized these questions to help guide and organize the interviews. In Phase II, we used them as survey instruments to assess the respondent’s team’s degree of psychological safety. As described in greater detail in the Section 4, we conducted a pilot test in Phase II and revised PS3 and PS5 based on the results of Phase I and the results of the pilot test in Phase II. The revised items are provided in the right-hand column of Table 1.

We revised PS3 because the use of “being different” in the original item was interpreted by interviewees to imply ethnic, gender, and other personal differences. Although this could have been the original intent, it did not fully capture modern SE teams. Our interviewees were surprised by the question, as they fundamentally embrace inclusivity. However, when prompted that “being different” could also imply having different perspectives, ideas, and approaches, they provided more insightful responses directly relevant to the nature of SE being a knowledge-based discipline. We concluded that in SE, diverse perspectives, ideas, and approaches are more relevant to innovation, problem-solving, and enhancing software quality. This observation led us to rephrase PS3 to focus on the acceptance of different ideas rather than personal characteristics to better reflect SE environments.

We also revised PS5, which had a very low factor loading in the factor analysis (see Section 4) of the pilot test data of Phase II. We reasoned that the factor loading may be low because several factors beyond psychological safety, such as remote work settings, make it difficult for team members to ask for help. Therefore, we changed it to “I feel comfortable asking my team members for help,” which puts greater emphasis on the team members’ beliefs about the risks associated with asking for help.

4 Methods

Our pragmatic epistemological stance informs our methodological decisions. According to the pragmatist epistemological tenet, knowledge should be able to affect practices, and research should center on “practical understandings” of tangible, real-world concerns (Patton 2005). Pragmatism, as an epistemology, encourages making decisions independent of the researcher’s overall philosophical viewpoint and more so based on their relevance to the situation at hand (Patton 2005).

To better comprehend this phenomenon, we decided to use a combination of different research approaches. Because there is so little research on psychological safety (PS) and software quality, in particular, we were unable to use existing theories and literature to inform our assumptions about the role of PS in attaining software quality in agile software development. Therefore, a qualitative exploratory phase followed by a quantitative investigation is appropriate here. During the first stage, we were able to collect sufficient data for us to draw some tentative but solid conclusions; the second stage was a tool for expanding the scope of the empirical investigation and validating the results of the first stage.

We opted for a sequential mixed-methods design. The results of the qualitative phase inform the next phase, which is quantitative (Creswell and Clark 2017). We used a qualitative interview approach in the first part of our research, which allowed us to provide in-depth descriptions of how PS affects the agile software development teams’ pursuit of achieving software quality. We then utilized these data to inform the second quantitative phase. Our knowledge of the consequences of PS for software quality has been expanded and validated in the second stage. Although numerous studies connect PS to team performance and efficiency, directly equating these outcomes with software quality would not only be methodologically incorrect but would also bypass crucial nuances relevant to software engineering. Indeed, prior research has shown that factors contributing to efficiency in software development projects need not contribute to software quality, and vice versa (Gopal and Gosain 2010; Barki et al. 2001). While this speaks for an inductive qualitative study of the mechanisms in which PS facilitates teams’ pursuit of software quality, a qualitative study alone might suffer from limited empirical generalizability, especially given our convenience sampling approach for interviews. By synthesizing both qualitative and quantitative data and analysis, our mixed-methods approach ensures empirical rigor in investigating the uncharted relationship between PS and software quality.

4.1 Phase I: Qualitative Study

The goal of Phase I was to gather data based on practitioner experiences. Practitioner experiences are a reliable source of information, and the technique is widely employed since practitioners have an emic viewpoint that is difficult to gain otherwise.

4.1.1 Participants’ Characteristics & Sampling

We recruited participants using two methods: convenience sampling and snowballing (Patton 2014). We leveraged our industry relationships to find software engineers and quality assurance specialists. We received 20 recommendations and successfully interviewed 12 individuals following the screening process. We utilized previously interviewed participants to find new participants among their professional contacts. This recruiting strategy had a snowball effect, resulting in an extra eight participants. The sample for phase I is documented in Table 2. The column “Sampling” refers to the sampling strategy that resulted in the participant’s recruitment, and “Exp.” refers to the number of years the participant has worked in software development. “Method” refers to the agile methods used by the teams of participants, while “Project type” refers to the sort of software created by the teams of participants. We interviewed 20 participants. Our sample was confined to software engineers and quality assurance (QA) positions. Software developers create and write software code, whereas QA engineers test the finished result. Both positions have a thorough understanding of not just how to ensure quality but also of the numerous non-technical dynamics that allow them to do so.

Table 2 Phase I interviewees

4.1.2 Data Collection

The data for the first stage was gathered via semi-structured interviews. This approach allows for variation and stimulates lively discussion. Nonetheless, the thematic format of each interview was consistent since it was based on a standard guide. This meant that, although the interviews were consistent in their thematic focus and structure, we encouraged a dynamic conversation. For example, themes such as the sources of PS in the team and taking risks and initiatives were covered in all interviews. We divided the interview guide into an introduction, a core section, and a section for further probing. The purpose of the introduction questions was to learn more about the respondent, the team, and the level of PS within the group. These initial questions set the stage for the interview’s primary focus. We utilized the core questions to collect information pertinent to our study RQ. To ensure that the dialogue goes as far and as deep as possible, we came up with probing questions. Even though we prepared in-depth questions to ask during semi-structured interviews, the topic naturally flowed in a different direction throughout each one. This meant that the researcher asked follow-up questions based on the responses the participants gave. Our interview question examples can be found in Table 3. The full interview guide can be found in Sect. Data Availability of the replication package.

Due to the geographical dispersion of our interviewees, we used the online video and audio conference service Zoom to conduct all of our interviews. The transcriptions of the interviews, which ranged in length from 40 to 90 minutes, averaged 17 pages. All interviews were performed by the first author between January and March of 2022. Otter.aiFootnote 1, an online transcription tool, was used to transcribe the interviews. Despite the tool’s excellent accuracy, it nonetheless makes mistakes occasionally. Therefore, we compared the transcripts to the original recordings and made any necessary adjustments by hand. We asked participants in Phase I whether we may share anonymized copies of our interview transcripts (see Sect. Data Availability).

Table 3 Examples of the interview’s questions

4.1.3 Informed Consent

All interviewees were sent a formal consent request via email prior to the interview taking place. They were asked to read the consent and reply, “agree to participate” or “disagree to participate.” The consent covered anonymized data sharing in public repositories and the manuscripts, in addition to the voluntary nature of participation, the procedures for ensuring confidentiality, the right to withdraw from the study at any point without any adverse consequences, and other ethical considerations as per the first author’s university policies, including but not limited to respect for participant privacy beyond the anonymization of data and the secure storage and handling of sensitive information. Consent was obtained from all participants before conducting the interviews.

4.1.4 Data Analysis

For the analysis of the interviews, we followed the recommendations of Miles et al. (2014). First, they suggest a preliminary stage (First Cycle) of analysis, followed by a second step (Second Cycle) that builds upon the first. In the First Cycle, we select “chunks” of data that are pertinent to our RQ and label them with codes (Miles et al. 2014). The goal of this coding exercise is to “condense” the data into usable preliminary codes (Miles et al. 2014). In this first round of coding, we used an inductive strategy. Instead of starting with assumptions about what the codes should be, the first codes were able to develop naturally from the data. This is consistent with the inquisitive mindset required at this stage. In the Second Cycle that followed, we combined all the separate First Cycle codes into one streamlined system (Miles et al. 2014). Pattern coding is the term for this method (Miles et al. 2014). We then put forward the concept of “Pattern codes.” They are the end result of grouping the original codes together in terms of their code types or by codes that share the same themes, ideas, or logically represent the same thing, explanations, causes, or relations of constructions (Miles et al. 2014).

We used a “causal network” analysis (Miles et al. 2014) to coordinate connections between the “Pattern codes.” In this analytical activity, we considered how “Pattern codes” affect one another (Miles et al. 2014). Specifically, we discovered that some “Pattern codes” either cause or significantly contribute to the emergence of other “Pattern codes,” therefore forming unidirectional linkages. For instance, one of our “Pattern codes” is admitting mistakes, which in turn encourages the team to learn from the mistake (learning from mistakes), which in turn makes previous mistakes a point of reference for the team, which helps the team avoid future mistakes. Newer versions of the same software or entirely new products built by the team benefit from learning from and avoiding the mistakes that have led to defects or other quality concerns in the past. Relationships between the variables are shown in Fig. 1. The data analysis documentation from Phase I is included in the replication package. Code samples and their corresponding “Pattern Codes” are included in Table 4. The first column is the list of the final Pattern codes, the second column lists the corresponding First Cycle codes, and the column labeled “N” represents the total number of unique instances where each First Cycle code appeared across all interviews. Each First Cycle code was counted once per interview, regardless of the number of times it was mentioned within that interview. The final column provides examples from the data.

The first coding cycle was led by the first author. The second and third authors reviewed the codes, offered comments, and proposed new codes. The first author then revised and offered a complete set of codes. The Second cycle of coding was carried out by the first author, followed by a review of the suggested “Pattern codes” by the second and third authors. By using a “reliability check” (Patton 2014; Miles et al. 2014), we were able to pool our coding judgments and settle our discrepancies, resulting in more trustworthy inferences.

Fig. 1
figure 1

Causal Network (Phase I)

4.2 Phase II: Survey Study

The second part of the study involves testing the hypotheses using survey data and expanding the empirical scope. Based on the data gathered and analyzed during Phase I, we were able to formulate hypotheses (see Table 5) about the nature of the relationships between the constructs we had identified as being relevant to our RQ. Once all the “Pattern Codes” and their connections were established, we generated testable hypotheses. It is recommended by Creswell and Clark (2017) that second-phase hypotheses expand upon the first-phase results. As an example, in Fig. 1, H2 represents the relationships shown there (the aqua color box).

Table 4 Phase I Pattern Codes

4.2.1 Hypotheses formulation

Our RQ’s hypotheses are documented in Table 5. To summarize the substance and meaning that emerged from the qualitative research, each hypothesis translates a “Pattern Code” and its relations into propositions. For example, the “Pattern code” for speaking up is now translated to H1. In Phase I, we learned that speaking up about software quality issues and generating engagement from the team is key to fixing them. In the second phase, we suggested H1 to accurately capture the core of this finding. The qualitative data suggested that when team members speak up, it brings attention to potential quality issues and galvanizes team engagement towards addressing them. For further validation of this claim in Phase II, we translated this observation to Hypothesis 1 (H1): “High psychological safety in agile teams is positively associated with speaking up more about software quality problems.” In this transition from an observation in our qualitative phase to H1 formulation, we ensured authenticity and alignment in our survey questions.

To illustrate another example, consider the Pattern Code “admitting mistakes.” This pattern was identified from the interviews’ data, where our interviewees discussed how team members openly admit mistakes, fostering an environment where learning from these mistakes is valued. These individual and collective behaviors were reported as integral to enhancing the team’s approach to identifying and addressing quality-related issues. Then, for further validation of this claim, this insight was translated into Hypothesis 2 (H2): “High psychological safety, in agile teams, is positively associated with admitting more software quality mistakes.” In the survey design, we continued to prioritize fidelity to our interviewees experiences. The questions we used to measure this construct directly represent the codes associated with the Pattern Code. For example, in the case of “admitting mistakes,” we articulated items aligned with every code. We used:

  • “Members of my team talk about the mistakes they make related to software quality.”

  • “Members of my team do not get blamed by their team members for mistakes related to software quality.”

  • “I admit my mistakes related to software quality to my team because there are no repercussions; instead, we deal with the situation constructively.”

  • “When mistakes related to software quality are admitted by a team member, we deal with the situation constructively.”

Each hypothesis presented in Table 5 follows a similar rationale, translating observed Pattern Codes into testable propositions that capture the essence of the qualitative findings. This approach ensures that our quantitative analysis in Phase II is deeply rooted in the substantive insights gained from our qualitative inquiry, providing a robust framework for exploring the dynamics of psychological safety within software engineering teams.

Table 5 Phase II Hypotheses
Table 6 Survey instrument for latent constructs

4.2.2 Instrumentation

Following the formulation of the hypotheses, we created a survey. The survey comprised 43 Likert-scale questions (see our replication package for the complete survey, Sect. Data Availability). We relied on scale development practices established in social science research, such as developing at least three questions to measure each construct (i.e., each pattern code from phase 1) and ensuring the fit of the items with the constructs’ definitions (DeVellis and Thorpe 2021; Straub et al. 2004). In line with Edmondson’s foundational research on PS (Edmondson 1999b), we conceptualized our constructs as team-level constructs. After developing the survey instrument, we ran a pilot test (N = 20) to evaluate its validity, asking the pilot respondents to offer comments on each page of the survey. We evaluated the free-text comments and conducted an exploratory factor analysis of the responses to the Likert scale answers in SPSS v. 28. Cronbach alpha values and factor loadings were evaluated in the exploratory factor analysis. Although the majority of the qualitative responses said that the survey questions were understandable, we made modest changes to those survey items with low standard deviations, Cronbach alpha values, or factor loadings.

Table 7 Discriminant validity results. Diagonal shows square root of Average Variance Extracted (AVE), all other cells show correlations

After the survey was completed, we used SmartPLS v 3.3.9 to conduct a confirmatory factor analysis to check the survey’s convergent and discriminant validity and reliability. The term “convergent validity” refers to a high degree of correlation between survey items measuring the same constructs (such as the four survey questions measuring collective decision-making), while “discriminant validity” describes a lower degree of correlation between survey items measuring different constructs (DeVellis and Thorpe 2021). Fornell and Larcker (1981) state that convergent validity may be shown if the average variance extracted (AVE) for all latent constructs is larger than .50, which we interpreted to mean that the average factor loadings for a particular construct must be greater than .708 (Fornell and Larcker 1981). Our first investigation revealed that the second item on the PS scale and the fourth item on the autonomy scale both had problems with convergent validity. Following standard practices for survey research, we omitted these two variables from the study (Russo and Stol 2021; Straub et al. 2004).

All AVE values remained above .5 after excluding these factors; PS was the only exception (with an AVE of .41) (see Table 6). Despite the fact that these findings may imply that PS consists of several dimensions, we continued by conceiving PS as a single construct in order to maintain consistency with previous research on the topic (for more, see Sect. 3), also given that PS had satisfactory Cronbach alpha (.72) and composite reliability (.81) values.

Discriminant validity was established by checking that all pairwise construct correlations were below the AVE square root threshold (Fornell and Larcker 1981). According to Table 7, the outcomes are acceptable. The Heterotrait-Monotrait Ratio Test results (Hair et al. 2021) also demonstrated discriminant validity (see the detailed results in the replication package described in Sect. Data Availability).

4.2.3 Sampling

To gather Phase II data, we employed ProlificFootnote 2, a research market platform. This platform offers access to a pre-vetted and diverse participant pool, which may be difficult to assess otherwise. We chose the Phase II population using purposive sampling. This sampling approach employs certain features in the selection of participants pertinent to the study’s purpose (Patton 2014). We incorporated software development and quality assurance responsibilities as we did in Phase I. We used a prescreening procedure after the pilot to select the population for the final survey. This is due to the fact that the Prolific prescreening data does not fulfill our criteria. Furthermore, through prescreening, we gain confidence in the quality and dependability of our sample. There were just five questions on the prescreening survey (Table 8). Iterative prescreening was used. We began daily prescreening surveys with a response limit of up to 50 people. We limited the number of respondents to enable us to thoroughly examine the prescreening data. The prescreening period lasted from May 2nd to May 26th, 2022. We used a six-step exclusion procedure for each round of the prescreening.

  • Step 1: We eliminated all entries with a “No” response to Q1.

  • Step 2: We eliminated all entries with “Other” response to Q2, and the free text response was not a software development or QA-related role.

  • Step 3: We eliminated all entries with “Other” response to Q3, and the free text response was not an agile method.

  • Step 4: We eliminated all entries with “False” responses to Q4.

  • Step 5: We filtered all entries with Q5 answers either “strongly disagree”, or “somewhat disagree.” Then, we examined the comments in the free text (i.e., Q6) to assess the rationale behind the disagreement. None of these cases were genuine responses; they were either ambiguous or comprehensive, indicating that the respondent was not a genuine software developer or QA. Then, we eliminated all “strongly disagree” and “somewhat disagree” entries. Then, we filtered “neither agree nor disagree” to examine Q6. First, we looked for categorical objections to the definition, and then the entry would have been eliminated (no such cases were found in all iterations). Then, if the disagreement with the definition was based on reservations, the entry was not excluded. For example, a respondent commented: “Software quality cannot be solely described by satisfying the various stakeholders’ needs.” We deemed this objection rational, as the comment hints that the definition disregards the internal features of software quality, such as code quality and the actual design of the software. In another example, a respondent commented, “regarding performance, it might be incompatible with the remaining non-functional requirements. If we need high-performance software, say for a shuttle launch, we might need to abandon compatibility, security, or maintainability in favor of more performance and less easy-to-read code since it will hinder performance.” This is a genuine expression of doubt, qualifying the definition. We concluded, in this instance, that the respondent implied that the definition should be fluid to accommodate an extended range of cases. To sum up, “strongly disagree” and “somewhat disagree” entries were not genuine respondents based on the qualitative comments. Respondents who selected “neither agree nor disagree” had genuine reservations about the definition and were not excluded.

  • Step 6: For the remaining entries (i.e., “somewhat agree” and “strongly agree”), we read every comment made in the Q6 text box to evaluate the quality of the comments and determine whether or not the answer met the criteria for Phase II participation. We didn’t include submissions that we felt were unintelligible, poorly written, factually inaccurate, or otherwise not genuine.

Table 8 Prescreening survey questions
Table 9 Data quality control techniques

The prescreening phase attracted 1000 respondents. Then, the qualifying process (steps discussed above) yielded a reliable sample of 480 potential participants. Still, we included additional quality control measures in the main survey to ensure the quality of our data.

4.2.4 Data Quality Control

We developed several measures to guarantee the quality of our data, addressing the potential risks of using a research marketplace. Respondent recruitment using a market research platform should take into account four data quality issues: bots, liars (malingerers), cheaters, and slackers (Lovett et al. 2018; Oppenheimer et al. 2009; Meade and Craig 2012; Chandler and Paolacci 2017; Palan and Schitter 2018). The possible quality problems and the measures we chose and executed to address them are summarized in Table 9. If an answer did not pass our quality tests, it was removed from the final dataset.

Table 10 The characteristics of the sample
Table 11 Descriptive statistics

4.2.5 Data Collection

We invited 480 prospective candidates after prescreening. The survey was initiated on May 27, 2022, and ran through June 8, 2022. We received 466 replies, and after quality control, we had N = 423 genuine submissions. Asking individuals to report on the properties of their teams (e.g., PS) presents a key-informant approach and is widely used in the social sciences (Pinsonneault and Kraemer 1993). The survey design may be found in the shared material (section Data Availability). The characteristics of our sample are summarized in Table 10. The descriptive statistics of our latent variables are shown in Table 11. In Fig. 2, we see the distribution of PS scores from our survey. The mean was 4.1, with a standard deviation of 0.59, which compares favorably to prior work on PS in software development reporting lower standard deviations (Faraj and Yan 2009; Buvik and Tkalich 2022). At the same time, the mean of 4.1 indicates a central tendency towards higher PS levels. It is notable that a segment of our sample, numbering 33,33% respondents, reported psychological safety scores below 4.0 and only 4,96% reported the highest score of 5.0. This indicates a presence of lower psychological safety experiences within the dataset, though this group does not represent the majority. The prescreening cost us 0.5 GBP per participant, while the pilot and main survey cost us 6.5 GBP per participant.

Fig. 2
figure 2

Histogram of psychological safety (Phase II)

4.2.6 Data Analysis

After ensuring the survey’s validity and reliability, we tested our hypotheses through ordinary least squares (OLS) regression in SPSS v. 28 and Partial Least Squares-Structural Equations Modelling (PLS-SEM) in SmartPLS. Both methods are appropriate for analyses where dependent and independent variables are measured using Likert scales, like in our study. While we used OLS regression as our main analysis method, we used PLS-SEM to corroborate our findings and, thus, increase the robustness of our analysis. In both techniques, a hypothesis was confirmed each time a regression coefficient had a p-value less than 0.05. Because both techniques yielded identical results concerning the significance of hypotheses, we report only the OLS regression results in the findings section, while the PLS-SEM results are provided in the replication package (see Sect. Data Availability). To mitigate unobserved-variable bias (Antonakis et al. 2010), we added a number of control variables to our models. By including control variables, we obtain regression coefficients related to PS that explain to what extent PS is able to predict variance in dependent variables (e.g., speaking up, admitting mistakes) above and beyond the control variables.

4.3 Integration of Phases I & II

We merged the findings throughout the drafting of this publication after Phase II was completed (including data analysis). This part of the investigation is referred to as “interpretation of the related outcomes” by Creswell and Clark (2017). Section 5 presents the interpretation of the results and discusses how far Phase II quantitative findings corroborate and generalize Phase I findings. We also explore how the outcomes of both stages sync and complement one another.

5 Findings

Recall, our RQ seeks to understand the effects of psychological safety (PS) on agile teams’ behaviors aimed at enhancing software quality. Both Phase I & II data show a potential that PS may promote behaviors that contribute to the potential enhancement of software quality. Psychologically safe agile software development teams seem to capitalize on four main quality-related behaviors: speaking up about software quality problems, admitting software quality mistakes, helping each other to enhance quality, collective problem-solving to solve quality issues, and taking initiatives to promote better software quality. Once these traits and behaviors are adopted by the team and its members, subsequent effects on potentially improving software quality take place. For example, when mistakes affecting quality are admitted, the team collectively learns from them, and they become points of reference in the future for the team to avoid. When P4 was asked to sum up his discussion of how learning from mistakes helps software quality, he replied, “absolutely, it does [learning from mistakes helps software quality]. And the team basically becomes, you know, becomes capable enough to not only achieve the quality but also improve the quality. So I would say it’s not just about achieving something and then saying that’s enough for the team. It’s always about setting new goals and improving the quality further, and employing more techniques and more ways to better mitigate the risks, better identify the potential problems, and better solve them. So I would say risk analysis and mitigation also become stronger as part of improving quality” (P4).

5.1 Speaking up About Software Quality Problems

Phase I data suggest that speaking up is the willingness to share one’s thoughts, needs, and concerns about the team’s software quality, including practices and processes to ensure it. Speaking up is an important form of honesty in psychologically safe agile teams. Honesty potentially contribute to building trust; by speaking up, one demonstrates that she will be truthful with her team and care about them. P4 sums up this finding: “So, if you see there are blockages or if you see the communication is a problem or quality is a problem, or this thing could be improved, then your suggestions should be heard and I speak up” (P4). This statement may also imply an interplay between openness and speaking up. They are interdependent, e.g., P4 maybe felt safe to speak up because, first, it is safe to do so, and second, his peers will reciprocate with openness. Feeling safe encourages individuals to speak up about problems. For example, after joining his team, P11 seems to feel safe enough to be more confident because others were. He stated: “what I saw from the beginning was even the senior members; the junior ones were telling every issue where I at the beginning was kind of keeping to myself, or at least saying it only my mentor, not and I was not coming today, but slowly as I saw them doing it, they were being open and, there was no problem, also like they’re supporting the material that the company presents itself with” (P11).

Then, how does speaking up about software quality issues may contribute to behaviors aimed at enhancing software quality? P18 asserts: “it is certainly what I encountered that talking about problems helps better quality.” Once quality problems become known, the team collectively and swiftly addresses them. “Yes, we do that every time we think of something that might be a problem for quality. Anyone can be reached at any time to discuss problems, and we frequently voice chat outside of scheduled meetings. If we ever identify something widespread, we schedule a meeting with all affected parties and let the person who found the problem present it. We value a keen and observant eye for quality”, P11 explained. In a safe working climate, honesty and openness could become norms; then, the team expects its members to bring problems forward. Once a team member brings the problem forward, then it is owned and resolved collectively. “So yeah, if you are communicating and honest with the team, then you definitely have the solution. Hiding your problems, then you’re going to suffer with your problem alone”, P15 said. H1 hypothesizes this finding. Model 2 (Table 12) regresses speaking up on control variables and on PS, allowing testing H1. In support of H1, we found significant positive effect with medium effect size (beta = 0.43, effect size f\(^{2} = 0.21\), p<0.001) (Cohen 2013).

High PS teams seem to actively adopt actionable behavior that encourages speaking up. Phase I data show several behaviors that may contribute to an atmosphere encouraging team members to voice their concerns and suggestions, including recognizing and sharing issues as a valued behavior within the team, direct and open communication with the leadership, and the legitimacy of the act of speaking up. For the act of speaking up to be effective, the concern in question must be legitimate and actionable. This shows the value of constructive feedback intended to lead to tangible improvements. P4 explains: “... the only way they can be heard is if the team is stable and easygoing, feeling safe, and they are open to communications. So if you can go ahead and talk to your senior members or to your CEO directly, and discuss your concerns, and if those concerns are legitimate, and they are being acted upon, then this is a very satisfactory and very motivating ...” (P4).

figure d
Table 12 Regression Results

5.2 Admitting Mistakes

A software quality problem is something that needs to be solved. On the other hand, a mistake is something that was not supposed to occur. Team members in psychologically safe agile teams seem to admit mistakes because they highly likely feel empowered to admit them, knowing that the focus will be on collective learning and problem-solving rather than blame. Such an environment seem to foster a proactive stance towards identifying and addressing quality issues, cultivating behaviors that support software quality. P3 explained that allowing mistakes is healthy; team members talk about more mistakes, then resolve them, and this climate keeps “the safety up.” P8 explained when he did not follow the team’s coding standards, he admitted it to his team lead. He was not blamed; instead, he was coached on how to use coding standards. He commented: “I felt great [not being blamed and coached on how to follow coding standards]. But I also felt why didn’t I ask this before? Like, I know, right. I mean, what I observed was like, until unless you do any mistakes, no one will teach you. But that’s what I first learned it, because everyone will think that, okay, this is easy. And so he can do that. So but sometimes the easiest task is the one where we make mistakes” (P8). Mistakes become visible when admitted, and then they are potentially turned into opportunities for collective learning, which may foster behaviors of continuous improvement of software quality practices.

To test this finding, we proposed H2. It predicts that high PS is associated with teams admitting more software quality-related mistakes. Model 3 (Table 12) regresses admitting mistakes on control variables and on PS, allowing testing H2. We found a significant positive correlation with a large effect size (beta = 0.55, f\(^{2} = 0.39\), p<0.001). Thus, H2 holds.

Admitting mistakes has the potential to trigger a rectification process, followed by a learning process to avoid similar mistakes in the future. Table 13 documents examples from both phases of the study. When software quality mistakes are admitted, a schema of effects and behaviors may takes place. First, it appears that measures are taken by the team to rectify the mistake, and then a learning process follows. At the core of the rectification process is the no blame approach. In psychologically safe, agile teams, when a team member admits a mistake, the team seems to reciprocate with support and proposes measures to rectify the issue. P17 explains his team’s attitude towards mistakes: first, no blame, “we won’t hold any issues against a single person; this is just how we work” (P17). This encourages individuals to come forward about their mistakes; then, they learn from them, “It [admitting mistakes] does help quality. First, the person won’t be repeating the same mistake again. So that is one way of improving quality” (P17). P11 explained that his whole team may have learned from admitting a process failure with consequences for quality,”well,l when you don’t have a way of making it [a process failure induced a defect to production], better, sadly there’s no way around like manual testing but testing it,t but we did try to promise each other that we would be more rigorous. We made an example like in an Excel which where we have steps and we have to sign at each step with our names so yes we learned and implemented a process gate to vet and avoid similar problems in the future” (P11).

figure e
Table 13 Example of admitted software quality mistakes

5.3 Learning from Mistakes

It appears that the rectification of the mistake is not necessarily the end. Psychologically safe teams may not simply move on; they seem to seize the opportunity to learn from the mistake and avoid it in the future. Mistakes could become points of reference to either avoid or use to understand and resolve similar situations. “We use it [past mistake] as examples in our meetings, and we tried to discuss in the team and we try and strive to avoid similar mistakes. We learned from this. Because it was a very, very good lesson for us. We remember it all the time”, P5 said. This process mirrors it on a personal level; P18 states: “I continue to learn from my mistakes, and I am encouraged to do this in order to evolve into a better person ... This [feeling safe] changed me and I continue to change. This s partly because of the safety we have, I feel confident enough to open up.” We proposed H3 to test this finding. Model 4 (Table 12) regresses learning from mistakes on control variables and on PS to test H3. We found a significant positive correlation with a medium effect size (beta = 0.34, f\(^{2} = 0.12\), p<0.001). We conclude that H3 holds.

Then, how do admitting and learning from mistakes contribute to behaviors aimed at enhancing software quality? Following the schema in the second column of Table 13, it appears that when more mistakes are admitted, then opportunities to potentially rectify them and resolve the potential negative impact on software quality take place. Hence, more admitted mistakes may lead to more rectifications and a potentially reduction of defects or errors in the code. Admitting mistakes, at minimum, seems to lead to quality-related behaviors, such as rectification, but most of the time initiates a collective learning process (“moving forward” in Table 13 schema). While this behavior appear to cultivate a quality-enhancing culture, the actual improvement in software quality remains partly contingent upon the effectiveness of the learning and rectification actions, in addition to other factors.

When mistakes are acknowledged, they potentially become points of reference, and subsequently, the team may potentially become better equipped to avoid them in the future. Hence, more admitted mistakes may result in a greater accumulation of reference points, which could help to avoid repeating similar errors. This process may help the team to become more resilient to similar errors. Thereby fostering a quality behavior that supports the continuous pursuit of software quality improvements.

To sustain this practice (both admitting and learning from mistakes), psychologically safe teams cultivate a culture of non-punitive response when mistakes are admitted, leveraging mistakes as teachable opportunities, and showing support when mistakes are admitted (see Section 6 for actionable recommendations). Knowing that one would not face retribution for the errors made and admitted, team members may come forward feeling safe and supported. This is encapsulated in P15 and P16 testimonies: “... so pointing out mistakes is welcomed in my team and blame is not acceptable ...” (P15) and “we won’t hold any issues against a single person, this is just how we work” (P17). Non-punitive environments seem to view mistakes as opportunities for learning and improvement rather than occasions for reprimand. Such environments cultivate the habit of sharing experiences, including errors. Admitting mistakes may contribute to collective knowledge if used constructively.

figure f

5.4 Helping Each Other

While collaboration is the default tenet of agile methods, when psychologically safe, teams not only collaborate, but they seem to be willing to “lend a hand” by sharing their knowledge to potentially improve software quality. P8 equated learning from his colleagues on how to write better code to quality assurance, “tools or technology.” He stated: “of course [learning from each other influences team member code quality]. If someone shows me his code that might be better than mine, then obviously I learn and I try to match his quality. The learning can also be tool or technology” (P8). P18 (a QA) echoed this claim: “yes [helping each other helps software quality]. I became a better QA and the more you know, the better you become at assuring quality” (P18).

We proposed H4 to test this finding. We hypothesized that high PS is positively associated with Helping each other, which is aimed at enhancing software quality. Model 5 (Table 12) regresses Helping each other on control variables and on PS to test H4. We found a significant positive correlation with a medium effect size (beta = 0.46, f\(^{2}\) = 0.26, p<.001). Thus, H4 is supported. Based on Phase I & II findings, we suggest that in psychologically safe agile teams, software developers and QAs alike share their knowledge for the betterment of the quality of the software they develop.

Helping each other seem to imply practices to maintain this sense of togetherness in psychologically safe teams. Phase I data show teams actively organizing quality-focused coaching, knowledge-sharing sessions, and dedicating buffers for support in task estimation (see Sect. 6). Our interviewees reported their teams organizing periodic “Lunch and Learn” sessions and dedicated pair programming sessions to share knowledge (e.g., P8, P9 & P13). They also emphasized the importance of coaching team members on the team’s expectations on software quality, e.g., “.. so we try hand holding ... So if somebody does something bad, we’ll do pair programming will not fix his mistake for him but will guide them” (P11). P9 echoed similar practice: “... there was also declining in code quality [of a team member], his quality wasn’t up to the point. So quality isn’t something that improved in just a few days ... So, what we do is we coach and give constructive feedback a lot of feedback’ (P9). Some teams even actively allocate time in their estimate as a buffer for potential support requests from their colleagues, e.g., “the tasks and works are organized with some buffer time for every resource. Based on the availability we can seek help and support” (P17). Such practices show active engagement in nurturing a culture of support, integrated into the team’s workflow.

figure g

5.5 Collective Problem-solving

As shown in Phase I data, collective problem-solving is when the team or a subset of its members join in their intellectual effort to resolve a software quality issue. P15 explained: “so the toughest issue can be the easiest issue for someone else. But if you’re bringing it to the team level, then we definitely have the solution where people like people can do the pair programming, and they can like to teach and think together and improve together the process and solve the problem” (P15). P2 explained how his team chipped in to find “better solutions” for “complex” problems: “for example, in our system, we are using Azure DevOps, by the way. We create a bug and then let the person who actually worked on that, okay, here’s a problem, and please correct it, and then that person takes over and fixes it. And even if it’s a complex problem, like more difficult to fix, then we sit down together and discuss why maybe it’s an architectural issue or it’s coming from, not from the architecture part of the whole system, then we sit down and discuss how can we resolve it? When we collectively contribute, then we solve difficult problems with better solutions” (P2).

Based on this finding, we hypothesized that in agile teams, high PS is associated with increased collective problem-solving aimed at enhancing software quality, H5. Our regression test (Model 6 in Table 12) shows this hypothesis holds, with a medium effect size (beta = 0.48, f\(^{2} = 0.30\), p <.001. Based on these results, we suggest that psychologically safe agile teams engage more to collectively solve software quality problems to improve the quality of their software.

To encourage this practice, Phase I interviewees reported having dedicated time slots and organizational structures in place to possibly facilitate collective problem-solving efforts. In addition, team members are actively encouraged to report and share problems. For example, P4 explains: “... people feel safe to put effort on quality. And if one of them is out of the balance, then we have problems. But it is very important to proactively identify those problems and to productively discuss them instead of waiting around at the end of the sprint to discuss them” (P4). P7 explained that his team organizes knowledge programming sessions dedicated to tackling complex problem-solving tasks: “ ... we did the knowledge sessions ... this type of session helps a lot” (P7).

figure h
Table 14 Example of software quality initiatives

5.6 Software Quality Initiatives

Agile teams and their members maybe inclined to propose and take initiatives to potentially improve either the quality of their software or the processes that assure it when they feel it is safe to do so. They are incentivized to take initiative because, highly likely they become invested in what they do. P18 passionately sums up this finding: “different ideas make us stronger, and we learn from each other, so the quality keeps improving. Our test coverage improves. We know the business better because we learn from the seniors, and ideas also improve our tools and processes” (P18). P15 statement further supports this conclusion: “initiatives do some great things we have observed in our team ... When people bring initiatives, eventually everything improves over time, processes, relationship in the team, and the quality of our software” (P15). Table 14 documents two examples from each phase of the study. The examples show that agile teams, in a PS work climate, propose unsolicited initiatives to improve software quality.

To test this finding, we proposed H6, which predicts that high PS in agile teams is positively linked to taking more initiatives aimed at enhancing software quality. We found a significant positive relationship with a small effect size (Table 12, model 7, beta = 0.19, p<0.05, f\(^{2} = 0.04\)). Thus, H6 holds.

In high PS teams, initiative-taking is supported by risk acceptance by management and stakeholders and dedicated innovation platforms (see Sect. Sect. 6 for additional recommendations). Our interviewees emphasized the importance of management commitment and acceptance of failure when experimenting with new initiatives. Understanding that failure is a part of the learning process fosters a sense of safety in SE teams. P5 explains: “... we also discussed with the stakeholders ... we also felt it is safe to experiment ... now we try to encourage initiatives and before rejecting them, we do some research about it. Because good initiatives can change the outcome of project in healthy ways” (P5). Some teams have structured and dedicated platforms for innovation and experimentation. For example, P8 discussed using hackathons as a platform to experiment: “We have hackathons where we can present our own ideas, like any new in innovations. So, we decided to use the hackathons to experiment and propose innovative ideas. It is up to us how to decide, it can be new features, trying new technology, new tools or improve our coding quality. We show each other’s code, and we learn from each other’s coding.” (P8)

figure i

To recap, then, how do we capitalize on PS to enable software quality improvements? First, SE teams and organizations should endeavor to cultivate behavioral norms and foster organizational practices that contribute to and promote psychological safety (e.g., leadership ownership of PS, openness, and no blame norms (Alami et al. 2023)). Then, organically, the enablers could materialize to promote individual and collective behavior conducive to promoting software quality. We discuss these findings further in light of previous work in the social and organizational sciences, and we propose concrete implications for practice in Section 6.

6 Discussion

The effects on software quality we identified in our study could potentially emerge organically in agile teams once they become psychologically safe. Our study shows that once teams become psychologically safe, team members will be more likely to speak up, admit mistakes and learn from them, help each other, collectively solve problems, and possibly launch software quality initiatives-all behaviors aimed at enhancing software quality. Our study not only establishes qualitative and quantitative evidence supporting these effects, but also provides insights into the strength of these effects. We find a large effect size for admitting mistakes, highlighting that PS is a key issue to work on in teams lacking a culture of admitting mistakes. We find medium effect sizes for speaking up, learning from mistakes, helping, and collective problem-solving, implying teams struggling in these areas are likely to benefit from initiatives to enhance PS. Last but not least, we find a small effect size on software quality initiatives, implying that software quality initiatives may benefit from PS, though PS alone is unlikely to lead to a substantial increase in software quality initiatives. Taken together, these findings highlight that teams can promote a variety of behaviors for enhancing software quality by working on their PS. Prior research provides important insights into how organizations and agile teams should focus on cementing and sustaining PS antecedents (Alami et al. 2023). Still, we managed to draw out practices that agile teams and their leaders could adopt to potentially maintain and encourage the behaviors underpinning these effects. Table 15 documents practices to maintain PS effects as we inferred them from Phase I data.

Speaking up is a form of participation in promoting and sustaining PS; it seems also a demonstration by individuals that they feel safe. Previous work has also found that speaking up is associated with PS. Bienefeld and Grote found that an individual’s perceived status within the team encourages individuals to speak up (Bienefeld and Grote 2014). Nembhard and Edmondson, on the other hand, suggest that professionally derived status: the higher the status of the individual or their team, the safer they will feel to speak up (Nembhard and Edmondson 2006). Our work shows that status is irrelevant in psychologically safe agile software teams for their members to feel safe speaking up. Table 15 lists some techniques we gathered from Phase I data to maintain and encourage speaking up in psychologically safe agile teams.

PS has been linked to greater error reporting in previous studies (Leroy et al. 2012; Peltokorpi 2004). It has also been shown to lead to more voice behavior (reduction in silence behavior) within the team (Siemsen et al. 2009; Xu and Yang 2010; Brinsfield 2013). For example, Tynan suggests that individuals in teams with high PS are more inclined to raise disagreements, give candid feedback, and point out errors to their peers and supervisors (Tynan 2005).

Psychologically safe agile software development teams appear to not only admit more mistakes but also potentially learn from them. A psychologically safe work climate fosters team and individual learning (Newman et al. 2017). Prior work has shown that PS promotes learning from failures (Sanner and Bunderson 2013; Carmeli 2007). In this study, we identified the processes in psychologically safe agile teams that could facilitate learning from admitted mistakes. Table 15 lists some techniques, which we gathered from Phase I data, to sustain this behavior in psychologically safe agile teams.

Table 15 Implementation techniques for PS potential effects on software quality

Researchers have found a positive attitude towards teamwork (Ulloa and Adams 2004) and organizational commitment (De Clercq and Rius 2007); in addition, employees reciprocate supportive practices such as helping in psychologically safe workplaces (Chen et al. 2014). Our work shows that agile teams with high PS are more inclined to help each other by sharing their knowledge on how to improve the quality of their work and solve complex quality problems. Prior work has also linked greater knowledge sharing among team members with PS (Mu and Gnyawali 2003; Siemsen et al. 2009; Xu and Yang 2010). For example, Mu and Gnyawali found high PS among group members is positively associated with higher development of synergistic knowledge, the process used by groups to “constructively [integrate] diverse perspectives of individual group members” (Mu and Gnyawali 2003); similar results were found by (Xu and Yang 2010). Our work extends these claims to support the idea that a similar process applies when agile teams are psychologically safe; they tend to help each other by sharing knowledge and collectively solving problems aimed to potentially improve the quality of their artifacts. Table 15 lists some techniques we gathered from Phase I data to foster helping each other and collective problem-solving.

Few researchers investigated the relationship between taking initiatives and PS. Tucker studied frontline employees’ initiatives (efforts outside the scope of everyday job responsibilities) to prevent operational failures (Tucker 2007). She found that PS positively correlates with frontline system improvement (Tucker 2007). Our work shows potential effects of PS on agile software team members’ willingness to take initiatives aimed to improve software quality and the techniques team members use to achieve this end. Agile team members seem to become invested in their teams’ efforts to achieve quality. Proposing initiatives maybe an individual contribution to the collective team effort to achieve quality. Table 15 documents some practices we gathered from Phase I data to support initiatives in agile software teams.

Even though some of our findings have parallels with findings reported in the social sciences and organizational studies literature, the novelty of our work lies in providing evidence that some PS potential effects, reported earlier, apply to software quality. For example, while Tucker reported that psychologically safe teams take initiatives to improve their frontline processes (Tucker 2007), we provided evidence that psychologically safe software teams are also keen on taking initiatives to aimed to improve the quality of their products and the processes and tools they use to assure it. This implies that PS probably is equally as important for software and non-software teams.

It is relevant and sound to pursue this study in the context of software development, even though similar findings have been acclaimed in social and organizational sciences studies. First, previous conclusions cannot be literally imported or translated into software development without proper empirical evidence. Second, we need to carry out studies in the context of software engineering to identify nuances relevant to our practice. For example, we identified how agile teams transform quality mistakes into points of reference and become resilient to similar mistakes in the future.

In addition, we drew out several practices and recommendations for practitioners, organizations, and agile teams to adapt to support PS-induced enablers aimed at pursuing software quality. Table 15 mentions practices that are not costly to implement. For example, acknowledging and recognizing individuals’ effort and courage when they speak up does not require changing existing processes and relationships but simply a change of behaviors. Leadership and team members should show authenticity in adhering to PS values. Edmondson suggests that a change strategy emphasizing behavioral change to build and reinforce PS requires low economic investment but strong mobilization of leadership (Edmondson 2018).

It is essential to acknowledge organizational dependencies in leveraging the benefits of psychological safety and the interpretation of our recommendations. While our findings imply that PS may offers techniques to maintain and encourage behaviors conducive to enhancing software quality, deploying some of our recommendations may face organizational complexities and obstacles. A parallel agile team’s quality is “autonomy” (Guzzo and Dickson 1996). Moe et al. found that autonomy often faces challenges in implementation (Moe et al. 2019). They found that, in the context of large agile projects, it is common for team members to be marginalized in the goal-setting for the project, resulting in a decline in their motivation and autonomy (Moe et al. 2019). Similar behavior may influence the deployment of some of our findings. For instance, the ability to speak up may not universally foster PS and, in turn, contribute to achieving behavior aimed at enhancing software quality unless the work environment is genuinely non-punitive and supportive at various levels of the hierarchy.

Although some of our findings can coexist in a control-oriented environment, it is worth noting some of the potential tensions. Control in information system (IS) literature is defined as the endeavor to align individuals’ behavior with the overarching goals of the organization (Choudhury and Sabherwal 2003; Kirsch 2004; Wiener et al. 2016). A commonly cited objective of control in IS projects is to regulate or modify actions and behaviors, and team members’ skills and capabilities are fully utilized to steer the project towards its intended goals (Kirsch 2004). For example, control mechanisms often punish errors, while our findings show that a psychologically safe environment sees them as opportunities for learning and improvement. Systems that employ control mechanisms to cultivate structured environments and social order, where individuals are expected to be held accountable for their participation (Hall et al. 2017). In such an environment, individuals are expected to engage in an account-giving process (Frink et al. 2008), which may result in rewards or sanctions, and its legitimacy is affirmed by the presence of an audience (Hall et al. 2017). Such an environment, oriented towards accountability, may be at odds with some of our findings, which suggest that team members should be encouraged to take initiative without fear of repercussions. This implies that implementing PS in a control-oriented setting requires careful balancing to maximize both organizations’ desire to exercise control through accountability mechanisms and employee well-being.

In addition, implementing our recommendations may yield diverse outcomes, depending on the specific team and organizational context. For instance, collective problem-solving implies fostering an environment of open dialogue, which may also undermine PS, depending on delivery and context. For example, the tone of the language used in the dialogue can imply either a genuine contribution to the problem-solving endeavor or an unintended or intended attempt to undermine PS. The source of the contribution to the problem-solving exercise may also matter. The hierarchical position of the individual contributing to problem-solving can influence how it is received. A contribution from a peer might be received differently than a superior’s. Practitioners should take into consideration their organizational culture when implementing these practices. Implementing these practices should be a reflective and evolving process throughout cycles of planning, actions, and assessment to iteratively refine their effectiveness. A reflective and iterative process should enable practitioners to identify particularities inherent to the context of their teams and organizations and learn how to fine-tune the implementation of these practices continuously.

7 Threats to Validity

7.1 Internal and Construct Validity

Our sampling strategy in Phase I might be seen as a possible limitation of this research. Participants for the interviews were sourced via word of mouth and referrals. Accordingly, it is possible that our sample may bias the results of Phase I. However, this shortcoming was somewhat addressed by our mixed-methods approach; in Phase II, we polled a large group of people from a variety of occupational, experiential, and national backgrounds.

The data quality of Phase II is at risk due to the use of an external market platform, Prolific. But we made every effort to screen for and recruit participants who would provide reliable data. Each answer was carefully reviewed before it was accepted for payment on the Prolific platform, in addition to the quality control steps we employed to verify the validity of our survey data (see Sect. 4). For authenticity purposes, we used open-ended questions and comments.

Furthermore, we chose to restrict survey participation to certain areas (native English-speaking countries or countries where English is the second language, e.g., South Africa and Europe). In making this choice, other areas’ perspectives may have been ignored. However, this shortcoming is once again offset by our mixed-methods approach. Thirteen participants from countries where English is not the native language were included in the Phase I sample. To provide one concrete example, we heard Indian participants discuss psychological safety on par with their British colleagues.

There were just two females in our Phase I sample (P18 and P20), suggesting a significant male preponderance. Phase II of the research included more non-male individuals (22.0% female and 1.0% non-binary), allowing us to control for gender and reduce the possibility that various genders might experience and interpret psychological safety in different ways.

Memory bias and social desirability bias (i.e., giving researchers the answers they want to hear) (Furnham 1986) are two potential threats to validity that might have prevented participants from offering accurate accounts. Throughout the interviews, we encouraged participants to provide instances from their team experiences to help reduce the impact of memory bias. High-quality examples not only give such proof but also serve as trustworthy descriptions of the real world. As a further precaution against social desirability bias, we assured our participants that their replies would be completely concealed.

The data from Phase I is slanted toward high PS. However, the essential nature of mixed methods, in which the weaknesses of one phase are balanced by the strengths of the other, contributes to the alleviation of this issue (Creswell and Clark 2017). Indeed, the psychological safety variance is higher in phase II data.

The concept of software quality is debatable (Kitchenham and Pfleeger 1996). This is because the concept is subjective (Yılmaz and Kolukısa Tarhan 2022). High-quality software may not be seen in the same manner. Based on their own requirements, interests, and expectations, various stakeholders may have different criteria for assessing software quality. Contextual variables such as the type of application, the complexity of the program, the intended use, the target audience, and the environment in which it functions may all influence how people perceive software quality. To ensure that our adopted definition (i.e., ISO/IEC 25010 citeiso2011) and our participants’ definition were aligned, we implemented check points during the data collection process (see Sections 4) to compare our participants’ perceived definition to ours. Although we did not encounter any serious challenges to our accepted definition, it is possible that some participants disregarded or underestimated the significance of this validation and agreed with our definition.

Our analysis (both phases of the research) does not directly evaluate software quality; that is, we did not analyze the software quality of our participants using pre-defined quality criteria to confirm their assertions. This restriction, however, is inherent in the methods and research design we adopted. Due to the large number of participants, it is not practical to examine the software quality directly in a large survey research. Future research might focus on a particular case in which the team feels psychologically safe and analyze software quality using well-established software metrics. We argue that this approach aligns well with the theoretical underpinnings of reasoned action and planned behavior (Madden et al. 1992), where the relationship between intention and behavior is not always direct. We also argue that establishing a direct link between any independent variable like PS and a complex dependent variable like software quality would oversimplify the multivariate reality of achieving software quality. For this reason, our research design deliberately avoids positing a “direct” relationship but instead aims to elucidate the nuanced influences of psychological safety on software quality.

7.2 External Validity

Our focus on agile teams does not imply that PS is irrelevant or inconsequential in teams using plan-driven methods. The decision was driven by the increasing adoption of agile methods and their inherent emphasis on human aspects of teamwork, which are key elements influencing psychological safety. However, this focus introduces a limitation to the generalizability of our findings to software development approaches other than agile.

The capacity to draw causal conclusions is a threat to the validity of our Phase II survey, as it is with other correlational studies. Because correlation may also occur from the impacts of unobserved variables, from self-selection, or from other forms of endogeneity (Antonakis et al. 2010), we cannot draw conclusions about cause and effect based just on the existence of a correlation. However, two features of our research help to alleviate this issue. First, in order to lessen the impact of unobserved-variable bias, we implemented a robust set of control variables. Second, the statements made in this research are supported not just by correlations but also by the insights into causal processes that we gleaned from our qualitative data, which, when combined with the quantitative data, gives a stronger foundation for causal assertions (Flyvbjerg 2006).

Software engineering teams can take various forms and work in complex settings. Our sample may not be representative of every possible software engineering team type. However, we used several variables in our Phase II sampling to ensure a diverse and more representative sample (see Table 10 in Sect. 4).

Both our qualitative and our quantitative samples show a tendency towards higher PS levels. Although we also had respondents reporting low PS in our quantitative sample, the high proportion of respondents reporting high PS levels may limit the generalizability of our findings to environments with very lower PS levels. Future research can address this threat to validity with a greater focus on low PS environments to enrich our understanding of its role across diverse software engineering contexts.

Finally, while we employed items published in past research, even though these items satisfied the standards for discriminant validity and reliability, we found little evidence for the convergent validity of the items evaluating psychological safety, which may be a limitation of our Phase II survey. Consistency with previous research led us to regard psychological safety as a monolithic concept; nevertheless, future work might investigate if different aspects of psychological safety have different weights when it comes to outcomes like software quality.

7.3 Construct Validity

While we took measures to ensure construct validity, the construct of PS might manifest differently in teams operating under different software development methods or organizational cultures. Additionally, the iterative and reflective nature of agile practices may inherently foster higher levels of PS, suggesting a potential selection bias in our sample. Thus, we propose future research directions to examine PS across a broader range of team types and working contexts to provide a more comprehensive validation of the construct.

In Phase I, our interview data informed our understanding of the PS construct as it manifests in software engineering teams. Interviewee accounts and examples served as validation for our initial PS measurements. Subsequently, we amended the item (i.e., PS3) to reflect this reality (see Section 3.1 for further details).

To further validate our measurement instrument, we initially conducted a pilot study, which served as a preliminary test of our survey items. The factor analysis results supported a unidimensional structure of PS, which was consistent with Edmondson’s conceptualization (Edmondson 1999b). One item (i.e., PS5) did not load significantly onto the primary factor. Then, we revised the wording accordingly to enhance the focus of the PS construct (see Section 3.1 for further details).

8 Conclusion

Our study sought to understand the impact of psychological safety on agile teams’ behaviors aimed at enhancing software quality. The findings from both Phases I and II of our study show evidence that PS fosters a range of quality-related behaviors among the team and its members. These behaviors include speaking up about software quality issues, admitting mistakes, helping each other in pursuing quality, engaging in collective problem-solving, and taking initiatives for software quality.

The adoption of these behaviors contributes to a constructive and supportive environment where learning from mistakes is normalized and more quality-related errors are reported. This behavior not only aids in the frequent correction of errors but also serves as a preventive mechanism by turning mistakes into opportunities for continuous learning and future references. PS also facilitates knowledge sharing and skills aimed at enhancing software quality when team members feel more comfortable contributing to the collective knowledge pool. This behavior extends beyond collaboration, as team members actively improve each other’s work, thereby cultivating a behavior aimed at enhancing quality. Teams characterized by high psychological safety also engage in collective problem-solving, pooling their collective intellectual efforts and experience to tackle quality-related challenges. In this collective effort, diverse solutions are considered, potentially leading to more robust and innovative responses to quality challenges.

Psychological safety has been proven to be a catalyst for many desirable traits in early studies on the topic and during the previous decades. Our results demonstrate that when PS is ingrained in agile teams, the results are far-reaching. Because of the social enablers it promotes, teams are more inclined to adopt behaviors conducive to enhancing software quality. Profiting from this social asset requires companies, their management, agile teams, and their individual members to uphold and propagate norms that foster a sense of safety.

Our conclusions also indicate that relying solely on technical tools and processes to pursue software quality is not the only reliable strategy. Bringing social strategies together to combine both may be superior. Our study shows that the social context matters and that human needs in the software development environment should be met. Our analysis shows that psychological safety is a human need; once contented, agile teams respond with an array of behaviors conducive to improving software quality.

To continue our efforts to make SE a socially aware practice, we propose avenues for future work. Our study primarily focused on the behavioral changes fostered by PS and their influences on software quality. Recognizing this limitation opens up an avenue for future work. This can be done through case studies focusing on teams with high and low psychological safety to empirically evaluate software quality using established metrics. Such empirical investigation would provide insights into a direct relationship between PS, or lack thereof, and software quality. Longitudinal studies on the evolution of PS and its subsequent effect on the team’s sustained impact on software quality. Such empirical insights would focus on the evolution of openness, support with the team, knowledge sharing, and collective problem-solving related to quality. Longitudinal studies could offer insights into the mechanisms through which PS leads to continuous and sustained quality improvement, the behaviors conducive to these improvements, and how teams manage to sustain high levels of psychological safety over the lifecycle of a project or their tenure as a team.Footnote 3