Our experiments investigate whether design communication between software engineers can become more effective when using GSD instead of TSD to exchange design information. To this end, we investigate whether using a GSD affects six considered communication aspects (Understanding, Explaining, Recall, Active Discussion, Creative Conflict, and Conversation Management) differently from using a TSD (R.Q.1). Moreover, we study whether a cohesive and motivated TSD (i.e., Altered-TSD) improves design communication (R.Q.2).
Considering Subgroup A, the global effect size of the perceived explaining quality is positive. This means that using a GSD has a positive effect on the perceived explaining quality. Similarly, the global effect size of the understanding (i.e., maintenance task) score is positive, which means that the score of the GSD users is better than the score of TSD users. Nevertheless, by considering distributions of the scores we neither find a statistically significant difference in the quality of explaining (Observation 1) nor in the quality of understanding (Observation 2) between the two groups: G and T.
While analyzing the recorded, and further transcribed, discussions between the Explainers and Receivers, we interestingly observed a difference in the explaining approach between the Explainers of the two groups. Figure 7 provides an illustration of the observed explaining approaches in the two groups. On the one hand, the Explainers of a TSD tended to explain the three modules of the MVC sequentially: Firstly the model entities, then the controllers, and lastly the views, as these modules are orderly presented in the textual document. We think that this trend is intrinsically imposed by the nature of textual descriptions where the knowledge is presented sequentially on a number of consecutive ordered pages. On the other hand, the Explainers of the GSD had more freedom in explaining the design. Indeed according to their explaining preferences, the Explainers of the GSD tended to jump back and forth between the three MVC modules when explaining the design. Based on this, we suggest that a GSD has an advantage over the TSD in unleashing Explainers’ expressiveness when explaining the design, as well as in helping navigation and getting a better overview of the design. However, developers might not have this advantage when explaining many GSDs (e.g., many UML diagrams) spread on different pages within a software design documentation.
We found that using a GSD is better than a TSD for recalling the details of the discussed design (Observation 3). This is actually inline with Meade et al. (2018), who suggest that drawing graphical notations brings more recall benefits than writing textual words in younger and older adults.
Graphical representations are considered better than the textual in representing information which deals with relationships between entities (Völter 2011). One of the recall questions that we used to measure the recall ability of the participants is concerned with the relationships between the entities of the software architecture design. We compared the score (interval variable; min is 0 and max is 1 point) of the two groups on this question. On average, the users of the graphical representation were slightly better in recalling the relationships between the entities (G: Mean= 0,506; Std. Dev.= 0,331) vs. (T: Mean= 0,423; Std. Dev.= 0,347). However, this difference is not statistically significant (Sig.= 0,128 > 0,05; Hedges’ g = 0,244; Power= 0,338).
The Chinese Whispers game is often invoked as a metaphor for miscommunication. In this game, the first player often fails to recall all the information of the initial message that she/he receives. Likewise, the second player often fails to recall all the information of the message that she/he receives from the first player, and so on for the rest of the players. In the same manner, the Explainers in our experiments failed to recall all the design details that we asked for in the post-task questionnaire (Mean Score= 3,319; Std. Dev.= 0,855). The Receivers were, as expected, worse than the Explainers in recalling the design details (Mean Score= 2,492; Std. Dev.= 0,885). Moreover, we found that the difference in recall ability betweeen Explainers and Receivers is statistically significant (Sig.= 0,000 < 0,05; Hedges’ d = 0,946, Power= 0,999).
Based on empirical results, we find that a GSD fosters more Active Discussion (AD) than TSD (Observation 4), while reducing Conversation Management (CM) at the same time (Observation 6). In the skill taxonomy of McManus and Aiken (1995), the communication activities in the AD category generally aim at helping an active exploration of the discussed argument by encouraging information requesting, clarification, or elaboration. In contrast, the branch of CM comprises communication activities that generally contribute less to active information requesting or clarification, such as acknowledging or coordinating group tasks. Consequently, we suggest that using a GSD as a basis for software design communication promotes an active exploration of the communicated designs, which in turn helps to improve the effectiveness of software design communication.
There is no significant difference in Creative Conflict (CC) discussions between group G using GSD and group T using TSD (Observation 5). We suggest that the type of design description does not influence design argumentation and reasoning. Alternatively, we think that the context, complexity of the design, available knowledge, or the application of reasoning techniques might affect the quality of design argumentation and reasoning discussions, as suggested by Tang et al. (2018).
It is widely assumed that model-based techniques support communicating software (Hutchinson et al. 2014). Our findings support such assumption and prove that using a GSD improves the recall ability of the discussed design details, fosters Active Discussion, and at the same time reduces less useful conversation on activities management.
We conducted REP3 to better calibrate our findings of the differences between GSD and TSD. We found that a motivated (i.e., augmented with rationale) and cohesive TSD helps to enhances the recall of the design details and increases the amount of active discussions at the cost of reducing the perceived quality of explaining (Observation 7). This finding is indeed inline with Tang et al. (2010) who stated that discussing the reasons of making software design choices (i.e. design rationale) positively contributes to the effectiveness of software design discussions by facilitating communication and design knowledge transfer. However, we found that adding more details (e.g., rationale) to the TSD adversely influences the perceived quality of explaining. One explanation for this effect is that the Explainers did not have enough time to explain the details of the Altered-TSD. For the same reason, the Receivers might have perceived that the Explainers did not go through the entire textual description when explaining the software design.
Threats to Validity
Our family of experiments is subject to threats to their construct validity, internal validity (causality), external validity (generalizability), and conclusion validity. We highlight these issues and discuss related study design decisions.
Constructs validity refers to how well operational measures represent what researchers intended them to represent in the study in question. In this study, we used a single method for measuring the impact of different design representation per each communication aspect. To mitigate this issue, we did not only rely on questionnaires, but also recorded, transcribed, and later evaluated the communication observed during the experiments. Nonetheless, leveraging additional methods to probe the explaining, understanding, recall, and interpersonal communication skills of the participants might help to better investigate the effects of different design representations. Such methods, for instance, might comprise conducting actual software design or software engineering tasks after receiving the explanation. However, this would introduce a multitude of other variables (e.g., the programming language or IDE used) that either can be hardly controlled or demand for drastic simplification, thus reducing our experiments’ generalizability.
Another threat to construct validity could arise from discretizing the measurement of continuous properties, such as the participants’ familiarity with software design or their expertise with UML. This challenge has been investigated for balanced Likert and identified as not compromising generalizability (Ray 1982).
The questionnaires to evaluate the participants’ performance raise threats to internal validity themselves: For instance, the participants might interpret the Likert scales we have used differently, might have avoided extreme responses (central tendency bias), and - as the participants evaluated their communication skills themselves - might be biased towards overestimating or underestimating their skills, which might be subject to different effects on their introspection. To support comprehension and reproduction of results, we use established surveys where possible and provide all materials on the experiments’ companion website. Nonetheless, completely mitigating the potential effects of surveys’ general deficiency requires the development of novel methods to test familiarity and understanding of UML designs and textual designs, as well as communication skills. While for the latter, specifically tailored exercises might be feasible to evaluate the skill level, conducting these, (a) requires unbiased instruments as well and (b) might affect our experiments. A specific challenge of our family of experiments regarding the questionnaires arises from conducting the REP2 survey in French, whereas the other experiments used English documents. While this generally could affect the results, the experimenters of REP2 had the task documents and questionnaires professionally translated and reviewed to maintain the consistency of the communicated information.
To mitigate the effect of limited preparation and explanation time – the Explainers had 20 minutes to understand the design and 12 to discuss it with the Receivers – we conducted multiple pilot studies at all sites prior to the actual experiments to understand how much time is required. After running the pilot studies, we increased the initially considered 10 minutes of discussion to 12 based on the feedback of the participants of the pilot studies. Afterwards, we conducted another pilot study that confirmed that both times are considered suitable for the tasks.
Other challenges to internal validity stem from the selection of our experiments’ participants. Potential confounding factors include that due to randomly assigning the participants to the G or T group, certain personality types are prevalent in one of the groups – which could affect results. By measuring the Big Five factors of personality (Donnellan et al. 2006), we checked that this is not the case: the distribution of the five personality factors (Extraversion, Agreeableness, Conscientiousness, Neuroticism, and Openness) is the same across the two groups.
Similarly, it could have affected our findings that the members of one of the two groups have significantly more experience with software design than the members of the other group. The pre-task questionnaire establishes that this is not a problem of our study. Other issues could have arisen from our participants being unfamiliar with UML designs or textual designs but the pre-task questionnaire shows that this is not the case. We assume that this is due to the participants’ educational backgrounds (in which processing textual designs for exercises or exams is common).
The textual representations used in this research are structured by indentation, indexing, and grouping information, which are helpful for information retrieval (Conversy 2014). However, these might have positively affected the quality of TSD communication. Similarly, the MVC entities in the graphical representation were highlighted by colors, which is also helpful for information retrieval (Conversy 2014). This might have also positively affected the quality of GSD communication. If the descriptions of the entities in TSD were tangled and if the entities of the GSD were not colored, then the quality of communication of these two representations might have been different and less efficient. As we use different enhancement techniques for GSD and TSD, it is possible that this affected the results of the comparison. Indeed, the augmentations to the textual representation might yield other (stronger or weaker) effects than the class diagram coloring. As both, coloring in graphical models and structuring of textual design documents, is common in industrial practice, we do not consider this a significant threat over using unstructured text and uncolored diagrams.
Some Receivers of the text group were drawing (informal) class diagrams while being explained to. Hence, there might be an interaction of both treatments, but with only six (2.5%) of the Receivers being affected, the effect of this combination of both representations is negligible.
Another threat might arise from using textual survey questions as the method to investigate the benefits of textual and graphical designs. Maybe, textual design representations yielded better answers to the questions because they are syntactically closer than graphical designs to the textual answers. This threat could be mitigated through leveraging graphical questions and answers in the surveys. While this would be feasible for the answers, for formulating the questions as graphical class diagrams, this would entail a new syntax which might yield further threats.
Threats to external validity indicate to which extent the results of our study can be generalized. Due to working at software engineering research and education institutes, we selected students with strong software engineering backgrounds of our Universities. While this prevents generalizing results to software developers with different backgrounds (e.g., developers in computer vision, artificial intelligence, or robotics), software design aims at software architecture from which we expect strong software engineering backgrounds.
Also, we conducted our studies with students instead of software design practitioners. Hence, the participants involved in our experiments may not represent the general professional population of software engineering practitioners. While this limits us from generalizing our findings to other subjects (i.e., domain experts, professional software architects, industrial practitioners in the field), the differences between students and professional software developers in performing small tasks are generally very small (Höst et al. 2000). We, therefore, consider our findings as a basis to extend our study to a larger community of software engineering practitioners.
Another threatening effect is that the population of professional software developers yield a larger age range than students. With recall abilities changing over time (Craik 2019), this limits generalization of our results to professional software developers of the same age range – between 20 and 30 years – than software engineering students and PhD students (as proposed in (Falessi et al. 2018)).
Moreover, the studies were conducted in educational contexts, i.e., contexts in which the students usually are evaluated and graded. This generally might have improved their performance (Hawthorne effect). However, as this applies to both groups, this does not affect our results.
Due to the outline of our experiments as single one-hour sessions and their popular context in sports that are easily relatable, we can exclude threats regarding history or maturation. The participants could neither have been effected from previous events of the experiment as there have not been any.
Moreover, as we used the same two textual/graphical notations in all experiments, this limits generalizability of our results to other textual or graphical representations, i.e., differently structured text or differently highlighted class diagrams. This, however, is a threat independent of the specific choice of representation and demands for studies deploying multiple (popular) representations – which demands correctly identifying industrially relevant forms of representation and yields further threats to generalizability.
The use of a single case specification is a threat to the generalizability of the results. The size, topic, and complexity of the design case specification might affect the communication quality and the results of the comparison. This threat can be addressed by conducting replication studies with different design case specifications.
Another challenge to generalizability might arise from the constructs investigated, i.e., whether structured textual design documents and colored UML class diagrams actually are relevant to communicating design decisions in industry. While the use of UML in software design and engineering is undaunted in various domains (cf. (Liebel et al. 2014; Wortmann et al. 2019)), so is the use of textual documents to describe software designs (Casamayor et al. 2012; Wagner and Fernández 2015; Palomba et al. 2016). However, using a specific form of structured text for communicating design decisions limits generalizability to this form of text. For instance, in requirements engineering, there are different tools that support capturing textual requirements and design decisions using different textual representations (Cant et al. 2006) and using these might entail different effects.
Generalizability might also be challenged by the size of documents used of investigation. There are no studies on the number of classes per class diagram in industrial software engineering projects. However, a report on numbers of classes per class diagram used in different lectures reports that in 101 diagrams from 5 different courses, the maximum number of classes per diagram is 40, with the minimum being 3 and the average being 10.75 (Wolf et al. 2013). This might indicate that our design class diagram of 28 classes is a bit more complex than it would be usual for education (and hence be more realistic regarding industrial challenges). Another study investigated 100 android applications from open-source repositories (Shatnawi et al. 2015). Here, only the average size of these applications as 90 classes is reported. While this does not report how these would be aligned in different class diagrams, assuming these cover at least three different concerns (e.g., model, view, and controller) appears reasonable, which would entail 30 classes per class diagram on average and would be in line with the 28 classes presented in our experiment. Therefore, we consider the size of the experiments’ class diagrams relevant. For the textual design documents, we are unaware of any studies on their average size, but due to them containing the same information as the class diagrams, which are of relevant size, we conclude that these should be as well. However, this needs further investigation and might challenge the generalizability of our results. Also, the effect of the number of classes conveyed in both representations might affect understanding and recall. This also demands for further investigation.
Similar to the threat of using specifically indented and colored documents, the optimality of their representations might challenge generalizability of our results as it might be conceivable that there are better suitable textual or graphical representations that lead to different results. To the best of our knowledge, the best representations of textual design documents and graphical class diagrams still have to be identified and whether these are optimal for any domain needs to be investigated. Nonetheless, differently presented textual or graphical designs might have yielded different effects. This, however, is a threat to generalizability that holds for any study investigating a finite number of alternative treatments where infinitely many are possible and needs to be considered when applying our results.
Also, the experimental conditions (scope, team size, duration, etc.,) might differ from real-world conditions and limit generalizability of results. Nonetheless, especially in the use case of onboarding of job newcomers by experienced developers and designers, this challenge is of practical interest as indicated by Ericsson’s “Experience Engine” initiative (cf. section 1).
Threats to conclusion validity challenge how reasonable a research or experimental conclusion is. In our study, these threats might arise, mainly, through concluding the existence of in-existing differences (type I error) and concluding the in-existence of existing differences (type II error).
We conducted hypotheses testing to determine whether two independent variables have the same distribution. We might have committed type I error and incorrectly rejected the null hypothesis (false positive), or committed type II error and incorrectly accepted the null hypothesis (false negative). However, we considered the significance and minimized the risk of detecting a non-real effect by setting the α value to 0,05. Also, we analyzed the sensitivity by discussing the effect size and statistical power of our tests.
We underline that a small sample size of experiments yields low statistical power which, in turn, increases the likelihood of making type II error (accepting the null hypothesis when it is false). To mitigate this threat, we conducted a family of experiments that aims at maximizing the sample size with repeated measures and increasing the statistical power and precision of the results (Santos et al. 2018).
Using GSD to communicate software designs produces more active discussion, less conversation management, and better recall. These effects contribute to deepening the active exploration of the discussed design (Guastello 1998), which is why we consider using GSD beneficial to communicating software designs. For identification of design errors, textual descriptions seem to be more efficient (Meliá et al. 2016) than GSD. Our findings suggest the use of GSD as a basis for communicating designs with the objective of transferring design knowledge, which is in line with the observed benefits of graphical documents on recall (Meade et al. 2018).
Our findings, however, assume that the textual design document accurately represents the GSD. Often, however, these natural language documents yield ambiguities or omit details that can be missed less easily in graphical descriptions. We assume that this can be due to graphical descriptions, such as UML class diagrams, being accessible for model checking to identify, e.g., missing associations or missing types. Future work should investigate whether textual artifacts used in practice indeed represent the underlying design accurately.
With REP3, we investigated the effects of a cohesive and motivated TSD on design understanding, explaining, recall, and interpersonal communication. As we found a difference in explaining, recalling, and active discussion between both groups: TSD and altered-TSD (Observation 7), future research in improving software design communication should investigate comparing benefits of augmenting GSD with textual motivation and rationale as well.
Generally, we found that communicating design with a GSD yields better discussions and better recall. We believe that these effects are not limited to software design documents but transfer to graphical software descriptions in general. While, for instance, UML class diagrams meant for implementation might differ in the level of detail, but not in the general representation. Applying our findings regarding the benefits of (i) GSD over TSD and (ii) cohesive TSD with rationale to other kinds of software artifacts can yield benefits for their communication and consumption as well. For instance, as requirements documents become more complex (Gralha et al. 2018), augmenting these with graphical representations or rational could, ultimately, improve requirements engineering. Model-based systems engineering (Ramos et al. 2012) traditionally considers graphical representations. Nonetheless, similar improvements could be achieved as the collaborating stakeholders from various domains could benefit from being provided rationale of design decisions made in other domains.
There also is research in textual modeling (Hölldobler et al. 2018), which leverages textual models with well-defined semantics for software design and development. As such, these textual models are in-between GSD and TSD and whether our results translate to textual software models, such as UML/P class diagrams (Rumpe 2017), needs further investigation.
Similarly, the observed benefits of GSD are subject to the viewpoint we selected in a fashion that allows presenting the complete design description (i.e., model) on a single sheet of paper. For more complex diagrams, this might not scale-up. However, we assume that the textual design document (currently three sheets of paper) scales-up even worse. Consequently, we believe that the effects of software design representation on large designs with hundreds or thousands of elements will be even more prominent.