Exploring Gender Bias in Remote Pair Programming among Software Engineering Students: The twincode Original Study and First External Replication

Context. Software Engineering (SE) has low female representation due to gender bias that men are better at programming. Pair programming (PP) is common in industry and can increase student interest in SE, especially women; but if gender bias affects PP, it may discourage women from joining the field. Objective. We explore gender bias in PP. In a remote setting where students cannot see their peers' gender, we study how perceived productivity, technical competency and collaboration/interaction behaviors of SE students vary by perceived gender of their remote partner. Method. We developed an online PP platform (twincode) with a collaborative editing window and a chat pane. Control group had no gender information about their partner, while treatment group saw a gendered avatar as a man or woman. Avatar gender was swapped between tasks to analyze 45 variables on collaborative coding behavior, chat utterances and questionnaire responses of 46 pairs in original study at the University of Seville and 23 pairs in the replication at the University of California, Berkeley. Results. No significant effect of gender bias treatment or interaction between perceived partner's gender and subject's gender in any variable in original study. In replication, significant effects with moderate to large sizes in four variables within experimental group comparing subjects' actions when partner was male vs female.


Introduction
Besides being widely used in industry, pair programming is becoming increasingly common in Software Engineering education because of its demonstrated positive influence on grades, class performance, confidence, productivity, and motivation to stay in Software Engineering and Computer Science academic majors [12], especially for women, as reported by [60].
In pair programming, two partners work closely together to solve a programming task, in which their ability to engage collaboratively with each other is essential.However, these collaborative interactions can be influenced by implicit gender bias [28], which is a widely observed phenomenon even in highly-structured and professional settings, such as those reported by [30] and [12], and which is based on the assumption that women are less technically competent than men [38].Since research in the social sciences indicates that an individual's behavior is clearly affected by the behavior of their peers [17], we aim to explore how and whether gender bias affects the pair programming experience among Software Engineering students.
Our study is based on the hypothesis that gender bias will lead to observable differences based on subjects' perceptions of the gender of their pair programming partners, i.e. they will score men and women differently on similar tasks, and they will also behave and communicate differently depending on whether they perceive their partner as a man or as a woman, even though their partner remains the same on all tasks.Specifically, in a non-colocated, i.e. remote, pair programming setting in which peer gender cannot be directly observed, our goal is to identify the potential effects of gender bias by observing student pairs when the perceived gender of one of the peers changes.
To study our hypothesis, we have applied methodological triangulation [13], using several methods to collect data and approaching a complex phenomenon like human behavior from more than one standpoint [9].In our case, three different data sources have been used: (1) questionnaires to measure changes in subjects' perceptions, (2) data collected automatically during the pair programming tasks to measure behavioral changes, and (3) data produced by several experimenters analyzing the message interchange during the pair programming tasks to measure changes in communication.
Assuming a remote pair programming setting, which has been proved to have similar results than co-located pair programming as reported by [53] and [3], our research questions with respect to subjects' perceptions are the following: RQ 1 Does gender bias affect perceived productivity compared to solo programming?That is, do perceived differences between in-pair and solo productivity depend on the perceived partner's gender?RQ 2 Does gender bias affect the partner's perceived technical competency compared to one's own technical competency?That is, do perceived differences between one's own and partners' technical competency depend on the perceived partner's gender?RQ 3 Does gender bias affect the partner's perceived positive and negative aspects?That is, do perceived positive and negative aspects of their partners depend on the perceived partner's gender? 1 .
RQ 4 Does gender bias affect how partners' skills are compared?That is, do perceived partners' skills depend on the perceived partner's gender when they are compared?
With respect to the subjects' behavior during remote pair programming, assuming that gender bias could cause a subject to be more or less proactive on the programming task, or more or less verbose during chatting, our research question-based on what we can automatically measure-is the following: RQ 5 Does gender bias affect the frequencies or relative frequencies with which each partner produces source code additions, source code deletions, successful validations, The twincode Exploratory Study failed validations, and chat utterances?That is, do these frequencies depend on the perceived partner's gender?Regarding subjects' communication during remote pair programming, we are interested in knowing whether gender bias affects how subjects communicate with their partners, i.e., whether they use a more formal or informal style, and whether they use some types of chat utterances more than others.Our related research questions are the following: RQ 6 Does gender bias affect the relative frequency of formal and informal chat utterances?That is, does the formality of the messages depend on the perceived partner's gender?RQ 7 Does gender bias affect the frequency or relative frequency of the different types of chat utterances?That is, do the frequencies of the different types of messages depend on the perceived partner's gender?

The twincode platform
To support our study, we have developed the twincode remote pair programming platform [18], which manages (i) the registration of students collecting demographic data; (ii) the random allocation into experimental and control groups balancing gender proportions, i.e. trying to have the same number of persons of the same gender in both groups; (iii) the random allocation into experimental-control pairs; (iv) the random assignment of programming exercises to individual subjects and pairs; (v) the swapping of gendered avatars between pair programming exercises for those subjects in the experimental group; and (vi) the automatic collection of interaction metrics and chat utterances.
As shown in Figure 1, twincode offers a source code editor where the students concurrently develop the solution to a proposed programing exercise in Javascript and can validate it against several test cases.Note that, to foster communication, only one partner can validate the source code at the same time and see validation results, which should be communicated to the other partner using the chat window, where they are instructed to collaborate to solve the proposed exercises.Note also that a gendered avatar is displayed only for the student in the experimental group (see Figure 1(a)) but not for the one in the control group (see Figure 1(b)).
Experimenters can use twincode to create new experimental sessions where they can configure, among other aspects, the type, number, and duration of the programming exercises, and the instructional messages shown to the students.If needed, they can also develop new programming exercises and their corresponding test cases.
The twincode platform is in permanent evolution, and several improvements were incorporated for satisfying some emerging requirements during our study, such as allowing the use of Python as an alternative programming language to Javascript for the programing exercises, changing the images used as gendered avatars (see Figure 9), and improving the user interface with instructions and a gendered message in the chat window (see Figure 16 As a companion tool to twincode, we have also developed tag-a-chat, a tool that help experimenters code chat utterances using different sets of tags, as shown in Figure 17 in Appendix B. To assist experimenters during the training stage of the coding, tag-a-chat automatically computes metrics such as Cohen's kappa (for two coders) and Fleiss's kappa (for three or more coders) in those dialogs that are being coded by several experimenters to achieve inter-coder reliability assessment [42,55] 2 .

Pilot Studies
After presenting a very initial approach to our study [2], and to get early feedback on (i) the comprehensibility and internal consistency of the scales used in the questionnaires; (ii) the usability and performance of the twincode platform; and (iii), the applicability of the chat utterance coding based on the one proposed by [44] and shown in Table 1, two pilot studies with a limited number of students were carried out at the University of Seville and University of California, Berkeley (UC Berkeley) during the 2020-21 academic year.
As a result, the questionnaires were reorganized into three scales that were assessed for internal consistency (see Appendix A), the initial set of chat utterance codes was augmented with formality codes, and the performance and reliability of the twincode platform was improved.

Other Gender Identities
While we recognize that many Software Engineering students may not identify as either men or women, our initial exploration focuses primarily on interactions between students who identify as one of these.The potential biases in interactions involving gender-fluid, gender-nonconforming, and nonbinary students is a complex topic deserving its own subsequent study.

Structure of the Paper
The rest of the paper is organized as follows.Section 2 reviews related work, although to our knowledge, this is the first study specifically focusing on the impact of gender bias within pairs in pair programming.Sections 3 and 4 describe the original study carried out at the University of Seville (December 2021) and its first external replication performed at UC Berkeley (May 2022) respectively.Section 5 discusses the two studies and the threats to their experimental validity.Finally, Section 6 draws conclusions and proposes future work.The twincode Exploratory Study   1 Chat utterance tags by [44] augmented with orthogonal informal/formal tags

Related Work
Several systematic literature reviews (SLRs), which are summarized in Table 2, have compiled the empirical research on pair programming in higher education, including [12], which is focused on distributed pair programming from a teaching perspective.The SLR by [46] reveals that the most important factor under study is solo versus pair programming in terms of effectiveness, quality of code, and satisfaction while students are programming, concluding that pair programming is more effective and satisfactory than solo programming.However, with respect to quality, findings are inconclusive.
Other SLRs, such as the ones by [25], [32], and [27], show that the focus of the studies is broadened, including factors such as personality, motivation, problem solving, troubleshooting, efficiency, confidence, self-esteem, skill level, gender, or enjoyment but not gender bias.In general, students rate pair programming positively compared to solo programming.Nevertheless, pair programming is effective but not always efficient, as it may take longer.By means of controlled experiments, remote and co-located pair programming are compared by [53] and [3], showing similar results.In most cases, the analyzed variables are related to performance in terms of time, quality, or code tests passed.Students perceptions have also been analyzed in terms of confidence, satisfaction, motivation, or personality by [48].
Regarding primary studies, Table 3 summarizes the empirical studies on the influence of gender in pair programming, including findings such as (i) same-gender pairs are more "democratic"; (ii) women working in pairs were more confident than those working solo; and iii) in mixed-gender pairings, women are less confident compared to same-gender pairings, and report no increase in enjoyment for pair programming compared to solo programming, an effect that is significantly observed in men [32].Although such studies reveal that gender seems to be a key factor, none of them study gender bias in pair programming.
Many factors other than gender may affect the outcomes of remote programming sessions [6,56].Previous research on productive pairing looked at factors such as skill levels, autonomy in choosing one's partner [62], and different personalities [26].Nevertheless, the work on gender composition of pairs found conflicting results about whether same-gender or mixed-gender pairings are more effective [7,8,28,33].One possible explanation is that gender correlates with other dimensions that may affect the pairs' collaboration, but these correlations may vary between different environments.For example, women in a class may, on average, have higher skill level than men because they had to face more societal barriers to enter the class.On the other hand, they may, on average, have lower skill level if women with no background are more actively recruited.

Original Study (Seville Dec, 2021)
In this section, the original study carried out at the University of Seville in December 2021 is reported, including most of the experimental settings which are in common with the external replication performed at the UC Berkeley in May 2022, reported in Section 4.

Participants
In the original study carried out at the University of Seville in December 2021, the participants were third-year students of the Degree in Software Engineering enrolled in any of the three groups of the Requirements Engineering course taught in Spanish 3 .The final number of valid 4 subjects was 92, arranged in 46 pairs.Only 9 students could not finish the study because of technical problems during the tasks.Considering the 92 valid subjects, 15 identified as woman (16.30%), 1 as non-binary (1.09%), and the rest as man (82.61%) during the registration process.

Table 3 Summary of primary studies on gender and pair programming in chronological order
Reference Object of study Metrics Findings [31] Compatibility of student pair programmers Web-based peer evaluation survey that required the students to evaluate the contributions of their partner and the perceived pair compatibility Students are compatible with partners whom they perceive of similar skill.Mixedgender pairs are less likely to report compatibility.
[50] Effect of personality heterogeneity on PP effectiveness The Keirsey Temperament Sorter personality test; PP effectiveness is measured by output/performance, communication, velocity, design correctness, passed acceptance tests; pair collaboration-viability is measured by satisfaction, knowledge acquisition and participation.Heterogeneous personality pairs shows better communication, pair performance and pair collaboration-viability than homogeneous pairs.For heterogeneous pairs, design and code correctness is positively correlated with communication transactions (more communication leads to higher correctness), and satisfaction regarding collaboration, knowledge acquisition and participation was significantly higher.
[48] Personality traits on PP effectiveness Five Factor Model (FFM2): Conscientiousness, Neuroticism, and Openness to experience Only openness has a significant role in differentiating paired students' academic performance.
[8] PP gender combinations Productivity, quality of source code, compatibility and communication between pairs Pair compatibility and communication levels significantly vary between the same gender pair type, woman-woman and man-man.
[21] PP gender combinations Productivity Similar productivity rates for the three gender pair combinations.Greater variability of productivity rates with mixed gender pairs (man-woman) was observed.
[30] PP gender combinations Weekly attendance, work accomplished during lab and perceived productivity Students who were randomly assigned a woman partner (rather than a man) attended classes more often, were more confident that the solution was correct, and more confident in the finished product that they developed.However, being assigned a woman partner was also associated with completing a smaller percentage of the assignment.
[64] Effect of structured roles in PP, motivation and stress for men and women Lexical features (number of messages, message length, sentiment), Intrictic Movitation Inventory score (IMI) measuring Interest/Enjoyment, Perceived Competence, Effort/Importance, Pressure/Tension, Perceived Choice, Value/Usefulness, and Relatedness; Self-reported stress, perceived competence, perceived choice, learning gain No significant differences found between structured vs unstructured PP roles.
Women reported significantly higher levels of stress, lower levels of perceived competence in their computing abilities and less perceived choice compared to men during a remote PP activity.Dialogue features significantly correlated with women's reports of stress, perceived competence, or perceived choice.Women tended to feel more relaxed if their partner sent longer messages on average or used more positive language.
[63] Analyzing the differences between women and men's awareness of CS gender gap Survey (which included six questions related to the gender gap in CS), and some follow-up interviews discussing the experiences and perceptions of CS gender gap.Men were less aware, had milder beliefs and shallow understanding of the gender disparities in computer science.Women were significantly more aware of the gender gap and felt significantly stronger that efforts should be made to reduce the gender gap.Some participants also expressed discomfort at the idea of opportunities for women within CS because they did not think that those were fair; these students would benefit from understanding the idea of equity over equality.[20] Young learners in remote compared to co-local PP Perception and experiences by means of remote collaboration logs, interviews and self evaluation survey Students felt successful in remote approach, had positive experiences with collaboration, reported remote PP made them have more autonomic and efficient in navigation compared to co-located PP.Furthermore, students recognized who partner with friends become more confident throughout the learning process (vs non friend partners).
Note that, although the percentage of women is low, it is above the average percentage in the Degree in Software Engineering at the University of Seville, which unfortunately is close to 11% according to the last academic year official statistics [59].Note also that, due to the 9 students dropped by technical reasons, the percentage of women could not be kept the same in the control (6 women, 14.29%) and experimental (9 women, 19.57%) groups than in the sample (16.30%), which was our initial intention.

Experiment Execution
Some weeks before experiment execution, in order to recruit participants, the students enrolled in the three groups of the Requirements Engineering course taught in Spanish were motivated to voluntarily participate in the study as an interesting experience in remote pair programming, but without mentioning neither that the main goal was to study the potential effect of gender bias, nor they were going to be paired with the same classmate during all the study.We also remarked that for the purpose of the study, they must remain anonymous to their partners, so they must neither mention nor ask any personal information, thus not discovering that their partners were always the same person.After providing all that information, including that the participation in the study counted for a 5% bonus on their grades to prevent dropout, the interested students registered in the twincode platform providing some demographic data and accepting the participation conditions.
The experiment execution, which is graphically represented in Figure 2 and 3, took place the same day for the three groups of students of the course during their laboratory sessions, as shown in Figure 4 5 .
All registered students logged into the twincode platform, which automatically allocated them into the control and experimental groups balancing the proportion of women in each group as much as possible.Once all the students were allocated to groups, they were randomly allocated into control-experimental pairs by the platform (see Figure 2).
After subject allocation, the pairs were presented a programming exercise that they had to solve collaboratively using twincode (labeled as Task#1 in Figure 3).They were given 10 minutes to solve a first exercise and another 10 minutes to solve a second exercise, thus a total time of 20 minutes.After the first 10-minute period, the second exercise was presented independently of whether the first one was finished successfully or not.Both exercises were randomly selected from a pool of exercises of similar complexity.During this programming exercise in pairs, subjects in the control group received no information about the gender of their partners, whereas subjects in the experimental group could see their partners as having a clearly gendered avatar randomly selected by the platform (see Figure 1).At the end of the 20-minute period, they were asked to individually fill in a questionnaire (labeled as Quest.#1 in Figure 3) about the perceived productivity compared to solo programming, the perceived partner's technical competency compared to their own, and about After filling the first questionnaire, the students were presented another programming exercise to be solved individually in 10 minutes (labeled as Task#2 in Figure 3).In the case they finished earlier, another exercise of similar complexity was randomly presented.The main purpose of this individual task was to make students forget about their first partners, i.e. their style of writing chat utterances or source code, so they did not recognize them in the second in-pair task.
After the individual task, pairs were presented again a new collaborative programming exercise that they must solve in similar conditions to the exercise in Task#1.In this second in-pair exercise, the gendered avatar was swapped with respect to the first exercise for the subjects in the experimental group.For those in the control group, they continued to receive no information on their partners' genders.Note that pairs were kept the same in order to reduce the variability due to the subjects themselves, which could possibly have had a confounding effect in case of a new pair allocation for Task#3 (see Section 3.5.1 for details).
Once Task#3 was finished, students were asked to fill a questionnaire (labeled as Quest.#2 in Figure 3) with the same questions than the one they filled after Task#1 but referred to the second partner, and another questionnaire (labeled as Quest.#3 in Figure 3) comparing the skills of the first and second partners and whether they Finally, they were informed about the actual purpose of the study.At that point, they were allowed to withdraw their data if they wished, although none of them opted for doing so.

Factors (Independent Variables)
The four factors, i.e., independent variables, in both the original experiment and the replication are following.
group nominal factor representing the group (experimental or control) subjects were randomly allocated to.time nominal factor representing the moment (t 1 and t 2 ) in which the first and second in-pair tasks were performed by the subjects.
ipgender nominal factor representing the induced partner's binary gender (man or woman for the experimental group, and none for the control group) during the in-pair tasks.
gender nominal factor representing subject's gender, which may be man, woman, or any other option as freely expressed in the demographic form during registration.

Response Variables (Dependent Variables)
The response variables, i.e., dependent variables, in both studies are described below, organized according to the corresponding three data sources-questionnaires, twincode platform, and chat utterance coding.

Perceived Variables (Questionnaires)
The response variables measuring subjects' perception are mainly scales composed by four or more 0-10 linear numerical response items and they are computed as the average of their corresponding items.Following the recommendations by [29], the 0-10 items are labeled not only in the first and last points, but also in the midpoint (see Figure 5).They are described below.
pp interval variable composed of four 0-10 numerical response items (pp 1...4 ) measuring the subject's own perceived productivity during each pair programming task compared to solo programming (see RQ 1 ).Low values correspond to better solo programming productivity whereas high values correspond to better pair programming productivity (see Figure 5 for an example of a response item and Section A.1 in the Appendix for all the response items in the scale).
pptc interval variable composed of four 0-10 numerical response items (pptc 1...4 ) measuring the subject's partner's perceived technical competency compared to their own after each in-pair task (see RQ 2 ).Low values correspond to higher subject's productivity, whereas higher values correspond to higher partner's productivity (see Section A.2 in the Appendix for all the response items).
ppa ratio variable counting the number of partner's positive aspects identified by the subject after each in-pair task (see RQ 3 ) 6 .This variable is automatically computed from an open question item in which subjects are asked to write the most positive and negative aspects of their partners in the previously performed pair programming exercise (see Section A.3 in the Appendix).They are instructed to prefix positive aspects with a plus sign (+) and negative ones with a minus sign (-).This variable is the result of automatically counting the number of plus signs in the text of the open question.
pna ratio variable counting the number of partner's negative aspects identified by the subject after each in-pair task (see RQ 3 ).In a similar way to the ppa variable, this variable is the result of automatically counting the number of minus signs in the text of the aforementioned open question (see also Section A.3 in the Appendix).
ppgender nominal variable measuring the perceived partner's gender during the inpair tasks.To measure this variable, subjects are asked in questionnaire #3 whether they remember if their partners showed some avatars in chat windows or not.If the answer is no or I don't remember (idr), this variable is assigned the none or idr levels at t 1 and t 2 .If the answer is yes, then the subjects are asked for the avatars of the first and second partner, having man, woman, or idr as options, as shown in Figure 6.
cps interval variable composed of five 0-10 numerical response items (cps 1...5 ) measuring whether the subject perceived better skills in their first or second partner in the in-pair tasks, i.e., compared partners' skills (see RQ 4 ).Low values correspond to the first partner, whereas high values correspond to the second partner (see Section A.4 in the Appendix for all the response items).
In the case of the experimental group only, this variable is transformed after collection in such a way that low values correspond to the partner for whom the induced gender was man, and high values to the partner for whom the induced gender was woman, in order to analyze whether there is a gender bias in the scoring.

Behavior-Related Variables (twincode Platform)
The response variables automatically collected by the twincode platform and related to the behavior during the in-pair programming exercises (see RQ 5 ) are listed below.Every variable v represents a frequency, i.e., a count, and its associated relative frequency is computed with respect to the the sum of the frequencies of the two subjects in a pair.For example, let us suppose that subjects i and j are the two members of a pair, and v i and v j are the corresponding values of the v variable.In this case, the relative frequencies for each subject would be v i v i +v j and v j v i +v j , respectively.sca / sca rf Ratio scale variables representing the count and relative frequency of characters added by a subject to the source code window during an in-pair task (source code additions).
scd / scd rf Ratio scale variables representing the count and relative frequency of characters deleted by a subject from the source code window during an in-pair task.(source code deletions).
okv / okv rf Ratio scale variables representing the count and relative frequency of successful (ok) validations of the source code performed by a subject during an The twincode Exploratory Study dm / dm rf Ratio scale variables representing the count and relative frequency of dialog messages (chat utterances) sent by a subject during an in-pair task.

Communication-Related Variables (Utterance Tagging)
The chat utterances registered in the twincode platform during the in-pair tasks were manually tagged according to two orthogonal dimensions.The first dimension uses the 13 tags (from S to O in Table 1) proposed by [44].The second dimension classifies each message as formal or informal, considering as formal the usual way in which a university student would communicate textually to a professor and informal otherwise.
For the tagging process, we followed a process inspired by the work of [42], in which two researchers each tagged 60% of the data, covering all dialogue messages.The overlapping subset of 20%, which was used for the initial training, established the inter-coder reliability using Cohen's kappa, which was κ = 0.796 for the formal/informal tags, and κ = 0.754 for Rodríguez et al tags, both indicating substantial agreement and sufficient reliability for further coding according to [55].
The response variables related to the manual tagging of the chat utterances (see RQ 6 and RQ 7 ) correspond to the tags in Table 1 and are listed below.Every variable represents a frequency, i.e., a count, and its associated relative frequency is computed with respect to the number of chat utterances generated by the subject during an in-pair task, which is defined by the dm variable specified in previous section.
i / i rf Ratio scale variables representing the absolute and relative frequency of informal messages generated by a subject during an in-pair task.
f / f rf Ratio scale variables representing the absolute and relative frequency of f ormal messages generated by a subject during an in-pair task.
s / s rf Ratio scale variables representing the absolute and relative frequency of statement of information or explanation messages generated by a subject during an in-pair task.m / m rf Ratio scale variables representing the absolute and relative frequency of meta-comment or reflection messages generated by a subject during an in-pair task.
qyn / qyn rf Ratio scale variables representing the absolute and relative frequency of yes/no question messages generated by a subject during an in-pair task.
qwh / qwh rf Ratio scale variables representing the absolute and relative frequency of wh-question (who, what, where, when, why, and how) messages generated by a subject during an in-pair task.
ayn / ayn rf Ratio scale variables representing the absolute and relative frequency of answer to yes/no question messages generated by a subject during an in-pair task.
awh / awh rf Ratio scale variables representing the absolute and relative frequency of answer to whquestion messages generated by a subject during an in-pair task.The twincode Exploratory Study fp / fp rf Ratio scale variables representing the absolute and relative frequency of positive task f eedback messages generated by a subject during an in-pair task.
fnon / fnon rf Ratio scale variables representing the absolute and relative frequency of non-positive task f eedback messages generated by a subject during an in-pair task.
o / o rf Ratio scale variables representing the absolute and relative frequency of offtask messages generated by a subject during an in-pair task.

Confounding Variables
The confounding variables that were controlled during both studies are described below.

Subject's technical skills
To control the variability caused by each subject on their partner, pairs were kept the same during the entire experiment, although the subjects were not informed about this fact.Ideally, this would make the conditions of the two in-pair tasks the same except for the programming exercises (see below) and for the induced gender in the case of the experimental group.

Programming exercises
In order to avoid potential differences among the programming exercises used during in-pair tasks, they were all of similar complexity and were randomly assigned.

Data Analysis
The data analysis was performed only for those subjects considered as valid according to the following criteria: (i) to have filled in both questionnaires; (ii) to have their metrics correctly collected by the twincode platform; (iii) to have been paired with another valid subject; and (iv) not to have disclosed their gender or their partner's during the in-pair exercises; This resulted in 46 pairs, i.e. 92 valid subjects, with only 9 subjects dropped because of technical problems with their connections to the twincode platform, as previously mentioned in Section 3.1.

Correlation between Induced and Perceived Gender
Before analyzing between and within-group relationships, the correlation of the induced and perceived gender in both groups was analyzed in order to know whether the treatment had been effectively administered to the subjects 7 .For that purpose, the results of the contingency table in Table 4 were analyzed observing that the percentage of subjects who were induced to think that their partner was a man and that effectively remembered they saw a man avatar was close to 61%, whereas in the case of woman avatars the percentage was close to 59%.Although Cramer's V for Table 4 showed a large effect (0.709) according to [23], we decided to exclude from the remaining analyses those subjects in the experimental group for whom the induced and perceived gender did not match, because we considered that the treatment had not been sufficiently effective in their cases 8 .On the other hand, we kept those subjects in the control group who did not perceived any gendered avatar or did not remember it, discarding the rest.As a result, we kept all the subjects in the control group (39 men, 6 women, 1 non-binary) but only 27 (21 men, 6 women) in the experimental group.

Between-groups Analysis
In the analysis between the control and experimental groups, for every response variable v except for cps 9 , we computed the distance between the two in-pair tasks as the absolute value of the difference, i.e. v(t 2 ) − v(t 1 ) , since the sign of that difference was not relevant in our case.In our research hypothesis, this distance should be smaller for the students in the control group, who received no information about their partners' genders i.e. no treatment, than for those in the experimental group who effectively perceived two different partners' genders at t 1 and t 2 .Therefore, for every response variable except for cps, we performed a one-tailed unpaired mean difference test between groups, applying a t-test or a Mann-Whitney U test (also known as Wilcoxon test), depending on the results of the normality assumption tests.
In the case of the cps variable, for the control group we expected the mean to be closer to the middle point (5) between the first and second partner, as they were unconsciously comparing the skills of the same person.For the experimental group, we expected the mean to be skewed towards 0 (partner perceived as a man) or 10 (partner perceived as a woman) due to the effect of the treatment.Therefore, to detect differences between groups for the cps response variable, we performed an unpaired two-tailed t-test because data distribution was not significantly different from normal distribution.
Contrary to our research hypothesis, no significant differences were observed at α=0.05 between the control and experimental groups for any of the 45 response variables described in Section 3.4, including cps.The corresponding boxplots are depicted in Figure 7, where it can be seen that the difference between means-the circles in the boxes-in both groups were very small.

Within-groups Analysis
Within the experimental group, we wanted to analyze whether there were differences between the response variables when the same subjects perceived theirs partners as men or women according to our research hypothesis.We also wanted to study the possible interaction between the perceived partner's gender and the subject's gender.
For those purposes, we performed a two-sided paired mean difference test for every response variable except for cps, using the perceived gender (ppgender) as a within-subjects variable, and applying a t-test or a Wilcoxon test depending on the results of the normality assumption tests.For studying the interaction, we performed the corresponding mixed-model two-way ANOVAs with the perceived gender (ppgender) as a within-subjects variable and the subject's gender (gender) as a between-subjects variable.
For the cps variable, which passed the Shapiro-Wilk normality tests, we analyzed whether the subject's gender had any effect when comparing partners perceived as man or woman by means of a two-tailed unpaired t-test between groups, using gender as a between-subjects variable.
Contrary to our research hypothesis, no significant differences were observed at α=0.05 between the two levels of the ppgender variable for any of the 44 response variables described in Section 3.4.None of the 44 ANOVA tests detected any significant interaction either, and no effect of the subject's gender on the cps variable was detected.
As depicted in Figure 8, the corresponding boxplots show very small differences between means when partners are perceived as men or women in the experimental group.

First Replication (Berkeley May, 2022)
In this section, the first replication carried out at the University of California Berkeley in May 2022 is reported focusing mainly on the changes in the participants and the experiment execution with respect to the original experiment, since the research questions and variables were the same in both studies.For each change, an estimation of their impact on the four types of experimental validity described by [61] is included, following the recommendations by [11] about reporting the impact of changes in replications using a 7-point discrete scale from −3 to +3.A summary of the impact of those changes is presented in Table 5, including the labels of the aforementioned scale in its legend.Table 5 Estimated effects on experimental validity of the changes introduced in the replication

Participants
In the replication carried out at the University of California, Berkeley, the participants were mainly first year students enrolled in the CS61A (The Structure and Interpretation of Computer Programs) and CS88 (Computational Structures in Data Science) courses.Applying the same criteria than for the original experiment, the final number of valid subjects was 46, arranged in 23 pairs.Only 6 students, i.e. 3 pairs, were excluded from the initial 52 participants.One pair was dropped due to the disclosure of their identities during the pair programming tasks; another pair was dropped because one of its partners did not actively participate in the experimental tasks; and the third pair was excluded because they lost their connection to the twincode platform repeatedly and their metrics could not be properly collected.Among the remaining 46 valid subjects, 26 identified as woman (56.52%) and the rest as man (43.48%) during the registration process 10 .Note that, contrary to the original experiment, the percentage of women is above that of men because the CS61A and CS88 introductory courses are taken also by students from other majors, usually with a higher presence of women than in Computer Science majors, where is around 25% [58].Note also that despite the 6 dropped subjects, the percentage of women in the control (12 women, 52.17%) and experimental (14 women, 60.87%) groups were close to each other.
From our point of view, this change in the sampled population from third-year Spanish students to first-year U.S. students, and the higher percentage of women, The twincode Exploratory Study increased external validity, but the reduction in 50% of the number of subjects (46 pairs to 23 pairs) reduced conclusion validity.

Experiment Execution
The experiment execution at the University of California, Berkeley followed the same process than that performed at the Universidad de Sevilla with some changes, which are described in the following sections.

Bonus for participating in the study
As commented in Section 3.2, in the original experiment the participation in the study counted for a 5% bonus on students' grades in the Requirements Engineering course they were enrolled in to prevent dropout.In the replication, considering that the students were enrolled in two different courses with different professors, they were offered a $15 Amazon gift card for participating actively in the study instead of a grade bonus which would have been difficult to manage.In our opinion, this change did not affect any type of experimental validity.

Location of students and number of sessions
In the original experiment, the experimental execution took place during one of the laboratory sessions of the Requirements Engineering course, as shown in Figure 4.The three groups of the course had the laboratory sessions the same day at different hours, with 30 students per session on average.In the replication, the students performed the experimental tasks remotely, coordinated by one of the experimenters using Zoom.There were four sessions that took place during a week with 10 students per session on average.
We think that this change increased construct validity with respect to the original study, since the setting was strictly remote rather than being co-located in a laboratory room, but it also decreased internal validity because of the lack of control of the subject's environment, in which interactions with a third person, interruptions, or distraction could occur.On the other hand, having multiple sessions over a week rather than having three consecutive sessions on the same day also decreased internal validity due to the possibility of some students disclosing the purpose of the study to their peers despite being instructed not to do so.

Timing of the tasks
In the original experiment, the students were given 20 minutes for the pair programming tasks, 10 minutes for the solo task, 10 minutes for the first questionnaire, and 15 minutes for the second and third questionnaires.In the replication, the students were given 15 minutes for the in-pair tasks, 10 minutes for the solo task, 10 minutes for the first questionnaire, and 10 minutes for the second and third questionnaires, due to the constraints imposed by their busy schedule.
We think that the shortened duration of the in-pair tasks and the second and third questionnaires may have compromised construct validity by reducing the time  span for measuring the response variables, the interaction time for assessing the partners' skills, and the reflection time before answering each response item.Moreover, it may have weakened the effect of the treatment over confounding variables, thus decreasing also internal validity.

Gendered avatars
In the original experiment, the gendered avatars used in the chat windows of the subjects in the experimental group were the silhouettes shown in Figure 9(a), whereas in the replication the avatars were those shown in Figure 9(b), which were generated at https://getavataaars.com/.The subjects in the replication were also shown a gendered message at the top of the chat window indicating that their partner was connected, e.g."Your partner (she/her) is connected" (see Figure 16(a) and 16(b) in Appendix B).
In principle, changing the gendered silhouette avatars by more explicit ones and adding a gendered message in the chat window would have increased construct validity, but the correlation between induced gender and perceived gender in the replication worsened with respect to the original experiment (see Section 4.3.1).As a result, we consider that this change decreased construct validity.

Exercise assignment
In the original experiment, the programming exercises, which had to be solved using Javascript as the programming language, were randomly assigned to the subjects from a pool of exercises of similar complexity.In the replication, the programming exercises, which had to be solved in Python due to the background of the participants, were organized into two blocks (A and B) that were randomly assigned to the subjects during the experiment.
In our opinion, adapting the programming language to the background of the participants should not have any impact on experimental validity, but using two blocks of exercises instead of a pool of exercises definitely improves the blocking of the related confounding variable (see Section 3.5.2),thus increasing internal validity.

Data Analysis
The data analysis was performed only for those subjects considered as valid according to the same criteria than in the original experiment.This resulted in 23

Correlation between Induced and Perceived Gender
As in the original experiment, the correlation of the induced and perceived gender in both groups was analyzed to check treatment effectiveness, especially after having changed the gendered avatars and included a gendered message at the top of the chat window, as described in Section 4.2.4.
As shown in Table 6, the man/man and woman/woman effectiveness was close to 40% in the replication whereas was close to 60% in the original experiment (see Table 4 in Section 3.6.1).Although Cramer's V for Table 6 showed also a large effect (0.530), we applied the same strict criteria than in the original experiment and decided to discard those subjects in the experimental group for whom the induced and perceived gender did not match.For the subjects in the control group, we kept those who did not perceived any gendered avatar or did not remember it.As a result, we kept 22 subjects in the control group (10 men, 12 women) but only 9 (3 men, 6 women) in the experimental group.

Between-groups Analysis
As in the original experiment, and contrary to our research hypothesis, no significant differences were observed at α=0.05 between the control and experimental groups in the replication for any of the 45 response variables11 described in Section 3.4, including cps.The corresponding boxplots are depicted in Figure 10.

Within-groups Analysis
Within the experimental group (see Figure 11 for the corresponding boxplots), we performed the same analysis than in the original experiment, finding statistically significant differences at α=0.05 in the following four response variables when using the perceived partner's gender (ppgender) as a within-subjects variable.The four variables passed the Shapiro-Wilk normality test and were therefore analyzed using a two-sided paired t-test.Their effect sizes were computed using Cohen's d.
• scd (source code deletions): the test detected (p = 0.0485) that subjects deleted more source characters when they perceived their partners as a woman, with a The twincode Exploratory Study • i rf (relative frequency of informal messages): the test detected (p = 0.0138) that subjects increased the relative frequency of informal messages when they perceived their partners as a man, with a large effect size (d = 1.050).
• m rf (relative frequency of meta-comments or reflections): the test detected (p = 0.0377) that subjects increased the relative frequency of meta-comments or reflections when they perceived their partners as a man, with a large effect size (d = 0.829).
• qyn rf (relative frequency of yes/no questions): the test detected (p = 0.0297) that subjects increased the relative frequency of yes/no questions when they perceived their partners as a man, with a large effect size (d = 0.880).
Note that these results must be considered carefully because of the small number of selected subjects (n=9), and because when false discovery rate (FDR) adjustments are applied [5], only the hypothesis test corresponding to the i rf variable remains significant.
No significant interactions were detected between the perceived partner's gender and the subject's gender for the same response variables than in the original study.

Discussion and Threats to Validity
In this section, the original study and its external replication are discussed.Since the main concerns are about their threats to the experimental validity regarding operationalization and sampling, the discussion is organized around these type of threats, especially those that were not previously discussed in the description of the replication changes in Sections 4.1 and 4.2.

Operationalization of the Cause Construct -Treatment
The operationalization of gender bias into a treatment is not a trivial task and, according to the obtained results, we may not have designed our treatment as adequately as we intended, thus threatening construct validity.
Considering our experimental design, telling the subjects that they were going to collaborate with a man or a woman more explicitly could have caused in many of them the suspicion of being observed about that fact, behave unnaturally and, probably, having mentioned it unintentionally during the chat messaging, thus discovering that they were being deceived about their partner's gender and invalidating the study.
However, although the silhouetted avatars in the original experiment (see Figure 9(a)) had an effectiveness close to 60% (see Table 4), when they were changed in the replication into what we thought were more explicitly gendered avatars (see Figure 9(b)), their effectiveness dropped under 40% (see Table 6).Apart from the change of the avatars, this decrease in treatment effectiveness could have been probably affected by other factors, such as the remote setting, which increased the likelihood of distractions compared to a controlled environment such as a laboratory session, as commented in Section 4.2.2.Other factors could have been the reduced duration of the in-pair tasks and the second and third questionnaires, as previously discussed in Section 4.2.3, and the so-called Zoom burnout [49], i.e., the fatigue and exhaustion caused by prolonged use of video conferencing platforms during the COVID-19 pandemic, which may have influenced the motivation and performance of students at UC Berkeley, who are also exposed to very high levels of stress [41,54].
As commented in Section 6.2, we are evaluating the use of chatbots together with a within-subjects design in future replications to improve the treatment and thus mitigate this threat to construct validity.

Operationalization of the Effect Construct -Metrics
The main goal of our work is exploring the effects of gender bias in remote pair programming.Due to this exploratory nature, we have applied methodological triangulation [13], observing the phenomenon from as many points of view as possible, with an operationalization based in 45 response variables of different types which were measured during a reasonable interaction time.
Having said that, during the coding of the chat utterances, some of the authors who are in their fifties at the moment of writing this article perceived strong differences in how the subjects, who are Generation Z youngsters [15], communicate compared to the way we did when we were their age.With all due caution, and taking into account the strong socio-political environment in Spain and the U.S. against any type of gender discrimination, we think it is possible that the presence of gender bias in people of our generation (Generation X) may have decreased two generations later, although we do not have enough evidence to affirm it.In addition, if gender bias persists, it is possible that most subjects self-censor, thus hindering the detection of its effects.To improve this situation, we are currently evolving the twincode platform to include more metrics, and we are also considering the inclusion of qualitative research that might lead to new findings in future replications by widening the spectrum of collected information.

Low Percentage of Women in the Original Study
Unfortunately, the small proportion of women in STEM studies is a common issue in most higher education institutions [1,51].The low number of women participants in the original study was an obstacle to study whether gender bias was mainly a masculine trait or if it was also present in women in any way.Nevertheless, the percentage of women increased substantially in the first replication without significant findings on the interaction of subject's gender with other factors.

Small Size of the Sample in the Replication
The small size of the sample in the replication and the low effectiveness of the treatment supposed a clear threat to conclusion validity that can only be mitigated by taking the outcomes as provisional and performing more replications with bigger samples and alternative experimental designs in the future.

Using Students as Subjects
Although in other empirical studies in which subjects are Software Engineering students, findings can be reasonably generalized to a wider community because the experimental tasks do not usually require high levels of industrial experience [43], and the students, who are the next generation of professionals, are close to the population under study [19,34,45], the intergenerational differences commented in Section 5.2 and the lack of conclusive results makes that very difficult in our case.

Conclusions and Future Work
After performing the original study and an external replication, we can conclude that we did not observe any effect of the gender bias treatment, nor any interaction between the perceived partner's gender and subject's gender, in any of the 45 response variables in the original study.
With respect to the external replication, we only observed statistically significant effects within the experimental group, i.e. comparing how subjects acted when they thought their partner was a man or a woman, in four of the 45 dependent variables.One variable was related with changes in the behavior (source code deletions), and the other three were related with the relative frequency of different type of chat utterances (informal messages, reflections, and yes/no questions).In the case of the source code deletions, subjects deleted more characters when they perceived their partners as a woman, but the relative frequency of informal messages, reflections, and yes/no questions was higher when they perceived their partners as a man.We also observed a lower effectiveness of the treatment in the replication, that could be caused by the changes in the gendered avatars but also for having used a remote setting instead of a controlled environment like a laboratory session, free of distractions and interruptions.That lower effectiveness of the treatment led to a small number of selected subjects in the experimental group, thus leading to consider the replication results carefully because of the small sample they are based on, and because when FDR adjustments are applied, only the result of the relative frequency of informal messages remains significant.
These outcomes have raised a number of potential research questions that we plan to address in the future and that are briefly described below.

Replication in Different Cultural Background
The cultural differences between Spanish and U.S. students could have also influenced the outcomes of both studies, so we would like to replicate it other countries and analyze those potential differences caused by cultural backgrounds.

Using Chatbots as Partners and AI-based Utterance Coding
Another two research lines we would like to explore in the future are the use of chatbots as pair programming partners and the use of deep learning to automatically code chat utterances, thus reducing the manual effort of carrying out a replication.
Inspired by current trends in Psychology [4,24] and taking into account not only the absence of significant differences between groups in the original study and the replication, but also the difficulties in recruiting a relevant number of subjects, we are considering the possibility of changing from a between-groups design to a withinsubject design in which each subject performs the pair programming tasks with a chatbot simulating being a man or a woman instead of with another human subject.Obviously, developing such a chatbot is not a trivial task, but current advances in the area, such as LaMDA [10], BERT [14], or GPT-3 [37], make this approach a technical challenge worth exploring.A very relevant aspect in the development of such a chatbot is avoiding gender bias in the training data, as recently studied by [39].
On the other hand, now that we have a relevant number of coded chat utterances in Spanish and English, we could use that labeled dataset to fine train a large language model system similar to those used in chatbots to classify user intents and apply it for the automatic coding of chat utterances, which is one of the most timeconsuming tasks we have had to perform as experimenters in our exploratory study.If the results of such a fine trained system were accurate, future replications would required much less effort than the two presented in this article and experimenter bias would be considerably mitigated.
(undergraduate students at UCB) for their support in the evolutive changes to the twincode platform and in the experiment execution at UCB.We particularly acknowledge Vron Vance (UCB alumnus, Data Analyst at Google) for their assistance regarding inclusive language around gender identity.Last but not least, we would like to thank the anonymous reviewers for their valuable comments and suggestions that helped us improve the quality and clarity of this article.The twincode Exploratory Study As shown in Figure 13, in the initial version of the scale used in the pilot studies, the pptc 5 item, which asked whether the assigned partner had been condescending, presented low correlations with the rest of the items in the scale and the scree plot indicated two factors.After removing that uncorrelated item, the Cronbach's α increased from 0.73 to 0.85, and the scree plot indicated only one factor, as shown in Figure 14.The only item in this questionnaire section, entitled as "Describe your partner", is a free text field in which subjects are instructed to describe the most positive and most negative aspects of the partner assigned to them in the programming exercises they just did, indicating the positive ones with a "+" sign and the negative ones with a "-" sign in front of each aspect.

A.4 Response items for compared partners' skills (cps)
All the items in this questionnaire section, entitled as "First or second partner?", are 0-10 numerical response items in which 0 means "first partner", 5 means "both equally", 10 means "second partner".15, all the items presented high Pearson correlations with Cronbach's α = 0.88, and the scree plot confirmed they were unidimensional according to the Kaiser criterion.As a result, all of them were kept after the reliability analysis on the data from the pilot studies.

Figure 1
Figure 1 twincode user interface for subjects in the experimental and control groups (original study version)

Figure 2 Figure 3
Figure 2 Experimental process (subject allocation to groups)

Figure 4
Figure 4 Experiment execution at University of Seville, Dec 2021

Figure 5
Figure 5 First response item for pp variable in questionnaires #1 & #2 as presented to the subjects

u
/ u rf Ratio scale variables representing the absolute and relative frequency of opinion or indication of uncertainty messages generated by a subject during an in-pair task.d / d rf Ratio scale variables representing the absolute and relative frequency of explicit or direct instruction messages generated by a subject during an in-pair task.su / su rf Ratio scale variables representing the absolute and relative frequency of polite or indirect instruction or suggestion messages generated by a subject during an in-pair task.ack/ ack rf Ratio scale variables representing the absolute and relative frequency of acknowledgment messages generated by a subject during an in-pair task.

Figure 7 Figure 8
Figure 7 Boxplots of the 45 response variables for between-groups analysis in the original study

Figure 9
Figure 9 Gendered avatars used in the original experiment and the replication

Figure 11
Figure 11 Boxplots of the 45 response variables for within-groups analysis in the replication

Figure 12
Figure 12 Pearson correlations and scree plot of pp scale items

Figure 13
Figure 13 Pearson correlations and scree plot of the initial version of pptc scale items

Figure 14
Figure 14 Pearson correlations and scree plot after dropping pptc 5 from pptc scale

Figure 15
Figure 15 Pearson correlations and scree plot of cps scale items Figure 16 twincode user interface for subjects in the experimental and control groups (replication version)

Table 2
Summary of secondary studies (SMS or SLR) in pair programming in chronological order pairs, i.e. 46 valid subjects, as previously mentioned in Section 4.1.
cps 1 Comparing your assigned partners in sessions 1 and 3, who do you think provided more clear and constructive feedback, your first partner or your second partner?cps 2 Comparing your assigned partners in sessions 1 and 3, who do you think was easier to communicate with, your first partner or your second partner?cps 3 Comparing your assigned partners in sessions 1 and 3, who do you think who do you think was more knowledgeable about the subject material, your first partner or your second partner?cps 4 Comparing your assigned partners in sessions 1 and 3, who do you think would be a better project partner, your first partner or your second partner?cps 5 Comparing your assigned partners in sessions 1 and 3, who do you think would be a better teaching assistant, your first partner or your second partner As shown in Figure