1 Introduction

1.1 Replicability crisis

The possibility of experimentally reproducing a result is at the heart of many sciences using experimental methodology, and is the guarantee of our confidence in its operation. In the words of a well-known Popper quote: “unique events that cannot be reproduced are of no importance to science” (Popper 1959). It is because the same cause produces the same consequences (in the same circumstances) that an individual, whether scientific or not, can have confidence in the existence of the observed causal effect, whatever the context of its measurement. However, some experimental sciences, and psychology in particular, are currently undermined by a notorious inability to reproduce experimental results. For example, Baker (2015) revealed that out of 100 articles published in leading psychology journals, only a third to half of the experimental results could be replicated, as Owens (2018) later found. What's more, 70% of the 1500 researchers surveyed admitted failing to replicate a colleague’s experiment, and more than half failing to replicate their own experiment (Baker 2016). This replicability crisis extends far beyond psychology and can be observed in many other fields, such as pharmacology (Prinz et al. 2011), medicine (Begley & Ellis 2012), or even economics (Bergh et al. 2017) and the social sciences (Camerer et al. 2018). It is puzzling that the results of experimental studies that sometimes founded sub-fields of research, such as Bargh’s (1996) social priming, have failed to be replicated (Doyen et al. 2012), nor have the experimental variations that followed (Harris et al. 2013; Klein et al. 2014; Pashler et al. 2012). This inability to reproduce empirical results leads, beyond the expenditure of time, energy and money, to a decline in confidence in science on the part of the public and scientists themselves (Anvari and Lakens 2018). The causes of this crisis are many, and a full review is beyond the scope of this article. Briefly, publication bias and the publish-or-die culture leading to dubious practices, or the misuse of statistics and in particular around the p-value are causes that have been stated many times (Romero 2019; Simmons et al. 2011). Beyond the researchers themselves, the social context of science seems to be at issue too. For example, until recently, only a small proportion of neuroscience journals accepted replication studies (Martin and Clarke 2017; Yeung 2017). Another cause mentioned by Romero (2019) in his review lies in the principle of priority, which rewards the first to obtain the innovative discovery (Merton 1957; Romero 2017). Finally, another possible cause of replication difficulties lies in the prevalence of biases inherent in the experimental procedure such as the effect of the experimenter, and in particular his or her gender/sex (Chapman et al. 2018).

1.2 Distinguishing gender/sex

It is generally accepted that there's a difference between gender and sex. However, these concepts are very often confused and associated, whether in a scientific context or not. Sex refers to the biological attributes (genitalia, chromosomes, hormones…) that distinguish male and female organisms, but also intersex and hermaphrodite organisms (Hyde et al. 2019). Gender is considered a social construct and refers to behaviors “associated with membership of a sexual category”, although they are not necessarily determined by sex (Hyde et al. 2019; West and Zimmerman 1987; Westbrook and Saperstein 2015, p. 537). Gender encompasses many different psychological and social traits that define individuals as man, woman, non-binary, trans, and to which the individual conforms to a greater or lesser extent (Garcia-Sifuentes and Maney 2021; Hyde et al. 2019; Swim et al. 2020; Tannenbaum et al. 2019). It’s important to note that during social interactions, a Person’s gender determination isn’t always aligned with how the person perceives themselves, and both can change over time (Westbrook and Saperstein 2015). When studying the scientific literature it seems impossible to determine which study actually measures gender and which actually measures sex and, more importantly, which effect can be attributed to gender and which to sex. Indeed, as most research on the experimenter’s gender/sex effect is relatively old, the concepts were not as differentiated or common as they are today, even if they are still largely confused today (Cameron and Stinson 2019). For example, Chapman et al. (2018), in a recent review on the subject, systematically refer to “gender” to describe this effect (the term “sex” does not appear), whereas many empirical studies cited in their review only refer to “sex”, with the term “gender” used interchangeably with “sex”, when it appears at all (Carter et al. 2002; Fisher 2007; Rumenik et al. 1977; Stevenson and Allen 1964). This is a very common phenomenon (Westbrook and Saperstein 2015). Knowing that the vast majority of these studies give little or no indication of how gender/sex is measured, we can imagine boxes to be ticked: “sex: male/female”, or “man/woman”, or even a choice left to the researcher’s discretion, with experimental instructions “only ask when in doubt” (Westbrook and Saperstein 2015). Of course, this leaves room for a high degree of ambiguity that makes it impossible to distinguish what is actually being assessed, particularly because sex and gender are assigned at the same time of birth (Hyde et al. 2019; Westbrook and Saperstein 2015). This is not only problematic from an epistemic point of view, but also from an ethical one (Cameron and Stinson 2019). Consequently, and as recommended by Hyde et al. (2019), in this study we use the term “gender/sex” not to consider them as undifferentiated but rather as indistinguishable (at least in this experiment) and to “reflect the inseparable nature of gender and sex in practical contexts and their confounded treatment and measurement in research” (Cameron and Stinson 2019; Hyde et al. 2019).

1.3 Effect of experimenter’s gender/sex

The experimental sciences are largely imbued with a quest for positivism and objectivism that focuses almost exclusively on the standardization of experimental conditions and materials. Within this framework, the experimenter is often seen as floating above the experiment and having no impact on what he or she observes, as evidenced by the vast majority of experimental studies that make no mention of the experimenter at all. Yet it is undeniable that during an experiment, at least two humans interact, and just because one of them is a scientist does not mean that this interaction is without social effect (Carter et al. 2002; Rumenik et al. 1977). Since scientists seem to have less difficulty replicating their own studies than those of their colleagues, it’s possible that the role of the investigator, and in particular his or her gender/sex, plays a part in the replicability crisis (Baker 2016). Chapman et al. (2018) note numerous areas in which such an effect can be observed: measures of intelligence (Back and Dana 1977), creativity (Gall and Mendelsohn 1967), learning and memory (Stevenson and Allen 1964), physical performance (Rikli 1976), risk-taking and testosterone (Ronay and Hippel 2010), aggressive and prosocial behavior (Borden 1975; Hyde 2014), and of course sexuality (Clark 1952). One of the areas where this effect is most documented is pain assessment, where men seem to report less pain when the experimenter is a woman, and women more pain when assessed by men (Aslaksen et al. 2007; Carter et al. 2002; Kállai et al. 2004; Levine and De Simone 1991; Mogil 2012; Vigil et al. 2014). A possible theoretical explanation is that the stakes of an opposite gender/sex interaction are often higher than those of similar gender/sex interactions, as they may (for heterosexual individuals) lead to a sexual or marital relationship, or at least to the self-consolidation of a seductive potential (Chapman et al. 2018; Leary et al. 1994). The result would be an increase in psychosocial stress which, depending on its level, would positively or negatively influence the experience. In this sense, it has been shown that the stress of men tested by women is greater than that of men tested by men, and vice versa for women (Kirschbaum et al. 1993). Other explanations for this phenomenon could involve gender conformity mechanisms through self-categorization to social roles (Berke et al. 2012; Carli 1989; Pool et al. 2007; Terry et al. 2000; Terry and Hogg 1996). According to these points of view, it can be argued that the dual interaction triggers attention to the participant's and experimenter’s gender/sex. This would lead to more gendered behaviours, for example women reporting more pain to appear feminine and men reporting less pain to appear masculine (Pool et al. 2007). In addition, it is possible that during a short punctual meeting with a stranger, individuals adopt more gendered characteristics and attitudes in order to create a common interaction basis, as these attitudes are broadly known and shared (even by those who do not conform to them), and seem to be socially valued (Swim et al. 2020). This is somewhat reminiscent of the stereotype threat theory (Spencer et al. 2016), a theory in which individuals will conform to a stereotype about themselves when it is made salient. For example, Spencer et al. (1999) reported that women under-performed at a math test when the stereotype that women have poor mathematical abilities was made salient. In this sense, it is arguable that the perceived gender/sex of the experimenter could trigger stereotype salience (Drążkowski et al. 2017; McGlone and Aronson 2006). However, it is important to note that many studies involving stereotype threat effects were found to be difficult and / or impossible to replicate, notably the famous one cited previously (Flore et al. 2018; Flore and Wicherts 2015).

1.4 VR experimentation

The use of VR in research is developing rapidly, for example in the study of spatial cognition, neuropsychological disorders, memories or therapies, as well as perception and philosophies of mind. (Banakou et al. 2016, 2018; Cogné et al. 2017, 2018; Muratore et al. 2019; Riva et al. 2019). Indeed, VR has the advantage of combining “the best of both worlds” (Minderer et al. 2016), i.e. methodologically controlled ecological studies (Dawson and Marcotte 2017; Parsons 2015). However, VR is not a simple transposition of reality: immersion consists in superimposing the artificial information of the immersive system onto the natural information of the physical environment (Maneuvrier and Westermann 2022). This superimposition gives rise to the sense of presence, a feeling of “being there” (Heeter 1992; Sheridan 2016), which enables the participant to act as if they were actually present in the virtual environment presented to them. But immersion also triggers the emergence of negative effects, notably cybersickness (Rebenitsch and Owen 2016; Stanney et al. 2020a, b; Tian et al. 2022). Cybermalaise is a set of symptoms similar to motion sickness. It is often assumed to be caused by inconsistencies between different perceptual systems, particularly the visual system (Bos et al. 2008). An important methodological aspect is that a sense of presence is negatively associated with cybersickness (Maneuvrier et al. 2021; Weech et al. 2019), and that the former seems to have a positive impact on cognitive performance, whereas the latter would tend to decrease it (Maneuvrier et al. 2020). In addition, both cybersickness and sense of presence appear to be associated with video game experience, perhaps through a reduction in the use of visual cues or, at least, by making them more flexible (Howarth and Hodder 2008; Maneuvrier et al. 2021), or through and enhanced familiarity with ergonomics and affordances (Gibson 1977; Maneuvrier et al. 2022). The video game experience is therefore often measured in VR, although less commonly than the sense of presence or the state of flow. Flow is the famous “optimal” state of consciousness characterized by cognitive absorption, enjoyment, time distortion, forgetfulness of self and others, and generally brought about by concentration on a task related to perceived competence (Csikszentmihalyi 1990). Flow is considered to be (i) triggered by video games, and (ii) associated with the sense of presence (Cheng et al. 2014; Kang et al. 2022; Kim and Ko 2019; Rutrecht et al. 2021; Spreij et al. 2022; Wehden et al. 2021; Yang and Zhang 2022). All these measures are usually assessed in the form of self-reported questionnaires after immersion (Schwind et al. 2019; Souza et al. 2021), especially as they are suspected to have an impact on performance during VR experimentation (Maneuvrier et al. 2020). However, there is one final factor that needs to be discussed because it is repeatedly described as having an impact on the VR experience: the participant's gender/sex.

1.5 Effect of participant’s gender/sex in VR

Some authors have found that men report a greater sense of presence than women in VR (Felnhofer et al. 2012; Lachlan and Krcmar 2011; Nicovich et al. 2005), even if this effect is not systematic (Maneuvrier et al. 2020; Weech et al. 2020). Furthermore, women have long been considered more susceptible to cybersickness than men (Maneuvrier et al. 2020; Shafer et al. 2017; Stanney et al. 2003), although again this effect is not systematic (Gamito et al. 2008; Ling et al. 2013). In this context, sex- and gender-related indistinguishable effects are suggested. For example, the sense of presence and cybersickness symptoms in VR are influenced by the experience of 3D and video games, probably due to shared spatial and perceptual processes known as field (in)dependence (Boccia et al. 2016; de Castell et al. 2019; Kennedy 1975; Maneuvrier et al. 2021, 2022, 2023; Wirth et al. 2012; Witkin 1949). Now, it can be argued that women are less inclined to play video games or 3D construction games, particularly during their youth (Entertainment Software Association 2022; Evans et al. 2013; Levine et al. 2016). But it’s impossible to tell whether women play less (sensory-intensive) video games because of social factors (Bègue et al. 2017), or because they're less comfortable with sensory mismatch, for biological reasons. Indeed, some authors have highlighted possible sex differences in cybersickness sensitivity, for example because of differences in hormone secretion rates or field-of-view size (Clemes and Howarth 2005; LaViola 2000). On the other hand, Stanney et al. (2020a, b) found results supporting the existence of a gender effect, stating that the size of head-mounted screens is generally not adapted to women's interpupillary distance. Once this variable was controlled for, they found no difference in the level of cybersickness between men and women (Stanney et al. 2020a, b). In addition, women could face stereotype threat when faced with computers and games, deteriorating their experience by generating stress (Koch et al. 2008). For example, Drążkowski et al. (2017) found a stereotype threat effect on measures of field (in)dependence: men and women are often expected to show different field (in)dependence strategies (Onyekuru 2015), with men being considered less field-dependent than women. However, Drążkowski et al. (2017) suggested that these differences are caused by a stereotype threat triggered by the experimenter's gender/sex: when the experimenter was a woman they found similar levels of field (in)dependence between genders/sexes. Finally, another gender-related possibility has been suggested: men may under-report their symptoms of cybersickness with the social aim of appearing stronger (Harm et al. 2007; Rebenitsch and Owen 2016; Stanney et al. 1999), which is also interesting in our context, as very similar results of an experimenter effect have been found in the field of pain assessment (Carter et al. 2002; Kállai et al. 2004; Levine and De Simone 1991; Mogil 2012; Vigil et al. 2014). However, an interaction between the experimenter’s and the participant’s gender/sex has never, to our knowledge, been empirically explored in VR.

1.6 Hypotheses and objectives

Given the current replicability crisis, the importance of VR measures for methodologically correct virtual evaluations, the discussed effects of gender/sex on the experience of VR and the possible interaction with the experimenter’s gender/sex, we believe that an empirical exploration of these dynamics is mandatory. The aim of this study is therefore to explore whether the experimenter's gender/sex may have an effect on the self-reported variables typically administered during a VR experiment, and to assess whether this effect depends on and interacts with the participant’s gender/sex. We conducted a randomly controlled VR experiment using a virtual first-person shooter, which is usually considered a masculine-gendered type of game (Juul 2012; Yee 2017). At the end of the game, participants self-administered the most common VR measures (sense of presence, cybersickness, video game experience, flow). Although this study is largely exploratory, we use the null hypothesis testing method as well as induction based on statistical inferences. We have formulated two a priori undirected hypotheses:

H0

The experimenter's gender/sex does not influence self-reported measures of VR (sense of presence, cybersickness, video game experience, flow state).

H1

The experimenter's gender/sex influences self-reported measures of VR (sense of presence, cybersickness, video game experience, flow state). Moreover, these effects vary depending on the participant’s gender/sex.

2 Materials and methods

3 Participants

This experiment was approved by a local ethics committee (CICPPR—IMT Atlantique) and strictly followed the Declaration of Helsinki (World Medical Association 2013). An a priori power analysis (f2 = 0.2, p = 0.05) revealed that in order to obtain 80% power, 66 participants were needed (Pilllai V = 0.16, Critical F(4,61) = 2.52, Non-centrality parameter λ = 13.2). 75 young adults were then recruited locally (Author’s university, Western France) from public posters and gave their written consent. After screening for missing data (1) and outliers (4) using the objective interquartile range (IQR) on each self-reported variable, i.e. [25th percentile]—1.5 × IQR and [75th percentile] + 1.5 × IQR, 70 participants were considered for analyses. Average age was 23.9 (± 4.2), minimum age was 18 and maximum 35. Exclusions criteria were: (i) being younger than 18 year old or older than 35 year old, (ii) having a known uncorrected psychological or physiological conditions which could alter perception or the use of a visuo-manual controller. Participants were not medically screened for those criteria but trusted by the experimenters, and individuals who could use corrective lenses or glasses in the head-mounted display were included. The age exclusion criteria was chosen because it is suggested that age has an effect on cybersickness (Arns and Cerney 2005; Petri et al. 2020; Stanney et al. 2020a, b), video game experience (Entertainment Software Association 2022; Greenberg et al. 2010), and, to some extent, sense of presence (Coxon et al. 2016; Ochs et al. 2018).

3.1 Gender/sex

Due to the “inseparable nature of gender and sex in practical contexts and their confounded treatment and measurement in research” (Cameron and Stinson 2019), the gender/sex variable was measured unitary using a non-mandatory open-ended question: “What sex were you assigned at birth?”. The choice of this formulation (“sex assigned at birth” rather than, for example, “current gender identification”) was intended to simplify the evaluation of the effects of gender and sex in an indistinguishable way. Indeed, it appeared that taking into account “current gender identification” rather than gender/sex assigned at birth would require much more in-depth analyses (use of a continuum and/or different categories, evolution over time, etc.), would make the issue more complex and would not allow sex to be taken into account correctly. The current measure (the sex, and therefore the gender socially assigned at birth) allows us to take into account (i) the sex factor and (ii) the associated gender, at least in part through the social relationships imposed during childhood. Indeed, a 23-year-old responding “non-binary” to a gender question at the time of the study is likely to have been socially considered as a man or a woman for much of his or her life, depending on the sex assigned at birth (and probably still partly assigned by some social agents). Although the measurement question explicitly refers to sex, we do not defend the idea that any effect described using this variable can be considered as attributed to sex or gender, which is why we use the unitary variable “gender/sex”. Even if we participate in their confounding treatment, we consider that the differentiation between the two is well beyond the scope and capabilities of this study. The question specified that the participant was not obliged to answer the question if they did not wish to. One participant did not answer the question, and was therefore excluded from analysis for missing data. Indeed, in view of the quantitative statistical method, it was established beforehand that responses that could not be grouped into the two most widespread gender categories (man/woman) based on the open-answers would not be considered for quantitative analyses (participants were not aware of this). With the exception of this participant, all the others participants answered the gender/sex variable in a dichotomous way, either “homme” in French, translated to “man” or “masculin”, translated to “masculine” which were grouped as “men”, because none responded “mâle”, the French near equivalent to “male” (N = 27). The other participants responded with “femme” in French, translated to “women” or “féminin” translated to “feminine”, which were grouped as “women”, because none responded “femelle”, the French near equivalent of “female” (N = 42). For the sake of clarity and consistency in reporting, gender/sex modalities are indicated as “men/women”.

3.2 Experimenters

In order to counterbalance a possible co-founding effect of other individual variables (height, attractiveness, sociability…) rather than the experimenter’s gender/sex variable per se, several experimenters (10) participated in this study. The experimental protocol was highly automated, since it took place mainly in VR and/or on computer software, and the experimenters had little direct evaluation work apart from human interaction and software management. With the exception of the principal investigator, all experimenters were graduate students, and all agreed to have their data used for this experiment. Using the same gender/sex variable, half of the experimenters were grouped as “men” (5) and the other half as “women’ (5). Participants and experimenters were randomly assigned. Women experimenters collected data from 37 participants (24 women, 13 men), while men collected data from 32 participants (18 women, 14 men). For the first two experiments of each experimenter, the principal investigator (man) was present in the room to (i) reassure the student experimenters about the technical aspects of the procedure and (ii) monitor the similar conduct of the experiment by all experimenters. This experimenter was silently working on a computer and did not interact directly with the participant. However, given that a passive presence can also potentially affect an experiment, and although this presence was equally distributed between the experimenters according to the gender/sex variable, a potential effect of this passive presence will be tested statistically.

3.3 VR immersion

The VR system was the HTC-Vive Pro (1440 × 1600 resolution per eye, 98° horizontal field of view, 90 Hz refresh rate). The computer used Windows 10–64-bit as its operating system, the processor (CPU) was the Intel Core i9—9900 K 3.6 GHz and it was equipped with 32 gigabytes of RAM. The graphics processing unit (GPU) was the GeForce RX 2080. Due to a mechanical issue with the head-mounted display, we were unable to adapt the interpupillary distance to each participant. However, we did set the interpupillary distance at 62.5 mm, which corresponds to the average distance of Northern European men and women aged 16–40 according to Pointer (1999), and is therefore equally suitable (or not) for men and women, in contrast to the basic distance of the head-mounted display (67.5 mm). After a brief tutorial explaining how to use the joysticks, participants were immersed in a Western-style cartoon world (Fig. 1, part A, part B). They stood on the roof of a moving train in order to obtain a smooth, linear visual flow likely to trigger very slight cybersickness (preventing a basement effect) thanks to a slight sensory mismatch (Clifton and Palmisano 2019). The train trajectory was entirely straight to avoid too much cybersickness, and the train speed was rather slow, 0.5 units of distance per second (slow walking speed). The tracked virtual controller was transformed into a gun capable of firing one bullet per second (Fig. 1, part C). Instructions were given visually and orally by voice recording (human voice) in the virtual environment. Participants had to shoot aliens (Fig. 1, part D) who appeared to attack the train en route. The aliens appeared regularly (every 10 s) in the virtual environment and fired projectiles at the train. Participants could either shoot at the projectiles to destroy them, or shoot at the aliens to neutralize them, their only explicit objective being to protect the train. Total immersion lasted 13 min and 30 s, after which the virtual environment closed. An action shooter has been chosen because it’s one of the game genres for which the greatest gender differences exist (Bosser and Nakatsu 2006; Feng et al. 2016; Kapalo et al. 2015), which could potentially make it easier to detect differences between the genders/sexes. This environment was custom-built by the author using Unity3D and the object-oriented programming language C#. The decision not to use a full avatar, but only to represent the tracked hand and virtual gun, was taken to avoid (i) uncanny valley (Mori et al. 2012), (ii) a mismatch between virtual and real poses (Palmisano et al. 2020), given the absence of trackers on the rest of the body, and (iii) embodiment effects (Banakou et al. 2018; Tassinari et al. 2022). A video of the first-person game is available in the supplementary material.

Fig. 1
figure 1

First person view of the virtual environment. a and b show the far-west background, c shows the tracked controller turned into a pistol and d shows the enemies shooting projectiles toward the train

3.4 VR measures

All the variables commonly measured in VR experiments were assessed using a computerized self-administered questionnaire, as is generally done (Grassini and Laumann 2020; Schwind et al. 2019). Sense of presence was measured using the most common questionnaire validated in French without the haptics items (Robillard et al. 2002; Witmer and Singer 1998), which contains 22 items (7 points). Because cybersickness in VR is considered to be visually induced (Bos et al. 2008; Rebenitsch and Owen 2016), cybersickness was measured using the oculomotor scale of the simulator sickness questionnaire validated in French (Bouchard et al. 2007; Kennedy et al. 1993), which contains 7 items (4 points). Video game experience was measured using a single item question “How often do you play video games?” on 10 points, where 10 was specified “Everyday” and 1 was specified “Never”. A 10-point scale was used instead of the standard 7-point Likert scale to avoid possible confusion with the number of days played per week. State of flow was assessed using a translation of various flow questionnaire items used in VR but translated into French as no validated version could be found. The items used for the flow questionnaire are given in the supplementary material.

3.5 Procedure

Participants were clearly informed of the experimental protocol, with the exception of the evaluation of the experimenter's effect. However, the experimenters all knew that the experimenter's effect would be evaluated (for pedagogical reasons). Participants were informed that they could stop the experiment at any time and without explanation, but none chose to do so. Once written consent obtained, participants were fitted with sanitary pads and the head-mounted display with a tracked controller, and told that recordings would be given in the virtual environment. To collect exploratory educational data, a virtual rod-and-frame-test (Witkin 1949; Witkin et al. 1962) was performed in the head-mounted display (fully automated, 16 trials, approx. 3 min). Next, participants completed a 13.30-min immersion in VR. Once the virtual environment was complete, participants were de-equipped with the equipment and asked by the experimenter to complete the VR measures (sense of presence, cybersickness, video game experience, state of flow) on a computer. Next, participants were invited to perform the virtual rod-and-frame-test again (same configuration). Finally, participants were thanked, the experimenter explained the nature of the research and they were invited to ask questions. They then left the laboratory room, accompanied by the experimenter.

3.6 Analyses

  • Preliminary analyses: to ensure that VR measures were, where appropriate, reliable, Macdonald’s Omega (confirmatory factor analysis estimate, analytical interval) was used.

  • Null Hypothesis testing: one MANOVA with Pillai Test was performed to test H0. The VR measures (sense of presence, cybersickkness, video game experience and flow) were used as dependent variables, and the experimenter’s gender/sex and the participant’s gender/sex were used as independent variables. In order to assess a potential effect of the passive presence of the primary interviewer, the passive presence of the primary interviewer (yes/no) was added as an independent variable. All the interactions were considered at the exception of third-degree interactions, because of the sample size. ANOVAs tables (2 × 2 between participants) were reported, along with Tukey adjusted post-hoc tests when a potential difference was outlined, and simple main effects were explored.

  • Global statistical method: JASP software was used for statistical analysis and G*Power software was used for a priori power analysis. 95% Confidence Intervals (95% CI) were systematically reported, along with p-values. Because of experimental results previously found in the literature against the null hypothesis and the a priori power analysis (p = 0.05), only p-values around p = 0.05 and below were discussed. However, and because of the exploratory nature of the study, effects were not discussed as significant or non-significant based on p-values alone, as suggested by recent epistemological debates (Amrhein et al. 2019; Wasserstein et al. 2019). Box’s M-test for homogeneity of covariance matrices and Shapiro–Wilk test for multivariate normality were used to test the statistical assumptions of the MANOVA along with a Q-Q plot exploration of the normality of residuals. Cohen’s d was used to report effect size of Post Hoc comparisons along with corrected Tuckey p-value. Pillai Trace (V) was used to report effect size of MANOVA, whereas η2 was used to report effect size of analyses of variance.

4 Results

4.1 Preliminary results

The Presence Questionnaire items showed acceptable reliability and were therefore considered a construct representing the sense of presence variable: mean = 108.5, SD = 11.7, ω = 0.82, 95% CI [0.75;0.89]. Oculomotor symptoms items from the Simulator Sickness Questionnaire showed acceptable reliability and were therefore considered a construct representing the cybersickness variable: mean = 3.04, SD = 2.7, ω = 0.75, 95% CI [0.66;0.84]. The state of flow questionnaire items showed acceptable reliability and were therefore considered as a construct representing the flow variable: mean = 66.6, SD = 7.99, ω = 0.74, 95% CI [0.65;0.82]. Since it is a single-item question, scores on the video game practice question were considered a construct representing the video game experience variable: mean = 4.81, SD = 3.04. In addition, all the statistical assumptions hypotheses tested for the MANOVAs were satisfied.

4.2 Null hypothesis testing

The MANOVA revealed a debatable but unlikely global effect of the gender/sex of the participant along with a very potential interaction effect between gender/sex of participant and gender/sex of experimenter (Table 1). Further explorations revealed four possible univariate effects on different VR variables (Table 2). In addition, it seems very unlikely that the passive presence of the main investigator had an effect on the outcome variables.

  • Sense of presence: a potential interaction between the experimenter’s gender/sex and the participant’s gender/sex was outlined (Table 2, part A), but no main simple effects or post-hoc comparisons (Fig. 2, part A).

  • Cybersickness: a potential interaction effect between the experimenter’s gender/sex of and the participant’s gender/sex was outlined (Table 2, part B). Only one post-hoc adjusted comparison was outlined, as women evaluated by men seemed to report more cybersickness than men evaluated by men (t = 2.6, Cohen’s d = 0.944, 95% CI[0.062,1.95], pTukey = 0.052) while women and men evaluated by women reported similar levels. In addition, a simple main effect of participant’s gender/sex was outlined with a moderating effect of experimenter’s gender/sex: F = 6.85, p = 0.01 (Fig. 2, Part B).

  • Video game experience: a potential main effect of the participant’s gender/sex was outlined (Table 2, part C). Post-hoc one-to-one adjusted comparisons outlined that men reported more video game experience than women (t = − 2.97, Cohen’s d = − 0.73, 95% CI[− 1.24, − 0.22], pTukey = 0.004). In addition, an interaction between the experimenter’s gender/sex and the participant’s gender/sex was outlined: F(1,66) = 4.62, p = 0.035, η2 = 0.057. Post-hoc comparisons outlined that men evaluated by men reported more video game experience than women evaluated by men (t = − 3.51, Cohen’s d = − 1.26, 95% CI[− 2.29, − 0.24], pTukey = 0.004), whereas men and women evaluated by women reported similar levels of video game experience. In addition, men evaluated by women reported more video game experience than women evaluated by men (t = 2.66, Cohen’s d = 0.98, 95% CI[− 0.048, 2], pTukey = 0.047). A last effect could be discussed as women evaluated by women seemed to report more video game experience than women evaluated by men (t = 2.48, Cohen’s d = 0.77, 95% [− 0.09, 1.64], pTukey = 0.07). Finally, a simple main effect of participant’s gender/sex was outlined: F = 12.37, p < 0.001 (Fig. 2, Part C).

  • Flow: no differences were outlined on the state of flow measure (Table 2, part D). Men and women reported similar levels of flow, regardless of whether the experimenter was a man or a woman and the interaction between the gender/sex dyad (Fig. 2, Part D).

Table 1 Results of the MANOVA (Pillai Test, Trace = V) on the self-reported post-immersion variables commonly measured in VR experiments: sense of presence, cybersickness, video game experience, flow
Table 2 Results of the ANOVA (Type II Sum of Squares) on the self-reported post-immersion variables commonly measured in VR experiments: sense of presence, cybersickness, video game experience, flow
Fig. 2
figure 2

Graphical representations of the experimental groups on the ANOVA on self-reported post-immersion VR measures. Error bars represent 95% CI. a shows the ANOVA on sense of presence, b on cybersickness, c on video game experience and d on flow. Women appear in yellow/light and men in green/dark

5 Discussion

The aim of this exploratory study was to assess whether experimenter and participant gender/sex may have an effect on self-reported variables commonly measured in VR (sense of presence, cybersickness, video game experience, flow), and to evaluate a potential interaction between the two. In view of previous findings in the literature and the empirical results of this study, we chose to reject H0. Indeed, although these results suggest (i) an effect of the participant’s gender/sex, that we consider solely due to differences in video game experience, and (ii) an absence of effect of the experimenter’s gender/sex, we defend that these two variables interact to impact self-reported measures. In addition, this interaction effect seems to differ according to the variables measured: it seems to be prevalent for the video game experience and cybersickness, but very weak or non-existent for the sense of presence, and very probably non-existent for the state of flow. In line with Goodman’s (2018) suggestion, and based on previous results, a priori power-analysis, sample size and sample distribution, confidence intervals, p-values and effect sizes, we emit an 80–85% chance of verisimilitude for these exploratory results. We now turn to discuss these results and the alternative hypothesis.

5.1 Main effect of the participant gender/sex

Although an overall main effect of the participant’s gender/sex was highlighted, further comparisons revealed that this was only true for video game experience, with men reporting more video game experience than women. This fact has already been documented (Behm-Morawitz 2013; Breuer et al. 2015; de Castell et al. 2019; Entertainment Software Association 2022; Maneuvrier et al. 2022), although the differences are much greater when assessed according to the type of games played (Yee 2017). Nevertheless, gender/sex differences in video game experience, and particularly in personal use of virtual reality, should not be overlooked. Indeed, these differences could be predictors of both the sense of presence and susceptibility to cybersickness, which in turn could influence performance in VR (Clemes and Howarth 2005; De Leo et al. 2014; Gamito et al. 2010; Knight and Arns 2006; Lachlan and Krcmar 2011; Maneuvrier et al. 2020; Weech et al. 2020). On the contrary, there seems to be no or slittle evidence of differences between men and women in the experience of flow (Boyd et al. 2018; Plummer et al. 2017). Yet the present study finds none, as asserted by (Csikszentmihalyi 1990), although not detecting an effect doesn’t mean it doesn’t exist. With the same precautions, and while a few studies have suggested a gender/sex effect in the way VR users experience the sense of presence (Felnhofer et al. 2012), the absence of gender/sex differences here is consistent with recent empirical work and is rather reassuring given its potential impact on performance (Maneuvrier et al. 2020). In contrast, debates about whether women experience more cybersickness than men in VR have been going on for decades (Stanney et al. 2003, 1999; Stanney et al. 2020a, b; Stanney et al. 2020a, b) and recent empirical work has also highlighted differences between genders/sexes when it comes to cybersickness (Maneuvrier et al. 2020). In view of our experimental results, we can note that in the present study men and women reported similar levels of cybersickness… only because not all women were evaluated by men. Indeed, if some gender/sex differences in cybersickness could be partially explained by modulating human factors (Maneuvrier and Westermann 2022; Stanney et al. 2020a, b), results of the present study allow us to outline a moderating effect of the experimenter's gender/sex. However, given that the experimenter's gender/sex is rarely indicated in experiment reports, it is impossible to know whether previous studies showing a gender/sex effect were conducted by men or women (or whomever). Still, we can note that in Maneuvrier et al. (2020), an empirical study where differences were found, the only experimenter was a man, which could have triggered an interaction with the participant's gender/sex (as is the case in half the cases of the present study).

5.2 Interaction between the experimenter’s and participant’s gender/sex

Indeed, men and women appear to differ in their symptoms of cybersickness when rated by men, but not when rated by women. An important question is whether these differences stem from a bias in self-reporting of negative symptoms, or whether the differences really exist on a psychophysiological basis. Of course, the two possibilities are not mutually exclusive, and this study cannot answer that question, as other physiological or behavioral measures of negative symptoms are needed (Dennison et al. 2016; Kim et al. 2005; Nesbitt et al. 2017). However, the fact that a very similar interaction effect was found on the video game variable, which in itself has no reason to change because of the experimenter’s gender/sex, might prompt us to hypothesize and discuss an effect of self-report measures. Opposite gender/sex dynamics seem to be at play here, and we think that studies on pain analysis, which is also a negative psychophysiological phenomenon, could be of interest. According to these views, women report more pain when assessed by men in order to appear more feminine and/or to trigger protective behaviors (Levine and De Simone 1991), while men report fewer negative symptoms when assessed by women in order to appear stronger and more masculine (Aslaksen et al. 2007; Carter et al. 2002; Kállai et al. 2004; Mogil 2012; Vigil et al. 2014). However, in our study, men seemed to report more cybersickness when assessed by women. Using the psychosocial stress interpretation, it could be argued that both women and men confused their psychosocial stress caused by the opposite gender/sex interaction with the symptoms of cybersickness, reporting more negative symptoms when evaluated by the opposite gender/sex. This interpretation based on psychosocial stress theory seems sound, but it cannot explain why a similar effect was evidenced for video game experience, which is not a physiological measure. However, it seems possible to explain these interaction effects by taking into account the dynamics of gender conformity and/or stereotype threat. Indeed, VR and video games are activities socially constructed by and for men (Kuittinen et al. 2007; Paaßen et al. 2017). Firstly, because men and women play very different types of games, with men playing more “intensive games” and “action games” like the one used in the study, and women playing “casual games” (Baniqued et al. 2013; Bosser and Nakatsu 2006; Juul 2012; Kapalo et al. 2015; Kuittinen et al. 2007; Rehbein et al. 2016; Saputra et al. 2017; Yee 2017). Secondly, because video games more than often target (young) men by representing gendered behaviors and stereotypes (Beasley and Collins Standley 2002; Behm-Morawitz 2013; Dietz 1998; Dill and Thill 2007) and by under-representing female characters or, when they do, treating them as sexual objects (Behm-Morawitz 2013; Dill and Thill 2007; Dunlop 2007; Gestos et al. 2018). Furthermore, women playing online video games are often victims of bullying or harassment (Breuer et al. 2015; Kowert et al. 2017; Kowert and Quandt 2017), which has led researchers to consider an association between video games and sexism (Bègue et al. 2017; Fox and Potocki 2016; Kowert et al. 2017). From this, it is possible to consider that women participants, when interacting with a man-investigator, are confronted with the stereotype/social norm that women are not gamers (Paaßen et al. 2017), which is amplified by the “first person shooter” style of the virtual environment (Yee 2017). Women would therefore report less video game experience when evaluated by a man than they would with a woman, in order to appear more feminine and/or less deviant from gender norms. On the contrary, men would report slightly more video game experience in front of another man in order to show conformity to masculine-gender norms, which is generally socially appreciated (Swim et al. 2020), but not in the presence of women in order not to appear sexist and/or show pseudo-virtual addictive behaviors that are much more widespread among men (Fisher 1994; King et al. 2020; Park and Hwang 2009; Van Rooij et al. 2011). It should be noted here that our empirical results seem to show that men are not, or at least much less, sensitive to the experimenter's effect than women. It is therefore important to bear in mind that the differences statistically observed may well be mostly due to an induced effect in women. Similar arguments could explain the differences reported in cybersickness, with women reporting more negative effects than men in order to show their maladjustment to a video game and technology reserved for men and/or to appear more sensitive to pain, which is socially considered feminine (Keogh and Denford 2009; Lloyd et al. 2020). Indeed, in Robinson et al. (2001) study, both men and women rated men as less willing to report pain than women, and women as more sensitive to pain and less able to bear it than men. Finally, similar interpretations could be used to explain the possible (weak) interaction effect on the sense of presence: women would report a slightly lower sense of presence when evaluated by a man in order to remain consistent with a “non-video game” gender norm, especially since they were faced with a typically masculine western-style shooting game. However, the fact that the interaction effect is less important for sense of presence than for cybersickness or video game experience is interesting and could confirm our interpretation. Indeed, and although the sense of presence is not a commonly gendered concept (as the concept is mostly socially unknown, although continuously experienced), the questionnaire that we used places a strong emphasis on technical processes, with questions on “transmission delay” or “graphical display” (Robillard et al. 2002; Schwind et al. 2019; Witmer and Singer 1998), aspects that contain gender stereotypes (Clayton et al. 2009; Koch et al. 2008; López-Sáez et al. 2011; Meadows and Sekaquaptewa 2013; Spencer et al. 1999). In comparison, measures of the state of flow are entirely psychological in nature and are not linked to VR or video games and/or computers and/or gender norms (“the activity totally absorbed my attention”). Thus, no technical or video game stereotypes interfere, which could explain why we found no effect of the gender/sex of the experimenter on the flow measures reported by the participants.

5.3 Further explorations and theory evaluation

First of all, the practice of video games should probably not be considered as a single-factor variable based on frequency (Baniqued et al. 2013; Bosser and Nakatsu 2006; Juul 2012; Kapalo et al. 2015; Kuittinen et al. 2007; Maneuvrier and Westermann 2022; Rehbein et al. 2016; Yee 2017), although the aim here was to play on the associated global masculine connotation. Still, future studies assessing video games impact in VR should try to take into account the different types and media of video games, and in particular the use of recreational VR, which could prove decisive for the methodology of the tool. In addition, future studies investigating reports of cybersickness could learn by assessing susceptibility to motion sickness prior to immersion (Brown et al. 2022), even though this may have a suggestive effect. Furthermore, to find out whether the gender/sex experimenter effect lies in a bias in subjective self-reports or an alteration in phenomenology, particularly for cybersickness and sense of presence, other psycho-physiological and/or dynamic measures could be considered (Arcioni et al. 2019; Dennison et al. 2016; Grassini and Laumann 2020; Haydu et al. 2016; Kim et al. 2005; Maneuvrier et al. 2023; Nesbitt et al. 2017; Tian et al. 2022; Yang et al. 2022). In addition, and in order to test the two interpretive theories of the experimenter's gender/sex effect, namely psychosocial stress (Chapman et al. 2018) and gender conformity dynamics (Swim et al. 2020), two explorations are mandatory. First, participants’ gendered attitudes toward video games and VR and/or computers could be measured (e.g., to what extent they consider men to be better at video games than women) as well as their gender conformity attitudes (e.g., to what extent they personally conform to and value gender norms in others). In addition, stereotype threat theory could be used (either inducing stereotype activation or not) to test the interaction between experimenter and participant gender/sex, but the non-replicable results of studies supporting this theory cast doubt on its use (Flore et al. 2018; Flore and Wicherts 2015). Secondly, other variables modulating the stakes of social interaction could be manipulated (e.g. attractiveness, age, social status of the experimenter…) to test the psychosocial stress theory. In addition, researchers could play on the gendered identification of immersive schemes: does the environment resemble a male-oriented game (for example, a first-person action game) or a more neutral or feminine game? Finally, it could be very interesting to further investigate gender dynamics, for example by evaluating “actual identified gender” and “experimenter-perceived gender” on a continuum, or by evaluating similar effects with other gender/sex categories (intersex/non-binary, transgender…) and/or other sexual orientations. Indeed, the use of the term “gender/sex”, if it allows a form of humility and scientific precaution in the current framework which seems fundamental to us, will need to be pushed further at some point for an analysis less globalizing and more detailed of the phenomena at stake. The study of the experimenter’s effect and its interaction with the participant’s genre/sex is of course not sufficient to solve the reproducibility crisis, but it could contribute to it. Other (non-statistical) recommendations already discussed would be the pre-registration of empirical studies in scientific journals, as well as methodological review boards (Anvari and Lakens 2018; Lakens 2023; Munafò et al. 2017; Nosek et al. 2018), independent institutions dedicated to empirical replications and the global use of post-publication public peer-review (Ortega 2022; Townsend 2013).

5.4 Practical implications

The main recommendation to be drawn from these results is obviously to try as far as possible to alternate the gender/sex of experimenters, whether in VR studies or not. Indeed, if the data in the present study had been collected exclusively by a man, we might have concluded that there was a gender effect on participants’ cybersickness, with women reporting more negative symptoms than men. Instead, we argue here that this is an experimenter effect and that mend women do not differ intrinsically in their susceptibility to cybersickness. In the same vein, we also invite experimental researchers collecting human data to systematically describe, at least succinctly, the experimenters, including their gender/sex. Indeed, given the results of our study, it would be highly relevant to be able to know the gender/sex of experimenters who observed differences and those who did not, in order to carry out meta-analyses. In addition, we invite researchers to carry out control comparisons between men and women, but also, if possible, between experimenters. Furthermore, we defend the use of Hyde et al. (2019) “gender/sex” expression when their distinction in an empirical or theoretical study seems impossible, but also the use of open questions and, more generally, Cameron and Stinson’s (2019) guidelines for measuring gender/sex. Finally, we defend the “ditching” of statistical significance and the absolute p-value rule for exploratory studies, as we did in this study (Amrhein et al. 2019; Colquhoun 2017; Munafò et al. 2017; Simmons et al. 2011; Wasserstein et al. 2019).

5.5 Limitations

One of the main problems with this exploratory empirical study is the discrepancy between the number of men (27) and women (42) participants, which could bias the analysis in favor of the men's extreme values (even though the outliers have been objectively removed). It also alters the statistical power pre-calculated at 80%. This gender discrepancy may be explained by the difficulties encountered during the recruitment process due to the COVID-19 epidemic, as well as by the overall dominant representation of women in today’s social science French universities and, of course, by the financial and time resources available. In addition, the video game experience variable, which does not distinguish between types of video game or medium (particularly VR), may be a limit, even though this variable is a purely dependent variable used to measure the experimenter’s effect and not an explanatory variable. Similarly, the homemade translation of the flow state scale could necessitate psychometric consolidation before being re-used to other means. Another limitation is that, although we tested the potential effect of the passive presence of the principal investigator (equally and randomly distributed between experimenters and participants), it is still possible that this presence had a statistically undetected effect on the results. Similarly, although we did not find an effect of experimenter bias on flow measures, this does not prove that it does not exist at all. In addition, most of the participants and experimenters were young Western Europeans, and most of the participants were students. It should be noted that ethnicity was not taken into account for the simple reason that the use of ethnic and racial statistics is prohibited in France (Möschel 2009). Consequently, it would be risky to make inductive deductions from this sample about other populations. Finally, this exploratory study, like any exploratory study, requires confirmatory research (Munafò et al. 2017), which is often difficult given that virtual environments are generally not shared. For this reason, the author would like to make it clear that the virtual environment as well as the data used in this study will be shared freely upon request to the corresponding author for a reproducible and/or replicative and/or confirmatory study, with all the human support that will be possible. Without speculating on the replicability of the results of the present study, we consider that it is at least easily reproducible, particularly with the help of the automated tools used.

5.6 Conclusion

The main finding of this study is that an experimenter’s gender/sex interacts with a participant's gender/sex to influence the latter's responses on several self-reported measures of VR (cybersickness and video game experience, and to a lesser extent sense of presence): men and women differ when rated by men, but not by women. This result is discussed as an effect of psychosocial stress induced by interaction with the opposite gender/sex, as well as conformity to gender norms, particularly in women with regard to action VR video games strongly influenced by a masculine social construction. This interpretation also explains why this effect was absent from self-reported measures of flow, which were the only questions unrelated to VR and/or video games. This study invites researchers in the experimental sciences to better control for peri-experimental effects such as the effect of the experimenter, and in particular their gender/sex, in order to consolidate research in times of replicability and, to a greater extent, epistemological crisis.