Conducting interactive experiments online
Online labor markets provide new opportunities for behavioral research, but conducting economic experiments online raises important methodological challenges. This particularly holds for interactive designs. In this paper, we provide a methodological discussion of the similarities and differences between interactive experiments conducted in the laboratory and online. To this end, we conduct a repeated public goods experiment with and without punishment using samples from the laboratory and the online platform Amazon Mechanical Turk. We chose to replicate this experiment because it is long and logistically complex. It therefore provides a good case study for discussing the methodological and practical challenges of online interactive experimentation. We find that basic behavioral patterns of cooperation and punishment in the laboratory are replicable online. The most important challenge of online interactive experiments is participant dropout. We discuss measures for reducing dropout and show that, for our case study, dropouts are exogenous to the experiment. We conclude that data quality for interactive experiments via the Internet is adequate and reliable, making online interactive experimentation a potentially valuable complement to laboratory studies.
KeywordsExperimental methodology Behavioral research Internet experiments Amazon Mechanical Turk Public goods game Punishment
JEL ClassificationC71 C88 C90 D71
Online labor markets such as Amazon Mechanical Turk (MTurk) are increasingly popular tools for behavioral scientists. With their large and diverse pools of people ready to promptly perform tasks for pay, these markets present researchers with new opportunities to recruit participants for experiments.1 Studies from across the social sciences have systematically compared data collected online with data from the physical laboratory. Their conclusions are promising: classic results from psychology and economics have been replicated using online samples, and the data obtained online is deemed as reliable as that obtained via traditional methods.2
Despite its great potential, behavioral research online has so far remained largely limited to non-interactive decision-making tasks or one-shot games with simultaneous decisions. Current online studies of social behavior often use survey software such as Qualtrics or SurveyMonkey to document decision making in tasks that participants complete individually, and emulate interactions through post hoc matching. Although this approach can be powerful, it does not permit the study of repeated, ‘hot’ interactions where live feedback between participants is essential. Experimental designs with live interaction are rarely implemented online, partly because there is not yet a widely-used web-based equivalent of z-Tree (Fischbacher 2007).3
In this paper, we assess the potential for interactive online experiments, where a set of participants interacts for more than one repetition. Interactive experiments raise novel challenges throughout the whole life cycle of an experiment. Our approach is to discuss these challenges, that is, methodological differences and similarities between interactive experiments in physical and online laboratories. We discuss these step-by-step, from recruitment to dismissal of participants after the experiment.
A particularly important challenge of interactive online experiments relates to participant dropout. While in the physical laboratory participants rarely leave a session, online experiments are more prone to dropouts which affect both the participant who is dropping out and their interaction partners (who still have to be paid for their participation). If dropouts happen for reasons exogenous to the experiment—e.g. due to network problems, frozen screens, or random distractions—they are just a (costly) nuisance to the experimentalist. Much more problematic are dropouts that happen endogenously, that is, people quitting because of what has happened in the experiment. Such dropouts could jeopardize the internal validity of experiments (Zhou and Fishbach 2016).
As a case study we replicate a repeated public goods game with and without peer punishment used in cross-cultural research (Herrmann et al. 2008), employing a sample of US participants recruited via MTurk.4 We chose to replicate this experiment because it is fairly long and logistically complex. It is a within-subjects design with two experimental conditions of ten periods each, where, after the first set of ten periods, participants receive new instructions. Moreover, this experiment has often been replicated, and its design allows us to evaluate whether dropouts depend on the experimental conditions (that is, the presence or absence of punishment). We report data from participants recruited via MTurk (62 groups) and participants from the physical laboratory (18 groups). We used our own software LIONESS (Sect. 2.5), developed for conducting interactive online experiments.
We observe that basic patterns of behavior online are similar to those in the laboratory. In the absence of punishment, aggregate levels of cooperation are higher on MTurk than in the laboratory, but show similar rates of decay over time. Moreover, our econometric analysis reveals that in both of our samples the group contributions strongly determine the level of cooperation. The introduction of punishment promotes the emergence and maintenance of cooperation in both samples. Punishment is mainly prosocial in nature in both samples (cooperators punish non-cooperators) but occurs less frequently online.
Our most important result is that, in our implementation, dropouts are most likely due to reasons that are exogenous to the experiment. Together with the replication of findings from the laboratory, our results suggest that online interactive experiments can be a reliable tool for collecting internally-valid data and hence are a potentially valuable complement to the physical laboratory.
Our paper contributes to a recently-emerged literature on the reliability of data gathered on online labor markets such as MTurk (see references in footnotes 1–3). The most important predecessor of our paper is Anderhub et al. (2001), who compared online and laboratory experiments in the very early days of experimentation on the Internet. They also provide a methodological discussion that, however, could not consider the specific properties of modern online labor markets where the bulk of present-day online experimentation is happening.5
The remainder of this paper is structured as follows. In Sect. 2, we introduce the experimental design. In Sect. 3, we discuss the conceptual and logistical differences between conducting interactive experiments in the laboratory and online and lay out our approach for dealing with them, highlighting important aspects of the data-collection process relating to attention and attrition. Section 4 shows the results of our experiment, systematically comparing cooperation and punishment behavior in our two samples. In Sect. 5 we present a detailed analysis of attrition in our online experiment. Finally, in Sect. 6 we make concluding remarks.
2 A case study to compare online and laboratory experiments
We base our discussion of online and laboratory experiments on a well-established paradigm: a public goods game with and without punishment (Fehr and Gächter 2000, 2002). In this section, we present the design of our experiments conducted in the laboratory and replicated online with a sample of participants recruited via MTurk. For the laboratory and the online samples, instructions and experimental screens were identical (screenshots are presented in the Online Appendix A).
2.1 General setup
Our experiment follows the within-subject design of Herrmann et al. (2008) and implements a repeated four-person public goods game with two conditions: one without punishment followed by one with punishment. Groups were constant throughout the experiment (‘partner matching’), and each condition ran for ten periods. Participants were aware that there were two ‘parts’ to the session (which corresponded to the conditions without and with punishment) but learned about the details of the second part only after the first one had finished.
At the beginning of a session, participants read on-screen instructions for the first experimental condition: the public goods game without punishment. Experimental instructions were shorter than those in Herrmann et al. (2008) (see Sect. 3 for rationale). Participants could start the interaction phase only once they had completed a set of comprehension questions.
2.2 Condition 1: a public goods game without peer punishment
In each period of the 10-period game, all four group members received an endowment of 20 points and simultaneously decided how many of those points to keep for themselves, and how many to contribute to a ‘group project’ (i.e. the public good). After all members had made their decision, the sum of all contributions was multiplied by 1.6 and distributed equally among all group members irrespective of their contributions. This setup reflects a social dilemma: in each period overall earnings are highest when each of the group members contributes all 20 of their points to the public good, while individuals maximize their earnings by contributing 0 regardless of the contributions of the others. Once all contributions had been made participants learned the result of that period. Apart from their own contribution and earnings, they were informed of the average contribution in their group. Subsequently, a separate screen showed the contributions of each of their fellow group members.
2.3 Condition 2: a public goods game with peer punishment
Once the 10 periods of Condition 1 were over, participants received new on-screen instructions about Condition 2. This condition also consisted of ten periods and was completed in the same groups as Condition 1. Again, the periods started once all group members had completed the comprehension questions. The decision situation was like Condition 1, but we introduced one change: once participants learned the contributions of each of their group members, they could assign up to 10 deduction points to each of their peers. Each assigned deduction point resulted in a loss of 1 point for the participant assigning it, and a loss of 3 points for its target. At the end of each period a separate screen informed participants of the total number of points they assigned and received. In cases where a participant made a loss during a period, only the costs of assigning deduction points would count towards the final earnings (cf. Herrmann et al. 2008). Each session concluded with a questionnaire including demographic items.
2.4 Online and laboratory sample
In all sessions, participants received instructions and made their decisions via web browsers. The program was implemented in the experimental software LIONESS (Sect. 2.5). Both online and in the laboratory, sessions took 28 min on average. This is considerably shorter than the original study by Herrmann et al. (2008), but longer than typical tasks on MTurk.
For our online sample, we recruited participants via MTurk, restricting their geographical location to the USA (for comparability with our laboratory sample, see below). Results are based on 24 sessions, with 248 participants in total (62 groups of four). The average age of participants in this sample was 31.5 years (s.d. 9.06), and 38.4% were female. Average earnings in our online sample were $6.69 (s.d. $1.03), which were paid via MTurk.
The data from our laboratory sample were collected at universities in two different cities in the USA (Harvard University, Boston MA; and Yale University, New Haven CT) over 8 sessions, with 72 participants in total (18 groups). Laboratory participants were invited through e-mails using the online recruitment software SONA. The average age for participants in this sample was 25.2 years (s.d. 7.45) and 42.3% were female. Average earnings in our laboratory sample were $20.02 (s.d. $1.65), paid in cash upon session completion. To conform to standards of the respective laboratories and average expected wages on MTurk, we used an exchange rate of $0.02 in the laboratory and $0.01 in our online sessions; show-up fees were $10 and $1 in the laboratory and online, respectively.6
2.5 The software used to conduct interactive experiments: LIONESS
We conducted both the laboratory and online experiments with LIONESS (Live Interactive Online Experimental Server Software). LIONESS provides a basic architecture for conducting interactive experiments online. Its key features reflect the solutions to the logistical challenges discussed in Sect. 3; dynamically grouping participants to minimize waiting times, regulating interactions in groups, promoting participants’ attention to the experiment and dealing with participants dropping out of an experiment. The software developed for the experimental conditions reported here can be downloaded at: http://lioness.nottingham.ac.uk.
3 Methodological differences in conducting interactive experiments in the laboratory and online
3.1 The online laboratory MTurk
While our discussion of online experiments is based on an MTurk sample, many issues also hold for other online platforms (cf. footnote 1). MTurk is a large online labor market, which offers an active pool of over 500,000 workers. The MTurk workforce completes over 40,000 Human Intelligence Tasks (HITs) every day (www.mturk-tracker.com; Difallah et al. 2015; Ipeirotis 2010). MTurk ‘workers’ browse HITs that are published by ‘requesters’ who provide a brief description of the task, its expected duration and the minimum payment workers will receive upon completion (see Online Appendix A for screenshots of the HIT as published on MTurk). HITs typically involve short individual assignments which computers are currently unable to perform (Berinsky et al. 2012), such as the processing of images or data cleaning. Due to the sheer size of the pool of workers ready to perform tasks for pay, MTurk enables researchers to conduct large-scale experiments and to implement an effective random assignment of participants to different conditions beyond the capacity of a typical physical laboratory.7 While not primarily designed for academic research, MTurk has the potential for efficient data collection. As mentioned in the introduction, questionnaire studies and experiments without repetitions (e.g. one-shot Prisoner’s dilemma) conducted with MTurk participants have produced results comparable to those obtained from laboratory samples (e.g. Paolacci et al. 2010; Horton et al. 2011; Goodman et al. 2013).
Despite their promise, online behavioral experiments have conceptual and logistical challenges that are usually not present in the laboratory. Here we focus on differences between laboratory and online experimentation that are specific to interactive designs. See Buhrmester et al. (2011) and Paolacci and Chandler (2014) for extensive discussions of differences regarding non-interactive (survey-style) designs.
3.2 A step-by-step comparison of laboratory and online experiments
Methodological differences in conducting interactive experiments in the laboratory and on MTurk
Typically a small part of total payoffs. Guaranteed when participant shows up to the session
Relatively large show-up fees promote recruitment rates, thereby facilitating prompt group formation. Experimenter can approve or reject the task submitted; if rejected no fee is paid
Invitations sent well in advance, participants commit to a session. Recruitment often from a pre-existing database
Sessions advertised online as HITs and can be completed immediately
Selection into the experiment
At sign-up, participants know very little about the experiment. Details of the task are communicated once participants are in the laboratory
Experiments are typically advertised as HITs with a brief task description. ‘Workers’ browse available HITs and accept those of their preference
Invitation conditioned on well-defined criteria of the laboratory’s records
HITs targeted at subsets of MTurk workers; experimenter can specify exclusion criteria. Many MTurk workers will have participated in many prior studies
Registration protocols usually prevent duplicate participation
Amazon acts against multiple worker accounts, but they exist
Participants can ask questions; comprehension questions ensure understanding
Experimenter is physically absent and cannot answer questions directly. Compulsory comprehension questions can be added but may make experiment (too) long for some participants
Easy to guess how many participants will attend; group settings can be pre-defined
Hard to guess how many participants will attend; groups can be constructed ‘on the fly’
In experimental economics deception is prohibited and laboratories foster reputations for non-deception
Because all requesters use the same subject pool, some participants may have experienced deception because requesters from other disciplines may use it
Hardly an issue; experimenter can restrict communication between subjects
Participants may in principle collude through external channels though this is difficult in practice
Closed form software like z-Tree specifies session progress
Scripted browser navigation specifies progress
Hardly an issue; participants that start a session usually finish it
Major challenge to internal validity, if dropout rates vary with treatment, selection bias may arise
Cash usually paid upon completion
Automatic transfer through Amazon
Cost per participant
Relatively high but predictable
Relatively low but varies with attrition
In a typical laboratory experiment, participants receive a show-up fee for attending. Still, the main part of the participant’s payment is usually determined by the decisions made over the course of the session. In a typical task on MTurk, participants are paid a flat reward per HIT, and the part of the earnings determined by their decisions can be added to their payments as a ‘bonus’. Consequently, a HIT that pays a relatively large flat fee usually draws more attention than one that promises a large bonus. This is particularly relevant for interactive experiments where participants need to wait for others to form a group at the start of a session (see below).
Sessions in the laboratory are pre-scheduled. A database contains the contact details of a pool of aspiring participants, who can register (and cancel) within a determined time window. Pre-scheduling ensures that the number of participants can be anticipated quite accurately before a session takes place, and including a small number of backup participants can prevent problems associated with unannounced non-attendance. Online platforms such as MTurk allow for instant recruitment of participants, facilitating a time-efficient method of data collection.8
To take advantage of these opportunities, LIONESS was ready to accommodate new participants during a time window specified by us, while capping the maximum number of entrants. In our experiment, we invited participants to sign up within 20 min of the HIT being posted and allotted them 45 min to complete the task. In addition, we asked them to start immediately. Recruitment rates were high (in a typical session with 100 slots, the first 50 participants normally entered within the first 5 min after the publication of the HIT), facilitating prompt group formation once participants had read and understood the instructions.9
Invitations to laboratory experiments rarely reveal any information on the contents of the experiment. On MTurk, however, participants browse various tasks that are currently available to complete for pay. This requires a HIT description giving the workers some idea of what the task will involve. To avoid self-selection (based on the topic of the experiment) into interactive experiments as much as possible, an experimenter can leave out any detailed information in the HIT description.10 For example, we did not announce that these are public goods experiments. On the other hand, it is essential that workers know that the HIT will involve live interactions with other people, and therefore they are expected to complete the interactive HIT without delay and without interruptions. The HIT description within MTurk is an appropriate place to make participants aware of this (see Online Appendix A for screenshots).
Typical laboratory subject pools are replenished annually when a new cohort of first-year students arrives on campus, and recruitment software allows invitation of only those participants with no (or little) experience with the experimental paradigm of a study. By contrast, the pool of MTurk workers (MTurkers) is replenished more regularly, but oftentimes MTurkers quickly acquire a sizable experience participating in hundreds of academic studies of all kinds.11 While researchers may have reason to believe that (frequent) prior experience could be an issue for their experiment (for a discussion of “lab rats” see Guillen and Veszteg 2012), MTurk facilitates inviting participants based on various criteria (e.g. number of HITs completed, their success rate or their geographic location). In addition, post-experimental questionnaires can include self-reported measures of participants’ familiarity with decision-making experiments and specific experimental paradigms.
In our case, we used MTurk’s options to restrict the geographical location of the participants to the United States for comparability with our laboratory sample. In addition, to increase the likelihood that participants completed our HIT with care, we only allowed workers with at least 90% of their previous HITs approved by requesters to participate (see Peer et al. 2014 for a detailed discussion of approval rates).
3.2.2 Session start-up
For many studies, it is essential that participants take part only once. In the laboratory it is relatively straightforward to implement this, particularly if the experimenter uses recruitment software such as ORSEE (Greiner 2015), SONA or hroot (Bock et al. 2014) and is physically present during laboratory sessions to confirm identities. For online sessions, however, re-takers may seriously compromise the data (e.g. by operating two browsers within the same experiment, potentially even controlling two players within the same group). Accordingly detecting them requires specific measures. Within a session, we prevented duplicate participation by logging the user’s IP address and blocking users that had already been connected to the experimental server. Between sessions, we used third-party software to prevent workers who had already participated in a specific HIT from being invited for future sessions.12
In a typical laboratory session participants can ask questions which the experimenter can answer in private, before the interactive phase of the experiment begins. For online sessions this is not feasible. To ensure that participants had a thorough understanding of the experimental decision situation and did not rush through the instructions, we introduced compulsory comprehension questions which participants had to solve before entering the decision-making phase of the experiment.13 A fraction of participants who entered the experimental pages did not proceed past the instructions and never reached the comprehension questions. In our online sample, 83.2% of the individuals who did reach the comprehension questions solved them successfully.14
3.2.3 Experimental interaction
In the laboratory, all participants typically arrive at a session at the same time and will simultaneously complete comprehension questions. By contrast, participants in online sessions may arrive during a time window set by the experimenter (20 min in our case), and the timing of completing comprehension questions may therefore vary substantially. Accordingly, we formed groups ‘on the fly’: participants who successfully completed the comprehension questions waited in a ‘lobby’. As soon as this lobby contained sufficient participants, a group was formed and its members were sent to the interaction phase of the experiment. An alternative (yet considerably less time-efficient) approach is to run pre-tests with participants and to build a ‘standing panel’ from which candidates for experimental sessions are recruited (see Suri and Watts 2011; Gallo and Yan 2015).
Although deception is uncommon in experimental economics, participants on MTurk are likely to encounter studies using deception (e.g. Pfattheicher and Schindler 2015). Participants may therefore be skeptical about the truthfulness of experimental instructions and doubt if their interaction partners are real people and not robots pre-programmed by the experimenter publishing the HIT. To promote trust between us (as experimenters) and the participants, we continuously strive to maintain a good reputation on our MTurk requester account (our records and those of other requesters can be found at https://turkopticon.ucsd.edu). In addition, our HIT description stated explicitly that groups were formed of real people recruited from MTurk.
To keep the attention of the participants focused on the experiment (and not have them dropping out in the very first period of the game), we clearly communicated the number of other participants they were waiting for at any given moment, and we added an on-screen countdown indicating the maximum amount of time left before participants could choose to leave the experiment if no group could be formed. When this timer reached zero in our experiment, participants could choose to either return to the lobby and wait for two additional minutes or to leave the session and collect their participation fee (of $1). This procedure led to a total of 89% of participants who correctly completed the comprehension questions being successfully matched into a group and starting the interaction phase.15 The remaining 11% could not be matched in a group of four, and were paid their participation fee.
While in the laboratory the experimenter can monitor and enforce any restriction of communication between participants, it is in principle harder to categorically exclude the possibility that online participants communicate with their interaction partners through external channels. We ran relatively large sessions in which participants could not be identified to prevent them from colluding via online forums such as Reddit or MTurk Crowd. In fact, forum discussions, which are usually moderated and prohibit the dissemination of the content of HITs and the discussion of strategies, typically center upon the attractiveness of a HIT in terms of earnings and length rather than its content (Chandler et al. 2014). Therefore, communication between participants is a potential problem for online experiments, but it is not any more severe for interactive designs. A similar argument could be made for communication with other people who are not participating in the task at hand (e.g. someone else in the room while completing the task). Although such communication is harder to control in online experiments than in the physical laboratory, this issue is not specific to interactive designs either.
The most severe problem for online interactive studies, and the largest discrepancy with laboratory experiments, is attrition (participant dropout). In laboratory sessions participants very rarely leave or turn out to be unable to complete a session. In online experiments, attrition is a major issue—there is no straightforward way to prevent participants from leaving a session by closing their browser window or failing to submit responses to experimental decision situations due to technical problems. Moreover, in contrast to the laboratory, interaction partners are geographically scattered and the progress of an experiment depends on their joint attention to it. Typically, groups proceed at the pace of the slowest participant and long waiting times increase the risk of reduced attention, which may ‘cascade’ through the group. Thus, we took measures to retain attention and promote successful completion.16 In our case, we used on-screen timers and told participants that failure to reach a decision in due time would result in their removal from the experiment without payment. Furthermore, in the event of a group member dropping out we notified the remaining participants of that and they continued in their reduced group.17 Data from incomplete groups is not included in the results reported in Sect. 4. An alternative approach to dealing with dropouts is to terminate the whole group once a member drops out. This may, however, damage the reputation of the experimenter as participants will be unable to earn as much as they had anticipated.
Our procedure of letting smaller groups continue ensures that real people generate all the information that participants respond to. Alternative solutions to non-responding participants, such as introducing random decisions or repeating previous decisions (e.g. Suri and Watts 2011; Wang et al. 2012), may affect the behavior of those who are still in the experiment (now responding to partially computer-generated information) which potentially compromises the internal validity of the data from groups affected by a dropout. It also raises issues of deception if such computer-generated information is not disclosed. Moreover, this procedure may also compromise the validity of data from groups unaffected by attrition, as participants cannot know whether their interaction partners’ behavior shown to them is real or generated by a computer.
Due to the nature of conducting research via the Internet, some level of attrition seems unavoidable. Attrition rates are likely to vary with factors such as group size, complexity of the decision situation, and the pace of the experiment.18 Despite our measures to prevent attrition, 84 participants (18%) who started the interaction phase dropped out at some point in our experiment.19 As these participants were distributed across experimental groups, the fraction of the data set affected by these dropouts was considerably larger.
3.2.4 Payment and costs of experiments
After a session is over, participants are typically paid according to their performance. MTurk, like other crowdsourcing platforms (see footnote 1), facilitates secure payments. The experimental software can generate a random code for each participant which can be matched with their MTurk ID, allowing for payments according to performance in the experiment. It is important to process payments immediately to maintain a good reputation as an MTurk requester.
All in all, typical costs per useable data point in an experiment with participants recruited via MTurk are likely to be lower than in a laboratory experiment. Nevertheless, these costs may vary with attrition rates, which can be affected by the specific features of the experiment such as its length and group size (as one dropout may compromise the data of the whole group). In our case, laboratory participants earned $20.02 on average. Therefore, with a group of four as the unit of observation, a useable data point cost us $80.08. Corresponding costs online were $47.32.20
4.1 Contribution behavior
The introduction of punishment opportunities strongly increases average contributions in both samples (average contributions in periods 10 vs 11: laboratory: 4.63 vs 12.94, p = 0.001; online: 7.85 vs 16.15, p = 0.001). Moreover, average cooperation levels over the course of the game are higher than in the absence of punishment (average group contributions in periods 11–20 vs 1–10: laboratory: 16.26 vs 8.30, p = 0.001; online: 17.15 vs 12.52, p = 0.001). As in the condition without punishment, overall contributions within groups were slightly yet significantly higher in our online sample than in the laboratory (16.26 vs 17.15, p = 0.008).23
Contributions to the public good
Cooperation dynamics (no punishment)
Contribution to the public good (no punishment)
Mean peer contribution in t−1
4.2 Punishment behavior
Determinants of punishment
Decision to punish (0 = no; 1 = yes)
Mean contrib. others
Rec. punishment in t−1
In sum, our results show that basic patterns of cooperation and punishment behavior in the laboratory are largely replicable online, and thus are robust to changes in the experimental method. Participants in our online sample initially contribute more but, in the absence of punishment opportunities, cooperation decays at similar rates. In both samples, peer punishment is mainly pro-social in nature, and its introduction increases and stabilizes cooperation.
5 Attrition: endogenous or exogenous?
Our observation that, across conditions, experimental results are quite robust already suggests that endogenous attrition due to what has happened in the experiment (and hence selection) is not a big issue in our data. Here, we investigate this issue more rigorously.
Determinants of attrition
Participant’s drop out in period t (0 = no; 1 = yes)
Group member(s) dropped out in previous period
Relative average contribution
Relative average punishment received
Relative average punishment given
Models 3–5 show that attrition is much more likely when a group member has dropped out in the previous period. This seems indicative of ‘cascading inattention’: when a participant drops out of the session (e.g. due to inactivity, a closed connection, or waning attention), their group members will have to wait for some time before they can proceed.27 Reduced attention may lead to additional attrition.
The results in columns 4 and 5 of Table 5 indicate that attrition did not depend on cooperation and punishment behavior in each of the experimental conditions. Specifically, dropouts did not depend on the behavior of the dropped-out participants relative to their fellow group mates, or on their earnings. The model in column 4 (fitted to the data from the experiment without punishment) shows that dropouts did not depend on relative average contributions. In addition to that, the model in column 5 (fitted to the data from the experiment with punishment) reveals that individuals who dropped out had neither received more punishment relative to their group members who did not drop out, nor differed from them in terms of the punishment towards others.
In light of the results presented here, one might feel tempted to embrace interactive online experimentation as a valuable complement to laboratory studies—and others might even see it as a cost-efficient substitute. The measures presented here address the most important methodological issues for conducting interactive experiments online, and our case study illustrates that established results from the laboratory can be replicated online. However, future research needs to establish how generalizable our results are to other research questions as, despite these measures, methodological differences between laboratory and online experimentation inevitably remain.
For instance, depending on the nature of the experiment, online participants can conceivably communicate with each other to share their knowledge, strategies and even experimental materials more quickly than their laboratory counterparts. As mentioned earlier, most of the forums that monitor the online community have mechanisms in place that prohibit the dissemination of materials, and participants themselves might find this practice prohibitively costly. Yet, one cannot completely rule out this possibility as laboratory and online participants can simply discuss an experiment through other channels. To some extent, the nature of interactive designs prevents participants from crafting intricate strategies beforehand, but this might not be the case for experimental designs where participants can figure out “correct” answers, and they might be at risk of being ineffective (Haigh 2016), or exhibiting reduced effect sizes (Chandler et al. 2015).
Comparisons between online and laboratory experiments can also be affected by differences in selection bias. Participants in online and laboratory experiments may self-select based on their opportunity cost of working time and their reservation wage. Indeed, opportunity costs and reservation wages might well differ between sessions conducted in the laboratory and online: laboratory participants might decide on whether to participate in an experiment by looking at the show-up fee paid and the travel costs they would incur (e.g. walk a long distance, experience bad weather, or even get dressed!), whereas for online participants such costs would typically be negligible. Interestingly though, results from Anderson et al. (2013) show that a comparable type of selection is unlikely to bias inference about the prevalence of other-regarding preferences. Thus, we have reasons to believe that our design is not particularly affected by the relatively low opportunity costs, but that others might be.
In this study, we systematically controlled for what we think are the most daunting logistical issues for running an interactive experiment online. However, one could argue that some important methodological differences between laboratory and online experiments remain, and that such discrepancies may potentially affect findings and treatment comparisons, regardless of experimental designs being interactive or not. For instance, we replicate classic patterns of behavior in an environment with less control but also find an important disparity between initial contributions. As we pointed out earlier though, this divergence is consistent with the one found in related studies comparing different adult populations with college students using various recruitment methods (Belot et al. 2010; Carpenter and Seki 2011; Gächter et al. 2004; Stoop et al. 2012; Gächter and Herrmann 2011). Yet we acknowledge that unobserved methodological differences might account for some of the variability observed in our results. Assessing the extent and severity of such differences in other designs is certainly a topic for future research.
The similarities between our online and laboratory results suggest that interactive designs conducted over the internet can be robust to changes in the experimental method. Yet, our results are based on a comparison between sessions that differed in terms of both the experimental method (online versus in the physical laboratory) and the subject pool (MTurk workers versus university participants). Further assessment of online experimentation could include a systematic study of the isolated effects of the method for collecting data (online or laboratory) on the one hand, and the subject pool (‘workers’ from an online labor market or university students) on the other hand, e.g. by running online experiments with university students and inviting MTurkers into the physical laboratory.
To summarize our discussion, we see our paper as a guide for researchers to think about relevant issues before deciding whether the online or the physical laboratory is most appropriate for their research question. Some might conclude that the loss of control is too big a problem for their designs, whereas others are willing to bear that loss of control. In the end, the extent of any loss of control is an empirical question and we encourage researchers to add to our first piece of experimental evidence.
7 Summary and conclusion
In this paper, we presented a detailed conceptual and methodological discussion of conducting interactive experiments in the physical laboratory and online. We illustrated similarities and differences using a repeated public goods experiment without and with punishment. Our comparative results suggest that online data quality is adequate and reliable, making online interactive experimentation a potentially valuable complement to laboratory studies.
Most importantly, attrition, though a significant nuisance in online experiments, did not compromise the internal validity of our data because attrition was unrelated to what happened in our experiment. Future research will need to establish how generalizable this result is to other interactive decision problems, in particular when attrition might be treatment-specific, which poses the biggest problem to internal validity (Zhou and Fishbach 2016). Future research should also investigate how individual characteristics of participants (e.g. social preferences) and aspects of the experimental design (e.g. group size, number of periods, complexity of the task and its instructions) affect dropouts.
We observed that cooperation levels in our online sample are substantially higher than in the laboratory, and are on the high end of the range of cooperation levels observed in the cross-cultural samples of Herrmann et al. (2008). These differences can be partly (but not completely) explained by the age of MTurkers relative to students in typical laboratory samples. Still, it is unclear whether some other differences in terms of the participants’ demographics, the perceived degree of anonymity, or the degree of familiarity with the experimental paradigm influence our results. We believe that future research should explore such avenues.
MTurk is by no means the only online labor market used for behavioral research. Peer et al. (2017) identified six other platforms for crowdsourcing behavioral research, in addition to platforms such as Prolific Academic. See Buhrmester et al. (2011), Horton et al. (2011), Rand (2012), and Paolacci and Chandler (2014) for evaluations of the potential of MTurk for conducting behavioral experiments in psychology and other social sciences.
For example, Chesney et al. (2009) conduct classic experiments in a virtual world platform and observe behaviour similar to the laboratory. Hergueux and Jacquemet (2015) find that social preferences of student participants elicited online and offline are qualitatively very similar; see their Table 1 for further references. Schmelz and Ziegelmeyer (2015) compare the behaviour of student participants in a principal-agent experiment conducted in the laboratory and on the Internet and conclude that data can be of similar quality. On MTurk, Horton et al. (2011) replicate classic framing effects and report similar levels of cooperation in a one-shot prisoner’s dilemma played online and in the laboratory. Berinsky et al. (2012) replicate classic findings on framing and risk tasks. For overviews of replication studies, see Paolacci et al. (2010), Behrend et al. (2011), Klein et al. (2014), and Mullinix et al. (2015).
Thus far interactive research developed its own bespoke software (see Egas and Riedl 2008; Suri and Watts 2011; Wang et al. 2012; Gallo and Yan 2015; Nishi et al. 2015; Schmelz and Ziegelmeyer 2015; Stagnaro et al. 2017), but recent software is likely to change this trend (see BreadBoard, MobLab, oTree, SoPHIE, and UbiquityLab).
Another early paper on experimentation on the internet is Reips (2000). The author discusses the advantages and disadvantages of conducting online experiments and also provides a methodological discussion of how internet experiments can be used to validate laboratory data.
Also, the size of the potential pool of participants on MTurk allows researchers to use designs requiring large groups, and to obtain sufficient statistical power to detect effects of relatively small size (e.g. Hauser et al. 2016).
We chose a 20-min time slot based on pilot sessions. Over time, MTurk HITs move down the list of most recently-published jobs and entrance rates tend to taper off. When participants trickle in at low rates, the chances that they have to wait for a long time before they can be matched into a group increase substantially.
Self-selection into experiments may not be a specific problem for online studies. Selection effects may also occur in physical laboratories which regularly run interactive tasks. Participants may anticipate their tasks involving interactions with other participants in their session due to previous experience or hearing about it from other members of the subject pool; see Krawczyk (2011), Anderson et al. (2013), Cleave et al. (2013) and Abeler and Nosenzo (2015) for detailed discussions.
We used UniqueTurker (http://uniqueturker.myleott.com) to prevent re-takes between sessions. This method uses a unique MTurk identification number which is linked to each worker and that Amazon constantly monitors to avoid duplicate participation. Alternatives for these methods include Turkprime (http://turkprime.com).
We used considerably shorter experimental instructions than Herrmann et al. (2008) to minimize the variation in the time that participants take to read them; with dynamic group formation, minimal variation in preparation time is highly desirable (e.g. to reduce attrition, see below).
When entering an incorrect answer, subjects were allowed to try again. Incidentally, the total number of incorrect attempts in the six comprehension questions prior to Condition 1 did not differ between our laboratory and online samples (4.68 vs 4.46, respectively, Mann–Whitney test: p = 0.245).
As subjects completed the experiment in their browsers, we strictly regulated navigation between experimental pages. Participants navigating the pages at will may lead to serious disruptions in the experimental flow and even lead to crashing sessions. We used LIONESS to regulate the sequence of an experiment in a browser by using a pop-up window without navigation bars and overwriting the browser history within that window when moving on to another page. This procedure also prevented participants from revising their decisions during a given interaction.
On waiting pages (to which LIONESS directed participants once they made their decisions and waited for their fellow group members to do so as well), we added an animation (a commonly used circular icon spinning around) to assure the participants that the experiment was still active. Additional measures to promote attention may include playing sounds (e.g. Hauser et al. 2016) or having the browser window with the experimental pages placed in the foreground of the participants’ computer screen (‘focus stealing’) when a decision is due. We did not use these more intrusive measures in this study.
The experimental instructions mentioned that the total number of points contributed to the public good was multiplied by 1.6 before distributing the resulting amount equally among the group members. A group member dropping out therefore increased the marginal per capita return of contributions.
Note that both the stability of the experimental software and the stability of the server on which the software is run might strongly affect attrition too. Pilot sessions led us to deploy LIONESS on high-CPU servers (available through Google Cloud) to facilitate many simultaneous server–client interactions.
Participants that did not respond in time or dropped out for any other reason did not receive any payment.
We incurred additional costs in our online sessions through: (1) 60 participants who did complete the comprehension questions but were not allocated to a group and who received $1; (2) participants from groups with dropouts (a total of $725.54); (3) Amazon MTurk fees ($489.02). The online costs per usable data point is computed as: (payments to the participants in the 62 groups finishing without dropouts + additional costs from 1–3)/62. This amount does not include payments to subjects who took part in pilots we ran to determine our procedures.
Unless otherwise stated, all statistical tests are two-sided Mann–Whitney tests with group averages over all periods as independent observations.
See Online Appendix B for a comparison of our samples with the cross-cultural ones from Herrmann et al. (2008).
An additional regression model including an interaction between ‘MTurk’ and ‘period’ does not detect a significant effect of this interaction in either condition (p > 0.560). Our results are also robust to the inclusion of demographic controls and the use of a different (multilevel mixed effects) model specification (see Table 6 of the Appendix).
In Table 8 of the Appendix we present a more detailed analysis of pro-social and anti-social punishment, indicating that by-and-large, punishment is driven by the same factors in the laboratory and online. These models also reveal that both in the laboratory and online the positive effects of ‘received punishment in t−1′ on punishment reported in Table 4 are due to instances of anti-social punishment.
In our software a subject was considered to have dropped out after 20 s of inactivity. This delay was introduced in order to allow people with brief connection irregularities to still complete the experiment.
We thank Thorsten Chmura, David Cooper, Dennie van Dolder, Urs Fischbacher, Marcus Giamattei, Roberto Hernan Gonzalez, Oliver Hauser, Susie Jackson, Daniele Nosenzo, Charles Noussair, David G. Rand, Ruslan Shichman, Katrin Schmelz, Jonathan Schulz, Richard Upward, Ori Weisel, Irenaeus Wolff, participants at the 2014 ESA Meeting in Ft. Lauderdale, and two anonymous referees for their helpful comments. This work was supported by the ESRC Network for Integrated Behavioural Science (NIBS, ES/K002201/1) and the European Research Council (ERC) Advanced Investigator Grant 295707.
- Anderson, J., Burks, S. V., Carpenter, J., Gotte, L., Maurer, K., Nosenzo, D., et al. (2013). Self-selection and variations in the laboratory measurement of other-regarding preferences across subject pools: evidence from one college student and two adult samples. Experimental Economics, 16(2), 170–189. doi:10.1007/s10683-012-9327-7.CrossRefGoogle Scholar
- Belot, M., Duch, R., & Miller, L. (2010). Who should be called to the lab? A comprehensive comparison of students and non-students in classic experimental games. University of Oxford, Nuffield College Discussion Papers.((2010-001)).Google Scholar
- Difallah, D. E., Catasta, M., Demartini, G., Ipeirotis, P. G., & Cudre-Mauroux, P. (2015). The Dynamics of Micro-Task Crowdsourcing The Case of Amazon MTurk. Www’15 Companion: Proceedings of the 24th International Conference on World Wide Web, 617–617, doi:10.1145/2740908.2744109.
- Hauser, O. P., Hendriks, A., Rand, D. G., & Nowak, M. A. (2016). Think global, act local: Preserving the global commons. [Article]. Scientific reports, 6, 36079, doi:10.1038/srep36079.http://www.nature.com/articles/srep36079#supplementary-information.
- Paolacci, G., Chandler, J., & Ipeirotis, P. G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5(5), 411–419.Google Scholar
- Schmelz, K., & Ziegelmeyer, A. (2015). Social distance and control aversion: Evidence from the Internet and the laboratory. Research Paper Series Thurgau Institute of Economics and Department of Economics at the University of Konstanz, TWI-RPS 100.Google Scholar
- Stewart, N., Ungemach, C., Harris, A. J. L., Bartels, D. M., Newell, B. R., Paolacci, G., et al. (2015). The average laboratory samples a population of 7,300 Amazon Mechanical Turk workers. Judgment and Decision Making, 10(5), 479–491.Google Scholar
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.