Although some questions psychologists care about involve comparing only two conditions to each other, most require teasing apart the contributions of many intertwined variables. In the past, this has required hundreds, if not thousands, of studies across numerous laboratories, each targeting a specific variable, population, or stimulus set.

In principle, we can now do this work many orders of magnitude more quickly. Given that half the world’s population has internet access (ITU Telecommunication Development Sector, 2017), any study that can be run on a computer or mobile device can be run with nearly any demographic anywhere in the world, and in large numbers. This includes not just surveys, but studies involving grammatical judgments, reaction times, decision-making, economics games, eyetracking, priming, sentence completion, skill acquisition, and others—which is to say, most human behavioral experiments (Birnbaum, 2004; Buchanan & Smith, 1999; Germine et al., 2012; Gosling & Mason, 2015; Gosling, Sandy, John, & Potter, 2010; Haworth et al., 2007; Honing & Ladinig, 2008; Krantz, 2001; Meyerson & Tryon, 2003; Papoutsaki et al., 2016; Reips, 2002; Skitka & Sargis, 2006). Extensive research has shown that data from online studies is, if anything, of higher quality than what is typically achieved in the lab (Appendix C).

The feasibility and utility of internet experiments is amply demonstrated by the widespread adoption of Amazon Mechanical Turk (Buhrmester, Kwang, & Gosling, 2011; Mason & Suri, 2012; Paolacci, Chandler, & Ipeirotis, 2010; Stewart, Chandler, & Paolacci, 2017). For example, around one-quarter of recent cognitive psychology articles feature at least one online experiment (Stewart et al., 2017). Fully capitalizing on the promise of the internet, however, requires finding a way to go beyond Amazon’s subject pool of fewer than 20,000, mostly American and Indian, adults to the full population of three billion internet users (Buhrmester et al., 2011; ITU Telecommunication Development Sector, 2017; Paolacci et al., 2010; Stewart et al., 2015).Footnote 1

In fact, a number of researchers have successfully leveraged the internet to conduct what are effectively hundreds of studies at once: massive online experiments that cover a wider range of demographics, a wide range of stimuli, or both (Blanchard & Lippa, 2007; Bleidorn et al., 2013; Bleidorn et al., 2016; Brysbaert, Stevens, Mandera, & Keuleers, 2016; Condon, Roney, & Revelle, 2017; Fortenbaugh et al., 2015; Gebauer et al., 2014; Germine, Duchaine, & Nakayama, 2011; Halberda, Ly, Wilmer, Naiman, & Germine, 2012; Hartshorne & Germine, 2015; Hartshorne, O’Donnell, & Tenenbaum, 2015; Hartshorne & Snedeker, 2013; Hartshorne, Tenenbaum, & Pinker, 2018a; Hauser, Young, & Cushman, 2008; Johnson, Logie, & Brockmole, 2010; Kajonius & Johnson, 2018; Keuleers, Stevens, Mandera, & Brysbaert, 2015; Killingsworth & Gilbert, 2010; Kumar, Killingsworth, & Gilovich, 2014; Lippa, 2008; Logie & Maylor, 2009; Manning & Fink, 2008; Maylor & Logie, 2010; Nosek, Banaji, & Greenwald, 2002; Peters, Reimers, & Manning, 2006; Reinecke & Gajos, 2014; Riley et al., 2016; Salganik, Dodds, & Watts, 2006; Soto, John, Gosling, & Potter, 2011; Susilo, Germine, & Duchaine, 2013).Footnote 2 Many of these studies have often prompted significant revision of theory, including overturning long-standing theoretical accounts of cognitive aging, critical periods, and aesthetic preferences (Fortenbaugh et al., 2015; Germine et al., 2011; Halberda et al., 2012; Hartshorne & Germine, 2015; Hartshorne, Tenenbaum, & Pinker, 2018b; Reinecke & Gajos, 2014).

Researchers have also used a related paradigm to process enormous amounts of data: citizen science (Dickinson, Zuckerberg, & Bonter, 2010; Greene, Kim, Seung, & the EyeWirers, 2016; Hartshorne, Bonial, & Palmer, 2013a, 2014; Kim et al., 2014; Poesio, Chamberlain, Kruschwitz, Robaldo, & Ducceschi, 2013; Simpson, Page, & De Roure, 2014). Citizen science projects recruit large numbers of volunteers to assist in scientific research (Box 1). While citizen science has been much more widely used in other fields (e.g., for categorizing galaxies or tracking bird migrations; Sullivan et al., 2014; Willett et al., 2017), some early successes have shown its potential power for psychology and neuroscience. For instance, mapping the synapses of even a single axon is an extremely time-intensive task. By recruiting over 100,000 volunteers, Kim et al. were able to map 274 retinal axons, finding that different types of bipolar cells vary in how close they synapse to starburst amacrine cell somas. This physical asymmetry in synapse location, coupled with several other properties of these neurons, provides a plausible mechanism for explaining how the mammalian brain detects motion.

Obstacles to broader adoption

Any researcher wishing to engage in massive online experiments or citizen science immediately runs into a significant obstacle: There is no ready-to-use software for implementing them. Indeed, the major online research websites—gameswithwords.org, testmybrain.org, labinthewild.org, projectimplicit.org, and eyewire.org—all use their own custom, in-house software.

The reason for this may not be immediately obvious, given that there are a number of solutions for online studies, including commercial platforms (Qualtrics, SurveyMonkey, LabVanced) and open-source software (jsPsych, PsychoJS, Ibex Farm) (Table 1). However, these systems were designed for relatively small experiments, with a few thousand subjects at most. Internet-scale studies present additional challenges with regard to addressing recruitment, reliability, a range of paradigms, and contingent design.

Table 1 Comparison of prominent software for behavioral experiments on the internet with the proposed software, based on software documentation and discussion with the developers

Recruitment is perhaps the most obvious problem. There may be over three billion people online, but they need some motivation to do your experiment. Many of the existing solutions involve paying subjects through Amazon Mechanical Turk or Qualtrics. However, the number of subjects that can be recruited through these platforms is insufficient for the kinds of studies under discussion here (Buhrmester et al., 2011; Levay, Freese, & Druckman, 2016; Paolacci et al., 2010; Stewart et al., 2015), and paying that many would in any case be prohibitively expensive. Instead, internet-scale research typically relies on gameification, personalized feedback, and other strategies to make participation intrinsically rewarding—strategies that are not generally available through existing platforms (Table 1).

The difficulties associated with reliability may not be immediately apparent to individuals who have limited experience running popular websites. In essence, internet-scale research has one inherent vulnerability: Websites are most likely to crash precisely when you need them most. That is, subjects tend to come in large waves (Fig. 1), and this heavy traffic can overwhelm a website and cause it to crash. Because it is during those waves of traffic that we collect almost all of our data, that is the worst possible time for the website to crash. This is one of the most difficult problems in web development. In fact, a common method of cyberattack is to overwhelm a website with heavy traffic. Common strategies involve auto-scaling (described below) and comprehensive backups. Although large commercial websites such as Google Forms or SurveyMonkey will generally handle these issues, they are fairly limited as platforms for research.

Fig. 1
figure 1

Daily traffic at GamesWithWords.org, a successful website for massive online experiments and citizen science. Large spikes in traffic are common after the launch of a new experiment, coverage in popular media, or both.

Indeed, our formal and informal surveys of colleagues indicate that access to a range of experimental paradigms is a major limiting factor in the adoption of internet-scale research. Most platforms were designed to support a particular paradigm, such as surveys (SurveyMonkey, Qualtrics, Google Forms), psycholinguistics experiments (Ibex Farm), or iterated cultural evolution experiments (Dallinger). Researchers who want to run multiple paradigms may need to learn multiple platforms (Table 1). This provides yet another barrier.

Finally, one of the most powerful uses of massive online experiments and citizen science projects is to gather data on very large numbers of specially chosen experimental stimuli, often with each participant only seeing a small fraction of test items. This raises a number of difficult design questions: Which stimuli should be tested, how many times should each be presented, and to which participants? Traditionally, researchers have answered these questions informally using a combination of intuition, prior experience with experimental paradigms, power analyses, and estimates of the number of participants and time involved in studies. This is possible in the laboratory, where the experimenter has time between subjects to adjust protocols, design new experiments, and so forth. It is not possible for internet-scale studies, for which data collection is continuous and happens at the discretion of the subjects, not the experimenter. A further complication is that with internet-scale studies, we rarely know how many subjects we will have, even to the order of magnitude. Thus, to get full use out of internet-scale research, it is often necessary to have the software response contingently as the data comes in.

This can be addressed with optimal experimental design, active learning, or other machine-assisted experimental design techniques (see the Contingent Experiments section). The basic idea behind machine-assisted experimental design is to choose, on the basis of all participants’ previous behavior, the stimuli that will provide the most relevant information. This may, for instance, take the form of adapting to one participant’s individual characteristics, selecting the next item of interest given responses so far, or choosing an entire study design from a large space provided by the user. Unfortunately, fully supporting machine-assisted experimental design requires a different software architecture from that of existing software (see the How Pushkin Works section). Thus, existing software either does not support machine-assisted experimental design or, in the case of Dallinger, supports only a subset of methods (Table 1).

Pushkin: A platform for internet-scale research

Philosophy and approach

The lab-based experimental paradigm developed over a century ago. A robust approach to internet-scale studies will not appear immediately, nor without a great deal of work. Thus, we have approached the problem in an incremental, scalable way. Although our long-term goal is a robust new paradigm that vastly increases the rate of progress in our science, we do not attempt to do all of this (or even most of this) ourselves. Instead, our approach is to lay a foundation upon which our community can build.

Part of the inspiration comes from Alexander Pushkin (1799–1837), who developed the literary language of Russian. No work of literature exists in a vacuum. Authors lean on established genres (romance, coming of age, fantasy), standard archetypes (vampires, ninjas, jaded private eyes), idioms (“heart on my sleeve,” “the center cannot hold,” “the sound and the fury”), and direct references (he is a Scrooge/Eeyore/Romeo) in order to quickly evoke characters, scenes, and emotions, rather than building everything from scratch. Prior to Pushkin, little was written in Russian, and this shared cultural vocabulary did not exist. Pushkin promoted what did exist, invented vocabulary, established genres, and coined idioms, building the framework that Doestoevsky, Tolstoy, Chekhov, and others depended on for their masterworks.

Thus, our goal is to establish a shared cultural vocabulary on which an internet-scale research paradigm can be built. This includes producing not just reusable software but also experimental paradigms, analysis methods, and best practices. Just as the Russian literary language continued to develop after Pushkin, our goal is to establish a core set of tools and paradigms upon which others will build.

As such, we stress interoperability (our tools should be modular and interface with existing projects), extensibility (our tools should be easy to extend), and broad applicability (our tools should be useful not just for economics games or psycholinguistics tasks, but for the widest possible range of internet studies). The result, we hope, will be a developer ecosystem that supports rapid building and deployment of not just new experiments, but new tools. This common vocabulary can then be used by developers to build custom applications for specific laboratories or to develop easy-to-use plug-and-play software for specific types of studies (cf. E-Prime). Such ecosystems play vital roles in the technology industry. We believe that they can play a similar role in social science.

At the heart of this is the Pushkin experiment framework itself: A platform for internet-scale research. In particular, Pushkin version 1.0 provides a range of functionality that is needed for massive online experiment and citizen science but that is not addressed by existing software. Importantly, in keeping with our philosophy, Pushkin is not a stand-along piece of software, but rather a highly modular framework that binds together different reusable tools. Although some of these tools are original to Pushkin, many are third-party tools, such as jsPsych, WebGazer.js, RabbitMQ, auth0, Bookshelf.js, WebPPL, and many of the components of Amazon Web Services (see the How Pushkin Works section and Fig. 6). If no existing product meets our needs, we attempt to extend one rather than build something from scratch. For instance, we modified jsPsych to make it easier to integrate with Pushkin. However, the services and libraries used are merely default choices; other researchers could swap them out for others as needed. Similarly, our own original tools can be reused for unrelated projects.

Box 1: Definitions There is no well-established terminology for internet-scale studies. Below, we define some of the terms used in this article.

Broadly multidemographic: a study comparing subjects from a large number of demographic groups. For instance, Hartshorne and Germine (2015) quantified cognitive abilities for every age from 10 to 70 (> 75% of the typical lifespan), and Reinecke and Gajos (2014) quantified visual preferences for subjects from 175 countries (90% of the countries in the world).

Extensively sampled stimuli: a large number of stimuli covering a wide range of the space of potential stimuli. For instance, Brysbaert et al. (2016) collected judgments about 61,800 words; Ferrand et al. (2010) collected lexical decision times for 38,400 words and 38,400 nonwords, and Brady, Konkle, Alvarez, and Oliva (2008) tested memory for 2,500 pictures of objects.

Massive online experiment (MOE): an experiment conducted online that is broadly multidemographic, involves extensively sampled stimuli, or both. Typically involves tens or hundreds of thousands of subjects.

Citizen science: a study in which large numbers of volunteer research assistants help collect data, perform analyses, or otherwise carry out research activities (Bonney et al., 2014; Dickinson et al., 2010; Silvertown, 2009; Simpson et al., 2014). Citizen science projects differ from MOEs in that the volunteers are not research subjects.

Crowdsourcing: a large task is broken down into many small components, each of which is carried out by a different person (Doan, Ramakrishnan, & Halevy, 2011; Howe, 2006). Common examples include spam-filtering, labeling images for search, or checking websites for broken links. Most citizen science projects are examples of crowdsourcing. “Crowdsourcing” is sometimes confusingly used to refer to internet experiments. We avoid that usage here.

Features

In the Obstacles to Broader Adoption section above, we laid out a number of desiderata for software that are addressed by Pushkin version 1.0. In this section, we describe how these are addressed with the current functionality. In the next section, we provide technical details.

Recruitment and engagement

Pushkin version 1.0 provides a number of mechanisms for recruiting, engaging, and retaining both research subjects and citizen scientists. Note that all these mechanisms are optional, and researchers can use different ones for different studies. Indeed, different kinds of studies will benefit more from different recruitment and engagement mechanisms.

Mailing lists

Participants can sign up to receive alerts about new experiments. Using a similar system, gameswithwords.org has built a mailing list several thousand individuals long. Pushkin also allows individuals to sign up to receive information about the results of specific studies and any related publications, which facilitates compliance with common IRB requirements while also providing a mechanism for engagement. Importantly, the mailing list is siloed from data, so there is no way to connect an email address to the subject data.

Sharing

To facilitate word-of-mouth recruitment, Pushkin makes it easy for participants to share experiments and other Pushkin webpages via email and social media. (Note that this does not give researchers access to subjects’ social media profiles.) Sharing on social media can be quite effective: Since 2014, 30% of gameswithwords.org’s traffic has come through social media referrals.

Personalized feedback

Many research participants appreciate immediate information about the outcome of a study (Huber, Reinecke, & Gajos, 2017; Jun, Hsieh, & Reinecke, 2017; Reinecke & Gajos, 2015). This can consist of a percentile score (you scored in the 75th percentile on vocabulary/face recognition/working memory) or a guess about some subject characteristic (based on our quiz, you are a native speaker of Spanish/elementary school teacher/43 years old). This can be quite effective. In a study of 5,000 visitors to testmybrain.org, a quarter cited “learning about myself” as the primary motivation for participation (Germine, personal communication, May 18, 2018). Pushkin provides a growing number of templates for such feedback in the form of jsPsych plugins, and developers can easily create their own.

The impact of this feedback can be magnified by allowing subjects to share their results on social media (see the previous section). Although it is somewhat counterintuitive to researchers who have been trained by IRBs to be mindful of subject confidentiality, many subjects are extremely enthusiastic about sharing their results with their friends (cf. the popularity of Facebook quizzes). Thus, the ability to share makes research participation more interesting and therefore more valuable to many subjects (Huber et al., 2017; Jun et al., 2017). Again, this is implemented in such a way that researchers do not have access to subjects’ social media profiles, and subject data cannot be connected to social media profiles.

Leaderboards and badges

Citizen scientists are motivated by a desire to contribute to science (Reed, Raddick, Lardner, & Carney, 2013). Pushkin provides the option to use leaderboards, badges, and project status bars as a means of visualizing an individual’s contribution (Fig. 2). Although primarily intended for citizen science projects, these elements could in principle be used for massive online experiments. For instance, testmybrain.org informs potential subjects of how many subjects have already participated, providing a visualization of how much has been accomplished so far.

Fig. 2
figure 2

Example of a citizen science project built with Pushkin, employing such gameification elements as a project progress bar, leaderboard, and shareable badges (see the right-hand side of the screen).

Forums

Pushkin provides support for an interactive forum in which participants can discuss the research. The forum has optional functionality that is particularly valuable for citizen science projects: the ability to post an item from the study to the forum for discussion and feedback (Fig. 3). Having discussed the item on the forum, anyone can then help code it. These features can be counterintuitive to many researchers, who are used to maintaining research subject naiveté. However, citizen scientists are performing the function of a researcher, and—except for projects that require the researcher to be blind to condition—it is often counterproductive for the researcher to be ignorant of the purpose of the project. Moreover, citizen scientists occasionally make important discoveries in their own right, so allowing them to pass along their observations can be very valuable (Becker, 2018).

Fig. 3
figure 3

The four panels of this figure depict the relationship between citizen science projects and their associated forums in Pushkin. (Top left) A citizen science project in which a volunteer is analyzing a music clip. At the bottom left of the window, there is a button labeled “Ask a question.” (Top right) Clicking “Ask a question” brings up a pop-up window, allowing the volunteer to post the item they were working on to the forum, along with a question. (Bottom left) This question is sent to the forum, tagged with the name of the project. (Bottom right) In the forum, users can listen to the clip and discuss the question. Anyone also has the option to respond to the original query (i.e., to code the item in question).

Participant dashboards

For Pushkin projects that allow participants to create persistent identities (see the Range of Experimental Paradigms section), Pushkin provides dashboards: homepages for registered users that allow them to see information about their participation, such as forum posts they are tagged in, badges they have earned, or their personalized results from massive online experiments they have participated in (Fig. 4). From here, they can also manage their account. In Pushkin version 1.0 the out-of-the-box dashboard functionality is limited, but this is an active area of development, and users can customize the dashboard as needed.

Fig. 4
figure 4

For Pushkin projects that involve persistent identities, participants who are logged in have access to a dashboard. In the dashboard, they can manage their account and also see information about their participation, such as forum posts they are tagged in. In Pushkin version 1.0, access to the dashboard’s full functionality requires customization.

Reliability and stability

Pushkin uses a variety of methods to decrease the probability that the website will crash, and to aid recovery if it does.

The most common reason for a website to crash is for it to be overwhelmed by massive influxes of traffic. This can be addressed by purchasing a very powerful web server. Unfortunately, this is prohibitively expensive. It is also overkill, since most of the time that computing power will go unused (cf. Fig. 1). By default, Pushkin makes use of several powerful methods provided by Amazon Web Services for auto-scaling: that is, for flexibly adjusting the amount of computing power available in response to demand (see the Auto-scaling section). This is augmented by specific design features of Pushkin’s internal architecture that allow it to “fail gracefully” during periods of heavy traffic (again, see the Auto-scaling section).

Nonetheless, no computer system is immune to crashes. Serious crashes can lead to data corruption or loss. For that reason, Pushkin by default makes use of several redundant mechanisms for backing up data, including real-time backups (see the Backups section).

Range of experimental paradigms

By default, Pushkin uses jsPsych to display stimuli and record responses. In principle, researchers could use any compatible experiment engine, but jsPsych is a particularly robust and flexible option (see Appendix A). It currently allows for the presentation of text, images, video, audio, and any other HTML-formatted content, including animations or interactive displays. Measurements can be made using keyboard responses, mouse clicks, touch input, text input, multiple choice questions and Likert scales, drag-and-drop sorting, visual analog scales, Likert scales, and more. A unique strength of jsPsych is its plugin-based architecture, which allows developers to add new stimulus types and response measures. For example, we created a plugin that allows eyetracking using the Webgazer.js package (Papoutsaki et al., 2016).

Moreover, jsPsych plugins allow for the development of standardized protocols that can be adapted through the adjustment of a set of parameters. For instance, although the implicit association test (IAT; Greenwald, McGhee, & Schwartz, 1998; Nosek et al., 2002) could be implemented as a series of generic stimulus-with-keyboard-response trials, jsPsych provides an IAT plugin that produces the standard layout and feedback of the IAT. Thus, the plugin architecture allows researchers to rapidly develop and disseminate interoperable code for new (and old) experimental paradigms. The growing library of jsPsych plugins means that not only are a wide range of experimental paradigms possible, a growing number of them are quick and easy to implement.

Pushkin augments jsPsych’s range of experiments in two important ways. First, it provides a secure subject login system, which enable multi-session and longitudinal designs. It also supports emailing the subjects (with their permission) to remind them about follow-up sessions. (For information on data security, see the Authentication and Logins and Security sections.)

Second, Pushkin provides the infrastructure for a broad range of contingent experiments. We describe this in the next subsection.

Contingent experiments

Pushkin is designed from the ground up to allow dynamic stimulus selection (Fig. 5), and thus is uniquely suited to implementing machine-assisted experimental design algorithms. Most approaches to machine-assisted experimental design rely on mathematically rigorous specifications of (i) the space of the scientific hypotheses of interest, (ii) the space of possible test stimuli, (iii) the space of possible participant responses to the test stimuli, (iv) a measure of the informativity of each response relative to the hypotheses, and (v) algorithms for efficiently searching for good experiments, given these specifications. For instance, in active learning (Settles, 2012), individual experimental stimuli are chosen so as to adaptively minimize uncertainty about the hypotheses in a given hypothesis space using easy-to-calculate local statistical heuristics. In optimal experiment design (Fedorov, 2010; Ouyang, Tessler, Ly, & Goodman, 2018), whole experiments are constructed in order to globally optimize an information gain criterion. We are developing a growing library of templates for specific types of machine-assisted experimental design algorithms in Pushkin.

Fig. 5
figure 5

(Left) Information flow in a standard computerized experiment (e.g., written in jsPsych or PsychoPy). Once the experiment begins, the software loops through each trial, recording the data before going on to the next trial. (Some software packages wait until the end to write the data.) (Right) Information flow in a Pushkin experiment. Pushkin separates input/output procedures (presenting stimuli and collecting the data) from determining what stimulus to display. After each trial, information is sent to a worker, which in addition to recording the results in the database, also decides what to do next. This allows Pushkin applications to dynamically update, choosing which stimuli to display on the basis of both that subject’s response and what other subjects have been doing. Two other important features of Pushkin applications are the data log, which records a complete history of all writes to the database, enabling version control, and the Chron worker, which carries out particular operations at specific times of day. See the main text for a discussion of how these are used.

Because machine-assisted experimental design is not yet common, we conclude this section with a detailed example. Readers who are not interested in the details should skip to the next section.

We illustrate using optimal experiment design as formulated by Ouyang et al. (2018). Let’s imagine that we are interested in theories explaining reaction times in lexical decision experiments. Lexical decision experiments are a workhorse method in psycholinguistics for studying the processing of words. Subjects must discriminate real words (e.g., “cake,” “interrupt,” “beige”) from nonsense words (e.g., “sleng,” “exterrupt,” “beigity”). The typical response measures are accuracy and reaction time. It is well known that word frequency affects lexical decision reaction times, with faster responses to more frequent words, though many of the details remain under debate (Adelman, Brown, & Quesada, 2006; Berent, Vaknin, & Marcus, 2007; Ellis, 2002; Morton, 1969; Ratcliff, Gomez, & McKoon, 2004).

Imagine that we wish to compare a set of hypotheses linking words to reaction times. For instance, imagine we wished to compare the hypothesis that reaction time is linearly related to word frequency to the hypothesis that it is logarithmically related. We would formulate each hypothesis as a linear model with frequency or log-frequency as fixed effect and perhaps a variety of random effects. Formally, a hypothesis mM, is defined by a conditional distribution Pm(yx | x) linking a set of items, x, to measured lexical decision times yx. Ahead of the experiment we have some prior beliefs about how likely the hypotheses are, P(M), informed by prior results. Our task is to determine what data to collect next—x—with the aim of collecting the data that would be most informative. We can formalize this as maximizing the distance between prior and posterior beliefs:

$$ {x}^{\ast }=\arg \underset{x}{\max }{D}_{\mathrm{KL}}\left[P\left(M|x,{y}_x\right)\ \Big\Vert\ P(M)\right] $$

where the Kullback–Leibler divergence DKL(∙‖∙) is used as a measure of distance between distributions. A priori, we do not know what the result of any particular experiment will be, so we must marginalize over the possible results y:

$$ {x}^{\ast }=\arg \underset{x}{\max }{\mathbb{E}}_{\widehat{p}\left({y}_x;x\right)}{D}_{\mathrm{KL}}\left[P\left(M|x,{y}_x\right)\ \Big\Vert\ P(M)\right] $$

Given the specific formulations for pm, this defines an objective function that can be used to optimally choose what data to collect next. For instance, we could search over possible stimuli in order to choose those stimuli that would best help us distinguish between the hypotheses.

To be clear, machine-assisted experimental design is not different in kind from what researchers normally do: try to design maximally informative experiments. In the same way that inferential statistics help us analyze data, machine-assisted experimental design helps us design experiments. Just as inferential statistics are most useful when the dataset is large and our questions about it are complex, machine-assisted experimental design shines when the hypotheses are many and complex, and when the experimenter has many design choices to make. Machine-assisted experimental design is also particularly helpful when the pace of data collection is too fast for the experimenter to make real-time decisions about what data to collect next—exactly the situation we face in internet-scale studies.

Although mathematical formulations of optimal experiment design, such as the one above, have been available for some time (e.g., Lindley, 1956), the method has not been widely used for two reasons. One is that the design of most experiment software platforms does not permit its use, in particular in the active stimulus selection setting in which machine-assisted experimental design and data collection must be tightly integrated. (The one counterexample being Dallinger, which supports some types of optimal experimental design; Suchow, 2018.) Pushkin’s unique architecture allows for a straightforward implementation of a wide range of machine-assisted experimental design protocols. Pushkin users can equip the experiment worker (Fig. 6) with computationally specified competing hypotheses and possible experiments. With these specifications in place, the optimal experiment can often be computed with no further input from the user. Pushkin will continue to make optimal choices about what data to collect next for as long as the experiment runs, thus making efficient use of however many subjects the experimenter manages to recruit.

Fig. 6
figure 6

Schematic of a Pushkin website, which consists of some number of quizzes and some ancillary webpages, such as a forum and a user dashboard. Note that the API, queue, experiment worker, and experiment database (DB) worker are all subsumed under “worker” in Fig. 5. See the main text for a detailed description. Although this is not depicted, each study has its own experiment worker and database worker.

The second reason is that specifying hypotheses formally can be complex and optimizing experiment design objectives is computationally difficult. Recently, approaches based on probabilistic programming languages (PPLs) have emerged as a viable alternative (Ouyang et al., 2018). PPLs are high-level languages designed for expressing models from artificial intelligence, machine learning, statistics, and computational cognitive science. In PPLs, diverse models are expressed as programs in a common language, and inference algorithms are developed for the language as a whole, rather than for specific models (Goodman, Mansinghka, Roy, Bonawitz, & Tenenbaum, 2008). Probabilistic programming is thus ideal for rapidly specifying and deploying models of each of the components of a machine-assisted experimental design system described above. Although Pushkin users can implement machine-assisted experimental design using any programming language, we are using the probabilistic programming language WebPPL (see Appendix B) to implement a library of reusable tools for machine-assisted experimental design within Pushkin.

Thus, we believe that one of the major contributions of Pushkin will be making machine-assisted experimental design more accessible and easy to implement—and therefore more commonly used.

Other features

Webserver setup and management

By default, Pushkin employs a number of popular technologies for auto-scaling, version control, and data backups. Detailed instructions on how to set up the webserver and related technology is provided.

Support for reproducibility

Because the entire experiment is run via code, reproducing the study merely requires rerunning the code. Similarly, note that Pushkin’s data log contains a reasonably comprehensive chronological record of everything that happened during test (see the Backups section).

Stub website

Designing a website requires at least basic knowledge of web development. Designing a website that is easy to update, is compatible with different browsers, is optimized for both desktops and mobile devices, and so forth, is troublesome and time-consuming. Pushkin provides a basic website layout that is a sufficient starting point for customization. Researchers who are not familiar with web development can create a website with basic functionality by making minor changes (adding custom images, changing the color scheme and fonts, etc.). More advanced programmers can make major changes to the website layout or create a new website altogether.

How Pushkin works

In this section, we describe many of the technical details as to how Pushhkin version 1.0 supports the functionality described in the previous section. This will primarily be of interest to skilled web developers and/or individuals interested in contributing to the project. Others may wish to skip to the next section.

Figure 6 outlines the structure of a Pushkin website. Pushkin websites consist of three primary parts. At one end are the webpages and associated content, including all jsPsych code and stimulus files. At the other end are the database, which stores lists of stimuli and subject responses, and the data log, which contains a real-time log of every database query, thus providing real-time backup and version control of the database. In the middle is a collection of workers that process participant responses and determine what to do next. The load balancer, which sits between the webpages and the workers, helps facilitate auto-scaling.

Below, we provide additional detail on how this architecture supports the functionality described in the Features section above. In keeping with our philosophy and approach (see the Philosophy and Approach section), we have made extensive use of existing technology and services wherever feasible, including Node.js, React, Redux, PostgreSQL, Rancher, Docker, Auth0, RabbitMQ, and Amazon Web Services. However—again in keeping with the philosophy and approach—the highly modular architecture permits other developers to mix and match. Moreover, given the quickly changing world of web development, it is highly likely that the Pushkin development team will periodically replace some of these technologies as better alternatives emerge.

Auto-scaling

Pushkin makes use of several services for auto-scaling. Webpages, images, and videos are hosted in Amazon Web Service’s S3 and CloudFront services, which provide rapid, scalable delivery of static content worldwide (Amazon Web Services, 2018).

Processes that require dynamic, server-side computation—such as processing database queries or running machine-assisted experimental design models—are hosted by the Amazon Web Services (AWS) Elastic Cloud Computing (EC2) platform (Amazon Web Services, 2018). An EC2 computer is called an “instance.” The same software can be replicated across multiple EC2 “instances,” with a load-balancer distributing web traffic to different instances. Thus, if there are three instances, different instances will handle the computations for different subjects. Auto-scaling is accomplished by monitoring usage and dynamically creating or destroying instances as needed. We use Datadog for monitoring usage (Datadog, 2016). For capacity to rapidly respond to demand, creating a new EC2 instance must happen quickly. Although installing Pushkin and its dependencies on a new EC2 instance is automated, it is slow. Thus, ready-to-use copies of the Pushkin software are kept in Docker images that can be rapidly deployed to new EC2 instances by Rancher, a service for managing Docker images.

For database services, we use AWS’s AuroraDB, which allows for both horizontal and vertical scaling.

Another feature that helps make Pushkin websites robust to large traffic spikes is the use of a message queue for passing messages between services (Fig. 6). At its heart, a queue is a text file. Services that have a message to pass write the message in the next line of the queue. Other services “listen” for messages addressed to them, immediately deleting them and acting on the instructions. This allows Pushkin applications to fail gracefully: If messages are written faster than they can be read, the queue grows longer and the site slows down proportionally, until auto-scaling provides more capacity and the listeners catch up. Our message queue manager of choice, RabbitMQ, provides a number of other useful features, including the ability to directly influence auto-scaling (Videla & Williams, 2012).

We chose these services for their robustness and cost-effectiveness: gameswithwords.org currently costs $300–$400 per month to support nearly 40,000 visitors per month. However, the modularity of Pushkin allows developers to—with greater or lesser degrees of effort—employ other services instead. Similarly, this modularity will make it easier to upgrade Pushkin in the future as new (versions of) services because available.

Authentication and logins

Experiments requiring multiple sessions (longitudinal studies, sleep studies, certain memory paradigms, etc.) necessitate tracking the same individual across multiple visits to the website. For experiments requiring this tracking, researchers can allow subjects to log in. Pushkin uses Auth0, a highly secure and widely trusted service for website logins (Auth0, 2017). Subjects can either create a username and password or—if researchers wish and their institutional review board (IRB) allows—log in using an email account or social media profile. The latter option has the advantage of not requiring the subject to remember a username. Note that this does not give the researcher access to the subject’s private social media and other data (see also the Security section).

Backups

Amazon Web Services’ RDS Multi-AZ deployments provide a real-time backup of the PostgreSQL database used by Pushkin. The primary Multi-AZ database instance is backed up by an identical copy hosted in a different geographical location. If the primary instance or its backup loses data—or if, for instance, there is a power outage—the data is recovered and duplicated from the unaffected copy.

In addition, the data log (another PostgreSQL database identical to the main Pushkin database) maintains a record of all queries performed on the primary Pushkin database. The data log serves as the history of a Pushkin project and can be used to restore the primary database in case of failure. The data log has its own real-time backup maintained by AWS. Therefore, instead of having just one main database, Pushkin maintains four databases (a primary database with a copy in a different availability zone, and a data log with a copy in a different availability zone).

In addition to backing up databases by creating identical copies, AWS provides database backup features designed to recover a particular state of a database—database snapshots and automated backups. Automated backups can be turned on for any AWS database, and Amazon RDS automatically takes a snapshot of the data in every database once a day. In addition, the database owner can choose to create additional database snapshots at any point in time (after, e.g., a major spike in website traffic). All of these backup strategies ensure that it is virtually impossible to lose data with a Pushkin project.

Security

Data security is a concern for any networked device, whether a webserver or a laptop. Pushkin employs a number of security mechanisms, described below. The overarching approach can be summed up as:

  • Default to anonymity rather than confidentiality whenever possible. If even the researchers do not know who the subjects are, security breaches are less problematic.

  • Where possible, silo identifiers from data. For example, the recruitment email list—which contains email addresses—is not connected to data.

  • Anonymize anything that is not anonymous automatically and as early in the data pipeline as possible.

If login/authentication is not enabled, data collection is anonymous. Identifiers such as IP addresses are not collected.

If login/authentication is enabled, data cannot be made fully anonymous. However, there are a number of layers of protection. First, we use the highly secure Auth0 authentication service to handle user IDs. Logins are handled by the Auth0 webservers, not by Pushkin itself. Participant identifiers (e.g., email and username) are stored in the secure Auth0 database. If the participant authenticates using a social media service (Facebook, Twitter, etc.), their social media username is likewise stored in the secure Auth0 database. Note that all that is accessed is the user’s publicly available social media username; private social media data are not accessed or stored.Footnote 3

Crucially, Pushkin applications do not access the participant’s external identifiers (e.g., email address), but rather an alphanumeric “token” representing the participant. Thus, the identifiers are stored in Auth0’s secure database, and participant data are stored in Pushkin’s secure database. For additional protection, Pushkin encrypts the Auth0 tokens as well, providing an additional password-protected firewall. Moreover, the (encrypted) token is stored separately from the data itself; instead, a different alphanumeric identifier is used to identify subjects for purposes of analysis. Finally, data are encrypted when traveling between the subject and the website, between the website and Auth0, and between the website and the researcher.

Thus, although it is possible to deanonymize data, this requires considerable effort and several passwords. Note that these robust security procedures do not mean that Pushkin is unhackable. No security system is unbreakable. Even if the software itself cannot be hacked, humans present a point of weakness (e.g., researchers or participants using easily guessed passwords). Moreover, it is sometimes possible to identify a subject from their data alone, if the questions asked are sufficiently specific (e.g., there may be only one female rabbi in a specific small town; Narayanan & Shmatikov, 2008). However, these considerations apply equally to data collected in the lab, and we encourage researchers to use common sense and robust security procedures for all data. For extremely sensitive data, researchers may be advised to take even more robust security measures than what Pushkin provides out of the box.

Chron

The Chron worker is Pushkin’s bonus feature. Although it is not an essential component of a Pushkin study, it makes it possible to periodically analyze data and send reports. Since the Chron worker is language-agnostic, it can run scripts written in the researcher’s language of choice (Python, JavaScript, WebPPL, R, etc.). The Chron worker can also be used to periodically remove data from subjects who did not complete a study or did not pass screening questions set up by the experimenters. Those are only some of the potential uses for the Chron worker. In large-scale citizen science projects, it can be incorporated to perform tasks such as alerting the researchers when sufficient data have been collected for a set of stimuli, or when other milestones in data collection have been reached. The Chron worker eliminates the need to monitor data collection freeing up the researcher’s time and resources.

Using Pushkin

The Pushkin source code is available through GitHub (github.com/pushkin-consortium/pushkin). The source code provides a stub of what is needed for a website similar to gameswithwords.org that hosts multiple massive online experiments and citizen science projects. Users familiar with ReactJS can edit the structure of the website (which pages are available, etc.) as desired.

Currently, to use Pushkin, users download the source code for Pushkin and the source code for jsPsych and (if desired) WebPPL. Users will also need to configure a web server from Amazon Web Services (or, if desired, an appropriate alternative). Users are urged to consult the documentation for the most up-to-date instructions (pushkin-only.readthedocs.io).

Modern website design involves a fairly unintuitive file structure. For instance, the code for a single experiment must be distributed across a variety of folders, with parts of different experiments ending up in the same folder. Likewise, many of the best practices that result in efficient, fast websites also result in code that is very hard to read. Thus, for purposes of development, we use a more intuitive file structure. When the user is ready to test or deploy the website, the code is reorganized into a web-appropriate format using webpack (Hlushko et al., 2018) (again, consult the documentation). Importantly, users can work exclusively with the “user-friendly” files and never have to inspect or modify the web-ready version.

Individual experiments can be written in jsPsych (advanced users may choose to use an alternative, but this may require significant extra work). For the most part, development is the same as it would be for any other jsPsych experiment (see documentation at www.jspsych.org). For most projects, the primary difference is in how the results are saved, since Pushkin handles interaction with the database (were data are stored). Setting this up is largely automated (see documentation), and only requires significant customization if the user needs to do a lot of preprocessing of the data before it is stored (for instance, for the purposes of machine-assisted experimental design).

Similarly, there is little extra that the user must do for a standard experiment in which every subject sees all items. The jsPsych library handles experiment flow for simple experiments (i.e., where every subject sees every item). Users who wish to create highly contingent experiments will need to edit the worker for that experiment (see Fig. 6).

As this summary should make clear, although Pushkin provides a powerful template for creating internet-scale projects, using Pushkin using Pushkin currently requires a fair amount of technical expertise, particularly for more complex projects. Our current development priority is making Pushkin more accessible to a wider audience (see the next section).

Future development

Tools to improve ease of use

We are in the process of rolling out command line tools that will greatly simplify the process. (This is one of the reasons to consult the documentation for the latest instructions.) In particular, we are using the popular package manager npm to download, install, update, and manage Pushkin, jsPsych, WebPPL, and their dependencies. Package managers greatly simplify the use of Unix programs and are the gold standard for open-source projects. Using the package manager for Pushkin is currently done through command line but in the future will be available through the graphical user interface (GUI). When complete, the Unix command “npm install pushkin” downloads the latest version of Pushkin along with dependencies. Similar Unix commands download various add-ons, such as additional jsPsych plugins or experiment templates. Importantly, anyone can create add-ons and distribute them through the package manager. Similarly, we are working on command line tools that will simplify webserver configuration. Finally, these tools will also support updating to newer versions of Pushkin. As of this writing, we expect to complete these upgrades by the end of 2018.

More ambitiously, we intend to integrate Pushkin development into the jsPsych GUI), currently in beta. The jsPsych GUI is a web application that allows users to build experiments by creating and organizing a series of trials with a point-and-click interface. The researcher can customize the parameters of each trial through simple menus that require no programming experience. Images, audio, and video can be uploaded for inclusion in the experiment. As users build an experiment, they are shown an immediate live preview, providing critical feedback for novice developers. When the experiment is finished, the GUI automatically assembles jsPsych code, which can then be exported and used.

We are currently extending the GUI to help with deployment of entire Pushkin websites, including facilitating customized subject feedback, social media integration, and machine-assisted experimental design. Note that while the GUI would make Pushkin accessible to researchers who lack programming experience, we intend it to offer advantages even to proficient programmers. To minimize the loss of flexibility that comes from using the GUI, we will extend the GUI to allow editing of the underlying code for each component at all steps of the process and link these changes to the visual state of the GUI. For example, if a researcher has code to parametrically generate stimuli, they will be able to insert this code via a script editor in the GUI and use it when declaring parameters for trials using the visual interface. The GUI will be able to run the code immediately and incorporate it into the live preview.

Complementary to the GUI, we are building a library of experiment templates for common paradigms. These greatly simplify experiment development. For instance, running a self-paced reading experiment may require no more than providing a list of sentences to be included and setting a few parameters in a config file (e.g., how many trials per subject).

We intend to make the experiment template library available through the GUI. A researcher will be able to browse through these examples and select one that closely resembles their desired experiment. This will create a copy of the experiment for the researcher to edit as they see fit. The availability of these prototypes would aid both novices and experienced developers. Researchers who create experiments will have the option to publish them to the package manager, thus not only supporting other researchers but also improving the reproducibility of their own work.

Finally, we are working on providing greater support for using Pushkin. Lack of institutional knowledge appears to be a major impediment to the wider use of internet-scale studies: researchers are simply not familiar with the design constraints and opportunities. One of the missions of the broader Pushkin project is to address that gap through workshops and publications (cf. Hartshorne & Jennings, 2017; Hartshorne, Leeuw, Germine, Reinecke, & Jennings, 2018a). The present article is an example of these activities. We intend to conduct a number of workshops over the next few years and publish a free electronic textbook (for similar examples, see Goodman & Stuhlmüller, 2014; Goodman & Tenenbaum, 2014).

Conclusion

Pushkin provides a suite of tools for conducting massive online experiments and citizen science projects for psychology and the cognitive sciences. It addresses both the design challenges of internet-scale research (recruiting subjects, running longitudinal studies, machine-assisted experimental design, etc.) and the technical challenges (webserver setup and configuration, data security, real-time backups and version control, auto-scaling, etc.). To achieve these ends, Pushkin draws on a wide range of software and hardware technologies. Thus, in addition to being a software framework, Pushkin can be thought of as a collection of best practices.

Other frameworks can provide aspects of this functionality. Most obviously, many of the same experiment designs can be implemented in jsPsych (though it does not by itself support machine-assisted experimental design or longitudinal studies), and some of our recruitment mechanisms are implemented as jsPsych plugins. However, jsPsych is purely experiment software that is meant to be embedded in a larger website. It does not handle data storage and security, backups and version control, auto-scaling, or any of the other parts of running a highly trafficked website. Other platforms, such as Google Forms or SurveyMonkey, provide the website but are very limited in their experiment functionality. LabVanced supports a wide range of experiments, but is not open-source or customizable and does not provide much support for subject recruitment and does not permit machine-assisted experimental design. Zooniverse (Simpson et al., 2014) provides a powerful platform for citizen science projects that involve classification and annotation of images, but does not support linguistic annotation or the collection of psychological data. Thus, although existing platforms provide excellent support for certain paradigms—and indeed, we use many of them—only Pushkin supports a wide range of internet-scale studies.

However, we must acknowledge that Pushkin supports only a subset of the internet-scale studies that are currently possible or will be in the near future. For instance, it does not currently support the sophisticated use of mobile devices (Miller, 2012; Stieger, Lewetz, & Reips, 2017) or wearable sensors or virtual reality. Pushkin focuses on web applications—which are popular among older children and adults—and not mobile apps, which are more appropriate for testing young children. Importantly, the fact that Pushkin is free and open-source, as well as modular and extensible, means that Pushkin should provide an important foundation as the community explores these exciting possibilities.

Author note

Work on Pushkin is supported by NSF Grant 1551834 to J.K.H. The authors thank Katharina Reinecke, Laura Germine, Roman Feiman, and two anonymous reviewers for comments and suggestions; Miguel Mejia for graphic design; the attendees at the Fist Annual Pushkin Developers Conference; and Oddball.io.