Objectively measuring inter- and intra-individual differences in human behavior is a fundamental core mission in psychology as it provides the solid fundament on which the entirety of research endeavors in psychology and related fields depend upon (Jenkins and Lykken 1957). The availability of accurate, reliable, and comprehensive phenotypic measures is not only essential for psychological hypothesis testing per se, but is also crucial for the successful elucidation of biological underpinnings of neurocognitive traits that are amenable to for instance imaging or genetic studies (Congdon et al. 2010). While computers have already been used to assist in test evaluations for more than half a century (Kleinmuntz 1963), advances in computer technology now allow for the development of completely digitalized assessment strategies with automated scoring and evaluation procedures (Luciana 2003). Automatization of psychometric assessment is a highly valuable approach for meeting for example the demands that are put forward by the recent revolutions in biotechnology: while high-throughput cost- and time-efficient individual whole genome scans in large cohorts have become a matter of course, phenotypic assessments typically still rely on laborious testing batteries, often requiring trained administrators and stationary attendance time of study participants.

We argue that bringing down the effort for both researchers and testees involved in collecting repeated phenotypic measurements of healthy large cohorts is feasible through online-based test administration. Yet, this requires a substantial redesign and redevelopment of psychometric assessment procedures and instruments. Conceptualizing the novel strategies for large-scale assessments should be led by the idea that participant compensation is essential and constituted by existing ethical guidelines yet does not necessarily need to be monetary. Entertainment that can be achieved through gamification and task design is not only a highly valued benefit itself, it is also key to nurse the participant’s motivation required for repeated measurements (Lumsden et al. 2016). Designing the data collection process as a rewarding experience itself is a valuable strategy, as previous studies have found that the demanding nature of data entry is one of the primary reasons respondents stopped using health apps (Krebs and Duncan 2015). Additionally, automation of data collection and evaluation can be used to provide test persons with graphically illustrated feedback on their own performance as this also serves as an incentive for repeated and continuous participation. Finally, computer game-based tests and experiments provide scientists with a novel technique to test ecological validity of laboratory-based procedures, which is always assumed, but rarely tested (Krakauer et al. 2017).

A large online-based research platform that collects sensitive personal data requires continuous attention and efforts to ensure the best possible standard of security for safeguarding participants’ personal data from security gaps and potential misuse. It is a question of respect towards the study participants to view and treat the gathered data as a good that the scientist is only entrusted with for conducting research, but that ultimately still belongs to the testees. The fact that the data might be used in currently undefined future research projects or may yield to potential monetization of research outcomes calls for more control options through participants during the data life cycle than a single “open ended” consent form (Lipworth et al. 2017). Yet, despite these concerns, using a single platform framework to simultaneously obtain a wide variety of different psychometric data comes with a set of very appealing options: based on the concepts of “open-science,” “open-data,” and collaboration, we outline our prototype for automatized and smart phenotypic data acquisition, which holds the potential for reshaping standard procedures in psychological research practice and for facilitating productivity and study outcome reliability. Specifically, we plan to implement a pre-registration system that grants publicly funded scientists’ script-based access to the collected data through the COSMOS platform. Scientists can develop their scripts on a dummy database system that mimics the database system of the COSMOS backend. Relying on script-based analyses, which will be run in a secure environment and only return the result of the analysis, is a safety precaution which eliminates the need to grant access to raw data. Only revealing combined and summarized data still allows making highly flexible and efficient use of the existing data pool, while maximizing the security of the dataset against identifying individual test participants.

Conveniently, the ongoing automatic data acquisition continuously generates novel samples that can be used for effortlessly replicating the obtained findings as soon as a large enough additional batch of data has been collected. Additionally, the comparably low maintenance and personnel costs of data gathering can contribute to alleviate the time-consuming competition over limited funding resources. At the same time, the centralization of longitudinal data gathering enables a higher phenotypic resolution per individual than single studies could achieve. The large N high-resolution data allows building models of higher complexity that are better suited to account for confounding factors, which typically would be out of scope for small N single-hypothesis testing study designs. Depending on the respective research question and the hypothesis tested, the available detailed assessment of a large number of individuals allows the application of sampling strategies that either are currently not taken into consideration at all or are only feasible at large expenses of cost and time: preselecting subgroups as homogenous as possible, closely matching experimental groups on potentially confounding factors, evaluating whether a detected correlation can be found in a set of different subgroups, whether it is largely stable or may be even reversed along the continuum of the normal distribution of a given trait.

Finally, platforms like COSMOS can facilitate settling the question, whether so-called brain training (i.e., repeatedly engaging in cognitively demanding tasks) can actually have generalizing beneficial effects: based on a large N, without taking money from the participants and thus without the inherent conflict of interest the brain training industry-affiliated scientists are faced with.

Game Tests

The COSMOS platform ( is now in its pilot phase, hosting five prototypes of games that currently undergo refinement and calibration as psychometric testing instruments, which are described in more detail below. Table 1 gives an overview of all developed instruments together with the phenotypic constructs they have been designed to measure.

Table 1 Overview of all games that have been developed for the COSMOS platform to date. The column “Features” shortly describes game setup and content. Also listed are approximate durations to complete one testing unit (depending on the game type, the test unit refers to a single level or to a complete game). “Technological specifications” lists the IT technologies/programming languages that were used to create the game and its backend


The HoNk-Back task is a gamified redesign of one of the most widely used working memory tasks in neuroscience, the N-Back paradigm (Owen et al. 2005). This gamification of the task goes beyond simply adding game-like reinforcement mechanics such as a score or a progress bar. We put special attention on developing a setting that lets the actual task of monitoring a sequence of stimuli appear as natural and plausible as possible, aiming at increasing ecological validity. The task setting makes the test subject to assume the role of a truck driver who gets overtaken by a constant stream of cars. Cars appearing in the review mirror trigger the required response signal by the truck driver which consists of either flashing the headlights at cars that also gave a light signal or waving at the cars that overtook the truck without emitting a headlight signal. Tilting of the rearview mirror controls the N condition as this allows regulating the number of cars disappearing into the blind spot.

Drag Race

This test in form of a drag race game is designed to measure reaction times to unpredictable and predictable cues and variation in response time accuracy. A light signal sequence of two yellow lights indicates that the driver has to get ready. The green light that indicates the take-off signal then is given after a variable random time interval allowing the measurement of spontaneous reaction time (SRT). The process of shifting gears requires a defined motor response pattern: releasing the accelerator button (spacebar), hitting the gear-shifting button (return) and releasing it again, and pushing the accelerator button again. The gear-shifting procedure is used to record the response times to predictable signals: the revmeter continuously moves towards the optimal switching moment, when the response pattern has to be executed. This allows measuring several reaction times of simple motor responses in the form of foreseeable reaction times (FRT). Evaluating repeated runs allows assessing variation in response time accuracy. We are aware of software and/or hardware-related issues concerning reaction time measurements such as monitor response time, operating system design, and input device-related delay such as key debouncing time (Garaizar et al. 2014; Salmon et al. 2017) that impact the accuracy of the response time measurements. Nevertheless, the provided test should yield rough estimates of individual response times and allow group comparisons under the assumption of equally distributed noise. Also the argument has been brought forward that the error introduced by response devices is bound to be small relative to human variability and will only exert potential effects in experiments that lack statistical power in the first place (Damian 2010). Given that the game will be made freely available as a standalone application, it can serve as test instrument in a controlled lab-based environment with identical hard- and software setups allowing unbiased inter-individual comparisons.

Frog Life

Frog Life is a combinatorial task with increasing difficulty levels consisting of a go/no-go paradigm to measure sustained attention and additionally assesses visual vigilance. The task setting lets participants control a virtual frog in a pond that feeds on dragonflies (go-condition) while avoiding devouring hornets (no-go condition). Simultaneously, the testee needs to escape predators, which are announced through changes in coloring of three different display details, namely the color of the water in the pond, the clouds, or a depicted bush (Fig. 1). Insects only become catchable after they entered the proximity range outlined by a spherical contour around the frog. Snatching of the insects is achieved by pressing the corresponding left or right cursor buttons of the keyboard depending on which side of the screen the insects emerged from. Correct responses of the go-task (eating dragonflies) are rewarded with increasing of the score, while incorrect responses to the no-go condition (eating hornets) decreases the score. Color changes of one of the three display details announce an upcoming predator and require the player to trigger an escape jump by pressing the spacebar. Faster reaction times to the color changes are rewarded with more points Yet, pressing the spacebar while no actual color change is taking place causes the player to lose one of three health points indicated by hearts. If all health points are lost, the player character is granted “game-over.” After every successful escape, the game mechanics difficulty level is increased. In case the player fails to detect a color change of the display details, appearance of a predator terminates the game. The color change thus constitutes an additional go/no-go task based on signal detection.

Fig. 1
figure 1

Scenery examples from the game “Froglife.” Dragonflies and hornets constitute a go/no-go paradigm. Eating dragonflies increases the “Munch Score,” eating hornets decreases it: upon entering the white circle that is surrounding the frog, the player can make the frog eat the insects using the left and right cursor button, depending on the side from which the insects are entering the white circle. a The dragonfly entered from the right side is captured by pressing the right arrow cursor button. b Color changes of the pond, the bush on the right side or the clouds indicate an approaching predator requiring the player to escape the current scenery by jumping to a different pond using the spacebar. In the depicted scene, the hue of the pond is changed


This game is designed as a two-tier short-term memory performance assessment consisting of an episodic picture recognition task and a sequence-learning test. At the beginning of every game round, testees need to memorize a set of picture stimuli (3 to 10 items). The test participants can choose between different categories, such as “food,” “animals,” and “sport,” and will be presented a set of pictures to memorize for the current game round. Additionally (apart from the “easy” condition featuring only three pictorial items to be remembered), a sequence of differently colored and shaped symbols is presented at the beginning of the game round. The accurate encoding of sequential information is a key cognitive element in human cognition, setting humans apart from other species, but also shows large variation in performance within species (Ghirlanda et al. 2017). During the actual game, the player controls a panda bear climbing rock walls that gets rewarded for correctly solving recognition tasks: at given intervals, the player is presented with a selection of pictures and required to identify the picture shown in the beginning. If she clicks on the correct picture, a bird lifts the panda bear to a higher position in the climbing wall thus rewarding the player with “shortcuts” in the climbing route. Also, at predefined intervals, buttons appear on the screen that require the player to reproduce the symbol sequence that was shown at the beginning of the level. It must be entered correctly in order for the climbing to continue. If the sequence is not entered correctly by the player, the entire sequence will be displayed, so that the player can continue. The player is awarded with points for correct recognition of the pictures, reproducing the symbol sequence correctly and for speed.


This task is primarily designed to measure a subtype of theory of mind (ToM) by employing entertaining stimulus material. At the beginning of the game, the player is asked to rate the jocularity of 10 items consisting of cartoons, memes, and written jokes on a scale from 0 to 10. Additionally, the participants also rate how strongly they agree with 18 statements touching topics such as politics, religion, society, sports, education, and personality. This initial phase has to be completed only once. The actual game then consists in guessing as how amusing a given item has been rated by another person whose ratings the player is randomly assigned to. Whenever an item appears from the joke or the statement pool that has not yet been answered by the player himself, he is asked to make his own rating prior to estimating/learning the estimation of his/her counterpart. This way, the pool of rated jokes and statements for all individuals is constantly increased. The goal of the game is to estimate as accurately as possible how entertaining a given stimuli was perceived by the other person. Apart from the demographic info on the other player that is always provided (gender, age, education), the participant can unlock further information on how the statements were rated using an in-game-generated currency (JokeCoins). The accuracy of the estimation process is rewarded with points and JokeCoins.

COSMOS Environment

Individual Data Visualization

All gamified testing instruments developed in the scope of the COSMOS project feature an application-specific relational SQL database that records the user’s input. This makes it easy to set up any application as a standalone implementation and to integrate the applications into a specific laboratory test setting, for example, as a subtest in a given brain mapping experiment. In the scope of the COSMOS web platform, all user data is assigned to unique identifier codes (UIC) and thus to a certain person, by means of a separate central authentication system implemented in the secure software framework used to host the website. All single SQL databases are linked via experience APIs to a Learning Record System employing a mongoDB that serves the purpose of graphically representing the obtained data. We have developed a data visualization application that allows platform administrators fast and easy creation and configuration of interactive plots. COSMOS participants can choose from a variety of preconfigured plots to learn about their individual performance over time, compare their scores to all participants or to specific subgroups only, e.g., a given age range or gender (examples of plots are depicted in Fig. 2). Visualization of achieved high scores and selected performance measures like for instance average reaction times also allows COSMOS participants to monitor their performance over the course of the day to identify peak performance time periods when they usually achieve the best concentration and attention levels.

Fig. 2
figure 2

Example of a typical visualization of test results generated by the mongoDB-based visualization feature of the COSMOS web platform. The generated graphical representations are partially configurable and allow the user to customize which data is displayed. Any data fed into the mongoDB can be visualized in either bar charts, pie charts, or progress charts. Database schema of the game Frog Life

Automated Data Processing Pipeline

The independent SQL databases that all games running on the COSMOS platform are equipped with facilitate a streamlined and automated data analysis pipeline. While there may be specific deviations for single games, the general rule is that data will be marked as an unfinished run or simply not stored in the database, if the level was aborted due to player inactivity, closing of the browser, or loss of internet connection. All user responses and summary statistics generated by the games are recorded and stored along with a timestamp and linked to a specific UIC in the games’ databases. The UIC is generated when an account is registered and thus pertains to specific login credentials. This procedure allows data to be uniquely assigned to a specific person and therefore enables data collection over multiple trials, time-points, levels, and different tasks. Since the exact timestamp of each reaction is always stored in the database, it is easy to calculate for example the average reaction time per game round: large intervals between stimulus presentation and the reaction of the player or a large variance in task performance indicators can be used to detect a lack of concentration or distraction and thus can be used to create QC filters. Of course, those statistical filters themselves can be evaluated if participants are asked to rate for example the attention or level of concentration they were exhibiting during gameplay after a level is completed.

The use of standard SQL databases allows accessing the data with all common statistical analysis tools/languages like R, python, matlab, octave (Eaton et al. 2014; MATLAB Optimization toolbox 2017; R Core Team 2018), etc. This allows the creation of standard query scripts that are customizable to retrieve the data best suited to answer a given research question: e.g., retrieve all data for game x, y, and z for all individuals meeting a given age range, gender, or educational level that finished at least 10 trials per game within a specified time period. The exact procedure of reading, processing, summarizing, and blending data may of course depend on the specifics of the research question to be answered. Figure 3 depicts a description of the SQL database schema for the game Frog Life. This description together with the information on the different response types (as shown in Table 2) helps understanding how simple database queries can be used to sum up different correct answers and/or errors depending on the difficulty level of the task in order to serve as a data basis for modeling.

Fig. 3
figure 3

The SQL database schema for Frog Life. The games-table records all the games (numbered incrementally starting from 1) a given user (user_id defined by the UIC) has played, along with the scores s/he achieved and the timestamp the game was started (creation_time) and finished (modification_time). The finished field contains the info whether the game was normally finished or prematurely terminated. All lines between the tables are dotted, since the UIC serves as foreign key for all other tables. The rounds-table contains information about every single round played as indicated by the “one to one or many” relationship (since many rounds per game are possible). A round starts either directly at the beginning of the game or after the player escaped an upcoming predator and jumped to a new scenery. After every round, the difficulty level is increased (and the current difficulty level gets stored in the difficulty field), i.e., the speed of the insects accelerates, the color change time decreases, and the hue intensity change gets less pronounced. The action table stores all the actions that are exhibited by the player during a given round. The action_types-table comprises all the possible response types a player can display (see Table 1 for action type definitions). The levels-table holds the information about the background sceneries, which is recorded in the rounds-table (level_id). Currently, three different sceneries are available

Table 2 Description of possible action types a user can display during playing Frog Life. Only if a given response as defined by an action type is exhibited, one of the described database entries in the Description column is triggered. Thus, e.g., if no entry for action type HORNET_EATEN exists in a given round table, the player did not make this type of mistake during that round. Type describes the psychometric characteristics attached to the potential user responses

Modeling Phenotypes

In order to understand and analyze complex behaviors, a promising approach has been computational models (Corrado and Doya 2007; Luksys and Sandi 2011; Mars et al. 2012; Nassar and Frank 2016). Most widely popularized in the field of reinforcement learning (Tanaka et al. 2004; Daw et al. 2006; Behrens et al. 2007; Frank et al. 2007; Luksys et al. 2009), they have also been applied to study working (Collins and Frank 2012; Collins et al. 2014) and episodic (Luksys et al. 2014, 2015) memory as well as decision-making (Forstmann et al. 2008), including strategic reasoning (Zhu et al. 2012; Seo et al. 2014). The main principle is that a computational model is fitted to experimental data (based on how well model-produced behaviors match experimentally observed ones), and then the best-fitting model parameters and/or variables are used as correlates for neurobiological data such as neuron recordings (Samejima et al. 2005), fMRI activations (Tanaka et al. 2016; Daw et al. 2006, Behrens et al. 2007), genetic differences (Frank et al. 2007; Set et al. 2014; Luksys et al. 2014, 2015), levels of stress (Luksys et al. 2009), and neuropsychiatric disorders (Collins et al. 2014). The main advantage of model-based analysis is that it can test neurocomputational mechanisms of behavior, which different candidate models aim to represent, and reduce a variety of behavioral measures, which can strongly depend on the specific task, to fewer model parameters that are directly comparable between the tasks. For example, reinforcement learning can model behavior in a number of tasks where rewards or punishments of some kind (sometimes implicit) are involved, and despite different formalizations, most of these models have common parameters such as the learning rate, exploration-exploitation tradeoff, and future discounting (Tanaka et al. 2004; Frank et al. 2007; Schweighofer et al. 2008; Luksys et al. 2009). Due to unusual richness of the acquired data, gamification provides a special opportunity to convincingly show usefulness of computational models compared to traditional analyses of behavior, and most importantly link platform-derived behaviors to laboratory-based tasks, which can be analyzed using more simple models that share parameters with more complex models of games. Where explicit modeling of games is not practical (e.g., due to their complexity), the recorded patterns of game-derived data could be linked to laboratory-based behaviors (or their model parameters) using machine learning tools. Finally, a game-based psychometric assessment platform such as COSMOS provides a unique chance to test and compare different candidate models using a much wider variety of tasks and populations than used in most model-based analysis studies, where usually a narrow range of models are tested against each other based on one or few tasks selected by authors (which may benefit their favorite models compared to alternatives).


The pervasive problem of low-powered studies in the behavioral and social sciences leading to non-replicable and spurious findings has already been identified more than 60 years ago. Yet today, it does not only still persist, but is even being exacerbated by system design faults (Ioannidis 2015; Smaldino and McElreath 2016; Szucs and Ioannidis 2017): using the amount of published original research as a quality criterion for awarding funding or tenured positions incentivizes increasing the number of publications. This creates a conflict of interest with the researcher’s intrinsic goal to maximize study outcome reliability. In addition, the novelty of findings based on a small number of observations is often valued higher than replication in large cohorts (Higginson and Munafo 2016; Nosek et al. 2015; Vinkers et al. 2015). In combination with short-term contracts for the junior scientific staff (Kreeger 2004; Langenberg 2001) that render planning and implementing of larger-scale projects almost impossible as they require substantially more time than single-hypothesis small N studies, the scientific community has formed an optimal hotbed for keeping the well-known problems alive and prospering.

The ongoing replication crisis of psychological research presses us to figure out how the above-mentioned systemic shortcomings can be overcome. Luckily, computerization and automatization in combination with interdisciplinary cooperation and an open-data philosophy offer a solution to a very basic but crucial problem: in our eyes, sample size is the elephant in the room for improvement of psychological research that needs to be addressed promptly. Towards this end, we have initiated the COSMOS platform: we are striving to facilitate recruitment of study participants through automatization, i.e., creating experimental setups that no longer require staff to implement them and to observe and record behavior. Although our platform is still in its pilot phase, we argue that digitally oriented research endeavors like COSMOS will eventually serve the scientific community in several ways. Online screening platforms can be used to either carefully preselect individuals or to simply increase sample size without skyrocketing costs. Being able to substantially increase the number of study participants is arguably a compelling strategy to counteract the overestimation of effect sizes and the non-replicability of study findings.

For the scaling of psychometric assessment, especially for the online-based test setting, our overall philosophy is that the testing instruments need to be as fun and absorbing for the participants as possible to increase the intrinsic motivation to engage. At the same time, tests should require minimal effort with regard to manual data entry in order to prevent significant issues with subject adherence. Finally, the novel assessment tools should provide investigators in the psychological and biomedical sciences with research-grade cognitive and psychological metrics. Technological advances along with a strongly grown computer literacy in the general population and widespread familiarity with computer games (Granello and Wheaton 2004; Palaus et al. 2017) open up a plethora of possibilities for the operationalization of psychological research questions. Leveraging gamification to repeatedly obtain behavioral samples paves the way for a next-generation high-throughput psychometric toolset. Hence, the COSMOS platform is conceptualized to collect a vast array of psychometric and cognitive data from a large pool of study participants in a highly automated and thus very cost- and time-efficient way.

It is obvious that the goal of gathering in-depth phenotypic data by employing web-based administration of psychometric tests in the guise of entertaining serious games chaperoned by individual automatic performance feedback requires a highly interdisciplinary skill set: social, computer, and data scientists need to work closely together to design, develop, refine, and validate the tools and put them to work. Yet, this aggregate competence is often readily available in university settings and easily accessible through close collaborations between disciplines.

The possibilities of the outlined web platform go way beyond the scope of only gathering data, if additional opportunities offered by the digital era are harnessed: it could also provide a framework to present, discuss, and continuously update scientific findings. We think that eventually such approaches will not only help online participants better understand their own behavior and detect patterns that may be early signs of neuropsychiatric disorders; they could also open up venues for the development of efficient, individualized, and most importantly scientifically sound methods of cognitive enhancement.