Measuring self-regulation in everyday life: Reliability and validity of smartphone-based experiments in alcohol use disorder

Zech, Hilmar; Waltmann, Maria; Lee, Ying; Reichert, Markus; Bedder, Rachel L.; Rutledge, Robb B.; Deeken, Friederike; Wenzel, Julia; Wedemeyer, Friederike; Aguilera, Alvaro; Aslan, Acelya; Bach, Patrick; Bahr, Nadja S.; Ebrahimi, Claudia; Fischbach, Pascale C.; Ganz, Marvin; Garbusow, Maria; Großkopf, Charlotte M.; Heigert, Marie; Hentschel, Angela; Belanger, Matthew; Karl, Damian; Pelz, Patricia; Pinger, Mathieu; Riemerschmid, Carlotta; Rosenthal, Annika; Steffen, Johannes; Strehle, Jens; Weiss, Franziska; Wieder, Gesine; Wieland, Alfred; Zaiser, Judith; Zimmermann, Sina; Liu, Shuyan; Goschke, Thomas; Walter, Henrik; Tost, Heike; Lenz, Bernd; Andoh, Jamila; Ebner-Priemer, Ulrich; Rapp, Michael A.; Heinz, Andreas; Dolan, Ray; Smolka, Michael N.; Deserno, Lorenz

doi:10.3758/s13428-022-02019-8

Measuring self-regulation in everyday life: Reliability and validity of smartphone-based experiments in alcohol use disorder

Open access
Published: 12 December 2022

Volume 55, pages 4329–4342, (2023)
Cite this article

Download PDF

You have full access to this open access article

Behavior Research Methods Aims and scope Submit manuscript

Measuring self-regulation in everyday life: Reliability and validity of smartphone-based experiments in alcohol use disorder

Download PDF

Hilmar Zech ORCID: orcid.org/0000-0002-5141-7952^1,2,
Maria Waltmann^2,3,
Ying Lee^1,4,5,
Markus Reichert^6,7,8,
Rachel L. Bedder^4,5,9,
Robb B. Rutledge^4,5,10,
Friederike Deeken¹¹,
Julia Wenzel¹²,
Friederike Wedemeyer¹²,
Alvaro Aguilera¹³,
Acelya Aslan¹⁴,
Patrick Bach¹⁴,
Nadja S. Bahr¹²,
Claudia Ebrahimi¹²,
Pascale C. Fischbach¹,
Marvin Ganz¹⁴,
Maria Garbusow¹²,
Charlotte M. Großkopf¹,
Marie Heigert¹²,
Angela Hentschel¹,
Matthew Belanger¹,
Damian Karl¹⁴,
Patricia Pelz¹²,
Mathieu Pinger¹⁵,
Carlotta Riemerschmid¹²,
Annika Rosenthal¹²,
Johannes Steffen¹,
Jens Strehle¹³,
Franziska Weiss¹⁵,
Gesine Wieder¹³,
Alfred Wieland¹⁴,
Judith Zaiser¹⁴,
Sina Zimmermann¹⁴,
Shuyan Liu¹²,
Thomas Goschke¹⁶,
Henrik Walter¹²,
Heike Tost⁸,
Bernd Lenz¹⁴,
Jamila Andoh⁸,
Ulrich Ebner-Priemer^7,8,
Michael A. Rapp¹¹,
Andreas Heinz¹²,
Ray Dolan^4,5,17,18,
Michael N. Smolka¹,
Lorenz Deserno^1,2,3 &
the ReCoDe-Consortium

3521 Accesses
7 Citations
10 Altmetric
Explore all metrics

Abstract

Self-regulation, the ability to guide behavior according to one’s goals, plays an integral role in understanding loss of control over unwanted behaviors, for example in alcohol use disorder (AUD). Yet, experimental tasks that measure processes underlying self-regulation are not easy to deploy in contexts where such behaviors usually occur, namely outside the laboratory, and in clinical populations such as people with AUD. Moreover, lab-based tasks have been criticized for poor test–retest reliability and lack of construct validity. Smartphones can be used to deploy tasks in the field, but often require shorter versions of tasks, which may further decrease reliability. Here, we show that combining smartphone-based tasks with joint hierarchical modeling of longitudinal data can overcome at least some of these shortcomings. We test four short smartphone-based tasks outside the laboratory in a large sample (N = 488) of participants with AUD. Although task measures indeed have low reliability when data are analyzed traditionally by modeling each session separately, joint modeling of longitudinal data increases reliability to good and oftentimes excellent levels. We next test the measures’ construct validity and show that extracted latent factors are indeed in line with theoretical accounts of cognitive control and decision-making. Finally, we demonstrate that a resulting cognitive control factor relates to a real-life measure of drinking behavior and yields stronger correlations than single measures based on traditional analyses. Our findings demonstrate how short, smartphone-based task measures, when analyzed with joint hierarchical modeling and latent factor analysis, can overcome frequently reported shortcomings of experimental tasks.

Sorry—Bad Habit! Validation of the German Self-Report Habit Index with a Test for Its Relation to Potentially Addictive Forms of Health-Risk Behaviors

Article Open access 03 May 2023

Do daily fluctuations in inhibitory control predict alcohol consumption? An ecological momentary assessment study

Article Open access 01 March 2018

Self-help Smartphone Applications for Alcohol Use, PTSD, Anxiety, and Depression: Addressing the New Research-Practice Gap

Article 09 May 2019

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Introduction

Self-regulation, the ability to guide feelings and behaviors according to one’s needs and goals, relates to a range of outcomes, including somatic and mental health (Goschke, 2014; Moffit et al., 2011). Consequently, there is a growing interest in self-regulation across research domains, using a variety of measurement methods (Eisenberg et al., 2019; Nigg, 2017). Whereas questionnaires primarily capture explicit aspects of self-regulation, experimental tasks are designed to capture distinct cognitive and motivational mechanisms underlying self-regulation. Further added value of experimental tasks is their ability to manipulate physiological and brain states in a controlled manner. Research using experimental tasks thus promises an improved mechanistic understanding of phenomena such as self-regulation and its failures, and ultimately the development of targeted, mechanism-based treatments for psychiatric conditions such as substance use disorder (SUD).

However, although bespoke experimental tasks can outperform questionnaires in measuring distinct processes, linkage to real-life outcomes has so far been less successful than questionnaire-based measures. In a recent study, Eisenberg et al. (2019) assessed the ecological validity of 22 questionnaires and 37 task measures. While questionnaires modestly predicted real-world outcomes, experimental tasks showed no relationship to real-world outcomes. In the domain of SUD, Ekhtiari et al. (2017) reviewed studies that attempted to link decision-making tasks to drug use and concluded that these measures may not be sufficient to predict drug use in real life. These findings mirror a more widely held view that experimental tasks lack “realism” and generalizability (Falk and Heckman, 2009).

Here, we argue that experimental tasks do not inherently lack real-world relevance, but instead that shortcomings regarding their psychometric properties can explain their lack of realism. An important psychometric property is test–retest reliability—or a task’s consistency in measuring between-participant differences. Recent reports revealed low test–retest reliability for many tasks (Enkavi et al., 2019; Hedge et al., 2018). Low test–retest reliability is a challenge when we relate one measurement to another (e.g., relating working memory to alcohol consumption in SUD)—because mathematically low reliability limits the observable correlation between two measures (Spearman, 1904/2010: Eq. 1).

$${r}_{observed}={r}_{true}\ast \sqrt{reliabilit{y}_x\ast reliabilit{y}_y}$$

(1)

As an illustration, if working memory and drinking have a high true correlation of .8, but working memory is measured with a low reliability of .31 (which is the median reliability of experimental tasks reported by Hedge et al., 2018), then any observable correlation between the two measures would mathematically decrease to an upper limit of .44. Thus, weak or no relations between task measures and real-life outcomes may simply arise from the low reliability of task measures (Enkavi et al., 2019; Hedge et al., 2018).

In addition to these psychometric shortcomings, experimental tasks—unlike questionnaires—are more difficult to deploy outside the laboratory in real-life scenarios (Zech et al., 2022). Most experimental tasks have been designed to run on laboratory computers, requiring specialized software and, in some cases, specialized hardware. This makes it difficult to link task measures to critical real-life events. For example, binge drinking happens occasionally and may take place in highly specific real-life environments (e.g., in the bar). Indeed, studies that successfully connected task measures to self-control failures used smartphones to measure self-control failures in the field (Berkman et al. 2011; Krönke et al., 2018, 2020a, 2020b, 2021a, 2021b; Lopez et al. 2014; Overmeyer et al., 2021). Some of these studies even suggest that state-like mobilization of control rather than trait self-control can explain self-control failures in daily life (Krönke et al., 2018). The lack of mobility of laboratory tasks makes it difficult to measure such state-like processes. Their lack of mobility also makes it difficult to target large samples of specifically vulnerable patient populations. Together with their low reliability, the lack of mobility of tasks could therefore explain the difficulty of linking task measures to real-life variables.

Recently, several lab-based tasks have been translated to smartphones, which makes it easier to deploy tasks in the field and in relevant populations (for a summary of advantages of using smartphones in behavioral studies, see Miller 2012 and Zech et al., 2022). However, especially when using smartphone tasks in longitudinal designs, they often need to be shorter than lab-based tasks to assure participant compliance outside the laboratory and in repeated testing sessions. Making tasks shorter (i.e., reducing the number of trials), however, reduces task reliability and thus further aggravates existing psychometric problems of experimental tasks (Miller & Ulrich, 2013; Rouder & Haaf, 2019; Smittenaar et al., 2015; Zech et al., 2022). Researchers consequently find themselves in a dilemma: Either they have to rely on long tasks that may sacrifice compliance even in the lab, or they use short tasks that may be unreliable.

Here, we thus outline an approach to overcome shortcomings in the applicability of experimental tasks to real-life scenarios: firstly, this entails moving task measures to platforms such as smartphones, which allows for the collection of rich longitudinal data in real-life contexts; secondly, it analytically exploits this longitudinal data to produce more reliable task metrics. Using this method, we test the reliability and construct validity of four smartphone-based tasks in a large (N = 488) sample of participants with mild to moderate alcohol use disorder. Little is known about the psychometric properties of experimental tasks in this population, although it is often the target of task-based research that strives toward clinical application (e.g., Heinz et al., 2020; Stavro et al., 2013; but see Kräplin et al., 2016). The tasks were gamified to increase participant engagement and designed to capture four distinct cognitive and motivational processes: working memory (McNab et al., 2015), response inhibition (Smittenaar et al., 2015), risk-taking (Rutledge et al., 2014), and information sampling (Hunt et al., 2016; see Fig. 1).

Beyond reliability, which is agnostic to what is measured, a theoretical assumption that tasks are particularly well suited to assess distinct cognitive-motivational processes requires the empirical investigation of their construct validity. In vulnerable populations, such as individuals with SUD, widespread cognitive alterations are well known (Hildebrandt et al., 2021) and may impede the detection of specific cognitive-motivational processes. Instead, multiple experiments may return a more general impairment, an undesirable scenario for mechanism-based research that aims to improve clinical stratification and ultimately individualized treatments. In fact, both reliability and construct validity do not depend on the task alone but also on the characteristics of an investigated sample (Knekta et al., 2019). When striving for future clinical applicability, there is an urgent need to assess psychometric criteria in populations such as people with SUDs. We therefore also assessed the tasks’ validity by analyzing their latent factor structure and by correlating the resulting factor scores with real-life measures of drinking.

Together, this study aimed to show how smartphones can be used to overcome shortcomings of experimental tasks by using joint modeling of longitudinal data from outside the laboratory to produce reliable and valid task measures in a clinical population diagnosed with alcohol use disorder.

Results

Study design

Data were collected as part of a larger research consortium on substance use disorder (SUD) employing a smartphone-based longitudinal Ecological Momentary Assessment (EMA) of up to 1 year (Heinz et al., 2020, see Materials and methods). After study inclusion, individuals with alcohol use disorder and associated comorbidities (Materials and methods, Table S1) commenced a smartphone-based data collection. On the first day of data collection, they completed four experimental tasks twice in pseudo-random order (randomized within each session; see Fig. 1), which allowed us to assess reliability and construct validity in N = 488 individuals. Of these, 373 participants (76.4%) completed the four tasks again 3 weeks later, allowing us to assess test–retest reliability at a longer retest interval. The tasks were taken from the Great Brain Experiment (GBE) app (Brown et al., 2014, see Materials and methods). Tests of response inhibition, working memory, risk taking, and information sampling were included and presented as games. In addition to completing the tasks, participants also self-reported how many days in the past 3 months they had consumed alcohol as well as how much alcohol they consumed on a regular drinking day. The study protocol was pre-registered (https://osf.io/9ze2u/).

Reliability

We initially assessed split-half reliability (or internal consistency) for each task, and each of the two sessions. This indicates consistency of a measure within a session by splitting trials into two halves and comparing scores across each half. Subsequently, we analyzed test–retest reliability, the temporal consistency across two distinct sessions, comparing two analytic approaches in this regard. In a commonly used approach (Enkavi et al., 2019; Hedge et al., 2018; Rouder & Haaf, 2019), we computed task scores for each session separately and then examined test–retest reliability based on the resulting scores. We refer to this as a separate modeling approach. A problem in this popular approach is that it does not consider the dependency of data within participants (the hierarchical longitudinal structure of the data) and, both theoretically and empirically, this has been shown to lead to exaggerated residual variance, which depresses reliability (Brown et al., 2020; Rouder & Haaf, 2019; Waltmann et al., 2022). Therefore, we next analyzed the data with a joint modeling approach by modeling data jointly from both sessions using hierarchical mixed models (for details, see Materials and methods and Supplementary Materials). These models account for a dependency of data within participants, thereby regularizing scores by moving session scores towards participant means, with further regularization by moving participant means towards the sample mean. Indeed, Waltmann et al. (2022) showed that this joint modeling approach yields more accurate session score estimates, leading to both higher and more accurate estimates of reliability than a separate modeling approach. Test–retest reliability was assessed using intra class coefficients (ICCs), which compare the variance of interest—the between-participant variance—with the total residual variance (including systematic within-session variance, e.g., repetition effects). We report ICC(1) (Liljequist et al., 2019) as a primary outcome because it can be computed for separate and joint modeling approaches (see Supplementary Materials).

Split-half reliability

For the risk taking task, split-half reliabilities for the gain and loss gambles were adequate (according to interpretations by Nunnally and Bernstein, 1994; r_sb gain session 1 = .84; r_sb_gain session 2 = .91; r_sb loss session1 = .77; r_sb_loss_session_2 = .82) but lower for mixed gambles (r_sb_mixed_session_1 = .67; r_sb_mixed_session_2 = .71). For the information sampling task, split-half reliabilities were adequate (r_sb_session_1 = .86; r_sb_session_2 = .86). Split-half reliabilities for the working memory task and for the inhibition task could not be analyzed in this way due to their adaptive task design (see Methods section).

Test–retest reliability

Test–retest reliability increased for all tasks when calculating scores based on the joint compared to the separate modeling approach (see Table 1 and Fig. 2). The inhibition task had moderate reliability when scores were calculated based on separate modeling (ICC1 = .51; according to interpretations by Koo and Li, 2016), but good reliability when scores were calculated based on joint modeling (ICC1 = .70). For comparison^{Footnote 1}, the reliabilities of SST scores reported by Hedge et al. (2018) ranged between .36 and .49. The working memory task had poor reliability in all conditions when scores were calculated based on separate modeling (ICC1s ≤ .43), but this reliability increased to moderate levels when scores were calculated based on joint modeling (ICC1 ranging from .51 to .64). For comparison, Lo et al. (2012) reviewed the reliability of similar lab-based tasks and reported reliabilities ranging from .56 to .60. The risk-taking task had moderate reliability in all conditions when scores were calculated based on separate modeling (ICC1s ranging from .52 to .65), with reliability increasing to moderate to good when scores were calculated based on joint modeling (ICC1s ranging from .73 to .80). For comparison, in a similar lab-based task, Petzold et al. (2019) report retest correlations between .02 and .65. Finally, the information sampling task had good reliability when scores were calculated based on separate modeling (ICC1 = .78), which further improved when scores were calculated based on joint modeling (ICC1 = .91). For comparison, Grummit et al. (as cited in Enkavi et al., 2019) reported an ICC of .53 in a sample of 312 children and Enkavi et al. (2019) report an ICC of .31 in a computer-based online information sampling task.

Table 1 Test–retest reliabilities (ICCs) for the different task measures, analysis approaches, and at 3-week follow-up. Note that for the joint modeling, only ICC1s can be calculated. Additional ICCs (ICC2s) are reported in Supplementary Table S3

Full size table

To assess whether reliability decreases with increasing retest periods, we also assessed reliabilities between the first measurement and a 3-week follow-up measurement. As expected, reliabilities for all tasks were slightly lower for the longer retest period (see Table 1).

Construct validity

To assess construct validity, we used exploratory factor analysis, an approach commonly used to assess discriminant and convergent validity (Eisenberg et al., 2019; Russel, 2002). Factor analysis seeks to reduce the dimensionality of measurements with the aim of revealing common factors underlying several measurements outcomes (Eisenberg et al., 2019). This allows researchers to assess whether measures designed to assess different processes, e.g., cognitive control and decision-making, also capture these differences in the variance of a given dataset. Generally, little is known about the factor structure of smartphone-based experimental task measures. Importantly, the factor structure of measurements can differ depending on the population (Knekta et al., 2019) such as healthy individuals or clinical populations—in our case individuals suffering from SUD. A common concern is that, in the face of known widespread cognitive deficits (Hildebrandt et al., 2021), distinct task measures may load on a single factor, potentially hampering inference on different cognitive processes in patient populations. This concern can be amplified for experiments in the field that utilize a smartphone (as increased external distractions could lead to correlated errors).

We conducted the factor analysis based on the average scores across both sessions from joint modeling, based on evidence that joint modeling yielded the highest reliability estimates. A scree plot indicated that the data was best represented by three factors (see Fig. 3A). Factor loadings indicated the first factor, which we labeled cognitive control (following terminology by, Nigg, 2017), represented measures of working memory and response inhibition. The different conditions in the working memory task had factor loadings of .75 to .82, and the response inhibition task had a factor loading of .46. The second factor represented risk-taking to avoid losses (factor loading of 1.00) and the third factor risk-taking for gains (factor loading of .86). Risk taking for mixed gambles loaded equally on Factor 2 (factor loading of .34) and Factor 3 (factor loading of .38). Information sampling loaded on none of the factors (all loadings ≤ -.06; see Fig. 3). However, an eigenvalue around 1 (Fig. 3A) may indicate, albeit weakly, a separate factor for information sampling (see Supplementary Materials).

Together, the factor loadings indicated a single cognitive control dimension and several decision-making dimensions. However, some measures, such as risk-taking for mixed gambles, did not clearly load on a distinct factor but instead spread over several factors, potentially because they involve a consideration of potential gains and losses. To gain further insight into the structure of the data, we, therefore, conducted a hierarchical clustering analysis based on factor loadings. Rather than showing to which extent a measurement is represented by each individual factor, cluster analyses can reveal which variables load similarly to one or several factors. To illustrate, factors can be understood as dimensions in a “psychological space” and cluster analysis determines how distant task measures are from each other within this space (Eisenberg et al., 2019). Notably, the clusters that emerged from this analysis mirrored the factor analysis regarding a large cognitive control cluster consisting of the working memory and inhibition task (see Fig. 3B). In addition, the analysis revealed two to three decision-making clusters (depending on where one chooses to cut the tree diagram) consisting of one cluster for risk-taking in the context of losses, and one for risk taking when rewards were available. The latter further split into a risk taking for gains and mixed gambles cluster and an information sampling bias cluster. The factor structure, as well as the clustering, are broadly consistent with theoretical predictions of a cognitive control dimension separate from a decision-making dimension that segregates gain and loss contexts (Deza Araujo et al., 2018).

Reliability of factor scores

Integrating several noisy measures onto latent variables can further increase reliability potentially by reducing measurement error (Eisenberg et al., 2019; Nigg 2017; Shahar et al., 2019). To test whether this was the case in our dataset, we calculated factor scores for each participant and session based on each task measure’s factor loading. The resulting cognitive control scores (based on the cognitive control factor) showed higher test–retest reliability than each of the underlying task measures (ICC1 = .81; compare Table 1). In the decision-making domain, reliabilities were similar to those of the underlying scores both for the risk taking for gains factor (ICC1 = .79) and for the risk taking to avoid losses factor (ICC1 = .82).

Correlation of task measures with drinking

As proof of concept, we correlated latent factor scores derived from joint modeling with measures of drinking. We observed a correlation between cognitive control and the number of reported drinking days over the last three month (Spearman correlation with Bonferroni corrected p values: r = – 0.230, 95% CI [– 0.31, – 0.14], p < .001, see supplement). When estimating this correlation based on single components of the cognitive control factor derived from separate modeling (focusing on the first session), the five correlations were overall weaker (working memory [short]: r = – 0.181, 95% CI [– 0.27, – 0.09]; p < .001; working memory [long]: r = – 0.131, 95% CI [– 0.22, – 0.04]; p = .005; working memory [encoding]: r = – 0.187, 95% CI [– 0.27, – 0.10]; p < .001; working memory [delayed]: r = – 0.210, 95% CI [– 0.30, – 0.12]; p < .001; SSRT: r = 0.111, 95% CI [– 0.10, 0.08]; p = .016). This pattern indicates that improving reliability, through joint modeling and deriving latent factor scores, can uncover significant correlations between task measures and real-world outcomes such as drinking.

Discussion

We assessed the reliability and validity of four smartphone-based experiments in a large sample of participants suffering from alcohol use disorder. We showed that split-half reliability was high. Test–retest reliability ranged from moderate to good when modeling task data separately for each session, and increased to good to excellent when modeling sessions jointly. This emphasizes a need for adequate modeling of within-subject longitudinal data for sufficient reliability estimates and highlights the value of smartphones for larger data collection than can be accomplished within a laboratory setting. With respect to construct validity, we identified a cognitive control factor distinct from two factors reflecting decision-making in the context of losses versus rewards. Latent variables based on factor loadings further increased test–retest reliability, demonstrating that combining several measures into latent variables is a further useful denoising step. As proof of concept, we demonstrate that a latent cognitive control score based on joint modeling indeed yielded stronger correlations with drinking behavior than single task scores based on separate modeling. Together, our data show that distinct cognitive-motivational aspects can be measured experimentally with sufficient reliability and validity in substance use disorder (SUD) through smartphone-based data collection.

Improving task reliability through joint modeling

Recent analyses of multiple task measures indicate that most tasks show poor test–retest reliability (Enkavi et al., 2019; Hedge et al. 2018). This poses a major problem because reliability limits the observable correlation between two variables, one possible account of why many task measures show only weak or no correlation to real-life outcomes (Eisenberg et al., 2019). Here, we demonstrate that modeling task outcomes jointly based on two measurement sessions improves a tasks’ test–retest reliability from moderate to good and often excellent levels compared to traditional methods that model each session separately. Previous work indicated that this method of jointly modeling task data produces more reliable laboratory task measures (Brown et al., 2020; Waltmann et al., 2022). A recent simulation showed that this moves reliability estimates closer to true reliability levels by regulating task outcomes based on all available data, thereby reducing measurement noise (Waltmann et al., 2022). Our smartphone tasks are very short (~5 min per task) and the observed improvement rests on data from more than one session and more than one person. Here, smartphone-based tasks have a major advantage over laboratory-based tasks by allowing researchers to collect data more efficiently from several shorter sessions in the field (Miller, 2012; Zech et al., 2020, 2022). Future research could aim at moving tasks to smartphone to obtain reliable scores from two or potentially even more sessions.

Reliability based on factor scores

In line with prior research (Enkavi et al., 2019; Eisenberg et al., 2019; Shahar et al., 2019), we show that a tasks’ reliability is further increased when measurements are based on factor scores rather than on individual tasks scores. In addition to reducing dimensionality, factor analysis can be regarded as a “denoising” step when data from several related tasks is available. However, having multiple tasks available in the same subjects is relatively unusual and here again smartphone-based tasks can be useful. As participants need to complete more tasks, the burden of participating in experimental sessions in the lab increases (e.g., Eisenberg’s participants had to complete 150 tasks) while smartphone-based tasks can spread this burden over time, thus making participation more engaging for participants.

State-dependent process?

A major goal of most smartphone-based studies is to detect state-dependent changes. Our data reveal excellent split-half reliability, a finding that is especially encouraging given participants completed the tasks outside a controlled lab environment, where more measurement noise (e.g., because of distractions) could have led to reduced reliability. Moreover, estimates of split-half reliability were consistently higher as compared to test–retest reliability. This indicates that at least part of the unexplained variance in experimental task measures might be driven by state-dependent changes in cognitive and motivational processes (Hedge et al., 2018). As most tasks measuring such processes use cross-sectional designs, little is Zech et al. (2022) known about whether and at what frequency cognitive and motivational processes fluctuate (). There are a few notable exceptions. For example, in a week-long smartphone experiment, Eldar et al. (2018) showed that reward-learning processes fluctuate at two distinct timescales—one fast and one slow—and that these fluctuations were linked to changes in mood. In SUD, Konova et al. (2020) linked longitudinal fluctuations in risky decision making to prospective opioid use. Importantly, they varied the sampling rate (between 1 week and 1 month) and only found the expected association at the weekly sampling rate. In a citizen science sample, smartphone-based assessments of risk-taking were linked to circadian rhythm (Bedder et al., 2020). These studies underline the importance of understanding whether, and at which frequencies, cognitive processes fluctuate. Future research should investigate such fluctuations more deeply by deploying experimental tasks in longitudinal settings. We propose that smartphone-based tasks with sufficient psychometrics, as outlined in the current study, can help in this endeavor. Indeed, they can easily be deployed in real-world environments with high measurement frequencies, and we are currently implementing this in a multi-center study (Heinz et al., 2020).

Construct validity

Next to reliability, construct validity is an important psychometric criterion. Construct validity refers to the extent to which associations between measurements reflect theoretical relationships between underlying constructs. Although it is as important as reliability, construct validity is rarely assessed for experimental tasks (for an exception, see Eisenberg et al., 2019). We assessed construct validity by creating a psychological space using factor analysis and then categorizing tasks within this psychological space using cluster analysis (following the method described in Eisenberg et al., 2019). This revealed a large cognitive control cluster consisting of working memory and inhibition, in addition to several smaller clusters related to risk taking, the latter broadly splitting into risk taking for gains and risk taking to avoid losses. In line with Eisenberg et al. (2019), our findings do not support the notion of a unifying construct of self-regulation, which may lack coherence as a construct. Instead, we show key factors of cognitive control and motivation that align well with theoretical accounts. In a recent historical review, Nigg (2017) concluded that most theories include working memory and inhibition in cognitive control (see also Botvinik & Braver, 2015), but separate them from decision-making such as risk taking. In a review of behavioral and genetic studies, Friedman and Miyake (2017) further concluded that response inhibition and working memory are robustly correlated. Furthermore, empirical evidence shows a distinction in neural circuits underlying risk taking to avoid losses and risk taking for gains (Deza Araujo et al., 2018). Risk taking for gains decreases with age, putatively reflecting decreases in dopamine (Rutledge et al., 2016), and risk taking for losses has been linked to circadian rhythms (Bedder et al., 2020). This distinction is also in line with prospect theory which posits that people assess risks differently in the context of gains compared to losses (Kahneman & Tversky, 1979; Tversky & Kahneman, 1981). Prospect theory has been extensively tested in both healthy populations and populations with SUD (Cabedo-Peris et al., 2022). We conclude that our tasks have theoretically plausible construct validity. This is the first demonstration of theoretically plausible construct validity for smartphone-based tasks, in this case within a sample of participants suffering from SUD.

Clinical sample and correlation with drinking

A task’s reliability and construct validity do not only depend on the task itself, but also on other factors such as the investigated sample (Knekta et al., 2019). Most large-scale assessments of task reliability and construct validity have been conducted so far in healthy individuals or rather unselected samples (Eisenberg et al., 2019; Enkavi et al., 2019; Hedge et al. 2018; Shahar et al., 2019; Waltmann et al., 2022). We demonstrate that latent cognitive control factor scores derived from joint modeling yielded higher correlations with a measure of drinking behavior as compared to individual scores derived from separate modeling. This result confirms Spearman’s (1904/2010) prediction that better reliability will increase observed correlations between variables while underlining the importance of increasing reliability through joint longitudinal data modeling and by combining task measures into latent factor scores in clinical populations.

Limitations

We show that—when analyzing data using traditional analytic approaches—the four tasks tested in this study had already moderate to good reliability, thereby sometimes already exceeding the median reliability of other tasks reported in previous studies (Enkavi et al., 2019; Hedge et al., 2018). The gamification of the tasks may play a role in this regard. For example, gamification of tasks may render tasks more engaging, thereby motivating participants to complete them to the best of their abilities leading to higher between-participant variance and in turn higher reliability. However, our study design did not allow us to systematically assess differences in this regard.

We further demonstrate that one of the extracted factor scores (cognitive control) correlated with real-life measures of drinking in the expected direction (lower cognitive control was related to more drinking). Although this finding is promising with regard to the task measures’ ecological validity, we did not find similar correlations between the two risk taking factors and drinking. On the one hand this finding might point toward a mechanistic insight that cognitive control is more important than risk preferences in SUDs. However, we are hesitant to interpret null-results as it is also possible that characteristics of the task explain these null-results. For example, it is possible that risk taking using abstract rewards does not generalize to risk taking involving real, health-related rewards. Future research could investigate this idea by designing risk taking tasks that are more closely related to risk taking in SUD. It is also possible that increasing the number of trials (for example by including additional measurement sessions) reveals correlations. Finally, future research could explore whether different tasks correlate with other aspects of addictive behaviors that were not investigated in the current project.

The temporal spacing between test sessions is an interesting target for ongoing and future research. Varying this interval systematically could lead to new insights as to when tasks are most sensitive to state- or training-related changes. A noteworthy advantage of our study is that we included two sessions for each task on the first study day, enabling us to specifically assess reliability and validity. Many existing studies on task reliability commonly use data from longitudinal studies that were designed to test clinical or developmental effects (e.g., Shahar et al., 2019, Brown et al., 2020). We also demonstrated that modeling session scores jointly—that is using data from all sessions—further improves reliability to good-to-excellent levels. It should be noted that this approach only works if data from several sessions is available, which might not be feasible for complex studies (e.g., studies involving brain scans). The method is therefore especially advantageous for remote studies such as this one.

One possible limitation of smartphone-based research is, however, that missing data might not be random. For example, it is possible that participants that have generally lower working memory also perform fewer working memory tasks. If data is missing in a systematic way, it should decrease overall variance and make subsequent analyses more conservative. Finally, as both reliability and construct validity depend both on tasks and on populations (Knekta et al., 2019), it should be noted that our results do not necessarily generalize to other populations (e.g., healthy participants or participants with other pathologies).

Conclusions

We demonstrate good reliability and plausible construct validity of four smartphone-based tasks deployed in a large sample of participants suffering from SUD. We show that reliability can be further improved to good to excellent levels when modeling data from several sessions jointly and when combining several measures into latent variables. Finally, we demonstrate that latent factor scores from joint modeling increase correlations with a measure of drinking. Together, our results demonstrate the strong potential of smartphones to help overcome psychometric shortcomings of lab-based experimental tasks and to investigate real-life outcomes, which require sufficient psychometrics and an easy mobile deployment in real life and in clinically relevant populations. These results represent a critical milestone towards longitudinal experimental studies in SUD research and in psychiatry and psychology more generally.

Materials and methods

General procedure

This study was part of a larger German research consortium on substance use disorder (SUD) at three sites (Technical University Dresden, Charité Berlin, and Central Institute of Health Mannheim), in which a smartphone-based longitudinal Ecological Momentary Assessment (EMA) of up to 1 year was performed with a range of subjective reports. In addition to subjective reports, individuals performed four cognitive-motivational tasks on the smartphone once per month. These tasks were taken from the Great Brain Experiment (GBE) app (Brown et al., 2014, see below for details). Before starting the EMA study, individuals underwent extensive clinical and neurocognitive assessments (see Heinz et al., 2020). During this assessment appointment, which was either conducted inside the laboratory or online via video chat, the app for running the EMA study (Movisens app; movisens GmbH, Germany; Reichert et al., 2021) as well as a customized version of the GBE app for assessment of the four cognitive-motivational tasks (see below) were installed either on participants’ own phone or on a study phone. On the first Monday following the assessment, participants were prompted to complete each smartphone task twice. The current study focusses on these first two measurement sessions. Participants also participated in multiple sub-projects of the consortium (see Heinz et al., 2020), that are unrelated to the present study.

Participants

The study procedure was approved by the review boards of the local ethics committee at Heidelberg University (2018-621N-MA), Charité – Universitätsmedizin Berlin (EA1/212/18), and Technical University Dresden (EK 459112018). Data collection took place between February 2020 and March 2022. All participants gave written informed consent before participating in the study. For study inclusion at all three sites, participants had to fulfill criteria of substance used disorder, specifically mild to moderate Alcohol Use Disorder (AUD). According to DSM 5, mild to moderate AUD was defined as the presence of at least two AUD criteria. Participants were recruited through flyers and advertisements. Telephone screenings were conducted before study inclusion/exclusion. Exclusion criteria were: clinical indication for detoxification treatment, insufficient knowledge of the German language, seeking a therapeutic intervention, MRI contraindications, medical history of DSM-5 bipolar disorder, psychotic disorder, schizophrenia or schizophrenic spectrum disorder, or current use of drugs or medication nor substance dependence thereof other than alcohol, nicotine, or cannabis, as well as medical history of severe head injury, or other severe central nervous system disorders. Data from 488 participants was analyzed for the present study. Participants age ranged from 16 to 65 years (M = 36.9, SD = 12.8) and 180 participants (36.9%) reported to be female. Participants fulfilled 2 to 9 AUD criteria ranged from (M = 4.05, SD = 1.60).

Inhibition task

During the Inhibition Task (Smittenaar et al., 2015), participants tapped left or right on their smartphone screen to collect fruits falling from a tree (see Fig. 1). Each trial began with two fruits hanging at the top of the screen for one to three seconds (randomly selected from a uniform distribution). Next, one of the fruits fell down and passed over one of two circles indicating the time during which participants should collect the fruit through tapping (Go-Trials with a response window spanning from 500 to 800 ms after stimulus onset). In 12 of 32 trials (37.5%), the falling fruit turned brown, indicating that it was rotten and should not be collected (stop trials). At the beginning of each session the delay after which the fruit turned brown (stop signal delay; SSD) was 350 ms. This delay changed according to staircase procedure (Verbruggen et al., 2019): it increased by 50 ms after each successful stop trial (rendering the subsequent stop trial more difficult) and decreased by 50 ms after each unsuccessful stop trial (rendering the subsequent stop trial easier).

Working memory task

During the working memory task (McNab et al., 2015), participants were asked to remember the positions of two up to 12 red circles presented on a 4 x 4 grid (see Fig. 1). The task involved four conditions: In the ‘long no distractor’ condition circles were presented for 2 s (encoding phase), then disappeared for 1 s (maintenance phase), before participants had to tap on their no-longer visible locations. In the ‘short no-distractor’ condition, patterns were presented for 1 instead of 2 s. In the ‘encoding-distractor’ condition, two yellow distractors were presented together with the red circles during the encoding phase. In the ‘delayed-distractor’ condition, the same two yellow distractors were presented but during the maintenance phase. Each condition started with three circles in trial one. If participants failed to respond correctly, two circles were presented in the second trial. If participants failed at this level, the condition was terminated. If a trial was completed correctly, the number of red circles in the corresponding condition increased by one in the next trial. If participants failed in a trial (from level four onwards), the level was repeated once. If they failed again the condition was terminated. A maximum of eight trials was completed for each condition.

Risk taking task

During the risk taking task (Rutledge et al., 2014), participants repeatedly chose between a certain outcome and a gamble, with equal probabilities of the two outcomes (see Fig. 1). The task involved three conditions: In the ‘gain’ condition participants chose between either a certain gain or to gamble for a larger gain against 0 points. In the ‘loss’ condition, participants chose between either a certain loss or to gamble for 0 points against a larger loss. In the ‘mixed’ condition, participants chose between a certain amount of 0 points or to gamble for a gain against a loss amount. The gain and loss conditions consisted of 11 trials and the mixed condition consisted of eight trials. In each trial, a certain amount was first randomly chosen with replacement from a fixed list of outcomes. Gamble amounts were then calculated by multiplying the certain amount with a randomly chosen multiplier from another fixed list (for details, see Bedder et al., 2020; Rutledge et al., 2014). The task also involved current mood ratings (“How happy are you at this moment?”; rating line with endpoints “very happy” and “very unhappy”) which were presented after every 2–3 trials and are known to be correlated with depressive symptoms (Rutledge et al., 2014), but are not subject to the currently reported reliability analysis.

Information sampling task

During the information sampling task (Hunt et al., 2016) participants were presented with four playing cards in rows of two and had to choose the row with the largest sum of card values (see Fig 1). Each of the 21 trials began with all cards facing down. Participants could invest points to turn over one card at a time to sample information with increasing costs for each additional card (zero points for the first card, 10 for the first card, 15 for the third, and 20 for the fourth card). Before turning over a card, participants could also choose to guess, at no cost, which row had the largest value. A choice at this stage would be a gamble (called a guess in the task) at 50/50. Participants won 60 points if this guess was correct and lost 50 points if the guess was incorrect. If turning over one or multiple cards, the costs for information sampling reduced the total win. Card values were sampled randomly with replacement from a discrete uniform distribution with integers ranging from 1–10.

Reliability

The first goal of this study was to assess the smartphone tasks’ reliability. Where possible, we first assessed the tasks’ split-half reliability, i.e., the consistency with which a task measures its construct within one measurement session. Next, we assessed the tasks’ test–retest reliability, i.e., the consistency with which a task measures its construct between two measurement sessions. While assessing the tasks’ test–retest reliability, we compared two approaches of analyzing task data—the more traditional approach in which sessions are modeled separately, and an alternative approach in which sessions are modeled jointly. The latter has recently been shown to yield superior reliability estimates in theory and practice in other cognitive tasks (for details, see below; Brown, 2020; Haines et al., 2020; Waltmann et al., 2022).

Modeling sessions separately vs. jointly

For each task, we compared two approaches of analyzing task data: The first approach, which we subsequently call separate modeling, is traditionally used to analyze task data. In this approach, summary scores are first created by aggregating data separately for each session of each participant. Next, these summary scores are used for inference, for example to calculate test–retest reliabilities. According to Haines et al. (2020), one problem of this approach is that it assumes that scores are estimated without measurement error. This, in turn, leads to ignoring uncertainty during inference, which, for example, can attenuate test–retest reliability. A second problem is that this method assumes that person-level parameters are distributed uniformly across an interval that spans beyond a reasonable range of task scores. This is because knowledge about scores from other participants or scores from other sessions of the same participant is not integrated in estimating individual session scores. Prior research shows that integrating such information into individual score estimation yields more reliable scores (Brown et al., 2020; Efron & Morris, 1977; Haines et al., 2020; Rouder & Haaf, 2019; Waltmann et al., 2022; Williams et al., 2021).

The alternative analysis approach, which we subsequently call joint modeling, overcomes both problems of the separate modeling approach. Instead of first calculating summary scores and using them in a second step for inference, the prediction approach performs inference directly based on all available trial-level data. This allows it to carry, firstly, within-session uncertainty into the inference step and, secondly, to use information from other participants and sessions in each individual session score estimation. Both of these aspects improved test–retest reliability in previous work. We implemented this approach using hierarchical mixed models specifically designed to model each task’s outcome measure (for details see Supplementary Materials). Hierarchical mixed models allow us to analyze data at the trial-level while still accounting for the participant and session structure of the data. We validated that scores based on mixed models did not substantially differ from task scores calculated with established methods when modelled for each session separately (see Supplementary Materials).

Split-half and test–retest reliability

Firstly, split-half reliability was assessed based on Spearman–Brown-corrected correlations within each session (based on odd-even splits). Note that for the working memory task and for the inhibition task, split-half reliabilities could not be computed because these tasks are adaptive. Therefore, splitting the task into two halves is not appropriate (Draheim et al., 2020). Qualitative interpretations of split-half reliabilities are given in line with Nunnally and Bernstein (1994; split-half reliabilities above .8 were labeled as adequate). Secondly, test–retest reliability was calculated based on intra-class correlation coefficients (ICCs) based on data from the first two measurement sessions. To calculate ICCs directly from mixed models, we followed the method recently described by Brown et al. (2020), which calculates reliabilities based on variance components extracted from mixed models. Waltmann et al. (2022) recently showed that this method yields more conservative and more accurate reliabilities than alternative methods (e.g., first predicting sessions scores and calculating reliabilities based on these predictions) and thus crucially avoids overestimating reliability. Qualitative interpretations of test–retest reliabilities are given in accordance with Koo and Li (2016): ICCs less than .5 were being interpreted as “poor”, ICCs between .5 and .75 as “moderate”, ICCs between .75 and .9 as “good”, and ICCs above .9 as “excellent”.

Factor and clustering analysis

Exploratory factor analysis was conducted using maximum likelihood estimation followed by oblimin rotation, which rotates factors without enforcing orthogonality. The analysis was based on average joint prediction scores from both sessions. Before conducting this analysis, the outcome measure of the inhibition task was inverted, so that it could be interpreted in the same direction as the outcome measure of the working memory task (i.e., higher values indicating better performance). This analysis was implemented using the factor_analyzer package (Python 3.5). The optimal number of factors was determined using a scree plot (see Fig. 3). The hierarchical clustering analysis was conducted using the SciPy package (Python 3.5). The analysis was conducted using Euclidean distances to generate a hierarchical tree. As there were no implicit heights at which to cut this tree, the cut height was determined based on theoretical considerations. To calculate reliability of factor scores, factor scores were extracted separately for each session using the tenBerge method, which is most appropriate for oblimin rotation (Ten Berge et al., 1999).

Notes

Note that all comparison studies used a traditional, separate modeling approach.

References

Bedder, R., Vaghi, M., Dolan, R., & Rutledge, R. (2020). Risk taking for potential losses but not gains increases with time of day. PsyArXiv. https://doi.org/10.31234/osf.io/3qdnx
Berkman, E. T., Falk, E. B., & Lieberman, M. D. (2011). In the trenches of real-world self-control. Psychological Science, 22, 498–506.
Article PubMed Google Scholar
Botvinick, M., & Braver, T. (2015). Motivation and cognitive control: From behavior to neural mechanism. Annual Review of Psychology, 66, 83–113.
Article PubMed Google Scholar
Brown, H. R., Zeidman, P., Smittenaar, P., Adams, R. A., McNab, F., Rutledge, R. B., & Dolan, R. J. (2014). Crowdsourcing for cognitive science–the utility of smartphones. PLoS One, 9(7), e100662.
Article PubMed PubMed Central Google Scholar
Brown, V. M., Chen, J., Gillan, C. M., & Price, R. B. (2020). Improving the reliability of computational analyses: Model-based planning and its relationship with compulsivity. Biological Psychiatry: Cognitive Neuroscience and Neuroimaging, 5, 601–609.
PubMed Google Scholar
Cabedo-Peris, J., González-Sala, F., Merino-Soto, C., Pablo, J. Á. C., & Toledano-Toledano, F. (2022). Decision making in addictive behaviors based on prospect theory: A systematic review. Healthcare, 10(9), 1659.
Deza Araujo, Y. I., Nebe, S., Neukam, P. T., Pooseh, S., Sebold, M., Garbusow, M., & Smolka, M. N. (2018). Risk seeking for losses modulates the functional connectivity of the default mode and left frontoparietal networks in young males. Cognitive, Affective, & Behavioral Neuroscience, 18, 536–549.
Article Google Scholar
Draheim, C., Tsukahara, J. S., Martin, J. D., Mashburn, C. A., & Engle, R. W. (2020). A toolbox approach to improving the measurement of attention control. Journal of Experimental Psychology: General. Advance online publication.
Efron, B., & Morris, C. (1977). Stein's paradox in statistics. Scientific American, 236, 119–127.
Article Google Scholar
Eisenberg, I. W., Bissett, P. G., Zeynep Enkavi, A., Li, J., MacKinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2019). Uncovering the structure of self-regulation through data-driven ontology discovery. Nature Communications, 10, 1–3.
Article Google Scholar
Ekhtiari, H., Victor, T. A., & Paulus, M. P. (2017). Aberrant decision-making and drug addiction—How strong is the evidence? Current Opinion in Behavioral Sciences, 13, 25–33.
Article Google Scholar
Eldar, E., Roth, C., Dayan, P., & Dolan, R. J. (2018). Decodability of reward learning signals predicts mood fluctuations. Current Biology, 28, 1433–1439.
Article PubMed PubMed Central Google Scholar
Enkavi, A. Z., Eisenberg, I. W., Bissett, P. G., Mazza, G. L., MacKinnon, D. P., Marsch, L. A., & Poldrack, R. A. (2019). Large-scale analysis of test–retest reliabilities of self-regulation measures. Proceedings of the National Academy of Sciences, 116, 5472–5477.
Article Google Scholar
Falk, A., & Heckman, J. J. (2009). Lab experiments are a major source of knowledge in the social sciences. Science, 326, 535–538.
Article PubMed Google Scholar
Friedman, N. P., & Miyake, A. (2017). Unity and diversity of executive functions: Individual differences as a window on cognitive structure. Cortex, 86, 186–204. https://doi.org/10.1016/j.cortex.2016.04.023
Goschke, T. (2014). Dysfunctions of decision-making and cognitive control as transdiagnostic mechanisms of mental disorders: Advances, gaps, and needs in current research. International Journal of Methods in Psychiatric Research, 23, 41–57.
Article PubMed Google Scholar
Haines, N., Kvam, P. D., Irving, L. H., Smith, C., Beauchaine, T. P., Pitt, M. A., ... & Turner, B. M. (2020). Theoretically informed generative models can advance the psychological and brain sciences: Lessons from the reliability paradox.
Hedge, C., Powell, G., & Sumner, P. (2018). The reliability paradox: Why robust cognitive tasks do not produce reliable individual differences. Behavior Research Methods, 50, 1166–1186.
Article PubMed Google Scholar
Heinz, A., Kiefer, F., Smolka, M. N., Endrass, T., Beste, C., Beck, A., & Spanagel, R. (2020). Addiction research consortium: Losing and regaining control over drug intake (ReCoDe)—From trajectories to mechanisms and interventions. Addiction Biology, 25, e12866.
Article PubMed Google Scholar
Hildebrandt, M. K., Dieterich, R., & Endrass, T. (2021). Neural correlates of inhibitory control in relation to the degree of substance use and substance-related problems–a systematic review and perspective. Neuroscience & Biobehavioral Reviews, 128, 1–11.
Article Google Scholar
Hunt, L. T., Rutledge, R. B., Malalasekera, W. N., Kennerley, S. W., & Dolan, R. J. (2016). Approach-induced biases in human information sampling. PLoS Biology, 14, e2000638.
Article PubMed PubMed Central Google Scholar
Kahneman, D., Tversky, A. (1979). Prospect theory: An analysis of decision under risk. Econometrica, 47, 263–291.
Article Google Scholar
Knekta, E., Runyon, C., & Eddy, S. (2019). One size doesn’t fit all: Using factor analysis to gather validity evidence when using surveys in your research. CBE—Life $ciences Education, 18, rm1.
Article Google Scholar
Konova, A. B., Lopez-Guzman, S., Urmanche, A., Ross, S., Louie, K., Rotrosen, J., & Glimcher, P. W. (2020). Computational markers of risky decision-making for identification of temporal windows of vulnerability to opioid use in a real-world clinical setting. JAMA Psychiatry, 77, 368–377.
Article PubMed Google Scholar
Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15, 155–163.
Article PubMed PubMed Central Google Scholar
Kräplin, A., Scherbaum, S., Bühringer, G., & Goschke, T. (2016). Retest reliabilities of decision-making and cognitive control measures in addictive disorders. Sucht, 62, 191–202.
Article Google Scholar
Krönke, K. M., Mohr, H., Wolff, M., Kraplin, A., Smolka, M. N., Buhringer, G., Ruge, H., & Goschke, T. (2021a). Real-life self-control is predicted by parietal activity during preference decision making: A brain decoding analysis. Cognitive, Affective, & Behavioral Neuroscience, 21, 936–947.
Article Google Scholar
Krönke, K. M., Wolff, M., Mohr, H., Kräplin, A., Smolka, M. N., Bühringer, G., & Goschke, T. (2018). Monitor yourself! Deficient error-related brain activity predicts real-life self-control failures. Cognitive, Affective, & Behavioral Neuroscience, 18, 622–637.
Article Google Scholar
Krönke, K. M., Wolff, M., Mohr, H., Kräplin, A., Smolka, M. N., Bühringer, G., & Goschke, T. (2020a). Predicting real-life self-control from brain activity encoding the value of anticipated future outcomes. Psychological Science, 31, 268–279.
Article PubMed Google Scholar
Krönke, K. M., Mohr, H., Wolff, M., Kraplin, A., Smolka, M. N., Buhringer, G., Ruge, H., & Goschke, T. (2021b). Real-life self-control is predicted by parietal activity during preference decision making: A brain decoding analysis. Cognitive, Affective, & Behavioral Neuroscience, 21, 936–947.
Article Google Scholar
Krönke, K. M., Wolff, M., Shi, Y., Kräplin, A., Smolka, M. N., Bühringer, G., & Goschke, T. (2020b). Functional connectivity in a triple-network saliency model is associated with real-life self-control. Neuropsychologia, 149, 107667.
Article PubMed Google Scholar
Liljequist, D., Elfving, B., & Skavberg Roaldsen, K. (2019). Intraclass correlation–A discussion and demonstration of basic features. PloS one, 14(7), e0219854. https://doi.org/10.1371/journal.pone.0219854
Lo, A. H., Humphreys, M., Byrne, G. J., & Pachana, N. A. (2012). Test–retest reliability and practice effects of the Wechsler Memory Scale‐III. Journal of Neuropsychology, 6(2), 212–231.
Lopez, R. B., Hofmann, W., Wagner, D. D., Kelley, W. M., & Heatherton, T. F. (2014). Neural predictors of giving in to temptation in daily life. Psychological Science, 25, 1337–1344.
Article PubMed Google Scholar
McNab, F., Zeidman, P., Rutledge, R. B., Smittenaar, P., Brown, H. R., Adams, R. A., & Dolan, R. J. (2015). Age-related changes in working memory and the ability to ignore distraction. Proceedings of the National Academy of Sciences, 112, 6515–6518.
Article Google Scholar
Miller, G. (2012). The smartphone psychology manifesto. Perspectives on Psychological Science, 7(3), 221–237.
Article PubMed Google Scholar
Miller, J., & Ulrich, R. (2013). Mental chronometry and individual differences: Modeling reliabilities and correlations of reaction time means and effect sizes. Psychonomic Bulletin & Review, 20(5), 819–858.
Article Google Scholar
Moffitt, T. E., Arseneault, L., Belsky, D., Dickson, N., Hancox, R. J., Harrington, H., & Caspi, A. (2011). A gradient of childhood self-control predicts health, wealth, and public safety. Proceedings of the National Academy of Sciences, 108, 2693–2698.
Article Google Scholar
Nigg, J. T. (2017). Annual research review: On the relations among self-regulation, self-control, executive functioning, effortful control, cognitive control, impulsivity, risk-taking, and inhibition for developmental psychopathology. Journal of Child Psychology and Psychiatry, 58, 361–383.
Article PubMed Google Scholar
Nunnally, J., & Bernstein, I. H. (1994). Psychometric theory (3rd ed.). McGraw-Hill.
Google Scholar
Overmeyer, R., Berghauser, J., Dieterich, R., Wolff, M., Goschke, T., & Endrass, T. (2021). The error-related negativity predicts self-control failures in daily life. Frontiers in Human Neuroscience, 14, 614979.
Article PubMed PubMed Central Google Scholar
Petzold, J., Kienast, A., Lee, Y., Pooseh, S., London, E. D., Goschke, T., & Smolka, M. N. (2019). Baseline impulsivity may moderate L-DOPA effects on value-based decision-making. Scientific Reports, 9, 1–8.
Article Google Scholar
Reichert, M., Gan, G., Renz, M., Braun, U., Brüßler, S., Timm, I., & Meyer-Lindenberg, A. (2021). Ambulatory assessment for precision psychiatry: Foundations, current developments and future avenues. Experimental Neurology, 345, 113807.
Article PubMed Google Scholar
Rouder, J. N., & Haaf, J. M. (2019). A psychometrics of individual differences in experimental tasks. Psychonomic Bulletin & Review, 26(2), 452–467.
Article Google Scholar
Russell, D. W. (2002). In search of underlying dimensions: The use (and abuse) of factor analysis in personality and social psychology bulletin. Personality and Social Psychology Bulletin, 28, 1629–1646.
Article Google Scholar
Rutledge, R. B., Skandali, N., Dayan, P., & Dolan, R. J. (2014). A computational and neural model of momentary subjective well-being. Proceedings of the National Academy of Sciences, 111(33), 12252–12257.
Article Google Scholar
Rutledge, R. B., Smittenaar, P., Zeidman, P., Brown, H. R., Adams, R. A., Lindenberger, U., & Dolan, R. J. (2016). Risk taking for potential reward decreases across the lifespan. Current Biology, 26, 1634–1639.
Article PubMed PubMed Central Google Scholar
Shahar, N., Hauser, T. U., Moutoussis, M., Moran, R., Keramati, M., Consortium, N., & Dolan, R. J. (2019). Improving the reliability of model-based decision-making estimates in the two-stage decision task with reaction-times and drift-diffusion modeling. PLoS Computational Biology, 15(2), e1006803.
Article PubMed PubMed Central Google Scholar
Smittenaar, P., Rutledge, R. B., Zeidman, P., Adams, R. A., Brown, H., Lewis, G., & Dolan, R. J. (2015). Proactive and reactive response inhibition across the lifespan. PLoS One, 10, e0140383.
Article PubMed PubMed Central Google Scholar
Spearman, C. (1904/2010). The proof and measurement of association between two things. International Journal of Epidemiology, 39, 1137–1150. (Original work published 1904).
Stavro, K., Pelletier, J., & Potvin, S. (2013). Widespread and sustained cognitive deficits in alcoholism: A meta-analysis. Addiction Biology, 18, 203–213.
Article PubMed Google Scholar
Ten Berge, J. M., Krijnen, W. P., Wansbeek, T., & Shapiro, A. (1999). Some new results on correlation-preserving factor scores prediction methods. Linear Algebra and its Applications, 289, 311–318.
Article Google Scholar
Tversky, A., & Kahneman, D. (1981). The framing of decisions and the psychology of choice. Science, 211, 453–458.
Verbruggen, F., Aron, A. R., Band, G. P., Beste, C., Bissett, P. G., Brockett, A. T., et al. (2019). A consensus guide to capturing the ability to inhibit actions and impulsive behaviors in the stop-signal task. eLIFE, 8, e46323.
Article PubMed PubMed Central Google Scholar
Waltmann, M., Schlagenhauf, F., & Deserno, L. (2022). Sufficient reliability of the behavioral and computational readouts of a probabilistic reversal learning task. Behavior Research Methods, 1–22.
Williams, D. R., Mulder, J., Rouder, J. N., & Rast, P. (2021). Beneath the surface: Unearthing within-person variability and mean relations with Bayesian mixed models. Psychological Methods, 26(1), 74.
Article PubMed Google Scholar
Zech, H. G., Rotteveel, M., van Dijk, W. W., & van Dillen, L. F. (2020). A mobile approach-avoidance task. Behavior Research Methods, 52, 2085–2097.
Article PubMed PubMed Central Google Scholar
Zech, H. G., Reichert, M., Ebner-Priemer, U. W., Tost, H., Rapp, M. A., Heinz, A., et al. (2022). Mobile data collection of cognitive-behavioral tasks in substance use disorders: Where are we now? Neuropsychobiology, 1–13.

Download references

Funding

Open Access funding enabled and organized by Projekt DEAL. Research reported in this publication was supported by a grant from the German Research Foundation/Deutsche Forschungsgemeinschaft (DFG Project number 402170461 [TRR 265 “Losing and Regaining Control over Drug intake”], Heinz et al., 2020).

Author information

Authors and Affiliations

Department of Psychiatry, Technische Universität Dresden, Dresden, Germany
Hilmar Zech, Ying Lee, Pascale C. Fischbach, Charlotte M. Großkopf, Angela Hentschel, Matthew Belanger, Johannes Steffen, Michael N. Smolka & Lorenz Deserno
Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, Centre of Mental Health, University of Würzburg, Margarete-Höppel-Platz 1, 97080, Würzburg, Germany
Hilmar Zech, Maria Waltmann & Lorenz Deserno
Max Planck Institute for Human Cognitive and Brain Sciences, Stephanstraße 1, 04103, Leipzig, Germany
Maria Waltmann & Lorenz Deserno
Max Planck University College London Centre for Computational Psychiatry and Ageing Research, London, UK
Ying Lee, Rachel L. Bedder, Robb B. Rutledge & Ray Dolan
Wellcome Centre for Neuroimaging (WCHN), University College London, London, UK
Ying Lee, Rachel L. Bedder, Robb B. Rutledge & Ray Dolan
Department of eHealth and Sports Analytics, Faculty of Sport Science, Ruhr-Universität Bochum (RUB), Bochum, Germany
Markus Reichert
Mental mHealth Lab, Institute of Sports and Sports Science, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany
Markus Reichert & Ulrich Ebner-Priemer
Department of Psychiatry and Psychotherapy, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
Markus Reichert, Heike Tost, Jamila Andoh & Ulrich Ebner-Priemer
Neuroscience Institute & Department of Psychology, Princeton University, Princeton, NJ, USA
Rachel L. Bedder
Department of Psychology, Yale University, New Haven, CT, USA
Robb B. Rutledge
Social and Preventive Medicine, Department of Sports and Health Sciences, Intra-faculty unit “Cognitive Sciences”, Faculty of Human Science, and Faculty of Health Sciences Brandenburg, Research Area Services Research and e-Health, University of Potsdam, Potsdam, Germany
Friederike Deeken & Michael A. Rapp
Department of Psychiatry and Neurosciences | CCM, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Department of Pediatric Surgery, Augustenburger Platz 1, 13353, Berlin, Germany
Julia Wenzel, Friederike Wedemeyer, Nadja S. Bahr, Claudia Ebrahimi, Maria Garbusow, Marie Heigert, Patricia Pelz, Carlotta Riemerschmid, Annika Rosenthal, Shuyan Liu, Henrik Walter & Andreas Heinz
Center for Information Services and High Performance Computing (ZIH), Technische Universität Dresden, Dresden, Germany
Alvaro Aguilera, Jens Strehle & Gesine Wieder
Department of Addictive Behavior and Addiction Medicine, Central Institute of Mental Health, Medical Faculty Mannheim, Heidelberg University, Mannheim, Germany
Acelya Aslan, Patrick Bach, Marvin Ganz, Damian Karl, Alfred Wieland, Judith Zaiser, Sina Zimmermann & Bernd Lenz
Department of Clinical Psychology, Central Institute of Mental Health, Medical Faculty Mannheim, University of Heidelberg, Mannheim, Germany
Mathieu Pinger & Franziska Weiss
Department of Psychology, Technische Universität Dresden, Dresden, Germany
Thomas Goschke
State Key Laboratory of Cognitive Neuroscience and Learning, IDG/McGovern Institute for Brain Research, Beijing Normal University, Beijing, China
Ray Dolan
BIH Visiting Professor, Stiftung Charité, Department of Psychiatry and Psychotherapy, Charité – Universitätsmedizin, Berlin, Germany
Ray Dolan

Authors

Hilmar Zech
View author publications
You can also search for this author in PubMed Google Scholar
Maria Waltmann
View author publications
You can also search for this author in PubMed Google Scholar
Ying Lee
View author publications
You can also search for this author in PubMed Google Scholar
Markus Reichert
View author publications
You can also search for this author in PubMed Google Scholar
Rachel L. Bedder
View author publications
You can also search for this author in PubMed Google Scholar
Robb B. Rutledge
View author publications
You can also search for this author in PubMed Google Scholar
Friederike Deeken
View author publications
You can also search for this author in PubMed Google Scholar
Julia Wenzel
View author publications
You can also search for this author in PubMed Google Scholar
Friederike Wedemeyer
View author publications
You can also search for this author in PubMed Google Scholar
Alvaro Aguilera
View author publications
You can also search for this author in PubMed Google Scholar
Acelya Aslan
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Bach
View author publications
You can also search for this author in PubMed Google Scholar
Nadja S. Bahr
View author publications
You can also search for this author in PubMed Google Scholar
Claudia Ebrahimi
View author publications
You can also search for this author in PubMed Google Scholar
Pascale C. Fischbach
View author publications
You can also search for this author in PubMed Google Scholar
Marvin Ganz
View author publications
You can also search for this author in PubMed Google Scholar
Maria Garbusow
View author publications
You can also search for this author in PubMed Google Scholar
Charlotte M. Großkopf
View author publications
You can also search for this author in PubMed Google Scholar
Marie Heigert
View author publications
You can also search for this author in PubMed Google Scholar
Angela Hentschel
View author publications
You can also search for this author in PubMed Google Scholar
Matthew Belanger
View author publications
You can also search for this author in PubMed Google Scholar
Damian Karl
View author publications
You can also search for this author in PubMed Google Scholar
Patricia Pelz
View author publications
You can also search for this author in PubMed Google Scholar
Mathieu Pinger
View author publications
You can also search for this author in PubMed Google Scholar
Carlotta Riemerschmid
View author publications
You can also search for this author in PubMed Google Scholar
Annika Rosenthal
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Steffen
View author publications
You can also search for this author in PubMed Google Scholar
Jens Strehle
View author publications
You can also search for this author in PubMed Google Scholar
Franziska Weiss
View author publications
You can also search for this author in PubMed Google Scholar
Gesine Wieder
View author publications
You can also search for this author in PubMed Google Scholar
Alfred Wieland
View author publications
You can also search for this author in PubMed Google Scholar
Judith Zaiser
View author publications
You can also search for this author in PubMed Google Scholar
Sina Zimmermann
View author publications
You can also search for this author in PubMed Google Scholar
Shuyan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Goschke
View author publications
You can also search for this author in PubMed Google Scholar
Henrik Walter
View author publications
You can also search for this author in PubMed Google Scholar
Heike Tost
View author publications
You can also search for this author in PubMed Google Scholar
Bernd Lenz
View author publications
You can also search for this author in PubMed Google Scholar
Jamila Andoh
View author publications
You can also search for this author in PubMed Google Scholar
Ulrich Ebner-Priemer
View author publications
You can also search for this author in PubMed Google Scholar
Michael A. Rapp
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Heinz
View author publications
You can also search for this author in PubMed Google Scholar
Ray Dolan
View author publications
You can also search for this author in PubMed Google Scholar
Michael N. Smolka
View author publications
You can also search for this author in PubMed Google Scholar
Lorenz Deserno
View author publications
You can also search for this author in PubMed Google Scholar

Consortia

the ReCoDe-Consortium

Contributions

All authors contributed to data collection. HZ, MW, YL and LD analyzed the data. HZ and LD drafted the initial version of the manuscript. MS and LD primarily designed the study in consultation with AH, MR, UEB, HT, MAR and RJD. RB and RR consulted regarding adaptions of the GBE and provided analysis scripts. All authors critically read, gave comments, and revised the manuscript.

Corresponding authors

Correspondence to Hilmar Zech or Lorenz Deserno.

Ethics declarations

Conflict of interest

Dr. Ebner-Priemer reports consultancy for Boehringer-Ingelheim.

Additional information

Open data statement

All materials and analysis scripts required to replicate our results have been deposited in Open Science Framework (https://osf.io/9ze2u/).

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (DOCX 522 KB)

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Zech, H., Waltmann, M., Lee, Y. et al. Measuring self-regulation in everyday life: Reliability and validity of smartphone-based experiments in alcohol use disorder. Behav Res 55, 4329–4342 (2023). https://doi.org/10.3758/s13428-022-02019-8

Download citation

Accepted: 03 November 2022
Published: 12 December 2022
Issue Date: December 2023
DOI: https://doi.org/10.3758/s13428-022-02019-8

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Measuring self-regulation in everyday life: Reliability and validity of smartphone-based experiments in alcohol use disorder

Abstract

Similar content being viewed by others

Sorry—Bad Habit! Validation of the German Self-Report Habit Index with a Test for Its Relation to Potentially Addictive Forms of Health-Risk Behaviors

Do daily fluctuations in inhibitory control predict alcohol consumption? An ecological momentary assessment study

Self-help Smartphone Applications for Alcohol Use, PTSD, Anxiety, and Depression: Addressing the New Research-Practice Gap

Introduction

Results

Study design

Reliability

Split-half reliability

Test–retest reliability

Construct validity

Reliability of factor scores

Correlation of task measures with drinking

Discussion

Improving task reliability through joint modeling

Reliability based on factor scores

State-dependent process?

Construct validity

Clinical sample and correlation with drinking

Limitations

Conclusions

Materials and methods

General procedure

Participants

Inhibition task

Working memory task

Risk taking task

Information sampling task

Reliability

Modeling sessions separately vs. jointly

Split-half and test–retest reliability

Factor and clustering analysis

Notes

References

Funding

Author information

Authors and Affiliations

Consortia

the ReCoDe-Consortium

Contributions

Corresponding authors

Ethics declarations

Conflict of interest

Additional information

Publisher’s note

Supplementary Information

Supplementary file1 (DOCX 522 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation