1 Introduction

Flow, an optimal experience often colloquially referred to as being ‘in the zone’, has been studied extensively for more than 40 years because it is recognised as an important concept in the scholarly literature and popular culture. This psychological state of absorption and effortlessness to one’s actions (Csikszentmihalyi, 1975) has historically been conceptualised through Csikszentmihalyi’s model of flow, which contains nine key dimensions (see Jackson & Csikszentmihalyi, 1999). Continued debate regarding measurement and conceptual inconsistencies, however, has led to recent appraisals that “flow research is approaching a crisis point” (Swann et al., 2018; p. 249). Specifically, Swann et al. (2018) contended that Csikszentmihalyi’s flow model is not sufficiently mechanistic to be considered a ‘theory’ as it lacks specific definitions and propositions underpinning testable causal relationships, and has issues of discriminant validity and conflation of flow with other states within the nine dimensions model. The nine-dimensional model has also been criticised regarding construct validity issues and relational ambiguity between dimensions (e.g., Engeser & Schiepe-Tiska, 2012; Heutte et al., 2021), theoretical incompatibility with other established psychological theories (e.g., self-efficacy; Jackson & Csikszentmihalyi, 1999; Swann et al., 2018), and inconsistent application and assessment across disciplines (e.g., Norsworthy et al., 2021; Peifer, 2012). Further, in experimental research the nine-dimensional model is often replaced by a unidimensional measure (challenge-skill balance) that (a) may not reflect the experience of flow, (b) is commonly accepted as an antecedent to flow rather than an experiential dimension, and (c) most commonly assess objectively through difficulty level with no subjective input (Norsworthy et al., 2021). In a recent collection of commentaries and articles on flow, Peifer and Engeser (2021a, b) concluded that “a gold standard for the modelling and measurement of flow is not at close reach” (p. 61). In reflection of recent research, Peifer et al. (2022) proposed a core experience of flow can be considered. Taken together, the critiques provided by these authors collectively highlight the need for researchers to re-consider issues of definition and conceptualization of flow, and then devise an (or highlight an existing) appropriate instrument that assesses any theoretical advancements regarding the flow construct.

In order to chart disparities and commonalities in contemporary flow research, Norsworthy et al. (2021) recently conducted a scoping review encompassing over 230 flow-related works spanning multiple scientific disciplines such as psychology, physiology, and neuroscience. Norsworthy and colleagues reported that flow was assessed using 141 different measures and described using 108 varying constructs, terms, or dimensions—targeting all, some, or none of Csikszentmihalyi’s nine dimensions (e.g., Fong et al., 2015; Zito et al., 2018). Within Norsworthy et al.’s synthesis, for instance, it was reported that Heutte et al. (2016) observed that barely half of the nine dimensions were perceived by learners in educational settings, and that specific dimensions—such as ‘time transformation’ or ‘loss of self-consciousness’—are only considered relevant in specific contexts (e.g., Sinnamon et al., 2012; Swann et al., 2012). A common theme that emerged in Norsworthy and colleagues’ review was the use of varied descriptive constructs that, despite similarities with one (or more) of Csikszentmihalyi’s dimensions, were contributing to challenges when synthesising research findings (also see Auld, 2014). Examples of such instances included the use of ‘effortlessness’ in psychophysiological settings (Bian et al., 2016; De Manzano et al., 2013) to characterise Csikszentmihalyi’s ‘sense of control’; ‘telepresence’ in human computer interface and gaming contexts (e.g., Klasen et al., 2012; Lazoc & Luiza, 2012) to characterise absorption or ‘merging of action and awareness’; and the use of ‘enjoyment’ or ‘intrinsic reward’ (e.g., Llorens et al., 2013; Romero & Calvillo-Gamez, 2014) to represent ‘autotelic experience’.

Based on their review findings, Norsworthy et al. (2021) concluded that despite substantial differences in terminology, flow researchers across scientific disciplines appeared to commonly conceptualise two key antecedents to flow, three core experiential dimensions, and three outcome themes. The antecedents were ‘optimal challenge’ which incorporated the concepts of clear task goals, immediate and unambiguous feedback; and ‘high motivation’ that subsumed the themes of interest, subjective value, intrinsic or extrinsic motives. The three core experiential dimensions that characterise the construct or experience itself were identified as ‘absorption’, ‘effort-less control’, and ‘intrinsic reward’. Lastly, three common outcome themes of ‘positive development’, ‘high functioning’, and ‘further engagement’ were proposed. Similarly, in a recent literature review targeting theoretical integration in the field of flow research, Peifer and Engeser (2021b) expressed support for synthesising flow descriptors into three similar core experiential constructs (i.e., absorption, perceived demand-skill balance, enjoyment).

1.1 Three Core Experiential Dimensions to Flow

With respect to the absorption dimension of flow, Norsworthy et al. (2021) defined ‘absorption’ as “a state of absorption in the task characterised by focused, undistracted attention, and a merging of action and awareness”. From a neuroscientific perspective, Dietrich (2004) and Norsworthy et al. (2021) highlighted that flow is thought to occur through a depleting ‘onion-peeling’ effect (or down-regulation) of higher cognitive processes as attentional resources are reallocated to deal with the growing demands of the task (also see Sadlo, 2016; Ulrich et al., 2016). In line with Peifer and Engeser’s (2021b) rationale, Norsworthy et al. outlined that as absorption occurs to meet the complexity of the task, a greater number of unnecessary higher cognitive functions (i.e., reflective self-consciousness or time monitoring) are down-regulated to free up attentional capacity—facilitating frequent descriptions of flow nuances such ‘loss of self-consciousness’, ‘time transformation’, and a ‘feeling of connection’. This delayering process may help to account for the seemingly continuous experience (i.e., mild to intense) of flow, after an initial threshold or discrete measure is passed (Norsworthy et al., 2021).

‘Effort-less control’ was defined by Norsworthy et al. (2021) as “a high sense of control in which the task feels less effortful than is typical for that person, characterised by fluidity of performance and an absence of concern over losing control”. This dimension, which involves less effort (not void of all effort), or a sense of effortlessness, to one’s high sense of control in the act, differentiates flow from other models of focused engagement that require high degrees of felt effort (i.e., cognitive/mental effort and high arousal) to override external distractors, and from other immersive states that may be enjoyable but not permit high functioning (Harris et al., 2017; Norsworthy et al., 2021; Peifer et al., 2014; Romero & Calvillo-Gamez, 2014; Tozman et al., 2015). The dimension ‘effort-less control’ differs from Peifer and Engeser’s (2021b) utilisation of ‘perceived demand-skill balance’ on three key points. First, although Peifer and Engeser (2021b) explained that perceived demand-skill balance represents the high level of control felt in flow, sources within Norsworthy and colleagues review highlighted (in both psychological and neuroscientific research) that it is the sense of ‘effortlessness’ (i.e., a subjective sense of the act being less effortful or more fluid than usual) towards the sense of control that differentiates flow from other forms of high control (also see Peifer & Tan, 2021). Second, the optimal level of challenge (i.e., perceived demand-skill balance) is widely recognised (including by Peifer and Engeser) as an antecedent to, rather than a dimension of, flow (also see Barthelmäs & Keller, 2021). Lastly, Peifer and Engeser’s (2021b) utilisation of ‘perceived demand-skill balance’ posits that flow must derive from a situation in which the individual’s skill is being challenged. Norsworthy and colleague’s utilisation of ‘effort-less control’, however, also accounts for the high degree of felt control in flow within non-demand-skill specific scenarios, such as non-achievement scenarios (e.g., an interesting conversation uninspired by achievement motives; see Engeser & Schiepe-Tiska, 2012).

The third dimension of flow, ‘intrinsic reward’, is characterised by positive valence and optimal levels of arousal (also see; Peifer et al., 2014; Ulrich et al., 2016). This dimension is evident in the activation of midbrain reward structures (Nah et al., 2017) and increased dopamine production that occurs during flow (Bian et al., 2016). The label ‘intrinsic reward’ captures the autotelic experience (as used by Csikszentmihalyi, 2014) and enjoyment (as used by Peifer & Engeser, 2021b) of flow and was chosen by Norsworthy et al. (2021) because the term better represents what was observed in the literature––in fact, Csikszentmihalyi himself, often utilises the term ‘intrinsic reward’. ‘Intrinsic reward’ is more widely applicable across scientific disciplines and can be assessed physiologically (e.g., dopamine levels) without involving reflective cognitive processes that occur following the flow experience, as would be necessary for determining one’s level of enjoyment––opening up the possibilities for potential bias from outcomes and contextual or social factors (Abuhamdeh, 2021). For further details, see Norsworthy et al.’s (2021) and Abuhamdeh’s (2021) reviews on the relationship between flow and enjoyment.

Given the scope and findings of their review (which included an examination of existing flow instruments), Norsworthy et al. (2021) suggested that no existing flow instrument adequately assesses this three-dimensional conceptualisation of flow, and that a new flow instrument was required to assess these dimensions and exact conceptualisation of flow. Although many flow measurements exist, they either assess one, some, or none of the three-dimensions; often assessing similar dimensions (e.g., enjoyment) that may bear resemblance to one of the three-dimensions (e.g., intrinsic reward) but differ in dimensional meaning. The main purpose of this study, therefore, was to examine existing instruments, develop an instrument that could assess flow experiences using this three-dimensional conceptualisation of flow, and to examine aspects of the reliability and validity of scores derived from that instrument. Instrument development followed Boateng et al.’s (2018) three-phased approach (i.e., item development, scale development, scale evaluation) and was grounded in Messick’s (1995) recommendations for the assessment of different aspects of construct validity. Specifically, we sought to provide preliminary evidence for content, substantive, structural, external, and generalizability aspects of validity; content validity: relevance, representativeness, technical quality; substantive validity: a theoretical rationale; structural validity: factor structure; external validity: convergent and discriminant evidence; and generalizability validity: examining scores across populations, settings, and tasks aspects of validity.

1.2 Transparency and Openness

We describe our sampling plan, all data exclusions (if any), all manipulations, and all measures in the study, and the Transparency and Openness Promotion (TOP) Guidelines (Nosek et al., 2015) were adhered to. All data, analysis syntax, and research materials can be found in the article, S1, and the Appendix 1. Data analysis is detailed below. The study protocol was approved (RA/4/20/609) by the lead author’s institutional ethics committee prior to data collection.

2 Phase 1: Item and Scale Development

We used a multi-stage approach for item development including theoretical consultation, review of existing instruments, item generation, item review, expert review of items, and target population review of items. Norsworthy et al.’s (2021) scoping review findings (see Table 1) provided the conceptual justification for item generation, and supported the substantive aspect of validity for instrument development. Although antecedent constructs have been assessed within existing flow measures (e.g., Flow Short Scale, Rheinberg et al., 2002), Norsworthy et al.’s proposed antecedents to flow (i.e., ‘optimal challenge’, ‘high motivation’) were excluded from item development to ensure the new instrument assessed the flow experience itself, and did not conflate experience with putative antecedents or pre-conditions.

Table 1 Norsworthy et al.’s (2021) Flow definition, dimensions, and descriptions

2.1 Examining Existing Instruments Footnote 1

An initial search within the databases Web of Science (1900), PubMed (1997), Medline (1966), Scopus (1966), Embase (1947), PsycINFO (1806), SPORTDiscus (1930), and Google Scholar was conducted to ensure no existing instrument adequately assessed these three dimensions. No time parameters were attributed. Many flow assessments exist (including over 20 Likert scale instruments)*, but they either assess one, some, or none of the three-dimensions (absorption, effort-less control, intrinsic reward); often assessing similar dimensions (e.g., high control or enjoyment) that may bear resemblance to one of the three-dimensions (e.g., effort-less control or intrinsic reward) but differ in dimensional meaning. The widely used FSS-2 (Jackson & Eklund, 2002) assessment, for example, assesses all nine-dimensions to flow; not only does this assessment include the antecedents of optimal challenge (and sub-dimensions: clear goals and unambiguous feedback), it also does not directly assess effort-less control and assesses the nuanced dimension of ‘time transformation’ that is often criticised for its validity. The Autotelic Personality Questionnaire (APQ; Tse et al., 2020) assesses personality traits and not the flow experience, and many of the physiological assessments mentioned above are still elementary and do not distinguish as to whether they assess a specific dimension or more accurately assess the experience as a whole. The Flow Short Scale (Rheinberg et al., 2002) was considered the closest existing representative instrument (to the three-dimensions posited in Norsworthy et al.’s scoping review; see Norsworthy et al., 2021, Chap. 2), due to the measures of fluency of performance and absorption. The FSS consists of 13 items that are suggested to predominantly assess the two factors of ‘fluency of performance’ and ‘absorption’ (whilst also offering questions to assess ‘perceived importance’, ‘perceived outcome importance’, ‘demand’, ‘skills’, and ‘perceived fit of demands and skills’) resembling two of the three dimensions targeted in this study (i.e., absorption, effort-less control). The FSS however, (a) did not fully assess the dimensions (and sub-dimensions) of absorption and effort-less control as laid out in the scoping review, (b) includes antecedent constructs, and (c) does not assess the intrinsic reward dimension. Although the existing options for assessing flow are plentiful, and utilising (or developing) an existing measure would certainly be advantageous, when examining existing instruments to assessing flow it was clear that no existing measurement instruments adequately assesses the core conceptualisation of flow that Norsworthy et al. (2021) highlighted in the recent scoping review. The existing measures have substantiative issues such as confounding antecedents into the experience assessment and not directly assessing the effortlessness associated with the high degrees of control that specifically delineates, and not conflates, flow from other similar states such as clutch states, for example.

2.2 Item Generation

Given that no adequate existing instrument was available, the lead author initially generated a pool of 60 potential items (20 for each dimension) to assess the three flow experience dimensions. These items were generated from dimensional and sub-dimensional descriptions and adapting questions from existing scales that targeted the same (or similar) dimensions. Two co-authors—both of whom had extensive experience with instrument development procedures—then provided detailed review and comments (e.g., identifying redundancy, ambiguity, and double-barrelled questioning) that resulted in a revised parsimonious pool of 36 potential items (12 for each dimension).

2.3 Expert and Target Population Testing

The initial pool of 36 items was first assessed by an expert panel comprising five scholars with research expertise in the topic (Boateng et al., 2018). These experts were chosen for their relevant academic experience evident by continuous publications in the motivational and psychological literature, and recruited through direct contact and the authors’ academic network. They were asked to provide feedback on item content with an emphasis on relevance and technical quality (e.g., representativeness, understanding, jargon, overlap, ambiguity; Haynes et al., 1995). The experts were presented with all items alongside definitions of the three dimensions (which were used as criteria for the rating). They were then asked to provide a rating from 1 (not representative) to 5 (clearly representative) in terms of how well each item represented the targeted dimension. Additionally, we requested qualitative feedback on item clarity, wording, and possible dimensional overlap. Surveys were administered through the Qualtrics online platform and included written instructions and definitions of key terms prior to commencement.

In order to gather target participant (and also to serve as a ‘non-academic’ form of) feedback, a group of 15 adults from the general population (6 males, 9 females) were also recruited through the lead author’s network. The participants came from a range of professions (e.g., student, teacher, executive, designer, sports coach, and retail). These participants were asked to undertake an identical task. Together, these consultations resulted in 8 items being dropped and minor wording changes to ensure item clarity and relevance. For example, items with a mean score below 4.4 were cut, and items containing the word ‘performance’ or terms deemed ‘too academic’ were removed or changed. As a result of these review stages, the initial pool of 60 items had been reduced to 28 items. These items are presented in the Supplementary Material (Table S1; S = supplementary material).

3 Phase 2: Evaluation Through Factor Analysis and Validity Testing

3.1 Overview and Procedure

We began Phase 2 with these 28 items and sought to test and further refine the item pool through the recruitment of a large sample of participants and iterative factor analytic methods. Our primary aim in Phase 2 was to perform exploratory and confirmatory factor analyses with the goal of testing and further refining the item pool and instrument. We also sought to (a) examine aggregate- and item-level descriptive statistics, and (b) assess external aspects of validity by examining item responses against related constructs and general appraisals of flow.

We used a 7-point Likert-type response scale for the flow items anchored at 1 (Strongly Disagree) and 7 (Strongly Agree). This response scale was selected because there is evidence that scale reliability and validity are generally improved using a 7-point scale (Dawes, 2008). In addition to the flow items, we assessed related constructs and included questions assessing general appraisals of flow (see Stage 3 for full details). We collected data through the use of a cross-sectional survey with 913 participants (from an initial pool of 956 of which 43, 4.5%, had large sections of incomplete data). This sample consisted of adults (635 females, 285 males, 3 prefer not to say) from English-speaking countries (United Kingdom, United States, Australia, New Zealand, Ireland, South Africa), and all participants were recruited through the Prolific data collection platform. Prolific is a research recruitment and data collection company designed for use by certified academic institutions and only for research approved by an institutional review board. Prolific samples are increasingly prevalent in research reports, especially in cases when online methods are used for data collection within behavioural sciences (see Buhrmester et al., 2018; Palan & Schitter, 2018); there is evidence that online methods may reduce sample biases in comparison to traditional data collection approaches (e.g., Gosling et al., 2004). To ensure adherence and good quality data, time limits were set, participants were paid above average (see prolific) rates, and two attention tests were included in the questions (Aguinis et al., 2021).

All participants were provided electronically with an information letter and provided informed consent prior to completing the questionnaire. Data were downloaded and stored in a de-identified spreadsheet on a secure server by the first author to ensure participant confidentiality. Participants were instructed to engage in an activity of their choice before commencing survey participation—the questionnaire was (requested to be) completed immediately after activity participation to ensure experience recall was as close to the event as possible (as recommended for flow measures; Norsworthy et al., 2017). Participants were instructed that they could use an activity that they had already completed prior to filling in the questionnaire (within the last hour), and if no activity had been carried out then to stop and engage in an activity prior to continuing form completion. Participants consented to having carried out the activity. Items in the survey were focused on participants’ thoughts and feelings experienced during the focal activity. Data from approximately half of the total sample (n = 452) were randomly apportioned for exploratory factor analytic purposes (see Stage 1 below) to refine the item pool and instrument, and the remaining (n = 461) data were assigned for confirmatory factor analytic purposes (see Stage 2 below). During analysis, data were initially screened for missing values. Pairwise deletion of cases was carried out for the few (< 0.01%) missing data items during the factor analysis. Finally, to assess discriminant and convergent validity, we used all participant data (n = 913) to examine correlations between flow scores—using only the subscale and global scores that were ultimately retained for the ‘final’ instrument following stages 1 and 2—and related variables (see Stage 3 below). We felt that (re-)using the entire sample for this purpose was likely to generate more representative correlation values (rather than, for example, computing these correlations separately with the sub-samples used for stage 1 and stage 2 analyses).

3.2 Stage 1: Exploratory Factor Analysis

3.2.1 Method

Exploratory factor analysis (EFA) was performed in Stage 1 to (a) explore the factor structure of items, (b) examine the structural aspect of validity (i.e., examine the higher-order structure of the item pool), (c) eliminate problematic (i.e., high cross-loading) items, and (d) identify the most parsimonious instrument (Boateng et al., 2018). Specifically, a principal axis factor analysis with a promax rotation was carried out. Promax rotation was chosen as it is an oblique rotation method. The analysis was performed using the EFA.dimensions package for R (version 0.1.7.6; O’Connor, 2023). This package was also used to determine the number of factors to extract. Eight such tests were applied: the empirical Kaiser criterion, comparison data, the Hull method, Velicer’s minimum average partial test, parallel analysis, the salient loadings criterion, the standard error scree test, and the sequential chi-square model test. Parallel analysis was run using a principal components extraction and 95% percentile parallel eigenvalues, as these specifications show the highest accuracy across varied data conditions (Auerswald & Moshagen, 2019).

3.2.2 Results

A KMO value greater than 0.6 indicates factor analysis is appropriate, and greater than 0.9 is ‘marvelous’ (Kaiser, 1974; Kaiser & Rice, 1974). A significant Bartlett test (p < .05) is indicative of adequate conditions to fit a factor analysis. A strong KMO value (.92) and significant Bartlett’s test of sphericity (χ2 (378) = 8424.89, p < .001) indicated adequacy of fit for factor analysis. Six of the eight factor extraction tests indicated that four was the optimal number of factors to extract. The standard error scree test and the sequential chi-square model test suggested extracting 7 and 12 factors, respectively. When there is an incongruence between the results of the sequential chi-square model test and parallel analysis, Auerswald and Moshagen (2019, p. 487) recommend referring to comparison data or the empirical Kaiser criterion (both of which suggested a four-factor solution). Accordingly, four factors were extracted. Figure S1 presents a plot with the observed and 95th percentile parallel eigenvalues produced as part of the parallel analysis.

Together, the four factors explained 63% of the variance in the data. Eigenvalues and the variance explained per factor are presented in Table 2. Factor loadings and communalities are also presented in this table. Based on consideration of item content, the factors were labelled (1) absorption, (2) effort-less control, (3) intrinsic reward, and (4) intuiting. This fourth factor was unexpected based on the theoretical model on which scale items were constructed.

Table 2 Promax rotated factor loadings, extraction communalities (h2), proportion of variance explained by retained factors, and rotated factor correlations

With the goal of developing a parsimonious (final) instrument, the strongest 3 items across each of the four factors (12 items in total) were retained for confirmatory analysis (Stage 2). Items were retained based on high factor loadings (with loadings ≥ 0.45, 0.55, 0.63, and 0.71 being considered fair, good, very good, and excellent, respectively; Tabachnick & Fidell, 2013), low cross-loadings (≤ 0.32; Tabachnick & Fidell, 2013), and expert judgement as to whether items adequately covered all aspect of each construct. For example, Item 14 showed a lower factor loading but was retained for the CFA as the item description (i.e., letting it happen rather than making it happen) has been used as a key differentiating description between flow and other similar states (Swann et al., 2016).

3.3 Stage 2: Confirmatory Factor Analysis

3.3.1 Method

In Stage 2, confirmatory factor analysis (CFA) was conducted on the proposed 12 Psychological Flow Scale (PFS) items. Norsworthy et al.‘s (2021) scoping review highlighted that, conceptually speaking, the flow experience should not be reduced solely to its dimensions, or be represented by only one or some of its dimensions (as researchers have done in the past); rather, flow is posited to occur when all three dimensions are present and interactive. Empirical testing to ascertain whether the PFS could include a single global latent factor (i.e., global flow score, as opposed to the instrument being limited to the assessment of the factors/dimensions), therefore, was deemed important. Accordingly, a series of models were tested: (1) a global latent factor model in which items were treated as reflective indicators of a single global flow factor, (2) a four factor, orthogonal model in which items were treated as reflective indicators of their purported factors as indicated by the EFA (absorption, effort-less control, intrinsic reward, and intuiting), but with no global flow factor, and (3) a higher-order model in which items were treated as reflective indicators of one of four first-order factors (absorption, effort-less control, intrinsic reward, and intuiting), which were themselves treated as reflective indicators of a second-order factor representing global flow. Figures depicting these models are presented in the supplementary material (Figure S2-S5). Model identification was achieved by setting the variances of latent factors to one. The analysis was carried out in Amos (version 27) using maximum likelihood estimation. When using maximum likelihood estimation, it is recommended that |skew| < 2 and |kurtosis| < 7 for all items (Fabrigar et al., 1999). All items showed skew and kurtosis below this threshold (see Table S1).

3.3.2 Results

As can be seen in Table 3, the single global factor model fit the data poorly, suggesting that the scale does not measure one domain only (Dunn & McCray, 2020). The four orthogonal factors model also showed poor fit to the data. However, loadings of factors on items were generally high (with the exception of item 14, all standardized factors loadings were above 0.71; see Table S2 for factor loadings for Models 1–3). This would suggest that, although items are good indicators of their factors, the scale does not simply measure a series of lower-order dimensions. The higher-order model with one second-order global flow factor and four first-order factors showed marginal fit to the data. Large standardized factor loadings were observed between first-order factors and their purported items (again with the exception of item 14) and between flow and three of the four first-order factors. Flow showed a weak standardized loading on intuiting (0.32). Consulting the modification indices indicated that freeing the error terms for the intuiting and effort-less control factors to covary would reduce model discrepancy by 56.92. Freeing these error terms to covary resulted in the standardized factor loading from flow to intuiting to decrease substantially (from 0.32 to 0.14). We interpreted this to indicate that the intuiting factor was not caused by the global flow factor, but rather only appeared to be caused by flow due to shared variance with effort-less control. The poor factor loading from flow to intuiting, and the fact that an intuiting factor was unexpected based on theoretical understandings of flow (Norsworthy et al., 2021), prompted us to specify a fourth model excluding the intuiting factor and its items. This modified model was found to fit the data well, with the exception of the model χ2 test (which is to be expected given the large sample size; Hoyle, 2011). Large standardized factor loadings (all > 0.71; see Table 4) were observed between each first-order factor and its item indicators. Standardized factor loadings between the general flow factor and each second-order factor were all in excess of 0.52. This final model best reflects the theoretical model on which the scale items were constructed. The final model was also rerun with the full dataset (N = 913), producing similar results (reported in Tables S3 and S4).

Table 3 Fit indices for tested CFA models
Table 4 Factor loadings for final model (Modified higher-order model with three first-order factors)

Internal consistency estimates (Cronbach’s α and McDonald’s ω) where then calculated for global flow, absorption, effort-less control and intrinsic reward on the full dataset. In all cases, these values (reported in Table 5) exceeded 0.82, indicating good internal consistency. Table 5 also reports descriptive statistics and intercorrelations between subscales.

Table 5 Intercorrelations, reliability estimates (Cronbach’s α and McDonald’s ω), and descriptive statistics for global flow and all subscales among full dataset (N = 913)

3.4 Stage 3: Discriminant and Convergent Validity

In the final stage, we examined external validity (i.e., discriminant and criterion relevance) for scores derived using the final, 9-item version of the PFS through correlational analysis.

3.4.1 Method

To examine discriminant and convergent validity, PFS scores were correlated with constructs that have been previously linked (both positively and negatively) to flow. These constructs included cognitive state anxiety and pressure / tension (e.g., Llorens & Salanova, 2017; Peifer, 2012; Sadlo, 2016), perceived competence (e.g., Schuler & Brandstatter, 2013; Valenzuela et al., 2018), performance (e.g., Csikszentmihalyi et al., 2018; Flett, 2015; Moran, 2012), and autotelic personality (Baumann, 2012; Jackson & Csikszentmihalyi, 1999; Tse et al., 2020). Additionally, the Flow Short Scale (Rheinberg et al., 2002) was also included to evaluate scale redundancy as it was considered the closest existing flow scale (in terms of dimensional similarity) to the PFS. Evidence for generalisability (i.e., across activities) was considered inherent within the dataset because participants carried out a wide variety of activities (see Results). Common criticisms of existing psychological measurements of flow (see Ellis et al., 2019; Engeser, 2012; Jackman et al., 2017; Moneta, 2012; Swann et al., 2018) were also addressed through assessing a number of specific indicators of the flow experience. Specifically, participants were given direct questions (see ‘Measures’ below) targeting self-reports of flow entry, frequency of perceived flow entry and exit within the same event (to account for multiple occurrences of flow, or not), flow intensity (to examine intra-differences), and percentage of time in flow within the given event.

Assessments

Psychological Flow Scale (PFS)

The ‘final’ PFS (for the purpose of this study, as confirmed during stage 1 and 2) consisted of nine items, with three items for each of the three flow dimensions (the items and the dimensions they pertain to are detailed in the Appendix 1). A global flow score was determined by averaging responses to the nine items. Subscale scores (i.e., absorption, effort-less control, intrinsic reward) were determined by averaging responses to the three items in each subscale.

Perceived Anxiety

The Mental Readiness Form-3 (MRF-3; Krane, 1994) was utilised to understand participants’ perceived level of anxiety during the task in question. The MRF-3 assesses cognitive state anxiety and was developed as a shorter alternative to the Competitive State Anxiety Inventory-2 (CSAI-2; Martens et al., 1990). The assessment involves one question, “Please rate how you felt during your chosen task/activity”, in which participants respond on a response scale from 1 (not worried) to 11 (very worried).

Perceived Competence

The Intrinsic Motivation Inventory (IMI) is a multidimensional instrument intended to assess participants’ subjective experience related to a target activity. The 6-item perceived competence subscale of the IMI (Ryan, 1982) was used in this study to understand participants’ perceived competence (e.g., “I think I am pretty good at this activity”). Responses were made on a response scale anchored at 1 (not at all true) and 7 (very true). Responses to all six items were summed to determine a score for further analyses. The internal consistency estimate for scores derived from this assessment in this study was α = 0.88.

Perceived Stress

The 5-item ‘pressure/tension’ subscale of the IMI (Ryan, 1982) was used to assess participants’ stress perceptions (e.g., “I felt very tense while doing this activity”). Responses were scored on the same 7-point response scale described in the previous section. Responses to all five items were summed to determine a score for further analyses. The internal consistency estimate for scores derived from this assessment in this study was α = 0.82.

Self-Reported Performance

A single self-report item was used to examine perceived performance. Participants responded on a response scale ranging from 1 (very low performance) to 11 (very high performance) to the item, “Please rate how you felt you performed in your chosen activity”.

Autotelic Personality

The Autotelic Personality Scale (APQ; Tse et al., 2020) was used to assess autotelic disposition. The scale consists of 26 items including 7 sub-scales (curiosity, persistence, low-self-centeredness, intrinsic motivation, enjoyment and transformation of challenge, enjoyment and transformation of boredom, attentional control). Example items include “I am curious about the world” (curiosity), “I find it hard to choose where my attention goes” (attentional control), and “I am good at finishing projects” (persistence). Participants used on a response scale ranging from 1 (strongly disagree) to 7 (strongly agree). Responses to all 26 items were summed to determine a composite score for further analyses (Tse et al., 2021). The internal consistency estimate for scores derived from this assessment in this study was α = 0.84.

Flow Short Scale (FSS)

The FSS (Rheinberg et al., 2002) was used because it was deemed a common existing assessment of flow that most closely resembled the PFS. Consistent with Kyriazos (2018), only the first 10 items (out of 13)—which target the flow experience of ‘fluency of performance’ (e.g., “My thoughts/activities run fluidly and smoothly”) and ‘absorption’ (e.g., “I am completely lost in thought”)—were utilised because the remaining 3 questions target perceived (outcome) importance. Responses were made on a scale ranging from 1 (not at all) to 7 (very much). Responses to all 10 items were summed to determine a composite score for further analyses. The internal consistency estimate for scores derived from this measure in this study was α = 0.85.

Activity Details

We used a single item to identify the activity the participant was referring to in their experience: “Please state the task/activity that you used to answer the above questions”. A text area was made available for responses. A second single item was used to identify the duration of the activity: “Please state in minutes how long the task/activity lasted for”. A small text area was made available for responses.

Self-Report Flow Entry

Participants were instructed, “the questions above are designed to assess ‘flow’ experiences. Flow is described as ‘a total engagement in which nothing else matters, actions seem to flow effortlessly—simply participating in the act feels satisfying.’ Based on your experience and responses to the above items, do you think that you experienced ‘flow’ in your recent task/activity?“ Participants were asked to respond, ‘yes’, ‘no’, or ‘unsure’, with the aim of providing a discrete assessment of flow entry (see discussions around flow being a discrete and continuous construct; Norsworthy et al., 2021; Peifer & Engeser, 2021a, b). If participants answered ‘no’ to this item, they were not asked to complete the following items.

Flow Duration

Participants reported the percentage of time they felt that they were in flow within the given activity using the item, “What percentage of the time were you in flow in relation to the total time (length) of the activity?” A percentage score was designated for responses.

Flow Frequency

To determine flow frequency, participants were asked, “was the flow experience a single experience or did you experience it or enter and exit it on multiple occasions within the activity?” Ordinal responses (‘just once’, ‘on two or three occasions’, ‘multiple occasions’) were presented.

Flow Intensity

Participants reported their perceptions regarding the intensity of the flow experience by responding to the question, “Referring back to the experience of flow, how strongly did you experience flow?” A response scale was provided that ranged from 1 (very weak) to 7 (very strong). See Supplementary Material for these flow items in full.

3.4.2 Results

Activity Details

Participants engaged in over 100 unique types of activities, including physical exercise, running, cooking, golf, music practice, drawing, jigsaw puzzles, email responding, gaming, studying, cycling, fitness work out, housework, aerobics, tennis, work meeting, yoga, knitting, canoeing, researching, exams, carpentry, and work-related tasks. Activities ranged from 6 min in length to 4 h.

Correlates

To assess convergent and discriminant validity and examine the relationship between PFS scores and the above assessments, descriptive statistics and a series of correlation coefficients were computed with the full dataset (n = 913) using IBM SPSS (version 27). See Table 6 for detailed statistics. As expected, perceived anxiety and stress scores significantly negatively correlated with the global PFS score and absorption, effort-less control and intrinsic reward PFS subscale scores. Scores for perceived competence, self-reported performance, autotelic personality, flow (as measured by the FSS), flow intensity, and flow duration all significantly positively correlated with global PFS and all PFS subscale scores. Broadly, these correlation values demonstrate that the nomological network associated with PFS scores ‘operates’ in a way that is consistent with existing flow research, and provide preliminary evidence for the external aspect of validity for PFS scores. Correlations between the above variables and between global PFS and the subcomponents of autotelic personality are reported in Tables S5 and S6, respectively.

Table 6 Correlations between global flow and subscales and theoretically relevant variables

The examination of general flow appraisals revealed that 71% (n = 648) of participants reported being in flow (i.e., flow entry), 18% (n = 164) reported not being in flow, and 11% (n = 101) reported being unsure as to whether they were in flow or not (see Table S7 for further details). On average, participants reported being in flow 66.49% (SD = 24.48) of the time during their nominated activity (flow duration). They also reported a mean flow intensity of 6.5 out of 10 (SD = 1.6). See Figs. S6 and S7 for histograms of flow duration and intensity. Flow duration and flow intensity both correlated positively and significantly with global PFS scores, and the PFS subscale scores for absorption, effort-less control, and intrinsic reward. See Table S8 for information on frequency of flow within a single event. For the interested reader, a positive and significant correlation was evident between the percentage of time participants self-reported to be in flow and self-reported intensity of flow (r = .47, p < .001). Taken together, these correlations with global and subscale PFS scores—derived using simple, naturalistic assessments of flow characteristics—offer evidence that the PFS (and its dimensions) may provide insight into important elements of one’s overall flow experience in a given activity.

4 Discussion

In response to recent critiques of the flow literature (e.g., Peifer & Engeser, 2021a, b; Swann et al., 2018), Norsworthy et al. (2021) conducted a scoping review and identified three core ‘flow experience’ themes (i.e., absorption, effort-less control, and intrinsic reward) that appear to exist across scientific disciplines and activity domains. Although the area is replete with measurement instruments— the existing psychological assessments showed substantiative issues such as confounding antecedents into the experience measure, and not directly assessing the effortlessness associated with high degrees of control that specifically delineates (and not conflates) flow from other similar states—no instrument exists to adequately assesses those three dimensions of the flow experience. The purpose of this study was to develop, and provide preliminary evidence of reliability and validity for, a new psychological flow scale (PFS). Overall, our findings offered insight into Messick’s (1995) content, substantive, structural, and external aspects of validity regarding the PFS across a variety of activity domains, and support for further use of the PFS as a psychological assessment of flow in English speaking adults.

4.1 Advancing the Measurement of Flow

The PFS advances our measurement of flow by targeting the core experiential dimensions of the construct, rather than conflating flow antecedents, pre-conditions, or outcomes (which are included in many existing measures) with the experience of flow itself. Further, the PFS is intended to be used in a domain-general sense and as a result does not rely on domain-specific (e.g., telepresence in computing) descriptions that may not generalise across scientific disciplines, domains or nuanced experiential descriptions. We also presented evidence to show that flow may be operationalised using the PFS with a higher-order latent global factor alongside the dimensions: absorption, effort-less control, and intrinsic reward. These results may lend support to the notion that an attempt to model the flow experience as a whole—alongside or perhaps, in some cases, instead of the three sub-dimensions—may be valuable for informing our understanding of the construct and how it operates (see Norsworthy et al., 2021). The EFA revealed a small fourth factor involving items that best represented intuitive behaviour or spontaneous action, which seemed to be inter-related with the effort-less control dimension. Although the CFA showed evidence against its inclusion in the final PFS, future research may want to examine how this ‘intuiting’ construct relates to effortless-control, especially within non-movement associated activities (i.e., brainstorming) in which cognitive aspects of the experience may be more prevalent. Finally, in outlining the instructions for the PFS, we attempted to bring some clarity to flow measurement by asking participants to report on the most intense experience in the given event—reducing interpretive ambiguity as to whether responses to flow items represent a conflated amalgamation (averaged or summed) of multiple experiences into a single reported experience (see Moneta, 2012; Swann et al., 2018).

4.2 External Validity

Findings revealed preliminary evidence for the nomological network associated with PFS scores, and in doing so revealed support for the convergent and discriminant validity (Messick, 1995). More specifically, global and dimensional PFS scores were negatively correlated with perceived anxiety and stress, and positively correlated with perceived competence, self-rated performance, and autotelic personality. It is important to caution against making any causal inferences regarding these associations; nonetheless, they are consistent (in a directional sense) with theory and research that identify flow as a buffer to anxiety and stress (also see Waples & Knight, 2017; Llorens & Salanova, 2017; Peifer et al., 2014; & Sadlo, 2016), and facilitator of positive development and high functioning (also see Flett, 2015; Klasen et al., 2012; Norsworthy et al., 2017; Norsworthy et al., 2021, & Swann et al., 2017).

One criticism of existing flow assessments (e.g., using the FSS-2 and FSS) is that, despite there being no ‘threshold’ explicitly recommended by those who developed the instruments, flow is often assumed to be ‘present’ when scores are above a midpoint (see Jackman et al., 2017). Participants in this study who reported that they had experienced flow (reporting ‘Yes’ from a ‘Yes’, ‘No’ or ‘Unsure’ response option) scored higher on global PFS scores (M = 5.86; out of 7) than those who reported they were unsure (M = 5.21) or had not experienced flow (M = 4.68); all of these mean flow scores exceeded the midpoint (3.5). On the assumption that flow has an entry point (differentiating it from other immersive or automatic states) and also varies by intensity (accounting for mild and intense flow experiences; see Norsworthy et al., 2021), it therefore appears important to caution against ‘assuming’ flow occurs when scores reach or exceed a scale midpoint.

On the topic of flow ‘intensity’, it is worth highlighting that in this study self-reported flow intensity scores significantly positively correlated with PFS scores. As such, it is reasonable to conclude that PFS scores may align with self-reported intensities of simple descriptions of flow, and that higher PFS scores may indicate higher self-reported flow intensity. Additionally, flow intensity also correlated positively with the percentage of time that participants self-reported to be in flow, and their global PFS scores, indicating that higher intensities of flow may take time to develop (i.e., a flow experience may progressively grow in intensity, rather than instantly occur at a ‘deep’ level). This proposition would be consistent with the hypothesis that flow occurs through a depleting ‘onion-peeling’ effect of higher cognitive processes, which results in progressive absorption and greater reduction in felt effort (see Norsworthy et al., 2021; Sadlo, 2016; Ulrich et al., 2016).

4.3 Limitations and Future Directions

In this investigation, we tested the PFS with adults who were resident in multiple English speaking countries and who performed an array of focal activities. It is necessary to highlight, though, that instrument validation is an iterative process and that our results provide only preliminary insight into the development and use of the PFS in this field. With that in mind, we encourage researchers to expand the evidence base for or against the use of the PFS by examining validity and reliability properties in different cultures and languages, in adult and youth samples, within and between focal activities, and on multiple occasions within individuals. We derived insight using a cross-sectional approach; longitudinal studies, therefore, could be used to evaluate within-person variability or stability in PFS scores (and relations with correlate variables). In addition, although our purpose was to develop a psychological flow scale, it would also be worthwhile—in light of emerging psychophysiological work in this area—for researchers to examine key (measurable) physiological or neuroscientific markers of flow alongside insight derived through instruments such as the PFS (see, for example, Norsworthy et al., 2021; Peifer & Tan, 2021).

4.4 Conclusion

The Psychological Flow Scale (PFS) is a relatively brief instrument that assesses three related dimensions (i.e., absorption, effortless-control, intrinsic reward) that are proposed to characterise the flow state experience across different domains and contexts. This instrument is an important advancement for the field of flow science because (a) it may enable valuable cross-disciplinary comparisons to be made in future work, (b) it is based upon the findings of a recent synthesis that aimed to provide a parsimonious understanding of the construct, (c) it removes what researchers believe to be the ‘antecedents’ (or operational factors) from assessing the experience of flow, and (d) addresses issues of construct conflation (although further discriminant validity testing is required) with similar states. In addition, evidence derived from CFA offers preliminary support for a higher-order flow construct that addresses recent conceptual and measurement criticisms of flow. We encourage sustained efforts focused on developing the PFS and demonstrating its applicability and correlates across populations and activities. As well as advancing our assessment and understanding of people’s flow experiences, we also hope the PFS will offer practical value for researchers and intervention designers in terms of quantifying our efforts to improve people’s experiences (and to ‘find flow’) in diverse activities.