The design fluency test: a reliable and valid instrument for the assessment of game intelligence?

The design fluency test (DFT) has been reported to predict successful sports performance of soccer players and has therefore been in the spotlight of sport psychology research. There is, however, a lack of research regarding the psychometric properties of the DFT in elite sports. Thus, the aim of this research was to provide findings of test–retest reliability, practice effects and the diagnostic power of the DFT. Multiple studies of youth and adult elite athletes, as well as nonathlete students, were conducted in applied settings. Test–retest relationship demonstrated poor to acceptable short-term and long-term correlations. Furthermore, significant changes between test and retest were obtained in some variables that differed among samples. The differential value of the DFT was corroborated by significant differences between adolescent students and adolescent elite soccer players. Regarding the prospective value, significant partial correlation coefficients were found between DFT scores and volleyball performance in adult elite players. Although our research partially confirmed previous findings on the differential and prospective power of the DFT, the findings on test–retest reliability indicate that the DFT cannot be recommended for application in sports. The psychometric properties—in particular the findings on test–retest reliability—of the DFT have to be improved before research can be carried out on the application for the selection of team sport athletes and for the prediction of future success in team sports. Further research is needed to develop a scientific instrument for the assessment of game intelligence.


Introduction
In the last decade, sport scientific research became more and more interested in a construct called executive functions (EF). EF refer to a broad construct (Etnier & Chang, 2009) that encompasses a set of higher-level functions, such as mental set shifting, information updating and monitoring, as well as inhibition of prepotent responses (Miyake et al., 2000). On the one hand, sport science investigated the effect of chronic or acute exercise on EF (Chang, Labban, Gapin, & Etnier, 2012;Lambourne & Tomporowski, 2010;Mc-Morris & Hale, 2012;Verburgh, Königs, Scherder, & Oosterlaan, 2013), whereas on the other hand, recent research questioned the role of EF on athletic performances and sport expertise (Jacobson & Matthaeus, 2014;Krenn, Finkenzeller, Würth, & Amesberger, 2018).
In the latter case, seminal findings were published by Vestberg, Gustafson, Maurex, Ingvar, & Petrovic (2012), who found soccer player's EF to be significantly predictive for the number of goals and the number of assists players scored two years later. In addition, they detected higher EF in players of the highest Swedish national league in comparison to soccer players of the second and third Swedish national division and that all soccer performance level groups scored significantly above population norm. As a consequence, EF were considered as key indicator of athletic performances in team sports (Lundgren, Högman, Näslund, & Parling, 2016;Vestberg et al., 2012). Subsequent research was able to corroborate this significant role of EF mainly in soccer: It was found that elite soccer players showed higher EF than subelite players (Vestberg et al., 2012;Vestberg, Jafari, Almeida, Maurex, Ingvar, & Petrovic 2020), that ambitious soccer players showed higher EFs than the population norm (Vestberg et al., 2012) and that player's EF correlated significantly with their scored assists (Vestberg et al., 2012(Vestberg et al., , 2020 and scored goals (Huijgen et al., 2015). Hence, these studies provide alleged evidence that above-average cognitive performance is associated with soccer expertise. Further, EF were suggested to represent a cognitive measure of game intelligence in team sports like ice hockey (Lundgren et al., 2016) and soccer (Vestberg et al., 2020).
The above described findings signifying the role of EF in team sports, predominantly in soccer, mainly are based on the Design Fluency Test (DFT) claiming to measure EF (Huijgen et al., 2015;Lundgren et al., 2016;Vestberg et al., 2012;Vestberg, Reinebo, Maurex, Ingvar, & Petrovic 2017). The DFT was originally developed for the assessment of fundamental skills and higher-level executive functions in clinical populations of children, adolescents, and adults and is one element of the Delis-Kaplan Executive Function System (D-KEFS) (Swanson, 2005). Rows of boxes, consisting of the same array of five dots, are presented. Within each box, different designs have to be generated by connecting the dots using four straight lines. The participants are required to draw as many different designs as possible within 60 s. There are three conditions which differ in the properties of the dots. In condition 1 (C1) all dots are filled, in condition 2 (C2) the dots are empty, and in condition 3 (C3) the dots are alternately filled and empty. According to the D-KEFS manual (Delis, Kaplan, & Kramer, 2001b), C1 provides a basic test of design fluency. C2 requires design fluency and response inhibition caused by the change from filled to empty dots. C3 is developed to assess design fluency and cognitive flexibility through switching between filled and empty dots. Suchy, Kraybill, and Larson (2010) emphasized that C3 captures a separate construct that needs to be considered when interpreting DFT results. In general, it is stated that the DFT measures the "initiation of problem-solving behavior, fluency in generating visual patterns, creativity in drawing new designs, simultaneous processing in drawing the designs while observing the rules and restrictions of the task, and inhibiting previously drawn responses" (Swanson, 2005, p. 122). It is claimed by many researchers that such skills are crucial for success in several team sports (Furley & Memmert, 2010b;Tillman & Wiens, 2011). In soccer, for instance, "a successful player must constantly assess the situation, compare it to past experiences, create new possibilities, make quick decisions to actions, but also quickly inhibit planned behavior" (Vestberg et al., 2012, p. 4). Huijgen et al. (2015) emphasized that "a soccer player must be able to quickly anticipate and react to fast changing situations that occur during a soccer match" (p. 2). Based on these soccer-specific demands, Vestberg et al. (2012) argued that the DFT is appropriate for the assessment of EFs associated with success in soccer because it challenges similar EFs as in typical soccer game situations. However, taking DFT's clinical origin into consideration its forthright application in the sample of elite athletes seems challenging and makes the highest demands on its psychometric properties.
In contrast to the findings on the DFT and soccer expertise (Vestberg et al., 2012(Vestberg et al., , 2020(Vestberg et al., , 2017, Furley, Schul, and Memmert (2017) pointed out that findings of improved cognitive performance in expert soccer players is anything but consistent. Several studies failed to provide evidence of superior executive functions in experts (e.g. Furley & Memmert, 2010a or showed equivocal findings (e.g. Verburgh, Scherder, van Lange, & Oosterlaan, 2014). These inconsistencies are discussed against the background of confounding variables, sample sizes, expectation of the researchers, and definition of expert (Furley et al., 2017), as well as under the perspective of the need for reliable measurements (Schweizer, Furley, Rost, & Barth, 2020).
In previous studies in sport, the DFT was used based on the test criteria provided by the D-KEFS manual (Delis, Kaplan, & Kramer, 2001a). However, according to Homack, Lee, and Riccio (2005), much research has to be done in order to fully determine the psychometric properties of the DFT. Likewise Shunk, Davis, and Dean (2006) concluded that the psychometric properties of the DFT were not well established. Although the D-KEFS manual provides data on reliability of the DFT, these data refer to a heterogeneous sample in terms of age and other demographic characteristics. Furthermore, the period between test and retest was not kept constant within a time range of 9 to 74 days. Additionally, the D-KEFS manual consists of information on validity that incorporates intercorrelations of measures, and differences between Alzheimer and Huntington disease patients (Delis et al., 2001a). Therefore, there are open questions on short-term and long-term test-retest reliability and practice effects in general, as well as questions concerning the use of the DFT in athletes in particular. Results on the stability of the rank order of participants in short-and long-term intervals are necessary to be able to evaluate findings on the prospective value of DFT performance. Further questions concern the differential and prospective value of the DFT in team sports in order to expand the existing knowledge on the diagnostic power of the DFT for application in team sports.
As the DFT was designed to detect dysfunctional EF in clinical populations, its application in elite athletes asks for strong evidence for a reliable and valid assessment of EF in this highly skilled sample. So far, research has failed to provide this clear evidence of reliability and validity. The current study aimed to contribute to this target and to enhance the evidence about the assessment of the DFT in sports for practitioners and researchers. We assembled different data sets collected in the applied field of sport psychology to enable a broad and differential analysis of the psychometric properties of the DFT.
The first aim of the present study was to determine reliability of DFT scores. Short-term test-retest reliability of DFT scores were assessed in three samples having different activities of varying duration between test and retest. Thus, reliability was examined in varying contextual situations to enable estimation of the effect of different sources of bias between measurements. Using a correlational coefficient approach, the relationship between test and retest from individual values was evaluated to show how well the rank order of participants in the first test was replicated in the retest (Hopkins, 2000). Additionally, changes between test and retest were examined to assess non-random effects, resulting from i.e. activities between measures, motivational and learning processes (Hopkins, 2000). Long-term test-retest correlation and systematic change of DFT scores between measurements were evaluated in a sample of national volleyball team players who were tested twice within a year.
The second aim was to determine the differential value of the DFT. Previous research (Vestberg et al., 2012(Vestberg et al., , 2017 showed higher DFT scores in the sum of correct designs in soccer players com-pared to normative data (Delis et al., 2001b). This study focused on differences between adolescent elite soccer players and high-school students. In contrast to previous studies (Huijgen et al., 2015;Vestberg et al., 2012Vestberg et al., , 2017, all single and composite DFT performance scores were considered in order to reflect DFT performance in a complex manner. Based on the findings by Vestberg et al. (2017), we expected that elite athletes would show a higher total sum of correct designs in the DFT, compared to the student group.
The third aim addressed the prospective value of the DFT in national team volleyball players in order to determine the extent to which the results transfer across different types of ball sports (Vestberg et al., 2012(Vestberg et al., , 2017. Past research suggested that playing soccer attaches high demands towards EF (Vestberg et al., 2012(Vestberg et al., , 2017. However, also volleyball as open-skill sports and strategic sport disciplines seem to make high demands on EF (Alves et al., 2013;Jacobson & Matthaeus, 2014;Krenn et al., 2018;Montuori et al., 2019). Taking several cognitive similarities between both sports into account (e.g. focusing on the ball and movement patterns of team players and opposing players; keeping tactical information and experiences about team members and opponents in mind; adapting to continuously changing situations and creating new ways to solve upcoming problems on the court; cf. Alves et al., 2013), we assumed significant correlations between DFT scores and performance parameters in volleyball. In addition, we expected higher correlations between DFT scores and more compounded and broader performance parameters (e.g. total points scored, attack errors), which should rely more heavily on the EF concepts of inhibition, working memory and cognitive flexibility than more specific performance parameters (e.g. serve aces and serve errors).

Methods
Multiple studies with different designs and different samples were conducted. Students, soccer and volleyball players or their parents/guardians gave informed consent to take part in the study. The lo-cal educational board, the Austrian Football Association (ÖFB) and the Austrian Volleyball Federation (ÖVV) gave their permission to the scientific processing of the data. Ethical approval was obtained from the local university ethics committee. The application of the DFT was done in the context of a longstanding cooperation with the sport associations. Test-retest reliability was assessed in four samples in different contextual situations. Students and soccer players were compared to obtain information on the differential value of the DFT. Elite volleyball players were investigated to gain knowledge about the prospective value of the DFT for future success in volleyball.

Samples for assessing test-retest reliability and changes of test-retest scores.
Three groups of female adolescents were recruited. The time interval and activity between test and retest, and the familiarity of the participants with the DFT differed across groups. . Table 1 provides information on the sample size, the age, the test-retest interval, the cognitive/ physical load between measurements. In sample 1, female students of a vocational high school were examined. The students took the DFT at the beginning and end of a regular school lesson. The cognitive load of the lesson was low in comparison to the demands that were attached towards the female soccer players between the two assessments. Sample 2 was comprised of female soccer players, who completed the DFT during the entrance examination for the Austrian national centre of women's soccer. The DFT was administered at the beginning and at the end of a comprehensive sport psychological battery of tests, which has a high cognitive load. The female soccer players of sample 3 were familiar with the DFT as they previously completed the test during their entrance examination (time span = 0.5 to 3.5 years). The female athletes took the DFT at the beginning and during a comprehensive sport-specific motor test battery for youth national team players. The time interval between the two measure-ments varied between 60 and 240 min (Mtime interval = 122.11 ± 47.11 min).
Long-term test-retest correlation and long-term changes were assessed by administering the DFT to 16 male volleyball players of the Austrian national team and Austrian youth national team (sample 4) twice in a time interval of a year (April 2014-April 2015). All participants of this sample were also included in sample 7 for testing the prospective power of the DFT.
Samples for testing the differential value. High-school students (sample 5) were compared to adolescent elite soccer players (sample 6). The students sample included 119 individuals (61 females and 58 males) aged between 11 and 19 years (Mage = 13.99, SD = 2.01), who attended public schools. The students completed the DFT prior to a regular school lesson. The adolescent elite athletes sample (n = 117; 69 females and 48 males) consisted of 12-to 18-year-old soccer players (Mage = 13.90, SD = 1.01), who also attended public schools. Female players were tested during the entrance examination to the Austrian national centre of women's soccer. The data of the boys were collected prior to a training session in Austrian football youth academies.
Sample and design for testing the prospective value. Volleyball players of the Austrian national team and Under 20 (U20) national team had to perform the DFT at the beginning of a training camp, held in April 2014 and April 2015, respectively. These scores were used to predict their volleyball performance in the following season (season 2014/2015 or season 2015/2016 each lasting from May to April). In total, 36 volleyball players (Mage = 21.52, SD = 2.83; sample 7) consisting of 13 players from the national team in 2014, 14 national team players in 2015, and 9 Under 20 national team players in 2014 were used to determine the prospective power. Data of each game were obtained from the official game stats sheets provided by the website of the European Volleyball Confederation (CEV). Experienced observers of each game's host country The design fluency test: a reliable and valid instrument for the assessment of game intelligence?

Abstract
The design fluency test (DFT) has been reported to predict successful sports performance of soccer players and has therefore been in the spotlight of sport psychology research. There is, however, a lack of research regarding the psychometric properties of the DFT in elite sports. Thus, the aim of this research was to provide findings of test-retest reliability, practice effects and the diagnostic power of the DFT. Multiple studies of youth and adult elite athletes, as well as nonathlete students, were conducted in applied settings. Test-retest relationship demonstrated poor to acceptable short-term and long-term correlations. Furthermore, significant changes between test and retest were obtained in some variables that differed among samples. The differential value of the DFT was corroborated by significant differences between adolescent students and adolescent elite soccer players. Regarding the prospective value, significant partial correlation coefficients were found between DFT scores and volleyball performance in adult elite players. Although our research partially confirmed previous findings on the differential and prospective power of the DFT, the findings on test-retest reliability indicate that the DFT cannot be recommended for application in sports. The psychometric properties-in particular the findings on test-retest reliability-of the DFT have to be improved before research can be carried out on the application for the selection of team sport athletes and for the prediction of future success in team sports. Further research is needed to develop a scientific instrument for the assessment of game intelligence.

Keywords
Executive functions · Psychological assessment · Elite athletes · Test criteria · Team sport

Procedure
The DFT was carried out as a paper-pencil test in a group with supervision of a qualified and specially trained psychologist across all samples. After a standardized introduction which followed the guidelines of the D-KEFS test manual (Delis et al., 2001b), participants completed the three practice trials of C1 (filled dots). After successful completion of the examples, the actual test started. The testing phase required each participant to create as many different designs as possible in 60 s, using only four lines. The experimenter recorded the time using a stopwatch. The procedure was repeated in the same manner for C2 (empty dots) and C3 (switching between filled and empty dots).

Evaluation/scoring
The criteria for scoring were taken from the D-KEFS test manual (Delis et al., 2001a). A qualified psychologist carried out the evaluation and had to check the results for a second time. The primary measure for each of the conditions was the number of correct designs, each different and finished within the 60 s time limit (sum correct). A set-loss design is a design that violates the criterion rule (sum set-loss; e.g. more or less than four lines, at least one free-floating line, etc.). Furthermore, the number of correct, but repeated designs per condition were counted (sum repeated). Additionally, composite scores which might provide information on higherlevel cognitive skills, such as cognitive shifting (Delis et al., 2001a), were calculated. The total sum ofall three conditions was computed for correct designs (to-tal_sum correct), for set-loss (total_sum set-loss) and for repeated designs (to-tal_sum repeated), as well as the sum of correct designs of condition 1 and 2 (C1+C2_sum correct). Finally, a contrast measure was determined by using the sum of correct designs in condition 3 minus the mean of the sum of correct designs of condition 1 and 2 (contrast measure). This measure is reported as an indicator of cognitive shifting (Delis et al., 2001b).

Statistical analysis
All statistical analyses were performed using the Statistical Package for Social Sciences (IBM Corp. Released 2013. IBM SPSS Statistics for Windows, Version 22.0., IBM Corp., Armonk, NY, USA). Shapiro-Wilk tests for normality revealed that most of the variables across samples did not follow normal distri-bution. Therefore, the Spearman's rank correlation was run to test the relationship of test and retest scores. Changes between measures were analysed by the Wilcoxon signed-rank tests. For all correlational and difference analyses, alpha level was Bonferroni adjusted to the number of tested scores (k = 14), and thus was set at p < 0.004. According to George and Mallery (2003), test-retest correlation coefficients below r = 0.60 indicate poor reliability, r values between 0.60 and 0.69 questionable reliability, r values between 0.70 and 0.79 acceptable reliability, r values between 0.80 and 0.89 good reliability, and values of r = 0.90 and greater excellent reliability.
For assessing the differential power, differences between students and athletes were calculated by a multivariate analysis of variance (MANOVA) with age as a covariate including all single scores. In case of a significant main effect, results of univariate tests for each score are reported. The alpha level of the univariate tests was set at p < 0.006 corresponding to Bonferroni correction to the number of tested scores (k = 9).
The prospective power was assessed using partial correlations between the DFT scores and four different performance measures: Player's total points scored, serve errors, serve aces and attack errors. A preselection of these measures was conducted in accordance with Vestberg et al. (2012) in order to replicate previous findings and to reduce the number of partial correlations analyses. Since scoring points differed in their probability of occurrence, we controlled for playing positions (setters/liberoes versus hitters/blockers) and team membership, i.e. national team versus youth national team (cf. Vestberg et al., 2012). In addition, for the separate analyses of the total points, service errors, aces and attack errors we controlled for the total number of player's attempts in each variable. This approach was chosen to overcome the higher likelihood of players showing higherperformance scores the more often they tried/spent time on the court (e.g. higher likelihood to generate two service errors when serving ten times than four times). Thus, we controlled for the number of line-ups when analysing the correlation between DFT scores and the sum of scored performance points; and we controlled for the total number of serves/ attacks, respectively when analysing the correlationbetweenDFT scoresand serve errors/attack errors. Performance data were transformed to square root values to address the skewed distributions of these variables. Alpha level was Bonferroni adjusted to the number of analysed DFT scores (k = 14), and thus was p < 0.004.
Since scoring points differed in their probability of occurrence, we controlled forplaying positions (setters/liberoes versus hitters/blockers) and team membership, i.e. national team versus youth national team (cf. Vestberg et al., 2012). In addition, for the separate analyses of the total points, service errors, aces and attack errors we controlled for the total number of player's attempts in each variable. This approach was chosen to overcome the higher likelihood of play-ers showing higher performance scores the more often they tried/spent time on the court (e.g. higher likelihood to generate two service errors when serving ten times than four times). Thus, we controlled for the number of line-ups when analysing the correlation between DFT scores and the sum of scored performance points; and we controlled for the total number of serves/attacks, respectively when analysing the correlation between DFT scores and serve errors/attack errors.

Results
. Table 2 shows the descriptive statistics of the test and retest DFT scores across samples.

Short-term and long-term test-retest reliability
. Table 3 displays the Spearman's rank correlation coefficients between t1 and t2 for each sample. Correlation coefficients varied between r = -0.08 and r = 0.72. The only acceptable test-retest reliability is shown in the composite score of C1 and C2 in the number of correct design (C1+C2_sum correct). The highest correlations in single scores were obtained in the sum of correct de-signs in condition 2 across groups, with coefficients between r = 0.61 and r = 0.68. In all samples, correlation coefficients for the sum of correct designs were found to be lowest in the third (between r = 0.00 and r = 0.37), compared to the first (between r = 0.29 and r = 0.57) and the second condition. This was most prominent in sample 4 (test-retest interval of a year), showing a zero correlation in condition 3. Set-loss designs and repeated designs showed poor to questionable test-retest coefficients across all samples and all conditions, varying between r = -0.08 and r = 0.67 for set-loss and between r = -0.05 and r = 0.63 for repeated designs. Remarkably weak correlations (r < 0.35) were observed in the contrast measure. The other composite scores demonstrated poor to questionable test-retest correlations coefficients (between r = 0.17 and r = 0.66), with exception of the total sum of correct designs (r = 0.70) and the sum of correct designs of condition 1 and 2 (r = 0.72) of sample 1 (female students).

Changes in test-retest DFT scores
The results on differences between test and retest are reported for each sample in . Tables 2 and 3. The greatest number of significant alterations was obtained in the students' sample (sample 1) that had the shortest test-retest interval and a low cognitive load between measures. These significant changes in the students were observed in the sum of correct designs in all three conditions (C1: increase of 42.52%, C2: increase of 25.17%, and C3: increase of 11.66%), as well as in the sum of repeated designs in the condition 1 and 2. Furthermore, with exception of the total sum of set-losses, all composite scores yielded significant changes in sample 1. Soccer players of sample 2, having had a high cognitive load between assessments, demonstrated a significant increase from test to retest in the sum of repeated designs in condition 1 (increase of 52.87%), and in the total sum of repetitions. Female soccer players that were familiar with the DFT and completed a physically high demanding test battery between assessments (sample 3) revealed significant improvements in the number of correct designs in condition 1 (increase of 17.59%) and 2 (increase of 17.94%), as well as in both composite scores on correct designs. Regarding long-term alterations (sample 4), volleyball players demonstrated a significant improvement in the sum of correct designs in condition 1 (increase of 26.04%), and a significant improvement in the sum of correct designs of condition 1 and 2 and the sum of correct designs of all three conditions.

Differential value
A MANOVA run on all single scores with age as a covariate yielded a significant difference between students and soccer players (sample 5 vs. sample 6), F(9, 225) = 9.81, p < 0.001, ηp 2 = 0.28. . Table 4 includes descriptive statistics and results of follow-up univariate analyses of variances. Soccer players scored significantly higher in the number of correct designs in condition 1 and 2, compared to students. Students, however, repeated significantly less designs in all three conditions.

Prospective value
The analysis of prospective power was based on primary measures and the contrast score as described in D-KEFS manual (Delis et al., 2001b). . Table 5 shows the partial correlations between performance data and the sum of correct designs in all conditions and the contrast measure in elite volleyball players.
Partial correlations showed that the number correct designs of condition 1 and 2, the sum of correct designs of condition 1 and 2 as well as the total sum of correct designs overall conditions significantly predicted players' total points. However, the partial correlations between players' total points and the sum of correct designs in condition 3 turned out low (r = 0.07) and did not reach statistical significance. The remaining DFT parameters indicated no significant correlations with the number of total points, the number of aces, serve errors and attack errors.

Discussion
The aim of the present study was to assemble different analyses evaluating the psychometric properties of the DFT in the field of sports. Thus, test-retest correlations and changes between measurements were determined among different contexts to gain insights on random and non-random effects on the reproducibility of DFT scores in applied settings. Furthermore, differential and prospective aspects were evaluated to expand on previous empirical findings.

Short-term and long-term test-retest reliability across different contextual situations
Three female adolescent samples were used foranalysinghowwell the rankorder of participants in the first test is replicated in the retest considering a short-term interval. Test-retest correlations showed poor to acceptable reliability coefficients (George & Mallery, 2003) in all samples, suggesting no specific impact of sample characteristics, duration of test-retest period, and activity between measures. The students who had the shortest time interval and no physical and low cog-nitive load between measurements obtained similar coefficients as compared to the soccer players. Based on the consistent findings regarding the size of the test-retest correlations, it is assumed that the poor to acceptable test-retest reliability represents a general than a specific effect. This assumption converges with data on test-retest reliability of the D-KEFS manual (Delis et al., 2001a), reporting test-retest coefficients between r = 0.58 (C1), r = 0.57 (C2), and r = 0.32 (C3) for correct designs using an average test-retest interval of 25 days (SD = 12.8). Furthermore, this is corroborated with our findings on long-term test-retest re-liability, even when the results have to be interpreted carefully, due to the small sample size. As a consequence, we have to emphasize that DFT application in sports seems critical, as we failed to provide clear evidence on its test-retest reliability. Interestingly, the most stable relationship in a single score across samples was the sum of correct designs in condition 2 (between r = 0.61andr = 0.68). The exposure to condition 1 may have led to practice effects, resulting in higher test-retest reliability scores of condition 2 in terms of correct designs. This finding appears to be of significance for the improvement of the DFT. An extension of the number of practice trials and/or the number of tasks within a condition might contribute to an increase of test-retest reliability.
Regarding all four samples, it is remarkable that the coefficients of condition 3 showed the lowest correlations in the sum of correct designs. The switching from the very similar conditions 1 and 2 to the more complex task of connecting filled and empty dots alternately resulted in a poor stability of the relationship of test-retest DFT scores. Hence, differences in the level of coefficients between conditions raises doubts regarding the usage of aggregate DFT scores across all conditions. Further studies are needed to address the divergent findings observed in condition 1 and 2, compared to condition 3. The poor correlations of the contrast measure throughout all four samples might result from these deviant correlations between conditions. This corroborates the notion of Crawford et al. (2008) that the contrast score of the DFT is "uninterpretable" (p. 1072).
The results on test-retest correlations across all four samples raise the question whether the DFT measures temporally stable traits related to higher-order cognitive functions. A possible reason for the obtained test-retest correlations might result from performance variabilityand measurementerrorcaused by a task requiring complex and effortful processes (Delis, Kramer, Kaplan, & Holdnack, 2004). Furthermore, it has to be clarified how different states such as achievement motivation, test anxiety, self-confidence and other possible moderators affects behavioural accuracy and variability in design fluency.
The design fluency task can be considered an open-ended task, with more than one correct outcome. Specifically, participants can perform the test by drawing the lines intuitively, using one specific strategy or switching between different strategies. Thus, intra-individual variance increases, which, in turn, decreases the correlation test-retest DFT scores. It is assumed that the condition 1 contributes to a decrease in performance variability by arousing the strategy of t1 that might result in higher test-retest reliability of condition 2.
Taking these poor to acceptable test-retest correlations into consideration, the application of the DFT in the field of team sports has to be regarded with caution. It is recommended to improve the DFT substantially before it is applied in team diagnostics as well as in individual diagnostics. The administration of only three practice trials per condition might not be enough to obtain stable data on design fluency performance. An extensive pretest practice phase consisting of the entire DFT would be useful, which allows participants to become more familiar with the test minimizing practice effects. Furthermore, a longer processing time per each condition might enhance test-retest reliability. Future studies are required to determine whether these recommendations will show the expected improvements in reliability.

Changes in test-retest DFT scores across different contextual situations
All four samples yielded systematic, however inconsistent, changes between test and retest. The found effects on the sum of correct designs of condition 1 and 2, are assumed to be due to practice effects, and converges with findings presented in the DFT manual (Minterval = 25 ± 12.8 days), indicating significant changes for all measures with the exception of repetition errors (Delis et al., 2001a). It is worth mentioning that even after a year improvements are observable. This finding is in line with Rabbitt et al. (2004), showing practice effects in neuropsychological tests even after a period of 7 years. The higher number of significant increases of DFT scores in the students may be a consequence of the shorter time interval, and the lower activity between measures. The absent practice effects of soccer players who performed the DFT for the first time (sample 2) might be a result of the demanding cognitive sport psychological assessments between test and retest. Mental fatigue might have reduced practice effects, which is supported by the significant increase of repeated designs in condition 1. The inconsistent findings on changes between test and retest across samples emphasize the importance of controlling for mental and physical activity when assessing DFT performance.

Differential value
In condition 1 and 2, soccer players generated more correct designs than the students. Thisfindingisinline withprevious results (Vestberg etal., 2012(Vestberg etal., , 2017, showing higher scores of total correct designs in soccer players, compared to a normative group. However, soccer players repeated significantly more correct designs in all three conditions. Hence, our study delivers first findings that scoring should not exclusively be limited to a composite measure of correct designs how it was applied in previous studies (Vestberg et al., 2012(Vestberg et al., , 2017. It is recommended that future studies should focus on the interplay of correct designs, set-loss designs, and repeated designs of the DFT in order to obtain a holistic pattern of cognitive performance. An alternative explanation of the found differences is conceivable. It is possible that soccer players may have an advantage in design-fluency performance when tested for the first time due to the similarity of the DFT with game situations that are featured using soccer tactic boards. This assumption is supported by more significant changes between test and retest (section: Changes in test-retest DFT scores) and higher increases in the sum of correct designs from test to retest in the students sample (sample 1; . Table 3) as compared to both soccer samples (samples 2 and 3). Thus, the familiarity with similar scenarios as administered in the DFT needs to be addressed in future studies. A further reason for the superiority of soccer players in generating a higher number of correct designs could be the better fitness status that might have influenced DFT performance. There is strong evidence that there is a relationship between physical fitness and cognitive functions (Chaddock, Neider, Voss, Gaspar, & Kramer, 2011), and therefore has to be considered in research on EF.

Prospective value
The partial correlations between the parameters on correct designs-with the exceptionofcondition3-and sports performance turned out significant, with an explanation of variance between 25% and 31.36%. Volleyball players drawing more correct designs yielded more points. This relationship is in line with findings reported in soccer players (Huijgen et al., 2015;Vestberg et al., 2012). However, caution is needed, given the fact that test-retest correlation of the overall score of correct designs (r = 0.61) was of similar size as the correlation between the overall score and total play points (r = 0.50). Furthermore, it has to be mentioned critically that the relationships between DFT scores on set-loss repeated designs, and correct designs in condition 3 and sport performance resulted in some variables not moving in the expected direction. For example, positive, non-significant correlations were revealed between total points scored and the repeated designs in all conditions. Another unexpected finding represents the positive, non-significant correlation between the sum of correct designs of condition 1 and 2 and serve errors.
Our findings deliver weak support for the prospective power of the DFT that need to be discussed against the background of the poor to acceptable test-rest reliability of DFT scores. If performance variability is high, as indicated by the poor to acceptable test-retest correlations, then the question arises whether it is appropriate to predict sports performance by just using one composite measure of design fluency. Future work may thus examine whether administering the DFT at various time intervals and calculating average composite scores across these intervals counteracts large individual performance variability, and results in comparable findings on prospective validity.

Limitations
The selection and assessment of participants in varying contextual situations for determining aspects of reliability might be assessed at first glance as a limitation of the study. However, the overall aim of this study was to show how reliable and valid the DFT is in applied sport psychological contexts in order to provide data of different settings, which occur in practical work. The assessment was done in the interest of the sport associations, and therefore is typical for the work in the field of applied sport psychology. It must be critically noted that the samples of this study consisted of females with exception of one male sample. A further shortcoming is the short time interval between test and retest, which could have resulted in memory effects for some designs. The findings of long-term reliability are based on a small sample of 16 elite volleyball players, and thus a replication in a larger sample is worthwhile. Regarding differences among soccer players and non-athlete high school students, it cannot be completely excluded that different intellectual abilities, motivation, and coping strategies had an influence on the findings. The sample size and the small number of volleyball games may be regarded as a potential limitation of results on the prospective value of the DFT with respect to the generalisability of the results. Finally, the findings of this study are limited to differential and prospective validity. It is recommended to examine the internal structure of the DFT using exploratory and confirmatory factorial analyses to test the assumptions of the DFT measures (Delis et al., 2001b). Additionally, divergent and convergent validity using well-defined EF tasks that are distal and proximal to the skills proposed in the test manual (Delis et al., 2001b) as well as tactical decision making tests, creativity tests or even coaching ratings should be examined in order to explore which executive subdomains are called upon by the DFT.

Conclusions
The size of the test-retest correlations, and significant changes between test and retest lead to the conclusion that the application of the DFT in individual as well as in group diagnostics cannot be recommended. Based on our findings and the information provided by the D-KEFS manual (Delis et al., 2001a) on test-retest reliability, we draw the conclusion that previous findings on the prospective power of the DFT have to be regarded with caution (Huijgen et al., 2015;Vestberg et al., 2012Vestberg et al., , 2017. Further studies are needed to improve the psychometric properties of the DFT before thinking about its application in sports. It has to be mentioned that there is growing acceptance that soccer talent is of multidimensional nature, and therefore needs to be predicted by multidisciplinary test batteries (Murr, Feichtinger, Larkin, O'Connor, & Höner, 2018). Finally, it is suggested (1) to select scientific instruments not only considering face validity, and (2) to use more sophisticated research approaches instead of simple research logic as in previous studies on the DFT and soccer expertise.